百宝箱

以下是专门针对**属灵书籍 + LightRAG** 的 chunking 最佳实践，直接可操作：

***

## LightRAG 的 Chunking 机制原理

LightRAG 的分块不只是为了向量检索，**同一个 chunk 同时用于**：实体抽取 → 知识图谱建构 → 向量存储。因此块的质量直接影响图谱质量，比普通 RAG 的要求更高。默认 `chunk_size=1200 tokens`，对中文属灵书籍需要调整。[1][2]

***

## 属灵书籍的最佳分块参数

“`python
rag = LightRAG(
working_dir=”./spiritual_books”,
chunk_token_size=800, # 中文语义密度高，800比默认1200更合适
chunk_overlap_token_size=100, # 保留跨段关系
addon_params={
“language”: “Simplified Chinese”,
“entity_types”: [
“人物”, “神学概念”, “教义”, “圣经章节”,
“属灵运动”, “书卷”, “历史时期”
]
},
enable_llm_cache=True # 相同段落不重复抽取，省成本
)
“`

***

## 分块策略：三层结构保留属灵语义

属灵书籍有明显的层级结构（卷→章→段→例证），**按结构切而非按字数切** ：[3]

**第一优先：按段落 + 标题切**
“`python
# 利用已有 JSON 文档的结构
{
“book”: “属灵人”,
“chapter”: “第二章魂的各部分”,
“section”: “理智的功用”,
“text”: “…”, # 正文块
“metadata”: {
“author”: “倪柝声”,
“page”: 45
}
}
“`
每个 JSON 的 `section` 直接作为一个 chunk，**不跨章节合并**，保证神学主题自洽。[3]

**第二：段落过长时的递归切割**

超过 800 tokens 的段落，按以下顺序寻找切割点：
“`
段落分隔符 → 句号/。→ 逗号/，→ 字数强切
“`
**绝不在引用圣经经文中间切断**，检测到”（约”、”（弗”等经文标记时，扩大 chunk 保留完整引用。

**第三：例证段落单独处理**

属灵书籍大量使用”比喻+解释”结构（如倪柝声常用的三层人结构图解），这类段落：
– 比喻部分 + 解释部分**必须在同一 chunk**
– 可适当放宽到 1000 tokens，不强切

***

## 中文专属问题：分词影响边界

LightRAG 默认按 token 计数，英文 tokenizer 对中文不准确（”神圣经纶”可能被切成”神圣”+”经纶”两个无意义碎片）。解决方案：[1]

“`python
# 用 jieba 预处理，保护专有术语不被切断
import jieba

# 加入属灵术语词典
jieba.load_userdict(“spiritual_terms.txt”)
# spiritual_terms.txt 内容示例：
# 神圣经纶 99 n
# 三一神 99 n
# 内住生命 99 n
# 神化 99 n
“`

***

## QueryParam 配置：匹配不同问题类型

已有 JSON 文档入库后，查询模式直接决定多跳效果：[1]

“`python
from lightrag import QueryParam

param = QueryParam(
mode=”mix”,
top_k=60,
chunk_top_k=20,
enable_rerank=True,
user_prompt=”请只根据属灵书籍的教导回答，引用具体书名和章节，不要加入你自己的神学推断。”
)
“`

***

## 实际建议：先跑一本书做基准测试

选《正常的基督徒生活》全书，完整跑一遍，检查：
– 图谱节点数（正常应在 500-1500 之间）
– 孤立节点比例（应低于 10%）
– 用 5-10 道已知答案问题测 Faithfulness 分数

以此作为参数调优基准，再批量处理全部书目，**一次建好比反复重建省更多时间和钱**。

Sources
[1] [EMNLP2025] “LightRAG: Simple and Fast Retrieval-Augmented … https://github.com/HKUDS/LightRAG
[2] LightRAG: Simple and Fast Retrieval-Augmented Generation – arXiv https://arxiv.org/html/2410.05779v1
[3] Chunking for RAG: best practices – Unstructured https://unstructured.io/blog/chunking-for-rag-best-practices
[4] Chunking Strategies for LLM Applications – Pinecone https://www.pinecone.io/learn/chunking-strategies/
[5] Advanced Chunking/Retrieving Strategies for Legal Documents : r/Rag https://www.reddit.com/r/Rag/comments/1jdi4sg/advanced_chunkingretrieving_strategies_for_legal/
[6] 9 Chunking Strategies to Improve RAG Performance – Non-Brand Data https://www.nb-data.com/p/9-chunking-strategis-to-improve-rag
[7] RAG Chunking Strategies Deep Dive – DEV Community https://dev.to/vishalmysore/rag-chunking-strategies-deep-dive-2l72
[8] LightRAG/lightrag/api/README.md at main – GitHub https://github.com/HKUDS/LightRAG/blob/main/lightrag/api/README.md
[9] The Ultimate Guide to RAG Chunking Strategies – Agenta https://agenta.ai/blog/the-ultimate-guide-for-chunking-strategies
[10] LightRAG, Deployment and Usage Guide: The Simplest Tutorial for … https://stable-learn.com/en/lightrag-introduction/
[11] How to Choose the Right Chunking Method for Your RAG App https://www.linkedin.com/posts/bhavishya-pandit_7-chunking-methods-for-rag-activity-7396044216394571776-WMBJ
[12] lightrag-dembrane – PyPI https://pypi.org/project/lightrag-dembrane/
[13] [PDF] LightRAG: Simple and Fast Retrieval-Augmented Generation – arXiv https://arxiv.org/pdf/2410.05779.pdf
[14] LightRAG: A Better Approach to Graph-Enhanced Retrieval … https://www.linkedin.com/pulse/lightrag-better-approach-graph-enhanced-generation-holt-nguyen-hfx2c
[15] LightRAG https://lightrag.github.io