百宝箱

针对属灵书籍这类**高度专业化的中文语料**，实体提取是 GraphRAG 最容易踩坑的环节，以下是直接可用的最佳实践：

***

## 先定义属灵书籍本体（Ontology）

**不要让 LLM 自由发挥**，必须预先定义实体类型清单，否则会抽出”神”、”上帝”、”主”三个节点互不相连。建议的属灵书籍本体：[1]

***

## Prompt 工程：属灵领域专属提取模板

通用 GraphRAG 的默认 prompt 完全不识别属灵术语，必须自定义：[2]

“`python
ENTITY_EXTRACTION_PROMPT = “””
你是属灵书籍知识图谱构建专家。从以下段落中提取实体和关系。

【实体类型】: Person, Concept, Scripture, Book, Doctrine, Movement, Period

【关系类型】:
– (Person) -[阐释]-> (Concept)
– (Person) -[著作]-> (Book)
– (Concept) -[发展自]-> (Concept)
– (Concept) -[基于]-> (Scripture)
– (Doctrine) -[对比]-> (Doctrine)
– (Person) -[影响]-> (Person)

【重要规则】:
1. “神”、”上帝”、”主”统一规范为”三一神”或具体位格
2. 保留原文术语，不要翻译或简化（如”经纶”不等于”计划”）
3. 若关系不确定，confidence标注为low，不要猜测

输出格式（JSON）:
{{“entities”: […], “relations”: […], “confidence”: “high/medium/low”}}

段落：{text}
“””
“`

***

## 实体消歧：最容易忽略的陷阱

中文属灵书籍有严重的**同义词和异名同实**问题，必须建立规范化词典：[3]

“`python
# 别名词典（必须人工整理）
ALIAS_MAP = {
“倪柝声”: [“Watchman Nee”, “倪弟兄”, “倪先生”],
“李常受”: [“Witness Lee”, “李弟兄”],
“神圣经纶”: [“神的经纶”, “神的计划”, “Divine Economy”],
“得救”: [“救恩”, “拯救”, “被救”],
“三一神”: [“神”, “上帝”, “主”, “父神”, “耶稣基督”] # 视上下文决定
}
“`

对于”三一神”这类核心实体，建议**按位格拆分**而非合并，否则关系图会过于混乱。

***

## 分块策略影响抽取质量

块太小（<200字）→ 上下文丢失，抽不出跨句关系；块太大（>1000字）→ LLM注意力分散，漏抽。建议：[1]

– **按段落切块**，保留段落标题作 metadata
– 每块与前后块保留 **50字重叠**，防止跨段关系断裂
– 结构化段落（如书目、注释）单独处理，不与正文混切

***

## 两阶段提取流程（省成本）

全量跑 GPT-4o 抽取一本书约 $15-30，用**两阶段策略**可降低 60% 成本：[2]

1. **粗筛（小模型）**：用 `Qwen2.5-7B` 本地跑，过滤无实体段落，只保留有神学内容的块
2. **精提（大模型）**：对筛选后的块用 Claude/GPT-4o 精细抽取三元组

***

## 抽取后验证（不可省略）

用 **LightRAG** 或 **Microsoft GraphRAG** 完成抽取后，人工抽查 5% 的节点：[4]
– 检查孤立节点（无关系边的节点）→ 通常是抽取错误
– 检查超高度节点（如”神”连接500+边）→ 需要分裂或细化
– 用 Neo4j Browser 可视化，目测图结构是否合理

属灵书籍知识图谱质量的上限由**本体设计和别名词典**决定，这两项花多少时间都值得。

Sources
[1] From Unstructured Data to Entity Graph: 5 Questions to Ask Before … https://memgraph.com/blog/unstructured-data-to-entity-graph-best-practices
[2] Fine-Tuning Large Language Models for Graph RAG-5 Indexing https://www.dataworkz.com/blog/fine-tuning-a-llm-to-index-graph-rag-5/
[3] Automatic knowledge extraction from Chinese electronic medical … https://pmc.ncbi.nlm.nih.gov/articles/PMC10240026/
[4] GraphRAG-Bench: Challenging Domain-Specific Reasoning … – arXiv https://arxiv.org/html/2506.02404v2
[5] Graph RAG in the Wild: Insights and Best Practices from Real-World … https://www.semantic-web-journal.net/content/graph-rag-wild-insights-and-best-practices-real-world-applications
[6] Domain adaptation in 2025 – Fine-tuning v.s RAG/GraphRAG – Reddit https://www.reddit.com/r/LLMDevs/comments/1kiht8g/domain_adaptation_in_2025_finetuning_vs/
[7] How to Create a Knowledge Graph from Text? https://web.stanford.edu/class/cs520/2020/notes/How_To_Create_A_Knowledge_Graph_From_Text.html
[8] What is GraphRAG: Complete guide [2026] – Meilisearch https://www.meilisearch.com/blog/graph-rag
[9] GraphRAG in Practice: How to Build Cost-Efficient, High-Recall … https://www.facebook.com/groups/3670562573177653/posts/4429604870606749/
[10] Research on entity relation extraction for Chinese medical text https://journals.sagepub.com/doi/10.1177/14604582241274762
[11] GraphRAG-Bench: Challenging Domain-specific Reasoning for… https://openreview.net/forum?id=QcgkUJbfxT
[12] Knowledge Graph Construction of Chinese Traditional Yu Opera … https://dl.acm.org/doi/10.1145/3677389.3702564
[13] an ontological and knowledge graph approach to the transmission … https://www.nature.com/articles/s40494-024-01504-x
[14] Quality-Controllable automatic construction method of Chinese … https://www.sciencedirect.com/science/article/abs/pii/S0306457325000895
[15] [D] Knowledge Graph Extraction from Unstructured Medical Texts https://www.reddit.com/r/MachineLearning/comments/193s7dq/d_knowledge_graph_extraction_from_unstructured/