自己组装电脑做网站服务器,移动建站是什么意思,网站建设 资质荣誉,国家重点项目建设网站微软在今年4月份的时候提出了GraphRAG的概念#xff0c;然后在上周开源了GraphRAG,Github链接见https://github.com/microsoft/graphrag,截止当前#xff0c;已有6900Star。
安装教程
官方推荐使用Python3.10-3.12版本#xff0c;我使用Python3.10版本安装时#xff0c;在…
微软在今年4月份的时候提出了GraphRAG的概念然后在上周开源了GraphRAG,Github链接见https://github.com/microsoft/graphrag,截止当前已有6900Star。
安装教程
官方推荐使用Python3.10-3.12版本我使用Python3.10版本安装时在初始化项目过程中会报错切换到Python3.11版本后运行正常推测是Python3.10与微软的一些最新的SDK不兼容。所以建议使用Python3.11的环境安装GraphRAG比较简单直接下面一行代码即可安装成功。
pip install graphrag使用教程
在这个教程中我们使用马伯庸的《太白金星有点烦》这个短篇小说为例测试下使用微软开源的GraphRAG的处理效果。
注意GraphRAG是使用LLM来提取文本片段中的实体关系因此耗费Token数较多如果是个人调研使用不建议使用GPT4级别的模型费用太高不差钱的大佬请忽略此条建议。综合成本和效果我这里使用的是DeepSeek-Chat模型。
初始化项目
我这边先创建了一个临时测试目录myTest然后按照官方教程在myTest目录下创建了input目录并把《太白金星有点烦》这本书的txt版本重命名为book.txt后放到input目录下。然后调用python -m graphrag.index --init 进行初始化工作生成一些配置文件。
mkdir ./myTest/input
curl https://www.xxx.com/太白金星有点烦.txt ./myTest/input/book.txt // 这里是示例代码大家在测试时根据实际情况放入自己要测试的txt文本即可。
cd ./myTest
python -m graphrag.index --init执行完成后会在当前目录即MyTest目录下生成几个新的文件夹output-后续执行生成的中间结果会保存到这个目录中prompts-处理过程中用到的一些Prompt内容.env-大模型API配置文件里面默认就一个GRAPHRAG_API_KEY 用于配置大模型的apiKeysettings.yaml-该文件是整体的配置信息如果我们使用的非OPENAI的官方模型和官方API我们需要修改此配置文件来让GraphRAG按照我们指定的配置文件执行。
配置相关文件
先在.env文件中配置大模型API的Key这个配置是全局生效的。我们在.env文件中配置完成后不需要在settings.yaml文件中重复配置。settings.yaml中使用的默认模型为gpt-4-turbo-preview 如果不需要修改模型以及调用的API地址那现在就已经配置完成了后续的配置内容可以执行忽略并直接到执行阶段。
我这里使用的是agicto 提供的APIkey(主要是新用户注册可以免费获取到10块钱的调用额度白嫖还是挺爽的)。我在这里主要就修改了API地址和调用模型的名称修改完成后的settings文件完整内容如下
encoding_model: cl100k_base
skip_workflows: []
llm:api_key: ${GRAPHRAG_API_KEY}type: openai_chat # or azure_openai_chatmodel: deepseek-chatmodel_supports_json: false # recommended if this is available for your model.api_base: https://api.agicto.cn/v1# max_tokens: 4000# request_timeout: 180.0# api_version: 2024-02-15-preview# organization: organization_id# deployment_name: azure_model_deployment_name# tokens_per_minute: 150_000 # set a leaky bucket throttle# requests_per_minute: 10_000 # set a leaky bucket throttle# max_retries: 10# max_retry_wait: 10.0# sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times# concurrent_requests: 25 # the number of parallel inflight requests that may be madeparallelization:stagger: 0.3# num_threads: 50 # the number of threads to use for parallel processingasync_mode: threaded # or asyncioembeddings:## parallelization: override the global parallelization settings for embeddingsasync_mode: threaded # or asynciollm:api_key: ${GRAPHRAG_API_KEY}type: openai_embedding # or azure_openai_embeddingmodel: text-embedding-3-smallapi_base: https://api.agicto.cn/v1# api_base: https://instance.openai.azure.com# api_version: 2024-02-15-preview# organization: organization_id# deployment_name: azure_model_deployment_name# tokens_per_minute: 150_000 # set a leaky bucket throttle# requests_per_minute: 10_000 # set a leaky bucket throttle# max_retries: 10# max_retry_wait: 10.0# sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times# concurrent_requests: 25 # the number of parallel inflight requests that may be made# batch_size: 16 # the number of documents to send in a single request# batch_max_tokens: 8191 # the maximum number of tokens to send in a single request# target: required # or optionalchunks:size: 300overlap: 100group_by_columns: [id] # by default, we dont allow chunks to cross documentsinput:type: file # or blobfile_type: text # or csvbase_dir: inputfile_encoding: utf-8file_pattern: .*\\.txt$cache:type: file # or blobbase_dir: cache# connection_string: azure_blob_storage_connection_string# container_name: azure_blob_storage_container_namestorage:type: file # or blobbase_dir: output/${timestamp}/artifacts# connection_string: azure_blob_storage_connection_string# container_name: azure_blob_storage_container_namereporting:type: file # or console, blobbase_dir: output/${timestamp}/reports# connection_string: azure_blob_storage_connection_string# container_name: azure_blob_storage_container_nameentity_extraction:## llm: override the global llm settings for this task## parallelization: override the global parallelization settings for this task## async_mode: override the global async_mode settings for this taskprompt: prompts/entity_extraction.txtentity_types: [organization,person,geo,event]max_gleanings: 0summarize_descriptions:## llm: override the global llm settings for this task## parallelization: override the global parallelization settings for this task## async_mode: override the global async_mode settings for this taskprompt: prompts/summarize_descriptions.txtmax_length: 500claim_extraction:## llm: override the global llm settings for this task## parallelization: override the global parallelization settings for this task## async_mode: override the global async_mode settings for this task# enabled: trueprompt: prompts/claim_extraction.txtdescription: Any claims or facts that could be relevant to information discovery.max_gleanings: 0community_report:## llm: override the global llm settings for this task## parallelization: override the global parallelization settings for this task## async_mode: override the global async_mode settings for this taskprompt: prompts/community_report.txtmax_length: 2000max_input_length: 8000cluster_graph:max_cluster_size: 10embed_graph:enabled: false # if true, will generate node2vec embeddings for nodes# num_walks: 10# walk_length: 40# window_size: 2# iterations: 3# random_seed: 597832umap:enabled: false # if true, will generate UMAP embeddings for nodessnapshots:graphml: falseraw_entities: falsetop_level_nodes: falselocal_search:# text_unit_prop: 0.5# community_prop: 0.1# conversation_history_max_turns: 5# top_k_mapped_entities: 10# top_k_relationships: 10# max_tokens: 12000global_search:# max_tokens: 12000# data_max_tokens: 12000# map_max_tokens: 1000# reduce_max_tokens: 2000# concurrency: 32
执行并构建图索引
此流程是GraphRAG的核心流程即构建基于图的知识库用于后续的问答环节通过以下代码即可触发执行。
python -m graphrag.index基于微软在论文中提到的实现思路执行过程GraphRAG主要实现了如下功能
Source Documents → Text Chunks将源文档分割成文本块。Text Chunks → Element Instances从每个文本块中提取图节点和边的实例。Element Instances → Element Summaries为每个图元素生成摘要。Element Summaries → Graph Communities使用社区检测算法将图划分为社区。Graph Communities → Community Summaries为每个社区生成摘要。Community Summaries → Community Answers → Global Answer使用社区摘要生成局部答案然后汇总这些局部答案以生成全局答案。
整体执行耗时与具体的文本大小有关。我这个例子整体耗时大概20分钟耗费人民币大约4块钱。执行过程中的输出如下 Reading settings from settings.yaml
/home/xinfeng/miniconda3/envs/graphrag-new/lib/python3.11/site-packages/numpy/core/fromnumeric.py:59: FutureWarning: DataFrame.swapaxes is deprecated and will
be removed in a future version. Please use DataFrame.transpose instead.return bound(*args, **kwds)create_base_text_unitsid chunk ... document_ids n_tokens
0 5fe95645e8592dc5146ae4e6e2343ad4 \n附每天更新最新最全的小说飞马书屋(FEIAS.COM)\n\n《太白金星有点烦》... ... [764c0e80c3fc53191ccd9e87ad9e4803]
300
1 e91ee08e3684833d1dd3cb26679a8e6a 歪斜斜落在殿旁台阶上。\n李长庚从鹤背上跳下来猫腰检查了一下。台阶倒没什么事只是仙鹤的右... ...
[764c0e80c3fc53191ccd9e87ad9e4803] 300
2 7eea0da373e721b9f87ad6c7c05565de 同期飞升的神仙早换成了更威风的神兽坐骑只有李长庚念旧一直骑着这头老鹤四处奔波。\n李长庚... ...
[764c0e80c3fc53191ccd9e87ad9e4803] 300
3 d0fbd3139f977d98891f5aeae2ac9180 形了。\n“您回来啦” 织女头也没抬专心看着宝鉴。\n“嗯回来了。”\n李长庚端起童子... ...
[764c0e80c3fc53191ccd9e87ad9e4803] 300
4 ab349a2200a3878ba2a340c71ba1641f 来泡平白被自己的牛饮糟蹋了。\n李长庚嘬了嘬牙花子悻悻坐下把一沓玉简文书从怀里取出来。... ...
[764c0e80c3fc53191ccd9e87ad9e4803] 300
.. ... ... ... ... ...
214 7f8d6ded30cb1488837df6102c77cab4 旅游。编辑说买ps5也不能报哦。我说鹓雏非梧桐不止非练实不食非醴泉不饮会看得上你这点... ...
[764c0e80c3fc53191ccd9e87ad9e4803] 300
215 73b2cf432f11036b715a7ced295a6091 《两京十五日》之后我也是写了个短篇《长安的荔枝》休息权当运动之后的拉伸。\n最初我并没打... ...
[764c0e80c3fc53191ccd9e87ad9e4803] 300
216 1a10c703e1637de884a1fad7f109a50b 头一看好嘛居然有十万字。\n也好尽兴了疲惫一扫而空这波不亏。\n有朋友问我你是不... ...
[764c0e80c3fc53191ccd9e87ad9e4803] 300
217 239fe13a155eb285cebc6938559cf0e9 味也不合心意。\n当然这种乘兴而写的东西神在意前一气呵成固然写得舒畅细节不免粗糙... ...
[764c0e80c3fc53191ccd9e87ad9e4803] 214
218 b9fb2d6193b2840cdce5a3cf25542ca7 凑个整不然心里难受。\n\n ... [764c0e80c3fc53191ccd9e87ad9e4803] 14[877 rows x 5 columns]create_base_extracted_entitiesentity_graph
0 graphml xmlnshttp://graphml.graphdrawing.or...create_summarized_entitiesentity_graph
0 graphml xmlnshttp://graphml.graphdrawing.or...create_base_entity_graphlevel clustered_graph
0 0 graphml xmlnshttp://graphml.graphdrawing.or...
1 1 graphml xmlnshttp://graphml.graphdrawing.or...
2 2 graphml xmlnshttp://graphml.graphdrawing.or...
3 3 graphml xmlnshttp://graphml.graphdrawing.or...
/home/xinfeng/miniconda3/envs/graphrag-new/lib/python3.11/site-packages/numpy/core/fromnumeric.py:59: FutureWarning: DataFrame.swapaxes is deprecated and will
be removed in a future version. Please use DataFrame.transpose instead.return bound(*args, **kwds)
/home/xinfeng/miniconda3/envs/graphrag-new/lib/python3.11/site-packages/numpy/core/fromnumeric.py:59: FutureWarning: DataFrame.swapaxes is deprecated and will
be removed in a future version. Please use DataFrame.transpose instead.return bound(*args, **kwds)create_final_entitiesid name ... text_unit_ids description_embedding
0 b45241d70f0e43fca764df95b2b81f77 飞马书屋 ... [159e9102707eeaef1f9188407e428111, 45e28cf587e... [0.008881675079464912, 0.012866131030023098, -...
1 4119fd06010c494caa07f439b333f4c5 马伯庸 ... [5fe95645e8592dc5146ae4e6e2343ad4] [0.03241756930947304, 0.03757039085030556, -0....
2 d3835bf3dda84ead99deadbeac5d0d7d 太白金星李长庚 ... [5fe95645e8592dc5146ae4e6e2343ad4] [0.002768812933936715, 0.020227784290909767,
-...
3 077d2820ae1845bcbb1803379a3d1eae 启明殿 ... [02c57ca370b4c0316a20148d00723bac, 046ed708031... [0.01269223727285862, 0.026068691164255142, 0....
4 3671ea0dd4e84c1a9b02c5ab2c8f4bac 《太白金星有点烦》 ... [5fe95645e8592dc5146ae4e6e2343ad4, 7f8d6ded30c... [0.003794945077970624, 0.016000036150217056,
-...
.. ... ... ... ... ...
207 7ea0bc1467e84184842de2d5e5bdd78e 《长安的荔枝》 ... [7f8d6ded30cb1488837df6102c77cab4] [0.012446477077901363, 0.005391148384660482,
0...
208 056f23eb710f471393ae5dc417d83fd9 两京十五日 ... [73b2cf432f11036b715a7ced295a6091] [0.021373916417360306, -0.0032437569461762905,...
209 e1ae27016d63447a8dfa021370cba0fa 长安的荔枝 ... [73b2cf432f11036b715a7ced295a6091] [0.022816641256213188, -0.0042687226086854935,...
210 f8c10f61a8f344cea7bdafa2d8af14b8 新书 ... [239fe13a155eb285cebc6938559cf0e9] [0.05925222113728523, 0.02118016593158245, -0....
211 aa7d003f25624e19bc88d3951d4dc943 读者 ... [239fe13a155eb285cebc6938559cf0e9] [0.0453583225607872, 0.020338334143161774, -0....[851 rows x 8 columns]
/home/xinfeng/miniconda3/envs/graphrag-new/lib/python3.11/site-packages/numpy/core/fromnumeric.py:59: FutureWarning: DataFrame.swapaxes is deprecated and will
be removed in a future version. Please use DataFrame.transpose instead.return bound(*args, **kwds)
/home/xinfeng/miniconda3/envs/graphrag-new/lib/python3.11/site-packages/datashaper/engine/verbs/convert.py:72: FutureWarning: errorsignore is deprecated and
will raise in a future version. Use to_datetime without passing errors and catch exceptions explicitly insteaddatetime_column pd.to_datetime(column, errorsignore)
/home/xinfeng/miniconda3/envs/graphrag-new/lib/python3.11/site-packages/datashaper/engine/verbs/convert.py:72: UserWarning: Could not infer format, so each
element will be parsed individually, falling back to dateutil. To ensure parsing is consistent and as-expected, please specify a format.datetime_column pd.to_datetime(column, errorsignore)create_final_nodeslevel title type description ... graph_embedding top_level_node_id x y
0 0 飞马书屋 ORGANIZATION 飞马书屋是一个多功能的在线阅读平台其域名为FEIMASW.COM。作为一个组织飞马书屋不... ... None
b45241d70f0e43fca764df95b2b81f77 0 0
1 0 马伯庸 PERSON 马伯庸是一位小说作者著有《太白金星有点烦》。 ... None
4119fd06010c494caa07f439b333f4c5 0 0
2 0 太白金星李长庚 PERSON 太白金星李长庚是小说《太白金星有点烦》中的主要角色最近感到烦恼。 ... None
d3835bf3dda84ead99deadbeac5d0d7d 0 0
3 0 启明殿 GEO 启明府是位于仙界的一个重要组织与三官府、二十八星宿相当显示了其在仙界中的地位。李长庚在此... ... None
077d2820ae1845bcbb1803379a3d1eae 0 0
4 0 《太白金星有点烦》 EVENT 《太白金星有点烦》是由马伯庸所著的一部小说讲述了太白金星李长庚的故事。这部作品是作者创作的... ...
None 3671ea0dd4e84c1a9b02c5ab2c8f4bac 0 0
... ... ... ... ... ... ... ... .. ..
3399 3 《长安的荔枝》 EVENT 作者在完成《两京十五日》后创作的短篇作品作为休息和拉伸。 ... None
7ea0bc1467e84184842de2d5e5bdd78e 0 0
3400 3 两京十五日 EVENT 《两京十五日》是一个文学作品作者在此之后创作了另一个短篇《长安的荔枝》。 ... None
056f23eb710f471393ae5dc417d83fd9 0 0
3401 3 长安的荔枝 EVENT 《长安的荔枝》是作者在创作《两京十五日》后写的一个短篇作为休息和创作的延续。 ... None
e1ae27016d63447a8dfa021370cba0fa 0 0
3402 3 新书 EVENT 新书发布是一个即将发生的事件作者希望得到读者的支持和关注。 ... None
f8c10f61a8f344cea7bdafa2d8af14b8 0 0
3403 3 读者 ... None aa7d003f25624e19bc88d3951d4dc943 0 0[3404 rows x 14 columns]
/home/xinfeng/miniconda3/envs/graphrag-new/lib/python3.11/site-packages/numpy/core/fromnumeric.py:59: FutureWarning: DataFrame.swapaxes is deprecated and will
be removed in a future version. Please use DataFrame.transpose instead.return bound(*args, **kwds)
/home/xinfeng/miniconda3/envs/graphrag-new/lib/python3.11/site-packages/numpy/core/fromnumeric.py:59: FutureWarning: DataFrame.swapaxes is deprecated and will
be removed in a future version. Please use DataFrame.transpose instead.return bound(*args, **kwds)create_final_communitiesid title level raw_community relationship_ids text_unit_ids
0 0 Community 0 0 0 [1c97184ce5ea4049be417a3fd125357b, 13a044c4043... [159e9102707eeaef1f9188407e428111,45e28cf587e6...
1 2 Community 2 0 2 [8d9ded5fc9cf4c4faba8c6c8cd50e2f4, 595a841aa60... [02c57ca370b4c0316a20148d00723bac,046ed708031c...
2 4 Community 4 0 4 [5a224002ecbc4725abeb5a424aaca6a6, 8826a17bbda... [d0fbd3139f977d98891f5aeae2ac9180, 27248272776...
3 3 Community 3 0 3 [ea465e5cd92247829f52ff0c8591d1bb, 2dbac25b512... [003906d4aeb4b30451d6b15477f474cf,00aa40cc8961...
4 6 Community 6 0 6 [40c2425cb1c34c1591f7cb89f9f5e0bf, 7cf59650687... [0c08b05560ec3763c4eef3215d9de406,1bf7f3f6d2d8...
.. ... ... ... ... ... ...
167 171 Community 171 3 171 [cc08fc303cdc4177ad77e6e7d3d15cfd, 318a9d64ba7... [0110b1a44d2939f061fabdca3c0c822a,050f809899ba...
168 169 Community 169 3 169 [22dc64e73efe47c1be1be0552c3e935a, 0a983d6c050... [13318cc421ba835d8ee409100f7e3c43,4c0646412c3c...
169 166 Community 166 3 166 [2edf3e83c1c64da393d5206ce5b352a3, 58ff8f61ba2... [1a10c703e1637de884a1fad7f109a50b,d25d9589f7d4...
170 168 Community 168 3 168 [6104e6eabe444d6195ec6efc79a2d618, f7bdce302b5... [06047f1634e84ec122354736d0da0512,2cd2d62cc35c...
171 170 Community 170 3 170 [1268f164ec404b48a520fe672bca0f16, 2456d7a68d0... [4502bb159a6b1ae4429141760179b1f3,4a14da17885b...[172 rows x 6 columns]join_text_units_to_entity_idstext_unit_ids entity_ids id
0 159e9102707eeaef1f9188407e428111 [b45241d70f0e43fca764df95b2b81f77, 19a7f254a5d... 159e9102707eeaef1f9188407e428111
1 45e28cf587e6d50704fd6ed866278782 [b45241d70f0e43fca764df95b2b81f77, 077d2820ae1... 45e28cf587e6d50704fd6ed866278782
2 4b8b97e111eb9dc6d262c5ec7eb60801 [b45241d70f0e43fca764df95b2b81f77, 19a7f254a5d... 4b8b97e111eb9dc6d262c5ec7eb60801
3 5fe95645e8592dc5146ae4e6e2343ad4 [b45241d70f0e43fca764df95b2b81f77, 4119fd06010... 5fe95645e8592dc5146ae4e6e2343ad4
4 6fe888799b2e26cd911859f9c31f85d6 [b45241d70f0e43fca764df95b2b81f77, 19a7f254a5d... 6fe888799b2e26cd911859f9c31f85d6
.. ... ... ...
871 73b2cf432f11036b715a7ced295a6091 [47f6d6573cf34e1096c95e36251dd60c, 056f23eb710... 73b2cf432f11036b715a7ced295a6091
872 da06f0769e85e52a06407bdf7dec4c2c [3f3a2d7aa1294116814f0b4d89baa23d, bbdd53a15e9... da06f0769e85e52a06407bdf7dec4c2c
873 239fe13a155eb285cebc6938559cf0e9 [5d398b88ee4242a59c32feb188683ec3, f8c10f61a8f... 239fe13a155eb285cebc6938559cf0e9
874 7837d3a4069066d3a313a050c5401a77 [bbdd53a15e99452a9deff05d1de2d965, d2ed972353a... 7837d3a4069066d3a313a050c5401a77
875 27b95fa0e9192d3c4088bbdd1d820b5c [9532cf83e9324ea0a46e5ac89bac407d, 8919fa72a9e... 27b95fa0e9192d3c4088bbdd1d820b5c[876 rows x 3 columns]
/home/xinfeng/miniconda3/envs/graphrag-new/lib/python3.11/site-packages/numpy/core/fromnumeric.py:59: FutureWarning: DataFrame.swapaxes is deprecated and will
be removed in a future version. Please use DataFrame.transpose instead.return bound(*args, **kwds)
/home/xinfeng/miniconda3/envs/graphrag-new/lib/python3.11/site-packages/numpy/core/fromnumeric.py:59: FutureWarning: DataFrame.swapaxes is deprecated and will
be removed in a future version. Please use DataFrame.transpose instead.return bound(*args, **kwds)
/home/xinfeng/miniconda3/envs/graphrag-new/lib/python3.11/site-packages/datashaper/engine/verbs/convert.py:65: FutureWarning: errorsignore is deprecated and
will raise in a future version. Use to_numeric without passing errors and catch exceptions explicitly insteadcolumn_numeric cast(pd.Series, pd.to_numeric(column, errorsignore))create_final_relationshipssource target weight description ... human_readable_id source_degree target_degree rank
0 飞马书屋 《太白金星有点烦》 1.0 飞马书屋提供《太白金星有点烦》这部小说的最新最全版本。 ... 0
4 4 8
1 飞马书屋 李长庚 4.0 李长庚是飞马书屋小说中的角色显示他与这个组织有文学上的联系。李长庚的对话内容被记录在飞马书... ...
1 4 323 327
2 飞马书屋 小说更新 1.0 飞马书屋提供每天最新最全的小说更新服务。 ... 2 4
1 5
3 飞马书屋 最好看的小说 1.0 飞马书屋提供最好看的小说满足读者的阅读需求。 ... 3 4
1 5
4 马伯庸 《太白金星有点烦》 1.0 马伯庸是《太白金星有点烦》这部小说的作者。 ... 4 1
4 5
... ... ... ... ... ... ... ... ... ...
1891 编辑 出版社 1.0 编辑在出版社工作负责处理作者的稿件。 ... 1891 3 1
4
1892 编辑 我 1.0 作者与编辑之间存在关于创作内容和休息方式的交流和分歧。 ... 1892 3
3 6
1893 我 《长安的荔枝》 1.0 作者在完成《两京十五日》后创作了《长安的荔枝》作为休息。 ... 1893 3
2 5
1894 《两京十五日》 《长安的荔枝》 1.0 《长安的荔枝》是作者在《两京十五日》之后创作的短篇作品作为休息。 ... 1894
1 2 3
1895 新书 读者 1.0 新书发布时作者希望得到读者的支持和捧场这是一种期待和互动的关系。 ... 1895 2
1 3[1896 rows x 10 columns]join_text_units_to_relationship_idsid relationship_ids
0 5fe95645e8592dc5146ae4e6e2343ad4 [1c97184ce5ea4049be417a3fd125357b, ae0d3104647...
1 45e28cf587e6d50704fd6ed866278782 [13a044c404394c34af1e9b07c48aa985, 8d9ded5fc9c...
2 4b8b97e111eb9dc6d262c5ec7eb60801 [13a044c404394c34af1e9b07c48aa985, a9b900821b8...
3 a55a87d948656692651bffe4d3aa5f82 [13a044c404394c34af1e9b07c48aa985, 8d9ded5fc9c...
4 e080b0c08ed32f44c6adc344b9771781 [13a044c404394c34af1e9b07c48aa985, f8402b10349...
.. ... ...
871 613f893eee700fad17498654df3182c0 [58126221b0894f01bae564e2608b754d, 69b67d3b170...
872 239fe13a155eb285cebc6938559cf0e9 [fc757d03e1814784a3a213d87ea36e23, 21bd7045ca9...
873 b0c5905978e8e25106a43ca347427229 [9636a7d02e614d00ac8602bd65da987b, 1a315dfbb60...
874 7837d3a4069066d3a313a050c5401a77 [3fa936635320477cbb990905f5db11d6, 616436c3a00...
875 27b95fa0e9192d3c4088bbdd1d820b5c [c86a30f7f1fe4a01807dd66719394ec3, 392721fc26e...[876 rows x 2 columns]
/home/xinfeng/miniconda3/envs/graphrag-new/lib/python3.11/site-packages/graphrag/index/graph/extractors/community_reports/prep_community_report_context.py:57:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] value insteadSee the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copyinvalid_context_df[schemas.CONTEXT_STRING] _sort_and_trim_context(
/home/xinfeng/miniconda3/envs/graphrag-new/lib/python3.11/site-packages/graphrag/index/graph/extractors/community_reports/utils.py:16: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] value insteadSee the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copydf[schemas.CONTEXT_SIZE] df[schemas.CONTEXT_STRING].apply(lambda x: num_tokens(x))
/home/xinfeng/miniconda3/envs/graphrag-new/lib/python3.11/site-packages/graphrag/index/graph/extractors/community_reports/prep_community_report_context.py:61:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] value insteadSee the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copyinvalid_context_df[schemas.CONTEXT_EXCEED_FLAG] 0create_final_community_reportscommunity full_content ... full_content_json id
0 164 # 天廷与神话组织\n\n天廷是一个神话中的组织负责管理仙界事务和财务与多个神话人物和地... ... {\n title: \u5929\u5ef7\u4e0e\u795e\u8bd...
7d7397c8-e65a-40ca-8f5e-c8ee95ec9bb0
1 166 # 宝象国与八十一难\n\n宝象国是一个地理位置涉及多个重要事件和人物包括玄奘、李长庚、... ... {\n title: \u5b9d\u8c61\u56fd\u4e0e\u516...
d9067bdd-b669-4ce7-b2e8-de33e6487bcf
2 168 # 阿傩与黄风怪的复杂关系\n\n该社区围绕阿傩和黄风怪展开涉及多个角色和组织如正途弟子... ... {\n title: \u963f\u50a9\u4e0e\u9ec4\u98c...
7288eb84-717e-46aa-8ccf-90432682a374
3 169 # 三星洞与石猴社区\n\n该社区以三星洞为核心组织涉及多个关键实体如石猴、六耳、冒名顶替... ... {\n title: \u4e09\u661f\u6d1e\u4e0e\u77f...
4db3be55-ad2a-4e7e-a311-29d4743a71be
4 170 # 通臂与三星洞社区\n\n该社区围绕通臂展开涉及多个关键实体如三星洞都管、孙悟空、六耳等... ... {\n title: \u901a\u81c2\u4e0e\u4e09\u661...
d993b5ec-6ec2-4d66-859d-512ca93bf01c
.. ... ... ... ... ...
148 3 # 两界山与取经之旅\n\n该社区以两界山和取经之旅为核心涉及多个关键实体如玄奘、阿傩长老... ... {\n title: \u4e24\u754c\u5c71\u4e0e\u53d...
d78001e2-5247-4bf9-b2af-fd2cbb109bf4
149 4 # 天庭与神话组织社区\n\n该社区以天庭为核心涉及多个神话组织和人物如释门、广目天王、... ... {\n title: \u5929\u5ead\u4e0e\u795e\u8bd...
085b720d-4f87-43a4-8a29-4a9da687789e
150 6 # 西王母与天庭关系网络\n\n该社区以西王母为核心涉及天庭、卷帘大将、李长庚等多个关键实... ... {\n title: \u897f\u738b\u6bcd\u4e0e\u592...
72ae5d47-5b05-43d1-a60a-de2f92557b56
151 8 # 文殊与普贤的佛教神祇社区\n\n该社区以文殊和普贤两位佛教菩萨为核心围绕他们的活动和互... ... {\n title: \u6587\u6b8a\u4e0e\u666e\u8d2...
cb782e19-0886-4bf9-93b7-bf864adfa2f3
152 9 # 护法渡劫与师徒四人\n\n该社区围绕‘护法渡劫’事件展开涉及‘师徒四人’、‘菩萨’等关... ... {\n title: \u62a4\u6cd5\u6e21\u52ab\u4e0...
52ddd79d-cfa7-43d1-a070-5598db14461d[153 rows x 10 columns]create_final_text_unitsid ... relationship_ids
0 5fe95645e8592dc5146ae4e6e2343ad4 ... [1c97184ce5ea4049be417a3fd125357b, ae0d3104647...
1 e91ee08e3684833d1dd3cb26679a8e6a ... [26c926c6016d4639b05427f01ba629f5, 8f6872eeb81...
2 7eea0da373e721b9f87ad6c7c05565de ... [8d9ded5fc9cf4c4faba8c6c8cd50e2f4, 595a841aa60...
3 d0fbd3139f977d98891f5aeae2ac9180 ... [ac80a99fda2b488285d29596dd4d1471, 67d6a3481e4...
4 ab349a2200a3878ba2a340c71ba1641f ... [904cd052ec194654bb72f4027e43daa3, 7e88fd2e835...
.. ... ... ...
872 7f8d6ded30cb1488837df6102c77cab4 ... [6bb9bed2e39c4e31a81f12479af3d16c, 7dbca0fef7d...
873 73b2cf432f11036b715a7ced295a6091 ... [2f13e93b77b84d5994605e27c17c3244, 20574c1c47c...
874 1a10c703e1637de884a1fad7f109a50b ... [e65667ec99e145fea2055d6b583cb05b, 2edf3e83c1c...
875 239fe13a155eb285cebc6938559cf0e9 ... [fc757d03e1814784a3a213d87ea36e23, 21bd7045ca9...
876 b9fb2d6193b2840cdce5a3cf25542ca7 ... None[877 rows x 6 columns]
/home/xinfeng/miniconda3/envs/graphrag-new/lib/python3.11/site-packages/datashaper/engine/verbs/convert.py:72: FutureWarning: errorsignore is deprecated and
will raise in a future version. Use to_datetime without passing errors and catch exceptions explicitly insteaddatetime_column pd.to_datetime(column, errorsignore)create_base_documentsid text_units raw_content title
0 764c0e80c3fc53191ccd9e87ad9e4803 [5fe95645e8592dc5146ae4e6e2343ad4, e91ee08e368...
\n附每天更新最新最全的小说飞马书屋(FEIAS.COM)\n\n《太白金星有点烦》... book.txtcreate_final_documentsid text_unit_ids raw_content title
0 764c0e80c3fc53191ccd9e87ad9e4803 [5fe95645e8592dc5146ae4e6e2343ad4, e91ee08e368...
\n附每天更新最新最全的小说飞马书屋(FEIAS.COM)\n\n《太白金星有点烦》... book.txt
⠋ GraphRAG Indexer
├── Loading Input (InputFileType.text) - 1 files loaded (0 filtered) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 0:00:00
├── create_base_text_units
├── create_base_extracted_entities
├── create_summarized_entities
├── create_base_entity_graph
├── create_final_entities
├── create_final_nodes
├── create_final_communities
├── join_text_units_to_entity_ids
├── create_final_relationships
├── join_text_units_to_relationship_ids
├── create_final_community_reports
├── create_final_text_units
├── create_base_documents
└── create_final_documentsAll workflows completed successfully.
基于构建的知识库进行提问
GraphRAG支持两种提问方式“global search和local search”。global search指的是那些需要理解整个文本语料库的问题例如“数据集的主要主题是什么”这类问题需要一种全局性的理解和摘要而不是仅从文本的局部区域中检索信息。相反local search在论文中通常指的是文本的局部区域或文本块这些局部区域是RAG方法检索的单元。
让GraphRAG帮我介绍下这篇文章都讲述了什么内容执行代码如下
python -m graphrag.query --root ../myTest --method global 这篇文章主要讲述 了什么内容?输出内容为
INFO: Reading settings from ../myTest/settings.yaml
/home/xinfeng/miniconda3/envs/graphrag-new/lib/python3.11/site-packages/graphrag/query/indexer_adapters.py:71: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] value insteadSee the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copyentity_df[community] entity_df[community].fillna(-1)
/home/xinfeng/miniconda3/envs/graphrag-new/lib/python3.11/site-packages/graphrag/query/indexer_adapters.py:72: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] value insteadSee the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copyentity_df[community] entity_df[community].astype(int)
creating llm client with {api_key: REDACTED,len51, type: openai_chat, model: deepseek-chat, max_tokens: 4000, request_timeout: 180.0, api_base: https://api.agicto.cn/v1, api_version: None, organization: None, proxy: None, cognitive_services_endpoint: None, deployment_name: None, model_supports_json: False, tokens_per_minute: 0, requests_per_minute: 0, max_retries: 10, max_retry_wait: 10.0, sleep_on_rate_limit_recommendation: True, concurrent_requests: 25}SUCCESS: Global Search Response: 这篇文章主要讲述了多个神话社区的故事每个社区围绕特定的核心人物或事件展开涉及复杂的互动关系和动态。这些社区包括玉帝与天庭神祇社orts (138, 119, 136, 93, 58, more)]此外文章还涉及了多个社区和事件的复杂关系和动态涉及不同的实体如悟空、李长庚、观音等以及他们之间的互动和影响。这些内容涵盖了从宗教到政治的多个层面展示了每个社区的核心角色和重要事件。[Data: Reports (125, 115, 143, 71, 92, more)]文章还详细描述了李长庚与天庭仙界的关系他在天庭中的核心角色、与观音、孙悟空、玄奘及取经队伍的复杂关系以及他在天庭中的多项关键职责和影响力。[Data: Reports (82)]另外文章还围绕天庭社区的织女和瑶池展开涉及多个神话人物和事件包括织女在天庭的角色和影响力、瑶池在天庭社区中的地位、织女与牛郎的家庭关系、织女对玄奘取经的兴趣以及织女与李长庚的工作关系。[Data: Reports (95)]最后文章还涉及了文殊与普贤的佛教神祇社区围绕他们的活动和互动展开包括取经队伍的选拔、试禅心活动以及与李长庚的复杂互动。[Data: Reports (60)]
同样的问题我们使用local search的方式再问一下执行代码
python -m graphrag.query --root ../myTest --method local 这篇文章主要讲述了什么内容?输出内容为
INFO: Reading settings from ../myTest/settings.yaml
[2024-07-07T13:58:58Z WARN lance::dataset] No existing dataset at /home/xinfeng/PycharmProjects/graphrag/myTest/lancedb/description_embedding.lance, it will be created
/home/xinfeng/miniconda3/envs/graphrag-new/lib/python3.11/site-packages/graphrag/query/indexer_adapters.py:71: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] value insteadSee the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copyentity_df[community] entity_df[community].fillna(-1)
/home/xinfeng/miniconda3/envs/graphrag-new/lib/python3.11/site-packages/graphrag/query/indexer_adapters.py:72: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] value insteadSee the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copyentity_df[community] entity_df[community].astype(int)
creating llm client with {api_key: REDACTED,len51, type: openai_chat, model: deepseek-chat, max_tokens: 4000, request_timeout: 180.0, api_base: https://api.agicto.cn/v1, api_version: None, organization: None, proxy: None, cognitive_services_endpoint: None, deployment_name: None, model_supports_json: False, tokens_per_minute: 0, requests_per_minute: 0, max_retries: 10, max_retry_wait: 10.0, sleep_on_rate_limit_recommendation: True, concurrent_requests: 25}
creating embedding llm client with {api_key: REDACTED,len51, type: openai_embedding, model: text-embedding-3-small, max_tokens: 4000, request_timeout: 180.0, api_base: https://api.agicto.cn/v1, api_version: None, organization: None, proxy: None, cognitive_services_endpoint: None, deployment_name: None, model_supports_json: None, tokens_per_minute: 0, requests_per_minute: 0, max_retries: 10, max_retry_wait: 10.0, sleep_on_rate_limit_recommendation: True, concurrent_requests: 25}SUCCESS: Local Search Response: 这篇文章主要讲述了李长庚在天庭仙界中的核心角色及其与多个关键人物的复杂关系以及他在取经任务中的重要作用。以下是详细的概述### 李长庚在天庭中的核心角色
李长庚在天庭中担任多项关键职责包括启明司的主持和护法锦囊设计等。他的行为和决策直接影响天庭的稳定和取经任务的进展。李长庚的复杂角色和多重职责使得他在天庭中的影响力极大同时也带来了较高的潜在风险。[Data: Entities (5), Relationships (49, 82, 39, 58, 83, 138, 155, 74, 46)]### 李长庚与观音的复杂关系
李长庚与观音之间的关系复杂且多层次涉及合作、争论和策略性互动。他们共同经历了多次困难彼此之间有着默契的默契。李长庚通过无形的影响来应对观音的威胁而观音对李长庚的安排感到不满。这种复杂的关系对取经任务的进展有着直接的影响。[Data: Relationships (49)]### 李长庚与孙悟空的密切关系
李长庚与孙悟空之间的关系复杂且密切涉及指导、关心和策略性互动。孙悟空对李长庚的修行状态和关心的事情表示理解而李长庚则提醒孙悟空注意因果。两人之间的直接交流和合作对取经任务的进展至关重要。[Data: Relationships (82)]### 李长庚与玄奘的互动
李长庚与玄奘之间存在着一系列复杂的关系和互动。李长庚正在策划一个与玄奘旅程相关的事件这表明他对玄奘的经历和成就有着浓厚的兴趣。两人之间的争议和合作对取经任务的进展有着重要影响。[Data: Relationships (39)]### 李长庚与取经队伍的关系
李长庚与取经队伍的关系复杂且充满关怀。尽管他本人并未直接参与取经队伍的活动但他的讨论发生在取经队伍活动的背景下。李长庚为取经队伍护法渡劫显示出他对该组织的忠诚和支持。[Data: Relationships (58)]这篇文章通过详细描述李长庚在天庭中的角色及其与观音、孙悟空、玄奘和取经队伍的关系展现了他在天庭和取经任务中的核心地位和重要作用。
对比之下可以看出local search的查询方式确实会透出更多细节信息。 以上就是这篇文章的主要内容第二篇文章我会找一个典型的文章对比下GraphRAG和常规RAG在实际场景中的使用效果第三篇文章会介绍下GraphRAG的主要实现原理工作日通常会加班下班较晚回家后需要陪陪家人因此更新可能会慢点预计下周末出第二篇。如果想看更多的理论细节推荐阅读https://arxiv.org/pdf/2404.16130。
感谢大家能看到最后欢迎大家有时间也来我的个人博客看看更新的内容会更多些。不来也没关系我觉得有价值的内容也会继续在CSDN上更新。 文章转载自: http://www.morning.jfqqs.cn.gov.cn.jfqqs.cn http://www.morning.xctdn.cn.gov.cn.xctdn.cn http://www.morning.rxydr.cn.gov.cn.rxydr.cn http://www.morning.ryrgx.cn.gov.cn.ryrgx.cn http://www.morning.mdxwz.cn.gov.cn.mdxwz.cn http://www.morning.tnfyj.cn.gov.cn.tnfyj.cn http://www.morning.rlhgx.cn.gov.cn.rlhgx.cn http://www.morning.txmkx.cn.gov.cn.txmkx.cn http://www.morning.qrwdg.cn.gov.cn.qrwdg.cn http://www.morning.xbnkm.cn.gov.cn.xbnkm.cn http://www.morning.jkmjm.cn.gov.cn.jkmjm.cn http://www.morning.pmdnx.cn.gov.cn.pmdnx.cn http://www.morning.qxmys.cn.gov.cn.qxmys.cn http://www.morning.rdsst.cn.gov.cn.rdsst.cn http://www.morning.gtnyq.cn.gov.cn.gtnyq.cn http://www.morning.dpqqg.cn.gov.cn.dpqqg.cn http://www.morning.qsmch.cn.gov.cn.qsmch.cn http://www.morning.qrqg.cn.gov.cn.qrqg.cn http://www.morning.cfpq.cn.gov.cn.cfpq.cn http://www.morning.hrzhg.cn.gov.cn.hrzhg.cn http://www.morning.kxqfz.cn.gov.cn.kxqfz.cn http://www.morning.bauul.com.gov.cn.bauul.com http://www.morning.kpypy.cn.gov.cn.kpypy.cn http://www.morning.rzdzb.cn.gov.cn.rzdzb.cn http://www.morning.mnwb.cn.gov.cn.mnwb.cn http://www.morning.lmcrc.cn.gov.cn.lmcrc.cn http://www.morning.cmhkt.cn.gov.cn.cmhkt.cn http://www.morning.ybgyz.cn.gov.cn.ybgyz.cn http://www.morning.xhddb.cn.gov.cn.xhddb.cn http://www.morning.jxzfg.cn.gov.cn.jxzfg.cn http://www.morning.nlzpj.cn.gov.cn.nlzpj.cn http://www.morning.lthpr.cn.gov.cn.lthpr.cn http://www.morning.mtktn.cn.gov.cn.mtktn.cn http://www.morning.qflcb.cn.gov.cn.qflcb.cn http://www.morning.xltwg.cn.gov.cn.xltwg.cn http://www.morning.gl-group.cn.gov.cn.gl-group.cn http://www.morning.bpmdn.cn.gov.cn.bpmdn.cn http://www.morning.bqts.cn.gov.cn.bqts.cn http://www.morning.xtdms.com.gov.cn.xtdms.com http://www.morning.fjtnh.cn.gov.cn.fjtnh.cn http://www.morning.hengqilan.cn.gov.cn.hengqilan.cn http://www.morning.wqtzs.cn.gov.cn.wqtzs.cn http://www.morning.ygztf.cn.gov.cn.ygztf.cn http://www.morning.ttkns.cn.gov.cn.ttkns.cn http://www.morning.kpmxn.cn.gov.cn.kpmxn.cn http://www.morning.ypnxq.cn.gov.cn.ypnxq.cn http://www.morning.nfdty.cn.gov.cn.nfdty.cn http://www.morning.qdcpn.cn.gov.cn.qdcpn.cn http://www.morning.zljqb.cn.gov.cn.zljqb.cn http://www.morning.mmtjk.cn.gov.cn.mmtjk.cn http://www.morning.qmzwl.cn.gov.cn.qmzwl.cn http://www.morning.xhwty.cn.gov.cn.xhwty.cn http://www.morning.qgfkn.cn.gov.cn.qgfkn.cn http://www.morning.qineryuyin.com.gov.cn.qineryuyin.com http://www.morning.zgpgl.cn.gov.cn.zgpgl.cn http://www.morning.qkxt.cn.gov.cn.qkxt.cn http://www.morning.zxybw.cn.gov.cn.zxybw.cn http://www.morning.nggbf.cn.gov.cn.nggbf.cn http://www.morning.nmkfy.cn.gov.cn.nmkfy.cn http://www.morning.xlclj.cn.gov.cn.xlclj.cn http://www.morning.tslxr.cn.gov.cn.tslxr.cn http://www.morning.pmwhj.cn.gov.cn.pmwhj.cn http://www.morning.qcmhs.cn.gov.cn.qcmhs.cn http://www.morning.clccg.cn.gov.cn.clccg.cn http://www.morning.qlbmc.cn.gov.cn.qlbmc.cn http://www.morning.tstwx.cn.gov.cn.tstwx.cn http://www.morning.kxypt.cn.gov.cn.kxypt.cn http://www.morning.xqspn.cn.gov.cn.xqspn.cn http://www.morning.wpqcj.cn.gov.cn.wpqcj.cn http://www.morning.yixingshengya.com.gov.cn.yixingshengya.com http://www.morning.wfqcs.cn.gov.cn.wfqcs.cn http://www.morning.rpjr.cn.gov.cn.rpjr.cn http://www.morning.xdqrz.cn.gov.cn.xdqrz.cn http://www.morning.rdpps.cn.gov.cn.rdpps.cn http://www.morning.ntzbr.cn.gov.cn.ntzbr.cn http://www.morning.ggrzk.cn.gov.cn.ggrzk.cn http://www.morning.kpzbf.cn.gov.cn.kpzbf.cn http://www.morning.hymmq.cn.gov.cn.hymmq.cn http://www.morning.rrgqq.cn.gov.cn.rrgqq.cn http://www.morning.sxfnf.cn.gov.cn.sxfnf.cn