使用 LangChain 从文本数据构建知识图

二师兄talks 2024-02-25

895

大脑以信息图的形式存储知识。

在本中，我将带领大家了解知识图谱以及如何从自己的文本数据构建一个。

什么是知识图谱？

Knowledge Graph，也称为语义图，是一种高效存储数据的智能结构。数据以节点和边的形式存储。如图1所示，节点表示对象，边表示对象之间的关系。以知识图为代表的数据模型有时被称为资源描述框架（RDF）。RDF 定义了万维网上站点互连的方式。

为什么需要知识图谱？

在整个数据故事中，只有少数数据点本质上代表整个数据集。因此，知识图仅存储重要的数据点。这显着降低了检索时间复杂度并降低了空间复杂度。

我最喜欢的知识图谱用例之一是药物发现和基于RAG的虚拟助手聊天机器人。

实施

1、安装和导入软件包

（注意：我们将使用Open AI的GPT-3.5来生成实体和关系，确保您已准备好您的Open AI Api密钥）

使用您喜欢的软件包管理器安装软件包。在这里，我使用PIP来安装和管理依赖项。

pip install -q langchain openai pyvis gradio==3.39.0

导入已安装的软件包。

from langchain.prompts import PromptTemplate
from langchain.llms.openai import OpenAI
from langchain.chains import LLMChain
from langchain.graphs.networkx_graph import KG_TRIPLE_DELIMITER
from pprint import pprint
from pyvis.network import Network
import networkx as nx
import gradio as gr

2、设置API密钥

使用从Open AI平台仪表板复制的API密钥设置api密钥环境变量。在这里，我通过colab secrets传递变量，所以在运行单元格之前，请确保您已经为api密钥值分配了秘密变量。

from google.colab import userdata
OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')

3、定义提示

向 LLMs 提出正确的问题至关重要，以便他们能够生成我们需要的内容。在这里，我们添加了一些示例以及说明，以便在推断过程中减少幻觉。这种提示方式被称为Few-Shot提示。随时阅读提示以清楚地了解它的工作原理。

# 用于知识三元组提取的提示模板
_DEFAULT_KNOWLEDGE_TRIPLE_EXTRACTION_TEMPLATE = (
    "You are a networked intelligence helping a human track knowledge triples"
    " about all relevant people, things, concepts, etc. and integrating"
    " them with your knowledge stored within your weights"
    " as well as that stored in a knowledge graph."
    " Extract all of the knowledge triples from the text."
    " A knowledge triple is a clause that contains a subject, a predicate,"
    " and an object. The subject is the entity being described,"
    " the predicate is the property of the subject that is being"
    " described, and the object is the value of the property.\\n\\n"
    "EXAMPLE\\n"
    "It's a state in the US. It's also the number 1 producer of gold in the US.\\n\\n"
    f"Output: (Nevada, is a, state){KG_TRIPLE_DELIMITER}(Nevada, is in, US)"
    f"{KG_TRIPLE_DELIMITER}(Nevada, is the number 1 producer of, gold)\\n"
    "END OF EXAMPLE\\n\\n"
    "EXAMPLE\\n"
    "I'm going to the store.\\n\\n"
    "Output: NONE\\n"
    "END OF EXAMPLE\\n\\n"
    "EXAMPLE\\n"
    "Oh huh. I know Descartes likes to drive antique scooters and play the mandolin.\\n"
    f"Output: (Descartes, likes to drive, antique scooters){KG_TRIPLE_DELIMITER}(Descartes, plays, mandolin)\\n"
    "END OF EXAMPLE\\n\\n"
    "EXAMPLE\\n"
    "{text}"
    "Output:"
)


KNOWLEDGE_TRIPLE_EXTRACTION_PROMPT = PromptTemplate(
    input_variables=["text"],
    template=_DEFAULT_KNOWLEDGE_TRIPLE_EXTRACTION_TEMPLATE,
)

4、初始化链

使用描述性提示，使用LLMChain类初始化链。

llm = OpenAI(
    api_key=OPENAI_API_KEY,
    temperature=0.9
    )


# 使用知识三元组提取提示创建一个LLMChain
chain = LLMChain(llm=llm, prompt=KNOWLEDGE_TRIPLE_EXTRACTION_PROMPT)

要构建知识图谱，您只需要一些相关的文本数据。在这里，我从字符串输入加载文本。但是，重要的是要注意，您还可以使用Python中的数据加载器[3]从一些流行的数据格式（如PDF、JSON、markdown等）加载数据。

# Run the chain with the specified text
text = "The city of Paris is the capital and most populous city of France. The Eiffel Tower is a famous landmark in Paris."
triples = chain.invoke(
    {'text' : text}
).get('text')

并使用此用户定义的函数解析检索到的三元组

def parse_triples(response, delimiter=KG_TRIPLE_DELIMITER):
    if not response:
        return []
    return response.split(delimiter)


triples_list = parse_triples(triples)


pprint(triples_list)

输出：

[' (Paris, is the capital of, France)',
 '(Paris, is the most populous city in, France)',
 '(Eiffel Tower, is a, famous landmark)',
 '(Eiffel Tower, is in, Paris)']

5、可视化构建的知识图谱

在这里，我们将使用PyVis创建出色的知识图谱可视化，并使用Gradio框架交互地显示它。

以下是一些用户定义的函数，以使我们的任务更容易：

def create_graph_from_triplets(triplets):
    G = nx.DiGraph()
    for triplet in triplets:
        subject, predicate, obj = triplet.strip().split(',')
        G.add_edge(subject.strip(), obj.strip(), label=predicate.strip())
    return G


def nx_to_pyvis(networkx_graph):
    pyvis_graph = Network(notebook=True, cdn_resources='remote')
    for node in networkx_graph.nodes():
        pyvis_graph.add_node(node)
    for edge in networkx_graph.edges(data=True):
        pyvis_graph.add_edge(edge[0], edge[1], label=edge[2]["label"])
    return pyvis_graph


def generateGraph():
    triplets = [t.strip() for t in triples_list if t.strip()]
    graph = create_graph_from_triplets(triplets)
    pyvis_network = nx_to_pyvis(graph)


    pyvis_network.toggle_hide_edges_on_drag(True)
    pyvis_network.toggle_physics(False)
    pyvis_network.set_edge_smooth('discrete')


    html = pyvis_network.generate_html()
    html = html.replace("'", "\\"")


    return f"""<iframe style="width: 100%; height: 600px;margin:0 auto" name="result" allow="midi; geolocation; microphone; camera;
    display-capture; encrypted-media;" sandbox="allow-modals allow-forms
    allow-scripts allow-same-origin allow-popups
    allow-top-navigation-by-user-activation allow-downloads" allowfullscreen=""
    allowpaymentrequest="" frameborder="0" srcdoc='{html}'></iframe>"""

使用 Gradio 显示 PyVis 生成的 html

demo = gr.Interface( 
    generateGraph, 
    inputs= None , 
    outputs=gr.outputs.HTML(), 
    title= “知识图谱” , 
    allow_flagging= 'never' , 
    live= True , 
) 


demo.launch( 
    height= 800 , 
    width= “100%”
）

最终输出：我们使用 gradio 框架显示了我们的知识图，这样该页面也可以通过生成的链接轻松地与在线任何人共享。只需在方法share=True中添加demo.launch(share=True)，您就可以使应用程序对任何人可见。

参考链接：

完整代码——Colab：https://colab.research.google.com/drive/1OpoLyKAWTVpkhy0VgVduprYypIFTSIrL

IBM 知识图谱：https://www.ibm.com/topics/knowledge-graph?source=post_page

知识图用例：https://www.wisecube.ai/blog/20-real-world-industrial-applications-of-knowledge-graphs/

LangChain数据加载器：https://python.langchain.com/docs/modules/data_connection/document_loaders/

原文链接：https://medium.com/@mahimairaja/build-knowledge-graph-from-textdata-using-langchain-under-2min-ce0d0d0e44e8

欢迎添加二师兄的个人微信沟通交流（请勿重复添加）

请用个人微信添加

你也可以关注公众号以获取更多故事，并在公众号上阅读我的短篇技术文章。

知识图谱

文章转载自二师兄talks，如果涉嫌侵权，请发送邮件至：contact@modb.pro进行举报，并提供相关证据，一经查实，墨天轮将立刻删除相关内容。