
前言
如果你正在构建这样的系统并将其应用于生产环境,那么你肯定也很关注 RAG 的性能。
本文会涉及到关于应用程序行为的大致概念,还会涉及定量反馈,指导实验和适当选择参数(如LLMs、嵌入模型、块大小、top K等)。
评估 RAG 的性能,对任何人来说都至关重要,因为他们总是希望有可靠的性能指标来验证项目的有效性。
以下是本期内容的主要部分:
根据 RAG 数据自动生成综合测试集
流行的 RAG 指标概述
使用 Ragas 包计算合成数据集上的 RAG 指标
一、合成测试集
假设你已经成功构建了一个RAG 系统,并且现在想要评估它的性能。为了这个目的,你需要一个评估数据集,该数据集包含以下列:
question(问题):想要评估的RAG的问题
ground_truths(真实答案):问题的真实答案
answer(答案):RAG 预测的答案
contexts(上下文):RAG 用于生成答案的相关信息列表
前两列代表真实数据,最后两列代表 RAG 预测数据。

要创建这样的数据集,我们首先需要生成问题和答案的元组。
接下来,在 RAG 上运行这些问题以获得预测结果。
生成问题和基准答案(实践中可能会出现偏差)
要生成(问题、答案)元组,我们首先需要准备 RAG 数据,我们将其拆分为块,并将其嵌入向量数据库中。完成这些步骤后,我们会指示 LLM 从指定主题中生成 num_questions 个问题,从而得到问题和答案元组。
为了从给定的上下文中生成问题和答案,我们需要按照以下步骤操作:
选择一个随机块并将其作为根上下文
从向量数据库中检索 K 个相似的上下文
将根上下文和其 K 个相邻上下文的文本连接起来以构建一个更大的上下文
使用这个大的上下文和 num_questions 在以下的提示模板中生成问题和答案
"""\\Your task is to formulate exactly {num_questions} questions from given context and provide the answer to each one.End each question with a '?' character and then in a newline write the answer to that question using onlythe context provided.Separate each question/answer pair by "XXX"Each question must start with "question:".Each answer must start with "answer:".The question must satisfy the rules given below:1.The question should make sense to humans even when read without the given context.2.The question should be fully answered from the given context.3.The question should be framed from a part of context that contains important information. It can also be from tables,code,etc.4.The answer to the question should not contain any links.5.The question should be of moderate difficulty.6.The question must be reasonable and must be understood and responded by humans.7.Do no use phrases like 'provided context',etc in the question8.Avoid framing question using word "and" that can be decomposed into more than one question.9.The question should not contain more than 10 words, make of use of abbreviation wherever possible.context: {context}""""""\\您的任务是根据给定的上下文提出{num_questions}个问题,并给出每个问题的答案。在每个问题的末尾加上"?提供的上下文写出该问题的答案。每个问题/答案之间用 "XXX "隔开。每个问题必须以 "question: "开头。每个答案必须以 "answer: "开头。问题必须符合以下规则:1.即使在没有给定上下文的情况下,问题也应该对人类有意义。2.问题应能根据给定的上下文给出完整的答案。3.问题应从包含重要信息的上下文中提取。也可以是表格、代码等。4.问题答案不应包含任何链接。5.问题难度应适中。6.问题必须合理,必须为人类所理解和回答。7.不要在问题中使用 "提供上下文 "等短语。8.避免在问题中使用 "和 "字,因为它可以分解成多个问题。9.问题不应超过 10 个单词,尽可能使用缩写。语境:{上下文}"""
重复以上步骤 num_count 次,每次改变上下文并生成不同的问题。
基于上面的工作流程,下面是我生成问题和答案的结果示例。
| | question | ground_truths ||---:|:---------------------------------------------------|:---------------------------------------------------|| 8 | What is the difference between lists and tuples in | ['Lists are mutable and cannot be used as || | Python? | dictionary keys, while tuples are immutable and || | | can be used as dictionary keys if all elements are || | | immutable.'] || 4 | What is the name of the Python variant optimized | ['MicroPython and CircuitPython'] || | for microcontrollers? | || 13 | What is the name of the programming language that | ['ABC programming language'] || | Python was designed to replace? | || 17 | How often do bugfix releases occur? | ['Bugfix releases occur about every 3 months.'] || 3 | What is the significance of Python's release | ['Python 2.0 was released in 2000, while Python || | history? | 3.0, a major revision with limited backward || | | compatibility, was released in 2008.'] |
编码用例
首先构建一个向量存储,其中包含 RAG 使用的数据。
1、我们从 Wikipedia 加载它
from langchain.document_loaders import WikipediaLoadertopic = "python programming"wikipedia_loader = WikipediaLoader(query=topic,load_max_docs=1,doc_content_chars_max=100000,)docs = wikipedia_loader.load()doc = docs[0]
2、加载数据后,我们将其分成块。
from langchain.text_splitter import RecursiveCharacterTextSplitterCHUNK_SIZE = 512CHUNK_OVERLAP = 128splitter = RecursiveCharacterTextSplitter(chunk_size=CHUNK_SIZE,chunk_overlap=CHUNK_OVERLAP,separators=[". "],)splits = splitter.split_documents([doc])
3、在 Pinecone 中创建一个索引。
import pineconepinecone.init(api_key=os.environ.get("PINECONE_API_KEY"),environment=os.environ.get("PINECONE_ENV"),)index_name = topic.replace(" ", "-")pinecone.init(api_key=os.environ.get("PINECONE_API_KEY"),environment=os.environ.get("PINECONE_ENV"),)if index_name in pinecone.list_indexes():pinecone.delete_index(index_name)pinecone.create_index(index_name, dimension=768)
4、使用 LangChain 包装器来索引其中的分片嵌入。
from langchain.vectorstores import Pineconedocsearch = Pinecone.from_documents(splits,embedding_model,index_name=index_name,)
5、生成合成数据集
我们使用 LLM、文档拆分、嵌入模型和 Pinecone 索引名称从TestsetGenerator 类初始化一个对象。
from langchain.embeddings import VertexAIEmbeddingsfrom langchain.llms import VertexAIfrom testset_generator import TestsetGeneratorgenerator_llm = VertexAI(location="europe-west3",max_output_tokens=256,max_retries=20,)embedding_model = VertexAIEmbeddings()testset_generator = TestsetGenerator(generator_llm=generator_llm,documents=splits,embedding_model=embedding_model,index_name=index_name,key="text",)
6、通过传递两个参数来调用generate方法
synthetic_dataset = testset_generator.generate(num_contexts=10,num_questions_per_context=2,)
7、生成问题与答案如下
| | question | ground_truths ||---:|:---------------------------------------------------|:---------------------------------------------------|| 8 | What is the difference between lists and tuples in | ['Lists are mutable and cannot be used as || | Python? | dictionary keys, while tuples are immutable and || | | can be used as dictionary keys if all elements are || | | immutable.'] || 4 | What is the name of the Python variant optimized | ['MicroPython and CircuitPython'] || | for microcontrollers? | || 13 | What is the name of the programming language that | ['ABC programming language'] || | Python was designed to replace? | || 17 | How often do bugfix releases occur? | ['Bugfix releases occur about every 3 months.'] || 3 | What is the significance of Python's release | ['Python 2.0 was released in 2000, while Python || | history? | 3.0, a major revision with limited backward || | | compatibility, was released in 2008.'] |
接下来使用 RAG 来预测每个问题的答案,并提供用于支撑响应的上下文列表。
8、初始化 RAG
# 初始化RAGfrom rag import RAGimport RAGrag = RAG(index_name,"text-bison",embedding_model,"text",)
9、通过对每个问题调用 predict 方法来迭代合成数据集并收集预测
rag_answers = []contexts = []for i, row in synthetic_dataset.iterrows():question = row["question"]prediction = rag.predict(question)rag_answer = prediction["answer"]rag_answers.append(rag_answer)source_documents = prediction["source_documents"]contexts.append([s.page_content for s in source_documents])synthetic_dataset_rag = synthetic_dataset.copy()synthetic_dataset_rag["answer"] = rag_answers
10、最终结果如下
| | question | ground_truths | answer | contexts |_truths | answer | contexts ||---:|:----------------------------------------------------------------------------|:----------------------------|:-----------------------------------------------------------------------------------------------------------|:---------------------------------------------------|| 7 | What are the two types of classes that Python supported before version 3.0? | ['old-style and new-style'] | Before version 3.0, Python had two kinds of classes (both using the same syntax): old-style and new-style. | ['. New instances of classes are constructed by || | | | | calling the class (for example, SpamClass() or || | | | | EggsClass()), and the classes are instances of the || | | | | metaclass type (itself an instance of itself), || | | | | allowing metaprogramming and reflection.\\nBefore || | | | | version 3.0, Python had two kinds of classes (both || | | | | using the same syntax): old-style and new-style, || | | | | current Python versions only support the semantics || | | | | new style.\\nPython supports optio .......... |
基于以上步骤,我们已经为评估 RAG 做好了准备,接下来我们讲解如何进行 RAG 评估。
二、流行的RAG指标
在探讨具体代码实现之前,让我们首先详细介绍用于评估 RAG(检索增强生成)模型的四个基础性指标。每个指标检查不同的方面。因此,在评估的应用程序时,考虑多个指标以获得全面的视角至关重要。
1、答案相关性(Answer Relevancy):此指标的目标是评估生成的答案与提供的问题提示之间的相关性。答案如果缺乏完整性或者包含冗余信息,那么其得分将相对较低。这一指标通过问题和答案的结合来进行计算,评分的范围通常在0到1之间,其中高分代表更好的相关性。
示例
问题:健康饮食的主要特点是什么?
低相关性答案:健康饮食对整体健康非常重要。
高相关性答案:健康饮食应包括各种水果、蔬菜、全麦食品、瘦肉和乳制品,为优化健康提供必要的营养素。
2、忠实度(Faithfulness):这个评价标准旨在检查生成的答案在给定上下文中的事实准确性。评估的过程涉及到答案内容与其检索到的上下文之间的比对。这一指标也使用一个介于0到1之间的数值来表示,其中更高的数值意味着答案与上下文的一致性更高。
示例
问题:居里夫人的主要成就是什么?
背景:玛丽·居里(1867-1934年)是一位开创性的物理学家和化学家,她是第一位获得诺贝尔奖的女性,也是唯一一位在两个不同领域获得诺贝尔奖的女性。
高忠实度答案:玛丽·居里在物理和化学两个领域都获得了诺贝尔奖,使她成为第一位实现这一成就的女性。
低忠实度答案:玛丽·居里只在物理学领域获得了诺贝尔奖。
3、上下文精确度(Context Precision):在这个指标中,我们评估所有在给定上下文中与基准信息相关的条目是否被正确地排序。理想情况下,所有相关的内容应该出现在排序的前部。这一评价标准同样使用0到1之间的得分值来表示,其中较高的得分反映了更高的精确度。
4、答案正确性(Answer Correctness):该指标主要用于测量生成的答案与实际基准答案之间的匹配程度。这一评估考虑了基准答案和生成答案的对比,其得分也通常在0到1之间,较高的得分表明生成答案与实际答案的一致性更高。
示例:
基本事实:埃菲尔铁塔于 1889 年在法国巴黎竣工。
答案正确率高:埃菲尔铁塔于 1889 年在法国巴黎竣工。
答案正确率低:埃菲尔铁塔于 1889 年竣工,矗立在英国伦敦。
三、使用 RAGAS 评估 RAG
为了评估 RAG 并计算这四个指标,我们可以使用 Ragas 框架。
Ragas(用于RAG评估)是一个可以帮助你评估 RAG 管道的框架。
它还提供了一系列指标和实用程序函数生成综合数据集。
1、对象装换
要在数据集上运行 Ragas,首先需要导入指标并将合成数据的数据帧转换为 Dataset 对象。
from datasets import Datasetfrom ragas.llms import LangchainLLMfrom ragas.metrics import (answer_correctness,answer_relevancy,answer_similarity,context_precision,context_recall,context_relevancy,faithfulness,)synthetic_ds_rag = Dataset.from_pandas(synthetic_dataset_rag)
2、配置 Ragas 以使用 VertexAI LLM 和嵌入
这步很重要,并且 Ragas 默认配置为使用 OpenAI
metrics = [answer_relevancy,context_precision,faithfulness,answer_correctness,answer_similarity,]for m in metrics:m.__setattr__("llm", ragas_vertexai_llm)if hasattr(m, "embeddings"):m.__setattr__("embeddings", vertexai_embeddings)answer_correctness.faithfulness = faithfulnessanswer_correctness.answer_similarity = answer_similarity
3、在合成数据集上调用 evaluate 函数并指定我们要计算的指标:
from ragas import evaluateimport evaluateresults_rag = evaluate(synthetic_ds_rag,metrics=[answer_relevancy,context_precision,faithfulness,answer_correctness,],)
4、打印结果
{'answer_correctness': 0.86875,'answer_correctness': 0.86875,'answer_relevancy': 0.9709101875947284,'context_precision': 0.8541666666143055,'faithfulness': 0.9375}
5、转换数据库格式
| | question | contexts | answer | ground_truths | answer_relevancy | context_precision | faithfulness | answer_correctness |_truths | answer_relevancy | context_precision | faithfulness | answer_correctness ||---:|:----------------------------------------------------------------------------------|:-----------|:--------------------------------------------------|:--------------------------------------------------|-------------------:|--------------------:|---------------:|---------------------:|| 1 | What is the difference between lists and tuples in Python? | ... | Lists are mutable, while tuples are immutable. | ['Lists and tuples are both ordered sequences of | 0.987162 | 1 | 1 | 0.75 || | | | This means that the elements of a list can be | elements in Python. However, lists are mutable, | | | | || | | | changed, while the elements of a tuple cannot. | meaning their elements can be changed, while | | | | || | | | Additionally, tuples can be used as keys in | tuples are immutable, meaning their elements | | | | || | | | dictionaries, while lists cannot. | cannot be changed.'] | | | | || 5 | What is the name of the Python runtime that uses just-in-time compilation? | ... | The name of the Python runtime that uses just-in- | ['Pyston'] | 1 | 1 | 1 | 1 || | | | time compilation is Pyston. | | | | | || 13 | What is the name of the programming language that Python was designed to replace? | ... | The programming language that Python was designed | ['ABC programming language'] | 0.982582 | 0.416667 | 1 | 0.5 || | | | to replace is called ABC. | | | | | || 2 | What is Python's approach to type checking? | ... | Python uses a combination of dynamic typing and | ["Python follows dynamic typing, where type | 0.890692 | 0.916667 | 1 | 0.666667 || | | | duck typing, with optional static type checking | constraints are not checked at compile time but | | | | || | | | available through the use of type annotations and | may result in operational failures if an object's | | | | || | | | the mypy type checker. | type is unsuitable."] | | | | || 12 | Which programming language has been the most popular since 2003? | ... | According to the TIOBE Programming Community | ['Python'] | 0.899585 | 0.5 | 1 | 0.75 || | | | Index, Python has consistently ranked in the top | | | | | || | | | ten most popular programming languages since 2003 | | | | | || | | | and as of December 2022, it was the most popular | | | | | || | | | language. | | | | | |
总结
生成综合数据集来评估 RAG 是一个好的解决方案,尤其是当您无法访问标记数据时。
然而,这个解决方案也有其问题。
一些生成的答案:
可能缺乏多样性
是多余的
是对原文的简单改写,需要更复杂的内容来反映需要推理的真实问题
可能太通用(尤其是在技术性很强的领域)
要解决这些问题,可以调整调整提示、过滤不相关的问题、创建有关特定主题的综合问题,以及使用 Ragas 生成数据集。
参考链接:
项目地址:https://github.com/mcks2000/llm_notebooks/blob/main/notebooks/Evaluate_RAG_on_Synthetic_Data.ipynb
Ragas 开源地址:https://github.com/explodinggradients/ragas
Ragas 文档:https://docs.ragas.io/en/latest/concepts/metrics/index.html




