像 OpenAI 的 ChatGPT 这样的生成式 AI 模型的快速发展彻底改变了自然语言处理,使这些系统能够生成连贯且上下文相关的响应。然而,即使是最先进的模型在处理特定领域的查询或提供高度准确的信息时也面临局限性。这通常会导致诸如幻觉之类的挑战——模型产生不准确或捏造的细节的情况。
检索增强生成(RAG),一个旨在弥合这一差距的创新框架。通过无缝集成外部数据源,RAG 使生成模型能够检索实时、利基信息,从而显着提高其准确性和可靠性。
在本文中,我们将深入研究 RAG 的机制,探索其架构,并讨论激发其创作灵感的传统生成模型的局限性。我们还将重点介绍实际实现、先进技术和评估方法,展示 RAG 如何改变人工智能与专业数据交互的方式。
检索增强生成(RAG)是一种先进的框架,通过集成外部数据的实时检索来增强生成式人工智能模型的能力。虽然生成模型擅长生成连贯的、类似人类的文本,但当要求提供准确、最新或特定领域的信息时,它们可能会犹豫不决。这就是 RAG 介入的地方,确保回复不仅具有创意,而且基于可靠且相关的来源。
RAG 通过将生成模型与检索机制连接起来进行操作,通常由向量数据库或搜索系统提供支持。当收到查询时,检索组件会搜索大量外部数据集以获取相关信息。然后,生成模型综合这些数据,产生既准确又具有上下文洞察力的输出。
通过解决幻觉和有限领域知识等关键挑战,RAG 释放了生成模型在专业领域脱颖而出的潜力。其应用涵盖各个行业,从提供精确答案的自动化客户支持,到使研究人员能够按需获取精选知识。 RAG 代表着在使 AI 系统在现实场景中更加智能、值得信赖和有用方面向前迈出了重要一步。
清楚地了解 RAG 架构对于释放其全部潜力和优势至关重要。该框架的核心是建立在两个主要组件之上:检索器和生成器,它们在无缝的信息处理流程中协同工作。
整个过程如下图所示:
来源:https://weaviate.io/blog/introduction-to-rag
RAG 流程的所有阶段和基本组成部分,如下图所示。
来源:https://www.griddynamics.com/blog/retrieval-augmented- Generation-llm
将文档分成更小的块可能看起来很简单,但需要仔细考虑语义,以避免不恰当地分割句子,这可能会影响后续步骤,例如回答问题。简单的固定大小分块方法可能会导致每个块中的信息不完整。大多数文档分段算法使用块大小和重叠,其中块大小由字符、单词或标记计数确定,而重叠通过在相邻块之间共享文本来确保连续性。该策略保留了跨块的语义上下文。
来源:https://www.griddynamics.com/blog/retrieval-augmented- Generation-llm
一些重要的矢量数据库是:
来源:https://www.griddynamics.com/blog/retrieval-augmented- Generation-llm
RAG(检索增强生成)和微调是扩展LLM能力的两种关键方法,每种方法适合不同的场景。微调涉及对特定领域数据的法学硕士进行再培训,以执行专门的任务,非常适合静态、狭窄的用例,例如需要特定语气或风格的品牌或创意写作。然而,它成本高昂、耗时,并且不适合动态、频繁更新的数据。
另一方面,RAG 通过动态检索外部数据而不修改模型权重来增强 LLM,使其具有成本效益,非常适合法律、财务或客户服务应用程序等实时数据驱动环境。 RAG 使法学硕士能够处理大型、非结构化的内部文档语料库,与导航混乱的数据存储库的传统方法相比,具有显着的优势。
微调擅长创建细致入微、一致的输出,而 RAG 通过利用外部知识库提供最新、准确的信息。在实践中,RAG 通常是需要实时、适应性响应的应用程序的首选,特别是在管理大量非结构化数据的企业中。
检索增强生成 (RAG) 方法有多种类型,每种方法都针对特定用例和目标量身定制。主要类型包括:
来源:https://x.com/weaviate_io/status/1866528335884325070
检索增强生成(RAG)框架由于其能够将外部知识动态集成到生成语言模型中,因此在各个行业具有广泛的应用。以下是一些突出的应用:
在本节中,我们将开发一个 Streamlit 应用程序,能够理解 PDF 的内容并使用检索增强生成 (RAG) 根据该内容响应用户查询。该实施利用 LangChain 平台来促进与法学硕士和向量商店的交互。我们将利用 OpenAI 的 LLM 及其嵌入模型构建 FAISS 矢量存储,以实现高效的信息检索。
python -m venv venv source venv/bin/activate #for ubuntu venv/Scripts/activate #for windows
pip install langchain langchain_community openai faiss-cpu PyPDF2 streamlit python-dotenv tiktoken
OPENAI_API_KEY=sk-proj-xcQxBf5LslO62At... OPENAI_MODEL_NAME=gpt-3.5-turbo OPENAI_EMBEDDING_MODEL_NAME=text-embedding-3-small
from dotenv import load_dotenv import os load_dotenv() OPENAI_API_KEY = os.getenv("OPENAI_API_KEY") OPENAI_MODEL_NAME = os.getenv("OPENAI_MODEL_NAME") OPENAI_EMBEDDING_MODEL_NAME = os.getenv("OPENAI_EMBEDDING_MODEL_NAME")
导入用于构建应用程序、处理 PDF 的基本库,例如 langchain、streamlit、pyPDF。
import streamlit as st from PyPDF2 import PdfReader from langchain.text_splitter import CharacterTextSplitter from langchain.prompts import PromptTemplate from langchain_community.embeddings import OpenAIEmbeddings from langchain_community.vectorstores import FAISS from langchain.memory import ConversationBufferMemory from langchain.chains import ConversationalRetrievalChain from langchain_community.chat_models import ChatOpenAI from htmlTemplates import bot_template, user_template, css
def get_pdf_text(pdf_files): text = "" for pdf_file in pdf_files: reader = PdfReader(pdf_file) for page in reader.pages: text += page.extract_text() return text
使用 LangChain 的 CharacterTextSplitter 将大文本分成更小的、可管理的块。
def get_chunk_text(text): text_splitter = CharacterTextSplitter( separator="\n", chunk_size=1000, chunk_overlap=200, length_function=len ) chunks = text_splitter.split_text(text) return chunks
生成文本块的嵌入并使用 FAISS 将它们存储在矢量数据库中。
def get_vector_store(text_chunks): embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY, model=OPENAI_EMBEDDING_MODEL_NAME) vectorstore = FAISS.from_texts(texts=text_chunks, embedding=embeddings) return vectorstore
定义一个从向量存储中检索信息并通过 LLM 与用户交互的链。
def get_conversation_chain(vector_store): llm = ChatOpenAI(openai_api_key=OPENAI_API_KEY, model_name=OPENAI_MODEL_NAME, temperature=0) memory = ConversationBufferMemory(memory_key='chat_history', return_messages=True) system_template = """ Use the following pieces of context and chat history to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer. Context: {context} Chat history: {chat_history} Question: {question} Helpful Answer: """ prompt = PromptTemplate( template=system_template, input_variables=["context", "question", "chat_history"], ) conversation_chain = ConversationalRetrievalChain.from_llm( verbose = True, llm=llm, retriever=vector_store.as_retriever(), memory=memory, combine_docs_chain_kwargs={"prompt": prompt} ) return conversation_chain
处理用户输入,将其传递到对话链,并更新聊天记录。
def handle_user_input(question): try: response = st.session_state.conversation({'question': question}) st.session_state.chat_history = response['chat_history'] except Exception as e: st.error('Please select PDF and click on Process.')
要使用 CSS 为用户和机器人消息创建自定义聊天界面,请设计自定义模板并使用 CSS 对其进行样式设置。
css = ''' <style> .chat-message { padding: 1rem; border-radius: 0.5rem; margin-bottom: 1rem; display: flex } .chat-message.user { background-color: #2b313e } .chat-message.bot { background-color: #475063 } .chat-message .avatar { width: 10%; } .chat-message .avatar img { max-width: 30px; max-height: 30px; border-radius: 50%; object-fit: cover; } .chat-message .message { width: 90%; padding: 0 1rem; color: #fff; } ''' bot_template = ''' <div> <h3> Displaying chat history </h3> <p>Show the user and AI conversation history in a reverse order with HTML templates for formatting.<br> </p> <pre class="brush:php;toolbar:false">def display_chat_history(): if st.session_state.chat_history: reversed_history = st.session_state.chat_history[::-1] formatted_history = [] for i in range(0, len(reversed_history), 2): chat_pair = { "AIMessage": reversed_history[i].content, "HumanMessage": reversed_history[i + 1].content } formatted_history.append(chat_pair) for i, message in enumerate(formatted_history): st.write(user_template.replace("{{MSG}}", message['HumanMessage']), unsafe_allow_html=True) st.write(bot_template.replace("{{MSG}}", message['AIMessage']), unsafe_allow_html=True)
设置应用程序主界面,用于文件上传、问题输入和聊天记录显示。
def main(): st.set_page_config(page_title='Chat with PDFs', page_icon=':books:') st.write(css, unsafe_allow_html=True) if "conversation" not in st.session_state: st.session_state.conversation = None if "chat_history" not in st.session_state: st.session_state.chat_history = None st.header('Chat with PDFs :books:') question = st.text_input("Ask anything to your PDF:") if question: handle_user_input(question) if st.session_state.chat_history is not None: display_chat_history() with st.sidebar: st.subheader("Upload your Documents Here: ") pdf_files = st.file_uploader("Choose your PDF Files and Press Process button", type=['pdf'], accept_multiple_files=True) if pdf_files and st.button("Process"): with st.spinner("Processing your PDFs..."): try: # Get PDF Text raw_text = get_pdf_text(pdf_files) # Get Text Chunks text_chunks = get_chunk_text(raw_text) # Create Vector Store vector_store = get_vector_store(text_chunks) st.success("Your PDFs have been processed successfully. You can ask questions now.") # Create conversation chain st.session_state.conversation = get_conversation_chain(vector_store) except Exception as e: st.error(f"An error occurred: {e}") if __name__ == '__main__': main()
以下是 PDF 聊天应用程序的完整代码实现。它将环境变量设置、文本提取、矢量存储和 RAG 功能集成到简化的解决方案中:
from dotenv import load_dotenv import os load_dotenv() OPENAI_API_KEY = os.getenv("OPENAI_API_KEY") OPENAI_MODEL_NAME = os.getenv("OPENAI_MODEL_NAME") OPENAI_EMBEDDING_MODEL_NAME = os.getenv("OPENAI_EMBEDDING_MODEL_NAME") import streamlit as st from PyPDF2 import PdfReader from langchain.text_splitter import CharacterTextSplitter from langchain.prompts import PromptTemplate from langchain_community.embeddings import OpenAIEmbeddings from langchain_community.vectorstores import FAISS from langchain.memory import ConversationBufferMemory from langchain.chains import ConversationalRetrievalChain from langchain_community.chat_models import ChatOpenAI from htmlTemplates import bot_template, user_template, css def get_pdf_text(pdf_files): text = "" for pdf_file in pdf_files: reader = PdfReader(pdf_file) for page in reader.pages: text += page.extract_text() return text def get_chunk_text(text): text_splitter = CharacterTextSplitter( separator="\n", chunk_size=1000, chunk_overlap=200, length_function=len ) chunks = text_splitter.split_text(text) return chunks def get_vector_store(text_chunks): embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY, model=OPENAI_EMBEDDING_MODEL_NAME) vectorstore = FAISS.from_texts(texts=text_chunks, embedding=embeddings) return vectorstore def get_conversation_chain(vector_store): llm = ChatOpenAI(openai_api_key=OPENAI_API_KEY, model_name=OPENAI_MODEL_NAME, temperature=0) memory = ConversationBufferMemory(memory_key='chat_history', return_messages=True) system_template = """ Use the following pieces of context and chat history to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer. Context: {context} Chat history: {chat_history} Question: {question} Helpful Answer: """ prompt = PromptTemplate( template=system_template, input_variables=["context", "question", "chat_history"], ) conversation_chain = ConversationalRetrievalChain.from_llm( verbose = True, llm=llm, retriever=vector_store.as_retriever(), memory=memory, combine_docs_chain_kwargs={"prompt": prompt} ) return conversation_chain def handle_user_input(question): try: response = st.session_state.conversation({'question': question}) st.session_state.chat_history = response['chat_history'] except Exception as e: st.error('Please select PDF and click on OK.') def display_chat_history(): if st.session_state.chat_history: reversed_history = st.session_state.chat_history[::-1] formatted_history = [] for i in range(0, len(reversed_history), 2): chat_pair = { "AIMessage": reversed_history[i].content, "HumanMessage": reversed_history[i + 1].content } formatted_history.append(chat_pair) for i, message in enumerate(formatted_history): st.write(user_template.replace("{{MSG}}", message['HumanMessage']), unsafe_allow_html=True) st.write(bot_template.replace("{{MSG}}", message['AIMessage']), unsafe_allow_html=True) def main(): st.set_page_config(page_title='Chat with PDFs', page_icon=':books:') st.write(css, unsafe_allow_html=True) if "conversation" not in st.session_state: st.session_state.conversation = None if "chat_history" not in st.session_state: st.session_state.chat_history = None st.header('Chat with PDFs :books:') question = st.text_input("Ask anything to your PDF:") if question: handle_user_input(question) if st.session_state.chat_history is not None: display_chat_history() with st.sidebar: st.subheader("Upload your Documents Here: ") pdf_files = st.file_uploader("Choose your PDF Files and Press Process button", type=['pdf'], accept_multiple_files=True) if pdf_files and st.button("Process"): with st.spinner("Processing your PDFs..."): try: # Get PDF Text raw_text = get_pdf_text(pdf_files) # Get Text Chunks text_chunks = get_chunk_text(raw_text) # Create Vector Store vector_store = get_vector_store(text_chunks) st.success("Your PDFs have been processed successfully. You can ask questions now.") # Create conversation chain st.session_state.conversation = get_conversation_chain(vector_store) except Exception as e: st.error(f"An error occurred: {e}") if __name__ == '__main__': main()
使用以下命令通过 Streamlit 执行应用程序。
streamlit run app.py
您将得到如下输出,
感谢您阅读这篇文章!!
感谢 Gowri M Bhatt 审阅内容。
如果您喜欢这篇文章,请点击心形按钮♥并分享以帮助其他人找到它!
本教程的完整源代码可以在这里找到,
codemaker2015/pdf-chat-using-RAG | github.com
以上是检索增强生成 (RAG) 的终极指南的详细内容。更多信息请关注PHP中文网其他相关文章!