如何使用Chat GPT和LangChain从您的文档中获取回复
在之前的文章中,我解释了一些你可以用来训练你的数据上的对话生成模型的方法。在这篇文章中,我们将使用其中的一种方法“嵌入”来从我们的文档中得到对话生成模型的回应。让我们开始吧!
我正在使用一个虚构角色的传记文本文件来创建嵌入。我们将使用此文件来从Chat GPT生成答复。以下是文本文件的片段。
Early Life and Education:
Born on March 15th, 1985, in a small town named Greenridge,
John Anderson displayed an inquisitive nature from an early age.
Growing up in a supportive and nurturing family environment,
John was encouraged to pursue his interests and cultivate his talents.
导入所需的包
让我们首先导入所需的包。请注意,如果您想要使用 Azure OpenAI 端点,则使用 AzureOpenAI 包。否则,请使用 openai。在此处阅读更多信息。我已经使用 ChromaDB 作为向量数据库来存储嵌入。
#Import required packages
#If you have your endpoing then use AzureOpenAI
#else use
#import openai
from langchain.llms import AzureOpenAI
#This will help us create embeddings
from langchain.embeddings.openai import OpenAIEmbeddings
#Using ChromaDB as a vector store for the embeddigns
from langchain.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA
from langchain.document_loaders import DirectoryLoader
import os
设置 API 密钥和端点
请将“OPENAI_API_KEY”替换为您的秘密API密钥。请阅读此处获取秘密密钥的方法。将“OPENAI_API_BASE”替换为您的端点名称。请阅读此处获取这些详细信息的方法。
os.environ["OPENAI_API_TYPE"] = "azure"
os.environ["OPENAI_API_VERSION"] = "2022-12-01"
#Set your API endpoint (API BASE) here if you are using Azure OpenAI
#If you are using openai common endpoing then you do not need to set this.
os.environ["OPENAI_API_BASE"] = "OPENAI_API_BASE"
os.environ["OPENAI_API_KEY"] = "OPENAI_API_KEY"
加载文件
现在我们从一个目录中加载文档。下面的代码读取“docs”目录中的所有txt文件。如果您想读取其他文件类型,请在此处阅读文档。
在加载所有文档的文本后,我们将它们分成小块来创建嵌入。
#Load all the .txt files from docs directory
loader = DirectoryLoader('./docs/',glob = "**/*.txt")
docs = loader.load()
#Split text into tokens
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(docs)
创建嵌入
在摄入文件后,让我们使用OpenAI嵌入创建嵌入。 如果您在使用Azure OpenAI,则可以提供部署名称。 如果没有,则不要使用此参数。 在此处阅读有关部署名称的更多信息。
一旦嵌入式被创建,我们可以将它们存储在Chroma向量数据库中。这些嵌入式将被存储在“chromadb”目录中,该目录将被创建在您的工作目录中。
#Turn the text into embeddings
embeddings = OpenAIEmbeddings(deployment="NAME_OF_YOUR_MODEL_DEPLOYMENT", chunk_size=1) #This model should be able to generate embeddings. For example, text-embedding-ada-002
#Store the embeddings into chromadb directory
docsearch = Chroma.from_documents(documents=texts, embedding=embeddings, persist_directory="./chromadb")
问问题!
我们已经到达代码的最酷部分。我们现在可以向 Chat GPT 提出任何问题,并通过这些嵌入传递以从我们的数据中获得响应。
#Use AzureOpenAI, if you're using a endpoint from Azure Open AI
#This can be any QnA model. For example, davinci.
#Remember to provide deployment name here and not the model name
llm = AzureOpenAI(deployment_name="NAME_OF_YOUR_MODEL_DEPLOYMENT")
#Use OpenAI if you're using a Azure OpenAI endpoint
#llm = ChatOpenAI(temperature = 0.7, model_name='MODEL_NAME')
qa = RetrievalQA.from_chain_type(llm=llm,
chain_type="stuff",
retriever=docsearch.as_retriever(),
return_source_documents=False
)
query = "Where was John born?"
qa.run(query)
大家一起
让我们把所有的东西放在一起。您可以在此处找到完整的笔记本。
#Import required packages
#If you have your endpoing then use AzureOpenAI
#else use
import openai
from langchain.llms import AzureOpenAI
#This will help us create embeddings
from langchain.embeddings.openai import OpenAIEmbeddings
#Using ChromaDB as a vector store for the embeddigns
from langchain.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA
from langchain.document_loaders import DirectoryLoader
import os
os.environ["OPENAI_API_TYPE"] = "azure"
os.environ["OPENAI_API_VERSION"] = "2022-12-01"
#Set your API endpoint (API BASE) here if you are using Azure OpenAI
#If you are using openai common endpoing then you do not need to set this.
os.environ["OPENAI_API_BASE"] = "OPENAI_API_BASE"
#Set your OPENAI API KEY here
os.environ["OPENAI_API_KEY"] = "OPENAI_API_KEY"
#Load all the .txt files from docs directory
loader = DirectoryLoader('./docs/',glob = "**/*.txt")
docs = loader.load()
#Split text into tokens
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(docs)
#Turn the text into embeddings
embeddings = OpenAIEmbeddings(deployment="NAME_OF_YOUR_MODEL_DEPLOYMENT", chunk_size=1) #This model should be able to generate embeddings. For example, text-embedding-ada-002
#Store the embeddings into chromadb directory
docsearch = Chroma.from_documents(documents=texts, embedding=embeddings, persist_directory="./chromadb")
#Use AzureOpenAI, if you're using a endpoint from Azure Open AI
llm = AzureOpenAI(deployment_name="NAME_OF_YOUR_MODEL_DEPLOYMENT") #This can be any QnA model. For example, davinci.
#Use OpenAI if you're using a Azure OpenAI endpoint
#llm = ChatOpenAI(temperature = 0.7, model_name='MODEL_NAME')
qa = RetrievalQA.from_chain_type(llm=llm,
chain_type="stuff",
retriever=docsearch.as_retriever(),
return_source_documents=False
)
query = "Where was John born?"
qa.run(query)