如何使用 LangChain 框架、IRIS 向量搜尋和 LLM 根據使用者提示產生 IRIS 相容 SQL 的實驗。
本文以此筆記本為基礎。您可以在 OpenExchange 中使用此應用程式在隨時可用的環境中運行它。
首先,我們需要安裝必要的函式庫:
接下來,我們導入所需的模組並設定環境:
我們將使用 SQLiteCache 來快取 LLM 呼叫:
設定IRIS資料庫連線參數:
如果環境中尚未設定 OpenAI API 金鑰,則提示使用者輸入:
為 IRIS 資料庫建立連接字串:
建立與 IRIS 資料庫的連線:
準備一個字典來保存系統提示的上下文資訊:
為了將使用者輸入轉換為與 IRIS 資料庫相容的 SQL 查詢,我們需要為語言模型建立有效的提示。我們從初始提示開始,它提供了產生 SQL 查詢的基本說明。此範本源自LangChain預設的MSSQL提示,並針對IRIS資料庫進行了客製化。
此基本提示將語言模型 (LLM) 配置為充當 SQL 專家,並為 IRIS 資料庫提供特定指導。接下來,我們提供一個輔助提示,其中包含有關資料庫架構的信息,以避免出現幻覺。
為了提高法學碩士回答的準確性,我們使用了一種稱為「少樣本提示」的技術。這涉及向法學碩士展示一些例子。
我們為少量範例定義範本:
我們使用少樣本模板建立使用者提示:
最後,我們組合所有提示來創建最終的提示:
此提示需要變數 example_value、input、table_info 和 top_k。
提示的結構如下:
為了視覺化提示將如何傳送到 LLM,我們可以使用所需變數的佔位符值:
現在,我們準備好透過提供必要的變數將此提示傳送給 LLM。準備好後,讓我們繼續下一步。
為了建立準確的 SQL 查詢,我們需要向語言模型 (LLM) 提供有關資料庫表的詳細資訊。如果沒有這些信息,法學碩士可能會產生看似合理但由於幻覺而不正確的查詢。因此,我們的第一步是建立一個從 IRIS 資料庫檢索表格定義的函數。
以下函數查詢 INFORMATION_SCHEMA 以取得指定模式的表定義。如果提供了特定的表,它將檢索該表的定義;否則,它將檢索架構中所有表的定義。
在此範例中,我們使用 Aviation 模式,可在此處取得。
此函數傳回 Aviation 模式中所有資料表的 CREATE TABLE 語句:
有了這些表定義,我們就可以繼續下一步,即將它們整合到我們的 LLM 提示中。這確保了 LLM 在產生 SQL 查詢時擁有有關資料庫架構的準確且全面的資訊。
使用資料庫時,尤其是較大的資料庫時,在提示中傳送所有資料表的資料定義語言 (DDL) 可能不切實際。雖然這種方法可能適用於小型資料庫,但現實世界的資料庫通常包含數百或數千個表,導致處理所有這些表的效率很低。
此外,語言模型不太可能需要了解資料庫中的每個表才能有效產生 SQL 查詢。為了應對這項挑戰,我們可以利用語義搜尋功能根據使用者的查詢僅選擇最相關的表格。
我們透過使用語義搜尋和 IRIS 向量搜尋來實現這一目標。請注意,如果 SQL 元素標識符(例如表、欄位和鍵)具有有意義的名稱,則此方法最有效。如果您的識別碼是任意程式碼,請考慮使用資料字典。
首先,將表定義提取到 pandas DataFrame 中:
# Retrieve table definitions into a pandas DataFrame table_def = get_table_definitions_array(cnx=cnx, schema='Aviation') table_df = pd.DataFrame(data=table_def, columns=["col_def"]) table_df["id"] = table_df.index + 1 table_df
The DataFrame (table_df) will look something like this:
col_def | id | |
---|---|---|
0 | CREATE TABLE Aviation.Aircraft (\n Event bigi... | 1 |
1 | CREATE TABLE Aviation.Crew (\n Aircraft varch... | 2 |
2 | CREATE TABLE Aviation.Event (\n ID bigint NOT... | 3 |
Next, split the table definitions into Langchain Documents. This step is crucial for handling large chunks of text and extracting text embeddings:
loader = DataFrameLoader(table_df, page_content_column="col_def") documents = loader.load() text_splitter = CharacterTextSplitter(chunk_size=400, chunk_overlap=20, separator="\n") tables_docs = text_splitter.split_documents(documents) tables_docs
The resulting tables_docs list contains split documents with metadata, like so:
[Document(metadata={'id': 1}, page_content='CREATE TABLE Aviation.Aircraft (\n Event bigint NOT NULL,\n ID varchar NOT NULL,\n ...'), Document(metadata={'id': 2}, page_content='CREATE TABLE Aviation.Crew (\n Aircraft varchar NOT NULL,\n ID varchar NOT NULL,\n ...'), Document(metadata={'id': 3}, page_content='CREATE TABLE Aviation.Event (\n ID bigint NOT NULL DEFAULT $i(^Aviation.EventD),\n ...')]
Now, use the IRISVector class from langchain-iris to extract embedding vectors and store them:
tables_vector_store = IRISVector.from_documents( embedding=OpenAIEmbeddings(), documents=tables_docs, connection_string=iris_conn_str, collection_name="sql_tables", pre_delete_collection=True )
Note: The pre_delete_collection flag is set to True for demonstration purposes to ensure a fresh collection in each test run. In a production environment, this flag should generally be set to False.
With the table embeddings stored, you can now query for relevant tables based on user input:
input_query = "List the first 2 manufacturers" relevant_tables_docs = tables_vector_store.similarity_search(input_query, k=3) relevant_tables_docs
For example, querying for manufacturers might return:
[Document(metadata={'id': 1}, page_content='GearType varchar,\n LastInspectionDate timestamp,\n ...'), Document(metadata={'id': 1}, page_content='AircraftModel varchar,\n AircraftRegistrationClass varchar,\n ...'), Document(metadata={'id': 3}, page_content='LocationSiteZipCode varchar,\n LocationState varchar,\n ...')]
From the metadata, you can see that only table ID 1 (Aviation.Aircraft) is relevant, which aligns with the query.
While this approach is generally effective, it may not always be perfect. For instance, querying for crash sites might also return less relevant tables:
input_query = "List the top 10 most crash sites" relevant_tables_docs = tables_vector_store.similarity_search(input_query, k=3) relevant_tables_docs
Results might include:
[Document(metadata={'id': 3}, page_content='LocationSiteZipCode varchar,\n LocationState varchar,\n ...'), Document(metadata={'id': 3}, page_content='InjuriesGroundSerious integer,\n InjuriesHighest varchar,\n ...'), Document(metadata={'id': 1}, page_content='CREATE TABLE Aviation.Aircraft (\n Event bigint NOT NULL,\n ID varchar NOT NULL,\n ...')]
Despite retrieving the correct Aviation.Event table twice, the Aviation.Aircraft table may also appear, which could be improved with additional filtering or thresholding. This is beyond the scope of this example and will be left for future implementations.
To automate this process, define a function to filter and return the relevant tables based on user input:
def get_relevant_tables(user_input, tables_vector_store, table_df): relevant_tables_docs = tables_vector_store.similarity_search(user_input) relevant_tables_docs_indices = [x.metadata["id"] for x in relevant_tables_docs] indices = table_df["id"].isin(relevant_tables_docs_indices) relevant_tables_array = [x for x in table_df[indices]["col_def"]] return relevant_tables_array
This function will help in efficiently retrieving only the relevant tables to send to the LLM, reducing the prompt length and improving overall query performance.
When working with language models (LLMs), providing them with relevant examples helps ensure accurate and contextually appropriate responses. These examples, referred to as "few-shot" examples, guide the LLM in understanding the structure and context of the queries it should handle.
In our case, we need to populate the examples_value variable with a diverse set of SQL queries that cover a broad spectrum of IRIS SQL syntax and the tables available in the database. This helps prevent the LLM from generating incorrect or irrelevant queries.
Below is a list of example queries designed to illustrate various SQL operations:
examples = [ {"input": "List all aircrafts.", "query": "SELECT * FROM Aviation.Aircraft"}, {"input": "Find all incidents for the aircraft with ID 'N12345'.", "query": "SELECT * FROM Aviation.Event WHERE EventId IN (SELECT EventId FROM Aviation.Aircraft WHERE ID = 'N12345')"}, {"input": "List all incidents in the 'Commercial' operation type.", "query": "SELECT * FROM Aviation.Event WHERE EventId IN (SELECT EventId FROM Aviation.Aircraft WHERE OperationType = 'Commercial')"}, {"input": "Find the total number of incidents.", "query": "SELECT COUNT(*) FROM Aviation.Event"}, {"input": "List all incidents that occurred in 'Canada'.", "query": "SELECT * FROM Aviation.Event WHERE LocationCountry = 'Canada'"}, {"input": "How many incidents are associated with the aircraft with AircraftKey 5?", "query": "SELECT COUNT(*) FROM Aviation.Aircraft WHERE AircraftKey = 5"}, {"input": "Find the total number of distinct aircrafts involved in incidents.", "query": "SELECT COUNT(DISTINCT AircraftKey) FROM Aviation.Aircraft"}, {"input": "List all incidents that occurred after 5 PM.", "query": "SELECT * FROM Aviation.Event WHERE EventTime > 1700"}, {"input": "Who are the top 5 operators by the number of incidents?", "query": "SELECT TOP 5 OperatorName, COUNT(*) AS IncidentCount FROM Aviation.Aircraft GROUP BY OperatorName ORDER BY IncidentCount DESC"}, {"input": "Which incidents occurred in the year 2020?", "query": "SELECT * FROM Aviation.Event WHERE YEAR(EventDate) = '2020'"}, {"input": "What was the month with most events in the year 2020?", "query": "SELECT TOP 1 MONTH(EventDate) EventMonth, COUNT(*) EventCount FROM Aviation.Event WHERE YEAR(EventDate) = '2020' GROUP BY MONTH(EventDate) ORDER BY EventCount DESC"}, {"input": "How many crew members were involved in incidents?", "query": "SELECT COUNT(*) FROM Aviation.Crew"}, {"input": "List all incidents with detailed aircraft information for incidents that occurred in the year 2012.", "query": "SELECT e.EventId, e.EventDate, a.AircraftManufacturer, a.AircraftModel, a.AircraftCategory FROM Aviation.Event e JOIN Aviation.Aircraft a ON e.EventId = a.EventId WHERE Year(e.EventDate) = 2012"}, {"input": "Find all incidents where there were more than 5 injuries and include the aircraft manufacturer and model.", "query": "SELECT e.EventId, e.InjuriesTotal, a.AircraftManufacturer, a.AircraftModel FROM Aviation.Event e JOIN Aviation.Aircraft a ON e.EventId = a.EventId WHERE e.InjuriesTotal > 5"}, {"input": "List all crew members involved in incidents with serious injuries, along with the incident date and location.", "query": "SELECT c.CrewNumber AS 'Crew Number', c.Age, c.Sex AS Gender, e.EventDate AS 'Event Date', e.LocationCity AS 'Location City', e.LocationState AS 'Location State' FROM Aviation.Crew c JOIN Aviation.Event e ON c.EventId = e.EventId WHERE c.Injury = 'Serious'"} ]
Given the ever-expanding list of examples, it’s impractical to provide the LLM with all of them. Instead, we use IRIS Vector Search along with the SemanticSimilarityExampleSelector class to identify the most relevant examples based on user prompts.
example_selector = SemanticSimilarityExampleSelector.from_examples( examples, OpenAIEmbeddings(), IRISVector, k=5, input_keys=["input"], connection_string=iris_conn_str, collection_name="sql_samples", pre_delete_collection=True )
Note: The pre_delete_collection flag is used here for demonstration purposes to ensure a fresh collection in each test run. In a production environment, this flag should be set to False to avoid unnecessary deletions.
To find the most relevant examples for a given input, use the selector as follows:
input_query = "Find all events in 2010 informing the Event Id and date, location city and state, aircraft manufacturer and model." relevant_examples = example_selector.select_examples({"input": input_query})
The results might look like this:
[{'input': 'List all incidents with detailed aircraft information for incidents that occurred in the year 2012.', 'query': 'SELECT e.EventId, e.EventDate, a.AircraftManufacturer, a.AircraftModel, a.AircraftCategory FROM Aviation.Event e JOIN Aviation.Aircraft a ON e.EventId = a.EventId WHERE Year(e.EventDate) = 2012'}, {'input': "Find all incidents for the aircraft with ID 'N12345'.", 'query': "SELECT * FROM Aviation.Event WHERE EventId IN (SELECT EventId FROM Aviation.Aircraft WHERE ID = 'N12345')"}, {'input': 'Find all incidents where there were more than 5 injuries and include the aircraft manufacturer and model.', 'query': 'SELECT e.EventId, e.InjuriesTotal, a.AircraftManufacturer, a.AircraftModel FROM Aviation.Event e JOIN Aviation.Aircraft a ON e.EventId = a.EventId WHERE e.InjuriesTotal > 5'}, {'input': 'List all aircrafts.', 'query': 'SELECT * FROM Aviation.Aircraft'}, {'input': 'Find the total number of distinct aircrafts involved in incidents.', 'query': 'SELECT COUNT(DISTINCT AircraftKey) FROM Aviation.Aircraft'}]
If you specifically need examples related to quantities, you can query the selector accordingly:
input_query = "What is the number of incidents involving Boeing aircraft." quantity_examples = example_selector.select_examples({"input": input_query})
The output may be:
[{'input': 'How many incidents are associated with the aircraft with AircraftKey 5?', 'query': 'SELECT COUNT(*) FROM Aviation.Aircraft WHERE AircraftKey = 5'}, {'input': 'Find the total number of distinct aircrafts involved in incidents.', 'query': 'SELECT COUNT(DISTINCT AircraftKey) FROM Aviation.Aircraft'}, {'input': 'How many crew members were involved in incidents?', 'query': 'SELECT COUNT(*) FROM Aviation.Crew'}, {'input': 'Find all incidents where there were more than 5 injuries and include the aircraft manufacturer and model.', 'query': 'SELECT e.EventId, e.InjuriesTotal, a.AircraftManufacturer, a.AircraftModel FROM Aviation.Event e JOIN Aviation.Aircraft a ON e.EventId = a.EventId WHERE e.InjuriesTotal > 5'}, {'input': 'List all incidents with detailed aircraft information for incidents that occurred in the year 2012.', 'query': 'SELECT e.EventId, e.EventDate, a.AircraftManufacturer, a.AircraftModel, a.AircraftCategory FROM Aviation.Event e JOIN Aviation.Aircraft a ON e.EventId = a.EventId WHERE Year(e.EventDate) = 2012'}]
This output includes examples that specifically address counting and quantities.
While the SemanticSimilarityExampleSelector is powerful, it’s important to note that not all selected examples may be perfect. Future improvements may involve adding filters or thresholds to exclude less relevant results, ensuring that only the most appropriate examples are provided to the LLM.
To assess the performance of the prompt and SQL query generation, we need to set up and run a series of tests. The goal is to evaluate how well the LLM generates SQL queries based on user inputs, with and without the use of example-based few shots.
We start by defining a function that uses the LLM to generate SQL queries based on the provided context, prompt, user input, and other parameters:
def get_sql_from_text(context, prompt, user_input, use_few_shots, tables_vector_store, table_df, example_selector=None, example_prompt=None): relevant_tables = get_relevant_tables(user_input, tables_vector_store, table_df) context["table_info"] = "\n\n".join(relevant_tables) examples = example_selector.select_examples({"input": user_input}) if example_selector else [] context["examples_value"] = "\n\n".join([ example_prompt.invoke(x).to_string() for x in examples ]) model = ChatOpenAI(model="gpt-3.5-turbo", temperature=0) output_parser = StrOutputParser() chain_model = prompt | model | output_parser response = chain_model.invoke({ "top_k": context["top_k"], "table_info": context["table_info"], "examples_value": context["examples_value"], "input": user_input }) return response
Test the prompt with and without examples:
# Prompt execution **with** few shots input = "Find all events in 2010 informing the Event Id and date, location city and state, aircraft manufacturer and model." response_with_few_shots = get_sql_from_text( context, prompt, user_input=input, use_few_shots=True, tables_vector_store=tables_vector_store, table_df=table_df, example_selector=example_selector, example_prompt=example_prompt, ) print(response_with_few_shots)
SELECT e.EventId, e.EventDate, e.LocationCity, e.LocationState, a.AircraftManufacturer, a.AircraftModel FROM Aviation.Event e JOIN Aviation.Aircraft a ON e.EventId = a.EventId WHERE Year(e.EventDate) = 2010
# Prompt execution **without** few shots input = "Find all events in 2010 informing the Event Id and date, location city and state, aircraft manufacturer and model." response_with_no_few_shots = get_sql_from_text( context, prompt, user_input=input, use_few_shots=False, tables_vector_store=tables_vector_store, table_df=table_df, ) print(response_with_no_few_shots)
SELECT TOP 3 "EventId", "EventDate", "LocationCity", "LocationState", "AircraftManufacturer", "AircraftModel" FROM Aviation.Event e JOIN Aviation.Aircraft a ON e.ID = a.Event WHERE e.EventDate >= '2010-01-01' AND e.EventDate < '2011-01-01'
To test the generated SQL queries, we define some utility functions:
def execute_sql_query(cnx, query): try: cursor = cnx.cursor() cursor.execute(query) rows = cursor.fetchall() return rows except: print('Error running query:') print(query) print('-'*80) return None def sql_result_equals(cnx, query, expected): rows = execute_sql_query(cnx, query) result = [set(row._asdict().values()) for row in rows or []] if result != expected and rows is not None: print('Result not as expected for query:') print(query) print('-'*80) return result == expected
# SQL test for prompt **with** few shots print("SQL is OK" if not execute_sql_query(cnx, response_with_few_shots) is None else "SQL is not OK")
SQL is OK
# SQL test for prompt **without** few shots print("SQL is OK" if not execute_sql_query(cnx, response_with_no_few_shots) is None else "SQL is not OK")
error on running query: SELECT TOP 3 "EventId", "EventDate", "LocationCity", "LocationState", "AircraftManufacturer", "AircraftModel" FROM Aviation.Event e JOIN Aviation.Aircraft a ON e.ID = a.Event WHERE e.EventDate >= '2010-01-01' AND e.EventDate < '2011-01-01' -------------------------------------------------------------------------------- SQL is not OK
Define a set of test cases and run them:
tests = [{ "input": "What were the top 3 years with the most recorded events?", "expected": [{128, 2003}, {122, 2007}, {117, 2005}] },{ "input": "How many incidents involving Boeing aircraft.", "expected": [{5}] },{ "input": "How many incidents that resulted in fatalities.", "expected": [{237}] },{ "input": "List event Id and date and, crew number, age and gender for incidents that occurred in 2013.", "expected": [{1, datetime.datetime(2013, 3, 4, 11, 6), '20130305X71252', 59, 'M'}, {1, datetime.datetime(2013, 1, 1, 15, 0), '20130101X94035', 32, 'M'}, {2, datetime.datetime(2013, 1, 1, 15, 0), '20130101X94035', 35, 'M'}, {1, datetime.datetime(2013, 1, 12, 15, 0), '20130113X42535', 25, 'M'}, {2, datetime.datetime(2013, 1, 12, 15, 0), '20130113X42535', 34, 'M'}, {1, datetime.datetime(2013, 2, 1, 15, 0), '20130203X53401', 29, 'M'}, {1, datetime.datetime(2013, 2, 15, 15, 0), '20130218X70747', 27, 'M'}, {1, datetime.datetime(2013, 3, 2, 15, 0), '20130303X21011', 49, 'M'}, {1, datetime.datetime(2013, 3, 23, 13, 52), '20130326X85150', 'M', None}] },{ "input": "Find the total number of incidents that occurred in the United States.", "expected": [{1178}] },{ "input": "List all incidents latitude and longitude coordinates with more than 5 injuries that occurred in 2010.", "expected": [{-78.76833333333333, 43.25277777777778}] },{ "input": "Find all incidents in 2010 informing the Event Id and date, location city and state, aircraft manufacturer and model.", "expected": [ {datetime.datetime(2010, 5, 20, 13, 43), '20100520X60222', 'CIRRUS DESIGN CORP', 'Farmingdale', 'New York', 'SR22'}, {datetime.datetime(2010, 4, 11, 15, 0), '20100411X73253', 'CZECH AIRCRAFT WORKS SPOL SRO', 'Millbrook', 'New York', 'SPORTCRUISER'}, {'108', datetime.datetime(2010, 1, 9, 12, 55), '20100111X41106', 'Bayport', 'New York', 'STINSON'}, {datetime.datetime(2010, 8, 1, 14, 20), '20100801X85218', 'A185F', 'CESSNA', 'New York', 'Newfane'} ] }]
Run the tests and calculate the accuracy:
def execute_tests(cnx, context, prompt, use_few_shots, tables_vector_store, table_df, example_selector, example_prompt): tests_generated_sql = [(x, get_sql_from_text( context, prompt, user_input=x['input'], use_few_shots=use_few_shots, tables_vector_store=tables_vector_store, table_df=table_df, example_selector=example_selector if use_few_shots else None, example_prompt=example_prompt if use_few_shots else None, )) for x in deepcopy(tests)] tests_sql_executions = [(x[0], sql_result_equals(cnx, x[1], x[0]['expected'])) for x in tests_generated_sql] accuracy = sum(1 for i in tests_sql_executions if i[1] == True) / len(tests_sql_executions) print(f'Accuracy: {accuracy}') print('-'*80)
# Accuracy tests for prompts executed **without** few shots use_few_shots = False execute_tests( cnx, context, prompt, use_few_shots, tables_vector_store, table_df, example_selector, example_prompt )
error on running query: SELECT "EventDate", COUNT("EventId") as "TotalEvents" FROM Aviation.Event GROUP BY "EventDate" ORDER BY "TotalEvents" DESC TOP 3; -------------------------------------------------------------------------------- error on running query: SELECT "EventId", "EventDate", "C"."CrewNumber", "C"."Age", "C"."Sex" FROM "Aviation.Event" AS "E" JOIN "Aviation.Crew" AS "C" ON "E"."ID" = "C"."EventId" WHERE "E"."EventDate" >= '2013-01-01' AND "E"."EventDate" < '2014-01-01' -------------------------------------------------------------------------------- result not expected for query: SELECT TOP 3 "e"."EventId", "e"."EventDate", "e"."LocationCity", "e"."LocationState", "a"."AircraftManufacturer", "a"."AircraftModel" FROM "Aviation"."Event" AS "e" JOIN "Aviation"."Aircraft" AS "a" ON "e"."ID" = "a"."Event" WHERE "e"."EventDate" >= '2010-01-01' AND "e"."EventDate" < '2011-01-01' -------------------------------------------------------------------------------- accuracy: 0.5714285714285714 --------------------------------------------------------------------------------
# Accuracy tests for prompts executed **with** few shots use_few_shots = True execute_tests( cnx, context, prompt, use_few_shots, tables_vector_store, table_df, example_selector, example_prompt )
error on running query: SELECT e.EventId, e.EventDate, e.LocationCity, e.LocationState, a.AircraftManufacturer, a.AircraftModel FROM Aviation.Event e JOIN Aviation.Aircraft a ON e.EventId = a.EventId WHERE Year(e.EventDate) = 2010 TOP 3 -------------------------------------------------------------------------------- accuracy: 0.8571428571428571 --------------------------------------------------------------------------------
The accuracy of SQL queries generated with examples (few shots) is approximately 49% higher compared to those generated without examples (85% vs. 57%).
以上是使用 LangChain 向 IRIS SQL 發送文本的詳細內容。更多資訊請關注PHP中文網其他相關文章!