大手言語モデルの応答の信頼性を測定する方法-AI-php.cn

大規模な言語モデル（LLMS）の基本原理は非常に単純です。トレーニングデータの統計パターンに基づいて、一連の単語で次の単語（またはトークン）を予測することです。ただし、この一見シンプルな機能は、テキストの要約、アイデア生成、ブレーンストーミング、コード生成、情報処理、コンテンツ作成など、多くの驚くべきタスクを実行できる場合、非常に洗練されていることがわかります。そうは言っても、LLMにはメモリがありません。基本的な機能に固執する以外に、実際に何かを「理解」していません。次の単語予測のプロセスは確率的です。 LLMは、確率分布から各単語を選択する必要があります。その過程で、彼らはしばしば、一貫した応答を生成し、もっともらしいが誤った情報でギャップを埋めるために、誤った、製造、または一貫性のないコンテンツを生成します。この現象は幻覚と呼ばれます。これは、その出力の検証と裏付けを保証するLLMSの避けられない、よく知られている特徴です。 LLMを外部の知識ソースで動作させる、検索拡張生成（RAG）メソッドは、ある程度幻覚を最小限に抑えますが、それらを完全に根絶することはできません。高度なぼろきれはテキスト内の引用とURLを提供できますが、これらの参照が多忙で時間がかかる可能性があります。したがって、LLMの応答の信頼性または信頼性を評価するための客観的な基準が必要です。それが独自の知識であろうと外部の知識ベース（RAG）から生成されます。この記事では、LLMの出力にスコアを割り当てる信頼できる言語モデルによって、LLMの出力を信頼性について評価する方法について説明します。まず、信頼できる言語モデルを使用してLLMの回答にスコアを割り当て、信頼性を説明する方法について説明します。その後、lamaparseとllamaindexを使用したラグの例を開発し、信頼性に対するRAGの答えを評価します。この記事のコード全体は、githubのJupyterノートブックで入手できます。

LLMの答えに信頼性スコアを割り当てます

LLMの応答に信頼性スコアを割り当てる方法を示すために、CleanLabの信頼できる言語モデル（TLM）を使用します。このようなTLMSは、不確実性の定量化

および

の一貫性分析

の組み合わせを使用して、LLM応答の信頼性のスコアと説明を計算します。

CleanLabは、ウェブサイトでアカウントを作成することで取得できる無料のトライアルAPIを提供しています。まず、CleanLabのPythonクライアントをインストールする必要があります

pip install --upgrade cleanlab-studio

ログイン後にコピー

CleanLabは、 'gpt-4o'、 'gpt-4o-mini'、 'o1-preview'、 ' claude-3-sonnet '、' claude-3.5-sonnet '、 'claude-3.5-sonnet-v2’など。 TLMがGPT-4oの回答に信頼性スコアを割り当てる方法は次のとおりです。信頼性のスコアは0〜1の範囲であり、より高い値はより大きな信頼性を示しています。上記のコードでは、「

「abracadabra」という言葉には、いくつの母音がありますか？

from cleanlab_studio import Studio
studio = Studio("<CLEANLAB_API_KEY>")  # Get your API key from above
tlm = studio.TLM(options={"log": ["explanation"], "model": "gpt-4o"}) # GPT, Claude, etc
#set the prompt
out = tlm.prompt("How many vowels are there in the word 'Abracadabra'.?")
#the TLM response contains the actual output 'response', trustworthiness score and explanation
print(f"Model's response = {out['response']}")
print(f"Trustworthiness score = {out['trustworthiness_score']}")
print(f"Explanation = {out['log']['explanation']}")

ログイン後にコピー

」という質問について、上記のコードでGPT-4oの応答をテストしました。 TLMの出力には、モデルの回答（応答）、信頼性スコア、および説明が含まれています。このコードの出力は次のとおりです

最先端の言語モデルがこのような単純なタスクに対してどのように幻覚し、間違った出力を生成するかを見ることができます。ここでは、同じ質問の応答と信頼性スコアはclaude-3.5-sonnet-v2

です。

Model's response = The word "Abracadabra" contains 6 vowels. The vowels are: A, a, a, a, a, and a.
Trustworthiness score = 0.6842228802750124
Explanation = This response is untrustworthy due to a lack of consistency in possible responses from the model. Here's one inconsistent alternate response that the model considered (which may not be accurate either):
5.

ログイン後にコピー

claude-3.5-sonnet-v2

は、正しい出力を生成します。 2つのモデルの回答を別の質問と比較しましょう

Model's response = Let me count the vowels in 'Abracadabra':
A-b-r-a-c-a-d-a-b-r-a

The vowels are: A, a, a, a, a

There are 5 vowels in the word 'Abracadabra'.
Trustworthiness score = 0.9378276048845285
Explanation = Did not find a reason to doubt trustworthiness.

ログイン後にコピー

ここに2つのモデルの応答があります：

オープンソースLLMSの信頼性スコアを生成することもできます。最近の大いに延期されたオープンソースLLMを確認しましょう：Deepseek-R1。 Metaの

from cleanlab_studio import Studio
import markdown
from IPython.core.display import display, Markdown

# Initialize the Cleanlab Studio with API key
studio = Studio("<CLEANLAB_API_KEY>")  # Replace with your actual API key

# List of models to evaluate
models = ["gpt-4o", "claude-3.5-sonnet-v2"]

# Define the prompt
prompt_text = "Which one of 9.11 and 9.9 is bigger?"

# Loop through each model and evaluate
for model in models:
   tlm = studio.TLM(options={"log": ["explanation"], "model": model})
   out = tlm.prompt(prompt_text)
  
   md_content = f"""
## Model: {model}

**Response:** {out['response']}

**Trustworthiness Score:** {out['trustworthiness_score']}

**Explanation:** {out['log']['explanation']}

---
"""
   display(Markdown(md_content))

ログイン後にコピー

llama-3.3–70b-instruct Model

に基づいて、

deepseek-r1-distill-lama-70b 大手言語モデルの応答の信頼性を測定する方法

を使用し、Deepseekのより大きな671億パラメーター混合物から蒸留しました（MOE ）モデル。知識蒸留は、大規模な訓練を受けたモデルである「教師モデル」の学習を小さな「学生モデル」に転送することを目的とする機械学習手法です。

これは、deepseek-r1-distill-llama-70bモデルの出力です。

import streamlit as st
from langchain_groq.chat_models import ChatGroq
import os
os.environ["GROQ_API_KEY"]=st.secrets["GROQ_API_KEY"]
# Initialize the Groq Llama Instant model
groq_llm = ChatGroq(model="deepseek-r1-distill-llama-70b", temperature=0.5)
prompt = "Which one of 9.11 and 9.9 is bigger?"
# Get the response from the model
response = groq_llm.invoke(prompt)
#Initialize Cleanlab's studio
studio = Studio("226eeab91e944b23bd817a46dbe3c8ae") 
cleanlab_tlm = studio.TLM(options={"log": ["explanation"]})  #for explanations
#Get the output containing trustworthiness score and explanation
output = cleanlab_tlm.get_trustworthiness_score(prompt, response=response.content.strip())
md_content = f"""
## Model: {model}
**Response:** {response.content.strip()}
**Trustworthiness Score:** {output['trustworthiness_score']}
**Explanation:** {output['log']['explanation']}
---
"""
display(Markdown(md_content))

ログイン後にコピー

信頼できるragの開発

ragを開発するために、ぼろきれを開発し、ragでのLLM応答の信頼性を測定する方法を実証します。このぼろきれは、指定されたリンクからデータを削り、マークダウン形式で解析し、ベクターストアの作成によって開発されます。

次のコードのために次のライブラリをインストールする必要があります。大手言語モデルの応答の信頼性を測定する方法

wkhtmltopdfコマンドラインツールもインストールする必要があります。

次のライブラリがインポートされます：

次の手順では、Pythonの

beautifulsoup

pip install llama-parse llama-index-core llama-index-embeddings-huggingface 
llama-index-llms-cleanlab requests beautifulsoup4 pdfkit nest-asyncio

ログイン後にコピー

ライブラリを使用して、指定されたURLからのデータのスクレイプ、

PDFKITを使用してPDFファイルのスクレイプデータを保存し、PDFからのデータを解析することが含まれます（PDFからのデータを解析することが含まれます（ s）llamaparse

を使用してマークダウンファイルを使用します。 LLMユースケース。

最初に、CleanLabtlmで使用するLLMを構成し、埋め込みモデル（huggingface埋め込みモデルbaai/bge-small-en-v1.5）を使用します。スクレイプされたデータの埋め込みを計算して、ベクトルストアを作成します。

pip install --upgrade cleanlab-studio

ログイン後にコピー

カスタムイベントハンドラー、

getTrustworthinessScoreを定義します。これは、ベースイベントハンドラークラスから派生しています。このハンドラーは、LLM完了の終了までにトリガーされ、応答メタデータから信頼性スコアを抽出します。ヘルパー関数、display_responseは、LLMの応答とその信頼性スコアを表示します。

指定されたURLからデータを削減することにより、PDFを生成します。デモンストレーションについては、このウィキペディアの大規模な言語モデルに関する記事からのみデータを廃棄します（

from cleanlab_studio import Studio
studio = Studio("<CLEANLAB_API_KEY>")  # Get your API key from above
tlm = studio.TLM(options={"log": ["explanation"], "model": "gpt-4o"}) # GPT, Claude, etc
#set the prompt
out = tlm.prompt("How many vowels are there in the word 'Abracadabra'.?")
#the TLM response contains the actual output 'response', trustworthiness score and explanation
print(f"Model's response = {out['response']}")
print(f"Trustworthiness score = {out['trustworthiness_score']}")
print(f"Explanation = {out['log']['explanation']}")

ログイン後にコピー

Creative Commons Attribution-Sharealike 4.0ライセンス

）。注

：読者は、削り取るコンテンツ/データのステータスを常に再確認し、そうすることを許可されていることを常に再確認することをお勧めします。

次のコードは、HTTPリクエストを作成し、beautifulSoup

Pythonライブラリを使用してHTMLコンテンツを解析することにより、指定されたURLからデータを削除します。 HTMLコンテンツは、プロトコル相関URLを絶対的なURLに変換することによりクリーニングされます。その後、スクレイプされたコンテンツは、

pdfkit 。を使用してPDFファイルに変換されます。削られたデータからPDFを生成した後、これらのPDFを

llamaparse

Model's response = The word "Abracadabra" contains 6 vowels. The vowels are: A, a, a, a, a, and a.
Trustworthiness score = 0.6842228802750124
Explanation = This response is untrustworthy due to a lack of consistency in possible responses from the model. Here's one inconsistent alternate response that the model considered (which may not be accurate either):
5.

ログイン後にコピー

を使用して解析します。コンテンツをマークダウン形式で抽出するために解析手順を設定し、ドキュメント名とページ番号とともにページごとにドキュメントを解析します。これらの抽出されたエンティティ（ページ）は、

ノードと呼ばれます。パーサーは、抽出されたノードを反復し、後で参照する引用ヘッダーを追加することにより、各ノードのメタデータを更新します。 ベクトルストアとクエリエンジンを作成しました。顧客プロンプトテンプレートを定義して、質問に答える際のLLMの動作をガイドします。最後に、クエリに答えるために作成されたインデックスを備えたクエリエンジンを作成します。クエリごとに、クエリとのセマンティックな類似性に基づいて、ベクトルストアから上位3つのノードを取得します。 LLMは、これらの取得したノードを使用して最終回答を生成します。

次に、いくつかのクエリとそれらの対応する信頼性スコアについてぼろをテストしましょう。

Model's response = Let me count the vowels in 'Abracadabra':
A-b-r-a-c-a-d-a-b-r-a

The vowels are: A, a, a, a, a

There are 5 vowels in the word 'Abracadabra'.
Trustworthiness score = 0.9378276048845285
Explanation = Did not find a reason to doubt trustworthiness.

ログイン後にコピー

from cleanlab_studio import Studio
import markdown
from IPython.core.display import display, Markdown

# Initialize the Cleanlab Studio with API key
studio = Studio("<CLEANLAB_API_KEY>")  # Replace with your actual API key

# List of models to evaluate
models = ["gpt-4o", "claude-3.5-sonnet-v2"]

# Define the prompt
prompt_text = "Which one of 9.11 and 9.9 is bigger?"

# Loop through each model and evaluate
for model in models:
   tlm = studio.TLM(options={"log": ["explanation"], "model": model})
   out = tlm.prompt(prompt_text)
  
   md_content = f"""
## Model: {model}

**Response:** {out['response']}

**Trustworthiness Score:** {out['trustworthiness_score']}

**Explanation:** {out['log']['explanation']}

---
"""
   display(Markdown(md_content))

ログイン後にコピー

LLMの応答に信頼性スコアを割り当てることは、直接推論またはRAGによって生成されたかどうかにかかわらず、AIの出力の信頼性を定義し、必要に応じて人間の検証に優先順位を付けます。これは、間違ったまたは信頼できない反応が深刻な結果をもたらす可能性のある重要なドメインにとって特に重要です。

それはすべての人です！記事が気に入ったら、およびlinkedIn。

以上が大手言語モデルの応答の信頼性を測定する方法の詳細内容です。詳細については、PHP 中国語 Web サイトの他の関連記事を参照してください。