在生產應用程式中整合大型語言模型-Python教學-PHP中文網

在本實用指南中，您將學習如何為您的應用程式建立具有內建 LLM 的高度可擴展的模型部署解決方案。
在您的範例中，我們將使用 Hugging Face 的 ChatGPT2 模型，但您可以輕鬆插入任何其他模型，包括 ChatGPT4、Claude 等。
無論您是設計具有 AI 功能的新應用程序，還是改進現有的 AI 系統，本指南都將幫助您逐步創建強大的 LLM 整合。

了解 LLM 整合基礎知識

在開始編寫程式碼之前，讓我們先弄清楚建置生產 LLM 整合需要什麼。在建立生產就緒的 LLM 整合時，API 呼叫並不是您需要考慮的唯一事情，您還需要考慮可靠性、成本和穩定性等問題。您的生產應用程式必須解決服務中斷、速率限制和回應時間變化等問題，同時控製成本。
這是我們將共同建構的內容：

一個強大的 API 用戶端，可以優雅地處理失敗
用於最佳化成本和速度的智慧型快取系統
適當的提示管理系統
全面的錯誤處理與監控
完整的內容審核系統作為您的範例項目

先決條件

在我們開始編碼之前，請確保您擁有：

您的電腦上安裝了 Python 3.8 或更高版本
Redis雲端帳號或本地安裝
基礎Python程式設計知識
REST API 的基本了解
Hugging Face API 金鑰（或任何其他 LLM 提供者金鑰）

想跟隨嗎？完整的程式碼可以在您的 GitHub 儲存庫中找到。

設定您的開發環境

讓我們先準備好您的開發環境。我們將創建一個乾淨的專案結構並安裝所有必要的軟體包。

首先，讓我們建立專案目錄並設定 Python 虛擬環境。開啟終端機並運作：

mkdir llm_integration && cd llm_integration
python3 -m venv env
syource env/bin/activate

登入後複製

現在讓我們設定您的專案依賴項。使用這些基本套件建立一個新的requirements.txt 檔案：

transformers==4.36.0
huggingface-hub==0.19.4
redis==4.6.0
pydantic==2.5.0
pydantic-settings==2.1.0
tenacity==8.2.3
python-dotenv==1.0.0
fastapi==0.104.1
uvicorn==0.24.0
torch==2.1.0
numpy==1.24.3

登入後複製

讓我們來分析為什麼我們需要這些套件：

Transformers：這是 Hugging Face 強大的函式庫，我們將用它來與 Qwen2.5-Coder 模型進行互動。
Huggingface-hub：使我們能夠處理模型載入和版本控制 redis：用於實現請求快取
pydantic：用於資料驗證和設定。
堅韌：負責重試功能以提高可靠性
python-dotenv：用於載入環境變數
fastapi：使用少量程式碼建立您的 API 端點
uvicorn：用於高效運行 FastAPI 應用程式
torch：用於運行變壓器模型和處理機器學習操作
numpy：用於數值計算。

使用以下指令安裝所有軟體套件：

mkdir llm_integration && cd llm_integration
python3 -m venv env
syource env/bin/activate

登入後複製

讓我們以乾淨的結構來組織您的專案。在您的專案目錄中建立這些目錄和檔案：

transformers==4.36.0
huggingface-hub==0.19.4
redis==4.6.0
pydantic==2.5.0
pydantic-settings==2.1.0
tenacity==8.2.3
python-dotenv==1.0.0
fastapi==0.104.1
uvicorn==0.24.0
torch==2.1.0
numpy==1.24.3

登入後複製

建置 LLM 客戶端

讓我們從您的LLM客戶端開始，這是您申請中最重要的組成部分。這是我們與 ChatGPT 模型（或您喜歡的任何其他 LLM）互動的地方。將以下程式碼片段加入您的 core/llm_client.py 檔案：

pip install -r requirements.txt

登入後複製

在 LLMClient 類別的第一部分中，我們正在建立基礎：

我們正在使用 Transformer 庫中的 AutoModelForCausalLM 和 AutoTokenizer 來載入您的模型
device_map="auto" 參數自動處理 GPU/CPU 分配
我們使用 torch.float16 來最佳化記憶體使用，同時保持良好的效能

現在讓我們加入與您的模型對話的方法：

llm_integration/
├── core/
│   ├── llm_client.py      # your main LLM interaction code
│   ├── prompt_manager.py  # Handles prompt templates
│   └── response_handler.py # Processes LLM responses
├── cache/
│   └── redis_manager.py   # Manages your caching system
├── config/
│   └── settings.py        # Configuration management
├── api/
│   └── routes.py          # API endpoints
├── utils/
│   ├── monitoring.py      # Usage tracking
│   └── rate_limiter.py    # Rate limiting logic
├── requirements.txt
└── main.py
└── usage_logs.json

登入後複製

讓我們分解一下這個完成方法中發生了什麼：

新增了@retry裝飾器方法來處理臨時失敗。
使用 torch.no_grad() 上下文管理器透過停用梯度計算來節省記憶體。
追蹤輸入和輸出中的令牌使用情況，這對於成本計算非常重要。
傳回包含回應和使用統計資訊的結構化字典。

建立您的 LLM 回應處理程序

接下來，我們需要加入回應處理程序來解析和建構 LLM 的原始輸出。使用下列程式碼片段在 core/response_handler.py 檔案中執行此操作：

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from tenacity import retry, stop_after_attempt, wait_exponential
from typing import Dict, Optional
import logging

class LLMClient:
    def __init__(self, model_name: str = "gpt2", timeout: int = 30):
        try:
            self.tokenizer = AutoTokenizer.from_pretrained(model_name)
            self.model = AutoModelForCausalLM.from_pretrained(
                model_name,
                device_map="auto",
                torch_dtype=torch.float16
            )
        except Exception as e:
            logging.error(f"Error loading model: {str(e)}")
            # Fallback to a simpler model if the specified one fails
            self.tokenizer = AutoTokenizer.from_pretrained("gpt2")
            self.model = AutoModelForCausalLM.from_pretrained("gpt2")

        self.timeout = timeout
        self.logger = logging.getLogger(__name__)

登入後複製

添加強大的緩存系統

現在讓我們建立您的快取系統來提高應用程式效能並降低成本。將以下程式碼片段加入您的cache/redis_manager.py 檔案：

 @retry(
        stop=stop_after_attempt(3),
        wait=wait_exponential(multiplier=1, min=4, max=10),
        reraise=True
    )
    async def complete(self, 
                      prompt: str, 
                      temperature: float = 0.7,
                      max_tokens: Optional[int] = None) -> Dict:
        """Get completion from the model with automatic retries"""
        try:
            inputs = self.tokenizer(prompt, return_tensors="pt").to(
                self.model.device
            )

            with torch.no_grad():
                outputs = self.model.generate(
                    **inputs,
                    max_new_tokens=max_tokens or 100,
                    temperature=temperature,
                    do_sample=True
                )

            response_text = self.tokenizer.decode(
                outputs[0], 
                skip_special_tokens=True
            )

            # Calculate token usage for monitoring
            input_tokens = len(inputs.input_ids[0])
            output_tokens = len(outputs[0]) - input_tokens

            return {
                'content': response_text,
                'usage': {
                    'prompt_tokens': input_tokens,
                    'completion_tokens': output_tokens,
                    'total_tokens': input_tokens + output_tokens
                },
                'model': "gpt2"
            }

        except Exception as e:
            self.logger.error(f"Error in LLM completion: {str(e)}")
            raise

登入後複製

在上面的程式碼片段中，我們建立了一個 CacheManager 類，它透過以下方式處理所有快取操作：

_generate_key 方法，根據提示和參數建立唯一的快取鍵
get_cached_response 檢查我們是否有給定提示的快取回應
cache_response 儲存成功的回應以供將來使用

建立智慧型提示管理器

讓我們建立您的提示管理員來管理您的 LLM 模型的提示。將以下程式碼加入您的 core/prompt_manager.py 中：

mkdir llm_integration && cd llm_integration
python3 -m venv env
syource env/bin/activate

登入後複製

然後使用程式碼片段在您的提示/content_moderation.json 檔案中建立用於內容審核的範例提示範本：

transformers==4.36.0
huggingface-hub==0.19.4
redis==4.6.0
pydantic==2.5.0
pydantic-settings==2.1.0
tenacity==8.2.3
python-dotenv==1.0.0
fastapi==0.104.1
uvicorn==0.24.0
torch==2.1.0
numpy==1.24.3

登入後複製

現在，您的提示管理器將能夠從 JSON 檔案載入提示模板，並獲得格式化的提示模板。

設定配置管理器

為了將所有 LLM 配置保存在一個位置並輕鬆地在您的應用程式中重複使用它們，讓我們建立配置設定。將以下程式碼加入您的 config/settings.py 檔案：

pip install -r requirements.txt

登入後複製

實施速率限制

接下來，讓我們實施速率限制來控制使用者存取應用程式資源的方式。為此，請將以下程式碼新增至您的 utils/rate_limiter.py 檔案：

llm_integration/
├── core/
│   ├── llm_client.py      # your main LLM interaction code
│   ├── prompt_manager.py  # Handles prompt templates
│   └── response_handler.py # Processes LLM responses
├── cache/
│   └── redis_manager.py   # Manages your caching system
├── config/
│   └── settings.py        # Configuration management
├── api/
│   └── routes.py          # API endpoints
├── utils/
│   ├── monitoring.py      # Usage tracking
│   └── rate_limiter.py    # Rate limiting logic
├── requirements.txt
└── main.py
└── usage_logs.json

登入後複製

在 RateLimiter 中，我們實作了一個可重複使用的 check_rate_limit 方法，該方法可在任何路由中使用，透過簡單地傳遞每個使用者在一段時間內允許的週期和請求數量來處理速率限制。

建立您的 API 端點

現在讓我們在 api/routes.py 檔案中建立 API 端點，以將 LLM 整合到您的應用程式中：

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from tenacity import retry, stop_after_attempt, wait_exponential
from typing import Dict, Optional
import logging

class LLMClient:
    def __init__(self, model_name: str = "gpt2", timeout: int = 30):
        try:
            self.tokenizer = AutoTokenizer.from_pretrained(model_name)
            self.model = AutoModelForCausalLM.from_pretrained(
                model_name,
                device_map="auto",
                torch_dtype=torch.float16
            )
        except Exception as e:
            logging.error(f"Error loading model: {str(e)}")
            # Fallback to a simpler model if the specified one fails
            self.tokenizer = AutoTokenizer.from_pretrained("gpt2")
            self.model = AutoModelForCausalLM.from_pretrained("gpt2")

        self.timeout = timeout
        self.logger = logging.getLogger(__name__)

登入後複製

這裡我們在 APIRouter 類別中定義了一個 /moderate 端點，它負責組織 API 路由。 @lru_cache 裝飾器應用於相依性注入函數（get_llm_client、get_response_handler、get_cache_manager 和 get_prompt_manager），以確保 LLMClient、CacheManager 和 PromptManager 的實例被快取以獲得更好的效能。以 @router.post 修飾的moderate_content函數定義了一個用於內容審核的POST路由，並利用FastAPI的Depends機制來注入這些依賴項。在函數內部，RateLimiter 類別使用設定中的速率限制設定進行配置，強制執行請求限制。

最後，讓我們更新您的 main.py 以將所有內容整合在一起：

 @retry(
        stop=stop_after_attempt(3),
        wait=wait_exponential(multiplier=1, min=4, max=10),
        reraise=True
    )
    async def complete(self, 
                      prompt: str, 
                      temperature: float = 0.7,
                      max_tokens: Optional[int] = None) -> Dict:
        """Get completion from the model with automatic retries"""
        try:
            inputs = self.tokenizer(prompt, return_tensors="pt").to(
                self.model.device
            )

            with torch.no_grad():
                outputs = self.model.generate(
                    **inputs,
                    max_new_tokens=max_tokens or 100,
                    temperature=temperature,
                    do_sample=True
                )

            response_text = self.tokenizer.decode(
                outputs[0], 
                skip_special_tokens=True
            )

            # Calculate token usage for monitoring
            input_tokens = len(inputs.input_ids[0])
            output_tokens = len(outputs[0]) - input_tokens

            return {
                'content': response_text,
                'usage': {
                    'prompt_tokens': input_tokens,
                    'completion_tokens': output_tokens,
                    'total_tokens': input_tokens + output_tokens
                },
                'model': "gpt2"
            }

        except Exception as e:
            self.logger.error(f"Error in LLM completion: {str(e)}")
            raise

登入後複製

在上面的程式碼中，我們使用 /api/v1 前綴下的 api.routes 建立了一個 FastAPI 應用程式和路由器。啟用日誌記錄以顯示帶有時間戳記的資訊訊息。該應用程式將使用 Uvicorn 運行 localhost:8000，並啟用熱重載。

運行您的應用程式

現在所有元件都已就位，讓我們開始啟動並運行您的應用程式。首先，在專案根目錄中建立一個 .env 檔案並新增您的 HUGGINGFACE_API_KEY 和 REDIS_URL：

mkdir llm_integration && cd llm_integration
python3 -m venv env
syource env/bin/activate

登入後複製

然後確保 Redis 正在您的電腦上運行。在大多數基於 Unix 的系統上，您可以使用以下命令啟動它：

transformers==4.36.0
huggingface-hub==0.19.4
redis==4.6.0
pydantic==2.5.0
pydantic-settings==2.1.0
tenacity==8.2.3
python-dotenv==1.0.0
fastapi==0.104.1
uvicorn==0.24.0
torch==2.1.0
numpy==1.24.3

登入後複製

現在您可以開始申請：

pip install -r requirements.txt

登入後複製

您的 FastAPI 伺服器將開始在 http://localhost:8000 上運作。自動 API 文件將在 http://localhost:8000/docs 上提供 - 這對於測試您的端點非常有幫助！

Integrating Large Language Models in Production Applications

測試您的內容審核 API

讓我們用真實的請求來測試您新建立的 API。開啟一個新終端機並執行以下curl指令：

llm_integration/
├── core/
│   ├── llm_client.py      # your main LLM interaction code
│   ├── prompt_manager.py  # Handles prompt templates
│   └── response_handler.py # Processes LLM responses
├── cache/
│   └── redis_manager.py   # Manages your caching system
├── config/
│   └── settings.py        # Configuration management
├── api/
│   └── routes.py          # API endpoints
├── utils/
│   ├── monitoring.py      # Usage tracking
│   └── rate_limiter.py    # Rate limiting logic
├── requirements.txt
└── main.py
└── usage_logs.json

登入後複製

您應該在終端機上看到以下回應：

Integrating Large Language Models in Production Applications

新增監控和分析

現在讓我們加入一些監控功能來追蹤應用程式的執行情況以及正在使用的資源量。將以下程式碼加入您的 utils/monitoring.py 檔案：

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from tenacity import retry, stop_after_attempt, wait_exponential
from typing import Dict, Optional
import logging

class LLMClient:
    def __init__(self, model_name: str = "gpt2", timeout: int = 30):
        try:
            self.tokenizer = AutoTokenizer.from_pretrained(model_name)
            self.model = AutoModelForCausalLM.from_pretrained(
                model_name,
                device_map="auto",
                torch_dtype=torch.float16
            )
        except Exception as e:
            logging.error(f"Error loading model: {str(e)}")
            # Fallback to a simpler model if the specified one fails
            self.tokenizer = AutoTokenizer.from_pretrained("gpt2")
            self.model = AutoModelForCausalLM.from_pretrained("gpt2")

        self.timeout = timeout
        self.logger = logging.getLogger(__name__)

登入後複製

UsageMonitor 類別將執行以下操作：

使用時間戳記追蹤每個 API 請求
記錄代幣使用以進行成本監控
測量反應時間
將所有內容儲存在結構化日誌檔案中（在將應用程式部署到生產環境之前將其替換為資料庫）

接下來，新增一個新的方法來計算使用統計：

 @retry(
        stop=stop_after_attempt(3),
        wait=wait_exponential(multiplier=1, min=4, max=10),
        reraise=True
    )
    async def complete(self, 
                      prompt: str, 
                      temperature: float = 0.7,
                      max_tokens: Optional[int] = None) -> Dict:
        """Get completion from the model with automatic retries"""
        try:
            inputs = self.tokenizer(prompt, return_tensors="pt").to(
                self.model.device
            )

            with torch.no_grad():
                outputs = self.model.generate(
                    **inputs,
                    max_new_tokens=max_tokens or 100,
                    temperature=temperature,
                    do_sample=True
                )

            response_text = self.tokenizer.decode(
                outputs[0], 
                skip_special_tokens=True
            )

            # Calculate token usage for monitoring
            input_tokens = len(inputs.input_ids[0])
            output_tokens = len(outputs[0]) - input_tokens

            return {
                'content': response_text,
                'usage': {
                    'prompt_tokens': input_tokens,
                    'completion_tokens': output_tokens,
                    'total_tokens': input_tokens + output_tokens
                },
                'model': "gpt2"
            }

        except Exception as e:
            self.logger.error(f"Error in LLM completion: {str(e)}")
            raise

登入後複製

更新您的API以新增UsageMonitor類別中的監控功能：

from typing import Dict
import logging

class ResponseHandler:
    def __init__(self):
        self.logger = logging.getLogger(__name__)

    def parse_moderation_response(self, raw_response: str) -> Dict:
        """Parse and structure the raw LLM response for moderation"""
        try:
            # Default response structure
            structured_response = {
                "is_appropriate": True,
                "confidence_score": 0.0,
                "reason": None
            }

            # Simple keyword-based analysis
            lower_response = raw_response.lower()

            # Check for inappropriate content signals
            if any(word in lower_response for word in ['inappropriate', 'unsafe', 'offensive', 'harmful']):
                structured_response["is_appropriate"] = False
                structured_response["confidence_score"] = 0.9
                # Extract reason if present
                if "because" in lower_response:
                    reason_start = lower_response.find("because")
                    structured_response["reason"] = raw_response[reason_start:].split('.')[0].strip()
            else:
                structured_response["confidence_score"] = 0.95

            return structured_response

        except Exception as e:
            self.logger.error(f"Error parsing response: {str(e)}")
            return {
                "is_appropriate": True,
                "confidence_score": 0.5,
                "reason": "Failed to parse response"
            }

    def format_response(self, raw_response: Dict) -> Dict:
        """Format the final response with parsed content and usage stats"""
        try:
            return {
                "content": self.parse_moderation_response(raw_response["content"]),
                "usage": raw_response["usage"],
                "model": raw_response["model"]
            }
        except Exception as e:
            self.logger.error(f"Error formatting response: {str(e)}")
            raise

登入後複製

現在，透過執行以下curl 指令來測試您的 /stats 端點：

import redis
from typing import Optional, Any
import json
import hashlib

class CacheManager:
    def __init__(self, redis_url: str, ttl: int = 3600):
        self.redis = redis.from_url(redis_url)
        self.ttl = ttl

    def _generate_key(self, prompt: str, params: dict) -> str:
        """Generate a unique cache key"""
        cache_data = {
            'prompt': prompt,
            'params': params
        }
        serialized = json.dumps(cache_data, sort_keys=True)
        return hashlib.sha256(serialized.encode()).hexdigest()

    async def get_cached_response(self, 
                                prompt: str, 
                                params: dict) -> Optional[dict]:
        """Retrieve cached LLM response"""
        key = self._generate_key(prompt, params)
        cached = self.redis.get(key)
        return json.loads(cached) if cached else None

    async def cache_response(self, 
                           prompt: str, 
                           params: dict, 
                           response: dict) -> None:
        """Cache LLM response"""
        key = self._generate_key(prompt, params)
        self.redis.setex(
            key,
            self.ttl,
            json.dumps(response)
        )

登入後複製

上面的命令將向您顯示 /moderate 端點上的請求的統計信息，如下面的屏幕截圖所示：

Integrating Large Language Models in Production Applications