建構多模態RAG系統的方法：使用CLIP和LLM-人工智慧-PHP中文網

我們將討論使用開源的大型語言多模態模型（Large Language Multi-Modal）來建立檢索增強生成（RAG）系統的方法。我們的重點是在不依賴LangChain或LLlama index的情況下實現這一目標，以避免增加更多的框架依賴。

建構多模態RAG系統的方法：使用CLIP和LLM

什麼是RAG

在人工智慧領域，檢索增強生成（retrieve-augmented generation, RAG）技術的出現為大型語言模式（Large Language Models）帶來了變革性的改進。 RAG的本質是透過允許模型從外部來源動態檢索即時訊息，從而增強人工智慧的響應能力。這項技術的引進使得AI能夠更具體地回應使用者需求。透過檢索和融合外部來源的訊息，RAG能夠產生更準確、全面的回答，為使用者提供更有價值的內容。這種能力的提升為人工智慧的應用領域帶來了更廣闊的前景，包括智慧客服、智慧搜尋和知識問答系統等。 RAG的出現標誌著語言模型的進一步發展，為人工智慧帶來了

該體系結構將動態檢索過程與生成能力無縫結合，使得人工智慧能夠適應各個領域中不斷變化的信息。與微調和再訓練不同，RAG提供了一種經濟高效的解決方案，允許人工智慧在不改變整個模型的情況下獲取最新和相關的資訊。這種能力的結合使得RAG在應對變化快速的資訊環境中具有優勢。

RAG的作用

1、提高準確性和可靠性:

透過將大型語言模型（LLM）定向到可靠的知識來源，解決了其不可預測性的問題，降低了提供虛假或過時資訊的風險，使反應更加準確可靠。

2、增加透明度和信任:

像LLM這樣的生成式人工智慧模型常常缺乏透明度，這導致人們難以相信其輸出。 RAG透過提供更大的控制權，解決了偏差、可靠性和遵從性方面的擔憂。

3、減輕幻覺:

LLM容易產生幻覺反應－提供連貫但不準確或捏造的訊息。而RAG則透過依靠權威來源確保回應，降低了關鍵部門誤導性建議的風險。

4、具有成本效益的適應性:

RAG提供了一種經濟有效的方法來提高AI輸出，而不需要廣泛的再訓練/微調。可以透過根據需要動態獲取特定細節來保持最新和相關的信息，確保人工智慧對不斷變化的信息的適應性。

多模式模態模型

多模態涉及有多個輸入，並將其結合成單一輸出，以CLIP為例：CLIP的訓練資料是文字-圖像對，透過對比學習，模型能夠學習到文字-圖像對的匹配關係。

此模型為表示相同事物的不同輸入產生相同(非常相似)的嵌入向量。

建構多模態RAG系統的方法：使用CLIP和LLM

多重模式

建構多模態RAG系統的方法：使用CLIP和LLM

態大型語言(multi-modal large language)

#GPT4v和Gemini vision就是探索整合了各種資料型別(包括影像、文字、語言、音訊等)的多模態語言模型(MLLM)。雖然像GPT-3、BERT和RoBERTa這樣的大型語言模型(llm)在基於文字的任務中表現出色，但它們在理解和處理其他資料類型方面面臨挑戰。為了解決這個限制，多模態模型結合了不同的模態，從而能夠更全面地理解不同的數據。

多模態大語言模型它超越了傳統的基於文本的方法。以GPT-4為例，這些模型可以無縫地處理各種資料類型，包括圖像和文本，從而更全面地理解資訊。

建構多模態RAG系統的方法：使用CLIP和LLM

與RAG結合

這裡我們將使用Clip嵌入圖像和文本，將這些嵌入儲存在ChromDB向量資料庫中。然後將利用大模型根據檢索到的信息參與用戶聊天會話。

我們將使用來自Kaggle的圖片和維基百科的資訊來創建一個花卉專家聊天機器人

#########首先我們安裝軟體包：######

! pip install -q timm einops wikipedia chromadb open_clip_torch !pip install -q transformers==4.36.0 !pip install -q bitsandbytes==0.41.3 accelerate==0.25.0

登入後複製

#######預處理資料的步驟很簡單只是把圖像和文字放在一個資料夾裡## ####

建構多模態RAG系統的方法：使用CLIP和LLM

可以随意使用任何矢量数据库，这里我们使用ChromaDB。

import chromadb  from chromadb.utils.embedding_functions import OpenCLIPEmbeddingFunction from chromadb.utils.data_loaders import ImageLoader from chromadb.config import Settings   client = chromadb.PersistentClient(path="DB")  embedding_function = OpenCLIPEmbeddingFunction() image_loader = ImageLoader() # must be if you reads from URIs

登入後複製

ChromaDB需要自定义嵌入函数

from chromadb import Documents, EmbeddingFunction, Embeddings  class MyEmbeddingFunction(EmbeddingFunction):def __call__(self, input: Documents) -> Embeddings:# embed the documents somehow or imagesreturn embeddings

登入後複製

这里将创建2个集合，一个用于文本，另一个用于图像

collection_images = client.create_collection(name='multimodal_collection_images', embedding_functinotallow=embedding_function, data_loader=image_loader)  collection_text = client.create_collection(name='multimodal_collection_text', embedding_functinotallow=embedding_function, )  # Get the Images IMAGE_FOLDER = '/kaggle/working/all_data'   image_uris = sorted([os.path.join(IMAGE_FOLDER, image_name) for image_name in os.listdir(IMAGE_FOLDER) if not image_name.endswith('.txt')]) ids = [str(i) for i in range(len(image_uris))]  collection_images.add(ids=ids, uris=image_uris) #now we have the images collection

登入後複製

对于Clip，我们可以像这样使用文本检索图像

from matplotlib import pyplot as plt  retrieved = collection_images.query(query_texts=["tulip"], include=['data'], n_results=3) for img in retrieved['data'][0]:plt.imshow(img)plt.axis("off")plt.show()

登入後複製

建構多模態RAG系統的方法：使用CLIP和LLM

也可以使用图像检索相关的图像

建構多模態RAG系統的方法：使用CLIP和LLM

文本集合如下所示

# now the text DB from chromadb.utils import embedding_functions default_ef = embedding_functions.DefaultEmbeddingFunction()  text_pth = sorted([os.path.join(IMAGE_FOLDER, image_name) for image_name in os.listdir(IMAGE_FOLDER) if image_name.endswith('.txt')])  list_of_text = [] for text in text_pth:with open(text, 'r') as f:text = f.read()list_of_text.append(text)  ids_txt_list = ['id'+str(i) for i in range(len(list_of_text))] ids_txt_list  collection_text.add(documents = list_of_text,ids =ids_txt_list )

登入後複製

然后使用上面的文本集合获取嵌入

results = collection_text.query(query_texts=["What is the bellflower?"],n_results=1 )  results

登入後複製

结果如下：

{'ids': [['id0']],'distances': [[0.6072186183744086]],'metadatas': [[None]],'embeddings': None,'documents': [['Campanula () is the type genus of the Campanulaceae family of flowering plants. Campanula are commonly known as bellflowers and take both their common and scientific names from the bell-shaped flowers—campanula is Latin for "little bell".\nThe genus includes over 500 species and several subspecies, distributed across the temperate and subtropical regions of the Northern Hemisphere, with centers of diversity in the Mediterranean region, Balkans, Caucasus and mountains of western Asia. The range also extends into mountains in tropical regions of Asia and Africa.\nThe species include annual, biennial and perennial plants, and vary in habit from dwarf arctic and alpine species under 5 cm high, to large temperate grassland and woodland species growing to 2 metres (6 ft 7 in) tall.']],'uris': None,'data': None}

登入後複製

或使用图片获取文本

query_image = '/kaggle/input/flowers/flowers/rose/00f6e89a2f949f8165d5222955a5a37d.jpg' raw_image = Image.open(query_image)  doc = collection_text.query(query_embeddings=embedding_function(query_image), n_results=1,  )['documents'][0][0]

登入後複製

建構多模態RAG系統的方法：使用CLIP和LLM

上图的结果如下：

A rose is either a woody perennial flowering plant of the genus Rosa (), in the family Rosaceae (), or the flower it bears. There are over three hundred species and tens of thousands of cultivars. They form a group of plants that can be erect shrubs, climbing, or trailing, with stems that are often armed with sharp prickles. Their flowers vary in size and shape and are usually large and showy, in colours ranging from white through yellows and reds. Most species are native to Asia, with smaller numbers native to Europe, North America, and northwestern Africa. Species, cultivars and hybrids are all widely grown for their beauty and often are fragrant. Roses have acquired cultural significance in many societies. Rose plants range in size from compact, miniature roses, to climbers that can reach seven meters in height. Different species hybridize easily, and this has been used in the development of the wide range of garden roses.

登入後複製

这样我们就完成了文本和图像的匹配工作，其实这里都是CLIP的工作，下面我们开始加入LLM。

from huggingface_hub import hf_hub_download  hf_hub_download(repo_, filename="configuration_llava.py", local_dir="./", force_download=True) hf_hub_download(repo_, filename="configuration_phi.py", local_dir="./", force_download=True) hf_hub_download(repo_, filename="modeling_llava.py", local_dir="./", force_download=True) hf_hub_download(repo_, filename="modeling_phi.py", local_dir="./", force_download=True) hf_hub_download(repo_, filename="processing_llava.py", local_dir="./", force_download=True)

登入後複製

我们是用visheratin/LLaVA-3b

from modeling_llava import LlavaForConditionalGeneration import torch  model = LlavaForConditionalGeneration.from_pretrained("visheratin/LLaVA-3b") model = model.to("cuda")

登入後複製

加载tokenizer

from transformers import AutoTokenizer  tokenizer = AutoTokenizer.from_pretrained("visheratin/LLaVA-3b")

登入後複製

然后定义处理器，方便我们以后调用

from processing_llava import LlavaProcessor, OpenCLIPImageProcessor  image_processor = OpenCLIPImageProcessor(model.config.preprocess_config) processor = LlavaProcessor(image_processor, tokenizer)

登入後複製

下面就可以直接使用了

question = 'Answer with organized answers: What type of rose is in the picture? Mention some of its characteristics and how to take care of it ?'  query_image = '/kaggle/input/flowers/flowers/rose/00f6e89a2f949f8165d5222955a5a37d.jpg' raw_image = Image.open(query_image)  doc = collection_text.query(query_embeddings=embedding_function(query_image), n_results=1,  )['documents'][0][0]  plt.imshow(raw_image) plt.show() imgs = collection_images.query(query_uris=query_image, include=['data'], n_results=3) for img in imgs['data'][0][1:]:plt.imshow(img)plt.axis("off")plt.show()

登入後複製

得到的结果如下：

建構多模態RAG系統的方法：使用CLIP和LLM

结果还包含了我们需要的大部分信息

建構多模態RAG系統的方法：使用CLIP和LLM

这样我们整合就完成了，最后就是创建聊天模板，

prompt = """system A chat between a curious human and an artificial intelligence assistant. The assistant is an exprt in flowers , and gives helpful, detailed, and polite answers to the human's questions. The assistant does not hallucinate and pays very close attention to the details. user <image> {question} Use the following article as an answer source. Do not write outside its scope unless you find your answer better {article} if you thin your answer is better add it after document. assistant """.format(questinotallow='question', article=doc)</image>

登入後複製

如何创建聊天过程我们这里就不详细介绍了，完整代码在这里：

//m.sbmmt.com/link/71eee742e4c6e094e6af364597af3f05

以上是建構多模態RAG系統的方法：使用CLIP和LLM的詳細內容。更多資訊請關注PHP中文網其他相關文章！