如何使用 Beautiful Soup 從公共網路中提取數據-Python教學-PHP中文網

如何使用 Beautiful Soup 從公共網路中提取數據

WBOY

發布： 2024-08-02 09:20:53

原創

763 人瀏覽過

How Beautiful Soup is used to extract data out of the Public Web

Beautiful Soup 是一個用於從網頁中抓取資料的 Python 函式庫。它會建立用於解析 HTML 和 XML 文件的解析樹，從而可以輕鬆提取所需的資訊。

Beautiful Soup 為網頁抓取提供了幾個關鍵功能：

導航解析樹：您可以輕鬆導航解析樹並蒐索元素、標籤和屬性。
修改解析樹：它允許您修改解析樹，包括新增、刪除和更新標籤和屬性。
輸出格式：可以將解析樹轉換回字串，方便儲存修改後的內容。

要使用 Beautiful Soup，您需要安裝該程式庫以及解析器，例如 lxml 或 html.parser。您可以使用 pip 安裝它們

#Install Beautiful Soup using pip.
pip install beautifulsoup4 lxml

登入後複製

處理分頁

在處理跨多個頁面顯示內容的網站時，處理分頁對於抓取所有資料至關重要。

辨識分頁結構：檢查網站以了解分頁的結構（例如下一頁按鈕或編號連結）。
迭代頁：使用循環迭代每個頁面並抓取資料。
更新URL或參數：修改URL或參數以取得下一頁的內容。

import requests
from bs4 import BeautifulSoup

base_url = 'https://example-blog.com/page/'
page_number = 1
all_titles = []

while True:
    # Construct the URL for the current page
    url = f'{base_url}{page_number}'
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')

    # Find all article titles on the current page
    titles = soup.find_all('h2', class_='article-title')
    if not titles:
        break  # Exit the loop if no titles are found (end of pagination)

    # Extract and store the titles
    for title in titles:
        all_titles.append(title.get_text())

    # Move to the next page
    page_number += 1

# Print all collected titles
for title in all_titles:
    print(title)

登入後複製

提取嵌套數據

有時，您需要提取的資料會嵌套在多層標籤中。以下是如何處理嵌套資料提取。

導覽至父標籤：尋找包含巢狀資料的父標籤。
擷取巢狀標籤：在每個父標籤中，尋找並擷取巢狀標籤。
迭代巢狀標籤：迭代巢狀標籤以擷取所需資訊。

import requests
from bs4 import BeautifulSoup

url = 'https://example-blog.com/post/123'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# Find the comments section
comments_section = soup.find('div', class_='comments')

# Extract individual comments
comments = comments_section.find_all('div', class_='comment')

for comment in comments:
    # Extract author and content from each comment
    author = comment.find('span', class_='author').get_text()
    content = comment.find('p', class_='content').get_text()
    print(f'Author: {author}\nContent: {content}\n')

登入後複製

處理 AJAX 請求

許多現代網站使用 AJAX 動態載入資料。處理 AJAX 需要不同的技術，例如使用瀏覽器開發人員工具監視網路請求並在抓取工具中複製這些請求。

import requests
from bs4 import BeautifulSoup

# URL to the API endpoint providing the AJAX data
ajax_url = 'https://example.com/api/data?page=1'
response = requests.get(ajax_url)
data = response.json()

# Extract and print data from the JSON response
for item in data['results']:
    print(item['field1'], item['field2'])

登入後複製