スクレイピングしながら検証: Pydantic Validation を使用したデータスクレイピング-Python チュートリアル-php.cn

注: chatGPT/LLM の出力ではありません

データスクレイピングは、公開 Web ソースからデータを収集するプロセスであり、ほとんどの場合、スクリプトを使用して自動化された方法で行われます。自動化により、収集されたデータにはエラーが含まれることが多く、使用するにはフィルタリングしてクリーンアップする必要があります。ただし、スクレイピング中にスクレイピングされたデータを検証できればより良いでしょう。

データ検証の要件を考慮すると、Scrapy のようなスクレイピングフレームワークのほとんどには、データ検証に使用できるパターンが組み込まれています。ただし、データスクレイピングプロセスでは、スクレイピングに requests や Beautifulsoup などの汎用モジュールのみを使用することがよくあります。このような場合、収集したデータを検証するのは難しいため、このブログ投稿では、Pydantic を使用した検証によるデータスクレイピングの簡単なアプローチを説明します。
https://docs.pydantic.dev/latest/
Pydantic はデータ検証 Python モジュールです。これは人気のある API モジュール FastAPI のバックボーンでもあり、Pydantic と同様に、データスクレイピング中の検証に使用できる他の Python モジュールもあります。ただし、このブログでは pydantic を調査しており、ここに代替パッケージのリンクがあります (学習演習として、他のモジュールで pydantic を変更してみることもできます)

Cerberus は、Python 用の軽量で拡張可能なデータ検証ライブラリです。 https://pypi.org/project/Cerberus/

スクレイピングの計画:

このブログでは、引用サイトからの引用をスクラップさせていただきます。
リクエストと Beautifulsoup を使用してデータを取得します。 pydantic データクラスを作成して、スクレイピングされた各データを検証します。フィルタリングされ検証されたデータを json ファイルに保存します。

より良く整理して理解するために、各ステップはメインセクションで使用できる Python メソッドとして実装されています。

基本的なインポート

import requests # for web request
from bs4 import BeautifulSoup # cleaning html content

# pydantic for validation

from pydantic import BaseModel, field_validator, ValidationError

import json

ログイン後にコピー

1. ターゲットサイトと見積もりの取得

引用符をスクレイピングするために (http://quotes.toscrape.com/) を使用しています。各引用には、quote_text、author、tags の 3 つのフィールドがあります。例:

Scrape but Validate: Data scraping with Pydantic Validation

以下のメソッドは、指定された URL の HTML コンテンツを取得する一般的なスクリプトです。

def get_html_content(page_url: str) -> str:
    page_content =""
    # Send a GET request to the website
    response = requests.get(url)
    # Check if the request was successful (status code 200)
    if response.status_code == 200:
        page_content = response.content
    else:
        page_content = f'Failed to retrieve the webpage. Status code: {response.status_code}'
    return page_content

ログイン後にコピー

2. スクレイピングから見積データを取得する

リクエストと beautifulsoup を使用して、指定された URL からデータをスクレイピングします。このプロセスは 3 つの部分に分かれています: 1) Web から HTML コンテンツを取得します。 2) 対象フィールドごとに必要な HTML タグを抽出します。 3) 各タグから値を取得します

import requests # for web request
from bs4 import BeautifulSoup # cleaning html content

# pydantic for validation

from pydantic import BaseModel, field_validator, ValidationError

import json

ログイン後にコピー

def get_html_content(page_url: str) -> str:
    page_content =""
    # Send a GET request to the website
    response = requests.get(url)
    # Check if the request was successful (status code 200)
    if response.status_code == 200:
        page_content = response.content
    else:
        page_content = f'Failed to retrieve the webpage. Status code: {response.status_code}'
    return page_content

ログイン後にコピー

以下のスクリプトは、各引用符の div からデータポイントを取得します。

def get_tags(tags):
    tags =[tag.get_text() for tag in tags.find_all('a')]
    return tags

ログイン後にコピー

3. Pydantic データクラスを作成し、各見積もりのデータを検証します

見積もりの各フィールドに従って、pydantic クラスを作成し、データスクレイピング中のデータ検証に同じクラスを使用します。

卑劣なモデル引用

以下は、quote_text、author、tags などの 3 つのフィールドを持つ BaseModel から拡張された Quote クラスです。この 3 つのうち、quote_text と author は文字列 (str) 型で、tags はリスト型です。

2 つのバリデーターメソッド (デコレーター付き) があります。

1) tags_more_than_two () : 2 つ以上のタグが必要かどうかをチェックします。 (これは単なる例であり、ここには任意のルールを含めることができます)

2.) check_quote_text(): このメソッドは引用符から「」を削除し、テキストをテストします。

def get_quotes_div(html_content:str) -> str :    
    # Parse the page content with BeautifulSoup
    soup = BeautifulSoup(html_content, 'html.parser')

    # Find all the quotes on the page
    quotes = soup.find_all('div', class_='quote')

    return quotes

ログイン後にコピー

データの取得と検証

pydantic を使用するとデータ検証は非常に簡単です。たとえば、以下のコードでは、スクレイピングされたデータを pydantic クラス Quote に渡します。

    # Loop through each quote and extract the text and author
    for quote in quotes_div:
        quote_text = quote.find('span', class_='text').get_text()
        author = quote.find('small', class_='author').get_text()
        tags = get_tags(quote.find('div', class_='tags'))

        # yied data to a dictonary 
        quote_temp ={'quote_text': quote_text,
                'author': author,
                'tags':tags
        }

ログイン後にコピー

class Quote(BaseModel):
    quote_text:str
    author:str
    tags: list

    @field_validator('tags')
    @classmethod
    def tags_more_than_two(cls, tags_list:list) -> list:
        if len(tags_list) <=2:
            raise ValueError("There should be more than two tags.")
        return tags_list

    @field_validator('quote_text')
    @classmethod    
    def check_quote_text(cls, quote_text:str) -> str:
        return quote_text.removeprefix('“').removesuffix('”')

ログイン後にコピー

4. データを保存する

データが検証されると、json ファイルに保存されます。 (Python 辞書を json ファイルに変換する汎用メソッドが記述されています)

quote_data = Quote(**quote_temp)

ログイン後にコピー

すべてをまとめる

スクレイピングの各部分を理解したら、すべてをまとめてデータ収集のためにスクレイピングを実行できます。

def get_quotes_data(quotes_div: list) -> list:
    quotes_data = []

    # Loop through each quote and extract the text and author
    for quote in quotes_div:
        quote_text = quote.find('span', class_='text').get_text()
        author = quote.find('small', class_='author').get_text()
        tags = get_tags(quote.find('div', class_='tags'))

        # yied data to a dictonary 
        quote_temp ={'quote_text': quote_text,
                'author': author,
                'tags':tags
        }

        # validate data with Pydantic model
        try:
            quote_data = Quote(**quote_temp)            
            quotes_data.append(quote_data.model_dump())            
        except  ValidationError as e:
            print(e.json())
    return quotes_data

ログイン後にコピー

注: 改訂が計画されています。改訂版に含めるアイデアや提案をお知らせください。

リンクとリソース: