如何使用 Selenium 抓取受登录保护的网站（分步指南）-Python教程-PHP中文网

How to Scrape Login-Protected Websites with Selenium (Step by Step Guide)

我抓取受密码保护的网站的步骤：

捕获 HTML 表单元素：用户名 ID、密码 ID 和登录按钮类
- 使用 requests 或 Selenium 等工具自动登录：填写用户名，等待，填写密码，等待，点击登录
- 存储会话 cookie 以进行身份验证
- 继续抓取经过身份验证的页面

免责声明：我已在 https://www.scrapewebapp.com/ 上为此特定用例构建了一个 API。因此，如果您想快速完成它，请使用它，否则请继续阅读。

让我们使用这个例子：假设我想从我的帐户 https://www.scrapewebapp.com/ 中抓取我自己的 API 密钥。在此页面上：https://app.scrapewebapp.com/account/api_key

1. 登录页面

首先，您需要找到登录页面。如果您尝试访问登录后的页面，大多数网站都会给您重定向 303，因此如果您尝试直接抓取 https://app.scrapewebapp.com/account/api_key，您将自动获取登录页面 https:// app.scrapewebapp.com/login。因此，如果尚未提供，这是自动查找登录页面的好方法。

好的，现在我们有了登录页面，我们需要找到添加用户名或电子邮件以及密码和实际登录按钮的位置。最好的方法是创建一个简单的脚本，使用类型“电子邮件”、“用户名”、“密码”查找输入的 ID，并查找类型为“提交”的按钮。我在下面为您编写了代码：

from bs4 import BeautifulSoup


def extract_login_form(html_content: str):
    """
    Extracts the login form elements from the given HTML content and returns their CSS selectors.
    """
    soup = BeautifulSoup(html_content, "html.parser")

    # Finding the username/email field
    username_email = (
        soup.find("input", {"type": "email"})
        or soup.find("input", {"name": "username"})
        or soup.find("input", {"type": "text"})
    )  # Fallback to input type text if no email type is found

    # Finding the password field
    password = soup.find("input", {"type": "password"})

    # Finding the login button
    # Searching for buttons/input of type submit closest to the password or username field
    login_button = None

    # First try to find a submit button within the same form
    if password:
        form = password.find_parent("form")
        if form:
            login_button = form.find("button", {"type": "submit"}) or form.find(
                "input", {"type": "submit"}
            )
    # If no button is found in the form, fall back to finding any submit button
    if not login_button:
        login_button = soup.find("button", {"type": "submit"}) or soup.find(
            "input", {"type": "submit"}
        )

    # Extracting CSS selectors
    def generate_css_selector(element, element_type):
        if "id" in element.attrs:
            return f"#{element['id']}"
        elif "type" in element.attrs:
            return f"{element_type}[type='{element['type']}']"
        else:
            return element_type

    # Generate CSS selectors with the updated logic
    username_email_css_selector = None
    if username_email:
        username_email_css_selector = generate_css_selector(username_email, "input")

    password_css_selector = None
    if password:
        password_css_selector = generate_css_selector(password, "input")

    login_button_css_selector = None
    if login_button:
        login_button_css_selector = generate_css_selector(
            login_button, "button" if login_button.name == "button" else "input"
        )

    return username_email_css_selector, password_css_selector, login_button_css_selector


def main(html_content: str):
    # Call the extract_login_form function and return its result
    return extract_login_form(html_content)

登录后复制

2。使用 Selenium 实际登录

现在您需要创建一个 selenium webdriver。我们将使用 chrome headless 来通过 Python 运行它。安装方法如下：

# Install selenium and chromium

!pip install selenium
!apt-get update 
!apt install chromium-chromedriver

!cp /usr/lib/chromium-browser/chromedriver /usr/bin
import sys
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')

登录后复制

然后实际登录我们的网站并保存 cookie。我们将保存所有 cookie，但您只能根据需要保存身份验证 cookie。

# Imports
from selenium import webdriver
from selenium.webdriver.common.by import By
import requests
import time

# Set up Chrome options
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')

# Initialize the WebDriver
driver = webdriver.Chrome(options=chrome_options)

try:
    # Open the login page
    driver.get("https://app.scrapewebapp.com/login")

    # Find the email input field by ID and input your email
    email_input = driver.find_element(By.ID, "email")
    email_input.send_keys("******@gmail.com")

    # Find the password input field by ID and input your password
    password_input = driver.find_element(By.ID, "password")
    password_input.send_keys("*******")

    # Find the login button and submit the form
    login_button = driver.find_element(By.CSS_SELECTOR, "button[type='submit']")
    login_button.click()

    # Wait for the login process to complete
    time.sleep(5)  # Adjust this depending on your site's response time


finally:
    # Close the browser
    driver.quit()

登录后复制

3. 存储 Cookie

就像通过 driver.getcookies() 函数将它们保存到字典中一样简单。

def save_cookies(driver):
    """Save cookies from the Selenium WebDriver into a dictionary."""
    cookies = driver.get_cookies()
    cookie_dict = {}
    for cookie in cookies:
        cookie_dict[cookie['name']] = cookie['value']
    return cookie_dict

登录后复制

从 WebDriver 保存 cookie

cookie = save_cookies(驱动程序)

4. 从我们登录的会话中获取数据

在这部分中，我们将使用简单的库请求，但您也可以继续使用 selenium。

现在我们想从此页面获取实际的 API：https://app.scrapewebapp.com/account/api_key。

因此，我们从请求库创建一个会话并将每个 cookie 添加到其中。然后请求 URL 并打印响应文本。

def scrape_api_key(cookies):
    """Use cookies to scrape the /account/api_key page."""
    url = 'https://app.scrapewebapp.com/account/api_key'

    # Set up the session to persist cookies
    session = requests.Session()

    # Add cookies from Selenium to the requests session
    for name, value in cookies.items():
        session.cookies.set(name, value)

    # Make the request to the /account/api_key page
    response = session.get(url)

    # Check if the request is successful
    if response.status_code == 200:
        print("API Key page content:")
        print(response.text)  # Print the page content (could contain the API key)
    else:
        print(f"Failed to retrieve API key page, status code: {response.status_code}")

登录后复制

5. 获取您想要的实际数据（奖励）

我们得到了我们想要的页面文本，但是有很多我们不关心的数据。我们只想要 api_key。

最好、最简单的方法是使用像 ChatGPT（GPT4o 模型）这样的人工智能。

这样提示模型：“您是一名专家抓取工具，您只会提取从上下文中询问的信息。我需要来自 {context} 的 api-key 值”

from bs4 import BeautifulSoup


def extract_login_form(html_content: str):
    """
    Extracts the login form elements from the given HTML content and returns their CSS selectors.
    """
    soup = BeautifulSoup(html_content, "html.parser")

    # Finding the username/email field
    username_email = (
        soup.find("input", {"type": "email"})
        or soup.find("input", {"name": "username"})
        or soup.find("input", {"type": "text"})
    )  # Fallback to input type text if no email type is found

    # Finding the password field
    password = soup.find("input", {"type": "password"})

    # Finding the login button
    # Searching for buttons/input of type submit closest to the password or username field
    login_button = None

    # First try to find a submit button within the same form
    if password:
        form = password.find_parent("form")
        if form:
            login_button = form.find("button", {"type": "submit"}) or form.find(
                "input", {"type": "submit"}
            )
    # If no button is found in the form, fall back to finding any submit button
    if not login_button:
        login_button = soup.find("button", {"type": "submit"}) or soup.find(
            "input", {"type": "submit"}
        )

    # Extracting CSS selectors
    def generate_css_selector(element, element_type):
        if "id" in element.attrs:
            return f"#{element['id']}"
        elif "type" in element.attrs:
            return f"{element_type}[type='{element['type']}']"
        else:
            return element_type

    # Generate CSS selectors with the updated logic
    username_email_css_selector = None
    if username_email:
        username_email_css_selector = generate_css_selector(username_email, "input")

    password_css_selector = None
    if password:
        password_css_selector = generate_css_selector(password, "input")

    login_button_css_selector = None
    if login_button:
        login_button_css_selector = generate_css_selector(
            login_button, "button" if login_button.name == "button" else "input"
        )

    return username_email_css_selector, password_css_selector, login_button_css_selector


def main(html_content: str):
    # Call the extract_login_form function and return its result
    return extract_login_form(html_content)

登录后复制

如果您想要一个简单可靠的 API 来实现这一切，请尝试我的新产品 https://www.scrapewebapp.com/

如果你喜欢这篇文章，请给我鼓掌并关注我。确实有很大帮助！

以上是如何使用 Selenium 抓取受登录保护的网站（分步指南）的详细内容。更多信息请关注PHP中文网其他相关文章！