Python Web刮擦與登錄示例
使用requests.Session()發送POST請求並維護會話以登錄網站;2. 通過BeautifulSoup解析登錄頁面獲取隱藏字段如CSRF令牌;3. 構造包含用戶名、密碼和令牌的登錄數據並提交;4. 驗證登錄是否成功,檢查響應內容中的“Logout”或“Dashboard”等標誌;5. 登錄成功後用同一會話訪問受保護頁面並抓取所需內容;6. 對於JavaScript動態渲染的頁面應改用Selenium模擬瀏覽器操作;7. 始終遵守網站的robots.txt和使用條款,避免生產環境硬編碼憑證,推薦使用環境變量存儲敏感信息,最終確保scraping 行為合法合規。
If you need to scrape a website that requires login, you'll typically need to send a POST request with your credentials first, maintain the session, and then access the protected pages. Here's a practical example using Python's requests
and BeautifulSoup
libraries to log in and scrape a page behind authentication.

We'll use a dummy login form structure (like many real sites) and show how to handle it.
✅ 1. Required Libraries
Install the needed packages if you haven't:

pip install requests beautifulsoup4
✅ 2. Example: Login and Scrape a Page
import requests from bs4 import BeautifulSoup # Step 1: Start a session session = requests.Session() # Step 2: URL of the login page (example) login_url = 'https://httpbin.org/post' # Placeholder - replace with actual login URL target_url = 'https://example.com/dashboard' # Page you want to scrape after login # Step 3: Get login page (to extract hidden form fields like CSRF tokens if needed) login_page = session.get('https://example.com/login') soup = BeautifulSoup(login_page.content, 'html.parser') # Optional: Extract hidden inputs (eg, CSRF token) csrf_token = soup.find('input', {'name': 'csrf_token'})['value'] # Adjust name as needed # Step 4: Prepare login payload payload = { 'username': 'your_username', 'password': 'your_password', 'csrf_token': csrf_token # Include if present } # Step 5: Submit login form response = session.post('https://example.com/login', data=payload) # Step 6: Check if login was successful if "Logout" in response.text or "Dashboard" in response.text: print("✅ Login successful") else: print("❌ Login failed") print(response.status_code) print(response.text[:500]) # Debug output exit() # Step 7: Scrape a protected page protected_page = session.get(target_url) soup = BeautifulSoup(protected_page.content, 'html.parser') # Example: Extract page title or specific content print("Page Title:", soup.title.string) # Or scrape data data = soup.find_all('div', class_='content') # Adjust selector for item in data: print(item.get_text(strip=True))
✅ 3. Key Points to Remember
- Session Persistence : Use
requests.Session()
to keep cookies and stay logged in. - Inspect the Login Form : Use browser DevTools (F12) to:
- Find the correct login URL (form's
action
attribute) - Check input field names (eg,
username
,email
,password
,csrf_token
)
- Find the correct login URL (form's
- CSRF & Hidden Fields : Many sites require tokens — always check for hidden inputs.
- HTTPS & Security : Never hardcode credentials in production. Use environment variables:
import os username = os.getenv('LOGIN_USER') password = os.getenv('LOGIN_PASS')
- Respect
robots.txt
and Terms of Service — scraping may be prohibited.
✅ 4. Real-World Example (Generic Pattern)
import requests from bs4 import BeautifulSoup import os session = requests.Session() # Load credentials USER = os.getenv('USERNAME', 'test@example.com') PASS = os.getenv('PASSWORD', 'secret') # Fetch login page resp = session.get('https://example.com/login') soup = BeautifulSoup(resp.text, 'html.parser') # Extract CSRF token token = soup.find('input', {'name': 'authenticity_token'})['value'] # Login data data = { 'authenticity_token': token, 'user[email]': USER, 'user[password]': PASS, 'commit': 'Log in' } # Post to login r = session.post('https://example.com/sessions', data=data) # Now scrape dashboard = session.get('https://example.com/my-account')
✅ 5. Alternative: Use Selenium for JavaScript-heavy Sites
If the login is handled by JavaScript (eg, React, Vue), use Selenium
:
from selenium import webdriver from selenium.webdriver.common.by import By driver = webdriver.Chrome() driver.get("https://example.com/login") driver.find_element(By.NAME, "username").send_keys("your_user") driver.find_element(By.NAME, "password").send_keys("your_pass") driver.find_element(By.XPATH, "//button[@type='submit']").click() # Wait and go to target page driver.implicitly_wait(5) driver.get("https://example.com/profile") print(driver.page_source) driver.quit()
Basically, for simple forms: requests Session
works great. For dynamic sites: go with Selenium
. Always test on a small scale and check the site's policies.

以上是Python Web刮擦與登錄示例的詳細內容。更多資訊請關注PHP中文網其他相關文章!

熱AI工具

Undress AI Tool
免費脫衣圖片

Undresser.AI Undress
人工智慧驅動的應用程序,用於創建逼真的裸體照片

AI Clothes Remover
用於從照片中去除衣服的線上人工智慧工具。

Clothoff.io
AI脫衣器

Video Face Swap
使用我們完全免費的人工智慧換臉工具,輕鬆在任何影片中換臉!

熱門文章

熱工具

記事本++7.3.1
好用且免費的程式碼編輯器

SublimeText3漢化版
中文版,非常好用

禪工作室 13.0.1
強大的PHP整合開發環境

Dreamweaver CS6
視覺化網頁開發工具

SublimeText3 Mac版
神級程式碼編輯軟體(SublimeText3)

Yes,aPythonclasscanhavemultipleconstructorsthroughalternativetechniques.1.Usedefaultargumentsinthe__init__methodtoallowflexibleinitializationwithvaryingnumbersofparameters.2.Defineclassmethodsasalternativeconstructorsforclearerandscalableobjectcreati

在Python中,使用for循環配合range()函數是控制循環次數的常見方式。 1.當明確知道循環次數或需按索引訪問元素時使用;2.range(stop)從0到stop-1,range(start,stop)從start到stop-1,range(start,stop,step)加入步長;3.注意range不包含結束值,且在Python3返回可迭代對象而非列表;4.可通過list(range())轉換為列表,倒序時用負步長。

要入門量子機器學習(QML),首選工具是Python,需安裝PennyLane、Qiskit、TensorFlowQuantum或PyTorchQuantum等庫;接著通過運行示例熟悉流程,如使用PennyLane構建量子神經網絡;然後按照數據集準備、數據編碼、構建參數化量子線路、經典優化器訓練等步驟實現模型;實戰中應避免一開始就追求復雜模型,關注硬件限制,採用混合模型結構,並持續參考最新文獻和官方文檔以跟進發展。

使用Python調用WebAPI獲取數據的關鍵在於掌握基本流程和常用工具。 1.使用requests發起HTTP請求是最直接的方式,通過get方法獲取響應並用json()解析數據;2.對於需要認證的API,可通過headers添加token或key;3.需檢查響應狀態碼,推薦使用response.raise_for_status()自動處理異常;4.面對分頁接口,可通過循環依次請求不同頁面並加入延時避免頻率限制;5.處理返回的JSON數據時需根據結構提取信息,複雜數據可用pandas轉換為Data

Python的onelineifelse是三元操作符,寫法為xifconditionelsey,用於簡化簡單的條件判斷。它可用於變量賦值,如status="adult"ifage>=18else"minor";也可用於函數中直接返回結果,如defget_status(age):return"adult"ifage>=18else"minor";雖然支持嵌套使用,如result="A"i

本文為您精選了多個頂級的Python“成品”項目網站與高水平“大片”級學習資源入口。無論您是想尋找開發靈感、觀摩學習大師級的源代碼,還是系統性地提昇實戰能力,這些平台都是不容錯過的寶庫,能幫助您快速成長為Python高手。

寫Python的ifelse語句關鍵在於理解邏輯結構與細節。 1.基礎結構是if條件成立執行一段代碼,否則執行else部分,else可選;2.多條件判斷用elif實現,順序執行且一旦滿足即停止;3.嵌套if用於進一步細分判斷,建議不超過兩層;4.簡潔場景可用三元表達式替代簡單ifelse。注意縮進、條件順序及邏輯完整性,才能寫出清晰穩定的判斷代碼。

使用for循環逐行讀取文件是一種高效處理大文件的方法。 1.基本用法是通過withopen()打開文件並自動管理關閉,結合forlineinfile遍歷每一行,line.strip()可去除換行符和空格;2.若需記錄行號,可用enumerate(file,start=1)讓行號從1開始;3.處理非ASCII文件時應指定encoding參數如utf-8,以避免編碼錯誤。這些方法簡潔實用,適用於大多數文本處理場景。
