;標籤。例如:
<div role="rowheader">
<a href="/owner/repo/tree/main/folder-name">folder-name</a>
</div>
登入後複製
登入後複製
3.實作抓取器
3.1。遞歸爬取函數
該腳本將遞歸地抓取資料夾並列印其結構。為了限制遞歸深度並避免不必要的負載,我們將使用深度參數。
import requests
from bs4 import BeautifulSoup
import time
def crawl_github_folder(url, depth=0, max_depth=3):
"""
Recursively crawls a GitHub repository folder structure.
Parameters:
- url (str): URL of the GitHub folder to scrape.
- depth (int): Current recursion depth.
- max_depth (int): Maximum depth to recurse.
"""
if depth > max_depth:
return
headers = {"User-Agent": "Mozilla/5.0"}
response = requests.get(url, headers=headers)
if response.status_code != 200:
print(f"Failed to access {url} (Status code: {response.status_code})")
return
soup = BeautifulSoup(response.text, 'html.parser')
# Extract folder and file links
items = soup.select('div[role="rowheader"] a')
for item in items:
item_name = item.text.strip()
item_url = f"https://github.com{item['href']}"
if '/tree/' in item_url:
print(f"{' ' * depth}Folder: {item_name}")
crawl_github_folder(item_url, depth + 1, max_depth)
elif '/blob/' in item_url:
print(f"{' ' * depth}File: {item_name}")
# Example usage
if __name__ == "__main__":
repo_url = "https://github.com/<owner>/<repo>/tree/<branch>/<folder>"
crawl_github_folder(repo_url)
登入後複製
登入後複製
4.功能解釋
-
請求標頭:使用使用者代理字串來模擬瀏覽器並避免阻塞。
-
遞歸爬行:
- 偵測資料夾 (/tree/) 並遞歸地輸入它們。
- 列出檔案 (/blob/),無需進一步輸入。
-
縮排:反映輸出中的資料夾層次結構。
-
深度限制:透過設定最大深度(max_深度)來防止過度遞歸。
5.增強功能
這些增強功能旨在提高爬蟲程序的功能和可靠性。它們解決了導出結果、處理錯誤和避免速率限制等常見挑戰,確保工具高效且用戶友好。
5.1。匯出結果
將輸出儲存到結構化 JSON 檔案以便於使用。
pip install requests beautifulsoup4
登入後複製
登入後複製
5.2。錯誤處理
為網路錯誤和意外的 HTML 變更添加強大的錯誤處理:
<div role="rowheader">
<a href="/owner/repo/tree/main/folder-name">folder-name</a>
</div>
登入後複製
登入後複製
5.3。速率限制
為了避免受到 GitHub 的速率限制,請引入延遲:
import requests
from bs4 import BeautifulSoup
import time
def crawl_github_folder(url, depth=0, max_depth=3):
"""
Recursively crawls a GitHub repository folder structure.
Parameters:
- url (str): URL of the GitHub folder to scrape.
- depth (int): Current recursion depth.
- max_depth (int): Maximum depth to recurse.
"""
if depth > max_depth:
return
headers = {"User-Agent": "Mozilla/5.0"}
response = requests.get(url, headers=headers)
if response.status_code != 200:
print(f"Failed to access {url} (Status code: {response.status_code})")
return
soup = BeautifulSoup(response.text, 'html.parser')
# Extract folder and file links
items = soup.select('div[role="rowheader"] a')
for item in items:
item_name = item.text.strip()
item_url = f"https://github.com{item['href']}"
if '/tree/' in item_url:
print(f"{' ' * depth}Folder: {item_name}")
crawl_github_folder(item_url, depth + 1, max_depth)
elif '/blob/' in item_url:
print(f"{' ' * depth}File: {item_name}")
# Example usage
if __name__ == "__main__":
repo_url = "https://github.com/<owner>/<repo>/tree/<branch>/<folder>"
crawl_github_folder(repo_url)
登入後複製
登入後複製
6.道德考量
由軟體自動化和道德程式設計專家 Shpetim Haxhiu 撰寫,本部分確保在使用 GitHub 爬蟲時遵守最佳實踐。
-
合規性:遵守 GitHub 的服務條款。
-
最小化負載:透過限制請求和增加延遲來尊重 GitHub 的伺服器。
-
權限:取得廣泛爬取私有倉庫的權限。
7.完整程式碼
這是包含所有功能的綜合腳本:
import json
def crawl_to_json(url, depth=0, max_depth=3):
"""Crawls and saves results as JSON."""
result = {}
if depth > max_depth:
return result
headers = {"User-Agent": "Mozilla/5.0"}
response = requests.get(url, headers=headers)
if response.status_code != 200:
print(f"Failed to access {url}")
return result
soup = BeautifulSoup(response.text, 'html.parser')
items = soup.select('div[role="rowheader"] a')
for item in items:
item_name = item.text.strip()
item_url = f"https://github.com{item['href']}"
if '/tree/' in item_url:
result[item_name] = crawl_to_json(item_url, depth + 1, max_depth)
elif '/blob/' in item_url:
result[item_name] = "file"
return result
if __name__ == "__main__":
repo_url = "https://github.com/<owner>/<repo>/tree/<branch>/<folder>"
structure = crawl_to_json(repo_url)
with open("output.json", "w") as file:
json.dump(structure, file, indent=2)
print("Repository structure saved to output.json")
登入後複製
透過遵循此詳細指南,您可以建立強大的 GitHub 資料夾爬蟲。該工具可以適應各種需求,同時確保道德合規性。
歡迎在留言區留言!另外,別忘了與我聯絡:
-
電子郵件:shpetim.h@gmail.com
-
LinkedIn:linkedin.com/in/shpetimhaxhiu
-
GitHub:github.com/shpetimhaxhiu