Laden Sie eine CSV-Datei mit den URLs der HTML-Seite hoch und verwenden Sie Flask, um die URLs zu lesen, die Sie crawlen möchten

Question

Ich muss derzeit ein webbasiertes System erstellen, das eine CSV-Datei mit einer Liste von URLs hochladen kann. Nach dem Hochladen liest das System die URL Zeile für Zeile und wird für den nächsten Crawling-Schritt verwendet. Hier erfordert das Crawlen, dass man sich vor dem Crawlen auf der Website anmeldet. Ich habe bereits den Quellcode für die Login-Website. Das Problem besteht jedoch darin, dass ich eine HTML-Seite namens „upload_page.html“ mit einer Flask-Datei namens „upload_csv.py“ verbinden möchte. Wo soll der Quellcode für Login und Scraping in der Flask-Datei abgelegt werden? upload_page.html<d

P粉207969787 · Answer

csv_file = request.files['file']
# Load the CSV data into a DataFrame
df = pd.read_csv(csv_file)
final_data = []
# Initialize the web driver
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument("--disable-gpu")
driver = webdriver.Chrome(options=chrome_options)
# Loop over the rows in the DataFrame and scrape each link
for index, row in df.iterrows():
    link = row['Link']
    # Login to the website
    # Replace this with your own login code
    driver.get("https://example.com/login")
    username_field = driver.find_element_by_name("username")
    password_field = driver.find_element_by_name("password")
    username_field.send_keys("myusername")
    password_field.send_keys("mypassword")
    password_field.send_keys(Keys.RETURN)
    # Wait for the login to complete
    WebDriverWait(driver, 10).until(EC.url_changes("https://example.com/login"))
    # Scrape the website
    driver.get(link)
    start = time.time()
    # will be used in the while loop
    initialScroll = 0
    finalScroll = 1000

    while True:
        driver.execute_script(f"window.scrollTo({initialScroll},{finalScroll})")
        # this command scrolls the window starting from the pixel value stored in the initialScroll
        # variable to the pixel value stored at the finalScroll variable
        initialScroll = finalScroll
        finalScroll += 1000

        # we will stop the script for 3 seconds so that the data can load
        time.sleep(2)
        end = time.time()
        # We will scroll for 20 seconds.
        if round(end - start) > 20:
            break