I'm trying to crawl the titles of all tables from this URL: https://www.nature.com/articles/s41586-023-06192-4
I can find this HTML element on the website:
<b id="Tab1" data-test="table-caption">Table 1 Calculated Ct–M–Ct angles</b>
I cannot crawl this title because it cannot be found. Even if I print the HTML script to the console, this element cannot be found.
I use the following code to print the HTML script:
from requests_html import HTMLSession from bs4 import BeautifulSoup url = 'https://www.nature.com/articles/s41586-023-06192-4' session = HTMLSession() response = session.get(url) response.html.render() soup = BeautifulSoup(response.html.raw_html.decode('utf-8'), 'html.parser') print(soup.prettify())
Use BeautifulSoup’s crawling function:
def get_tables(driver): tables = [] html = driver.page_source soup = BeautifulSoup(html, 'html.parser') for i in range(1, 11): try: table_caption = soup.find('b', {'id': f'Tab{i}', 'data-test': 'table-caption'}) table_text = table_caption.text if table_caption else "Not Available" if table_text != "Not Available": print(f"找到表格{i}:{table_text}") else: print(f"未找到表格{i}。") tables.append(table_text) except Exception as e: print(f"处理表格{i}时出错:{str(e)}") tables.append("Not Available") return tables
Use Selenium’s crawling function:
def get_tables(driver): tables = [] for i in range(1, 11): try: table_caption = driver.find_element_by_css_selector(f'b#Tab{i}[data-test="table-caption"]') table_text = table_caption.text if table_caption else "Not Available" if table_text != "Not Available": print(f"找到表格{i}:{table_text}") else: print(f"未找到表格{i}。") tables.append(table_text) except Exception as e: print(f"处理表格{i}时出错:{str(e)}") tables.append("Not Available") return tables
I try to use Selenium and BeautifulSoup to crawl the website. I've checked the iframe. I delayed the fetch operation for 40 seconds to ensure the page loaded completely. Even GPT4 cannot solve this problem.
So the code you used looks fine, the problem that comes to mind is that the website may be loading that element you want to crawl via JavaScript or some XHR call, so when you use the requests library to send the request, it cannot Get that element.
The way to solve this problem is to try to use Selenium, open the website with selenium, and then load the page source code into bs4, so that your code can work normally.
Note: When the entire website is loaded, load the page source code into bs4. You will also need to create a login function using selenium, as this website requires a login to view content.