Retrieving Links from Web Pages with Python and BeautifulSoup
This article demonstrates how to retrieve the links from a web page and gather their URL addresses using Python and the BeautifulSoup library.
Problem:
How do you extract the URLs of links embedded in a webpage using Python?
Solution:
To achieve this, you can utilize the SoupStrainer class provided by BeautifulSoup. The following code snippet exemplifies the process:
import httplib2 from bs4 import BeautifulSoup, SoupStrainer http = httplib2.Http() status, response = http.request('http://www.nytimes.com') for link in BeautifulSoup(response, 'html.parser', parse_only=SoupStrainer('a')): if link.has_attr('href'): print(link['href'])
This code establishes a connection to a specified webpage, namely 'http://www.nytimes.com' in the example. Using BeautifulSoup, it parses the HTML response and applies the SoupStrainer('a') filter, which focuses on 'a' tags (representing links) within the page. For each link found, the code retrieves its 'href' attribute, which contains the actual URL address.
The above is the detailed content of How Can I Extract Hyperlinks from a Webpage Using Python and BeautifulSoup?. For more information, please follow other related articles on the PHP Chinese website!