To read the text content in an HTML file, perform the following steps: Load the HTML file Parse the HTML Extract text using the text attribute or get_text() method Optional: Clean text (remove whitespace, special characters and convert to lowercase ) Output text (print, write to file, etc.)
How to read text content in HTML files
To extract text content from an HTML file, you can use the following steps:
1. Load the HTML file
import requests url = 'https://example.com' response = requests.get(url)
2. Parse the HTML
from bs4 import BeautifulSoup soup = BeautifulSoup(response.text, 'html.parser')
3. Extract text content
There are two ways to extract text content:
text
Attributes:Extract all text within the HTML tag, including the tag itself.text = soup.text
get_text()
Method:Extract the text within the HTML tag, but ignore the tag itself.text = soup.get_text()
4. Clean text content (optional)
If you need to further clean up text content, you can perform the following operations:
text = text.replace(' ', '')
import string text = text.translate(str.maketrans('', '', string.punctuation))
text = text.lower()
5. Output text content
You can output text content in a variety of ways:
print(text)
with open('output.txt', 'w') as f: f.write(text)
The above is the detailed content of How to read text content in html file. For more information, please follow other related articles on the PHP Chinese website!