Home > Backend Development > Python Tutorial > How Can I Efficiently Remove HTML Tags from Strings in Python?

How Can I Efficiently Remove HTML Tags from Strings in Python?

Patricia Arquette
Release: 2024-12-06 22:47:11
Original
314 people have browsed it

How Can I Efficiently Remove HTML Tags from Strings in Python?

Stripping HTML Tags from Strings in Python

In Python, there are various scenarios where you may need to remove HTML tags from a string to extract its content. Let's explore a solution to this problem.

Suppose you retrieve HTML content using the mechanize library, as shown in the example snippet. Each line of the content contains HTML tags and text. To extract only the text, we need to strip the tags.

One option is to use a custom function to perform this task. The function strip_tags utilizes the HTMLParser class to create a parser that processes HTML content. The parser extracts only the data within the tags and accumulates it in a StringIO object.

Here's the code snippet for Python 3:

from io import StringIO
from html.parser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        super().__init__()
        self.reset()
        self.strict = False
        self.convert_charrefs= True
        self.text = StringIO()
    def handle_data(self, d):
        self.text.write(d)
    def get_data(self):
        return self.text.getvalue()

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()
Copy after login

For Python 2, use the following code:

from HTMLParser import HTMLParser
from StringIO import StringIO

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.text = StringIO()
    def handle_data(self, d):
        self.text.write(d)
    def get_data(self):
        return self.text.getvalue()

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()
Copy after login

By passing the HTML content to this function, you can effectively remove the tags and retain only the text content.

The above is the detailed content of How Can I Efficiently Remove HTML Tags from Strings in Python?. For more information, please follow other related articles on the PHP Chinese website!

source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Latest Articles by Author
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template