Removing HTML Formatting from Strings in Python
Consider the task of extracting the contents of an HTML document without displaying the formatting tags. For instance, the HTML element some text should output only "some text," and hello should display "hello."
Solution
The built-in Python library provides a useful mechanism to achieve this goal:
For Python 3:
from io import StringIO from html.parser import HTMLParser class MLStripper(HTMLParser): def __init__(self): super().__init__() self.reset() self.strict = False self.convert_charrefs= True self.text = StringIO() def handle_data(self, d): self.text.write(d) def get_data(self): return self.text.getvalue() def strip_tags(html): s = MLStripper() s.feed(html) return s.get_data()
For Python 2:
from HTMLParser import HTMLParser from StringIO import StringIO class MLStripper(HTMLParser): def __init__(self): self.reset() self.text = StringIO() def handle_data(self, d): self.text.write(d) def get_data(self): return self.text.getvalue() def strip_tags(html): s = MLStripper() s.feed(html) return s.get_data()
The above is the detailed content of How to Remove HTML Tags from Strings in Python?. For more information, please follow other related articles on the PHP Chinese website!