[python tutorial] Web page text and content image extraction algorithm-Python Tutorial-php.cn

[python tutorial] Web page text and content image extraction algorithm

黄舟

Release： 2017-02-07 16:16:56

Original

2552 people have browsed it

Regular matching is usually used when crawling the web content of a single website. However, the structures of different websites are so strange that it is difficult to match them with a unified regular expression. The author of "General Web Page Text Extraction Algorithm Based on Line Block Distribution Function" summarized the general methods of extracting article text from web pages, proposed a text extraction algorithm based on line block distribution, and provided implementations in PHP, Java, etc. The main principles of this algorithm are based on two points: 1. Text area density: after removing all tags in HTML, the character density in the text area is higher and there are fewer multiple lines of blanks; 2. Line block length: the content in non-text areas is average Shorter in individual labels (line blocks). The algorithm steps are as follows:

1. Remove all tags, including styles, Js script content, etc., but retain the original line breaks\n

[python tutorial] Web page text and content image extraction algorithm

##2. Split the web page content by lines, define the line block $block_i$ as the sum of the $[i, i + blockSize]$ lines of text and give the distribution function of the line block length based on the line number:

[python tutorial] Web page text and content image extraction algorithm

3. The text appears in the longest line block, and the range from both sides to the line block length of 0 is intercepted:

[python tutorial] Web page text and content image extraction algorithm

4. If you need to extract the pictures that appear in the text area, you only need to retain the content of the

tag when removing the tag in the first step:

[python tutorial] Web page text and content image extraction algorithm

The above is the content of [python tutorial] web page text and content image extraction algorithm. For more related content, please pay attention to the PHP Chinese website (m.sbmmt.com)!