Home > Article > Web Front-end > Detailed explanation of a perfect HTML parsing engine (Jumony)

Detailed explanation of a perfect HTML parsing engine (Jumony)

零下一度Original: 2017-05-04 14:57:377201browse

Perhaps many people will think that the current HTML parser is enough, and even simple regular expressions can already meet the needs of manipulating HTML documents. Yes, for the vast majority of HTML documents on the Internet, in fact, most of them meet the XHTML specifications, and their parsing does not require a powerful parser. But a powerful parser is one thing, and a perfect parser is another.

Jumony Core first provides a nearly perfect HTML parsing engine, and its parsing results are infinitely close to those of the browser. Whether it is elements without end tags, elements with optional end tags, tag attributes, or CSS selectors and styles, all legal and illegal HTML documents will be parsed by the browser, and Jumony will parse them into whatever they are. Sample. In other words, the results of Jumony parsing are the same as the results of browser parsing, so you no longer have to worry about whether the HTML document can be recognized. If the browser can read it, Jumony can understand it.

There is only one step between perfection and power, but a perfect parser allows you to never have to care about the HTML source document.

The following is an incomplete list of features supported by the Jumony parser

##0dbfd9bc5d3c7c92cbdc7be3251a6bb2Use double quotes for attribute valuesa506a44ecd5500230253bcf07517151aThe attribute value is missing (but there is an equal sign)2de933bd81cf0e954926b4b4a48640c3There are spaces in front of the attribute value05797ff6bfecf7b84525904378753179Parsing8b05045a5be5764f313ed5b9168a17e6

Not only can it parse HTML from text, Jumony's API can directly grab documents for analysis from the Internet, and automatically identify encodings based on HTTP headers:

new JumonyParser().LoadDocument( "m.sbmmt.com/" ).Find( ".post_item a.titlelnk" )

It is currently second only to Jumony The HTML parsing open source project HtmlAgilityPack has stopped updating for a long time. After so many years, there are still problems with the parsing of the most basic ff9c23ada1bcecdd1a0fb5d5a0f18437 elements.

2. CSS style setting support

Just perfectly parsing HTML does not bring much benefit. As mentioned above, in fact, most HTML documents can be parsed with second-rate parsing. It can analyze even simple regular expressions, so why do we need Jumony?

The answer is that an HTML engine is more than just parsing the DOM structure.

Consider this scenario: I need to set a none value to the display style of an element. In the browser, we only need a simple element.style.display = "none" to meet our requirements. Now, we have obtained the DOM we need through the parser, but do we still need to concatenate strings to set the style?

No need, Jumony supports CSS style parsing, and even some CSS style abbreviation rules can be recognized. In Jumony, setting a style for an element is as simple as in the browser:

element.Style( "display", "none" )

We Let's look at this example again: 7cd3a19f7494deb2fc114a1203d0dae994b3e26ee717c64999d7867364b1b4a3, what will happen if we set padding-left: 0px on this element?

In Jumony, the result will be:

Look, the padding attribute is magically expanded automatically.

3. CSS 3 selector support

CSS selector is a popular query language in the HTML world. It is simple and powerful and is supported by many browsers. Jumony also supports almost complete CSS3 selectors (except runtime pseudo-classes and pseudo-objects). With the help of selectors, we can easily find the objects we are interested in in HTML. For example, grab all the article titles on the homepage of the blog park:

new JumonyParser().LoadDocument( "m.sbmmt.com/" ).Find( ".post_item a.titlelnk" )

Capture, analyze, select, all in one go. With just a simple code, we can output the data we captured on the console:

 foreach( var title = new JumonyParser().LoadDocument( "m.sbmmt.com/" ).Find( ".post_item a.titlelnk" ) )
  Console.WriteLine( title.InnerText() );

List of CSS3 selectors supported by Jumony:

特性	例子
孤立的7f19db43d97d2d4008bb10f655c09c49解析为文本	3499910bf9dac5ae3c52d5ede7383485>5db79b134e9f6b82c0b36e0489ee08ed应当解析为3499910bf9dac5ae3c52d5ede7383485>5db79b134e9f6b82c0b36e0489ee08ed
标记属性（没有值的属性）	f396c7ee6581b8a2c1feb962c201109c
元素丢失结束标签	e388a4556c0f65e1904146cc1a846bee43091600189d2e5d073500b021ca54e8测试链接94b3e26ee717c64999d7867364b1b4a3
可选结束标签元素 "body", "colgroup", "dd", "dt", "head", "html", "li", "option", "p", "tbody", "td", "tfoot", "th", "thead", "tr"	e388a4556c0f65e1904146cc1a846beeabce388a4556c0f65e1904146cc1a846bee123
无结束标签元素 "area", "base", "basefont", "br", "col", "frame", "hr", "img", "input", "isindex", "link", "meta" , "param", "wbr", "bgsound", "spacer", "keygen"	4faf7b57895b870867b99beee44351ac
CDataElement	##3f1c4e4b6b16bbbd69b2ee476dc4f83aif ( 11027f152cc5b3cb62c2e445c3c9a350d" );2cacc6d41bbb37262a98f745aa00fbf0
"script", "style", "textarea", "title"
Preformatted elements	e03b848252eb9375d56be284e690e873 There is a space in frontbc5574f69a0cba105bc93bd3dc13c4ec
Use single quotes for attribute values



HTMLDeclaration

##p~aSelect subsequent elements##[attr][attr=value][ attr~=value][attr^=value][attr*=value][attr$=value][attr!=value]: not:only-child:only-of-type:empty: nth-child:nth-last-child##:nth-of-type

Selector	Description
*	Select all elements
p a	Select descendant elements
##p>a	Select child elements
p+a	Select adjacent elements

Attribute existence selection
Exact match of attribute value
Attribute value approximate match
The attribute value starts with matching
The attribute value contains Match
Attribute value ends with match
Attribute value negative matching
Negative pseudo-class
Unique sub-element pseudo-class
only-of-type pseudo-class
Empty element pseudo-class
Structured pseudo-class
Structured pseudo-class
structured pseudo-class
:nth-last-of-type	Structured pseudo-class
:first-child	Structured pseudo-class
:last-child	Structured pseudo-class
:first-of-type	Structured pseudo-class
:last-of-type	##Structured pseudo-class

4. Powerful scalability

In Jumony Core 3, it provides users with the greatest scalability. You can customize HTML specifications, implement your own parser, graft other DOM models to the Jumony API, invent your own CSS selector pseudo-class, or even change your own API, such as jQuery style.

Jumony Core has many derivative projects, such as crawling websites, providing jQuery-style APIs, developing websites, making MHT files, adding CSS selector support for HAP parsing results, etc. These projects all require Benefiting from the powerful scalability of Jumony Core, it can exert powerful functions.

【Related recommendations】

1. Free html online video tutorial

html development manual

php.cn original html5 video tutorial

The above is the detailed content of Detailed explanation of a perfect HTML parsing engine (Jumony). For more information, please follow other related articles on the PHP Chinese website!

Statement：

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Previous article：Introduction to HTML extractor (woody)Next article：Introduction to HTML extractor (woody)

See more

Detailed explanation of a perfect HTML parsing engine (Jumony)

2. CSS style setting support

3. CSS 3 selector support

Related articles