Problem
Usually when we mention the use of XML, the most troublesome part is the verbosity of XML and the parsing speed of XML. This problem becomes particularly serious when large XML files need to be processed. What I mention here is the topic of how to optimize XML processing speed.
When we choose to process XML files, we generally have two options:
DOM, which is the W3C standard model, builds XML structural information in a tree shape , which provides interfaces and methods for traversing this tree.
SAX, a low-level parser, forward read-only processing element by element, does not contain structural information.
Both of the above options have their own pros and cons, but neither is a particularly good solution. Their pros and cons are as follows:
DOM
Advantages: Ease of use, because all XML structure information exists in memory, and traversal is simple, supporting XPath.
Disadvantages: Parsing speed is too slow, memory usage is too high (5x~10x of the original file), and it is almost impossible to use for large files.
SAX
Advantages: Parsing is fast, and the memory usage is not related to the size of XML (it can be done without increasing memory as XML grows).
Disadvantages: Poor usability, because there is no structural information, it cannot be traversed, and it does not support XPath. If you need a structure, you can only read a little bit and construct a little bit, which makes the maintainability very poor.
We can see that DOM and SAX are basically two opposite extremes, but neither one can meet most of our requirements well. We need to find another processing method. Note that the efficiency problem with XML is not a problem with XML itself, but a problem with the Parser that processes XML, just like the two methods we saw above have different efficiency trade-offs.
Thinking
We like the use of DOM-like methods because we can traverse, which means that XPath can be supported, which greatly enhances the ease of use, but the efficiency of DOM is very low. As we already know, the efficiency problem lies in the processing mechanism. So, what aspects of DOM affect its efficiency? Let us do a comprehensive dissection below:
In most platforms today based on virtual machine (hosted, or any similar mechanism) technology, the creation and destruction of objects is a time-consuming job (it is worth the main is the time consuming of Garbage Collection), the large number of object creation and destruction used in the DOM mechanism is undoubtedly one of the reasons that affects its efficiency (it will cause too many Garbage Collections).
Each object will have an additional 32 bits to store its memory address. When there are a large number of objects like DOM, this additional cost is not small.
The main efficiency issue causing the above two problems is that DOM and SAX are both extractive parsing modes. This parsing mode is destined to require a large number of creation (destroy) objects for both DOM and SAX, causing efficiency problems. The so-called extractive parsing means that when parsing XML, DOM or SAX will extract a part of the original file (generally a string), and then parse and construct it in memory (the output is naturally one or some objects). Take DOM as an example. DOM will parse each element, attribute, PRocessing-instruction, comment, etc. into an object and give it a structure. This is the so-called extractive parsing.
Another problem caused by the issue of extractive is update efficiency. In DOM (SAX does not support update, so we won’t mention it at all), every time we need to make changes, all we have to do is to update the object’s information. Then parse back the XML string. Note that this parsing is a complete parsing, that is to say, the original file is not used, but the DOM model is directly re-parsed completely into an XML string. In other words, DOM does not support Incremental Update (incremental update).
Another "small" problem that is likely not to be noticed is the encoding of XML. No matter what kind of parsing method is used, it needs to be able to handle the encoding of XML, that is, decoding when reading and writing when writing. when coding. Another efficiency problem with DOM is that when I only want to make a small modification to a large XML, it must first decode the entire file and then build the structure. Invisibly, it is another expense.
Let us summarize the problem. Simply put, the efficiency problem of DOM mainly lies in its extractive parsing mode (the same is true for SAX, which has the same problem). This has triggered a series of related problems. If these can be overcome If there is an efficiency bottleneck, then it is conceivable that the XML processing efficiency will be further improved. If the ease of use and processing efficiency of XML are greatly improved, then the application scope and application model of XML will be further sublimated, and perhaps many wonderful things that have never been thought of before will be produced. XML based products come.
The way out
VTD-XML is the answer given after thinking about the above problems. It is a non-extractive XML parser. Because of its excellent mechanism, it is a good solution (avoiding ) solves the various problems raised above, and also "incidentally" brings other non-extractive benefits, such as fast parsing and traversal, XPath support, Incremental Update, etc. I have a set of data here, taken from the official website of VTD-XML:
The parsing speed of VTD-XML is 1.5x~2.0x that of SAX (with NULL content handler). With NULL content handler means that no additional processing logic is inserted into SAX parsing, which is the maximum speed of SAX.
The memory usage of VTD-XML is 1.3x~1.5x that of the original XML (the 1.0x part is the original XML, and the 0.3x~0.5x part is the part occupied by VTD-XML), while the memory usage of the DOM is 1.3x~1.5x that of the original XML. 5x~10x of XML. For example, if the size of an XML is 50MB, then the memory occupied by VTD-XML will be between 65MB~75MB, while the memory occupied by DOM will be between 250M~500MB. Using DOM to process large XML files based on this data is almost an impossible option.
You may find it incredible, is it really possible to make an XML parser that is easier to use than DOM and faster than SAX? Don’t rush to a conclusion, let’s take a look at the principles of VTD-XML!
Basic Principle
Like most good products, the principle of VTD-XML is not complicated, but very clever. In order to achieve the purpose of non-extractive, it reads the original XML file into the memory unchanged in binary mode, without even decoding it, and then parses the position of each element on this byte array and records some information. Subsequent traversal operations are performed on these saved records. If the XML content needs to be extracted, the position and other information in the record are used to decode the original byte array and return a string. This all seems simple, but this simple process does have multiple performance details and hides several potential capabilities. Let's first describe each performance detail:
In order to avoid excessive object creation, VTD-XML decided to use the original numerical type as the record type, so that heap is not needed. The record mechanism of VTD-XML is called VTD (Virtual Token Descriptor). VTD solves the performance bottleneck in the tokenization stage, which is really a clever and thoughtful approach. VTD is a 64-bits length numerical type that records information such as the starting position (offset), length (length), depth (depth) and token type (type) of each element.
Note that VTD is of fixed length (officially decided to use 64bits). The purpose of this is to improve performance. Because the length is fixed, it is extremely efficient (O(1)) when reading, querying and other operations, that is, Arrays, an efficient structure that can be used to organize VTDs, greatly reduce performance problems caused by the large use of objects.
The superpower of VTD (no exaggeration at all) is that it can simply transform a tree-shaped data structure such as XML into an operation on a byte array, any operation you can imagine on a byte array All can be applied to XML. This is because the XML read in is binary (byte array), and VTD records the location of each element and other access information. When we find the VTD to be operated, we only need to use information such as offset and length. Perform any operation on the original byte array, or you can operate directly on the VTD. For example, if I want to find an element in a large XML and delete it, then I only need to find the VTD of this element (the traversal method will be discussed later), delete this VTD from the VTD array, and then use all Just write the VTD to another byte array. Because the deleted VTD marks the location of the element to be deleted, this element will not appear in the newly written byte array. Use VTD to write the new The byte array is actually a copy of the byte array, and its efficiency is quite high. This is the so-called incremental update.
Regarding the traversal method of VTD-XML, it uses LC (Location Cache), which is simply a tree-shaped table structure built using VTD with its depth as the standard. The entry of LC is also a 64-bit long numerical type. The first 32 bits represent the index of a VTD, and the last 32 bits represent the index of the first child of this VTD. You can use this information to calculate any position you want to reach. For specific traversal methods, please refer to the article on the official website. VTD-XML based on this traversal method has a different operating interface from DOM, which is understandable. Moreover, this traversal method of VTD-XML can take you to the place you need in the fewest steps. , the traversal performance is very outstanding.
Summary
As you can see above, VTD-XML has fascinating features, and now version 1.5 has added support for XPath (as long as it can be traversed, it can support XPath , this is a matter of time :-)), its practicality has exceeded the scope of what we imagine today. Another superpower of VTD-XML is that based on its current processing method, it can fully support the future Binary XML standard and push the application of XML to a higher level through Binaryization! This is what I'm looking forward to now! :-)
However, VTD-XML still has many areas that need improvement and perfection, and this aspect is worthy of our efforts and discussion.
The above is the introduction of the emerging XML processing method VTD-XML. For more related content, please pay attention to the PHP Chinese website (m.sbmmt.com)!