Based on the previous two blog posts:
Usage of single page collection function get_html based on curl data collection
Usage of single page parallel collection function get_htmls based on curl data collection
We have obtained the html file we need. Now we need to process the obtained file to obtain the collected data we need.
For the parsing of HTML documents, there is no parsing class like XML, because HTML documents have many unpaired tags and are not strict. At this time, you need to use some other auxiliary classes. Simplehtmldom is a parsing class similar to JQuery that operates HTML documents. It is very convenient to get the data you want, but unfortunately it is slow. This is not the focus of our discussion here. I mainly use regular expressions to match the data I need to collect, so that I can quickly get the information I need to collect.
Considering that get_html can judge the returned data, but get_htmls cannot judge, the following two functions were written to facilitate debugging and calling:
Because when collecting data, the list page is often collected, and the content page is collected based on the content page link obtained from the list page, or more levels, then there will be a lot of nested loops, and the control of the code will feel inadequate. So can we separate the code of the collection list page from the code of the collection content page, or more levels of collection code, or even simplify the loop?