Nowadays, web crawling is a well-known technology, but there are still many complexities. Simple web crawlers are still difficult to cope with various complex technologies such as Ajax rotation training, XMLHttpRequest, WebSockets, Flash Sockets, etc. Developed modern website.
Let’s take our basic needs on the Hubdoc project as an example. In this project, we scrape bill amounts, due dates, account numbers, and most importantly, from the websites of banks, utilities, and credit card companies. of: pdf of recent bills. For this project, I started with a very simple solution (not using the expensive commercial products we are evaluating for the time being) - a simple crawler project I had done before using Perl at MessageLab/Symantec. But the results were disastrous. Spammers created websites that were much simpler than those of banks and utility companies.
So how to solve this problem? We mainly started by using the excellent request library developed by Mikea. Make a request in the browser, check in the Network window what request headers are sent, and then copy these request headers into the code. The process is simple. It just tracks the process from logging in to downloading the PDF file, and then simulates all the requests in this process. In order to make it easier to handle similar things and allow web developers to write crawler programs more rationally, I exported the method of getting the results from HTML into jQuery (using the lightweight cheerio library), which makes similar tasks easy, and also makes it easier to use CSS selectors to select elements on a page. The entire process is wrapped into a framework, and this framework can also do additional work, such as picking up certificates from the database, loading individual robots, and communicating with the UI through socket.io.
This works for some web sites, but it is just a JS script, not my node.js code that these companies put on their sites. They've layered the legacy issues on complexity, making it very difficult to figure out what to do to get the login information point. I tried for several days to get some sites by combining with the request() library, but still in vain.
After nearly crashing, I discovered node-phantomjs, a library that allows me to control the phantomjs headless webkit browser from node (Translator’s Note: I don’t have this) Thinking of a corresponding noun, headless here means that rendering the page is completed in the background without a display device). This seems like a simple solution, but there are still some unavoidable problems with phantomjs that need to be solved:
1. PhantomJS can only tell you whether the page has been loaded, but you cannot determine whether there is a redirect through JavaScript or meta tags in the process. Especially when JavaScript uses setTimeout() to delay calls.
2.PhantomJS provides you with a pageLoadStarted hook that allows you to handle the issues mentioned above, but this function can only be used when you determine the number of pages to be loaded and when each page is loaded. Decrease this number, and provide handling for possible timeouts (since this doesn't always happen) so that when your number decreases to 0, your callback function is called. This approach works, but it always feels a bit like a hack.
3. PhantomJS requires a complete independent process for each page it crawls, because otherwise, the cookies between each page cannot be separated. If you use the same phantomjs process, the session in the logged in page will be sent to another page.
4. Unable to download resources using PhantomJS - you can only save the page as png or pdf. This is useful, but it means we need to resort to request() to download the pdf.
5. Due to the above reasons, I must find a way to distribute cookies from the session of PhantomJS to the session library of request(). Just distribute the document.cookie string, parse it, and inject it into the request() cookie jar.
6. Injecting variables into the browser session is not easy. To do this I need to create a string to create a Javascript function.
8. Some websites are always full of codes such as console.log(), and they need to be redefined and output to the location we want. To accomplish this, I do this:
10. I also need to limit the maximum concurrency of browser sessions to ensure that we will not blow up the server. Having said that, this limit is much higher than what expensive commercial solutions can provide. (Translator’s note: The commercial solution has greater concurrency than this solution)
After all the work is done, I have a decent PhantomJS request crawler solution. You must log in using PhantomJS before you can return to the request() request. It will use the cookie set in PhantomJS to authenticate the logged in session. This is a huge win because we can use request()'s stream to download the pdf file.
The whole plan is to make it relatively easy for web developers to understand how to use jQuery and CSS selectors to create crawlers for different web sites. I have not yet successfully proven that this idea is feasible, but I believe it will soon.