Actually, I don’t quite agree with what the person who did the DHT crawler said.
Different languages will naturally have different uses. Talking about which one is good or bad without the environment is just a hooliganism.
1. If you are doing it for fun, crawling a few pages in a targeted manner, and if efficiency is not the core requirement, the problem will not be big. Any language will work, and the performance difference will not be big. Of course, if you encounter a very complex page and the regular expression is very complex, the maintainability of the crawler will decrease.
2. If you are doing directional crawling, and the target needs to parse dynamic js.
So at this time, using the ordinary method of requesting the page and getting the content will definitely not work. A js engine similar to firfox and chrome is needed to dynamically parse the js code. At this time, we recommend casperJS+phantomjs or slimerJS+phantomjs
3. If it is a large-scale website crawling
At this time, efficiency, scalability, maintainability, etc. must be considered.
Large-scale crawling involves many aspects, such as distributed crawling, heavy judgment mechanism, and task scheduling. Which of these questions is easier if you go deeper?
Language selection is very important at this time.
NodeJs: It is very efficient in crawling. High concurrency, multi-threaded programming becomes simple traversal and callback, memory and CPU usage are small, and callback must be handled well.
PHP: Various frameworks are available everywhere, you can just use any one. However, there is really a problem with the efficiency of PHP...not much to say
Python: I write more in python and have better support for various problems. The scrapy framework is easy to use and has many advantages.
I think js is not very suitable for writing... efficiency issues. If I haven’t written it, I’ll probably be in a lot of trouble.
As far as I know, big companies also use C++. In short, most of them are modified on open source frameworks. Not many people really reinvent the wheel.
not worth.
I wrote it casually based on my impressions. Corrections are welcome.
High-performance crawlers are not as suitable for concurrency as servers, but for efficiency (reduce duplication) are more suitable for parallelism rather than concurrency.
Well I was wrong again.
Concurrency and parallelism are almost the same for crawlers~
Most of them use python, and of course there are also many java c++. Python comes quickly and has great advantages over small and medium-sized applications. If it is large-scale, optimization or C is required to rewrite some performance bottleneck codes.
You can try the jsoup tool, which is developed using java.
Let’s start using node now. JavaScript is the one who understands HTML best
nodejs +1
nodejs +1
Actually, I don’t quite agree with what the person who did the DHT crawler said.
Different languages will naturally have different uses. Talking about which one is good or bad without the environment is just a hooliganism.
1. If you are doing it for fun, crawling a few pages in a targeted manner, and if efficiency is not the core requirement, the problem will not be big. Any language will work, and the performance difference will not be big. Of course, if you encounter a very complex page and the regular expression is very complex, the maintainability of the crawler will decrease.
2. If you are doing directional crawling, and the target needs to parse dynamic js.
So at this time, using the ordinary method of requesting the page and getting the content will definitely not work. A js engine similar to firfox and chrome is needed to dynamically parse the js code. At this time, we recommend casperJS+phantomjs or slimerJS+phantomjs
3. If it is a large-scale website crawling
At this time, efficiency, scalability, maintainability, etc. must be considered.
Large-scale crawling involves many aspects, such as distributed crawling, heavy judgment mechanism, and task scheduling. Which of these questions is easier if you go deeper?
Language selection is very important at this time.
NodeJs: It is very efficient in crawling. High concurrency, multi-threaded programming becomes simple traversal and callback, memory and CPU usage are small, and callback must be handled well.
PHP: Various frameworks are available everywhere, you can just use any one. However, there is really a problem with the efficiency of PHP...not much to say
Python: I write more in python and have better support for various problems. The scrapy framework is easy to use and has many advantages.
I think js is not very suitable for writing... efficiency issues. If I haven’t written it, I’ll probably be in a lot of trouble.
As far as I know, big companies also use C++. In short, most of them are modified on open source frameworks. Not many people really reinvent the wheel.
not worth.
I wrote it casually based on my impressions. Corrections are welcome.
Use pyspider, its performance is no worse than scrapy, more flexible, with WEBUI, and also supports JS crawling~
You can play it with your own demo~
selenium
nodejs +1
No, I was wrong.
High-performance crawlers are not as suitable for concurrency as servers, but for efficiency (reduce duplication) are more suitable for parallelism rather than concurrency.
Well I was wrong again.
Concurrency and parallelism are almost the same for crawlers~
No, it’s different.
Forget it, nodejs +1.
Most of them use python, and of course there are also many java c++. Python comes quickly and has great advantages over small and medium-sized applications. If it is large-scale, optimization or C is required to rewrite some performance bottleneck codes.
You can try python’s scrapy