19 Java open source web crawlers that you must use when playing with big data-javaTutorial-php.cn

Java Programming Language

Java is an object-oriented programming language that can write cross-platform application software. It was launched by Sun Microsystems in May 1995. The general term for programming languages and Java platforms (i.e. JavaEE(j2ee), JavaME(j2me), JavaSE(j2se)).

Web crawlers (also known as web spiders, web robots, and more commonly known as web page chasers in the FOAF community) are a type of web crawler that follows certain rules. , a program or script that automatically crawls information from the World Wide Web. Other less commonly used names include ants, autoindexers, emulators, or worms.

Today I will introduce 19 Java open source web crawlers to you. Friends who need them can quickly collect them.

19 Java open source web crawlers that you must use when playing with big data

##1. Heritrix

Heritrix is an open source web crawler developed by java. Users can use it to crawl the resources they want from the Internet. . Its best feature is its good scalability, which makes it convenient for users to implement their own crawling logic.

Heritrix is an "Archival Crawler" - to get a complete, accurate, deep copy of your site's content. This includes capturing images and other non-text content. Crawl and store relevant content. The content will not be rejected and the content of the page will not be modified. Recrawling the same URL does not replace the previous one. The crawler is launched, monitored and adjusted primarily through a web user interface, allowing flexibility in defining the URLs to be fetched.

Heritrix is a crawler that crawls in a multi-threaded manner. The main thread assigns tasks to Teo threads (processing threads), and each Teo thread processes one URL at a time. The Teo thread executes the URL processor chain once for each URL. The URL processor chain includes the following five processing steps.

(1) Prefetch chain: Mainly doing some preparation work, such as delaying and reprocessing, and vetoing subsequent operations.

(2) Extraction chain: mainly download web pages, perform DNS conversion, and fill in request and response forms.

(3) Extraction chain: When the extraction is completed, extract the HTML and JavaScript of interest, usually there are new URLs to be crawled.

(4) Write link: store the crawl results, and you can directly do full-text indexing in this step. Heritrix provides an implementation of ARCWriterProcessor that saves download results in ARC format.

(5) Submission chain: Do the final processing of operations related to this URL. Check which newly extracted URLs are within the crawling range, and then submit these URLs to Frontier. DNS cache information will also be updated.

19 Java open source web crawlers that you must use when playing with big data

Heritrix system framework diagram

19 Java open source web crawlers that you must use when playing with big data

The process of heritrix processing a url

2. WebSPHINX

WebSPHINX is an interactive development environment for Java class packages and Web crawlers. Web crawlers (also called robots or spiders) are programs that automatically browse and process Web pages. WebSPHINX consists of two parts: the crawler working platform and the WebSPHINX class package.

WebSPHINX is an interactive development environment for Java class packages and web crawlers. Web crawlers (also called robots or spiders) are programs that automatically browse and process Web pages. WebSPHINX consists of two parts: the crawler working platform and the WebSPHINX class package.

WebSPHINX – Purpose

1. Visually display a collection of pages

2. Download pages to local disk for offline browsing

3. Combine all pages Splice into a single page for browsing or printing

4. Extract text strings from the page according to specific rules

5. Use Java or Javascript to develop a custom crawler

Detailed introduction can be found>>>

3. WebLech

WebLech is a powerful Web site download and mirroring tool. It supports downloading web sites based on functional requirements and mimics the behavior of standard web browsers as closely as possible. WebLech has a functional console and uses multi-threaded operation.

WebLech is a powerful free open source tool for downloading and mirroring Web sites. It supports downloading web sites based on functional requirements and mimics the behavior of standard web browsers as closely as possible. WebLech has a functional console and uses multi-threaded operation.

This crawler is simple enough. If you are a beginner and want to write a crawler, it can be used as an introductory reference. So I chose to start my research with this crawler. If you are only doing low-demand applications, you can also try it. If you want to find a powerful tool, don’t waste your time on WebLech.

Homepage of the project: http://weblech.sourceforge.net/

Features:

1)Open source, free

2)The code is Written in pure Java, it can be used on any platform that supports Java

3) Supports multi-threaded downloading of web pages

4) Can maintain link information between web pages

5 ) Highly configurable: Depth-first or breadth-first crawling web pages can be customized with URL filters, so that a single web server can be crawled as needed, a single directory or the entire WWW network can be crawled, the priority of the URL can be set, so that our senses can be crawled first Interesting or important web pages can record the status of the program at the breakpoint, and can continue crawling from the last time when restarting.

4. Arale

Arale is mainly designed for personal use and does not focus on page indexing like other crawlers. Arale is able to download entire web sites or certain resources from web sites. Arale can also map dynamic pages into static pages.

5. JSpider

JSpider: It is a fully configurable and customizable Web Spider engine. You can use it to check website errors (internal server errors, etc.), internal and external links on the website Check and analyze the structure of the website (can create a site map), download the entire website, and you can also write a JSpider plug-in to extend the functions you need.

Spider is a WebSpider implemented in Java. The execution format of JSpider is as follows:

jspider [URL] [ConfigName]

The URL must be added with the protocol name, such as: http://, otherwise an error will be reported. If ConfigName is omitted, the default configuration is used.

The behavior of JSpider is specifically configured by the configuration file. For example, what plug-ins are used, how the results are stored, etc. are all set in the conf[ConfigName] directory. JSpider's default configuration types are very few and have little use. However, JSpider is very easy to extend, and you can use it to develop powerful web crawling and data analysis tools. To do this, you need to have an in-depth understanding of the principles of JSpider, then develop plug-ins and write configuration files according to your own needs.

Spider is:

A highly configurable and customizable web crawler

Developed under the LGPL open source license

100% pure Java implementation

You can use it to:

Check your website for errors (internal server errors, …)

Check for outgoing or internal links

Analyze the structure of your website (Create a sitemap, …)

Download the renovation website

Achieve any function by writing a JSpider plug-in.

The project homepage: http://j-spider.sourceforge. net/

6. spindle

spindle is a Web indexing/search tool built on the Lucene toolkit. It includes an HTTP spider for creating indexes and an HTTP spider for searching these Search class for the index. The spindle project provides a set of JSP tag libraries that enable JSP-based sites to add search functionality without developing any Java classes.

7. Arachnid

Arachnid is a Java-based web spider framework. It contains a simple HTML parser that can analyze the input stream containing HTML content. By implementing a subclass of Arachnid, you can Develop a simple web spider and add a few lines of code to call after each page on the website is parsed. The Arachnid download package contains two spider application examples to demonstrate how to use the framework.

Homepage of the project: http://arachnid.sourceforge.net/

8. LARM

LARM can provide a pure Java for users of the Jakarta Lucene search engine framework Search for solutions. It contains methods for indexing files, database tables, and crawlers for indexing Web sites.

The project homepage: http://larm.sourceforge.net/

9. JoBo

JoBo is a simple tool for downloading entire Web sites. It is essentially a Web Spider. Compared with other download tools, its main advantages are its ability to automatically fill forms (such as automatic login) and use cookies to handle sessions. JoBo also has flexible download rules (such as: limiting downloads by web page URL, size, MIME type, etc.).

10. snoics-reptile

1. What is snoics-reptile?

is developed in pure Java and is a tool used to capture website images. You can use The URL entry provided in the configuration file will capture all the resources of this website that can be obtained through the browser through GET to the local area, including web pages and various types of files, such as: pictures, flash, mp3, zip, rar, exe and other files. The entire website can be downloaded to the hard drive completely, and the original website structure can be kept accurate and unchanged. You only need to put the captured website into a web server (such as Apache) to achieve a complete website mirroring.

2. Now that there are other similar software, why do we need to develop snoics-reptile?

Because some files often have wrong files during the crawling process, and for many There is no way to correctly parse URLs controlled by JavaScript, and snoics-reptile provides external interfaces and configuration files. For special URLs, it can freely extend the externally provided interfaces and inject configuration files, basically It can achieve correct parsing and crawling of all web pages.

Homepage of the project: http://www.blogjava.net/snoics

11. Web-Harvest

Web-Harvest is a Java open source Web data extraction tool . It is able to collect specified web pages and extract useful data from these pages. Web-Harvest mainly uses technologies such as XSLT, XQuery, regular expressions, etc. to implement text/xml operations.

Web-Harvest is an open source Web data extraction tool written in Java. It provides a way to extract useful data from the desired page. In order to achieve this goal, you may need to use related technologies such as XSLT, XQuery, and regular expressions to manipulate text/xml. Web-Harvest mainly focuses on HMLT/XML-based page content, which still accounts for the majority. On the other hand, it can easily extend its extraction capabilities by writing its own Java methods.

The main purpose of Web-Harvest is to enhance the application of existing data extraction technologies. Its goal is not to create a new method but to provide a better way to use and combine existing methods. It provides a set of processors for processing data and controlling the flow. Each processor is regarded as a function, which has parameters and returns results after execution. Moreover, the processing is combined into a pipeline, so that they can be executed in a chained form. In addition, for easier data manipulation and reuse, Web-Harvest also provides variable upper and lower layers for storing declared variables.

To start web-harvest, you can directly double-click the jar package to run it. However, this method cannot specify the size of the web-harvest java virtual machine. The second method is to go to the web-harvest directory under cmd and type the command "java -jar -Xms400m webharvest_all_2.jar" to start and set the Java virtual machine size to 400M.

The project homepage: http://web-harvest.sourceforge.net

12. ItSucks

ItSucks is a Java Web crawler open source project. It can be flexibly customized and supports defining download rules through download templates and regular expressions. Provides a console and Swing GUI operation interface.

Features:

Multi-threading

Regular expression

Save/Load download work

Online help

HTTP/HTTPS support

HTTP proxy support

HTTP authentication

Cookie support

Configurable User Agent

Connection Limitation

Configuring the behavior of HTTP response codes

Bandwidth Limitation

Gzip Compression

The project homepage: http://itsucks.sourceforge.net/

13. Smart and Simple Web Crawler

Smart and Simple Web Crawler is a web crawler framework. Integrated Lucene support. The crawler can start from a single link or an array of links, offering two traversal modes: maximum iteration and maximum depth. You can set filters to limit crawled links. Three filters, ServerFilter, BeginningPathFilter and RegularExpressionFilter, are provided by default. These three filters can be combined with AND, OR and NOT. Listeners can be added before and after the parsing process or page loading.

14. Crawler4j

crawler4j is an open source web crawler implemented in Java. Provides a simple and easy-to-use interface to create a multi-threaded web crawler in minutes.

The use of crawler4j is mainly divided into two steps:

Implementing a crawler class inherited from WebCrawler;

The crawler class implemented by calling CrawlController.

WebCrawler is an abstract class. If you inherit it, you must implement two methods: shouldVisit and visit. Among them:

shouldVisit is to determine whether the current URL should be crawled (visited);

visit is to crawl the data of the page pointed to by the URL, and its incoming parameters are It is the encapsulation object Page for all the data of the web page.

In addition, WebCrawler has other methods that can be overridden, and the naming rules of its methods are similar to Android's naming rules. For example, the getMyLocalData method can return the data in the WebCrawler; the onBeforeExit method will be called before the WebCrawler ends and can perform some work such as resource release.

License

Released under Apache License 2.0

Open source address: https://github.com/ yasserg/crawler4j

15. Ex-Crawler

Ex-Crawler is a web crawler developed in Java. The project is divided into two parts, one is the daemon process, and the other is flexible and configurable Web crawler. Use a database to store web page information.

Ex-Crawler is divided into three parts (Crawler Daemon, Gui Client and Web search engine). The combination of these three parts will become a flexible and powerful crawler and search engine. The Web search engine part is developed using PHP and includes a content management system CMS for maintaining the search engine.

Homepage of the project: http://ex-crawler.sourceforge.net/joomla/

16. Crawler

Crawler is a simple web crawler. It allows you to avoid writing boring, error-prone code and only focus on the structure of the website you need to crawl. Plus it's very easy to use.

Homepage of the project: http://projetos.vidageek.net/crawler/crawler/

Seventeen, Encog

Encog is an advanced neural network and robot/crawler Development class library. The two functions provided by Encog can be used separately to create neural networks or HTTP robots, and Encog also supports the combined use of these two advanced functions. Encog supports the creation of feedforward neural networks, Hopfield neural networks, and self-organizing graphs.

Encog provides advanced HTTP robot/crawler programming capabilities. Supports storing content generated by multi-threaded crawlers in memory or database. Supports HTM parsing and advanced form and cookie handling.

Encog is an advanced machine learning framework that supports a variety of advanced algorithms, as well as support for class normalization and processing of data. Machine learning algorithms such as support vector machines, artificial neural networks, genetic programming, Bayesian networks, hidden Markov models, genetic programming and genetic algorithms are supported. Most Encog training algorithms are multi-threaded and scale well to multi-core hardware. Encog can also use a GPU to further speed up processing times. A GUI-based workbench is also provided to help model and train machine learning algorithms. Encog has been actively developed since 2008.

Encog supports multiple languages, including C# Java and C

There are source codes in various languages on GitHub.

http://www.heatonresearch.com/encog

https://github.com/encog

18. Crawljax

Crawljax is an open source Java tool for Automated scraping and testing of Ajax web applications. Crawljax is able to crawl/crawl any Ajax based web application by triggering events and populating data in forms.

Inclusion time: 2011-05-18 09:50:32

The project homepage: http://crawljax.com/

Open source address: https:// github.com/crawljax/crawljax

The above is the content of 19 Java open source web crawlers that you must use when playing with big data. For more related content, please pay attention to the PHP Chinese website ( m.sbmmt.com)!