Java crawler application tutorial, practical data capture and analysis-javaTutorial-php.cn

Java crawler application tutorial, practical data capture and analysis

With the advent of the Internet era, data has become the only way for enterprises and individuals to succeed, so the importance of data has become more and more important. Come higher and higher. As a powerful tool for data acquisition, crawler technology has been widely used in all walks of life. This article will introduce how to use Java language to write a crawler to capture and analyze data.

1. Prerequisite knowledge

Before learning Java crawler, you need to master the following basic knowledge:

Basics of Java language: At least you need to understand Java Basic concepts such as classes, methods, variables, and the ideas of object-oriented programming.
HTML Basics: Understand the basic structure and tags of the HTML language, and you can use simple CSS styles and JavaScript codes.
HTTP Basics: Understand the basic principles of the GET and POST methods in the HTTP protocol, and have a certain understanding of HTTP header information such as Cookie, User-Agent, etc.
Regular Expressions: Understand the basic syntax and usage of regular expressions.
Database operations: Master the basic knowledge of Java database operations, such as JDBC, Hibernate, MyBatis, etc.

2. Java crawler basics

A web crawler is an automated program that can simulate human behavior to access the Internet, extract information from web pages, and process it. The Java language has good network programming capabilities and powerful object-oriented features, so it is very suitable for writing crawler programs.

Java crawlers are generally divided into three parts: URL manager, web page downloader and web page parser.

URL Manager

URL Manager manages the URL addresses that crawlers need to crawl, and records which URLs have been crawled and which URLs still need to be crawled. . There are generally two ways to implement URL managers:

(1) In-memory URL manager: Use a Set or Queue to record the URLs that have been crawled and the URLs to be crawled.

(2) Database URL manager: Store the URLs that have been crawled and those to be crawled in the database.

Web page downloader

The web page downloader is the core part of the crawler and is responsible for downloading web pages from the Internet. Java crawlers generally have two implementation methods:

(1) URLConnection: implemented using the URLConnection class, which is relatively simple to use. The core code is as follows:

URL url = new URL("http://www.example.com");
HttpURLConnection conn = (HttpURLConnection) url.openConnection();
InputStream in = conn.getInputStream();
BufferedReader reader = new BufferedReader(new InputStreamReader(in));
String line = reader.readLine();
while (line != null) {
    System.out.println(line);
    line = reader.readLine();
}

Copy after login

(2) HttpClient: implemented using the HttpClient framework , more powerful than URLConnection, and can handle HTTP header information such as Cookies and custom User-Agent. The core code is as follows:

HttpClient httpClient = new HttpClient();
GetMethod getMethod = new GetMethod("http://www.example.com");
int status = httpClient.executeMethod(getMethod);
if (status == HttpStatus.SC_OK) {
    InputStream in = getMethod.getResponseBodyAsStream();
    BufferedReader reader = new BufferedReader(new InputStreamReader(in));
    String line = reader.readLine();
    while (line != null) {
        System.out.println(line);
        line = reader.readLine();
    }
}

Copy after login

Webpage parser

Webpage download After downloading, you need to use a web page parser to extract the data. Java crawlers generally have two implementation methods:

(1) Regular expression: Use regular expressions to match data in web pages. The core code is as follows:

String pattern = "<title>(.*?)</title>";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(html);
if (m.find()) {
    System.out.println(m.group(1));
}

Copy after login

(2) HTML parser: Implemented using the Jsoup framework, the web page can be converted into a DOM structure, and then the data can be obtained through a CSS selector or a method similar to XPath. The core code is as follows:

Document doc = Jsoup.connect("http://www.example.com").get();
Elements links = doc.select("a[href]");
for (Element link : links) {
    String text = link.text();
    String href = link.attr("href");
    System.out.println(text + " " + href);
}

Copy after login

3. Java crawler practice

Understand After understanding the basic ideas and implementation methods of Java crawlers, we can try to write a simple crawler program to obtain data from a website and analyze it.

Crawling data

We chose to crawl the data of Douban movie rankings. First, we need to get the URL address of the Douban movie rankings, as shown below:

https://movie.douban.com/chart

Copy after login

Then, we can use the Jsoup framework to download the web page and extract the data in it, the code is as follows:

Document doc = Jsoup.connect("https://movie.douban.com/chart").get();
Elements items = doc.select("div.item");
List<Movie> movieList = new ArrayList<>();
for (Element item : items) {
    Elements title = item.select("div.info div.hd a");
    Elements rating = item.select("div.info div.bd div.star span.rating_num");
    Elements director = item.select("div.info div.bd p").eq(0);
    Elements actor = item.select("div.info div.bd p").eq(1);
    Movie movie = new Movie();
    movie.setTitle(title.text());
    movie.setRating(Double.valueOf(rating.text()));
    movie.setDirector(director.text().replace("导演: ", ""));
    movie.setActor(actor.text().replace("主演: ", ""));
    movieList.add(movie);
}

Copy after login

Here we use a Movie class to store movie information.

Storage data

Store the obtained movie data in the database to facilitate subsequent analysis. Here we use JDBC to operate the database. The code is as follows:

public class DBHelper {
    private static final String JDBC_DRIVER = "com.mysql.jdbc.Driver";
    private static final String DB_URL = "jdbc:mysql://localhost:3306/db";
    private static final String USER = "root";
    private static final String PASS = "password";

    public static Connection getConnection() {
        Connection conn = null;
        try {
            Class.forName(JDBC_DRIVER);
            conn = DriverManager.getConnection(DB_URL, USER, PASS);
        } catch (Exception e) {
            e.printStackTrace();
        }
        return conn;
    }

    public static void saveMovies(List<Movie> movieList) {
        try (Connection conn = getConnection();
             PreparedStatement stmt = conn.prepareStatement(
                     "INSERT INTO movie(title,rating,director,actor) VALUES (?,?,?,?)"
             )) {
            for (Movie movie : movieList) {
                stmt.setString(1, movie.getTitle());
                stmt.setDouble(2, movie.getRating());
                stmt.setString(3, movie.getDirector());
                stmt.setString(4, movie.getActor());
                stmt.addBatch();
            }
            stmt.executeBatch();
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

Copy after login

Analyzing data

After we have the data, we can analyze it. Here we count each Director's number of films and average rating. The code is as follows:

public class MovieAnalyzer {
    public static void analyzeMovies() {
        try (Connection conn = DBHelper.getConnection();
             Statement stmt = conn.createStatement()) {
            String sql = "SELECT director, COUNT(*) AS cnt, AVG(rating) AS avg_rating " +
                    "FROM movie " +
                    "GROUP BY director " +
                    "HAVING cnt > 1 " +
                    "ORDER BY avg_rating DESC";
            ResultSet rs = stmt.executeQuery(sql);
            while (rs.next()) {
                String director = rs.getString("director");
                int cnt = rs.getInt("cnt");
                double avgRating = rs.getDouble("avg_rating");
                System.out.printf("%-20s %5d %7.2f%n", director, cnt, avgRating);
            }
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

Copy after login

Store the obtained movie information in the database and analyze it. We have completed the practical application of the Java crawler.

4. Summary

This article introduces the basic knowledge and practical applications of Java crawlers, hoping to help readers better understand crawler technology and Java programming. In practice, you need to pay attention to legal and ethical norms and refrain from illegally obtaining other people's privacy and infringing copyrights. At the same time, you also need to master anti-crawler technology to avoid being blocked or IP banned by crawled websites.

The above is the detailed content of Java crawler application tutorial, practical data capture and analysis. For more information, please follow other related articles on the PHP Chinese website!