Web crawler - How to use java to crawl information and make a ranking system?-PHP Chinese Network Q&A

I happen to have an interesting project to do while learning java web.
Our school requires a credit card for morning runs. The Sports Department provides an inquiry website, but does not provide an interface.
I want to make a website/WeChat backend to capture information from the school website and store it in a database. Then users can query their running records through my website/WeChat. And display rankings and other functions based on these records.

The query only needs to provide the student number and name, this data is already available.

Simulated login is implemented using httpclient. Obtained the entire page
The page is a table display record. What should be used to extract data from the page?

Regarding the direction of java web, I will only use jsp to write an add, delete, modify and check. I don’t know much about what follows.

I want to do such a background management crawling operation. User query returns.
Where should I start learning? Or what kind of technology/framework is used?

About the query website:
One element is the number of running exercises. What follows
is the corresponding record. Each record displays the running time, specific to the minute.

Fetching is not the most difficult part. The problem is how to build such a management system. There is no idea on how to develop a complete web full stack.
I realize I can’t comment. . .

Thanks!

reply all (4)

Ty802017-06-12 09:21:16 4 floor

I just said it casually, because I didn’t think of any method.

Use Jsoup to crawl page data, haha

Like+0

Add Reply

代言2017-06-12 09:21:16 3 floor

Think of a few points, let’s talk briefly:
1. Data capture, you can write your own crawler program, formulate time rules for data crawling, etc.
2. Data processing, capture the content of the web page through jsoup or other Method to extract the effective content of the web page and design the data structure. The student ID should be unique. There can be a student table and a morning running record table, which are related through the student ID. 3. My personal understanding is to sort by the number of times, because after thinking about it, , if sorting by time is unreasonable, because there is no way to judge the real morning running time, then I will just talk by the number of times here. You can directly store the field of the number of runs in the student table, reduce querying through the record table, and improve Efficiency means maintaining this field when data processing is required

三叔2017-06-12 09:21:16 2 floor

Generally speaking, tools likehttpclientare used to get the return package and parse the message entity (here refers to thehtmlpage). The next step is to usexpath, regular expressions, and methods similar tojQueryto parse.DOMelement to get the data you want (such as jsoup package). If it’s still too troublesome, you can use the webmagic framework

巴扎黑2017-06-12 09:21:16 1 floor

Simulate login: Use a browser to open the login page and observe the url that receives the student ID and password; post data to the url when simulating login; parse the Set-cookie field information from the response header;
Data capture: Initiate a get request to the sports data page (bring the cookie field obtained in the previous step), get the response, and then perform regular parsing to obtain the data;

Recommendation: To cache the data that users query each time, for example, for 2 hours, it is recommended to use redis; the database can store the queried data, first get the data from redis, if it cannot be retrieved, simulate login to get new data. As for the database layer, I personally feel that it is not necessary. If it is available, you can also perform data analysis and so on