JSoup helps you send http requests, obtain the returned HTML content, save it in the Document object, and then provides a set of jQuery-like APIs to query and parse the information in the HTML document
Each site has a specific URL request, or JSON or JSONP request for page turning. This needs to be organized and processed by yourself
You can use crawler libraries such as HttpClient to obtain the original HTML content, construct it into a JSOUP Document object, let JSOUP parse the content, and then save it to your desired persistence solution (local file, database, memory...)
Whether it is crawled or not, and whether it needs to be crawled through a proxy (how to reverse crawl) is not what JSOUP should do, just like HttpClient is responsible for crawling content, but it will not parse the content....
JSoup helps you send http requests, obtain the returned HTML content, save it in the Document object, and then provides a set of jQuery-like APIs to query and parse the information in the HTML document
Each site has a specific URL request, or JSON or JSONP request for page turning. This needs to be organized and processed by yourself
You can use crawler libraries such as HttpClient to obtain the original HTML content, construct it into a JSOUP Document object, let JSOUP parse the content, and then save it to your desired persistence solution (local file, database, memory...)
Whether it is crawled or not, and whether it needs to be crawled through a proxy (how to reverse crawl) is not what JSOUP should do, just like HttpClient is responsible for crawling content, but it will not parse the content....
Crawlers usually crawl a seed page first, which contains the rules for all page URLs, and then crawl other pages through this seed.