If all the data you are referring to is all the data under a small domain name, and you don’t want to study the principles in detail, then learn scrapy.
If all the data you are referring to is the entire network data, and you want to understand whether crawling is breadth-first or depth-first, etc., then you must first have 10,000+ servers.
If the structure of the website is simple and repetitive, you can first analyze the pattern of page number URLs, then get the total number of pages directly from the first page, and then manually construct the URLs of other pages.
First of all, let’s briefly talk about the idea of crawling. If the page link is very simple, like www.xxx.com/post/1.html, you can write recursion or loop to crawl it
If the page link is unknown, you can get the crawled page to parse the link of the tag, and then continue crawling. In this process, you need to save the crawled links and look for them when crawling new links. Check if it has been crawled before, and then crawl it recursively
Crawling idea: Crawling through url -> Parsing the new url in the crawled content -> Crawling through url ->....-> When crawling to a certain number or there are no new links for a long time Break out of recursion
Finally, there is a very powerful crawler framework scrapy in the python world. It basically encapsulates all common crawler routines. You can use the portal with a little learning
import java.io.File;
import java.io.IOException;
import java.io.InputStream;
import java.net.URL;
import java.net.URLConnection;
import org.apache.commons.io.FileUtils;
public class SpiderDemo {
public static void main(String[] args) throws IOException {
// URL url = new URL("http://www.zhongguoxinyongheimingdan.com");
// URLConnection connection = url.openConnection();
// InputStream in = connection.getInputStream();
// File file = new File("F://a.txt");
// FileUtils.copyInputStreamToFile(in, file);
File srcDir = new File("F://a.txt");
String str = FileUtils.readFileToString(srcDir, "UTF-8");
String[] str1 = str.split("href=");
for (int i = 3; i < str1.length-1; i++) {
URL url = new URL("http://www.zhongguoxinyongheimingdan.com"+str1[i].substring(1, 27));
File f = new File("F://abc//"+str1[i].substring(2, 22));
if(!f.exists()){
f.mkdir();
File desc1 = new File(f,str1[i].substring(1, 22)+".txt");
URLConnection connection = url.openConnection();
InputStream in = connection.getInputStream();
FileUtils.copyInputStreamToFile(in, desc1);
String str2 = FileUtils.readFileToString(desc1, "UTF-8");
String[] str3 = str2.split("\" src=\"");
for(int j = 1;j<str3.length-2;j++){
URL url1 = new URL(str3[j].substring(0, 81));
URLConnection connection1 = url1.openConnection();
connection1.setDoInput(true);
InputStream in1 = connection1.getInputStream();
File desc2 = new File(f,str3[j].substring(44,76)+".jpg");
FileUtils.copyInputStreamToFile(in1, desc2);
}
}
}
}
}
Simple code to save all photos from China credit blacklist website to local The website itself is simple! But the website crashed on the spot and I was drunk!
Recursion, message queue, storage of crawled pages (redis, database)
If all the data you are referring to is all the data under a small domain name, and you don’t want to study the principles in detail, then learn scrapy.
If all the data you are referring to is the entire network data, and you want to understand whether crawling is breadth-first or depth-first, etc., then you must first have 10,000+ servers.
If it’s the same website, use recursion to crawl it. Why can’t the same website be crawled to the end?
If the structure of the website is simple and repetitive, you can first analyze the pattern of page number URLs, then get the total number of pages directly from the first page, and then manually construct the URLs of other pages.
First of all, let’s briefly talk about the idea of crawling. If the page link is very simple, like www.xxx.com/post/1.html, you can write recursion or loop to crawl it
If the page link is unknown, you can get the crawled page to parse the link of the tag, and then continue crawling. In this process, you need to save the crawled links and look for them when crawling new links. Check if it has been crawled before, and then crawl it recursively
Crawling idea: Crawling through url -> Parsing the new url in the crawled content -> Crawling through url ->....-> When crawling to a certain number or there are no new links for a long time Break out of recursion
Finally, there is a very powerful crawler framework scrapy in the python world. It basically encapsulates all common crawler routines. You can use the portal with a little learning
Simple code to save all photos from China credit blacklist website to local The website itself is simple! But the website crashed on the spot and I was drunk!