python - 爬虫获取所有数据的思路是什么
ringa_lee
ringa_lee 2017-04-18 10:19:45
0
6
657

比如一个网站有下一页,我要怎么能把所有下一页爬完呢,用递归吗,递归深度不会有限制吗,初学,希望得到指点

ringa_lee
ringa_lee

ringa_lee

reply all(6)
大家讲道理

Recursion, message queue, storage of crawled pages (redis, database)

巴扎黑

If all the data you are referring to is all the data under a small domain name, and you don’t want to study the principles in detail, then learn scrapy.

If all the data you are referring to is the entire network data, and you want to understand whether crawling is breadth-first or depth-first, etc., then you must first have 10,000+ servers.

刘奇

If it’s the same website, use recursion to crawl it. Why can’t the same website be crawled to the end?

巴扎黑

If the structure of the website is simple and repetitive, you can first analyze the pattern of page number URLs, then get the total number of pages directly from the first page, and then manually construct the URLs of other pages.

洪涛

First of all, let’s briefly talk about the idea of ​​crawling. If the page link is very simple, like www.xxx.com/post/1.html, you can write recursion or loop to crawl it

If the page link is unknown, you can get the crawled page to parse the link of the tag, and then continue crawling. In this process, you need to save the crawled links and look for them when crawling new links. Check if it has been crawled before, and then crawl it recursively

Crawling idea: Crawling through url -> Parsing the new url in the crawled content -> Crawling through url ->....-> When crawling to a certain number or there are no new links for a long time Break out of recursion

Finally, there is a very powerful crawler framework scrapy in the python world. It basically encapsulates all common crawler routines. You can use the portal with a little learning

阿神

import java.io.File;
import java.io.IOException;
import java.io.InputStream;
import java.net.URL;
import java.net.URLConnection;

import org.apache.commons.io.FileUtils;



public class SpiderDemo {
    public static void main(String[] args) throws IOException {
//        URL url = new URL("http://www.zhongguoxinyongheimingdan.com");
//        URLConnection connection = url.openConnection();
//        InputStream in = connection.getInputStream();
//        File file = new File("F://a.txt");
//        FileUtils.copyInputStreamToFile(in, file);
        File srcDir = new File("F://a.txt");
        String str = FileUtils.readFileToString(srcDir, "UTF-8");
        String[] str1 = str.split("href=");
        for (int i = 3; i < str1.length-1; i++) {
            URL url = new URL("http://www.zhongguoxinyongheimingdan.com"+str1[i].substring(1, 27));
            File f = new File("F://abc//"+str1[i].substring(2, 22));
            if(!f.exists()){
            f.mkdir();    
            File desc1 = new File(f,str1[i].substring(1, 22)+".txt");
            URLConnection connection = url.openConnection();
            InputStream in = connection.getInputStream();
            FileUtils.copyInputStreamToFile(in, desc1);
            String str2 = FileUtils.readFileToString(desc1, "UTF-8");
            String[] str3 = str2.split("\" src=\"");
            for(int j = 1;j<str3.length-2;j++){
                URL url1 = new URL(str3[j].substring(0, 81));
                URLConnection connection1 = url1.openConnection();
                connection1.setDoInput(true);
                InputStream in1 = connection1.getInputStream();
                File desc2 = new File(f,str3[j].substring(44,76)+".jpg");
                FileUtils.copyInputStreamToFile(in1, desc2);
            }
            }
            }
        }
    
}

Simple code to save all photos from China credit blacklist website to local The website itself is simple! But the website crashed on the spot and I was drunk!

Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template