python - Scrapy-如何让scrapy爬取尽可能多的网页-PHP Chinese Network Q&A

我用scrapy框架来爬取新浪财经的内容,将新闻的标题和内容下载下来并保存到txt文件中,接着我想从当前网页中提取url,供scrapy继续爬取...按照我的想法,只要我不停止爬虫,那么它就应该提取url,下载标题和内容,写入文件这样一直循环做.但是,现在的问题是我的爬虫只能够爬取几个百网页,爬完之后爬虫就完成了.这不符合我的设想,那么我该怎么做让爬虫能够不停的爬取呢?
------------------------更新------------------------
我发现我的爬虫只是爬完了所有包含在首页的满足要求的url,没有对更深的url进行爬取.我也想让爬虫对那些嵌套的url进行爬取,我要怎么做呢?
这是我的代码:

# -*- coding:utf-8 -*- from scrapy.spiders import CrawlSpider,Rule from scrapy.selector import Selector from scrapy.linkextractors import LinkExtractor from scrapy.http import Request class Spider(CrawlSpider): num=0#Record article's number name = "sina" download_delay=1 allowed_domains = ["sina.com.cn"] start_urls = [ "http://finance.sina.com.cn/" ] rules=[Rule(LinkExtractor(allow=()),callback='parse',follow=True)] def parse(self,response): URLgroup=LinkExtractor(allow=()).extract_links(response) for URL in URLgroup: if 'finance.sina.com.cn' in URL.url: #only crawl url with a fixed prefix yield Request(url=URL.url,callback=self.parse_content) def parse_content(self, response): content= Selector(response) text= content.xpath('//p[@id="artibody"]/p/text()').extract()#extract text title=content.xpath('//h1/text()').extract()#extract title file_abs=r"C:/Temp/save/article" if title and text: with open(file_abs+str(self.num)+'.txt', 'w') as f: self.num+=1 f.write('标题:\n') for t in title: f.write((t.encode('utf-8'))) f.write('\n') f.write('正文:\n') for t in text: f.write((t.encode('utf-8'))) f.write('\n')

reply all (2)

小葫芦2017-04-17 17:11:18 2 floor

Let me answer my own questions. In the above code, I did not continue to extract URLs from start_urls, so if I want the crawler to meet my requirements, I need to use LinkExtractor, extract URLs, filter out URLs that do not meet the requirements, and then continue to call the parse function, so The crawler can continue crawling.
The detailed description of LinkExtractor can be found here
`