In the scrapy-redis framework,xxx:requests
stored in reids has been crawled, but the program is still running. How to automatically stop the program instead of running empty?
2017-07-03 09:17:06 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2017-07-03 09:18:06 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
You can stop the program through engine.close_spider(spider, 'reason').
def next_request(self): block_pop_timeout = self.idle_before_close request = self.queue.pop(block_pop_timeout) if request and self.stats: self.stats.inc_value('scheduler/dequeued/redis', spider=self.spider) if request is None: self.spider.crawler.engine.close_spider(self.spider, 'queue is empty') return request
There is another question I don’t understand:
When closing the spider through engine.close_spider(spider, 'reason'), several errors will occur before it can be closed.
# 正常关闭 2017-07-03 18:02:38 [scrapy.core.engine] INFO: Closing spider (queue is empty) 2017-07-03 18:02:38 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'finish_reason': 'queue is empty', 'finish_time': datetime.datetime(2017, 7, 3, 10, 2, 38, 616021), 'log_count/INFO': 8, 'start_time': datetime.datetime(2017, 7, 3, 10, 2, 38, 600382)} 2017-07-03 18:02:38 [scrapy.core.engine] INFO: Spider closed (queue is empty) # 之后还会出现几个错误才关闭spider,难道spider刚启动时会启动多个线程一起抓取, # 然后其中一个线程关闭了spider,其他线程就找不到spider才会报错! Unhandled Error Traceback (most recent call last): File "D:/papp/project/launch.py", line 37, in process.start() File "D:\Program Files\python3\lib\site-packages\scrapy\crawler.py", line 285, in start reactor.run(installSignalHandlers=False) # blocking call File "D:\Program Files\python3\lib\site-packages\twisted\internet\base.py", line 1243, in run self.mainLoop() File "D:\Program Files\python3\lib\site-packages\twisted\internet\base.py", line 1252, in mainLoop self.runUntilCurrent() --- --- File "D:\Program Files\python3\lib\site-packages\twisted\internet\base.py", line 878, in runUntilCurrent call.func(*call.args, **call.kw) File "D:\Program Files\python3\lib\site-packages\scrapy\utils\reactor.py", line 41, in __call__ return self._func(*self._a, **self._kw) File "D:\Program Files\python3\lib\site-packages\scrapy\core\engine.py", line 137, in _next_request if self.spider_is_idle(spider) and slot.close_if_idle: File "D:\Program Files\python3\lib\site-packages\scrapy\core\engine.py", line 189, in spider_is_idle if self.slot.start_requests is not None: builtins.AttributeError: 'NoneType' object has no attribute 'start_requests'
How to know that the crawling of the placed requests has been completed? This needs to be defined to know
If it is not complicated, you can use the internal extension to turn it off!
scrapy.contrib.closespider.CloseSpider
CLOSESPIDER_TIMEOUT
CLOSESPIDER_ITEMCOUNT
CLOSESPIDER_PAGECOUNT
CLOSESPIDER_ERRORCOUNT
http://scrapy-chs.readthedocs...