Dans le framework scrapy-redis, les reids stockés xxx:requests
ont été explorés, mais le programme est toujours en cours d'exécution. Comment arrêter automatiquement le programme au lieu de s'exécuter à vide ?
2017-07-03 09:17:06 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-07-03 09:18:06 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
Vous pouvez arrêter le programme via engine.close_spider(spider, 'reason').
def next_request(self):
block_pop_timeout = self.idle_before_close
request = self.queue.pop(block_pop_timeout)
if request and self.stats:
self.stats.inc_value('scheduler/dequeued/redis', spider=self.spider)
if request is None:
self.spider.crawler.engine.close_spider(self.spider, 'queue is empty')
return request
Il y a une autre question que je ne comprends pas :
Lors de la fermeture du spider via engine.close_spider(spider, 'reason'), plusieurs erreurs se produiront avant de pouvoir le fermer.
# 正常关闭
2017-07-03 18:02:38 [scrapy.core.engine] INFO: Closing spider (queue is empty)
2017-07-03 18:02:38 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'finish_reason': 'queue is empty',
'finish_time': datetime.datetime(2017, 7, 3, 10, 2, 38, 616021),
'log_count/INFO': 8,
'start_time': datetime.datetime(2017, 7, 3, 10, 2, 38, 600382)}
2017-07-03 18:02:38 [scrapy.core.engine] INFO: Spider closed (queue is empty)
# 之后还会出现几个错误才关闭spider,难道spider刚启动时会启动多个线程一起抓取,
# 然后其中一个线程关闭了spider,其他线程就找不到spider才会报错!
Unhandled Error
Traceback (most recent call last):
File "D:/papp/project/launch.py", line 37, in <module>
process.start()
File "D:\Program Files\python3\lib\site-packages\scrapy\crawler.py", line 285, in start
reactor.run(installSignalHandlers=False) # blocking call
File "D:\Program Files\python3\lib\site-packages\twisted\internet\base.py", line 1243, in run
self.mainLoop()
File "D:\Program Files\python3\lib\site-packages\twisted\internet\base.py", line 1252, in mainLoop
self.runUntilCurrent()
--- <exception caught here> ---
File "D:\Program Files\python3\lib\site-packages\twisted\internet\base.py", line 878, in runUntilCurrent
call.func(*call.args, **call.kw)
File "D:\Program Files\python3\lib\site-packages\scrapy\utils\reactor.py", line 41, in __call__
return self._func(*self._a, **self._kw)
File "D:\Program Files\python3\lib\site-packages\scrapy\core\engine.py", line 137, in _next_request
if self.spider_is_idle(spider) and slot.close_if_idle:
File "D:\Program Files\python3\lib\site-packages\scrapy\core\engine.py", line 189, in spider_is_idle
if self.slot.start_requests is not None:
builtins.AttributeError: 'NoneType' object has no attribute 'start_requests'
Comment savoir que les requêtes placées ont été explorées ? Il faut définir cela pour savoir
Si ce n'est pas compliqué, vous pouvez utiliser l'extension interne pour la désactiver !
scrapy.contrib.closespider.CloseSpider
CLOSESPIDER_TIMEOUT
CLOSESPIDER_ITEMCOUNT
CLOSESPIDER_PAGECOUNT
CLOSESPIDER_ERRORCOUNT
http://scrapy-chs.readthedocs...