我的爬蟲程式碼如下,其中rules無獲取,不知道是什麼問題?
#encoding: utf-8
import re
import requests
import time
from bs4 import BeautifulSoup
import scrapy
from scrapy.http import Request
from craler.items import CralerItem
import urllib2
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
class MoyanSpider(CrawlSpider):
try:
name = 'maoyan'
allowed_domains = ["http://maoyan.com"]
start_urls = ['http://maoyan.com/films']
rules = (
Rule(LinkExtractor(allow=(r"films/\d+.*")), callback='parse_item', follow=True),
)
except Exception, e:
print e.message
#
# def start_requests(self):
# for i in range(22863):
# url = self.start_urls + str(i*30)
#
# yield Request(url,self.parse, headers=self.headers)
def parse_item(self, response):
item = CralerItem()
# time.sleep(2)
# moveis = BeautifulSoup(response.text, 'lxml').find("p",class_="movies-list").find_all("dd")
try:
time.sleep(2)
item['name'] = response.find("p",class_="movie-brief-container").find("h3",class_="name").get_text()
item['score'] = response.find("p",class_="movie-index-content score normal-score").find("span",class_="stonefont").get_text()
url = "http://maoyan.com"+response.find("p",class_="channel-detail movie-item-title").find("a")["href"]
#item['url'] = url
item['id'] = response.url.split("/")[-1]
# html = requests.get(url).content
# soup = BeautifulSoup(html,'lxml')
temp= response.find("p","movie-brief-container").find("ul").get_text()
temp = temp.split('\n')
#item['cover'] = soup.find("p","avater-shadow").find("img")["src"]
item['tags'] = temp[1]
item['countries'] = temp[3].strip()
item['duration'] = temp[4].split('/')[-1]
item['time'] = temp[6]
#print item['name']
return item
except Exception, e:
print e.message
執行報錯的提醒:
C:\Python27\python.exe "C:\Program Files (x86)\JetBrains\PyCharm Community Edition 2016.2.2\helpers\pydev\pydevd.py" --multiproc --qt-support --client 127.0.0.1 --port 12779 --file D:/scrapy/craler/entrypoint.py
pydev debugger: process 30468 is connecting
Connected to pydev debugger (build 162.1967.10)
D:/scrapy/craler\craler\spiders\maoyan.py:12: ScrapyDeprecationWarning: Module `scrapy.contrib.linkextractors` is deprecated, use `scrapy.linkextractors` instead
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
D:/scrapy/craler\craler\spiders\maoyan.py:12: ScrapyDeprecationWarning: Module `scrapy.contrib.linkextractors.sgml` is deprecated, use `scrapy.linkextractors.sgml` instead
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
2017-05-08 21:58:14 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: craler)
2017-05-08 21:58:14 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'craler.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['craler.spiders'], 'HTTPCACHE_ENABLED': True, 'BOT_NAME': 'craler', 'COOKIES_ENABLED': False, 'DOWNLOAD_DELAY': 3}
2017-05-08 21:58:14 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2017-05-08 21:58:14 [py.warnings] WARNING: D:/scrapy/craler\craler\middlewares.py:11: ScrapyDeprecationWarning: Module `scrapy.contrib.downloadermiddleware.useragent` is deprecated, use `scrapy.downloadermiddlewares.useragent` instead
from scrapy.contrib.downloadermiddleware.useragent import UserAgentMiddleware
2017-05-08 21:58:14 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'craler.middlewares.RotateUserAgentMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats',
'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware']
2017-05-08 21:58:15 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-05-08 21:58:15 [scrapy.middleware] INFO: Enabled item pipelines:
['craler.pipelines.CralerPipeline']
2017-05-08 21:58:15 [scrapy.core.engine] INFO: Spider opened
2017-05-08 21:58:15 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-05-08 21:58:15 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-05-08 21:58:15 [root] INFO: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)
2017-05-08 21:58:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://maoyan.com/robots.txt> (referer: None) ['cached']
2017-05-08 21:58:15 [root] INFO: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50
2017-05-08 21:58:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://maoyan.com/films> (referer: None) ['cached']
2017-05-08 21:58:15 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'maoyan.com': <GET http://maoyan.com/films/248683>
2017-05-08 21:58:15 [scrapy.core.engine] INFO: Closing spider (finished)
2017-05-08 21:58:15 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 534,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 6913,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 5, 8, 13, 58, 15, 357000),
'httpcache/hit': 2,
'log_count/DEBUG': 4,
'log_count/INFO': 9,
'log_count/WARNING': 1,
'offsite/domains': 1,
'offsite/filtered': 30,
'request_depth_max': 1,
'response_received_count': 2,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2017, 5, 8, 13, 58, 15, 140000)}
2017-05-08 21:58:15 [scrapy.core.engine] INFO: Spider closed (finished)
Process finished with exit code 0
主要是
allow_domains
的問題,你的提取規則是沒問題的,程式碼這樣寫就能抓連結了主要就是
allow_domain
别带上http://
字串。另外,你的解析模組有點問題,我沒給你修改,有數據了自己應該也能改。
另外,吐槽一下前面的同學,根本就沒調試人家的程式碼,也這樣強答,明顯在誤導人嘛
有幾個模組組件已經棄用了,讓你換個別的相似模組使用
只是警告,沒有錯誤。可能你爬取的網站做了防爬蟲措施,導致你無法正常取得。