python - Using CrawlSpider in scrapy, urls cannot be matched
为情所困
为情所困 2017-05-18 10:51:02
0
3
871

My crawler code is as follows. The rules are not obtained. I don’t know what the problem is?

#encoding: utf-8
import re
import requests
import time
from bs4 import BeautifulSoup
import scrapy
from scrapy.http import Request
from craler.items import CralerItem
import urllib2
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
class MoyanSpider(CrawlSpider):
    try:
        name = 'maoyan'
        allowed_domains = ["http://maoyan.com"]
        start_urls = ['http://maoyan.com/films']
        
        rules = (
            Rule(LinkExtractor(allow=(r"films/\d+.*")), callback='parse_item', follow=True),
        )
    except Exception, e:
        print e.message
    # 
    # def start_requests(self):
    #     for i in range(22863):
    #         url = self.start_urls + str(i*30)
    #         
    #         yield Request(url,self.parse, headers=self.headers)

    def parse_item(self, response):
        item = CralerItem()
        # time.sleep(2)
        # moveis = BeautifulSoup(response.text, 'lxml').find("p",class_="movies-list").find_all("dd")
      
        try:
       
            time.sleep(2)
            item['name'] = response.find("p",class_="movie-brief-container").find("h3",class_="name").get_text()
            item['score'] = response.find("p",class_="movie-index-content score normal-score").find("span",class_="stonefont").get_text()
            url = "http://maoyan.com"+response.find("p",class_="channel-detail movie-item-title").find("a")["href"]
            #item['url'] = url
            item['id'] = response.url.split("/")[-1]
            # html = requests.get(url).content
            # soup = BeautifulSoup(html,'lxml')
            temp= response.find("p","movie-brief-container").find("ul").get_text()
            temp = temp.split('\n')
            #item['cover'] = soup.find("p","avater-shadow").find("img")["src"]
            item['tags'] = temp[1]
            item['countries'] = temp[3].strip()
            item['duration'] = temp[4].split('/')[-1]
            item['time'] = temp[6]
            #print item['name']
            return item
        except Exception, e:
            print e.message

            

Run error reminder:

C:\Python27\python.exe "C:\Program Files (x86)\JetBrains\PyCharm Community Edition 2016.2.2\helpers\pydev\pydevd.py" --multiproc --qt-support --client 127.0.0.1 --port 12779 --file D:/scrapy/craler/entrypoint.py
pydev debugger: process 30468 is connecting

Connected to pydev debugger (build 162.1967.10)
D:/scrapy/craler\craler\spiders\maoyan.py:12: ScrapyDeprecationWarning: Module `scrapy.contrib.linkextractors` is deprecated, use `scrapy.linkextractors` instead
  from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
D:/scrapy/craler\craler\spiders\maoyan.py:12: ScrapyDeprecationWarning: Module `scrapy.contrib.linkextractors.sgml` is deprecated, use `scrapy.linkextractors.sgml` instead
  from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
2017-05-08 21:58:14 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: craler)
2017-05-08 21:58:14 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'craler.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['craler.spiders'], 'HTTPCACHE_ENABLED': True, 'BOT_NAME': 'craler', 'COOKIES_ENABLED': False, 'DOWNLOAD_DELAY': 3}
2017-05-08 21:58:14 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2017-05-08 21:58:14 [py.warnings] WARNING: D:/scrapy/craler\craler\middlewares.py:11: ScrapyDeprecationWarning: Module `scrapy.contrib.downloadermiddleware.useragent` is deprecated, use `scrapy.downloadermiddlewares.useragent` instead
  from scrapy.contrib.downloadermiddleware.useragent import UserAgentMiddleware

2017-05-08 21:58:14 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'craler.middlewares.RotateUserAgentMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats',
 'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware']
2017-05-08 21:58:15 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-05-08 21:58:15 [scrapy.middleware] INFO: Enabled item pipelines:
['craler.pipelines.CralerPipeline']
2017-05-08 21:58:15 [scrapy.core.engine] INFO: Spider opened
2017-05-08 21:58:15 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-05-08 21:58:15 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-05-08 21:58:15 [root] INFO: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)
2017-05-08 21:58:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://maoyan.com/robots.txt> (referer: None) ['cached']
2017-05-08 21:58:15 [root] INFO: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50
2017-05-08 21:58:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://maoyan.com/films> (referer: None) ['cached']
2017-05-08 21:58:15 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'maoyan.com': <GET http://maoyan.com/films/248683>
2017-05-08 21:58:15 [scrapy.core.engine] INFO: Closing spider (finished)
2017-05-08 21:58:15 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 534,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 6913,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2017, 5, 8, 13, 58, 15, 357000),
 'httpcache/hit': 2,
 'log_count/DEBUG': 4,
 'log_count/INFO': 9,
 'log_count/WARNING': 1,
 'offsite/domains': 1,
 'offsite/filtered': 30,
 'request_depth_max': 1,
 'response_received_count': 2,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2017, 5, 8, 13, 58, 15, 140000)}
2017-05-08 21:58:15 [scrapy.core.engine] INFO: Spider closed (finished)

Process finished with exit code 0
为情所困
为情所困

reply all(3)
世界只因有你

The main problem is allow_domains. Your extraction rules are fine. If you write the code like this, you can capture the link

# encoding: utf-8
import time
from tutorial.items import CrawlerItem
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor


class MoyanSpider(CrawlSpider):
    name = 'maoyan'
    allowed_domains = ["maoyan.com"]
    start_urls = ['http://maoyan.com/films']

    rules = (
        Rule(LinkExtractor(allow=(r"films/\d+.*")), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        print(response.url)
        item = CrawlerItem()
        try:

            time.sleep(2)
            item['name'] = response.text.find("p", class_="movie-brief-container").find("h3", class_="name").get_text()
            item['score'] = response.text.find("p", class_="movie-index-content score normal-score").find("span",
                                                                                                       class_="stonefont").get_text()
            url = "http://maoyan.com" + response.text.find("p", class_="channel-detail movie-item-title").find("a")["href"]
            item['id'] = response.url.split("/")[-1]
            temp = response.text.find("p", "movie-brief-container").find("ul").get_text()
            temp = temp.split('\n')
            item['tags'] = temp[1]
            item['countries'] = temp[3].strip()
            item['duration'] = temp[4].split('/')[-1]
            item['time'] = temp[6]
            return item
        except Exception as e:
            print(e)

Mainly allow_domain别带上http://strings.

In addition, there is something wrong with your parsing module. I have not modified it for you. You should be able to modify it yourself once you have the data.

Also, let me complain about the previous classmate. He didn’t debug his code at all, and he still answered like this. It’s obviously misleading.

習慣沉默

Several module components have been deprecated, allowing you to use similar modules instead

阿神

Just a warning, no errors. Maybe the website you crawled has taken anti-crawler measures, causing you to be unable to obtain it normally.

Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template