Release: 2017-04-04
Development Environment:

Python 3.6.0 version

(currently the latest) Scrapy 1.3.2 version
(currently the latest) Spider
A crawler is a class that defines how to crawl a website (or a group of websites), including how to perform the crawl (i.e., follow links) and how to extract structured data from its web pages (i.e., crawl items). In a nutshell, a spider is where you define custom behavior

for a crawler to crawl and parse web pages for a specific website (or in some cases, a group of websites). ,

Loop Go through something like this:

You first generate the initial requests for scraping the first URL, and then specify that you want to use the responses downloaded from those requests The callback function

  1. The first request executed is generated by calling start_requests() (by default) for the

    URL specified in start_urls and The parse method is obtained, and this method serves as the callback function of the request.

    In the callback function, you will parse the response (web page) and return it with the extracted data, Item. Object , a Request object, or an iterable of these objects. These requests will also contain callbacks (which may be the same), which are then downloaded by Scrapy, and their responses handled by the specified callback.

  2. In the callback function, you usually use selectors to parse the page content (but you can also use BeautifulSoup,l

  3. or any mechanism you prefer) and generate the project with the parsed data.
  4. Finally, items returned from the crawler are typically persisted to a database (in some item pipeline) or written to a file using a feed export.

  5. Even though this loop works for (more or less) any kind of crawler, there are different kinds of default crawlers bundled into Scrapy for different purposes. We will talk about these types here.
  6. class

This is the simplest crawler, every other crawler must inherit <a href="//m.sbmmt.com/wiki/164.html" target="_blank"> Crawlers (including those bundled with Scrapy, as well as crawlers you write yourself). It doesn't provide any special features. It simply provides a default </a>start_requests() implementation that sends requests from the


attribute and parse is called for each resulting response spider method. nameString

that defines the name of this crawler. The crawler name is how the crawler is located (and instantiated) by Scrapy, so it

must be unique . However, there is nothing stopping you from instantiating multiple instances of the same crawler. This is the most important crawler attribute and it is required.
If the crawler crawls a single domain name, the common practice is to name the crawler after the domain. So, for example, a crawler that crawls mywebsite.com would typically be called mywebsite. NOTE

In Python 2, this must be ASCII.


An optional list of strings that allow this crawler to crawl domains, specify a list to crawl,

Others will not be captured. <a href="//m.sbmmt.com/wiki/646.html" target="_blank"></a>start_urls
The list of URLs that the crawler will start crawling when no specific URL is specified.

custom_<a href="//m.sbmmt.com/code/8209.html" target="_blank">set</a>tings
A dictionary of settings that will be overridden from the project wide configuration when running this crawler. It must be defined as a class attribute because the setting is updated before instantiation.

For a list of available built-in settings, see: Built-in Settings Reference.

This attribute is set by the class method from_crawler() after initializing the class and links the Crawler to the object to which this crawler instance is bound.

Crawlers encapsulate many components in the project for single entry access (such as extensions, Middleware, signal managers, etc.). See CrawlerAPI for details.

Configurations for running this crawler. This is a Settings instance, see Settings Themes for a detailed introduction to this topic.

Python logger created with Spidername. You can use this to send log messages through it, as described in Logging a crawler.

from_crawler(crawler, args, * kwargs)
is the class method used by Scrapy to create crawlers.

You may not need to override this directly, as the default implementation acts as a proxy for the method, init() It is called with the given argument args and named argument kwargs.

Nevertheless, this method sets the crawler and settings properties in the new instance so that they can be accessed later in the crawler.

  • Parameters:

    • crawler(Crawlerinstance) - The crawler to which the crawler

    • args (list) - Arguments passed to the init() method

    • kwargs (dict) - Parameters passed to the init() Keyword arguments to the method

This method must return an iterable of the first request to crawl this crawler.

With start_requests(), start_urls is not written, and it is useless even if it is written.

The default implementation is: start_urls, but the method start_requests can be overridden.
For example, if you need to start by logging in using a POST request , you can:

class MySpider(scrapy.Spider):
    name = 'myspider'

    def start_requests(self):
        return [scrapy.FormRequest("http://www.example.com/login",
                                   formdata={'user': 'john', 'pass': 'secret'},

    def logged_in(self, response):
        # here you would extract links to follow and return Requests for
        # each of them, with another callback
Copy after login

A way to receive a URL and Method that returns a Request object (or list of Request objects) for fetching. This method is used to construct the initial request within the start_requests() method, and is typically used to convert URLs into requests.

Unless overridden, this method returns Requests that have the parse() method as their callback function and the dont_filter parameter enabled (see the Request class for more information).

This is Scrapy's default callback for handling downloaded responses when their request does not specify a callback.

The parse method is responsible for processing the response and returning the crawled data or more URLs. Other request callbacks have the same requirements as the Spider class.

This method and any other request callback must return an iterable Request and dicts or Item objects.

  • Parameters:

    • response (Response) - the parsed response

log(message[, level, component])
Wrapper sends log message logger through crawler to maintain backward compatibility. See Logging from Spider for details.

Called when the crawler is closed. This method provides a shortcut to signals.connect() for the spider_closed signal.

Let's look at an example:

import scrapy

class MySpider(scrapy.Spider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = [

    def parse(self, response):
        self.logger.info('A response from %s just arrived!', response.url)
Copy after login

Returning multiple requests and items from a single callback:

import scrapy

class MySpider(scrapy.Spider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = [

    def parse(self, response):
        for h3 in response.xpath('//h3').extract():
            yield {"title": h3}

        for url in response.xpath('//a/@href').extract():
            yield scrapy.Request(url, callback=self.parse)
Copy after login

Instead of start_urls; items you can use start_requests() directly It can make it easier to obtain data:

import scrapy
from myproject.items import MyItem

class MySpider(scrapy.Spider):
    name = 'example.com'
    allowed_domains = ['example.com']

    def start_requests(self):
        yield scrapy.Request('http://www.example.com/1.html', self.parse)
        yield scrapy.Request('http://www.example.com/2.html', self.parse)
        yield scrapy.Request('http://www.example.com/3.html', self.parse)

    def parse(self, response):
        for h3 in response.xpath('//h3').extract():
            yield MyItem(title=h3)

        for url in response.xpath('//a/@href').extract():
            yield scrapy.Request(url, callback=self.parse)
Copy after login

Spider arguments

The crawler can receive parameters that modify its behavior. Some common uses of crawler parameters are to define a starting URL or to limit crawling to certain parts of the website, but they can be used to configure any feature of the crawler.

Spider crawl parameter is passed through the command using the -a option. For example:

scrapy crawl myspider -a category=electronics

Crawlers can access parameters in their init method:

import scrapy

class MySpider(scrapy.Spider):
    name = 'myspider'

    def init(self, category=None, *args, **kwargs):
        super(MySpider, self).init(*args, **kwargs)
        self.start_urls = ['http://www.example.com/categories/%s' % category]
        # ...
Copy after login

The default init method will take any crawler parameters and copy them to the crawler as attributes. The above example could also be written as follows:

import scrapy

class MySpider(scrapy.Spider):
    name = 'myspider'

    def start_requests(self):
        yield scrapy.Request('http://www.example.com/categories/%s' % self.category)
Copy after login

Remember that the spider parameter is just a string. The crawler won't do any parsing on its own. If you want to set the start_urls property from the command line, you must parse it as a list yourself, using something like ast.literal_eval or json.loads, and then set it as a property. Otherwise, you end up iterating over a start_urls string (a very common Python trap), causing each character to be treated as a separate URL.

有效的用例是设置使用的http验证凭据HttpAuthMiddleware 或用户代理使用的用户代理UserAgentMiddleware:
scrapy crawl myspider -a http_user=myuser -a http_pass=mypassw<a href="//m.sbmmt.com/wiki/1360.html" target="_blank">ord</a> -a user_agent=mybot

Spider参数也可以通过Scrapyd schedule.jsonAPI 传递。请参阅Scrapyd文档。


Scrapy附带一些有用的通用爬虫,你可以使用它来子类化你的爬虫。他们的目的是为一些常见的抓取案例提供方便的功能,例如根据某些规则查看网站上的所有链接,从站点地图抓取或解析XML / CSV Feed。


import scrapy

class TestItem(scrapy.Item):
    id = scrapy.Field()
    name = scrapy.Field()
    description = scrapy.Field()
Copy after login


类 scrapy.spiders.CrawlSpider






class scrapy.spiders.Rule(link_extractor,callback = None,cb_kwargs = None,follow = None,process_links = None,process_request = None )

callback是一个可调用的或字符串(在这种情况下,将使用具有该名称的爬虫对象的方法),以便为使用指定的link_extractor提取的每个链接调用。这个回调接收一个响应作为其第一个参数,并且必须返回一个包含Item和 Request对象(或它们的任何子类)的列表。


cb_kwargs 是包含要传递给回调函数的关键字参数的dict。

follow是一个布尔值,它指定是否应该从使用此规则提取的每个响应中跟踪链接。如果callbackNone follow默认为True,否则默认为False


process_request 是一个可调用的或一个字符串(在这种情况下,将使用具有该名称的爬虫对象的方法),它将被此规则提取的每个请求调用,并且必须返回一个请求或无(过滤出请求) 。



import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class MySpider(CrawlSpider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com']

    rules = (
        # Extract links matching 'category.php' (but not matching 'subsection.php')
        # and follow links from them (since no callback means follow=True by default).
        Rule(LinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),

        # Extract links matching 'item.php' and parse them with the spider's method parse_item
        Rule(LinkExtractor(allow=('item\.php', )), callback='parse_item'),

    def parse_item(self, response):
        self.logger.info('Hi, this is an item page! %s', response.url)
        item = scrapy.Item()
        item['id'] = response.xpath('//td[@id="item_id"]/text()').re(r'ID: (\d+)')
        item['name'] = response.xpath('//td[@id="item_name"]/text()').extract()
        item['description'] = response.xpath('//td[@id="item_description"]/text()').extract()
        return item
Copy after login



class scrapy.spiders.XMLFeedSpider


  • iterator

    • 'iternodes' - 基于正则表达式的快速迭代器

    • 'html'- 使用的迭代器Selector。请记住,这使用DOM解析,并且必须加载所有DOM在内存中,这可能是一个大饲料的问题

    • 'xml'- 使用的迭代器Selector。请记住,这使用DOM解析,并且必须加载所有DOM在内存中,这可能是一个大饲料的问题

itertag = 'product'

定义该文档中将使用此爬虫处理的命名空间的元组列表。在 与将用于自动注册使用的命名空间 的方法。(prefix, uri)prefixuriregister_namespace()

然后,您可以在属性中指定具有命名空间的itertag 节点。


class YourSpider(XMLFeedSpider):

    namespaces = [('n', 'http://www.sitemaps.org/schemas/sitemap/0.9')]
    itertag = 'n:url'
    # ...
Copy after login



parse_node(response, selector)
对于与提供的标记名称(itertag)匹配的节点,将调用此方法。接收Selector每个节点的响应和 。覆盖此方法是必需的。否则,你的爬虫将不工作。此方法必须返回一个Item对象,一个 Request对象或包含任何对象的迭代器。

process_results(response, results)
对于由爬虫返回的每个结果(Items or Requests),将调用此方法,并且它将在将结果返回到框架核心之前执行所需的任何最后处理,例如设置项目ID。它接收结果列表和产生那些结果的响应。它必须返回结果列表(Items or Requests)。



from scrapy.spiders import XMLFeedSpider
from myproject.items import TestItem

class MySpider(XMLFeedSpider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com/feed.xml']
    iterator = 'iternodes'  # This is actually unnecessary, since it's the default value
    itertag = 'item'

    def parse_node(self, response, node):
        self.logger.info('Hi, this is a <%s> node!: %s', self.itertag, ''.join(node.extract()))

        item = TestItem()
        item['id'] = node.xpath('@id').extract()
        item['name'] = node.xpath('name').extract()
        item['description'] = node.xpath('description').extract()
        return item
Copy after login



class scrapy.spiders.CSVF



<a href="//m.sbmmt.com/html/html-HEAD-2.html" target="_blank">head</a>ers
文件CSV Feed中包含的行的列表,用于从中提取字段。

parse_row(response, row)


让我们看一个类似于前一个例子,但使用 CSVFeedSpider:

from scrapy.spiders import CSVFeedSpider
from myproject.items import TestItem

class MySpider(CSVFeedSpider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com/feed.csv']
    delimiter = ';'
    quotechar = "'"
    headers = ['id', 'name', 'description']

    def parse_row(self, response, row):
        self.logger.info('Hi, this is a row!: %r', row)

        item = TestItem()
        item['id'] = row['id']
        item['name'] = row['name']
        item['description'] = row['description']
        return item
Copy after login


class scrapy.spiders.SitemapSpider

它支持嵌套Sitemap和从robots.txt发现Sitemap网址 。



元组列表其中:(regex, callback)

  • regex是与从Sitemap中提取的网址相匹配的正则表达式。 regex可以是一个str或一个编译的正则表达式对象

  • callback是用于处理与正则表达式匹配的url的回调。callback可以是字符串(指示蜘蛛方法的名称)或可调用的。

sitemap_rules = [('/product/', 'parse_product')]






    <xhtml:link rel="alternate" hreflang="de" href="http://example.com/de"/>
Copy after login

使用sitemap_alternate_linksset,这将检索两个URL。随着 sitemap_alternate_links禁用,只有http://example.com/将进行检索。



最简单的示例:使用parse回调处理通过站点地图发现的所有网址 :

from scrapy.spiders import SitemapSpider

class MySpider(SitemapSpider):
    sitemap_urls = ['http://www.example.com/sitemap.xml']

    def parse(self, response):
        pass # ... scrape item here ...
Copy after login


from scrapy.spiders import SitemapSpider

class MySpider(SitemapSpider):
    sitemap_urls = ['http://www.example.com/sitemap.xml']
    sitemap_rules = [
        ('/product/', 'parse_product'),
        ('/category/', 'parse_category'),

    def parse_product(self, response):
        pass # ... scrape product ...

    def parse_category(self, response):
        pass # ... scrape category ...
Copy after login

关注robots.txt文件中定义的sitemaps,并且只跟踪其网址包含/sitemap_shop以下内容的Sitemap :

from scrapy.spiders import SitemapSpider

class MySpider(SitemapSpider):
    sitemap_urls = ['http://www.example.com/robots.txt']
    sitemap_rules = [
        ('/shop/', 'parse_shop'),
    sitemap_follow = ['/sitemap_shops']

    def parse_shop(self, response):
        pass # ... scrape shop here ...
Copy after login


from scrapy.spiders import SitemapSpider

class MySpider(SitemapSpider):
    sitemap_urls = ['http://www.example.com/robots.txt']
    sitemap_rules = [
        ('/shop/', 'parse_shop'),

    other_urls = ['http://www.example.com/about']

    def start_requests(self):
        requests = list(super(MySpider, self).start_requests())
        requests += [scrapy.Request(x, self.parse_other) for x in self.other_urls]
        return requests

    def parse_shop(self, response):
        pass # ... scrape shop here ...

    def parse_other(self, response):
        pass # ... scrape other here ...
Copy after login


