python - scrapy pipeline报错求助
天蓬老师
天蓬老师 2017-04-18 10:31:25
0
4
625

由于不太清楚传输的机制,卡在SCRAPY传输的这个问题上近半个月,翻阅了好多资料,还是不懂,基础比较差所以上来求助各位老师!
不涉及自定义就以SCRAPY默认的格式为例
spider return的东西需要什么样的格式?
dict?{a:1,b:2,.....}
还是[{a:1,aa:11},{b:2,bb:22},{......}]
return的东西传去哪了?
是不是下面代码的item?

class pipeline :
    def process_item(self, item, spider):

我真的是很菜,但是我很想学希望能得到各位老师的帮助!下面是我的代码,希望能指出缺点

spider:

# -*- coding: utf-8 -*-
import scrapy
from pm25.items import Pm25Item
import re


class InfospSpider(scrapy.Spider):
    name = "infosp"
    allowed_domains = ["pm25.com"]
    start_urls = ['http://www.pm25.com/rank/1day.html', ]

    def parse(self, response):
        item = Pm25Item()
        re_time = re.compile("\d+-\d+-\d+")
        date = response.xpath("/html/body/p[4]/p/p/p[2]/span").extract()[0] #单独解析出DATE
        # items = []

        selector = response.selector.xpath("/html/body/p[5]/p/p[3]/ul[2]/li") #从response里确立解析范围
        for subselector in selector: #通过范围逐条解析
            try: #防止[0]报错
                rank = subselector.xpath("span[1]/text()").extract()[0] 
                quality = subselector.xpath("span/em/text()")[0].extract()
                city = subselector.xpath("a/text()").extract()[0]
                province = subselector.xpath("span[3]/text()").extract()[0]
                aqi = subselector.xpath("span[4]/text()").extract()[0]
                pm25 = subselector.xpath("span[5]/text()").extract()[0]
            except IndexError:
                print(rank,quality,city,province,aqi,pm25)

            item['date'] = re_time.findall(date)[0]
            item['rank'] = rank
            item['quality'] = quality
            item['province'] = city
            item['city'] = province
            item['aqi'] = aqi
            item['pm25'] = pm25
            # items.append(item)

            yield item #这里不懂该怎么用,出来的是什么格式,
                       #有的教程会return items,所以希望能得到指点

pipeline:

import time

class Pm25Pipeline(object):

    def process_item(self, item, spider):
        today = time.strftime("%y%m%d",time.localtime())
        fname = str(today) + ".txt"

        with open(fname,"a") as f:
            for tmp in item: #不知道这里是否写的对,
                             #个人理解是spider return出来的item是yiled dict
                             #[{a:1,aa:11},{b:2,bb:22},{......}]
                f.write(tmp["date"] + '\t' +
                        tmp["rank"] + '\t' +
                        tmp["quality"] + '\t' +
                        tmp["province"] + '\t' +
                        tmp["city"] + '\t' +
                        tmp["aqi"] + '\t' +
                        tmp["pm25"] + '\n'
                        )
            f.close()
        return item

items:

import scrapy

class Pm25Item(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    date = scrapy.Field()
    rank = scrapy.Field()
    quality = scrapy.Field()
    province = scrapy.Field()
    city = scrapy.Field()
    aqi = scrapy.Field()
    pm25 = scrapy.Field()
    pass

部分运行报错代码:

Traceback (most recent call last):
  File "d:\python35\lib\site-packages\twisted\internet\defer.py", line 653, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "D:\pypro\pm25\pm25\pipelines.py", line 23, in process_item
    tmp["pm25"] + '\n'
TypeError: string indices must be integers
2017-04-03 10:23:14 [scrapy.core.scraper] ERROR: Error processing {'aqi': '30',
 'city': '新疆',
 'date': '2017-04-02',
 'pm25': '13 ',
 'province': '伊犁哈萨克州',
 'quality': '优',
 'rank': '357'}
Traceback (most recent call last):
  File "d:\python35\lib\site-packages\twisted\internet\defer.py", line 653, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "D:\pypro\pm25\pm25\pipelines.py", line 23, in process_item
    tmp["pm25"] + '\n'
TypeError: string indices must be integers
2017-04-03 10:23:14 [scrapy.core.scraper] ERROR: Error processing {'aqi': '28',
 'city': '西藏',
 'date': '2017-04-02',
 'pm25': '11 ',
 'province': '林芝',
 'quality': '优',
 'rank': '358'}
Traceback (most recent call last):
  File "d:\python35\lib\site-packages\twisted\internet\defer.py", line 653, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "D:\pypro\pm25\pm25\pipelines.py", line 23, in process_item
    tmp["pm25"] + '\n'
TypeError: string indices must be integers
2017-04-03 10:23:14 [scrapy.core.scraper] ERROR: Error processing {'aqi': '28',
 'city': '云南',
 'date': '2017-04-02',
 'pm25': '11 ',
 'province': '丽江',
 'quality': '优',
 'rank': '359'}
Traceback (most recent call last):
  File "d:\python35\lib\site-packages\twisted\internet\defer.py", line 653, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "D:\pypro\pm25\pm25\pipelines.py", line 23, in process_item
    tmp["pm25"] + '\n'
TypeError: string indices must be integers
2017-04-03 10:23:14 [scrapy.core.scraper] ERROR: Error processing {'aqi': '27',
 'city': '云南',
 'date': '2017-04-02',
 'pm25': '15 ',
 'province': '玉溪',
 'quality': '优',
 'rank': '360'}
Traceback (most recent call last):
  File "d:\python35\lib\site-packages\twisted\internet\defer.py", line 653, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "D:\pypro\pm25\pm25\pipelines.py", line 23, in process_item
    tmp["pm25"] + '\n'
TypeError: string indices must be integers
2017-04-03 10:23:14 [scrapy.core.scraper] ERROR: Error processing {'aqi': '26',
 'city': '云南',
 'date': '2017-04-02',
 'pm25': '10 ',
 'province': '楚雄州',
 'quality': '优',
 'rank': '361'}
Traceback (most recent call last):
  File "d:\python35\lib\site-packages\twisted\internet\defer.py", line 653, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "D:\pypro\pm25\pm25\pipelines.py", line 23, in process_item
    tmp["pm25"] + '\n'
TypeError: string indices must be integers
2017-04-03 10:23:14 [scrapy.core.scraper] ERROR: Error processing {'aqi': '24',
 'city': '云南',
 'date': '2017-04-02',
 'pm25': '11 ',
 'province': '迪庆州',
 'quality': '优',
 'rank': '362'}
Traceback (most recent call last):
  File "d:\python35\lib\site-packages\twisted\internet\defer.py", line 653, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "D:\pypro\pm25\pm25\pipelines.py", line 23, in process_item
    tmp["pm25"] + '\n'
TypeError: string indices must be integers
2017-04-03 10:23:14 [scrapy.core.scraper] ERROR: Error processing {'aqi': '22',
 'city': '云南',
 'date': '2017-04-02',
 'pm25': '9 ',
 'province': '怒江州',
 'quality': '优',
 'rank': '363'}
Traceback (most recent call last):
  File "d:\python35\lib\site-packages\twisted\internet\defer.py", line 653, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "D:\pypro\pm25\pm25\pipelines.py", line 23, in process_item
    tmp["pm25"] + '\n'
TypeError: string indices must be integers
2017-04-03 10:23:14 [scrapy.core.engine] INFO: Closing spider (finished)
2017-04-03 10:23:14 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 328,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 38229,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2017, 4, 3, 2, 23, 14, 972356),
 'log_count/DEBUG': 2,
 'log_count/ERROR': 363,
 'log_count/INFO': 7,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2017, 4, 3, 2, 23, 13, 226730)}
2017-04-03 10:23:14 [scrapy.core.engine] INFO: Spider closed (finished)

希望能到到各位老师的帮助再次感谢~!

天蓬老师
天蓬老师

欢迎选择我的课程,让我们一起见证您的进步~~

membalas semua(4)
Ty80

Tulis terus sahaja, tidak perlu gelung, item diproses secara individu, bukan senarai seperti yang anda fikirkan:

import time

class Pm25Pipeline(object):

    def process_item(self, item, spider):
        today = time.strftime("%y%m%d", time.localtime())
        fname = str(today) + ".txt"

        with open(fname, "a") as f:
            f.write(item["date"] + '\t' +
                    item["rank"] + '\t' +
                    item["quality"] + '\t' +
                    item["province"] + '\t' +
                    item["city"] + '\t' +
                    item["aqi"] + '\t' +
                    item["pm25"] + '\n'
                    )
        f.close()
        return item
伊谢尔伦

Cari untuk: TypeError: indeks rentetan mestilah integer, ketahui masalahnya
Cari bilangan baris dan selesaikan masalah

大家讲道理

Item Scrapy adalah serupa dengan kamus python, dengan beberapa fungsi lanjutan.

Reka bentuk Scrapy, setiap kali Item dijana, ia boleh dihantar ke saluran paip untuk diproses. for tmp in item yang anda tulis di dalamnya melingkari kekunci kamus item. Kekunci itu mestilah rentetan Jika anda menggunakan sintaks __getitem__, anda akan digesa untuk menggunakan nombor dan bukannya nombor.

小葫芦

Anda boleh menganggap item sebagai kamus, yang sebenarnya merupakan kelas terbitan daripada kelas dict. Anda melintasi terus pipeline ini dalam item, dan tmp yang diperoleh sebenarnya adalah kunci kamus dan jenisnya ialah rentetan, jadi tmp['pm25'] operasi jenis ini melaporkan TypeError:string类型的对象索引必须是int型.

Muat turun terkini
Lagi>
kesan web
Kod sumber laman web
Bahan laman web
Templat hujung hadapan