由于不太清楚传输的机制,卡在SCRAPY传输的这个问题上近半个月,翻阅了好多资料,还是不懂,基础比较差所以上来求助各位老师!
不涉及自定义就以SCRAPY默认的格式为例
spider return
的东西需要什么样的格式?
dict?{a:1,b:2,.....}
还是[{a:1,aa:11},{b:2,bb:22},{......}]
return
的东西传去哪了?
是不是下面代码的item?
class pipeline :
def process_item(self, item, spider):
我真的是很菜,但是我很想学希望能得到各位老师的帮助!下面是我的代码,希望能指出缺点
spider:
# -*- coding: utf-8 -*-
import scrapy
from pm25.items import Pm25Item
import re
class InfospSpider(scrapy.Spider):
name = "infosp"
allowed_domains = ["pm25.com"]
start_urls = ['http://www.pm25.com/rank/1day.html', ]
def parse(self, response):
item = Pm25Item()
re_time = re.compile("\d+-\d+-\d+")
date = response.xpath("/html/body/p[4]/p/p/p[2]/span").extract()[0] #单独解析出DATE
# items = []
selector = response.selector.xpath("/html/body/p[5]/p/p[3]/ul[2]/li") #从response里确立解析范围
for subselector in selector: #通过范围逐条解析
try: #防止[0]报错
rank = subselector.xpath("span[1]/text()").extract()[0]
quality = subselector.xpath("span/em/text()")[0].extract()
city = subselector.xpath("a/text()").extract()[0]
province = subselector.xpath("span[3]/text()").extract()[0]
aqi = subselector.xpath("span[4]/text()").extract()[0]
pm25 = subselector.xpath("span[5]/text()").extract()[0]
except IndexError:
print(rank,quality,city,province,aqi,pm25)
item['date'] = re_time.findall(date)[0]
item['rank'] = rank
item['quality'] = quality
item['province'] = city
item['city'] = province
item['aqi'] = aqi
item['pm25'] = pm25
# items.append(item)
yield item #这里不懂该怎么用,出来的是什么格式,
#有的教程会return items,所以希望能得到指点
pipeline:
import time
class Pm25Pipeline(object):
def process_item(self, item, spider):
today = time.strftime("%y%m%d",time.localtime())
fname = str(today) + ".txt"
with open(fname,"a") as f:
for tmp in item: #不知道这里是否写的对,
#个人理解是spider return出来的item是yiled dict
#[{a:1,aa:11},{b:2,bb:22},{......}]
f.write(tmp["date"] + '\t' +
tmp["rank"] + '\t' +
tmp["quality"] + '\t' +
tmp["province"] + '\t' +
tmp["city"] + '\t' +
tmp["aqi"] + '\t' +
tmp["pm25"] + '\n'
)
f.close()
return item
items:
import scrapy
class Pm25Item(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
date = scrapy.Field()
rank = scrapy.Field()
quality = scrapy.Field()
province = scrapy.Field()
city = scrapy.Field()
aqi = scrapy.Field()
pm25 = scrapy.Field()
pass
部分运行报错代码:
Traceback (most recent call last):
File "d:\python35\lib\site-packages\twisted\internet\defer.py", line 653, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "D:\pypro\pm25\pm25\pipelines.py", line 23, in process_item
tmp["pm25"] + '\n'
TypeError: string indices must be integers
2017-04-03 10:23:14 [scrapy.core.scraper] ERROR: Error processing {'aqi': '30',
'city': '新疆',
'date': '2017-04-02',
'pm25': '13 ',
'province': '伊犁哈萨克州',
'quality': '优',
'rank': '357'}
Traceback (most recent call last):
File "d:\python35\lib\site-packages\twisted\internet\defer.py", line 653, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "D:\pypro\pm25\pm25\pipelines.py", line 23, in process_item
tmp["pm25"] + '\n'
TypeError: string indices must be integers
2017-04-03 10:23:14 [scrapy.core.scraper] ERROR: Error processing {'aqi': '28',
'city': '西藏',
'date': '2017-04-02',
'pm25': '11 ',
'province': '林芝',
'quality': '优',
'rank': '358'}
Traceback (most recent call last):
File "d:\python35\lib\site-packages\twisted\internet\defer.py", line 653, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "D:\pypro\pm25\pm25\pipelines.py", line 23, in process_item
tmp["pm25"] + '\n'
TypeError: string indices must be integers
2017-04-03 10:23:14 [scrapy.core.scraper] ERROR: Error processing {'aqi': '28',
'city': '云南',
'date': '2017-04-02',
'pm25': '11 ',
'province': '丽江',
'quality': '优',
'rank': '359'}
Traceback (most recent call last):
File "d:\python35\lib\site-packages\twisted\internet\defer.py", line 653, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "D:\pypro\pm25\pm25\pipelines.py", line 23, in process_item
tmp["pm25"] + '\n'
TypeError: string indices must be integers
2017-04-03 10:23:14 [scrapy.core.scraper] ERROR: Error processing {'aqi': '27',
'city': '云南',
'date': '2017-04-02',
'pm25': '15 ',
'province': '玉溪',
'quality': '优',
'rank': '360'}
Traceback (most recent call last):
File "d:\python35\lib\site-packages\twisted\internet\defer.py", line 653, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "D:\pypro\pm25\pm25\pipelines.py", line 23, in process_item
tmp["pm25"] + '\n'
TypeError: string indices must be integers
2017-04-03 10:23:14 [scrapy.core.scraper] ERROR: Error processing {'aqi': '26',
'city': '云南',
'date': '2017-04-02',
'pm25': '10 ',
'province': '楚雄州',
'quality': '优',
'rank': '361'}
Traceback (most recent call last):
File "d:\python35\lib\site-packages\twisted\internet\defer.py", line 653, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "D:\pypro\pm25\pm25\pipelines.py", line 23, in process_item
tmp["pm25"] + '\n'
TypeError: string indices must be integers
2017-04-03 10:23:14 [scrapy.core.scraper] ERROR: Error processing {'aqi': '24',
'city': '云南',
'date': '2017-04-02',
'pm25': '11 ',
'province': '迪庆州',
'quality': '优',
'rank': '362'}
Traceback (most recent call last):
File "d:\python35\lib\site-packages\twisted\internet\defer.py", line 653, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "D:\pypro\pm25\pm25\pipelines.py", line 23, in process_item
tmp["pm25"] + '\n'
TypeError: string indices must be integers
2017-04-03 10:23:14 [scrapy.core.scraper] ERROR: Error processing {'aqi': '22',
'city': '云南',
'date': '2017-04-02',
'pm25': '9 ',
'province': '怒江州',
'quality': '优',
'rank': '363'}
Traceback (most recent call last):
File "d:\python35\lib\site-packages\twisted\internet\defer.py", line 653, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "D:\pypro\pm25\pm25\pipelines.py", line 23, in process_item
tmp["pm25"] + '\n'
TypeError: string indices must be integers
2017-04-03 10:23:14 [scrapy.core.engine] INFO: Closing spider (finished)
2017-04-03 10:23:14 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 328,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 38229,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 4, 3, 2, 23, 14, 972356),
'log_count/DEBUG': 2,
'log_count/ERROR': 363,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2017, 4, 3, 2, 23, 13, 226730)}
2017-04-03 10:23:14 [scrapy.core.engine] INFO: Spider closed (finished)
希望能到到各位老师的帮助再次感谢~!
Tulis terus sahaja, tidak perlu gelung, item diproses secara individu, bukan senarai seperti yang anda fikirkan:
Cari untuk: TypeError: indeks rentetan mestilah integer, ketahui masalahnya
Cari bilangan baris dan selesaikan masalah
Item Scrapy adalah serupa dengan kamus python, dengan beberapa fungsi lanjutan.
Reka bentuk Scrapy, setiap kali Item dijana, ia boleh dihantar ke saluran paip untuk diproses.
for tmp in item
yang anda tulis di dalamnya melingkari kekunci kamus item. Kekunci itu mestilah rentetan Jika anda menggunakan sintaks __getitem__, anda akan digesa untuk menggunakan nombor dan bukannya nombor.Anda boleh menganggap
item
sebagai kamus, yang sebenarnya merupakan kelas terbitan daripada kelasdict
. Anda melintasi teruspipeline
ini dalamitem
, dantmp
yang diperoleh sebenarnya adalah kunci kamus dan jenisnya ialah rentetan, jaditmp['pm25']
operasi jenis ini melaporkanTypeError:string类型的对象索引必须是int型
.