网页爬虫 - python新手求帮助
天蓬老师
天蓬老师 2017-04-17 16:22:41
0
2
258

-- coding: utf-8 --

import scrapy
import codecs
import re

class SinaSpider(scrapy.Spider):

name = "sina"
fileout = codecs.open('sina.txt', 'a', 'utf-8')
allowed_domains = ["sina.cn"]
start_urls = (
    'http://sports.sina.cn/?vt=4&pos=108&vs=3',
)

def parse(self, response):
     type_list = response.xpath('//p[@class="carditems"]//a/@href').extract()
     global url,i
     for i in type_list:
        url=i
        print url
        yield scrapy.Request(i, callback=self.parse_item)
def parse_item(self,response):
    t= response.xpath('//section[@class="art_main_card j_article_main"]//h1//text()').extract()[0]
    strinfo = re.compile('\s')
    title=strinfo.sub('',t)
    leixing1=response.xpath('//nav[@class="sinaHead"]//li//text()').extract()[0]
    strinfo = re.compile('\s')
    type1=strinfo.sub('',leixing1)
    leixing2=response.xpath('//nav[@class="sinaHead"]//li//text()').extract()[1]
    strinfo = re.compile('\s')
    type2=strinfo.sub('',leixing2)
    type= type1 + '' + type2
    type2_list = response.xpath('//p[@class="comment-count"]//a//@href').extract()
    for b in type2_list:
         print b
         yield scrapy.Request(b,callback=self.parse2_item)
    self.fileout.write(
         title + '\001' + type + '\001' + comment
    )
    self.fileout.write('\n')
def parse2_item(self,response):
    global comment
    comment=response.xpath('//p[@class="center_tips"]//p//text()').extract()[0]
    

"E:\Program Files (x86)\python27\python.exe" E:/study/CX/python/rexx/sina.py sina

Process finished with exit code 0
这个问题怎么解决

这个爬虫想实现的爬去数据是将URL,也就是在parse中的i输出到文件中,并且将每个URL需要爬去的内容爬去出来。我现在爬去到的只是一个相同的URL,希望大神们帮忙解决

天蓬老师
天蓬老师

欢迎选择我的课程,让我们一起见证您的进步~~

Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template
About us Disclaimer Sitemap
php.cn:Public welfare online PHP training,Help PHP learners grow quickly!