Use scrapy to crawl Himalaya and crawl the PC address. The response of the entry link is fine, but the subsequent response sees the mobile address. . . .
The spider code is as follows:
class SpxmlySpider(scrapy.Spider): name = 'ximalaya' allowed_domains = ["ximalaya.com"] # 保存每页链接 start_urls = ['http://www.ximalaya.com/dq/all/{}'.format(num) for num in range(2, 3)] #先改为第二页试试 def parse(self, response): # 取出专辑链接 print(response) mainurls = response.xpath('//p[@class="albumfaceOutter"]/a/@href').extract() # for url in mainurls: # yield Request(url = url, callback=self.parse_details) print(mainurls[0]) yield Request(url = mainurls[0], dont_filter=True, callback = self.parse_details) # TODO 为什么PC端访问会变成移动地址问题!!!!!!!!!!!!!!!!! def parse_details(self, response): item = XimalayaItem() print(response) ......以下省略
Console output:
I have written a middlewares.RotateUserAgentMiddleware, which is effective, and the output content can also be seen.
Is it triggering any anti-crawling mechanism?
It should be because your headers do not have user-agent set up
Configure the request headers carefully. Determining whether it is a mobile terminal usually relies on user-agent
You can access the data without anything, which also shows that the target website does not pay much attention to anti-hotlinking