简单的通过scrapy访问雪球都报错,我知道要先访问一次雪球,需要cookie信息才能真正打开连接。scrapy据说可以不用在意cookie,会自动获取cookie。我按照这个连接在middleware里已经启用cookie,http://stackoverflow.com/ques...,但为什么还是会返回404错误?搜索了几天都没找到答案。郁闷啊,求帮忙给个简单代码如何访问,谢谢了
class XueqiuSpider(scrapy.Spider):
name = "xueqiu"
start_urls = "https://xueqiu.com/stock/f10/finmainindex.json?symbol=SZ000001&page=1&size=1"
headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Accept-Language": "zh-CN,zh;q=0.8",
"Connection": "keep-alive",
"Host": "www.zhihu.com",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.109 Safari/537.36"
}
def __init__(self, url = None):
self.user_url = url
def start_requests(self):
yield scrapy.Request(
url = self.start_urls,
headers = self.headers,
meta = {
'cookiejar': 1
},
callback = self.request_captcha
)
def request_captcha(self,response):
print response
错误日志。
2017-03-04 12:42:02 [scrapy.core.engine] INFO: Spider opened
2017-03-04 12:42:02 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-03-04 12:42:02 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
********Current UserAgent:Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6************
2017-03-04 12:42:12 [scrapy.downloadermiddlewares.cookies] DEBUG: Received cookies from: <200 https://xueqiu.com/robots.txt>
Set-Cookie: aliyungf_tc=AQAAAGFYbBEUVAQAPSHDc8pHhpYZKUem; Path=/; HttpOnly
2017-03-04 12:42:12 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://xueqiu.com/robots.txt> (referer: None)
********Current UserAgent:Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6************
2017-03-04 12:42:12 [scrapy.downloadermiddlewares.cookies] DEBUG: Received cookies from: <404 https://xueqiu.com/stock/f10/finmainindex.json?symbol=SZ000001&page=1&size=1>
Set-Cookie: aliyungf_tc=AQAAAPTfyyJNdQUAPSHDc8KmCkY5slST; Path=/; HttpOnly
2017-03-04 12:42:12 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://xueqiu.com/stock/f10/finmainindex.json?symbol=SZ000001&page=1&size=1> (referer: None)
2017-03-04 12:42:12 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <404 https://xueqiu.com/stock/f10/finmainindex.json?symbol=SZ000001&page=1&size=1>: HTTP status code is not handled or not allowed
2017-03-04 12:42:12 [scrapy.core.engine] INFO: Closing spider (finished)
2017-03-04 12:42:12 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
Saya mencubanya sekali lagi.. Anda benar-benar tidak perlu log masuk.. Saya terlalu memikirkannya... Hanya minta xueqiu.com dahulu, dan kemudian minta alamat API selepas mendapat kuki.. Itu sahaja. .
============== Garis pemisah malu==============
Seperti yang disahkan oleh saya, anda perlu log masuk...
Selain itu, tapak web sememangnya telah mengesahkan
User-Agent
dan boleh ditetapkan dalamsettings.py
Sudah tentu, anda juga boleh menulisnya sendiri dalam fail perangkak. Kata laluan ialah rentetan yang disulitkanMD5
.Oh, ya, satu perkara lagi, kerana saya mendaftar dengan telefon bimbit saya,
form_data
adalah medan ini Jika anda menggunakan kaedah lain, anda hanya perlu menggunakan alat Chrome untuk melihat parameter yang ada pada permintaan POST dan ubah suai itu sendiriform_data
Kandungan akan berjaya.Haha, terima kasih, ia telah menyelesaikan kekeliruan selama beberapa hari. Saya pernah melakukannya melalui permintaan sebelum ini, tidak perlu log masuk, pos kod,