网页爬虫 - 爬虫遇到 redirect, 303 POST , 无法用 Python requests 完成任务
ringa_lee
ringa_lee 2017-04-17 17:10:53
0
1
711

小弟最近正在研究如何爬取这家机票资料 https://m.tigerair.com/booking/search

从 Chrome dev tool 看到从client 端连续发送出了两个类似的 requests

curl 'https://m.tigerair.com/booking/search' -H 'Cookie: PLAY_SESSION="1d7f16c847d5a596f468c9c0f764a8eabf83f48c-id=1585d391-c8a4-431d-8e24-edaa7bbaef57"; --data 'departureStation=SHE&arrivalStation=MAA&roundtrip=false&departureDate=2016-02-11&returnDate=&adults=1&children=0&infants=0&currency=CNY' --compressed

curl 'https://m.tigerair.com/booking/select' -H 'Cookie: PLAY_SESSION="30fdab1a897ba9ee088cba84ca28835efca28372-id=1585d391-c8a4-431d-8e24-edaa7bbaef57&searchForm=%7B%22currency%22%3A%22CNY%22%2C%22departureStation%22%3A%22SHE%22%2C%22arrivalStation%22%3A%22MAA%22%2C%22departureDate%22%3A1455148800000%2C%22children%22%3A0%2C%22adults%22%3A1%2C%22roundtrip%22%3Afalse%2C%22returnDate%22%3Anull%2C%22infants%22%3A0%2C%22switchMyFligthEnabled%22%3Afalse%7D"' -H 'Accept-Encoding: gzip, deflate, sdch' -H 'Accept-Language: en-US,en;q=0.8' 

第一个request 看起来像是用POST 发送给 server打个交道,server回传cookie 要他记起来
接下来第二个 request 是GET 用来获取资料用

只是真是坑爹,我找了整个礼拜,就是无法用Python 复制这样的爬取行为。

看到谷歌大神有人建议用 request.session 也是没啥屁用

各位前辈高手,可以帮小弟给点方向吗? 小弟是用 Python requests 做开发的

ringa_lee
ringa_lee

ringa_lee

reply all(1)
PHPzhong

Questioner, can you see if this paragraph can be run? I copied it while trying it from the command line, it should be fine.

import requests
import re

s = requests.session()
s.headers = {
    'Host': 'm.tigerair.com',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:44.0) Gecko/20100101 Firefox/44.0',
       'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
       'Accept-Language': 'zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3',
       'Accept-Encoding': 'gzip, deflate, br',
       'Connection': 'keep-alive'
}

res = s.get(url = 'https://m.tigerair.com/booking/search',verify = False)
print res

post_data = {
    'adults':'1',
    'arrivalStation':'HKG',
    'children':'0',
    'currency':'SGD',
    'departureDate':'2016-02-08',
    'departureStation':'SIN',
    'infants':'0',
    'returnDate':'',
    'roundtrip':'false'
   }
res = s.post(url = 'https://m.tigerair.com/booking/search',data = post_data)
print res

res = s.get('https://m.tigerair.com/booking/select')
print res
#print res.content
print re.findall(r'data-journey-sell-key="([\s\S]+?)"',res.content)
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template