网页爬虫 - python模拟登入豆瓣问题
黄舟
黄舟 2017-04-17 17:08:30
0
1
568

半自动模拟登入豆瓣

代码信息:

# /usr/bin/python #coding:utf-8 __author__ = 'eyu Fanne' import requests from bs4 import BeautifulSoup headers={ "Host":"www.douban.com", "User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64; rv:43.0) Gecko/20100101 Firefox/43.0", "Accept-Language":"zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3", "Accept-Encoding":"gzip, deflate", "Connection":"keep-alive" } s=requests.session() s.headers.update(headers) html_url = s.get('https://www.douban.com/accounts/login',headers=headers) print s.cookies.items() print "html_url code %s" %html_url.status_code html_txt = html_url.text html_soup = BeautifulSoup(html_txt,'lxml') img_soup = html_soup.find_all('img',class_="captcha_image") for img_i in img_soup: print img_i['src'] cap_img=img_i['src'] for i in html_soup.find_all("input",attrs={"name":"captcha-id"}): print i['value'] cap_i = i['value'] captcha_solution=raw_input('输入验证码:') captcha_id=cap_i print captcha_solution print captcha_id url_data={ "source":"index_nav", "form_email":"*********", "form_password":"*******", "captcha-solution":captcha_solution, "captcha-id":captcha_id, } s_login=s.post(html_url,data=url_data,headers=headers) print s.cookies.items()

账号密码用**代替了,执行时候会给出验证码图片,人为输入的

错误信息:

[('bid', '"X1c3XEWFnhQ"')] html_url code 200 https://www.douban.com/misc/captcha?id=ArzwwQ6Yv33e0BU7MawrL62d:en&size=s ArzwwQ6Yv33e0BU7MawrL62d:en 输入验证码:thought thought ArzwwQ6Yv33e0BU7MawrL62d:en Traceback (most recent call last): File "D:/360_svn/eyugame_python_exercise/121_remote_pro/crawler_ex/get_douban_move/douban_login.py", line 48, in  s_login=s.post(html_url,data=url_data,headers=headers) File "C:\Python27_x86\lib\site-packages\requests\sessions.py", line 508, in post return self.request('POST', url, data=data, json=json, **kwargs) File "C:\Python27_x86\lib\site-packages\requests\sessions.py", line 451, in request prep = self.prepare_request(req) File "C:\Python27_x86\lib\site-packages\requests\sessions.py", line 382, in prepare_request hooks=merge_hooks(request.hooks, self.hooks), File "C:\Python27_x86\lib\site-packages\requests\models.py", line 304, in prepare self.prepare_url(url, params) File "C:\Python27_x86\lib\site-packages\requests\models.py", line 362, in prepare_url to_native_string(url, 'utf8'))) requests.exceptions.MissingSchema: Invalid URL '': No schema supplied. Perhaps you meant http://? Process finished with exit code 1

问题出现在哪里?

还有一疑问,requests函数
http://docs.python-requests.org/en/latest/user/advanced/
s = requests.Session()
这边是大写的Session,有些地方看到是小写的session的,有咋区别。

===========
update 更新信息~~~

模拟登入问题已搞定,出现在最后的post请求上,第一个参数我给的不是url参数,
修改后的代码:

# /usr/bin/python #coding:utf-8 __author__ = 'eyu Fanne' import requests from bs4 import BeautifulSoup headers={ "Host":"www.douban.com", "User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64; rv:43.0) Gecko/20100101 Firefox/43.0", "Accept-Language":"zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3", "Accept-Encoding":"gzip, deflate", "Connection":"keep-alive" } s=requests.session() s.headers.update(headers) login_url=r'https://www.douban.com/accounts/login' html_url = s.get(login_url,headers=headers) print s.cookies.items() print "html_url code %s" %html_url.status_code html_txt = html_url.text html_soup = BeautifulSoup(html_txt,'lxml') img_soup = html_soup.find_all('img',class_="captcha_image") for img_i in img_soup: print img_i['src'] cap_img=img_i['src'] for i in html_soup.find_all("input",attrs={"name":"captcha-id"}): print i['value'] cap_i = i['value'] captcha_solution=raw_input('输入验证码:') captcha_id=cap_i print captcha_solution print captcha_id url_data={ "source":"index_nav", "form_email":"******", "form_password":"******", "captcha-solution":captcha_solution, "captcha-id":captcha_id, } s_login=s.post(login_url,data=url_data,headers=headers) print s.cookies.items()

执行结果:

[('bid', '"Ojx9+4qSsdw"')] html_url code 200 https://www.douban.com/misc/captcha?id=ryEmaBD2QermvX2BSPncxIuY:en&size=s ryEmaBD2QermvX2BSPncxIuY:en 输入验证码:opposite opposite ryEmaBD2QermvX2BSPncxIuY:en [('bid', '"Ojx9+4qSsdw"'), ('ck', '"malX"'), ('dbcl2', '"41572135:JiIAk8PlKLw"'), ('ue', '"896661380@qq.com"')] Process finished with exit code 0

最后那个session函数还是没搞懂。
还有一疑问,requests函数
http://docs.python-requests.org/en/latest/user/advanced/
s = requests.Session()
这边是大写的Session,有些地方看到是小写的session的,有咋区别。

黄舟
黄舟

人生最曼妙的风景,竟是内心的淡定与从容!

全部回复 (1)
黄舟

2333手抖了吧

s_login=s.post(html_txt,data=url_data,headers=headers)
    最新下载
    更多>
    网站特效
    网站源码
    网站素材
    前端模板
    关于我们 免责声明 Sitemap
    PHP中文网:公益在线PHP培训,帮助PHP学习者快速成长!