python使用urllib模块和pyquery实现阿里巴巴排名查询

WBOY
Release: 2016-06-16 08:45:35
Original
1296 people have browsed it

urllib基础模块的应用,通过该类获取到url中的html文档信息,内部可以重写代理的获取方法

复制代码代码如下:

class ProxyScrapy(object):
def __init__(self):
self.proxy_robot = ProxyRobot()
self.current_proxy = None
self.cookie = cookielib.CookieJar()

def __builder_proxy_cookie_opener(self):
cookie_handler = urllib2.HTTPCookieProcessor(self.cookie)
handlers = [cookie_handler]

if PROXY_ENABLE:
self.current_proxy = ip_port = self.proxy_robot.get_random_proxy()
proxy_handler = urllib2.ProxyHandler({'http': ip_port[7:]})
handlers.append(proxy_handler)

opener = urllib2.build_opener(*handlers)
urllib2.install_opener(opener)
return opener

def get_html_body(self,url):
opener = self.__builder_proxy_cookie_opener()

request=urllib2.Request(url)
#request.add_header("Accept-Encoding", "gzip,deflate,sdch")
#request.add_header("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8")
#request.add_header("Cache-Control", "no-cache")
#request.add_header("Connection", "keep-alive")

try:
response = opener.open(request,timeout=2)

http_code = response.getcode()
if http_code == 200:
if PROXY_ENABLE:
self.proxy_robot.handle_success_proxy(self.current_proxy)
html = response.read()
return html
else:
if PROXY_ENABLE:
self.proxy_robot.handle_double_proxy(self.current_proxy)
return self.get_html_body(url)
except Exception as inst:
print inst,self.current_proxy
self.proxy_robot.handle_double_proxy(self.current_proxy)
return self.get_html_body(url)

source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template
About us Disclaimer Sitemap
php.cn:Public welfare online PHP training,Help PHP learners grow quickly!