How to example crawler code in python

coldplay.xixi
Release: 2020-08-11 13:58:52
Original
9541 people have browsed it

Python crawler code example method: first obtain the browser information and use urlencode to generate post data; then install pymysql and store the data in MySQL.

How to example crawler code in python

Python crawler code example method:

1, urllib and BeautifuluSoup

Get browser information

from urllib import request req = request.urlopen("http://www.baidu.com") print(req.read().decode("utf-8"))
Copy after login

Simulate a real browser: carry user-Agent header

(The purpose is to prevent the server from thinking it is a crawler. If this browser information is not carried, then An error may be reported)

req = request.Request(url) #此处url为某个网址 req.add_header(key,value) #key即user-Agent,value即浏览器的版本信息 resp = request.urlopen(req) print(resp.read().decode("utf-8"))
Copy after login

Related learning recommendations:python video tutorial

Use POST

to import parse under the urllib library

from urllib import parse
Copy after login

Use urlencode to generate post data

postData = parse.urlencode([ (key1,val1), (key2,val2), (keyn,valn) ])
Copy after login

Use post

request.urlopen(req,data=postData.encode("utf-8")) #使用postData发送post请求 resp.status #得到请求状态 resp.reason #得到服务器的类型
Copy after login

Complete code example (take crawling Wikipedia home page link as an example)

#-*- coding:utf-8 -*- from bs4 import BeautifulSoup as bs from urllib.request import urlopen import re import ssl #获取维基百科词条信息 ssl._create_default_https_context = ssl._create_unverified_context #全局取消证书验证 #请求URL,并把结果用utf-8编码 req = urlopen("https://en.wikipedia.org/wiki/Main page").read().decode("utf-8") #使用beautifulsoup去解析 soup = bs(req,"html.parser") # print(soup) #获取所有href属性以“/wiki/Special”开头的a标签 urllist = soup.findAll("a",href=re.compile("^/wiki/Special")) for url in urllist: #去除以.jpg或.JPG结尾的链接 if not re.search("\.(jpg|JPG)$",url["href"]): #get_test()输出标签下的所有内容,包括子标签的内容; #string只输出一个内容,若该标签有子标签则输出“none print(url.get_text()+"----->"+url["href"]) # print(url)
Copy after login

2. Store data in MySQL

Install pymysql

Install via pip:

$ pip install pymysql
Copy after login

or install the file:

$ python setup.py install
Copy after login

Use

#引入开发包 import pymysql.cursors #获取数据库链接 connection = pymysql.connect(host="localhost", user = 'root', password = '123456', db ='wikiurl', charset = 'utf8mb4') try: #获取会话指针 with connection.cursor() as cursor #创建sql语句 sql = "insert into `tableName`(`urlname`,`urlhref`) values(%s,%s)" #执行SQL语句 cursor.execute(sql,(url.get_text(),"https://en.wikipedia.org"+url["href"])) #提交 connection.commit() finally: #关闭 connection.close()
Copy after login

3. Precautions for crawlers

The full name of the Robots protocol (Robot protocol, also known as the crawler protocol) is the "Web crawler exclusion protocol". The website tells the search engine through the Robots protocol Which pages can be crawled and which pages cannot be crawled. Generally under the main page, such as https://en.wikipedia.org/robots.txt

Disallow:不允许访问 allow:允许访问
Copy after login

Related recommendations:Programming Video Course

The above is the detailed content of How to example crawler code in python. For more information, please follow other related articles on the PHP Chinese website!

Related labels:
source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template
About us Disclaimer Sitemap
php.cn:Public welfare online PHP training,Help PHP learners grow quickly!