html - regular expression python crawler-PHP Chinese Network Q&A

Article Topic Learning Download Q&A Programming Dictionary Game Recent Updates

简体中文(ZH-CN) English(EN) 繁体中文(ZH-TW) 日本語(JA) 한국어(KO) Melayu(MS) Français(FR) Deutsch(DE)

html - regular expression python crawler

怪我咯 2017-06-22 11:51:19

631

import urllib.request

req = urllib.request.urlopen('http://search.jd.com/Search?k...')

req
Out[3]:

buf = req.read()

buf = buf.decode('utf-8')

urllist = re.findall(r'//img. .png',buf)
This will normally display the image URL ending in .png
urllist = re.findall(r'//img. .jpg ',buf)
Also basically normal
urllist = re.findall(r'//img. .(png|jpg)',buf)
This can only display the format of a series of pictures, like this ：
'.jpg',
'.jpg',
'.png',
'.jpg',
'.jpg',
'.jpg',
'.jpg',
'.jpg',
'.jpg',
Why is this?

怪我咯

走同样的路，发现不同的人生

reply all (2)

阿神2017-06-22 11:53:19 2 floor

Mainly because, when you do not add(),re.findallwill print out all the matches, but if you add(), it will print the matching, which is()Captured results, so you see a bunch ofjpg/png. Because of this, we need to use()to capture all the matching links so that they can be printed. At the same time, we need to use(?:jpg |png), because what this place needs isto match jpg or png, so we need to use non-capturing grouping syntax.

# 代码修改 urllist = re.findall(r'(//img.+?.(?:png|jpg))',buf)

For more aboutcapture grouping/non-capturing grouping, you can refer to: Link description

Like+0

Add Reply

代言2017-06-22 11:53:19 1 floor

[png|jpg]

(png|jpg) will be grouped

import re import requests r = requests.get('http://search.jd.com/Search?keyword=%E6%96%87%E8%83%B8&enc=utf-8&wq=%E6%96%87%E8%83%B8&pvid=4anf50si.fbrh68') print re.findall('//img.+.[png|jpg]', r.text)

Like+0

Add Reply