html - regular expression python crawler
怪我咯
怪我咯 2017-06-22 11:51:19
0
2
631

import urllib.request

req = urllib.request.urlopen('http://search.jd.com/Search?k...')

req
Out[3]:

buf = req.read()

buf = buf.decode('utf-8')

urllist = re.findall(r'//img. .png',buf)
This will normally display the image URL ending in .png
urllist = re.findall(r'//img. .jpg ',buf)
Also basically normal
urllist = re.findall(r'//img. .(png|jpg)',buf)
This can only display the format of a series of pictures, like this :
'.jpg',
'.jpg',
'.png',
'.jpg',
'.jpg',
'.jpg',
'.jpg',
'.jpg',
'.jpg',
Why is this?

怪我咯
怪我咯

走同样的路,发现不同的人生

reply all (2)
阿神

Mainly because, when you do not add(),re.findallwill print out all the matches, but if you add(), it will print the matching, which is()Captured results, so you see a bunch ofjpg/png. Because of this, we need to use()to capture all the matching links so that they can be printed. At the same time, we need to use(?:jpg |png), because what this place needs isto match jpg or png, so we need to use non-capturing grouping syntax.

# 代码修改 urllist = re.findall(r'(//img.+?.(?:png|jpg))',buf)

For more aboutcapture grouping/non-capturing grouping, you can refer to: Link description

    代言

    [png|jpg]

    (png|jpg) will be grouped

    import re import requests r = requests.get('http://search.jd.com/Search?keyword=%E6%96%87%E8%83%B8&enc=utf-8&wq=%E6%96%87%E8%83%B8&pvid=4anf50si.fbrh68') print re.findall('//img.+.[png|jpg]', r.text)
      Latest Downloads
      More>
      Web Effects
      Website Source Code
      Website Materials
      Front End Template
      About us Disclaimer Sitemap
      php.cn:Public welfare online PHP training,Help PHP learners grow quickly!