Home > Backend Development > Python Tutorial > Python crawler for beginners: Crawling jokes

Python crawler for beginners: Crawling jokes

巴扎黑
Release: 2017-06-23 16:28:58
Original
1381 people have browsed it

I recently started to learn Python crawlers from this blog. The blogger uses Python version 2.7, while I use version 3.5. There are many incompatibilities, but it doesn’t matter. You can change it yourself. Just change it.

We want to filter the content of the website and only get the parts that interest us. For example, you want to filter out pornographic pictures on the XX website and package them up to take away. Here we only do a simple implementation, taking the joke (plain text) written by Sister Bai Sibu as an example. We want to achieve the following functions:

  • Download several pages of paragraphs in batches to local files

  • Press any key to start Read the next paragraph

1. Get the web page code

and import the related library of urllib. In Python 3, it should be written like this:

import urllib.requestimport urllib.parseimport re
Copy after login

The re library is a regular expression (Regular Expression), which will be used later for matching.

Bessibejie's joke pageurl ='http://www.budejie.com/text/1', the number 1 at the end here means that this is the first page . The following code can return the content of the web page.

    req = urllib.request.Request(url)# 添加headers 使之看起来像浏览器在访问req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 ' '(KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36')
    response = urllib.request.urlopen(req)# 得到网页内容,注意必须使用decode()解码html = response.read().decode('utf-8')
Copy after login

print(html), the content is as follows:

Python crawler for beginners: Crawling jokes

You can see it ? Where's the joke? What about the joke we want? !

Oh, by the way, check the headers like this.

Press F12, and then...look at the picture

Python crawler for beginners: Crawling jokes

2. Regular matching to extract paragraphs

If you want to filter out the matching The content that ordinary people read (if it still has html tags, how can it be read?), the jokes are successfully extracted. For this, we need some established patterns to match the entire content of the web page, and return the objects that successfully match the pattern. We use powerful regular expressions for matching (Regular Expression). The relevant syntax can be found here.

Only for the content of the web page in this example, let’s first see what content in the web page the paragraph we need corresponds to.

Python crawler for beginners: Crawling jokes

You can see that the paragraph was <div class="j-r-list-c-desc">(the content we want)</div> Surrounded by tags like , you only need to specify the corresponding rules to extract it! As can be seen from the picture above, there are many spaces before and after the paragraph text, which needs to be matched.

pattern = re.compile(r&#39;<div class="j-r-list-c-desc">\s+(.*)\s+</div>&#39;)
result = re.findall(pattern, html)
Copy after login

Make rules through the compile function of the re library.

  • \s+ can match one or more spaces

  • . matches except newlines All characters except \n.

Now we have the matching results, let’s take a look.

Python crawler for beginners: Crawling jokes

Bingo! It was extracted, right? !

But we found that there are some nasty things inside<br />. It doesn't matter, just write a few lines of code. I won’t show the removed content here, just make up your own mind haha.

    for each in content:# 如果某个段子里有<br />if &#39;<br />&#39; in each:# 替换成换行符并输出new_each = re.sub(r&#39;<br />&#39;, &#39;\n&#39;, each)print(new_each)# 没有就照常输出else:print(each)
Copy after login

Here content is the list we return through re.findall().

So far, we have successfully gotten the jokes we want to see! What if you want to download it locally?

3. Download the jokes to the local

You can define a save() function. The num parameters are user-defined. You want to download the latest 100 The content of the page is fine! There are some variables that are not mentioned, and the source code will be given at the end.

# num是指定网页页数def save(num):# 写方式打开一个文本,把获取的段子列表存放进去with open(&#39;a.txt&#39;, &#39;w&#39;, encoding=&#39;utf-8&#39;) as f:
        text = get_content(num)# 和上面去掉<br />类似for each in text:if &#39;<br />&#39; in each:
                new_each = re.sub(r&#39;<br />&#39;, &#39;\n&#39;, each)
                f.write(new_each)else:
                f.write(str(each) + &#39;\n&#39;)
Copy after login

After downloading to the local document, it is shown in the figure below

Python crawler for beginners: Crawling jokes

4. Read the paragraphs one by one

There are so many, so many things to see. But we just want to read them one by one. You can switch to the next item by pressing any key on the keyboard, and the program will not end until the last item is read, or you can exit the program at any time by setting an exit key, such as setting the q key to exit. The entire code is given here.

import urllib.requestimport urllib.parseimport re

pattern = re.compile(r'
\s+(.*)\s+
')# 返回指定网页的内容def open_url(url): req = urllib.request.Request(url) req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 ' '(KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36') response = urllib.request.urlopen(req) html = response.read().decode('utf-8')return html# num为用户自定,返回的是所有页的段子列表def get_content(num):# 存放段子的列表text_list = []for page in range(1, int(num)): address = 'http://www.budejie.com/text/' + str(page) html = open_url(address) result = re.findall(pattern, html)# 每一页的result都是一个列表,将里面的内容加入到text_listfor each in result: text_list.append(each)return text_list# num是指定网页页数def save(num):# 写方式打开一个文本,把获取的段子列表存放进去with open(&#39;a.txt&#39;, &#39;w&#39;, encoding=&#39;utf-8&#39;) as f: text = get_content(num)# 和上面去掉<br />类似for each in text:if &#39;<br />&#39; in each: new_each = re.sub(r&#39;<br />&#39;, &#39;\n&#39;, each) f.write(new_each)else: f.write(str(each) + &#39;\n&#39;) if __name__ == '__main__':print('阅读过程中按q随时退出') number = int(input('想读几页的内容: ')) content = get_content(number + 1)for each in content:if '<br />' in each: new_each = re.sub(r'<br />', '\n', each)print(new_each)else:print(each)# 用户输入user_input = input()# 不区分大小写的q,输入则退出if user_input == 'q' or user_input == 'Q':break
Copy after login

Demonstrate it, the effect is like this.
Python crawler for beginners: Crawling jokes

Although the function is very useless, as a beginner, I am still very satisfied. If you are interested, you can go deeper! The crawler is not just that, you will learn more advanced functions in the future.


by @sunhaiyu

2016.8.15

The above is the detailed content of Python crawler for beginners: Crawling jokes. For more information, please follow other related articles on the PHP Chinese website!

Related labels:
source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template