python爬虫 - 一个python的多线程爬虫,daemon=False主程序无法退出,daemon=Ture程序可以退出
迷茫
迷茫 2017-04-18 09:48:59
0
1
709

代码在2.7下测试了可以直接运行

大神指点下,对于daemon查了很久了,但是还是没想明白,拜托看一下放在shomy答主的下面的几条评论,补充了一些内容。

问题描述

如果将mutiple.py的第54行改为t.daemon=False,那么所有图片下载完成后,程序会一直卡在这里,不会退出。

$ python mutiple.py
一共下载了 253 张图片
Took 57.710124015808105s
...现在卡死不动了,只能通过kill -9来杀

接下来我用$ pstree -h | grep python,显然主线程和它的子线程现在没有退出,这是为什么呢?因为Queue已经设置了join(),而且print语句也成功打印出来,所以说子线程应该已经完工了呀。

python(6591)-+-{python}(6596)
            |-{python}(6597)
            |-{python}(6598)
            |-{python}(6599)
            |-{python}(6600)
            |-{python}(6601)
            |-{python}(6602)
            '-{python}(6603)

mutiple.py的代码

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from Queue import Queue
from threading import Thread
from time import time
from itertools import chain
from download import setup_download_dir, get_links, download_link


class DownloadWorker(Thread):
    def __init__(self, queue):
        Thread.__init__(self)
        self.queue = queue

    def run(self):
        while True:
            # Get the work from the queue and expand the tuple
            item = self.queue.get()
            if item is None:
                break
            directory, link = item
            download_link(directory, link)
            self.queue.task_done()


def main():
    ts = time()

    url1 = 'http://www.toutiao.com/a6333981316853907714'
    url2 = 'http://www.toutiao.com/a6334459308533350658'
    url3 = 'http://www.toutiao.com/a6313664289211924737'
    url4 = 'http://www.toutiao.com/a6334337170774458625'
    url5 = 'http://www.toutiao.com/a6334486705982996738'
    download_dir = setup_download_dir('thread_imgs')
    # Create a queue to communicate with the worker threads
    queue = Queue()

    links = list(chain(
        get_links(url1),
        get_links(url2),
        get_links(url3),
        get_links(url4),
        get_links(url5),
    ))

    # Create 8 worker threads
    for x in range(8):
        worker = DownloadWorker(queue)
        # Setting daemon to True will let the main thread exit even though the
        # workers are blocking
        worker.daemon = True
        worker.start()

    # Put the tasks into the queue as a tuple
    for link in links:
        queue.put((download_dir, link))

    # Causes the main thread to wait for the queue to finish processing all
    # the tasks
    queue.join()
    print u'一共下载了 {} 张图片'.format(len(links))
    print u'Took {}s'.format(time() - ts)


if __name__ == '__main__':
    main()

"""
一共下载了 253 张图片
Took 57.710124015808105s
"""

download.py的代码

#!/usr/bin/env python
import os
import requests
from pathlib import Path
from bs4 import BeautifulSoup


def get_links(url):
    '''
    return the links in a list
    '''
    req = requests.get(url)
    soup = BeautifulSoup(req.text, "html.parser")
    return [img.attrs.get('src') for img in
            soup.find_all('p', class_='img-wrap')
            if img.attrs.get('src') is not None]


def download_link(directory, link):
    '''
    download the img by the link and save it
    '''
    img_name = '{}.jpg'.format(os.path.basename(link))
    download_path = directory / img_name
    r = requests.get(link)
    with download_path.open('wb') as fd:
        fd.write(r.content)


def setup_download_dir(directory):
    '''
    set the dir and create a new dir if not exists
    '''
    download_dir = Path(directory)
    if not download_dir.exists():
        download_dir.mkdir()
    return download_dir

程序运行中,执行一个主线程,如果主线程又创建一个子线程,主线程和子线程就分兵两路,分别运行,那么当主线程完成想退出时,会检验子线程是否完成。如果子线程未完成,则主线程会等待子线程完成后再退出。但是有时候我们需要的是,只要主线程完成了,不管子线程是否完成,都要和主线程一起退出,这时就可以用setDaemon(True)方法了。

迷茫
迷茫

业精于勤,荒于嬉;行成于思,毁于随。

membalas semua(1)
左手右手慢动作

Pemahaman saya ialah:

  1. setdaemon(True) bermaksud utas daemon, iaitu, apabila anda menetapkannya kepada True, apabila utas utama tamat, utas anak terpaksa keluar.

  2. queue.join() akan menyebabkan utas utama menunggu sehingga semua sub-utas selesai sebelum utas utama meneruskan pelaksanaan.

  3. Benang tidak menyediakan fungsi keluar

Untuk meringkaskan tiga perkara di atas, jika setdaemon(Salah) digunakan, utas utama akan menunggu sehingga utas anak keluar. Sangat tersekat

Muat turun terkini
Lagi>
kesan web
Kod sumber laman web
Bahan laman web
Templat hujung hadapan