python爬虫 - 一个python的多线程爬虫,daemon=False主程序无法退出,daemon=Ture程序可以退出
迷茫
迷茫 2017-04-18 09:48:59
0
1
730

代码在2.7下测试了可以直接运行

大神指点下,对于daemon查了很久了,但是还是没想明白,拜托看一下放在shomy答主的下面的几条评论,补充了一些内容。

问题描述

如果将mutiple.py的第54行改为t.daemon=False,那么所有图片下载完成后,程序会一直卡在这里,不会退出。

$ python mutiple.py
一共下载了 253 张图片
Took 57.710124015808105s
...现在卡死不动了,只能通过kill -9来杀

接下来我用$ pstree -h | grep python,显然主线程和它的子线程现在没有退出,这是为什么呢?因为Queue已经设置了join(),而且print语句也成功打印出来,所以说子线程应该已经完工了呀。

python(6591)-+-{python}(6596)
            |-{python}(6597)
            |-{python}(6598)
            |-{python}(6599)
            |-{python}(6600)
            |-{python}(6601)
            |-{python}(6602)
            '-{python}(6603)

mutiple.py的代码

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from Queue import Queue
from threading import Thread
from time import time
from itertools import chain
from download import setup_download_dir, get_links, download_link


class DownloadWorker(Thread):
    def __init__(self, queue):
        Thread.__init__(self)
        self.queue = queue

    def run(self):
        while True:
            # Get the work from the queue and expand the tuple
            item = self.queue.get()
            if item is None:
                break
            directory, link = item
            download_link(directory, link)
            self.queue.task_done()


def main():
    ts = time()

    url1 = 'http://www.toutiao.com/a6333981316853907714'
    url2 = 'http://www.toutiao.com/a6334459308533350658'
    url3 = 'http://www.toutiao.com/a6313664289211924737'
    url4 = 'http://www.toutiao.com/a6334337170774458625'
    url5 = 'http://www.toutiao.com/a6334486705982996738'
    download_dir = setup_download_dir('thread_imgs')
    # Create a queue to communicate with the worker threads
    queue = Queue()

    links = list(chain(
        get_links(url1),
        get_links(url2),
        get_links(url3),
        get_links(url4),
        get_links(url5),
    ))

    # Create 8 worker threads
    for x in range(8):
        worker = DownloadWorker(queue)
        # Setting daemon to True will let the main thread exit even though the
        # workers are blocking
        worker.daemon = True
        worker.start()

    # Put the tasks into the queue as a tuple
    for link in links:
        queue.put((download_dir, link))

    # Causes the main thread to wait for the queue to finish processing all
    # the tasks
    queue.join()
    print u'一共下载了 {} 张图片'.format(len(links))
    print u'Took {}s'.format(time() - ts)


if __name__ == '__main__':
    main()

"""
一共下载了 253 张图片
Took 57.710124015808105s
"""

download.py的代码

#!/usr/bin/env python
import os
import requests
from pathlib import Path
from bs4 import BeautifulSoup


def get_links(url):
    '''
    return the links in a list
    '''
    req = requests.get(url)
    soup = BeautifulSoup(req.text, "html.parser")
    return [img.attrs.get('src') for img in
            soup.find_all('p', class_='img-wrap')
            if img.attrs.get('src') is not None]


def download_link(directory, link):
    '''
    download the img by the link and save it
    '''
    img_name = '{}.jpg'.format(os.path.basename(link))
    download_path = directory / img_name
    r = requests.get(link)
    with download_path.open('wb') as fd:
        fd.write(r.content)


def setup_download_dir(directory):
    '''
    set the dir and create a new dir if not exists
    '''
    download_dir = Path(directory)
    if not download_dir.exists():
        download_dir.mkdir()
    return download_dir

程序运行中,执行一个主线程,如果主线程又创建一个子线程,主线程和子线程就分兵两路,分别运行,那么当主线程完成想退出时,会检验子线程是否完成。如果子线程未完成,则主线程会等待子线程完成后再退出。但是有时候我们需要的是,只要主线程完成了,不管子线程是否完成,都要和主线程一起退出,这时就可以用setDaemon(True)方法了。

迷茫
迷茫

业精于勤,荒于嬉;行成于思,毁于随。

全部回覆(1)
左手右手慢动作

我的理解是這樣的:

  1. setdaemon(True)就是守護線程的意思吧,即當你設定為True,則主線程​​結束的時候,子線程被強制退出。

  2. queue.join() 會讓主執行緒等所有子執行緒完成,主執行緒才會往下執行。

  3. 執行緒沒有提供退出函數

綜上三點的話,如果setdaemon(False)的話, 那麼主執行緒會一直等待子執行緒退出。所以卡住

熱門教學
更多>
最新下載
更多>
網站特效
網站源碼
網站素材
前端模板