Python uses multi-threading to crawl web page information-Python Tutorial-php.cn

This article mainly introduces the multi-threaded web page crawling function of Python. It analyzes the relevant operating techniques and precautions of Python multi-threaded programming in detail based on specific examples. It also comes with a demo example that gives the multi-threaded web page crawling method. For the implementation method, friends who need it can refer to

. The example in this article describes the implementation of multi-threaded web page crawling function in Python. Share it with everyone for your reference, the details are as follows:

Recently, I have been doing things related to web crawlers. I took a look at the larbin crawler written in open source C++, and carefully read the design ideas and the implementation of some key technologies.

1. Larbin’s URL de-reuse is a very efficient bloom filter algorithm;
2. DNS processing uses the adns asynchronous open source component;
3. For URL queue processing, it is Use a strategy of partially caching into memory and partially writing to files.
4. Larbin has done a lot of work on file-related operations.
5. There is a connection pool in larbin. By creating a socket, it sends the GET method in the HTTP protocol to the target site, obtains the content, and then parses the header. Class things
6. A large number of descriptors, I/O multiplexing through the poll method, very efficient
7. Larbin is highly configurable
8. A large number of data structures used by the author are his own I started from the bottom and basically didn’t use things like STL
...

There are many more. I will write an article and summarize them when I have time in the future.

In the past two days, I wrote a multi-threaded page download program in python. For I/O-intensive applications, multi-threading is obviously a good solution. The thread pool I just wrote can also be used. In fact, it is very simple to use Python to crawl pages. There is a urllib2 module, which is very convenient to use and can be done in basically two or three lines of code. Although using third-party modules can solve problems very conveniently, it is of no benefit to personal technical accumulation, because the key algorithms are implemented by others, not yourself. You simply don’t know many details. Unable to understand. As technology professionals, we cannot just use modules or APIs written by others. We must implement them ourselves so that we can learn more.

I decided to start from socket, which also encapsulates the GET protocol and parses the header. It can also handle the DNS parsing process separately, such as DNS caching, so if I write it myself, it will be more controllable. More conducive to expansion. For timeout processing, I use a global 5-second timeout processing. For relocation (301or302) processing, the maximum relocation is 3 times, because during the previous testing process, I found that many sites' relocations were redirected to myself. This creates an infinite loop, so an upper limit is set. The specific principle is relatively simple, just look at the code.

After I finished writing it, I compared the performance with urllib2. I found that the efficiency of my own writing was relatively high, and the error rate of urllib2 was slightly higher. I don’t know why. Some people on the Internet say that urllib2 has some minor problems in a multi-threaded context, but I am not particularly clear about the details.

First paste the code:

fetchPage.py Use the Get method of the Http protocol to download the page and store it For the file

&#39;&#39;&#39;
Created on 2012-3-13
Get Page using GET method
Default using HTTP Protocol , http port 80
@author: xiaojay
&#39;&#39;&#39;
import socket
import statistics
import datetime
import threading
socket.setdefaulttimeout(statistics.timeout)
class Error404(Exception):
  &#39;&#39;&#39;Can not find the page.&#39;&#39;&#39;
  pass
class ErrorOther(Exception):
  &#39;&#39;&#39;Some other exception&#39;&#39;&#39;
  def __init__(self,code):
    #print &#39;Code :&#39;,code
    pass
class ErrorTryTooManyTimes(Exception):
  &#39;&#39;&#39;try too many times&#39;&#39;&#39;
  pass
def downPage(hostname ,filename , trytimes=0):
  try :
    #To avoid too many tries .Try times can not be more than max_try_times
    if trytimes >= statistics.max_try_times :
      raise ErrorTryTooManyTimes
  except ErrorTryTooManyTimes :
    return statistics.RESULTTRYTOOMANY,hostname+filename
  try:
    s = socket.socket(socket.AF_INET,socket.SOCK_STREAM)
    #DNS cache
    if statistics.DNSCache.has_key(hostname):
      addr = statistics.DNSCache[hostname]
    else:
      addr = socket.gethostbyname(hostname)
      statistics.DNSCache[hostname] = addr
    #connect to http server ,default port 80
    s.connect((addr,80))
    msg = &#39;GET &#39;+filename+&#39; HTTP/1.0\r\n&#39;
    msg += &#39;Host: &#39;+hostname+&#39;\r\n&#39;
    msg += &#39;User-Agent:xiaojay\r\n\r\n&#39;
    code = &#39;&#39;
    f = None
    s.sendall(msg)
    first = True
    while True:
      msg = s.recv(40960)
      if not len(msg):
        if f!=None:
          f.flush()
          f.close()
        break
      # Head information must be in the first recv buffer
      if first:
        first = False
        headpos = msg.index("\r\n\r\n")
        code,other = dealwithHead(msg[:headpos])
        if code==&#39;200&#39;:
          #statistics.fetched_url += 1
          f = open(&#39;pages/&#39;+str(abs(hash(hostname+filename))),&#39;w&#39;)
          f.writelines(msg[headpos+4:])
        elif code==&#39;301&#39; or code==&#39;302&#39;:
          #if code is 301 or 302 , try down again using redirect location
          if other.startswith("http") :
            hname, fname = parse(other)
            downPage(hname,fname,trytimes+1)#try again
          else :
            downPage(hostname,other,trytimes+1)
        elif code==&#39;404&#39;:
          raise Error404
        else :
          raise ErrorOther(code)
      else:
        if f!=None :f.writelines(msg)
    s.shutdown(socket.SHUT_RDWR)
    s.close()
    return statistics.RESULTFETCHED,hostname+filename
  except Error404 :
    return statistics.RESULTCANNOTFIND,hostname+filename
  except ErrorOther:
    return statistics.RESULTOTHER,hostname+filename
  except socket.timeout:
    return statistics.RESULTTIMEOUT,hostname+filename
  except Exception, e:
    return statistics.RESULTOTHER,hostname+filename
def dealwithHead(head):
  &#39;&#39;&#39;deal with HTTP HEAD&#39;&#39;&#39;
  lines = head.splitlines()
  fstline = lines[0]
  code =fstline.split()[1]
  if code == &#39;404&#39; : return (code,None)
  if code == &#39;200&#39; : return (code,None)
  if code == &#39;301&#39; or code == &#39;302&#39; :
    for line in lines[1:]:
      p = line.index(&#39;:&#39;)
      key = line[:p]
      if key==&#39;Location&#39; :
        return (code,line[p+2:])
  return (code,None)
def parse(url):
  &#39;&#39;&#39;Parse a url to hostname+filename&#39;&#39;&#39;
  try:
    u = url.strip().strip(&#39;\n&#39;).strip(&#39;\r&#39;).strip(&#39;\t&#39;)
    if u.startswith(&#39;http://&#39;) :
      u = u[7:]
    elif u.startswith(&#39;https://&#39;):
      u = u[8:]
    if u.find(&#39;:80&#39;)>0 :
      p = u.index(&#39;:80&#39;)
      p2 = p + 3
    else:
      if u.find(&#39;/&#39;)>0:
        p = u.index(&#39;/&#39;)
        p2 = p
      else:
        p = len(u)
        p2 = -1
    hostname = u[:p]
    if p2>0 :
      filename = u[p2:]
    else : filename = &#39;/&#39;
    return hostname, filename
  except Exception ,e:
    print "Parse wrong : " , url
    print e
def PrintDNSCache():
  &#39;&#39;&#39;print DNS dict&#39;&#39;&#39;
  n = 1
  for hostname in statistics.DNSCache.keys():
    print n,&#39;\t&#39;,hostname, &#39;\t&#39;,statistics.DNSCache[hostname]
    n+=1
def dealwithResult(res,url):
  &#39;&#39;&#39;Deal with the result of downPage&#39;&#39;&#39;
  statistics.total_url+=1
  if res==statistics.RESULTFETCHED :
    statistics.fetched_url+=1
    print statistics.total_url , &#39;\t fetched :&#39;, url
  if res==statistics.RESULTCANNOTFIND :
    statistics.failed_url+=1
    print "Error 404 at : ", url
  if res==statistics.RESULTOTHER :
    statistics.other_url +=1
    print "Error Undefined at : ", url
  if res==statistics.RESULTTIMEOUT :
    statistics.timeout_url +=1
    print "Timeout ",url
  if res==statistics.RESULTTRYTOOMANY:
    statistics.trytoomany_url+=1
    print e ,"Try too many times at", url
if __name__==&#39;__main__&#39;:
  print &#39;Get Page using GET method&#39;

Copy after login

below, I will use the thread pool in the previous article as an auxiliary to implement parallel crawling under multi-threads, and use the download I wrote above Let’s compare the performance of the page method and urllib2.

&#39;&#39;&#39;
Created on 2012-3-16
@author: xiaojay
&#39;&#39;&#39;
import fetchPage
import threadpool
import datetime
import statistics
import urllib2
&#39;&#39;&#39;one thread&#39;&#39;&#39;
def usingOneThread(limit):
  urlset = open("input.txt","r")
  start = datetime.datetime.now()
  for u in urlset:
    if limit <= 0 : break
    limit-=1
    hostname , filename = parse(u)
    res= fetchPage.downPage(hostname,filename,0)
    fetchPage.dealwithResult(res)
  end = datetime.datetime.now()
  print "Start at :\t" , start
  print "End at :\t" , end
  print "Total Cost :\t" , end - start
  print &#39;Total fetched :&#39;, statistics.fetched_url
&#39;&#39;&#39;threadpoll and GET method&#39;&#39;&#39;
def callbackfunc(request,result):
  fetchPage.dealwithResult(result[0],result[1])
def usingThreadpool(limit,num_thread):
  urlset = open("input.txt","r")
  start = datetime.datetime.now()
  main = threadpool.ThreadPool(num_thread)
  for url in urlset :
    try :
      hostname , filename = fetchPage.parse(url)
      req = threadpool.WorkRequest(fetchPage.downPage,args=[hostname,filename],kwds={},callback=callbackfunc)
      main.putRequest(req)
    except Exception:
      print Exception.message
  while True:
    try:
      main.poll()
      if statistics.total_url >= limit : break
    except threadpool.NoResultsPending:
      print "no pending results"
      break
    except Exception ,e:
      print e
  end = datetime.datetime.now()
  print "Start at :\t" , start
  print "End at :\t" , end
  print "Total Cost :\t" , end - start
  print &#39;Total url :&#39;,statistics.total_url
  print &#39;Total fetched :&#39;, statistics.fetched_url
  print &#39;Lost url :&#39;, statistics.total_url - statistics.fetched_url
  print &#39;Error 404 :&#39; ,statistics.failed_url
  print &#39;Error timeout :&#39;,statistics.timeout_url
  print &#39;Error Try too many times &#39; ,statistics.trytoomany_url
  print &#39;Error Other faults &#39;,statistics.other_url
  main.stop()
&#39;&#39;&#39;threadpool and urllib2 &#39;&#39;&#39;
def downPageUsingUrlib2(url):
  try:
    req = urllib2.Request(url)
    fd = urllib2.urlopen(req)
    f = open("pages3/"+str(abs(hash(url))),&#39;w&#39;)
    f.write(fd.read())
    f.flush()
    f.close()
    return url ,&#39;success&#39;
  except Exception:
    return url , None
def writeFile(request,result):
  statistics.total_url += 1
  if result[1]!=None :
    statistics.fetched_url += 1
    print statistics.total_url,&#39;\tfetched :&#39;, result[0],
  else:
    statistics.failed_url += 1
    print statistics.total_url,&#39;\tLost :&#39;,result[0],
def usingThreadpoolUrllib2(limit,num_thread):
  urlset = open("input.txt","r")
  start = datetime.datetime.now()
  main = threadpool.ThreadPool(num_thread)
  for url in urlset :
    try :
      req = threadpool.WorkRequest(downPageUsingUrlib2,args=[url],kwds={},callback=writeFile)
      main.putRequest(req)
    except Exception ,e:
      print e
  while True:
    try:
      main.poll()
      if statistics.total_url >= limit : break
    except threadpool.NoResultsPending:
      print "no pending results"
      break
    except Exception ,e:
      print e
  end = datetime.datetime.now()
  print "Start at :\t" , start
  print "End at :\t" , end
  print "Total Cost :\t" , end - start
  print &#39;Total url :&#39;,statistics.total_url
  print &#39;Total fetched :&#39;, statistics.fetched_url
  print &#39;Lost url :&#39;, statistics.total_url - statistics.fetched_url
  main.stop()
if __name__ ==&#39;__main__&#39;:
  &#39;&#39;&#39;too slow&#39;&#39;&#39;
  #usingOneThread(100)
  &#39;&#39;&#39;use Get method&#39;&#39;&#39;
  #usingThreadpool(3000,50)
  &#39;&#39;&#39;use urllib2&#39;&#39;&#39;
  usingThreadpoolUrllib2(3000,50)

Copy after login

Experimental analysis:

Experimental data: larbin The 3,000 URLs captured were processed by the Mercator queue model (I implemented it in C++, and I will post a blog when I have the opportunity to do so in the future). The URL collection is random and representative. Use a thread pool of 50 threads.
Experimental environment: ubuntu10.04, good network, python2.6
Storage: small files, each page, one file for storage
PS: Because the school’s Internet access is based on traffic There is a fee to do web crawling, which is a waste of regular traffic! ! ! In a few days, we may conduct a large-scale URL download experiment and try it with hundreds of thousands of URLs.

Experimental results:

Using urllib2 , usingThreadpoolUrllib2(3000,50)

Start at : 2012-03-16 22:18:20.956054
End at : 2012-03-16 22:22:15.203018
Total Cost : 0:03:54.246964
Total url : 3001
Total fetched : 2442
Lost url: 559

Physical storage size of download page: 84088kb

Use your own getPageUsingGet, usingThreadpool(3000,50)

Start at: 2012-03-16 22: 23:40.206730
End at : 2012-03-16 22:26:26.843563
Total Cost : 0:02:46.636833
Total url : 3002
Total fetched : 2484
Lost url : 518
Error 404 : 94
Error timeout : 312
Error Try too many times 0
Error Other faults 112

The physical storage size of the download page: 87168kb

Summary: The download page program I wrote myself is very efficient and has fewer lost pages. But in fact, if you think about it, there are still many places that can be optimized. For example, the files are too scattered. The creation and release of too many small files will definitely cause a lot of performance overhead, and the program uses hash naming, which will also generate a lot of problems. In terms of calculation, if you have a good strategy, these costs can actually be omitted. In addition, for DNS, you do not need to use the DNS resolution that comes with python, because the default DNS resolution is a synchronous operation, and DNS resolution is generally time-consuming, so it can be performed in a multi-threaded asynchronous manner, coupled with appropriate DNS caching. Efficiency can be improved to a large extent. Not only that, during the actual page crawling process, there will be a large number of URLs, and it is impossible to store them in the memory at once. Instead, they should be reasonably allocated according to a certain strategy or algorithm. In short, there are still many things that need to be done in the collection page and things that can be optimized.

The above is the detailed content of Python uses multi-threading to crawl web page information. For more information, please follow other related articles on the PHP Chinese website!