1) urllib2+BeautifulSoup captures Google search links
Recently, I have participated in a project that requires processing Google search results. I have previously learned Python tools related to processing web pages. In practical applications, urllib2 and beautifulsoup are used to crawl web pages. However, when crawling Google search results, it is found that if the source code of the Google search results page is directly processed, many "dirty" links will be obtained.
Look at the picture below for the results of searching for "titanic james":
The ones marked in red in the picture are not needed, and the ones marked in blue need to be captured and processed.
Of course, this kind of "dirty link" can be filtered out through rule filtering, but the complexity of the program will be high. Just when I was writing the filtering rules with a frown on my face. A classmate reminded me that Google should provide relevant APIs, and then I suddenly understood.
(2) Google Web Search API + Multithreading
The document gives an example of using Python to search:
The ones marked in red in the picture are not needed, and those marked in blue need to be crawled.
Of course, this kind of "dirty link" can be filtered out through rule filtering, but the complexity of the program will be high. Just when I was writing the filtering rules with a frown on my face. A classmate reminded me that Google should provide relevant APIs, and then I suddenly understood.
(2) Google Web Search API + multi-threading
The document gives an example of using Python to search:
import simplejson # The request also includes the userip parameter which provides the end # user's IP address. Doing so will help distinguish this legitimate # server-side traffic from traffic which doesn't come from an end-user. url = ('https://ajax.googleapis.com/ajax/services/search/web' '?v=1.0&q=Paris%20Hilton&userip=USERS-IP-ADDRESS') request = urllib2.Request( url, None, {'Referer': /* Enter the URL of your site here */}) response = urllib2.urlopen(request) # Process the JSON string. results = simplejson.load(response) # now have some fun with the results... import simplejson # The request also includes the userip parameter which provides the end # user's IP address. Doing so will help distinguish this legitimate # server-side traffic from traffic which doesn't come from an end-user. url = ('https://ajax.googleapis.com/ajax/services/search/web' '?v=1.0&q=Paris%20Hilton&userip=USERS-IP-ADDRESS') request = urllib2.Request( url, None, {'Referer': /* Enter the URL of your site here */}) response = urllib2.urlopen(request) # Process the JSON string. results = simplejson.load(response) # now have some fun with the results..
In actual applications, you may need to crawl many web pages on Google, so you need to use multi-threading to share the crawling task . For a detailed reference introduction to using the Google Web Search API, please see here (Standard URL Arguments are introduced here). In addition, special attention should be paid to the parameter rsz in the url must be a value below 8 (including 8). If it is greater than 8, an error will be reported!
(3) Code implementation
There are still problems in the code implementation, but it can run, the robustness is poor, and it needs to be improved. I hope all the experts can point out the mistakes (beginner to Python), I will be grateful.
#-*-coding:utf-8-*- import urllib2,urllib import simplejson import os, time,threading import common, html_filter #input the keywords keywords = raw_input('Enter the keywords: ') #define rnum_perpage, pages rnum_perpage=8 pages=8 #定义线程函数 def thread_scratch(url, rnum_perpage, page): url_set = [] try: request = urllib2.Request(url, None, {'Referer': 'http://www.sina.com'}) response = urllib2.urlopen(request) # Process the JSON string. results = simplejson.load(response) info = results['responseData']['results'] except Exception,e: print 'error occured' print e else: for minfo in info: url_set.append(minfo['url']) print minfo['url'] #处理链接 i = 0 for u in url_set: try: request_url = urllib2.Request(u, None, {'Referer': 'http://www.sina.com'}) request_url.add_header( 'User-agent', 'CSC' ) response_data = urllib2.urlopen(request_url).read() #过滤文件 #content_data = html_filter.filter_tags(response_data) #写入文件 filenum = i+page filename = dir_name+'/related_html_'+str(filenum) print ' write start: related_html_'+str(filenum) f = open(filename, 'w+', -1) f.write(response_data) #print content_data f.close() print ' write down: related_html_'+str(filenum) except Exception, e: print 'error occured 2' print e i = i+1 return #创建文件夹 dir_name = 'related_html_'+urllib.quote(keywords) if os.path.exists(dir_name): print 'exists file' common.delete_dir_or_file(dir_name) os.makedirs(dir_name) #抓取网页 print 'start to scratch web pages:' for x in range(pages): print "page:%s"%(x+1) page = x * rnum_perpage url = ('https://ajax.googleapis.com/ajax/services/search/web' '?v=1.0&q=%s&rsz=%s&start=%s') % (urllib.quote(keywords), rnum_perpage,page) print url t = threading.Thread(target=thread_scratch, args=(url,rnum_perpage, page)) t.start() #主线程等待子线程抓取完 main_thread = threading.currentThread() for t in threading.enumerate(): if t is main_thread: continue t.join()