Python regular analysis of nginx access logs-Python Tutorial-php.cn

Python regular analysis of nginx access logs

高洛峰

Release： 2017-02-21 10:46:40

Original

1342 people have browsed it

Preface

The script of this article is to analyze the access log of nginx, mainly to check the number of visits to the site uri. The results of the check will be provided to the R&D personnel for reference, because when it comes to For analysis, regular expressions must be used, so friends who have not been exposed to regular expressions are asked to supplement their brains. Because it involves the content of regular expressions, it is really impossible to expand on it. The content of regular expressions is too huge, and it is not a two-part article. The article can be written clearly.

Before we begin, let’s take a look at the log structure to be analyzed:

127.0.0.1 - - [19/Jun/2012:09:16:22 +0100] "GET /GO.jpg HTTP/1.1" 499 0 "http://domain.com/htm_data/7/1206/758536.html" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; SE 2.X MetaSr 1.0)"
127.0.0.1 - - [19/Jun/2012:09:16:25 +0100] "GET /Zyb.gif HTTP/1.1" 499 0 "http://domain.com/htm_data/7/1206/758536.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; QQDownload 711; SV1; .NET4.0C; .NET4.0E; 360SE)"

Copy after login

This is the modified log Content and sensitive content have been deleted or replaced, but it does not affect our analysis results. Of course, the format does not matter. Nginx access logs can be customized, and each company may be slightly different, so you must understand The key is to modify the content of the script and apply it to your own work. The log format I gave is just a reference. I bet the log format you see on your company's server is definitely different from mine. After reading the log Format, we are starting to write our script

I will post the code first and explain it later:

import re
from operator import itemgetter
 
def parser_logfile(logfile):
 pattern = (r&#39;&#39;
   &#39;(\d+.\d+.\d+.\d+)\s-\s-\s&#39; #IP address
   &#39;\[(.+)\]\s&#39; #datetime
   &#39;"GET\s(.+)\s\w+/.+"\s&#39; #requested file
   &#39;(\d+)\s&#39; #status
   &#39;(\d+)\s&#39; #bandwidth
   &#39;"(.+)"\s&#39; #referrer
   &#39;"(.+)"&#39; #user agent
  )
 fi = open(logfile, &#39;r&#39;)
 url_list = []
 for line in fi:
  url_list.append(re.findall(pattern, line))
 fi.close()
 return url_list
 
def parser_urllist(url_list):
 urls = []
 for url in url_list:
  for r in url: 
   urls.append(r[5])
 return urls
 
def get_urldict(urls):
 d = {}
 for url in urls:
  d[url] = d.get(url,0)+1
 return d
 
def url_count(logfile):
 url_list = parser_logfile(logfile)
 urls = parser_urllist(url_list)
 totals = get_urldict(urls)
 return totals
 
if __name__ == &#39;__main__&#39;:
 urls_with_counts = url_count(&#39;example.log&#39;)
 sorted_by_count = sorted(urls_with_counts.items(), key=itemgetter(1), reverse=True)
 print(sorted_by_count)

Copy after login

Script explanation, parser_logfile()The function is to analyze the log and return a list of matching lines. The regular part will not be explained. You should know what it matches by looking at the comments. parser_urllist()The function is to get the url visited by the user, get_urldict()The function is to return a dictionary with url as the key. If the key is the same, the value is increased by 1, and the returned dictionary is the sum of each url and the largest Number of visits, url_count()The function is to call the previously defined function. In the main function part, let’s talk about itemgetter, which can sort by specified elements. You can understand it with an example:

>>> from operator import itemgetter
>>> a=[(&#39;b&#39;,2),(&#39;a&#39;,1),(&#39;c&#39;,0)] 
>>> s=sorted(a,key=itemgetter(1))
>>> s
[(&#39;c&#39;, 0), (&#39;a&#39;, 1), (&#39;b&#39;, 2)]
>>> s=sorted(a,key=itemgetter(0))
>>> s
[(&#39;a&#39;, 1), (&#39;b&#39;, 2), (&#39;c&#39;, 0)]

Copy after login

The reverse=True parameter indicates descending order, that is, sorting from large to small. The script running result is:

[(&#39;http://domain.com/htm_data/7/1206/758536.html&#39;, 141), (&#39;http://domain.com/?q=node&page=12&#39;, 3), (&#39;http://website.net/htm_data/7/1206/758536.html&#39;, 1)]

Copy after login

For more articles related to python regular analysis of nginx access logs, please pay attention to the PHP Chinese website!