Preface
The script of this article is to analyze the access log of nginx, mainly to check the number of visits to the site uri. The results of the check will be provided to the R&D personnel for reference, because when it comes to For analysis, regular expressions must be used, so friends who have not been exposed to regular expressions are asked to supplement their brains. Because it involves the content of regular expressions, it is really impossible to expand on it. The content of regular expressions is too huge, and it is not a two-part article. The article can be written clearly.
Before we begin, let’s take a look at the log structure to be analyzed:
127.0.0.1 - - [19/Jun/2012:09:16:22 +0100] "GET /GO.jpg HTTP/1.1" 499 0 "http://domain.com/htm_data/7/1206/758536.html" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; SE 2.X MetaSr 1.0)" 127.0.0.1 - - [19/Jun/2012:09:16:25 +0100] "GET /Zyb.gif HTTP/1.1" 499 0 "http://domain.com/htm_data/7/1206/758536.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; QQDownload 711; SV1; .NET4.0C; .NET4.0E; 360SE)"
This is the modified log Content and sensitive content have been deleted or replaced, but it does not affect our analysis results. Of course, the format does not matter. Nginx access logs can be customized, and each company may be slightly different, so you must understand The key is to modify the content of the script and apply it to your own work. The log format I gave is just a reference. I bet the log format you see on your company's server is definitely different from mine. After reading the log Format, we are starting to write our script
I will post the code first and explain it later:
import re from operator import itemgetter def parser_logfile(logfile): pattern = (r'' '(\d+.\d+.\d+.\d+)\s-\s-\s' #IP address '\[(.+)\]\s' #datetime '"GET\s(.+)\s\w+/.+"\s' #requested file '(\d+)\s' #status '(\d+)\s' #bandwidth '"(.+)"\s' #referrer '"(.+)"' #user agent ) fi = open(logfile, 'r') url_list = [] for line in fi: url_list.append(re.findall(pattern, line)) fi.close() return url_list def parser_urllist(url_list): urls = [] for url in url_list: for r in url: urls.append(r[5]) return urls def get_urldict(urls): d = {} for url in urls: d[url] = d.get(url,0)+1 return d def url_count(logfile): url_list = parser_logfile(logfile) urls = parser_urllist(url_list) totals = get_urldict(urls) return totals if __name__ == '__main__': urls_with_counts = url_count('example.log') sorted_by_count = sorted(urls_with_counts.items(), key=itemgetter(1), reverse=True) print(sorted_by_count)
Script explanation, parser_logfile()
The function is to analyze the log and return a list of matching lines. The regular part will not be explained. You should know what it matches by looking at the comments. parser_urllist()
The function is to get the url visited by the user, get_urldict()
The function is to return a dictionary with url as the key. If the key is the same, the value is increased by 1, and the returned dictionary is the sum of each url and the largest Number of visits, url_count()
The function is to call the previously defined function. In the main function part, let’s talk about itemgetter, which can sort by specified elements. You can understand it with an example:
>>> from operator import itemgetter >>> a=[('b',2),('a',1),('c',0)] >>> s=sorted(a,key=itemgetter(1)) >>> s [('c', 0), ('a', 1), ('b', 2)] >>> s=sorted(a,key=itemgetter(0)) >>> s [('a', 1), ('b', 2), ('c', 0)]
The reverse=True parameter indicates descending order, that is, sorting from large to small. The script running result is:
[('http://domain.com/htm_data/7/1206/758536.html', 141), ('http://domain.com/?q=node&page=12', 3), ('http://website.net/htm_data/7/1206/758536.html', 1)]
For more articles related to python regular analysis of nginx access logs, please pay attention to the PHP Chinese website!