How to use nginx lua to collect data in website statistics-Nginx-php.cn

Webmasters and operators often use website data analysis tools. Google Analytics, Baidu Statistics, Tencent Analytics, etc. are widely used. If you want to collect statistical data, you must first collect data. Let’s analyze the principles of data collection below. , and build a data collection system.

Data Collection Principle Analysis

Simply put, the website statistical analysis tool needs to collect the behavior of users browsing the target website (such as opening a certain web page, clicking on a certain button, adding items to the shopping cart, etc.) and additional behavior data (such as the order amount generated by an order behavior, etc.). Early website statistics often collected only one user behavior: the opening of a page. Then the user's behavior on the page cannot be collected. This collection strategy can meet common analysis perspectives such as basic traffic analysis, source analysis, content analysis and visitor attributes. However, with the widespread use of ajax technology and e-commerce websites, the demand for statistical analysis of e-commerce targets is becoming stronger and stronger. , this traditional collection strategy has become beyond its capabilities.
Later, Google innovatively introduced customizable data collection scripts in its product Google Analytics. Through the extensible interface defined by Google Analytics, users only need to write a small amount of javascript code to implement custom events and customization. Define tracking and analysis of metrics. At present, products such as Baidu Statistics and Sogou Analytics have copied the Google Analytics model.
In fact, the basic principles and processes of the two data collection modes are the same, but the latter one collects more information through javascript. Let’s take a look at the basic principles of data collection for various website statistics tools today.

Process Overview

First of all, the user's behavior will trigger an http request from the browser to the page being counted. Let's assume that the behavior is to open the web page. When the web page is opened, the javascript snippet embedded in the page will be executed. Friends who have used related tools should know that general website statistics tools will require users to add a small piece of javascript code to the web page. This code snippet will usually dynamically create a script. tag, and point src to a separate js file. At this time, this separate js file (green node in Figure 1) will be requested and executed by the browser. This js is often the real data collection script. After the data collection is completed, js will request a back-end data collection script (backend in Figure 1). This script is usually a dynamic script program disguised as an image, which may be written by php, python or other server-side languages. js will The collected data is passed to the back-end script through http parameters. The back-end script parses the parameters and records them in the access log in a fixed format. At the same time, some cookies for tracking may be planted in the http response to the client.
The above is a general process of data collection. The following uses Google Analytics as an example to conduct a relatively detailed analysis of each stage.

Buried script execution phase

To use Google Analytics (hereinafter referred to as GA), you need to insert a javascript fragment provided by it into the page. This fragment is often used It's called buried code.

_gaq is GA's global array, used to place various configurations. The format of each configuration is:

_gaq.push([‘Action’, ‘param1&rsquo ;, ‘param2’, …]);

Action specifies the configuration action, followed by a list of related parameters. The default embedding code given by GA will give two preset configurations. _setAccount is used to set the website identification ID. This identification ID is assigned when registering GA. _trackPageview tells GA to track a page visit. For more configuration, please refer to: https://developers.google.com/analytics/devguides/collection/gajs/. In fact, this _gaq is used as a FIFO queue, and the configuration code does not need to appear before the buried code. For details, please refer to the instructions in the above link.
As far as this article is concerned, the mechanism of _gaq is not the focus. The focus is on the code of the anonymous function behind it. This is what the buried code really needs to do. The main purpose of this code is to introduce an external js file (ga.js) by creating a script through the document.createElement method and pointing the src to the corresponding ga.js according to the protocol (http or https), and finally adding this element Insert into the DOM tree of the page.
Note that ga.async = true means calling the external js file asynchronously, that is, not blocking the browser's parsing, and executing it asynchronously after the external js download is completed. This attribute is newly introduced in HTML5.

Data collection script execution phase

The data collection script (ga.js) will be executed after being requested. This script generally does the following things:
1. Collect information through the browser's built-in javascript object, such as page title (through document.title ), referrer (previous URL, via document.referrer), user monitor resolution (via windows.screen), cookie information (via document.cookie), and other information.
2. Parse _gaq to collect configuration information. This may include user-defined event tracking, business data (such as product numbers on e-commerce websites, etc.).
3. Parse and splice the data collected in the above two steps according to the predefined format.
4. Request a back-end script and put the information in the http request parameter and carry it to the back-end script.
The only problem here is step 4. The common method for javascript to request back-end scripts is ajax, but ajax cannot make cross-domain requests. Here ga.js is executed in the domain of the website being counted, and the back-end script is in another domain (GA's back-end statistics script is http://www.m.sbmmt.com/__utm.gif), and ajax does not work. A common method is to create an Image object in a js script, point the src attribute of the Image object to the backend script and carry parameters. At this time, a cross-domain request to the backend is implemented. This is why backend scripts are often disguised as gif files. Through http packet capture, you can see the request of ga.js to __utm.gif.

You can see that ga.js brings a lot of information when requesting __utm.gif. For example, utmsr=1280×1024 is the screen resolution, utmac=UA-35712773-1 is the value parsed from _gaq The GA identification ID and so on.
It is worth noting that __utm.gif may not only be requested when the hidden code is executed. If event tracking is configured with _trackEvent, this script will also be requested when the event occurs.
Since ga.js has been compressed and obfuscated, and its readability is very poor, we will not analyze it. I will implement a script with similar functions in the later implementation stage.

Back-end script execution phase

GA's __utm.gif is a script disguised as a gif. This kind of back-end script generally needs to complete the following things:
1. Parse the information of http request parameters.
2. Obtain some information from the server (WebServer) that the client cannot obtain, such as visitor IP, etc.
3. Write the information to the log according to the format.
4. Generate a 1×1 empty gif image as the response content and set the Content-type of the response header to image/gif.
5. Set some required cookie information in the response header through Set-cookie.
The reason why cookies are set is because if you want to track unique visitors, the usual approach is that if the client does not have a specified tracking cookie during the request, a globally unique cookie is generated according to the rules and planted to the user, otherwise Set-cookie Place the obtained tracking cookie in to keep the same user cookie unchanged (see Figure 4).

Although this approach is not perfect (for example, users who clear cookies or change browsers will be considered two users), it is currently a widely used method. Note that if there is no need to track the same user across sites, you can use js to plant cookies under the domain of the website being counted (GA does this). If you want to position the entire network uniformly, you can plant cookies in the server domain through back-end scripts. Next (our implementation will do this later).

Design and implementation of the system

Based on the above principles, I built an access log collection system myself.
I call this system MyAnalytics.

Determine the information collected

For the sake of simplicity, I am not going to implement the complete data collection model of GA, but collect the information.

Buried code

Buried code I will learn from the GA model, but currently the configuration object will not be used as a FIFO queue.

I am currently using a statistics script named ma.js and enabling the secondary domain name analytics.codinglabs.org. Of course, there is a small problem here, because I do not have an https server, so if the code is deployed on an https site, there will be problems, but let's ignore it here.

Front-end statistics script

I wrote a statistical script ma.js that is not very complete but can complete the basic work:

    (function () {
    var params = {};
    //Document对象数据
    if(document) {
    params.domain = document.domain || '';
    params.url = document.URL || '';
    params.title = document.title || '';
    params.referrer = document.referrer || '';
    }
    //Window对象数据
    if(window && window.screen) {
    params.sh = window.screen.height || 0;
    params.sw = window.screen.width || 0;
    params.cd = window.screen.colorDepth || 0;
    }
    //navigator对象数据
    if(navigator) {
    params.lang = navigator.language || '';
    }
    //解析_maq配置
    if(_maq) {
    for(var i in _maq) {
    switch(_maq[i][0]) {
    case '_setAccount':
    params.account = _maq[i][1];
    break;
    default:
    break;
    }
    }
    }
    //拼接参数串
    var args = '';
    for(var i in params) {
    if(args != '') {
    args += '&';
    }
    args += i + '=' + encodeURIComponent(params[i]);
    }
    //通过Image对象请求后端脚本
    var img = new Image(1, 1);
    img.src = 'http://analytics.codinglabs.org/1.gif?' + args;
    })();

Copy after login

Put the entire script In anonymous functions, ensure that the global environment is not polluted. The function has been explained in the principle section and will not be described again. Among them 1.gif is the back-end script.

Log format

The log uses one record per line, using the invisible character ^A (ascii code 0x01, under Linux, it can be entered through ctrl v ctrl a, "^A" is used below to represent the invisible character 0x01), and the specific format is as follows:
Time^AIP^ADomain name^AURL^APage title^AReferrer^AHigh resolution^AResolution wide^AColor depth^ ALanguage^AClient information^AUser ID^AWebsite ID

Backend script

为了简单和效率考虑，我打算直接使用nginx的access_log做日志收集，不过有个问题就是nginx配置本身的逻辑表达能力有限，所以我选用了OpenResty做这个事情。OpenResty是一个基于Nginx扩展出的高性能应用开发平台，内部集成了诸多有用的模块，其中的核心是通过ngx_lua模块集成了Lua，从而在nginx配置文件中可以通过Lua来表述业务。关于这个平台我这里不做过多介绍，感兴趣的同学可以参考其官方网站http://openresty.org/，或者这里有其作者章亦春（agentzh）做的一个非常有爱的介绍OpenResty的slide：http://agentzh.org/misc/slides/ngx-openresty-ecosystem/，关于ngx_lua可以参考：https://github.com/chaoslawful/lua-nginx-module。
首先，需要在nginx的配置文件中定义日志格式：
log_format tick “$msec^A$remote_addr^A$u_domain^A$u_url^A$u_title^A$u_referrer^A$u_sh^A$u_sw^A$u_cd^A$u_lang^A$http_user_agent^A$u_utrace^A$u_account”;
注意这里以u_开头的是我们待会会自己定义的变量，其它的是nginx内置变量。
然后是核心的两个location：

    location /1.gif {
    #伪装成gif文件
    default_type image/gif;
    #本身关闭access_log，通过subrequest记录log
    access_log off;
    access_by_lua "
    -- 用户跟踪cookie名为__utrace
    local uid = ngx.var.cookie___utrace
    if not uid then
    -- 如果没有则生成一个跟踪cookie，算法为md5(时间戳+IP+客户端信息)
    uid = ngx.md5(ngx.now() .. ngx.var.remote_addr .. ngx.var.http_user_agent)
    end
    ngx.header['Set-Cookie'] = {'__utrace=' .. uid .. '; path=/'}
    if ngx.var.arg_domain then
    -- 通过subrequest到/i-log记录日志，将参数和用户跟踪cookie带过去
    ngx.location.capture('/i-log?' .. ngx.var.args .. '&utrace=' .. uid)
    end
    ";
    #此请求不缓存
    add_header Expires "Fri, 01 Jan 1980 00:00:00 GMT";
    add_header Pragma "no-cache";
    add_header Cache-Control "no-cache, max-age=0, must-revalidate";
    #返回一个1&times;1的空gif图片
    empty_gif;
    }
    location /i-log {
    #内部location，不允许外部直接访问
    internal;
    #设置变量，注意需要unescape
    set_unescape_uri $u_domain $arg_domain;
    set_unescape_uri $u_url $arg_url;
    set_unescape_uri $u_title $arg_title;
    set_unescape_uri $u_referrer $arg_referrer;
    set_unescape_uri $u_sh $arg_sh;
    set_unescape_uri $u_sw $arg_sw;
    set_unescape_uri $u_cd $arg_cd;
    set_unescape_uri $u_lang $arg_lang;
    set_unescape_uri $u_utrace $arg_utrace;
    set_unescape_uri $u_account $arg_account;
    #打开日志
    log_subrequest on;
    #记录日志到ma.log，实际应用中最好加buffer，格式为tick
    access_log /path/to/logs/directory/ma.log tick;
    #输出空字符串
    echo '';
    }

Copy after login

要完全解释这段脚本的每一个细节有点超出本文的范围，而且用到了诸多第三方ngxin模块（全都包含在OpenResty中了），重点的地方我都用注释标出来了，可以不用完全理解每一行的意义，只要大约知道这个配置完成了我们在原理一节提到的后端逻辑就可以了。

日志轮转

日志收集系统需要处理大量的访问日志，在时间的累积下文件规模急剧膨胀，放在同一文件中管理不便。所以通常要按时间段将日志切分，例如每天或每小时切分一个日志。我这里为了效果明显，每一小时切分一个日志。通过 crontab 定时调用一个 shell 脚本，以下是该脚本的内容：

    _prefix="/path/to/nginx"
    time=`date +%Y%m%d%H`
    mv ${_prefix}/logs/ma.log ${_prefix}/logs/ma/ma-${time}.log
    kill -USR1 `cat ${_prefix}/logs/nginx.pid`

Copy after login

这个脚本将ma.log移动到指定文件夹并重命名为ma-{yyyymmddhh}.log，然后向nginx发送USR1信号令其重新打开日志文件。
然后再/etc/crontab里加入一行：

59 * * * * root /path/to/directory/rotatelog.sh

Copy after login

在每个小时的59分启动这个脚本进行日志轮转操作。

测试

下面可以测试这个系统是否能正常运行了。我昨天就在我的博客中埋了相关的点，通过http抓包可以看到ma.js和1.gif已经被正确请求。
同时可以看一下1.gif的请求参数。
相关信息确实也放在了请求参数中。
然后我tail打开日志文件，然后刷新一下页面，因为没有设access log buffer，我立即得到了一条新日志：
1351060731.360^A0.0.0.0^Awww.codinglabs.org^Ahttp://www.codinglabs.org/^ACodingLabs^A^A1024^A1280^A24^Azh-CN^AMozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.4 (KHTML, like Gecko) Chrome/22.0.1229.94 Safari/537.4^A4d612be64366768d32e623d594e82678^AU-1-1
注意实际上原日志中的^A是不可见的，这里我用可见的^A替换为方便阅读，另外IP由于涉及隐私我替换为了0.0.0.0。

关于分析

通过上面的分析和开发可以大致理解一个网站统计的日志收集系统是如何工作的。有了这些日志，就可以进行后续的分析了。本文只注重日志收集，所以不会写太多关于分析的东西。
注意，原始日志最好尽量多的保留信息而不要做过多过滤和处理。例如上面的MyAnalytics保留了毫秒级时间戳而不是格式化后的时间，时间的格式化是后面的系统做的事而不是日志收集系统的责任。后面的系统根据原始日志可以分析出很多东西，例如通过IP库可以定位访问者的地域、user agent中可以得到访问者的操作系统、浏览器等信息，再结合复杂的分析模型，就可以做流量、来源、访客、地域、路径等分析了。当然，一般不会直接对原始日志分析，而是会将其清洗格式化后转存到其它地方，如MySQL或HBase中再做分析。
分析部分的工作有很多开源的基础设施可以使用，例如实时分析可以使用Storm，而离线分析可以使用Hadoop。当然，在日志比较小的情况下，也可以通过shell命令做一些简单的分析，例如，下面三条命令可以分别得出我的博客在今天上午8点到9点的访问量（PV），访客数（UV）和独立IP数（IP）：

    awk -F^A '{print $1}' ma-2012102409.log | wc -l
    awk -F^A '{print $12}' ma-2012102409.log | uniq | wc -l
    awk -F^A '{print $2}' ma-2012102409.log | uniq | wc -l

Copy after login

The above is the detailed content of How to use nginx lua to collect data in website statistics. For more information, please follow other related articles on the PHP Chinese website!