3.OfflineData analysisProcess introduction

Note: This link mainly experiences the macro concept and processing process of the data analysis system, and initially understands the application links ofhadoopand other frameworks. Don’t pay too much attention. Code details

A widely used data analysis system:"webLog data mining"

3.1Requirements Analysis

3.1.1Case Name

"Website orAPPClickstream Log Data Mining System".

3.1.2Case requirement description

“Web"Clickstream log" contains very important information for website operation. Through log analysis, we can know the number of visits to the website, which webpage has the most visitors, which webpage is the most valuable, advertising conversion rate, visitor source information, and visitor terminal information. wait.

3.1.3Data source

The data of this case is mainly composed ofUser’s click behavior record

How to obtain: Pre-embed ajsprogram on the page for the page you want to monitor Label binding event, as long as the user clicks or moves to the label, it can trigger theajaxrequest to the backgroundservletprogram, uselog4jRecord event information to thewebserver (nginx,tomcat, etc.), a growing log file is formed.

Form:

3.2Data processing process

##3.2.1Flow chart analysis

This case is very similar to the typicalBIsystem, the overall process As follows:

handles massive amounts of data. Therefore, the technologies used in each link of the process are completely different from traditionalBI. Subsequent courses will explain them one by one:1)Data collection: Customized development of the collection program, or using the open source frameworkFLUME

2)Data preprocessing: Customized developmentmapreduce

hadoopCluster3)Data warehouse technology:

# based onhadoop##4)Data export:sqoop

data import and export tool based onhadoop5)Data visualization: Customized development ofwebprograms or the use of

kettleand other products6)The entire process Process scheduling:hadoop#oozie

tools in thehadoopecosystem or other similar open source products3.2.2

3.2.3

a) MapreudceProgram running

Query data in

##c)

mysql

##./sqoop export --connect jdbc:mysql://localhost:3306/weblogdb --username root --password root --table t_display_xx --export-dir /user/hive/warehouse/uv/dt=2014-08-03

##However, since the premise of this caseisThe program runs onHiveProject technical architecture diagramProject related screenshots (Perceptual understanding, just appreciation)b)HiveImport statistical results into

##58.215.204.118 - - [18/Sep/2013:06: 51:35 +0000] "GET /wp-includes/js/jquery/jquery.js?ver=1.10.2 HTTP/1.1" 304 0 "http://blog.fens.me/nodejs-socketio-chat/" "Mozilla/5.0 (Windows NT 5.1; rv:23.0) Gecko/20100101 Firefox/23.0"

3.3

Final effect of the project

After the complete data processing process, reports of various statistical indicators will be periodically output. In production practice, these reports will eventually need to be The data is displayed in the form of visualization. This case uses thewebprogram to realize data visualization.

The effect is as follows:

The above is the detailed content of Introduction to offline data analysis process. For more information, please follow other related articles on the PHP Chinese website!