Detailed explanation of real IP request Pandas for Python data analysis-Python Tutorial-php.cn

Home

Backend Development

Python Tutorial

Detailed explanation of real IP request Pandas for Python data analysis

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Dec 05, 2016 pm 01:27 PM

pandas python Tutorial data analysis

Foreword

pandas is a data analysis package built based on Numpy that contains more advanced data structures and tools. Similar to Numpy, whose core is ndarray, pandas also revolves around the two core data structures of Series and DataFrame. Series and DataFrame correspond to one-dimensional sequence and two-dimensional table structure respectively. The conventional import method of pandas is as follows:

from pandas import Series,DataFrame
import pandas as pd

1.1. Pandas analysis steps

1. Load log data

2. Load area_ip data

3. Count the number of real_ip requests. SQL similar to the following:

SELECT inet_aton(l.real_ip),
  count(*),
  a.addr
FROM log AS l
INNER JOIN area_ip AS a
  ON a.start_ip_num <= inet_aton(l.real_ip)
  AND a.end_ip_num >= inet_aton(l.real_ip)
GROUP BY real_ip
ORDER BY count(*)
LIMIT 0, 100;

1.2. Code

cat pd_ng_log_stat.py
#!/usr/bin/env python
#-*- coding: utf-8 -*-
 
from ng_line_parser import NgLineParser
 
import pandas as pd
import socket
import struct
 
class PDNgLogStat(object):
 
  def __init__(self):
    self.ng_line_parser = NgLineParser()
 
  def _log_line_iter(self, pathes):
    """解析文件中的每一行并生成一个迭代器"""
    for path in pathes:
      with open(path, 'r') as f:
        for index, line in enumerate(f):
          self.ng_line_parser.parse(line)
          yield self.ng_line_parser.to_dict()
 
  def _ip2num(self, ip):
    """用于IP转化为数字"""
    ip_num = -1
    try:
      # 将IP转化成INT/LONG 数字
      ip_num = socket.ntohl(struct.unpack("I",socket.inet_aton(str(ip)))[0])
    except:
      pass
    finally:
      return ip_num
 
  def _get_addr_by_ip(self, ip):
    """通过给的IP获得地址"""
    ip_num = self._ip2num(ip)
 
    try:
      addr_df = self.ip_addr_df[(self.ip_addr_df.ip_start_num <= ip_num) & 
                   (ip_num <= self.ip_addr_df.ip_end_num)]
      addr = addr_df.at[addr_df.index.tolist()[0], 'addr']
      return addr
    except:
      return None
           
  def load_data(self, path):
    """通过给的文件路径加载数据生成 DataFrame"""
    self.df = pd.DataFrame(self._log_line_iter(path))
 
 
  def uv_real_ip(self, top = 100):
    """统计cdn ip量"""
    group_by_cols = ['real_ip'] # 需要分组的列,只计算和显示该列
     
    # 直接统计次数
    url_req_grp = self.df[group_by_cols].groupby(
                   self.df['real_ip'])
    return url_req_grp.agg(['count'])['real_ip'].nlargest(top, 'count')
     
  def uv_real_ip_addr(self, top = 100):
    """统计real ip 地址量"""
    cnt_df = self.uv_real_ip(top)
 
    # 添加 ip 地址 列
    cnt_df.insert(len(cnt_df.columns),
           'addr',
           cnt_df.index.map(self._get_addr_by_ip))
    return cnt_df
     
  def load_ip_addr(self, path):
    """加载IP"""
    cols = ['id', 'ip_start_num', 'ip_end_num',
        'ip_start', 'ip_end', 'addr', 'operator']
    self.ip_addr_df = pd.read_csv(path, sep='\t', names=cols, index_col='id')
    return self.ip_addr_df
 
def main():
  file_pathes = ['www.ttmark.com.access.log']
 
  pd_ng_log_stat = PDNgLogStat()
  pd_ng_log_stat.load_data(file_pathes)
 
  # 加载 ip 地址
  area_ip_path = 'area_ip.csv'
  pd_ng_log_stat.load_ip_addr(area_ip_path)
 
  # 统计 用户真实 IP 访问量 和 地址
  print pd_ng_log_stat.uv_real_ip_addr()
 
if __name__ == '__main__':
  main()

Running statistics and output results

python pd_ng_log_stat.py
 
         count  addr
real_ip            
60.191.123.80  101013 浙江省杭州市
-        32691  None
218.30.118.79  22523   北京市
......
136.243.152.18   889   德国
157.55.39.219   889   美国
66.249.65.170   888   美国
 
[100 rows x 2 columns]

Summary

The above is the entire content of this article. I hope the content of this article will be helpful to everyone’s study or work. If you have any questions, you can leave a message to communicate.

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undress AI Tool

Undress images for free

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

How to appeal a community guideline violation on TikTok?

1 months ago By 下次还敢

PHP calls AI intelligent voice assistant PHP voice interaction system construction

1 months ago By

Pokémon GO Gigantamax Journey Timed Research quest steps and code

1 months ago By Jack chen

How to use PHP to build social sharing functions PHP sharing interface integration practice

1 months ago By

How to report an impersonation account on Instagram

2 weeks ago By 下次还敢

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

PHP Tutorial

1585

276

Related knowledge

How to debug Python code in Sublime Text? Aug 14, 2025 pm 04:51 PM

UseSublimeText’sbuildsystemtorunPythonscriptsandcatcherrorsbypressingCtrl Baftersettingthecorrectbuildsystemorcreatingacustomone.2.Insertstrategicprint()statementstocheckvariablevalues,types,andexecutionflow,usinglabelsandrepr()forclarity.3.Installth

How to handle large datasets in Python that don't fit into memory? Aug 14, 2025 pm 01:00 PM

When processing large data sets that exceed memory in Python, they cannot be loaded into RAM at one time. Instead, strategies such as chunking processing, disk storage or streaming should be adopted; CSV files can be read in chunks through Pandas' chunksize parameters and processed block by block. Dask can be used to realize parallelization and task scheduling similar to Pandas syntax to support large memory data operations. Write generator functions to read text files line by line to reduce memory usage. Use Parquet columnar storage format combined with PyArrow to efficiently read specific columns or row groups. Use NumPy's memmap to memory map large numerical arrays to access data fragments on demand, or store data in lightweight data such as SQLite or DuckDB.

How to run Python code in Sublime Text? Aug 16, 2025 am 04:58 AM

Make sure that Python is installed and added to the system PATH, run python--version or python3--version verification through the terminal; 2. Save the Python file as a .py extension, such as hello.py; 3. Create a custom build system in SublimeText, Windows users use {"cmd":["python","-u","$file"]}, macOS/Linux users use {"cmd":["python3

How to debug a Python script in VSCode Aug 16, 2025 am 02:53 AM

To debug Python scripts, you need to first install the Python extension and configure the interpreter, then create a launch.json file to set the debugging configuration, then set a breakpoint in the code and press F5 to start the debugging. The script will be paused at the breakpoint, allowing checking variables and step-by-step execution. Finally, by checking the problem by viewing the console output, adding logs or adjusting parameters, etc., to ensure that the debugging process is simple and efficient after the environment is correct.

How to automatically format Python code in VSCode Aug 14, 2025 pm 04:10 PM

ToautomaticallyformatPythoncodeinVSCode,installBlackusingpipinstallblack,installtheofficialMicrosoftPythonextension,setBlackastheformatterinsettings.jsonwith"python.formatting.provider":"black",enableformatonsavebyadding"edit

How to create a Python project in Sublime Text? Aug 16, 2025 am 08:53 AM

InstallSublimeTextandPython,thenconfigureabuildsystembycreatingaPython3.sublime-buildfilewiththeappropriatecmdandselectorsettingstoenablerunningPythonscriptsviaCtrl B.2.OrganizeyourprojectbycreatingadedicatedfolderwithPythonfilesandsupportingdocument

How does the yield keyword work in Python Aug 15, 2025 am 08:23 AM

The yield keyword is used to define a generator function, so that it can pause execution and return values one by one, and then recover from the pause; the generator function returns a generator object, has lazy evaluation characteristics, and can save memory. It is suitable for handling scenarios such as large files, streaming data, and infinite sequences. The generator is an iterator that supports next() and for loops, but cannot be rewind and must be recreated to iterate again.

How to avoid getting blocked while web scraping with Python? Aug 16, 2025 am 09:54 AM

ToavoidgettingblockedwhilewebscrapingwithPython,userealisticrequestheaders,addrandomizeddelays,rotateIPaddresseswithproxies,maintainsessions,respectrobots.txt,anduseheadlessbrowserswhennecessary,ensuringethicalandstealthybehaviortomimicrealusersandpr

See all articles