What are the methods for Dataframe query in pandas-Python Tutorial-php.cn

This time I will bring you the DataframeQuerymethods in pandas, and theNotesof Dataframe query in pandas. The following is a practical case, let’s take a look.

pandas provides us with a variety of slicing methods, but if you don't know much about these methods, it's often easy to get confused. The following examples illustrate these slicing methods.

Data introduction

First randomly generate a set of data:

In [5]: rnd_1 = [random.randrange(1,20) for x in xrange(1000)] ...: rnd_2 = [random.randrange(1,20) for x in xrange(1000)] ...: rnd_3 = [random.randrange(1,20) for x in xrange(1000)] ...: fecha = pd.date_range('2012-4-10', '2015-1-4') ...: ...: data = pd.DataFrame({'fecha':fecha, 'rnd_1': rnd_1, 'rnd_2': rnd_2, 'rnd_3': rnd_3}) In [6]: data.describe() Out[6]: rnd_1 rnd_2 rnd_3 count 1000.000000 1000.000000 1000.000000 mean 9.946000 9.825000 9.894000 std 5.553911 5.559432 5.423484 min 1.000000 1.000000 1.000000 25% 5.000000 5.000000 5.000000 50% 10.000000 10.000000 10.000000 75% 15.000000 15.000000 14.000000 max 19.000000 19.000000 19.000000

Copy after login

[]Slicing method

Use square brackets to slice DataFrame, somewhat similar topython's list slicing. Row selection or column selection or block selection can be achieved according to the index.

# 行选择 In [7]: data[1:5] Out[7]: fecha rnd_1 rnd_2 rnd_3 1 2012-04-11 1 16 3 2 2012-04-12 7 6 1 3 2012-04-13 2 16 7 4 2012-04-14 4 17 7 # 列选择 In [10]: data[['rnd_1', 'rnd_3']] Out[10]: rnd_1 rnd_3 0 8 12 1 1 3 2 7 1 3 2 7 4 4 7 5 12 8 6 2 12 7 9 8 8 13 17 9 4 7 10 14 14 11 19 16 12 2 12 13 15 18 14 13 18 15 13 11 16 17 7 17 14 10 18 9 6 19 11 15 20 16 13 21 18 9 22 1 18 23 4 3 24 6 11 25 2 13 26 7 17 27 11 8 28 3 12 29 4 2 .. ... ... 970 8 14 971 19 5 972 13 2 973 8 10 974 8 17 975 6 16 976 3 2 977 12 6 978 12 10 979 15 13 980 8 4 981 17 3 982 1 17 983 11 5 984 7 7 985 13 14 986 6 19 987 13 9 988 3 15 989 19 6 990 7 11 991 11 7 992 19 12 993 2 15 994 10 4 995 14 13 996 12 11 997 11 15 998 17 14 999 3 8 [1000 rows x 2 columns] # 区块选择 In [11]: data[:7][['rnd_1', 'rnd_2']] Out[11]: rnd_1 rnd_2 0 8 17 1 1 16 2 7 6 3 2 16 4 4 17 5 12 19 6 2 7

Copy after login

However, for multi-column selection, you cannot use the 1:5 method like when selecting rows.

In [12]: data[['rnd_1':'rnd_3']] File "", line 1 data[['rnd_1':'rnd_3']] ^ SyntaxError: invalid syntax

Copy after login

loc

loc allows you to select rows and columns based on index.

In [13]: data.loc[1:5] Out[13]: fecha rnd_1 rnd_2 rnd_3 1 2012-04-11 1 16 3 2 2012-04-12 7 6 1 3 2012-04-13 2 16 7 4 2012-04-14 4 17 7 5 2012-04-15 12 19 8

Copy after login

It should be noted here that the difference between loc and the first method is that it will also select the 5th row, while the first method will only select the 4th row.

data.loc[2:4, ['rnd_2', 'fecha']] Out[14]: rnd_2 fecha 2 6 2012-04-12 3 16 2012-04-13 4 17 2012-04-14

Copy after login

loc can select data between two specific dates. It should be noted that both dates must be in the index.

In [15]: data_fecha = data.set_index('fecha') ...: data_fecha.head() Out[15]: rnd_1 rnd_2 rnd_3 fecha 2012-04-10 8 17 12 2012-04-11 1 16 3 2012-04-12 7 6 1 2012-04-13 2 16 7 2012-04-14 4 17 7 In [16]: # 生成两个特定日期 ...: fecha_1 = dt.datetime(2013, 4, 14) ...: fecha_2 = dt.datetime(2013, 4, 18) ...: ...: # 生成切片数据 ...: data_fecha.loc[fecha_1: fecha_2] Out[16]: rnd_1 rnd_2 rnd_3 fecha 2013-04-14 17 10 5 2013-04-15 14 4 9 2013-04-16 1 2 18 2013-04-17 9 15 1 2013-04-18 16 7 17

Copy after login

Update:If there are no special needs, it is strongly recommended to use loc and use [] as little as possible, because loc is reprocessing the DataFrame. Chained indexing problems will be avoided during assignment operations. When using [], the compiler is likely to give a warning about SettingWithCopy.

For details, please refer to the official documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

iloc

If loc is selected according to the value of the index, then iloc is selected according to the position of the index. iloc doesn't care about the specific value of the index, it only cares about the position, so when using iloc, only numerical values can be used in square brackets.

# 行选择 In [17]: data_fecha[10: 15] Out[17]: rnd_1 rnd_2 rnd_3 fecha 2012-04-20 14 6 14 2012-04-21 19 14 16 2012-04-22 2 6 12 2012-04-23 15 8 18 2012-04-24 13 8 18 # 列选择 In [18]: data_fecha.iloc[:,[1,2]].head() Out[18]: rnd_2 rnd_3 fecha 2012-04-10 17 12 2012-04-11 16 3 2012-04-12 6 1 2012-04-13 16 7 2012-04-14 17 7 # 切片选择 In [19]: data_fecha.iloc[[1,12,34],[0,2]] Out[19]: rnd_1 rnd_3 fecha 2012-04-11 1 3 2012-04-22 2 12 2012-05-14 17 10

Copy after login

The usage of at is similar to loc, but it has faster access to data than loc, and can only access A single element, multiple elements cannot be accessed.

In [20]: timeit data_fecha.at[fecha_1,'rnd_1'] The slowest run took 3783.11 times longer than the fastest. This could mean that an intermediate result is being cached. 100000 loops, best of 3: 11.3 µs per loop In [21]: timeit data_fecha.loc[fecha_1,'rnd_1'] The slowest run took 121.24 times longer than the fastest. This could mean that an intermediate result is being cached. 10000 loops, best of 3: 192 µs per loop In [22]: data_fecha.at[fecha_1,'rnd_1'] Out[22]: 17

Copy after login

iat

iat is to iloc what at is to loc, a faster option based on index position Method, like at, can only access a single element.

In [23]: data_fecha.iat[1,0] Out[23]: 1 In [24]: timeit data_fecha.iat[1,0] The slowest run took 6.23 times longer than the fastest. This could mean that an intermediate result is being cached. 100000 loops, best of 3: 8.77 µs per loop In [25]: timeit data_fecha.iloc[1,0] 10000 loops, best of 3: 158 µs per loop

Copy after login

The methods mentioned above all require that the rank of the query is in the index, or the position does not exceed the length range, and ix allows you to get data that is not in the DataFrame index.

In [28]: date_1 = dt.datetime(2013, 1, 10, 8, 30) ...: date_2 = dt.datetime(2013, 1, 13, 4, 20) ...: ...: # 生成切片数据 ...: data_fecha.ix[date_1: date_2] Out[28]: rnd_1 rnd_2 rnd_3 fecha 2013-01-11 19 17 19 2013-01-12 10 9 17 2013-01-13 15 3 10

Copy after login

As shown in the above example, January 10, 2013 was not selected because this time point is regarded as 0:00, which is earlier than 8:30.

I believe you have mastered the method after reading the case in this article. For more exciting information, please pay attention to other related articles on the php Chinese website!