How can I efficiently group and count occurrences of terms within Pandas DataFrames by ID and group without using loops?-Python Tutorial-php.cn

How can I efficiently group and count occurrences of terms within Pandas DataFrames by ID and group without using loops?

DDD

Release： 2024-11-29 00:01:12

Original

525 people have browsed it

How can I efficiently group and count occurrences of terms within Pandas DataFrames by ID and group without using loops?

Grouping and Counting Values in Pandas DataFrames

In data analysis, it is often necessary to group data and count the occurrences of specific values or terms. This type of aggregation can be easily achieved using the groupby and size functions in Pandas.

Problem:

Suppose you have a DataFrame df with the following columns: id, group, and term. The goal is to count the number of occurrences of each unique term for each combination of id and group, without using loops.

Solution:

To achieve this, we can use the following steps:

Groupby: Group the DataFrame by the three columns using groupby(['id', 'group', 'term']). This partitions the data into groups based on the unique combinations of these columns.
Size: Apply the size function to the grouped data. size calculates the number of rows in each group, which effectively counts the number of occurrences of each unique term within each id-group pair.
Unstack: The result of the groupby operation is a hierarchical index. To create a table with rows representing id-group pairs and columns representing unique terms, we can use unstack(fill_value=0). This unstacks the hierarchical index and fills any missing values with 0.

The resulting DataFrame will resemble this layout:

enter image description here

Example Code:

df = pd.DataFrame([
    (1, 1, 'term1'),
    (1, 2, 'term2'),
    (1, 1, 'term1'),
    (1, 1, 'term2'),
    (2, 2, 'term3'),
    (2, 3, 'term1'),
    (2, 2, 'term1')
], columns=['id', 'group', 'term'])

result = df.groupby(['id', 'group', 'term']).size().unstack(fill_value=0)

print(result)

Copy after login

Output:

   term1  term2  term3
id group             
1   1      2      1
2   2      2      1
   3      1      0

Copy after login

Performance:

For large datasets, the groupby and size operations can be computationally expensive. The following code provides timing statistics for grouping and counting on a DataFrame with 1,000,000 rows:

df = pd.DataFrame(dict(id=np.random.choice(100, 1000000),
                       group=np.random.choice(20, 1000000),
                       term=np.random.choice(10, 1000000)))

%timeit df.groupby(['id', 'group', 'term']).size().unstack(fill_value=0)

Copy after login

The above is the detailed content of How can I efficiently group and count occurrences of terms within Pandas DataFrames by ID and group without using loops?. For more information, please follow other related articles on the PHP Chinese website!