In data analysis, it is often necessary to group data and count the occurrences of specific values or terms. This type of aggregation can be easily achieved using the groupby and size functions in Pandas.
Problem:
Suppose you have a DataFrame df with the following columns: id, group, and term. The goal is to count the number of occurrences of each unique term for each combination of id and group, without using loops.
Solution:
To achieve this, we can use the following steps:
The resulting DataFrame will resemble this layout:
Example Code:
df = pd.DataFrame([ (1, 1, 'term1'), (1, 2, 'term2'), (1, 1, 'term1'), (1, 1, 'term2'), (2, 2, 'term3'), (2, 3, 'term1'), (2, 2, 'term1') ], columns=['id', 'group', 'term']) result = df.groupby(['id', 'group', 'term']).size().unstack(fill_value=0) print(result)
Output:
term1 term2 term3 id group 1 1 2 1 2 2 2 1 3 1 0
Performance:
For large datasets, the groupby and size operations can be computationally expensive. The following code provides timing statistics for grouping and counting on a DataFrame with 1,000,000 rows:
df = pd.DataFrame(dict(id=np.random.choice(100, 1000000), group=np.random.choice(20, 1000000), term=np.random.choice(10, 1000000))) %timeit df.groupby(['id', 'group', 'term']).size().unstack(fill_value=0)
The above is the detailed content of How can I efficiently group and count occurrences of terms within Pandas DataFrames by ID and group without using loops?. For more information, please follow other related articles on the PHP Chinese website!