Suppose you have a data frame with multiple string columns. Each combination of the first two columns should have only one valid value in the third column. You need to clean the data consistently by grouping the data frame by the first two columns and selecting the most common value of the third column for each combination.
The following code demonstrates an attempt to achieve this:
import pandas as pd<br>from scipy import stats</p> <p>source = pd.DataFrame({</p> <div class="code" style="position:relative; padding:0px; margin:0px;"><pre class="brush:php;toolbar:false">'Country': ['USA', 'USA', 'Russia', 'USA'], 'City': ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'], 'Short name': ['NY', 'New', 'Spb', 'NY']})
source.groupby(['Country','City']).agg(lambda x: stats.mode(x['Short name'])[0])
However, the last line of code fails with a KeyError. How can you fix this issue?
For Pandas versions 0.16 and later, use the following code:
source.groupby(['Country','City'])['Short name'].agg(pd.Series.mode)<br>
This code uses the pd.Series.mode function, which was introduced in Pandas 0.16, to find the most common value in each group.
The Series.mode function handles cases with multiple modes effectively:
While you could use statistics.mode from Python, it doesn't handle multiple modes well and may raise a StatisticsError. Hence, it's not recommended.
The above is the detailed content of How to Efficiently Find the Most Common Value in a Pandas DataFrame Group?. For more information, please follow other related articles on the PHP Chinese website!