I ́m working on trying to get the n most frequent items from a pandas dataframe similar to
+----+-----+-------+------+------+------+
| cod| name|sum_vol| date| lat| lon|
+----+-----+-------+------+------+------+
|aggc|23124| 37|201610|-15.42|-32.11|
|aggc|23124| 19|201611|-15.42|-32.11|
| abc| 231| 22|201610|-26.42|-43.11|
| abc| 231| 22|201611|-26.42|-43.11|
| ttx| 231| 10|201610|-22.42|-46.11|
| ttx| 231| 10|201611|-22.42|-46.11|
| tty| 231| 25|201610|-25.42|-42.11|
| tty| 231| 45|201611|-25.42|-42.11|
|xptx| 124| 62|201611|-26.43|-43.21|
|xptx| 124| 260|201610|-26.43|-43.21|
|xptx|23124| 50|201610|-26.43|-43.21|
|xptx|23124| 50|201611|-26.43|-43.21|
+----+-----+-------+------+------+------+
I ́m able to do it using the following code:
import pandas as pd
df = pd.DataFrame({'cod':['aggc','abc'], 'name':[23124,23124],
'sum_vol':[37,19], 'date':[201610,201611],
'lat':[-15.42, -15.42], 'lon':[-32.11, -32.11]})
gg = df.groupby(['name','date']).cod.value_counts().to_frame()
gg = gg.rename(columns={'cod':'count_cod'}).reset_index()
df_top_freq = gg.groupby(['name', 'date']).head(5)
But this code is slow and very cumbersome. Is there a way to do it in a more flexible and straightforward way?
2 Answers 2
Using the agg
function allows you to calculate the frequency for each group using the standard library function len
.
Sorting the result by the aggregated column code_count
values, in descending order, then head
selecting the top n
records, then reseting the frame; will produce the top n frequent records
import pandas as pd
data_values = [['aggc', 23124, 37, 201610, -15.42, -32.11],
['aggc', 23124, 19, 201611, -15.42, -32.11],
[' abc', 231, 22, 201610, -26.42, -43.11],
[' abc', 231, 22, 201611, -26.42, -43.11],
[' ttx', 231, 10, 201610, -22.42, -46.11],
[' ttx', 231, 10, 201611, -22.42, -46.11],
[' tty', 231, 25, 201610, -25.42, -42.11],
[' tty', 231, 45, 201611, -25.42, -42.11],
['xptx', 124, 62, 201611, -26.43, -43.21],
['xptx', 124, 260, 201610, -26.43, -43.21],
['xptx', 23124, 50, 201610, -26.43, -43.21],
['xptx', 23124, 50, 201611, -26.43, -43.21]]
data_cols = ['cod', 'name', 'sum_vol', 'date', 'lat', 'lon']
df = pd.DataFrame(data_values, columns=data_cols)
df_top_freq = df.groupby(['date', 'name'])['cod'].agg(
{"code_count": len}).sort_values(
"code_count", ascending=False).head(n).reset_index()
The df_top_freq
frame will look like below
| | date | name | code_count |
|----+--------+--------+--------------|
| 0 | 201610 | 231 | 3 |
| 1 | 201611 | 231 | 3 |
| 2 | 201610 | 23124 | 2 |
| 3 | 201611 | 23124 | 2 |
| 4 | 201610 | 124 | 1 |
You're using groupby
twice unnecessarily. Instead, define a helper function to apply with.
Also, value_counts
by default sorts results by descending count. So using head
directly afterwards is perfect.
def top_value_count(x, n=5):
return x.value_counts().head(n)
gb = df.groupby(['name', 'date']).cod
df_top_freq = gb.apply(top_value_count).reset_index()
df_top_freq.rename(columns=dict(level_2='cod', cod='count_cod'))
df
abbreviation to be fromdataframe
, I'd advice you to post at least the imports with your code. Additional context will never hurt either. Unlike Stack Overflow, Code Review needs to look at concrete code in a real context. Please see Why is hypothetical example code off-topic for CR? \$\endgroup\$