Select the n most frequent items from a pandas groupby dataframe

Question 1

I ́m working on trying to get the n most frequent items from a pandas dataframe similar to

+----+-----+-------+------+------+------+
| cod| name|sum_vol| date| lat| lon|
+----+-----+-------+------+------+------+
|aggc|23124| 37|201610|-15.42|-32.11|
|aggc|23124| 19|201611|-15.42|-32.11|
| abc| 231| 22|201610|-26.42|-43.11|
| abc| 231| 22|201611|-26.42|-43.11|
| ttx| 231| 10|201610|-22.42|-46.11|
| ttx| 231| 10|201611|-22.42|-46.11|
| tty| 231| 25|201610|-25.42|-42.11|
| tty| 231| 45|201611|-25.42|-42.11|
|xptx| 124| 62|201611|-26.43|-43.21|
|xptx| 124| 260|201610|-26.43|-43.21|
|xptx|23124| 50|201610|-26.43|-43.21|
|xptx|23124| 50|201611|-26.43|-43.21|
+----+-----+-------+------+------+------+

I ́m able to do it using the following code:

import pandas as pd
df = pd.DataFrame({'cod':['aggc','abc'], 'name':[23124,23124],
 'sum_vol':[37,19], 'date':[201610,201611],
 'lat':[-15.42, -15.42], 'lon':[-32.11, -32.11]})
gg = df.groupby(['name','date']).cod.value_counts().to_frame()
gg = gg.rename(columns={'cod':'count_cod'}).reset_index()
df_top_freq = gg.groupby(['name', 'date']).head(5)

But this code is slow and very cumbersome. Is there a way to do it in a more flexible and straightforward way?

Question 2

While the pandas regulars will recognize the df abbreviation to be from dataframe, I'd advice you to post at least the imports with your code. Additional context will never hurt either. Unlike Stack Overflow, Code Review needs to look at concrete code in a real context. Please see Why is hypothetical example code off-topic for CR?

Question 3

Using the agg function allows you to calculate the frequency for each group using the standard library function len.

Sorting the result by the aggregated column code_count values, in descending order, then head selecting the top n records, then reseting the frame; will produce the top n frequent records

import pandas as pd
data_values = [['aggc', 23124, 37, 201610, -15.42, -32.11],
 ['aggc', 23124, 19, 201611, -15.42, -32.11],
 [' abc', 231, 22, 201610, -26.42, -43.11],
 [' abc', 231, 22, 201611, -26.42, -43.11],
 [' ttx', 231, 10, 201610, -22.42, -46.11],
 [' ttx', 231, 10, 201611, -22.42, -46.11],
 [' tty', 231, 25, 201610, -25.42, -42.11],
 [' tty', 231, 45, 201611, -25.42, -42.11],
 ['xptx', 124, 62, 201611, -26.43, -43.21],
 ['xptx', 124, 260, 201610, -26.43, -43.21],
 ['xptx', 23124, 50, 201610, -26.43, -43.21],
 ['xptx', 23124, 50, 201611, -26.43, -43.21]]
data_cols = ['cod', 'name', 'sum_vol', 'date', 'lat', 'lon']
df = pd.DataFrame(data_values, columns=data_cols)
df_top_freq = df.groupby(['date', 'name'])['cod'].agg(
 {"code_count": len}).sort_values(
 "code_count", ascending=False).head(n).reset_index()

The df_top_freq frame will look like below

 | | date | name | code_count |
 |----+--------+--------+--------------|
 | 0 | 201610 | 231 | 3 |
 | 1 | 201611 | 231 | 3 |
 | 2 | 201610 | 23124 | 2 |
 | 3 | 201611 | 23124 | 2 |
 | 4 | 201610 | 124 | 1 |

Question 4

You're using groupby twice unnecessarily. Instead, define a helper function to apply with.

Also, value_counts by default sorts results by descending count. So using head directly afterwards is perfect.

def top_value_count(x, n=5):
 return x.value_counts().head(n)
gb = df.groupby(['name', 'date']).cod
df_top_freq = gb.apply(top_value_count).reset_index()
df_top_freq.rename(columns=dict(level_2='cod', cod='count_cod'))

sgDysregulation sgDysregulation 1911 silver badge6 bronze badges · Answer 1 · 2017-01-04 21:10:25Z

Using the agg function allows you to calculate the frequency for each group using the standard library function len.

Sorting the result by the aggregated column code_count values, in descending order, then head selecting the top n records, then reseting the frame; will produce the top n frequent records

import pandas as pd
data_values = [['aggc', 23124, 37, 201610, -15.42, -32.11],
 ['aggc', 23124, 19, 201611, -15.42, -32.11],
 [' abc', 231, 22, 201610, -26.42, -43.11],
 [' abc', 231, 22, 201611, -26.42, -43.11],
 [' ttx', 231, 10, 201610, -22.42, -46.11],
 [' ttx', 231, 10, 201611, -22.42, -46.11],
 [' tty', 231, 25, 201610, -25.42, -42.11],
 [' tty', 231, 45, 201611, -25.42, -42.11],
 ['xptx', 124, 62, 201611, -26.43, -43.21],
 ['xptx', 124, 260, 201610, -26.43, -43.21],
 ['xptx', 23124, 50, 201610, -26.43, -43.21],
 ['xptx', 23124, 50, 201611, -26.43, -43.21]]
data_cols = ['cod', 'name', 'sum_vol', 'date', 'lat', 'lon']
df = pd.DataFrame(data_values, columns=data_cols)
df_top_freq = df.groupby(['date', 'name'])['cod'].agg(
 {"code_count": len}).sort_values(
 "code_count", ascending=False).head(n).reset_index()

The df_top_freq frame will look like below

 | | date | name | code_count |
 |----+--------+--------+--------------|
 | 0 | 201610 | 231 | 3 |
 | 1 | 201611 | 231 | 3 |
 | 2 | 201610 | 23124 | 2 |
 | 3 | 201611 | 23124 | 2 |
 | 4 | 201610 | 124 | 1 |

piRSquared piRSquared 2261 silver badge3 bronze badges · Answer 2 · 2017-01-06 00:08:41Z

You're using groupby twice unnecessarily. Instead, define a helper function to apply with.

Also, value_counts by default sorts results by descending count. So using head directly afterwards is perfect.

def top_value_count(x, n=5):
 return x.value_counts().head(n)
gb = df.groupby(['name', 'date']).cod
df_top_freq = gb.apply(top_value_count).reset_index()
df_top_freq.rename(columns=dict(level_2='cod', cod='count_cod'))

Stack Exchange Network

Select the n most frequent items from a pandas groupby dataframe

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Select the n most frequent items from a pandas groupby dataframe

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions