6
\$\begingroup\$

I ́m working on trying to get the n most frequent items from a pandas dataframe similar to

+----+-----+-------+------+------+------+
| cod| name|sum_vol| date| lat| lon|
+----+-----+-------+------+------+------+
|aggc|23124| 37|201610|-15.42|-32.11|
|aggc|23124| 19|201611|-15.42|-32.11|
| abc| 231| 22|201610|-26.42|-43.11|
| abc| 231| 22|201611|-26.42|-43.11|
| ttx| 231| 10|201610|-22.42|-46.11|
| ttx| 231| 10|201611|-22.42|-46.11|
| tty| 231| 25|201610|-25.42|-42.11|
| tty| 231| 45|201611|-25.42|-42.11|
|xptx| 124| 62|201611|-26.43|-43.21|
|xptx| 124| 260|201610|-26.43|-43.21|
|xptx|23124| 50|201610|-26.43|-43.21|
|xptx|23124| 50|201611|-26.43|-43.21|
+----+-----+-------+------+------+------+

I ́m able to do it using the following code:

import pandas as pd
df = pd.DataFrame({'cod':['aggc','abc'], 'name':[23124,23124],
 'sum_vol':[37,19], 'date':[201610,201611],
 'lat':[-15.42, -15.42], 'lon':[-32.11, -32.11]})
gg = df.groupby(['name','date']).cod.value_counts().to_frame()
gg = gg.rename(columns={'cod':'count_cod'}).reset_index()
df_top_freq = gg.groupby(['name', 'date']).head(5)

But this code is slow and very cumbersome. Is there a way to do it in a more flexible and straightforward way?

Tolani
2,5017 gold badges31 silver badges49 bronze badges
asked Dec 8, 2016 at 14:02
\$\endgroup\$
1
  • 2
    \$\begingroup\$ While the pandas regulars will recognize the df abbreviation to be from dataframe, I'd advice you to post at least the imports with your code. Additional context will never hurt either. Unlike Stack Overflow, Code Review needs to look at concrete code in a real context. Please see Why is hypothetical example code off-topic for CR? \$\endgroup\$ Commented Dec 8, 2016 at 14:08

2 Answers 2

1
\$\begingroup\$

Using the agg function allows you to calculate the frequency for each group using the standard library function len.

Sorting the result by the aggregated column code_count values, in descending order, then head selecting the top n records, then reseting the frame; will produce the top n frequent records

import pandas as pd
data_values = [['aggc', 23124, 37, 201610, -15.42, -32.11],
 ['aggc', 23124, 19, 201611, -15.42, -32.11],
 [' abc', 231, 22, 201610, -26.42, -43.11],
 [' abc', 231, 22, 201611, -26.42, -43.11],
 [' ttx', 231, 10, 201610, -22.42, -46.11],
 [' ttx', 231, 10, 201611, -22.42, -46.11],
 [' tty', 231, 25, 201610, -25.42, -42.11],
 [' tty', 231, 45, 201611, -25.42, -42.11],
 ['xptx', 124, 62, 201611, -26.43, -43.21],
 ['xptx', 124, 260, 201610, -26.43, -43.21],
 ['xptx', 23124, 50, 201610, -26.43, -43.21],
 ['xptx', 23124, 50, 201611, -26.43, -43.21]]
data_cols = ['cod', 'name', 'sum_vol', 'date', 'lat', 'lon']
df = pd.DataFrame(data_values, columns=data_cols)
df_top_freq = df.groupby(['date', 'name'])['cod'].agg(
 {"code_count": len}).sort_values(
 "code_count", ascending=False).head(n).reset_index()

The df_top_freq frame will look like below

 | | date | name | code_count |
 |----+--------+--------+--------------|
 | 0 | 201610 | 231 | 3 |
 | 1 | 201611 | 231 | 3 |
 | 2 | 201610 | 23124 | 2 |
 | 3 | 201611 | 23124 | 2 |
 | 4 | 201610 | 124 | 1 |
answered Jan 4, 2017 at 21:10
\$\endgroup\$
1
\$\begingroup\$

You're using groupby twice unnecessarily. Instead, define a helper function to apply with.

Also, value_counts by default sorts results by descending count. So using head directly afterwards is perfect.

def top_value_count(x, n=5):
 return x.value_counts().head(n)
gb = df.groupby(['name', 'date']).cod
df_top_freq = gb.apply(top_value_count).reset_index()
df_top_freq.rename(columns=dict(level_2='cod', cod='count_cod'))
answered Jan 6, 2017 at 0:08
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.