Pandas Filtering based on the length of the same kind variables in a column

Asked 4 years, 5 months ago

Viewed 83 times

\$\begingroup\$

In my real case I have a set of time series related to different IDs stored in a single DataFrame Some are composed by 400 samples, some by 1000 samples, some by 2000. They are stored in the same df and:

I would like to drop all the IDs made up of time series shorter than a custom length.

I wrote the following code, but I think is very ugly and inefficient.

import pandas as pd
import numpy as np
dict={"samples":[1,2,3,4,5,6,7,8,9],"id":["a","b","c","b","b","b","c","c","c"]}
df=pd.DataFrame(dict)
df_id=pd.DataFrame()
for i in set(df.id):
 df_filtered=df[df.id==i]
 len_id=len(df_filtered.samples)
 if len_id>3: #3 is just a random choice for this example
 df_id=df_id.append(df_filtered)
print(df_id)

Output:

 samples id
2 3 c
6 7 c
7 8 c
8 9 c
1 2 b
3 4 b
4 5 b
5 6 b

How to improve it in a more Pythonic way? Thanks

asked Apr 1, 2021 at 16:25

Andrea Ciufo's user avatar

Andrea Ciufo Andrea Ciufo

5991 gold badge5 silver badges12 bronze badges

\$\endgroup\$

Add a comment |

2 Answers 2

Sorted by: Reset to default

\$\begingroup\$

Good answer by Juho. Another option is a groupby-filter:

df.groupby('id').filter(lambda group: len(group) > 3)
# samples id
# 1 2 b
# 2 3 c
# 3 4 b
# 4 5 b
# 5 6 b
# 6 7 c
# 7 8 c
# 8 9 c

To match your output order exactly, add a descending id sort: .sort_values('id', ascending=False)

edited Apr 2, 2021 at 2:57

answered Apr 2, 2021 at 2:52

tdy's user avatar

tdy tdy

2,2661 gold badge10 silver badges21 bronze badges

\$\endgroup\$

1

\$\begingroup\$ This is neat as well. \$\endgroup\$

Juho
– Juho

2021年04月02日 06:19:32 +00:00
Commented Apr 2, 2021 at 6:19

Add a comment |

\$\begingroup\$

There are many solutions. For example, you can use a groupby-transform and drop the "small" samples. An appropriate solution will depend on what your requirements are more closely, i.e., will you do the preprocessing once, and then drop different samples?

Anyway, consider:

import pandas as pd
df = pd.DataFrame({"samples": [1,2,3,4,5,6,7,8,9], "id": ["a","b","c","b","b","b","c","c","c"]})
df["counts"] = df.groupby("id")["samples"].transform("count")
df[df["counts"] > 3]
# Or if you want:
df[df["counts"] > 3].drop(columns="counts")

By the way, avoid using dict as a variable name.

answered Apr 1, 2021 at 19:14

Juho's user avatar

Juho Juho

3,64921 silver badges18 bronze badges

\$\endgroup\$

Add a comment |

Your Answer

Draft saved

Draft discarded

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

lang-py

Stack Exchange Network

Pandas Filtering based on the length of the same kind variables in a column

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Pandas Filtering based on the length of the same kind variables in a column

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions