In my real case I have a set of time series related to different IDs stored in a single DataFrame
Some are composed by 400 samples, some by 1000 samples, some by 2000.
They are stored in the same df and:
I would like to drop all the IDs made up of time series shorter than a custom length.
I wrote the following code, but I think is very ugly and inefficient.
import pandas as pd
import numpy as np
dict={"samples":[1,2,3,4,5,6,7,8,9],"id":["a","b","c","b","b","b","c","c","c"]}
df=pd.DataFrame(dict)
df_id=pd.DataFrame()
for i in set(df.id):
df_filtered=df[df.id==i]
len_id=len(df_filtered.samples)
if len_id>3: #3 is just a random choice for this example
df_id=df_id.append(df_filtered)
print(df_id)
Output:
samples id
2 3 c
6 7 c
7 8 c
8 9 c
1 2 b
3 4 b
4 5 b
5 6 b
How to improve it in a more Pythonic way? Thanks
2 Answers 2
Good answer by Juho. Another option is a groupby-filter
:
df.groupby('id').filter(lambda group: len(group) > 3)
# samples id
# 1 2 b
# 2 3 c
# 3 4 b
# 4 5 b
# 5 6 b
# 6 7 c
# 7 8 c
# 8 9 c
To match your output order exactly, add a descending id
sort: .sort_values('id', ascending=False)
-
1\$\begingroup\$ This is neat as well. \$\endgroup\$Juho– Juho2021年04月02日 06:19:32 +00:00Commented Apr 2, 2021 at 6:19
There are many solutions. For example, you can use a groupby-transform and drop the "small" samples. An appropriate solution will depend on what your requirements are more closely, i.e., will you do the preprocessing once, and then drop different samples?
Anyway, consider:
import pandas as pd
df = pd.DataFrame({"samples": [1,2,3,4,5,6,7,8,9], "id": ["a","b","c","b","b","b","c","c","c"]})
df["counts"] = df.groupby("id")["samples"].transform("count")
df[df["counts"] > 3]
# Or if you want:
df[df["counts"] > 3].drop(columns="counts")
By the way, avoid using dict
as a variable name.