Match lists and get value of one column based on values of other columns from dataframe optimization
I have an input dataframe like below, where 'ID' is unique identifier, 'Signals_in_Group' is a derived field containing list of all unique 'Signal' column values present in a 'Group'. And 'Signals_Count' is also a derived field whose values are count of items in 'Signals_in_Group'.
groups_df
ID Timestamp Signal Group Signals_in_Group Signals_Count
1 5 1590662170 A 1 [A, B, C] 3
2 2 1590662169 B 1 [A, B, C] 3
3 6 1590662169 C 1 [A, B, C] 3
4 8 1590662171 D 2 [A, D] 2
5 7 1590662172 A 2 [A, D] 2
6 10 1590662185 B 3 [A, B, C, D] 4
7 9 1590662185 D 3 [A, B, C, D] 4
8 3 1590662188 C 3 [A, B, C, D] 4
9 1 1590662186 D 3 [A, B, C, D] 4
10 11 1590662189 A 3 [A, B, C, D] 4
11 4 1590662192 C 4 [C, D] 2
12 12 1590662192 D 4 [C, D] 2
13 13 1590662204 B 5 [B, C] 2
14 14 1590662204 C 5 [B, C] 2
15 15 1590662204 B 5 [B, C] 2
Below is another input, which is a list of lists
clusters = [['A', 'B'], ['B', 'C'], ['A', 'D'], ['A', 'B', 'C'], ['B', 'C', 'D'], ['A', 'B', 'C', 'D'], ['C', 'D', 'E', 'F']]
I need to find whether in each group 'Signals_in_Group' contains any of the clusters. And for each cluster matched in a group, find the first occurring 'Signal' based on 'Timestamp'. If 'Timestamp' is same for more than 1 row, consider the 'Signal' having the lowest 'ID'.
Example for Group 1: 'Signals_in_Group' ([A, B, C]) contains 'clusters' [A, B], [B, C] and [A, B, C]. For cluster [A, B] in 'Group' 1, rows with index 1 and 2 match. Row 2 has the least 'Timestamp' among them, so the corresponding 'Signal' value 'B' becomes the output. For cluster [B, C] in 'Group' 1, rows with index 2 and 3 match. Both of them have same timestamp, so find the lowest 'ID' among them which is 2, and corresponding 'Signal' value 'B' becomes the output. For cluster [A, B, C] in 'Group' 1, rows with index 1, 2 and 3 match. Rows 2 and 3 have lowest and same timestamp, so find the lowest 'ID' among them which is 2, and corresponding 'Signal' value 'B' becomes the output. Likewise, this should be done for all groups.
The output should look like below: Each item in 'clusters' become the column names and each row is one 'Group'.
Group
A,B B,C A,D A,B,C B,C,D A,B,C,D C,D,E,F
1 B B NaN B NaN NaN NaN
2 NaN NaN D NaN NaN NaN NaN
3 B B D B D D NaN
5 NaN B NaN NaN NaN NaN NaN
I achieved this using the code below, first by iterating groups and then for each group, iterating clusters. However, it takes too long to complete. So, I'm looking for a more Pythonic and optimized solution to make it faster. I tested for 763k rows with 52k groups, number of clusters are 200. It took around 4 hrs.
Any suggestion to improve the runtime would be appreciated. Thanks.
cls = [','.join(ele) for ele in clusters]
cls.insert(0, 'Group')
result = pd.DataFrame(columns=cls)
result.set_index('Group', inplace = True)
groups = groups_df['Group'].unique()
for group in groups:
# Get all records belonging to the group
group_df = groups_df.groupby(['Group']).get_group(group)
# Remove clusters containing no. of items less than no. of items in 'Signals_in_Group'
clusters_fil = [x for x in clusters if len(x) <= group_df['Signals_Count'].iloc[0]]
for cluster in clusters_fil:
if all(elem in group_df['Signals_in_Group'].iloc[0] for elem in cluster):
cluster_df = group_df[group_df['Signal'].isin(cluster)]
inter = cluster_df.loc[cluster_df['Timestamp'] == cluster_df['Timestamp'].min()]
result.loc[group, ','.join(cluster)] = inter.loc[inter.ID == inter.ID.min(), 'Signal'].iat[0]
1 Answer 1
'Signals_Count' is also a derived field whose values are count of items in 'Signals_in_Group'
This is not necessary to perform your operation, so you should drop it.
Timestamp
should be coerced to a pandas.Timestamp
; run-time performance will be the same but printing and debugging will be better.
I think you did a good job coming up with a logic explanation and sample data that illustrate your edge cases.
I'm looking for a more Pythonic and optimized solution
The code being Pythonic here is kind of secondary; your primary concerns should be good API usage and vectorisation. Indeed, you have four loops here when you should have zero.
There are a few ways to go about this, and they won't all perform the same way. The way I demonstrate treats the set membership test as an inner join followed by a count comparison. There is another way I can think of where instead of an inner join, you perform a right outer join followed by a grouped .all()
which will evalute to false if there are any nulls. I think this second way would have more rows but fewer columns; to know which one better suits your full-scale data you would need to run large benchmarks (I have not done that).
Aside from some column order differences, the following is equivalent:
import io
import typing
import pandas as pd
def load_data() -> pd.DataFrame:
with io.StringIO(
'''
, ID, Timestamp, Signal, Group, Signals_Count
1, 5, 1590662170, A, 1, 3
2, 2, 1590662169, B, 1, 3
3, 6, 1590662169, C, 1, 3
4, 8, 1590662171, D, 2, 2
5, 7, 1590662172, A, 2, 2
6, 10, 1590662185, B, 3, 4
7, 9, 1590662185, D, 3, 4
8, 3, 1590662188, C, 3, 4
9, 1, 1590662186, D, 3, 4
10, 11, 1590662189, A, 3, 4
11, 4, 1590662192, C, 4, 2
12, 12, 1590662192, D, 4, 2
13, 13, 1590662204, B, 5, 2
14, 14, 1590662204, C, 5, 2
15, 15, 1590662204, B, 5, 2
''') as f:
df = pd.read_csv(f, skipinitialspace=True, index_col=0)
df['Timestamp'] = pd.to_datetime(df['Timestamp'], unit='s', origin='unix')
df['Signals_in_Group'] = [
['A', 'B', 'C'],
['A', 'B', 'C'],
['A', 'B', 'C'],
['A', 'D'],
['A', 'D'],
['A', 'B', 'C', 'D'],
['A', 'B', 'C', 'D'],
['A', 'B', 'C', 'D'],
['A', 'B', 'C', 'D'],
['A', 'B', 'C', 'D'],
['C', 'D'],
['C', 'D'],
['B', 'C'],
['B', 'C'],
['B', 'C'],
]
return df
def process(
group_df: pd.DataFrame, clusters: typing.Sequence[typing.Sequence[str]],
) -> pd.DataFrame:
"""
For each cluster (output column) and Group (output row):
output the signal having the earliest timestamp (if Signals_in_Group
contains the cluster), or NaN otherwise.
"""
# Eliminate some signal duplicates and unneeded columns
groups = (
group_df[['Group', 'Timestamp', 'ID', 'Signal']]
# Timestamp, then ID, are tiebreakers
.sort_values(['Group', 'Signal', 'Timestamp', 'ID'])
.groupby(['Group', 'Signal'])
.first()
)
# Arbitrary numeric cluster ID and cluster lists
clusters_short = pd.DataFrame(
index=pd.RangeIndex(name='cluster_id', stop=len(clusters)),
data={'Signal': clusters, 'Cluster': clusters},
)
# Add length for set membership logic. str.len() also works on lists.
clusters_short['cluster_count'] = clusters_short['Cluster'].str.len()
# Break out per signal
clusters_long = clusters_short[['Signal', 'cluster_count']].explode('Signal')
merged = (
# One row per matching group signal-cluster signal
pd.merge(
left=groups.reset_index(), right=clusters_long.reset_index(),
on='Signal', sort=True,
)
# We need to sort again because we're about to take another first(), but
# this time it won't be per signal; it will be per Group & cluster
.sort_values(['Group', 'cluster_id', 'Timestamp', 'ID'])
# Group by 'Group' and cluster, to take the earliest signal
.groupby(['Group', 'cluster_id'])
# Signal/first is the output; Signal/count and cluster_count to be used in a filter
.agg({'Signal': ['first', 'count'], 'cluster_count': 'first'})
)
# Filter on clusters that are fully-contained in the group
filtered = merged[
merged['Signal', 'count'] == merged['cluster_count', 'first']
]['Signal', 'first'].unstack('cluster_id')
# Re-label columns
filtered.columns = clusters_short.loc[
filtered.columns, 'Cluster',
].str.join(',')
return filtered
def main() -> None:
df = load_data()
clusters = [['A', 'B'], ['B', 'C'], ['A', 'D'], ['A', 'B', 'C'], ['B', 'C', 'D'],
['A', 'B', 'C', 'D'], ['C', 'D', 'E', 'F']]
merged = process(df, clusters)
print(merged)
if __name__ == '__main__':
main()
Cluster A,B B,C A,B,C A,D B,C,D A,B,C,D
Group
1 B B B NaN NaN NaN
2 NaN NaN NaN D NaN NaN
3 B B B D D D
5 NaN B NaN NaN NaN NaN
Explore related questions
See similar questions with these tags.