Subset pandas DataFrame based on two columns ignoring what order the match happens in

Question 1

I have two Pandas DataFrames and I want to subset df_all based on the values within to_keep. Unfortunately this isn't straight forward pd.merge() or df.join() because I have multiple columns that I want to match on, and I don't care what order the match happens.

I don't care if df_all['source'] matches in either to_keep['from'] OR 'to_keep['to']
And then df_all['target'] matches in either to_keep['from'] OR to_keep['to'].

What I have below currently works, but it seems like a lot of work and hopefully this operation could be optimized.

import pandas as pd
import numpy as np
# create sample dataframe
df_all = pd.DataFrame({'from': ['a', 'a', 'b', 'a', 'b', 'c', 'd', 'd', 'd'], 
 'to': ['b', 'b', 'a', 'c', 'c', 'd', 'c', 'f', 'e'], 
 'time': np.random.randint(50, size=9),
 'category': np.random.randn(9)
 })
# create a key based on from & to
df_all['key'] = df_all['from'] + '-' + df_all['to']
df_all
 category from time to key
0 0.374312 a 38 b a-b
1 -0.425700 a 0 b a-b
2 0.928008 b 34 a b-a
3 -0.160849 a 44 c a-c
4 0.462712 b 4 c b-c
5 -0.223074 c 33 d c-d
6 -0.778988 d 47 c d-c
7 -1.392306 d 0 f d-f
8 0.910363 d 34 e d-e
# create another sample datframe
to_keep = pd.DataFrame({'source': ['a', 'a', 'b'], 
 'target': ['b', 'c', 'c'] 
 })
to_keep
 source target
0 a b
1 a c
2 b c
# create a copy of to_keep
to_keep_flipped = to_keep.copy()
# flip source and target column names
to_keep_flipped.rename(columns={'source': 'target', 'target': 'source'}, inplace=True)
# extend to_keep with flipped version
to_keep_all = pd.concat([to_keep, to_keep_flipped], ignore_index=True)
to_keep_all
 source target
0 a b
1 a c
2 b c
3 b a
4 c a
5 c b
# create a key based on source & target
keys = to_keep_all['source'] + '-' + to_keep_all['target']
keys
0 a-b
1 a-c
2 b-c
3 b-a
4 c-a
5 c-b
dtype: object
df_all[df_all['key'].isin(keys)]
 category from time to key
0 0.374312 a 38 b a-b
1 -0.425700 a 0 b a-b
2 0.928008 b 34 a b-a
3 -0.160849 a 44 c a-c
4 0.462712 b 4 c b-c

Question 2

First, some comments on your question-asking: in your initial statement, you talk about matching 'col1' etc., but in your actual code, you have 'from, 'to', 'source', and 'target'. You talk about subsetting, but then you talk about pd.merge(), and those are completely different things. You formatting is poor, resulting your column names not lining up with the actual columns when you show the output in your code. When you generate random sample data, you should set a seed so that people will get the same data, and can check whether their code is doing the same thing as yours. Your test data is poorly chosen. For instance, does row for which 'from' matches one of the to_keep columns have to match the row in which 'to' matches, or can they match different rows? Can they both match the same column (e.g. 'from' and 'to' both match 'target'), or do they have to match different columns. Neither your test cases nor your problem description are clear on those points; one has to go through your code to figure out what you mean.

Assuming that they have match on the same row, and in different columns, this code should work:

def check(row):
 forward = (to_keep['source'] == row['from']) & (to_keep['target'] == row['to'])
 reverse = (to_keep['source'] == row['to']) & (to_keep['target'] == row['from'])
 return any(forward) | any(reverse)
kept_df = df_all.loc[[check(row) for row in df_all.iterrows()]]

Acccumulation Acccumulation 1,5757 silver badges6 bronze badges · Answer 1 · 2018-06-15 21:59:13Z

First, some comments on your question-asking: in your initial statement, you talk about matching 'col1' etc., but in your actual code, you have 'from, 'to', 'source', and 'target'. You talk about subsetting, but then you talk about pd.merge(), and those are completely different things. You formatting is poor, resulting your column names not lining up with the actual columns when you show the output in your code. When you generate random sample data, you should set a seed so that people will get the same data, and can check whether their code is doing the same thing as yours. Your test data is poorly chosen. For instance, does row for which 'from' matches one of the to_keep columns have to match the row in which 'to' matches, or can they match different rows? Can they both match the same column (e.g. 'from' and 'to' both match 'target'), or do they have to match different columns. Neither your test cases nor your problem description are clear on those points; one has to go through your code to figure out what you mean.

Assuming that they have match on the same row, and in different columns, this code should work:

def check(row):
 forward = (to_keep['source'] == row['from']) & (to_keep['target'] == row['to'])
 reverse = (to_keep['source'] == row['to']) & (to_keep['target'] == row['from'])
 return any(forward) | any(reverse)
kept_df = df_all.loc[[check(row) for row in df_all.iterrows()]]

Stack Exchange Network

Subset pandas DataFrame based on two columns ignoring what order the match happens in

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Subset pandas DataFrame based on two columns ignoring what order the match happens in

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions