3
\$\begingroup\$

I have two Pandas DataFrames and I want to subset df_all based on the values within to_keep. Unfortunately this isn't straight forward pd.merge() or df.join() because I have multiple columns that I want to match on, and I don't care what order the match happens.

  • I don't care if df_all['source'] matches in either to_keep['from'] OR 'to_keep['to']
  • And then df_all['target'] matches in either to_keep['from'] OR to_keep['to'].

What I have below currently works, but it seems like a lot of work and hopefully this operation could be optimized.


import pandas as pd
import numpy as np
# create sample dataframe
df_all = pd.DataFrame({'from': ['a', 'a', 'b', 'a', 'b', 'c', 'd', 'd', 'd'], 
 'to': ['b', 'b', 'a', 'c', 'c', 'd', 'c', 'f', 'e'], 
 'time': np.random.randint(50, size=9),
 'category': np.random.randn(9)
 })
# create a key based on from & to
df_all['key'] = df_all['from'] + '-' + df_all['to']
df_all
 category from time to key
0 0.374312 a 38 b a-b
1 -0.425700 a 0 b a-b
2 0.928008 b 34 a b-a
3 -0.160849 a 44 c a-c
4 0.462712 b 4 c b-c
5 -0.223074 c 33 d c-d
6 -0.778988 d 47 c d-c
7 -1.392306 d 0 f d-f
8 0.910363 d 34 e d-e
# create another sample datframe
to_keep = pd.DataFrame({'source': ['a', 'a', 'b'], 
 'target': ['b', 'c', 'c'] 
 })
to_keep
 source target
0 a b
1 a c
2 b c
# create a copy of to_keep
to_keep_flipped = to_keep.copy()
# flip source and target column names
to_keep_flipped.rename(columns={'source': 'target', 'target': 'source'}, inplace=True)
# extend to_keep with flipped version
to_keep_all = pd.concat([to_keep, to_keep_flipped], ignore_index=True)
to_keep_all
 source target
0 a b
1 a c
2 b c
3 b a
4 c a
5 c b
# create a key based on source & target
keys = to_keep_all['source'] + '-' + to_keep_all['target']
keys
0 a-b
1 a-c
2 b-c
3 b-a
4 c-a
5 c-b
dtype: object
df_all[df_all['key'].isin(keys)]
 category from time to key
0 0.374312 a 38 b a-b
1 -0.425700 a 0 b a-b
2 0.928008 b 34 a b-a
3 -0.160849 a 44 c a-c
4 0.462712 b 4 c b-c
asked Jun 15, 2018 at 18:57
\$\endgroup\$

1 Answer 1

1
\$\begingroup\$

First, some comments on your question-asking: in your initial statement, you talk about matching 'col1' etc., but in your actual code, you have 'from, 'to', 'source', and 'target'. You talk about subsetting, but then you talk about pd.merge(), and those are completely different things. You formatting is poor, resulting your column names not lining up with the actual columns when you show the output in your code. When you generate random sample data, you should set a seed so that people will get the same data, and can check whether their code is doing the same thing as yours. Your test data is poorly chosen. For instance, does row for which 'from' matches one of the to_keep columns have to match the row in which 'to' matches, or can they match different rows? Can they both match the same column (e.g. 'from' and 'to' both match 'target'), or do they have to match different columns. Neither your test cases nor your problem description are clear on those points; one has to go through your code to figure out what you mean.

Assuming that they have match on the same row, and in different columns, this code should work:

def check(row):
 forward = (to_keep['source'] == row['from']) & (to_keep['target'] == row['to'])
 reverse = (to_keep['source'] == row['to']) & (to_keep['target'] == row['from'])
 return any(forward) | any(reverse)
kept_df = df_all.loc[[check(row) for row in df_all.iterrows()]]
answered Jun 15, 2018 at 21:59
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.