Pandas nested loop which checks if values from one series are in another series

Question 1

I have two different series in pandas that I have created a nested for loop which checks if the values of the first series are in the other series. But this is time consuming in pandas and I cannot work out how to change it to a pandas method. I thought to use the apply function but it did not work with method chaining. My original nested for loops look like so and they work;

for x in df_one['ser_one']:
 print(x)
 for y in df_two['ser_two']:
 if 'MBTS' not in y and x in y:
 if 'L' in y:
 print(y)

Is there a way to make this less time consuming?

Here is what I attempted using apply methods;

df_two['ser_two'].apply(lambda x: x if 'MBTS' not in df_one['ser_one'].apply(lambda y:y) and x in df_one['ser_one'].apply(lambda y:y))

Example input:

df_one.head()
Out[136]: 
 type ser_one
0 MBTS VUMX1234
1 MBTS VUMX6436
2 MBTS VUMX5745
3 MBTS VUMX5802
4 MBTS VUMX8091

df_two.head()
Out[137]: 
 ser_two 
0 VUMX8091 
1 VUMX8091L 
2 VUMX1234 
3 VUMX1234L 
4 VUMX5838

Question 2

Can you add some example input?

Question 3

@Graipher have added.

Question 4

Disclaimer, I am not the best at pandas, and I'm absolutely sure there is a far more readable way to accomplish this, but the following will rid you of your for loop and nested if statements, which are slower than vectorized numpy/pandas operations.

Your filter if 'MBTS' not in y won't work the way you think it will, at least, given the limited sample input, as y is a Series made from the column ser_one, not type. Let's assume that's an easy fix so in pseudocode it should be something like:

for x in df_one.ser_one:
 for y in df_two: # iterate through the rows so you get both columns
 if 'MBTS' not in y.type and x in y.ser_two:
 if 'L' not in y.ser_two:
 print(y)

This is a bit clunky, and pandas is great for vectorizing these sorts of operations, so let's filter it down to just Series operations. I'm working with a small part of your dataframes, so as a sanity check, they look like

df_one
 ser_one type
0 VUMX1234 MBTS
1 VUMX6436 MBTS
2 VUMX5745 MBTS
3 VUMX5802 MBTS
4 VUMX8091 MBTS
5 VUMX1234 XXXX
6 VUMX1234L XXXX
df_two
 ser_two
0 VUMX8091
1 VUMX8091L
2 VUMX1234
3 VUMX1234L
4 VUMX5838

I added a few entries that were non-MBTS to fit your problem.

The first bit, you want to find where 'MBTS' is not in df_one.type, but we want to filter the entire dataframe for that. df.loc will give you the rows that pass a given filter:

df_one.loc[df_one['type'] == 'MBTS']
 ser_one type
0 VUMX1234 MBTS
1 VUMX6436 MBTS
2 VUMX5745 MBTS
3 VUMX5802 MBTS
4 VUMX8091 MBTS
# or
df_one.loc[df_one['type'] != 'MBTS']
 ser_one type
5 VUMX1234 XXXX
6 VUMX1234L XXXX

Now you can check if the results of ser_one are contained within ser_two, since the output of that previous check is a Series, like so:

df_one.loc[df_one['type'] != 'MBTS']['ser_one'].isin(df_two['ser_two'])
5 True
6 True

Just get the .loc back from that, and you should be left with two records in this example:

df_one.loc[df_one.loc[df_one['type'] != 'MBTS']['ser_one'].isin(df_two['ser_two']).index]
 ser_one type
5 VUMX1234 XXXX
6 VUMX1234L XXXX

It might be a bit easier to do the filtering against any ser_one that contains 'L' ahead of time:

df_one[~df_one['ser_one'].str.contains("L")]
 ser_one type
0 VUMX1234 MBTS
1 VUMX6436 MBTS
2 VUMX5745 MBTS
3 VUMX5802 MBTS
4 VUMX8091 MBTS
5 VUMX1234 XXXX

Now, combining all of that into one big gigantic horrible expression

df_one.loc[df_one[~df_one['ser_one'].str.contains("L")].loc[df_one['type'] != 'MBTS']['ser_one'].isin(df_two['ser_two']).index]
 ser_one type
5 VUMX1234 XXXX

The outer loc will take an array of index values as returned by the .index call near the end of the expression. The rest is just chained filters which are operations in native pandas, implemented in C and fast.

C.Nivs C.Nivs 3,11714 silver badges32 bronze badges · Answer 1 · 2019-04-10 19:31:38Z

Disclaimer, I am not the best at pandas, and I'm absolutely sure there is a far more readable way to accomplish this, but the following will rid you of your for loop and nested if statements, which are slower than vectorized numpy/pandas operations.

Your filter if 'MBTS' not in y won't work the way you think it will, at least, given the limited sample input, as y is a Series made from the column ser_one, not type. Let's assume that's an easy fix so in pseudocode it should be something like:

for x in df_one.ser_one:
 for y in df_two: # iterate through the rows so you get both columns
 if 'MBTS' not in y.type and x in y.ser_two:
 if 'L' not in y.ser_two:
 print(y)

This is a bit clunky, and pandas is great for vectorizing these sorts of operations, so let's filter it down to just Series operations. I'm working with a small part of your dataframes, so as a sanity check, they look like

df_one
 ser_one type
0 VUMX1234 MBTS
1 VUMX6436 MBTS
2 VUMX5745 MBTS
3 VUMX5802 MBTS
4 VUMX8091 MBTS
5 VUMX1234 XXXX
6 VUMX1234L XXXX
df_two
 ser_two
0 VUMX8091
1 VUMX8091L
2 VUMX1234
3 VUMX1234L
4 VUMX5838

I added a few entries that were non-MBTS to fit your problem.

The first bit, you want to find where 'MBTS' is not in df_one.type, but we want to filter the entire dataframe for that. df.loc will give you the rows that pass a given filter:

df_one.loc[df_one['type'] == 'MBTS']
 ser_one type
0 VUMX1234 MBTS
1 VUMX6436 MBTS
2 VUMX5745 MBTS
3 VUMX5802 MBTS
4 VUMX8091 MBTS
# or
df_one.loc[df_one['type'] != 'MBTS']
 ser_one type
5 VUMX1234 XXXX
6 VUMX1234L XXXX

Now you can check if the results of ser_one are contained within ser_two, since the output of that previous check is a Series, like so:

df_one.loc[df_one['type'] != 'MBTS']['ser_one'].isin(df_two['ser_two'])
5 True
6 True

Just get the .loc back from that, and you should be left with two records in this example:

df_one.loc[df_one.loc[df_one['type'] != 'MBTS']['ser_one'].isin(df_two['ser_two']).index]
 ser_one type
5 VUMX1234 XXXX
6 VUMX1234L XXXX

It might be a bit easier to do the filtering against any ser_one that contains 'L' ahead of time:

df_one[~df_one['ser_one'].str.contains("L")]
 ser_one type
0 VUMX1234 MBTS
1 VUMX6436 MBTS
2 VUMX5745 MBTS
3 VUMX5802 MBTS
4 VUMX8091 MBTS
5 VUMX1234 XXXX

Now, combining all of that into one big gigantic horrible expression

df_one.loc[df_one[~df_one['ser_one'].str.contains("L")].loc[df_one['type'] != 'MBTS']['ser_one'].isin(df_two['ser_two']).index]
 ser_one type
5 VUMX1234 XXXX

The outer loc will take an array of index values as returned by the .index call near the end of the expression. The rest is just chained filters which are operations in native pandas, implemented in C and fast.

Stack Exchange Network

Pandas nested loop which checks if values from one series are in another series

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Pandas nested loop which checks if values from one series are in another series

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions