I have two different series in pandas that I have created a nested for loop which checks if the values of the first series are in the other series. But this is time consuming in pandas and I cannot work out how to change it to a pandas method. I thought to use the apply
function but it did not work with method chaining. My original nested for loops look like so and they work;
for x in df_one['ser_one']:
print(x)
for y in df_two['ser_two']:
if 'MBTS' not in y and x in y:
if 'L' in y:
print(y)
Is there a way to make this less time consuming?
Here is what I attempted using apply
methods;
df_two['ser_two'].apply(lambda x: x if 'MBTS' not in df_one['ser_one'].apply(lambda y:y) and x in df_one['ser_one'].apply(lambda y:y))
Example input:
df_one.head()
Out[136]:
type ser_one
0 MBTS VUMX1234
1 MBTS VUMX6436
2 MBTS VUMX5745
3 MBTS VUMX5802
4 MBTS VUMX8091
df_two.head()
Out[137]:
ser_two
0 VUMX8091
1 VUMX8091L
2 VUMX1234
3 VUMX1234L
4 VUMX5838
-
\$\begingroup\$ Can you add some example input? \$\endgroup\$Graipher– Graipher2019年04月10日 13:00:59 +00:00Commented Apr 10, 2019 at 13:00
-
1\$\begingroup\$ @Graipher have added. \$\endgroup\$mp252– mp2522019年04月10日 13:32:37 +00:00Commented Apr 10, 2019 at 13:32
1 Answer 1
Disclaimer, I am not the best at pandas, and I'm absolutely sure there is a far more readable way to accomplish this, but the following will rid you of your for loop and nested if statements, which are slower than vectorized numpy/pandas operations.
Your filter if 'MBTS' not in y
won't work the way you think it will, at least, given the limited sample input, as y
is a Series made from the column ser_one
, not type
. Let's assume that's an easy fix so in pseudocode it should be something like:
for x in df_one.ser_one:
for y in df_two: # iterate through the rows so you get both columns
if 'MBTS' not in y.type and x in y.ser_two:
if 'L' not in y.ser_two:
print(y)
This is a bit clunky, and pandas is great for vectorizing these sorts of operations, so let's filter it down to just Series
operations. I'm working with a small part of your dataframes, so as a sanity check, they look like
df_one
ser_one type
0 VUMX1234 MBTS
1 VUMX6436 MBTS
2 VUMX5745 MBTS
3 VUMX5802 MBTS
4 VUMX8091 MBTS
5 VUMX1234 XXXX
6 VUMX1234L XXXX
df_two
ser_two
0 VUMX8091
1 VUMX8091L
2 VUMX1234
3 VUMX1234L
4 VUMX5838
I added a few entries that were non-MBTS to fit your problem.
The first bit, you want to find where 'MBTS'
is not in df_one.type
, but we want to filter the entire dataframe for that. df.loc
will give you the rows that pass a given filter:
df_one.loc[df_one['type'] == 'MBTS']
ser_one type
0 VUMX1234 MBTS
1 VUMX6436 MBTS
2 VUMX5745 MBTS
3 VUMX5802 MBTS
4 VUMX8091 MBTS
# or
df_one.loc[df_one['type'] != 'MBTS']
ser_one type
5 VUMX1234 XXXX
6 VUMX1234L XXXX
Now you can check if the results of ser_one
are contained within ser_two
, since the output of that previous check is a Series
, like so:
df_one.loc[df_one['type'] != 'MBTS']['ser_one'].isin(df_two['ser_two'])
5 True
6 True
Just get the .loc
back from that, and you should be left with two records in this example:
df_one.loc[df_one.loc[df_one['type'] != 'MBTS']['ser_one'].isin(df_two['ser_two']).index]
ser_one type
5 VUMX1234 XXXX
6 VUMX1234L XXXX
It might be a bit easier to do the filtering against any ser_one
that contains 'L'
ahead of time:
df_one[~df_one['ser_one'].str.contains("L")]
ser_one type
0 VUMX1234 MBTS
1 VUMX6436 MBTS
2 VUMX5745 MBTS
3 VUMX5802 MBTS
4 VUMX8091 MBTS
5 VUMX1234 XXXX
Now, combining all of that into one big gigantic horrible expression
df_one.loc[df_one[~df_one['ser_one'].str.contains("L")].loc[df_one['type'] != 'MBTS']['ser_one'].isin(df_two['ser_two']).index]
ser_one type
5 VUMX1234 XXXX
The outer loc
will take an array of index values as returned by the .index
call near the end of the expression. The rest is just chained filters which are operations in native pandas, implemented in C and fast.