Matching rows between two dataframes

Question 1

I have managed to write this piece of code but as I am using it on big data sets it ends up being quite slow. I am pretty sure it would be possible to optimize it but I am very knew to coding and I don't really know where to start.. I think getting rid of the for loop would be one way but honestly I'm lost. A little bit of help would be greatly appreciated !

Basically, the point is to look if one row of the 'data' dataframe match one row of the 'ref' dataframe. And I use np.isclose in order to allow for small differences in the value as I know my 'data' values can be slightly different than the 'ref' values.

Also, because my rows can have a lot of NaN values in them, I first use np.isnan to get the index of where is my last 'real' value in the row and then only do the row comparison with the 'actual' values. I thought it would speed things up but I'm not very sure it did...

match = []
checklist = set()
for read in data.itertuples():
 for ref in ref.itertuples():
 x = np.isnan(read[3:]).argmax(axis=0)
 if x == 2:
 if np.isclose(read[4:6],ref[7:9],atol=5, equal_nan=True).all() == True and np.isnan(ref[6:]).argmax(axis=0) == x:
 if not read[1] in checklist:
 match.append([read[1], ref[5]])
 checklist.add(read[1])
 if x > 2:
 read_pos = 3+x-1
 ref_pos = 6+x-1
 if np.isclose(read[4:read_pos],ref[7:ref_pos],atol=5, equal_nan=True).all() == True and np.isnan(ref[6:]).argmax(axis=0) == x:
 if not read[1] in checklist:
 match.append([read[1], ref[5]])
 checklist.add(read[1])
 if read[1] not in checklist:
 match.append([read[1], "not found"])
 checklist.add(read[1])

Thanks in advance !

EDIT:

To download samples of data and ref tables: https://we.tl/RF6lxDZBjt

Short example of the dataframes:

ref = pd.DataFrame({'name':['a-1','a-2','b-1'],
 'start 1':[100,100,100],
 'end 1':[200,200,500],
 'start 2':[300,np.NaN,600],
 'end 2':[400,np.NaN, 700]}, 
 columns=['name', 'start 1', 'end 1', 'start 2', 'end 2'], 
 dtype='float64')
 name start 1 end 1 start 2 end 2
 0 a-1 100.0 200.0 300.0 400.0
 1 a-2 100.0 200.0 NaN NaN
 2 b-1 100.0 500.0 600.0 700.0
data = pd.DataFrame({'name':['read 1','read 2','read 3','read 4', 'read 5'],
 'start 1':[100,102,100,103,600],
 'end 1':[198,504,500,200, 702],
 'start 2':[np.NaN,600,650,601, np.NaN],
 'end 2':[np.NaN,699, 700,702, np.NaN]}, 
 columns=['name', 'start 1', 'end 1', 'start 2', 'end 2'], 
 dtype='float64')
 read start 1 end 1 start 2 end 2
 0 read 1 100.0 200.0 300.0 400.0
 1 read 2 100.0 200.0 NaN NaN
 2 read 3 100.0 500.0 600.0 700.0
 3 read 4 300.0 400.0 600.0 700.0
 4 read 5 600.0 702.0 NaN NaN

Question 2

@Graipher Did you happen to have time to look at it ? If not, I totally understand ! If you even have a small idea of things that I could try I'm willing to try it myself ! I tried looking into vectorization but I really don't know where to start.. Thanks anyway for your time, it's greatly appreciated :)

Question 3

Not yet, but I will have some time tomorrow. Yes, writing good (in other words vectorised) code in numpy/pandas is a whole new world if you only know vanilla Python.

Question 4

Would you think it's possible to remove the outer for loop by using apply instead ? and put the vectorization of the inner loop in a function ?

Question 5

Invariants

for read in data.itertuples():
 for ref in ref.itertuples():
 x = np.isnan(read[3:]).argmax(axis=0)

x doesn't change in the inner loop, so you can move it out of the inner loop, and not execute it repeatedly.

for read in data.itertuples():
 x = np.isnan(read[3:]).argmax(axis=0)
 for ref in ref.itertuples():

These following two lines are identical, apart for the end-points of the slices:

if np.isclose(read[4: 6 ],ref[7: 9 ],atol=5, equal_nan=True).all() == True and np.isnan(ref[6:]).argmax(axis=0) == x:
if np.isclose(read[4:read_pos],ref[7:ref_pos],atol=5, equal_nan=True).all() == True and np.isnan(ref[6:]).argmax(axis=0) == x:

You already have a variable for the end-points. Why not use it for the first line as well, and only have one case?

read_pos = 3+x-1 if x > 2 else 6
ref_pos = 6+x-1 if x > 2 else 9

Once you've found your target, you can't ever add it again ...

if not read[1] in checklist:
 match.append([read[1], ref[5]])
 checklist.add(read[1])

... but you don't break out of your inner search, which is now pointless.

If I haven't made any errors, this should be a litte faster:

match = []
checklist = set()
for read in data.itertuples():
 x = np.isnan(read[3:]).argmax(axis=0)
 if x >= 2 and read[1] not in checklist:
 read_pos = 3+x-1 if x > 2 else 6
 ref_pos = 6+x-1 if x > 2 else 9
 for ref in ref.itertuples():
 if np.isclose(read[4:read_pos],ref[7:ref_pos],atol=5, equal_nan=True).all() == True and np.isnan(ref[6:]).argmax(axis=0) == x:
 match.append([read[1], ref[5]])
 checklist.add(read[1])
 break
 if read[1] not in checklist:
 match.append([read[1], "not found"])
 checklist.add(read[1])

Question 6

Thanks ! It didn't improve the speed that much but at least it made realize some mistakes I did and it is much more easy to read and understand so that's already a nice improvement !

AJNeufeld 35.3k5 gold badges41 silver badges103 bronze badges · Answer 1 · 2018-07-19 22:48:59Z

Invariants

for read in data.itertuples():
 for ref in ref.itertuples():
 x = np.isnan(read[3:]).argmax(axis=0)

x doesn't change in the inner loop, so you can move it out of the inner loop, and not execute it repeatedly.

for read in data.itertuples():
 x = np.isnan(read[3:]).argmax(axis=0)
 for ref in ref.itertuples():

These following two lines are identical, apart for the end-points of the slices:

if np.isclose(read[4: 6 ],ref[7: 9 ],atol=5, equal_nan=True).all() == True and np.isnan(ref[6:]).argmax(axis=0) == x:
if np.isclose(read[4:read_pos],ref[7:ref_pos],atol=5, equal_nan=True).all() == True and np.isnan(ref[6:]).argmax(axis=0) == x:

You already have a variable for the end-points. Why not use it for the first line as well, and only have one case?

read_pos = 3+x-1 if x > 2 else 6
ref_pos = 6+x-1 if x > 2 else 9

Once you've found your target, you can't ever add it again ...

if not read[1] in checklist:
 match.append([read[1], ref[5]])
 checklist.add(read[1])

... but you don't break out of your inner search, which is now pointless.

If I haven't made any errors, this should be a litte faster:

match = []
checklist = set()
for read in data.itertuples():
 x = np.isnan(read[3:]).argmax(axis=0)
 if x >= 2 and read[1] not in checklist:
 read_pos = 3+x-1 if x > 2 else 6
 ref_pos = 6+x-1 if x > 2 else 9
 for ref in ref.itertuples():
 if np.isclose(read[4:read_pos],ref[7:ref_pos],atol=5, equal_nan=True).all() == True and np.isnan(ref[6:]).argmax(axis=0) == x:
 match.append([read[1], ref[5]])
 checklist.add(read[1])
 break
 if read[1] not in checklist:
 match.append([read[1], "not found"])
 checklist.add(read[1])

Thanks ! It didn't improve the speed that much but at least it made realize some mistakes I did and it is much more easy to read and understand so that's already a nice improvement !

Stack Exchange Network

Matching rows between two dataframes

1 Answer 1

You must log in to answer this question.

Hot Network Questions

Matching rows between two dataframes

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions