Match pattern in Polars (Python) native API vs the slow map_elements

Question 1

I have a Polars dataframe in which the cells contain a sequence of single digits as a string of characters, and I want to find the number of differences between the single-digits elements of the string. For example:

df = pl.DataFrame({"pop_1": ["100","0021"],"pop_2":["11002","0000",]})

| pop_1 | pop_2 |
|_______|_________|
| "100" | "11002" |
| "0021"| "0000" |

col_1 row1 has 2 differences with itself (1 different than 0 two times); col_2 row1 has 8 differences with itself; col_1 and col_2 at row1 have 9 differences between them.

The naive implementation of this is:

def get_dxy(str1,str2):
 diffs = 0
 for x in str1:
 for y in str2:
 if x!=y:
 diffs+=1
 return diffs
def get_pi(str1):
 diffs = 0
 for i in range(len(str1)-1):
 for j in range(i+1,len(str1)):
 if str1[i]!=str1[j]:
 diffs+=1
 return diffs

I need to report these differences in separate columns. I am able to do this by using map_elements:

df = df.with_columns( pl.col('pop_1')
 .map_elements(get_pi, return_dtype=pl.Int64)
 .alias('pi_1') 
 )
df = df.with_columns( pl.col('pop_2')
 .map_elements(get_pi, return_dtype=pl.Int64)
 .alias('pi_2') 
 )
df = df.with_columns(
 pl.struct("pop_1", "pop_2")
 .map_elements(lambda cols: get_dxy(cols["pop_1"], cols["pop_2"]), return_dtype=pl.Int64)
 .alias("dxy")
 )
df

| pop_1 | pop_2 | pi_1 | pi_2 | dxy | 
|_______|_________|_______|_________|_______|
| "100" | "11002" | 2 | 8 | 9 |
| "0021"| "0000" | 5 | 0 | 8 |

However, my data is too large, and this method is not very fast. I was wondering what the fastest way to accomplish this using Polars native API (like regex) would be.

Could I get some hints in how to do this?

Question 2

wondering what the fastest way to accomplish this

You presented a quadratic algorithm, but you want an approach that scales as O(N log N). Number of mismatches is total elements minus the matches.

Sort your inputs. If need be, do that on (value, index) pairs, so you know where each element came from. Do K-way (2-way) merge to quickly identify matches. Subtract to learn how many mismatches there are.

An RDBMS JOIN, perhaps in SQLite, would be one way to accomplish this. Polars offers several methods for conveniently writing to temp tables, and for retrieving JOIN results.

J_H J_HJ_H 41.4k3 gold badges38 silver badges157 bronze badges · Answer 1 · 2024-09-19 15:36:51Z

wondering what the fastest way to accomplish this

You presented a quadratic algorithm, but you want an approach that scales as O(N log N). Number of mismatches is total elements minus the matches.

Sort your inputs. If need be, do that on (value, index) pairs, so you know where each element came from. Do K-way (2-way) merge to quickly identify matches. Subtract to learn how many mismatches there are.

An RDBMS JOIN, perhaps in SQLite, would be one way to accomplish this. Polars offers several methods for conveniently writing to temp tables, and for retrieving JOIN results.

Stack Exchange Network

Match pattern in Polars (Python) native API vs the slow map_elements

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Match pattern in Polars (Python) native API vs the slow map_elements

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions