1
\$\begingroup\$

I have a Polars dataframe in which the cells contain a sequence of single digits as a string of characters, and I want to find the number of differences between the single-digits elements of the string. For example:

df = pl.DataFrame({"pop_1": ["100","0021"],"pop_2":["11002","0000",]})
| pop_1 | pop_2 |
|_______|_________|
| "100" | "11002" |
| "0021"| "0000" |

col_1 row1 has 2 differences with itself (1 different than 0 two times); col_2 row1 has 8 differences with itself; col_1 and col_2 at row1 have 9 differences between them.

The naive implementation of this is:

def get_dxy(str1,str2):
 diffs = 0
 for x in str1:
 for y in str2:
 if x!=y:
 diffs+=1
 return diffs
def get_pi(str1):
 diffs = 0
 for i in range(len(str1)-1):
 for j in range(i+1,len(str1)):
 if str1[i]!=str1[j]:
 diffs+=1
 return diffs

I need to report these differences in separate columns. I am able to do this by using map_elements:

df = df.with_columns( pl.col('pop_1')
 .map_elements(get_pi, return_dtype=pl.Int64)
 .alias('pi_1') 
 )
df = df.with_columns( pl.col('pop_2')
 .map_elements(get_pi, return_dtype=pl.Int64)
 .alias('pi_2') 
 )
df = df.with_columns(
 pl.struct("pop_1", "pop_2")
 .map_elements(lambda cols: get_dxy(cols["pop_1"], cols["pop_2"]), return_dtype=pl.Int64)
 .alias("dxy")
 )
df
| pop_1 | pop_2 | pi_1 | pi_2 | dxy | 
|_______|_________|_______|_________|_______|
| "100" | "11002" | 2 | 8 | 9 |
| "0021"| "0000" | 5 | 0 | 8 | 

However, my data is too large, and this method is not very fast. I was wondering what the fastest way to accomplish this using Polars native API (like regex) would be.

Could I get some hints in how to do this?

asked Sep 19, 2024 at 4:25
\$\endgroup\$

1 Answer 1

2
\$\begingroup\$

wondering what the fastest way to accomplish this

You presented a quadratic algorithm, but you want an approach that scales as O(N log N). Number of mismatches is total elements minus the matches.

Sort your inputs. If need be, do that on (value, index) pairs, so you know where each element came from. Do K-way (2-way) merge to quickly identify matches. Subtract to learn how many mismatches there are.

An RDBMS JOIN, perhaps in SQLite, would be one way to accomplish this. Polars offers several methods for conveniently writing to temp tables, and for retrieving JOIN results.

answered Sep 19, 2024 at 15:36
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.