Optimization of startswith, list intersection with duplicates and substring search

Question 1

def check_rule(token, training_data, mode):
 if mode == "frag":
 children = training_data["words"]
 remove = list.remove
 def in_child(x):
 for i in token:
 if i not in x:
 return False
 remove(x, i)
 return True
 for i in children:
 if in_child(i):
 return False
 return True
 # is_present = children.map(in_child)
 elif mode == "wild":
 # is_present = training_data[1].map(methodcaller("__contains__", token))
 for i in training_data[1]:
 if token in i:
 return False
 return True
 elif mode == "prefix":
 # is_present = training_data[1].str.startswith(token)
 token_len = len(token)
 for i in training_data[1]:
 if i[:token_len] == token:
 return False
 return True
 # token_len = len(token)
 # is_present = training_data[1].str[:token_len] == token
 # is_present = training_data[1].map(methodcaller("startswith", token))
 # if is_present.any():
 # return False
 # return True

This function takes in:

token: string
training_data: A pandas dataframe with 2 columns having ID and vendor name columns
mode: Which can be any one of the prefix, wild, frag.

I have 5 data structures:

Vendor data: Having vendor names in one column (there are other columns as well but only this name is relevant).
Sub vendor data: A dict, for each <vendor name> there is a list of <sub-vendor>.
Rule data: For each mode there is a dict with <vendor name><sub vendor name> as key and value is a string rule.

So, for each vendor and sub vendor, whole data minus the rows having the respective vendor, the corresponding rule is checked if it is present in any of the rows. This rule checking is done by the code segment I have provided.

The commented parts are the optimizations I tried after profiling. But, it still takes a lot of time. Can this be optimized further?

Question 2

This question is incomplete. To help reviewers give you better answers, please add sufficient context to your question. The more you tell us about what your code does and what the purpose of doing that is, the easier it will be for reviewers to help you. Questions should include a description of what the code does

Question 3

The time it takes to run on different datasets can be seen here - stackoverflow.com/q/41139223/2650427

Question 4

You can replace your function signature with

def check_rule(
 token: str,
 training_data: pd.DataFrame,
 mode: typing.Literal['frag', 'wild', 'prefix'],
) -> bool:

and your mode branches with

 match mode:
 case 'frag':
 # ...
 case _:
 raise ValueError(f'Unsupported mode {mode}')

A pandas dataframe with 2 columns having ID and vendor name columns

This is a lie, because you also expect it to have words.

All of this:


 remove = list.remove
 def in_child(x):
 for i in token:
 if i not in x:
 return False
 remove(x, i)
 return True
 for i in children:
 if in_child(i):
 return False

needs to be deleted and replaced with vectorised methods. It's very difficult to say exactly how, because you've provided no data; but I can smell poorly-expressed Pandas content here if e.g. words is a column of lists. In a case like that, you should start with explode() to make it not a column of lists.

Reinderien Reinderien 70.9k5 gold badges76 silver badges256 bronze badges · Answer 1 · 2025-06-28 16:12:18Z

You can replace your function signature with

def check_rule(
 token: str,
 training_data: pd.DataFrame,
 mode: typing.Literal['frag', 'wild', 'prefix'],
) -> bool:

and your mode branches with

 match mode:
 case 'frag':
 # ...
 case _:
 raise ValueError(f'Unsupported mode {mode}')

A pandas dataframe with 2 columns having ID and vendor name columns

This is a lie, because you also expect it to have words.

All of this:


 remove = list.remove
 def in_child(x):
 for i in token:
 if i not in x:
 return False
 remove(x, i)
 return True
 for i in children:
 if in_child(i):
 return False

needs to be deleted and replaced with vectorised methods. It's very difficult to say exactly how, because you've provided no data; but I can smell poorly-expressed Pandas content here if e.g. words is a column of lists. In a case like that, you should start with explode() to make it not a column of lists.

Stack Exchange Network

Optimization of startswith, list intersection with duplicates and substring search

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Optimization of startswith, list intersection with duplicates and substring search

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions