3
\$\begingroup\$
def check_rule(token, training_data, mode):
 if mode == "frag":
 children = training_data["words"]
 remove = list.remove
 def in_child(x):
 for i in token:
 if i not in x:
 return False
 remove(x, i)
 return True
 for i in children:
 if in_child(i):
 return False
 return True
 # is_present = children.map(in_child)
 elif mode == "wild":
 # is_present = training_data[1].map(methodcaller("__contains__", token))
 for i in training_data[1]:
 if token in i:
 return False
 return True
 elif mode == "prefix":
 # is_present = training_data[1].str.startswith(token)
 token_len = len(token)
 for i in training_data[1]:
 if i[:token_len] == token:
 return False
 return True
 # token_len = len(token)
 # is_present = training_data[1].str[:token_len] == token
 # is_present = training_data[1].map(methodcaller("startswith", token))
 # if is_present.any():
 # return False
 # return True

This function takes in:

token: string
training_data: A pandas dataframe with 2 columns having ID and vendor name columns
mode: Which can be any one of the prefix, wild, frag.

I have 5 data structures:

  1. Vendor data: Having vendor names in one column (there are other columns as well but only this name is relevant).

  2. Sub vendor data: A dict, for each <vendor name> there is a list of <sub-vendor>.

  3. Rule data: For each mode there is a dict with <vendor name><sub vendor name> as key and value is a string rule.

So, for each vendor and sub vendor, whole data minus the rows having the respective vendor, the corresponding rule is checked if it is present in any of the rows. This rule checking is done by the code segment I have provided.

The commented parts are the optimizations I tried after profiling. But, it still takes a lot of time. Can this be optimized further?

asked Dec 14, 2016 at 14:47
\$\endgroup\$
2
  • 2
    \$\begingroup\$ This question is incomplete. To help reviewers give you better answers, please add sufficient context to your question. The more you tell us about what your code does and what the purpose of doing that is, the easier it will be for reviewers to help you. Questions should include a description of what the code does \$\endgroup\$ Commented Dec 14, 2016 at 15:24
  • \$\begingroup\$ The time it takes to run on different datasets can be seen here - stackoverflow.com/q/41139223/2650427 \$\endgroup\$ Commented Dec 15, 2016 at 7:39

1 Answer 1

2
\$\begingroup\$

You can replace your function signature with

def check_rule(
 token: str,
 training_data: pd.DataFrame,
 mode: typing.Literal['frag', 'wild', 'prefix'],
) -> bool:

and your mode branches with

 match mode:
 case 'frag':
 # ...
 case _:
 raise ValueError(f'Unsupported mode {mode}')

A pandas dataframe with 2 columns having ID and vendor name columns

This is a lie, because you also expect it to have words.

All of this:


 remove = list.remove
 def in_child(x):
 for i in token:
 if i not in x:
 return False
 remove(x, i)
 return True
 for i in children:
 if in_child(i):
 return False

needs to be deleted and replaced with vectorised methods. It's very difficult to say exactly how, because you've provided no data; but I can smell poorly-expressed Pandas content here if e.g. words is a column of lists. In a case like that, you should start with explode() to make it not a column of lists.

answered Jun 28 at 16:12
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.