def check_rule(token, training_data, mode):
if mode == "frag":
children = training_data["words"]
remove = list.remove
def in_child(x):
for i in token:
if i not in x:
return False
remove(x, i)
return True
for i in children:
if in_child(i):
return False
return True
# is_present = children.map(in_child)
elif mode == "wild":
# is_present = training_data[1].map(methodcaller("__contains__", token))
for i in training_data[1]:
if token in i:
return False
return True
elif mode == "prefix":
# is_present = training_data[1].str.startswith(token)
token_len = len(token)
for i in training_data[1]:
if i[:token_len] == token:
return False
return True
# token_len = len(token)
# is_present = training_data[1].str[:token_len] == token
# is_present = training_data[1].map(methodcaller("startswith", token))
# if is_present.any():
# return False
# return True
This function takes in:
token
: string
training_data
: A pandas dataframe with 2 columns having ID and vendor name columns
mode
: Which can be any one of the prefix
, wild
, frag
.
I have 5 data structures:
Vendor data: Having vendor names in one column (there are other columns as well but only this name is relevant).
Sub vendor data: A
dict
, for each<vendor name>
there is a list of<sub-vendor>
.Rule data: For each
mode
there is a dict with<vendor name><sub vendor name>
as key and value is a string rule.
So, for each vendor and sub vendor, whole data minus the rows having the respective vendor, the corresponding rule is checked if it is present in any of the rows. This rule checking is done by the code segment I have provided.
The commented parts are the optimizations I tried after profiling. But, it still takes a lot of time. Can this be optimized further?
-
2\$\begingroup\$ This question is incomplete. To help reviewers give you better answers, please add sufficient context to your question. The more you tell us about what your code does and what the purpose of doing that is, the easier it will be for reviewers to help you. Questions should include a description of what the code does \$\endgroup\$301_Moved_Permanently– 301_Moved_Permanently2016年12月14日 15:24:12 +00:00Commented Dec 14, 2016 at 15:24
-
\$\begingroup\$ The time it takes to run on different datasets can be seen here - stackoverflow.com/q/41139223/2650427 \$\endgroup\$TrigonaMinima– TrigonaMinima2016年12月15日 07:39:21 +00:00Commented Dec 15, 2016 at 7:39
1 Answer 1
You can replace your function signature with
def check_rule(
token: str,
training_data: pd.DataFrame,
mode: typing.Literal['frag', 'wild', 'prefix'],
) -> bool:
and your mode
branches with
match mode:
case 'frag':
# ...
case _:
raise ValueError(f'Unsupported mode {mode}')
A pandas dataframe with 2 columns having ID and vendor name columns
This is a lie, because you also expect it to have words
.
All of this:
remove = list.remove
def in_child(x):
for i in token:
if i not in x:
return False
remove(x, i)
return True
for i in children:
if in_child(i):
return False
needs to be deleted and replaced with vectorised methods. It's very difficult to say exactly how, because you've provided no data; but I can smell poorly-expressed Pandas content here if e.g. words
is a column of lists. In a case like that, you should start with explode()
to make it not a column of lists.