Culling strings from a list that are substrings of other list elements

Question 1

I'm bothered by my answer to this SO question.

I'm pretty sure it's more efficient than the sort-and-cull implementations, but I'm having trouble expressing that in a way that I trust.

Also, it's rather convoluted by python standards.
Is there a way to express the same process more clearly, without pages of comments?

The problem statement:

Given a subject list of strings, make a list containing every element of the subject that isn't a substring of a different element of the subject.

(削除) Original (削除ここまで) Current solution:

from itertools import groupby
from operator import itemgetter
from typing import Dict, Generator, Iterable, List, Union
# Exploded is a recursive data type representing a culled list of strings as a tree of character-by-character common prefixes. The leaves are the non-common suffixes.
Exploded = None
Exploded = Dict[str, Union[Exploded, str]]
def explode(subject:typing.Iterable[str])->Exploded:
 heads_to_tails = dict()
 for s in subject:
 if s:
 head = s[0]
 tail = s[1:]
 if head in heads_to_tails:
 heads_to_tails[head].append(tail)
 else:
 heads_to_tails[head] = [tail]
 return {
 h: prune_or_follow(t)
 for (h, t) in heads_to_tails.items()
 }
def prune_or_follow(tails: List[str]) -> Union[Exploded, str]:
 if 1 < len(tails):
 return explode(tails)
 else: #we just assume it's not empty.
 return tails[0]
def implode(e: Exploded) -> Generator[str, None, None]:
 for (head, continued) in e.items():
 if isinstance(continued, str):
 yield head + continued
 else:
 for tail in implode(continued):
 yield head + tail
def cull(subject: List[str]) -> List[str]:
 return list(implode(explode(subject)))
print(cull(['a','ab','ac','add']))
print(cull([ 'a boy ran' , 'green apples are worse' , 'a boy ran towards the mill' , ' this is another sentence ' , 'a boy ran towards the mill and fell']))
print(cull(['a', 'ab', 'ac', 'b', 'add']))

I know that it works for the ~~(削除) two (削除ここまで)~~ three test cases.

Question 2

IMHO a lot of the code's perceived verbosity comes from the type hints. Formatting them with a little bit of well placed whitespace here and there might already lead to code that looks friendlier.

Question 3

As a quick comment: using typing is not a perfect substitute for documentation: the logic is non-trivial and deserves to be explained in a comment. It looks like you are building some kind of Trie. Is it correct ?

Question 4

Ok, since nobody had left any answers yet, i've edited the question to reflect what the two of you have mentioned so far.

Question 5

This doesn't work. print(cull(['a', 'ab', 'ac', 'b', 'add'])) outputs ['add', 'b']. groupby() does not sort it's input, so when the dictionary comprehension in explode() gets to 'add', it overwrites the entry for 'a' that was created when processing `['a', 'ab', 'ac'].

Question 6

mother f*er. I spent like five minutes after confirming the bug with your input trying to figure out why groupby would sort, and why we'd care if it did. This isn't even the first time I've been hit by this. In absolute seriousness, who in hell publishes a function like that to a public library?

Question 7

All in all this looks good already.

A few tips:

`continue`

If you change the test in explode from if s: to

if not s:
 continue

You can save a level of indentation for the rest of the loop

`collections.defaultdict`

checking whether a key is in a dict, and adding it if it isn't is why there is collections.defaultdict which can simplify explode a lot:

def explode(subject: Iterable[str]) -> Exploded:
 heads_to_tails = defaultdict(list)
 for s in subject:
 if not s:
 continue
 heads_to_tails[s[0]].append(s[1:])
 return {h: prune_or_follow(t) for (h, t) in heads_to_tails.items()}

tuple unpacking

A matter of style, but in most situations where you use my_list[0] and my_list[1:], you can use tuple unpacking. You can use this in prune_or_follow

def prune_or_follow(tails: List[str]) -> Union[Exploded, str]:
 start, *rest = tails
 if rest:
 return explode(tails)
 else: 
 return start

Whether this is more clear is a matter of taste.

This tuple unpacking does not work as clean in explode because it converts the string into a list of 1-character strings.

`Exploded`

You can use a string in typing so you can do:

Exploded = Dict[str, Union["Exploded", str]]

without the Exploded = None

`cull` typing

cull can accept any iterable. For the result, I don't think the order of strings is important, so there is no need to limit you to a list. The signature can then be def cull(subject: Iterable[str]) -> Collection[str]:. Whether you return a list or a set is a matter of implementation.

If the result is just used for iteration, you can even forgo the list call, and just return the implode generator, and let the caller decide whether he needs in in a list, set or whatever data structure he wants.

implode

now you use e as argument name. One-letter variable names are unclear in general, and to be avoided most of the time*. Since it is a kind of a tree, I would change this variable.

A slightly different take in implode which doesn't use string concatenation but tuple unpacking:

def implode(tree: Exploded, previous=()) -> Generator[str, None, None]:
 for root, branches in tree.items():
 if isinstance(branches, str):
 yield "".join((*previous, root, *branches))
 else:
 yield from implode(branches, *previous, root)

or equivalently:

def implode(tree: Exploded, previous=()) -> Generator[str, None, None]:
 for root, branches in tree.items():
 if isinstance(branches, str):
 yield "".join((*previous, root, *branches))
 else:
 yield from implode(branches, (*previous, root))

The choice between your version and this is a matter of taste, but I wanted to present the different possibilities.

Another variation, is using try-except ArgumentError instead of the isinstance:

def implode_except(tree: Exploded, previous=()) -> Generator[str, None, None]:
 for root, branches in tree.items():
 try:
 yield from implode(branches, (*previous, root))
 except AttributeError:
 yield "".join((*previous, root, *branches))

'* I find i, j etc acceptable for a counter or index, and x and y for coordinates, but in general I try not to use one-letter variable names.

Question 8

Thanks; i've implemented most of this over at the parent question!

Maarten Fabré Maarten Fabré 9,3901 gold badge15 silver badges27 bronze badges · Accepted Answer · 2019-06-27 10:50:04Z

All in all this looks good already.

A few tips:

`continue`

If you change the test in explode from if s: to

if not s:
 continue

You can save a level of indentation for the rest of the loop

`collections.defaultdict`

checking whether a key is in a dict, and adding it if it isn't is why there is collections.defaultdict which can simplify explode a lot:

def explode(subject: Iterable[str]) -> Exploded:
 heads_to_tails = defaultdict(list)
 for s in subject:
 if not s:
 continue
 heads_to_tails[s[0]].append(s[1:])
 return {h: prune_or_follow(t) for (h, t) in heads_to_tails.items()}

tuple unpacking

A matter of style, but in most situations where you use my_list[0] and my_list[1:], you can use tuple unpacking. You can use this in prune_or_follow

def prune_or_follow(tails: List[str]) -> Union[Exploded, str]:
 start, *rest = tails
 if rest:
 return explode(tails)
 else: 
 return start

Whether this is more clear is a matter of taste.

This tuple unpacking does not work as clean in explode because it converts the string into a list of 1-character strings.

`Exploded`

You can use a string in typing so you can do:

Exploded = Dict[str, Union["Exploded", str]]

without the Exploded = None

`cull` typing

cull can accept any iterable. For the result, I don't think the order of strings is important, so there is no need to limit you to a list. The signature can then be def cull(subject: Iterable[str]) -> Collection[str]:. Whether you return a list or a set is a matter of implementation.

If the result is just used for iteration, you can even forgo the list call, and just return the implode generator, and let the caller decide whether he needs in in a list, set or whatever data structure he wants.

implode

now you use e as argument name. One-letter variable names are unclear in general, and to be avoided most of the time*. Since it is a kind of a tree, I would change this variable.

A slightly different take in implode which doesn't use string concatenation but tuple unpacking:

def implode(tree: Exploded, previous=()) -> Generator[str, None, None]:
 for root, branches in tree.items():
 if isinstance(branches, str):
 yield "".join((*previous, root, *branches))
 else:
 yield from implode(branches, *previous, root)

or equivalently:

def implode(tree: Exploded, previous=()) -> Generator[str, None, None]:
 for root, branches in tree.items():
 if isinstance(branches, str):
 yield "".join((*previous, root, *branches))
 else:
 yield from implode(branches, (*previous, root))

The choice between your version and this is a matter of taste, but I wanted to present the different possibilities.

Another variation, is using try-except ArgumentError instead of the isinstance:

def implode_except(tree: Exploded, previous=()) -> Generator[str, None, None]:
 for root, branches in tree.items():
 try:
 yield from implode(branches, (*previous, root))
 except AttributeError:
 yield "".join((*previous, root, *branches))

'* I find i, j etc acceptable for a counter or index, and x and y for coordinates, but in general I try not to use one-letter variable names.

Thanks; i've implemented most of this over at the parent question!

Stack Exchange Network

Culling strings from a list that are substrings of other list elements

The problem statement:

(削除) Original (削除ここまで) Current solution:

1 Answer 1

`continue`

`collections.defaultdict`

tuple unpacking

`Exploded`

`cull` typing

implode

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Culling strings from a list that are substrings of other list elements

The problem statement:

(削除) Original (削除ここまで) Current solution:

1 Answer 1

continue

collections.defaultdict

tuple unpacking

Exploded

cull typing

implode

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions

`continue`

`collections.defaultdict`

`Exploded`

`cull` typing