I needed to write a Python generator that lazily generates, in lexicographical order, strings composed of a certain alphabet (e.g. lowercase English letters).
My first thought was to make an infinite generator, but an infinite iterator would be pretty useless because it would just generate strings composed of only the character a
. This is because the lexicographical ordering of strings is not a well-order; it can be thought of as composed of an infinite sequence of infinitely nested sequences: (a
, (aa
, ...), (ab
, ...), ...), (b
, (ba
, ...), (bb
, ...), ...), ... The generator would never reach ab
since it has an infinite amount of predecessors.
Thus I went for limiting the maximum length of the strings. I.e., the full specification is a generator that lazily generates all strings in a given alphabet of up to a certain length in lexicographical order. E.g., for a maximum length of 3 with ASCII lowercase letters, it should generate, in this order: ""
, "a"
, "aa"
, "aaa"
, "aab"
, ..., "aaz"
, "ab"
, "aba
", ..., "zzz"
.
The solution I came up with is elegant in the abstract; it uses the intrinsic recursive relationship of lexicographical ordering: if you take the set of all strings—including infinite strings—and remove the first character from each, you're still left with the set of all strings. But it's probably not very efficient in practice, as for each iteration it uses Python's string concatenation operation +
\$m\$ times to build a string of length \$m\$ (\$m - 1\$ times in between characters plus the "terminating" empty string, as you'll see below), and strings are immutable so copies are made at every intermediate step, resulting in \$O(m^2)\$ time complexity at each iteration.
Here is my solution:
import string
def lexstrings(max_length: int, alphabet=string.ascii_lowercase):
yield ""
if max_length == 0: return
for first in alphabet:
for suffix in lexstrings(max_length - 1, alphabet=alphabet):
yield first + suffix
Example:
>>> g = lexstrings(max_length=3, alphabet="ab")
>>> list(g)
['',
'a',
'aa',
'aaa',
'aab',
'ab',
'aba',
'abb',
'b',
'ba',
'baa',
'bab',
'bb',
'bba',
'bbb']
This implementation also "supports" the infinite version by supplying a negative maximum length:
>>> g = lexstrings(-1)
>>> next(g)
''
>>> next(g)
'a'
>>> next(g)
'aa'
>>> next(g)
'aaa'
...
Some benchmarking confirms a single iteration of this is quadratic in terms of the string size. Here is a plot of execution time for generating a string of maximum length (in a single iteration) vs the maximum length:
Benchmarks show quadratic time complexity per iteration
The code to generate this plot using perfplot
follows. (I modified the order of generation so that the first iteration yields a max-length string.)
import string
import perfplot
import sys
import numpy as np
def lexstrings(max_length: int, alphabet=string.ascii_lowercase):
if max_length == 0:
return
for first in alphabet:
for suffix in lexstrings(max_length - 1, alphabet=alphabet):
yield first + suffix
yield ""
n_range = np.linspace(1, sys.getrecursionlimit() - 500)
benchmark = perfplot.bench(
setup=lambda n: n,
kernels=[
lambda n: next(lexstrings(n)),
],
labels=[
"lexstrings",
],
n_range=n_range,
xlabel="maximum string length",
title="Single iteration (max-length string)",
target_time_per_measurement=0.1,
equality_check=None,
)
benchmark.show(logx=False, logy=False)
-
1\$\begingroup\$ Hey, why don't you plot graphs of alphabet size and max length against execution time, then you'll be able to see exactly what complexity these 2 variables contribute \$\endgroup\$Greedo– Greedo2020年11月06日 07:40:39 +00:00Commented Nov 6, 2020 at 7:40
-
1\$\begingroup\$ @Greedo Good idea, I've added a plot for time vs length of generated string. (The alphabet size only influences the number of iterations, not the time each takes to execute, so I skipped that.) \$\endgroup\$Anakhand– Anakhand2020年11月06日 17:43:24 +00:00Commented Nov 6, 2020 at 17:43
4 Answers 4
One approach to avoid string copying and improve the time complexity from \$\Theta(m^2)\$ to \$\Theta(m)\$ (\$m\$ is the output string length) for each output string is to store characters of strings in a list and modify it in-place.
from typing import List, Iterable
def lex_string_gen_recursive(cur_depth: int, ch_list: List[str], alphabet: Iterable[str]) -> Iterable[str]:
yield "".join(ch_list)
if cur_depth < len(ch_list):
next_depth = cur_depth + 1
for ch in alphabet:
ch_list[cur_depth] = ch
yield from lex_string_gen_recursive(next_depth, ch_list, alphabet)
ch_list[cur_depth] = ""
def lex_string_gen(max_length: int, alphabet: Iterable[str]) -> Iterable[str]:
yield from lex_string_gen_recursive(0, [""] * max_length, sorted(alphabet))
Test:
list(lex_string_gen(3, "ab"))
Output:
['',
'a',
'aa',
'aaa',
'aab',
'ab',
'aba',
'abb',
'b',
'ba',
'baa',
'bab',
'bb',
'bba',
'bbb']
On a possible non-recursive solution, as a possible improvement (per Reinderien's answer).
The general idea here is that if we build a tree of these strings in the obvious way*, then the lexicographical ordering is nothing but a DFS on this tree.
*Let the root be the empty string, then its neighbours are the one character long strings for each character of the alphabet, and so on. For example on the alphabet ab
with max_length = 2
, the tree would look like the following:
<root>
|
+- a
| |
| +- aa
| |
| `- ab
|
`- b
|
+- ba
|
`- bb
Then we can sort of build the tree while doing the DFS like so:
def lexstrings(max_length=3, alphabet="ab"):
reversed_alphabet = list(reversed(alphabet))
nodes_to_visit = ['']
while nodes_to_visit:
current_node = nodes_to_visit.pop()
if len(current_node) > max_length:
continue
yield current_node
nodes_to_visit.extend(current_node + tc for tc in reversed_alphabet)
-
\$\begingroup\$ Ah! So basically a DFS on a "sorted" trie? \$\endgroup\$Anakhand– Anakhand2020年11月08日 17:38:35 +00:00Commented Nov 8, 2020 at 17:38
-
1\$\begingroup\$ How about starting with
nodes_to_visit = ['']
? Less code, and would support even negativemax_length
(at the moment you do yield''
for that). \$\endgroup\$superb rain– superb rain2020年11月08日 17:59:45 +00:00Commented Nov 8, 2020 at 17:59 -
\$\begingroup\$ @Anakhand I'm not sure, but I have added an example. I hope it makes my intentions clearer. \$\endgroup\$Eman Yalpsid– Eman Yalpsid2020年11月08日 19:15:23 +00:00Commented Nov 8, 2020 at 19:15
It's a great question, and the only improvements I can see are superficial ones:
- Your method has incomplete type hints.
alphabet
should beIterable[str]
, and it should return-> Iterable[str]
as well. - PEP8 discourages multi-statement lines (your
return
).
Otherwise, you should reconsider making this a recursive implementation. I see that you were careful to check sys.getrecursionlimit()
in your test code, so you're already aware of the fact that this will break given sufficient depth. Python also does not have tail optimization so an iterative implementation will potentially yield performance improvements.
-
1\$\begingroup\$ I've thought of how to write this without recursion but honestly I wasn't able to come up with anything... Any hints? \$\endgroup\$Anakhand– Anakhand2020年11月07日 10:17:29 +00:00Commented Nov 7, 2020 at 10:17
First... go read itertools. It's in the standard library, you're reinventing some wheels here. Note that there's a long 'recipes' section at the bottom.
In particular, the strings on a given alphabet of length n
is just itertools.combinations_with_replacement(alphabet, r)
.
import itertools
def lexstrings(max_length: int, alphabet=string.ascii_lowercase):
return itertools.chain(itertools.combinations_with_replacement(alphabet, r)
for r in range(0, max_length+1))
or an abstract infinite one:
import itertools
def lexstrings(max_length: int, alphabet=string.ascii_lowercase):
return itertools.chain(itertools.combinations_with_replacement(alphabet, r)
for r in itertools.count())
In the future, you should also know about python 3.3 introducing yield from.
Edit: Last, if it's not lexicographical order, rename it from lexstrings
.
Edit 2: Oh, looks like your original does preserve order. Keep my tips and ignore my (different order) rewrite.
-
1\$\begingroup\$ 1. I know itertools and yield from, if I had found a way (and a need) to use them I would have. 2. Your generators don't generate strings in lexicographical order as defined in the question. \$\endgroup\$Anakhand– Anakhand2020年11月07日 10:14:27 +00:00Commented Nov 7, 2020 at 10:14