Lazy generator of strings in lexicographical order

Question 1

I needed to write a Python generator that lazily generates, in lexicographical order, strings composed of a certain alphabet (e.g. lowercase English letters).

My first thought was to make an infinite generator, but an infinite iterator would be pretty useless because it would just generate strings composed of only the character a. This is because the lexicographical ordering of strings is not a well-order; it can be thought of as composed of an infinite sequence of infinitely nested sequences: (a, (aa, ...), (ab, ...), ...), (b, (ba, ...), (bb, ...), ...), ... The generator would never reach ab since it has an infinite amount of predecessors.

Thus I went for limiting the maximum length of the strings. I.e., the full specification is a generator that lazily generates all strings in a given alphabet of up to a certain length in lexicographical order. E.g., for a maximum length of 3 with ASCII lowercase letters, it should generate, in this order: "", "a", "aa", "aaa", "aab", ..., "aaz", "ab", "aba", ..., "zzz".

The solution I came up with is elegant in the abstract; it uses the intrinsic recursive relationship of lexicographical ordering: if you take the set of all strings—including infinite strings—and remove the first character from each, you're still left with the set of all strings. But it's probably not very efficient in practice, as for each iteration it uses Python's string concatenation operation + \$m\$ times to build a string of length \$m\$ (\$m - 1\$ times in between characters plus the "terminating" empty string, as you'll see below), and strings are immutable so copies are made at every intermediate step, resulting in \$O(m^2)\$ time complexity at each iteration.

Here is my solution:

import string
def lexstrings(max_length: int, alphabet=string.ascii_lowercase):
 yield ""
 if max_length == 0: return
 for first in alphabet:
 for suffix in lexstrings(max_length - 1, alphabet=alphabet):
 yield first + suffix

Example:

>>> g = lexstrings(max_length=3, alphabet="ab")
>>> list(g)
['',
 'a',
 'aa',
 'aaa',
 'aab',
 'ab',
 'aba',
 'abb',
 'b',
 'ba',
 'baa',
 'bab',
 'bb',
 'bba',
 'bbb']

This implementation also "supports" the infinite version by supplying a negative maximum length:

>>> g = lexstrings(-1)
>>> next(g)
''
>>> next(g)
'a'
>>> next(g)
'aa'
>>> next(g)
'aaa'
...

Some benchmarking confirms a single iteration of this is quadratic in terms of the string size. Here is a plot of execution time for generating a string of maximum length (in a single iteration) vs the maximum length:

Benchmarks show quadratic time complexity per iteration

The code to generate this plot using perfplot follows. (I modified the order of generation so that the first iteration yields a max-length string.)

import string
import perfplot
import sys
import numpy as np
def lexstrings(max_length: int, alphabet=string.ascii_lowercase):
 if max_length == 0:
 return
 for first in alphabet:
 for suffix in lexstrings(max_length - 1, alphabet=alphabet):
 yield first + suffix
 yield ""
n_range = np.linspace(1, sys.getrecursionlimit() - 500)
benchmark = perfplot.bench(
 setup=lambda n: n,
 kernels=[
 lambda n: next(lexstrings(n)),
 ],
 labels=[
 "lexstrings",
 ],
 n_range=n_range,
 xlabel="maximum string length",
 title="Single iteration (max-length string)",
 target_time_per_measurement=0.1,
 equality_check=None,
)
benchmark.show(logx=False, logy=False)

Question 2

Hey, why don't you plot graphs of alphabet size and max length against execution time, then you'll be able to see exactly what complexity these 2 variables contribute

Question 3

@Greedo Good idea, I've added a plot for time vs length of generated string. (The alphabet size only influences the number of iterations, not the time each takes to execute, so I skipped that.)

Question 4

One approach to avoid string copying and improve the time complexity from \$\Theta(m^2)\$ to \$\Theta(m)\$ (\$m\$ is the output string length) for each output string is to store characters of strings in a list and modify it in-place.

from typing import List, Iterable
def lex_string_gen_recursive(cur_depth: int, ch_list: List[str], alphabet: Iterable[str]) -> Iterable[str]:
 yield "".join(ch_list)
 if cur_depth < len(ch_list):
 next_depth = cur_depth + 1
 for ch in alphabet:
 ch_list[cur_depth] = ch
 yield from lex_string_gen_recursive(next_depth, ch_list, alphabet)
 ch_list[cur_depth] = ""
def lex_string_gen(max_length: int, alphabet: Iterable[str]) -> Iterable[str]:
 yield from lex_string_gen_recursive(0, [""] * max_length, sorted(alphabet))

Test:

list(lex_string_gen(3, "ab"))

Output:

['',
 'a',
 'aa',
 'aaa',
 'aab',
 'ab',
 'aba',
 'abb',
 'b',
 'ba',
 'baa',
 'bab',
 'bb',
 'bba',
 'bbb']

Question 5

On a possible non-recursive solution, as a possible improvement (per Reinderien's answer).

The general idea here is that if we build a tree of these strings in the obvious way*, then the lexicographical ordering is nothing but a DFS on this tree.

*Let the root be the empty string, then its neighbours are the one character long strings for each character of the alphabet, and so on. For example on the alphabet ab with max_length = 2, the tree would look like the following:

<root>
|
+- a
| |
| +- aa
| |
| `- ab
|
`- b
 |
 +- ba
 |
 `- bb

Then we can sort of build the tree while doing the DFS like so:

def lexstrings(max_length=3, alphabet="ab"):
 reversed_alphabet = list(reversed(alphabet))
 nodes_to_visit = ['']
 while nodes_to_visit:
 current_node = nodes_to_visit.pop()
 
 if len(current_node) > max_length:
 continue
 yield current_node
 nodes_to_visit.extend(current_node + tc for tc in reversed_alphabet)

Question 6

Ah! So basically a DFS on a "sorted" trie?

Question 7

How about starting with nodes_to_visit = ['']? Less code, and would support even negative max_length (at the moment you do yield '' for that).

Question 8

@Anakhand I'm not sure, but I have added an example. I hope it makes my intentions clearer.

Question 9

It's a great question, and the only improvements I can see are superficial ones:

Your method has incomplete type hints. alphabet should be Iterable[str], and it should return -> Iterable[str] as well.
PEP8 discourages multi-statement lines (your return).

Otherwise, you should reconsider making this a recursive implementation. I see that you were careful to check sys.getrecursionlimit() in your test code, so you're already aware of the fact that this will break given sufficient depth. Python also does not have tail optimization so an iterative implementation will potentially yield performance improvements.

Question 10

I've thought of how to write this without recursion but honestly I wasn't able to come up with anything... Any hints?

Question 11

First... go read itertools. It's in the standard library, you're reinventing some wheels here. Note that there's a long 'recipes' section at the bottom.

In particular, the strings on a given alphabet of length n is just itertools.combinations_with_replacement(alphabet, r).

import itertools
def lexstrings(max_length: int, alphabet=string.ascii_lowercase):
 return itertools.chain(itertools.combinations_with_replacement(alphabet, r) 
 for r in range(0, max_length+1))

or an abstract infinite one:

import itertools
def lexstrings(max_length: int, alphabet=string.ascii_lowercase):
 return itertools.chain(itertools.combinations_with_replacement(alphabet, r) 
 for r in itertools.count())

In the future, you should also know about python 3.3 introducing yield from.

Edit: Last, if it's not lexicographical order, rename it from lexstrings. Edit 2: Oh, looks like your original does preserve order. Keep my tips and ignore my (different order) rewrite.

Question 12

1. I know itertools and yield from, if I had found a way (and a need) to use them I would have. 2. Your generators don't generate strings in lexicographical order as defined in the question.

GZ0 2,3618 silver badges19 bronze badges · Answer 1 · 2020-11-08 13:21:36Z

One approach to avoid string copying and improve the time complexity from \$\Theta(m^2)\$ to \$\Theta(m)\$ (\$m\$ is the output string length) for each output string is to store characters of strings in a list and modify it in-place.

from typing import List, Iterable
def lex_string_gen_recursive(cur_depth: int, ch_list: List[str], alphabet: Iterable[str]) -> Iterable[str]:
 yield "".join(ch_list)
 if cur_depth < len(ch_list):
 next_depth = cur_depth + 1
 for ch in alphabet:
 ch_list[cur_depth] = ch
 yield from lex_string_gen_recursive(next_depth, ch_list, alphabet)
 ch_list[cur_depth] = ""
def lex_string_gen(max_length: int, alphabet: Iterable[str]) -> Iterable[str]:
 yield from lex_string_gen_recursive(0, [""] * max_length, sorted(alphabet))

Test:

list(lex_string_gen(3, "ab"))

Output:

['',
 'a',
 'aa',
 'aaa',
 'aab',
 'ab',
 'aba',
 'abb',
 'b',
 'ba',
 'baa',
 'bab',
 'bb',
 'bba',
 'bbb']

Eman Yalpsid 1,56911 silver badges16 bronze badges · Answer 2 · 2020-11-08 11:53:18Z

On a possible non-recursive solution, as a possible improvement (per Reinderien's answer).

The general idea here is that if we build a tree of these strings in the obvious way*, then the lexicographical ordering is nothing but a DFS on this tree.

*Let the root be the empty string, then its neighbours are the one character long strings for each character of the alphabet, and so on. For example on the alphabet ab with max_length = 2, the tree would look like the following:

<root>
|
+- a
| |
| +- aa
| |
| `- ab
|
`- b
 |
 +- ba
 |
 `- bb

Then we can sort of build the tree while doing the DFS like so:

def lexstrings(max_length=3, alphabet="ab"):
 reversed_alphabet = list(reversed(alphabet))
 nodes_to_visit = ['']
 while nodes_to_visit:
 current_node = nodes_to_visit.pop()
 
 if len(current_node) > max_length:
 continue
 yield current_node
 nodes_to_visit.extend(current_node + tc for tc in reversed_alphabet)

How about starting with nodes_to_visit = ['']? Less code, and would support even negative max_length (at the moment you do yield '' for that).
@Anakhand I'm not sure, but I have added an example. I hope it makes my intentions clearer.

Reinderien 71.2k5 gold badges76 silver badges257 bronze badges · Answer 3 · 2020-11-06 18:40:08Z

It's a great question, and the only improvements I can see are superficial ones:

Your method has incomplete type hints. alphabet should be Iterable[str], and it should return -> Iterable[str] as well.
PEP8 discourages multi-statement lines (your return).

Otherwise, you should reconsider making this a recursive implementation. I see that you were careful to check sys.getrecursionlimit() in your test code, so you're already aware of the fact that this will break given sufficient depth. Python also does not have tail optimization so an iterative implementation will potentially yield performance improvements.

I've thought of how to write this without recursion but honestly I wasn't able to come up with anything... Any hints?

Zachary Vance 1,5739 silver badges20 bronze badges · Answer 4 · 2020-11-07 06:59:12Z

First... go read itertools. It's in the standard library, you're reinventing some wheels here. Note that there's a long 'recipes' section at the bottom.

In particular, the strings on a given alphabet of length n is just itertools.combinations_with_replacement(alphabet, r).

import itertools
def lexstrings(max_length: int, alphabet=string.ascii_lowercase):
 return itertools.chain(itertools.combinations_with_replacement(alphabet, r) 
 for r in range(0, max_length+1))

or an abstract infinite one:

import itertools
def lexstrings(max_length: int, alphabet=string.ascii_lowercase):
 return itertools.chain(itertools.combinations_with_replacement(alphabet, r) 
 for r in itertools.count())

In the future, you should also know about python 3.3 introducing yield from.

Edit: Last, if it's not lexicographical order, rename it from lexstrings. Edit 2: Oh, looks like your original does preserve order. Keep my tips and ignore my (different order) rewrite.

1. I know itertools and yield from, if I had found a way (and a need) to use them I would have. 2. Your generators don't generate strings in lexicographical order as defined in the question.

Stack Exchange Network

Lazy generator of strings in lexicographical order

4 Answers 4

You must log in to answer this question.

Hot Network Questions

Lazy generator of strings in lexicographical order

4 Answers 4

You must log in to answer this question.

Related

Hot Network Questions