I want to encode a non-negative integer to ASCII letters and digits. I wrote below code for that, that works, but which I assume too immature to use in production.
#! /usr/bin/env python3
from string import digits, ascii_letters
def encode(number: int, pool: str) -> str:
"""Encode a non-negative integer as string."""
result = ''
pool_size = len(pool)
while number:
number, remainder = divmod(number, pool_size)
result += pool[remainder]
return result
def main():
size = (1 << (8 * 2)) - 1
codes = set()
for i in range(size):
print(i, code := encode(i, digits+ascii_letters))
codes.add(code)
print('Unique:', len(codes) == size)
if __name__ == '__main__':
main()
I also tried the encoding functions in the base64
library to no avail. Most of them will create strings containing other than the desired characters at some point and b16encode()
has a too small alphabet, resulting in too long codes (no lower-case).
If you can provide a better solution to convert an arbitrary non-negative integer into a string of characters exclusively from string.ascii_letters + string.digits
, that would be great.
Update
Project on GitHub
-
\$\begingroup\$ What is the motivation for doing this? And why are the default base 64 characters unfavourable? \$\endgroup\$Reinderien– Reinderien2022年02月24日 13:46:05 +00:00Commented Feb 24, 2022 at 13:46
-
\$\begingroup\$ The motivation is to encode integer IDs into easy-to-remember and easy-to-type PINs for users within a web application for smartphones. \$\endgroup\$Richard Neumann– Richard Neumann2022年02月24日 13:48:17 +00:00Commented Feb 24, 2022 at 13:48
-
\$\begingroup\$ Is your maximum encodable ID actually sixteen bits? Serially generated IDs will quickly exceed this. \$\endgroup\$Reinderien– Reinderien2022年02月24日 13:51:44 +00:00Commented Feb 24, 2022 at 13:51
-
\$\begingroup\$ Yes, it's currently two bytes. But the en-/decoding code should be generalized to an arbitrary amount of bytes. \$\endgroup\$Richard Neumann– Richard Neumann2022年02月24日 14:03:25 +00:00Commented Feb 24, 2022 at 14:03
1 Answer 1
Your current encode
doesn't have validation, and won't produce meaningful output for number
below 1.
size
and the way you use it in a range
don't entirely make sense; I would sooner expect size = 1 << (8 * 2)
so that 65535 is included. Also your range
should start at 1 and not 0, since 0 produces an empty string. (Or modify your encode
so that 0 does indeed produce a non-empty string.)
Given your goal of
easy-to-remember and easy-to-type PINs
I don't think that digits, ascii_letters
constitute an appropriate pool. Disambiguation of zero and capital O, lower-case L and upper-case I, etc. are problematic. Best to write your own pool literal that avoids these.
Make a judgement call as to which is easier to remember: a shorter string that includes mixed case, or a longer string with one case only. With a pool of 54 upper-case, lower-case and numeric characters, and an ID of at most 16 bits, the maximum encoded length is
$$ \lceil \frac {16 \log2} {\log54} \rceil = 3 $$
In the other direction, if you want to limit the number of encoded characters to 3, the minimum pool size is
$$ \lceil 2^{\frac {16} 3} \rceil = 41 $$
Suggested
import math
from typing import Iterator
LEGIBLE = (
'ABCDEFGHJKLMNPRTUVWXYZ'
'abcdefghijkmnopqrstuvwxyz'
'2346789'
)
# One above the maximum ID.
ID_LIMIT = 1 << (8 * 2)
def encode_each(number: int, pool: str = LEGIBLE) -> Iterator[str]:
if number < 0:
raise ValueError(f'{number} is non-encodable')
pool_size = len(pool)
while True:
number, remainder = divmod(number, pool_size)
yield pool[remainder]
if not number:
break
def encode(number: int, pool: str = LEGIBLE) -> str:
return ''.join(encode_each(number, pool))
def all_codes(size: int = ID_LIMIT) -> Iterator[str]:
for i in range(size):
yield encode(i)
def test() -> None:
codes = tuple(all_codes())
assert len(codes) == 65536
assert len(set(codes)) == 65536
n_symbols = math.ceil(math.log(65536) / math.log(len(LEGIBLE)))
for code in codes:
assert 0 < len(code) <= n_symbols
def demo() -> None:
ids = 0, 1, 100, 1000, 60000, 65535
for i in ids:
print(i, '->', encode(i))
if __name__ == '__main__':
test()
demo()
Output
0 -> A
1 -> B
100 -> zB
1000 -> gW
60000 -> GjY
65535 -> mda
Explore related questions
See similar questions with these tags.