Python 3 arbitrary base converter and base36 encoder

Question 1

This is a script that can convert any numerical value to any arbitrary radix representation and back, well, technically it can convert a number to any base, but because the output is a string representation with each a character as a digit, and possible digits are 10 Arabic numerals plus 26 basic Latin (lowercase) letters, the base can only be an integer from 2 to 36 (including both ends).

The possible digits are 0..9 + a..z, just like how hexadecimal orders them, I used bit shifting to convert to and from a base that is a power of two, and divmod for every other base. In my testing, bit-shifting is actually a bit slower than divmoding, I don't know why.

I have written a function that converts a decimal number to a base between 2 and 36, and another function that converts a number from a base between 2 and 36 to decimal, both functions validate the possibility of conversion before the actual conversion.

I have also written three sets of function that converts binary data to and from base-36.

The first set encode and decode the hexadecimal value of the binary data as a whole, rather than individual bytes, so the result can be more compact.

The second set encode and decode each character as two base-36 bits, with the highest possible two-bit being zz which in decimal is 1295 or (36 ^ 2) - 1, as such it can be accurate up to \u050f or ԏ, but extended ASCII has only 256 code points the second set is sufficient for non-UNICODE characters.

The third set encode and decode each byte as two base-36 bits, it can therefore represent any UNICODE code point, and its output is same as the second set for code points below 1296.

The code:

from string import ascii_lowercase
from string import digits
def log2(n):
 return n.bit_length() - 1
def power_of_2(n):
 return (n & (n-1) == 0) and n != 0
glyphs = digits + ascii_lowercase
def parser(s, base):
 sep = ','
 if base == 60:
 sep = ':'
 elif base == 256:
 sep = '.'
 if 2 <= base <= 36:
 return [glyphs.index(i) for i in s]
 return s.split(sep)
def to_base(num, base):
 if base < 2 or not isinstance(base, int):
 return
 
 sep = ','
 if base == 60:
 sep = ':'
 elif base == 256:
 sep = '.'
 
 if power_of_2(base):
 l = log2(base)
 powers = range(num.bit_length() // l + (num.bit_length() % l != 0))
 places = [(num >> l * i) % base for i in powers]
 
 else:
 if num == 0:
 return ('0', base)
 
 places = []
 while num:
 n, p = divmod(num, base)
 places.append(p)
 num = n
 if base > 36:
 return (sep.join(map(str, reversed(places))), base)
 return (''.join([glyphs[p] for p in reversed(places)]), base)
def from_base(s, base):
 if base < 2 or not isinstance(base, int):
 return
 
 sep = ','
 if base == 60:
 sep = ':'
 elif base == 256:
 sep = '.'
 
 if base <= 36:
 for i in s:
 if glyphs.index(i) >= base:
 return
 
 else:
 for i in s.split(sep):
 if int(i) >= base:
 return
 
 places = parser(s, base)
 
 if power_of_2(base):
 l = log2(base)
 return sum([int(n) << l * p for p, n in enumerate(reversed(places))])
 
 powers = reversed([base ** i for i in range(len(places))])
 return sum(int(a) * b for a, b in zip(places, powers))
def b36encode(s):
 msg = s.encode('utf8').hex()
 n = int(msg, 16)
 return to_base(n, 36)[0]
def b36decode(m):
 n = from_base(m, 36)
 h = hex(n).lstrip('0x')
 return bytes.fromhex(h).decode('utf8')
def b36_encode(s):
 msg = [ord(i) for i in s]
 return ''.join([to_base(n, 36)[0].zfill(2) for n in msg])
def b36_decode(m):
 s = [from_base(n, 36) for n in [m[i:i+2] for i in range(0, len(m), 2)]]
 return ''.join(chr(n) for n in s)
def base36_encode(s):
 msg = s.encode('utf8')
 return ''.join([to_base(n, 36)[0] for n in msg])
def base36_decode(m):
 s = [from_base(n, 36) for n in [m[i:i+2] for i in range(0, len(m), 2)]]
 return bytearray(s).decode('utf8')

Sample usage:

In [282]: to_base(16777215, 3)
Out[282]: ('1011120101000100', 3)
In [283]: from_base(*to_base(16777215, 3))
Out[283]: 16777215
In [284]: to_base(2**32-1, 3)
Out[284]: ('102002022201221111210', 3)
In [285]: from_base(*to_base(2**32-1, 3))
Out[285]: 4294967295
In [286]: from_base(*to_base(2**64-1, 36))
Out[286]: 18446744073709551615
In [287]: to_base(2**64-1, 36)
Out[287]: ('3w5e11264sgsf', 36)
In [288]: to_base(46610, 16)
Out[288]: ('b612', 16)
In [289]: to_base(54, 13)
Out[289]: ('42', 13)
In [290]: from_base('zzz', 36)
Out[290]: 46655
In [291]: from_base('computer', 36)
Out[291]: 993986429283
In [292]: to_base(from_base('computer', 36),36)
Out[292]: ('computer', 36)
In [293]: '\u1800'
Out[293]: '᠀'
In [294]: b36encode('whoami')
Out[294]: '1ajdznl1ex'
In [295]: b36decode(b36encode('whoami'))
Out[295]: 'whoami'
In [296]: b36decode(b36encode('\u1800\u1800\u1800'))
Out[296]: '᠀᠀᠀'
In [297]: base36_decode(base36_encode('\u1800\u1800\u1800'))
Out[297]: '᠀᠀᠀'

What improvement can be made to my code?

Update:

I have improved representation scheme to make it able to represent numbers in any base.

For bases bigger than 36, the resultant string is a mixed radix representation, with the digits represented in their decimal form instead of having a single character representing its value, and a separator delimit the digits.

The separator is determined by the base, for base 60 the separator is a colon (:), similar to how time is represented, for base 256 the separator is a dot (.), similar to how IPv4 addresses are represented, and a comma (,) for any other base.

I have considered using the Greek alphabet after Latin alphabet (after Arabic numerals) to make single character notation representation form able to represent numbers in base-60, however many Greek letters and Latin letters are homoglyphs I abandoned the idea to avoid ambiguity.

Examples:

In [52]: to_base(16777216,64)
Out[52]: ('1,0,0,0,0', 64)
In [53]: to_base(16777215,64)
Out[53]: ('63,63,63,63', 64)
In [54]: to_base(33554431,64)
Out[54]: ('1,63,63,63,63', 64)
In [55]: to_base(86399,60)
Out[55]: ('23:59:59', 60)
In [56]: to_base(86399,365)
Out[56]: ('236,259', 365)
In [57]: to_base(2**32-1,256)
Out[57]: ('255.255.255.255', 256)

Question 2

You realize that from_base can be entirely replaced with a call to int(), right?

Question 3

@Reinderien Of course I do realize from_base can be entirely replaced with a call to int, I even make it accept the same types of positional arguments in the same order, while using int is far more performant than from_base, using int doesn't demonstrate that I know how to do the conversion.

Question 4

If your goal is to have good code, most of this shouldn't exist and you should make better use of built-ins, but it seems like that isn't your goal.

If your goal is to demonstrate that you know how to write functions equivalent to the built-ins, you haven't quite hit your mark, because the built-ins are able to deal with negative numbers and your functions are not.

As for a review laundry-list:

You need PEP484 type hints
glyphs should be capitalized since it's a global constant
You have a bunch of verbatim-repeated code, such as your selection of separators, for which you should factor out functions
parser looks up a separator even when it doesn't need one. Only look up the separator if base is greater than 36.
Do not return on failure; raise instead.
There is no value in returning the base as a second tuple element from to_base; the caller knows what base it's going to be.
lstrip('0x') does not do what you think it does. Hint: what is 'xxx000xxx'.lstrip('0x') ? Use removeprefix instead.
More repeated code in the comprehensions for your b36_decode and base36_decode functions that need to be factored out.
Convert your sample usage into unit tests.

Suggested

from string import ascii_lowercase
from string import digits
from typing import List, Iterable
GLYPHS = digits + ascii_lowercase
def log2(n: int) -> int:
 return n.bit_length() - 1
def power_of_2(n: int) -> bool:
 return (n & (n-1) == 0) and n != 0
def sep_for_base(base: int) -> str:
 return {
 60: ':',
 256: '.',
 }.get(base, ',')
def parser(s: str, base: int) -> List[int]:
 if 2 <= base <= 36:
 return [GLYPHS.index(i) for i in s]
 return s.split(sep_for_base(base))
def to_base(num: int, base: int) -> str:
 if base < 2 or not isinstance(base, int):
 raise ValueError()
 if power_of_2(base):
 l = log2(base)
 powers = range(num.bit_length() // l + (num.bit_length() % l != 0))
 places = [(num >> l * i) % base for i in powers]
 else:
 if num == 0:
 return '0'
 places = []
 while num:
 n, p = divmod(num, base)
 places.append(p)
 num = n
 if base > 36:
 sep = sep_for_base(base)
 return sep.join(map(str, reversed(places)))
 return ''.join([GLYPHS[p] for p in reversed(places)])
def from_base(s: str, base: int) -> int:
 if base < 2 or not isinstance(base, int):
 raise ValueError()
 if base <= 36:
 for i in s:
 if GLYPHS.index(i) >= base:
 raise ValueError()
 else:
 sep = sep_for_base(base)
 for i in s.split(sep):
 if int(i) >= base:
 raise ValueError()
 places = parser(s, base)
 if power_of_2(base):
 l = log2(base)
 return sum([int(n) << l * p for p, n in enumerate(reversed(places))])
 powers = reversed([base ** i for i in range(len(places))])
 return sum(int(a) * b for a, b in zip(places, powers))
def b36encode(s: str) -> str:
 msg = s.encode('utf8').hex()
 n = int(msg, 16)
 return to_base(n, 36)
def b36decode(m: str) -> str:
 n = from_base(m, 36)
 h = hex(n).removeprefix('0x')
 return bytes.fromhex(h).decode('utf8')
def b36_encode(s: str) -> str:
 msg = [ord(i) for i in s]
 return ''.join([to_base(n, 36).zfill(2) for n in msg])
def b36_digits(m: str) -> Iterable[int]:
 for i in range(0, len(m), 2):
 n = m[i:i+2]
 yield from_base(n, 36)
def b36_decode(m) -> str:
 s = b36_digits(m)
 return ''.join(chr(n) for n in s)
def base36_encode(s: str) -> str:
 msg = s.encode('utf8')
 return ''.join([to_base(n, 36) for n in msg])
def base36_decode(m: str) -> str:
 s = b36_digits(m)
 return bytearray(s).decode('utf8')
def assert_from_base(s: str, base: int, exp: int) -> None:
 assert int(s, base) == exp
 assert from_base(s, base) == exp
def round_trip(x: int, base: int, s: str) -> None:
 s_actual = to_base(x, base)
 assert s == s_actual
 assert_from_base(s, base, x)
def test() -> None:
 round_trip(2**24-1, 3, '1011120101000100')
 round_trip(2**32-1, 3, '102002022201221111210')
 round_trip(2**64-1, 36, '3w5e11264sgsf')
 assert to_base(46610, 16) == 'b612'
 assert to_base(54, 13) == '42'
 assert_from_base('zzz', 36, 46655)
 round_trip(993986429283, 36, 'computer')
 assert b36encode('whoami') == '1ajdznl1ex'
 assert b36decode('1ajdznl1ex') == 'whoami'
 assert '᠀᠀᠀' == '\u1800\u1800\u1800'
 assert b36encode('᠀᠀᠀') == 'oedjywcl6i78g0'
 assert b36decode('oedjywcl6i78g0') == '᠀᠀᠀'
 assert base36_encode('᠀᠀᠀') == '694g3k694g3k694g3k'
 assert base36_decode('694g3k694g3k694g3k') == '᠀᠀᠀'
if __name__ == '__main__':
 test()

Question 5

nit: "def parser" should be "def parse"

Reinderien Reinderien 71k5 gold badges76 silver badges256 bronze badges · Accepted Answer · 2021-09-10 20:00:06Z

If your goal is to have good code, most of this shouldn't exist and you should make better use of built-ins, but it seems like that isn't your goal.

If your goal is to demonstrate that you know how to write functions equivalent to the built-ins, you haven't quite hit your mark, because the built-ins are able to deal with negative numbers and your functions are not.

As for a review laundry-list:

You need PEP484 type hints
glyphs should be capitalized since it's a global constant
You have a bunch of verbatim-repeated code, such as your selection of separators, for which you should factor out functions
parser looks up a separator even when it doesn't need one. Only look up the separator if base is greater than 36.
Do not return on failure; raise instead.
There is no value in returning the base as a second tuple element from to_base; the caller knows what base it's going to be.
lstrip('0x') does not do what you think it does. Hint: what is 'xxx000xxx'.lstrip('0x') ? Use removeprefix instead.
More repeated code in the comprehensions for your b36_decode and base36_decode functions that need to be factored out.
Convert your sample usage into unit tests.

Suggested

from string import ascii_lowercase
from string import digits
from typing import List, Iterable
GLYPHS = digits + ascii_lowercase
def log2(n: int) -> int:
 return n.bit_length() - 1
def power_of_2(n: int) -> bool:
 return (n & (n-1) == 0) and n != 0
def sep_for_base(base: int) -> str:
 return {
 60: ':',
 256: '.',
 }.get(base, ',')
def parser(s: str, base: int) -> List[int]:
 if 2 <= base <= 36:
 return [GLYPHS.index(i) for i in s]
 return s.split(sep_for_base(base))
def to_base(num: int, base: int) -> str:
 if base < 2 or not isinstance(base, int):
 raise ValueError()
 if power_of_2(base):
 l = log2(base)
 powers = range(num.bit_length() // l + (num.bit_length() % l != 0))
 places = [(num >> l * i) % base for i in powers]
 else:
 if num == 0:
 return '0'
 places = []
 while num:
 n, p = divmod(num, base)
 places.append(p)
 num = n
 if base > 36:
 sep = sep_for_base(base)
 return sep.join(map(str, reversed(places)))
 return ''.join([GLYPHS[p] for p in reversed(places)])
def from_base(s: str, base: int) -> int:
 if base < 2 or not isinstance(base, int):
 raise ValueError()
 if base <= 36:
 for i in s:
 if GLYPHS.index(i) >= base:
 raise ValueError()
 else:
 sep = sep_for_base(base)
 for i in s.split(sep):
 if int(i) >= base:
 raise ValueError()
 places = parser(s, base)
 if power_of_2(base):
 l = log2(base)
 return sum([int(n) << l * p for p, n in enumerate(reversed(places))])
 powers = reversed([base ** i for i in range(len(places))])
 return sum(int(a) * b for a, b in zip(places, powers))
def b36encode(s: str) -> str:
 msg = s.encode('utf8').hex()
 n = int(msg, 16)
 return to_base(n, 36)
def b36decode(m: str) -> str:
 n = from_base(m, 36)
 h = hex(n).removeprefix('0x')
 return bytes.fromhex(h).decode('utf8')
def b36_encode(s: str) -> str:
 msg = [ord(i) for i in s]
 return ''.join([to_base(n, 36).zfill(2) for n in msg])
def b36_digits(m: str) -> Iterable[int]:
 for i in range(0, len(m), 2):
 n = m[i:i+2]
 yield from_base(n, 36)
def b36_decode(m) -> str:
 s = b36_digits(m)
 return ''.join(chr(n) for n in s)
def base36_encode(s: str) -> str:
 msg = s.encode('utf8')
 return ''.join([to_base(n, 36) for n in msg])
def base36_decode(m: str) -> str:
 s = b36_digits(m)
 return bytearray(s).decode('utf8')
def assert_from_base(s: str, base: int, exp: int) -> None:
 assert int(s, base) == exp
 assert from_base(s, base) == exp
def round_trip(x: int, base: int, s: str) -> None:
 s_actual = to_base(x, base)
 assert s == s_actual
 assert_from_base(s, base, x)
def test() -> None:
 round_trip(2**24-1, 3, '1011120101000100')
 round_trip(2**32-1, 3, '102002022201221111210')
 round_trip(2**64-1, 36, '3w5e11264sgsf')
 assert to_base(46610, 16) == 'b612'
 assert to_base(54, 13) == '42'
 assert_from_base('zzz', 36, 46655)
 round_trip(993986429283, 36, 'computer')
 assert b36encode('whoami') == '1ajdznl1ex'
 assert b36decode('1ajdznl1ex') == 'whoami'
 assert '᠀᠀᠀' == '\u1800\u1800\u1800'
 assert b36encode('᠀᠀᠀') == 'oedjywcl6i78g0'
 assert b36decode('oedjywcl6i78g0') == '᠀᠀᠀'
 assert base36_encode('᠀᠀᠀') == '694g3k694g3k694g3k'
 assert base36_decode('694g3k694g3k694g3k') == '᠀᠀᠀'
if __name__ == '__main__':
 test()

\$\begingroup\$ nit: "def parser" should be "def parse" \$\endgroup\$

milahu
– milahu

2023年06月28日 08:15:04 +00:00
Commented Jun 28, 2023 at 8:15

Stack Exchange Network

Python 3 arbitrary base converter and base36 encoder

1 Answer 1

Suggested

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Python 3 arbitrary base converter and base36 encoder

1 Answer 1

Suggested

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions