This is a script that can convert any numerical value to any arbitrary radix representation and back, well, technically it can convert a number to any base, but because the output is a string representation with each a character as a digit, and possible digits are 10 Arabic numerals plus 26 basic Latin (lowercase) letters, the base can only be an integer from 2 to 36 (including both ends).
The possible digits are 0..9 + a..z, just like how hexadecimal orders them, I used bit shifting to convert to and from a base that is a power of two, and divmod for every other base. In my testing, bit-shifting is actually a bit slower than divmoding, I don't know why.
I have written a function that converts a decimal number to a base between 2 and 36, and another function that converts a number from a base between 2 and 36 to decimal, both functions validate the possibility of conversion before the actual conversion.
I have also written three sets of function that converts binary data to and from base-36.
The first set encode and decode the hexadecimal value of the binary data as a whole, rather than individual bytes, so the result can be more compact.
The second set encode and decode each character as two base-36 bits, with the highest possible two-bit being zz
which in decimal is 1295 or (36 ^ 2) - 1, as such it can be accurate up to \u050f
or ԏ
, but extended ASCII has only 256 code points the second set is sufficient for non-UNICODE characters.
The third set encode and decode each byte as two base-36 bits, it can therefore represent any UNICODE code point, and its output is same as the second set for code points below 1296.
The code:
from string import ascii_lowercase
from string import digits
def log2(n):
return n.bit_length() - 1
def power_of_2(n):
return (n & (n-1) == 0) and n != 0
glyphs = digits + ascii_lowercase
def parser(s, base):
sep = ','
if base == 60:
sep = ':'
elif base == 256:
sep = '.'
if 2 <= base <= 36:
return [glyphs.index(i) for i in s]
return s.split(sep)
def to_base(num, base):
if base < 2 or not isinstance(base, int):
return
sep = ','
if base == 60:
sep = ':'
elif base == 256:
sep = '.'
if power_of_2(base):
l = log2(base)
powers = range(num.bit_length() // l + (num.bit_length() % l != 0))
places = [(num >> l * i) % base for i in powers]
else:
if num == 0:
return ('0', base)
places = []
while num:
n, p = divmod(num, base)
places.append(p)
num = n
if base > 36:
return (sep.join(map(str, reversed(places))), base)
return (''.join([glyphs[p] for p in reversed(places)]), base)
def from_base(s, base):
if base < 2 or not isinstance(base, int):
return
sep = ','
if base == 60:
sep = ':'
elif base == 256:
sep = '.'
if base <= 36:
for i in s:
if glyphs.index(i) >= base:
return
else:
for i in s.split(sep):
if int(i) >= base:
return
places = parser(s, base)
if power_of_2(base):
l = log2(base)
return sum([int(n) << l * p for p, n in enumerate(reversed(places))])
powers = reversed([base ** i for i in range(len(places))])
return sum(int(a) * b for a, b in zip(places, powers))
def b36encode(s):
msg = s.encode('utf8').hex()
n = int(msg, 16)
return to_base(n, 36)[0]
def b36decode(m):
n = from_base(m, 36)
h = hex(n).lstrip('0x')
return bytes.fromhex(h).decode('utf8')
def b36_encode(s):
msg = [ord(i) for i in s]
return ''.join([to_base(n, 36)[0].zfill(2) for n in msg])
def b36_decode(m):
s = [from_base(n, 36) for n in [m[i:i+2] for i in range(0, len(m), 2)]]
return ''.join(chr(n) for n in s)
def base36_encode(s):
msg = s.encode('utf8')
return ''.join([to_base(n, 36)[0] for n in msg])
def base36_decode(m):
s = [from_base(n, 36) for n in [m[i:i+2] for i in range(0, len(m), 2)]]
return bytearray(s).decode('utf8')
Sample usage:
In [282]: to_base(16777215, 3)
Out[282]: ('1011120101000100', 3)
In [283]: from_base(*to_base(16777215, 3))
Out[283]: 16777215
In [284]: to_base(2**32-1, 3)
Out[284]: ('102002022201221111210', 3)
In [285]: from_base(*to_base(2**32-1, 3))
Out[285]: 4294967295
In [286]: from_base(*to_base(2**64-1, 36))
Out[286]: 18446744073709551615
In [287]: to_base(2**64-1, 36)
Out[287]: ('3w5e11264sgsf', 36)
In [288]: to_base(46610, 16)
Out[288]: ('b612', 16)
In [289]: to_base(54, 13)
Out[289]: ('42', 13)
In [290]: from_base('zzz', 36)
Out[290]: 46655
In [291]: from_base('computer', 36)
Out[291]: 993986429283
In [292]: to_base(from_base('computer', 36),36)
Out[292]: ('computer', 36)
In [293]: '\u1800'
Out[293]: '᠀'
In [294]: b36encode('whoami')
Out[294]: '1ajdznl1ex'
In [295]: b36decode(b36encode('whoami'))
Out[295]: 'whoami'
In [296]: b36decode(b36encode('\u1800\u1800\u1800'))
Out[296]: '᠀᠀᠀'
In [297]: base36_decode(base36_encode('\u1800\u1800\u1800'))
Out[297]: '᠀᠀᠀'
What improvement can be made to my code?
Update:
I have improved representation scheme to make it able to represent numbers in any base.
For bases bigger than 36, the resultant string is a mixed radix representation, with the digits represented in their decimal form instead of having a single character representing its value, and a separator delimit the digits.
The separator is determined by the base, for base 60 the separator is a colon (:
), similar to how time is represented, for base 256 the separator is a dot (.
), similar to how IPv4 addresses are represented, and a comma (,
) for any other base.
I have considered using the Greek alphabet after Latin alphabet (after Arabic numerals) to make single character notation representation form able to represent numbers in base-60, however many Greek letters and Latin letters are homoglyphs I abandoned the idea to avoid ambiguity.
Examples:
In [52]: to_base(16777216,64)
Out[52]: ('1,0,0,0,0', 64)
In [53]: to_base(16777215,64)
Out[53]: ('63,63,63,63', 64)
In [54]: to_base(33554431,64)
Out[54]: ('1,63,63,63,63', 64)
In [55]: to_base(86399,60)
Out[55]: ('23:59:59', 60)
In [56]: to_base(86399,365)
Out[56]: ('236,259', 365)
In [57]: to_base(2**32-1,256)
Out[57]: ('255.255.255.255', 256)
1 Answer 1
If your goal is to have good code, most of this shouldn't exist and you should make better use of built-ins, but it seems like that isn't your goal.
If your goal is to demonstrate that you know how to write functions equivalent to the built-ins, you haven't quite hit your mark, because the built-ins are able to deal with negative numbers and your functions are not.
As for a review laundry-list:
- You need PEP484 type hints
glyphs
should be capitalized since it's a global constant- You have a bunch of verbatim-repeated code, such as your selection of separators, for which you should factor out functions
parser
looks up a separator even when it doesn't need one. Only look up the separator if base is greater than 36.- Do not
return
on failure;raise
instead. - There is no value in returning the base as a second tuple element from
to_base
; the caller knows what base it's going to be. lstrip('0x')
does not do what you think it does. Hint: what is'xxx000xxx'.lstrip('0x')
? Useremoveprefix
instead.- More repeated code in the comprehensions for your
b36_decode
andbase36_decode
functions that need to be factored out. - Convert your sample usage into unit tests.
Suggested
from string import ascii_lowercase
from string import digits
from typing import List, Iterable
GLYPHS = digits + ascii_lowercase
def log2(n: int) -> int:
return n.bit_length() - 1
def power_of_2(n: int) -> bool:
return (n & (n-1) == 0) and n != 0
def sep_for_base(base: int) -> str:
return {
60: ':',
256: '.',
}.get(base, ',')
def parser(s: str, base: int) -> List[int]:
if 2 <= base <= 36:
return [GLYPHS.index(i) for i in s]
return s.split(sep_for_base(base))
def to_base(num: int, base: int) -> str:
if base < 2 or not isinstance(base, int):
raise ValueError()
if power_of_2(base):
l = log2(base)
powers = range(num.bit_length() // l + (num.bit_length() % l != 0))
places = [(num >> l * i) % base for i in powers]
else:
if num == 0:
return '0'
places = []
while num:
n, p = divmod(num, base)
places.append(p)
num = n
if base > 36:
sep = sep_for_base(base)
return sep.join(map(str, reversed(places)))
return ''.join([GLYPHS[p] for p in reversed(places)])
def from_base(s: str, base: int) -> int:
if base < 2 or not isinstance(base, int):
raise ValueError()
if base <= 36:
for i in s:
if GLYPHS.index(i) >= base:
raise ValueError()
else:
sep = sep_for_base(base)
for i in s.split(sep):
if int(i) >= base:
raise ValueError()
places = parser(s, base)
if power_of_2(base):
l = log2(base)
return sum([int(n) << l * p for p, n in enumerate(reversed(places))])
powers = reversed([base ** i for i in range(len(places))])
return sum(int(a) * b for a, b in zip(places, powers))
def b36encode(s: str) -> str:
msg = s.encode('utf8').hex()
n = int(msg, 16)
return to_base(n, 36)
def b36decode(m: str) -> str:
n = from_base(m, 36)
h = hex(n).removeprefix('0x')
return bytes.fromhex(h).decode('utf8')
def b36_encode(s: str) -> str:
msg = [ord(i) for i in s]
return ''.join([to_base(n, 36).zfill(2) for n in msg])
def b36_digits(m: str) -> Iterable[int]:
for i in range(0, len(m), 2):
n = m[i:i+2]
yield from_base(n, 36)
def b36_decode(m) -> str:
s = b36_digits(m)
return ''.join(chr(n) for n in s)
def base36_encode(s: str) -> str:
msg = s.encode('utf8')
return ''.join([to_base(n, 36) for n in msg])
def base36_decode(m: str) -> str:
s = b36_digits(m)
return bytearray(s).decode('utf8')
def assert_from_base(s: str, base: int, exp: int) -> None:
assert int(s, base) == exp
assert from_base(s, base) == exp
def round_trip(x: int, base: int, s: str) -> None:
s_actual = to_base(x, base)
assert s == s_actual
assert_from_base(s, base, x)
def test() -> None:
round_trip(2**24-1, 3, '1011120101000100')
round_trip(2**32-1, 3, '102002022201221111210')
round_trip(2**64-1, 36, '3w5e11264sgsf')
assert to_base(46610, 16) == 'b612'
assert to_base(54, 13) == '42'
assert_from_base('zzz', 36, 46655)
round_trip(993986429283, 36, 'computer')
assert b36encode('whoami') == '1ajdznl1ex'
assert b36decode('1ajdznl1ex') == 'whoami'
assert '᠀᠀᠀' == '\u1800\u1800\u1800'
assert b36encode('᠀᠀᠀') == 'oedjywcl6i78g0'
assert b36decode('oedjywcl6i78g0') == '᠀᠀᠀'
assert base36_encode('᠀᠀᠀') == '694g3k694g3k694g3k'
assert base36_decode('694g3k694g3k694g3k') == '᠀᠀᠀'
if __name__ == '__main__':
test()
-
\$\begingroup\$ nit: "def parser" should be "def parse" \$\endgroup\$milahu– milahu2023年06月28日 08:15:04 +00:00Commented Jun 28, 2023 at 8:15
Explore related questions
See similar questions with these tags.
from_base
can be entirely replaced with a call toint()
, right? \$\endgroup\$from_base
can be entirely replaced with a call toint
, I even make it accept the same types of positional arguments in the same order, while usingint
is far more performant thanfrom_base
, usingint
doesn't demonstrate that I know how to do the conversion. \$\endgroup\$