Encode and decode string as base-32 number

Question 1

I wrote a program that can take any English text and transform it into an integer and vice versa.

My method

I use two dictionaries to map between letters and numbers. I chose base 32 because it is a multiple of 2, so I guessed that would be easy to work with.

The pairing is

0 ⟷
1 ⟷ a
⋮
26 ⟷ z
⋮ (some other characters)
31 ⟷ ⍰ (any character not included in 0-30 is coded as 31.)

Concerns

My current encoding method is really slow as it goes through every character and treats each as a base 32, using the reference dictionary to calculate the result.

The decoding method is also not very fast because again it uses lots of divmod() and the number of iterations increase with the length of the string.

How can I improve my code? (Maybe a different algorithm, design choice, other things I should try?)

I am thinking about utilizing the right shift operator or implementing classes but I am not quite sure.

The code

import string
alpha_to_num = {
 " ":0,
 ",":27,
 ".":28,
 "!":29,
 "'":30,
 '"':30,
 "⍰":31,
 "?":31,
}
num_to_alpha = {
 0:"⎵",
 27:",",
 28:".",
 29:"!",
 30:"'",
 31:"⍰",
}
counter = 1
for o in string.ascii_lowercase:
 alpha_to_num[o] = counter
 num_to_alpha[counter] = o
 counter+=1
def encode(string:str):
 string = string.lower()
 _sum = 0
 _pow = 0
 for o in string:
 if o in alpha_to_num:
 index = alpha_to_num[o]
 else:
 index = 31
 _sum += 32**_pow*index
 _pow += 1
 return _sum
def decode(integer:int):
 string = []
 while True:
 if not integer:
 break
 integer, _chr = divmod(integer,32)
 string.append(num_to_alpha[_chr])
 return ''.join(string)
# example usage
foo = encode("lol")
print(foo)
print(decode(foo))

Question 2

The problem with encoding text as an integer is that Python integers become much slower to work with as they get longer, because arbitrary-precision numbers inevitably slow as they scale.

This is why crypto algorithms such as RSA or elliptic-curve cryptography are not used directly to encrypt text, but instead operate on the shorter values that hold symmetric-encryption keys, and those keys are used for the actual encryption of the text.

A specific problem occurs here:

_pow = 0
for o in string:
 ⋮
 _sum += 32**_pow*index
 _pow += 1

Instead of computing 32**_pow from scratch each time around the loop, it's more efficient to remember that value and multiply it by 32 each time:

multiplier = 1
for o in string:
 ⋮
 sum += multiplier * index
 multiplier *= 32

This still won't eliminate the inefficiencies of using big integers, though.

Some other issues:

Use whitespace around operators. counter+=1 shouldn't be packed together like that; it makes it hard to read for Python people. 32**_pow*index is particularly hard to unpack!
Don't use initial underscore for local variables. There are Python conventions for using initial underscore in class members, so they shouldn't be used in other places. Other variable names seem mostly okay (although string is a poor choice, as it's also the name of a data type).
Use a plain array type (probably a tuple) for mapping from integer.
Use a defaultdict (or dict.get() with a default) to avoid throwing an exception when encoding text with a character (such as @) that isn't in alpha_to_num.
~~(削除) Instead of gathering characters into a list, and then join()ing them into a string, why not gather into a string directly? (削除ここまで)~~
A string that ends with spaces will lose them in the round-trip. Think about how you could avoid that.

Question 3

Gather into a list and then using join is the preferred way to build strings. Using += on a string in a loop generally results in quadratic runtime. In some cases, CPython can optimise it in a way that results in a linear runtime, but it's best not to depend on that.

Question 4

How does list.append() avoid the quadratic performance problem that affects string.__iadd__()? I would have thought them to have similar implementations, and now I'm intrigued.

Question 5

Because in Python lists are mutable, if you call append, lists can allocate more memory than required on the assumption it may be used later, which if you append items in a loop means you need significantly fewer reallocations. So a single call to list.append may or may not be (relatively) expensive, but the amortized time for that method is O(1), while s1 += s2 where they are both strings always needs a reallocation, which is at least O(len(s1 + s2)).

Question 6

Thanks Jasmijn - that's a good explanation. I'd forgotten that Python strings are immutable (perhaps I've done too much C++ lately).

Question 7

Base-32 to Integer

Enumerate

Your loop ...

 _sum = 0
 _pow = 0
 for o in string:
 if o in alpha_to_num:
 index = alpha_to_num[o]
 else:
 index = 31
 _sum += 32**_pow*index
 _pow += 1

... can be greatly simplified. First, instead of manually maintaining the _pow counter, use Python's enumerate method, which does this for you.

 _sum = 0
 for _pow, o in enumerate(string):
 if o in alpha_to_num:
 index = alpha_to_num[o]
 else:
 index = 31
 _sum += 32**_pow*index

get with default

Next, use a dict.get(key [,default]) to eliminate the if o in alpha_to_num: ... else: ... statement.

 _sum = 0
 for _pow, o in enumerate(string):
 index = alpha_to_num.get(o, 31)
 _sum += 32**_pow*index

sum

Finally, you have a sum of a simple loop iteration, so we can now replace this with a single sum(...) statement:

 _sum = sum(32 ** _pow * alpha_to_num.get(o, 31) for _pow, o in enumerate(string))

built-in

Python has a built-in function for converting base-2 through base-36 strings into integers. It is the int(value, base) function. If we passed in a suitable value, then Python will perform the required \$\Sigma_i(d_i * 32^i)\$ calculation itself.

Reworked code

First, you want to transform your input string into a base-32 equivalent string according to the encoding:

abcdefghijklmnopqrstuvwxyz,.!'? ⇒ 0123456789abcdefghijklmnopqrstuv

Plus uppercase letters to the same values as the lowercase ones, plus double-quote to the same as a single-quote, plus any unknown values to the equivalent as the question-mark. The str.translate(table) built-in method will do this for us in one step, provided we give it the proper table. The table must map from the ordinal character values to the replacement strings. We'll use a defaultdict to return 'v' (the encoding for a '?') for any unknown characters.

from collections import defaultdict
PLAIN = " abcdefghijklmnopqrstuvwxyz,.!'?"
CRYPT = "0123456789abcdefghijklmnopqrstuv"
ENCODE = defaultdict(lambda: 'v')
ENCODE.update((ord(c), p) for c, p in zip(PLAIN, CRYPT))
ENCODE.update((ord(c), p) for c, p in zip(PLAIN.upper(), CRYPT))
ENCODE[ord('"')] = 'u' # Encode double-quote to same as single quote.
def encode(msg: str) -> int:
 encoded = msg.translate(ENCODE)
 ...

With the string properly translated to a base-32 string, we can now use int(..., 32) to convert it to an integer. The only catch is we need to reverse the string, since the original encode method uses a little-endian base-32 to integer conversion.

 return int(encoded[::-1], 32)

Update

The above code will crash with a ValueError if given an empty string. We can prevent this by changing encoded to "0" when encoding an empty string.

 encoded = msg.translate(ENCODE) or "0"

Integer to Base-32

Unfortunately, there is no simple int-to-base-n string conversion function built-in to Python. If there was, we could use the same str.translate(table) function to convert it to the desired output text. We can even use the str.maketrans(x[, y[, z]]) to create the necessary table for us, because we have no infinite set of unknown values to convert to a specific value.

CRYPT = "0123456789abcdefghijklmnopqrstuv"
KEY32 = "⎵abcdefghijklmnopqrstuvwxyz,.!'⍰"
DECODE = str.maketrans(CRYPT, KEY32)
def decode(code: int) -> str:
 encoded = ...
 return encoded[::-1].translate(DECODE)

I said no simple int-to-base-n string conversion function built-in to Python. There are base32 encoding and decoding functions built-in to the base64 standard library. We can use that to do the heavy lifting.

But first, we need to know how long our message is going to be ... which is the number of 5-bit groups in the integer message:

 length = (code.bit_length() + 4) // 5

Then we'll need to pad the message to an integral number of bytes.

 bits = 5 * length
 if bits % 8 != 0:
 pad = 8 - bits % 8
 bits += pad
 code <<= pad

At this point, we can turn the integer into a series of bytes:

 data = code.to_bytes(bits // 8, 'big')

Now we can use the base64.b32hexencode() method to convert that into a series of characters 0-9 and A-V representing 5-bit groupings of those bytes. The data is padded with equal characters (=) at the end to ensure a consistent multiple of 40 bit structure, so we need to truncate the result to length characters to get rid of the padding. Also, b32hexencode uses uppercase characters, where as we want lowercase, so we need to call .lower() on the result. Finally, the output is a bytes object where as we desire a str, so we'll want to .decode("utf-8") the result:

 encoded = base64.b32hexencode(data)[:length].lower().decode("utf-8")

As I said, it wasn't simple.

Reworked code

Changing the CRYPT to uppercase allows us to remove a .lower() call, for slightly simpler code.

from base64 import b32hexencode
from collections import defaultdict
PLAIN = " abcdefghijklmnopqrstuvwxyz,.!'?"
CRYPT = "0123456789ABCDEFGHIJKLMNOPQRSTUV"
KEY32 = "⎵abcdefghijklmnopqrstuvwxyz,.!'⍰"
ENCODE = defaultdict(lambda: 'V')
ENCODE.update((ord(c), p) for c, p in zip(PLAIN, CRYPT))
ENCODE.update((ord(c), p) for c, p in zip(PLAIN.upper(), CRYPT))
ENCODE[ord('"')] = 'U'
DECODE = str.maketrans(CRYPT, KEY32)
def encode(msg: str) -> int:
 encoded = msg.translate(ENCODE) or "0"
 return int(encoded[::-1], 32)
def decode(code: int) -> str:
 length = (code.bit_length() + 4) // 5
 bits = 5 * length
 if bits % 8 != 0:
 pad = 8 - bits % 8
 bits += pad
 code <<= pad
 
 data = code.to_bytes(bits // 8, 'big')
 encoded = b32hexencode(data)[:length].decode("utf-8")
 
 return encoded[::-1].translate(DECODE)
if __name__ == '__main__':
 msg = '''The quick "brown" fox jumped 'over' the lazy dog!'''
 code = encode(msg)
 decrypted = decode(code)
 print(code)
 print(decrypted)
 from timeit import timeit
 t_enc = timeit('encode(msg)', globals=globals())
 print(f"Encryption takes: {t_enc:.2f} μs")
 t_dec = timeit('decode(code)', globals=globals())
 print(f"Decryption takes: {t_dec:.2f} μs")

Output:

51651161261624696839988483915180468301590365415316947842058570259607065876
the⎵quick⎵'brown'⎵fox⎵jumped⎵'over'⎵the⎵lazy⎵dog!
Encryption takes: 1.54 μs
Decryption takes: 7.78 μs

Toby Speight Toby Speight 87.9k14 gold badges104 silver badges325 bronze badges · Answer 1 · 2022-08-17 07:23:44Z

The problem with encoding text as an integer is that Python integers become much slower to work with as they get longer, because arbitrary-precision numbers inevitably slow as they scale.

This is why crypto algorithms such as RSA or elliptic-curve cryptography are not used directly to encrypt text, but instead operate on the shorter values that hold symmetric-encryption keys, and those keys are used for the actual encryption of the text.

A specific problem occurs here:

_pow = 0
for o in string:
 ⋮
 _sum += 32**_pow*index
 _pow += 1

Instead of computing 32**_pow from scratch each time around the loop, it's more efficient to remember that value and multiply it by 32 each time:

multiplier = 1
for o in string:
 ⋮
 sum += multiplier * index
 multiplier *= 32

This still won't eliminate the inefficiencies of using big integers, though.

Some other issues:

Use whitespace around operators. counter+=1 shouldn't be packed together like that; it makes it hard to read for Python people. 32**_pow*index is particularly hard to unpack!
Don't use initial underscore for local variables. There are Python conventions for using initial underscore in class members, so they shouldn't be used in other places. Other variable names seem mostly okay (although string is a poor choice, as it's also the name of a data type).
Use a plain array type (probably a tuple) for mapping from integer.
Use a defaultdict (or dict.get() with a default) to avoid throwing an exception when encoding text with a character (such as @) that isn't in alpha_to_num.
~~(削除) Instead of gathering characters into a list, and then join()ing them into a string, why not gather into a string directly? (削除ここまで)~~
A string that ends with spaces will lose them in the round-trip. Think about how you could avoid that.

Gather into a list and then using join is the preferred way to build strings. Using += on a string in a loop generally results in quadratic runtime. In some cases, CPython can optimise it in a way that results in a linear runtime, but it's best not to depend on that.
How does list.append() avoid the quadratic performance problem that affects string.__iadd__()? I would have thought them to have similar implementations, and now I'm intrigued.
Because in Python lists are mutable, if you call append, lists can allocate more memory than required on the assumption it may be used later, which if you append items in a loop means you need significantly fewer reallocations. So a single call to list.append may or may not be (relatively) expensive, but the amortized time for that method is O(1), while s1 += s2 where they are both strings always needs a reallocation, which is at least O(len(s1 + s2)).
Thanks Jasmijn - that's a good explanation. I'd forgotten that Python strings are immutable (perhaps I've done too much C++ lately).

AJNeufeld AJNeufeld 35.3k5 gold badges41 silver badges103 bronze badges · Answer 2 · 2022-08-17 19:42:26Z

Base-32 to Integer

Enumerate

Your loop ...

 _sum = 0
 _pow = 0
 for o in string:
 if o in alpha_to_num:
 index = alpha_to_num[o]
 else:
 index = 31
 _sum += 32**_pow*index
 _pow += 1

... can be greatly simplified. First, instead of manually maintaining the _pow counter, use Python's enumerate method, which does this for you.

 _sum = 0
 for _pow, o in enumerate(string):
 if o in alpha_to_num:
 index = alpha_to_num[o]
 else:
 index = 31
 _sum += 32**_pow*index

get with default

Next, use a dict.get(key [,default]) to eliminate the if o in alpha_to_num: ... else: ... statement.

 _sum = 0
 for _pow, o in enumerate(string):
 index = alpha_to_num.get(o, 31)
 _sum += 32**_pow*index

sum

Finally, you have a sum of a simple loop iteration, so we can now replace this with a single sum(...) statement:

 _sum = sum(32 ** _pow * alpha_to_num.get(o, 31) for _pow, o in enumerate(string))

built-in

Python has a built-in function for converting base-2 through base-36 strings into integers. It is the int(value, base) function. If we passed in a suitable value, then Python will perform the required \$\Sigma_i(d_i * 32^i)\$ calculation itself.

Reworked code

First, you want to transform your input string into a base-32 equivalent string according to the encoding:

abcdefghijklmnopqrstuvwxyz,.!'? ⇒ 0123456789abcdefghijklmnopqrstuv

Plus uppercase letters to the same values as the lowercase ones, plus double-quote to the same as a single-quote, plus any unknown values to the equivalent as the question-mark. The str.translate(table) built-in method will do this for us in one step, provided we give it the proper table. The table must map from the ordinal character values to the replacement strings. We'll use a defaultdict to return 'v' (the encoding for a '?') for any unknown characters.

from collections import defaultdict
PLAIN = " abcdefghijklmnopqrstuvwxyz,.!'?"
CRYPT = "0123456789abcdefghijklmnopqrstuv"
ENCODE = defaultdict(lambda: 'v')
ENCODE.update((ord(c), p) for c, p in zip(PLAIN, CRYPT))
ENCODE.update((ord(c), p) for c, p in zip(PLAIN.upper(), CRYPT))
ENCODE[ord('"')] = 'u' # Encode double-quote to same as single quote.
def encode(msg: str) -> int:
 encoded = msg.translate(ENCODE)
 ...

With the string properly translated to a base-32 string, we can now use int(..., 32) to convert it to an integer. The only catch is we need to reverse the string, since the original encode method uses a little-endian base-32 to integer conversion.

 return int(encoded[::-1], 32)

Update

The above code will crash with a ValueError if given an empty string. We can prevent this by changing encoded to "0" when encoding an empty string.

 encoded = msg.translate(ENCODE) or "0"

Integer to Base-32

Unfortunately, there is no simple int-to-base-n string conversion function built-in to Python. If there was, we could use the same str.translate(table) function to convert it to the desired output text. We can even use the str.maketrans(x[, y[, z]]) to create the necessary table for us, because we have no infinite set of unknown values to convert to a specific value.

CRYPT = "0123456789abcdefghijklmnopqrstuv"
KEY32 = "⎵abcdefghijklmnopqrstuvwxyz,.!'⍰"
DECODE = str.maketrans(CRYPT, KEY32)
def decode(code: int) -> str:
 encoded = ...
 return encoded[::-1].translate(DECODE)

I said no simple int-to-base-n string conversion function built-in to Python. There are base32 encoding and decoding functions built-in to the base64 standard library. We can use that to do the heavy lifting.

But first, we need to know how long our message is going to be ... which is the number of 5-bit groups in the integer message:

 length = (code.bit_length() + 4) // 5

Then we'll need to pad the message to an integral number of bytes.

 bits = 5 * length
 if bits % 8 != 0:
 pad = 8 - bits % 8
 bits += pad
 code <<= pad

At this point, we can turn the integer into a series of bytes:

 data = code.to_bytes(bits // 8, 'big')

Now we can use the base64.b32hexencode() method to convert that into a series of characters 0-9 and A-V representing 5-bit groupings of those bytes. The data is padded with equal characters (=) at the end to ensure a consistent multiple of 40 bit structure, so we need to truncate the result to length characters to get rid of the padding. Also, b32hexencode uses uppercase characters, where as we want lowercase, so we need to call .lower() on the result. Finally, the output is a bytes object where as we desire a str, so we'll want to .decode("utf-8") the result:

 encoded = base64.b32hexencode(data)[:length].lower().decode("utf-8")

As I said, it wasn't simple.

Reworked code

Changing the CRYPT to uppercase allows us to remove a .lower() call, for slightly simpler code.

from base64 import b32hexencode
from collections import defaultdict
PLAIN = " abcdefghijklmnopqrstuvwxyz,.!'?"
CRYPT = "0123456789ABCDEFGHIJKLMNOPQRSTUV"
KEY32 = "⎵abcdefghijklmnopqrstuvwxyz,.!'⍰"
ENCODE = defaultdict(lambda: 'V')
ENCODE.update((ord(c), p) for c, p in zip(PLAIN, CRYPT))
ENCODE.update((ord(c), p) for c, p in zip(PLAIN.upper(), CRYPT))
ENCODE[ord('"')] = 'U'
DECODE = str.maketrans(CRYPT, KEY32)
def encode(msg: str) -> int:
 encoded = msg.translate(ENCODE) or "0"
 return int(encoded[::-1], 32)
def decode(code: int) -> str:
 length = (code.bit_length() + 4) // 5
 bits = 5 * length
 if bits % 8 != 0:
 pad = 8 - bits % 8
 bits += pad
 code <<= pad
 
 data = code.to_bytes(bits // 8, 'big')
 encoded = b32hexencode(data)[:length].decode("utf-8")
 
 return encoded[::-1].translate(DECODE)
if __name__ == '__main__':
 msg = '''The quick "brown" fox jumped 'over' the lazy dog!'''
 code = encode(msg)
 decrypted = decode(code)
 print(code)
 print(decrypted)
 from timeit import timeit
 t_enc = timeit('encode(msg)', globals=globals())
 print(f"Encryption takes: {t_enc:.2f} μs")
 t_dec = timeit('decode(code)', globals=globals())
 print(f"Decryption takes: {t_dec:.2f} μs")

Output:

51651161261624696839988483915180468301590365415316947842058570259607065876
the⎵quick⎵'brown'⎵fox⎵jumped⎵'over'⎵the⎵lazy⎵dog!
Encryption takes: 1.54 μs
Decryption takes: 7.78 μs

Stack Exchange Network

Encode and decode string as base-32 number

My method

Concerns

The code

2 Answers 2

Base-32 to Integer

Enumerate

get with default

sum

built-in

Reworked code

Integer to Base-32

Reworked code

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Encode and decode string as base-32 number

My method

Concerns

The code

2 Answers 2

Base-32 to Integer

Enumerate

get with default

sum

built-in

Reworked code

Integer to Base-32

Reworked code

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions