I wrote a program that can take any English text and transform it into an integer and vice versa.
My method
I use two dictionaries to map between letters and numbers. I chose base 32 because it is a multiple of 2, so I guessed that would be easy to work with.
The pairing is
0
⟷1
⟷a
- ⋮
26
⟷z
- ⋮ (some other characters)
31
⟷⍰
(any character not included in 0-30 is coded as 31.)
Concerns
My current encoding method is really slow as it goes through every character and treats each as a base 32, using the reference dictionary to calculate the result.
The decoding method is also not very fast because again it uses lots of divmod()
and the number of iterations increase with the length of the string.
How can I improve my code? (Maybe a different algorithm, design choice, other things I should try?)
I am thinking about utilizing the right shift operator or implementing classes but I am not quite sure.
The code
import string
alpha_to_num = {
" ":0,
",":27,
".":28,
"!":29,
"'":30,
'"':30,
"⍰":31,
"?":31,
}
num_to_alpha = {
0:"⎵",
27:",",
28:".",
29:"!",
30:"'",
31:"⍰",
}
counter = 1
for o in string.ascii_lowercase:
alpha_to_num[o] = counter
num_to_alpha[counter] = o
counter+=1
def encode(string:str):
string = string.lower()
_sum = 0
_pow = 0
for o in string:
if o in alpha_to_num:
index = alpha_to_num[o]
else:
index = 31
_sum += 32**_pow*index
_pow += 1
return _sum
def decode(integer:int):
string = []
while True:
if not integer:
break
integer, _chr = divmod(integer,32)
string.append(num_to_alpha[_chr])
return ''.join(string)
# example usage
foo = encode("lol")
print(foo)
print(decode(foo))
2 Answers 2
The problem with encoding text as an integer is that Python integers become much slower to work with as they get longer, because arbitrary-precision numbers inevitably slow as they scale.
This is why crypto algorithms such as RSA or elliptic-curve cryptography are not used directly to encrypt text, but instead operate on the shorter values that hold symmetric-encryption keys, and those keys are used for the actual encryption of the text.
A specific problem occurs here:
_pow = 0
for o in string:
⋮
_sum += 32**_pow*index
_pow += 1
Instead of computing 32**_pow
from scratch each time around the loop, it's more efficient to remember that value and multiply it by 32 each time:
multiplier = 1
for o in string:
⋮
sum += multiplier * index
multiplier *= 32
This still won't eliminate the inefficiencies of using big integers, though.
Some other issues:
- Use whitespace around operators.
counter+=1
shouldn't be packed together like that; it makes it hard to read for Python people.32**_pow*index
is particularly hard to unpack! - Don't use initial underscore for local variables. There are Python conventions for using initial underscore in class members, so they shouldn't be used in other places. Other variable names seem mostly okay (although
string
is a poor choice, as it's also the name of a data type). - Use a plain array type (probably a tuple) for mapping from integer.
- Use a
defaultdict
(ordict.get()
with a default) to avoid throwing an exception when encoding text with a character (such as@
) that isn't inalpha_to_num
. (削除) Instead of gathering characters into a list, and thenjoin()
ing them into a string, why not gather into a string directly? (削除ここまで)- A string that ends with spaces will lose them in the round-trip. Think about how you could avoid that.
-
\$\begingroup\$ Gather into a list and then using
join
is the preferred way to build strings. Using+=
on a string in a loop generally results in quadratic runtime. In some cases, CPython can optimise it in a way that results in a linear runtime, but it's best not to depend on that. \$\endgroup\$Jasmijn– Jasmijn2022年08月17日 22:36:32 +00:00Commented Aug 17, 2022 at 22:36 -
\$\begingroup\$ How does
list.append()
avoid the quadratic performance problem that affectsstring.__iadd__()
? I would have thought them to have similar implementations, and now I'm intrigued. \$\endgroup\$Toby Speight– Toby Speight2022年08月18日 16:47:44 +00:00Commented Aug 18, 2022 at 16:47 -
\$\begingroup\$ Because in Python lists are mutable, if you call
append
, lists can allocate more memory than required on the assumption it may be used later, which if you append items in a loop means you need significantly fewer reallocations. So a single call tolist.append
may or may not be (relatively) expensive, but the amortized time for that method is O(1), whiles1 += s2
where they are both strings always needs a reallocation, which is at least O(len(s1 + s2)). \$\endgroup\$Jasmijn– Jasmijn2022年08月19日 01:31:00 +00:00Commented Aug 19, 2022 at 1:31 -
\$\begingroup\$ Thanks Jasmijn - that's a good explanation. I'd forgotten that Python strings are immutable (perhaps I've done too much C++ lately). \$\endgroup\$Toby Speight– Toby Speight2022年08月19日 06:01:12 +00:00Commented Aug 19, 2022 at 6:01
Base-32 to Integer
Enumerate
Your loop ...
_sum = 0
_pow = 0
for o in string:
if o in alpha_to_num:
index = alpha_to_num[o]
else:
index = 31
_sum += 32**_pow*index
_pow += 1
... can be greatly simplified. First, instead of manually maintaining the _pow
counter, use Python's enumerate
method, which does this for you.
_sum = 0
for _pow, o in enumerate(string):
if o in alpha_to_num:
index = alpha_to_num[o]
else:
index = 31
_sum += 32**_pow*index
get with default
Next, use a dict.get(key [,default])
to eliminate the if o in alpha_to_num: ... else: ...
statement.
_sum = 0
for _pow, o in enumerate(string):
index = alpha_to_num.get(o, 31)
_sum += 32**_pow*index
sum
Finally, you have a sum of a simple loop iteration, so we can now replace this with a single sum(...)
statement:
_sum = sum(32 ** _pow * alpha_to_num.get(o, 31) for _pow, o in enumerate(string))
built-in
Python has a built-in function for converting base-2 through base-36 strings into integers. It is the int(value, base)
function. If we passed in a suitable value, then Python will perform the required \$\Sigma_i(d_i * 32^i)\$ calculation itself.
Reworked code
First, you want to transform your input string into a base-32 equivalent string according to the encoding:
abcdefghijklmnopqrstuvwxyz,.!'?
⇒ 0123456789abcdefghijklmnopqrstuv
Plus uppercase letters to the same values as the lowercase ones, plus double-quote to the same as a single-quote, plus any unknown values to the equivalent as the question-mark. The str.translate(table)
built-in method will do this for us in one step, provided we give it the proper table. The table must map from the ordinal character values to the replacement strings. We'll use a defaultdict
to return 'v'
(the encoding for a '?'
) for any unknown characters.
from collections import defaultdict
PLAIN = " abcdefghijklmnopqrstuvwxyz,.!'?"
CRYPT = "0123456789abcdefghijklmnopqrstuv"
ENCODE = defaultdict(lambda: 'v')
ENCODE.update((ord(c), p) for c, p in zip(PLAIN, CRYPT))
ENCODE.update((ord(c), p) for c, p in zip(PLAIN.upper(), CRYPT))
ENCODE[ord('"')] = 'u' # Encode double-quote to same as single quote.
def encode(msg: str) -> int:
encoded = msg.translate(ENCODE)
...
With the string properly translated to a base-32 string, we can now use int(..., 32)
to convert it to an integer. The only catch is we need to reverse the string, since the original encode
method uses a little-endian base-32 to integer conversion.
return int(encoded[::-1], 32)
Update
The above code will crash with a ValueError
if given an empty string. We can prevent this by changing encoded
to "0"
when encoding an empty string.
encoded = msg.translate(ENCODE) or "0"
Integer to Base-32
Unfortunately, there is no simple int-to-base-n string conversion function built-in to Python. If there was, we could use the same str.translate(table)
function to convert it to the desired output text. We can even use the str.maketrans(x[, y[, z]])
to create the necessary table for us, because we have no infinite set of unknown values to convert to a specific value.
CRYPT = "0123456789abcdefghijklmnopqrstuv"
KEY32 = "⎵abcdefghijklmnopqrstuvwxyz,.!'⍰"
DECODE = str.maketrans(CRYPT, KEY32)
def decode(code: int) -> str:
encoded = ...
return encoded[::-1].translate(DECODE)
I said no simple int-to-base-n string conversion function built-in to Python. There are base32 encoding and decoding functions built-in to the base64
standard library. We can use that to do the heavy lifting.
But first, we need to know how long our message is going to be ... which is the number of 5-bit groups in the integer message:
length = (code.bit_length() + 4) // 5
Then we'll need to pad the message to an integral number of bytes.
bits = 5 * length
if bits % 8 != 0:
pad = 8 - bits % 8
bits += pad
code <<= pad
At this point, we can turn the integer into a series of bytes:
data = code.to_bytes(bits // 8, 'big')
Now we can use the base64.b32hexencode()
method to convert that into a series of characters 0
-9
and A
-V
representing 5-bit groupings of those bytes. The data is padded with equal characters (=
) at the end to ensure a consistent multiple of 40 bit structure, so we need to truncate the result to length
characters to get rid of the padding. Also, b32hexencode
uses uppercase characters, where as we want lowercase, so we need to call .lower()
on the result. Finally, the output is a bytes
object where as we desire a str
, so we'll want to .decode("utf-8")
the result:
encoded = base64.b32hexencode(data)[:length].lower().decode("utf-8")
As I said, it wasn't simple.
Reworked code
Changing the CRYPT
to uppercase allows us to remove a .lower()
call, for slightly simpler code.
from base64 import b32hexencode
from collections import defaultdict
PLAIN = " abcdefghijklmnopqrstuvwxyz,.!'?"
CRYPT = "0123456789ABCDEFGHIJKLMNOPQRSTUV"
KEY32 = "⎵abcdefghijklmnopqrstuvwxyz,.!'⍰"
ENCODE = defaultdict(lambda: 'V')
ENCODE.update((ord(c), p) for c, p in zip(PLAIN, CRYPT))
ENCODE.update((ord(c), p) for c, p in zip(PLAIN.upper(), CRYPT))
ENCODE[ord('"')] = 'U'
DECODE = str.maketrans(CRYPT, KEY32)
def encode(msg: str) -> int:
encoded = msg.translate(ENCODE) or "0"
return int(encoded[::-1], 32)
def decode(code: int) -> str:
length = (code.bit_length() + 4) // 5
bits = 5 * length
if bits % 8 != 0:
pad = 8 - bits % 8
bits += pad
code <<= pad
data = code.to_bytes(bits // 8, 'big')
encoded = b32hexencode(data)[:length].decode("utf-8")
return encoded[::-1].translate(DECODE)
if __name__ == '__main__':
msg = '''The quick "brown" fox jumped 'over' the lazy dog!'''
code = encode(msg)
decrypted = decode(code)
print(code)
print(decrypted)
from timeit import timeit
t_enc = timeit('encode(msg)', globals=globals())
print(f"Encryption takes: {t_enc:.2f} μs")
t_dec = timeit('decode(code)', globals=globals())
print(f"Decryption takes: {t_dec:.2f} μs")
Output:
51651161261624696839988483915180468301590365415316947842058570259607065876
the⎵quick⎵'brown'⎵fox⎵jumped⎵'over'⎵the⎵lazy⎵dog!
Encryption takes: 1.54 μs
Decryption takes: 7.78 μs