I'm trying to write a Python program to predict the most likely substitution for a Caesar cipher. I managed to get it working, but it just seems very clunky and slow, especially because I need to preserve cases and non-alphabetical characters.
- Would it be cleaner / more pythonic to use the
chr()
andord()
functions instead of theALPHABET
list? - Any suggestions on how I could improve it overall (while also maintaining relative readability)?
Thank you so much!
ALPHABET = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
FREQUENCIES = [.0817, .0149, .0278, .0425, .1270, .0223, .0202, .0609, .0697, .0015, .0077, .0402, .0241, .0675, .0751, .0193, .0009, .0599, .0633, .0906, .0276, .0098, .0236, .0015, .0197, .0007]
def getGoodness(msg):
msg_length = sum([1 for i in msg if i.isalpha()])
msg_frqs = [round(msg.count(a) / msg_length, 4) for a in ALPHABET]
diffs = [abs(a - b) for a, b in zip(msg_frqs, FREQUENCIES)]
avg_diff = round(sum(diffs) / len(diffs), 4)
return avg_diff
def shift(string, n):
shifted = [i for i in string]
for char in range(len(shifted)):
if shifted[char].isalpha():
new_char = ALPHABET[(ALPHABET.index(shifted[char].lower()) + n) % len(ALPHABET)]
if shifted[char].isupper():
new_char = new_char.upper()
shifted[char] = new_char
return ''.join(shifted)
#########################################
msg = "Oqqcfrwbu hc ozz ybckb zokg ct ojwohwcb, hvsfs wg bc kom o pss gvcizr ps opzs hc tzm. Whg kwbug ofs hcc gaozz hc ush whg toh zwhhzs pcrm ctt hvs ufcibr. Hvs pss, ct qcifgs, tzwsg obmkom psqoigs pssg rcb'h qofs kvoh viaobg hvwby wg wadcggwpzs. Mszzck, pzoqy. Mszzck, pzoqy. Mszzck, pzoqy. Mszzck, pzoqy. Ccv, pzoqy obr mszzck! Zsh'g gvoys wh id o zwhhzs. Poffm! Pfsoytogh wg fsorm!"
goodness = []
shifted = []
for i in range(26):
goodness.append(getGoodness(shift(msg, i)))
shifted.append(shift(msg, i))
gdict = dict(zip(goodness, shifted))
print(gdict.get(min(gdict.keys())))
3 Answers 3
The shift()
function was fairly easy to understand, in no small part because
you've chosen reasonable and clear variable names. Here are a few relatively
minor suggestions regarding the function. (1) Strings are iterable, so you can
convert to characters directly. (2) Python makes iteration so easy that looping
over data via an index is usually not needed. Instead, just iterate directly.
And for those cases when an index is needed, enumerate()
is often the best
option, because it provides both index and value -- which helps your code by
reducing its visual weight and thus aiding readability. (3) The assignment to
new_char
is a moderately complex expression. One way to improve code
readability is to such computations apart. (4) A dict
to lookup a letter's
index would eliminate the need to scan ALPHABET
repeatedly.
# 4
ALPHA_IDXS = {a:i for i, a in enumerate(ALPHABET)}
def shift(string, n):
# 1
shifted = list(string)
# 2
for i, char in enumerate(shifted):
if char.isalpha():
# 3 and 4
j = (ALPHA_IDXS[char.lower()] + n) % len(ALPHABET)
new_char = ALPHABET[j]
if char.isupper():
new_char = new_char.upper()
shifted[i] = new_char # The only place we use index.
return ''.join(shifted)
Another technique to consider is converting shift()
into a generator function.
This simplifies the code even more because the function no longer has to
assemble a return data collection -- although at a small cost to the caller,
which must do the assembling. Here's what that might look like.
def shift(string, n):
for i, char in enumerate(string):
if char.isalpha():
j = (ALPHA_IDXS[char.lower()] + n) % len(ALPHABET)
new = ALPHABET[j]
char = new if char.islower() else new.upper()
yield char
You did a nice job in getGoodness()
with keeping the steps of the algorithm
very clear. Here are a few suggestions: (1) Because bool
is a subclass of
int
, you can treat True
and False
as integers -- just add up the True
s.
Also, functions like sum()
take any iterable, so you don't need an
intermediate list. (2) msg_length
seems a bit misleading: it is not the
length of msg
; it is the number of alpha characters. (3) A Counter
can
compute letter frequencies for you. (4)
Shouldn't this calculation be done on a lowercase version of the text?
from collections import Counter
def getGoodness(msg):
# 1 and 2
n_alphas = sum(i.isalpha() for i in msg)
# 3 and 4
freqs = Counter(msg.lower())
msg_frqs = [round(freqs[a] / n_alphas, 4) for a in ALPHABET]
diffs = [abs(a - b) for a, b in zip(msg_frqs, FREQUENCIES)]
avg_diff = round(sum(diffs) / len(diffs), 4)
return avg_diff
Finally, some comments about the orchestration code. (1) Put your code in
functions. (2) Set yourself up for testing. It helps a lot during the
development and debugging lifecycle. (3) In the loop to compute goodnesses for
the shifted strings, a tuple
seems like a handier data structure than a
dict
. It simplifies the code and is better suited for min()
. (4) I
recommend the habit of setting up all scripts so they can take arguments --
again, useful during development and sometimes usage.
# 1
def main(args):
# 2
TESTS = (
(
"Oqqcfrwbu hc ozz ybckb zokg ct ojwohwcb, hvsfs wg bc kom o pss gvcizr ps opzs hc tzm. Whg kwbug ofs hcc gaozz hc ush whg toh zwhhzs pcrm ctt hvs ufcibr. Hvs pss, ct qcifgs, tzwsg obmkom psqoigs pssg rcb'h qofs kvoh viaobg hvwby wg wadcggwpzs. Mszzck, pzoqy. Mszzck, pzoqy. Mszzck, pzoqy. Mszzck, pzoqy. Ccv, pzoqy obr mszzck! Zsh'g gvoys wh id o zwhhzs. Poffm! Pfsoytogh wg fsorm!",
"According to all known laws of aviation, there is no way a bee should be able to fly. Its wings are too small to get its fat little body off the ground. The bee, of course, flies anyway because bees don't care what humans think is impossible. Yellow, black. Yellow, black. Yellow, black. Yellow, black. Ooh, black and yellow! Let's shake it up a little. Barry! Breakfast is ready!",
),
(
"Ebiil, tloia!",
"Hello, world!",
),
)
for msg, exp in TESTS:
# 3
scores = []
for i in range(26):
txt = ''.join(shift(msg, i))
g = (getGoodness(txt), txt)
scores.append(g)
got = min(scores)[1]
# 2 Always be testing.
print(exp == got, '=>', got[0:30])
# 4
if __name__ == '__main__':
main(sys.argv[1:])
-
1\$\begingroup\$ thank you so much for your help! I have a question though - is a
Counter
necessarily faster or better in some way than the.count()
function? thanks! \$\endgroup\$tommy2111111111– tommy21111111112021年03月10日 18:41:00 +00:00Commented Mar 10, 2021 at 18:41 -
\$\begingroup\$ @tommy2111111111 No, not faster (often slower). But it can simplify code in a variety of situations. It's not clear how much simplicity it provides in this case -- not a lot -- but it is a useful data structure to know about. \$\endgroup\$FMc– FMc2021年03月10日 19:29:48 +00:00Commented Mar 10, 2021 at 19:29
You're diffing msg_frqs
with FREQUENCIES
. The latter's sum is 1, the former's isn't, because you're counting the upper case letters in msg_length
but not in msg_frqs
. That seems inappropriate. So better start getGoodness
with msg = msg.lower()
.
And I wouldn't round
. Why throw away information? Especially using extra code and time.
-
\$\begingroup\$ thank you! this was helpful. totally forgot about the
msg = msg.lower()
\$\endgroup\$tommy2111111111– tommy21111111112021年03月10日 18:41:41 +00:00Commented Mar 10, 2021 at 18:41
Your part
goodness = []
shifted = []
for i in range(26):
goodness.append(getGoodness(shift(msg, i)))
shifted.append(shift(msg, i))
gdict = dict(zip(goodness, shifted))
print(gdict.get(min(gdict.keys())))
is rather complicated. You could just use min-by-goodness:
shifts = (shift(msg, i) for i in range(26))
print(min(shifts, key=getGoodness))
Also takes less memory, as it just keeps the best shift so far, doesn't build a list+dict containing all of them like yours.
-
\$\begingroup\$ oh got it, yeah just figured it out when I tested out the code lol. how would we get the corresponding number of shifts? \$\endgroup\$tommy2111111111– tommy21111111112021年03月10日 19:34:43 +00:00Commented Mar 10, 2021 at 19:34
-
\$\begingroup\$ Not sure what you mean with that... \$\endgroup\$Manuel– Manuel2021年03月10日 19:37:21 +00:00Commented Mar 10, 2021 at 19:37
-
\$\begingroup\$ like the second line
print(min(shifts, key=getGoodness))
outputs the most likely plaintext, right? how would I get the number of places the ciphertext was shifted to the right to produce that most likely output? \$\endgroup\$tommy2111111111– tommy21111111112021年03月10日 19:39:57 +00:00Commented Mar 10, 2021 at 19:39 -
\$\begingroup\$ Ah, so something your original doesn't do? You could compute it separately with
min(range(26), key=lambda i: getGoodness(shift(msg, i)))
(didn't test this) or you could take the input message and the output message and find the first difference. \$\endgroup\$Manuel– Manuel2021年03月10日 19:44:19 +00:00Commented Mar 10, 2021 at 19:44