Implementation of a Caesar cipher is a popular exercise and there are many implementations posted here on Code Review.
My version is intended to be efficient and portable (subject to some limitations, below).
#include <algorithm>
#include <array>
#include <cctype>
#include <climits>
#include <cstdlib>
#include <iostream>
#include <iterator>
#include <numeric>
#include <stdexcept>
class caesar_rotator {
using char_table = std::array<char, UCHAR_MAX+1>;
const char_table table;
public:
caesar_rotator(int rotation) noexcept
: table{create_table(rotation)}
{}
char operator()(char c) const noexcept
{
return table[static_cast<unsigned char>(c)];
};
private:
#define LETTERS "abcdefghijklmnopqrstuvwxyz"
static char_table create_table(int rotation)
{
constexpr int len = (sizeof LETTERS) - 1; // don't count the terminating null
static const auto* alpha2 = reinterpret_cast<const unsigned char*>(LETTERS LETTERS);
// normalise to the smallest positive equivalent
rotation = (rotation % len + len) % len;
char_table table;
// begin with a identity mapping
std::iota(table.begin(), table.end(), 0);
// change the mapping of letters
for (auto i = 0; i < len; ++i) {
table[alpha2[i]] = alpha2[i+rotation];
table[std::toupper(alpha2[i])] = static_cast<char>(std::toupper(alpha2[i+rotation]));
}
return table;
}
#undef LETTERS
};
int main(int argc, char **argv)
{
constexpr int default_rotation = 13;
// Parse arguments
int rotation;
if (argc <= 1) {
rotation = default_rotation;
} else if (argc == 2) {
try {
std::size_t end;
rotation = std::stoi(argv[1], &end);
if (argv[1][end]) { throw std::invalid_argument(""); }
} catch (...) {
std::cerr << "Invalid Caesar shift value: " << argv[1] << " (integer required)\n";
return EXIT_FAILURE;
}
} else {
std::cerr << "Usage: " << argv[0]
<< " [NUMBER]\nCaesar-shift letters in standard input by NUMBER places (default "
<< default_rotation <<")\n";
return EXIT_FAILURE;
}
// Now filter the input
std::transform(std::istreambuf_iterator<char>{std::cin},
std::istreambuf_iterator<char>{},
std::ostreambuf_iterator<char>{std::cout},
caesar_rotator{rotation});
}
Features and limitations
- It's written using portable C++ which compiles with any compiler supporting C++11 or later. (But don't let that inhibit reviews from suggesting improvements that require newer versions - I am still interested!)
- We can shift by any representable integer (negative shift decodes a positive shift, and vice versa).
- It works with all single-byte character sets. Unlike many implementations, it works on systems where alphabetic characters are discontinuous, such as EBCDIC.
- It works with multi-byte character sets where the alphabetic characters are invariant (e.g. UTF-8). When using codings that shift out the alpha-chars, then non-alphabetic characters may be modified - but using the same program for the decoding will correctly restore the original input text.
- Because we use a table to produce output, we need storage for that table, sufficient to represent all character values. No problem for systems where
CHAR_BIT
is 8, but it may prevent use on systems with much wider character type. - There is an overhead to constructing the table each run, which won't pay for itself with short input streams. But it's still small compared with the overhead of creating a process, and it's a big win for large inputs.
- I may write a separate wide-char implementation that addresses the limitations.
The ugly stuff
- I don't have access to an EBCDIC system to confirm my claim that it's correct on such platforms.
- Most of the ugliness is in
create_table()
, which in an earlier version had been an immediately-invoked lambda expression in the constructor's initialiser list; I moved it out to a named private function to make it easier to read and easier to skip when perusing the public interface. - I don't like using a macro to concatenate two copies of the alphabet. But I couldn't find an alternative
constexpr
way to do that, even in C++20, other than writing the letters twice, which I didn't want to do. (Andalpha2
still isn'tconstexpr
, because of the cast, but at least it enables reasonable compilers to store the string literal in.text
segment or equivalent.) - The
reinterpret_cast
offends me. And thestatic_cast
s, though to a lesser extent. But we need unsigned values forstd::toupper()
and for array indexing, and plainchar
for the stream handling and string literal. - I don't much like using
%
twice to get the positive modulo; is there a simpler way? I consideredrotation %= len; if (rotation < 0) rotation += len;
but I'm not convinced that's better. - In the main function, I throw an exception when argument has anything following an initial number, so that I handle all parsing errors in one place. Is that okay, or is there something better I could do?
-
\$\begingroup\$ An answer that improves any one of the "ugly" things is very welcome (even if it's just confirmation that the code functions correctly using EBCDIC). \$\endgroup\$Toby Speight– Toby Speight2021年10月30日 12:24:20 +00:00Commented Oct 30, 2021 at 12:24
4 Answers 4
Avoid macros
There is absolutely no need for a macro here. Just write:
static const char *letters = "abcdefghijklmnopqrstuvwxyz";
auto alpha = reinterpret_cast<const unsigned char *>(letters);
And to access it safely:
table[alpha2[i]] = alpha2[(i + rotation) % len];
Alternatively, you could create a mutable string array holding the letters, and then use std::rotate()
to create a rotated version of it.
Avoiding the reinterpret_cast
Instead of creating a reinterpret-casted pointer to the string holding the letters, you can static_cast
the individual characters. That's safer but arguably uglier because you then need more casts.
A much better way would be to make char_table
a class with an operator[]
that takes a char
as an argument, and which returns a reference. That way you can write:
char_table table;
...
for (auto i = 0; i < len; i++) {
table[letters[i]] = letters[(i + rotation) % len];
...
}
Of course, that pesky std::toupper()
still gives trouble. Maybe you can create a helper function that takes care of doing toupper()
safely on char
s.
Optimizing table construction
Most of the inefficiency is in the modulo operation. But it can be largely avoided: first ensure rot
is between 0 and 26, then you can make use of the fact that i + rot
is between 0 and 52, and use i + rot < 26 ? i + rot : i + rot - 26
as an index. Again, you could make a little helper function for this to make the code more readable.
Apart from that, consider making build_table()
a constexpr
function. This is rather trivial with C++17
and later (the only thing to worry about is std::iota()
, which isn't constexpr
before C++20). It might be possible to make it constexpr
in earlier C++ versions as well. With this, if rot
is a compile-time constant, the table can also be built at compile-time.
If you can build one table at compile-time, you could even build all 26 of them at compile-time, making it cheap to do the Caesar substitution with a rotation that you only know at runtime.
Based on suggestions from G. Sliepen's answer, I was able to eliminate the macro and localise most of the casts:
private:
static constexpr int upper_int(char c)
{
return std::toupper(static_cast<unsigned char>(c));
}
static constexpr char upper_char(char c)
{
return static_cast<char>(upper_int(c));
}
static char_table create_table(int rotation)
{
constexpr auto* letters = "abcdefghijklmnopqrstuvwxyz";
constexpr int len = std::strlen(letters);
// normalise to the smallest positive equivalent
rotation = (rotation % len + len) % len;
char_table table;
// begin with a identity mapping
std::iota(table.begin(), table.end(), 0);
// change the mapping of letters
for (auto i = 0, target = rotation; i < len; ++i, ++target) {
if (target == len) {
target = 0;
}
table[letters[i]] = letters[target];
table[upper_int(letters[i])] = upper_char(letters[target]);
}
return table;
}
Not only that, but these changes reduce the maximum line length, for easier reading.
:-)
Compared to the answer that inspired this, I did the computation of target
differently: here, I help branch prediction by taking the forward path all but once. Yes, that is unnecessary micro-optimisation, but I'm here for a learning experience.
For C++20 onwards, create_table()
can be declared constexpr
, though there's little point, as we never call it with a compile-time constant argument. With GCC, this even works with C++17, before std::iota()
became constexpr
- I think it's smart enough to examine the resultant code after expanding the template.
-
2\$\begingroup\$ That looks great! \$\endgroup\$G. Sliepen– G. Sliepen2021年10月30日 19:18:06 +00:00Commented Oct 30, 2021 at 19:18
I suggest a different approach:
Nail the locale to the C locale, write the 26 possible translation-table as read-only data, and let the constructor just choose the right one.
It probably takes a bit more space in the executable (6.5K data for 8bit char), but it will certainly be faster and needs less per-instance space at runtime.
#include <array>
#include <numeric>
constexpr auto translation_tables = []{
const char* bands[]{
"abcdefghijklmnopqrstuvwxyz",
"ABCDEFGHIJKLMNOPQRSTUVWXYZ"
};
std::array<std::array<unsigned char, 1 + (unsigned char)-1>, 26> r;
for (int rotate = 0; rotate < 26; ++rotate) {
auto& table = r[rotate];
std::iota(table.begin(), table.end(), (unsigned char)'0円');
for (auto band : bands)
for (int i = 0; i < 26; ++i)
table[band[i]] = table[band[(i + rotate) % 26]];
}
return r;
}();
constexpr auto caesar_rotator(int shift) noexcept {
if ((shift %= 26) < 0)
shift += 26;
auto table = translation_tables[shift].data();
return [=](unsigned char c) noexcept { return table[c]; };
}
-
\$\begingroup\$ You only need two consecutive copies of 'a'..'Z', not 26. Just point a pointer to the starting point based on the shift. That is, work the same as the old "decoding rings". Having the list repeat prevents the modulo operation, which is your real performance point. \$\endgroup\$JDługosz– JDługosz2021年11月01日 16:41:00 +00:00Commented Nov 1, 2021 at 16:41
-
\$\begingroup\$ @JDługosz You know the two shifted rings are embedded in a block of 256 symbols (for 8-bit, aka minimal, bytes), the rest being identity-mapped? Thus, really need one mapping per shift-value, of which there are 26 different ones, as the rings are 26 unique symbols long. \$\endgroup\$Deduplicator– Deduplicator2021年11月01日 17:23:42 +00:00Commented Nov 1, 2021 at 17:23
-
\$\begingroup\$ I see. I thought you were just avoiding the expensive modulo operation, not the range test for ischar. \$\endgroup\$JDługosz– JDługosz2021年11月01日 22:33:25 +00:00Commented Nov 1, 2021 at 22:33
You can avoid the reinterpret_cast
without changing the logic, and at the same time drastically limit the scope of your macro:
#define LETTERS "abcdefghijklmnopqrstuvwxyz"
+ static constexpr auto alphabet_size = sizeof LETTERS - 1;
+ static constexpr unsigned char alpha2[sizeof LETTERS * 2 - 1] = LETTERS LETTERS;
+#undef LETTERS
+
static char_table create_table(int rotation)
{
- constexpr int len = (sizeof LETTERS) - 1; // don't count the terminating null
- static const auto* alpha2 = reinterpret_cast<const unsigned char*>(LETTERS LETTERS);
+ constexpr int len = alphabet_size;
@@ -44,3 +47,2 @@
}
-#undef LETTERS
};
Still far from ideal (especially since it can’t be combined with AA style, which I’d otherwise follow) but much better.
Secondly, and this is certainly subjective, but I really dislike implicit conversions to bool
. I just about accept omitting != nullptr
; but implicit int
to bool
conversions just make the code less readable. Case in point, I had to actively think which exact condition was tested by if (argv[1][end])
. I find it vastly more readable to write this explicitly, i.e.:
if (argv[1][end] != '0円')
I’d probably also give argv[1]
a dedicated name in this scenario since it’s used repeatedly, and requires nested array access. And I’d probably make rotation
const
, and use a conditional expression with a nested IIFE to initialise it. However, this will require a function-level try
block (and admittedly this makes the code longer, which might not be a good trade-off in this simple case).
Lastly, we probably don’t want to just catch ...
because if there’s an unexpected exception that’s a bug in the code, and catching the exception obliterates the stack trace.
-
1\$\begingroup\$ I tried initializing a
std::array<unsigned char>
from string literal, but that's an error. I didn't realise I could initialise aunsigned char[]
without a warning. I guess the array size should bealphabet_size * 2
orsizeof LETTERS * 2 - 2
to not have a couple of wasted elements? Remember thatsizeof
measures the whole string including the final null character that we don't need here. \$\endgroup\$Toby Speight– Toby Speight2021年10月31日 15:59:51 +00:00Commented Oct 31, 2021 at 15:59 -
\$\begingroup\$ I did originally "scope"
LETTERS
around the two lines within the function, but that was hard to read. Your suggestion of extracting the constants is much better. (I only changed to a class quite late - originally,rotate()
just returned a lambda containing the translation table). \$\endgroup\$Toby Speight– Toby Speight2021年10月31日 16:01:13 +00:00Commented Oct 31, 2021 at 16:01 -
\$\begingroup\$ @TobySpeight Yes, of course, my mistake. It does need to be
sizeof LETTERS - 1
though, otherwise the array will be too small to hold the string literal incl. zero termination. \$\endgroup\$Konrad Rudolph– Konrad Rudolph2021年10月31日 16:01:53 +00:00Commented Oct 31, 2021 at 16:01 -
\$\begingroup\$ @TobySpeight That’s only valid in C, not in C++! \$\endgroup\$Konrad Rudolph– Konrad Rudolph2021年10月31日 16:08:08 +00:00Commented Oct 31, 2021 at 16:08
-
\$\begingroup\$ Ah,
dcl.init.string
, and the consensus interpretation is that you're right: What language standards allow ignoring null terminators on fixed size arrays? \$\endgroup\$Toby Speight– Toby Speight2021年10月31日 16:19:55 +00:00Commented Oct 31, 2021 at 16:19