Improving the speed of an MD5 implementation from scratch in C++

Question 1

I wrote an implementation of the MD5 hashing algorithm from scratch in C++, following the original RFC: https://www.ietf.org/rfc/rfc1321.txt . It works as expected, and the outputs match all the test cases mentioned in the RFC as well. I timed this for a couple of test cases, and it seems it takes around 0.25 msecs on average for a 26 character text (or 1 block essentially). This, admittedly, is not bad. But given that I was able to scrounge together an implementation of the same in Python relatively easily, which is actually faster (nanosecond range), I feel there are obviously some implementation mistakes or inefficiencies that are present here. This is literally the first C++ program that I have written (except for hello world of course), as I am primarily a Python developer, so I'm looking for ways to improve the code structure and efficiency. I would like to know where I could improve here, and what exactly is causing it to be slow in the first place. Please excuse my naming conventions. Just borrowed from Python! I would also like to point out that there are possibly some things in this code that are specifically only C++20. Here is the code:

pstl_cryptography.h

#pragma once
#include <string>
#include <array>
#include <vector>
namespace pstl::cryptography::hashing
{
 class md5
 {
 public:
 static std::string digest(std::string str);
 protected:
 md5() = delete;
 ~md5() = delete;
 private:
 constexpr static std::array<uint32_t, 64> make_k_array();
 static std::vector<char> padder(std::string str);
 static void init();
 private:
 static const std::array<uint32_t, 64> k_array;
 static const std::array<uint32_t, 64> s_array;
 inline static std::array<uint64_t, 16> m_array;
 inline static uint32_t a0 = 0x67452301;
 inline static uint32_t b0 = 0xefcdab89;
 inline static uint32_t c0 = 0x98badcfe;
 inline static uint32_t d0 = 0x10325476;
 };
}

pstl_hashes.cpp

#include <iostream>
#include <array>
#include <string>
#include <cmath>
#include <vector>
#include <format>
#include "pstl_cryptography.h"
const std::array<uint32_t, 64> pstl::cryptography::hashing::md5::s_array = {
 7, 12, 17, 22, 7, 12, 17, 22, 7, 12, 17, 22, 7, 12, 17, 22,
 5, 9, 14, 20, 5, 9, 14, 20, 5, 9, 14, 20, 5, 9, 14, 20,
 4, 11, 16, 23, 4, 11, 16, 23, 4, 11, 16, 23, 4, 11, 16, 23,
 6, 10, 15, 21, 6, 10, 15, 21, 6, 10, 15, 21, 6, 10, 15, 21
};
constexpr std::array<uint32_t, 64> pstl::cryptography::hashing::md5::make_k_array()
{
 std::array<uint32_t, 64> output;
 for (int i = 0; i < output.max_size(); i++)
 output[i] = std::floor(0x100000000 * std::abs(std::sin(i + 1)));
 return output;
}
const std::array<uint32_t, 64> pstl::cryptography::hashing::md5::k_array = pstl::cryptography::hashing::md5::make_k_array();
std::vector<char> pstl::cryptography::hashing::md5::padder(std::string str)
{
 std::vector<char> padded(str.begin(), str.end());
 union length_pad
 {
 uint64_t whole;
 uint8_t parts[8];
 } length;
 length.whole = padded.size() * 8;
 padded.emplace_back(0x80);
 int extra_over_chunks = (padded.size() % 64);
 int zero_pad = extra_over_chunks < 56 ? 56 - extra_over_chunks : 56 - extra_over_chunks + 64;
 for (int i = 0; i < zero_pad; i++)
 padded.emplace_back(0x00);
 for (auto& length_byte : length.parts)
 padded.emplace_back(length_byte);
 return padded;
}
void pstl::cryptography::hashing::md5::init()
{
 a0 = 0x67452301;
 b0 = 0xefcdab89;
 c0 = 0x98badcfe;
 d0 = 0x10325476;
}
// Based on RFC 1321
// https://www.ietf.org/rfc/rfc1321.txt
// 1992
std::string pstl::cryptography::hashing::md5::digest(std::string str)
{
 init();
 std::vector<char> padded_message = padder(str);
 uint32_t num_chunks = padded_message.size() / 64;
 for (int i = 0; i < num_chunks; i++)
 {
 uint32_t A = a0;
 uint32_t B = b0;
 uint32_t C = c0;
 uint32_t D = d0;
 union chunk
 {
 char in_chars[64];
 uint32_t in_ints[16];
 } current_chunk;
 for (int k = 0; k < 64; k++)
 current_chunk.in_chars[k] = padded_message[k + i * 64];
 
 for (int k = 0; k < 16; k++)
 m_array[k] = current_chunk.in_ints[k];
 for (int j = 0; j < 64; j++)
 {
 uint32_t F, g;
 if (j < 16)
 { 
 F = (B & C) | ((~B) & D);
 g = j;
 }
 else if (j < 32)
 {
 F = (B & D) | (C & (~D));
 g = (5 * j + 1) % 16;
 }
 else if (j < 48)
 {
 F = B ^ C ^ D;
 g = (3 * j + 5) % 16;
 }
 else if (j < 64)
 {
 F = C ^ (B | (~D));
 g = (7 * j) % 16;
 }
 F = F + A + m_array[g] + k_array[j];
 A = D;
 D = C;
 C = B;
 B = B + std::rotl(F, s_array[j]);
 }
 a0 += A;
 b0 += B;
 c0 += C;
 d0 += D;
 }
 std::string output;
 for (auto var : { &a0, &b0, &c0, &d0 })
 {
 *var = (((*var) >> 24) | (((*var) & 0x00FF0000) >> 8) | (((*var) & 0x0000FF00) << 8) | ((*var) << 24));
 output += std::format("{:08x}", *var);
 }
 return output;
}

main.cpp

#include <iostream>
#include "pstl_cryptography.h"
int main()
{
 std::cout << pstl::cryptography::hashing::md5::digest("") << std::endl;
 std::cout << pstl::cryptography::hashing::md5::digest("a") << std::endl;
 std::cout << pstl::cryptography::hashing::md5::digest("message digest") << std::endl;
 std::cout << pstl::cryptography::hashing::md5::digest("ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789") << std::endl;
}

Question 2

Micro-review: You're missing <cstdint>, which declares std::uint32_t and the other fixed-width types.

Question 3

@TobySpeight Ah yes! Thanks for that! But I have a huge issue with this entire thing actually. I am using Visual Studio 2022 for this, and it seems to never fail compilation for missing headers as long as I have included iostream apparently. I don't really know how to ensure that I have included all the required headers for other compilers, as I'm not yet well versed in knowing which STL functions belong in which header. And VS is not helping in this regard. Any Idea how to deal with that?

Question 4

I recently mentioned include-what-you-use in another answer. It requires you to compile your code with LLVM for its diagnostics; I don't know how feasible it is to do that on code that's laid out for Microsoft's compiler and build system.

Question 5

First Look:

In these two functions:

 static std::string digest(std::string str);
 static std::vector<char> padder(std::string str);

You are passing the str by value. So you are making a copy of the string. Which of course uses memory management which is slow. Pass by const reference.

 static std::string digest(std::string const& str);
 static std::vector<char> padder(std::string const& str);

This will prevent a copy and prevent you modifying the original.

You make an additional copy of string here.

std::vector<char> padded(str.begin(), str.end());

That seems like an inefficiency to me. The whole reason for padder() function seems redundant. If you have gone past the end of the data you know then you calculate the padding value in place.

Couple of other places where there seems to be redundant copying.

 for (int k = 0; k < 64; k++)
 current_chunk.in_chars[k] = padded_message[k + i * 64];
 for (int k = 0; k < 16; k++)
 m_array[k] = current_chunk.in_ints[k];

Second Loop

You are deleting the constructor and destructor!

protected:
 md5() = delete;
 ~md5() = delete;

All the public functions are static (as are the members). So looks like you have a bunch of free functions and some shared state between them. If that's what you want (free functions with no state (i.e. not object)) just use a namespace

namespace md5
{
 std::string digest(std::string str);
 namespace implementation_details
 {
 // Put other stuff in here.
 }
}

But after reading this. You do store some state. Which makes your code unuseful unless you totally serialize all usage of your MD5 code (otherwise they will interact).

So I would keep the class design, but distinguish shared state (the const stuff) and the state that belongs to the the hash of the current bit of string.

class md5
{
 public:
 std::string digest(std::string str);
 private:
 
 // Private Static State.
 // Used by all instances.
 static std::array<uint32_t, 64> make_k_array();
 static const std::array<uint32_t, 64> k_array;
 static const std::array<uint32_t, 64> s_array;
 // State used by each instance.
 // Don't need an init() as you initialize the object here.
 std::array<uint64_t, 16> m_array;
 uint32_t a0 = 0x67452301;
 uint32_t b0 = 0xefcdab89;
 uint32_t c0 = 0x98badcfe;
 uint32_t d0 = 0x10325476;
 static std::vector<char> padder(std::string str);
};

Third Look:

Minor grumbles:

Actually trying to compile it.
Should have mentioned you need C++20 to compile it.

You have a missing header:

#include <bit>

I get a compiler error for this:

constexpr static std::array<uint32_t, 64> make_k_array();

It's complaining this is not a constexpr function. So I just deleted the constexpr part. Not that worried about it (probably my old compiler).

Bad Design:

The digest should not return a string:

static std::string digest(std::string str);

The return value is simply a 128 bit integer.

OK you can put that 128 bit integer into 16 byte string for ease of use. But you don't; you encode that 128 bit into a string of hash digits. That seems like a waste of time. It's a 128 bit integer treat it as such. Yes, convert it to hex for display to a human but that conversion to hex digits is not part of the algorithm.

using Hash = std::array<unsigned char, 16>;
Hash digest(std::string const& str);
std::ostream& operator<<(std::ostream& str, Hash const& data)
{
 for (auto const& v: data) {
 str << std::format("{:08x}", v);
 }
 return str;
}

Question 6

Thanks for the extensive reply. I find a couple of points interesting here. Yes I agree that the const reference string would be good and it seems I missed that, so will definitely add it. As for the C++ 20 thing, I think i did mention it in the original post. But what is concerning is the constexpr part. I added constexpr because I intended to calculate that array at compile time, as it is quite an extensive calculation. But if that is not happening, then that may be where the slowdown is. Why is the array not being calculated at compile time using that syntax?

Question 7

Most implementations just directly use a predefined array for K. I decided to create it using the actual formula, but have it created at compile time. Looks like that hasn't worked correctly. And as for the constructor/destructor, I took this approach because I don't want someone to be able to create 100s of instances of this class and create 100s of copies of the internally stored arrays. I just want it to function like a singleton, with static variables available to it.

Question 8

Also, for the init, that is actually required in this design if I intend to call the digest function multiple times in the same program, as each digest needs to have the same initial values. Those initial values get modified in every run, and need to be reset. Actually it should be called reset instead of init. This again is due to the singleton design. Although, I'm not entirely sure if that was a good idea to begin with

Question 9

@MartinYork Thanks for this! There are a couple of things here that I'm not fully aware of and will need to learn. But I'll try and understand it, and implement it, to see if it leads to any improvement in performance.

Question 10

@PrashantSinha Please post a follow up question with any new code and any improvements.

Loki Astari Loki Astari 97.7k5 gold badges126 silver badges341 bronze badges · Accepted Answer · 2022-09-18 18:46:48Z

First Look:

In these two functions:

 static std::string digest(std::string str);
 static std::vector<char> padder(std::string str);

You are passing the str by value. So you are making a copy of the string. Which of course uses memory management which is slow. Pass by const reference.

 static std::string digest(std::string const& str);
 static std::vector<char> padder(std::string const& str);

This will prevent a copy and prevent you modifying the original.

You make an additional copy of string here.

std::vector<char> padded(str.begin(), str.end());

That seems like an inefficiency to me. The whole reason for padder() function seems redundant. If you have gone past the end of the data you know then you calculate the padding value in place.

Couple of other places where there seems to be redundant copying.

 for (int k = 0; k < 64; k++)
 current_chunk.in_chars[k] = padded_message[k + i * 64];
 for (int k = 0; k < 16; k++)
 m_array[k] = current_chunk.in_ints[k];

Second Loop

You are deleting the constructor and destructor!

protected:
 md5() = delete;
 ~md5() = delete;

All the public functions are static (as are the members). So looks like you have a bunch of free functions and some shared state between them. If that's what you want (free functions with no state (i.e. not object)) just use a namespace

namespace md5
{
 std::string digest(std::string str);
 namespace implementation_details
 {
 // Put other stuff in here.
 }
}

But after reading this. You do store some state. Which makes your code unuseful unless you totally serialize all usage of your MD5 code (otherwise they will interact).

So I would keep the class design, but distinguish shared state (the const stuff) and the state that belongs to the the hash of the current bit of string.

class md5
{
 public:
 std::string digest(std::string str);
 private:
 
 // Private Static State.
 // Used by all instances.
 static std::array<uint32_t, 64> make_k_array();
 static const std::array<uint32_t, 64> k_array;
 static const std::array<uint32_t, 64> s_array;
 // State used by each instance.
 // Don't need an init() as you initialize the object here.
 std::array<uint64_t, 16> m_array;
 uint32_t a0 = 0x67452301;
 uint32_t b0 = 0xefcdab89;
 uint32_t c0 = 0x98badcfe;
 uint32_t d0 = 0x10325476;
 static std::vector<char> padder(std::string str);
};

Third Look:

Minor grumbles:

Actually trying to compile it.
Should have mentioned you need C++20 to compile it.

You have a missing header:

#include <bit>

I get a compiler error for this:

constexpr static std::array<uint32_t, 64> make_k_array();

It's complaining this is not a constexpr function. So I just deleted the constexpr part. Not that worried about it (probably my old compiler).

Bad Design:

The digest should not return a string:

static std::string digest(std::string str);

The return value is simply a 128 bit integer.

OK you can put that 128 bit integer into 16 byte string for ease of use. But you don't; you encode that 128 bit into a string of hash digits. That seems like a waste of time. It's a 128 bit integer treat it as such. Yes, convert it to hex for display to a human but that conversion to hex digits is not part of the algorithm.

using Hash = std::array<unsigned char, 16>;
Hash digest(std::string const& str);
std::ostream& operator<<(std::ostream& str, Hash const& data)
{
 for (auto const& v: data) {
 str << std::format("{:08x}", v);
 }
 return str;
}

Thanks for the extensive reply. I find a couple of points interesting here. Yes I agree that the const reference string would be good and it seems I missed that, so will definitely add it. As for the C++ 20 thing, I think i did mention it in the original post. But what is concerning is the constexpr part. I added constexpr because I intended to calculate that array at compile time, as it is quite an extensive calculation. But if that is not happening, then that may be where the slowdown is. Why is the array not being calculated at compile time using that syntax?
Most implementations just directly use a predefined array for K. I decided to create it using the actual formula, but have it created at compile time. Looks like that hasn't worked correctly. And as for the constructor/destructor, I took this approach because I don't want someone to be able to create 100s of instances of this class and create 100s of copies of the internally stored arrays. I just want it to function like a singleton, with static variables available to it.
Also, for the init, that is actually required in this design if I intend to call the digest function multiple times in the same program, as each digest needs to have the same initial values. Those initial values get modified in every run, and need to be reset. Actually it should be called reset instead of init. This again is due to the singleton design. Although, I'm not entirely sure if that was a good idea to begin with
@MartinYork Thanks for this! There are a couple of things here that I'm not fully aware of and will need to learn. But I'll try and understand it, and implement it, to see if it leads to any improvement in performance.
@PrashantSinha Please post a follow up question with any new code and any improvements.

Stack Exchange Network

Improving the speed of an MD5 implementation from scratch in C++

1 Answer 1

First Look:

Second Loop

Third Look:

Bad Design:

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Improving the speed of an MD5 implementation from scratch in C++

1 Answer 1

First Look:

Second Loop

Third Look:

Bad Design:

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions