It requires no explanation, so here's the code.
Code:
#include <filesystem>
#include <fstream>
#include <iomanip>
#include <iostream>
#include <optional>
#include <string>
#include <unordered_map>
#include <vector>
#include <openssl/evp.h>
static std::optional<std::string> compute_file_hash(const std::string &unhashed)
{
unsigned char hash[EVP_MAX_MD_SIZE];
unsigned int hash_len{0};
EVP_MD_CTX *context = EVP_MD_CTX_new();
if (context == nullptr ||
EVP_DigestInit_ex(context, EVP_md5(), nullptr) == 0 ||
EVP_DigestUpdate(context, unhashed.c_str(), unhashed.length()) == 0 ||
EVP_DigestFinal_ex(context, hash, &hash_len) == 0) {
EVP_MD_CTX_free(context);
return std::nullopt;
}
std::stringstream ss{};
for (unsigned int i = 0; i < hash_len; ++i) {
ss << std::hex << std::setw(2) << std::setfill('0')
<< static_cast<int>(hash[i]);
}
EVP_MD_CTX_free(context);
return ss.str();
}
static void find_duplicates(
std::unordered_map<std::string, std::vector<std::string>> &file_hashes,
std::string_view arg)
{
for (const auto &entry :
std::filesystem::recursive_directory_iterator(arg)) {
if (entry.is_regular_file() && !entry.is_symlink()) {
std::string entry_path{entry.path().string()};
std::ifstream ifs(entry_path);
std::string content((std::istreambuf_iterator<char>(ifs)),
(std::istreambuf_iterator<char>()));
const auto res = compute_file_hash(content);
/* Should we take an openssl failure to be fatal, or just
* continue on to the next file? */
if (res != std::nullopt) {
const auto it = file_hashes.find(res.value());
it != file_hashes.end()
? it->second.push_back(entry_path)
: file_hashes[res.value()].push_back(entry_path);
}
}
}
}
static void print_duplicates(
std::unordered_map<std::string, std::vector<std::string>> &file_hashes)
{
for (const auto &[key, value] : file_hashes) {
if (value.size() > 1) {
std::cout << key << "\n";
for (const auto &path : value) {
std::cout << "\t" << path << "\n";
}
}
}
}
int main(int argc, char **argv)
{
for (int i = 1; i < argc; ++i) {
std::unordered_map<std::string, std::vector<std::string>> file_hashes{};
find_duplicates(file_hashes, argv[i]);
print_duplicates(file_hashes);
}
}
Formatted with clang-format and built with:
g++-13 -std=c++23 -Wall -Werror -Wpedantic -O2 dedup.cpp -o dedup -lcrypto
Review Request:
Style, bad practices, et cetera. How do I make it more robust?
It likely won't perform well for large files, and the files may change in the meantime.
1 Answer 1
There's a clear bug in that files are considered to be equal if they have the same MD5 hash. Since hash collisions are definitely a thing, files with equal hashes need to be further compared to determine whether their contents are actually identical.
The speed of this program is quite poor. There's a number of improvements that will make it more efficient:
I don't think you need to read each file into a
std::string
to hash it. You should be able to do it in chunks, which will reduce the memory footprint of your program. We can use a local array for the buffer, which will eliminate most of the memory allocations.If you're willing to go slightly non-portable, then consider memory-mapping each file when possible (consider using Boost so you don't have to implement for each platform) and fall back to reading only on platforms that don't provide mapped files. That will greatly reduce the amount of data copying that takes place.
This implementation is likely to be limited by filesystem throughput; consider using multiple threads to do the hashing in parallel. You could use a work-queue arrangement for hashing files; perhaps also for traversing the filesystem hierarchy.
I don't think we need to create a hex representation of every hash. If we use the raw hash as key in the map, then we only need to hexify the ones that we print.
In other words, we can do something like:
std::basic_string<unsigned char> hash(EVP_MAX_MD_SIZE, '0円'); hash.resize(hash_len); return hash;
It's inefficient to allocate and free a new digest context for each file. We should be able re-use a single context (or one per thread) if we
EVP_MD_CTX_reset()
it at the start of each file.
Instead of mapping each hash to a vector of filenames, consider using std::unordered_multimap
, since the order of the entries is not significant.
If any of the digest operations fails, we just ignore that file. We should at least emit a warning to std::cerr
so that the user knows the scan is incomplete.
We also ignore unreadable files and directories - we should similarly inform the user that we are skipping those.
A usability concern - we do nothing if no arguments are passed. That's probably not what's wanted. Consider an error message or a sensible default (i.e. the current directory).
We might want to treat matches specially if they are just different names for the same file ("hard links") rather than actual copies. That will require facilities outside the standard library.
-
3\$\begingroup\$ MD5 is a bad choice of hash because there are actual known collisions, but if they had used
EVP_blake2b512
, for instance, then the chance of a false positive from a hash collision would be vanishingly smaller than the chance of a random false positive from a cosmic ray, so there would be no added value in comparing the file contents. \$\endgroup\$benrg– benrg2024年04月13日 23:06:48 +00:00Commented Apr 13, 2024 at 23:06 -
2\$\begingroup\$ Since we have to do a full comparison on hash collision. Do we even need to hash the whole file? Sure this is going to get you more clashes, but only in the pathological case is this significant, under normal situation, I can see this generating only slightly more than normal. The hashing part is just to narrow down the number of files we have to a full difference on anyway. I am not sure of the way to choose an optimal size But hashing the first
nK
of a file will save some time. \$\endgroup\$Loki Astari– Loki Astari2024年04月14日 17:16:37 +00:00Commented Apr 14, 2024 at 17:16 -
1\$\begingroup\$ @Martin, it's quite common for files to share a common prefix, so we could improve on that be hashing a subset of each file (perhaps 1K every 1M or something like that). Perhaps even fold the file length into the hash? \$\endgroup\$Toby Speight– Toby Speight2024年04月15日 10:51:31 +00:00Commented Apr 15, 2024 at 10:51
ref&
toref &
. \$\endgroup\$