C++ - Find all duplicate files

Question 1

It requires no explanation, so here's the code.

Code:

#include <filesystem>
#include <fstream>
#include <iomanip>
#include <iostream>
#include <optional>
#include <string>
#include <unordered_map>
#include <vector>
#include <openssl/evp.h>
static std::optional<std::string> compute_file_hash(const std::string &unhashed)
{
 unsigned char hash[EVP_MAX_MD_SIZE];
 unsigned int hash_len{0};
 EVP_MD_CTX *context = EVP_MD_CTX_new();
 if (context == nullptr ||
 EVP_DigestInit_ex(context, EVP_md5(), nullptr) == 0 ||
 EVP_DigestUpdate(context, unhashed.c_str(), unhashed.length()) == 0 ||
 EVP_DigestFinal_ex(context, hash, &hash_len) == 0) {
 EVP_MD_CTX_free(context);
 return std::nullopt;
 }
 std::stringstream ss{};
 for (unsigned int i = 0; i < hash_len; ++i) {
 ss << std::hex << std::setw(2) << std::setfill('0')
 << static_cast<int>(hash[i]);
 }
 EVP_MD_CTX_free(context);
 return ss.str();
}
static void find_duplicates(
 std::unordered_map<std::string, std::vector<std::string>> &file_hashes,
 std::string_view arg)
{
 for (const auto &entry :
 std::filesystem::recursive_directory_iterator(arg)) {
 if (entry.is_regular_file() && !entry.is_symlink()) {
 std::string entry_path{entry.path().string()};
 std::ifstream ifs(entry_path);
 std::string content((std::istreambuf_iterator<char>(ifs)),
 (std::istreambuf_iterator<char>()));
 const auto res = compute_file_hash(content);
 /* Should we take an openssl failure to be fatal, or just
 * continue on to the next file? */
 if (res != std::nullopt) {
 const auto it = file_hashes.find(res.value());
 it != file_hashes.end()
 ? it->second.push_back(entry_path)
 : file_hashes[res.value()].push_back(entry_path);
 }
 }
 }
}
static void print_duplicates(
 std::unordered_map<std::string, std::vector<std::string>> &file_hashes)
{
 for (const auto &[key, value] : file_hashes) {
 if (value.size() > 1) {
 std::cout << key << "\n";
 for (const auto &path : value) {
 std::cout << "\t" << path << "\n";
 }
 }
 }
}
int main(int argc, char **argv)
{
 for (int i = 1; i < argc; ++i) {
 std::unordered_map<std::string, std::vector<std::string>> file_hashes{};
 find_duplicates(file_hashes, argv[i]);
 print_duplicates(file_hashes);
 }
}

Formatted with clang-format and built with:

g++-13 -std=c++23 -Wall -Werror -Wpedantic -O2 dedup.cpp -o dedup -lcrypto

Review Request:

Style, bad practices, et cetera. How do I make it more robust?

It likely won't perform well for large files, and the files may change in the meantime.

Question 2

Looks like the formatter changed all instances of ref& to ref &.

Question 3

I believe the idea of comparing hashes instead of file contents is wrong. Most pairs of files (of the same size) are unequal, and in the vast majority of cases this is clear after comparing a few hundred bytes of actual file data. Just consider two equal-sized gigabyte files different in the first octet. Your algorithms reads 2 GB for nothing. What does comparing hashes do that comparing contents does not do much faster? And without false positives?

Question 4

@Jens The idea of comparing hashes was taken from a forensics book originally, and then some Stackoverflow answers. I too have thought about these questions, and might make a switch to comparing file contents instead.

Question 5

@Jens And then rmlint does the same.

Question 6

There's a clear bug in that files are considered to be equal if they have the same MD5 hash. Since hash collisions are definitely a thing, files with equal hashes need to be further compared to determine whether their contents are actually identical.

The speed of this program is quite poor. There's a number of improvements that will make it more efficient:

I don't think you need to read each file into a std::string to hash it. You should be able to do it in chunks, which will reduce the memory footprint of your program. We can use a local array for the buffer, which will eliminate most of the memory allocations.

If you're willing to go slightly non-portable, then consider memory-mapping each file when possible (consider using Boost so you don't have to implement for each platform) and fall back to reading only on platforms that don't provide mapped files. That will greatly reduce the amount of data copying that takes place.
This implementation is likely to be limited by filesystem throughput; consider using multiple threads to do the hashing in parallel. You could use a work-queue arrangement for hashing files; perhaps also for traversing the filesystem hierarchy.
I don't think we need to create a hex representation of every hash. If we use the raw hash as key in the map, then we only need to hexify the ones that we print.

In other words, we can do something like:
```
 std::basic_string<unsigned char> hash(EVP_MAX_MD_SIZE, '0円');
 hash.resize(hash_len);
 return hash;
```
It's inefficient to allocate and free a new digest context for each file. We should be able re-use a single context (or one per thread) if we EVP_MD_CTX_reset() it at the start of each file.

Instead of mapping each hash to a vector of filenames, consider using std::unordered_multimap, since the order of the entries is not significant.

If any of the digest operations fails, we just ignore that file. We should at least emit a warning to std::cerr so that the user knows the scan is incomplete.

We also ignore unreadable files and directories - we should similarly inform the user that we are skipping those.

A usability concern - we do nothing if no arguments are passed. That's probably not what's wanted. Consider an error message or a sensible default (i.e. the current directory).

We might want to treat matches specially if they are just different names for the same file ("hard links") rather than actual copies. That will require facilities outside the standard library.

Question 7

MD5 is a bad choice of hash because there are actual known collisions, but if they had used EVP_blake2b512, for instance, then the chance of a false positive from a hash collision would be vanishingly smaller than the chance of a random false positive from a cosmic ray, so there would be no added value in comparing the file contents.

Question 8

Since we have to do a full comparison on hash collision. Do we even need to hash the whole file? Sure this is going to get you more clashes, but only in the pathological case is this significant, under normal situation, I can see this generating only slightly more than normal. The hashing part is just to narrow down the number of files we have to a full difference on anyway. I am not sure of the way to choose an optimal size But hashing the first nK of a file will save some time.

Question 9

@Martin, it's quite common for files to share a common prefix, so we could improve on that be hashing a subset of each file (perhaps 1K every 1M or something like that). Perhaps even fold the file length into the hash?

Toby Speight Toby Speight 87.4k14 gold badges104 silver badges322 bronze badges · Accepted Answer · 2024-04-13 12:36:05Z

There's a clear bug in that files are considered to be equal if they have the same MD5 hash. Since hash collisions are definitely a thing, files with equal hashes need to be further compared to determine whether their contents are actually identical.

The speed of this program is quite poor. There's a number of improvements that will make it more efficient:

I don't think you need to read each file into a std::string to hash it. You should be able to do it in chunks, which will reduce the memory footprint of your program. We can use a local array for the buffer, which will eliminate most of the memory allocations.

If you're willing to go slightly non-portable, then consider memory-mapping each file when possible (consider using Boost so you don't have to implement for each platform) and fall back to reading only on platforms that don't provide mapped files. That will greatly reduce the amount of data copying that takes place.
This implementation is likely to be limited by filesystem throughput; consider using multiple threads to do the hashing in parallel. You could use a work-queue arrangement for hashing files; perhaps also for traversing the filesystem hierarchy.
I don't think we need to create a hex representation of every hash. If we use the raw hash as key in the map, then we only need to hexify the ones that we print.

In other words, we can do something like:
```
 std::basic_string<unsigned char> hash(EVP_MAX_MD_SIZE, '0円');
 hash.resize(hash_len);
 return hash;
```
It's inefficient to allocate and free a new digest context for each file. We should be able re-use a single context (or one per thread) if we EVP_MD_CTX_reset() it at the start of each file.

Instead of mapping each hash to a vector of filenames, consider using std::unordered_multimap, since the order of the entries is not significant.

If any of the digest operations fails, we just ignore that file. We should at least emit a warning to std::cerr so that the user knows the scan is incomplete.

We also ignore unreadable files and directories - we should similarly inform the user that we are skipping those.

A usability concern - we do nothing if no arguments are passed. That's probably not what's wanted. Consider an error message or a sensible default (i.e. the current directory).

We might want to treat matches specially if they are just different names for the same file ("hard links") rather than actual copies. That will require facilities outside the standard library.

MD5 is a bad choice of hash because there are actual known collisions, but if they had used EVP_blake2b512, for instance, then the chance of a false positive from a hash collision would be vanishingly smaller than the chance of a random false positive from a cosmic ray, so there would be no added value in comparing the file contents.
Since we have to do a full comparison on hash collision. Do we even need to hash the whole file? Sure this is going to get you more clashes, but only in the pathological case is this significant, under normal situation, I can see this generating only slightly more than normal. The hashing part is just to narrow down the number of files we have to a full difference on anyway. I am not sure of the way to choose an optimal size But hashing the first nK of a file will save some time.
@Martin, it's quite common for files to share a common prefix, so we could improve on that be hashing a subset of each file (perhaps 1K every 1M or something like that). Perhaps even fold the file length into the hash?

Stack Exchange Network

C++ - Find all duplicate files

Code:

Review Request:

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

C++ - Find all duplicate files

Code:

Review Request:

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions