C++ - Find all duplicate files (follow-up)

Question 1

What's changed since the first version:

Instead of hashing each and every file, the sizes of the files are first compared, and if those are equal, the files are then hashed.

A more collision resistant hash function, EVP_Blake2b512(), is used in place of EVP_md5().

Files are also read in chunks now. (Previously each file was read into a string.)

Code:

#include <cstddef>
#include <cstdint>
#include <cstdlib>
#include <filesystem>
#include <fstream>
#include <iomanip>
#include <iostream>
#include <optional>
#include <string>
#include <string_view>
#include <system_error>
#include <unordered_map>
#include <openssl/evp.h>
static std::optional<std::string> compute_file_hash(EVP_MD_CTX* context,
 const std::string& fpath)
{
 EVP_MD_CTX_reset(context);
 unsigned char hash[EVP_MAX_MD_SIZE];
 unsigned int hash_len {0};
 if (EVP_DigestInit_ex(context, EVP_blake2b512(), nullptr) == 0) {
 return std::nullopt;
 }
 std::ifstream file {fpath};
 if (!file) {
 std::cerr << "error: failed to read " << std::quoted(fpath) << ": "
 << std::error_code {errno, std::generic_category()}.message()
 << "\n";
 return std::nullopt;
 }
 constexpr std::size_t bufsize {65536};
 char buf[bufsize];
 while (file) {
 file.read(buf, bufsize);
 const std::size_t count {static_cast<std::size_t>(file.gcount())};
 if (count == 0) {
 break;
 }
 if (EVP_DigestUpdate(context, buf, count) == 0) {
 return std::nullopt;
 }
 }
 if (EVP_DigestFinal_ex(context, hash, &hash_len) == 0) {
 return std::nullopt;
 }
 return std::string {reinterpret_cast<char*>(hash), hash_len};
}
static bool populate_size_map(std::unordered_multimap<std::uintmax_t, std::string>& size_map,
 const std::string_view& dir)
{
 for (const auto& entry :
 std::filesystem::recursive_directory_iterator(dir)) {
 if (entry.is_symlink() 
 || entry.is_fifo() 
 || entry.is_socket() 
 || entry.is_other()) {
 std::cerr << "Skipping entry: " << entry.path().string() << ".\n";
 continue;
 }
 if (entry.is_regular_file()) {
 size_map.emplace(std::filesystem::file_size(entry.path()),
 entry.path().string());
 }
 }
 return true;
}
static bool populate_hash_map(const std::unordered_multimap<std::uintmax_t, std::string>& size_map,
 std::unordered_multimap<std::string, std::string>& hash_map)
{
 EVP_MD_CTX* const context {EVP_MD_CTX_new()};
 if (context == nullptr) {
 std::cerr << "error: "
 << std::error_code {errno, std::generic_category()}.message()
 << "\n";
 return false;
 }
 for (auto it {size_map.begin()}; it != size_map.end(); ++it) {
 const auto res {size_map.equal_range(it->first)};
 if (std::distance(res.first, res.second) > 1) {
 const auto hash {compute_file_hash(context, it->second)};
 if (hash != std::nullopt) {
 hash_map.emplace(hash.value(), it->second);
 }
 }
 }
 EVP_MD_CTX_free(context);
 return true;
}
static std::string str_to_hex(const std::string& str)
{
 std::basic_string<unsigned char> s {str.begin(), str.end()};
 std::stringstream ss {};
 for (unsigned int i = 0; i < s.size(); ++i) {
 ss << std::hex << std::setw(2) << std::setfill('0')
 << static_cast<int>(s[i]);
 }
 return ss.str();
}
static void print_duplicates(std::unordered_multimap<std::string, std::string>& hash_map)
{
 std::string curr_key {""};
 for (auto it {hash_map.begin()}; it != hash_map.end(); ++it) {
 if (curr_key == it->first) {
 continue;
 }
 else {
 curr_key = it->first;
 }
 const auto res {hash_map.equal_range(it->first)};
 if (std::distance(res.first, res.second) > 1) {
 std::cout << str_to_hex(it->first) << "\n";
 for (auto it {res.first}; it != res.second; ++it) {
 std::cout << "\t" << it->second << "\n";
 }
 }
 }
}
int main(int argc, char** argv)
{
 if (argc < 2) {
 std::cerr << "Error: no argument provided.\n"
 << "Usage: " << (argv[0] ? argv[0] : "dedup ")
 << "file1 file2 ...\n";
 return EXIT_FAILURE;
 }
 std::unordered_multimap<std::uintmax_t, std::string> size_map {};
 std::unordered_multimap<std::string, std::string> hash_map {};
 for (int i {1}; i < argc; ++i) {
 if (!populate_size_map(size_map, argv[i]) 
 || !populate_hash_map(size_map, hash_map)) {
 return EXIT_FAILURE;
 }
 }
 print_duplicates(hash_map);
}

I haven't looked at using multiple threads yet.

Review Request:

Have I introduced any new bugs/problems into the code?

Style, incorrect data-structures, bad practices.

Anything, everything.

Question 2

If and when you implement a pool of hasher threads, use a heap / pqueue. Set priority equal to file length. That way you do the big ones first, minimizing stragglers. // Also, if would be worthwhile to cache hash values, with modified mtime or inode number causing a cache invalidate.

Question 3

Error handling

There are several things wrong with the way you handle errors. Error handling can be an annoying thing to do, but it is very important you do it correctly. Here are some of the issues:

std::optional is not really meant to convey errors. I recommend that you either throw exceptions derived from std::runtime_error, use C++23's std::expected, or a library implementing the latter if you cannot use C++23 yet.
Be consistent. In compute_file_hash(), sometimes you print an error message to std::cerr, sometimes you do not.
Don't ignore errors. You do this in various places. Consider that there might be an I/O error reading a file, even after opening it succesfully. You ignore that scenario and just calculate the wrong hash value. Furthermore, in populate_hash_map(), you do check whether compute_file_hash() returns a std::nullopt, but if it does you just ignore that file. Is that really what you want? Also, even for regular files, std::filesystem::file_size() could throw an exception.
If something cannot fail, don't return a value that might indicate an error code. For example, populate_size_map() returns a bool, but it will always be true. Just return void here.

Use `else` where appropriate

In populate_size_map() you first check if entry is a symlink, fifo, socket or other file type, then you have another if-statement to check if it is a regular file. But are you sure you covered all the possible file types? What about block or character files?

Avoid these kinds of problems by only checking for one condition, and using else to handle all other cases without having to explicitly list them all.

Prepare for multiple threads

Your code uses a single thread, and sometimes that's just good enough: most computers can hash roughly 1 gigabyte/second using Blake2b on a single core, and that's not much slower than the maximum speed a fast SSD can handle.

If you do want to parallelize though, then you want to remove as many sequential steps from your algorithm as possible (see also Amdahl's law). This can even have benefits for single-threaded code though.

Your are first populating a map of sizes, then of hashes, and only then checking if there were duplicates. You can change your code to do all these steps at once for each file: first check the file size, and if it's not in a std::unordered_map of sizes, add it. If it was in there, then hash the file that's already in the map, and add it to a std::unordered_map of hashes, and then hash the file you are currently checking, and compare it against that map. If it's not a match, add it, if not you know you found a duplicate, and can already print names.

For multi-threaded code, this has the advantage that you are not blocked on populating maps, threads can immediately start processing files. For single-threaded code, the advantage is that you don't have to wait so long before the first results can be printed.

What about duplicate paths?

What if you run your program and provide it with the same filename twice on the command line? Or what if you pass a parent directory and one of its subdirectories? You probably want to catch this situation and only check each unique path once.

Question 4

The top-voted answer for std::optional on StackOverflow suggested using it for such a situation: stackoverflow.com/a/16861022 If this is not an actual use case, what are?

Question 5

For many of those examples for std::optional it is not unexpected to return std::nullopt. If you do a lookup in a container and you don't find the item, is that an error or just something that can be expected to happen (even if unfrequently)? std::expected is much better than std::optional as it also allows you to return an error code, so the caller can decide what to do based on the actual error. The StackOverflow post you linked to is also from 2013, before std::optional was even standardized, and when there wasn't a std::experimental::expected.

Question 6

A std::unordered_set of paths would certainly work.

Question 7

1GB/s is only like 30% of what my (normal, not very fast) SSD can handle and less than 10% of a fast one

Question 8

The recommendation to throw is nice, but note that OP didn't encapsulate the EVP_MD_CTX in a class and frees it manually, and the lack of RAII is not going to play well with exceptions (or early returns).

G. Sliepen G. Sliepen 68.7k3 gold badges74 silver badges179 bronze badges · Accepted Answer · 2024-04-15 19:17:38Z

Error handling

There are several things wrong with the way you handle errors. Error handling can be an annoying thing to do, but it is very important you do it correctly. Here are some of the issues:

std::optional is not really meant to convey errors. I recommend that you either throw exceptions derived from std::runtime_error, use C++23's std::expected, or a library implementing the latter if you cannot use C++23 yet.
Be consistent. In compute_file_hash(), sometimes you print an error message to std::cerr, sometimes you do not.
Don't ignore errors. You do this in various places. Consider that there might be an I/O error reading a file, even after opening it succesfully. You ignore that scenario and just calculate the wrong hash value. Furthermore, in populate_hash_map(), you do check whether compute_file_hash() returns a std::nullopt, but if it does you just ignore that file. Is that really what you want? Also, even for regular files, std::filesystem::file_size() could throw an exception.
If something cannot fail, don't return a value that might indicate an error code. For example, populate_size_map() returns a bool, but it will always be true. Just return void here.

Use `else` where appropriate

In populate_size_map() you first check if entry is a symlink, fifo, socket or other file type, then you have another if-statement to check if it is a regular file. But are you sure you covered all the possible file types? What about block or character files?

Avoid these kinds of problems by only checking for one condition, and using else to handle all other cases without having to explicitly list them all.

Prepare for multiple threads

Your code uses a single thread, and sometimes that's just good enough: most computers can hash roughly 1 gigabyte/second using Blake2b on a single core, and that's not much slower than the maximum speed a fast SSD can handle.

If you do want to parallelize though, then you want to remove as many sequential steps from your algorithm as possible (see also Amdahl's law). This can even have benefits for single-threaded code though.

Your are first populating a map of sizes, then of hashes, and only then checking if there were duplicates. You can change your code to do all these steps at once for each file: first check the file size, and if it's not in a std::unordered_map of sizes, add it. If it was in there, then hash the file that's already in the map, and add it to a std::unordered_map of hashes, and then hash the file you are currently checking, and compare it against that map. If it's not a match, add it, if not you know you found a duplicate, and can already print names.

For multi-threaded code, this has the advantage that you are not blocked on populating maps, threads can immediately start processing files. For single-threaded code, the advantage is that you don't have to wait so long before the first results can be printed.

What about duplicate paths?

What if you run your program and provide it with the same filename twice on the command line? Or what if you pass a parent directory and one of its subdirectories? You probably want to catch this situation and only check each unique path once.

The top-voted answer for std::optional on StackOverflow suggested using it for such a situation: stackoverflow.com/a/16861022 If this is not an actual use case, what are?
For many of those examples for std::optional it is not unexpected to return std::nullopt. If you do a lookup in a container and you don't find the item, is that an error or just something that can be expected to happen (even if unfrequently)? std::expected is much better than std::optional as it also allows you to return an error code, so the caller can decide what to do based on the actual error. The StackOverflow post you linked to is also from 2013, before std::optional was even standardized, and when there wasn't a std::experimental::expected.
1GB/s is only like 30% of what my (normal, not very fast) SSD can handle and less than 10% of a fast one
The recommendation to throw is nice, but note that OP didn't encapsulate the EVP_MD_CTX in a class and frees it manually, and the lack of RAII is not going to play well with exceptions (or early returns).

Stack Exchange Network

C++ - Find all duplicate files (follow-up)

What's changed since the first version:

Code:

Review Request:

1 Answer 1

Error handling

Use `else` where appropriate

Prepare for multiple threads

What about duplicate paths?

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

C++ - Find all duplicate files (follow-up)

What's changed since the first version:

Code:

Review Request:

1 Answer 1

Error handling

Use else where appropriate

Prepare for multiple threads

What about duplicate paths?

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions

Use `else` where appropriate