What's changed since the first version:
Instead of hashing each and every file, the sizes of the files are first compared, and if those are equal, the files are then hashed.
A more collision resistant hash function, EVP_Blake2b512()
, is used in place of EVP_md5()
.
Files are also read in chunks now. (Previously each file was read into a string.)
Code:
#include <cstddef>
#include <cstdint>
#include <cstdlib>
#include <filesystem>
#include <fstream>
#include <iomanip>
#include <iostream>
#include <optional>
#include <string>
#include <string_view>
#include <system_error>
#include <unordered_map>
#include <openssl/evp.h>
static std::optional<std::string> compute_file_hash(EVP_MD_CTX* context,
const std::string& fpath)
{
EVP_MD_CTX_reset(context);
unsigned char hash[EVP_MAX_MD_SIZE];
unsigned int hash_len {0};
if (EVP_DigestInit_ex(context, EVP_blake2b512(), nullptr) == 0) {
return std::nullopt;
}
std::ifstream file {fpath};
if (!file) {
std::cerr << "error: failed to read " << std::quoted(fpath) << ": "
<< std::error_code {errno, std::generic_category()}.message()
<< "\n";
return std::nullopt;
}
constexpr std::size_t bufsize {65536};
char buf[bufsize];
while (file) {
file.read(buf, bufsize);
const std::size_t count {static_cast<std::size_t>(file.gcount())};
if (count == 0) {
break;
}
if (EVP_DigestUpdate(context, buf, count) == 0) {
return std::nullopt;
}
}
if (EVP_DigestFinal_ex(context, hash, &hash_len) == 0) {
return std::nullopt;
}
return std::string {reinterpret_cast<char*>(hash), hash_len};
}
static bool populate_size_map(std::unordered_multimap<std::uintmax_t, std::string>& size_map,
const std::string_view& dir)
{
for (const auto& entry :
std::filesystem::recursive_directory_iterator(dir)) {
if (entry.is_symlink()
|| entry.is_fifo()
|| entry.is_socket()
|| entry.is_other()) {
std::cerr << "Skipping entry: " << entry.path().string() << ".\n";
continue;
}
if (entry.is_regular_file()) {
size_map.emplace(std::filesystem::file_size(entry.path()),
entry.path().string());
}
}
return true;
}
static bool populate_hash_map(const std::unordered_multimap<std::uintmax_t, std::string>& size_map,
std::unordered_multimap<std::string, std::string>& hash_map)
{
EVP_MD_CTX* const context {EVP_MD_CTX_new()};
if (context == nullptr) {
std::cerr << "error: "
<< std::error_code {errno, std::generic_category()}.message()
<< "\n";
return false;
}
for (auto it {size_map.begin()}; it != size_map.end(); ++it) {
const auto res {size_map.equal_range(it->first)};
if (std::distance(res.first, res.second) > 1) {
const auto hash {compute_file_hash(context, it->second)};
if (hash != std::nullopt) {
hash_map.emplace(hash.value(), it->second);
}
}
}
EVP_MD_CTX_free(context);
return true;
}
static std::string str_to_hex(const std::string& str)
{
std::basic_string<unsigned char> s {str.begin(), str.end()};
std::stringstream ss {};
for (unsigned int i = 0; i < s.size(); ++i) {
ss << std::hex << std::setw(2) << std::setfill('0')
<< static_cast<int>(s[i]);
}
return ss.str();
}
static void print_duplicates(std::unordered_multimap<std::string, std::string>& hash_map)
{
std::string curr_key {""};
for (auto it {hash_map.begin()}; it != hash_map.end(); ++it) {
if (curr_key == it->first) {
continue;
}
else {
curr_key = it->first;
}
const auto res {hash_map.equal_range(it->first)};
if (std::distance(res.first, res.second) > 1) {
std::cout << str_to_hex(it->first) << "\n";
for (auto it {res.first}; it != res.second; ++it) {
std::cout << "\t" << it->second << "\n";
}
}
}
}
int main(int argc, char** argv)
{
if (argc < 2) {
std::cerr << "Error: no argument provided.\n"
<< "Usage: " << (argv[0] ? argv[0] : "dedup ")
<< "file1 file2 ...\n";
return EXIT_FAILURE;
}
std::unordered_multimap<std::uintmax_t, std::string> size_map {};
std::unordered_multimap<std::string, std::string> hash_map {};
for (int i {1}; i < argc; ++i) {
if (!populate_size_map(size_map, argv[i])
|| !populate_hash_map(size_map, hash_map)) {
return EXIT_FAILURE;
}
}
print_duplicates(hash_map);
}
I haven't looked at using multiple threads yet.
Review Request:
Have I introduced any new bugs/problems into the code?
Style, incorrect data-structures, bad practices.
Anything, everything.
-
3\$\begingroup\$ If and when you implement a pool of hasher threads, use a heap / pqueue. Set priority equal to file length. That way you do the big ones first, minimizing stragglers. // Also, if would be worthwhile to cache hash values, with modified mtime or inode number causing a cache invalidate. \$\endgroup\$J_H– J_H2024年04月15日 17:16:40 +00:00Commented Apr 15, 2024 at 17:16
1 Answer 1
Error handling
There are several things wrong with the way you handle errors. Error handling can be an annoying thing to do, but it is very important you do it correctly. Here are some of the issues:
std::optional
is not really meant to convey errors. I recommend that you either throw exceptions derived fromstd::runtime_error
, use C++23'sstd::expected
, or a library implementing the latter if you cannot use C++23 yet.- Be consistent. In
compute_file_hash()
, sometimes you print an error message tostd::cerr
, sometimes you do not. - Don't ignore errors. You do this in various places. Consider that there might be an I/O error reading a file, even after opening it succesfully. You ignore that scenario and just calculate the wrong hash value. Furthermore, in
populate_hash_map()
, you do check whethercompute_file_hash()
returns astd::nullopt
, but if it does you just ignore that file. Is that really what you want? Also, even for regular files,std::filesystem::file_size()
could throw an exception. - If something cannot fail, don't return a value that might indicate an error code. For example,
populate_size_map()
returns abool
, but it will always betrue
. Just returnvoid
here.
Use else
where appropriate
In populate_size_map()
you first check if entry
is a symlink, fifo, socket or other file type, then you have another if
-statement to check if it is a regular file. But are you sure you covered all the possible file types? What about block or character files?
Avoid these kinds of problems by only checking for one condition, and using else
to handle all other cases without having to explicitly list them all.
Prepare for multiple threads
Your code uses a single thread, and sometimes that's just good enough: most computers can hash roughly 1 gigabyte/second using Blake2b on a single core, and that's not much slower than the maximum speed a fast SSD can handle.
If you do want to parallelize though, then you want to remove as many sequential steps from your algorithm as possible (see also Amdahl's law). This can even have benefits for single-threaded code though.
Your are first populating a map of sizes, then of hashes, and only then checking if there were duplicates. You can change your code to do all these steps at once for each file: first check the file size, and if it's not in a std::unordered_map
of sizes, add it. If it was in there, then hash the file that's already in the map, and add it to a std::unordered_map
of hashes, and then hash the file you are currently checking, and compare it against that map. If it's not a match, add it, if not you know you found a duplicate, and can already print names.
For multi-threaded code, this has the advantage that you are not blocked on populating maps, threads can immediately start processing files. For single-threaded code, the advantage is that you don't have to wait so long before the first results can be printed.
What about duplicate paths?
What if you run your program and provide it with the same filename twice on the command line? Or what if you pass a parent directory and one of its subdirectories? You probably want to catch this situation and only check each unique path once.
-
\$\begingroup\$ The top-voted answer for
std::optional
on StackOverflow suggested using it for such a situation: stackoverflow.com/a/16861022 If this is not an actual use case, what are? \$\endgroup\$Madagascar– Madagascar2024年04月15日 20:50:23 +00:00Commented Apr 15, 2024 at 20:50 -
1\$\begingroup\$ For many of those examples for
std::optional
it is not unexpected to returnstd::nullopt
. If you do a lookup in a container and you don't find the item, is that an error or just something that can be expected to happen (even if unfrequently)?std::expected
is much better thanstd::optional
as it also allows you to return an error code, so the caller can decide what to do based on the actual error. The StackOverflow post you linked to is also from 2013, beforestd::optional
was even standardized, and when there wasn't astd::experimental::expected
. \$\endgroup\$G. Sliepen– G. Sliepen2024年04月15日 21:19:31 +00:00Commented Apr 15, 2024 at 21:19 -
2\$\begingroup\$ A
std::unordered_set
of paths would certainly work. \$\endgroup\$G. Sliepen– G. Sliepen2024年04月15日 21:22:10 +00:00Commented Apr 15, 2024 at 21:22 -
1\$\begingroup\$ 1GB/s is only like 30% of what my (normal, not very fast) SSD can handle and less than 10% of a fast one \$\endgroup\$user555045– user5550452024年04月16日 05:22:35 +00:00Commented Apr 16, 2024 at 5:22
-
1\$\begingroup\$ The recommendation to throw is nice, but note that OP didn't encapsulate the
EVP_MD_CTX
in a class and frees it manually, and the lack of RAII is not going to play well with exceptions (or early returns). \$\endgroup\$Matthieu M.– Matthieu M.2024年04月16日 09:55:36 +00:00Commented Apr 16, 2024 at 9:55