The code takes in two arguments, a path to search and an output file name. It searches the supplied path (including sub directories) and writes out all files that have the same size as any other files in the path. This will form part of a multi-step process to identify duplicate files within the file system.
Example output, with 3 files 29 bytes long, 3 files 113 bytes long and 2 files 114 bytes long:
29 c:\Program Files\Git\mingw64\libexec\git-core\mergetools\gvimdiff 29 c:\Program Files\Git\mingw64\libexec\git-core\mergetools\gvimdiff2 29 c:\Program Files\Git\mingw64\libexec\git-core\mergetools\gvimdiff3 113 c:\Program Files\Git\mingw64\lib\tcl8.6\tzdata\Etc\GMT-7 113 c:\Program Files\Git\mingw64\lib\tcl8.6\tzdata\Etc\GMT-8 113 c:\Program Files\Git\mingw64\lib\tcl8.6\tzdata\Etc\GMT-9 114 c:\Program Files\Git\mingw64\lib\tcl8.6\tzdata\Etc\GMT+4 114 c:\Program Files\Git\mingw64\lib\tcl8.6\tzdata\Etc\GMT+5
#include "stdafx.h"
#include <iostream>
#include <fstream>
#include <filesystem>
#include <map>
#include <set>
using namespace std::experimental::filesystem;
std::map<uintmax_t, std::set<path>> build_filesize_map(const std::string &pathToSearch);
void write_possible_duplicate_files(const std::string &outputFileName, const std::map<uintmax_t, std::set<path>> &sizeMap);
int main(int argc, char *argv[])
{
if (argc != 3) {
std::cout << "Usage:" << argv[0] << " <folder to scan> <output file>";
return EXIT_FAILURE;
}
std::string pathToSearch(argv[1]);
std::string outputFileName(argv[2]);
auto sizeMap = build_filesize_map(pathToSearch);
// Write out a list of all files that share their size with other files
write_possible_duplicate_files(outputFileName, sizeMap);
}
// Search directory, any files with size > 0 are considered. They are added to a set
// within the map, which is keyed on the file size
std::map<uintmax_t, std::set<path>> build_filesize_map(const std::string &pathToSearch) {
auto sizeMap = std::map<uintmax_t, std::set<path>>();
for (recursive_directory_iterator next(pathToSearch), end; next != end; ++next) {
auto currentPath = next->path();
auto currentFileSize = is_regular_file(currentPath) ? file_size(currentPath) : 0;
if (0 != currentFileSize) {
auto existingEntry = sizeMap.find(currentFileSize);
if (existingEntry == sizeMap.end()) {
std::set<path> fileSet;
fileSet.emplace(currentPath);
sizeMap.emplace(currentFileSize, fileSet);
}
else {
existingEntry->second.emplace(currentPath);
}
}
}
return sizeMap;
}
// Write out a list of all files that share their size with other files
void write_possible_duplicate_files(const std::string &outputFileName, const std::map<uintmax_t, std::set<path>> &sizeMap) {
std::ofstream outputFile(outputFileName, std::ios_base::trunc);
for (auto sizeIterator = sizeMap.begin(); sizeIterator != sizeMap.end(); ++sizeIterator) {
if (sizeIterator->second.size() == 1) continue;
for (auto pathIteror = sizeIterator->second.begin(); pathIteror != sizeIterator->second.end(); ++pathIteror) {
outputFile << sizeIterator->first << "\t" << *pathIteror << std::endl;
}
}
}
Things I'm particularly interested in (anything else is welcome):
- Naming - I generally write C# these days and everywhere I've written C++ has had its own naming convention. Is there an industry standard or is consistency still the most important convention?
- STL usage - I've used Boost, RogueWave and lots of custom classes but have little experience with the STL. Specifically, I'm adding local instance to the collections, rather than
new
ing them up. Is that OK, or should I be adding pointers instead? - My
build_filesize_map
is returning amap
instance. Is that OK? Historically, I'd have new'd up the map and returned a pointer to it. Which is the better approach?
-
\$\begingroup\$ While a lot of other languages have a convention nowadays, C++ does not. There are a couple of best-practices around but those aren't set in stone. \$\endgroup\$Mast– Mast ♦2016年06月24日 08:48:34 +00:00Commented Jun 24, 2016 at 8:48
2 Answers 2
To complement the other review from Jan Korous, here are a few other things that may help you improve your code.
General portability
This code could be made portable if you omit the Windows-only include file #include "stdafx.h"
.
Be consistent with experimental portions
The std::experimental::filesystem
namespace has been accepted into the C++17 specification and so the corresponding namespace will become std::filesystem
and will be in #include <filesystem>
but until then, it should be #include <experimental/filesystem>
.
Consider using range-for syntax
The range-for syntax makes it much easier to write code simply. For instance, instead of the complicated loops within write_possible_duplicate_file()
, I'd write this instead:
for (const auto &item : sizeMap) {
if (item.second.size() > 1) {
for (const auto &file : item.second) {
outputFile << item.first << '\t' << file << '\n';
}
}
}
Prefer stream reference to filenames
Instead of having the write_possible_duplicate_files()
take a string for a filename, I would advocate changing that to both take and return a std::ostream
reference. This would allow the program to easily write to std::cout
if desired and more completely abstracts the purpose of the function.
Use a typedef
to simplify declarations
Instead of using the long form everywhere it appears, I'd recommend using a typedef
for the map type:
typedef std::map<uintmax_t, std::set<path>> MyMap;
Now one can simply refer to MyMap
everywhere the code had originally had a long std::map
declaration.
Think of the user
It may be that the user is eventually another program, but for now, if we are examining a list of possible duplicate files, I think I'd want to see the biggest files first so that eliminating one would have the biggest impact on freeing space. That could easily be done by making a small change to the map
typedef mentioned above.
typedef std::map<uintmax_t, std::set<path>, std::greater<uintmax_t>> MyMap;
Consider catching exceptions
The version of file_size()
that you're using can throw an exception (as, for example, if the file cannot be read with the current user's permissions) which will cause the program to abort with a somewhat cryptic (to the user) error message. Consider catching the exception and skipping the file or directory or at least emitting a user-centric message.
Simplify the code using in-place constructors
After the check assures that this is a new entry, the build_filesize_map()
code uses these three lines:
std::set<path> fileSet;
fileSet.emplace(currentPath);
sizeMap.emplace(currentFileSize, fileSet);
This is more succinctly written like this:
sizeMap.emplace(currentFileSize, std::set<path>{currentPath});
Clean nice code, reasonable data structure. Good job!
Don't worry too much about returning the std::map
as even historically RVO would help you.
See: https://en.wikipedia.org/wiki/Return_value_optimization
iterator naming
Since you asked about naming convention I have to admit I wouldn't name this iterator next
. More common name might be it
but something more explicit like dir_entry
would be probably even better.
It especially caught my eyes that "current path" is "next path" which is kind of strange.
for (recursive_directory_iterator next(pathToSearch), end; next != end; ++next) {
auto currentPath = next->path();
const
My opinion is that there's never enough const
. I would define all variables not intended to be changed later as const.
const auto currentPath = next->path();
const auto currentFileSize = is_regular_file(currentPath) ? file_size(currentPath) : 0;
if (0 != currentFileSize) {
const auto existingEntry = sizeMap.find(currentFileSize);
std::map::operator[]
Most of the time I am frustrated by map operator[]
behavior. You on the other hand are lucky as it might simplify your code.
auto existingEntry = sizeMap.find(currentFileSize);
if (existingEntry == sizeMap.end()) {
std::set<path> fileSet;
fileSet.emplace(currentPath);
sizeMap.emplace(currentFileSize, fileSet);
}
else {
existingEntry->second.emplace(currentPath);
}
could be done as
sizeMap[currentFileSize].emplace(currentPath)
Reason is that in case key given as argument to std::map::operator[]
is not found new pair is inserted as {key, default value of value_type}
which in your case means empty set.
See: http://en.cppreference.com/w/cpp/container/map/operator_at
iterating over map keys> 1
Since you are using an ordered map you might take advantage of that with std::map::lower_bound()
and skip the first part with count = 1.
See: http://en.cppreference.com/w/cpp/container/map/lower_bound
for (auto sizeIterator = sizeMap.begin(); sizeIterator != sizeMap.end(); ++sizeIterator) {
if (sizeIterator->second.size() == 1) continue;
for (auto pathIteror = sizeIterator->second.begin(); pathIteror != sizeIterator->second.end(); ++pathIteror) {
could be done like this
for (auto sizeIterator = sizeMap.lower_bound(1); sizeIterator != sizeMap.end(); ++sizeIterator) {
for (auto pathIteror = sizeIterator->second.begin(); pathIteror != sizeIterator->second.end(); ++pathIteror) {