Identify files within folder structure that have common file sizes

Question 1

The code takes in two arguments, a path to search and an output file name. It searches the supplied path (including sub directories) and writes out all files that have the same size as any other files in the path. This will form part of a multi-step process to identify duplicate files within the file system.

Example output, with 3 files 29 bytes long, 3 files 113 bytes long and 2 files 114 bytes long:

29 c:\Program Files\Git\mingw64\libexec\git-core\mergetools\gvimdiff
29 c:\Program Files\Git\mingw64\libexec\git-core\mergetools\gvimdiff2
29 c:\Program Files\Git\mingw64\libexec\git-core\mergetools\gvimdiff3
113 c:\Program Files\Git\mingw64\lib\tcl8.6\tzdata\Etc\GMT-7
113 c:\Program Files\Git\mingw64\lib\tcl8.6\tzdata\Etc\GMT-8
113 c:\Program Files\Git\mingw64\lib\tcl8.6\tzdata\Etc\GMT-9
114 c:\Program Files\Git\mingw64\lib\tcl8.6\tzdata\Etc\GMT+4
114 c:\Program Files\Git\mingw64\lib\tcl8.6\tzdata\Etc\GMT+5

#include "stdafx.h"
#include <iostream>
#include <fstream>
#include <filesystem>
#include <map>
#include <set>
using namespace std::experimental::filesystem;
std::map<uintmax_t, std::set<path>> build_filesize_map(const std::string &pathToSearch);
void write_possible_duplicate_files(const std::string &outputFileName, const std::map<uintmax_t, std::set<path>> &sizeMap);
int main(int argc, char *argv[])
{
 if (argc != 3) {
 std::cout << "Usage:" << argv[0] << " <folder to scan> <output file>";
 return EXIT_FAILURE;
 }
 std::string pathToSearch(argv[1]);
 std::string outputFileName(argv[2]);
 auto sizeMap = build_filesize_map(pathToSearch);
 // Write out a list of all files that share their size with other files
 write_possible_duplicate_files(outputFileName, sizeMap);
}
// Search directory, any files with size > 0 are considered. They are added to a set
// within the map, which is keyed on the file size
std::map<uintmax_t, std::set<path>> build_filesize_map(const std::string &pathToSearch) {
 auto sizeMap = std::map<uintmax_t, std::set<path>>();
 for (recursive_directory_iterator next(pathToSearch), end; next != end; ++next) {
 auto currentPath = next->path();
 auto currentFileSize = is_regular_file(currentPath) ? file_size(currentPath) : 0;
 if (0 != currentFileSize) {
 auto existingEntry = sizeMap.find(currentFileSize);
 if (existingEntry == sizeMap.end()) {
 std::set<path> fileSet;
 fileSet.emplace(currentPath);
 sizeMap.emplace(currentFileSize, fileSet);
 }
 else {
 existingEntry->second.emplace(currentPath);
 }
 }
 }
 return sizeMap;
}
// Write out a list of all files that share their size with other files
void write_possible_duplicate_files(const std::string &outputFileName, const std::map<uintmax_t, std::set<path>> &sizeMap) {
 std::ofstream outputFile(outputFileName, std::ios_base::trunc);
 for (auto sizeIterator = sizeMap.begin(); sizeIterator != sizeMap.end(); ++sizeIterator) {
 if (sizeIterator->second.size() == 1) continue;
 for (auto pathIteror = sizeIterator->second.begin(); pathIteror != sizeIterator->second.end(); ++pathIteror) {
 outputFile << sizeIterator->first << "\t" << *pathIteror << std::endl;
 }
 }
}

Things I'm particularly interested in (anything else is welcome):

Naming - I generally write C# these days and everywhere I've written C++ has had its own naming convention. Is there an industry standard or is consistency still the most important convention?
STL usage - I've used Boost, RogueWave and lots of custom classes but have little experience with the STL. Specifically, I'm adding local instance to the collections, rather than newing them up. Is that OK, or should I be adding pointers instead?
My build_filesize_map is returning a map instance. Is that OK? Historically, I'd have new'd up the map and returned a pointer to it. Which is the better approach?

Question 2

While a lot of other languages have a convention nowadays, C++ does not. There are a couple of best-practices around but those aren't set in stone.

Question 3

To complement the other review from Jan Korous, here are a few other things that may help you improve your code.

General portability

This code could be made portable if you omit the Windows-only include file #include "stdafx.h".

Be consistent with experimental portions

The std::experimental::filesystem namespace has been accepted into the C++17 specification and so the corresponding namespace will become std::filesystem and will be in #include <filesystem> but until then, it should be #include <experimental/filesystem>.

Consider using range-for syntax

The range-for syntax makes it much easier to write code simply. For instance, instead of the complicated loops within write_possible_duplicate_file(), I'd write this instead:

for (const auto &item : sizeMap) {
 if (item.second.size() > 1) {
 for (const auto &file : item.second) {
 outputFile << item.first << '\t' << file << '\n';
 }
 }
}

Prefer stream reference to filenames

Instead of having the write_possible_duplicate_files() take a string for a filename, I would advocate changing that to both take and return a std::ostream reference. This would allow the program to easily write to std::cout if desired and more completely abstracts the purpose of the function.

Use a `typedef` to simplify declarations

Instead of using the long form everywhere it appears, I'd recommend using a typedef for the map type:

typedef std::map<uintmax_t, std::set<path>> MyMap;

Now one can simply refer to MyMap everywhere the code had originally had a long std::map declaration.

Think of the user

It may be that the user is eventually another program, but for now, if we are examining a list of possible duplicate files, I think I'd want to see the biggest files first so that eliminating one would have the biggest impact on freeing space. That could easily be done by making a small change to the map typedef mentioned above.

typedef std::map<uintmax_t, std::set<path>, std::greater<uintmax_t>> MyMap;

Consider catching exceptions

The version of file_size() that you're using can throw an exception (as, for example, if the file cannot be read with the current user's permissions) which will cause the program to abort with a somewhat cryptic (to the user) error message. Consider catching the exception and skipping the file or directory or at least emitting a user-centric message.

Simplify the code using in-place constructors

After the check assures that this is a new entry, the build_filesize_map() code uses these three lines:

std::set<path> fileSet;
fileSet.emplace(currentPath);
sizeMap.emplace(currentFileSize, fileSet);

This is more succinctly written like this:

sizeMap.emplace(currentFileSize, std::set<path>{currentPath});

Question 4

Clean nice code, reasonable data structure. Good job!

Don't worry too much about returning the std::map as even historically RVO would help you.

See: https://en.wikipedia.org/wiki/Return_value_optimization

iterator naming

Since you asked about naming convention I have to admit I wouldn't name this iterator next. More common name might be it but something more explicit like dir_entry would be probably even better. It especially caught my eyes that "current path" is "next path" which is kind of strange.

for (recursive_directory_iterator next(pathToSearch), end; next != end; ++next) {
 auto currentPath = next->path();

const

My opinion is that there's never enough const. I would define all variables not intended to be changed later as const.

const auto currentPath = next->path();
const auto currentFileSize = is_regular_file(currentPath) ? file_size(currentPath) : 0;
 if (0 != currentFileSize) {
 const auto existingEntry = sizeMap.find(currentFileSize);

std::map::operator[]

Most of the time I am frustrated by map operator[] behavior. You on the other hand are lucky as it might simplify your code.

 auto existingEntry = sizeMap.find(currentFileSize);
 if (existingEntry == sizeMap.end()) {
 std::set<path> fileSet;
 fileSet.emplace(currentPath);
 sizeMap.emplace(currentFileSize, fileSet);
 }
 else {
 existingEntry->second.emplace(currentPath);
 }

could be done as

 sizeMap[currentFileSize].emplace(currentPath)

Reason is that in case key given as argument to std::map::operator[] is not found new pair is inserted as {key, default value of value_type} which in your case means empty set.

See: http://en.cppreference.com/w/cpp/container/map/operator_at

iterating over map keys> 1

Since you are using an ordered map you might take advantage of that with std::map::lower_bound() and skip the first part with count = 1.

See: http://en.cppreference.com/w/cpp/container/map/lower_bound

for (auto sizeIterator = sizeMap.begin(); sizeIterator != sizeMap.end(); ++sizeIterator) {
 if (sizeIterator->second.size() == 1) continue;
 for (auto pathIteror = sizeIterator->second.begin(); pathIteror != sizeIterator->second.end(); ++pathIteror) {

could be done like this

for (auto sizeIterator = sizeMap.lower_bound(1); sizeIterator != sizeMap.end(); ++sizeIterator) {
 for (auto pathIteror = sizeIterator->second.begin(); pathIteror != sizeIterator->second.end(); ++pathIteror) {

Edward Edward 67.2k4 gold badges120 silver badges284 bronze badges · Accepted Answer · 2016-06-24 13:22:48Z

To complement the other review from Jan Korous, here are a few other things that may help you improve your code.

General portability

This code could be made portable if you omit the Windows-only include file #include "stdafx.h".

Be consistent with experimental portions

The std::experimental::filesystem namespace has been accepted into the C++17 specification and so the corresponding namespace will become std::filesystem and will be in #include <filesystem> but until then, it should be #include <experimental/filesystem>.

Consider using range-for syntax

The range-for syntax makes it much easier to write code simply. For instance, instead of the complicated loops within write_possible_duplicate_file(), I'd write this instead:

for (const auto &item : sizeMap) {
 if (item.second.size() > 1) {
 for (const auto &file : item.second) {
 outputFile << item.first << '\t' << file << '\n';
 }
 }
}

Prefer stream reference to filenames

Instead of having the write_possible_duplicate_files() take a string for a filename, I would advocate changing that to both take and return a std::ostream reference. This would allow the program to easily write to std::cout if desired and more completely abstracts the purpose of the function.

Use a `typedef` to simplify declarations

Instead of using the long form everywhere it appears, I'd recommend using a typedef for the map type:

typedef std::map<uintmax_t, std::set<path>> MyMap;

Now one can simply refer to MyMap everywhere the code had originally had a long std::map declaration.

Think of the user

It may be that the user is eventually another program, but for now, if we are examining a list of possible duplicate files, I think I'd want to see the biggest files first so that eliminating one would have the biggest impact on freeing space. That could easily be done by making a small change to the map typedef mentioned above.

typedef std::map<uintmax_t, std::set<path>, std::greater<uintmax_t>> MyMap;

Consider catching exceptions

The version of file_size() that you're using can throw an exception (as, for example, if the file cannot be read with the current user's permissions) which will cause the program to abort with a somewhat cryptic (to the user) error message. Consider catching the exception and skipping the file or directory or at least emitting a user-centric message.

Simplify the code using in-place constructors

After the check assures that this is a new entry, the build_filesize_map() code uses these three lines:

std::set<path> fileSet;
fileSet.emplace(currentPath);
sizeMap.emplace(currentFileSize, fileSet);

This is more succinctly written like this:

sizeMap.emplace(currentFileSize, std::set<path>{currentPath});

Stack Exchange Network

Identify files within folder structure that have common file sizes

2 Answers 2

General portability

Be consistent with experimental portions

Consider using range-for syntax

Prefer stream reference to filenames

Use a `typedef` to simplify declarations

Think of the user

Consider catching exceptions

Simplify the code using in-place constructors

iterator naming

const

std::map::operator[]

iterating over map keys> 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Identify files within folder structure that have common file sizes

2 Answers 2

General portability

Be consistent with experimental portions

Consider using range-for syntax

Prefer stream reference to filenames

Use a typedef to simplify declarations

Think of the user

Consider catching exceptions

Simplify the code using in-place constructors

iterator naming

const

std::map::operator[]

iterating over map keys> 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions

Use a `typedef` to simplify declarations