Efficient file-copying utility using threads

Question 1

The task is to implement simple 2 threaded "copy" tool.

Tool should accept 2 parameters: source filename and target filename.
Copying logic should be organized with help of 2 threads.
First thread should read the data from source file.
Second thread should write the data to the target file.

For this I made a link list of chunks of 1MB and added conditional variable for sync between copy and paste thread. I doubt this is a good program. Please help me out rating my code.

#include <cstring>
#include <iostream>
#include <fstream>
#include <thread>
#include <cerrno>
#include <array>
#include <atomic>
#include <mutex>
#include <condition_variable>
#include <list>
#include <chrono>
#include <benchmark/benchmark.h>
#define BUFSIZE_1MB 1048576
typedef struct chunk
{
 std::array<char, BUFSIZE_1MB> chk;
 unsigned int size = 0; 
} chunk;
std::mutex mtx;
std::atomic_bool eof = false;
std::condition_variable cv;
std::list<chunk> buffer;
void copy(std::ifstream &inputFile)
{
 chunk chk;
 while (!eof)
 {
 inputFile.read(chk.chk.data(), BUFSIZE_1MB);
 chk.size = inputFile.gcount();
 {
 std::scoped_lock<std::mutex> mm(mtx);
 buffer.push_back(chk);
 }
 if (inputFile.eof())
 eof = true;
 cv.notify_one();
 }
}
void paste(std::ofstream &outputFile)
{
 int chunkIndex = 0;
 while (true)
 {
 std::unique_lock<std::mutex> mm(mtx);
 while (buffer.size())
 {
 mm.unlock();
 outputFile.write(buffer.front().chk.data(), buffer.front().size);
 mm.lock();
 buffer.pop_front();
 }
 if (eof)
 break;
 cv.wait(mm);
 }
}
void p4_copyFiles_thread(const char *source, const char *destination)
{
 std::ifstream inputFile(source);
 if (!inputFile)
 {
 std::cout << std::strerror(errno) << " Error opening file " << source << "\n";
 return;
 }
 std::ofstream outputFile(destination);
 if (!outputFile)
 {
 std::cout << std::strerror(errno) << " Error opening file " << destination << "\n";
 return;
 }
 std::thread copyT(copy, std::ref(inputFile));
 std::thread pasteT(paste, std::ref(outputFile));
 copyT.join();
 pasteT.join();
}
int main(int argc, char* argv[])
{
 if(argc < 3)
 return 0;
 
 srand(static_cast<unsigned>(time(0)));
 auto start = std::chrono::high_resolution_clock::now();
 
 p4_copyFiles_thread(argv[1], argv[2]);
 
 auto end = std::chrono::high_resolution_clock::now();
 auto duration = end - start;
 std::cout << "Time taken " <<duration.count() << " ms" << std::endl;
 return 0;
}

Question 2

Note that if this weren't a learning exercise, then using memory map to read the data in and out might be more efficient than streaming through buffers, especially if the page sizes are large. One thread to load a memory map page into RAM, and one thread to copy to the destination memory map. Depending on the SSD type, making that "multi-threaded" may or may not also improve performance.

Question 3

@MooingDuck: Actually, outside a learning exercise, I would try to punt it all to the OS.

Question 4

True! The OS has more direct knowledge of the hardware and is almost certainly faster than anything I could do. And it's definitely less buggy than anything I'd do.

Question 5

Missing and incorrect error handling

The most important problem with your code is that you either don't handle errors, or in some other cases handle them incorrectly.

Already at the start of main() you check if at least two arguments are passed. If not, you silently return 0, which means success! You should instead first print an error message to std::cerr, so the user knows what is wrong, and then return EXIT_FAILURE, so that in case your program is called from a shell script, there script can notice the error and won't continue as if nothing was wrong.

The next problem is that when it comes to I/O, almost any operation can go wrong. So while you check if the files are opened correctly, you don't check if the read and write operations actually succeed. This can cause silent data corruption, and the read thread can actually go into an infinite loop (as a read error halfway in the file will not set the EOF flag).

Make sure you check after each read/write operation if the stream is still good().

`eof` must be guarded by the mutex

You are only guarding buffer using the mutex, but the state of eof is something that should also be updated atomically together with buffer. Currently there is a bug where the write thread can be suspended right between the end of the while() loop and before it checks eof. The read thread might then read some more chunks and then reach the end of the file. It will push the chunks to the list and set eof. Now the write thread resumes, checks eof and thinks it is already done, when there are still unwritten chunks in the list.

Unbounded memory use

If writing the output file is slower than reading the input file, the linked list will become longer and longer. This can be a problem, since disks are typically much larger than the amount of RAM in a computer, and so it is quite possible for a file to be larger than you can fit in RAM. There is also no performance benefit from having lots of chunks in memory.

I recommend you limit the amount of chunks to just 3; one chunk is for reading, one chunk for writing, and another chunk to handle temporary read/write speed variations. You can add more, but any benefits greatly decrease after that.

Unnecessary copying in memory

In copy(), you first read data into the local variable chk. When the chunk is full, you then copy that chunk into the linked list. This can be avoided in several ways:

Use std::unique_ptr<Chunk>s, so you can std::move them from a local variable into the list without actually copying the contents.
Use std::vector instead of std::array, as std::vectors can be efficiently moved by themselves.
Use a std::list<Chunk> for the local variable as well, and then splice() that into buffer.
Just like paste() works on a Chunk that is stored inside the linked list, you can first emplace_back() an element to the list, and then read() into that.
Prepopulate buffer with a fixed number of chunks, and treat it as a ring buffer.

Avoid manual lock/unlock operations

Calling lock() and unlock() manually is often a cause of bugs. It's easy to forget doing this in the right order, and even if it looks correct like in your code, consider what would happen if an exception was thrown by outputFile.write(). (Well, nothing bad luckily, but did you consider it to begin with?)

This also brings me to:

Use `wait()` with a predicate

It's almost always better to use the version of wait() that takes a predicate. This makes it explicit what you want to wait for, and the check for it will be done at exactly the right moment.

Pass `std::istream` instead of `std::ifstream` where possible

copy() and paste() don't care whether they are copying from a file on disk, or from any other type of stream. So you can make them more flexible by making them take references to std::istream and std::ostream instead of references to std::ifstream and std::ofstream.

What if you want to copy multiple files?

Your program can only copy one file at a time. Not just because you only take two command line arguments, but also because you are using global variables, so you cannot safely call p4_copyFiles_thread() from multiple threads. Ideally you'd store all the state related to a copying operation in a struct or class, and pass a reference to that to the threads.

Minor issues

There is an unnecessary call to srand(). Avoid C random number functions anyway, and prefer the ones from C++.
Unused variable chunkIndex.
Use '\n' instead of std::endl.
Don't use int to store sizes. Consider that gcount() returns a std::streamsize, which likely is larger than an int. Even if you never expect this to be larger than 1048576, it's best to use the right type. This might also avoid some compiler warnings.
copy() and paste() are very generic names. It would help to make them more unique, or put them into a namespace to avoid naming conflicts in larger programs.
The name p4_copyFiles_thread() is quite bad. What does the p4 stand for? Why does it use the plural Files when it can only copy one file? I also wouldn't add thread() to the name; it doesn't return a thread or use just one thread, and as far as the caller is concerned the function blocks, so it that threads are used internally is not relevant to the caller.
C++17 introduced std::filesystem::copy_file(). It doesn't say whether it uses threads, but that doesn't mean it is not efficient. It would be interesting to compare it against your code.

Question 6

In general, two buffers is usually plenty. One for reading and one for writing. As long as reading and writing don't change speeds much, then that's plenty. If reading is sometimes faster and sometimes slower than writing, then adding a third buffer may help a teeny bit, but more than that is unlikely to help at all.

Question 7

Do not return a successful exit code on failure:

if (argc < 3)
 return 0;

If an incorrect number of arguments were provided, you should return EXIT_FAILURE from <cstdlib> and print an error message to standard error, not silently return success.

Furthermore, prefer EXIT_SUCCESS and EXIT_FAILURE from <cstdlib> instead of using magic values like 0, 1, et cetera.

You are doing the same thing in p4_copyFiles_thread(), where when the files can not be opened, you silently return and still print the time it took for the whole copy. Consider changing the return type of p4_copyFiles_thread() to bool, and return false if the files can't be opened. This way, callers can detect (main() here) whether the operation was successful or not.

You also need not return 0; at the end of main(), as that is the default behavior.

Errors are printed to standard error:

std::cout << std::strerror(errno) << " Error opening file " << destination << "\n";

In the case that standard output is redirected to a file or a pipe, the error message would not appear. This should use std::cerr instead.

Do not unnecessarily flush standard output:

std::endl flushes the internal buffer along with printing a newline. That doesn't seem to be the intended behavior here (and also is not required), and can be replaced with a simple '\n. Standard output is automatically flushed on program exit.

G. Sliepen G. Sliepen 69k3 gold badges74 silver badges180 bronze badges · Answer 1 · 2024-05-27 17:43:50Z

Missing and incorrect error handling

The most important problem with your code is that you either don't handle errors, or in some other cases handle them incorrectly.

Already at the start of main() you check if at least two arguments are passed. If not, you silently return 0, which means success! You should instead first print an error message to std::cerr, so the user knows what is wrong, and then return EXIT_FAILURE, so that in case your program is called from a shell script, there script can notice the error and won't continue as if nothing was wrong.

The next problem is that when it comes to I/O, almost any operation can go wrong. So while you check if the files are opened correctly, you don't check if the read and write operations actually succeed. This can cause silent data corruption, and the read thread can actually go into an infinite loop (as a read error halfway in the file will not set the EOF flag).

Make sure you check after each read/write operation if the stream is still good().

`eof` must be guarded by the mutex

You are only guarding buffer using the mutex, but the state of eof is something that should also be updated atomically together with buffer. Currently there is a bug where the write thread can be suspended right between the end of the while() loop and before it checks eof. The read thread might then read some more chunks and then reach the end of the file. It will push the chunks to the list and set eof. Now the write thread resumes, checks eof and thinks it is already done, when there are still unwritten chunks in the list.

Unbounded memory use

If writing the output file is slower than reading the input file, the linked list will become longer and longer. This can be a problem, since disks are typically much larger than the amount of RAM in a computer, and so it is quite possible for a file to be larger than you can fit in RAM. There is also no performance benefit from having lots of chunks in memory.

I recommend you limit the amount of chunks to just 3; one chunk is for reading, one chunk for writing, and another chunk to handle temporary read/write speed variations. You can add more, but any benefits greatly decrease after that.

Unnecessary copying in memory

In copy(), you first read data into the local variable chk. When the chunk is full, you then copy that chunk into the linked list. This can be avoided in several ways:

Use std::unique_ptr<Chunk>s, so you can std::move them from a local variable into the list without actually copying the contents.
Use std::vector instead of std::array, as std::vectors can be efficiently moved by themselves.
Use a std::list<Chunk> for the local variable as well, and then splice() that into buffer.
Just like paste() works on a Chunk that is stored inside the linked list, you can first emplace_back() an element to the list, and then read() into that.
Prepopulate buffer with a fixed number of chunks, and treat it as a ring buffer.

Avoid manual lock/unlock operations

Calling lock() and unlock() manually is often a cause of bugs. It's easy to forget doing this in the right order, and even if it looks correct like in your code, consider what would happen if an exception was thrown by outputFile.write(). (Well, nothing bad luckily, but did you consider it to begin with?)

This also brings me to:

Use `wait()` with a predicate

It's almost always better to use the version of wait() that takes a predicate. This makes it explicit what you want to wait for, and the check for it will be done at exactly the right moment.

Pass `std::istream` instead of `std::ifstream` where possible

copy() and paste() don't care whether they are copying from a file on disk, or from any other type of stream. So you can make them more flexible by making them take references to std::istream and std::ostream instead of references to std::ifstream and std::ofstream.

What if you want to copy multiple files?

Your program can only copy one file at a time. Not just because you only take two command line arguments, but also because you are using global variables, so you cannot safely call p4_copyFiles_thread() from multiple threads. Ideally you'd store all the state related to a copying operation in a struct or class, and pass a reference to that to the threads.

Minor issues

There is an unnecessary call to srand(). Avoid C random number functions anyway, and prefer the ones from C++.
Unused variable chunkIndex.
Use '\n' instead of std::endl.
Don't use int to store sizes. Consider that gcount() returns a std::streamsize, which likely is larger than an int. Even if you never expect this to be larger than 1048576, it's best to use the right type. This might also avoid some compiler warnings.
copy() and paste() are very generic names. It would help to make them more unique, or put them into a namespace to avoid naming conflicts in larger programs.
The name p4_copyFiles_thread() is quite bad. What does the p4 stand for? Why does it use the plural Files when it can only copy one file? I also wouldn't add thread() to the name; it doesn't return a thread or use just one thread, and as far as the caller is concerned the function blocks, so it that threads are used internally is not relevant to the caller.
C++17 introduced std::filesystem::copy_file(). It doesn't say whether it uses threads, but that doesn't mean it is not efficient. It would be interesting to compare it against your code.

In general, two buffers is usually plenty. One for reading and one for writing. As long as reading and writing don't change speeds much, then that's plenty. If reading is sometimes faster and sometimes slower than writing, then adding a third buffer may help a teeny bit, but more than that is unlikely to help at all.

Madagascar Madagascar 10.1k1 gold badge16 silver badges51 bronze badges · Answer 2 · 2024-05-27 17:15:25Z

Do not return a successful exit code on failure:

if (argc < 3)
 return 0;

If an incorrect number of arguments were provided, you should return EXIT_FAILURE from <cstdlib> and print an error message to standard error, not silently return success.

Furthermore, prefer EXIT_SUCCESS and EXIT_FAILURE from <cstdlib> instead of using magic values like 0, 1, et cetera.

You are doing the same thing in p4_copyFiles_thread(), where when the files can not be opened, you silently return and still print the time it took for the whole copy. Consider changing the return type of p4_copyFiles_thread() to bool, and return false if the files can't be opened. This way, callers can detect (main() here) whether the operation was successful or not.

You also need not return 0; at the end of main(), as that is the default behavior.

Errors are printed to standard error:

std::cout << std::strerror(errno) << " Error opening file " << destination << "\n";

In the case that standard output is redirected to a file or a pipe, the error message would not appear. This should use std::cerr instead.

Do not unnecessarily flush standard output:

std::endl flushes the internal buffer along with printing a newline. That doesn't seem to be the intended behavior here (and also is not required), and can be replaced with a simple '\n. Standard output is automatically flushed on program exit.

Stack Exchange Network

Efficient file-copying utility using threads

2 Answers 2

Missing and incorrect error handling

`eof` must be guarded by the mutex

Unbounded memory use

Unnecessary copying in memory

Avoid manual lock/unlock operations

Use `wait()` with a predicate

Pass `std::istream` instead of `std::ifstream` where possible

What if you want to copy multiple files?

Minor issues

Do not return a successful exit code on failure:

Errors are printed to standard error:

Do not unnecessarily flush standard output:

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Efficient file-copying utility using threads

2 Answers 2

Missing and incorrect error handling

eof must be guarded by the mutex

Unbounded memory use

Unnecessary copying in memory

Avoid manual lock/unlock operations

Use wait() with a predicate

Pass std::istream instead of std::ifstream where possible

What if you want to copy multiple files?

Minor issues

Do not return a successful exit code on failure:

Errors are printed to standard error:

Do not unnecessarily flush standard output:

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions

`eof` must be guarded by the mutex

Use `wait()` with a predicate

Pass `std::istream` instead of `std::ifstream` where possible