The task is to implement simple 2 threaded "copy" tool.
- Tool should accept 2 parameters: source filename and target filename.
- Copying logic should be organized with help of 2 threads.
- First thread should read the data from source file.
- Second thread should write the data to the target file.
For this I made a link list of chunks of 1MB and added conditional variable for sync between copy and paste thread. I doubt this is a good program. Please help me out rating my code.
#include <cstring>
#include <iostream>
#include <fstream>
#include <thread>
#include <cerrno>
#include <array>
#include <atomic>
#include <mutex>
#include <condition_variable>
#include <list>
#include <chrono>
#include <benchmark/benchmark.h>
#define BUFSIZE_1MB 1048576
typedef struct chunk
{
std::array<char, BUFSIZE_1MB> chk;
unsigned int size = 0;
} chunk;
std::mutex mtx;
std::atomic_bool eof = false;
std::condition_variable cv;
std::list<chunk> buffer;
void copy(std::ifstream &inputFile)
{
chunk chk;
while (!eof)
{
inputFile.read(chk.chk.data(), BUFSIZE_1MB);
chk.size = inputFile.gcount();
{
std::scoped_lock<std::mutex> mm(mtx);
buffer.push_back(chk);
}
if (inputFile.eof())
eof = true;
cv.notify_one();
}
}
void paste(std::ofstream &outputFile)
{
int chunkIndex = 0;
while (true)
{
std::unique_lock<std::mutex> mm(mtx);
while (buffer.size())
{
mm.unlock();
outputFile.write(buffer.front().chk.data(), buffer.front().size);
mm.lock();
buffer.pop_front();
}
if (eof)
break;
cv.wait(mm);
}
}
void p4_copyFiles_thread(const char *source, const char *destination)
{
std::ifstream inputFile(source);
if (!inputFile)
{
std::cout << std::strerror(errno) << " Error opening file " << source << "\n";
return;
}
std::ofstream outputFile(destination);
if (!outputFile)
{
std::cout << std::strerror(errno) << " Error opening file " << destination << "\n";
return;
}
std::thread copyT(copy, std::ref(inputFile));
std::thread pasteT(paste, std::ref(outputFile));
copyT.join();
pasteT.join();
}
int main(int argc, char* argv[])
{
if(argc < 3)
return 0;
srand(static_cast<unsigned>(time(0)));
auto start = std::chrono::high_resolution_clock::now();
p4_copyFiles_thread(argv[1], argv[2]);
auto end = std::chrono::high_resolution_clock::now();
auto duration = end - start;
std::cout << "Time taken " <<duration.count() << " ms" << std::endl;
return 0;
}
-
\$\begingroup\$ Note that if this weren't a learning exercise, then using memory map to read the data in and out might be more efficient than streaming through buffers, especially if the page sizes are large. One thread to load a memory map page into RAM, and one thread to copy to the destination memory map. Depending on the SSD type, making that "multi-threaded" may or may not also improve performance. \$\endgroup\$Mooing Duck– Mooing Duck2024年05月28日 04:21:43 +00:00Commented May 28, 2024 at 4:21
-
2\$\begingroup\$ @MooingDuck: Actually, outside a learning exercise, I would try to punt it all to the OS. \$\endgroup\$Matthieu M.– Matthieu M.2024年05月28日 07:17:21 +00:00Commented May 28, 2024 at 7:17
-
\$\begingroup\$ True! The OS has more direct knowledge of the hardware and is almost certainly faster than anything I could do. And it's definitely less buggy than anything I'd do. \$\endgroup\$Mooing Duck– Mooing Duck2024年05月28日 17:42:44 +00:00Commented May 28, 2024 at 17:42
2 Answers 2
Missing and incorrect error handling
The most important problem with your code is that you either don't handle errors, or in some other cases handle them incorrectly.
Already at the start of main()
you check if at least two arguments are passed. If not, you silently return 0
, which means success! You should instead first print an error message to std::cerr
, so the user knows what is wrong, and then return EXIT_FAILURE
, so that in case your program is called from a shell script, there script can notice the error and won't continue as if nothing was wrong.
The next problem is that when it comes to I/O, almost any operation can go wrong. So while you check if the files are opened correctly, you don't check if the read and write operations actually succeed. This can cause silent data corruption, and the read thread can actually go into an infinite loop (as a read error halfway in the file will not set the EOF flag).
Make sure you check after each read/write operation if the stream is still good()
.
eof
must be guarded by the mutex
You are only guarding buffer
using the mutex, but the state of eof
is something that should also be updated atomically together with buffer
. Currently there is a bug where the write thread can be suspended right between the end of the while()
loop and before it checks eof
. The read thread might then read some more chunks and then reach the end of the file. It will push the chunks to the list and set eof
. Now the write thread resumes, checks eof
and thinks it is already done, when there are still unwritten chunks in the list.
Unbounded memory use
If writing the output file is slower than reading the input file, the linked list will become longer and longer. This can be a problem, since disks are typically much larger than the amount of RAM in a computer, and so it is quite possible for a file to be larger than you can fit in RAM. There is also no performance benefit from having lots of chunks in memory.
I recommend you limit the amount of chunks to just 3; one chunk is for reading, one chunk for writing, and another chunk to handle temporary read/write speed variations. You can add more, but any benefits greatly decrease after that.
Unnecessary copying in memory
In copy()
, you first read data into the local variable chk
. When the chunk is full, you then copy that chunk into the linked list. This can be avoided in several ways:
- Use
std::unique_ptr<Chunk>
s, so you canstd::move
them from a local variable into the list without actually copying the contents. - Use
std::vector
instead ofstd::array
, asstd::vector
s can be efficiently moved by themselves. - Use a
std::list<Chunk>
for the local variable as well, and thensplice()
that intobuffer
. - Just like
paste()
works on aChunk
that is stored inside the linked list, you can firstemplace_back()
an element to the list, and thenread()
into that. - Prepopulate
buffer
with a fixed number of chunks, and treat it as a ring buffer.
Avoid manual lock/unlock operations
Calling lock()
and unlock()
manually is often a cause of bugs. It's easy to forget doing this in the right order, and even if it looks correct like in your code, consider what would happen if an exception was thrown by outputFile.write()
. (Well, nothing bad luckily, but did you consider it to begin with?)
This also brings me to:
Use wait()
with a predicate
It's almost always better to use the version of wait()
that takes a predicate. This makes it explicit what you want to wait for, and the check for it will be done at exactly the right moment.
Pass std::istream
instead of std::ifstream
where possible
copy()
and paste()
don't care whether they are copying from a file on disk, or from any other type of stream. So you can make them more flexible by making them take references to std::istream
and std::ostream
instead of references to std::ifstream
and std::ofstream
.
What if you want to copy multiple files?
Your program can only copy one file at a time. Not just because you only take two command line arguments, but also because you are using global variables, so you cannot safely call p4_copyFiles_thread()
from multiple threads. Ideally you'd store all the state related to a copying operation in a struct
or class
, and pass a reference to that to the threads.
Minor issues
- There is an unnecessary call to
srand()
. Avoid C random number functions anyway, and prefer the ones from C++. - Unused variable
chunkIndex
. - Use
'\n'
instead ofstd::endl
. - Don't use
int
to store sizes. Consider thatgcount()
returns astd::streamsize
, which likely is larger than anint
. Even if you never expect this to be larger than 1048576, it's best to use the right type. This might also avoid some compiler warnings. copy()
andpaste()
are very generic names. It would help to make them more unique, or put them into anamespace
to avoid naming conflicts in larger programs.- The name
p4_copyFiles_thread()
is quite bad. What does thep4
stand for? Why does it use the pluralFiles
when it can only copy one file? I also wouldn't addthread()
to the name; it doesn't return a thread or use just one thread, and as far as the caller is concerned the function blocks, so it that threads are used internally is not relevant to the caller. - C++17 introduced
std::filesystem::copy_file()
. It doesn't say whether it uses threads, but that doesn't mean it is not efficient. It would be interesting to compare it against your code.
-
\$\begingroup\$ In general, two buffers is usually plenty. One for reading and one for writing. As long as reading and writing don't change speeds much, then that's plenty. If reading is sometimes faster and sometimes slower than writing, then adding a third buffer may help a teeny bit, but more than that is unlikely to help at all. \$\endgroup\$Mooing Duck– Mooing Duck2024年05月28日 04:20:02 +00:00Commented May 28, 2024 at 4:20
Do not return a successful exit code on failure:
if (argc < 3)
return 0;
If an incorrect number of arguments were provided, you should return EXIT_FAILURE
from <cstdlib>
and print an error message to standard error, not silently return success.
Furthermore, prefer EXIT_SUCCESS
and EXIT_FAILURE
from <cstdlib>
instead of using magic values like 0, 1, et cetera.
You are doing the same thing in p4_copyFiles_thread()
, where when the files can not be opened, you silently return and still print the time it took for the whole copy. Consider changing the return type of p4_copyFiles_thread()
to bool
, and return false
if the files can't be opened. This way, callers can detect (main()
here) whether the operation was successful or not.
You also need not return 0;
at the end of main()
, as that is the default behavior.
Errors are printed to standard error:
std::cout << std::strerror(errno) << " Error opening file " << destination << "\n";
In the case that standard output is redirected to a file or a pipe, the error message would not appear. This should use std::cerr
instead.
Do not unnecessarily flush standard output:
std::endl
flushes the internal buffer along with printing a newline. That doesn't seem to be the intended behavior here (and also is not required), and can be replaced with a simple '\n
. Standard output is automatically flushed on program exit.