Here is my code. It reads a file and returns the sum of all the bytes in the file:
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <malloc.h>
#include <iostream>
#include <chrono>
#include <thread>
#include <mutex>
#include <condition_variable>
#define BUFSIZE 134217728
// globals
std::mutex mut;
unsigned char* buffers[12]; // global array of pointers to buffers where file will be read
int bytes_read[12] = {0};
std::condition_variable cv;
// write_head is the shared variable associated with cv
int write_head = 0; // index of buffer currently being written to
void producer_thread()
{
int fd;
const char* fname = "1GB.txt";
if ((fd = open(fname, O_RDONLY|O_DIRECT)) < 0) {
printf("%s: cannot open %s\n", fname);
exit(2);
}
for (int i = 0; i < 12; ++i){
unsigned char* buf = buffers[i];
int n = read(fd,buf,BUFSIZE);
bytes_read[i] = n;
// wake up consumer thread
{
std::lock_guard<std::mutex> lk(mut);
write_head = i + 1;
}
cv.notify_all();
if ( n == 0 ){ // if we have reached end of file
std::cout << "Read to end of file" << std::endl;
std::cout << "Buffers used: " << i << std::endl;
return;
}
}
}
void consumer_thread(){
unsigned long result = 0;
for (int i = 0; i < 12; ++i){
// wait for buffer to become available for reading
{
std::unique_lock<std::mutex> lk(mut);
cv.wait(lk, [&]() { return i < write_head; });
}
int n = bytes_read[i];
if ( n == 0 ) {
std::cout << "Result: " << result;
return ;
}
// now process the data
unsigned char* buf = buffers[i];
for (int j=0; j<n; ++j)
result += buf[j];
}
}
int main (int argc, char* argv[]) {
using std::chrono::high_resolution_clock;
using std::chrono::duration_cast;
using std::chrono::duration;
using std::chrono::milliseconds;
puts("Allocating buffers");
auto start = high_resolution_clock::now();
int alignment = 4096;
// allocate 10 buffers and put them into the global buffers array
for (int i = 0; i < 10; ++i){
unsigned char* buf = (unsigned char*) memalign(alignment, BUFSIZE);
buffers[i] = buf;
}
auto end = high_resolution_clock::now();
/* Getting number of milliseconds as a double. */
duration<double, std::milli> ms_double = end - start;
puts("finished allocating buffers");
std::cout << "time taken: " << ms_double.count() << "ms\n";
// start producer and consumer threads
std::thread t1(producer_thread), t2(consumer_thread);
t1.join();
t2.join();
return 0;
}
(Note: replacing unsigned long
with int
seems to reduce running time by about 0.06
seconds, I'm not sure why. There is no integer overflow going on since the input file is all zeroes.)
This code takes around 0.97 to 1.03 seconds to run on my machine, which is the same as the fread
version.
I thought O_DIRECT would be faster as it bypasses cache, but it seems like there is no difference, maybe because I'm using disk encryption.
Is there any way to make this code faster?
P.S I'm pretty sure there is no race condition but if you notice any please say!
1 Answer 1
Bypassing the cache might not be wise
I thought O_DIRECT would be faster as it bypasses cache,
You do know that caches are made to make things faster, right? Bypassing it might thus make your code even slower. There are a few reasons why you might want to bypass a cache:
- You know for sure that nothing else will read that file a second time. In that case, using
O_DIRECT
avoids cache memory being wasted. It won't make reading the file the first time any faster though. - You know much better what the access pattern of the file is than the kernel can predict. In that case, you can avoid the kernel doing incorrect prefetching for example, which wastes cache space and can increase latency of subsequent reads. You can perhaps also ensure cached data is not evicted before it is reused. So this would have some performance benefit.
However, you are just reading the whole file in linearly. The kernel can easily see what the access pattern is, and start preloading the next chunk even before the second read()
call happens in the for
-loop.
You've already seen that using O_DIRECT
makes no difference, so you are better of not using it.
Avoid unnecessary C-isms
You're already making quite good use of C++ functionality, however there are some C-isms that can be avoided.
Instead of using macros to define constants, use constexpr
:
static constexpr std::size_t BUFSIZE = 128 * 1024 * 1024;
Use std::cout
instead of printf()
. Note that since C++20 there's std::format()
that helps format strings, and in C++23 we will get the best of both worlds with std::print()
.
Use std::vector
and std::array
to allocate memory instead of using C functions. You can use alignas()
to specify the required alignment:
struct aligned_buffer {
alignas(4096) std::array<char, BUFSIZE> data;
};
std::vector<aligned_buffer> buffers(12);
If you don't need O_DIRECT
anymore, then use std::fstream
instead of open()
or fopen()
.
When including standard header files for C functions in C++ code, use the C++ version of those header files:
#include <cstdio>
#include <cstdlib>
Waiting for the producer
You have implemented a correct thread-safe algorithm for the consumer to block for the producer to have read in more buffers. However, since C++20 some new concurrency primitives have been introduced that allow you to do the same in a somewhat simpler way.
First, there is std::counting_semaphore
. The producer can simply call release()
when it has finished reading in a buffer, and the consumer can call acquire()
to wait for a buffer to be ready. It correctly handles the case where the producer is multiple buffers ahead of the consumer.
Second, you can use a std::atomic_int
to count how far the producer is. Since C++20 you can call wait()
on an atomic variable to wait until it has changed its value. However, I would recommend using the std::counting_semaphore
instead, as it does exactly what you want, and is harder to use incorrectly.
-
1\$\begingroup\$ Although O_DIRECT is a pessimization, using OS-provided
open
andread
does give better performance thanfstream
. The difference is less for binary I/O than for formatted input and output, but it's still real. Unlikefopen
, OSopen
is not a C-ism, it's OS API vs language support library. \$\endgroup\$Ben Voigt– Ben Voigt2023年06月30日 21:32:56 +00:00Commented Jun 30, 2023 at 21:32 -
2\$\begingroup\$ It's important to note that
<cstdio>
and<cstdlib>
do not guarantee to provide access to the C functions of<stdio.h>
and<stdlib.h>
. In particular,std::puts()
will exist, but::puts()
may not. Andstd::abs()
doesn't widen its argument like::abs()
, because it has overloads for narrower types. Switching C++ code from using C library functions to using C++ library functions inspired by C requires more care than just changing a few includes. \$\endgroup\$Ben Voigt– Ben Voigt2023年06月30日 21:34:45 +00:00Commented Jun 30, 2023 at 21:34
unsigned long
andint
, when very likely the size is the important difference. Trysigned long
orunsigned int
to find out for sure... \$\endgroup\$end
. \$\endgroup\$signed long
performs poorly (~1.03s on average) compared toint
(~0.97s average).unsigned int
seems to perform slightly worse (~1.00s on average) thanint
. There is some variance between the readings but there is definitely a difference on average. I ran the program about 5-7 times each with the different types, the output fromtime
is roughly consistent. \$\endgroup\$-O3
) and are therefore invalid... \$\endgroup\$