Multithreaded O_DIRECT file read-and-process program

Question 1

Here is my code. It reads a file and returns the sum of all the bytes in the file:

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <malloc.h>
#include <iostream>
#include <chrono>
#include <thread>
#include <mutex>
#include <condition_variable>
#define BUFSIZE 134217728
// globals
std::mutex mut;
unsigned char* buffers[12]; // global array of pointers to buffers where file will be read
int bytes_read[12] = {0};
std::condition_variable cv;
// write_head is the shared variable associated with cv
int write_head = 0; // index of buffer currently being written to
void producer_thread()
{
 int fd;
 const char* fname = "1GB.txt";
 if ((fd = open(fname, O_RDONLY|O_DIRECT)) < 0) {
 printf("%s: cannot open %s\n", fname);
 exit(2);
 }
 for (int i = 0; i < 12; ++i){
 unsigned char* buf = buffers[i];
 int n = read(fd,buf,BUFSIZE);
 bytes_read[i] = n;
 // wake up consumer thread
 {
 std::lock_guard<std::mutex> lk(mut);
 write_head = i + 1;
 }
 cv.notify_all();
 if ( n == 0 ){ // if we have reached end of file
 std::cout << "Read to end of file" << std::endl;
 std::cout << "Buffers used: " << i << std::endl;
 return;
 }
 }
}
void consumer_thread(){
 unsigned long result = 0;
 for (int i = 0; i < 12; ++i){
 // wait for buffer to become available for reading
 {
 std::unique_lock<std::mutex> lk(mut);
 cv.wait(lk, [&]() { return i < write_head; });
 }
 int n = bytes_read[i];
 if ( n == 0 ) {
 std::cout << "Result: " << result;
 return ;
 }
 // now process the data
 unsigned char* buf = buffers[i];
 for (int j=0; j<n; ++j)
 result += buf[j];
 }
}
int main (int argc, char* argv[]) {
 using std::chrono::high_resolution_clock;
 using std::chrono::duration_cast;
 using std::chrono::duration;
 using std::chrono::milliseconds;
 puts("Allocating buffers");
 auto start = high_resolution_clock::now();
 int alignment = 4096;
 
 // allocate 10 buffers and put them into the global buffers array
 for (int i = 0; i < 10; ++i){
 unsigned char* buf = (unsigned char*) memalign(alignment, BUFSIZE);
 buffers[i] = buf;
 }
 auto end = high_resolution_clock::now();
 /* Getting number of milliseconds as a double. */
 duration<double, std::milli> ms_double = end - start;
 puts("finished allocating buffers");
 std::cout << "time taken: " << ms_double.count() << "ms\n";
 // start producer and consumer threads
 std::thread t1(producer_thread), t2(consumer_thread);
 t1.join();
 t2.join();
 return 0;
}

(Note: replacing unsigned long with int seems to reduce running time by about 0.06 seconds, I'm not sure why. There is no integer overflow going on since the input file is all zeroes.)

This code takes around 0.97 to 1.03 seconds to run on my machine, which is the same as the fread version.

I thought O_DIRECT would be faster as it bypasses cache, but it seems like there is no difference, maybe because I'm using disk encryption.

Is there any way to make this code faster?

P.S I'm pretty sure there is no race condition but if you notice any please say!

Question 2

You're assuming that the signedness is the important difference between unsigned long and int, when very likely the size is the important difference. Try signed long or unsigned int to find out for sure...

Question 3

Also your timing measurement appears to be surrounding the wrong code. The I/O doesn't happen until after you store end.

Question 4

@BenVoigt signed long performs poorly (~1.03s on average) compared to int (~0.97s average). unsigned int seems to perform slightly worse (~1.00s on average) than int. There is some variance between the readings but there is definitely a difference on average. I ran the program about 5-7 times each with the different types, the output from time is roughly consistent.

Question 5

Note: The above results might have been from an un-optimized build (built without -O3) and are therefore invalid...

Question 6

Bypassing the cache might not be wise

I thought O_DIRECT would be faster as it bypasses cache,

You do know that caches are made to make things faster, right? Bypassing it might thus make your code even slower. There are a few reasons why you might want to bypass a cache:

You know for sure that nothing else will read that file a second time. In that case, using O_DIRECT avoids cache memory being wasted. It won't make reading the file the first time any faster though.
You know much better what the access pattern of the file is than the kernel can predict. In that case, you can avoid the kernel doing incorrect prefetching for example, which wastes cache space and can increase latency of subsequent reads. You can perhaps also ensure cached data is not evicted before it is reused. So this would have some performance benefit.

However, you are just reading the whole file in linearly. The kernel can easily see what the access pattern is, and start preloading the next chunk even before the second read() call happens in the for-loop.

You've already seen that using O_DIRECT makes no difference, so you are better of not using it.

Avoid unnecessary C-isms

You're already making quite good use of C++ functionality, however there are some C-isms that can be avoided.

Instead of using macros to define constants, use constexpr:

static constexpr std::size_t BUFSIZE = 128 * 1024 * 1024;

Use std::cout instead of printf(). Note that since C++20 there's std::format() that helps format strings, and in C++23 we will get the best of both worlds with std::print().

Use std::vector and std::array to allocate memory instead of using C functions. You can use alignas() to specify the required alignment:

struct aligned_buffer {
 alignas(4096) std::array<char, BUFSIZE> data;
};
std::vector<aligned_buffer> buffers(12);

If you don't need O_DIRECT anymore, then use std::fstream instead of open() or fopen().

When including standard header files for C functions in C++ code, use the C++ version of those header files:

#include <cstdio>
#include <cstdlib>

Waiting for the producer

You have implemented a correct thread-safe algorithm for the consumer to block for the producer to have read in more buffers. However, since C++20 some new concurrency primitives have been introduced that allow you to do the same in a somewhat simpler way.

First, there is std::counting_semaphore. The producer can simply call release() when it has finished reading in a buffer, and the consumer can call acquire() to wait for a buffer to be ready. It correctly handles the case where the producer is multiple buffers ahead of the consumer.

Second, you can use a std::atomic_int to count how far the producer is. Since C++20 you can call wait() on an atomic variable to wait until it has changed its value. However, I would recommend using the std::counting_semaphore instead, as it does exactly what you want, and is harder to use incorrectly.

Question 7

Although O_DIRECT is a pessimization, using OS-provided open and read does give better performance than fstream. The difference is less for binary I/O than for formatted input and output, but it's still real. Unlike fopen, OS open is not a C-ism, it's OS API vs language support library.

Question 8

It's important to note that <cstdio> and <cstdlib> do not guarantee to provide access to the C functions of <stdio.h> and <stdlib.h>. In particular, std::puts() will exist, but ::puts() may not. And std::abs() doesn't widen its argument like ::abs(), because it has overloads for narrower types. Switching C++ code from using C library functions to using C++ library functions inspired by C requires more care than just changing a few includes.

G. Sliepen G. Sliepen 68.7k3 gold badges74 silver badges179 bronze badges · Accepted Answer · 2023-06-30 21:20:58Z

Bypassing the cache might not be wise

I thought O_DIRECT would be faster as it bypasses cache,

You do know that caches are made to make things faster, right? Bypassing it might thus make your code even slower. There are a few reasons why you might want to bypass a cache:

You know for sure that nothing else will read that file a second time. In that case, using O_DIRECT avoids cache memory being wasted. It won't make reading the file the first time any faster though.
You know much better what the access pattern of the file is than the kernel can predict. In that case, you can avoid the kernel doing incorrect prefetching for example, which wastes cache space and can increase latency of subsequent reads. You can perhaps also ensure cached data is not evicted before it is reused. So this would have some performance benefit.

However, you are just reading the whole file in linearly. The kernel can easily see what the access pattern is, and start preloading the next chunk even before the second read() call happens in the for-loop.

You've already seen that using O_DIRECT makes no difference, so you are better of not using it.

Avoid unnecessary C-isms

You're already making quite good use of C++ functionality, however there are some C-isms that can be avoided.

Instead of using macros to define constants, use constexpr:

static constexpr std::size_t BUFSIZE = 128 * 1024 * 1024;

Use std::cout instead of printf(). Note that since C++20 there's std::format() that helps format strings, and in C++23 we will get the best of both worlds with std::print().

Use std::vector and std::array to allocate memory instead of using C functions. You can use alignas() to specify the required alignment:

struct aligned_buffer {
 alignas(4096) std::array<char, BUFSIZE> data;
};
std::vector<aligned_buffer> buffers(12);

If you don't need O_DIRECT anymore, then use std::fstream instead of open() or fopen().

When including standard header files for C functions in C++ code, use the C++ version of those header files:

#include <cstdio>
#include <cstdlib>

Waiting for the producer

You have implemented a correct thread-safe algorithm for the consumer to block for the producer to have read in more buffers. However, since C++20 some new concurrency primitives have been introduced that allow you to do the same in a somewhat simpler way.

First, there is std::counting_semaphore. The producer can simply call release() when it has finished reading in a buffer, and the consumer can call acquire() to wait for a buffer to be ready. It correctly handles the case where the producer is multiple buffers ahead of the consumer.

Second, you can use a std::atomic_int to count how far the producer is. Since C++20 you can call wait() on an atomic variable to wait until it has changed its value. However, I would recommend using the std::counting_semaphore instead, as it does exactly what you want, and is harder to use incorrectly.

Although O_DIRECT is a pessimization, using OS-provided open and read does give better performance than fstream. The difference is less for binary I/O than for formatted input and output, but it's still real. Unlike fopen, OS open is not a C-ism, it's OS API vs language support library.
It's important to note that <cstdio> and <cstdlib> do not guarantee to provide access to the C functions of <stdio.h> and <stdlib.h>. In particular, std::puts() will exist, but ::puts() may not. And std::abs() doesn't widen its argument like ::abs(), because it has overloads for narrower types. Switching C++ code from using C library functions to using C++ library functions inspired by C requires more care than just changing a few includes.

Stack Exchange Network

Multithreaded O_DIRECT file read-and-process program

1 Answer 1

Bypassing the cache might not be wise

Avoid unnecessary C-isms

Waiting for the producer

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Multithreaded O_DIRECT file read-and-process program

1 Answer 1

Bypassing the cache might not be wise

Avoid unnecessary C-isms

Waiting for the producer

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions