Large file log parser with less memory footprint

Question 1

This is a variation of an interesting problem I am currently dealing with. We have a large input file which is being continuously written (size: 10-20G). We need to write a log filter which reads content from the file with the following conditions

It looks for specific keywords in each line and if they present, it writes that line to another output file
We only read first X characters of the line (e.g. x <= 5000) and everything else until the next newline starts, we can ignore. Be mindful that we should not load the whole line into memory. We have tight memory constraints
We read from a file which is continuously being written, so if we reach to the end of the file, we wait until new data is available

Here's the sample code I came up with

bool Contains(const std::string& input) {
 for (const auto& keyword : KEY_WORDS) {
 for (int i = 0; i < std::min(input.length(), input.length() - keyword.length() + 1); i++) {
 std::string_view view{ input.c_str() + i, keyword.length() };
 if (view == keyword) {
 return true;
 }
 }
 }
 return false;
}
void ClearFlags(std::ifstream& input_file) {
 input_file.clear();
 input_file.seekg(0, std::ios::end);
}
void ProcessFile(std::ifstream& input_file) {
 // seek to the beginning but we will change this later
 input_file.seekg(0, std::ios::beg);
 while (true) {
 std::string buffer;
 char ch;
 // Read until either the new line is found or the first 5000 characters
 while (buffer.length() < BUFFER_SIZE) {
 if (input_file.good()) {
 ch = input_file.get();
 if (ch == '\n') 
 break;
 if (ch != EOF)
 buffer += ch;
 } else {
 ClearFlags(input_file);
 std::this_thread::sleep_for(std::chrono::milliseconds(1));
 }
 }
 // Process the line
 if (buffer.length() && Contains(buffer)) {
 sink.write(buffer);
 }
 if (ch == '\n') 
 continue;
 // We have more characters. so, we read one by one until we reach newline
 while (ch != '\n') {
 if (input_file.good()) {
 input_file.get(ch);
 } else {
 ClearFlags(input_file);
 std::this_thread::sleep_for(std::chrono::milliseconds(1));
 }
 }
 }
}

Would like to get some comments and also possible improvements and edge cases I might be missing here? Currently, I sleep for 1 millisecond, clear the flags and repeat the EOF check. I think it's not optimal and any suggestions over there would be highly appreciated

Question 2

input file which is being continuously written unusual for an input file.

Question 3

Make use of standard library functionality

You are performing a lot of operations manually, but the standard library comes with many functions that you can use, and often perform better.

To read a line up to a certain number of characters, use std::istream's getline() member function. Use gcount() to check how many characters were actually read. One you have read (part of) a line into a buffer, use std::string's find() member function to search for keywords.

If you read the first 5000 characters of a very long line, use ignore() to skip the remainder until the newline.

Consider using regular expressions

If you have a lot of keywords, checking one of them at a time if they are present in a line will become very slow. Regular expressions are a solution then; these are compiled into a state machine that can check for many keywords much more efficiently. C++ comes with a standard regular expression library, although it is not the fastest and you might want to use an external library, like Hana Dusíková's CTRE library.

Waiting for the file to grow

Currently, I sleep for 1 millisecond, clear the flags and repeat the EOF check. I think it's not optimal and any suggestions over there would be highly appreciated

Indeed, this is not optimal. By sleeping only for 1 millisecond, you probably prevent the CPU from going into a low power mode if there is no activity. Unfortunately, there is no standard way in C++ to wait for a file to grow. However, there are various alternatives, see this StackOverflow question. If you don't mind an external dependency, I recommend using the SimpleFileWatcher library to avoid having to add platform-specific code.

Race condition when reaching the end of the file

Once you hit the end of the file, you call ClearFlags(), which in turn calls seekg(0, std::ios::end). However, consider that between reaching the end and seeking, the file might have grown. The seek will cause you to skip over the characters that have just been added, and might cause you to lose some lines.

Avoid global variables

It looks like KEY_WORDS and sink are global variables. I would avoid that, or at the very least make sure you pass the list of keywords and the output stream via function parameters.

Question 4

Thanks for the suggestions. I checked the getline() documentation and it seems like it reads the whole line into memory or the delimiter we introduce. I am not sure how can we use that to replace get() which I am using right now And regarding race condition, seems like the best way is to run a file watcher and trigger a callback once the update event is triggered. Do you recommend any other better way? I am gonna replace global variables and also make keyword search using regex

Question 5

I'm refering to std::istream::getline(), which allows you to specify the maximum size to read in. As for the race condition, it doesn't matter how you wait, but you cannot seek to the end and assume you did not skip anything. If you need to seek at all, seek to the number of characters read so far.

Question 6

Note that std::istream::tellg allows you to determine the current input position.

G. Sliepen G. Sliepen 69k3 gold badges74 silver badges180 bronze badges · Answer 1 · 2023-04-22 23:01:45Z

Make use of standard library functionality

You are performing a lot of operations manually, but the standard library comes with many functions that you can use, and often perform better.

To read a line up to a certain number of characters, use std::istream's getline() member function. Use gcount() to check how many characters were actually read. One you have read (part of) a line into a buffer, use std::string's find() member function to search for keywords.

If you read the first 5000 characters of a very long line, use ignore() to skip the remainder until the newline.

Consider using regular expressions

If you have a lot of keywords, checking one of them at a time if they are present in a line will become very slow. Regular expressions are a solution then; these are compiled into a state machine that can check for many keywords much more efficiently. C++ comes with a standard regular expression library, although it is not the fastest and you might want to use an external library, like Hana Dusíková's CTRE library.

Waiting for the file to grow

Currently, I sleep for 1 millisecond, clear the flags and repeat the EOF check. I think it's not optimal and any suggestions over there would be highly appreciated

Indeed, this is not optimal. By sleeping only for 1 millisecond, you probably prevent the CPU from going into a low power mode if there is no activity. Unfortunately, there is no standard way in C++ to wait for a file to grow. However, there are various alternatives, see this StackOverflow question. If you don't mind an external dependency, I recommend using the SimpleFileWatcher library to avoid having to add platform-specific code.

Race condition when reaching the end of the file

Once you hit the end of the file, you call ClearFlags(), which in turn calls seekg(0, std::ios::end). However, consider that between reaching the end and seeking, the file might have grown. The seek will cause you to skip over the characters that have just been added, and might cause you to lose some lines.

Avoid global variables

It looks like KEY_WORDS and sink are global variables. I would avoid that, or at the very least make sure you pass the list of keywords and the output stream via function parameters.

Thanks for the suggestions. I checked the getline() documentation and it seems like it reads the whole line into memory or the delimiter we introduce. I am not sure how can we use that to replace get() which I am using right now And regarding race condition, seems like the best way is to run a file watcher and trigger a callback once the update event is triggered. Do you recommend any other better way? I am gonna replace global variables and also make keyword search using regex
I'm refering to std::istream::getline(), which allows you to specify the maximum size to read in. As for the race condition, it doesn't matter how you wait, but you cannot seek to the end and assume you did not skip anything. If you need to seek at all, seek to the number of characters read so far.
Note that std::istream::tellg allows you to determine the current input position.

Stack Exchange Network

Large file log parser with less memory footprint

1 Answer 1

Make use of standard library functionality

Consider using regular expressions

Waiting for the file to grow

Race condition when reaching the end of the file

Avoid global variables

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Large file log parser with less memory footprint

1 Answer 1

Make use of standard library functionality

Consider using regular expressions

Waiting for the file to grow

Race condition when reaching the end of the file

Avoid global variables

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions