This is a variation of an interesting problem I am currently dealing with. We have a large input file which is being continuously written (size: 10-20G). We need to write a log filter which reads content from the file with the following conditions
- It looks for specific keywords in each line and if they present, it writes that line to another output file
- We only read first X characters of the line (e.g. x <= 5000) and everything else until the next newline starts, we can ignore. Be mindful that we should not load the whole line into memory. We have tight memory constraints
- We read from a file which is continuously being written, so if we reach to the end of the file, we wait until new data is available
Here's the sample code I came up with
bool Contains(const std::string& input) {
for (const auto& keyword : KEY_WORDS) {
for (int i = 0; i < std::min(input.length(), input.length() - keyword.length() + 1); i++) {
std::string_view view{ input.c_str() + i, keyword.length() };
if (view == keyword) {
return true;
}
}
}
return false;
}
void ClearFlags(std::ifstream& input_file) {
input_file.clear();
input_file.seekg(0, std::ios::end);
}
void ProcessFile(std::ifstream& input_file) {
// seek to the beginning but we will change this later
input_file.seekg(0, std::ios::beg);
while (true) {
std::string buffer;
char ch;
// Read until either the new line is found or the first 5000 characters
while (buffer.length() < BUFFER_SIZE) {
if (input_file.good()) {
ch = input_file.get();
if (ch == '\n')
break;
if (ch != EOF)
buffer += ch;
} else {
ClearFlags(input_file);
std::this_thread::sleep_for(std::chrono::milliseconds(1));
}
}
// Process the line
if (buffer.length() && Contains(buffer)) {
sink.write(buffer);
}
if (ch == '\n')
continue;
// We have more characters. so, we read one by one until we reach newline
while (ch != '\n') {
if (input_file.good()) {
input_file.get(ch);
} else {
ClearFlags(input_file);
std::this_thread::sleep_for(std::chrono::milliseconds(1));
}
}
}
}
Would like to get some comments and also possible improvements and edge cases I might be missing here? Currently, I sleep for 1 millisecond, clear the flags and repeat the EOF check. I think it's not optimal and any suggestions over there would be highly appreciated
1 Answer 1
Make use of standard library functionality
You are performing a lot of operations manually, but the standard library comes with many functions that you can use, and often perform better.
To read a line up to a certain number of characters, use std::istream
's getline()
member function. Use gcount()
to check how many characters were actually read. One you have read (part of) a line into a buffer, use std::string
's find()
member function to search for keywords.
If you read the first 5000 characters of a very long line, use ignore()
to skip the remainder until the newline.
Consider using regular expressions
If you have a lot of keywords, checking one of them at a time if they are present in a line will become very slow. Regular expressions are a solution then; these are compiled into a state machine that can check for many keywords much more efficiently. C++ comes with a standard regular expression library, although it is not the fastest and you might want to use an external library, like Hana Dusíková's CTRE library.
Waiting for the file to grow
Currently, I sleep for 1 millisecond, clear the flags and repeat the EOF check. I think it's not optimal and any suggestions over there would be highly appreciated
Indeed, this is not optimal. By sleeping only for 1 millisecond, you probably prevent the CPU from going into a low power mode if there is no activity. Unfortunately, there is no standard way in C++ to wait for a file to grow. However, there are various alternatives, see this StackOverflow question. If you don't mind an external dependency, I recommend using the SimpleFileWatcher library to avoid having to add platform-specific code.
Race condition when reaching the end of the file
Once you hit the end of the file, you call ClearFlags()
, which in turn calls seekg(0, std::ios::end)
. However, consider that between reaching the end and seeking, the file might have grown. The seek will cause you to skip over the characters that have just been added, and might cause you to lose some lines.
Avoid global variables
It looks like KEY_WORDS
and sink
are global variables. I would avoid that, or at the very least make sure you pass the list of keywords and the output stream via function parameters.
-
\$\begingroup\$ Thanks for the suggestions. I checked the
getline()
documentation and it seems like it reads the whole line into memory or the delimiter we introduce. I am not sure how can we use that to replaceget()
which I am using right now And regarding race condition, seems like the best way is to run a file watcher and trigger a callback once the update event is triggered. Do you recommend any other better way? I am gonna replace global variables and also make keyword search using regex \$\endgroup\$Rohith Uppala– Rohith Uppala2023年04月23日 12:58:29 +00:00Commented Apr 23, 2023 at 12:58 -
\$\begingroup\$ I'm refering to
std::istream::getline()
, which allows you to specify the maximum size to read in. As for the race condition, it doesn't matter how you wait, but you cannot seek to the end and assume you did not skip anything. If you need to seek at all, seek to the number of characters read so far. \$\endgroup\$G. Sliepen– G. Sliepen2023年04月23日 21:01:40 +00:00Commented Apr 23, 2023 at 21:01 -
1\$\begingroup\$ Note that
std::istream::tellg
allows you to determine the current input position. \$\endgroup\$ruds– ruds2023年05月23日 20:02:56 +00:00Commented May 23, 2023 at 20:02
Explore related questions
See similar questions with these tags.
input file which is being continuously written
unusual for an input file. \$\endgroup\$