Parse a file containing key-value pairs followed by coordinates of nodes in a graph

Question 1

I've a file which is divided into two parts:

The header
The data part

The header has of course a different format than the data part. The header part contains lines that look like this KEY: VALUE or KEY : VALUE or KEY:VALUE, in theory, clearly.

The data part starts with a delimiter, which in my specific case should always be NODE_COORD_SECTION, and ends with another delimiter, which should always be EOF.

In the header, I only need to keep track of two key-value pairs: the DIMENSION and the BEST_KNOWN.

Each line or entry in the data section should be formatted as

INDEX X_COORDINATE Y_COORDINATE

There could be more than one space between the values. INDEX should be an int, X_COORDINATE and Y_COORDINATE should be floats or doubles.

I'm parsing the header and the data section differently. The first one is with getline with the help of the find method of the string class, and the second with the overloaded operator >> of the ifstream object. I've read on a Stack Overflow's post that mixing these two ways of parsing lines could be dangerous and thus should be avoided, but I thought that for my case it would be the best way to do it.

I want my code first to be correct, but I would like it to also be as efficient and fast as possible, because this parsing of the file, structured as I've been describing above, is part of a program which must be as fast as possible.

Note, I'm throwing exceptions whenever a line is not formatted as I was expecting. I think this a good way of handling these kinds of errors in general, but I don't know if in C++ this is actually a good idea. Notice that I've already read something about exception handling in C++.

I think this is the relevant code for you:

int dimension = -1;
int best_known = -1;
vector <unique_ptr<Node>> nodes;
void read_file(const char *file_name) {
 ifstream file(file_name);
 if (file.fail()) {
 throw runtime_error("file failed to open");
 }
 string line;
 size_t found_at;
 while (getline(file, line)) {
 if (line.find("BEST_KNOWN") != string::npos) {
 found_at = line.find(':');
 if (found_at == string::npos) {
 throw runtime_error("line containing 'BEST_KNOWN' formatted not as expected");
 }
 best_known = stoi(line.substr(found_at + 1));
 } else if (line.find("DIMENSION") != string::npos) {
 found_at = line.find(':');
 if (found_at == string::npos) {
 throw runtime_error("line containing 'DIMENSION' formatted not as expected");
 }
 dimension = stoi(line.substr(found_at + 1));
 } else if (line.find("NODE_COORD_SECTION") != string::npos) {
 if (dimension == -1 || best_known == -1) {
 throw runtime_error("dimension and best known result should have already been read");
 }
 unsigned index;
 double x, y;
 while (file >> index >> x >> y) {
 unique_ptr <Node> p(new Node(index, x, y));
 nodes.push_back(move(p));
 }
 break;
 }
 }
 file.close();
}

Node is a class that represents a node in a graph. I'm using unique_ptrs here to handle memory release automatically, but I don't know if this has a bad impact on the overall performance. Honestly I would be using C if it wasn't for the classes that C++ provides, because I feel that with C++ I will be losing performance.

Question 2

So by definition you problem is divided into two sections. This should be reflected in your code, i.e. your read_file function should call a read_header and afterwards a read_data function. This helps encapsulation and turns the code more readable

Question 3

@miscco That's a good observation that should be in an answer (even if it's all you have to say). Comments are for seeking clarification to the question, and may be deleted.

Question 4

I wouldnt think so, as each function would only be called once. It would be different if you defined a function readNextBlock() or similar but that would most certainly be inlined anyway

Question 5

One problem we all know regarding speed is caching. A std::vector<T> has a nice property, namely that &vec[i] + 1 == &vec[i+1], e.g. the memory is continuous. So whenever you have a loop like

for(size_t i = 0; i < values.size(); ++i) {
 values[i] += 2;
}

the values values[i+1], values[i+2] ... will already reside in your cache, and you don't have to access your main memory for those. However if you store only pointers in your vector<T>, ie.

std::vector <std::unique_ptr<Node>> nodes;

you lose that nice property, since we know have to access *vec[i], and &(*vec[i]) != &(*vec[i+1]). The above code becomes

for(size_t i = 0; i < values.size(); ++i) {
 *(values[i]) += 2;
}

and we really want to keep the number of * low. Luckily, unless you want to allow nodes[i] == nullptr for some i, there is no need to use a layer of indirection. Just use

std::vector <Node> nodes;

and create the Node at the end of your vector with std::vector<T>::emplace_back:

 while (file >> index >> x >> y) {
 nodes.emplace_back(index, x, y);
 }

Next, you usually don't want to use the whole namespace std. It's considered bad practice. And last but not least, you want to use variables as local as possible, to get rid of errors that are hard to debug (e.g. found_at should be declared/initialized in your if).

If we follow all of these points, we end up with

int dimension = -1;
int best_known = -1;
std::vector <Node> nodes;
void read_file(const char *file_name) {
 std::ifstream file(file_name);
 if (file.fail()) {
 throw std::runtime_error("file failed to open");
 }
 std::string line;
 while (std::getline(file, line)) {
 if (line.find("BEST_KNOWN") != std::string::npos) {
 const auto found_at = line.find(':');
 if (found_at == std::string::npos) {
 throw std::runtime_error("line containing 'BEST_KNOWN' formatted not as expected");
 }
 best_known = std::stoi(line.substr(found_at + 1));
 } else if (line.find("DIMENSION") != std::string::npos) {
 const auto found_at = line.find(':');
 if (found_at == std::string::npos) {
 throw std::runtime_error("line containing 'DIMENSION' formatted not as expected");
 }
 dimension = std::stoi(line.substr(found_at + 1));
 } else if (line.find("NODE_COORD_SECTION") != std::string::npos) {
 break;
 }
 }
 if (dimension == -1 || best_known == -1) {
 throw std::runtime_error("dimension and best known result should have already been read");
 }
 unsigned index;
 double x, y;
 while (file >> index >> x >> y) {
 nodes.emplace_back(index, x, y);
 }
 file.close();
}

Note that I've moved the parse of the single node coordinates after the initial getline approach.

Concerning your additional questions: yes, exceptions provide a way to control your flow, but so would error return codes, e.g.

enum READ_ERROR { BEST_KNOWN_FORMAT, DIMENSION_FORMAT, BEST_DIM_NOT_READ, READ_OK };
....
 if (dimension == -1 || best_known == -1) {
 return BEST_DIM_NOT_READ;
 }

The choice is yours, but exception handling and stack unwinding is often a hassle. The compiler can use some optimizations if it knows that a certain part of your code will never throw. However, since we're using nodes.emplace_back, we're already using at least one function that can possibly throw.

Also note that you slightly violate the DRY principle in your line.find("DIMENSION") and line.find("BEST_KNOWN") lines, but another layer of indirection would make things slightly more confusing (for the reader).

Other than that, well done. Your code doesn't fall into other pitfalls that usually happen, e.g.

while(file >> index >> x >> y)

is exactly how you should handle formatted extraction with iostream.

Question 6

I think some best practices have already been addressed, so I'll comment pretty much only on performance.

1: `string::find` is suboptimal for finding the header key

if (line.find("BEST_KNOWN") != std::string::npos)

This searches for the string BEST_KNOWN anywhere in the line. You know that string must start the time, so write a startsWith function which checks if a string starts with another string.

2: `string::find` is suboptimal for finding the ':'

const auto found_at = line.find(':');

Again, you're searching the entire line for the first ':'. You know it must occur after the key name, and you know if you hit anything except whitespace before hitting it, the formatting is wrong. Start iterating after the key name in search of the : and throw an error if a character isn't isspace(c) before finding ':'

3: Reduce allocations

best_known = std::stoi(line.substr(found_at + 1));

substr creates a string. Strings (probably) allocate memory. All you're doing is calling std::stoi. Instead, just use atoi, it takes a char*. Pass it the start of the value string, and avoid an allocation.

4: File IO speed

You're reading the header line by line with std::getline(file, line). That's a lot of reads. Those calls are slow. Similarly, calling file >> index >> x >> y is probably even worse. If you know you have the memory to store the entire file in memory at once, read the entire file into a buffer in 1 read and then parse it while it sits in memory. Otherwise, read it in large chunks and parse the chunks (but I can tell you this will cause significant parsing-code complexity). Otherwise, otherwise, consider memory mapped files, which wouldn't cause code complexity but will give you nice speed benefits. The downside is that handling that will be platform specific and not standard-c++. To be clear, file io speed likely dwarfs the other optimizations listed here.

I'm sure you can make this file reading/parsing 10x faster or more with the above optimizations (especially changing your file io).

Zeta ZetaZeta 19.6k2 gold badges57 silver badges90 bronze badges · Answer 1 · 2016-09-29 12:22:19Z

One problem we all know regarding speed is caching. A std::vector<T> has a nice property, namely that &vec[i] + 1 == &vec[i+1], e.g. the memory is continuous. So whenever you have a loop like

for(size_t i = 0; i < values.size(); ++i) {
 values[i] += 2;
}

the values values[i+1], values[i+2] ... will already reside in your cache, and you don't have to access your main memory for those. However if you store only pointers in your vector<T>, ie.

std::vector <std::unique_ptr<Node>> nodes;

you lose that nice property, since we know have to access *vec[i], and &(*vec[i]) != &(*vec[i+1]). The above code becomes

for(size_t i = 0; i < values.size(); ++i) {
 *(values[i]) += 2;
}

and we really want to keep the number of * low. Luckily, unless you want to allow nodes[i] == nullptr for some i, there is no need to use a layer of indirection. Just use

std::vector <Node> nodes;

and create the Node at the end of your vector with std::vector<T>::emplace_back:

 while (file >> index >> x >> y) {
 nodes.emplace_back(index, x, y);
 }

Next, you usually don't want to use the whole namespace std. It's considered bad practice. And last but not least, you want to use variables as local as possible, to get rid of errors that are hard to debug (e.g. found_at should be declared/initialized in your if).

If we follow all of these points, we end up with

int dimension = -1;
int best_known = -1;
std::vector <Node> nodes;
void read_file(const char *file_name) {
 std::ifstream file(file_name);
 if (file.fail()) {
 throw std::runtime_error("file failed to open");
 }
 std::string line;
 while (std::getline(file, line)) {
 if (line.find("BEST_KNOWN") != std::string::npos) {
 const auto found_at = line.find(':');
 if (found_at == std::string::npos) {
 throw std::runtime_error("line containing 'BEST_KNOWN' formatted not as expected");
 }
 best_known = std::stoi(line.substr(found_at + 1));
 } else if (line.find("DIMENSION") != std::string::npos) {
 const auto found_at = line.find(':');
 if (found_at == std::string::npos) {
 throw std::runtime_error("line containing 'DIMENSION' formatted not as expected");
 }
 dimension = std::stoi(line.substr(found_at + 1));
 } else if (line.find("NODE_COORD_SECTION") != std::string::npos) {
 break;
 }
 }
 if (dimension == -1 || best_known == -1) {
 throw std::runtime_error("dimension and best known result should have already been read");
 }
 unsigned index;
 double x, y;
 while (file >> index >> x >> y) {
 nodes.emplace_back(index, x, y);
 }
 file.close();
}

Note that I've moved the parse of the single node coordinates after the initial getline approach.

Concerning your additional questions: yes, exceptions provide a way to control your flow, but so would error return codes, e.g.

enum READ_ERROR { BEST_KNOWN_FORMAT, DIMENSION_FORMAT, BEST_DIM_NOT_READ, READ_OK };
....
 if (dimension == -1 || best_known == -1) {
 return BEST_DIM_NOT_READ;
 }

The choice is yours, but exception handling and stack unwinding is often a hassle. The compiler can use some optimizations if it knows that a certain part of your code will never throw. However, since we're using nodes.emplace_back, we're already using at least one function that can possibly throw.

Also note that you slightly violate the DRY principle in your line.find("DIMENSION") and line.find("BEST_KNOWN") lines, but another layer of indirection would make things slightly more confusing (for the reader).

Other than that, well done. Your code doesn't fall into other pitfalls that usually happen, e.g.

while(file >> index >> x >> y)

is exactly how you should handle formatted extraction with iostream.

David DavidDavid 1,0586 silver badges10 bronze badges · Answer 2 · 2016-09-29 18:26:01Z

I think some best practices have already been addressed, so I'll comment pretty much only on performance.

1: `string::find` is suboptimal for finding the header key

if (line.find("BEST_KNOWN") != std::string::npos)

This searches for the string BEST_KNOWN anywhere in the line. You know that string must start the time, so write a startsWith function which checks if a string starts with another string.

2: `string::find` is suboptimal for finding the ':'

const auto found_at = line.find(':');

Again, you're searching the entire line for the first ':'. You know it must occur after the key name, and you know if you hit anything except whitespace before hitting it, the formatting is wrong. Start iterating after the key name in search of the : and throw an error if a character isn't isspace(c) before finding ':'

3: Reduce allocations

best_known = std::stoi(line.substr(found_at + 1));

substr creates a string. Strings (probably) allocate memory. All you're doing is calling std::stoi. Instead, just use atoi, it takes a char*. Pass it the start of the value string, and avoid an allocation.

4: File IO speed

You're reading the header line by line with std::getline(file, line). That's a lot of reads. Those calls are slow. Similarly, calling file >> index >> x >> y is probably even worse. If you know you have the memory to store the entire file in memory at once, read the entire file into a buffer in 1 read and then parse it while it sits in memory. Otherwise, read it in large chunks and parse the chunks (but I can tell you this will cause significant parsing-code complexity). Otherwise, otherwise, consider memory mapped files, which wouldn't cause code complexity but will give you nice speed benefits. The downside is that handling that will be platform specific and not standard-c++. To be clear, file io speed likely dwarfs the other optimizations listed here.

I'm sure you can make this file reading/parsing 10x faster or more with the above optimizations (especially changing your file io).

Stack Exchange Network

Parse a file containing key-value pairs followed by coordinates of nodes in a graph

2 Answers 2

1: `string::find` is suboptimal for finding the header key

2: `string::find` is suboptimal for finding the ':'

3: Reduce allocations

4: File IO speed

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Parse a file containing key-value pairs followed by coordinates of nodes in a graph

2 Answers 2

1: string::find is suboptimal for finding the header key

2: string::find is suboptimal for finding the ':'

3: Reduce allocations

4: File IO speed

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions

1: `string::find` is suboptimal for finding the header key

2: `string::find` is suboptimal for finding the ':'