I've a file which is divided into two parts:
- The header
- The data part
The header has of course a different format than the data part. The header part contains lines that look like this KEY: VALUE
or KEY : VALUE
or KEY:VALUE
, in theory, clearly.
The data part starts with a delimiter, which in my specific case should always be NODE_COORD_SECTION
, and ends with another delimiter, which should always be EOF
.
In the header, I only need to keep track of two key-value pairs: the DIMENSION
and the BEST_KNOWN
.
Each line or entry in the data section should be formatted as
INDEX X_COORDINATE Y_COORDINATE
There could be more than one space between the values. INDEX
should be an int
, X_COORDINATE
and Y_COORDINATE
should be floats or doubles.
I'm parsing the header and the data section differently. The first one is with getline
with the help of the find
method of the string
class, and the second with the overloaded operator >>
of the ifstream
object. I've read on a Stack Overflow's post that mixing these two ways of parsing lines could be dangerous and thus should be avoided, but I thought that for my case it would be the best way to do it.
I want my code first to be correct, but I would like it to also be as efficient and fast as possible, because this parsing of the file, structured as I've been describing above, is part of a program which must be as fast as possible.
Note, I'm throwing exceptions whenever a line is not formatted as I was expecting. I think this a good way of handling these kinds of errors in general, but I don't know if in C++ this is actually a good idea. Notice that I've already read something about exception handling in C++.
I think this is the relevant code for you:
int dimension = -1;
int best_known = -1;
vector <unique_ptr<Node>> nodes;
void read_file(const char *file_name) {
ifstream file(file_name);
if (file.fail()) {
throw runtime_error("file failed to open");
}
string line;
size_t found_at;
while (getline(file, line)) {
if (line.find("BEST_KNOWN") != string::npos) {
found_at = line.find(':');
if (found_at == string::npos) {
throw runtime_error("line containing 'BEST_KNOWN' formatted not as expected");
}
best_known = stoi(line.substr(found_at + 1));
} else if (line.find("DIMENSION") != string::npos) {
found_at = line.find(':');
if (found_at == string::npos) {
throw runtime_error("line containing 'DIMENSION' formatted not as expected");
}
dimension = stoi(line.substr(found_at + 1));
} else if (line.find("NODE_COORD_SECTION") != string::npos) {
if (dimension == -1 || best_known == -1) {
throw runtime_error("dimension and best known result should have already been read");
}
unsigned index;
double x, y;
while (file >> index >> x >> y) {
unique_ptr <Node> p(new Node(index, x, y));
nodes.push_back(move(p));
}
break;
}
}
file.close();
}
Node
is a class that represents a node in a graph. I'm using unique_ptr
s here to handle memory release automatically, but I don't know if this has a bad impact on the overall performance. Honestly I would be using C if it wasn't for the classes that C++ provides, because I feel that with C++ I will be losing performance.
-
\$\begingroup\$ So by definition you problem is divided into two sections. This should be reflected in your code, i.e. your read_file function should call a read_header and afterwards a read_data function. This helps encapsulation and turns the code more readable \$\endgroup\$miscco– miscco2016年09月29日 12:42:04 +00:00Commented Sep 29, 2016 at 12:42
-
\$\begingroup\$ @miscco That's a good observation that should be in an answer (even if it's all you have to say). Comments are for seeking clarification to the question, and may be deleted. \$\endgroup\$200_success– 200_success2016年09月29日 18:50:52 +00:00Commented Sep 29, 2016 at 18:50
-
\$\begingroup\$ I wouldnt think so, as each function would only be called once. It would be different if you defined a function readNextBlock() or similar but that would most certainly be inlined anyway \$\endgroup\$miscco– miscco2016年10月01日 08:22:55 +00:00Commented Oct 1, 2016 at 8:22
2 Answers 2
One problem we all know regarding speed is caching. A std::vector<T>
has a nice property, namely that &vec[i] + 1 == &vec[i+1]
, e.g. the memory is continuous. So whenever you have a loop like
for(size_t i = 0; i < values.size(); ++i) {
values[i] += 2;
}
the values values[i+1]
, values[i+2]
... will already reside in your cache, and you don't have to access your main memory for those. However if you store only pointers in your vector<T>
, ie.
std::vector <std::unique_ptr<Node>> nodes;
you lose that nice property, since we know have to access *vec[i]
, and &(*vec[i]) != &(*vec[i+1])
. The above code becomes
for(size_t i = 0; i < values.size(); ++i) {
*(values[i]) += 2;
}
and we really want to keep the number of *
low. Luckily, unless you want to allow nodes[i] == nullptr
for some i
, there is no need to use a layer of indirection. Just use
std::vector <Node> nodes;
and create the Node
at the end of your vector with std::vector<T>::emplace_back
:
while (file >> index >> x >> y) {
nodes.emplace_back(index, x, y);
}
Next, you usually don't want to use the whole namespace std
. It's considered bad practice. And last but not least, you want to use variables as local as possible, to get rid of errors that are hard to debug (e.g. found_at
should be declared/initialized in your if
).
If we follow all of these points, we end up with
int dimension = -1;
int best_known = -1;
std::vector <Node> nodes;
void read_file(const char *file_name) {
std::ifstream file(file_name);
if (file.fail()) {
throw std::runtime_error("file failed to open");
}
std::string line;
while (std::getline(file, line)) {
if (line.find("BEST_KNOWN") != std::string::npos) {
const auto found_at = line.find(':');
if (found_at == std::string::npos) {
throw std::runtime_error("line containing 'BEST_KNOWN' formatted not as expected");
}
best_known = std::stoi(line.substr(found_at + 1));
} else if (line.find("DIMENSION") != std::string::npos) {
const auto found_at = line.find(':');
if (found_at == std::string::npos) {
throw std::runtime_error("line containing 'DIMENSION' formatted not as expected");
}
dimension = std::stoi(line.substr(found_at + 1));
} else if (line.find("NODE_COORD_SECTION") != std::string::npos) {
break;
}
}
if (dimension == -1 || best_known == -1) {
throw std::runtime_error("dimension and best known result should have already been read");
}
unsigned index;
double x, y;
while (file >> index >> x >> y) {
nodes.emplace_back(index, x, y);
}
file.close();
}
Note that I've moved the parse of the single node coordinates after the initial getline
approach.
Concerning your additional questions: yes, exceptions provide a way to control your flow, but so would error return codes, e.g.
enum READ_ERROR { BEST_KNOWN_FORMAT, DIMENSION_FORMAT, BEST_DIM_NOT_READ, READ_OK };
....
if (dimension == -1 || best_known == -1) {
return BEST_DIM_NOT_READ;
}
The choice is yours, but exception handling and stack unwinding is often a hassle. The compiler can use some optimizations if it knows that a certain part of your code will never throw. However, since we're using nodes.emplace_back
, we're already using at least one function that can possibly throw.
Also note that you slightly violate the DRY principle in your line.find("DIMENSION")
and line.find("BEST_KNOWN")
lines, but another layer of indirection would make things slightly more confusing (for the reader).
Other than that, well done. Your code doesn't fall into other pitfalls that usually happen, e.g.
while(file >> index >> x >> y)
is exactly how you should handle formatted extraction with iostream
.
I think some best practices have already been addressed, so I'll comment pretty much only on performance.
1: string::find
is suboptimal for finding the header key
if (line.find("BEST_KNOWN") != std::string::npos)
This searches for the string BEST_KNOWN
anywhere in the line. You know that string must start the time, so write a startsWith
function which checks if a string starts with another string.
2: string::find
is suboptimal for finding the ':'
const auto found_at = line.find(':');
Again, you're searching the entire line for the first ':'
. You know it must occur after the key name, and you know if you hit anything except whitespace before hitting it, the formatting is wrong. Start iterating after the key name in search of the :
and throw an error if a character isn't isspace(c)
before finding ':'
3: Reduce allocations
best_known = std::stoi(line.substr(found_at + 1));
substr
creates a string. Strings (probably) allocate memory. All you're doing is calling std::stoi
. Instead, just use atoi
, it takes a char*
. Pass it the start of the value
string, and avoid an allocation.
4: File IO speed
You're reading the header line by line with std::getline(file, line)
. That's a lot of reads. Those calls are slow. Similarly, calling file >> index >> x >> y
is probably even worse. If you know you have the memory to store the entire file in memory at once, read the entire file into a buffer in 1 read and then parse it while it sits in memory. Otherwise, read it in large chunks and parse the chunks (but I can tell you this will cause significant parsing-code complexity). Otherwise, otherwise, consider memory mapped files, which wouldn't cause code complexity but will give you nice speed benefits. The downside is that handling that will be platform specific and not standard-c++. To be clear, file io speed likely dwarfs the other optimizations listed here.
I'm sure you can make this file reading/parsing 10x faster or more with the above optimizations (especially changing your file io).