I'm new to programming (1.5 years experience at university), and I'm working on my first computer vision related project! My goal is to take a file which might have CSV (comma separated) or txt format with XYZ coordinates and process the data. I want to round the X and Y coordinates into integers and leave the Z coordinate as a floating point.
Each file ranges from 200000 to 1400000 lines of XYZ data is is formatted as so:
# information about .txt file
253.9999929 58.0428367 -21.3930063253
.9999929 59.0435773 -21.2499391255
etc.....
Do note that the file might have commas, like a CSV file.
The function that is my main bottleneck is:
Loadfile(string filename);
I'm pretty sure my bottleneck is from the ifstream
for reading the file one line at a time, but I'm having trouble finding ways to improve it. One idea I had was to convert the file into an array of bytes and read split it up into different domains and give it to multiple threads to read concurrently, but I'm not sure if that's a good practice.
Here is the my code so far:
class XYZCoordinate{
public:
int x;
int y;
float z;
XYZCoordinate(int X, int Y, float Z){
x = X;
y = Y;
z = Z;
}
};
class Loadfileoutput
{
public:
int xmin;
int xmax;
int ymin;
int ymax;
vector<XYZCoordinate> values;
Loadfileoutput(int xmn, int xmx, int ymn, int ymx, vector<XYZCoordinate> val){
xmin = xmn;
xmax = xmx;
ymin = ymn;
ymax = ymx;
values = val;
}
};
Loadfileoutput Loadfile(string filename)
{
std::ifstream in(filename);
if (!in) {
cout << "Cannot open input file.\n";
exit(EXIT_FAILURE);
}
std::string str;
bool isfirst = true;
int xmin, xmax, ymin, ymax;
vector<XYZCoordinate> file = vector<XYZCoordinate>();
while (std::getline(in, str)) //potential cause of bottleneck
{
if (str[0] == '#') //.skipping the .txt file header
continue;
vector<string> v = vector<string>();
boost::split(v, str, boost::is_any_of(" ,"));
string xstr = v[0];
string ystr = v[1];
string zstr = v[2];
int xint, yint;
float x,y,z;
stringstream(v[0]) >> x;
xint = (int)round(x);
stringstream(v[1]) >> y;
yint = (int)round(y);
stringstream(v[2]) >> z;
XYZCoordinate temp = XYZCoordinate(xint, yint, z);
file.push_back(temp);
//get bounds
if (isfirst)
{
xmin = xint;
xmax = xint;
ymin = yint;
ymax = yint;
isfirst = false;
}
else
{
//set xmin/max for scaling
if (xint > xmax)
xmax = xint;
else if (xint < xmin)
xmin = xint;
//set ymin/max for scaling
if (yint > ymax)
ymax = yint;
else if (yint < ymin)
ymin = yint;
}
}
Loadfileoutput output = Loadfileoutput(xmin,xmax,ymin,ymax,file);
return output;
}
5 Answers 5
There are a number of things that may help you improve your program.
Don't abuse using namespace std
Putting using namespace std
at the top of every program is a bad habit that you'd do well to avoid.
Use the required #include
s
The code uses a number of #include
s that are not listed. It was not difficult to infer, but it helps reviewers if the code is complete. After a bit of looking, it seems that this is what's needed:
#include <vector>
#include <string>
#include <iostream>
#include <fstream>
#include <sstream>
#include <cmath>
#include <boost/algorithm/string.hpp>
Prefer modern initializers for constructors
The constructor use the more modern initializer style rather than the old style you're currently using. Instead of this:
XYZCoordinate(int X, int Y, float Z){
x = X;
y = Y;
z = Z;
}
I would write this:
XYZCoordinate(int X, int Y, float Z) : x{X}, y{Y}, z{Z} {}
However, in this case, for such a simple class, in real code I wouldn't write a constructor at all and would simply delete the line. The class can still be initialized like this: XYZCoordinate xyz{3, 5, -23.22};
via aggregate initialization.
Use better variable names
The variable names xmax
, xmin
, etc. are good, but the name file
is not. The first names explain something about what the variable means within the context of the code, but the latter is only confusing. A better name might be coordinates
. Similarly, Loadfileoutput
is another poor name -- I'd suggest Coordinates
for the name of that class.
Use C++ idiom
The code currently contains this:
vector<XYZCoordinate> file = vector<XYZCoordinate>();
That's really not necessary. Just write this instead:
vector<XYZCoordinate> coordinates{};
Prefer a struct
to a class
for objects with no invariants
C++ has two primary means to encapsulate some behavior and data: struct
and class
. The only difference is that a struct
has its members public
by default, while the class
members are private
by default. If there is no invariant, (a thing that must always be true for the class to be coherent), then it often makes sense to use a struct
instead. In this case the XYZCoordinate
class has no invariants and is probably better suited as a struct
.
Write a custom extractor
Here's the code I'd prefer to write to read in and create the desired object:
std::ifstream in(filename);
Coordinates c;
in >> c;
This can be done by writing an extractor for the data. In this case, I'd write two extractors. One for the Coordinates
class and one for the XYZCoordinate
class. Here's one way to write one for the XYZCoordinate
class. Here's the complete definition for that class, which I have converted to a struct
:
struct XYZCoordinate{
int x;
int y;
float z;
friend std::istream& operator>>(std::istream& in, XYZCoordinate &xyz) {
float a, b;
in >> a >> b >> xyz.z;
xyz.x = std::rint(a);
xyz.y = std::rint(b);
return in;
}
};
Now we can write an extractor for the Coordinates
class. Here's that complete class:
class Coordinates
{
int xmin;
int xmax;
int ymin;
int ymax;
std::vector<XYZCoordinate> values;
public:
friend std::ostream& operator<<(std::ostream& out, const Coordinates &c) {
return out << "[ " << c.xmin << ", " << c.xmax << "], [ "
<< c.ymin << ", " << c.ymax << "], with " << c.values.size() << " points\n";
}
friend std::istream& operator>>(std::istream& in, Coordinates &c) {
// skip comment lines
while (in.peek() == '#') {
in.ignore(std::numeric_limits<std::streamsize>::max(), '\n');
}
XYZCoordinate xyz;
while (in >> xyz) {
c.values.push_back(xyz);
c.xmin = std::min<int>(c.xmin, xyz.x);
c.xmax = std::max<int>(c.xmax, xyz.x);
c.ymin = std::min<int>(c.xmin, xyz.y);
c.ymax = std::max<int>(c.xmax, xyz.y);
}
return in;
}
};
Note that the values of xmin
, xmax
, etc. are invariants for this class, so we leave it as a class
with only private data members. If you need to access the items in the vector, you can add custom functions to handle that.
Provide a test driver
This is more about getting good reviews than the code itself, but it's often very useful if you provide some simple main
that exercises the code. This helps reviewers (and yourself!) understand how it's to be used in context. In this case, I wrote this:
#include <vector>
#include <iostream>
#include <fstream>
#include <locale>
#include <limits>
#include <algorithm>
// all of the class code above goes here
int main(int argc, char *argv[]) {
if (argc != 2) {
std::cerr << "Usage: xyz infile.txt\n";
return 1;
}
std::ifstream in(argv[1]);
in.imbue(std::locale(std::locale(), new csv_reader()));
Coordinates c;
in >> c;
std::cout << c;
}
The call to imbue
uses a locale that treats commas as whitespace. That code is in this excellent answer. Now you don't have to explicitly check for commas or spaces and don't need boost. Everything here is standard C++.
Results
I wrote a program to create a set of 20 millon sets of random coordinates. Using that input, your original program took 41.9 seconds on my machine, while the one presented above took 14.8 seconds.
-
1\$\begingroup\$ Out of curiosity, why
x{X}
instead ofx(X)
? Similarly, whyvector<...> coordinates {};
instead of justvector<...> coordinates;
? \$\endgroup\$anon– anon2018年08月15日 21:18:26 +00:00Commented Aug 15, 2018 at 21:18 -
\$\begingroup\$ Additional question, would it be considered standard practice to include the the modern constructor in the header file or should I keep it in the cpp file? \$\endgroup\$Casey Poon– Casey Poon2018年08月15日 21:49:46 +00:00Commented Aug 15, 2018 at 21:49
-
3\$\begingroup\$ @NicHartley: I use the C++11 uniform initialization pretty much whenever I can, for the reasons listed in the link. \$\endgroup\$Edward– Edward2018年08月15日 22:21:18 +00:00Commented Aug 15, 2018 at 22:21
-
1\$\begingroup\$ @CaseyPoon: Good question -- for an extremely simple initialization like the one shown here for
XYZCoordinate
, I'd actually put it in neither file and let the compiler autogenerate it. \$\endgroup\$Edward– Edward2018年08月15日 22:23:08 +00:00Commented Aug 15, 2018 at 22:23 -
1\$\begingroup\$ Small correction: OP isn’t using "old style initialisation". They’re not using intialisation, but instead assignment. This is a fundamental difference (it won’t work correctly with references and
const
variables, for instance). \$\endgroup\$Konrad Rudolph– Konrad Rudolph2018年08月16日 09:59:19 +00:00Commented Aug 16, 2018 at 9:59
Well, lets start from the CodeReview:
Since this is not really a library, I will omit my typical "Make it easy to use correctly, hard to use incorrectly", but it is still important thing to keep in mind.
- Using classes instead of structs
In this case, it seems like struct
should be used:
struct point {
int x;
int y;
double z;
};
Better name, like Point3D
might be applied. XYZCoordinate
, although explicit, is a bit verbose. Matemathicians might argue that names miss coordinate system names, but I guess it is out of problem domain at the moment.
- Passing by value
Although it is fine to accept by value and move at the call site, passing such big objects by value is rarely used, especially since read only view is needed. Prefer to use const T&
for read only view, where T
is the type of passed in object.
Do not call default constructor unless required
vector<XYZCoordinate> file = vector<XYZCoordinate>();
should be just
vector<XYZCoordinate> file;
Calling default constructor explicitly increases chances of running into vexing parse. To call default constructor, just use {}
:
std::vector<int>{};
Try to find better names
class Loadfileoutput;
Is somewhat confusing. One would think that there should a file in it somewhere, but there is no file. Something like Dataset
, or CollectedData
might be better.
- Knowledge of standard library
@R.Sahu showed you some suggestions on improving the performance of the reading part in .txt
files. Here is an implementation which somewhat incorporates the first suggestion (misses the part about dealing with errors):
#include <iostream>
#include <cctype>
#include <vector>
#include <limits>
struct point {
int x;
int y;
double z;
};
std::istream& operator>>(std::istream& is, point& p) {
double x, y, z;
is >> x >> y >> z;
p = {static_cast<int>(x),
static_cast<int>(y),
z}; //explicit conversion, as implicit is not allowed
return is;
}
std::ostream& operator<<(std::ostream& os, const point& p) {
return os << p.x << ' ' << p.y << ' ' << p.z;
}
struct custom_classification : std::ctype<char> {
custom_classification() : ctype(make_table()) { }
private:
static mask* make_table() {
const mask* classic = classic_table();
static std::vector<mask> v(classic, classic + table_size);
v[','] |= space;
return &v[0];
}
};
std::vector<point> read_points(std::istream& is) {
auto old_locale = is.imbue(std::locale(is.getloc(), new custom_classification));
is.ignore(std::numeric_limits<std::streamsize>::max(), '\n');
auto points = std::vector(std::istream_iterator<point>(is), {});
is.imbue(old_locale);
return points;
}
int main() {
auto points = read_points(std::cin);
for (auto&& a_point: points) {
std::cout << a_point << '\n';
}
}
There are several things going on here:
point
is standalone.It knows how to read itself from a stream, how to print itself into stream, and it handles that narrowing conversion on its own. Do note the use of aggregate initializer.
Ignoring commas and other formatting issues are delegated to stream. The old locale is restored, to behave as a good citizen and avoid surprising behaior when code gets modified.
Iterators are used extensively to reduce boilerplate and naked loops, although at some cost of being beginner friendly
To search for min-max while reading, one could use idiomatic while (first != last)
loop:
dataset dset; //contains vector, min and max
auto first = std::istream_iterator<point>(is);
auto last = std::istream_iteratior<point>{};
//initialize dset to reasonable defaults, for min-max search
while (first != last) {
auto next_point = *first++;
if (next_point.x > dset.xmax) {
dset.xmax = next_point.x;
} else if (next_point.x < dset.xmin) {
dset.xmin = next_point.x;
}
//ditto for y
dset.values.push_back(next_point);
}
-
1\$\begingroup\$ Good review, but I think this misses the fact that there may be comment lines in the file. \$\endgroup\$Edward– Edward2018年08月15日 19:49:31 +00:00Commented Aug 15, 2018 at 19:49
Suggestion 1: Don't split the lines
I don't see the need for splitting each line into tokens and extracting the numbers from each token. You can extract all the numbers from a line by using a istringstream
with the entire line.
Replace the lines
boost::split(v, str, boost::is_any_of(" ,"));
string xstr = v[0];
string ystr = v[1];
string zstr = v[2];
int xint, yint;
float x,y,z;
stringstream(v[0]) >> x;
xint = (int)round(x);
stringstream(v[1]) >> y;
yint = (int)round(y);
stringstream(v[2]) >> z;
by
std:istringstream istr(str);
if ( istr >> x >> y >> z )
{
// Process the data
}
else
{
// Deal with the error.
}
Suggestion 2: Use a char []
instead of std::string
for lines
Constructing and destrucing std::string
s for each line is potentially expensive. Use char []
of sufficient size.
const size_t MAX_LINE_LENGTH = 1000; // Make it large enough for your app.
char line[MXX_LINE_LENGTH];
while ( in.getline(line, MAX_LINE_LENGTH) )
{
// Use line
}
-
1\$\begingroup\$ It seems like problem description is incomplete. \$\endgroup\$Incomputable– Incomputable2018年08月15日 17:58:56 +00:00Commented Aug 15, 2018 at 17:58
-
\$\begingroup\$ @Incomputable, after reading the comment about CSV files, it does seem that way. \$\endgroup\$R Sahu– R Sahu2018年08月15日 18:03:43 +00:00Commented Aug 15, 2018 at 18:03
-
1\$\begingroup\$ I guess I'll upvote it anyway. Newcomers' posts are usually half baked on the start anyway, as they are new to the post format and community. \$\endgroup\$Incomputable– Incomputable2018年08月15日 18:05:33 +00:00Commented Aug 15, 2018 at 18:05
-
1\$\begingroup\$ Actually, I learnt a lot from your contributions on SO, so it should be me thanking you. In some sense, I was raised here :) \$\endgroup\$Incomputable– Incomputable2018年08月15日 18:09:08 +00:00Commented Aug 15, 2018 at 18:09
class XYZCoordinate{
public:
int x;
int y;
float z;
XYZCoordinate(int X, int Y, float Z){
x = X;
y = Y;
z = Z;
}
};
Integrate a formatting utility, like clang-format or astyle, into your tool-chain. Your indentations and spacing are inconsistent.
Do you plan on working with x
and y
as float
s? Avoid transformations until you absolutely need them.
For constructors, prefer initialization over assignment.
class XYZCoordinate {
public:
XYZCoordinate(int x, int y, float z) : x{x}, y{y}, z{z} {}
int x;
int y;
float z;
};
Consider leveraging the type system to safely operate on the dimensions of your coordinate class. You can read up on strong types here and here
Loadfileoutput(int xmn, int xmx, int ymn, int ymx, vector<XYZCoordinate> val){
Your coordinate vector argument is being copied here. Treat vector<XYZCoordinate>
either as a input-only argument or as a sink argument.
// In-Only - Pass by reference to const then copy
Loadfileoutput(/* ... */, const std::vector<XYZCoordinate>& val)
: /* ... */
, values{val} // copied!
{}
// Sink - Pass by value and move into place
Loadfileoutput(/* ... */, std::vector<XYZCoordinate> val)
: /* ... */
, values{std::move(val)} // moved!
{}
std::string str;
Poorly-chosen names can mislead the reader and cause bugs. It is important to use descriptive names that match the semantics and role of the underlying entities, within reason. Your intent is to use the string as a buffer, so just name it buffer
.
bool isfirst = true;
int xmin, xmax, ymin, ymax;
Any time you have a situation where you need a boolean flag to distinguish between a value being set or not, consider using an optional type (boost::optional
, std::optional
, etc).
vector<XYZCoordinate> file = vector<XYZCoordinate>();
Don't repeat yourself.
std::vector<XYZCoordinate> coordinates();
^ ^
Type mentioned once Has an appropriate name
while (std::getline(in, str))
if (str[0] == '#') //.skipping the .txt file header
std::getline
does not skip leading whitespace. Use the IO manipulator std::ws
:
while (std::getline(in >> std::ws, str))
if (str[0] == '#') //.skipping the .txt file header
If you really want to use std::getline
and a string buffer, consider using string_views liberally to access the buffer. Trim and split are easy to implement for string views. Numeric conversions already exists (std::from_chars
).
vector<string> v = vector<string>();
std::vector
doesn't have a small buffer optimization to take advantage of, so every loop on a line builds up and tears down this buffer. Move it outside the loop. That will also mean that you need to appropriately size it (3 records + remaining) and you'll need to wrap either boost::split
or boost::is_any_of
with that limits to 3 splits.
std::string
does have a small string optimization, but it's implementation defined on how much space you have before you are forced to allocate.
SSO Capacities for the big three
The longest coordinate from your example above, including sign and decimal, is 14 characters in length. If it's possible for coordinates to be longer than the SSO supported capacity of your implementation, consider using a string view instead (boost::string_view
, std::string_view
) and use from_chars()
to convert to your numeric type.
boost::split(v, str, boost::is_any_of(" ,"));
Separators are an interesting detail you must take care with. Locales using the SI style may use spaces for its thousands separator. Almost half of the countries using the hindu-arabic numeral system use period for its decimal separator. A similar amount use the comma as a decimal separator. The rest use some combination of both and other characters. Unless you are sure everyone uses the same locale, you just can't split on comma. Similarly, CSV fields typically allow commas to be part of the value, as long as they are part of an escaped sequence,
212311231,3.14,"3,14"
^ ^ ^
| | Don't split on this
Split on these commas
If you want to support CSVs, then support the full specification or document the extent by which you support comma separated values.
string xstr = v[0];
string ystr = v[1];
string zstr = v[2];
These are unused variables and should be removed. Your compiler didn't mention it because copying may have intended side effects for non-pod types. Keep in mind that you have pointers and references if you need to locally refer to something.
int xint, yint;
float x,y,z;
stringstream(v[0]) >> x;
xint = (int)round(x);
stringstream(v[1]) >> y;
yint = (int)round(y);
stringstream(v[2]) >> z;
You construct and destruct 3 instances of std::stringstream
, which is expensive.
Make sure you are using the C++ version of std::round
by including <cmath>
.
Calls to round
can fail. Check to see if std::round
flagged an error with std::fetestexcept
.
XYZCoordinate temp = XYZCoordinate(xint, yint, z);
file.push_back(temp);
You've created a named temporary object and you copy it into your vector. Since you no longer need temp
, just move it into the vector.
XYZCoordinate temp(xint, yint, z);
file.push_back(std::move(temp));
You can shorten this up by avoiding the named temporary.
file.push_back(XYZCoordinate(xint, yint, z));
You can avoid the temporary by directly emplacing the objects arguments.
file.emplace_back(xint, yint, z);
Consider another approach that doesn't rely on string conversions and the temporary shuffling. Others have mentioned implementing operator>>(istream&, XYZCoordinate&)
which does a formatted read and the rounded conversion.
friend std::istream& operator>>(istream& in, XYZCoordinate& c) {
float f_x;
float f_y;
if (!(in >> f_x >> f_y >> c.z)) {
return in;
}
c.x = static_cast<int>(std::round(f_x)); // check the error!
c.y = static_cast<int>(std::round(f_y)); // check the error!
return in;
}
Back in the line reading loop, you simply move the sentry in the file, skipping whitespace and unnecessary portions (your data is 1 record per line, so skip everything after the third value).
while (in >> std::ws) { // skip any whitespace
if (in.peek() == '#') { // skip comment lines
consume_line(in); // by skipping to '\n'
continue;
}
auto& coord = coordinates.emplace_back(); // create the object
if (!(in >> coord)) { // write its values
// Read failed: Log, throw, something
}
consume_line(in); // the record has been read, to the next record!
// use coord.x and coord.y for minmax finding.
}
As the comment says, consume_line
just skips all the remaining characters upto and including the end-of-line character.
std::istream& consume_line(std::istream& stream) {
return stream.ignore(std::numeric_limits<std::streamsize>::max(), stream.widen('\n'));
}
Loadfileoutput output = Loadfileoutput(xmin,xmax,ymin,ymax,file);
return output;
Similarly on simplifying things, you can directly initialize output
.
Loadfileoutput output(xmin, xmax, ymin, ymax, file);
return output;
You don't need a named return variable as you just returning it immediately and making no changes.
return Loadfileoutput(xmin, xmax, ymin, ymax, file);
Your compiler will still be able to use return value optimization and construct the returned value in-place.
Preallocate space in your vector
for storage of your points, with .reserve(size)
. Without preallocation, .push_back()
may need to reallocate and copy your data points several times, which takes time.
Instead of counting lines, a reasonable estimate could be based on the file size ... say, size of file divided by 32, for 3 nine-digit values per line, plus two spaces, two commas and a new line. Over estimating is probably better than underestimating; resize down to actual size after you’ve read all the data.
Explore related questions
See similar questions with these tags.
file >> x >> y >> z
instead of reading a line, putting it into stringstream, thein reading them? It also looks like input file contains commas, whereas your example does not show that. \$\endgroup\$