I have known about std::variant
for a while but had dismissed it as not very useful (I have revised that opinion now). But I would like a second opinion on my usage to see if I can improve this.
The original class looked like this:
namespace ThorsAnvil::Serialize
{
class ParserInterface
{
public:
ParserInterface(std::istream& stream)
: input(stream)
{}
virtual ~ParserInterface() {}
virtual ParserToken getNextToken() = 0;
virtual std::string getKey() = 0;
virtual void getValue(short int&) = 0;
virtual void getValue(int&) = 0;
virtual void getValue(long int&) = 0;
virtual void getValue(long long int&) = 0;
virtual void getValue(unsigned short int&) = 0;
virtual void getValue(unsigned int&) = 0;
virtual void getValue(unsigned long int&) = 0;
virtual void getValue(unsigned long long int&)= 0;
virtual void getValue(float&) = 0;
virtual void getValue(double&) = 0;
virtual void getValue(long double&) = 0;
virtual void getValue(bool&) = 0;
virtual void getValue(std::string&) = 0;
virtual bool isValueNull() = 0;
virtual std::string getRawValue() = 0;
protected:
bool read(char* dst, std::size_t size)
{
return static_cast<bool>(input.read(dst, size));
}
bool readTo(std::string& dst, char delim)
{
return static_cast<bool>(std::getline(input, dst, delim));
}
std::size_t lastReadCount() const
{
return input.gcount();
}
std::streampos getPos()
{
return input.tellg();
}
int get()
{
return input.get();
}
int peek()
{
return input.peek();
}
void ignore(std::size_t size)
{
input.ignore(size);
}
void clear()
{
input.clear();
}
void unget()
{
input.unget();
}
bool ok() const
{
return !input.fail();
}
void setFail()
{
input.setstate(std::ios::failbit);
}
template<typename T>
bool readValue(T& value)
{
return static_cast<bool>(input >> value);
}
private:
std::istream& input;
};
}
I then have implementations for Json/Yaml/Bson that override the virtual functions to provide the actual implementation. But all these implementations use the protected methods to read from the stream.
I did some experiments and found that I this could be much more efficient with a std::string
as the underlying storage. I know I can move a std::string
into a std::stringstream
and simply use the code above, but the overhead of the string stream is very high. So I decided I wanted to modify the class to support strings directly.
This is where std::variant
comes in. Without changing anything (drastically) I made the input type std::variant<std::istream&, StringInput>
. Note: I had to change std::istream&
to std::istream*
as std::variant
does not support references (and I could not move the stream). Also StringInput
is a very basic wrapper around std::string_view
that simply supports the methods I need.
Thus the class I would like reviewed is:
namespace ThorsAnvil::Serialize
{
class ParserInterface
{
public:
// Added a new constructor
// So it can take a string in addition to a stream.
ParserInterface(std::string_view const& str)
: input(str)
{}
ParserInterface(std::istream& stream, ParserConfig config = ParserConfig{})
: input(&stream)
{}
virtual ~ParserInterface() {}
virtual ParserToken getNextToken() = 0;
virtual std::string getKey() = 0;
virtual void getValue(short int&) = 0;
virtual void getValue(int&) = 0;
virtual void getValue(long int&) = 0;
virtual void getValue(long long int&) = 0;
virtual void getValue(unsigned short int&) = 0;
virtual void getValue(unsigned int&) = 0;
virtual void getValue(unsigned long int&) = 0;
virtual void getValue(unsigned long long int&)= 0;
virtual void getValue(float&) = 0;
virtual void getValue(double&) = 0;
virtual void getValue(long double&) = 0;
virtual void getValue(bool&) = 0;
virtual void getValue(std::string&) = 0;
virtual bool isValueNull() = 0;
virtual std::string getRawValue() = 0;
protected:
// The protected functions are all modified to read from
// either a string or stream using the `std::visit()` method.
bool read(char* dst, std::size_t size)
{
struct Read
{
char* dst;
std::size_t size;
Read(char* dst, std::size_t size):dst(dst),size(size){}
bool operator()(std::istream* input) {return static_cast<bool>(input->read(dst, size));}
bool operator()(StringInput& input) {return input.read(dst, size);}
};
return std::visit(Read{dst, size}, input);
}
bool readTo(std::string& dst, char delim)
{
struct ReadTo
{
std::string& dst;
char delim;
ReadTo(std::string& dst, char delim):dst(dst),delim(delim){}
bool operator()(std::istream* input) {return static_cast<bool>(std::getline((*input), dst, delim));}
bool operator()(StringInput& input) {return input.readTo(dst, delim);}
};
return std::visit(ReadTo(dst, delim), input);
}
std::size_t lastReadCount() const
{
struct LastReadCount
{
std::size_t operator()(std::istream const* input) {return input->gcount();}
std::size_t operator()(StringInput const& input) {return input.getLastReadCount();}
};
return std::visit(LastReadCount{}, input);
}
std::streampos getPos()
{
struct GetPos
{
std::streampos operator()(std::istream* input) {return input->tellg();}
std::streampos operator()(StringInput& input) {return input.getPos();}
};
return std::visit(GetPos{}, input);
}
int get()
{
struct Get
{
int operator()(std::istream* input) {return input->get();}
int operator()(StringInput& input) {return input.get();}
};
return std::visit(Get{}, input);
}
int peek()
{
struct Peek
{
int operator()(std::istream* input) {return input->peek();}
int operator()(StringInput& input) {return input.peek();}
};
return std::visit(Peek{}, input);
}
void ignore(std::size_t size)
{
struct Ignore
{
std::size_t size;
Ignore(std::size_t size): size(size) {}
void operator()(std::istream* input) {input->ignore(size);}
void operator()(StringInput& input) {input.ignore(size);}
};
std::visit(Ignore{size}, input);
}
void clear()
{
struct Clear
{
void operator()(std::istream* input) {input->clear();}
void operator()(StringInput& input) {input.clear();}
};
std::visit(Clear{}, input);
}
void unget()
{
struct Unget
{
void operator()(std::istream* input) {input->unget();}
void operator()(StringInput& input) {input.unget();}
};
std::visit(Unget{}, input);
}
bool ok() const
{
struct OK
{
bool operator()(std::istream const* input) {return !input->fail();}
bool operator()(StringInput const& input) {return input.isOk();}
};
return std::visit(OK{}, input);
}
void setFail()
{
struct SetFail
{
void operator()(std::istream* input) {input->setstate(std::ios::failbit);}
void operator()(StringInput& input) {input.setFail();}
};
std::visit(SetFail{}, input);
}
template<typename T>
bool readValue(T& value)
{
struct ReadValue
{
T& value;
ReadValue(T& value) :value(value) {}
bool operator()(std::istream* input) {return static_cast<bool>((*input) >> value);}
bool operator()(StringInput& input) {input.readValue(value);return true;}
};
return std::visit(ReadValue{value}, input);
}
private:
using DataInputStream = std::variant<std::istream*, StringInput>;
DataInputStream input;
};
}
For completness the StringInput class:
namespace ThorsAnvil::Serialize
{
struct StringInput
{
std::string_view data;
std::size_t position;
std::size_t lastRead;
public:
StringInput(std::string_view const& view)
: data(view)
, position(0)
, lastRead(0)
{}
bool read(char* dst, std::size_t size)
{
std::size_t copySize = std::min(size, data.size() - position);
std::copy(&data[position], &data[position + copySize], dst);
position += copySize;
lastRead = copySize;
return position <= data.size();
}
bool readTo(std::string& dst, char delim)
{
auto find = data.find(delim, position);
if (find == std::string::npos) {
find = data.size();
}
auto size = find - position;
dst.resize(size);
std::copy(&data[position], &data[position + size], &dst[0]);
position += (size + 1);
return position <= data.size();
}
std::size_t getLastReadCount() const {return lastRead;}
std::streampos getPos() {return position;}
int get() {return data[position++];}
int peek() {return data[position];}
void ignore(std::size_t size) {position += size;}
void clear() {position = 0;}
void unget() {--position;}
bool isOk() const {return position <= data.size();{
void setFail() {position = data.size() + 1;}
template<typename T>
void readValue(T& value)
{
using std::from_chars;
#if defined(NO_STD_SUPPORT_FROM_CHAR_DOUBLE) && (NO_STD_SUPPORT_FROM_CHAR_DOUBLE >= 1)
using fast_float::from_chars;
#endif
auto start = &data[position];
auto result = from_chars(start, &data[0] + data.size(), value);
if (result.ec != std::errc::invalid_argument)
{
lastRead = (result.ptr - start);
position+= lastRead;
}
}
void readValue(char& value)
{
while (position < data.size() && std::isspace(data[position])) {
++position;
}
value = (position < data.size()) ? data[position] : -1;
++position;
}
}
}
1 Answer 1
I can hardly criticize the overall design all that much, because I’ve done more or less the same thing. 😏
It sounds like you’re primarily interested in the use of std::variant
, so I’ll focus on that.
So, as I mentioned, I’ve done more or less the same thing. The difference is that I made my StringInput
a implementation detail of the parser class. I regret that choice, but I also don’t think that it makes much sense as an independent type. What I planned to do (never got around to it), was to build a generalized input shim class, and that would wrap the istream
, string_view
, and any other input sources, in a pleasant and useful universal interface. In other words, I made StringInput
an implementation detail... you made DataInputStream
an implementation detail... but I think the right solution would be make DataInputStream
(or something much like it) an independent, reusable type.
If DataInputStream
were a completely independent type, then any complexity inside it wouldn’t be that big a deal; the point of a type is to wrap complexity and insulate other business logic from it. So all those visitor classes that take up so much space in ParserInterface
would be a wash.
For now, let’s see if we can reduce the footprint variant
takes up in ParserInterface
.
To start, I notice that there are several functions where the only difference between the stream action and the string action is whether the call is direct or indirect:
int get()
{
struct Get
{
int operator()(std::istream* input) {return input->get();}
int operator()(StringInput& input) {return input.get();}
};
return std::visit(Get{}, input);
}
int peek()
{
struct Peek
{
int operator()(std::istream* input) {return input->peek();}
int operator()(StringInput& input) {return input.peek();}
};
return std::visit(Peek{}, input);
}
// etc. ...
An easy way to simplify this near-duplication is with a wrapper that overloads the indirection operator. It’s basically just 3 lines that turn multiple functions from this:
int get()
{
struct Get
{
int operator()(std::istream* input) {return input->get();}
int operator()(StringInput& input) {return input.get();}
};
return std::visit(Get{}, input);
}
... to this:
int get()
{
return std::visit([](auto& input) { return input->get(); }, input);
}
And it shouldn’t add any noticeable overhead, because both the compiler and the CPU can trivially see through that extra indirection, so prefetch is on your side. (You’re already paying for a hidden branch when selecting the active variant member.)
Generally, I would suggest this pattern:
auto func(...)
{
struct Func
{
auto operator()(T1& t) { return t.func(...); }
auto operator()(T2& t) { return t.func(...); }
auto operator()(T3& t) { return t.func(...); }
// ... etc.
// ... data members ...
};
return std::visit(Func{...}, v);
}
... is not a good idea.
The big hazard is that an unfortunate typo could potentially silently "work". Consider what might happen if there happened to be a completely unrelated but visible class named Fnuc
, and—by accident—you typed the visit
line above as return std::visit(Fnuc{}, v);
, and Fnuc
happens to have an operator()
defined that happens to accept the same types that are in the variant (not so far fetched if the call operator is a template that just accepts everything). Who knows what could happen. It could even appear to work.
The above scenario is not likely... but it is also not exactly ridiculous. That’s what makes it concerning for me. If it were a likely problem, you’d be inclined to check for it. But because it is an unlikely problem, you probably won’t bother to check all the time, and then when it silently "works"....
The other thing that bothers me about the above pattern is how non-intuitive it is. Normally you read a function downward—top to bottom—and follow the logic step-by-step as you go. (This is obviously a simplification, because loops, but it is part of the reason why "goto
considered harmful" became a thing.) But to read the function above, you kinda have to jump around a bit. First you read this struct
nested in the function, and be like... "ooookay, what’s this about?". And then you see the visit()
call and have the "aha" moment... and then you have to go back up, now that you know what’s going on and understand that the multiple call operator functions are branches of a pattern-matching visit operation.
Whenever I use variants, I always use either a lambda that is defined in situ:
auto func(...)
{
return std::visit([...](auto& t)
{
using T = std::remove_cvref_t<decltype(t)>;
if constexpr (std::is_same_v<T, T1>) { return t.func(...); }
else if constexpr (std::is_same_v<T, T2>) { return t.func(...); }
else if constexpr (std::is_same_v<T, T3>) { return t.func(...); }
// ... etc.
// Optional catch-all:
else { ... }
},
v);
}
... or the "overload" pattern:
// Assuming:
template<typename... Ts>
struct overloaded : Ts... { using Ts::operator()...; };
auto func(...)
{
return std::visit(overloaded{
[](T1& t) { return t.func(...); },
[](T2& t) { return t.func(...); },
[](T3& t) { return t.func(...); },
// ... etc.
// Optional catch-all:
[](auto& t) { ... }
},
v);
}
Either way I think is easier to read and understand, because the logic flows naturally. std::visit()
basically becomes the head of a pattern-matching tree—or a switch
on steroids—and the pattern branches are all right there in order. You’re not just confronted with a set of (what you will later realize are) match patterns with no context.
Also, it makes the hazard of a typo creating a silent match to some external class impossible.
Also, you can use member function templates (either as the in situ lambda itself, of as a catch-all matcher), which you can’t do with local classes.
(Also, keep in mind that the future is pattern matching, which probably will look like this:
auto func(...)
{
return v match
{
T1: let t => t.func(...);
T2: let t => t.func(...);
T3: let t => t.func(...);
// ... etc.
// Optional catch-all:
let t => ...;
};
}
... which looks a lot more like either the in situ lambda or the "overload" version than it does the version with the local class.)
Of the three versions, I would recommend the "overload" version, because it is both simpler and shorter. Having a type like overloaded
defined in your library would be generally useful in any case.
Some more before and after.
// Before:
bool read(char* dst, std::size_t size)
{
struct Read
{
char* dst;
std::size_t size;
Read(char* dst, std::size_t size):dst(dst),size(size){}
bool operator()(std::istream* input) {return static_cast<bool>(input->read(dst, size));}
bool operator()(StringInput& input) {return input.read(dst, size);}
};
return std::visit(Read{dst, size}, input);
}
// After:
bool read(char* dst, std::size_t size)
{
return std::visit(overloaded{
[dst, size](std::istream* input) { return static_cast<bool>(input->read(dst, size)); },
[dst, size](StringInput& input) { return input.read(dst, size); }
}, input);
}
// Or, assuming a wrapper with the indirection operator...:
bool read(char* dst, std::size_t size)
{
// Don't even need overloaded if there's only one branch:
return std::visit([dst, size](auto& input) { return bool(input->read(dst, size)); }, input);
}
// ---------------------------------------------
// Before:
bool readTo(std::string& dst, char delim)
{
struct ReadTo
{
std::string& dst;
char delim;
ReadTo(std::string& dst, char delim):dst(dst),delim(delim){}
bool operator()(std::istream* input) {return static_cast<bool>(std::getline((*input), dst, delim));}
bool operator()(StringInput& input) {return input.readTo(dst, delim);}
};
return std::visit(ReadTo(dst, delim), input);
}
// After:
bool readTo(std::string& dst, char delim)
{
return std::visit(overloaded{
// Fine to use single-letter identifiers, because the scope is so tiny:
[&dst, delim](std::istream* i) { return bool(std::getline(*i, dst, delim)); },
[&dst, delim](StringInput& i) { return i.readTo(dst, delim); }
}, input);
}
... and so on.
Hope this helps!
-
\$\begingroup\$ Thanks that was useful. \$\endgroup\$Loki Astari– Loki Astari2024年09月27日 17:00:28 +00:00Commented Sep 27, 2024 at 17:00
-
\$\begingroup\$ I agree that readability is slightly lower the way I am doing it. So I will look into your other forms. \$\endgroup\$Loki Astari– Loki Astari2024年09月27日 17:01:56 +00:00Commented Sep 27, 2024 at 17:01
-
\$\begingroup\$ I am really surprised (Impressed) by how efficient this technique is. I have been trying all sorts of optimizations but the code generated by the visitor pattern is as good as any other technique I have tried). And most of the other techniques are a lot messier in code. \$\endgroup\$Loki Astari– Loki Astari2024年09月27日 17:03:57 +00:00Commented Sep 27, 2024 at 17:03
-
\$\begingroup\$ The one thing I have done since this is convert
Read
andReadTo
,Ignore
andReadValue
into stateless classes (no constructor or members). This does mean the parameters to visit become a bit more funky (as they all need to be variants) but withusing
it is still readable. \$\endgroup\$Loki Astari– Loki Astari2024年09月27日 17:05:46 +00:00Commented Sep 27, 2024 at 17:05 -
\$\begingroup\$ Take everything I say next with a heavy grain of salt, because my focus is on high-level code architecture, and not low-level code gen or CPU architecture: It seems to me that visiting a variant, no matter how you do it, should be more efficient than a classic OO polymorphic implementation... or, in the worst case, not less efficient. 1/5 \$\endgroup\$indi– indi2024年09月28日 12:26:42 +00:00Commented Sep 28, 2024 at 12:26
std::variant
can store references if you usestd::reference_wrapper
. But if you already working withstd::istream
s, then to read from a string the caller can just make astd::istringstream
or since C++23 make astd::ispanstream
for an existingstd::string
. \$\endgroup\$std::istringstream
. But I am optimizing for performance. The stream interface has an excissively high overhead. By adding the theStringInput
class I have increased throughput by a factor of 4. \$\endgroup\$reference_wrapper
is really just a pointer. Might as well just store the pointer. \$\endgroup\$