I want to return an iterable collection of all q-grams of a provided string. This means all sub-words of length Q
. Let Q
be 3 for example:
normalized
as an already whitespace-stripped and upcased word or string. PREFIX3
and SUFFIX3
are constant strings ^^
and$$
that are put surrounding the string before splitting. That is a common trick to get more q-words.
For the word INDEX
the resulting q-grams are therefore:
^^I ^IN IND NDE DEX EX$ X$$
My version uses several C++14 constructs. This is intended, because the code is part of a lesson for teaching about C++14.
Is there anything I could do better, especially with respect to C++14-ish code?
vector<string> qwordify(const string& normalized) const
{
auto word = PREFIX3 + normalized + SUFFIX3;
vector<string> result {};
auto left = word.cbegin();
auto right = word.cbegin() + Q; // valid, because |"^^"|+|"$$"| => 3.
for( ; right <= word.end(); ++left, ++right) {
result.emplace_back(left, right);
}
return result;
}
Here are the things I considered especially:
- Despite of returning a value,
result
is not copied, because of RVO. - Lots of
auto
, which I like. - Use of
cbegin()
andcend()
, because I declaredauto word
and notconst auto word
. I could have done, but for this demonstration I liked to show thec
-variants. - Use of
emplace_back
, which really saves a copy here, correct?
I feel like some places could be improved, maybe. The iterators maybe? Getting rid of the loop in favor of an algorithm? Maybe even not returning a full vector
at all, but only an iterable proxy-object, but that would too long a shot for this small example, I think.
1 Answer 1
Things I like:
- (Mostly) a pretty neatly packaged, self-contained component.
- Written primarily to emphasize simplicity
- and let the compiler deal with making it fast
Things I don't like so well:
Q
,PREFIX3
, andSUFFIX3
are apparently globals (or maybe class variables?) so I can't see what they really are.- Subtle interaction:
PREFIX3
andSUFFIX3
both (seem to) implicitly assume thatQ = 3
. qword
has a well-known meaning in Intel assembly language. Unless you're sure this will only be used by an audience that immediately knows what you intend, another name might be better.- Written non-generically, for no particularly good reason.
- It almost seems like
qwordify
is really two separate things shoved together: a generic "generate all the results in a sliding window", and a specialized "generate all the qwords from a string". - If you can use the new
string_view
, it would be a good choice here.
Ignoring point 6 (since it's not really C++14), we could end up with code on this order:
// the generic algorithm:
template <class InIt, class OutIt, class size_type>
void window(InIt b, InIt e, OutIt d, size_type len) {
for (auto s = std::next(b, len); s != e; ++s, ++b, ++d)
*d = { b, s };
}
// The piece specifically for q-grams
std::vector<std::string> generate_qgram(std::string const &in, size_t q) {
std::string input = std::string(q-1, '^') + in + std::string(q, '$');
std::vector<std::string> words;
window(input.begin(), input.end(), std::back_inserter(words), q);
return words;
}
// and a little demo code:
int main() {
auto ret = generate_qgram("INDEX", 3);
std::copy(ret.begin(), ret.end(), std::ostream_iterator<std::string>(std::cout, "\n"));
}
-
\$\begingroup\$
PREFIX3
etcstatic const string
in this file, yes. Yes, they assume Q=3. I was thinking of to generate them, but that is hard with string as constexpr, I thought. So I tried withstatic_asserion
onPREFIX3.length() == Q-1
, but of course, could not (notconstexpr
). I didn't think if using the c'torstring(char,size_t)
, thank you. Oops, good point about "qword", I will rename it toqgram
andqgramify
. Also good point about non-genericness. I didn't think of it, but still, I probably leave it because it is teaching code. I like your more separate version usingwindow
. Very good. \$\endgroup\$towi– towi2016年08月23日 08:05:25 +00:00Commented Aug 23, 2016 at 8:05
const
modifier. \$\endgroup\$