I'd like to get some expert's view on this first piece of code especially in the area of
- performance - (e.g. avoidable copies)
- good practice (e.g. static members?)
The final goal is the provide some abstractions for text manipulation like cleansing, stop-word removal and tokenization. The code base should grow substantially in the future so a lot more string manipulation methods should be supported. I understand that this should be bundled in a library in the end, however, I'd like to get feedback on the general code style first.
#include <fstream>
#include <iostream>
#include <regex>
#include <sstream>
#include <string>
#include <vector>
class TextManipulator {
public:
static std::string replace(std::string &string, const std::string pattern,
const std::string toreplace) {
return std::regex_replace(string, std::regex(pattern), toreplace);
}
static std::string toLowerCase(std::string &string) {
std::locale loc;
std::string res;
for (auto elem : string) {
res += std::tolower(elem, loc);
}
return res;
}
static std::vector<std::string> split(const std::string &text,
char seperator) {
std::vector<std::string> tokens;
std::size_t start = 0, end = 0;
while ((end = text.find(seperator, start)) != std::string::npos) {
std::string extr = text.substr(start, end - start);
if (!extr.empty()) {
tokens.push_back(extr);
}
start = end + 1;
}
tokens.push_back(text.substr(start));
return tokens;
}
};
int main() {
std::ifstream input(
"/example_docs/Pride_and_Prejudice.txt");
std::stringstream buffer;
buffer << input.rdbuf();
std::string text = buffer.str();
text = TextManipulator::replace(text, "\r\n", " ");
text = TextManipulator::replace(text, ",", "");
text = TextManipulator::replace(text, ":", "");
text = TextManipulator::replace(text, ".", "");
text = TextManipulator::toLowerCase(text);
std::vector<std::string> tokens = TextManipulator::split(text, ' ');
return 0;
}
2 Answers 2
Don't use a class
There is no need to create a class
just to hold a collection of static
functions. Instead, just declare the functions in the global namespace. If you want to avoid polluting the global namespace, consider putting them in a namespace
instead, like so:
namespace TextManipulation {
std::string replace(...) {
...
};
...
}
Don't write unnecessary trivial wrapper functions
There is no need to create TextManipulator::replace()
, when all it does is call std::regex_replace
with almost exactly the same arguments. Why not call std::regex_replace()
directly instead?
The name replace()
also hides the fact that pattern
is a regular expression.
Consider renaming toLowerCase()
to tolower()
It's a bit annoying to have different naming conventions in one program, and since you already have tolower()
for single characters, it would be very convenient to reuse that for whole strings as well. So:
std::string tolower(const std::string &str) {
...
}
Note: you forget to add const
to the function parameter here, as well as for the first parameter of replace()
, but you did it correctly for the other string parameters.
Whenever possible, use boost::string_ref in your code, this will reduce the number of copy operations with std::string
. Check the places where you have copy operations and replace them with boost::string_ref
operations if possible.
Explore related questions
See similar questions with these tags.