This is an attempt at a fast "document distance" implementation in C++. For reference, I found this problem in MIT's OpenCourseware 6.006 algorithms course.
A quick look at the callgrind
output shows that my performance killers are, in order:
- the
transfrom
call informatToken
- Constructing
inputString
inwordOccurences
- Indexing into the
unordered_map
(callingoperator[]
) - reading from the
stringstream
in order to split the string
Here's the code:
#include <algorithm>
#include <fstream>
#include <functional>
#include <iostream>
#include <iterator>
#include <sstream>
#include <string>
#include <unordered_map>
using namespace std;
string const g_source = "abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz1234567890";
string const g_destination = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890";
bool containsKey(unordered_map<string, int> const &wordMap, string const &key)
{
return wordMap.find(key) != wordMap.end();
}
template <typename ForwardIterator>
unordered_map<char, char> translationTable(
ForwardIterator first,
ForwardIterator last,
ForwardIterator dFirst)
{
unordered_map<char, char> table;
for(; first != last; ++first, ++dFirst) {
table[*dFirst] = *first;
}
return table;
}
void formatToken(string &rawString)
{
static unordered_map<char, char> charTranslationTable = translationTable(
g_source.begin(),
g_source.end(),
g_destination.begin());
transform(
rawString.begin(),
rawString.end(),
rawString.begin(),
[](char c) { char ret = charTranslationTable[c]; return ret ? ret : ' '; });
}
unordered_map<string, int> wordOccurences(ifstream &is)
{
unordered_map<string, int> result;
string inputString{istreambuf_iterator<char>(is), istreambuf_iterator<char>()};
formatToken(inputString);
istringstream stream(inputString);
string token;
while (stream) {
stream >> token;
result[token] += 1;
}
return result;
}
double documentMagnitude(unordered_map<string, int> const &document)
{
double magnitude = 0.0;
for (auto const &word : document) {
magnitude += double(word.second * word.second);
}
return sqrt(magnitude);
}
int main(int argc, char *argv[])
{
string filename1 = arg[1];
string filename2 = arg[2];
ifstream file1(filename1);
ifstream file2(filename2);
unordered_map<string, int> document1 = wordOccurences(file1);
unordered_map<string, int> document2 = wordOccurences(file2);
double magnitude1 = documentMagnitude(document1);
double magnitude2 = documentMagnitude(document2);
double innerProduct = 0.0;
for (auto const &word : document2) {
if (containsKey(document1, word.first)) {
innerProduct += double(word.second * document1[word.first]);
}
}
double cosAngle = innerProduct / magnitude1 / magnitude2;
cout << "Distance: " << acos(cosAngle) << '\n';
}
I'm running at speeds comparable to the CPython implementation from the aforementioned course. If I try to pre-allocate the string size to the stream-size, I get less than 1% improvement in performance.
1 Answer 1
Speed/optimization
Unfortunately, istream_buf_iterator
s aren't a particularly fast way to read the entirety of a file into a string. One simple way to improve speed is to use the istream's read
instead. To do that, you can replace this:
string inputString{istreambuf_iterator<char>(is), istreambuf_iterator<char>()};
...with something like this:
string inputString;
is.seekg(0, std::ios::end);
size_t length = is.tellg();
inputString.resize(length);
is.seekg(0, std::ios::beg);
is.read(&inputString[0], length);
At least in my testing, this reduces the time from around 800 ms to around 600 ms for a pair of test files I threw together (about 6.5 megabytes, mostly from project Gutenberg, with one chapter of one book edited out of one of the files).
If you're running this under Linux, you'll probably also benefit from adding a line like this toward the beginning of main
:
std::ios::sync_with_stdio(false);
On Windows this has essentially no effect, but on Linux it can make quite a bit of difference.
Correctness
Command line arguments
You declare the command line arguments as argc
/argv
like usual, but attempt to refer to two arguments as arg[1]
and arg[2]
(should be argv
instead of arg
).
You should also check that argc > 2
before attempting to use argv[2]
.
EOF detection
This code:
while (stream) {
stream >> token;
result[token] += 1;
}
...is broken. It doesn't detect the end of the file correctly. For accurate EOF detection, consider using something like this instead:
while (stream >> token)
++result[token];
Translation
I'm pretty sure you transposed these two strings:
string const g_source = "abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz1234567890";
string const g_destination = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890";
I'm pretty sure the intent was the either upper-case or lower-case input would be mapped to lower-case output. As it stands, you seem to be trying to map lower-case input to both lower-case and upper-case output.
Style
Simplicity
Although it probably won't make a huge difference in speed, I'd rather use a simple std::array
for the translation table, and just fill in the result character for every possible input:
static const size = std::numeric_limits<unsigned char>::max();
std::array<char, size> translation_table;
for (int i=0; i<size; i++)
if (std::isalpha(i))
translation_table[i] = std::tolower(i);
else if (std::isdigit(i))
translation_table[i] = i;
else
translation_table[i] = ' ';
Then translating a character becomes utterly trivial:
transform(
rawString.begin(),
rawString.end(),
rawString.begin(),
[](unsigned char c) { return translation_table[c]; });
Also note the change to unsigned char
, so if we have an input that's negative when viewed as a char
, we don't try to index outside the table.
Encapsulation
Taking the next step from the last point, I'd rather hide the implementation of the translation a bit more by wrapping it up into a class. For example, something like this might be usable:
class formatter {
class xlat_table {
static const int table_size = std::numeric_limits<unsigned char>::max();
std::array<char, table_size> table;
public:
xlat_table() {
for (int i = 0; i < table_size; i++)
if (std::isalpha(i))
table[i] = std::tolower(i);
else if (std::isdigit(i))
table[i] = i;
else table[i] = ' ';
}
char operator[](size_t n) { return table[n]; }
};
static xlat_table table;
public:
void operator()(std::string &rawString) {
transform(
rawString.begin(),
rawString.end(),
rawString.begin(),
[this](unsigned char c) { return table[c]; });
}
};
formatter::xlat_table formatter::table;
This creates the xlat_table
as a static member of formatter
, so we can create formatter
objects freely, without initializing a new translation table every time.
Conclusion
This should improve speed at least a little bit, but this will probably be I/O bound in most cases, so just translating the code to C++ isn't necessarily going to give any huge speed improvement. On the other hand, if you have something that can do really fast I/O (e.g., a reasonably high-end SSD) you might be more likely to see substantial gains.
My final code putting the pieces together looks like this:
#include <algorithm>
#include <fstream>
#include <functional>
#include <iostream>
#include <iterator>
#include <sstream>
#include <string>
#include <unordered_map>
#include <chrono>
#include <array>
#include <cctype>
using namespace std;
bool containsKey(unordered_map<string, int> const &wordMap, string const &key)
{
return wordMap.find(key) != wordMap.end();
}
class formatter {
class xlat_table {
static const int table_size = std::numeric_limits<unsigned char>::max();
std::array<char, table_size> table;
public:
xlat_table() {
for (int i = 0; i < table_size; i++)
if (std::isalpha(i))
table[i] = std::tolower(i);
else if (std::isdigit(i))
table[i] = i;
else table[i] = ' ';
}
char operator[](size_t n) { return table[n]; }
};
static xlat_table table;
public:
void operator()(std::string &rawString) {
transform(
rawString.begin(),
rawString.end(),
rawString.begin(),
[this](unsigned char c) { return table[c]; });
}
};
formatter::xlat_table formatter::table;
unordered_map<string, int> wordOccurences(istream &is)
{
unordered_map<string, int> result;
string inputString;
is.seekg(0, std::ios::end);
size_t length = is.tellg();
inputString.resize(length);
is.seekg(0, std::ios::beg);
is.read(&inputString[0], length);
formatter()(inputString);
istringstream stream(inputString);
string token;
while (stream >> token) {
++result[token];
}
return result;
}
double documentMagnitude(unordered_map<string, int> const &document)
{
double magnitude = 0.0;
for (auto const &word : document) {
magnitude += double(word.second * word.second);
}
return sqrt(magnitude);
}
double cos_dist(std::istream &file1, std::istream &file2) {
using namespace std::chrono;
auto begin = high_resolution_clock::now();
unordered_map<string, int> document1 = wordOccurences(file1);
unordered_map<string, int> document2 = wordOccurences(file2);
double magnitude1 = documentMagnitude(document1);
double magnitude2 = documentMagnitude(document2);
double innerProduct = 0.0;
for (auto const &word : document2) {
if (containsKey(document1, word.first)) {
innerProduct += double(word.second * document1[word.first]);
}
}
double cosAngle = innerProduct / magnitude1 / magnitude2;
auto end = high_resolution_clock::now();
std::cout << "Time: " << duration_cast<milliseconds>(end - begin).count() << " ms\n";
return cosAngle;
}
int main(int argc, char *argv[])
{
std::ios::sync_with_stdio(false);
string filename1 = argv[1];
string filename2 = argv[2];
ifstream file1(filename1);
ifstream file2(filename2);
double cosAngle = cos_dist(file1, file2);
cout << "Distance: " << acos(cosAngle) << '\n';
}
With all the pieces put together, this reduces the time (on my test files) from about 800 milliseconds to about 450-475 milliseconds. I haven't done any detailed profiling to find what parts of the code are consuming that time, so there might be some fairly obvious places to get more improvement.
-
\$\begingroup\$ This has a lot of insightful suggestions. I'll just make a couple of comments for future viewers: the Translation error you pointed out is better identified as a poor naming, because if you take a look at the original code, the translation table is correctly implemented, however I labeled the source as the destination and vice versa. In addition to this, I assumed as you did that I would be IO-bound, but even after the changes, this is still not the case; in fact, using istream::read() (as you suggested) makes this a total non-issue! \$\endgroup\$Nasser Al-Shawwa– Nasser Al-Shawwa2016年05月09日 22:02:39 +00:00Commented May 9, 2016 at 22:02