Fast Document Distance, C++

Question 1

This is an attempt at a fast "document distance" implementation in C++. For reference, I found this problem in MIT's OpenCourseware 6.006 algorithms course.

A quick look at the callgrind output shows that my performance killers are, in order:

the transfrom call in formatToken
Constructing inputString in wordOccurences
Indexing into the unordered_map (calling operator[])
reading from the stringstream in order to split the string

Here's the code:

#include <algorithm>
#include <fstream>
#include <functional>
#include <iostream>
#include <iterator>
#include <sstream>
#include <string>
#include <unordered_map>
using namespace std;
string const g_source = "abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz1234567890";
string const g_destination = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890";
bool containsKey(unordered_map<string, int> const &wordMap, string const &key)
{
 return wordMap.find(key) != wordMap.end();
}
template <typename ForwardIterator>
unordered_map<char, char> translationTable(
 ForwardIterator first,
 ForwardIterator last,
 ForwardIterator dFirst)
{
 unordered_map<char, char> table;
 for(; first != last; ++first, ++dFirst) {
 table[*dFirst] = *first;
 }
 return table;
}
void formatToken(string &rawString)
{
 static unordered_map<char, char> charTranslationTable = translationTable(
 g_source.begin(),
 g_source.end(),
 g_destination.begin());
 transform(
 rawString.begin(),
 rawString.end(),
 rawString.begin(),
 [](char c) { char ret = charTranslationTable[c]; return ret ? ret : ' '; });
}
unordered_map<string, int> wordOccurences(ifstream &is)
{
 unordered_map<string, int> result;
 string inputString{istreambuf_iterator<char>(is), istreambuf_iterator<char>()};
 formatToken(inputString);
 istringstream stream(inputString);
 string token;
 while (stream) {
 stream >> token;
 result[token] += 1;
 }
 return result;
}
double documentMagnitude(unordered_map<string, int> const &document)
{
 double magnitude = 0.0;
 for (auto const &word : document) {
 magnitude += double(word.second * word.second);
 }
 return sqrt(magnitude);
}
int main(int argc, char *argv[])
{
 string filename1 = arg[1];
 string filename2 = arg[2];
 ifstream file1(filename1);
 ifstream file2(filename2);
 unordered_map<string, int> document1 = wordOccurences(file1);
 unordered_map<string, int> document2 = wordOccurences(file2);
 double magnitude1 = documentMagnitude(document1);
 double magnitude2 = documentMagnitude(document2);
 double innerProduct = 0.0;
 for (auto const &word : document2) {
 if (containsKey(document1, word.first)) {
 innerProduct += double(word.second * document1[word.first]);
 }
 }
 double cosAngle = innerProduct / magnitude1 / magnitude2;
 cout << "Distance: " << acos(cosAngle) << '\n';
}

I'm running at speeds comparable to the CPython implementation from the aforementioned course. If I try to pre-allocate the string size to the stream-size, I get less than 1% improvement in performance.

Question 2

Speed/optimization

Unfortunately, istream_buf_iterators aren't a particularly fast way to read the entirety of a file into a string. One simple way to improve speed is to use the istream's read instead. To do that, you can replace this:

string inputString{istreambuf_iterator<char>(is), istreambuf_iterator<char>()};

...with something like this:

string inputString;
is.seekg(0, std::ios::end);
size_t length = is.tellg();
inputString.resize(length);
is.seekg(0, std::ios::beg);
is.read(&inputString[0], length);

At least in my testing, this reduces the time from around 800 ms to around 600 ms for a pair of test files I threw together (about 6.5 megabytes, mostly from project Gutenberg, with one chapter of one book edited out of one of the files).

If you're running this under Linux, you'll probably also benefit from adding a line like this toward the beginning of main:

std::ios::sync_with_stdio(false);

On Windows this has essentially no effect, but on Linux it can make quite a bit of difference.

Correctness

Command line arguments

You declare the command line arguments as argc/argv like usual, but attempt to refer to two arguments as arg[1] and arg[2] (should be argv instead of arg).

You should also check that argc > 2 before attempting to use argv[2].

EOF detection

This code:

while (stream) {
 stream >> token;
 result[token] += 1;
}

...is broken. It doesn't detect the end of the file correctly. For accurate EOF detection, consider using something like this instead:

while (stream >> token)
 ++result[token];

Translation

I'm pretty sure you transposed these two strings:

string const g_source = "abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz1234567890";
string const g_destination = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890";

I'm pretty sure the intent was the either upper-case or lower-case input would be mapped to lower-case output. As it stands, you seem to be trying to map lower-case input to both lower-case and upper-case output.

Style

Simplicity

Although it probably won't make a huge difference in speed, I'd rather use a simple std::array for the translation table, and just fill in the result character for every possible input:

static const size = std::numeric_limits<unsigned char>::max();
std::array<char, size> translation_table;
for (int i=0; i<size; i++)
 if (std::isalpha(i))
 translation_table[i] = std::tolower(i);
 else if (std::isdigit(i))
 translation_table[i] = i;
 else
 translation_table[i] = ' ';

Then translating a character becomes utterly trivial:

transform(
 rawString.begin(),
 rawString.end(),
 rawString.begin(),
 [](unsigned char c) { return translation_table[c]; });

Also note the change to unsigned char, so if we have an input that's negative when viewed as a char, we don't try to index outside the table.

Encapsulation

Taking the next step from the last point, I'd rather hide the implementation of the translation a bit more by wrapping it up into a class. For example, something like this might be usable:

class formatter {
 class xlat_table {
 static const int table_size = std::numeric_limits<unsigned char>::max();
 std::array<char, table_size> table;
 public:
 xlat_table() {
 for (int i = 0; i < table_size; i++)
 if (std::isalpha(i))
 table[i] = std::tolower(i);
 else if (std::isdigit(i))
 table[i] = i;
 else table[i] = ' ';
 }
 char operator[](size_t n) { return table[n]; }
 };
 static xlat_table table;
public:
 void operator()(std::string &rawString) { 
 transform(
 rawString.begin(),
 rawString.end(),
 rawString.begin(),
 [this](unsigned char c) { return table[c]; });
 }
};
formatter::xlat_table formatter::table;

This creates the xlat_table as a static member of formatter, so we can create formatter objects freely, without initializing a new translation table every time.

Conclusion

This should improve speed at least a little bit, but this will probably be I/O bound in most cases, so just translating the code to C++ isn't necessarily going to give any huge speed improvement. On the other hand, if you have something that can do really fast I/O (e.g., a reasonably high-end SSD) you might be more likely to see substantial gains.

My final code putting the pieces together looks like this:

#include <algorithm>
#include <fstream>
#include <functional>
#include <iostream>
#include <iterator>
#include <sstream>
#include <string>
#include <unordered_map>
#include <chrono>
#include <array>
#include <cctype>
using namespace std;
bool containsKey(unordered_map<string, int> const &wordMap, string const &key)
{
 return wordMap.find(key) != wordMap.end();
}
class formatter {
 class xlat_table {
 static const int table_size = std::numeric_limits<unsigned char>::max();
 std::array<char, table_size> table;
 public:
 xlat_table() {
 for (int i = 0; i < table_size; i++)
 if (std::isalpha(i))
 table[i] = std::tolower(i);
 else if (std::isdigit(i))
 table[i] = i;
 else table[i] = ' ';
 }
 char operator[](size_t n) { return table[n]; }
 };
 static xlat_table table;
public:
 void operator()(std::string &rawString) { 
 transform(
 rawString.begin(),
 rawString.end(),
 rawString.begin(),
 [this](unsigned char c) { return table[c]; });
 }
};
formatter::xlat_table formatter::table;
unordered_map<string, int> wordOccurences(istream &is)
{
 unordered_map<string, int> result;
 string inputString;
 is.seekg(0, std::ios::end);
 size_t length = is.tellg();
 inputString.resize(length);
 is.seekg(0, std::ios::beg);
 is.read(&inputString[0], length);
 formatter()(inputString);
 istringstream stream(inputString);
 string token;
 while (stream >> token) {
 ++result[token];
 }
 return result;
}
double documentMagnitude(unordered_map<string, int> const &document)
{
 double magnitude = 0.0;
 for (auto const &word : document) {
 magnitude += double(word.second * word.second);
 }
 return sqrt(magnitude);
}
double cos_dist(std::istream &file1, std::istream &file2) {
 using namespace std::chrono;
 auto begin = high_resolution_clock::now();
 unordered_map<string, int> document1 = wordOccurences(file1);
 unordered_map<string, int> document2 = wordOccurences(file2);
 double magnitude1 = documentMagnitude(document1);
 double magnitude2 = documentMagnitude(document2);
 double innerProduct = 0.0;
 for (auto const &word : document2) {
 if (containsKey(document1, word.first)) {
 innerProduct += double(word.second * document1[word.first]);
 }
 }
 double cosAngle = innerProduct / magnitude1 / magnitude2;
 auto end = high_resolution_clock::now();
 std::cout << "Time: " << duration_cast<milliseconds>(end - begin).count() << " ms\n";
 return cosAngle;
}
int main(int argc, char *argv[])
{
 std::ios::sync_with_stdio(false);
 string filename1 = argv[1];
 string filename2 = argv[2];
 ifstream file1(filename1);
 ifstream file2(filename2);
 double cosAngle = cos_dist(file1, file2);
 cout << "Distance: " << acos(cosAngle) << '\n';
}

With all the pieces put together, this reduces the time (on my test files) from about 800 milliseconds to about 450-475 milliseconds. I haven't done any detailed profiling to find what parts of the code are consuming that time, so there might be some fairly obvious places to get more improvement.

Question 3

This has a lot of insightful suggestions. I'll just make a couple of comments for future viewers: the Translation error you pointed out is better identified as a poor naming, because if you take a look at the original code, the translation table is correctly implemented, however I labeled the source as the destination and vice versa. In addition to this, I assumed as you did that I would be IO-bound, but even after the changes, this is still not the case; in fact, using istream::read() (as you suggested) makes this a total non-issue!

Jerry Coffin Jerry Coffin 34.1k4 gold badges77 silver badges145 bronze badges · Accepted Answer · 2016-05-09 09:44:47Z

Speed/optimization

Unfortunately, istream_buf_iterators aren't a particularly fast way to read the entirety of a file into a string. One simple way to improve speed is to use the istream's read instead. To do that, you can replace this:

string inputString{istreambuf_iterator<char>(is), istreambuf_iterator<char>()};

...with something like this:

string inputString;
is.seekg(0, std::ios::end);
size_t length = is.tellg();
inputString.resize(length);
is.seekg(0, std::ios::beg);
is.read(&inputString[0], length);

At least in my testing, this reduces the time from around 800 ms to around 600 ms for a pair of test files I threw together (about 6.5 megabytes, mostly from project Gutenberg, with one chapter of one book edited out of one of the files).

If you're running this under Linux, you'll probably also benefit from adding a line like this toward the beginning of main:

std::ios::sync_with_stdio(false);

On Windows this has essentially no effect, but on Linux it can make quite a bit of difference.

Correctness

Command line arguments

You declare the command line arguments as argc/argv like usual, but attempt to refer to two arguments as arg[1] and arg[2] (should be argv instead of arg).

You should also check that argc > 2 before attempting to use argv[2].

EOF detection

This code:

while (stream) {
 stream >> token;
 result[token] += 1;
}

...is broken. It doesn't detect the end of the file correctly. For accurate EOF detection, consider using something like this instead:

while (stream >> token)
 ++result[token];

Translation

I'm pretty sure you transposed these two strings:

string const g_source = "abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz1234567890";
string const g_destination = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890";

I'm pretty sure the intent was the either upper-case or lower-case input would be mapped to lower-case output. As it stands, you seem to be trying to map lower-case input to both lower-case and upper-case output.

Style

Simplicity

Although it probably won't make a huge difference in speed, I'd rather use a simple std::array for the translation table, and just fill in the result character for every possible input:

static const size = std::numeric_limits<unsigned char>::max();
std::array<char, size> translation_table;
for (int i=0; i<size; i++)
 if (std::isalpha(i))
 translation_table[i] = std::tolower(i);
 else if (std::isdigit(i))
 translation_table[i] = i;
 else
 translation_table[i] = ' ';

Then translating a character becomes utterly trivial:

transform(
 rawString.begin(),
 rawString.end(),
 rawString.begin(),
 [](unsigned char c) { return translation_table[c]; });

Also note the change to unsigned char, so if we have an input that's negative when viewed as a char, we don't try to index outside the table.

Encapsulation

Taking the next step from the last point, I'd rather hide the implementation of the translation a bit more by wrapping it up into a class. For example, something like this might be usable:

class formatter {
 class xlat_table {
 static const int table_size = std::numeric_limits<unsigned char>::max();
 std::array<char, table_size> table;
 public:
 xlat_table() {
 for (int i = 0; i < table_size; i++)
 if (std::isalpha(i))
 table[i] = std::tolower(i);
 else if (std::isdigit(i))
 table[i] = i;
 else table[i] = ' ';
 }
 char operator[](size_t n) { return table[n]; }
 };
 static xlat_table table;
public:
 void operator()(std::string &rawString) { 
 transform(
 rawString.begin(),
 rawString.end(),
 rawString.begin(),
 [this](unsigned char c) { return table[c]; });
 }
};
formatter::xlat_table formatter::table;

This creates the xlat_table as a static member of formatter, so we can create formatter objects freely, without initializing a new translation table every time.

Conclusion

This should improve speed at least a little bit, but this will probably be I/O bound in most cases, so just translating the code to C++ isn't necessarily going to give any huge speed improvement. On the other hand, if you have something that can do really fast I/O (e.g., a reasonably high-end SSD) you might be more likely to see substantial gains.

My final code putting the pieces together looks like this:

#include <algorithm>
#include <fstream>
#include <functional>
#include <iostream>
#include <iterator>
#include <sstream>
#include <string>
#include <unordered_map>
#include <chrono>
#include <array>
#include <cctype>
using namespace std;
bool containsKey(unordered_map<string, int> const &wordMap, string const &key)
{
 return wordMap.find(key) != wordMap.end();
}
class formatter {
 class xlat_table {
 static const int table_size = std::numeric_limits<unsigned char>::max();
 std::array<char, table_size> table;
 public:
 xlat_table() {
 for (int i = 0; i < table_size; i++)
 if (std::isalpha(i))
 table[i] = std::tolower(i);
 else if (std::isdigit(i))
 table[i] = i;
 else table[i] = ' ';
 }
 char operator[](size_t n) { return table[n]; }
 };
 static xlat_table table;
public:
 void operator()(std::string &rawString) { 
 transform(
 rawString.begin(),
 rawString.end(),
 rawString.begin(),
 [this](unsigned char c) { return table[c]; });
 }
};
formatter::xlat_table formatter::table;
unordered_map<string, int> wordOccurences(istream &is)
{
 unordered_map<string, int> result;
 string inputString;
 is.seekg(0, std::ios::end);
 size_t length = is.tellg();
 inputString.resize(length);
 is.seekg(0, std::ios::beg);
 is.read(&inputString[0], length);
 formatter()(inputString);
 istringstream stream(inputString);
 string token;
 while (stream >> token) {
 ++result[token];
 }
 return result;
}
double documentMagnitude(unordered_map<string, int> const &document)
{
 double magnitude = 0.0;
 for (auto const &word : document) {
 magnitude += double(word.second * word.second);
 }
 return sqrt(magnitude);
}
double cos_dist(std::istream &file1, std::istream &file2) {
 using namespace std::chrono;
 auto begin = high_resolution_clock::now();
 unordered_map<string, int> document1 = wordOccurences(file1);
 unordered_map<string, int> document2 = wordOccurences(file2);
 double magnitude1 = documentMagnitude(document1);
 double magnitude2 = documentMagnitude(document2);
 double innerProduct = 0.0;
 for (auto const &word : document2) {
 if (containsKey(document1, word.first)) {
 innerProduct += double(word.second * document1[word.first]);
 }
 }
 double cosAngle = innerProduct / magnitude1 / magnitude2;
 auto end = high_resolution_clock::now();
 std::cout << "Time: " << duration_cast<milliseconds>(end - begin).count() << " ms\n";
 return cosAngle;
}
int main(int argc, char *argv[])
{
 std::ios::sync_with_stdio(false);
 string filename1 = argv[1];
 string filename2 = argv[2];
 ifstream file1(filename1);
 ifstream file2(filename2);
 double cosAngle = cos_dist(file1, file2);
 cout << "Distance: " << acos(cosAngle) << '\n';
}

With all the pieces put together, this reduces the time (on my test files) from about 800 milliseconds to about 450-475 milliseconds. I haven't done any detailed profiling to find what parts of the code are consuming that time, so there might be some fairly obvious places to get more improvement.

This has a lot of insightful suggestions. I'll just make a couple of comments for future viewers: the Translation error you pointed out is better identified as a poor naming, because if you take a look at the original code, the translation table is correctly implemented, however I labeled the source as the destination and vice versa. In addition to this, I assumed as you did that I would be IO-bound, but even after the changes, this is still not the case; in fact, using istream::read() (as you suggested) makes this a total non-issue!

Stack Exchange Network

Fast Document Distance, C++

1 Answer 1

Speed/optimization

Correctness

Command line arguments

EOF detection

Translation

Style

Simplicity

Encapsulation

Conclusion

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Fast Document Distance, C++

1 Answer 1

Speed/optimization

Correctness

Command line arguments

EOF detection

Translation

Style

Simplicity

Encapsulation

Conclusion

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions