Return to Answer

Commonmark migration

edited Jun 10, 2020 at 13:24

##Optimizing the algorithm

Optimizing the algorithm

All tests were done on a 4.7mb file containing a copy of the history of the US and a copy of a medical manual; around 1400 pages worth of content).

Timings are an average of executing one single run through the file 10 times. The code was compiled under VC++2015 x64.

Here's a link to the test program; simply use your own files where appropriate.

http://coliru.stacked-crooked.com/a/28907bc68796a0bf ##Better containers for lookup

Better containers for lookup

You're using a std::map<> in order to count word occurrence. You should use a std::unordered_map<> as it provies amortized O(1) lookup. This is highly beneficial for large documents. We replace:

std::map<std::string, int> counts;

With:

std::unordered_map<std::string, int> counts;

This reduced runtime from 524ms to 352ms.

Similarly, you're using a std::set<> in order to find stopwords in O(logn) time. You should use a std::unordered_set<> as it provides amortized O(1) lookup. We replace:

std::set<std::string> stopwords{ ... };

With:

std::unordered_set<std::string> stopwords{ ... };

I've tested this with the following stopwords.txt file:

a an the am are is was were and then for I

This reduced runtime from 352ms to 325ms.

##Optimize tolower()

Optimize `tolower()`

We can cut out execution time by directly modifying the string reference instead of copying it in. Thus, this function:

std::string tolower(std::string in) {
 for (auto &c : in) {
 c = std::tolower(unsigned char(c));
 }
 return in;
}

Can be implemented like so:

void tolower(std::string& in) {
 for (auto &c : in) {
 c = std::tolower(unsigned char(c));
 }
}

Demo

Furthermore, since your character table only uses ASCII characters, this provides a bigger speed boost than calling std::tolower() for every individual character:

void tolower_ref( std::string& in )
{
 static constexpr int diff{ 'a' - 'A' };
 for ( auto &c : in )
 if ( c < 'a' && c != '\'' )
 c += diff;
}

Demo

This reduced runtime from 325ms to 274ms.

#Conclusion

Conclusion

By applying a small couple of changes, the speed of the algorithm was increased by ~50%.

##Optimizing the algorithm

All tests were done on a 4.7mb file containing a copy of the history of the US and a copy of a medical manual; around 1400 pages worth of content).

Timings are an average of executing one single run through the file 10 times. The code was compiled under VC++2015 x64.

Here's a link to the test program; simply use your own files where appropriate.

http://coliru.stacked-crooked.com/a/28907bc68796a0bf ##Better containers for lookup

You're using a std::map<> in order to count word occurrence. You should use a std::unordered_map<> as it provies amortized O(1) lookup. This is highly beneficial for large documents. We replace:

std::map<std::string, int> counts;

With:

std::unordered_map<std::string, int> counts;

This reduced runtime from 524ms to 352ms.

Similarly, you're using a std::set<> in order to find stopwords in O(logn) time. You should use a std::unordered_set<> as it provides amortized O(1) lookup. We replace:

std::set<std::string> stopwords{ ... };

With:

std::unordered_set<std::string> stopwords{ ... };

I've tested this with the following stopwords.txt file:

a an the am are is was were and then for I

This reduced runtime from 352ms to 325ms.

##Optimize tolower()

We can cut out execution time by directly modifying the string reference instead of copying it in. Thus, this function:

std::string tolower(std::string in) {
 for (auto &c : in) {
 c = std::tolower(unsigned char(c));
 }
 return in;
}

Can be implemented like so:

void tolower(std::string& in) {
 for (auto &c : in) {
 c = std::tolower(unsigned char(c));
 }
}

Demo

Furthermore, since your character table only uses ASCII characters, this provides a bigger speed boost than calling std::tolower() for every individual character:

void tolower_ref( std::string& in )
{
 static constexpr int diff{ 'a' - 'A' };
 for ( auto &c : in )
 if ( c < 'a' && c != '\'' )
 c += diff;
}

Demo

This reduced runtime from 325ms to 274ms.

#Conclusion

By applying a small couple of changes, the speed of the algorithm was increased by ~50%.

Optimizing the algorithm

All tests were done on a 4.7mb file containing a copy of the history of the US and a copy of a medical manual; around 1400 pages worth of content).

Timings are an average of executing one single run through the file 10 times. The code was compiled under VC++2015 x64.

Here's a link to the test program; simply use your own files where appropriate.

http://coliru.stacked-crooked.com/a/28907bc68796a0bf

Better containers for lookup

You're using a std::map<> in order to count word occurrence. You should use a std::unordered_map<> as it provies amortized O(1) lookup. This is highly beneficial for large documents. We replace:

std::map<std::string, int> counts;

With:

std::unordered_map<std::string, int> counts;

This reduced runtime from 524ms to 352ms.

Similarly, you're using a std::set<> in order to find stopwords in O(logn) time. You should use a std::unordered_set<> as it provides amortized O(1) lookup. We replace:

std::set<std::string> stopwords{ ... };

With:

std::unordered_set<std::string> stopwords{ ... };

I've tested this with the following stopwords.txt file:

a an the am are is was were and then for I

This reduced runtime from 352ms to 325ms.

Optimize `tolower()`

We can cut out execution time by directly modifying the string reference instead of copying it in. Thus, this function:

std::string tolower(std::string in) {
 for (auto &c : in) {
 c = std::tolower(unsigned char(c));
 }
 return in;
}

Can be implemented like so:

void tolower(std::string& in) {
 for (auto &c : in) {
 c = std::tolower(unsigned char(c));
 }
}

Demo

Furthermore, since your character table only uses ASCII characters, this provides a bigger speed boost than calling std::tolower() for every individual character:

void tolower_ref( std::string& in )
{
 static constexpr int diff{ 'a' - 'A' };
 for ( auto &c : in )
 if ( c < 'a' && c != '\'' )
 c += diff;
}

Demo

This reduced runtime from 325ms to 274ms.

Conclusion

By applying a small couple of changes, the speed of the algorithm was increased by ~50%.

added 14 characters in body

Source Link

edited May 27, 2016 at 0:00

Return to Answer

Optimizing the algorithm

Better containers for lookup

Optimize tolower()

Conclusion

Optimizing the algorithm

Better containers for lookup

Optimize tolower()

Conclusion

Optimize `tolower()`

Optimize `tolower()`