##Optimizing the algorithm
Optimizing the algorithm
All tests were done on a 4.7mb file containing a copy of the history of the US and a copy of a medical manual; around 1400 pages worth of content).
Timings are an average of executing one single run through the file 10 times. The code was compiled under VC++2015 x64.
Here's a link to the test program; simply use your own files where appropriate.
http://coliru.stacked-crooked.com/a/28907bc68796a0bf ##Better containers for lookup
Better containers for lookup
You're using a std::map<>
in order to count word occurrence. You should use a std::unordered_map<>
as it provies amortized O(1) lookup. This is highly beneficial for large documents. We replace:
std::map<std::string, int> counts;
With:
std::unordered_map<std::string, int> counts;
This reduced runtime from 524ms to 352ms.
Similarly, you're using a std::set<>
in order to find stopwords in O(logn) time. You should use a std::unordered_set<>
as it provides amortized O(1) lookup. We replace:
std::set<std::string> stopwords{ ... };
With:
std::unordered_set<std::string> stopwords{ ... };
I've tested this with the following stopwords.txt file:
a an the am are is was were and then for I
This reduced runtime from 352ms to 325ms.
##Optimize tolower()
Optimize tolower()
We can cut out execution time by directly modifying the string reference instead of copying it in. Thus, this function:
std::string tolower(std::string in) {
for (auto &c : in) {
c = std::tolower(unsigned char(c));
}
return in;
}
Can be implemented like so:
void tolower(std::string& in) {
for (auto &c : in) {
c = std::tolower(unsigned char(c));
}
}
Furthermore, since your character table only uses ASCII characters, this provides a bigger speed boost than calling std::tolower()
for every individual character:
void tolower_ref( std::string& in )
{
static constexpr int diff{ 'a' - 'A' };
for ( auto &c : in )
if ( c < 'a' && c != '\'' )
c += diff;
}
This reduced runtime from 325ms to 274ms.
#Conclusion
Conclusion
By applying a small couple of changes, the speed of the algorithm was increased by ~50%.
##Optimizing the algorithm
All tests were done on a 4.7mb file containing a copy of the history of the US and a copy of a medical manual; around 1400 pages worth of content).
Timings are an average of executing one single run through the file 10 times. The code was compiled under VC++2015 x64.
Here's a link to the test program; simply use your own files where appropriate.
http://coliru.stacked-crooked.com/a/28907bc68796a0bf ##Better containers for lookup
You're using a std::map<>
in order to count word occurrence. You should use a std::unordered_map<>
as it provies amortized O(1) lookup. This is highly beneficial for large documents. We replace:
std::map<std::string, int> counts;
With:
std::unordered_map<std::string, int> counts;
This reduced runtime from 524ms to 352ms.
Similarly, you're using a std::set<>
in order to find stopwords in O(logn) time. You should use a std::unordered_set<>
as it provides amortized O(1) lookup. We replace:
std::set<std::string> stopwords{ ... };
With:
std::unordered_set<std::string> stopwords{ ... };
I've tested this with the following stopwords.txt file:
a an the am are is was were and then for I
This reduced runtime from 352ms to 325ms.
##Optimize tolower()
We can cut out execution time by directly modifying the string reference instead of copying it in. Thus, this function:
std::string tolower(std::string in) {
for (auto &c : in) {
c = std::tolower(unsigned char(c));
}
return in;
}
Can be implemented like so:
void tolower(std::string& in) {
for (auto &c : in) {
c = std::tolower(unsigned char(c));
}
}
Furthermore, since your character table only uses ASCII characters, this provides a bigger speed boost than calling std::tolower()
for every individual character:
void tolower_ref( std::string& in )
{
static constexpr int diff{ 'a' - 'A' };
for ( auto &c : in )
if ( c < 'a' && c != '\'' )
c += diff;
}
This reduced runtime from 325ms to 274ms.
#Conclusion
By applying a small couple of changes, the speed of the algorithm was increased by ~50%.
Optimizing the algorithm
All tests were done on a 4.7mb file containing a copy of the history of the US and a copy of a medical manual; around 1400 pages worth of content).
Timings are an average of executing one single run through the file 10 times. The code was compiled under VC++2015 x64.
Here's a link to the test program; simply use your own files where appropriate.
http://coliru.stacked-crooked.com/a/28907bc68796a0bf
Better containers for lookup
You're using a std::map<>
in order to count word occurrence. You should use a std::unordered_map<>
as it provies amortized O(1) lookup. This is highly beneficial for large documents. We replace:
std::map<std::string, int> counts;
With:
std::unordered_map<std::string, int> counts;
This reduced runtime from 524ms to 352ms.
Similarly, you're using a std::set<>
in order to find stopwords in O(logn) time. You should use a std::unordered_set<>
as it provides amortized O(1) lookup. We replace:
std::set<std::string> stopwords{ ... };
With:
std::unordered_set<std::string> stopwords{ ... };
I've tested this with the following stopwords.txt file:
a an the am are is was were and then for I
This reduced runtime from 352ms to 325ms.
Optimize tolower()
We can cut out execution time by directly modifying the string reference instead of copying it in. Thus, this function:
std::string tolower(std::string in) {
for (auto &c : in) {
c = std::tolower(unsigned char(c));
}
return in;
}
Can be implemented like so:
void tolower(std::string& in) {
for (auto &c : in) {
c = std::tolower(unsigned char(c));
}
}
Furthermore, since your character table only uses ASCII characters, this provides a bigger speed boost than calling std::tolower()
for every individual character:
void tolower_ref( std::string& in )
{
static constexpr int diff{ 'a' - 'A' };
for ( auto &c : in )
if ( c < 'a' && c != '\'' )
c += diff;
}
This reduced runtime from 325ms to 274ms.
Conclusion
By applying a small couple of changes, the speed of the algorithm was increased by ~50%.
##Optimizing the algorithm
All tests were done on a 4.7mb file containing a copy of the history of the US and a copy of a medical manualmanual; around 1400 pages worth of content).
Timings are an average of executing one single run through the file 10 times. The code was compiled under VC++2015 x64.
##BetterHere's a link to the test program; simply use your own files where appropriate.
http://coliru.stacked-crooked.com/a/28907bc68796a0bf ##Better containers for lookup
You're using a std::map<>
in order to count word occurrence. You should use a std::unordered_map<>
as it provies amortized O(1) lookup. This is highly beneficial for large documents. We replace:
std::map<std::string, int> counts;
With:
std::unordered_map<std::string, int> counts;
This reduced runtime from 524ms to 352ms.
Similarly, you're using a std::set<>
in order to find stopwords in O(logn) time. You should use a std::unordered_set<>
as it provides amortized O(1) lookup. We replace:
std::set<std::string> stopwords{ ... };
With:
std::unordered_set<std::string> stopwords{ ... };
I've tested this with the following stopwords.txt file:
a an the am are is was were and then for I
This reduced runtime from 352ms to 325ms.
##Optimize tolower()
We can cut out execution time by directly modifying the string reference instead of copying it in. Thus, this function:
std::string tolower(std::string in) {
for (auto &c : in) {
c = std::tolower(unsigned char(c));
}
return in;
}
Can be implemented like so:
void tolower(std::string& in) {
for (auto &c : in) {
c = std::tolower(unsigned char(c));
}
}
Furthermore, since your character table only uses ASCII characters, this provides a bigger speed boost than calling std::tolower()
for every individual character:
void tolower_ref( std::string& in )
{
static constexpr int diff{ 'a' - 'A' };
for ( auto &c : in )
if ( c < 'a' && c != '\'' )
c += diff;
}
This reduced runtime from 325ms to 274ms.
#Conclusion
By applying a small couple of changes, the speed of the algorithm was increased by ~50%.
##Optimizing the algorithm
All tests were done on a 4.7mb file containing a copy of the history of the US and a copy of a medical manual.
Timings are an average of executing one single run through the file 10 times. The code was compiled under VC++2015 x64.
##Better containers for lookup
You're using a std::map<>
in order to count word occurrence. You should use a std::unordered_map<>
as it provies amortized O(1) lookup. This is highly beneficial for large documents. We replace:
std::map<std::string, int> counts;
With:
std::unordered_map<std::string, int> counts;
This reduced runtime from 524ms to 352ms.
Similarly, you're using a std::set<>
in order to find stopwords in O(logn) time. You should use a std::unordered_set<>
as it provides amortized O(1) lookup. We replace:
std::set<std::string> stopwords{ ... };
With:
std::unordered_set<std::string> stopwords{ ... };
I've tested this with the following stopwords.txt file:
a an the am are is was were and then for I
This reduced runtime from 352ms to 325ms.
##Optimize tolower()
We can cut out execution time by directly modifying the string reference instead of copying it in. Thus, this function:
std::string tolower(std::string in) {
for (auto &c : in) {
c = std::tolower(unsigned char(c));
}
return in;
}
Can be implemented like so:
void tolower(std::string& in) {
for (auto &c : in) {
c = std::tolower(unsigned char(c));
}
}
Furthermore, since your character table only uses ASCII characters, this provides a bigger speed boost than calling std::tolower()
for every individual character:
void tolower_ref( std::string& in )
{
static constexpr int diff{ 'a' - 'A' };
for ( auto &c : in )
if ( c < 'a' && c != '\'' )
c += diff;
}
This reduced runtime from 325ms to 274ms.
##Optimizing the algorithm
All tests were done on a 4.7mb file containing a copy of the history of the US and a copy of a medical manual; around 1400 pages worth of content).
Timings are an average of executing one single run through the file 10 times. The code was compiled under VC++2015 x64.
Here's a link to the test program; simply use your own files where appropriate.
http://coliru.stacked-crooked.com/a/28907bc68796a0bf ##Better containers for lookup
You're using a std::map<>
in order to count word occurrence. You should use a std::unordered_map<>
as it provies amortized O(1) lookup. This is highly beneficial for large documents. We replace:
std::map<std::string, int> counts;
With:
std::unordered_map<std::string, int> counts;
This reduced runtime from 524ms to 352ms.
Similarly, you're using a std::set<>
in order to find stopwords in O(logn) time. You should use a std::unordered_set<>
as it provides amortized O(1) lookup. We replace:
std::set<std::string> stopwords{ ... };
With:
std::unordered_set<std::string> stopwords{ ... };
I've tested this with the following stopwords.txt file:
a an the am are is was were and then for I
This reduced runtime from 352ms to 325ms.
##Optimize tolower()
We can cut out execution time by directly modifying the string reference instead of copying it in. Thus, this function:
std::string tolower(std::string in) {
for (auto &c : in) {
c = std::tolower(unsigned char(c));
}
return in;
}
Can be implemented like so:
void tolower(std::string& in) {
for (auto &c : in) {
c = std::tolower(unsigned char(c));
}
}
Furthermore, since your character table only uses ASCII characters, this provides a bigger speed boost than calling std::tolower()
for every individual character:
void tolower_ref( std::string& in )
{
static constexpr int diff{ 'a' - 'A' };
for ( auto &c : in )
if ( c < 'a' && c != '\'' )
c += diff;
}
This reduced runtime from 325ms to 274ms.
#Conclusion
By applying a small couple of changes, the speed of the algorithm was increased by ~50%.
##Optimizing the algorithm
All tests were done on a 4.7mb file containing a copy of the history of the US and a copy of a medical manual.
Timings are an average of executing one single run through the file 10 times. The code was compiled under VC++2015 x64.
##Better containers for lookup
You're using a std::map<>
in order to count word occurrence. You should use a std::unordered_map<>
as it provies amortized O(1) lookup. This is highly beneficial for large documents. We replace:
std::map<std::string, int> counts;
With:
std::unordered_map<std::string, int> counts;
This reduced runtime from 524ms to 352ms.
Similarly, you're using a std::set<>
in order to find stopwords in O(logn) time. You should use a std::unordered_set<>
as it provides amortized O(1) lookup. We replace:
std::set<std::string> stopwords{ ... };
With:
std::unordered_set<std::string> stopwords{ ... };
I've tested this with the following stopwords.txt file:
a an the am are is was were and then for I
This reduced runtime from 352ms to 325ms.
##Optimize tolower()
We can cut out execution time by directly modifying the string reference instead of copying it in. Thus, this function:
std::string tolower(std::string in) {
for (auto &c : in) {
c = std::tolower(unsigned char(c));
}
return in;
}
Can be implemented like so:
void tolower(std::string& in) {
for (auto &c : in) {
c = std::tolower(unsigned char(c));
}
}
Furthermore, since your character table only uses ASCII characters, this provides a bigger speed boost than calling std::tolower()
for every individual character:
void tolower_ref( std::string& in )
{
static constexpr int diff{ 'a' - 'A' };
for ( auto &c : in )
if ( c < 'a' && c != '\'' )
c += diff;
}
This reduced runtime from 325ms to 274ms.
##Optimizing the algorithm
All tests were done on a 4.7mb file containing a copy of the history of the US and a copy of a medical manual.
Timings are an average of executing one single run through the file 10 times. The code was compiled under VC++2015 x64.
##Better containers for lookup
You're using a std::map<>
in order to count word occurrence. You should use a std::unordered_map<>
as it provies amortized O(1) lookup. This is highly beneficial for large documents. We replace:
std::map<std::string, int> counts;
With:
std::unordered_map<std::string, int> counts;
This reduced runtime from 524ms to 352ms.
Similarly, you're using a std::set<>
in order to find stopwords in O(logn) time. You should use a std::unordered_set<>
as it provides amortized O(1) lookup. We replace:
std::set<std::string> stopwords{ ... };
With:
std::unordered_set<std::string> stopwords{ ... };
I've tested this with the following stopwords.txt file:
a an the am are is was were and then for I
This reduced runtime from 352ms to 325ms.
##Optimize tolower()
We can cut out execution time by directly modifying the string reference instead of copying it in. Thus, this function:
std::string tolower(std::string in) {
for (auto &c : in) {
c = std::tolower(unsigned char(c));
}
return in;
}
Can be implemented like so:
void tolower(std::string& in) {
for (auto &c : in) {
c = std::tolower(unsigned char(c));
}
}
Furthermore, since your character table only uses ASCII characters, this provides a bigger speed boost than calling std::tolower()
for every individual character:
void tolower_ref( std::string& in )
{
static constexpr int diff{ 'a' - 'A' };
for ( auto &c : in )
if ( c < 'a' && c != '\'' )
c += diff;
}
This reduced runtime from 325ms to 274ms.
##Optimizing the algorithm
All tests were done on a 4.7mb file containing a copy of the history of the US and a copy of a medical manual.
Timings are an average of executing one single run through the file 10 times. The code was compiled under VC++2015 x64.
##Better containers for lookup
You're using a std::map<>
in order to count word occurrence. You should use a std::unordered_map<>
as it provies amortized O(1) lookup. This is highly beneficial for large documents. We replace:
std::map<std::string, int> counts;
With:
std::unordered_map<std::string, int> counts;
This reduced runtime from 524ms to 352ms.
Similarly, you're using a std::set<>
in order to find stopwords in O(logn) time. You should use a std::unordered_set<>
as it provides amortized O(1) lookup. We replace:
std::set<std::string> stopwords{ ... };
With:
std::unordered_set<std::string> stopwords{ ... };
I've tested this with the following stopwords.txt file:
a an the am are is was were and then for I
This reduced runtime from 352ms to 325ms.
##Optimize tolower()
We can cut out execution time by directly modifying the string reference instead of copying it in. Thus, this function:
std::string tolower(std::string in) {
for (auto &c : in) {
c = std::tolower(unsigned char(c));
}
return in;
}
Can be implemented like so:
void tolower(std::string& in) {
for (auto &c : in) {
c = std::tolower(unsigned char(c));
}
}
Furthermore, since your character table only uses ASCII characters, this provides a bigger speed boost than calling std::tolower()
for every individual character:
void tolower_ref( std::string& in )
{
static constexpr int diff{ 'a' - 'A' };
for ( auto &c : in )
if ( c < 'a' && c != '\'' )
c += diff;
}
This reduced runtime from 325ms to 274ms.