I am trying to recognize a sentence that I have read through optical character recognition software. This code will eventually run on a Raspberry Pi.
I know for certain that it's meant to be one of a few thousand sentences which I have written down in a file (sentencelist.txt), but the character recognition has messed up a few of the words.
For example:
The quick brown fox jumps over the lazy dog
has been read as
He quick broom fax jumps over the lazy dig
I want to compare the incorrect sentence to every sentence in my list, and then figure out which one it's meant to be.
I currently have a working program, but it is just too slow! I have about 10,000 entries in my sentence file, and the whole process takes over a minute.
I initially used the Levenshtein algorithm to compare, but am now using a different algorithm which compares words rather than characters to speed it up.
How can I speed this baby up?
// C source for sentence lookup from OCR string. In main, you can set the distance calculator to be used.
#include "stdafx.h"
#include <iterator>
#include <string>
#include <vector>
#include <list>
#include <iostream>
#include <fstream>
#include <sstream>
#include <algorithm>
using namespace std;
// Levenshtein Distance Function - callable from Main
size_t LevenshteinDistance(const std::string &s1, const std::string &s2)
{
const size_t m(s1.size());
const size_t n(s2.size());
if (m == 0) return n;
if (n == 0) return m;
size_t *costs = new size_t[n + 1];
for (size_t k = 0; k <= n; k++) costs[k] = k;
size_t i = 0;
for (std::string::const_iterator it1 = s1.begin(); it1 != s1.end(); ++it1, ++i)
{
costs[0] = i + 1;
size_t corner = i;
size_t j = 0;
for (std::string::const_iterator it2 = s2.begin(); it2 != s2.end(); ++it2, ++j)
{
size_t upper = costs[j + 1];
if (*it1 == *it2)
{
costs[j + 1] = corner;
}
else
{
size_t t(upper<corner ? upper : corner);
costs[j + 1] = (costs[j]<t ? costs[j] : t) + 1;
}
corner = upper;
}
}
size_t result = costs[n];
delete[] costs;
return result;
}
// edit distance function (Levenshtein Distance for words) - callable from Main
typedef std::vector<std::string> Sentence;
Sentence &split(const std::string &s, char delim, Sentence &elems) {
std::stringstream ss(s);
std::string item;
while (std::getline(ss, item, delim)) {
elems.push_back(item);
}
return elems;
}
Sentence split(const std::string &s, char delim) {
Sentence elems;
split(s, delim, elems);
return elems;
}
unsigned int edit_distance(const Sentence& s1, const Sentence& s2)
{
const std::size_t len1 = s1.size(), len2 = s2.size();
std::vector<std::vector<unsigned int>> d(len1 + 1, std::vector<unsigned int>(len2 + 1));
d[0][0] = 0;
for (unsigned int i = 1; i <= len1; ++i) d[i][0] = i;
for (unsigned int i = 1; i <= len2; ++i) d[0][i] = i;
for (unsigned int i = 1; i <= len1; ++i)
for (unsigned int j = 1; j <= len2; ++j)
{
d[i][j] = std::min(d[i - 1][j] + 1, d[i][j - 1] + 1);
d[i][j] = std::min(d[i][j], d[i - 1][j - 1] + (s1[i - 1] == s2[j - 1] ? 0 : 1));
}
return d[len1][len2];
}
int main()
{
// Tesseract outputs sentence data (Q). We call this string s1.
string s1 = "He quick broom fax jumps over the lazy dig";
// String s2 is out comparison string, read from sentence database
string s2;
// Split strings into vectors of sentences
Sentence sen1 = split(s1, ' ');
Sentence sen2;
int i = 0; // Counter
int num = 13309; // Number of entries in sentence file
string sentencevec[13309]; // Vector where each entry is sentence string from file
int distancevec[13309]; // Vector where each entry is distance between s1 and corresponding sentence
//Read from sentence file and populate aforementioned vectors with data
string line;
ifstream myfile("sentencelistt.txt");
if (myfile.is_open())
{
while (getline(myfile, line))
{
string s2 = line;
sen2 = split(s2, ' ');
i++;
distancevec[i - 1] = edit_distance(sen1, sen2); // Here is where you set the distance calculator. Currently set to LevenshteinDistance, but can set to edit
sentencevec[i - 1] = s2;
}
myfile.close();
}
else cout << "Unable to open file";
// Find smallest entry in distancevec
int smallestsentence = 0; // Column of sentencevec corresponding to matched sentence
int smallestdistance = distancevec[0]; // Distance to matched sentence (Column of distancevec corresponding to matched sentence)
for (i = 0; i < num; i++) {
if (distancevec[i] < smallestdistance) {
smallestdistance = distancevec[i];
smallestsentence = i;
}
}
// Print recognized sentence and pause so program doesn't immediately exit
cout << '\n' << '\n' << sentencevec[smallestsentence] << '\n' << '\n';
system("pause");
return 0;
}
-
\$\begingroup\$ As we all want to make our code more efficient or improve it in one way or another, try to write a title that summarizes what your code does, not what you want to get out of a review. For examples of good titles, check out Best of Code Review 2014 - Best Question Title Category You may also want to read How to get the best value out of Code Review - Asking Questions. \$\endgroup\$Mathieu Guindon– Mathieu Guindon2015年07月26日 13:20:14 +00:00Commented Jul 26, 2015 at 13:20
-
\$\begingroup\$ Cross-posted from Stack Overflow \$\endgroup\$Mathieu Guindon– Mathieu Guindon2015年07月26日 13:25:59 +00:00Commented Jul 26, 2015 at 13:25
-
2\$\begingroup\$ The first step in speeding up any piece of code is to profile it using a tool that will help you identify the bottlenecks. Have you done that? What were the results? Are the bottlenecks in places you can do something about? \$\endgroup\$nhgrif– nhgrif2015年07月26日 13:58:06 +00:00Commented Jul 26, 2015 at 13:58
2 Answers 2
I'm not sure if the manually-allocated array in
LevenshteinDistance()
is necessary when you can probably still usestd::vector
.If this is still necessary, then you could consider accounting for failure with
new
if it could help make your program more robust. Still, having to do something like this suggests that you should avoid manual memory allocation as much as possible in C++.Try to avoid single-character variable names, unless they're loop counters. The names
m
andn
may work in a mathematical sense, but you shouldn't assume that the user (or even you in several years) will not be confused by this.Consider better names than
s1
ands2
, based on their intended uses (you already have info on this in some comments inmain()
).Keep your whitespace use consistent and don't add it where unnecessary. For instance, there's a lot of excess whitespace before and within
main()
for some reason. This does nothing to improve readability.This is quite unclear and is practically a magic number:
int num = 13309;
The comment next to it describes this variable, so just rename it as such, and make it
const
since it's a constant. If you need a comment to describe something like this, then you may need to rethink your naming. Comments should mostly be needed for more complex explanations.You attempt to open a file for reading, but only display an error on failure and continue executing the program. You should instead terminate the program with a valid error code on failure after displaying the error. The error itself should also be printed to
std::cerr
instead ofstd::cout
.Prefer a better alternative to a "pause" when running the code through certain IDEs, such as:
std::cin.get();
One way of acceleration the code is numerating all the words in preprocessing step. Since the only thing you really care when you process two words is whether they are equal or not, you can replace each word with its unique ID.
In order to do so, you have to preprocess your whole dictionary and generate std::unordered_map<std::string, int> wordsMap
which maps each word in the dictionary to some unique integer. Then create a compressed dictionary which consists of sequences of integers (i.e. std::vector<int>
) instead of sequences of strings. When you search a sentence, you have to first replace each word with its ID in wordsMap. If it is not present there, give it some nonexistent ID (for instance -1). Then you can solve a problem of edit distance between two sequences of ints, which has smaller size.
Regarding code quality, I suggest avoiding memory allocation/deallocation in the innermost loop. Try to preallocate all the memory, or make your vectors static/global, etc.
EDIT: Given that you have many sentences you have to check, I think you should introduce some pruning. Perhaps you should look this question and implement some data structure for closest word search.
Here is one simple idea for pruning. You can create for each word a list of sentences it appears in. When searching for a sentence, you can merge these lists for all the words in the sentence. As a result you'll have a list of all sentences that has at least one common word with yours. Hopefully it is smaller than the whole dictionary. Speaking of all the other sentences (which have no common words), you can check only the shortest of them (it is enough).
-
\$\begingroup\$ thanks for this- I'll try your wordsmap suggestion. In terms of your second suggestion, I'm not entirely sure what you mean. Could you please elaborate? \$\endgroup\$Tarrare– Tarrare2015年07月26日 13:44:43 +00:00Commented Jul 26, 2015 at 13:44
-
\$\begingroup\$ @Tarrare In each call of
edit_distance
the matrixd
is created from scratch. It means that len1 + 2 memory allocations are done at the moment of creation. Also, they are deallocated at the end of the call. Both operations take significant time. I suggest making a global matrix of sufficiently large size instead ofd
. \$\endgroup\$stgatilov– stgatilov2015年07月26日 13:59:23 +00:00Commented Jul 26, 2015 at 13:59 -
\$\begingroup\$ Cut time by half. Thanks! Do you think I could put the list of sentences in a hash table to speed things up? \$\endgroup\$Tarrare– Tarrare2015年07月26日 14:57:57 +00:00Commented Jul 26, 2015 at 14:57
-
\$\begingroup\$ @Tarrate Added some more information regarding pruning. \$\endgroup\$stgatilov– stgatilov2015年07月26日 15:14:31 +00:00Commented Jul 26, 2015 at 15:14
Explore related questions
See similar questions with these tags.