Recognizing a sequence read through OCR software

Question 1

I am trying to recognize a sentence that I have read through optical character recognition software. This code will eventually run on a Raspberry Pi.

I know for certain that it's meant to be one of a few thousand sentences which I have written down in a file (sentencelist.txt), but the character recognition has messed up a few of the words.

For example:

The quick brown fox jumps over the lazy dog

has been read as

He quick broom fax jumps over the lazy dig

I want to compare the incorrect sentence to every sentence in my list, and then figure out which one it's meant to be.

I currently have a working program, but it is just too slow! I have about 10,000 entries in my sentence file, and the whole process takes over a minute.

I initially used the Levenshtein algorithm to compare, but am now using a different algorithm which compares words rather than characters to speed it up.

How can I speed this baby up?

// C source for sentence lookup from OCR string. In main, you can set the distance calculator to be used.
#include "stdafx.h"
#include <iterator>
#include <string>
#include <vector>
#include <list>
#include <iostream>
#include <fstream>
#include <sstream>
#include <algorithm>
using namespace std;
// Levenshtein Distance Function - callable from Main
size_t LevenshteinDistance(const std::string &s1, const std::string &s2)
{
 const size_t m(s1.size());
 const size_t n(s2.size());
 if (m == 0) return n;
 if (n == 0) return m;
 size_t *costs = new size_t[n + 1];
 for (size_t k = 0; k <= n; k++) costs[k] = k;
 size_t i = 0;
 for (std::string::const_iterator it1 = s1.begin(); it1 != s1.end(); ++it1, ++i)
 {
 costs[0] = i + 1;
 size_t corner = i;
 size_t j = 0;
 for (std::string::const_iterator it2 = s2.begin(); it2 != s2.end(); ++it2, ++j)
 {
 size_t upper = costs[j + 1];
 if (*it1 == *it2)
 {
 costs[j + 1] = corner;
 }
 else
 {
 size_t t(upper<corner ? upper : corner);
 costs[j + 1] = (costs[j]<t ? costs[j] : t) + 1;
 }
 corner = upper;
 }
 }
 size_t result = costs[n];
 delete[] costs;
 return result;
}
 // edit distance function (Levenshtein Distance for words) - callable from Main
typedef std::vector<std::string> Sentence;
Sentence &split(const std::string &s, char delim, Sentence &elems) {
 std::stringstream ss(s);
 std::string item;
 while (std::getline(ss, item, delim)) {
 elems.push_back(item);
 }
 return elems;
}
Sentence split(const std::string &s, char delim) {
 Sentence elems;
 split(s, delim, elems);
 return elems;
}
unsigned int edit_distance(const Sentence& s1, const Sentence& s2)
{
 const std::size_t len1 = s1.size(), len2 = s2.size();
 std::vector<std::vector<unsigned int>> d(len1 + 1, std::vector<unsigned int>(len2 + 1));
 d[0][0] = 0;
 for (unsigned int i = 1; i <= len1; ++i) d[i][0] = i;
 for (unsigned int i = 1; i <= len2; ++i) d[0][i] = i;
 for (unsigned int i = 1; i <= len1; ++i)
 for (unsigned int j = 1; j <= len2; ++j)
 {
 d[i][j] = std::min(d[i - 1][j] + 1, d[i][j - 1] + 1);
 d[i][j] = std::min(d[i][j], d[i - 1][j - 1] + (s1[i - 1] == s2[j - 1] ? 0 : 1));
 }
 return d[len1][len2];
}
int main()
{
 // Tesseract outputs sentence data (Q). We call this string s1.
 string s1 = "He quick broom fax jumps over the lazy dig";
 // String s2 is out comparison string, read from sentence database
 string s2;
 // Split strings into vectors of sentences
 Sentence sen1 = split(s1, ' ');
 Sentence sen2;
 int i = 0; // Counter
 int num = 13309; // Number of entries in sentence file
 string sentencevec[13309]; // Vector where each entry is sentence string from file
 int distancevec[13309]; // Vector where each entry is distance between s1 and corresponding sentence
//Read from sentence file and populate aforementioned vectors with data
 string line;
 ifstream myfile("sentencelistt.txt");
 if (myfile.is_open())
 {
 while (getline(myfile, line))
 {
 string s2 = line;
 sen2 = split(s2, ' ');
 i++;
 distancevec[i - 1] = edit_distance(sen1, sen2); // Here is where you set the distance calculator. Currently set to LevenshteinDistance, but can set to edit
 sentencevec[i - 1] = s2;
 }
 myfile.close();
 }
 else cout << "Unable to open file";
 // Find smallest entry in distancevec
 int smallestsentence = 0; // Column of sentencevec corresponding to matched sentence
 int smallestdistance = distancevec[0]; // Distance to matched sentence (Column of distancevec corresponding to matched sentence)
 for (i = 0; i < num; i++) {
 if (distancevec[i] < smallestdistance) {
 smallestdistance = distancevec[i];
 smallestsentence = i;
 }
 }
 // Print recognized sentence and pause so program doesn't immediately exit 
 cout << '\n' << '\n' << sentencevec[smallestsentence] << '\n' << '\n';
 system("pause");
 return 0;
}

Question 2

As we all want to make our code more efficient or improve it in one way or another, try to write a title that summarizes what your code does, not what you want to get out of a review. For examples of good titles, check out Best of Code Review 2014 - Best Question Title Category You may also want to read How to get the best value out of Code Review - Asking Questions.

Question 3

Cross-posted from Stack Overflow

Question 4

The first step in speeding up any piece of code is to profile it using a tool that will help you identify the bottlenecks. Have you done that? What were the results? Are the bottlenecks in places you can do something about?

Question 5

I'm not sure if the manually-allocated array in LevenshteinDistance() is necessary when you can probably still use std::vector.

If this is still necessary, then you could consider accounting for failure with new if it could help make your program more robust. Still, having to do something like this suggests that you should avoid manual memory allocation as much as possible in C++.
Try to avoid single-character variable names, unless they're loop counters. The names m and n may work in a mathematical sense, but you shouldn't assume that the user (or even you in several years) will not be confused by this.
Consider better names than s1 and s2, based on their intended uses (you already have info on this in some comments in main()).
Keep your whitespace use consistent and don't add it where unnecessary. For instance, there's a lot of excess whitespace before and within main() for some reason. This does nothing to improve readability.
This is quite unclear and is practically a magic number:
```
int num = 13309;
```
The comment next to it describes this variable, so just rename it as such, and make it const since it's a constant. If you need a comment to describe something like this, then you may need to rethink your naming. Comments should mostly be needed for more complex explanations.
You attempt to open a file for reading, but only display an error on failure and continue executing the program. You should instead terminate the program with a valid error code on failure after displaying the error. The error itself should also be printed to std::cerr instead of std::cout.
Prefer a better alternative to a "pause" when running the code through certain IDEs, such as:
```
std::cin.get();
```

Question 6

One way of acceleration the code is numerating all the words in preprocessing step. Since the only thing you really care when you process two words is whether they are equal or not, you can replace each word with its unique ID.

In order to do so, you have to preprocess your whole dictionary and generate std::unordered_map<std::string, int> wordsMap which maps each word in the dictionary to some unique integer. Then create a compressed dictionary which consists of sequences of integers (i.e. std::vector<int>) instead of sequences of strings. When you search a sentence, you have to first replace each word with its ID in wordsMap. If it is not present there, give it some nonexistent ID (for instance -1). Then you can solve a problem of edit distance between two sequences of ints, which has smaller size.

Regarding code quality, I suggest avoiding memory allocation/deallocation in the innermost loop. Try to preallocate all the memory, or make your vectors static/global, etc.

EDIT: Given that you have many sentences you have to check, I think you should introduce some pruning. Perhaps you should look this question and implement some data structure for closest word search.

Here is one simple idea for pruning. You can create for each word a list of sentences it appears in. When searching for a sentence, you can merge these lists for all the words in the sentence. As a result you'll have a list of all sentences that has at least one common word with yours. Hopefully it is smaller than the whole dictionary. Speaking of all the other sentences (which have no common words), you can check only the shortest of them (it is enough).

Question 7

thanks for this- I'll try your wordsmap suggestion. In terms of your second suggestion, I'm not entirely sure what you mean. Could you please elaborate?

Question 8

@Tarrare In each call of edit_distance the matrix d is created from scratch. It means that len1 + 2 memory allocations are done at the moment of creation. Also, they are deallocated at the end of the call. Both operations take significant time. I suggest making a global matrix of sufficiently large size instead of d.

Question 9

Cut time by half. Thanks! Do you think I could put the list of sentences in a hash table to speed things up?

Question 10

@Tarrate Added some more information regarding pruning.

Jamal 35.3k13 gold badges134 silver badges239 bronze badges · Answer 1 · 2015-07-26 16:34:10Z

I'm not sure if the manually-allocated array in LevenshteinDistance() is necessary when you can probably still use std::vector.

If this is still necessary, then you could consider accounting for failure with new if it could help make your program more robust. Still, having to do something like this suggests that you should avoid manual memory allocation as much as possible in C++.
Try to avoid single-character variable names, unless they're loop counters. The names m and n may work in a mathematical sense, but you shouldn't assume that the user (or even you in several years) will not be confused by this.
Consider better names than s1 and s2, based on their intended uses (you already have info on this in some comments in main()).
Keep your whitespace use consistent and don't add it where unnecessary. For instance, there's a lot of excess whitespace before and within main() for some reason. This does nothing to improve readability.
This is quite unclear and is practically a magic number:
```
int num = 13309;
```
The comment next to it describes this variable, so just rename it as such, and make it const since it's a constant. If you need a comment to describe something like this, then you may need to rethink your naming. Comments should mostly be needed for more complex explanations.
You attempt to open a file for reading, but only display an error on failure and continue executing the program. You should instead terminate the program with a valid error code on failure after displaying the error. The error itself should also be printed to std::cerr instead of std::cout.
Prefer a better alternative to a "pause" when running the code through certain IDEs, such as:
```
std::cin.get();
```

stgatilov 1304 bronze badges · Answer 2 · 2015-07-26 13:38:25Z

One way of acceleration the code is numerating all the words in preprocessing step. Since the only thing you really care when you process two words is whether they are equal or not, you can replace each word with its unique ID.

In order to do so, you have to preprocess your whole dictionary and generate std::unordered_map<std::string, int> wordsMap which maps each word in the dictionary to some unique integer. Then create a compressed dictionary which consists of sequences of integers (i.e. std::vector<int>) instead of sequences of strings. When you search a sentence, you have to first replace each word with its ID in wordsMap. If it is not present there, give it some nonexistent ID (for instance -1). Then you can solve a problem of edit distance between two sequences of ints, which has smaller size.

Regarding code quality, I suggest avoiding memory allocation/deallocation in the innermost loop. Try to preallocate all the memory, or make your vectors static/global, etc.

EDIT: Given that you have many sentences you have to check, I think you should introduce some pruning. Perhaps you should look this question and implement some data structure for closest word search.

Here is one simple idea for pruning. You can create for each word a list of sentences it appears in. When searching for a sentence, you can merge these lists for all the words in the sentence. As a result you'll have a list of all sentences that has at least one common word with yours. Hopefully it is smaller than the whole dictionary. Speaking of all the other sentences (which have no common words), you can check only the shortest of them (it is enough).

thanks for this- I'll try your wordsmap suggestion. In terms of your second suggestion, I'm not entirely sure what you mean. Could you please elaborate?
@Tarrare In each call of edit_distance the matrix d is created from scratch. It means that len1 + 2 memory allocations are done at the moment of creation. Also, they are deallocated at the end of the call. Both operations take significant time. I suggest making a global matrix of sufficiently large size instead of d.
Cut time by half. Thanks! Do you think I could put the list of sentences in a hash table to speed things up?

Stack Exchange Network

Recognizing a sequence read through OCR software

2 Answers 2

You must log in to answer this question.

Hot Network Questions

Recognizing a sequence read through OCR software

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions