My text comparison algorithm works very slow

Question 1

I tried to create an algorithm (Translation Memory) that finds similarities between texts and try to predict the closest possible translation of the text, but there is a big problem with the working time of the algorithm.

Example for 2.6ghz processor;

A match for "250(source) * 250(target) section" takes 3.36 seconds.
A match for "7000(source) * 250(target) section" takes 539.78 seconds.
A match for "7000(source) * 7000(target) section" takes 35042.76 seconds.

With the Match function, sentences in the source array are sorted by a loop. Then, the target arrays are listed underneath the loop. Thereafter, for each source sentence, the results in the target array are searched for the highest possible match.

This is matching function;

public function match($source,$target,$limit){
 if(isset($source) && is_array($source) && count($source) > 0){
 if(isset($target) && is_array($target) && count($target) > 0){
 $match = array() ;
 $array = array();
 foreach($source as $k => $v){
 $v = trim(preg_replace('/'.$this->regex.'/', null, $v));
 $similarity = $this->fetch($v,$target,$limit);
 if($similarity[0]){
 $array[$k]['data'] = $similarity[1]['data'];
 $array[$k]['match'] = $similarity[1]['match'];
 $match[] = $similarity[1]['match'];
 }else{
 $array[$k]['data'] = null;
 $array[$k]['match'] = null;
 }
 unset($source[$k]);
 }
 if(isset($match) && is_array($match) && count($match) > 0){
 return array(true,$array,round((array_sum($match)/count($match)),2));
 }else{
 return array(true,$array,0);
 }
 }else{
 return array(false,'source memory is empty');
 }
 }else{
 return array(false,'source word is empty');
 }
}

This is fetching function;

private function fetch($text,$array,$limit){
 $status = false;
 $match = 0;
 $data = null;
 foreach($array as $k1 => $v1){
 foreach($v1 as $k2 => $v2){
 $old = $match;
 $match = $this->similarity($text, $v2['source']);
 if($match == 100){
 $data = $v2['target'];
 break;
 }elseif($match > $old){
 $data = $v2['target'];
 }
 unset($array[$k1][$k2]);
 }
 unset($array[$k1]);
 }
 if($match >= $limit){
 $status = true;
 }
 return array($status,array('data'=>$data,'match'=>$match));
}

The matching algorithm calculates the similarity distance by breaking up the sentence into words (just like the matrix distance (Levenshtein) algorithm that breaks the word into letters). Then, it exposes this distance as a percentage (For example, "I am going to school today" sentence is similar to "I am going home today" by %80).

This is similarity function;

private function similarity($source,$target){
 $source = mb_strtolower($source,'UTF-8');
 $target = mb_strtolower($target,'UTF-8');
 $sourceAR = preg_split('/[\s]+/ui', $source);
 $targetAR = preg_split('/[\s]+/ui', $target);
 $sourceCount = count($sourceAR);
 $targetCount = count($targetAR);
 $difference = 0;
 $matrix = array_fill(0, $sourceCount + 1, array_fill(0, $targetCount + 1, 0));
 for ($i = 1; $i < $sourceCount + 1; $i++){
 $matrix[$i][0] = $i;
 }
 for ($j = 1; $j < $targetCount + 1; $j++){
 $matrix[0][$j] = $j;
 }
 /* $i = column / x axis*/
 for ($i = 1; $i <= $sourceCount; $i++){
 /* $j = row / y axis */
 for ($j = 1; $j <= $targetCount; $j++){
 /* calculation of cost (not match , missing/more) */
 if($sourceAR[$i - 1] == $targetAR[$j - 1]){
 $c = 0;
 }else{
 $c = 1;
 }
 /* calculation of cost (not match , missing/more) */
 $matrix[$i][$j] = min($matrix[$i - 1][$j] + 1, $matrix[$i][$j - 1] + 1, $matrix[$i - 1][$j - 1] + $c);
 $difference = $matrix[$i][$j];
 }
 /* $j = row / y axis */
 }
 /* $i = column / x axis */
 return (100 - ((100*$difference)/$sourceCount));
}

Since the algorithm is encoded with php, the solution must also be working with php.

At the output of the match; for each source sentence, the target sentence with the highest match and the percentage of this match must be given.

Question 2

this hurts my eyes. there's a built in function for this that i'm sure is a heck of a lot faster: similar_text. there's a built in function for levenshtein distance as well.

Question 3

How are you intending to treat occurrences of punctuation? Your preg_split () pattern means to split on whitespace characters (*character class brackets and i flag are not necessary) so all punctuation is stuck to its neighboring word. Is this satisfactory or do you also want to strip these characters off of the words?

Question 4

PHP has a couple phonetic functions that seem as though they should be used. Either soundex() or metaphone() with a combination of similar_text() or levenshtein().

soundex()

Soundex value of a string. https://www.php.net/manual/en/function.soundex.php e.g.

soundex("Euler") == soundex("Ellery"); // E460

metaphone() ‐ metaphone of a string. Bases on english pronunciation rules, so more precision than the soundex() function but limited use with global sites. https://www.php.net/manual/en/function.metaphone.php e.g.

var_dump(metaphone('programming'));
string(7) "PRKRMNK

For example you could use metaphone() with a levenshtein() function to compare words. The levenshtein function measures the minimum number of charaters needed to replace, insert, or delete to transform a string into a string. https://www.php.net/manual/en/function.levenshtein

With those functions you could make good function.

All that being said, the best solution would likely use machine learning to rank results and be written in another language such as Tensorflow or utlitizing a number of AWS, et al tools.

I hope this helps!

hxtree hxtreehxtree 1616 bronze badges · Answer 1 · 2020-01-21 00:59:33Z

PHP has a couple phonetic functions that seem as though they should be used. Either soundex() or metaphone() with a combination of similar_text() or levenshtein().

soundex()

Soundex value of a string. https://www.php.net/manual/en/function.soundex.php e.g.

soundex("Euler") == soundex("Ellery"); // E460

metaphone() ‐ metaphone of a string. Bases on english pronunciation rules, so more precision than the soundex() function but limited use with global sites. https://www.php.net/manual/en/function.metaphone.php e.g.

var_dump(metaphone('programming'));
string(7) "PRKRMNK

For example you could use metaphone() with a levenshtein() function to compare words. The levenshtein function measures the minimum number of charaters needed to replace, insert, or delete to transform a string into a string. https://www.php.net/manual/en/function.levenshtein

With those functions you could make good function.

All that being said, the best solution would likely use machine learning to rank results and be written in another language such as Tensorflow or utlitizing a number of AWS, et al tools.

I hope this helps!

Stack Exchange Network

My text comparison algorithm works very slow

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

My text comparison algorithm works very slow

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions