I tried to create an algorithm (Translation Memory) that finds similarities between texts and try to predict the closest possible translation of the text, but there is a big problem with the working time of the algorithm.
Example for 2.6ghz processor;
A match for "250(source) * 250(target) section" takes 3.36 seconds.
A match for "7000(source) * 250(target) section" takes 539.78 seconds.
A match for "7000(source) * 7000(target) section" takes 35042.76 seconds.
With the Match function, sentences in the source array are sorted by a loop. Then, the target arrays are listed underneath the loop. Thereafter, for each source sentence, the results in the target array are searched for the highest possible match.
This is matching function;
public function match($source,$target,$limit){
if(isset($source) && is_array($source) && count($source) > 0){
if(isset($target) && is_array($target) && count($target) > 0){
$match = array() ;
$array = array();
foreach($source as $k => $v){
$v = trim(preg_replace('/'.$this->regex.'/', null, $v));
$similarity = $this->fetch($v,$target,$limit);
if($similarity[0]){
$array[$k]['data'] = $similarity[1]['data'];
$array[$k]['match'] = $similarity[1]['match'];
$match[] = $similarity[1]['match'];
}else{
$array[$k]['data'] = null;
$array[$k]['match'] = null;
}
unset($source[$k]);
}
if(isset($match) && is_array($match) && count($match) > 0){
return array(true,$array,round((array_sum($match)/count($match)),2));
}else{
return array(true,$array,0);
}
}else{
return array(false,'source memory is empty');
}
}else{
return array(false,'source word is empty');
}
}
This is fetching function;
private function fetch($text,$array,$limit){
$status = false;
$match = 0;
$data = null;
foreach($array as $k1 => $v1){
foreach($v1 as $k2 => $v2){
$old = $match;
$match = $this->similarity($text, $v2['source']);
if($match == 100){
$data = $v2['target'];
break;
}elseif($match > $old){
$data = $v2['target'];
}
unset($array[$k1][$k2]);
}
unset($array[$k1]);
}
if($match >= $limit){
$status = true;
}
return array($status,array('data'=>$data,'match'=>$match));
}
The matching algorithm calculates the similarity distance by breaking up the sentence into words (just like the matrix distance (Levenshtein) algorithm that breaks the word into letters). Then, it exposes this distance as a percentage (For example, "I am going to school today" sentence is similar to "I am going home today" by %80).
This is similarity function;
private function similarity($source,$target){
$source = mb_strtolower($source,'UTF-8');
$target = mb_strtolower($target,'UTF-8');
$sourceAR = preg_split('/[\s]+/ui', $source);
$targetAR = preg_split('/[\s]+/ui', $target);
$sourceCount = count($sourceAR);
$targetCount = count($targetAR);
$difference = 0;
$matrix = array_fill(0, $sourceCount + 1, array_fill(0, $targetCount + 1, 0));
for ($i = 1; $i < $sourceCount + 1; $i++){
$matrix[$i][0] = $i;
}
for ($j = 1; $j < $targetCount + 1; $j++){
$matrix[0][$j] = $j;
}
/* $i = column / x axis*/
for ($i = 1; $i <= $sourceCount; $i++){
/* $j = row / y axis */
for ($j = 1; $j <= $targetCount; $j++){
/* calculation of cost (not match , missing/more) */
if($sourceAR[$i - 1] == $targetAR[$j - 1]){
$c = 0;
}else{
$c = 1;
}
/* calculation of cost (not match , missing/more) */
$matrix[$i][$j] = min($matrix[$i - 1][$j] + 1, $matrix[$i][$j - 1] + 1, $matrix[$i - 1][$j - 1] + $c);
$difference = $matrix[$i][$j];
}
/* $j = row / y axis */
}
/* $i = column / x axis */
return (100 - ((100*$difference)/$sourceCount));
}
Since the algorithm is encoded with php, the solution must also be working with php.
At the output of the match; for each source sentence, the target sentence with the highest match and the percentage of this match must be given.
1 Answer 1
PHP has a couple phonetic functions that seem as though they should be used. Either soundex() or metaphone() with a combination of similar_text() or levenshtein().
soundex()
Soundex value of a string. https://www.php.net/manual/en/function.soundex.php e.g.
soundex("Euler") == soundex("Ellery"); // E460
metaphone() ‐ metaphone of a string. Bases on english pronunciation rules, so more precision than the soundex() function but limited use with global sites. https://www.php.net/manual/en/function.metaphone.php e.g.
var_dump(metaphone('programming'));
string(7) "PRKRMNK
For example you could use metaphone() with a levenshtein() function to compare words. The levenshtein function measures the minimum number of charaters needed to replace, insert, or delete to transform a string into a string. https://www.php.net/manual/en/function.levenshtein
With those functions you could make good function.
All that being said, the best solution would likely use machine learning to rank results and be written in another language such as Tensorflow or utlitizing a number of AWS, et al tools.
I hope this helps!
preg_split ()
pattern means to split on whitespace characters (*character class brackets andi
flag are not necessary) so all punctuation is stuck to its neighboring word. Is this satisfactory or do you also want to strip these characters off of the words? \$\endgroup\$