[Python-Dev] Why is soundex marked obsolete?

Eric S. Raymond esr@thyrsus.com
2001年1月14日 14:09:01 -0500


Tim Peters <tim.one@home.com>:
> > I think you've just made an argument for replacing your
> > SequenceMatcher with simil.ratcliff.
>> Actually, I'm certain they're the same algorithm now, except the C is
> showing through in ratcliff to the floating-point eye <wink>.

Take a look:
/*****************************************************************************
 *
 * Ratcliff-Obershelp common-subpattern similarity.
 *
 * This code first appeared in a letter to the editor in Doctor
 * Dobbs's Journal, 11/1988. The original article on the algorithm,
 * "Pattern Matching by Gestalt" by John Ratcliff, had appeared in the
 * July 1988 issue (#181) but the algorithm was presented in assembly.
 * The main drawback of the Ratcliff-Obershelp algorithm is the cost
 * of the pairwise comparisons. It is significantly more expensive
 * than stemming, Hamming distance, soundex, and the like.
 *
 * Running time quadratic in the data size, memory usage constant.
 *
 *****************************************************************************/
static int RatcliffObershelp(char *st1, char *end1, char *st2, char *end2)
{
 register char *a1, *a2;
 char *b1, *b2; 
 char *s1 = st1, *s2 = st2;	/* initializations are just to pacify GCC */
 short max, i;
 if (end1 <= st1 || end2 <= st2)
	return(0);
 if (end1 == st1 + 1 && end2 == st2 + 1)
	return(0);
		
 max = 0;
 b1 = end1; b2 = end2;
	
 for (a1 = st1; a1 < b1; a1++)
 {
	for (a2 = st2; a2 < b2; a2++)
	{
	 if (*a1 == *a2)
	 {
		/* determine length of common substring */
		for (i = 1; a1[i] && (a1[i] == a2[i]); i++) 
		 continue;
		if (i > max)
		{
		 max = i; s1 = a1; s2 = a2;
		 b1 = end1 - max; b2 = end2 - max;
		}
	 }
	}
 }
 if (!max)
	return(0);
 max += RatcliffObershelp(s1 + max, end1, s2 + max, end2);	/* rhs */
 max += RatcliffObershelp(st1, s1, st2, s2);			/* lhs */
 return max;
}
static float ratcliff(char *s1, char *s2)
/* compute Ratcliff-Obershelp similarity of two strings */
{
 short l1, l2;
 l1 = strlen(s1);
 l2 = strlen(s2);
	
 /* exact match end-case */
 if (l1 == 1 && l2 == 1 && *s1 == *s2)
	return(1.0);
			
 return 2.0 * RatcliffObershelp(s1, s1 + l1, s2, s2 + l2) / (l1 + l2);
}
static PyObject *
simil_ratcliff(PyObject *self, PyObject *args)
{
 char *str1, *str2;
 
 if(!PyArg_ParseTuple(args, "ss:ratcliff", &str1, &str2))
 return NULL;
 return Py_BuildValue("f", ratcliff(str1, str2));
}
-- 
		<a href="http://www.tuxedo.org/~esr/">Eric S. Raymond</a>
"Taking my gun away because I might shoot someone is like cutting my tongue
out because I might yell `Fire!' in a crowded theater."
 -- Peter Venetoklis

AltStyle によって変換されたページ (->オリジナル) /