Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commit 4946f58

Browse files
Merge pull request #55 from denmase/master
Implementation of Ratcliff-Obershelp algorithm
2 parents eeb33dc + f6c7aad commit 4946f58

File tree

3 files changed

+291
-0
lines changed

3 files changed

+291
-0
lines changed

‎README.md

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@ A library implementing different string similarity and distance measures. A doze
2222
* [Cosine similarity](#shingle-n-gram-based-algorithms)
2323
* [Jaccard index](#shingle-n-gram-based-algorithms)
2424
* [Sorensen-Dice coefficient](#shingle-n-gram-based-algorithms)
25+
* [Ratcliff-Obershelp](#ratcliff-obershelp)
2526
* [Experimental](#experimental)
2627
* [SIFT4](#sift4)
2728
* [Users](#users)
@@ -58,6 +59,7 @@ The main characteristics of each implemented algorithm are presented below. The
5859
| [Cosine similarity](#cosine-similarity) |similarity<br>distance | Yes | No | Profile | O(m+n) | |
5960
| [Jaccard index](#jaccard-index) |similarity<br>distance | Yes | Yes | Set | O(m+n) | |
6061
| [Sorensen-Dice coefficient](#sorensen-dice-coefficient) |similarity<br>distance | Yes | No | Set | O(m+n) | |
62+
| [Ratcliff-Obershelp](#ratcliff-obershelp) |similarity<br>distance | Yes | No | | ? | |
6163

6264
[1] In this library, Levenshtein edit distance, LCS distance and their sibblings are computed using the **dynamic programming** method, which has a cost O(m.n). For Levenshtein distance, the algorithm is sometimes called **Wagner-Fischer algorithm** ("The string-to-string correction problem", 1974). The original algorithm uses a matrix of size m x n to store the Levenshtein distance between string prefixes.
6365

@@ -443,6 +445,38 @@ Similar to Jaccard index, but this time the similarity is computed as 2 * |V1 in
443445

444446
Distance is computed as 1 - similarity.
445447

448+
## Ratcliff-Obershelp
449+
Ratcliff/Obershelp Pattern Recognition, also known as Gestalt Pattern Matching, is a string-matching algorithm for determining the similarity of two strings. It was developed in 1983 by John W. Ratcliff and John A. Obershelp and published in the Dr. Dobb's Journal in July 1988
450+
451+
Ratcliff/Obershelp computes the similarity between 2 strings, and the returned value lies in the interval [0.0, 1.0].
452+
453+
The distance is computed as 1 - Ratcliff/Obershelp similarity.
454+
455+
```java
456+
import info.debatty.java.stringsimilarity.*;
457+
458+
public class MyApp {
459+
460+
461+
public static void main(String[] args) {
462+
RatcliffObershelp ro = new RatcliffObershelp();
463+
464+
// substitution of s and t
465+
System.out.println(ro.similarity("My string", "My tsring"));
466+
467+
// substitution of s and n
468+
System.out.println(ro.similarity("My string", "My ntrisg"));
469+
}
470+
}
471+
```
472+
473+
will produce:
474+
475+
```
476+
0.8888888888888888
477+
0.7777777777777778
478+
```
479+
446480
## Experimental
447481

448482
### SIFT4
Lines changed: 133 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,133 @@
1+
/*
2+
* The MIT License
3+
*
4+
* Copyright 2015 Thibault Debatty.
5+
*
6+
* Permission is hereby granted, free of charge, to any person obtaining a copy
7+
* of this software and associated documentation files (the "Software"), to deal
8+
* in the Software without restriction, including without limitation the rights
9+
* to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
10+
* copies of the Software, and to permit persons to whom the Software is
11+
* furnished to do so, subject to the following conditions:
12+
*
13+
* The above copyright notice and this permission notice shall be included in
14+
* all copies or substantial portions of the Software.
15+
*
16+
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
17+
* IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
18+
* FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
19+
* AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
20+
* LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
21+
* OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
22+
* THE SOFTWARE.
23+
*/
24+
package info.debatty.java.stringsimilarity;
25+
26+
import info.debatty.java.stringsimilarity.interfaces.NormalizedStringSimilarity;
27+
import info.debatty.java.stringsimilarity.interfaces.NormalizedStringDistance;
28+
import java.util.List;
29+
import java.util.ArrayList;
30+
import java.util.Iterator;
31+
32+
import net.jcip.annotations.Immutable;
33+
34+
/**
35+
* Ratcliff/Obershelp pattern recognition
36+
* The Ratcliff/Obershelp algorithm computes the similarity of two strings a
37+
* the doubled number of matching characters divided by the total number of
38+
* characters in the two strings. Matching characters are those in the longest
39+
* common subsequence plus, recursively, matching characters in the unmatched
40+
* region on either side of the longest common subsequence.
41+
* The Ratcliff/Obershelp distance is computed as 1 - Ratcliff/Obershelp
42+
* similarity.
43+
*
44+
* @author Ligi https://github.com/dxpux (as a patch for fuzzystring)
45+
* Ported to java from .net by denmase
46+
*/
47+
@Immutable
48+
public class RatcliffObershelp implements
49+
NormalizedStringSimilarity, NormalizedStringDistance {
50+
51+
/**
52+
* Compute the Ratcliff-Obershelp similarity between strings.
53+
*
54+
* @param s1 The first string to compare.
55+
* @param s2 The second string to compare.
56+
* @return The RatcliffObershelp similarity in the range [0, 1]
57+
* @throws NullPointerException if s1 or s2 is null.
58+
*/
59+
public final double similarity(final String s1, final String s2) {
60+
if (s1 == null) {
61+
throw new NullPointerException("s1 must not be null");
62+
}
63+
64+
if (s2 == null) {
65+
throw new NullPointerException("s2 must not be null");
66+
}
67+
68+
if (s1.equals(s2)) {
69+
return 1.0d;
70+
}
71+
72+
List<String> matches = getMatchList(s1, s2);
73+
int sumofmatches = 0;
74+
Iterator it = matches.iterator();
75+
76+
while (it.hasNext()) {
77+
String element = it.next().toString();
78+
sumofmatches += element.length();
79+
}
80+
81+
return 2.0d * sumofmatches / (s1.length() + s2.length());
82+
}
83+
84+
/**
85+
* Return 1 - similarity.
86+
*
87+
* @param s1 The first string to compare.
88+
* @param s2 The second string to compare.
89+
* @return 1 - similarity
90+
* @throws NullPointerException if s1 or s2 is null.
91+
*/
92+
public final double distance(final String s1, final String s2) {
93+
return 1.0d - similarity(s1, s2);
94+
}
95+
96+
private static List<String> getMatchList(final String s1, final String s2) {
97+
List<String> list = new ArrayList<String>();
98+
String match = frontMaxMatch(s1, s2);
99+
100+
if (match.length() > 0) {
101+
String frontsource = s1.substring(0, s1.indexOf(match));
102+
String fronttarget = s2.substring(0, s2.indexOf(match));
103+
List<String> frontqueue = getMatchList(frontsource, fronttarget);
104+
105+
String endsource = s1.substring(s1.indexOf(match) + match.length());
106+
String endtarget = s2.substring(s2.indexOf(match) + match.length());
107+
List<String> endqueue = getMatchList(endsource, endtarget);
108+
109+
list.add(match);
110+
list.addAll(frontqueue);
111+
list.addAll(endqueue);
112+
}
113+
114+
return list;
115+
}
116+
117+
private static String frontMaxMatch(final String s1, final String s2) {
118+
int longest = 0;
119+
String longestsubstring = "";
120+
121+
for (int i = 0; i < s1.length(); ++i) {
122+
for (int j = i + 1; j <= s1.length(); ++j) {
123+
String substring = s1.substring(i, j);
124+
if (s2.contains(substring) && substring.length() > longest) {
125+
longest = substring.length();
126+
longestsubstring = substring;
127+
}
128+
}
129+
}
130+
131+
return longestsubstring;
132+
}
133+
}
Lines changed: 124 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,124 @@
1+
/*
2+
* The MIT License
3+
*
4+
* Copyright 2015 Thibault Debatty.
5+
*
6+
* Permission is hereby granted, free of charge, to any person obtaining a copy
7+
* of this software and associated documentation files (the "Software"), to deal
8+
* in the Software without restriction, including without limitation the rights
9+
* to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
10+
* copies of the Software, and to permit persons to whom the Software is
11+
* furnished to do so, subject to the following conditions:
12+
*
13+
* The above copyright notice and this permission notice shall be included in
14+
* all copies or substantial portions of the Software.
15+
*
16+
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
17+
* IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
18+
* FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
19+
* AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
20+
* LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
21+
* OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
22+
* THE SOFTWARE.
23+
*/
24+
25+
package info.debatty.java.stringsimilarity;
26+
27+
import info.debatty.java.stringsimilarity.testutil.NullEmptyTests;
28+
import org.junit.Test;
29+
import static org.junit.Assert.*;
30+
31+
/**
32+
*
33+
* @author Agung Nugroho
34+
*/
35+
public class RatcliffObershelpTest {
36+
37+
38+
/**
39+
* Test of similarity method, of class RatcliffObershelp.
40+
*/
41+
@Test
42+
public final void testSimilarity() {
43+
System.out.println("similarity");
44+
RatcliffObershelp instance = new RatcliffObershelp();
45+
46+
// test data from other algorithms
47+
// "My string" vs "My tsring"
48+
// Substrings:
49+
// "ring" ==> 4, "My s" ==> 3, "s" ==> 1
50+
// Ratcliff-Obershelp = 2*(sum of substrings)/(length of s1 + length of s2)
51+
// = 2*(4 + 3 + 1) / (9 + 9)
52+
// = 16/18
53+
// = 0.888888
54+
assertEquals(
55+
0.888888,
56+
instance.similarity("My string", "My tsring"),
57+
0.000001);
58+
59+
// test data from other algorithms
60+
// "My string" vs "My tsring"
61+
// Substrings:
62+
// "My " ==> 3, "tri" ==> 3, "g" ==> 1
63+
// Ratcliff-Obershelp = 2*(sum of substrings)/(length of s1 + length of s2)
64+
// = 2*(3 + 3 + 1) / (9 + 9)
65+
// = 14/18
66+
// = 0.777778
67+
assertEquals(
68+
0.777778,
69+
instance.similarity("My string", "My ntrisg"),
70+
0.000001);
71+
72+
// test data from essay by Ilya Ilyankou
73+
// "Comparison of Jaro-Winkler and Ratcliff/Obershelp algorithms
74+
// in spell check"
75+
// https://ilyankou.files.wordpress.com/2015/06/ib-extended-essay.pdf
76+
// p13, expected result is 0.857
77+
assertEquals(
78+
0.857,
79+
instance.similarity("MATEMATICA", "MATHEMATICS"),
80+
0.001);
81+
82+
// test data from stringmetric
83+
// https://github.com/rockymadden/stringmetric
84+
// expected output is 0.7368421052631579
85+
assertEquals(
86+
0.736842,
87+
instance.similarity("aleksander", "alexandre"),
88+
0.000001);
89+
90+
// test data from stringmetric
91+
// https://github.com/rockymadden/stringmetric
92+
// expected output is 0.6666666666666666
93+
assertEquals(
94+
0.666666,
95+
instance.similarity("pennsylvania", "pencilvaneya"),
96+
0.000001);
97+
98+
// test data from wikipedia
99+
// https://en.wikipedia.org/wiki/Gestalt_Pattern_Matching
100+
// expected output is 14/18 = 0.7777777777777778‬
101+
assertEquals(
102+
0.777778,
103+
instance.similarity("WIKIMEDIA", "WIKIMANIA"),
104+
0.000001);
105+
106+
// test data from wikipedia
107+
// https://en.wikipedia.org/wiki/Gestalt_Pattern_Matching
108+
// expected output is 24/40 = 0.65
109+
assertEquals(
110+
0.6,
111+
instance.similarity("GESTALT PATTERN MATCHING", "GESTALT PRACTICE"),
112+
0.000001);
113+
114+
NullEmptyTests.testSimilarity(instance);
115+
}
116+
117+
@Test
118+
public final void testDistance() {
119+
RatcliffObershelp instance = new RatcliffObershelp();
120+
NullEmptyTests.testDistance(instance);
121+
122+
// TODO: regular (non-null/empty) distance tests
123+
}
124+
}

0 commit comments

Comments
(0)

AltStyle によって変換されたページ (->オリジナル) /