The Jaccard distance between two sets is the size of their intersection divided by the size of their union. For example, the distance between {1, 2, 3}
and {2, 3, 4}
is 2 ({2,3}
) / 4 ({1,2,3,4}
) = 0.5.
The Jaccard distance can be used for string similarity by slicing a string into word or character n-grams. For example, with 2-character n-grams:
"Pear" vs "Peach"
`{"Pe", "ea", "ar"}` vs `{"Pe", "ea", "ac", "ch"}` = 2/5 = 0.4
I've written an implementation in Rust as an attempt to learn something about the language.
extern crate unidecode;
use unidecode::unidecode;
use std::collections::HashSet;
fn normalize(s: &str) -> String {
unidecode(s).to_lowercase()
}
fn shingles(s: String) -> HashSet<String> {
let mut xs = HashSet::new();
let it = s.chars();
let mut pk = it.peekable();
loop {
let cur = pk.next();
let nxt = pk.peek();
match (cur, nxt) {
(Some(i), Some(nxt_i)) => {
xs.insert(format!("{}{}", i, nxt_i));
}
_ => { break }
}
}
return xs
}
// Intersection of the sets divided by the size of the union of the
// sets.
fn jaccard_distance(s1: String, s2: String) -> f64 {
let s1_shingles = shingles(s1);
let s2_shingles = shingles(s2);
s1_shingles.len() as f64;
let inter: HashSet<_> = s1_shingles.intersection(&s2_shingles).collect();
let union: HashSet<_> = s1_shingles.union(&s2_shingles).collect();
(inter.len() as f64) / (union.len() as f64)
}
fn comp_and_print(s1: &str, s2: &str) {
let normal_s1 = normalize(s1);
let normal_s2 = normalize(s2);
println!("'{}' vs '{}' ... \t {}", s1, s2,
jaccard_distance(normal_s1, normal_s2));
}
fn main() {
comp_and_print("Apple sauce", "Apple Trees");
comp_and_print("Apple pie", "Grandma's Apple pie");
comp_and_print("Pear", "Peach");
}
The shingles method seems particularly bad. I'd like it to be more flexible (for 3 or 4 character n-grams, for instance) and the use of format!
rather than join
on a collection like I'd do in Ruby is ugly.
edit
Here's Ruby code for shingles
since I can express what I mean easier in Ruby. My Rust implementation only handles n = 2 because I had trouble figuring out how to make it dynamic.
def shingles(str, n = 2)
(0..(str.length - n)).map { |idx| str[idx..idx+(n-1)] }.to_set
end
-
\$\begingroup\$ For the record, a "2-character ngram" is called a bigram. \$\endgroup\$anon– anon2015年11月02日 12:21:19 +00:00Commented Nov 2, 2015 at 12:21
-
\$\begingroup\$ @QPaysTaxes I wrote it that way because it also makes sense to use ngrams on longer length, but good point. \$\endgroup\$spike– spike2015年11月02日 14:08:04 +00:00Commented Nov 2, 2015 at 14:08
-
1\$\begingroup\$ Fair enough. I just wanted to point out that, say, Googling "2-character ngram" is going to get you a lot of results for ngrams in general, not bigrams. At least, as far as I know. I haven't tried. \$\endgroup\$anon– anon2015年11月02日 20:04:00 +00:00Commented Nov 2, 2015 at 20:04
2 Answers 2
Let's start by removing unused code, (削除) the whole the line normalize
function and (削除ここまで)s1_shingles.len() as f64
.
Then, in order to actually be able to call your functions easily, we change all String
arguments into &str
. You usually want to accept &str
as more types can provide it. In this case, string literals are the big example.
Let's do shingles
in a big bang!
fn shingles(s: &str) -> HashSet<String> {
let chars: Vec<_> = s.chars().collect();
chars.windows(2).map(|w| w.iter().cloned().collect()).collect()
}
We start by getting all the characters once into a vector. Now that we have a slice (through the Vec), we can use windows
, which gives overlapping views into the slice. We then convert each window into an iterator of char
using cloned
and then into a string using collect
. We convert the overall iterator into a HashSet
, also using collect
. Look at that type inference go!
Note that now we have a parameter to windows
that allows you to extend to arbitrary-sized n-grams!
The other optimization is to avoid collecting into a collection if we just want the count of elements:
let inter = s1_shingles.intersection(&s2_shingles).count();
let union = s1_shingles.union(&s2_shingles).count();
All together, it now looks like:
use std::collections::HashSet;
fn shingles(s: &str) -> HashSet<String> {
let chars: Vec<_> = s.chars().collect();
chars.windows(2).map(|w| w.iter().cloned().collect()).collect()
}
fn jaccard_distance(s1: &str, s2: &str) -> f64 {
let s1_shingles = shingles(s1);
let s2_shingles = shingles(s2);
let inter = s1_shingles.intersection(&s2_shingles).count();
let union = s1_shingles.union(&s2_shingles).count();
(inter as f64) / (union as f64)
}
fn main() {
println!("{}", jaccard_distance("Pear", "Peach"));
}
Edit
And of course now I see that if I were to scroll down a bit, I'd see that you do use normalize
. Oops! You should still accept a &str
, the only change you need to make is to take a reference to the String
s:
println!("'{}' vs '{}' ... \t {}", s1, s2,
jaccard_distance(&normal_s1, &normal_s2));
-
\$\begingroup\$ Wow, this is just an incredible answer. Thank you so much. P.S. your version runs 30% faster than mine with
cargo bench
on my machine. \$\endgroup\$spike– spike2015年11月03日 00:56:45 +00:00Commented Nov 3, 2015 at 0:56
shingles
shingles
should be named pairwise_with_repetitions
as that is what it does. I would implement it by zipping the string with itself minus the first character:
s.zip(s.remove_first()) // Pseudocode as I am not proficient in Rust
temp overuse
I suggest avoiding temporary variables when practical, the printing function could have the function calls inlined.
Does nothing, right?
I think that:
s1_shingles.len() as f64;
Does nothing and should be removed.
-
\$\begingroup\$ I agree about your second two points. I tried to use
zip
+skip
(like drop in ruby) but had rust-related troubles. I also want to be able to extendshingles
to take an argument for how many characters to repeat. I've added ruby code for my idealshingles
method - how to write in idiomatic rust? \$\endgroup\$spike– spike2015年11月02日 14:20:25 +00:00Commented Nov 2, 2015 at 14:20
Explore related questions
See similar questions with these tags.