Jaccard distance between strings in Rust

Question 1

The Jaccard distance between two sets is the size of their intersection divided by the size of their union. For example, the distance between {1, 2, 3} and {2, 3, 4} is 2 ({2,3}) / 4 ({1,2,3,4}) = 0.5.

The Jaccard distance can be used for string similarity by slicing a string into word or character n-grams. For example, with 2-character n-grams:

"Pear" vs "Peach" 
`{"Pe", "ea", "ar"}` vs `{"Pe", "ea", "ac", "ch"}` = 2/5 = 0.4

I've written an implementation in Rust as an attempt to learn something about the language.

extern crate unidecode;
use unidecode::unidecode;
use std::collections::HashSet;
fn normalize(s: &str) -> String {
 unidecode(s).to_lowercase()
}
fn shingles(s: String) -> HashSet<String> {
 let mut xs = HashSet::new();
 let it = s.chars();
 let mut pk = it.peekable();
 loop {
 let cur = pk.next();
 let nxt = pk.peek();
 match (cur, nxt) {
 (Some(i), Some(nxt_i)) => {
 xs.insert(format!("{}{}", i, nxt_i));
 }
 _ => { break }
 }
 }
 return xs
}
// Intersection of the sets divided by the size of the union of the
// sets.
fn jaccard_distance(s1: String, s2: String) -> f64 {
 let s1_shingles = shingles(s1);
 let s2_shingles = shingles(s2);
 s1_shingles.len() as f64;
 let inter: HashSet<_> = s1_shingles.intersection(&s2_shingles).collect();
 let union: HashSet<_> = s1_shingles.union(&s2_shingles).collect();
 (inter.len() as f64) / (union.len() as f64)
}
fn comp_and_print(s1: &str, s2: &str) {
 let normal_s1 = normalize(s1);
 let normal_s2 = normalize(s2);
 println!("'{}' vs '{}' ... \t {}", s1, s2,
 jaccard_distance(normal_s1, normal_s2));
}
fn main() {
 comp_and_print("Apple sauce", "Apple Trees");
 comp_and_print("Apple pie", "Grandma's Apple pie");
 comp_and_print("Pear", "Peach");
}

The shingles method seems particularly bad. I'd like it to be more flexible (for 3 or 4 character n-grams, for instance) and the use of format! rather than join on a collection like I'd do in Ruby is ugly.

edit

Here's Ruby code for shingles since I can express what I mean easier in Ruby. My Rust implementation only handles n = 2 because I had trouble figuring out how to make it dynamic.

def shingles(str, n = 2)
 (0..(str.length - n)).map { |idx| str[idx..idx+(n-1)] }.to_set
end

Question 2

For the record, a "2-character ngram" is called a bigram.

Question 3

@QPaysTaxes I wrote it that way because it also makes sense to use ngrams on longer length, but good point.

Question 4

Fair enough. I just wanted to point out that, say, Googling "2-character ngram" is going to get you a lot of results for ngrams in general, not bigrams. At least, as far as I know. I haven't tried.

Question 5

Let's start by removing unused code, ~~(削除) the whole normalize function and (削除ここまで)~~ the line s1_shingles.len() as f64.

Then, in order to actually be able to call your functions easily, we change all String arguments into &str. You usually want to accept &str as more types can provide it. In this case, string literals are the big example.

Let's do shingles in a big bang!

fn shingles(s: &str) -> HashSet<String> {
 let chars: Vec<_> = s.chars().collect();
 chars.windows(2).map(|w| w.iter().cloned().collect()).collect()
}

We start by getting all the characters once into a vector. Now that we have a slice (through the Vec), we can use windows, which gives overlapping views into the slice. We then convert each window into an iterator of char using cloned and then into a string using collect. We convert the overall iterator into a HashSet, also using collect. Look at that type inference go!

Note that now we have a parameter to windows that allows you to extend to arbitrary-sized n-grams!

The other optimization is to avoid collecting into a collection if we just want the count of elements:

let inter = s1_shingles.intersection(&s2_shingles).count();
let union = s1_shingles.union(&s2_shingles).count();

All together, it now looks like:

use std::collections::HashSet;
fn shingles(s: &str) -> HashSet<String> {
 let chars: Vec<_> = s.chars().collect();
 chars.windows(2).map(|w| w.iter().cloned().collect()).collect()
}
fn jaccard_distance(s1: &str, s2: &str) -> f64 {
 let s1_shingles = shingles(s1);
 let s2_shingles = shingles(s2);
 let inter = s1_shingles.intersection(&s2_shingles).count();
 let union = s1_shingles.union(&s2_shingles).count();
 (inter as f64) / (union as f64)
}
fn main() {
 println!("{}", jaccard_distance("Pear", "Peach"));
}

Edit

And of course now I see that if I were to scroll down a bit, I'd see that you do use normalize. Oops! You should still accept a &str, the only change you need to make is to take a reference to the Strings:

println!("'{}' vs '{}' ... \t {}", s1, s2,
 jaccard_distance(&normal_s1, &normal_s2));

Question 6

Wow, this is just an incredible answer. Thank you so much. P.S. your version runs 30% faster than mine with cargo bench on my machine.

Question 7

shingles

shingles should be named pairwise_with_repetitions as that is what it does. I would implement it by zipping the string with itself minus the first character:

s.zip(s.remove_first()) // Pseudocode as I am not proficient in Rust

temp overuse

I suggest avoiding temporary variables when practical, the printing function could have the function calls inlined.

Does nothing, right?

I think that:

s1_shingles.len() as f64;

Does nothing and should be removed.

Question 8

I agree about your second two points. I tried to use zip + skip (like drop in ruby) but had rust-related troubles. I also want to be able to extend shingles to take an argument for how many characters to repeat. I've added ruby code for my ideal shingles method - how to write in idiomatic rust?

Shepmaster Shepmaster 8,77827 silver badges28 bronze badges · Accepted Answer · 2015-11-03 00:14:51Z

Let's start by removing unused code, ~~(削除) the whole normalize function and (削除ここまで)~~ the line s1_shingles.len() as f64.

Then, in order to actually be able to call your functions easily, we change all String arguments into &str. You usually want to accept &str as more types can provide it. In this case, string literals are the big example.

Let's do shingles in a big bang!

fn shingles(s: &str) -> HashSet<String> {
 let chars: Vec<_> = s.chars().collect();
 chars.windows(2).map(|w| w.iter().cloned().collect()).collect()
}

We start by getting all the characters once into a vector. Now that we have a slice (through the Vec), we can use windows, which gives overlapping views into the slice. We then convert each window into an iterator of char using cloned and then into a string using collect. We convert the overall iterator into a HashSet, also using collect. Look at that type inference go!

Note that now we have a parameter to windows that allows you to extend to arbitrary-sized n-grams!

The other optimization is to avoid collecting into a collection if we just want the count of elements:

let inter = s1_shingles.intersection(&s2_shingles).count();
let union = s1_shingles.union(&s2_shingles).count();

All together, it now looks like:

use std::collections::HashSet;
fn shingles(s: &str) -> HashSet<String> {
 let chars: Vec<_> = s.chars().collect();
 chars.windows(2).map(|w| w.iter().cloned().collect()).collect()
}
fn jaccard_distance(s1: &str, s2: &str) -> f64 {
 let s1_shingles = shingles(s1);
 let s2_shingles = shingles(s2);
 let inter = s1_shingles.intersection(&s2_shingles).count();
 let union = s1_shingles.union(&s2_shingles).count();
 (inter as f64) / (union as f64)
}
fn main() {
 println!("{}", jaccard_distance("Pear", "Peach"));
}

Edit

And of course now I see that if I were to scroll down a bit, I'd see that you do use normalize. Oops! You should still accept a &str, the only change you need to make is to take a reference to the Strings:

println!("'{}' vs '{}' ... \t {}", s1, s2,
 jaccard_distance(&normal_s1, &normal_s2));

Wow, this is just an incredible answer. Thank you so much. P.S. your version runs 30% faster than mine with cargo bench on my machine.

Stack Exchange Network

Jaccard distance between strings in Rust

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Jaccard distance between strings in Rust

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions