is that possible to group a array of string based on the similarity in ruby

Question 1

I have implemented a custom ruby method which groups similar text using loops,

 array = ["South East Queensland", "Wide Bay Burnett", "Margaret River", "Port Pirie", "Gippsland", "Elizabeth", "Barossa"]
similarity_group = []
similarity_percentage = 60.0
array.each do |first_text|
 results.each do |second_text|
 result = first_text.upcase.similar(second_text.upcase)
 if result >= similarity_percentage
 ...
 ...
 ...
 end 
 end
end

consider the above implementation for 2000 element,then to group them it will cost 4000000 iteration because, each element will check with each other.

is there any performant solution or gems or library like grouping a bulk array based on their similarity.

(I need to use the same array element for similarity check)

sample expectation : [array1].similarity([array1])

Question 2

Where does similar come from? Is it from a gem? Which one?

Question 3

Note : You use first_text twice. I suppose similarity will always be 1 :) One obvious optimization would be to check 2 strings only if string1 >= string2. It cuts the iterations in half.

Question 4

I think you're looking for clustering based on string distance (e.g. Levensthein). From your code, it looks like similar is already implemented, so it must be clear which distance you're considering.

For clustering, these two gems could help.

The problem is hard though, so your O(n**2) is actually not so bad in comparison. You could avoid comparing string pairs twice by just checking that string1 > string2 before calculating the distance. You'd need (2000*1999)/2 iterations instead of 2000**2.

Question 5

Thanks @Eric Duminil, yes similar is a gem which helps to find similarity percentage between two string. and sorry it's a typo. I updated my question.

Question 6

I have a matrix solution I used in the past to compare ebooks, since these are quite large and take time to process I solved that in the first place with a matrix, later on I used an index on a value I derived from the books which was as accurate and much faster. I could search the matrix solution for you but I have another simple routine based on levenshtein which looks much as what you want to accomplish, creating a similarity array (a hash here).

I publish the working version of my script as in my scriptlibrary so you'll probably want to adapt some of it. I used it as an answer previously on this question. You could eg build an array in the same way.

require 'levenshtein'
MAX_DISTANCE, COMPENSATION = 3, 5
strings = [
 "Hazard Const. Company",
 "hazard construction company",
 "PETERSON-CHASE GENERAL ENGINEERING CONSTRUCTION INC",
 "peterson-chase general engineering construction inc",
 "TRAFFIC DEVELOPMENT SERVICES ",
 "traffic development services"
]
result = {}
strings.each do |s|
 s.downcase!
 similar = result.keys.select { |key| Levenshtein.distance(key, s) < MAX_DISTANCE+(s.length/COMPENSATION) }
 if similar.any?
 result[similar.first] += 1
 else
 result.merge!({s => 1})
 end
end
p result
# {"hazard const. company"=>2, "peterson-chase general engineering construction inc"=>2, "traffic development services "=>2}

Question 7

Thanks a lot for you and @Eric Duminil. let me try both of your solution.

score 4 · Accepted Answer · 2017-01-30 14:07:10Z

I think you're looking for clustering based on string distance (e.g. Levensthein). From your code, it looks like similar is already implemented, so it must be clear which distance you're considering.

For clustering, these two gems could help.

The problem is hard though, so your O(n**2) is actually not so bad in comparison. You could avoid comparing string pairs twice by just checking that string1 > string2 before calculating the distance. You'd need (2000*1999)/2 iterations instead of 2000**2.

Thanks @Eric Duminil, yes similar is a gem which helps to find similarity percentage between two string. and sorry it's a typo. I updated my question.

CollectivesTM on Stack Overflow

is that possible to group a array of string based on the similarity in ruby

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related