1

I have implemented a custom ruby method which groups similar text using loops,

 array = ["South East Queensland", "Wide Bay Burnett", "Margaret River", "Port Pirie", "Gippsland", "Elizabeth", "Barossa"]
similarity_group = []
similarity_percentage = 60.0
array.each do |first_text|
 results.each do |second_text|
 result = first_text.upcase.similar(second_text.upcase)
 if result >= similarity_percentage
 ...
 ...
 ...
 end 
 end
end

consider the above implementation for 2000 element,then to group them it will cost 4000000 iteration because, each element will check with each other.

is there any performant solution or gems or library like grouping a bulk array based on their similarity.

(I need to use the same array element for similarity check)

sample expectation : [array1].similarity([array1])

asked Jan 30, 2017 at 13:50
2
  • Where does similar come from? Is it from a gem? Which one? Commented Jan 30, 2017 at 13:53
  • Note : You use first_text twice. I suppose similarity will always be 1 :) One obvious optimization would be to check 2 strings only if string1 >= string2. It cuts the iterations in half. Commented Jan 30, 2017 at 13:56

2 Answers 2

4

I think you're looking for clustering based on string distance (e.g. Levensthein). From your code, it looks like similar is already implemented, so it must be clear which distance you're considering.

For clustering, these two gems could help.

The problem is hard though, so your O(n**2) is actually not so bad in comparison. You could avoid comparing string pairs twice by just checking that string1 > string2 before calculating the distance. You'd need (2000*1999)/2 iterations instead of 2000**2.

answered Jan 30, 2017 at 14:07
1
  • Thanks @Eric Duminil, yes similar is a gem which helps to find similarity percentage between two string. and sorry it's a typo. I updated my question. Commented Jan 30, 2017 at 17:54
1

I have a matrix solution I used in the past to compare ebooks, since these are quite large and take time to process I solved that in the first place with a matrix, later on I used an index on a value I derived from the books which was as accurate and much faster. I could search the matrix solution for you but I have another simple routine based on levenshtein which looks much as what you want to accomplish, creating a similarity array (a hash here).

I publish the working version of my script as in my scriptlibrary so you'll probably want to adapt some of it. I used it as an answer previously on this question. You could eg build an array in the same way.

require 'levenshtein'
MAX_DISTANCE, COMPENSATION = 3, 5
strings = [
 "Hazard Const. Company",
 "hazard construction company",
 "PETERSON-CHASE GENERAL ENGINEERING CONSTRUCTION INC",
 "peterson-chase general engineering construction inc",
 "TRAFFIC DEVELOPMENT SERVICES ",
 "traffic development services"
]
result = {}
strings.each do |s|
 s.downcase!
 similar = result.keys.select { |key| Levenshtein.distance(key, s) < MAX_DISTANCE+(s.length/COMPENSATION) }
 if similar.any?
 result[similar.first] += 1
 else
 result.merge!({s => 1})
 end
end
p result
# {"hazard const. company"=>2, "peterson-chase general engineering construction inc"=>2, "traffic development services "=>2}
answered Jan 30, 2017 at 17:17
1
  • Thanks a lot for you and @Eric Duminil. let me try both of your solution. Commented Jan 30, 2017 at 17:58

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.