7
\$\begingroup\$

I'm trying to implement a small full-text search in Ruby. I feel like I've laid down the foundation but I've encountered a blocker that has gotten me thinking that something with the design is incorrect.

The general concept is that a Document is the basic unit. Multiple Documents form a Collection. The InvertedIndex takes a collection to build the index, which is a just hash of stemmed tokens with their respective document ids, for example:

inverted_index = {
 "new" => [1, 4], # These are document ids.
 "home" => [1, 2, 3, 4], # The key is stemmed and some stop-words
 "sale" => [1, 2, 3, 4], # are being removed.
 "top" => [1],
 "forecast" => [1],
 "rise" => [2, 4],
 "juli" => [2, 3, 4],
 "increas" => [3]
})

document.rb

module Rankrb
 class Document
 attr_accessor :id, :body, :rank
 def initialize(params={})
 @id = params.fetch :id, nil
 @body = params.fetch :body, ''
 @rank = params.fetch :rank, nil
 end
 def length
 tokens.join(' ').length
 end
 def include?(term)
 tokens.include? term_to_token(term)
 end
 def term_freq(term)
 tokens.count term_to_token(term)
 end
 def tokens
 Rankrb::Tokenizer.new(@body).tokenize
 end
 def uniq_tokens
 tokens.uniq
 end
 private
 def term_to_token(term)
 Rankrb::Tokenizer.new(term).tokenize.shift
 end
 end
end

tokenizer.rb

module Rankrb
 # The same tokenizer should be used for document
 # tokenization and query tokenization to ensure that
 # the same terms are being searched and returned.
 class Tokenizer
 attr_accessor :str
 attr_reader :tokens
 def initialize(str='')
 @str = str
 @tokens = Array.new
 @stopwords = Rankrb.configuration.stopwords
 @lang = Rankrb.configuration.language
 end
 def tokenize
 regex = /[^\s\p{Alnum}\p{Han}\p{Katakana}\p{Hiragana}\p{Hangul}]/
 @tokens = @str.gsub(regex,'')
 .downcase
 .split
 .delete_if {|token| @stopwords.include?(token)}
 .map {|w| Lingua.stemmer(w, :language => @lang)}
 @tokens
 end
 end
end

collection.rb

module Rankrb
 class Collection
 attr_accessor :query, :docs
 def initialize(params={})
 @docs = params.fetch(:docs, [])
 @query = params.fetch(:query, nil)
 def @docs.<<(arg)
 self.push arg
 end
 end
 def remove_doc(doc)
 @docs.delete_if do |curr_doc|
 curr_doc == doc
 end
 end
 def containing_term(term)
 @docs.count {|doc| doc.include?(term)}
 end
 def avg_dl
 @docs.map(&:length).inject(:+) / total_docs
 end
 def total_docs
 @docs.size
 end
 def idf(term)
 numerator = total_docs - containing_term(term) + 0.5
 denominator = containing_term(term) + 0.5
 Math.log(numerator / denominator)
 end
 def bm25(params={:k => 1.2, :b => 0.75, :delta => 1.0})
 @k = params[:k]
 @b = params[:b]
 @delta = params[:delta]
 @docs.each do |doc|
 score = 0
 dl = doc.length
 query_terms = @query.split
 query_terms.each do |term|
 dtf = doc.term_freq(term)
 numerator = dtf * (@k + 1)
 denominator = dtf + @k * (1 - @b + @b * (doc.length / avg_dl))
 score += idf(term) * (numerator/denominator) + @delta
 end
 doc.rank = score
 end
 @docs.sort {|a, b| a.rank <=> b.rank}
 end
 end
end

inverted_index.rb

module Rankrb
 class InvertedIndex
 attr_accessor :collection, :iidx
 def initialize(params={})
 @collection = params.fetch(:collection, Rankrb::Collection.new)
 @index_file = 'db/index.json'
 @iidx = Hash.new
 end
 def build
 @collection.docs.each do |doc|
 # Make the inverted index hash
 doc.uniq_tokens.each do |token|
 if @iidx[token]
 @iidx[token] << doc.id
 else
 @iidx[token] = [doc.id]
 end
 end
 end
 # Now sort the document ids and return the inverted index!
 @iidx.each {|k, v| @iidx[k] = v.sort}
 end
 def remove_doc(doc)
 doc.tokens.each do |token|
 # Remove the document id
 @iidx[token].delete(doc.id)
 # Then remove the key from the hash if
 # there are no more docs.
 @iidx.delete(token) if @iidx[token].empty?
 end
 # Once all tokens have been removed,
 # remove the document from the collection.
 @collection.remove_doc(doc)
 @iidx
 end
 # Returns an array of document ids.
 def find(str)
 Rankrb::Tokenizer.new(str)
 .tokenize
 .map {|token| @iidx[token]}
 .compact
 .flatten
 .uniq
 .sort
 end
 # Define query_or and query_and methods.
 %w(and or).each do |op|
 define_method("query_#{op}") do |word_ary|
 doc_ids = Array.new
 word_ary.each {|word| doc_ids << find(word) }
 case op
 when 'and'
 symbol = :& # Conjunctive query
 when 'or'
 symbol = :| # Disjunctive query
 end
 doc_ids.inject(symbol)
 end
 end
 def commit!
 if File.exist?(@index_file)
 file = File.read @index_file
 # Merge the new tokens
 index = JSON.parse(file).merge(@iidx)
 File.open(@index_file, 'w+') { |f| f.write(index.to_json) }
 else
 # Create & write to file for the first time
 File.open(@index_file, 'w') { |f| f.write(@iidx) }
 end
 end
 end
end

You'd run it as follows:

d1 = Rankrb::Document.new body: "new home sales top forecasts", id: 1
d2 = Rankrb::Document.new body: "home sales rise in july", id: 2
d3 = Rankrb::Document.new body: "increase in home sales in july", id: 3
d4 = Rankrb::Document.new body: "july new home sales rise", id: 4
coll = Rankrb::Collection.new docs: [d1, d2, d3, d4]
index = Rankrb::InvertedIndex.new collection: coll
index.build # Inverted-index gets built and stored into @iidx
index.find('top sales') # => [1, 2, 3, 4]

This is where I'm a bit lost. The current process does the following:

  1. The find method in InvertedIndex returns an array of doc ids.
  2. The doc ids need to be found within Collection (which currently holds ALL documents)
  3. These documents need to be ranked.
  4. Collection needs to return only the list of Documents that were ranked back over to find to respond to the query.

The questions:

  • Storing all Documents in memory within Collection is the first thing that seems wrong to me, as this will eat of tons of RAM. However, I need them there in order for find to do something with the array of document ids that it returns. I can't exactly remove the Documents from memory since, as you can see from this case, the query matched tokens in all of them. Is there a better way to handle this situation?
  • In order to rank Documents, I need to iterate over the array returned by find (eg. [1, 2, 3, 4]). This means I'd need to iterate over all these documents and return a new array of documents to find, so the id and rank can be preserved and returned.

Am I wrong to think this is slightly unreasonable? Is this design incorrect?

200_success
146k22 gold badges190 silver badges479 bronze badges
asked Aug 2, 2015 at 22:28
\$\endgroup\$
3
  • \$\begingroup\$ As far as I can tell, this doesn't contain all of your code. I say this because, when I try to use it, (a) it gives me NoMethodErrors for configuration and that stuff, and (b) there's no single file which also requires the rest, which is typical for a rather large library like this. \$\endgroup\$ Commented Dec 8, 2015 at 23:38
  • \$\begingroup\$ Also, where's Lingua? That's not in there as far as I can tell. \$\endgroup\$ Commented Dec 8, 2015 at 23:54
  • \$\begingroup\$ Isn't this question kind of off topic..? \$\endgroup\$ Commented Dec 15, 2015 at 13:44

1 Answer 1

2
\$\begingroup\$

I gave a talk at RailsConf a few years back where I did a walk through of some fundamentally similar code. In my talk, I actually used JRuby and OpenNLP but you can achieve the same result in Ruby MRI using some of the other libraries I call out in my deck.

https://speakerdeck.com/brandonblack/natural-language-processing-in-ruby?slide=31

In my example, I'm doing a poor man's summary of a block of text by identifying and grabbing the top "most significant" sentences in the document. It's a different goal, but many of the basic ideas translate over to what you're trying to do.

For a basic text search, you need to build an index that returns the most relevant items based on a given search phrase. Once you've built that index, searching it is simple.

For each segment of searchable text:

  1. Tokenize all the input blocks of text.
  2. Filter out stop words that have low value. You can do this with a library or for simple demo purposes you can just use a simple regex or blacklist.
  3. Use a word stemmer to identify the root word for each token in the text block.
  4. In your database, store an association between the word stem, the document being searched and a weighting for how many times that word stem appears in the text.
  5. In your search method, follow the same process (tokenize, filter, stem) processing the search phrase and then return the documents found in your index for those stem words in the search phrase sorted by weight descending.

In the database, a row in the search index might look something like this:

stem_word_id, document_id, weight

If you had a table full of those associations, you'd just need to index that table across all 3 columns with a weight descending direction. For best results, consider using a database that leverages in-memory mapping (ex. MongoDB) for this index (similar to keeping this index in cache).

I hope that helps draw a rough picture of the process.

That being said, there are many more nuances to full text search. This is a very very simplistic, mostly naïve approach and without an appropriate data backend it would not scale very much for you (your spot on about storing these in memory being a long term losing strategy). If you're considering this for a production system, I would recommend using a database with full text search or something like Apache Lucene which there are Ruby-based adapters for.

answered Jan 22, 2016 at 1:14
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.