I'm trying to implement a small full-text search in Ruby. I feel like I've laid down the foundation but I've encountered a blocker that has gotten me thinking that something with the design is incorrect.
The general concept is that a Document
is the basic unit. Multiple Documents
form a Collection
. The InvertedIndex
takes a collection to build
the index, which is a just hash of stemmed tokens with their respective document id
s, for example:
inverted_index = {
"new" => [1, 4], # These are document ids.
"home" => [1, 2, 3, 4], # The key is stemmed and some stop-words
"sale" => [1, 2, 3, 4], # are being removed.
"top" => [1],
"forecast" => [1],
"rise" => [2, 4],
"juli" => [2, 3, 4],
"increas" => [3]
})
document.rb
module Rankrb
class Document
attr_accessor :id, :body, :rank
def initialize(params={})
@id = params.fetch :id, nil
@body = params.fetch :body, ''
@rank = params.fetch :rank, nil
end
def length
tokens.join(' ').length
end
def include?(term)
tokens.include? term_to_token(term)
end
def term_freq(term)
tokens.count term_to_token(term)
end
def tokens
Rankrb::Tokenizer.new(@body).tokenize
end
def uniq_tokens
tokens.uniq
end
private
def term_to_token(term)
Rankrb::Tokenizer.new(term).tokenize.shift
end
end
end
tokenizer.rb
module Rankrb
# The same tokenizer should be used for document
# tokenization and query tokenization to ensure that
# the same terms are being searched and returned.
class Tokenizer
attr_accessor :str
attr_reader :tokens
def initialize(str='')
@str = str
@tokens = Array.new
@stopwords = Rankrb.configuration.stopwords
@lang = Rankrb.configuration.language
end
def tokenize
regex = /[^\s\p{Alnum}\p{Han}\p{Katakana}\p{Hiragana}\p{Hangul}]/
@tokens = @str.gsub(regex,'')
.downcase
.split
.delete_if {|token| @stopwords.include?(token)}
.map {|w| Lingua.stemmer(w, :language => @lang)}
@tokens
end
end
end
collection.rb
module Rankrb
class Collection
attr_accessor :query, :docs
def initialize(params={})
@docs = params.fetch(:docs, [])
@query = params.fetch(:query, nil)
def @docs.<<(arg)
self.push arg
end
end
def remove_doc(doc)
@docs.delete_if do |curr_doc|
curr_doc == doc
end
end
def containing_term(term)
@docs.count {|doc| doc.include?(term)}
end
def avg_dl
@docs.map(&:length).inject(:+) / total_docs
end
def total_docs
@docs.size
end
def idf(term)
numerator = total_docs - containing_term(term) + 0.5
denominator = containing_term(term) + 0.5
Math.log(numerator / denominator)
end
def bm25(params={:k => 1.2, :b => 0.75, :delta => 1.0})
@k = params[:k]
@b = params[:b]
@delta = params[:delta]
@docs.each do |doc|
score = 0
dl = doc.length
query_terms = @query.split
query_terms.each do |term|
dtf = doc.term_freq(term)
numerator = dtf * (@k + 1)
denominator = dtf + @k * (1 - @b + @b * (doc.length / avg_dl))
score += idf(term) * (numerator/denominator) + @delta
end
doc.rank = score
end
@docs.sort {|a, b| a.rank <=> b.rank}
end
end
end
inverted_index.rb
module Rankrb
class InvertedIndex
attr_accessor :collection, :iidx
def initialize(params={})
@collection = params.fetch(:collection, Rankrb::Collection.new)
@index_file = 'db/index.json'
@iidx = Hash.new
end
def build
@collection.docs.each do |doc|
# Make the inverted index hash
doc.uniq_tokens.each do |token|
if @iidx[token]
@iidx[token] << doc.id
else
@iidx[token] = [doc.id]
end
end
end
# Now sort the document ids and return the inverted index!
@iidx.each {|k, v| @iidx[k] = v.sort}
end
def remove_doc(doc)
doc.tokens.each do |token|
# Remove the document id
@iidx[token].delete(doc.id)
# Then remove the key from the hash if
# there are no more docs.
@iidx.delete(token) if @iidx[token].empty?
end
# Once all tokens have been removed,
# remove the document from the collection.
@collection.remove_doc(doc)
@iidx
end
# Returns an array of document ids.
def find(str)
Rankrb::Tokenizer.new(str)
.tokenize
.map {|token| @iidx[token]}
.compact
.flatten
.uniq
.sort
end
# Define query_or and query_and methods.
%w(and or).each do |op|
define_method("query_#{op}") do |word_ary|
doc_ids = Array.new
word_ary.each {|word| doc_ids << find(word) }
case op
when 'and'
symbol = :& # Conjunctive query
when 'or'
symbol = :| # Disjunctive query
end
doc_ids.inject(symbol)
end
end
def commit!
if File.exist?(@index_file)
file = File.read @index_file
# Merge the new tokens
index = JSON.parse(file).merge(@iidx)
File.open(@index_file, 'w+') { |f| f.write(index.to_json) }
else
# Create & write to file for the first time
File.open(@index_file, 'w') { |f| f.write(@iidx) }
end
end
end
end
You'd run it as follows:
d1 = Rankrb::Document.new body: "new home sales top forecasts", id: 1
d2 = Rankrb::Document.new body: "home sales rise in july", id: 2
d3 = Rankrb::Document.new body: "increase in home sales in july", id: 3
d4 = Rankrb::Document.new body: "july new home sales rise", id: 4
coll = Rankrb::Collection.new docs: [d1, d2, d3, d4]
index = Rankrb::InvertedIndex.new collection: coll
index.build # Inverted-index gets built and stored into @iidx
index.find('top sales') # => [1, 2, 3, 4]
This is where I'm a bit lost. The current process does the following:
- The
find
method inInvertedIndex
returns an array of doc ids. - The doc ids need to be found within
Collection
(which currently holds ALL documents) - These documents need to be ranked.
Collection
needs to return only the list ofDocument
s that were ranked back over tofind
to respond to the query.
The questions:
- Storing all
Document
s in memory withinCollection
is the first thing that seems wrong to me, as this will eat of tons of RAM. However, I need them there in order forfind
to do something with the array of document ids that it returns. I can't exactly remove theDocument
s from memory since, as you can see from this case, the query matched tokens in all of them. Is there a better way to handle this situation? - In order to rank
Document
s, I need to iterate over the array returned byfind
(eg.[1, 2, 3, 4]
). This means I'd need to iterate over all these documents and return a new array of documents tofind
, so the id and rank can be preserved and returned.
Am I wrong to think this is slightly unreasonable? Is this design incorrect?
1 Answer 1
I gave a talk at RailsConf a few years back where I did a walk through of some fundamentally similar code. In my talk, I actually used JRuby and OpenNLP but you can achieve the same result in Ruby MRI using some of the other libraries I call out in my deck.
https://speakerdeck.com/brandonblack/natural-language-processing-in-ruby?slide=31
In my example, I'm doing a poor man's summary of a block of text by identifying and grabbing the top "most significant" sentences in the document. It's a different goal, but many of the basic ideas translate over to what you're trying to do.
For a basic text search, you need to build an index that returns the most relevant items based on a given search phrase. Once you've built that index, searching it is simple.
For each segment of searchable text:
- Tokenize all the input blocks of text.
- Filter out stop words that have low value. You can do this with a library or for simple demo purposes you can just use a simple regex or blacklist.
- Use a word stemmer to identify the root word for each token in the text block.
- In your database, store an association between the word stem, the document being searched and a weighting for how many times that word stem appears in the text.
- In your search method, follow the same process (tokenize, filter, stem) processing the search phrase and then return the documents found in your index for those stem words in the search phrase sorted by weight descending.
In the database, a row in the search index might look something like this:
stem_word_id, document_id, weight
If you had a table full of those associations, you'd just need to index that table across all 3 columns with a weight descending direction. For best results, consider using a database that leverages in-memory mapping (ex. MongoDB) for this index (similar to keeping this index in cache).
I hope that helps draw a rough picture of the process.
That being said, there are many more nuances to full text search. This is a very very simplistic, mostly naïve approach and without an appropriate data backend it would not scale very much for you (your spot on about storing these in memory being a long term losing strategy). If you're considering this for a production system, I would recommend using a database with full text search or something like Apache Lucene which there are Ruby-based adapters for.
Explore related questions
See similar questions with these tags.
NoMethodError
s forconfiguration
and that stuff, and (b) there's no single file which alsorequire
s the rest, which is typical for a rather large library like this. \$\endgroup\$Lingua
? That's not in there as far as I can tell. \$\endgroup\$