| movies | purlu is now written without a capital letter | |
| purlu | Release 2.0.2 | |
| purlu-cli | Release 2.0.2 | |
| .editorconfig | Add a .editorconfig file | |
| .gitignore | The first release! | |
| Cargo.lock | Release 2.0.2 | |
| Cargo.toml | The first release! | |
| CHANGELOG.md | Release 2.0.2 | |
| flake.lock | Use Fenix for the Rust toolchain | |
| flake.nix | Release 2.0.2 | |
| justfile | Release 2.0.0 | |
| LICENSE | The first release! | |
| README.md | purlu is now written without a capital letter | |
purlu
purlu is a full-text search engine.
Introduction
purlu is designed for collections with a relatively small number of documents (let's say less than five thousands).
What's more, it's designed to run on relatively low-resource machines (like an office PC with 4 gigabytes of RAM and a pentium that's been turned into a nice server).
However, the idea is to offer plenty of cool features. purlu isn't simple, but it tries to be light.
purlu can be used via its CLI, which enables its search features to be used via a JSON HTTP API (there is a container image in the packages of this repository).
Or by embedding its (Rust) library directly into a project, as you might do with SQLite.
Text analysis
purlu works by matching query terms with document terms.
So, for example, the query bunny may match with the document A very cute bunny!.
But how does purlu obtain these terms? Thanks to the text analyzer, which is applied to both queries and documents.
For example, if we run the analyzer on this text:
"Be gay, do crime!"
We may get back (depending on the analyzer's configuration):
["be", "gay", "do", "crime"]
purlu's analyzer can optionally:
- extract text from HTML using swc's HTML parser;
- segment text into words using unicode-segmentation;
- lowercase text;
- apply regex replacements;
- apply stemming using rust-stemmers;
- and apply ASCII folding using asciifolding.
Disabling all these features can be useful. For example, we may want to store external identifiers as is.
I think the explanation of the Snowball project on What is Stemming? is great, so I'll just quote it here:
Stemming maps different forms of the same word to a common "stem" - for example, the English stemmer maps connection, connections, connective, connected, and connecting to connect. So a searching for connected would also find documents which only have the other forms.
This stem form is often a word itself, but this is not always the case as this is not a requirement for text search systems, which are the intended field of use. We also aim to conflate words with the same meaning, rather than all words with a common linguistic root (so awe and awful don't have the same stem), and over-stemming is more problematic than under-stemming so we tend not to stem in cases that are hard to resolve. If you want to always reduce words to a root form and/or get a root form which is itself a word then Snowball's stemming algorithms likely aren't the right answer.
Scoring
purlu assigns a relevance score to documents that match a query, and then sort the results by this score in descending order.
It uses the BM25F ranking function, as described in the article Integrating the Probabilistic Model BM25/BM25F into Lucene.
It's actually an extension of BM25 supporting document with multiple fields (like a title and a description for example).
This ranking function is quite cool:
- The more frequent a query term is in a document, the higher its score.
- But not too much either: this increase is not linear, to avoid a document with a term repeated many times becoming too important.
- The frequency of terms in the entire collection is taken into account. If a term is rare in the collection, its importance in the score will be higher, and if a term is frequent in the collection, its importance in the score will be lower.
- Document length is also taken into account. So, if a document is long, the importance of its terms in the score will be lower. This compensates for the fact that a long document is likely to have more terms that match the query.
purlu's multi-field support means that it can take weights to be assigned to each field (known as boosts) as parameters for queries.
These weights can be used to give more importance to a title than a description, for example.
Indexes
purlu indexes are immutable, stored in memory and not persisted on disk.
This means that if you want to add a document to an index, or delete one, you'll have to recreate the index.
It also means that if purlu is restarted, the indexes will disappear.
This is possible because purlu is designed for small datasets.
In this case, it may be desirable anyway to re-index each time you wish to synchronize a collection of documents (stored in PostgreSQL, for example) with purlu.
This can help avoid bugs where, for example, a change has been missed and it will take days for a re-indexation to take place and for the change to be taken into account.
Another important thing is that purlu indexes need a schema.
This schema actually consists of a list of fields. For each one, you can specify :
- Whether it will be indexed or not? If it is indexed, its terms will be used in the search.
- Whether it will be stored or not? If it is stored, its values will be returned with the results.
Queries
We already know that queries take as parameter a text that will be analyzed.
We also know that queries can assign weights to fields.
Well, queries also support:
- ordering by score or document id, and ascending or descending;
- filtering using
filter(supported operators:equal,superset,and,orandnot). - pagination using
offsetandlimit; - prefix matching using the
prefixoption, on all terms (all) or only the last one (last, this is inspired by Xapian's partially entered query matching); - counting the number of documents grouped by unique values for some fields, that is aggregation;
- highlighting results using the
highlightfield names list. - Optionally returning all documents if the query is empty (using the
all_if_emptyoption).
By determining the order of the list of documents when creating an index, we can then order by document id to sort by a property of our choice (this is inspired by Xapian).
The movies example
This repository contains a movies folder.
Inside, there is a movies.json file. It's the Movies Dataset from Meilisearch.
It weights 19MB and contains 31968 movie descriptions.
There is also a index.py script. It will index the title and overview fields.
It expects a purlu server to be available at localhost:1312.
Finally, there is a search.html document allowing to search the indexed dataset with a simple interface.
On my laptop (which is a potato), the query naruto finishes in less than a millisecond.
HTTP JSON API
analyzer objects
Schema fields and queries optionally take an analyzer object.
This object is used to configure text analysis.
All its fields are optional, and false by default (also if the object is not defined).
- a
htmlboolean, iftruethe text will be parsed as HTML and the resulting document text will be extracted; - a
tokenizeboolean, iftruethe text will be segmented into words; - a
lowercaseboolean, iftruethe text will be lowercased; - a
replacementslist ofreplacementobjects; - a
stemmer_languagestring, if defined the text will be lowercased (overriding thelowercaseboolean) and stemmed; - a
stemmed_onlyboolean, iftrueonly stemmed words will be returned (by default both unstemmed and stemmed words are returned); - an
ascii_foldingboolean, iftruecharacters which are not ASCII will be converted into their ASCII equivalents, if one exists.
A replacement object is made of:
- a
patternstring, the regex; - an optional
replacementstring, the replacement (by default matches will be removed); - and an optional
allboolean, iftrueall matches will be replaced (falseby default).
Here is an example replacement object for elision of french articles:
{
"pattern": "^(l|m|t|qu|n|s|j|d|c|jusqu|quoiqu|lorsqu|puisqu)['’]",
}
(This list of french articles is from Lucene's FrenchAnalyzer.)
Here is the list of available stemmer_language:
arabicdanishdutchenglishfrenchgermangreekhungarianitaliannorwegianportugueseromanianrussianspanishswedishtamilturkish
Here is an example analyzer object:
{
"lowercase": true
}
POST /indexes/:index_id
This route will create an index, and if an index with the same identifier already exists, replace it.
purlu expects the request body to be a JSON object containing:
- a schema, specifically a list of
fields, which are objects made up of:- a
namestring; - optionally an
analyzerobject; - optionally an
indexedboolean, which by default isfalse; - optionally a
storedboolean, which by default isfalse;
- a
- and a list
documents.
Here is an example request:
{
"fields": [
{
"name": "id",
"stored": true
},
{
"name": "title",
"analyzer": {
"tokenize": true,
"stemmer_language": "english"
},
"indexed": true,
"stored": true
},
{
"name": "description",
"analyzer": {
"tokenize": true,
"stemmer_language": "english"
},
"indexed": true
}
],
"documents": [
{
"id": "12",
"title": "On cute rabbits",
"description": "Cute rabbits are so cute!",
"tags": ["13", "26", "12", "24"],
}
]
}
There is no reserved field name.
You may omit any field in the documents.
If a document has fields with a name not declared in the schema (the fields list), these fields will be ignored.
Document fields can only be strings or lists of strings.
If you wish to store a number, for example an identifier from your main database, you'll need to send it as a string.
If a field contains a list:
- analysis will be applied to each element, then the results will be combined into a single sequence of tokens;
- aggregation will make groups for each element and not for the list itself;
- highlighting will be applied to each element, and the resulting list will be the result.
DELETE /indexes/:index_id
This route will delete an index if it exists and do nothing otherwise.
POST /indexes/:index_id/search
This route will search an index.
purlu expects the request body to be a JSON object containing:
- a
querystring; - optionally an
analyzerobject; - optionally an
all_if_emptyboolean (by defaultfalse); - optionally a
boostsobject mapping fields names (those not declared in the schema will be ignored) to a weight (by default1.0); - optionally a
orderstring (by defaultscore:desc, the first part can also bedocument_id, and the second part can also beasc); - optionally a
filterobject; - optionally an
offset(by default0) and alimit(by default no limit); - optionally a
prefixstring: set it toallto enable prefix matching for all query terms, or tolastto enable it for the last one only; - optionally an
aggregatelist of stored fields names (by default an empty list) - and optionally a
highlightlist of stored fields names (those not declared in the schema as stored (or at all) will be ignored, by default an empty list).
A filter object must contain a single entry whose key is the operator:
{"equal": [<field name>, <value>]}{"superset": [<field name>, <value>]}(do all texts of<value>exist in this document<field name>? this is useful for drilldown){"and": <list of filter objects>}{"or": <list of filter objects>}{"not": <filter object>}
Here is an example request:
{
"query": "Where are all the cute rabbits?",
"analyzer": {
"tokenize": true,
"stemmer_language": "english"
},
"boosts": {
"title": 2.0
},
"prefix": "all",
"highlight": ["title"]
}
And an example response:
{
"count": 1,
"hits": [
{
"score": 1.23456789,
"values": {
"id": "12",
"title": "On cute rabbits"
},
"highlighted": {
"title": "On <mark>cute</mark> <mark>rabbits</mark>"
}
}
]
}
Only stored fields will be present in values objects.
POST /indexes/:index_id/highlight
This route will highlight texts with a query on an index.
This can be used to highlight structured content, such as HTML, as we can extract the textual parts first.
The highlighted texts will be in the same order as those in the request.
purlu expects the request body to be a JSON object containing:
- a
textsstring list; - a
querystring; - optionally an
analyzerobject; - and optionally a
prefixstring.
Here is an example request:
{
"texts": [
"Rabbits are cute.",
"Everything about rabbits."
],
"query": "rabbit",
"analyzer": {
"tokenize": true,
"stemmer_language": "english"
},
"prefix": "all"
}
And an example response:
{
"highlighted": [
"<mark>Rabbits</mark> are cute.",
"Everything about <mark>rabbits</mark>."
]
}