chito/purlu

Fork 0

Code Issues 11 Pull requests Releases 5 Packages 1 Activity

purlu is a full-text search engine.

bm25 bm25f fts search text-search

32 commits 1 branch 5 tags 6.1 MiB

Rust 89.3%

HTML 5.6%

Nix 3%

Python 1.9%

Just 0.2%

Find a file

chito f72595cc9c Release 2.0.2		2026年01月14日 14:04:20 +01:00
movies	purlu is now written without a capital letter	2026年01月13日 16:53:47 +01:00
purlu	Release 2.0.2	2026年01月14日 14:04:20 +01:00
purlu-cli	Release 2.0.2	2026年01月14日 14:04:20 +01:00
.editorconfig	Add a .editorconfig file	2026年01月13日 16:54:13 +01:00
.gitignore	The first release!	2025年02月07日 11:58:13 +01:00
Cargo.lock	Release 2.0.2	2026年01月14日 14:04:20 +01:00
Cargo.toml	The first release!	2025年02月07日 11:58:13 +01:00
CHANGELOG.md	Release 2.0.2	2026年01月14日 14:04:20 +01:00
flake.lock	Use Fenix for the Rust toolchain	2026年01月13日 17:13:02 +01:00
flake.nix	Release 2.0.2	2026年01月14日 14:04:20 +01:00
justfile	Release 2.0.0	2025年05月19日 21:54:12 +02:00
LICENSE	The first release!	2025年02月07日 11:58:13 +01:00
README.md	purlu is now written without a capital letter	2026年01月13日 16:53:47 +01:00

README.md

purlu

purlu is a full-text search engine.

Introduction

purlu is designed for collections with a relatively small number of documents (let's say less than five thousands).

What's more, it's designed to run on relatively low-resource machines (like an office PC with 4 gigabytes of RAM and a pentium that's been turned into a nice server).

However, the idea is to offer plenty of cool features. purlu isn't simple, but it tries to be light.

purlu can be used via its CLI, which enables its search features to be used via a JSON HTTP API (there is a container image in the packages of this repository).

Or by embedding its (Rust) library directly into a project, as you might do with SQLite.

Text analysis

purlu works by matching query terms with document terms.

So, for example, the query bunny may match with the document A very cute bunny!.

But how does purlu obtain these terms? Thanks to the text analyzer, which is applied to both queries and documents.

For example, if we run the analyzer on this text:

"Be gay, do crime!"

We may get back (depending on the analyzer's configuration):

["be", "gay", "do", "crime"]

purlu's analyzer can optionally:

extract text from HTML using swc's HTML parser;
segment text into words using unicode-segmentation;
lowercase text;
apply regex replacements;
apply stemming using rust-stemmers;
and apply ASCII folding using asciifolding.

Disabling all these features can be useful. For example, we may want to store external identifiers as is.

I think the explanation of the Snowball project on What is Stemming? is great, so I'll just quote it here:

Stemming maps different forms of the same word to a common "stem" - for example, the English stemmer maps connection, connections, connective, connected, and connecting to connect. So a searching for connected would also find documents which only have the other forms.

This stem form is often a word itself, but this is not always the case as this is not a requirement for text search systems, which are the intended field of use. We also aim to conflate words with the same meaning, rather than all words with a common linguistic root (so awe and awful don't have the same stem), and over-stemming is more problematic than under-stemming so we tend not to stem in cases that are hard to resolve. If you want to always reduce words to a root form and/or get a root form which is itself a word then Snowball's stemming algorithms likely aren't the right answer.

Scoring

purlu assigns a relevance score to documents that match a query, and then sort the results by this score in descending order.

It uses the BM25F ranking function, as described in the article Integrating the Probabilistic Model BM25/BM25F into Lucene.

It's actually an extension of BM25 supporting document with multiple fields (like a title and a description for example).

This ranking function is quite cool:

The more frequent a query term is in a document, the higher its score.
But not too much either: this increase is not linear, to avoid a document with a term repeated many times becoming too important.
The frequency of terms in the entire collection is taken into account. If a term is rare in the collection, its importance in the score will be higher, and if a term is frequent in the collection, its importance in the score will be lower.
Document length is also taken into account. So, if a document is long, the importance of its terms in the score will be lower. This compensates for the fact that a long document is likely to have more terms that match the query.

purlu's multi-field support means that it can take weights to be assigned to each field (known as boosts) as parameters for queries.

These weights can be used to give more importance to a title than a description, for example.

Indexes

purlu indexes are immutable, stored in memory and not persisted on disk.

This means that if you want to add a document to an index, or delete one, you'll have to recreate the index.

It also means that if purlu is restarted, the indexes will disappear.

This is possible because purlu is designed for small datasets.

In this case, it may be desirable anyway to re-index each time you wish to synchronize a collection of documents (stored in PostgreSQL, for example) with purlu.

This can help avoid bugs where, for example, a change has been missed and it will take days for a re-indexation to take place and for the change to be taken into account.

Another important thing is that purlu indexes need a schema.

This schema actually consists of a list of fields. For each one, you can specify :

Whether it will be indexed or not? If it is indexed, its terms will be used in the search.
Whether it will be stored or not? If it is stored, its values will be returned with the results.

Queries

We already know that queries take as parameter a text that will be analyzed.

We also know that queries can assign weights to fields.

Well, queries also support:

ordering by score or document id, and ascending or descending;
filtering using filter (supported operators: equal, superset, and, or and not).
pagination using offset and limit;
prefix matching using the prefix option, on all terms (all) or only the last one (last, this is inspired by Xapian's partially entered query matching);
counting the number of documents grouped by unique values for some fields, that is aggregation;
highlighting results using the highlight field names list.
Optionally returning all documents if the query is empty (using the all_if_empty option).

By determining the order of the list of documents when creating an index, we can then order by document id to sort by a property of our choice (this is inspired by Xapian).

The movies example

This repository contains a movies folder.

Inside, there is a movies.json file. It's the Movies Dataset from Meilisearch.

It weights 19MB and contains 31968 movie descriptions.

There is also a index.py script. It will index the title and overview fields.

It expects a purlu server to be available at localhost:1312.

Finally, there is a search.html document allowing to search the indexed dataset with a simple interface.

On my laptop (which is a potato), the query naruto finishes in less than a millisecond.

HTTP JSON API

`analyzer` objects

Schema fields and queries optionally take an analyzer object.

This object is used to configure text analysis.

All its fields are optional, and false by default (also if the object is not defined).

a html boolean, if true the text will be parsed as HTML and the resulting document text will be extracted;
a tokenize boolean, if true the text will be segmented into words;
a lowercase boolean, if true the text will be lowercased;
a replacements list of replacement objects;
a stemmer_language string, if defined the text will be lowercased (overriding the lowercase boolean) and stemmed;
a stemmed_only boolean, if true only stemmed words will be returned (by default both unstemmed and stemmed words are returned);
an ascii_folding boolean, if true characters which are not ASCII will be converted into their ASCII equivalents, if one exists.

A replacement object is made of:

a pattern string, the regex;
an optional replacement string, the replacement (by default matches will be removed);
and an optional all boolean, if true all matches will be replaced (false by default).

Here is an example replacement object for elision of french articles:

{
 "pattern": "^(l|m|t|qu|n|s|j|d|c|jusqu|quoiqu|lorsqu|puisqu)['’]",
}

(This list of french articles is from Lucene's FrenchAnalyzer.)

Here is the list of available stemmer_language:

arabic
danish
dutch
english
french
german
greek
hungarian
italian
norwegian
portuguese
romanian
russian
spanish
swedish
tamil
turkish

Here is an example analyzer object:

{
 "lowercase": true
}

`POST /indexes/:index_id`

This route will create an index, and if an index with the same identifier already exists, replace it.

purlu expects the request body to be a JSON object containing:

a schema, specifically a list of fields, which are objects made up of:
- a name string;
- optionally an analyzer object;
- optionally an indexed boolean, which by default is false;
- optionally a stored boolean, which by default is false;
and a list documents.

Here is an example request:

{
 "fields": [
 {
 "name": "id",
 "stored": true
 },
 {
 "name": "title",
 "analyzer": {
 "tokenize": true,
 "stemmer_language": "english"
 },
 "indexed": true,
 "stored": true
 },
 {
 "name": "description",
 "analyzer": {
 "tokenize": true,
 "stemmer_language": "english"
 },
 "indexed": true
 }
 ],
 "documents": [
 {
 "id": "12",
 "title": "On cute rabbits",
 "description": "Cute rabbits are so cute!",
 "tags": ["13", "26", "12", "24"],
 }
 ]
}

There is no reserved field name.

You may omit any field in the documents.

If a document has fields with a name not declared in the schema (the fields list), these fields will be ignored.

Document fields can only be strings or lists of strings.

If you wish to store a number, for example an identifier from your main database, you'll need to send it as a string.

If a field contains a list:

analysis will be applied to each element, then the results will be combined into a single sequence of tokens;
aggregation will make groups for each element and not for the list itself;
highlighting will be applied to each element, and the resulting list will be the result.

`DELETE /indexes/:index_id`

This route will delete an index if it exists and do nothing otherwise.

`POST /indexes/:index_id/search`

This route will search an index.

purlu expects the request body to be a JSON object containing:

a query string;
optionally an analyzer object;
optionally an all_if_empty boolean (by default false);
optionally a boosts object mapping fields names (those not declared in the schema will be ignored) to a weight (by default 1.0);
optionally a order string (by default score:desc, the first part can also be document_id, and the second part can also be asc);
optionally a filter object;
optionally an offset (by default 0) and a limit (by default no limit);
optionally a prefix string: set it to all to enable prefix matching for all query terms, or to last to enable it for the last one only;
optionally an aggregate list of stored fields names (by default an empty list)
and optionally a highlight list of stored fields names (those not declared in the schema as stored (or at all) will be ignored, by default an empty list).

A filter object must contain a single entry whose key is the operator:

{"equal": [<field name>, <value>]}
{"superset": [<field name>, <value>]} (do all texts of <value> exist in this document <field name>? this is useful for drilldown)
{"and": <list of filter objects>}
{"or": <list of filter objects>}
{"not": <filter object>}

Here is an example request:

{
 "query": "Where are all the cute rabbits?",
 "analyzer": {
 "tokenize": true,
 "stemmer_language": "english"
 },
 "boosts": {
 "title": 2.0
 },
 "prefix": "all",
 "highlight": ["title"]
}

And an example response:

{
 "count": 1,
 "hits": [
 {
 "score": 1.23456789,
 "values": {
 "id": "12",
 "title": "On cute rabbits"
 },
 "highlighted": {
 "title": "On <mark>cute</mark> <mark>rabbits</mark>"
 }
 }
 ]
}

Only stored fields will be present in values objects.

`POST /indexes/:index_id/highlight`

This route will highlight texts with a query on an index.

This can be used to highlight structured content, such as HTML, as we can extract the textual parts first.

The highlighted texts will be in the same order as those in the request.

purlu expects the request body to be a JSON object containing:

a texts string list;
a query string;
optionally an analyzer object;
and optionally a prefix string.

Here is an example request:

{
 "texts": [
 "Rabbits are cute.",
 "Everything about rabbits."
 ],
 "query": "rabbit",
 "analyzer": {
 "tokenize": true,
 "stemmer_language": "english"
 },
 "prefix": "all"
}

And an example response:

{
 "highlighted": [
 "<mark>Rabbits</mark> are cute.",
 "Everything about <mark>rabbits</mark>."
 ]
}

README.md

purlu

Introduction

Text analysis

Scoring

Indexes

Queries

The movies example

HTTP JSON API

analyzer objects

POST /indexes/:index_id

DELETE /indexes/:index_id

POST /indexes/:index_id/search

POST /indexes/:index_id/highlight

`analyzer` objects

`POST /indexes/:index_id`

`DELETE /indexes/:index_id`

`POST /indexes/:index_id/search`

`POST /indexes/:index_id/highlight`