Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

leiless/sqlite3-ngram

Folders and files

NameName
Last commit message
Last commit date

Latest commit

History

133 Commits

Repository files navigation

sqlite3-ngram

ngram is a SQLite3 FTS5 n-gram tokenizer, it tokenize the input text in computational linguistics level.

For the input text Hello 新 世界:

  • ngram = 1

    Hello, , ,

  • ngram = 2

    Hello, , 新世, 世界

  • ngram = 3

    Hello, , 新世, 新世界

The tokenization is based on UTF-8 character and character category boundary.

The ngram currently support is in range [1, 4], larger ngram can be supported but it's usually unnecessary.

This tokenizer extension can be used as a fallback(generic) tokenizer for FTS purpose.

Build

# Tested under podman, docker should also be ok.
container/build.sh

Usage

-- First load the ngram extension
.load build/libngram.so
-- By default N = 2, valid N is in range [1, 4]
CREATE VIRTUAL TABLE t1 USING fts5(x, tokenize = 'ngram');
CREATE VIRTUAL TABLE t1 USING fts5(x, tokenize = 'ngram gram N');
-- Or check sql/load-ext.sql for example usage
-- sqlite3 < sql/load-ext.sql

Advance usage

You can integrate this tokenizer with the SQLite3 official porter tokenizer:

.load build/libngram.so
CREATE VIRTUAL TABLE t1 USING fts5(x, tokenize = 'porter ngram gram N');

In such case, if you tokenized the word direct. directed, directing, direction, directly... all can be coalesced into direct and thus hit a match.

Limitation

Currently only the UTF-8 string is supported for tokenization, usually not a big concern though.

Credits

This project was inspired from the following projects:

wangfenjin/simple - 支持中文(简体和繁体)和拼音的 SQLite fts5 扩展

TODO

  • Implement ngram_highlight() function
  • Add more test cases
  • Enable build & test CI

About

SQLite3 FTS5 n-gram tokenizer (WIP)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

AltStyle によって変換されたページ (->オリジナル) /