AppVeyor build status CircleCI build status GitHub Actions build status Codecov test coverage CRAN status CRAN downloads Tweet
sbo provides utilities for building and evaluating text predictors
based on Stupid
Back-off N-gram models
in R. It includes functions such as:
kgram_freqs(): Extract k-gram frequency tables from a text corpussbo_predictor(): Train a next-word predictor via Stupid Back-off.eval_sbo_predictor(): Test text predictions against an independent corpus.
You can install the latest release of sbo from CRAN:
install.packages("sbo")You can install the development version of sbo from GitHub:
# install.packages("devtools") devtools::install_github("vgherard/sbo")
This example shows how to build a text predictor with sbo:
library(sbo) p <- sbo_predictor(sbo::twitter_train, # 50k tweets, example dataset N = 3, # Train a 3-gram model dict = sbo::twitter_dict, # Top 1k words appearing in corpus .preprocess = sbo::preprocess, # Preprocessing transformation EOS = ".?!:;" # End-Of-Sentence characters )
The object p can now be used to generate predictive text as follows:
predict(p, "i love") # a character vector #> [1] "you" "it" "my" predict(p, "you love") # another character vector #> [1] "<EOS>" "me" "the" predict(p, c("i love", "you love", "she loves", "we love", "you love", "they love") ) # a character matrix #> [,1] [,2] [,3] #> [1,] "you" "it" "my" #> [2,] "<EOS>" "me" "the" #> [3,] "you" "my" "me" #> [4,] "you" "our" "it" #> [5,] "<EOS>" "me" "the" #> [6,] "to" "you" "and"
For more general purpose utilities to work with n-gram models, you can
also check out my package
{kgrams}.
For help, see the sbo website.