Introduction to the RcppML package

Zach DeBruine

2021年09月21日

The ‘RcppML’ package provides high-performance machine learning algorithms using Rcpp with a focus on matrix factorization.

Installation

Install the latest development version of RcppML from github:

library(devtools)
install_github("zdebruine/RcppML")

library(RcppML)
library(Matrix)
#> Warning: package 'Matrix' was built under R version 4.0.5

Non-Negative Least Squares

RcppML contains extremely fast NNLS solvers. Use the nnls function to solve systems of equations subject to non-negativity constraints.

The RcppML::solve function solves the equation for where is symmetric positive definite matrix of dimensions and is a vector of length or a matrix of dimensions .

# construct a system of equations
X <-matrix(rnorm(2000),100,20)
btrue <-runif(20)
y <-X %*%btrue +rnorm(100)
a <-crossprod(X)
b <-crossprod(X, y)

# solve the system of equations
x <-RcppML::nnls(a, b)

# use only coordinate descent
x <-RcppML::nnls(a, b, fast_nnls = FALSE, cd_maxit = 1000, cd_tol = 1e-8)

RcppML::solve implements a new and fastest-in-class algorithm for non-negative least squares:

initialization is done by solving for the unconstrained least squares solution.
forward active set tuning (FAST) provides a near-exact solution (often exact for well-conditioned systems) by setting all negative values in the unconstrained solution to zero, re-solving the system for only positive values, and repeating the process until the solution for values not constrained to zero is strictly positive. Set cd_maxit = 0 to use only the FAST solver.
Coordinate descent refines the FAST solution and finds the best solution discoverable by gradient descent. The coordinate descent solution is only used if it gives a better error than the FAST solution. Generally, coordinate descent re-introduces variables constrained to zero by FAST back into the feasible set, but does not dramatically change the solution.

Projecting Linear Models

Project dense linear factor models onto real-valued sparse matrices (or any matrix coercible to Matrix::dgCMatrix) using RcppML::project.

RcppML::project solves the equation for .

# simulate a sparse matrix
A <-rsparsematrix(1000, 100, 0.1)

# simulate a linear factor model
w <-matrix(runif(1000 *10), 1000, 10)

# project the model
h <-RcppML::project(A, w)

Non-negative Matrix Factorization

RcppML::nmf finds a non-negative matrix factorization by alternating least squares (alternating projections of linear models and ).

There are several ways in which the NMF algorithm differs from other currently available methods:

Diagonalized scaling of factors to sum to 1, permitting convex L1 regularization along the entire solution path
Fast stopping criteria, based on correlation between models across consecutive iterations
Extremely fast algorithms using the Eigen C++ library, optimized for matrices that are >90% sparse
Support for NMF or unconstrained matrix factorization
Parallelized using OpenMP multithreading

The following example runs rank-10 NMF on a random 1000 x 1000 matrix that is 90% sparse:

A <-rsparsematrix(100, 100, 0.1)
model <-RcppML::nmf(A, 10, verbose = F)

w <-model$w
d <-model$d
h <-model$h
model_tolerance <-tail(model$tol, 1)

Tolerance is simply a measure of the average correlation between \eqn{w_{i-1} and and and for a given iteration .

For symmetric factorizations (when ), tolerance becomes a measure of the correlation between and , and diagonalization is automatically performed to enforce symmetry:

A_sym <-as(crossprod(A), "dgCMatrix")

model <-RcppML::nmf(A_sym, 10, verbose = F)

Mean squared error of a factorization can be calculated for a given model using the RcppML::mse function:

RcppML::mse(A_sym, model$w, model$d, model$h)
#> [1] 1.319477