\$\begingroup\$
\$\endgroup\$
I have some nominal variables encoded as integers (not ordinal), which I would like to encode as binary (not dummies nor one hot!). The following code is what I came up with (adapted from other code I found). Is this a valid/scalable approach? Thanks!
library(binaryLogic)
df <- data.frame(x1 = c(1, 1, 2, 3), x2 = c(1, 2, 3, 4))
encode_binary <- function(x, name = "binary_") {
x2 <- as.binary(x)
maxlen <- max(sapply(x2, length))
x2 <- lapply(x2, function(y) {
l <- length(y)
if (l < maxlen) {
y <- c(rep(0, (maxlen - l)), y)
}
y
})
d <- as.data.frame(t(as.data.frame(x2)))
rownames(d) <- NULL
colnames(d) <- paste0(name, 1:maxlen)
d
}
df <- cbind(df, encode_binary(df[["x1"]], name = "binary_x1_"))
df <- cbind(df, encode_binary(df[["x2"]], name = "binary_x2_"))
df
1 Answer 1
\$\begingroup\$
\$\endgroup\$
If we test on larger vector your approach is quite slow:
test_vec <- 1:1e5
system.time(v1 <- encode_binary(test_vec, name = "binary_x1_"))
# user system elapsed
# 22.23 0.08 22.37
Based on this SO question I managed to write code that performs a lot faster:
encode_binary2 <- function(x, name = "binary_") {
m <- sapply(x, function(x) rev(as.integer(intToBits(x))))
tm <- t(m)
# remove empty bit cols
i <- which(colSums(tm) != 0L)[1]
tm <- tm[, i:ncol(tm)]
# save to data.frame
d <- as.data.frame(tm)
rownames(d) <- NULL
colnames(d) <- paste0(name, 1:ncol(d))
d
}
system.time(v2 <- encode_binary2(test_vec, name = "binary_x1_"))
# user system elapsed
# 0.61 0.02 0.63
# test that results are equal:
all.equal(v1, v2)
# [1] TRUE
answered Sep 2, 2020 at 8:23
lang-r