I tried writing a short function that expands brackets []
within a regular expression. Given a regular expression, the function will expand the brackets and return a vector of strings that explicitly spell out each match.
I attempted to account for two cases: 1) a regular expression with a single range (e.g. ^405[0-3L-O]$
), 2) a regular expression with multiple ranges where each pattern with the ranges is separated by |
. (e.g. ^W3812$|^405[0-3L-O]$|^N17[04][9FK]Z$
). I also added additional feature where if show_expanded
is set to TRUE, the resulting vector will be a named vector where each name would represent the values in an expanded range.
Below is the code.
#' Regular Expression Bracket Expander
#'
#' Given a regular expression with brackets, expands the expression with explicit matches.
#' Returns a vector of explicit matches.
#'
#' @param rex a regular expression
#' @param show_expanded if set to TRUE, the resulting vector will show each value in the expanded range as names.
#' @examples
#' r <- "^W3812$|^405[0-3L-O]$|^N17[04][9FK][0-3]Z$"
#' regex_expander(r)
regex_expander <- function(rex, show_expanded=TRUE){
alpha_nums <- c(0:9, letters, LETTERS)
rex_split <- strsplit(rex, split="\\|")[[1]]
# extract range patterns
range_pattern <- stringr::str_extract_all(rex_split, "\\[.*?\\]")
# expand range
expanded_patterns <- lapply(range_pattern, function(rng){
if(length(rng) == 1){
grep(rng, alpha_nums, value=TRUE)
} else if(length(rng) > 1){
# if more than 1 range, get every possible combination
expanded <- lapply(rng, function(v) grep(v, alpha_nums, value=TRUE))
apply(expand.grid(expanded), 1, function(x) paste0(x, collapse=""))
} else{
# no range in the pattern
NULL
}
})
# replace ranges with explicit nums/alphabets
res <- mapply(function(rex, rng, expt){
if(length(rng) == 1){
sapply(expt, function(p) gsub(pattern=rng, replacement=p, x=rex, fixed=TRUE),
USE.NAMES=show_expanded)
} else if(length(rng) > 1){
rng <- paste0(rng, collapse="")
sapply(expt, function(p) gsub(pattern=rng, replacement=p, x=rex, fixed=TRUE),
USE.NAMES=show_expanded)
} else{
warning("The expression ", rex, " does not contain any ranges.")
rex
}
},
rex_split, range_pattern, expanded_patterns,
USE.NAMES=FALSE)
# regex with no "|" separator
if(is.matrix(res)) {
rnms <- rownames(res)
res <- as.vector(res)
names(res) <- rnms
# return
res
}
# otherwise, return
else unlist(res)
}
I tested with two examples.
- single range
r <- "^405[0-3L-O]$"
regex_expander(r, show_expanded=FALSE)
# output
# [1] "^4050$" "^4051$" "^4052$" "^4053$" "^405L$" "^405M$" "^405N$" "^405O$"
regex_expander(r, show_expanded=TRUE)
# output
# 0 1 2 3 L M N O
# "^4050$" "^4051$" "^4052$" "^4053$" "^405L$" "^405M$" "^405N$" "^405O$"
- multiple ranges separated by
|
r <- "^W3812$|^405[0-3L-O]$|^N17[04][9FK][0-3]Z$"
regex_expander(r, show_expanded=FALSE)
# output
# [1] "^W3812$" "^4050$" "^4051$" "^4052$" "^4053$" "^405L$"
# [7] "^405M$" "^405N$" "^405O$" "^N17090Z$" "^N17490Z$" "^N170F0Z$"
# [13] "^N174F0Z$" "^N170K0Z$" "^N174K0Z$" "^N17091Z$" "^N17491Z$" "^N170F1Z$"
# [19] "^N174F1Z$" "^N170K1Z$" "^N174K1Z$" "^N17092Z$" "^N17492Z$" "^N170F2Z$"
# [25] "^N174F2Z$" "^N170K2Z$" "^N174K2Z$" "^N17093Z$" "^N17493Z$" "^N170F3Z$"
# [31] "^N174F3Z$" "^N170K3Z$" "^N174K3Z$"
regex_expander(r, show_expanded=TRUE)
# output
# 0 1 2 3 L M
# "^W3812$" "^4050$" "^4051$" "^4052$" "^4053$" "^405L$" "^405M$"
# N O 090 490 0F0 4F0 0K0
# "^405N$" "^405O$" "^N17090Z$" "^N17490Z$" "^N170F0Z$" "^N174F0Z$" "^N170K0Z$"
# 4K0 091 491 0F1 4F1 0K1 4K1
# "^N174K0Z$" "^N17091Z$" "^N17491Z$" "^N170F1Z$" "^N174F1Z$" "^N170K1Z$" "^N174K1Z$"
# 092 492 0F2 4F2 0K2 4K2 093
# "^N17092Z$" "^N17492Z$" "^N170F2Z$" "^N174F2Z$" "^N170K2Z$" "^N174K2Z$" "^N17093Z$"
# 493 0F3 4F3 0K3 4K3
# "^N17493Z$" "^N170F3Z$" "^N174F3Z$" "^N170K3Z$" "^N174K3Z$"
I feel like there would be a better way to handle checking for one or multiple ranges. Any suggestions?
Thank you for reviewing and I would appreciate any feedback!