I recently wrote some R code, which I would normally not do, to find repeats in DNA fasta files.
Here is an example fasta file:
>seq0
ATCGGGGACGA
>seq1
ATTTTCGGGGACGA
>seq2
ATGCATCGATCG
The code takes in a fasta file as input and identifies repeats:
suppressPackageStartupMessages(library("Biostrings"))
library(Biostrings) # read a file in
library(stringr) # string manipulation
library(jsonlite) # save as JSON for pretty printing
#-----------------
# usage: Rscript test.R yourfile.fa
#-----------------
args <- commandArgs()
dna <- readDNAStringSet(args[6])
# Assuming 'dna' is S4 object with strings
dna_strings <- as.character(dna) # Convert S4 object to character vector
if (TRUE %in% duplicated(dna_strings)) {
stop('Duplicate input!')
}
# Split each string using the regular expression pattern
split_dna <- strsplit(dna_strings, '(A{2,}|C{2,}|G{2,}|T{2,})', perl = TRUE)
# Find all matches and their starting positions; using 0-based indices
matches <- gregexpr('(A{2,}|C{2,}|G{2,}|T{2,})', dna_strings, perl = TRUE)
# Extract the matched substrings based on the starting positions and lengths
#repeating_strings <- regmatches(dna_strings, matches)
# save these lists for making a dataframe later
seqs <- c()
starts_ends <- c()
all_capture_starts <- lapply(matches, function(match) {
as.integer(attr(match, "capture.start"))
})
all_capture_lengths <- lapply(matches, function(match) {
as.integer(attr(match, "capture.length"))
})
for (i in seq_along(all_capture_starts)) {
seqs <- append(seqs, dna_strings[i])
locs <- c()
for (j in seq_along(all_capture_starts[[i]])) {
locs[j] <- c()
stsp <- c(all_capture_starts[[i]][j]-1, all_capture_starts[[i]][j] + all_capture_lengths[[i]][j]-2)
locs <- append(locs, list(stsp))
}
if (-2 %in% locs[[1]]) {
starts_ends <- append(starts_ends, 'None')
} else {
starts_ends <- append(starts_ends, toJSON(locs))
}
# for (att in c('match.length', 'capture.start', 'capture.length', 'index.type', 'capture.names')) {
# cat(att, attr(matches[[i]], att), "\n", sep = "\t")
# }
}
df <- data.frame('Sequences' = seqs, 'Starts/Ends' = starts_ends)
print(df)
and the results look like:
Sequences Starts.Ends
seq0 ATCGGGGACGA [[3,6]]
seq1 ATTTTCGGGGACGA [[1,4],[6,9]]
seq2 ATGCATCGATCG None
I am very much used to more functional programming languages like C and Perl; writing this in R was very difficult for me, and took me 3 days, but the script does exactly what it was supposed to do.
How would a better R programmer write this?
1 Answer 1
Here are some suggestions:
- There is a duplicate
library
call toBiostrings
. It has no impact because R is smart enough to avoid reloading in this case, but it can be removed.
# read a file in
suppressPackageStartupMessages(library(Biostrings))
library(stringr) # string manipulation
library(jsonlite) # save as JSON for pretty printing
- You can use
anyDuplicated
to make the initial check more efficient:
if (anyDuplicated(dna_strings)) {
stop('Duplicate input!')
}
- You can create
df
withdna_strings
(named character) instead ofseqs
(character). - I would split the
for
loop into two stages: one for creating the data structure and the other for converting to JSON. This makes it easy to inspect intermediate results (e.g. in RStudio). We can usevapply
on the second stage for better performance / error checking. - R allows appending to lists by assignment. If necessary, you can even preallocate the list size for improved performance, as shown in the example below. It might be worth doing
capture_starts <- all_capture_starts[[i]]
for readability, but this can ruin performance if R incorrectly applies copy-on-modify. - You might have noticed that
Starts/Ends
is converted toStarts.Ends
by thedata.frame
. If this is not desired, you might want to look at the documentation formake.names
and pick an alternative.
Here is my implementation of suggestions 3-5:
n_matches <- length(matches)
all_locs <- vector(mode = "list", length = n_matches)
for (i in seq_len(n_matches)) {
locs <- list()
for (j in seq_along(all_capture_starts[[i]])) {
locs[[j]] <- c(
all_capture_starts[[i]][j] - 1L,
all_capture_starts[[i]][j] + all_capture_lengths[[i]][j] - 2L
)
}
all_locs[[i]] <- locs
}
starts_ends <- vapply(all_locs, function(locs) {
if (-2L %in% locs[[1]])
return('None')
toJSON(locs)
}, "")
df <- data.frame('Sequences' = dna_strings, 'Starts/Ends' = starts_ends)
print(df)
For what it's worth, I think your code was relatively well-written.