Finding repeats in DNA in R

Question 1

I recently wrote some R code, which I would normally not do, to find repeats in DNA fasta files.

Here is an example fasta file:

>seq0
ATCGGGGACGA
>seq1
ATTTTCGGGGACGA
>seq2
ATGCATCGATCG

The code takes in a fasta file as input and identifies repeats:

suppressPackageStartupMessages(library("Biostrings"))
library(Biostrings) # read a file in
library(stringr) # string manipulation
library(jsonlite) # save as JSON for pretty printing
#-----------------
# usage: Rscript test.R yourfile.fa
#-----------------
args <- commandArgs()
dna <- readDNAStringSet(args[6])
# Assuming 'dna' is S4 object with strings
dna_strings <- as.character(dna) # Convert S4 object to character vector
if (TRUE %in% duplicated(dna_strings)) {
 stop('Duplicate input!')
}
# Split each string using the regular expression pattern
split_dna <- strsplit(dna_strings, '(A{2,}|C{2,}|G{2,}|T{2,})', perl = TRUE)
# Find all matches and their starting positions; using 0-based indices
matches <- gregexpr('(A{2,}|C{2,}|G{2,}|T{2,})', dna_strings, perl = TRUE)
# Extract the matched substrings based on the starting positions and lengths
#repeating_strings <- regmatches(dna_strings, matches)
# save these lists for making a dataframe later
seqs <- c()
starts_ends <- c()
all_capture_starts <- lapply(matches, function(match) {
 as.integer(attr(match, "capture.start"))
})
all_capture_lengths <- lapply(matches, function(match) {
 as.integer(attr(match, "capture.length"))
})
for (i in seq_along(all_capture_starts)) {
 seqs <- append(seqs, dna_strings[i])
 locs <- c()
 for (j in seq_along(all_capture_starts[[i]])) {
 locs[j] <- c()
 stsp <- c(all_capture_starts[[i]][j]-1, all_capture_starts[[i]][j] + all_capture_lengths[[i]][j]-2)
 locs <- append(locs, list(stsp))
 }
 if (-2 %in% locs[[1]]) {
 starts_ends <- append(starts_ends, 'None')
 } else {
 starts_ends <- append(starts_ends, toJSON(locs))
 }
# for (att in c('match.length', 'capture.start', 'capture.length', 'index.type', 'capture.names')) {
# cat(att, attr(matches[[i]], att), "\n", sep = "\t")
# }
}
df <- data.frame('Sequences' = seqs, 'Starts/Ends' = starts_ends)
print(df)

and the results look like:

 Sequences Starts.Ends
seq0 ATCGGGGACGA [[3,6]]
seq1 ATTTTCGGGGACGA [[1,4],[6,9]]
seq2 ATGCATCGATCG None

I am very much used to more functional programming languages like C and Perl; writing this in R was very difficult for me, and took me 3 days, but the script does exactly what it was supposed to do.

How would a better R programmer write this?

Question 2

Here are some suggestions:

There is a duplicate library call to Biostrings. It has no impact because R is smart enough to avoid reloading in this case, but it can be removed.

# read a file in
suppressPackageStartupMessages(library(Biostrings)) 
library(stringr) # string manipulation
library(jsonlite) # save as JSON for pretty printing

You can use anyDuplicated to make the initial check more efficient:

if (anyDuplicated(dna_strings)) {
 stop('Duplicate input!')
}

You can create df with dna_strings (named character) instead of seqs (character).
I would split the for loop into two stages: one for creating the data structure and the other for converting to JSON. This makes it easy to inspect intermediate results (e.g. in RStudio). We can use vapply on the second stage for better performance / error checking.
R allows appending to lists by assignment. If necessary, you can even preallocate the list size for improved performance, as shown in the example below. It might be worth doing capture_starts <- all_capture_starts[[i]] for readability, but this can ruin performance if R incorrectly applies copy-on-modify.
You might have noticed that Starts/Ends is converted to Starts.Ends by the data.frame. If this is not desired, you might want to look at the documentation for make.names and pick an alternative.

Here is my implementation of suggestions 3-5:

n_matches <- length(matches)
all_locs <- vector(mode = "list", length = n_matches)
for (i in seq_len(n_matches)) {
 locs <- list()
 for (j in seq_along(all_capture_starts[[i]])) {
 locs[[j]] <- c(
 all_capture_starts[[i]][j] - 1L, 
 all_capture_starts[[i]][j] + all_capture_lengths[[i]][j] - 2L
 )
 }
 
 all_locs[[i]] <- locs
}
starts_ends <- vapply(all_locs, function(locs) {
 if (-2L %in% locs[[1]])
 return('None')
 toJSON(locs)
}, "")
df <- data.frame('Sequences' = dna_strings, 'Starts/Ends' = starts_ends)
print(df)

For what it's worth, I think your code was relatively well-written.

Justin Chang Justin Chang 2,4473 silver badges14 bronze badges · Accepted Answer · 2024-11-02 19:14:39Z

Here are some suggestions:

There is a duplicate library call to Biostrings. It has no impact because R is smart enough to avoid reloading in this case, but it can be removed.

# read a file in
suppressPackageStartupMessages(library(Biostrings)) 
library(stringr) # string manipulation
library(jsonlite) # save as JSON for pretty printing

You can use anyDuplicated to make the initial check more efficient:

if (anyDuplicated(dna_strings)) {
 stop('Duplicate input!')
}

You can create df with dna_strings (named character) instead of seqs (character).
I would split the for loop into two stages: one for creating the data structure and the other for converting to JSON. This makes it easy to inspect intermediate results (e.g. in RStudio). We can use vapply on the second stage for better performance / error checking.
R allows appending to lists by assignment. If necessary, you can even preallocate the list size for improved performance, as shown in the example below. It might be worth doing capture_starts <- all_capture_starts[[i]] for readability, but this can ruin performance if R incorrectly applies copy-on-modify.
You might have noticed that Starts/Ends is converted to Starts.Ends by the data.frame. If this is not desired, you might want to look at the documentation for make.names and pick an alternative.

Here is my implementation of suggestions 3-5:

n_matches <- length(matches)
all_locs <- vector(mode = "list", length = n_matches)
for (i in seq_len(n_matches)) {
 locs <- list()
 for (j in seq_along(all_capture_starts[[i]])) {
 locs[[j]] <- c(
 all_capture_starts[[i]][j] - 1L, 
 all_capture_starts[[i]][j] + all_capture_lengths[[i]][j] - 2L
 )
 }
 
 all_locs[[i]] <- locs
}
starts_ends <- vapply(all_locs, function(locs) {
 if (-2L %in% locs[[1]])
 return('None')
 toJSON(locs)
}, "")
df <- data.frame('Sequences' = dna_strings, 'Starts/Ends' = starts_ends)
print(df)

For what it's worth, I think your code was relatively well-written.

Stack Exchange Network

Finding repeats in DNA in R

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Finding repeats in DNA in R

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions