How can I can do better with my R code to analyze word counts

Question 1

I am new to R and to learn I start to code to do things that interest me. I have a biological background and I am very interested in bioinformatics. I know some python, but I want to use R as tool to apply some statistical analysis, once R is build in that thinking.

K-mer are substrings/words counted in a slide window (in my case of length 1, it means they overlap). Example:

AGCATTGGGACAT
AGC
 GCA
 CAT
 ATT
 TTG
 TGG
 GGG
 GGA
 GAC
 ACA
 CAT

The we count each ord ou kmer. I count kmers/words/n-grams of length (k) 1-10 in some genomes and my goal is to find under/over represented kmers in this genomes. I write some R code, but seems to me it is not very efficient. It takes to long to read the csv files, select the necessary data and to the stats.

Hope some of you guys would have with some ideias or tips to the code do better.

I compare the counts with the expected values, obtained by a High Markov Order, than calculate the variance, standard deviation, the z-scores, etc... Then I selected the words by a e-value (alfa) with 0.01. For 4^6 (4096 possible words) I get more or less 200 words.

This is a example of input:

A 167981
C 77361
G 77560
T 167983
AA 65544
AC 18941
AG 28260
AT 55235
CA 23043
CC 18508
CG 7602
CT 28208
GA 23256
GC 16484
GG 18645
GT 19175
TA 56138
TC 23428
TG 23053
TT 65364
AAA 27001
AAC 8762
AAG 10349
...

This is a example of out put:

kmer observed expected observed_freq Zscore Pval Eval rank
AAAAAA 1139 1376.16784735997 0.00232029905171272 -9.99779788296386 1.55823528020429e-23 6.38253170771679e-20 1
GAAAAT 338 494.315292339662 0.000688552308585514 -9.05021175302625 1.42691358118129e-19 5.84463802851856e-16 2
TTTTTT 1188 1397.96314403911 0.00242011876508755 -8.80934951714221 1.2587414031676e-18 5.1558047873745e-15 3
TTTAAA 695 810.032973280273 0.00141581022031637 -6.64539757722037 3.0239968595903e-11 1.23862911368819e-07 4
AACTCC 65 103.301775147929 0.000132413905497214 -6.05507022362132 1.40356421623009e-09 5.74899902967844e-06 5
...

I think the code works right, but it is slow.

So I really appreciate any tip to improve it.

Thank you by your attention and time.

Paulo.

 select_data <- function(counts, k){
 df_obs <- subset(counts, nchar(kmer) == k)
 return(df_obs)
 }
expected_high_markov_var <- function(kmer, counts){
 kmer_exp <- numeric(0)
 n <- nchar(kmer)
 pref <- substr(kmer, 1, n - 1)
 p <- as.double(counts[which(counts$kmer == pref), 'observed'])
 suf <- substr(kmer, 2, n)
 s <- as.double(counts[which(counts$kmer == suf), 'observed'])
 mid <- substr(kmer, 2, n - 1)
 m <- as.double(counts[which(counts$kmer == mid), 'observed'])
 exp <- (p * s) / m
 kmer_exp <- exp
 kmer_var <- exp * (m - p) * (m - s) / (m^2)
 return(list(kmer = kmer, expected = kmer_exp, variance = kmer_var))
}
generate_all_kmers <- function (k, 
 seq = '', 
 alphabet = c('A', 'C', 'G', 'T')) {
 kmers_list <- seq
 for (i in 1:k) {
 kmers_list <- unlist(lapply(kmers_list, function (kmer) {
 paste0(kmer, alphabet)
 }))
 }
 kmers_list
}
get_kmer_expected_var <- function(counts, kmer_list, k){
 df <- select_data(counts, k)
 n <- length(kmer_list)
 expected <- data.frame(kmer = character(), expected = numeric(0), variance = numeric(0))
 for(i in 1:n){
 exp_var <- expected_high_markov_var(kmer_list[i], counts)
 expected[i, ] <- exp_var
 }
 merged_df <- merge(df, expected, by = "kmer")
 return(merged_df)
}
kmer_frequency <- function(df, k, col){
 new_name <- paste(col, "_", "freq", sep = "", collapse = "")
 num_rows <- nrow(df)
 total <- sum(df[, col]) + k -1
 for(i in 1:num_rows){
 freq <- df[i, col] / total
 df[i, "frequency"] <- freq
 }
 names(df)[names(df) == "frequency"] <- new_name
 return(df)
}
get_kmer_sd <- function(kmer_list, kmer_data){
 n <- nrow(kmer_data)
 for(i in 1:n){
 var <- kmer_data[which(kmer_data$kmer == kmer_list[i]), "variance"]
 sd <- sqrt(var)
 kmer_data[i, "sd"] <- sd
 }
 return(kmer_data)
}
get_kmer_Zscores <- function(kmer_list, kmer_data){
 n <- nrow(kmer_data)
 for(i in 1:n){
 obs <- kmer_data[which(kmer_data$kmer == kmer_list[i]), "observed"]
 exp <- kmer_data[which(kmer_data$kmer == kmer_list[i]), "expected"]
 sd <- kmer_data[which(kmer_data$kmer == kmer_list[i]), "sd"]
 zscr <- (obs - exp) / sd
 kmer_data[i, "Zscore"] <- zscr
 }
 return(kmer_data)
}
pval <- function(kmer_list, kmer_data){
 n <- nrow(kmer_data)
 for(i in 1:n){
 zsc <- kmer_data[which(kmer_data$kmer == kmer_list[i]), "Zscore"]
 pval <- pnorm(-abs(zsc)) * 2
 kmer_data[i, "Pval"] <- pval
 }
 return(kmer_data)
}
get_e_vals <- function(kmer_list, kmer_data){
 n <- nrow(kmer_data)
 for(i in 1:n){
 pval <- kmer_data[which(kmer_data$kmer == kmer_list[i]), "Pval"]
 eval <- n * pval
 kmer_data[i, "Eval"] <- eval
 }
 return(kmer_data)
}
# get the exceptional words by a thresholder e-value
exceptional_kmers <- function(kmer_data, eval=0.05){
 exceptional_kmers <- kmer_data[which(kmer_data$Eval < eval), ]
 exceptional_kmers <- exceptional_kmers[order(exceptional_kmers$Zscore), ]
 exceptional_kmers$rank <- rank(exceptional_kmers$Zscore)
 return(exceptional_kmers)
}
analyse_kmer_data <- function(filenames, 
 kmer_list, 
 k,
 output_data){
 colunm_to_keep <- c("kmer", "observed", "expected", 
 "observed_freq", "Zscore",
 "Pval", "Eval", "rank")
 for(filename in filenames) {
 name <- basename(filename)
 message("Calculating the statistics from file: ", name)
 final_name <- paste0(substr(name, 1, 15), "_kmer", k, ".csv")
 oligos <- read.csv(filename, 
 header = FALSE, 
 col.names=c("kmer", "observed"))
 df <- get_kmer_expected_var(oligos, kmer_list, k)
 rm(oligos)
 df <- kmer_frequency(df, k, "observed")
 df <- get_kmer_sd(kmer_list, df)
 df <- get_kmer_Zscores(kmer_list, df)
 df <- pval(kmer_list, df)
 df <- get_e_vals(kmer_list, df)
 df <- exceptional_kmers(df, eval = 0.001)
 final_data <- subset(df, select = colunm_to_keep)
 out <- file.path(output_data, final_name)
 write.csv(final_data, file = out, row.names = FALSE)
 rm(df)
 }
}

kmer_list it is all possible kmer of length k, eg., the genome alphabet is {A,C,G,T}, so if I want all kmers of length (k) = 2, my list will be the all possible 4^2 (4^k) combinations of the letters of the alphabet, in this case = 16: AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT.

Question 2

I would avoid "rolling your own" versions of these types of programs. Most of what you are trying to do is available with the BioConductor project for R (bioconductor.org/install). For example, you can easily obtain counts of DNA k-mers in given DNA sequence with the countDnaKmers in the BioConductor setTools pakage. The k-mers are searched in a set of search windows, which are defined by start and width parameter. For example: dna_sequence<-"AGCATTGGGACAT" countDnaKmers(dna_sequence, 3, 1:3, 3) returns: AGC 1 0 0 ATT 0 0 1 CAG 0 0 0 CAT 0 1 1 etc.

Question 3

@3209Cigs but I got your point. Thank you.

Question 4

What is kmer_list ? what is k? we can not test the code...

But, that said, I adjusted 3 functions, without testing them, maybe they give some speedup:

expected_high_markov_var <- function(kmer, counts){
 kmer_exp <- numeric(0)
 n <- nchar(kmer)
 pref <- substr(kmer, 1, n - 1)
 counts2 <- as.double(counts$observed)
 p <- counts2[which(counts$kmer == pref)]
 suf <- substr(kmer, 2, n)
 s <- counts2[which(counts$kmer == suf)]
 mid <- substr(kmer, 2, n - 1)
 m <- counts2[which(counts$kmer == mid)]
 exp <- (p * s) / m
 kmer_exp <- exp
 kmer_var <- exp * (m - p) * (m - s) / (m^2)
 return(list(kmer = kmer, expected = kmer_exp, variance = kmer_var))
}
kmer_frequency <- function(df, k, col){
 new_name <- paste(col, "_", "freq", sep = "", collapse = "")
 num_rows <- nrow(df)
 total <- sum(df[, col]) + k -1
 rez <- sapply(1:num_rows, function(i) df[[col]][i] / total)
 df[, "frequency"] <- rez
 names(df)[names(df) == "frequency"] <- new_name
 return(df)
}
get_kmer_sd <- function(kmer_list, kmer_data){
 n <- nrow(kmer_data)
 rez <- sapply(1:n, function(i)
 sqrt(kmer_data$variance[kmer_data$kmer == kmer_list[i]]))
 kmer_data[, "sd"] <- rez
 return(kmer_data)
}

p.s. there are probably a lot more that could be rewritten...

Question 5

Hi there, hope now it gets more clear. Thanks

Question 6

Ordinarily we take answer invalidation quite seriously, but it seems your answer was written before the question was clear enough for a proper review. The question is much clarified by the edits, so can I ask you to modify the answer instead?

Question 7

@ Mast I am sorry about the missing information. Some times I write as if the subject is know for everyone and that is not true. I will try to give more detailed information in future questions. Thank you.

minem minem 1,0021 gold badge9 silver badges12 bronze badges · Answer 1 · 2022-11-04 13:21:30Z

What is kmer_list ? what is k? we can not test the code...

But, that said, I adjusted 3 functions, without testing them, maybe they give some speedup:

expected_high_markov_var <- function(kmer, counts){
 kmer_exp <- numeric(0)
 n <- nchar(kmer)
 pref <- substr(kmer, 1, n - 1)
 counts2 <- as.double(counts$observed)
 p <- counts2[which(counts$kmer == pref)]
 suf <- substr(kmer, 2, n)
 s <- counts2[which(counts$kmer == suf)]
 mid <- substr(kmer, 2, n - 1)
 m <- counts2[which(counts$kmer == mid)]
 exp <- (p * s) / m
 kmer_exp <- exp
 kmer_var <- exp * (m - p) * (m - s) / (m^2)
 return(list(kmer = kmer, expected = kmer_exp, variance = kmer_var))
}
kmer_frequency <- function(df, k, col){
 new_name <- paste(col, "_", "freq", sep = "", collapse = "")
 num_rows <- nrow(df)
 total <- sum(df[, col]) + k -1
 rez <- sapply(1:num_rows, function(i) df[[col]][i] / total)
 df[, "frequency"] <- rez
 names(df)[names(df) == "frequency"] <- new_name
 return(df)
}
get_kmer_sd <- function(kmer_list, kmer_data){
 n <- nrow(kmer_data)
 rez <- sapply(1:n, function(i)
 sqrt(kmer_data$variance[kmer_data$kmer == kmer_list[i]]))
 kmer_data[, "sd"] <- rez
 return(kmer_data)
}

p.s. there are probably a lot more that could be rewritten...

Ordinarily we take answer invalidation quite seriously, but it seems your answer was written before the question was clear enough for a proper review. The question is much clarified by the edits, so can I ask you to modify the answer instead?
@ Mast I am sorry about the missing information. Some times I write as if the subject is know for everyone and that is not true. I will try to give more detailed information in future questions. Thank you.

Stack Exchange Network

How can I can do better with my R code to analyze word counts

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

How can I can do better with my R code to analyze word counts

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions