0
\$\begingroup\$

I am new to R and to learn I start to code to do things that interest me. I have a biological background and I am very interested in bioinformatics. I know some python, but I want to use R as tool to apply some statistical analysis, once R is build in that thinking.

K-mer are substrings/words counted in a slide window (in my case of length 1, it means they overlap). Example:

AGCATTGGGACAT
AGC
 GCA
 CAT
 ATT
 TTG
 TGG
 GGG
 GGA
 GAC
 ACA
 CAT

The we count each ord ou kmer. I count kmers/words/n-grams of length (k) 1-10 in some genomes and my goal is to find under/over represented kmers in this genomes. I write some R code, but seems to me it is not very efficient. It takes to long to read the csv files, select the necessary data and to the stats.

Hope some of you guys would have with some ideias or tips to the code do better.

I compare the counts with the expected values, obtained by a High Markov Order, than calculate the variance, standard deviation, the z-scores, etc... Then I selected the words by a e-value (alfa) with 0.01. For 4^6 (4096 possible words) I get more or less 200 words.

This is a example of input:

A 167981
C 77361
G 77560
T 167983
AA 65544
AC 18941
AG 28260
AT 55235
CA 23043
CC 18508
CG 7602
CT 28208
GA 23256
GC 16484
GG 18645
GT 19175
TA 56138
TC 23428
TG 23053
TT 65364
AAA 27001
AAC 8762
AAG 10349
...

This is a example of out put:

kmer observed expected observed_freq Zscore Pval Eval rank
AAAAAA 1139 1376.16784735997 0.00232029905171272 -9.99779788296386 1.55823528020429e-23 6.38253170771679e-20 1
GAAAAT 338 494.315292339662 0.000688552308585514 -9.05021175302625 1.42691358118129e-19 5.84463802851856e-16 2
TTTTTT 1188 1397.96314403911 0.00242011876508755 -8.80934951714221 1.2587414031676e-18 5.1558047873745e-15 3
TTTAAA 695 810.032973280273 0.00141581022031637 -6.64539757722037 3.0239968595903e-11 1.23862911368819e-07 4
AACTCC 65 103.301775147929 0.000132413905497214 -6.05507022362132 1.40356421623009e-09 5.74899902967844e-06 5
...

I think the code works right, but it is slow.

So I really appreciate any tip to improve it.

Thank you by your attention and time.

Paulo.

 select_data <- function(counts, k){
 df_obs <- subset(counts, nchar(kmer) == k)
 return(df_obs)
 }
expected_high_markov_var <- function(kmer, counts){
 kmer_exp <- numeric(0)
 n <- nchar(kmer)
 pref <- substr(kmer, 1, n - 1)
 p <- as.double(counts[which(counts$kmer == pref), 'observed'])
 suf <- substr(kmer, 2, n)
 s <- as.double(counts[which(counts$kmer == suf), 'observed'])
 mid <- substr(kmer, 2, n - 1)
 m <- as.double(counts[which(counts$kmer == mid), 'observed'])
 exp <- (p * s) / m
 kmer_exp <- exp
 kmer_var <- exp * (m - p) * (m - s) / (m^2)
 return(list(kmer = kmer, expected = kmer_exp, variance = kmer_var))
}
generate_all_kmers <- function (k, 
 seq = '', 
 alphabet = c('A', 'C', 'G', 'T')) {
 kmers_list <- seq
 for (i in 1:k) {
 kmers_list <- unlist(lapply(kmers_list, function (kmer) {
 paste0(kmer, alphabet)
 }))
 }
 kmers_list
}
get_kmer_expected_var <- function(counts, kmer_list, k){
 df <- select_data(counts, k)
 n <- length(kmer_list)
 expected <- data.frame(kmer = character(), expected = numeric(0), variance = numeric(0))
 for(i in 1:n){
 exp_var <- expected_high_markov_var(kmer_list[i], counts)
 expected[i, ] <- exp_var
 }
 merged_df <- merge(df, expected, by = "kmer")
 return(merged_df)
}
kmer_frequency <- function(df, k, col){
 new_name <- paste(col, "_", "freq", sep = "", collapse = "")
 num_rows <- nrow(df)
 total <- sum(df[, col]) + k -1
 for(i in 1:num_rows){
 freq <- df[i, col] / total
 df[i, "frequency"] <- freq
 }
 names(df)[names(df) == "frequency"] <- new_name
 return(df)
}
get_kmer_sd <- function(kmer_list, kmer_data){
 n <- nrow(kmer_data)
 for(i in 1:n){
 var <- kmer_data[which(kmer_data$kmer == kmer_list[i]), "variance"]
 sd <- sqrt(var)
 kmer_data[i, "sd"] <- sd
 }
 return(kmer_data)
}
get_kmer_Zscores <- function(kmer_list, kmer_data){
 n <- nrow(kmer_data)
 for(i in 1:n){
 obs <- kmer_data[which(kmer_data$kmer == kmer_list[i]), "observed"]
 exp <- kmer_data[which(kmer_data$kmer == kmer_list[i]), "expected"]
 sd <- kmer_data[which(kmer_data$kmer == kmer_list[i]), "sd"]
 zscr <- (obs - exp) / sd
 kmer_data[i, "Zscore"] <- zscr
 }
 return(kmer_data)
}
pval <- function(kmer_list, kmer_data){
 n <- nrow(kmer_data)
 for(i in 1:n){
 zsc <- kmer_data[which(kmer_data$kmer == kmer_list[i]), "Zscore"]
 pval <- pnorm(-abs(zsc)) * 2
 kmer_data[i, "Pval"] <- pval
 }
 return(kmer_data)
}
get_e_vals <- function(kmer_list, kmer_data){
 n <- nrow(kmer_data)
 for(i in 1:n){
 pval <- kmer_data[which(kmer_data$kmer == kmer_list[i]), "Pval"]
 eval <- n * pval
 kmer_data[i, "Eval"] <- eval
 }
 return(kmer_data)
}
# get the exceptional words by a thresholder e-value
exceptional_kmers <- function(kmer_data, eval=0.05){
 exceptional_kmers <- kmer_data[which(kmer_data$Eval < eval), ]
 exceptional_kmers <- exceptional_kmers[order(exceptional_kmers$Zscore), ]
 exceptional_kmers$rank <- rank(exceptional_kmers$Zscore)
 return(exceptional_kmers)
}
analyse_kmer_data <- function(filenames, 
 kmer_list, 
 k,
 output_data){
 colunm_to_keep <- c("kmer", "observed", "expected", 
 "observed_freq", "Zscore",
 "Pval", "Eval", "rank")
 for(filename in filenames) {
 name <- basename(filename)
 message("Calculating the statistics from file: ", name)
 final_name <- paste0(substr(name, 1, 15), "_kmer", k, ".csv")
 oligos <- read.csv(filename, 
 header = FALSE, 
 col.names=c("kmer", "observed"))
 df <- get_kmer_expected_var(oligos, kmer_list, k)
 rm(oligos)
 df <- kmer_frequency(df, k, "observed")
 df <- get_kmer_sd(kmer_list, df)
 df <- get_kmer_Zscores(kmer_list, df)
 df <- pval(kmer_list, df)
 df <- get_e_vals(kmer_list, df)
 df <- exceptional_kmers(df, eval = 0.001)
 final_data <- subset(df, select = colunm_to_keep)
 out <- file.path(output_data, final_name)
 write.csv(final_data, file = out, row.names = FALSE)
 rm(df)
 }
}

kmer_list it is all possible kmer of length k, eg., the genome alphabet is {A,C,G,T}, so if I want all kmers of length (k) = 2, my list will be the all possible 4^2 (4^k) combinations of the letters of the alphabet, in this case = 16: AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT.

asked Nov 2, 2022 at 22:45
\$\endgroup\$
2
  • \$\begingroup\$ I would avoid "rolling your own" versions of these types of programs. Most of what you are trying to do is available with the BioConductor project for R (bioconductor.org/install). For example, you can easily obtain counts of DNA k-mers in given DNA sequence with the countDnaKmers in the BioConductor setTools pakage. The k-mers are searched in a set of search windows, which are defined by start and width parameter. For example: dna_sequence<-"AGCATTGGGACAT" countDnaKmers(dna_sequence, 3, 1:3, 3) returns: AGC 1 0 0 ATT 0 0 1 CAG 0 0 0 CAT 0 1 1 etc. \$\endgroup\$ Commented Nov 15, 2022 at 7:00
  • \$\begingroup\$ @3209Cigs but I got your point. Thank you. \$\endgroup\$ Commented Nov 16, 2022 at 8:38

1 Answer 1

1
\$\begingroup\$

What is kmer_list ? what is k? we can not test the code...

But, that said, I adjusted 3 functions, without testing them, maybe they give some speedup:

expected_high_markov_var <- function(kmer, counts){
 kmer_exp <- numeric(0)
 n <- nchar(kmer)
 pref <- substr(kmer, 1, n - 1)
 counts2 <- as.double(counts$observed)
 p <- counts2[which(counts$kmer == pref)]
 suf <- substr(kmer, 2, n)
 s <- counts2[which(counts$kmer == suf)]
 mid <- substr(kmer, 2, n - 1)
 m <- counts2[which(counts$kmer == mid)]
 exp <- (p * s) / m
 kmer_exp <- exp
 kmer_var <- exp * (m - p) * (m - s) / (m^2)
 return(list(kmer = kmer, expected = kmer_exp, variance = kmer_var))
}
kmer_frequency <- function(df, k, col){
 new_name <- paste(col, "_", "freq", sep = "", collapse = "")
 num_rows <- nrow(df)
 total <- sum(df[, col]) + k -1
 rez <- sapply(1:num_rows, function(i) df[[col]][i] / total)
 df[, "frequency"] <- rez
 names(df)[names(df) == "frequency"] <- new_name
 return(df)
}
get_kmer_sd <- function(kmer_list, kmer_data){
 n <- nrow(kmer_data)
 rez <- sapply(1:n, function(i)
 sqrt(kmer_data$variance[kmer_data$kmer == kmer_list[i]]))
 kmer_data[, "sd"] <- rez
 return(kmer_data)
}

p.s. there are probably a lot more that could be rewritten...

answered Nov 4, 2022 at 13:21
\$\endgroup\$
3
  • \$\begingroup\$ Hi there, hope now it gets more clear. Thanks \$\endgroup\$ Commented Nov 4, 2022 at 18:29
  • 1
    \$\begingroup\$ Ordinarily we take answer invalidation quite seriously, but it seems your answer was written before the question was clear enough for a proper review. The question is much clarified by the edits, so can I ask you to modify the answer instead? \$\endgroup\$ Commented Nov 4, 2022 at 18:39
  • \$\begingroup\$ @ Mast I am sorry about the missing information. Some times I write as if the subject is know for everyone and that is not true. I will try to give more detailed information in future questions. Thank you. \$\endgroup\$ Commented Nov 4, 2022 at 21:06

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.