I am new to R and to learn I start to code to do things that interest me. I have a biological background and I am very interested in bioinformatics. I know some python, but I want to use R as tool to apply some statistical analysis, once R is build in that thinking.
K-mer are substrings/words counted in a slide window (in my case of length 1, it means they overlap). Example:
AGCATTGGGACAT
AGC
GCA
CAT
ATT
TTG
TGG
GGG
GGA
GAC
ACA
CAT
The we count each ord ou kmer. I count kmers/words/n-grams of length (k) 1-10 in some genomes and my goal is to find under/over represented kmers in this genomes. I write some R code, but seems to me it is not very efficient. It takes to long to read the csv files, select the necessary data and to the stats.
Hope some of you guys would have with some ideias or tips to the code do better.
I compare the counts with the expected values, obtained by a High Markov Order, than calculate the variance, standard deviation, the z-scores, etc... Then I selected the words by a e-value (alfa) with 0.01. For 4^6 (4096 possible words) I get more or less 200 words.
This is a example of input:
A 167981
C 77361
G 77560
T 167983
AA 65544
AC 18941
AG 28260
AT 55235
CA 23043
CC 18508
CG 7602
CT 28208
GA 23256
GC 16484
GG 18645
GT 19175
TA 56138
TC 23428
TG 23053
TT 65364
AAA 27001
AAC 8762
AAG 10349
...
This is a example of out put:
kmer observed expected observed_freq Zscore Pval Eval rank
AAAAAA 1139 1376.16784735997 0.00232029905171272 -9.99779788296386 1.55823528020429e-23 6.38253170771679e-20 1
GAAAAT 338 494.315292339662 0.000688552308585514 -9.05021175302625 1.42691358118129e-19 5.84463802851856e-16 2
TTTTTT 1188 1397.96314403911 0.00242011876508755 -8.80934951714221 1.2587414031676e-18 5.1558047873745e-15 3
TTTAAA 695 810.032973280273 0.00141581022031637 -6.64539757722037 3.0239968595903e-11 1.23862911368819e-07 4
AACTCC 65 103.301775147929 0.000132413905497214 -6.05507022362132 1.40356421623009e-09 5.74899902967844e-06 5
...
I think the code works right, but it is slow.
So I really appreciate any tip to improve it.
Thank you by your attention and time.
Paulo.
select_data <- function(counts, k){
df_obs <- subset(counts, nchar(kmer) == k)
return(df_obs)
}
expected_high_markov_var <- function(kmer, counts){
kmer_exp <- numeric(0)
n <- nchar(kmer)
pref <- substr(kmer, 1, n - 1)
p <- as.double(counts[which(counts$kmer == pref), 'observed'])
suf <- substr(kmer, 2, n)
s <- as.double(counts[which(counts$kmer == suf), 'observed'])
mid <- substr(kmer, 2, n - 1)
m <- as.double(counts[which(counts$kmer == mid), 'observed'])
exp <- (p * s) / m
kmer_exp <- exp
kmer_var <- exp * (m - p) * (m - s) / (m^2)
return(list(kmer = kmer, expected = kmer_exp, variance = kmer_var))
}
generate_all_kmers <- function (k,
seq = '',
alphabet = c('A', 'C', 'G', 'T')) {
kmers_list <- seq
for (i in 1:k) {
kmers_list <- unlist(lapply(kmers_list, function (kmer) {
paste0(kmer, alphabet)
}))
}
kmers_list
}
get_kmer_expected_var <- function(counts, kmer_list, k){
df <- select_data(counts, k)
n <- length(kmer_list)
expected <- data.frame(kmer = character(), expected = numeric(0), variance = numeric(0))
for(i in 1:n){
exp_var <- expected_high_markov_var(kmer_list[i], counts)
expected[i, ] <- exp_var
}
merged_df <- merge(df, expected, by = "kmer")
return(merged_df)
}
kmer_frequency <- function(df, k, col){
new_name <- paste(col, "_", "freq", sep = "", collapse = "")
num_rows <- nrow(df)
total <- sum(df[, col]) + k -1
for(i in 1:num_rows){
freq <- df[i, col] / total
df[i, "frequency"] <- freq
}
names(df)[names(df) == "frequency"] <- new_name
return(df)
}
get_kmer_sd <- function(kmer_list, kmer_data){
n <- nrow(kmer_data)
for(i in 1:n){
var <- kmer_data[which(kmer_data$kmer == kmer_list[i]), "variance"]
sd <- sqrt(var)
kmer_data[i, "sd"] <- sd
}
return(kmer_data)
}
get_kmer_Zscores <- function(kmer_list, kmer_data){
n <- nrow(kmer_data)
for(i in 1:n){
obs <- kmer_data[which(kmer_data$kmer == kmer_list[i]), "observed"]
exp <- kmer_data[which(kmer_data$kmer == kmer_list[i]), "expected"]
sd <- kmer_data[which(kmer_data$kmer == kmer_list[i]), "sd"]
zscr <- (obs - exp) / sd
kmer_data[i, "Zscore"] <- zscr
}
return(kmer_data)
}
pval <- function(kmer_list, kmer_data){
n <- nrow(kmer_data)
for(i in 1:n){
zsc <- kmer_data[which(kmer_data$kmer == kmer_list[i]), "Zscore"]
pval <- pnorm(-abs(zsc)) * 2
kmer_data[i, "Pval"] <- pval
}
return(kmer_data)
}
get_e_vals <- function(kmer_list, kmer_data){
n <- nrow(kmer_data)
for(i in 1:n){
pval <- kmer_data[which(kmer_data$kmer == kmer_list[i]), "Pval"]
eval <- n * pval
kmer_data[i, "Eval"] <- eval
}
return(kmer_data)
}
# get the exceptional words by a thresholder e-value
exceptional_kmers <- function(kmer_data, eval=0.05){
exceptional_kmers <- kmer_data[which(kmer_data$Eval < eval), ]
exceptional_kmers <- exceptional_kmers[order(exceptional_kmers$Zscore), ]
exceptional_kmers$rank <- rank(exceptional_kmers$Zscore)
return(exceptional_kmers)
}
analyse_kmer_data <- function(filenames,
kmer_list,
k,
output_data){
colunm_to_keep <- c("kmer", "observed", "expected",
"observed_freq", "Zscore",
"Pval", "Eval", "rank")
for(filename in filenames) {
name <- basename(filename)
message("Calculating the statistics from file: ", name)
final_name <- paste0(substr(name, 1, 15), "_kmer", k, ".csv")
oligos <- read.csv(filename,
header = FALSE,
col.names=c("kmer", "observed"))
df <- get_kmer_expected_var(oligos, kmer_list, k)
rm(oligos)
df <- kmer_frequency(df, k, "observed")
df <- get_kmer_sd(kmer_list, df)
df <- get_kmer_Zscores(kmer_list, df)
df <- pval(kmer_list, df)
df <- get_e_vals(kmer_list, df)
df <- exceptional_kmers(df, eval = 0.001)
final_data <- subset(df, select = colunm_to_keep)
out <- file.path(output_data, final_name)
write.csv(final_data, file = out, row.names = FALSE)
rm(df)
}
}
kmer_list it is all possible kmer of length k, eg., the genome alphabet is {A,C,G,T}, so if I want all kmers of length (k) = 2, my list will be the all possible 4^2 (4^k) combinations of the letters of the alphabet, in this case = 16: AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT.
1 Answer 1
What is kmer_list
? what is k
? we can not test the code...
But, that said, I adjusted 3 functions, without testing them, maybe they give some speedup:
expected_high_markov_var <- function(kmer, counts){
kmer_exp <- numeric(0)
n <- nchar(kmer)
pref <- substr(kmer, 1, n - 1)
counts2 <- as.double(counts$observed)
p <- counts2[which(counts$kmer == pref)]
suf <- substr(kmer, 2, n)
s <- counts2[which(counts$kmer == suf)]
mid <- substr(kmer, 2, n - 1)
m <- counts2[which(counts$kmer == mid)]
exp <- (p * s) / m
kmer_exp <- exp
kmer_var <- exp * (m - p) * (m - s) / (m^2)
return(list(kmer = kmer, expected = kmer_exp, variance = kmer_var))
}
kmer_frequency <- function(df, k, col){
new_name <- paste(col, "_", "freq", sep = "", collapse = "")
num_rows <- nrow(df)
total <- sum(df[, col]) + k -1
rez <- sapply(1:num_rows, function(i) df[[col]][i] / total)
df[, "frequency"] <- rez
names(df)[names(df) == "frequency"] <- new_name
return(df)
}
get_kmer_sd <- function(kmer_list, kmer_data){
n <- nrow(kmer_data)
rez <- sapply(1:n, function(i)
sqrt(kmer_data$variance[kmer_data$kmer == kmer_list[i]]))
kmer_data[, "sd"] <- rez
return(kmer_data)
}
p.s. there are probably a lot more that could be rewritten...
-
\$\begingroup\$ Hi there, hope now it gets more clear. Thanks \$\endgroup\$Paulo Sergio Schlogl– Paulo Sergio Schlogl2022年11月04日 18:29:21 +00:00Commented Nov 4, 2022 at 18:29
-
1\$\begingroup\$ Ordinarily we take answer invalidation quite seriously, but it seems your answer was written before the question was clear enough for a proper review. The question is much clarified by the edits, so can I ask you to modify the answer instead? \$\endgroup\$2022年11月04日 18:39:35 +00:00Commented Nov 4, 2022 at 18:39
-
\$\begingroup\$ @ Mast I am sorry about the missing information. Some times I write as if the subject is know for everyone and that is not true. I will try to give more detailed information in future questions. Thank you. \$\endgroup\$Paulo Sergio Schlogl– Paulo Sergio Schlogl2022年11月04日 21:06:07 +00:00Commented Nov 4, 2022 at 21:06
countDnaKmers
in the BioConductorsetTools
pakage. The k-mers are searched in a set of search windows, which are defined by start and width parameter. For example:dna_sequence<-"AGCATTGGGACAT" countDnaKmers(dna_sequence, 3, 1:3, 3)
returns:AGC 1 0 0 ATT 0 0 1 CAG 0 0 0 CAT 0 1 1
etc. \$\endgroup\$