I previously published an article on a new method for analysing RNA-seq data as part of my PhD studies, and I'm now working on making this into an R package. I don't come from a programming background, so this is quite challenging and, thankfully, both fun as well as rewarding. There was a part of my code that was written in Python, since that particular part would be slow in R, but now I have to convert that part into R as well, in order for it to go into the R package.
I have successfully done this conversion, but it is, as I guessed, horribly slow. I'm hoping this is only because of the way I did it, and that there is room for optimisation; hence, my question here. In essence, the script takes a VCF file (containing data on mutations from one or more samples), performs some filtering on data quality criteria and extracts the relevant information to another file. This is done to save time, as the resulting file is going to be analysed up to several hundreds of times (depending on the study) by other downstream scripts.
This is an example record in the VCF file (long; scrolling needed):
chr1 158097044 rs6427420 A G 290.77 PASS AC=2;AF=1.00;AN=2;DP=10;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=29.08;SOR=1.085;ANN=G|3_prime_UTR_variant|MODIFIER|KIRREL|ENSG00000183853|transcript|ENST00000368173.7|protein_coding|13/13|c.*1924A>G|||||1924|,G|3_prime_UTR_variant|MODIFIER|KIRREL|ENSG00000183853|transcript|ENST00000368172.1|protein_coding|11/11|c.*1924A>G|||||1924|,G|downstream_gene_variant|MODIFIER|KIRREL|ENSG00000183853|transcript|ENST00000359209.10|protein_coding||c.*1924A>G|||||1391|,G|downstream_gene_variant|MODIFIER|KIRREL|ENSG00000183853|transcript|ENST00000360089.8|protein_coding||c.*1924A>G|||||993|;ASP;CAF=0.497,0.503;COMMON=1;G5;G5A;GNO;HD;KGPhase1;KGPhase3;RS=6427420;RSPOS=158097044;SAO=0;SLO;SSR=0;U3;VC=SNV;VLD;VP=0x05010080000517053f000100;WGT=1;dbSNPBuildID=116 GT:AD:DP:GQ:PL 1/1:0,10:10:30:319,30,0
The main issue is the ANN
(mutation annotation) field in the 8th (extremely large) column. I can relatively easy get the following format, using the VariantAnnotation
package:
[1] ANN=G|3_prime_UTR_variant|MODIFIER|KIRREL|ENSG00000183853|transcript|ENST00000368173.7|protein_coding|13/13|c.*1924A>G|||||1924|
[2] G|truncation|HIGH|KIRREL|ENSG00000183853|transcript|ENST00000368172.1|protein_coding|11/11|c.*1924A>G|||||1924|
...
[n] ANN=...
There is thus several levels of ANN
per mutation, because a single mutation might effect several genes and transcripts. The goal is to separate all ANN
fields by |
, keep those mutations with the highest predicted effect on protein function (going in descending order from HIGH
, MODERATE
, LOW
and MODIFIER
) and output them together with the data for the specific mutation. I need to go from this ...
chr1 123 rs456 A G 789 PASS ANN=...gene1-HIGH..., ...gene2-HIGH..., ...gene3-LOW... GT:AD 1/0:12,14
... to this ...
chr1 123 rs456 A G 789 PASS ANN=...gene1-HIGH... GT:AD 1/0:12,14
chr1 123 rs456 A G 789 PASS ANN=...gene2-HIGH... GT:AD 1/0:12,14
The LOW
gene annotation is gone, and each of the HIGH
annotations now gets its own row with identical information (except the ANN
field).
The script below works and does its job, yielding equivalent results to the previous Python script, but it is slower. A very small file containing only a hundred variants took around 30 seconds (1 second with Python), a normal-sized file up to 5 minutes (5-20 seconds with Python), and I gave up on a very large file after an hour (5 minutes with Python).
So, how could I optimise my script? Am I doing it in a very slow manner? I didn't think that I could use apply
, as I need to go from a single line into many, and thus went for a for
loop. But maybe I can? Or maybe there's something else? Any ideas are greatly appreciated!
extract_variants = function(vcf_file,
sample,
output_file,
filter_depth=10) {
# Read VCF file
vcf = VariantAnnotation::readVcf(vcf_file)
# Gather relevant information to data GRanges object
gr = SummarizedExperiment::rowRanges(vcf)
gr$ANN = VariantAnnotation::info(vcf)$ANN
gr$DP = as.data.frame(VariantAnnotation::geno(vcf)$DP)[[sample]]
gr$AD = as.data.frame(VariantAnnotation::geno(vcf)$AD)[[sample]]
gr$GT = as.data.frame(VariantAnnotation::geno(vcf)$GT)[[sample]]
# Set ALT as character
gr$ALT = S4Vectors::unstrsplit(IRanges::CharacterList(gr$ALT))
# Remove variants not passing variant calling filters
gr = gr[gr$FILTER == 'PASS', ]
gr$FILTER = NULL
# Remove variants below the given depth threshold
gr = gr[gr$DP >= filter_depth & !is.na(gr$DP), ]
# Convert to data frame
data = GenomicRanges::as.data.frame(gr)
# Remove non-SNVs
data = data[nchar(data$REF) == 1 &
nchar(data$ALT) == 1, ]
# Get rsIDs if existing
data$rsID = row.names(data)
data[!grepl('^rs[0-9]+', data$rsID), 'rsID'] = 'None'
# Remove unwanted columns
row.names(data) = NULL
data$end = NULL
data$width = NULL
data$strand = NULL
data$paramRangeID = NULL
data$QUAL = NULL
# Separate allelic depths
data = tidyr::separate(data=data, col=AD, sep='\,円\\ ',
into=c('AD1', 'AD2'), remove=TRUE)
data$AD1 = gsub('c\\(', '', data$AD1)
data$AD2 = gsub('\\)', '', data$AD2)
# Add alleles
data = tidyr::separate(data=data, col=GT, sep='/', into=c('A1', 'A2'),
remove=TRUE)
data[data$A1 == 0, 'A1'] = data[data$A1 == 0, 'REF']
data[data$A1 == 1, 'A1'] = data[data$A1 == 1, 'ALT']
data[data$A2 == 0, 'A2'] = data[data$A2 == 0, 'REF']
data[data$A2 == 1, 'A2'] = data[data$A2 == 1, 'ALT']
######### PROBLEMATIC PART STARTS HERE #########
# Initialise empty data frame for final results
results = data.frame(effect=character(),
impact=character(),
gene=character(),
ENSGID=character(),
feature=character(),
ENSTID=character(),
biotype=character(),
warnings=character(),
seqnames=integer(),
start=integer(),
rsID=character(),
REF=character(),
ALT=character(),
DP=integer(),
AD1=integer(),
AD2=integer(),
A1=character(),
A2=character(), stringsAsFactors=FALSE)
# Loop over each SNV
for (n in c(1:nrow(data))) {
# Get annotation data for current SNV
ann = data[n, 'ANN'][[1]]
# Separate into columns
ann = tidyr::separate(as.data.frame(ann), col=ann, sep='\\|',
into=c('ALT', 'effect', 'impact', 'gene', 'ENSGID', 'feature',
'ENSTID', 'biotype', 'rank', 'HGSV.c', 'HGSV.p',
'cDNA.pos', 'CDS.pos', 'protein.pos', 'distance',
'warnings'), remove=TRUE)
# Remove unwanted data columns
ann$ALT = NULL
ann$rank = NULL
ann$HGSV.c = NULL
ann$HGSV.p = NULL
ann$cDNA.pos = NULL
ann$CDS.pos = NULL
ann$protein.pos = NULL
ann$distance = NULL
# Keep only the highest impact SNV(s)
impacts = unique(ann$impact)
if ('HIGH' %in% impacts) {
ann = ann[ann$impact == 'HIGH', ]
} else if ('MODERATE' %in% impacts) {
ann = ann[ann$impact == 'MODERATE', ]
} else if ('LOW' %in% impacts) {
ann = ann[ann$impact == 'LOW', ]
}
# SNV data columns
data.cols = c('seqnames', 'start', 'rsID', 'REF', 'ALT', 'DP',
'AD1', 'AD2', 'A1', 'A2')
# Add SNV data to each annotation
for (col in data.cols) {
ann[[col]] = data[n, col]
}
# Append to final results data frame
results = rbind(results, ann)
}
######### PROBLEMATIC PART ENDS HERE #########
# Finalise output
results = results[c('seqnames', 'start', 'rsID', 'REF', 'ALT',
'gene', 'ENSGID', 'ENSTID', 'impact', 'effect',
'feature', 'biotype', 'DP', 'AD1', 'AD2', 'A1',
'A2', 'warnings')]
names(results) = c('chr', 'pos', names(results)[3:18])
# Write results to file
write.table(results, output_file, sep='\t', row.names=FALSE,
quote=FALSE)
}
}
P.S. I would also welcome feedback on the general level of the code, style, legibility, etc., as it is my first R package and I don't have any previous experience with such.
1 Answer 1
Just a few thoughts:
- The loop makes the code slow, and it should be easily possible to get rid of it. Without your data (or at least a reproducible example) it is hard for me to try it out; but you should be able to call
tidyr::separate()
on the whole data frame, not just a single row. There are two great packages to wrangle data (e.g., filter observations or variables), namely, dplyr and data.table. Choose one of them (Here is a comprehensive comparison.) This will make your code faster and easier to both write and read. For example:
# instead of data$end = NULL data$width = NULL data$strand = NULL
# I would do data <- dplyr::select(data, -end, -width, -strand)
If you fill, for example, a matrix using a loop, it is faster and saver to initialize the full matrix first. Specify all rows and columns you need and fill them with
NA
s. For example,data.frame(id = 1:100, a = rep(NA, 100), b = rep(NA_character_, 100))
.- Using
=
instead of<-
is often regarded bad practice. - I tend to use something like
ii
instead ofi
infor (ii in seq_len(nrow(data)))
, because it's easier to find (and replace) if needed.
-
\$\begingroup\$ Thank you for that! I will see if I can get
separate
to work differenty, and change the wrangling. For the matrix filling, I can't really do much, though, seeing as I don't know how many rows I will have at the end (due to filtering, removing low impacts, etc.) Regarding=
and<-
, I did a bit of reading on that when I started to useR
some time ago, and I got the impression that it didn't really matter and that it was more of a personal taste kind of thing. This is not the case? \$\endgroup\$erikfas– erikfas2017年07月17日 15:10:53 +00:00Commented Jul 17, 2017 at 15:10 -
\$\begingroup\$ @hplieninger
I tend to use something like ii instead of i
: In most IDEs you can find a feature to replace variables. In RStudio it's calledRename in Scope
. (another option: ctrl+F, and checkWhole word
) \$\endgroup\$Scarabee– Scarabee2017年07月21日 22:13:58 +00:00Commented Jul 21, 2017 at 22:13 -
\$\begingroup\$ There is an extended discussion on
<-
vs.=
on stackoverflow.com/q/1741820/2563804. \$\endgroup\$hplieninger– hplieninger2017年07月27日 08:50:33 +00:00Commented Jul 27, 2017 at 8:50