Extraction of mutational data from large files in R

Question 1

I previously published an article on a new method for analysing RNA-seq data as part of my PhD studies, and I'm now working on making this into an R package. I don't come from a programming background, so this is quite challenging and, thankfully, both fun as well as rewarding. There was a part of my code that was written in Python, since that particular part would be slow in R, but now I have to convert that part into R as well, in order for it to go into the R package.

I have successfully done this conversion, but it is, as I guessed, horribly slow. I'm hoping this is only because of the way I did it, and that there is room for optimisation; hence, my question here. In essence, the script takes a VCF file (containing data on mutations from one or more samples), performs some filtering on data quality criteria and extracts the relevant information to another file. This is done to save time, as the resulting file is going to be analysed up to several hundreds of times (depending on the study) by other downstream scripts.

This is an example record in the VCF file (long; scrolling needed):

chr1 158097044 rs6427420 A G 290.77 PASS AC=2;AF=1.00;AN=2;DP=10;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=29.08;SOR=1.085;ANN=G|3_prime_UTR_variant|MODIFIER|KIRREL|ENSG00000183853|transcript|ENST00000368173.7|protein_coding|13/13|c.*1924A>G|||||1924|,G|3_prime_UTR_variant|MODIFIER|KIRREL|ENSG00000183853|transcript|ENST00000368172.1|protein_coding|11/11|c.*1924A>G|||||1924|,G|downstream_gene_variant|MODIFIER|KIRREL|ENSG00000183853|transcript|ENST00000359209.10|protein_coding||c.*1924A>G|||||1391|,G|downstream_gene_variant|MODIFIER|KIRREL|ENSG00000183853|transcript|ENST00000360089.8|protein_coding||c.*1924A>G|||||993|;ASP;CAF=0.497,0.503;COMMON=1;G5;G5A;GNO;HD;KGPhase1;KGPhase3;RS=6427420;RSPOS=158097044;SAO=0;SLO;SSR=0;U3;VC=SNV;VLD;VP=0x05010080000517053f000100;WGT=1;dbSNPBuildID=116 GT:AD:DP:GQ:PL 1/1:0,10:10:30:319,30,0

The main issue is the ANN (mutation annotation) field in the 8th (extremely large) column. I can relatively easy get the following format, using the VariantAnnotation package:

[1] ANN=G|3_prime_UTR_variant|MODIFIER|KIRREL|ENSG00000183853|transcript|ENST00000368173.7|protein_coding|13/13|c.*1924A>G|||||1924|
[2] G|truncation|HIGH|KIRREL|ENSG00000183853|transcript|ENST00000368172.1|protein_coding|11/11|c.*1924A>G|||||1924|
...
[n] ANN=...

There is thus several levels of ANN per mutation, because a single mutation might effect several genes and transcripts. The goal is to separate all ANN fields by |, keep those mutations with the highest predicted effect on protein function (going in descending order from HIGH, MODERATE, LOW and MODIFIER) and output them together with the data for the specific mutation. I need to go from this ...

chr1 123 rs456 A G 789 PASS ANN=...gene1-HIGH..., ...gene2-HIGH..., ...gene3-LOW... GT:AD 1/0:12,14

... to this ...

chr1 123 rs456 A G 789 PASS ANN=...gene1-HIGH... GT:AD 1/0:12,14
chr1 123 rs456 A G 789 PASS ANN=...gene2-HIGH... GT:AD 1/0:12,14

The LOW gene annotation is gone, and each of the HIGH annotations now gets its own row with identical information (except the ANN field).

The script below works and does its job, yielding equivalent results to the previous Python script, but it is slower. A very small file containing only a hundred variants took around 30 seconds (1 second with Python), a normal-sized file up to 5 minutes (5-20 seconds with Python), and I gave up on a very large file after an hour (5 minutes with Python).

So, how could I optimise my script? Am I doing it in a very slow manner? I didn't think that I could use apply, as I need to go from a single line into many, and thus went for a for loop. But maybe I can? Or maybe there's something else? Any ideas are greatly appreciated!

extract_variants = function(vcf_file,
 sample,
 output_file,
 filter_depth=10) {
 # Read VCF file
 vcf = VariantAnnotation::readVcf(vcf_file)
 # Gather relevant information to data GRanges object
 gr = SummarizedExperiment::rowRanges(vcf)
 gr$ANN = VariantAnnotation::info(vcf)$ANN
 gr$DP = as.data.frame(VariantAnnotation::geno(vcf)$DP)[[sample]]
 gr$AD = as.data.frame(VariantAnnotation::geno(vcf)$AD)[[sample]]
 gr$GT = as.data.frame(VariantAnnotation::geno(vcf)$GT)[[sample]]
 # Set ALT as character
 gr$ALT = S4Vectors::unstrsplit(IRanges::CharacterList(gr$ALT))
 # Remove variants not passing variant calling filters
 gr = gr[gr$FILTER == 'PASS', ]
 gr$FILTER = NULL
 # Remove variants below the given depth threshold
 gr = gr[gr$DP >= filter_depth & !is.na(gr$DP), ]
 # Convert to data frame
 data = GenomicRanges::as.data.frame(gr)
 # Remove non-SNVs
 data = data[nchar(data$REF) == 1 &
 nchar(data$ALT) == 1, ]
 # Get rsIDs if existing
 data$rsID = row.names(data)
 data[!grepl('^rs[0-9]+', data$rsID), 'rsID'] = 'None'
 # Remove unwanted columns
 row.names(data) = NULL
 data$end = NULL
 data$width = NULL
 data$strand = NULL
 data$paramRangeID = NULL
 data$QUAL = NULL
 # Separate allelic depths
 data = tidyr::separate(data=data, col=AD, sep='\,円\\ ',
 into=c('AD1', 'AD2'), remove=TRUE)
 data$AD1 = gsub('c\\(', '', data$AD1)
 data$AD2 = gsub('\\)', '', data$AD2)
 # Add alleles
 data = tidyr::separate(data=data, col=GT, sep='/', into=c('A1', 'A2'),
 remove=TRUE)
 data[data$A1 == 0, 'A1'] = data[data$A1 == 0, 'REF']
 data[data$A1 == 1, 'A1'] = data[data$A1 == 1, 'ALT']
 data[data$A2 == 0, 'A2'] = data[data$A2 == 0, 'REF']
 data[data$A2 == 1, 'A2'] = data[data$A2 == 1, 'ALT']
 ######### PROBLEMATIC PART STARTS HERE #########
 # Initialise empty data frame for final results
 results = data.frame(effect=character(),
 impact=character(),
 gene=character(),
 ENSGID=character(),
 feature=character(),
 ENSTID=character(),
 biotype=character(),
 warnings=character(),
 seqnames=integer(),
 start=integer(),
 rsID=character(),
 REF=character(),
 ALT=character(),
 DP=integer(),
 AD1=integer(),
 AD2=integer(),
 A1=character(),
 A2=character(), stringsAsFactors=FALSE)
 # Loop over each SNV
 for (n in c(1:nrow(data))) {
 # Get annotation data for current SNV
 ann = data[n, 'ANN'][[1]]
 # Separate into columns
 ann = tidyr::separate(as.data.frame(ann), col=ann, sep='\\|',
 into=c('ALT', 'effect', 'impact', 'gene', 'ENSGID', 'feature',
 'ENSTID', 'biotype', 'rank', 'HGSV.c', 'HGSV.p',
 'cDNA.pos', 'CDS.pos', 'protein.pos', 'distance',
 'warnings'), remove=TRUE)
 # Remove unwanted data columns
 ann$ALT = NULL
 ann$rank = NULL
 ann$HGSV.c = NULL
 ann$HGSV.p = NULL
 ann$cDNA.pos = NULL
 ann$CDS.pos = NULL
 ann$protein.pos = NULL
 ann$distance = NULL
 # Keep only the highest impact SNV(s)
 impacts = unique(ann$impact)
 if ('HIGH' %in% impacts) {
 ann = ann[ann$impact == 'HIGH', ]
 } else if ('MODERATE' %in% impacts) {
 ann = ann[ann$impact == 'MODERATE', ]
 } else if ('LOW' %in% impacts) {
 ann = ann[ann$impact == 'LOW', ]
 }
 # SNV data columns
 data.cols = c('seqnames', 'start', 'rsID', 'REF', 'ALT', 'DP',
 'AD1', 'AD2', 'A1', 'A2')
 # Add SNV data to each annotation
 for (col in data.cols) {
 ann[[col]] = data[n, col]
 }
 # Append to final results data frame
 results = rbind(results, ann)
 }
 ######### PROBLEMATIC PART ENDS HERE #########
 # Finalise output
 results = results[c('seqnames', 'start', 'rsID', 'REF', 'ALT',
 'gene', 'ENSGID', 'ENSTID', 'impact', 'effect',
 'feature', 'biotype', 'DP', 'AD1', 'AD2', 'A1',
 'A2', 'warnings')]
 names(results) = c('chr', 'pos', names(results)[3:18])
 # Write results to file
 write.table(results, output_file, sep='\t', row.names=FALSE, 
 quote=FALSE)
 }
}

P.S. I would also welcome feedback on the general level of the code, style, legibility, etc., as it is my first R package and I don't have any previous experience with such.

Question 2

Just a few thoughts:

The loop makes the code slow, and it should be easily possible to get rid of it. Without your data (or at least a reproducible example) it is hard for me to try it out; but you should be able to call tidyr::separate() on the whole data frame, not just a single row.
There are two great packages to wrangle data (e.g., filter observations or variables), namely, dplyr and data.table. Choose one of them (Here is a comprehensive comparison.) This will make your code faster and easier to both write and read. For example:

# instead of data$end = NULL data$width = NULL data$strand = NULL

# I would do data <- dplyr::select(data, -end, -width, -strand)
If you fill, for example, a matrix using a loop, it is faster and saver to initialize the full matrix first. Specify all rows and columns you need and fill them with NAs. For example, data.frame(id = 1:100, a = rep(NA, 100), b = rep(NA_character_, 100)).
Using = instead of <- is often regarded bad practice.
I tend to use something like ii instead of i in for (ii in seq_len(nrow(data))), because it's easier to find (and replace) if needed.

Question 3

Thank you for that! I will see if I can get separate to work differenty, and change the wrangling. For the matrix filling, I can't really do much, though, seeing as I don't know how many rows I will have at the end (due to filtering, removing low impacts, etc.) Regarding = and <-, I did a bit of reading on that when I started to use R some time ago, and I got the impression that it didn't really matter and that it was more of a personal taste kind of thing. This is not the case?

Question 4

@hplieninger I tend to use something like ii instead of i: In most IDEs you can find a feature to replace variables. In RStudio it's called Rename in Scope. (another option: ctrl+F, and check Whole word)

Question 5

There is an extended discussion on <- vs. = on stackoverflow.com/q/1741820/2563804.

hplieninger hplieninger 1947 bronze badges · Accepted Answer · 2017-07-13 15:12:07Z

Just a few thoughts:

The loop makes the code slow, and it should be easily possible to get rid of it. Without your data (or at least a reproducible example) it is hard for me to try it out; but you should be able to call tidyr::separate() on the whole data frame, not just a single row.
There are two great packages to wrangle data (e.g., filter observations or variables), namely, dplyr and data.table. Choose one of them (Here is a comprehensive comparison.) This will make your code faster and easier to both write and read. For example:

# instead of data$end = NULL data$width = NULL data$strand = NULL

# I would do data <- dplyr::select(data, -end, -width, -strand)
If you fill, for example, a matrix using a loop, it is faster and saver to initialize the full matrix first. Specify all rows and columns you need and fill them with NAs. For example, data.frame(id = 1:100, a = rep(NA, 100), b = rep(NA_character_, 100)).
Using = instead of <- is often regarded bad practice.
I tend to use something like ii instead of i in for (ii in seq_len(nrow(data))), because it's easier to find (and replace) if needed.

Thank you for that! I will see if I can get separate to work differenty, and change the wrangling. For the matrix filling, I can't really do much, though, seeing as I don't know how many rows I will have at the end (due to filtering, removing low impacts, etc.) Regarding = and <-, I did a bit of reading on that when I started to use R some time ago, and I got the impression that it didn't really matter and that it was more of a personal taste kind of thing. This is not the case?
@hplieninger I tend to use something like ii instead of i: In most IDEs you can find a feature to replace variables. In RStudio it's called Rename in Scope. (another option: ctrl+F, and check Whole word)
There is an extended discussion on <- vs. = on stackoverflow.com/q/1741820/2563804.

Stack Exchange Network

Extraction of mutational data from large files in R

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Extraction of mutational data from large files in R

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions