These notes were created during the course, and server as a transcript of topics covered.

Intro to sequencing

Workflow

  1. Experimental design
  2. Wet lab sample prep, etc
  3. Sequencing
    • FASTQ file of reads and their quality scores
    • Quality assessment (FASTQ program), trimming or removing contanimants, removing optical duplicates (FASTX, trimomatic)
    • Quality with respect to your research question
  4. Alignment / (assembly)
    • BAM file of aligned reads to a known reference genome
    • Aligners: vary from simple to use to hard to use, from ‘good enough’ alignments (for RNA-seq of known genes, ChIP-seq) to high-quality (e.g., DNA-seq calling variants)
    • Bowtie2 (easy, good enough), gmap (excellent, hard to use).
    • Purpose-built tools that align and reduce. E.g., RNA-seq known gene differential expression – kalisto, sailfish
  5. Reduction
    • BED of called peaks in a ChIP-seq experiment (e.g., MACS, FindPeaks)
    • VCF of called variants (GATK, bcftools)
    • Count table (e.g., tsv) in an RNA-seq experiment (python htseq2; GenomicFeatures::summarizeOverlaps())
  6. (Statistical) analysis
    • Why statistical analysis? data is fundamentally huge; biological questions are framed in terms of classical statistics, e.g., designed experiments, hypothesis testing; technical and other artifacts, e.g., GC bias, mapability, batch effects
    • Appropriate tools: able to cope with statistics; access to advanced statistical methods; analysis has to be reproducible (some sort of scripting); processing large amounts of data is not the primary criterion.
    • R / Bioconductor is the best most awesome tool.
  7. Comprehension
    • .Rmd or similar documenting the work flow, including inputs, analysis steps, tables, figures, interpertation...

FASTQ and BAM files

View from the Linux command line...

  • zcat *fastq.gz | less
  • samtools view -h *bam

... or within R / Bioconductor: fastq files

library(ShortRead)
## Loading required package: BiocGenerics
## Loading required package: parallel
## 
## Attaching package: 'BiocGenerics'
## 
## The following objects are masked from 'package:parallel':
## 
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
## 
## The following objects are masked from 'package:stats':
## 
## IQR, mad, xtabs
## 
## The following objects are masked from 'package:base':
## 
## anyDuplicated, append, as.data.frame, as.vector, cbind,
## colnames, do.call, duplicated, eval, evalq, Filter, Find, get,
## grep, grepl, intersect, is.unsorted, lapply, lengths, Map,
## mapply, match, mget, order, paste, pmax, pmax.int, pmin,
## pmin.int, Position, rank, rbind, Reduce, rownames, sapply,
## setdiff, sort, table, tapply, union, unique, unlist, unsplit
## 
## Loading required package: BiocParallel
## Loading required package: Biostrings
## Loading required package: S4Vectors
## Loading required package: stats4
## Loading required package: IRanges
## Loading required package: XVector
## Loading required package: Rsamtools
## Loading required package: GenomeInfoDb
## Loading required package: GenomicRanges
## Loading required package: GenomicAlignments
## Loading required package: SummarizedExperiment
## Loading required package: Biobase
## Welcome to Bioconductor
## 
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
strm = FastqStreamer("bigdata/SRR1039508_1.fastq.gz", 100000)
fq = yield(strm)
fq
## class: ShortReadQ
## length: 100000 reads; width: 63 cycles
sread(fq)
## A DNAStringSet instance of length 100000
## width seq
## [1] 63 CATTGCTGATACCAANNNNNNNNGCATTC...GTCTTCCTCCTTCCCTTACGGAATTACA
## [2] 63 CCCTGGACTGCTTCTTGAAAAGTGCCATC...CTATCTTTGGGGAGAGTATGATAGAGAT
## [3] 63 TCGATCCATCGATTGGAAGGCACTGATCT...TCAGGTTGGTGGTCTTATTTGCAAGTCC
## [4] 63 GAAGAGTTAGCAGCGACCGTGACAGACCA...GCTCCCAACTCCAGGGTGCCAATCCGAT
## [5] 63 CGTGCAGGAGATCATGATCCCCGCGGGCA...GCCTGGTCATTGGCAAGGGCGGGGAGAC
## ... ... ...
## [99996] 63 GAGAGAAGCTTTGTATGGCTGTCATGCTT...TGATTCCTGCAACTTGACCTTCAGGCTG
## [99997] 63 TTATGGTGCAGACATGGCCAAGTCCAAGA...CCACACACAACCAGTCCCGAAAATGGCA
## [99998] 63 TTAAAGTAGAGCATCTAGTTTGAGAAATA...AATTATTAAAGATGTCTTTTTTCTACCC
## [99999] 63 TCCCAACTGTAGGCTGAGTGACCTGAAGG...AGACTGCCGAAGTCCAAAAGCTTCAGCA
## [100000] 63 GTGTTTTCTGGTATCGTCCCTTCGTGGTT...AAAAAATGGTACTGGAAAGGGGTCCCAA
quality(fq)
## class: FastqQuality
## quality:
## A BStringSet instance of length 100000
## width seq
## [1] 63 HJJJJJJJJJJJJJJ########00?GHI...JIJJJJJJJJJJJJJJJJJHHHFFFFFD
## [2] 63 HJJJJJJJJJJJJJJJJJIIJIGHIJJJJ...JJJJJJJJJJJJGHHIDHIJJHHHHHHF
## [3] 63 HJJJJJJJJJJJJJJJJJJJJJJJJJJJJ...GHIJJBGIJCGIAHIJHHHHHHHFFFFF
## [4] 63 HIJJJJIIJJJJJJJJJJJIJJJJJJJJJ...IHHHHHHFFFFEEEEDC@DDDDDDDDDD
## [5] 63 HIGGIIIIIIIGHIIIGIHIIIIJGIFAC...@@DDBDDCCDECCDDDB?BBBBBD@B;<
## ... ... ...
## [99996] 63 HJJJJJJJJJJJJGIJJJJJJGGIJJGHH...CHJJJGGHIJJJJJIJJJJJJJJIHHHH
## [99997] 63 HJJJIJHHIIJJJJIJJJJJIJIJJIJJI...HHFFFFDDDDDDDDCDDDDD@DDDDDDD
## [99998] 63 HJJJJJJHIJJJJJJJJJJJJJIJJJJJJ...JJJJJJJJJJJJJJJJJJJJJJJJJJIJ
## [99999] 63 HJJJJJJJJHIJJJJJJJGHIJJJJJJJJ...JJJJJJJJJJJJIJJJJHHHHFFFFFFF
## [100000] 63 HAEFHIJJJJJHIJJJJJJJJJJIHIJFH...IJJJJJIJHHHHHHFFFFFEDD>BDDDD

R

  • Statistical programming language
  • Vectorized (works efficiently on vectors; vector notation is very expressive and compact)
  • Objects help to coordinate management of related data
  • Introspection helps discover what can be done with objects.
x = rnorm(1000)
y = x + rnorm(1000, sd=.5)
df = data.frame(x=x, y=y)
plot(y ~ x, df)
fit = lm(y ~ x, df)
class(fit)
## [1] "lm"
methods(class=class(fit))
## [1] add1 alias anova case.names 
## [5] coerce confint cooks.distance deviance 
## [9] dfbeta dfbetas drop1 dummy.coef 
## [13] effects extractAIC family formula 
## [17] hatvalues influence initialize kappa 
## [21] labels logLik model.frame model.matrix 
## [25] nobs plot predict print 
## [29] proj qr residuals rstandard 
## [33] rstudent show simulate slotsFromS3 
## [37] summary variable.names vcov 
## see '?methods' for accessing help and source code
methods("anova")
## [1] anova.glm* anova.glmlist* anova.lm* anova.lmlist* 
## [5] anova.loess* anova.mlm* anova.nls* 
## see '?methods' for accessing help and source code

Help!

?log
?plot # generic 'plot'
?plot.lm # plot for objects of class 'lm'

Bioconductor

  • Main web site, including biocViews
  • Package landing pages, e.g., ChIPseeker
  • The support forum
  • 1100+ packages for analysis and comprehension of high-throughput genomic data: sequencing (RNA, ChIP, variants, ...), microarray (expression, methylation, copy number, etc), flow cytometry, proteomics, imaging, ...

Extensive use of ‘S4’ classes

  • fit (from lm()) is an example of an S3 class
  • sread(fq) returned a DNAStringSet, an example of an S4 class
library(ShortRead)
strm = FastqStreamer("bigdata/SRR1039508_1.fastq.gz", 100000)
fq = yield(strm) # 'ShortReadQ' S4 class
class(fq) # introspection
## [1] "ShortReadQ"
## attr(,"package")
## [1] "ShortRead"
methods(class=class(fq)) 
## [1] [ [<- alphabetByCycle 
## [4] alphabetScore append clean 
## [7] coerce detail dustyScore 
## [10] id length narrow 
## [13] pairwiseAlignment qa renew 
## [16] renewable reverse reverseComplement
## [19] show srdistance srduplicated 
## [22] sread srorder srrank 
## [25] srsort tables trimEnds 
## [28] trimLRPatterns trimTails trimTailw 
## [31] width writeFasta writeFastq 
## see '?methods' for accessing help and source code
reads = sread(fq) # accessor -- get the reads
reads # 'DNAStringSet' S4 class
## A DNAStringSet instance of length 100000
## width seq
## [1] 63 CATTGCTGATACCAANNNNNNNNGCATTC...GTCTTCCTCCTTCCCTTACGGAATTACA
## [2] 63 CCCTGGACTGCTTCTTGAAAAGTGCCATC...CTATCTTTGGGGAGAGTATGATAGAGAT
## [3] 63 TCGATCCATCGATTGGAAGGCACTGATCT...TCAGGTTGGTGGTCTTATTTGCAAGTCC
## [4] 63 GAAGAGTTAGCAGCGACCGTGACAGACCA...GCTCCCAACTCCAGGGTGCCAATCCGAT
## [5] 63 CGTGCAGGAGATCATGATCCCCGCGGGCA...GCCTGGTCATTGGCAAGGGCGGGGAGAC
## ... ... ...
## [99996] 63 GAGAGAAGCTTTGTATGGCTGTCATGCTT...TGATTCCTGCAACTTGACCTTCAGGCTG
## [99997] 63 TTATGGTGCAGACATGGCCAAGTCCAAGA...CCACACACAACCAGTCCCGAAAATGGCA
## [99998] 63 TTAAAGTAGAGCATCTAGTTTGAGAAATA...AATTATTAAAGATGTCTTTTTTCTACCC
## [99999] 63 TCCCAACTGTAGGCTGAGTGACCTGAAGG...AGACTGCCGAAGTCCAAAAGCTTCAGCA
## [100000] 63 GTGTTTTCTGGTATCGTCCCTTCGTGGTT...AAAAAATGGTACTGGAAAGGGGTCCCAA
methods(class=class(reads))
## [1] ! != 
## [3] [ [[ 
## [5] [[<- [<- 
## [7] %in% < 
## [9] <= == 
## [11] > >= 
## [13] $ $<- 
## [15] aggregate alphabetFrequency 
## [17] anyNA append 
## [19] as.character as.complex 
## [21] as.data.frame as.env 
## [23] as.integer as.list 
## [25] as.logical as.matrix 
## [27] as.numeric as.raw 
## [29] as.vector c 
## [31] chartr clean 
## [33] coerce compact 
## [35] compare compareStrings 
## [37] complement consensusMatrix 
## [39] consensusString countOverlaps 
## [41] countPattern countPDict 
## [43] dinucleotideFrequencyTest do.call 
## [45] droplevels duplicated 
## [47] dustyScore elementLengths 
## [49] elementMetadata elementMetadata<- 
## [51] elementType endoapply 
## [53] eval expand 
## [55] extractAt extractROWS 
## [57] Filter Find 
## [59] findOverlaps hasOnlyBaseLetters 
## [61] head high2low 
## [63] ifelse intersect 
## [65] is.na is.unsorted 
## [67] isEmpty isMatchingEndingAt 
## [69] isMatchingStartingAt lapply 
## [71] length lengths 
## [73] letterFrequency Map 
## [75] match 
## [ reached getOption("max.print") -- omitted 102 entries ]
## see '?methods' for accessing help and source code
gc = letterFrequency(reads, "GC", as.prob=TRUE)
hist(gc)

Help!

?DNAStringSet # class, and often frequently used methods
?letterFrequency # generic
methods("letterFrequency")
?"letterFrequency,XStringSet-method"

And...

Key software packages...

... and classes

  • DNAStringSet, DNAString for sequence data
  • GRanges, GRangesList for representing coordinates in genome space
  • SummarizedExperiment (ExpressionSet): integrated data contains: rows x columns (features x samples)
    • assays()
    • rowRanges() for annotations on rows
    • colData() for column annotations

Annotation

  • Pure ‘data’ packages
  • Identifier mapping org.* packages
  • Gene models with TxDb.* packages
  • Whole genome sequences BSgenome.* packages
  • biomaRt for accessing ENSEMBL-based biomarts; AnnotationHub for genome-scale annotation resources

Strategies for working with big data

  • Write efficient R code – vectorized
  • Process data in chunks, e.g., FastqStreamer(), Rsamtools::BamFile(..., yieldSize=1000000); GenomicFiles::reduceByYield() (see examples on ?reduceByYield)
  • Process in parallel BiocParallel

All material on the course materials page

AltStyle によって変換されたページ (->オリジナル) /