1 R
2 Bioconductor
3 End matter
- 3.1 Session Info
- 3.2 Acknowledgements

Author: Martin Morgan
Date: 22 July, 2019

1 R

1.1 History of R and CRAN

Statistical programming language. Concieved 1992, initial version 1996, stable beta version in 2000; an implementation of S. CRAN started in 1997.
‘Free’ software: no cost, open source, broad use.
Extensible: packages (15,000 on CRAN, 1750 on Bioconductor)
Key features
- Intrinsic statistical concepts
- Vectorized computation
- ‘Old-school’ scripts rather than graphical user interface – great for reproducibility!
- (Advanced) copy-on-change semanatics

1.2 Vectors and data frames

1 + 2

## [1] 3

x = c(1, 2, 3)
1:3 # sequence of integers from 1 to 3

## [1] 1 2 3

x + c(4, 5, 6) # vectorized

## [1] 5 7 9

x + 4 # recycling

## [1] 5 6 7

Vectors

numeric(), character(), logical(), integer(), complex(), ...
NA: ‘not available’
factor(): values from restricted set of ‘levels’.

Operations

numeric: ==, <, <=, >, >=, ...
logical: | (or), & (and), ! (not)
subset: [, e.g., x[c(2, 3)]
assignment: [<-, e.g., x[c(1, 3)] = x[c(1, 3)]
other: is.na()

Functions

x = rnorm(100)
y = x + rnorm(100)
plot(x, y)

Many!

data.frame

df <- data.frame(Independent = x, Dependent = y)
head(df)

## Independent Dependent
## 1 -0.4338047 -0.5779168
## 2 -0.2769985 -1.0665115
## 3 -1.6966211 -1.8769578
## 4 -0.6481076 -0.9540841
## 5 -2.1015776 -1.1166887
## 6 0.7109163 -0.3363154

df[1:5, 1:2]

## Independent Dependent
## 1 -0.4338047 -0.5779168
## 2 -0.2769985 -1.0665115
## 3 -1.6966211 -1.8769578
## 4 -0.6481076 -0.9540841
## 5 -2.1015776 -1.1166887

df[1:5, ]

## Independent Dependent
## 1 -0.4338047 -0.5779168
## 2 -0.2769985 -1.0665115
## 3 -1.6966211 -1.8769578
## 4 -0.6481076 -0.9540841
## 5 -2.1015776 -1.1166887

plot(Dependent ~ Independent, df) # 'formula' interface

List of equal-length vectors
Vectors can be of different type
Two-dimensional subset and assignment
Column access: df[, 1], df[, "Indep"], df[[1]], df[["Indep"]], df$Indep

Exercise: plot only values with Dependent > 0, Independent > 0

Select rows

ridx <- (df$Dependent > 0) & (df$Independent > 0)

Plot subset

plot(Dependent ~ Independent, df[ridx, ])

Skin the cat another way

plot(
 Dependent ~ Independent, df,
 subset = (Dependent > 0) & (Independent > 0)
)

1.3 Analysis: functions, classes, methods

fit <- lm(Dependent ~ Independent, df) # linear model -- regression
anova(fit) # summary table

## Analysis of Variance Table
## 
## Response: Dependent
## Df Sum Sq Mean Sq F value Pr(>F) 
## Independent 1 92.664 92.664 70.32 3.787e-13 ***
## Residuals 98 129.139 1.318 
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

plot(Dependent ~ Independent, df)
abline(fit)

lm(): plain-old function
fit: an object of class "lm"
anova(): a generic with a specific method for class "lm"

class(fit)

## [1] "lm"

methods(class="lm")

## [1] add1 alias anova case.names 
## [5] coerce confint cooks.distance deviance 
## [9] dfbeta dfbetas drop1 dummy.coef 
## [13] effects extractAIC family formula 
## [17] hatvalues influence initialize kappa 
## [21] labels logLik model.frame model.matrix 
## [25] nobs plot predict print 
## [29] proj qr residuals rstandard 
## [33] rstudent show simulate slotsFromS3 
## [37] summary variable.names vcov 
## see '?methods' for accessing help and source code

1.4 Help!

?"plot" # plain-old-function or generic
?"plot.formula" # method
?"plot.lm" # method for object of class 'lm', plot(fit)

1.5 Packages

library(ggplot2)
ggplot(df, aes(x = Independent, y = Dependent)) +
 geom_point() + geom_smooth(method = "lm")

General purpose: >15,000 packages on CRAN
Gain contributor’s domain expertise and weird (or other) idiosyncracies
Installation (once only per computer) versus load (via library(ggplot2), once per session)

2 Bioconductor

Started 2002 as a platform for understanding analysis of microarray data

2.1 Packages

1,750 packages. Domains of expertise:

Sequencing (RNASeq, ChIPSeq, single-cell, called variants, ...)
Microarrays (methylation, expression, copy number, ...)
flow cytometry
proteomics
...

Important themes

Reproducible research
Interoperability between packages & work kflows
Usability

Resources

https://bioconductor.org
https://bioconductor.org/packages – software, annotation, experiment, workflow
https://support.bioconductor.org
Community slack (sign-up)

2.2 Objects

A distinctive feature of Bioconductor – use of objects for representing data

library(Biostrings)
dna <- DNAStringSet(c("AACTCC", "CTGCA"))
dna

## A DNAStringSet instance of length 2
## width seq
## [1] 6 AACTCC
## [2] 5 CTGCA

reverseComplement(dna)

## A DNAStringSet instance of length 2
## width seq
## [1] 6 GGAGTT
## [2] 5 TGCAG

Biostrings: DNA, RNA, AA representation and manipulation
GenomicRanges: Coordinates in genome space
SummarizedExperiment: coordinating ‘assay’ data (e.g., counts from an RNASeq experiment) with row and column annotations (e.g., information about samples and experimental treatments).

2.3 High-throughput sequence work flow

Web site, https://bioconductor.org

1750 ‘software’ packages, https://bioconductor.org/packages

Sequence analysis (RNASeq, ChIPSeq, called variants, copy number, single cell)
Microarrays (methylation, copy number, classical expression, ...)
Annotation (more about annotations later this morning...)
Flow cytometry
Proteomics, image analysis, ...

Discovery and use, e.g., DESeq2

Landing pages: title, description (abstract), installation instructions, badges
Vignettes!

Also:

‘Annotation’ packages
‘Experiment data’ packages
Workflows
Course material, ...

3 End matter

3.1 Session Info

sessionInfo()

## R version 3.6.1 Patched (2019年07月16日 r76845)
## Platform: x86_64-apple-darwin17.7.0 (64-bit)
## Running under: macOS High Sierra 10.13.6
## 
## Matrix products: default
## BLAS: /Users/ma38727/bin/R-3-6-branch/lib/libRblas.dylib
## LAPACK: /Users/ma38727/bin/R-3-6-branch/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets 
## [8] methods base 
## 
## other attached packages:
## [1] Biostrings_2.53.2 XVector_0.25.0 IRanges_2.19.10 
## [4] S4Vectors_0.23.17 BiocGenerics_0.31.5 ggplot2_3.2.0 
## [7] BiocStyle_2.13.2 
## 
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.1 pillar_1.4.2 compiler_3.6.1 
## [4] BiocManager_1.30.4 zlibbioc_1.31.0 tools_3.6.1 
## [7] digest_0.6.20 evaluate_0.14 tibble_2.1.3 
## [10] gtable_0.3.0 pkgconfig_2.0.2 rlang_0.4.0 
## [13] yaml_2.2.0 xfun_0.8 withr_2.1.2 
## [16] stringr_1.4.0 dplyr_0.8.3 knitr_1.23 
## [19] grid_3.6.1 tidyselect_0.2.5 glue_1.3.1 
## [22] R6_2.4.0 rmarkdown_1.14 bookdown_0.12 
## [25] purrr_0.3.2 magrittr_1.5 scales_1.0.0 
## [28] codetools_0.2-16 htmltools_0.3.6 assertthat_0.2.1 
## [31] colorspace_1.4-1 labeling_0.3 stringi_1.4.3 
## [34] lazyeval_0.2.2 munsell_0.5.0 crayon_1.3.4

3.2 Acknowledgements

Research reported in this tutorial was supported by the National Human Genome Research Institute and the National Cancer Institute of the National Institutes of Health under award numbers U41HG004059 and U24CA180996.

This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement number 633974)

Lecture 1 – Introduction to R / Bioconductor

22 July 2019

Contents

1 R