Contents

Author: Martin Morgan
Date: 22 July, 2019

1 R

1.1 History of R and CRAN

  • Statistical programming language. Concieved 1992, initial version 1996, stable beta version in 2000; an implementation of S. CRAN started in 1997.
  • ‘Free’ software: no cost, open source, broad use.
  • Extensible: packages (15,000 on CRAN, 1750 on Bioconductor)
  • Key features
    • Intrinsic statistical concepts
    • Vectorized computation
    • ‘Old-school’ scripts rather than graphical user interface – great for reproducibility!
    • (Advanced) copy-on-change semanatics

1.2 Vectors and data frames

1 + 2
## [1] 3
x = c(1, 2, 3)
1:3 # sequence of integers from 1 to 3
## [1] 1 2 3
x + c(4, 5, 6) # vectorized
## [1] 5 7 9
x + 4 # recycling
## [1] 5 6 7

Vectors

  • numeric(), character(), logical(), integer(), complex(), ...
  • NA: ‘not available’
  • factor(): values from restricted set of ‘levels’.

Operations

  • numeric: ==, <, <=, >, >=, ...
  • logical: | (or), & (and), ! (not)
  • subset: [, e.g., x[c(2, 3)]
  • assignment: [<-, e.g., x[c(1, 3)] = x[c(1, 3)]
  • other: is.na()

Functions

x = rnorm(100)
y = x + rnorm(100)
plot(x, y)
  • Many!

data.frame

df <- data.frame(Independent = x, Dependent = y)
head(df)
## Independent Dependent
## 1 -0.4338047 -0.5779168
## 2 -0.2769985 -1.0665115
## 3 -1.6966211 -1.8769578
## 4 -0.6481076 -0.9540841
## 5 -2.1015776 -1.1166887
## 6 0.7109163 -0.3363154
df[1:5, 1:2]
## Independent Dependent
## 1 -0.4338047 -0.5779168
## 2 -0.2769985 -1.0665115
## 3 -1.6966211 -1.8769578
## 4 -0.6481076 -0.9540841
## 5 -2.1015776 -1.1166887
df[1:5, ]
## Independent Dependent
## 1 -0.4338047 -0.5779168
## 2 -0.2769985 -1.0665115
## 3 -1.6966211 -1.8769578
## 4 -0.6481076 -0.9540841
## 5 -2.1015776 -1.1166887
plot(Dependent ~ Independent, df) # 'formula' interface
  • List of equal-length vectors
  • Vectors can be of different type
  • Two-dimensional subset and assignment
  • Column access: df[, 1], df[, "Indep"], df[[1]], df[["Indep"]], df$Indep

Exercise: plot only values with Dependent > 0, Independent > 0

  1. Select rows

    ridx <- (df$Dependent > 0) & (df$Independent > 0)
  2. Plot subset

    plot(Dependent ~ Independent, df[ridx, ])
  3. Skin the cat another way

    plot(
     Dependent ~ Independent, df,
     subset = (Dependent > 0) & (Independent > 0)
    )

1.3 Analysis: functions, classes, methods

fit <- lm(Dependent ~ Independent, df) # linear model -- regression
anova(fit) # summary table
## Analysis of Variance Table
## 
## Response: Dependent
## Df Sum Sq Mean Sq F value Pr(>F) 
## Independent 1 92.664 92.664 70.32 3.787e-13 ***
## Residuals 98 129.139 1.318 
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(Dependent ~ Independent, df)
abline(fit)
  • lm(): plain-old function
  • fit: an object of class "lm"
  • anova(): a generic with a specific method for class "lm"
class(fit)
## [1] "lm"
methods(class="lm")
## [1] add1 alias anova case.names 
## [5] coerce confint cooks.distance deviance 
## [9] dfbeta dfbetas drop1 dummy.coef 
## [13] effects extractAIC family formula 
## [17] hatvalues influence initialize kappa 
## [21] labels logLik model.frame model.matrix 
## [25] nobs plot predict print 
## [29] proj qr residuals rstandard 
## [33] rstudent show simulate slotsFromS3 
## [37] summary variable.names vcov 
## see '?methods' for accessing help and source code

1.4 Help!

?"plot" # plain-old-function or generic
?"plot.formula" # method
?"plot.lm" # method for object of class 'lm', plot(fit)

1.5 Packages

library(ggplot2)
ggplot(df, aes(x = Independent, y = Dependent)) +
 geom_point() + geom_smooth(method = "lm")
  • General purpose: >15,000 packages on CRAN
  • Gain contributor’s domain expertise and weird (or other) idiosyncracies
  • Installation (once only per computer) versus load (via library(ggplot2), once per session)

2 Bioconductor

Started 2002 as a platform for understanding analysis of microarray data

2.1 Packages

1,750 packages. Domains of expertise:

  • Sequencing (RNASeq, ChIPSeq, single-cell, called variants, ...)
  • Microarrays (methylation, expression, copy number, ...)
  • flow cytometry
  • proteomics
  • ...

Important themes

  • Reproducible research
  • Interoperability between packages & work kflows
  • Usability

Resources

2.2 Objects

A distinctive feature of Bioconductor – use of objects for representing data

library(Biostrings)
dna <- DNAStringSet(c("AACTCC", "CTGCA"))
dna
## A DNAStringSet instance of length 2
## width seq
## [1] 6 AACTCC
## [2] 5 CTGCA
reverseComplement(dna)
## A DNAStringSet instance of length 2
## width seq
## [1] 6 GGAGTT
## [2] 5 TGCAG
  • Biostrings: DNA, RNA, AA representation and manipulation
  • GenomicRanges: Coordinates in genome space
  • SummarizedExperiment: coordinating ‘assay’ data (e.g., counts from an RNASeq experiment) with row and column annotations (e.g., information about samples and experimental treatments).

2.3 High-throughput sequence work flow

Web site, https://bioconductor.org

1750 ‘software’ packages, https://bioconductor.org/packages

  • Sequence analysis (RNASeq, ChIPSeq, called variants, copy number, single cell)
  • Microarrays (methylation, copy number, classical expression, ...)
  • Annotation (more about annotations later this morning...)
  • Flow cytometry
  • Proteomics, image analysis, ...

Discovery and use, e.g., DESeq2

  • Landing pages: title, description (abstract), installation instructions, badges
  • Vignettes!

Also:

  • ‘Annotation’ packages
  • ‘Experiment data’ packages
  • Workflows
  • Course material, ...

3 End matter

3.1 Session Info

sessionInfo()
## R version 3.6.1 Patched (2019年07月16日 r76845)
## Platform: x86_64-apple-darwin17.7.0 (64-bit)
## Running under: macOS High Sierra 10.13.6
## 
## Matrix products: default
## BLAS: /Users/ma38727/bin/R-3-6-branch/lib/libRblas.dylib
## LAPACK: /Users/ma38727/bin/R-3-6-branch/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets 
## [8] methods base 
## 
## other attached packages:
## [1] Biostrings_2.53.2 XVector_0.25.0 IRanges_2.19.10 
## [4] S4Vectors_0.23.17 BiocGenerics_0.31.5 ggplot2_3.2.0 
## [7] BiocStyle_2.13.2 
## 
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.1 pillar_1.4.2 compiler_3.6.1 
## [4] BiocManager_1.30.4 zlibbioc_1.31.0 tools_3.6.1 
## [7] digest_0.6.20 evaluate_0.14 tibble_2.1.3 
## [10] gtable_0.3.0 pkgconfig_2.0.2 rlang_0.4.0 
## [13] yaml_2.2.0 xfun_0.8 withr_2.1.2 
## [16] stringr_1.4.0 dplyr_0.8.3 knitr_1.23 
## [19] grid_3.6.1 tidyselect_0.2.5 glue_1.3.1 
## [22] R6_2.4.0 rmarkdown_1.14 bookdown_0.12 
## [25] purrr_0.3.2 magrittr_1.5 scales_1.0.0 
## [28] codetools_0.2-16 htmltools_0.3.6 assertthat_0.2.1 
## [31] colorspace_1.4-1 labeling_0.3 stringi_1.4.3 
## [34] lazyeval_0.2.2 munsell_0.5.0 crayon_1.3.4

3.2 Acknowledgements

Research reported in this tutorial was supported by the National Human Genome Research Institute and the National Cancer Institute of the National Institutes of Health under award numbers U41HG004059 and U24CA180996.

This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement number 633974)

AltStyle によって変換されたページ (->オリジナル) /