1 Read the data in to R
- 1.1 Use read.table() to read ALLphenoData.tsv
- 1.2 Use read.table() to read the expression values
2 Make a SummarizedExperiment object
3 Show off your skills
4 Document your work
5 Resources
- 5.1 sessionInfo()

The material in this course requires R version 3.2 and Bioconductor version 3.2

stopifnot(
 getRversion() >= '3.2' && getRversion() < '3.3',
 BiocInstaller::biocVersion() == "3.2"
)

Your boss has been working on Acute Lymphocytic Lukemia (ALL) for many years. One data set consists of microarray gene expression values for 12625 genes in 128 different samples. Your boss would like to analyze different subsets of the data, and has given you a couple of tab-delimited files. One file (ALLphenoData.tsv) describes the samples, the other (ALLassay.tsv) contains pre-processed gene expression data. You are supposed to come up with a way to create the subsets your boss asks you for. You realize that you could read the data in to Excel and manipulate it there, but you’re concerned about being able to do reproducible research and you’re nervous about the bookkeeping errors that always seem to come up. So you think you’ll give Bioconductor a try...

1 Read the data in to R

Download the ALLphenoData.tsv and ALLassay.tsv files to the current workign directory, getwd().

1.1 Use `read.table()` to read ALLphenoData.tsv

fname = "ALLphenoData.tsv" ## use file.choose() to find the file
pdata = read.table(fname)

Check out the help page ?read.delim for input options, and explore basic properties of the object you’ve created, for instance...

class(pdata)

## [1] "data.frame"

colnames(pdata)

## [1] "cod" "diagnosis" "sex" "age" "BT" 
## [6] "remission" "CR" "date.cr" "t.4.11." "t.9.22." 
## [11] "cyto.normal" "citog" "mol.biol" "fusion.protein" "mdr" 
## [16] "kinet" "ccr" "relapse" "transplant" "f.u" 
## [21] "date.last.seen"

dim(pdata)

## [1] 128 21

head(pdata)

## cod diagnosis sex age BT remission CR date.cr t.4.11. t.9.22. cyto.normal citog
## 01005 1005 5/21/1997 M 53 B2 CR CR 8/6/1997 FALSE TRUE FALSE t(9;22)
## 01010 1010 3/29/2000 M 19 B2 CR CR 6/27/2000 FALSE FALSE FALSE simple alt.
## 03002 3002 6/24/1998 F 52 B4 CR CR 8/17/1998 NA NA NA <NA>
## 04006 4006 7/17/1997 M 38 B1 CR CR 9/8/1997 TRUE FALSE FALSE t(4;11)
## 04007 4007 7/22/1997 M 57 B2 CR CR 9/17/1997 FALSE FALSE FALSE del(6q)
## 04008 4008 7/30/1997 M 17 B1 CR CR 9/27/1997 FALSE FALSE FALSE complex alt.
## mol.biol fusion.protein mdr kinet ccr relapse transplant f.u date.last.seen
## 01005 BCR/ABL p210 NEG dyploid FALSE FALSE TRUE BMT / DEATH IN CR <NA>
## 01010 NEG <NA> POS dyploid FALSE TRUE FALSE REL 8/28/2000
## 03002 BCR/ABL p190 NEG dyploid FALSE TRUE FALSE REL 10/15/1999
## 04006 ALL1/AF4 <NA> NEG dyploid FALSE TRUE FALSE REL 1/23/1998
## 04007 NEG <NA> NEG dyploid FALSE TRUE FALSE REL 11/4/1997
## 04008 NEG <NA> NEG hyperd. FALSE TRUE FALSE REL 12/15/1997

summary(pdata$sex)

## F M NA's 
## 42 83 3

summary(pdata$cyto.normal)

## Mode FALSE TRUE NA's 
## logical 69 24 35

Remind yourselves about various ways to subset and access columns of a data.frame

pdata[1:5, 3:4]

## sex age
## 01005 M 53
## 01010 M 19
## 03002 F 52
## 04006 M 38
## 04007 M 57

pdata[1:5, ]

## cod diagnosis sex age BT remission CR date.cr t.4.11. t.9.22. cyto.normal citog
## 01005 1005 5/21/1997 M 53 B2 CR CR 8/6/1997 FALSE TRUE FALSE t(9;22)
## 01010 1010 3/29/2000 M 19 B2 CR CR 6/27/2000 FALSE FALSE FALSE simple alt.
## 03002 3002 6/24/1998 F 52 B4 CR CR 8/17/1998 NA NA NA <NA>
## 04006 4006 7/17/1997 M 38 B1 CR CR 9/8/1997 TRUE FALSE FALSE t(4;11)
## 04007 4007 7/22/1997 M 57 B2 CR CR 9/17/1997 FALSE FALSE FALSE del(6q)
## mol.biol fusion.protein mdr kinet ccr relapse transplant f.u date.last.seen
## 01005 BCR/ABL p210 NEG dyploid FALSE FALSE TRUE BMT / DEATH IN CR <NA>
## 01010 NEG <NA> POS dyploid FALSE TRUE FALSE REL 8/28/2000
## 03002 BCR/ABL p190 NEG dyploid FALSE TRUE FALSE REL 10/15/1999
## 04006 ALL1/AF4 <NA> NEG dyploid FALSE TRUE FALSE REL 1/23/1998
## 04007 NEG <NA> NEG dyploid FALSE TRUE FALSE REL 11/4/1997

head(pdata[, 3:5])

## sex age BT
## 01005 M 53 B2
## 01010 M 19 B2
## 03002 F 52 B4
## 04006 M 38 B1
## 04007 M 57 B2
## 04008 M 17 B1

tail(pdata[, 3:5], 3)

## sex age BT
## 65003 M 30 T3
## 83001 M 29 T2
## LAL4 <NA> NA T

head(pdata$age)

## [1] 53 19 52 38 57 17

head(pdata$sex)

## [1] M M F M M M
## Levels: F M

head(pdata[pdata$age > 21,])

## cod diagnosis sex age BT remission CR date.cr t.4.11. t.9.22. cyto.normal citog
## 01005 1005 5/21/1997 M 53 B2 CR CR 8/6/1997 FALSE TRUE FALSE t(9;22)
## 03002 3002 6/24/1998 F 52 B4 CR CR 8/17/1998 NA NA NA <NA>
## 04006 4006 7/17/1997 M 38 B1 CR CR 9/8/1997 TRUE FALSE FALSE t(4;11)
## 04007 4007 7/22/1997 M 57 B2 CR CR 9/17/1997 FALSE FALSE FALSE del(6q)
## 08001 8001 1/15/1997 M 40 B2 CR CR 3/26/1997 FALSE FALSE FALSE del(p15)
## 08011 8011 8/21/1998 M 33 B3 CR CR 10/8/1998 FALSE FALSE FALSE del(p15/p16)
## mol.biol fusion.protein mdr kinet ccr relapse transplant f.u date.last.seen
## 01005 BCR/ABL p210 NEG dyploid FALSE FALSE TRUE BMT / DEATH IN CR <NA>
## 03002 BCR/ABL p190 NEG dyploid FALSE TRUE FALSE REL 10/15/1999
## 04006 ALL1/AF4 <NA> NEG dyploid FALSE TRUE FALSE REL 1/23/1998
## 04007 NEG <NA> NEG dyploid FALSE TRUE FALSE REL 11/4/1997
## 08001 BCR/ABL p190 NEG <NA> FALSE TRUE FALSE REL 7/11/1997
## 08011 BCR/ABL p190/p210 NEG dyploid FALSE FALSE TRUE BMT / DEATH IN CR <NA>

1.2 Use `read.table()` to read the expression values

fname <- "ALLassay.tsv"
exprs <- as.matrix(read.table(fname, check.names=FALSE))

Use dim() to figure out the number of rows and columns in the expression data. Use subscripts to look at the first few rows and columns exprs[1:5, 1:5]. What are the row names? Do the column names agree with the row names of the pdata object? What is the range() of the expression data? Can you create a histogram (hint: hist()) of the data? What is plot(density(exprs))? Can you use plot() and lines() to plot the density of each sample, in a single figure?

2 Make a SummarizedExperiment object

You could work with the matrix and data frame directly, but it is better to put these related parts of the data into a single object, a SummarizedExperiment.

Load the appropriate Bioconductor package

if (BiocInstaller::biocVersion() >= "3.2") {
 library(SummarizedExperiment)
} else {
 library(GenomicRanges)
}

and create a single SummarizedExperiment object from the two parts of the data. Some Bioconductor objects enhance the behavior of base R objects; an example of this is DataFrame()

se <- SummarizedExperiment(exprs, colData=DataFrame(pdata))

Explore the object, noting that you can retrieve the original elements, and can subset in a coordinated fashion.

head(colData(se))

## DataFrame with 6 rows and 21 columns
## cod diagnosis sex age BT remission CR date.cr t.4.11.
## <factor> <factor> <factor> <integer> <factor> <factor> <factor> <factor> <logical>
## 01005 1005 5/21/1997 M 53 B2 CR CR 8/6/1997 FALSE
## 01010 1010 3/29/2000 M 19 B2 CR CR 6/27/2000 FALSE
## 03002 3002 6/24/1998 F 52 B4 CR CR 8/17/1998 NA
## 04006 4006 7/17/1997 M 38 B1 CR CR 9/8/1997 TRUE
## 04007 4007 7/22/1997 M 57 B2 CR CR 9/17/1997 FALSE
## 04008 4008 7/30/1997 M 17 B1 CR CR 9/27/1997 FALSE
## t.9.22. cyto.normal citog mol.biol fusion.protein mdr kinet ccr
## <logical> <logical> <factor> <factor> <factor> <factor> <factor> <logical>
## 01005 TRUE FALSE t(9;22) BCR/ABL p210 NEG dyploid FALSE
## 01010 FALSE FALSE simple alt. NEG NA POS dyploid FALSE
## 03002 NA NA NA BCR/ABL p190 NEG dyploid FALSE
## 04006 FALSE FALSE t(4;11) ALL1/AF4 NA NEG dyploid FALSE
## 04007 FALSE FALSE del(6q) NEG NA NEG dyploid FALSE
## 04008 FALSE FALSE complex alt. NEG NA NEG hyperd. FALSE
## relapse transplant f.u date.last.seen
## <logical> <logical> <factor> <factor>
## 01005 FALSE TRUE BMT / DEATH IN CR NA
## 01010 TRUE FALSE REL 8/28/2000
## 03002 TRUE FALSE REL 10/15/1999
## 04006 TRUE FALSE REL 1/23/1998
## 04007 TRUE FALSE REL 11/4/1997
## 04008 TRUE FALSE REL 12/15/1997

assay(se)[1:5, 1:5]

## 01005 01010 03002 04006 04007
## 1000_at 7.597323 7.479445 7.567593 7.384684 7.905312
## 1001_at 5.046194 4.932537 4.799294 4.922627 4.844565
## 1002_f_at 3.900466 4.208155 3.886169 4.206798 3.416923
## 1003_s_at 5.903856 6.169024 5.860459 6.116890 5.687997
## 1004_at 5.925260 5.912780 5.893209 6.170245 5.615210

se$sex %in% "M"

## [1] TRUE TRUE FALSE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [16] FALSE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE
## [31] FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE TRUE FALSE
## [46] TRUE FALSE TRUE FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE TRUE FALSE TRUE FALSE
## [61] TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE FALSE FALSE
## [76] TRUE TRUE TRUE FALSE TRUE TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE
## [91] TRUE TRUE FALSE TRUE FALSE TRUE FALSE FALSE TRUE FALSE TRUE TRUE TRUE TRUE TRUE
## [106] TRUE FALSE FALSE TRUE FALSE TRUE TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUE TRUE
## [121] TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE

males <- se[,se$sex %in% "M"]
males

## class: SummarizedExperiment0 
## dim: 12625 83 
## metadata(0):
## assays(1): ''
## rownames(12625): 1000_at 1001_at ... AFFX-YEL021w/URA3_at AFFX-YEL024w/RIP1_at
## metadata column names(0):
## colnames(83): 01005 01010 ... 65003 83001
## colData names(21): cod diagnosis ... f.u date.last.seen

assay(males)[1:5, 1:5]

## 01005 01010 04006 04007 04008
## 1000_at 7.597323 7.479445 7.384684 7.905312 7.065914
## 1001_at 5.046194 4.932537 4.922627 4.844565 5.147762
## 1002_f_at 3.900466 4.208155 4.206798 3.416923 3.945869
## 1003_s_at 5.903856 6.169024 6.116890 5.687997 6.208061
## 1004_at 5.925260 5.912780 6.170245 5.615210 5.923487

Use vignette("SummarizedExperiment") to read about other operations on SummarizedExperiment.

3 Show off your skills

Quickly create the following subsets of data for your boss:

All women in the study.
All women over 40
An object bcrabl containing individuals with mol.biol belonging either to "BCR/ABL" or "NEG".

Can you...?

Create a new column that simplifies the BT column (which lists different B- and T-cell subtypes) to contain just B or T, e.g., re-coding B, B1, B2, B3 and B4 to simply B, and likewise for T?
Use aggregate() to calculate the average age of males and females in the BCR/ABL and NEG treatment groups?
Use t.test() to compare the age of individuals in the BCR/ABL versus NEG groups; visualize the results using boxplot(). In both cases, use the formula interface. Consult the help page ?t.test and re-do the test assuming that variance of ages in the two groups is identical. What parts of the test output change?

4 Document your work

Summarize the exercises above in a simple script. Can you figure out how to write a ‘markdown’ document that includes R code chunks, as well as text describing what you did, and figures and tables showing the results?

5 Resources

Acknowledgements

Core (Seattle): Sonali Arora, Marc Carlson, Nate Hayden, Jim Hester, Valerie Obenchain, Hervé Pagès, Paul Shannon, Dan Tenenbaum.
The research reported in this presentation was supported by the National Cancer Institute and the National Human Genome Research Institute of the National Institutes of Health under Award numbers U24CA180996 and U41HG004059, and the National Science Foundation under Award number 1247813. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health or the National Science Foundation.

5.1 `sessionInfo()`

sessionInfo()

## R version 3.2.2 (2015年08月14日)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Debian GNU/Linux stretch/sid
## 
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 
## [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C 
## [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C 
## 
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets methods base 
## 
## other attached packages:
## [1] ALL_1.11.0 org.Hs.eg.db_3.2.3 
## [3] RSQLite_1.0.0 DBI_0.3.1 
## [5] ggplot2_1.0.1 airway_0.103.1 
## [7] limma_3.25.18 DESeq2_1.9.51 
## [9] RcppArmadillo_0.6.100.0.0 Rcpp_0.12.1 
## [11] BSgenome.Hsapiens.UCSC.hg19_1.4.0 BSgenome_1.37.6 
## [13] rtracklayer_1.29.28 TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
## [15] GenomicFeatures_1.21.33 AnnotationDbi_1.31.19 
## [17] SummarizedExperiment_0.3.11 Biobase_2.29.1 
## [19] GenomicRanges_1.21.32 GenomeInfoDb_1.5.16 
## [21] microbenchmark_1.4-2 Biostrings_2.37.8 
## [23] XVector_0.9.4 IRanges_2.3.26 
## [25] S4Vectors_0.7.23 BiocGenerics_0.15.11 
## [27] BiocStyle_1.7.9 
## 
## loaded via a namespace (and not attached):
## [1] splines_3.2.2 Formula_1.2-1 latticeExtra_0.6-26 
## [4] Rsamtools_1.21.21 yaml_2.1.13 lattice_0.20-33 
## [7] digest_0.6.8 RColorBrewer_1.1-2 colorspace_1.2-6 
## [10] sandwich_2.3-4 htmltools_0.2.6 plyr_1.8.3 
## [13] XML_3.98-1.3 biomaRt_2.25.3 genefilter_1.51.1 
## [16] zlibbioc_1.15.0 xtable_1.7-4 mvtnorm_1.0-3 
## [19] scales_0.3.0 BiocParallel_1.3.54 annotate_1.47.4 
## [22] TH.data_1.0-6 nnet_7.3-11 proto_0.3-10 
## [25] survival_2.38-3 magrittr_1.5 evaluate_0.8 
## [28] MASS_7.3-44 foreign_0.8-66 BiocInstaller_1.19.14 
## [31] tools_3.2.2 formatR_1.2.1 multcomp_1.4-1 
## [34] stringr_1.0.0 munsell_0.4.2 locfit_1.5-9.1 
## [37] cluster_2.0.3 lambda.r_1.1.7 futile.logger_1.4.1 
## [40] grid_3.2.2 RCurl_1.95-4.7 labeling_0.3 
## [43] bitops_1.0-6 rmarkdown_0.8.1 gtable_0.1.2 
## [46] codetools_0.2-14 reshape2_1.4.1 GenomicAlignments_1.5.18
## [49] gridExtra_2.0.0 zoo_1.7-12 knitr_1.11 
## [52] Hmisc_3.17-0 futile.options_1.0.0 stringi_0.5-5 
## [55] geneplotter_1.47.0 rpart_4.1-10 acepack_1.3-3.3

Working With Data: `SummarizedExperiment`

Martin Morgan (martin.morgan@roswellpark.org)
Roswell Park Cancer Institute, Buffalo, NY
5 - 9 October, 2015

Contents

1 Read the data in to R

1.1 Use `read.table()` to read ALLphenoData.tsv

1.2 Use `read.table()` to read the expression values

2 Make a SummarizedExperiment object

3 Show off your skills

4 Document your work

5 Resources

5.1 `sessionInfo()`

Working With Data: SummarizedExperiment

Martin Morgan (martin.morgan@roswellpark.org) Roswell Park Cancer Institute, Buffalo, NY 5 - 9 October, 2015

Contents

1 Read the data in to R

1.1 Use read.table() to read ALLphenoData.tsv

1.2 Use read.table() to read the expression values

2 Make a SummarizedExperiment object

3 Show off your skills

4 Document your work

5 Resources

5.1 sessionInfo()

Working With Data: `SummarizedExperiment`

Martin Morgan (martin.morgan@roswellpark.org)
Roswell Park Cancer Institute, Buffalo, NY
5 - 9 October, 2015

1.1 Use `read.table()` to read ALLphenoData.tsv

1.2 Use `read.table()` to read the expression values

5.1 `sessionInfo()`