Motivation

Is expression of genes in a gene set associated with experimental condition?

E.g., Are there unusually many up-regulated genes in the gene set?

Many methods, a recent review is Kharti et al., 2012.

Over-representation analysis (ORA) – are differentially expressed (DE) genes in the set more common than expected?
Functional class scoring (FCS) – summarize statistic of DE of genes in a set, and compare to null
Pathway topology (PT) – include pathway knowledge in assessing DE of genes in a set

What is a gene set?

Any a priori classification of `genes’ into biologically relevant groups

Members of same biochemical pathway
Proteins expressed in identical cellular compartments
Co-expressed under certain conditions
Targets of the same regulatory elements
On the same cytogenic band
...

Sets do not need to be...

Exhaustive
Disjoint

Collections of gene sets

Gene Ontology (GO) Annotation (GOA)

CC Cellular Components
BP Biological Processes
MF Molecular Function

Pathways

MSigDb
KEGG (no longer freely available)
reactome
PantherDB
...

E.g., MSigDb

c1 Positional gene sets – chromosome & cytogenic band
c2 Curated Gene Sets from online pathway databases, publications in PubMed, and knowledge of domain experts.
c3 motif gene sets based on conserved cis-regulatory motifs from a comparative analysis of the human, mouse, rat, and dog genomes.
c4 computational gene sets defined by mining large collections of cancer-oriented microarray data.
c5 GO gene sets consist of genes annotated by the same GO terms.
c6 oncogenic signatures defined directly from microarray gene expression data from cancer gene perturbations.
c7 immunologic signatures defined directly from microarray gene expression data from immunologic studies.

Statistical approaches

Initially based on a presentation by Simon Anders, CSAMA 2010

Approach 1: hypergeometric tests

Steps

Classify each gene as ‘differentially expressed’ DE or not, e.g., based on P < 0.05
Are DE genes in the set more common than DE genes not in the set?

Yes	No
In gene set?
Differentially	Yes	k	K
expressed?	No	n - k	N - K

Fisher hypergeometric test, via fiser.test() or GOstats

Notes

Conditional hypergeometric to accommodate GO DAG, GOstats
But: artificial division into two groups

Approach 2: enrichment score

Mootha et al., 2003; modified Subramanian et al., 2005.

Steps

Sort genes by log fold change
Calculate running sum: incremented when gene in set, decremented when not.
Maximum of the running sum is enrichment score ES; large ES means that genes in set are toward top of list.
Permuting subject labels for signficance

Approach 3: category \(t\)-test

E.g., Jiang & Gentleman, 2007;

Summarize \(t\) (or other) statistic in each set
Test for significance by permuting the subject labels
Much more straight-forward to implement

Expression in NEG vs BCR/ABL samples for genes in the ‘ribosome’ KEGG pathway; Category vignette.

Competitive versus self-contained null hypothesis

Goemann & Bühlmann, 2007

Competitive null: The genes in the gene set do not have stronger association with the subject condition than other genes. (Approach 1, 2)
Self-contained null: The genes in the gene set do not have any association with the subject condition. (Approach 3)
Probably, self-contained null is closer to actual question of interest
Permuting subjects (rather than genes) is appropriate

Approach 4: linear models

E.g., Hummel et al., 2008,

Colorectal tumors have good (‘stage II’) or bad (‘stage III’) prognosis. Do genes in the p53 pathway (just one gene set!) show different activity at the two stages?
Linear model incorporates covariates – sex of patient, location of tumor

limma

Majewski et al., 2010 romer() and Wu & Smythe 2012 camera() for enrichment (competitive null) linear models
Wu et al., 2010: roast(), mroast() for self-contained null linear models

Approach 5: pathway topology

E.g., Tarca et al., 2009,

Incorporate pathway topology (e.g., interactions between gene products) into signficance testing
Signaling Pathway Impact Analysis
Combined evidence: pathway over-representation \(P_{NDE}\); unusual signaling \(P_{PERT}\) (equation 1 of Tarca et al.)

Evidence plot, colorectal cancer. Points: pathway gene sets. Significant after Bonferroni (red) or FDR (blue) correction.

Issues with sequence data?

All else being equal, long genes receive more reads than short genes
Per-gene \(P\) values proportional to gene size

E.g., Young et al., 2010, goseq

Hypergeometric, weighted by gene size
Substantial differences
Better: read depth??

DE genes vs. transcript length. Points: bins of 300 genes. Line: fitted probability weighting function.

Approach 6: de novo discovery

So far: analogous to supervised machine learning, where pathways are known in advance
What about unsupervised discovery?

Example: Langfelder & Hovarth, WGCNA

Weighted correlation network analysis
Described in Langfelder & Horvath, 2008

Representing gene sets in R

Named list(), where names of the list are sets, and each element of the list is a vector of genes in the set.
data.frame() of set name / gene name pairs
GSEABase

Conclusions

Gene set enrichment classifications

Kharti et al: Over-representation analysis; functional class scoring; pathway topology
Goemann & Bühlmann: Competitive vs. self-contained null

Selected Packages

Approach	Packages
Hypergeometric	GOstats , topGO
Enrichment	limma `::romer()`
Category \(t\)-test	Category
Linear model	GlobalAncova , GSEAlm , limma `::roast()`
Pathway topology	SPIA
Sequence-specific	goseq
de novo	WGCNA

Practical

This practical is based on section 6 of the goseq vignette.

1-6 Experimental design, ..., Analysis of gene differential expression

This (relatively old) experiment examined the effects of androgen stimulation on a human prostate cancer cell line, LNCaP (Li et al., 2008). The experiment used short (35bp) single-end reads from 4 control and 3 untreated lines. Reads were aligned to hg19 using Bowtie, and counted using ENSEMBL 54 gene models.

Input the data to edgeR ’s DGEList data structure.

library(edgeR)
path <- system.file(package="goseq", "extdata", "Li_sum.txt")
table.summary <- read.table(path, sep='\t', header=TRUE, stringsAsFactors=FALSE)
counts <- table.summary[,-1]
rownames(counts) <- table.summary[,1]
grp <- factor(rep(c("Control","Treated"), times=c(4,3)))
summarized <- DGEList(counts, lib.size=colSums(counts), group=grp)

Use a ‘common’ dispersion estimate, and compare the two groups using an exact test

disp <- estimateCommonDisp(summarized)
tested <- exactTest(disp)
topTags(tested)

## Comparison of groups: Treated-Control 
## logFC logCPM PValue FDR
## ENSG00000127954 11.557868 6.680748 2.574972e-80 1.274766e-75
## ENSG00000151503 5.398963 8.499530 1.781732e-65 4.410322e-61
## ENSG00000096060 4.897600 9.446705 7.983756e-60 1.317479e-55
## ENSG00000091879 5.737627 6.282646 1.207655e-54 1.494654e-50
## ENSG00000132437 -5.880436 7.951910 2.950042e-52 2.920896e-48
## ENSG00000166451 4.564246 8.458467 7.126763e-52 5.880292e-48
## ENSG00000131016 5.254737 6.607957 1.066807e-51 7.544766e-48
## ENSG00000163492 7.085400 5.128514 2.716461e-45 1.681014e-41
## ENSG00000113594 4.051053 8.603264 9.272066e-44 5.100255e-40
## ENSG00000116285 4.108522 7.864773 6.422468e-43 3.179507e-39

7. Comprehension

Start by extracting all P values, then correcting for multiple comparison using p.adjust(). Classify the genes as differentially expressed or not.

padj <- with(tested$table, {
 keep <- logFC != 0
 value <- p.adjust(PValue[keep], method="BH")
 setNames(value, rownames(tested)[keep])
})
genes <- padj < 0.05
table(genes)

## genes
## FALSE TRUE 
## 19535 3208

Gene symbol to pathway

Under the hood, goseq uses Bioconductor annotation packages (in this case org.Hs.eg.db and r Biocannopkg("GO.db") to map from gene symbols to GO pathways.

Expore these packages through the columns() and select() functions. Can you map between ENSEMBL gene identifiers (the row names of topTable()) to GO pathway? What about ‘drilling down’ on particular GO identifiers to discover the term’s definition?

Probability weighting function

Calculate the weighting for each gene. This looks up the gene lengths in a pre-defined table (how could these be calculated using TxDb packages? What challenges are associated with calculating these ‘weights’, based on the knowledge that genes typically consist of several transcripts, each expressed differently?)

pwf <- nullp(genes,"hg19","ensGene")

## Loading hg19 length data...

## Warning in pcls(G): initial point very close to some inequality
## constraints

head(pwf)

## DEgenes bias.data pwf
## ENSG00000230758 FALSE 247 0.03757470
## ENSG00000182463 FALSE 3133 0.20436865
## ENSG00000124208 FALSE 1978 0.16881769
## ENSG00000230753 FALSE 466 0.06927243
## ENSG00000224628 FALSE 1510 0.15903532
## ENSG00000125835 FALSE 954 0.12711992

Over- and under-representation

Perform the main analysis. This includes association of genes to GO pathway

GO.wall <- goseq(pwf, "hg19", "ensGene")

## Fetching GO annotations...
## For 9751 genes, we could not find any categories. These genes will be excluded.
## To force their use, please run with use_genes_without_cat=TRUE (see documentation).
## This was the default behavior for version 1.15.1 and earlier.
## Calculating the p-values...

head(GO.wall)

## category over_represented_pvalue under_represented_pvalue
## 10729 GO:0044763 8.237627e-15 1
## 10708 GO:0044699 2.079753e-14 1
## 2453 GO:0005737 2.956026e-10 1
## 3004 GO:0006614 6.131543e-09 1
## 7499 GO:0031982 1.101818e-08 1
## 2372 GO:0005576 1.339207e-08 1
## numDEInCat numInCat
## 10729 1893 8355
## 10708 2054 9201
## 2453 1790 8080
## 3004 39 107
## 7499 638 2675
## 2372 669 2836
## term ontology
## 10729 single-organism cellular process BP
## 10708 single-organism process BP
## 2453 cytoplasm CC
## 3004 SRP-dependent cotranslational protein targeting to membrane BP
## 7499 vesicle CC
## 2372 extracellular region CC

What if we’d ignored gene length?

Here we do the same operation, but ignore gene lengths

GO.nobias <- goseq(pwf,"hg19","ensGene",method="Hypergeometric")

## Fetching GO annotations...
## For 9751 genes, we could not find any categories. These genes will be excluded.
## To force their use, please run with use_genes_without_cat=TRUE (see documentation).
## This was the default behavior for version 1.15.1 and earlier.
## Calculating the p-values...

Compare the over-represented P-values for each set, under the different methods

idx <- match(GO.nobias$category, GO.wall$category)
plot(log10(GO.nobias[, "over_represented_pvalue"]) ~
 log10(GO.wall[idx, "over_represented_pvalue"]),
 xlab="Wallenius", ylab="Hypergeometric",
 xlim=c(-5, 0), ylim=c(-5, 0))
abline(0, 1, col="red", lwd=2)

References

Khatri et al., 2012, PLoS Comp Biol 8.2: e1002375.
Subramanian et al., 2005, PNAS 102.43: 15545-15550.
Jiang & Gentleman, 2007, Bioinformatics Feb 1;23(3):306-13.
Goeman & B"uhlmann, 2007, Bioinformatics 23.8: 980-987.
Hummel et al., 2008, Bioinformatics 24.1: 78-85.
Wu & Smyth 2012, Nucleic Acids Research 40, e133.
Wu et al., 2010 Bioinformatics 26, 2176-2182.
Majewski et al., 2010, Blood, published online 5 May 2010.
Tarca et al., 2009, Bioinformatics 25.1: 75-82.
Young et al., 2010, Genome Biology 11:R14.

E. Gene Set Enrichiment

Martin Morgan (mtmorgan@fredhutch.org)

2015年04月07日