16.8.13
Reproducible Methods (or bad bioinformatics methods sections)
Science is built on the whole idea of being able to reproduce results, i.e. if I publish something, it should be possible for someone else to reproduce it, using the description of the methods used in the publication. As biological sciences have become increasingly reliant on computational methods, this has become a bigger and bigger issue, especially as the results of experiments become dependent on independently developed computational code, or use rather sophisticated computer packages that have a variety of settings that can affect output, and multiple versions. For further discussion on this issue, you might want to read 1, 2.
I recently read a couple of different publications that really made me realize how big a problem this is. I want to spend some time showing what the problem is in these publications, and why we should be concerned about the current state of computational analytical reproducibility in life-sciences.
In both the articles mentioned below, I do not believe that I, or anyone not associated with the project, would be able to generate even approximately similar results based solely on the raw data and the description of methods provided. Ultimately, this is a failure of both those doing the analysis, and the reviewers who reviewed the work, and is a rather deplorable situation for a field that prides itself verification of results. This is why I'm saying these are bad bioinformatics methods sections.
Puthanveettil et al., Synaptic Transcriptome
Puthanveettil et al, 2013 had a paper out earlier titled “A strategy to capture and characterize the synaptic transcriptome” in PNAS. Although the primary development reported is a new method of characterizing RNA complexes that are carried by kinesin, much of the following analysis is bioinformatic in nature.
For example, they used BLAST searches to identify the RNA molecules, a cutoff value is reported in the results. However, functional characterization using Gene Ontology (GO) was carried out by “Bioinformatics analyses” (see the top of pg3 in the PDF). No mention of where the GO terms came from, which annotation source was used, or any software mentioned. Not in the results, discussion, or methods, or the supplemental methods. The microarray data analysis isn't too badly described, but the 454 sequencing data processing isn't really described at all.
My point is, that even given their raw data, I'm not sure I would be able to even approximate their results based on the methods reported in the methods section.
Gulsuner et al., Schizophrenia SNPs
Gulsuner et al published a paper in Cell in August 2013 titled “Spatial and Temporal Mapping of De Novo Mutations in Schizophrenia to a Fetal Prefrontal Cortical Network”. This one also looks really nice, they look for de novo mutations (i.e. new mutations in offspring not present in parents or siblings) that mess up genes that are in a heavily connected network, and also examine gene co-expression over brain development time-scales. Sounds really cool, and the results seem like they are legit, based on my reading of the manuscript. I was really impressed that they even used randomly generated networks to control the false discovery rate!
However, almost all of the analysis again depends on a lot of different bioinformatic software. I do have to give the authors props, they actually give the full version of each tool used. But no mention of tool specific settings (which can generate vastly different results, see Exome Sequencing of the methods).
Then there is this bombshell: “The predicted functional impact of each candidate de novo missense variant was assessed with in silico tools.” (near top of pg 525 of the PDF). Rrrreeeaaaalllly now. No actual quote of which tools were used, although the subsequent wording and references provided imply that they were PolyPhen2, SIFT, and the Grantham Method. But shouldn't that have been stated up front? Along with any settings that were changed from default??
There is no raw data available, only their reported SNPs. Not even a list of all the SNPs that were potentially considered, so that I could at least go from those and re-run the later analysis. I have to take their word for it (although I am glad at least the SNPs they used in later analyses are reported).
Finally, the random network generation. I'd like to be able to see that code, go over it, and see what exactly it was doing to verify it was done correctly. It likely was, based on the histograms provided, but still, these are where small errors creep in and result in invalid results.
As above, even if the raw data was available (didn't see an SRA accession or any other download link), I'm not sure I could reproduce or verify the results.
What to do??
How do we fix this problem? I think scripts and workflows used to run any type of bioinformatic analyses have to become first class research objects. And we have to teach scientists to write them and use them in a way that makes them first class research objects. So in the same way that a biologist might ask for verification of immunostaining, etc, bioinformaticians should ask that given known input, a script generates reasonable output.
I know there has been discussion on this before, and disagreement, especially with the exploratory nature of research. However, once you've got something working right, you should be able to test it. Reviewers should be asking if it is testable, or the code should be available for others to test.
I also think we as a community should do more to point out the problem. i.e. when we see it, point it out to others. I've done that here, but I don't know how much should be formal. Maybe we need a new hashtag, #badbioinfomethodsection, and point links to papers that do this. Conversely, we should also point to examples when it is done right (#goodbioinfomethodsection??), and if you are bioinformatician or biologist who does a lot of coding, share your code, and at least supply it as supplemental materials. Oh, and maybe take a SoftwareCarpentry class, and look up git.
Posted on August 16, 2013 at http://robertmflight.blogspot.com/2013/08/reproducible-methods-or-bad.html, raw markdown at https://github.com/rmflight/blogPosts/blob/master/reproducible_methods.md
5.4.13
Reproducible publications as R packages... and markdown vignettes!
R
packages are an interesting way to distribute reproducible papers. You have functions in the R
directory, any required data files in data
, a Description
file that describes the package and all its dependencies, and finally you have the vignette, generally in inst/doc
. Vignettes in R
traditionally had to be written in Latex and used Sweave to mix the written and code parts.It is very easy to imagine that the vignette produced could be a journal article, and therefore to get the reproducible article and access to underlying data and functions, you simply need to install the package. I don't know why I hadn't realized this before. It is actually a really neat, and potentially powerful idea. I would not be surprised that this was actually part of the logic of incorporating the Sweave vignettes in
R
packages.One wonders why this isn't more common, to distribute papers as packages? I am guessing that it is likely from the requirement to use Latex for writing the document. I've written a vignette, and I can't say I really cared for the experience, basically because getting the Latex part to work was a painful process.
However, with the release of
R 3.0.0
, it is now possible to define alternative vignette engines. Currently the only one I know of is knitr
, which currently supports generating PDF from Rnw
(latex with an alternative syntax to Sweave) and HTML from Rmd
, or R markdown files. Markdown is so much easier to write, and in my mind the HTML generated is much easier to read. In addition to that, customization of the look using CSS is probably much more familiar to many people who are doing programming nowadays as well, another big plus.In addition to having to write using Latex, the process of changing, building, loading, and documenting packages was pretty cumbersome. However, Hadley Wickham has been doing a lot to change that with his
devtools
package, that makes it quite easy to re-build a package and load it back up. This has now been integrated into the latest versions of RStudio, making it rather easy to work on a package and immediately work with any changes. In addition, the test_that
package makes it easier to run automated tests, and ROxygen
makes it easy to also document your custom functions used by your vignette.So, I know I will be using Yihui's guide to switch my own package to use a markdown vignette, and will probably try to do my next paper as a self contained
R
package as well. How about you??Edit: As Carl pointed out below, pandoc is very useful for converting markdown to other formats that may be required for formal submission to a journal.
9.10.12
Writing papers using R Markdown
I have been watching the activity in RStudio
and knitr
for a while, and have even been using Rmd
(R markdown) files in my own work as a way to easily provide commentary on an actual dataset analysis. Yihui has proposed writing papers in markdown and posting them to a blog as a way to host a statistics journal, and lots of people are now using knitr
as a way to create reproducible blog posts that include code (including yours truly).
The idea of writing a paper that actually includes the necessary code to perform the analysis, and is actually readable in its raw form, and that someone else could actually run was pretty appealing. Unfortunately, I had not had the time or opportunity to actually try it, until recently our group submitted a conference paper that included a lot of analysis in R
that seemed like the perfect opportunity to try this. (I will link to the paper here when I hear more, or get clearance from my PI). Originally we wrote the paper in Microsoft® Word, but after submission I decided to see what it would have taken to write it as an Rmd
document that could then generate markdown
or html
.
It turned out that it was not that hard, but it did force me to do some things differently. This is what I want to discuss here.
Advantages
I actually found it much easier to have the text with the analysis (in contrast to having to be separate in a Word document), and upon doing the conversion, discovered some possible numerical errors that crept in because of having to copy numerical results separately (that is the nice thing about being able to insert variable directly into the text). In addition, the Word template for the submission didn't play nice with automatic table and figure numbering, so our table and figure numbering got messed up in the submission. So overall, I'd say it worked out better with the Rmd
file overall, even with the having to create functions to handle table and figure numbering properly myself (see below).
Tables and Figures
As I'm sure most of you know, Word (and other WYSIWYG editors) have ability to keep track of your object numbers, this is especially nice for keeping your figure and table numbers straight. Of course, there is no such ability built into a static text file, but I found it was easy to write a couple of functions for this. The way I came up with is to have a variable that contains a label for the figure or table, a function that increments the counter when new figures or tables are added, and a function that prints the associated number for a particular label. This does require a bit of forethought on the part of the writer, because you may have to add a table or figure label to the variable long before you actually create it, but as long as you use sane (i.e. descriptive) labels, it shouldn't be a big deal. Let me show you what I mean.
Counting
incCount <- function(inObj, useName) {
nObj <- length(inObj)
useNum <- max(inObj) + 1
inObj <- c(inObj, useNum)
names(inObj)[nObj + 1] <- useName
inObj
}
figCount <- c(`_` = 0)
tableCount <- c(`_` = 0)
The incCount
function is very simple, it takes an object, checks the maximum count, and then adds an incremental value with the supplied name. In this example, I initialized the figCount
and tableCount
objects with a non-sensical named value of zero.
Now in the process of writing, I decide I'm going to need a table on the amount of time spent by post-docs writing blog posts in different years of their post-doc training. Lets call this t.blogPostDocs
. Notice that this is a fairly descriptive name. We can assign it a number like so:
tableCount <- incCount(tableCount, "t.blogPostDocs")
tableCount
## _ t.blogPostDocs
## 0 1
Inserting
So now we have a variable with a named number we can refer to. But how do we insert it into the text? We are going to use another function that will let us insert either the text with a link, or just the text itself.
pasteLabel <- function(preText, inObj, objName, insLink = TRUE) {
objNum <- inObj[objName]
useText <- paste(preText, objNum, sep = " ")
if (insLink) {
useText <- paste("[", useText, "](#", objName, ")", sep = "")
}
useText
}
This function allows us to insert the table number like so:
r I(pasteLabel("Table", tableCount, "t.blogPostDocs"))
This would be inserted into a normal inline
code block. The I
makes sure that the text will appear as normal text, and not get formatted as a code block. The default behavior is to insert as a relative link, thereby enabling the use of relative links to link where a table / figure is mentioned to its actual location. For example, we can insert the anchor link like so:
<a id="t.blogPostDocs"></a>
Markdown Tables
Followed by the actual table text. This brings up the subject of markdown
tables. I also wrote a function (thanks to Yihui again) that transforms a normal R
data.frame
to a markdown
table.
tableCat <- function(inFrame) {
outText <- paste(names(inFrame), collapse = " | ")
outText <- c(outText, paste(rep("---", ncol(inFrame)), collapse = " | "))
invisible(apply(inFrame, 1, function(inRow) {
outText <<- c(outText, paste(inRow, collapse = " | "))
}))
return(outText)
}
Lets see it in action.
postDocBlogs <- data.frame(PD = c("p1", "p2", "p3"), NBlog = c(4, 10, 2), Year = c(1,
4, 2))
postDocBlogs
## PD NBlog Year
## 1 p1 4 1
## 2 p2 10 4
## 3 p3 2 2
postDocInsert <- tableCat(postDocBlogs)
postDocInsert
## [1] "PD | NBlog | Year" "--- | --- | ---" "p1 | 4 | 1"
## [4] "p2 | 10 | 4" "p3 | 2 | 2"
To actually insert it into the text, use a code chunk with results='asis'
and echo=FALSE
.
cat(postDocInsert, sep = "\n")
PD | NBlog | Year |
---|---|---|
p1 | 4 | 1 |
p2 | 10 | 4 |
p3 | 2 | 2 |
Before inserting the table though, you might want an inline code with the table number and caption, like this:
I(pasteLabel("Table", tableCount, "t.blogPostDocs", FALSE))
This is the number of blog posts and year of training for post-docs.
Table 1 This is the number of blog posts and year of training for post-docs.
Remember for captions to set the insLink
variable to FALSE
so that you don't generate a link from the caption.
Figures
Oftentimes, you will have code that generates the figure, and then you want to insert the figure at a different point. This is accomplished by the judicious use of echo
and include
chunk options.
For example, we can create a ggplot2
figure and store it in a variable in one chunk, and then print
it in a later chunk to actually insert it into the text body.
plotData <- data.frame(x = rnorm(1000, 1, 5), y = rnorm(1000, 0, 2))
plotKeep <- ggplot(plotData, aes(x = x, y = y)) + geom_point()
figCounts <- incCount(figCount, "f.randomFigure")
And now we decide to actually insert it using print(plotKeep)
with the option of echo=FALSE
:
plot of chunk figureInsert
Figure 1. A random figure.
Numerical result formatting
When R
prints a number, it normally likes to do so with lots of digits. This is not probably what you want either in a table or when reporting a number in a sentence. You can control that by using the format
function. When generating a new variable, the number of digits to display when printing will be saved, and when used on a variable directly, only the defined number of digits will display.
Echo and Include
This brings up the issue of how to keep the code from appearing in the text body. I found depending on the particulars, either using echo=FALSE
or include=FALSE
would do the job. This is meant to be a paper, a reproducible one, but a paper nonetheless, and therefore the code should not end up in the text body.
References
One thing I haven't done yet is convert all the references. I am planning to try using the knitcitations package. I will probably post on that experience.
HTML generation
Because I use RStudio
, I set up a modified function For generating a full html
version of the paper, changing the default RStudio
markdown
render options like so:
htmlOptions <- markdownHTMLOptions(defaults=TRUE)
htmlOptions <- htmlOptions[htmlOptions != "hard_wrap"]
markdownToHTML(inputFile, outputFile, options = htmlOptions)
This should be added to a .Rprofile
file either in your home
directory or in the directory you start R
in (this is especially useful for modification on a per project basis).
I do this because when I write my documents, I want the source to be readable online. If this is a github
hosted repo, that means being displayed in the github
file browser, which does not do line wrapping. So I set up a 120 character line in my editor, and try very hard to stick to that.
Function source
You can find the previously mentioned functions in a github gist here.
Post source
The source files for this blog post can be found at: Rmd
, md
, and html
.
Posted on October 9, 2012, at http://robertmflight.blogspot.com/2012/10/writing-papers-using-r-markdown.html
Edit: added section on formatting numerical results
Edit: added session info
R version 2.15.0 (2012年03月30日)
Platform: x86_64-pc-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ggplot2_0.9.2.1 knitr_0.8.1
loaded via a namespace (and not attached):
[1] colorspace_1.1-1 dichromat_1.2-4 digest_0.5.2
[4] evaluate_0.4.2 formatR_0.6 grid_2.15.0
[7] gtable_0.1.1 labeling_0.1 MASS_7.3-21
[10] memoise_0.1 munsell_0.4 plyr_1.7.1
[13] proto_0.3-9.2 RColorBrewer_1.0-5 reshape2_1.2.1
[16] scales_0.2.2 stringr_0.6.1 tools_2.15.0