DOGMA: a new tool for assessing the quality of proteomes and transcriptomes

A new tool, recently published in Nucleic Acids Research, caught my eye this week:

The tool, by a team from the University of Münster, uses protein domains and domain arrangements in order to assess 'completeness' of a proteome or transcriptome. From the abstract...

Even in the era of next generation sequencing, in which bioinformatics tools abound, annotating transcriptomes and proteomes remains a challenge. This can have major implications for the reliability of studies based on these datasets. Therefore, quality assessment represents a crucial step prior to downstream analyses on novel transcriptomes and proteomes. DOGMA allows such a quality assessment to be carried out. The data of interest are evaluated based on a comparison with a core set of conserved protein domains and domain arrangements. Depending on the studied species, DOGMA offers precomputed core sets for different phylogenetic clades

Unlike CEGMA and BUSCO, which run against unannotated assemblies, DOGMA first requires a set of gene annotations. The paper focuses on the web server version of DOGMA but you can also access the source code online.

It's good to see that other groups are continuing to look at new ways of asssessing the quality of large genome/transcriptome/proteome datasets.

What's in a name?

Initially, I thought the name was just a word that both echoed 'CEGMA' and reinforced the central dogma of molecular biology. Hooray I thought, a bioinformatics tool that just has a regular word as a name without relying on contrived acronyms.

Then I saw the website...

  • DOGMA: DOmain-based General Measure for transcriptome and proteome quality Assessment

This is even more tenuous than the older, unrelated, version of DOGMA:

  • DOGMA: Dual Organellar GenoMe Annotator

Beyond Generations: My Vocabulary for Sequencing Tech

Many writers have attempted to divide Next Generation Sequencing into Second Generation Sequencing and Third Generation Sequencing. Personally, I think it isn't helpful and just confuses matters. I'm not the biggest fan of Next Generation Sequencing (NGS) to start with, as like "post-modern architecture" (or heck, "modern architecture") it isn't future-proofed.

Keith Robison gives an interesting deep dive on how sequencing technologies have been named and potentially could be named.

This post reminded me of my previous takes on the confusing, and inconsistent labelling of these technologies:

Reflections on the 2019 Festival of Genomics conference in London

For the third year in a row, I attended the Festival of Genomics conference in London. This year saw the conference change venue, moving from the ExCel Arena to the Business Design Centre in Islington.

The new venue was notably smaller leading to many sessions being heavily overcrowded. There were also fewer 'fun' activities compared to previous years. No graffiti wall and no recharging stations (massage stands and power points for phones).

The opening keynote was given by Professor Mark Caulfield (Chief Scientist at Genomics England

From 100K to 500K

Reflecting on the completion of the 100,000 Genomes Project, Professor Caulfield revealed that the 100,000th genome was completed at 2:40 am on the 2nd December.

He also shared details that at the peak, the project was completing 6,000 genomes a month and it has now reached 103,311 genomes.

The next phase will see 500,000 genomes completed within the NHS over the next five years, with an 'ambition' to go on to sequence five million genomes.

Looking at the global picture of human genome sequencing, Professor Caulfield projected that there will be 60 million completed genomes by 2023.

I wrote more about the conference in a blog post for The Institute of Cancer Research:

The 100,000 Genomes Project has finished

This week I helped write a blog post for The Institute of Cancer Research to mark the completion of the 100,000 Genomes Project. This blog post was co-written by a former colleague, Dr Sam Dick, who wrote the majority of the article:

Read the blog post:

Reflecting on this milestone achievement, I also took to Twitter this week for a lengthy (and admittedly rambling) thread that reflected on how far genomics has come as a field. Click on the tweet below to see the full Twitter thread:

Looks like I picked a terrible day to launch my new 100,000 Gnomes Project!

But seriously, many congratulations to all involved with the amazing effort by @GenomicsEngland and all associated with the #Genomes100K project! pic.twitter.com/5JtqwvbLL7

— Keith Bradnam (@kbradnam) December 5, 2018

Chromosome-Scale Scaffolds And The State of Genome Assembly

Keith Robison has written another fantastic post on his Omics! Omics! blog which is a great read for two reasons.

First he looks at the issues regarding chromosome-size scaffolds that can now be produced with Hi-C sequencing approches. He then goes on to provide a brilliant overview of what the latest sequencing and mapping technologies mean for the field of genome assembly:

For high quality de novo genomes, the technology options appear to be converging for the moment on five basic technologies which can be mixed-and-matched.

  • Hi-C (in vitro or in vivo)
  • Rapid Physical Maps (BioNano Genomics)
  • Linked Reads (10X, iGenomX)
  • Oxford Nanopore
  • Pacific Biosciences
  • vanilla Illumina paired end

This second section should be required reading for anyone interested in genome assembly, particularly if you've been away for the field for a while.

Read the post: Chromosome-Scale Scaffolds And The State of Genome Assembly

What did I learn at the Festival of Genomics conference?

Last week I attended the excellent Festival of Genomics conference in London, organised by Front Line Genomics. This was the first time I had been to a conference as a communications person rather than as a scientist...something that felt quite strange.

In addition to live-tweeting many talks for The Institute of Cancer Research where I work, I also recorded some videos of ICR scientists on the conference floor. All were asked to respond to the same simple question: Why is genomics important for cancer research?. You can see the video responses on the ICR's YouTube channel.

I also made a very short video to highlight one unusual aspect of the conference...the talks were pretty much silent. Wireless headphones worn by all audience members meant that there was no need to amplify the speakers...and therefore no need for the four different 'lecture theatres' to actually have any walls!

My first ICR blog post!

My final task was to write a blog post about some aspect of the conference. Before the conference started, I thought I might write something that was more focused on genomics technologies. However, I was surprised by how much of the conference covered genomics as part of healthcare.

In particular, I was left with the sense that genomics is finally delivering on some of the promises made back in 2003 when the human genome sequence was published. One of the target areas that was mentioned in this 2003 NIH press release was 'New methods for the early detection of disease'.

This is something that is now possible with whole genome sequencing being deployed as part of the 100,000 genomes project (undertaken by Genomics England). The ability to screen a patient for all known genetic diseases leads to many concerns and challenges — you should see Gattaca if you haven't already done so — but it was heartening to see how much groundwork has been put in to stay on top of some of these issues.

This is my first proper blog post for the ICR, and if you are interested in finding out more, please read my post on the ICR's Science Talk blog:

We have not yet reached 'peak CEGMA': record number of citations in 2016

Over the last few weeks, I've been closely watching the number of citations to our original 2007 CEGMA paper. Despite making it very clear on the CEGMA webpage that is has been 'discontinued' and despite leaving a comment in PubMed Commons that people should consider alternative tools, citations continue to rise.

This week we passed a milestone with the paper getting more citations in 2016 than in 2015. As the paper's Google Scholar page clearly shows, the citations have increased year-on-year ever since it was published:

While it is somewhat flattering to see research that I was involved so highly cited — I can't imagine that many papers show this pattern of citation growth over such a long period — I really hope that 2016 marks 'peak CEGMA'.

CEGMA development started in 2005, a year that pre-dates technologies such as Solexa sequencing! People should really stop using this tool and try using something like BUSCO instead.

Assembling a twitter following: people continue to be interested in genome assembly

Late in 2010, I was asked to help organise what would initially become The Assemblathon and then more formally Assemblathon 1. One of the very first things I did was to come up with the name itself — more here on naming bioinformatics projects — register the domain name, and secure the Twitter account @Assemblathon.

The original goal was to use the website and Twitter account to promote the contest and then share details of how the competition was unfolding. This is exactly what we did, all the way through to the publication of the Assemblathon 1 paper in late 2011. Around this time it seemed to make sense to also use the Twitter account to promote anything else related to the field of genome assembly and that is exactly what I did.

As well as tweeting a lot about Assemblathon 2 and a little bit about the aborted but oh-so-close-to-launching Assemblathon 3, I have found time to tweet (and retweet) several thousand links to many relevant publications and software tools.

It seems that people are finding this useful as the account keeps gaining a steady trickle of followers. The graph below shows data from when I started tracking the follower growth in early 2014:

All of which leaves me to make two concluding remarks:

  1. There can be tremendous utility in having an outlet — such as a Twitter account — to focus on a very niche subject (maybe some would say that genome assembly is no longer a niche field?).
  2. Although I am no longer working on the Assemblathon projects — I'm not even a researcher any more — I'm happy to keep posting to this account as long as people find it useful.