Bad bioinformatics software names revisited

I recently have been sorting through lots of old notes files, including many from my time as a genomics researcher at UC Davis. One note file I had was called ‘Strategies for naming bioinformatics software’ and I initially assumed it was one of the blog posts posted on this blog.

However, I couldn’t find it as an actual post and when I did a quick web search, I instead discovered this ‘Bioinformatics lab’ podcast from earlier this year:

I have been out of the field of genomics/bioinformatics for many years now and didn’t know about The Bioinformatics Lab podcast which describes itself as ‘ramblings on all things bioinformatics’.

The conversation between the hosts (Kevin Libuit and Andrew Page) is good and listening to it brought back lots of memories from the many things I’ve written about on this blog. At the end of the episode, Andrew concludes:

"It’s kind of hard. People should bit a bit of effort into it"

100% this! Naming software should definitely not be an afterthought. Andrew goes on:

"Before you do any development on anything, go and choose a really good name and make sure it doesn’t conflict with any trademarks or existing tools, you can Google it easily and it’s not offensive in any language."

These are the types of things that I have written about extensively on this blog. If you are interested, perhaps start with

Then you can ready any one of the nearly forty posts I wrote which handed out ‘JABBA awards’ which stands for Just Another Bogus Bioinformatics Acronym.

This award series started all the way back in 2013 and the inaugural award went to a tool with the crazy capitalisation of 'BeAtMuSiC'.

There’s also a series of posts on duplicate names in bioinformatics where people haven’t checked whether their software name is stepping on someone else’s toes.

This includes a post about the audacious attempt to name a new piece of bioinformatics software BLAST. There is also a post about the five different tools that are all called ‘SNAP’.

Admittedly I’ve been out of the loop for so long there is the possibility of there being many more SNAPs out there now!

The moral of this blog post is that names are important and it is very easy to mess them up which could end up meaning that fewer people ever discover your tool in teh first place.

CEGMA is dying...just very, very slowly

This is my first post on this blog in almost three years and it is now almost nine years since I could legitimately call myself a genomics researcher or bioinformatician.

However, I feel that I need to 'come out of retirement' for one quick blog post on a topic that has spanned many others...CEGMA.

As I outlined in my last post on this blog, the CEGMA tool that I helped develop back in 2005 and which was first published in 2007, continues to be used.

This is despite many attempts to tell/remind people not to use it anymore! There are better tools out there (probably many that I'm not even aware of). Fundamentally, the weakness of using CEGMA is that is based on an identified set of orthologs that was published over two decades ago.

And yet, every week I receive Google Scholar alerts that tell me that someone else has cited the tool again. We (myself and Ian Korf) should perhaps take some of the blame for keeping the software available on the Korf Lab website (I wonder how many other bioinformatics tools from 2007 can still be downloaded and successfully run?).

CEGMA citations (2011-2024)

When I saw that citations had peaked in 2017 and when I saw better tools come along, I thought it would be only a couple of years until the death knell tolled for CEGMA. I was wrong. It is dying...just very, very slowly. There were 119 citations last year and there have been 88 so far this year.

Academics (including former academics) obviously love to see their work cited. It is good to know that you have built tools that were actively used. But please, stop using CEGMA now! Myself and the other co-authors no longer need the citations to justify our existence.

Come back to this blog in another three years when I will no doubt post yet another post about CEGMA ('For the love of all that is holy why won't you just curl up and die!').

New BUSCO vs (very old) CEGMA

If I’m only going to write one or two blog posts a year on this blog, then it makes sense to return to my recurring theme of don’t use CEGMA, use BUSCO!

In 2015 I was foolishly optimistic that the development of BUSCO would mean that people would stop using CEGMA — a tool that we started developing in 2005 and which used a set of orthologs published in 2003! — and that we would reach ‘peak-CEGMA’ citations that year.

That didn’t happen. At the end of 2017, I again asked the question have we reached peak-CEGMA? because we had seen ten consecutive years of increasing publications.

Well I’m happy to announce that 2017 did indeed see citations to our 2007 CEGMA paper finally peak:

CEGMA citations by year (from Google Scholar)

Although we have definitely passed peak CEGMA, it still receives over a 100 citations a year and people really should be using tools like BUSCO instead.

This neatly leads me to mention that a recent publication in Molecular Biology and Evolution describes an update to BUSCO:

From the introduction:

With respect to v3, the last BUSCO version, v5, features: 1) a major upgrade of the underlying data sets in sync with OrthoDB v10; 2) an updated workflow for the assessment of prokaryotic and viral genomes using the gene predictor Prodigal (Hyatt et al. 2010); 3) an alternative workflow for the assessment of eukaryotic genomes using the gene predictor MetaEuk (Levy Karin et al. 2020); 4) a workflow to automatically select the most appropriate BUSCO data set, enabling the analysis of sequences of unknown origin; 5) an option to run batch analysis of multiple inputs to facilitate high-throughput assessments of large data sets and metagenomic bins; and 6) a major refactoring of the code, and maintenance of two distribution channels on Bioconda (Grüning et al. 2018) and Docker (Merkel 2014).

Please, please, please...don’t use CEGMA anymore! It is enjoying a well-earned retirement at the Sunnyvale Home for Senior Bioinformatics Tools.

Three cheers for JABBA awards

These days, I mostly think of this blog as a time capsule to my past life as a scientist. Every so often though, I’m tempted out of retirement for one more post. This time I’ve actually been asked to bring back my JABBA awards by Martin Hunt (@martibartfast)...and with good reason!

There is a new preprint in bioRxiv...

I’m almost lost for words about this one. You know that it is a tenuous attempt at an acronym or initialism when you don’t use any letters from the 2nd, 3rd, 4th, or 5th words of the full software name!

The approach here is very close to just choosing a random five-letter word. The authors could also have had:

CLAMP: hierarChical taxonomic cLassification for virAl Metagenomic data via deeP learning

HOTEL: hierarcHical taxOnomic classificaTion for viral mEtagenomic data via deep Learning

RAVEN: hieraRchical tAxonomic classification for Viral metagenomic data via dEep learNing

ALIEN: hierArchical taxonomic cLassification for vIral metagEnomic data via deep learniNg

LARVA: hierarchicaL taxonomic classificAtion for viRal metagenomic data Via deep leArning

Okay, as this might be my only blog post of 2020, I’ll say CHEERio!

Damn and blast...I can't think of what to name my software

As many people have pointed out on Twitter this week, there is a new preprint on bioRxiv that merits some discussion:

The full name of the test that is the subject of this article is the Bron/Lyon Attention Stability Test. You have to admit that 'BLAST' is a punchy and catchy acronym for a software tool.

It's just a shame that is also an acronym for another piece of software that you may have come across.

It's a bold move to give your software the same name as another tool that has only been cited at least 135,000 times!

This is not the first, nor will it be the last, example of duplicate names in bioinformatics software, many of which I have written about before.

Hopping back for another JABBA award

So I was meant to have retired myself from handing out JABBA awards to recognise instances of ‘Just Another Bogus Bioinformatics Algorithm’. However, I saw something this week which clearly merits an award.

And so, from a new paper recently published in PLoS ONE I give you:

The three lower-case letters signal that there is going to be some name wrangling going on...so let’s see how the authors arrive at this name:

GRASShopPER: GPU overlap GRaph ASSembler using Paired End Reads

That’s how it is described in the paper, so I guess it could have also been called ‘GOGAUPER’? I find this is another example of a clumsily constructed acronym that could have been avoided altogether.

‘Grasshopper’ is a cool, and catchy, name for any software tool and it doesn’t really need to be retconned into making an awkward acronym.

It does, however, give me one new animal for the JABBA menagerie!

The changing landscape of sequencing platforms that underpin genome assembly

From Flickr user itsrick208. CC BY-NC 2.0

In my last blog post I looked at the the amazing growth over the last two decades in publications that relate to genome assembly.

In this post, I try seeing whether Google Scholar can also shed any light on which sequencing technologies have been used to help understand, and improve, genome assembly.

Here is a rough overview of the major sequencing platforms that have underpinned genome assembly over the years. I’ve focused on time points when there were sequencing instruments that people were actually using rather than when the technology was first invented or described. This is why I start Sanger sequencing at 1995 with the AB310 sequencer rather than 1977.

Click to enlarge

Return to Google Scholar

So how can you find publications which concern genome assembly using these technologies? Well here are my Google Scholar searches that I used to try to identify relevant publications.

  1. Sanger — "genome assembly"|"de novo assembly" sanger -sanger.ac.uk — I had to exclude the Sanger’s website address as this was used in many papers that might not be talking about Sanger sequencing per se.
  2. Roche 454 — "genome assembly"|"de novo assembly" 454 (roche |pyrosequencing) — another tricky one as ‘454’ alone was not a suitable keyword for searching.
  3. Illumina — "genome assembly"|"de novo assembly" (illumina|solexa) — obviously need to include Solexa in this search as well.
  4. ABI SOLiD — "genome assembly"|"de novo assembly" "ABI solid"
  5. Ion Torrent — "genome assembly"|"de novo assembly" "ion torrent"
  6. PacBio — "genome assembly"|"de novo assembly" ("PacBio"|"Pacific Biosciences")
  7. Oxford Nanopore Technologies — "genome assembly"|"de novo assembly" "Oxford Nanopore"

Now obviously, many of these searches are flawed and are going to miss publications or include false positives. This makes comparing the absolute numbers of publications between technologies potentially misleading. However, it should still be illuminating to look at the trends of how publications for each of these technologies have changed over time.

The results

As in my last graph, I plot the number of publications on a log scale.

Click to enlarge

Observations

  1. Publications about genome assembly that mention Sanger sequencing dominate the first decade of this graph before being overtaken by Illumina in 2009.
  2. The growth of publications for Sanger is starting to slow down
  3. Publications for Roche 454 peaked in 2015 and have started to decline
  4. Publications concerning Ion Torrent peaked a year later in 2016
  5. ABI SOLiD shows the clearest ‘rise and fall’ pattern with five years now of declining publications about genome assembly
  6. The rate of growth for PacBIo publications has been pretty solid but may have just slowed a little in 2017
  7. Oxford Nanopore, the newest kid on the block — in terms of commercially available products — has been on a solid period of exponential growth and looks set to overtake Ion Torrent (and maybe Roche 454) this year.

Are we about to reach ‘peak genome assembly’?

Sanger Peak. Image from Google Maps.

The ever-declining costs of DNA sequencing technologies — no, I’m not going to show that graph — has meant that the field of genome assembly has exploded over the last decade.

Plummeting costs are obviously not the only reason behind this. The evolving nature of sequencing technologies has meant that this year has pushed us into the brave new era of megabase pair read lengths!

Think of the poor budding yeast: the first eukaryotic species to have its (12 Mbp) genome sequenced. There was a time when the sequencing of individual yeast chromosomes would merit their own Nature publication! Now only chromosome IV remains as the last yeast chromosome whose length couldn’t be exceeded by a single Oxford Nanopore read (but probably not for much longer!). Update 2018年09月12日: a 2.2 Mbp Nanopore read means that chromsome IV's length has now been eclipsed!

Looking for genome assembly publications

I turned to the font of all (academic) knowledge, Google Scholar, for answers. I wanted to know whether interest in genome assembly had reached a peak, and by ‘interest’ I mean publications or patents that specifically mention either ‘genome assembly’ or ‘de novo assembly’.

Some obvious caveats:

  1. Google Scholar is not a perfect source of publications: some papers are missing, some appear multiple times, and occasionally some are associated with the wrong year.
  2. Publications are increasing in many fields due to more scientists being around and the inexorable rise of if-you-pay-us-money-and-randomly-hit-keys-on-your-keyboard-we-will-publish-it publishing. So a rise in publications in topic 'X' does not necessarily reflect more interest in that topic.
  3. Not all publications concerning genome assembly will contain the phrases ‘genome assembly’ or ‘de novo assembly’.

Caveats aside, let’s see what Google thinks about the state of genome assembly:

Click to enlarge

Does this tell us anything?

So there’s clearly been a pretty explosive growth in publications concerning genome assembly over the last couple of decades. Interestingly, the data from 2017 suggest that the period of exponential growth is starting to slow just a little bit. However, it would seem that we have not reached ‘peak genome assembly’ just yet.

There are, no doubt, countless hundreds (thousands?) of publications that concern technical aspects of genome assembly which have reached dead ends or which have become obsolete (pipelines for your ABI SOLiD data?).

Maybe we are starting to reach an era where the trio of leading technologies (Illumina, Pacific Biosciences, and Oxford Nanopore) are good enough to facilitate — alone, or in combination — easier (or maybe less troublesome) genome assemblies. I’ve previously pointed out how there are more ‘improved’ assemblies being published than ever before.

Maybe the field has finally moved the focus away from ‘how do we do get this to work properly?’ to ‘what shall we assemble next?’. In a follow-up post, I’ll be looking at the rise and fall of different sequencing technologies throughout this era.

Update 2018年08月13日: Thanks to Neil Saunders for crunching the numbers in a more rigourous manner and applying a correction for total number of publications published per year. The results are, as he notes, broadly similar.

Genomic makeovers: the number of ‘improved’ genome sequences is increasing

Image from flickr user londonmatt. Licensed under Creative Commons CC BY 2.0 license

Excluding viruses, the genome that can claim to being completed before any others was that of the bacterium Haemophilus influenzae, the sequence of which was described in Science on July 28 1995.

I still find it pretty pretty amazing to recall that just over a year later, the world saw the publication of the first complete eukaryotic genome sequence, that of the yeast Saccharomyces cerevisiae.

The field of genomics and genome sequencing have continued to grow at breakneck speeds and the days of a genome sequence automatically meriting a front cover story in Nature or Science are long gone.

Complete vs Draft vs Improved

I’ve written previously about the fact that although more genomes than ever are being sequenced, fewer seem to be ‘complete’. I’ve also written a series of blog posts that address the rise of ‘draft genomes’.

Today I want to highlight another changing aspect of genome sequencing, that of the increasing number of publications that describe ‘improved’ genomes. Some recent examples:

Improving genomes is an increasing trend

To check whether there really are more ‘improved’ sequences being described, I looked in Google Scholar to see how many papers feature the terms ‘complete genome|assembly’ vs ‘draft genome|assembly’ vs ‘improved genome|assembly’ (these Google Scholar links reveal the slightly more complex query that I used). In gathering data I went back to 1995 (the date of the first published genome sequence).

As always with Google Scholar, these are not perfect search terms and they all pull in matches which are not strictly what I’m after, but it does reveal an interesting picture:

Number of publications in Google Scholar referencing complete vs draft vs improved genomes/assemblies

It is clear that the number of publications referencing ‘complete’ genomes/assemblies has been increasing at a steady rate. In contrast, publications describing ’draft’ genomes have grown rapidly in the last decade but the rate of increase is slowing. When it comes to ‘improved’ genomes it looks like we are in a period where many more papers are being published that are describing improved versions of existing genomes (in 2017 there was a 54% increase in such papers compared to 2016).

Why improve a genome?

I wonder how much of this growth reflects the sad truth that many genomes that were published in the post-Sanger, pre-nanopore era (approximately 2005–2015) were just not very good. Many people rushed to adopt the powerful new sequencing technologies provided by Illumina and others, and many genomes have been published using those technologies that are now being given makeovers by applying newer sequencing, scaffolding, and mapping technologies

The updated pine genome (the last publication on the list above) says as much in its abstract (emphasis mine):

The 22-gigabase genome of loblolly pine (Pinus taeda) is one of the largest ever sequenced. The draft assembly published in 2014 was built entirely from short Illumina reads, with lengths ranging from 100 to 250 base pairs (bp). The assembly was quite fragmented, containing over 11 million contigs whose weighted average (N50) size was 8206 bp. To improve this result, we generated approximately 12-fold coverage in long reads using the Single Molecule Real Time sequencing technology developed at Pacific Biosciences. We assembled the long and short reads together using the MaSuRCA mega-reads assembly algorithm, which produced a substantially better assembly, P. taeda version 2.0. The new assembly has an N50 contig size of 25 361, more than three times as large as achieved in the original assembly, and an N50 scaffold size of 107 821, 61% larger than the previous assembly.

Perhaps I’m being a bit harsh in saying that the first versions of many of these genomes that have been subsequently improved were not very good. The more important lesson to bear in mind is that, in reality, a genome is never finished and that all published sequences represent ‘works in progress’.

Fun at the Festival: a mini-report of the 2018 Festival of Genomics in London

Graffiti wall at the Festival of Genomics

Last week I once again attended the excellent Festival of Genomics conference in London. As before, I was attending in order to produce some coverage for The Institute of Cancer Research (where I work).

I really enjoy the mixture of talks at this conference which always has a strong leaning towards medicine in general and the rapid integration of genomics in the NHS in particular. This was a topic I explored in more detail in a blog post last year for the ICR.

I like how the conference organisers, Front Line Genomics, make an effort to ensure the conference is fun and engaging. It is easy to dismiss things like a 'grafitti walls' and 'recharge stations' (where you can power up your mobile phone and get a massage) as gimmicks, but I think they add to a feeling that this is a modern and vibrant conference.

NHS meets NGS

Opening the conference was a presentation from Sir Malcolm Grant, the Chairman of NHS England. He presented an update on Genomics England's 100,000 Genomes Project.

Sir Malcolm noted that consent is such an important part of this project as participants are consenting to provide information that may affect others, e.g. their children and heirs. He stressed the importance of ensuring public trust and support as the project moves forwards.

Although initial progress towards achieving those 100,000 genomes may have been slower than some would have liked, work has been accelerating. The project has taken almost five years to reach the halfway point but is now on course to reach the 100K milestone within the next 12 months.

The following day saw Genomics England's Chairman, Sir John Chisholm, take to the stage for a chat with Carl Smith (Managing Editor of Front Line Genomics). He stressed that people should think of the 100,000 genomes project as a "pilot for the main game", i.e. the routine sequencing of patients within the NHS.

Rigged for silent running

The conference has four 'stages' but as the whole area at the ExCel Area is just one big open space, they make use of wireless headphones to have conference areas which are effectively silent to people walking past.

In addition to headphones being left on each seat, there are also many additional headsets that can be given out to people who are just standing by the sides of the 'stages' to more casually listen in to each session.

When genomics meets radiotherapy

This year the ICR was honoured with our own conference session in which four early-career researchers talked about how they used genomics data in their own areas of cancer research.

I have writen a Science Talk blog post for the ICR that focuses on a presentation at the conference by Dr James Campbell, who is a Lead Bioinformatican at the ICR. He is using genome data from almost 2,000 patients that have undergone radiotherapy treatment for prostate cancer, in order to develop a model which predicts how well a new patient — given their particular set of genotypes and clinical factors — will respond to radiotherapy.

You can read the blog post here: