In 2016, during Christmas break, I got bored and decided to experiment with an old Raspberry Pi that I had laying around. At that time, me and my colleague were in the middle of writing our first comparative genomics paper in which we compared an in-house sequenced bacterial isolate with all publicly available closely related genomes (more about that in this blogpost). While doing this analysis, I was in constant fear that some new genomes of our species of interest would be uploaded, meaning our study would already be outdated and we had to start our whole pipeline all over again. So, to keep track of the amount of publicly available genomes, I created a Raspberry Pi driven Twitter robot which I (very creatively) named: @Lactobot.
For the past year, @Lactobot has been checking NCBI’s Bacterial Assembly database every morning at 9 and reports to all of its 14 followers how many new Lactobacillus genomes were added to the database that day. As you can see in one of its latest tweets, the amount of genomes has risen from around 1,000, when I first created @Lactobot, to more than 1,600 today.
This made me wonder: what was the situation before January 2017? Did the amount of genomes increase much slower than today? And who has been uploading all of these genomes?
To answer these questions, I will of course use my favourite toolkit: R and the Tidyverse! I’ve left out the code to increase the readability of the post, but for those interested, you can find the full code here on GitHub. All data used in this blog post is freely available on the web.
The Lactobacillus genus
This graph shows a few cool things:
- The first Lactobacillus genome was released in 2004, the second one was uploaded at the end of 2005, 1.5 year later.
- From 2015 on, the amount of publicly available genomes started rising much faster.
- The first 500 genomes were uploaded in a timespan of 11 years while genomes 500 – 1,000 were uploaded in 1.5 years and genomes 1,000 – 1,500 were uploaded within 1 single year.
The last observation really impressed me. This makes me feel lucky to be doing genomic research today and not 5 years ago!
Another clear observation from the above graph is the sudden rise in more than 150 genomes on one single day at the end of 2015. What happened there? Who uploaded this giant bulk of genomes all at once?
The graph above shows that a company called Shangai Majorbio uploaded 174 different Lactobacillus genomes on one single day. That’s impressive, so a sincere ‘Thank You’ to them! Who else do I need to thank?
Well apparently Shangai Majorbio submitted/released all of their sequences at once and nobody else topped that amount. They are followed by the University of Alberta and the University of Minnesota on place 2 and 3 who submitted around half of the number of genomes that Shangai Majorbio submitted. The first European institute in this list is the Institute of Functional Genomics Lyon on place 4, followed closely by the ETH Zurich, the second European institute. To all of them (and all other submitters): Thank you for sharing your valuable data and keep on submitting!
Top 20 most commonly sequenced bacterial genera
Well, enough about these lactobacilli. Let’s have a more general look at NCBI’s bacterial database. My first question is similar as the last one above: what institute spent the most research money on sequencing bacterial genomes?
Wow. Just wow.
This graph blows my mind. There’s one single institute that sequenced more than 22,000 genomes. If we would sequence this amount of bacterial strains today, at the current rate of +/- € 150 per genome (assuming that we use Illumina technology), this would total in at least € 3.3 million! And I’m pretty sure that this is still a severe underestimation of the total cost price. The best thing is: all of this data is publicly available, free for all of us to use!
At first, I thought that ‘SC’ referred to some kind of parsing error but a few clicks from the Assembly database to the Bioproject and Biosample database showed that all of these assemblies originated from the Wellcome Trust Sanger Institute. They are followed by the Broad Institute (+/- 9,000 genomes) and the University of Queensland (+/- 7,500 genomes).
Next question: On what kind of bugs are we spending this research money? Meaning, which bacterial genera have got the most attention?
Can you spot your favourite genus?
If you’re a little bit into microbiology, this list probably does not surprise you. The top 5 most sequenced bacterial genera contains some very nasty pathogenic species. Well in fact, on a second look the whole top 20 contains genera which have at least one pathogenic species, except for two single genera at place 16 and 20: the Lactobacillus and Streptomyces genus, respectively.
Finally, I wanted to know whether Streptococcus has always been the most favourite bacterial genus on NCBI. So, as a final plot for this blog post I recreated the @Lactobot plot for all top 12 genera:
Look at those dynamics!
The sequencing adventure starts in 2001, with Vibrio cholerae as the first bacterial genome ever released on NCBI. (UPDATE: see below). Many others followed such as Streptococcus pyogenes, Streptococcus agalactiae, Streptococcus pneumoniae, Streptococcus epidermidis, Bacillus anthracis and Escherichia coli in 2002. From then on, it seems that Streptococcus kept the lead most commonly sequenced genus, but was surpassed by Escherichia and later also by Staphylococcus in 2013-2014. However, some big sequencing efforts in 2015 and again in 2016 made Streptococcus the big overall winner.
I think its fair to say that the popularity of bacterial genome sequencing has gotten a significant boost since 2015. The amount of publicly available genomes has been sharply rising since then and it does not seem to be stopping any time soon.
Moreover, this analysis made me realise that I can be extremely happy to be working with Lactobacillus, the most commonly sequenced non-pathogenic genus. This allows us to compare our own in-house sequenced isolates with a wide range of closely related strains, which just isn’t a possibility for many other bacterial genera.
Finally, I’m super curious what this graph will look like next year… See you then?
Code and data availability
All data used in this blog post are publicly available from NCBI’s Assembly database. You can find the full analysis here on GitHub.
Prof. Salzberg pointed out in the comments, my statement regarding Vibrio cholerae is incorrect. I’ve indeed overlooked some genomes due to the fact that I filtered on the 20 most abundant genera to create the last plot. Therefore, I must correct this statement by saying:
The adventure really started in 1999 (I was 8 years old then), with 5 genomes suddenly released at once:
- Treponema palidum
- Helicobacter pylori
- Haemophilus influenzae
- Aquifex aeolicus
- Chlamydophila pneumoniea
One year later in 2000 (I celebrated my 9th birthday that year) another genome was released:
- Deinococcus radiodurans
Only then we can see the Vibrio cholerae popping up.
In addition, dr. Andrew Page, from the Wellcome Sanger Institute pointed out that my numbers were still an underestimation of the amount of genomes they’ve uploaded on NCBI. For this blogpost I only used the Assembly database. If I would add the SRA database I would end up with around 350,000 bacteria sequenced by them, which is even more astonishing.