The Entrez Direct toolkit is great for programmatic access to all of NCBI’s resources. This little snippet below finds all of the refseq representative genomes for a given NCBI taxonomy ID, makes a little summary of the genomes downloaded and uses wget
to download the genbank files from the Assembly FTP. Change the inital query in the first call to esearch
to change what genomes are downloaded.
I feel like I’m on a life-long quest to make all of my phylogenetic tree figures completely programmatically. The best tool I’ve found for making them is the ete library for the python programing language. I’ve already figured out how to get trees drawn in the style that I like but there was still one thing left to do: making organism names italicize correctly. I work with microorganisms where the convention is for the genus and species names to be italicized but the strain name to be in regular font.
So. You’ve got yourself a nice new genome sequence and you want to know what kind of metabolism it has. There is a good chance that you have some idea already — you think it’s a nitrogen fixer or a sulfate reducer etc. — based on other analyses and now it’s time to strengthen your paper with a bit of genomic evidence.
Getting an initial annotation The vast majority of the genes in the genome are going to be hypothetical proteins; of the rest, a sizable chunk are going to be genes with a general sort of annotation like “ABC transporter” (which says nothing about what it’s transporting), and the rest are going to be metabolic genes that you probably care about.
ete3 has support for phyloxml which I use with archaeopteryx tree viewer for a lot of my day-to-day phylogenetics visualisation. My main reason for using phyloxml is one of convenience as I have a script that will easily add in the proper organism name onto the tree and I think that archaeopteryx is a really good basic tree viewer. I wanted to draw a tree from phyloxml in ete using my own style and to have the proper organism name to be rendered.
Bioinformatics… Or ‘advanced file copying’ as I like to call it.
— Nick Loman (@pathogenomenick) January 29, 2014
Get ready for some advanced file copying! I recently had to clean up some data from the supplementary material from Pereira et. al 2011, which is a very nice table of manually annotated genes in sulfate reducing bacteria. The only problem is that the table is designed for maximum human readability, which made it a real pain when trying to parse out the data.
I do a lot of work in phylogenetics, which means that for just about every paper I’ve written I’ve had at least one figure that is a phylogenetic tree. Making pretty looking trees for a publication is tedious and my previous workflow involved using ARB for actually drawing the tree and producing an initial file in postscript, and then loading that into Adobe Illustrator to make everything beautiful. The problem with this is that it is not an automated process so any time I need to change the tree I need to redo all of the ‘beautifying’ manually.
Genome bins comming off automated pipelines can be contaminated
with parts of other genomes. As part of my workflow I use
CheckM
(I’m biased since I’m a coauthor) to assess the contamination of
genome bins using single-copy marker genes. If you’re lucky then
the genome bins that you’re interested in will be relatively complete
without much contamination. Unfortunately that isn’t always the
case. In this blog post I’m going to run through some of the analyses
that I did on a genome bin that was 90% complete but 70% contaminated.
This is exploratory analysis to see if I can manually improve the
bin over what the automated tools can do.
With the pace of science what seemed top stuff three years ago is now an
order of magnitude less than what just got published.
The question that most people ask when looking at a metagenomic draft genome bin
is: should this gene really be there? The answer is that sometimes it’s not easy to know
Every PhD student can contribute to open science by uploading their thesis literature review onto Wikipedia!
My thesis was entitled “Phage-host evolution in a model ecosystem”, where I tracked the evolution of phage genome evolution and the evolution of bacterial defense mechanisms using metagenomics. When I was writing my thesis I spent a lot of time writing up the section on CRISPRs, which are a type of bacterial adaptive immune system.