Formatting scientific names in R

Here is a function that will take a character string in R and return an expression for fancy formatting in plots that properly italicize scientific names. The syntax for doing this is truly quite horrible, but this is how R does it.

split_by: splitting files on the command line based on content

Unix has so many great ways to perform text manipulation but one niche which hasn’t been filled is splitting a tabular file into pieces based on the contents of certain columns. There are two commands, split and csplit, that do a similar role. split can split a file into a certain number of bytes or lines; csplit uses a regular expression to determine where to split the file. Often for my purposes neither of these tools is a good fit, and what I really want is an equivalent to the “group by” clause in SQL databases.

Downloading NCBI genomes from a given taxonomy

The Entrez Direct toolkit is great for programmatic access to all of NCBI’s resources. This little snippet below finds all of the refseq representative genomes for a given NCBI taxonomy ID, makes a little summary of the genomes downloaded and uses wget to download the genbank files from the Assembly FTP. Change the inital query in the first call to esearch to change what genomes are downloaded.

Python ete3: formatting organism names the way I want

I feel like I’m on a life-long quest to make all of my phylogenetic tree figures completely programmatically. The best tool I’ve found for making them is the ete library for the python programing language. I’ve already figured out how to get trees drawn in the style that I like but there was still one thing left to do: making organism names italicize correctly. I work with microorganisms where the convention is for the genus and species names to be italicized but the strain name to be in regular font.

My guide to annotating proteins and pathways

So. You’ve got yourself a nice new genome sequence and you want to know what kind of metabolism it has. There is a good chance that you have some idea already — you think it’s a nitrogen fixer or a sulfate reducer etc. — based on other analyses and now it’s time to strengthen your paper with a bit of genomic evidence. Getting an initial annotation The vast majority of the genes in the genome are going to be hypothetical proteins; of the rest, a sizable chunk are going to be genes with a general sort of annotation like “ABC transporter” (which says nothing about what it’s transporting), and the rest are going to be metabolic genes that you probably care about.

Messing around with Phyloxml in ete3

ete3 has support for phyloxml which I use with archaeopteryx tree viewer for a lot of my day-to-day phylogenetics visualisation. My main reason for using phyloxml is one of convenience as I have a script that will easily add in the proper organism name onto the tree and I think that archaeopteryx is a really good basic tree viewer. I wanted to draw a tree from phyloxml in ete using my own style and to have the proper organism name to be rendered.

Going from a messy supplementary table to good clean data

Bioinformatics… Or ‘advanced file copying’ as I like to call it. — Nick Loman (@pathogenomenick) January 29, 2014  Get ready for some advanced file copying! I recently had to clean up some data from the supplementary material from Pereira et. al 2011, which is a very nice table of manually annotated genes in sulfate reducing bacteria. The only problem is that the table is designed for maximum human readability, which made it a real pain when trying to parse out the data.

Drawing Phylogenetic Trees, Connor Style

I do a lot of work in phylogenetics, which means that for just about every paper I’ve written I’ve had at least one figure that is a phylogenetic tree. Making pretty looking trees for a publication is tedious and my previous workflow involved using ARB for actually drawing the tree and producing an initial file in postscript, and then loading that into Adobe Illustrator to make everything beautiful. The problem with this is that it is not an automated process so any time I need to change the tree I need to redo all of the ‘beautifying’ manually.

Genome Bin Decontamination

Genome bins comming off automated pipelines can be contaminated with parts of other genomes. As part of my workflow I use CheckM (I’m biased since I’m a coauthor) to assess the contamination of genome bins using single-copy marker genes. If you’re lucky then the genome bins that you’re interested in will be relatively complete without much contamination. Unfortunately that isn’t always the case. In this blog post I’m going to run through some of the analyses that I did on a genome bin that was 90% complete but 70% contaminated. This is exploratory analysis to see if I can manually improve the bin over what the automated tools can do.

The pace of genome binning from metagenomes

With the pace of science what seemed top stuff three years ago is now an order of magnitude less than what just got published.