Exploratory data analysis with Pharo Smalltalk

The first time I heard about Smalltalk was reading through the wikipedia page for Ruby, which mentioned it as an influence. At the time I was just a few months into my transition from a wet-lab biologist into a bioinformatician and trying to decide between Perl, Python, and Ruby as a scripting language to learn. Python became my language of choice after a long battle with Perl (this was some years ago and Perl was much more relevant).

A Make for URIs

Make has been a one of the key tools in my arsenal for gettings things done. Although it was developed for compiling code, its functionality can be generalized to any process that requires files to be generated based on dependancies. I recommend you look at these slides by Vince Buffalo as a good introduction to using make for scientific workflows. Make works by creating a dependancy graph of files and their prerequisites using the last time the file was modified as a way to determine if a file needs to be remade.

Drawing KEGG pathway maps using biopython and matplotlib

I use KEGG a lot to understand microbial metabolism. KEGG is one of the largest resources of enzymes, biochemical reactions, genes, and molecules, all cross-linked and organized into what’s called metabolic maps. These maps are well-constructed images of enzymes that functions together for the same overall purpose like amino acid synthesis, or the metabolism of glucose. One of the great things about the website is the ability to color on your data to their metabolic maps.

Checking back in on CRISPRs

As part of my PhD thesis I studied an emerging field of bacterial adaptive immunity, known as CRISPR. At the time I was interested in tracking this type of immune system in bacterial communities to track co-evolution between bacteria and their viruses. For anyone interested and brave enough, here is a link to my thesis. After I finished up writing, submitting, and ultimately obtaining my PhD I realised that large portions of the literature review section would never be published in a scientific journal.

Using Amazon Neptune full text search

I’ve been trying out Amazon Neptune’s full text search feature. Overall it’s been a great experience although there are a few caveats when searching that means that you’ll have to craft your queries carefully to make full use of the feature. The tinkerpop standard has some text searching features however it lacks any advanced features such as searching using regular expressions or even case-insensitive searching. It’s left to different implementations to augment this text searching capability.

An update on Bioawk

Bioawk is a great project started by Heng Li some years ago. The aim was to take the awk source code and modify it slightly for use with common biological formats and adding in some new functions. Heng’s original doesn’t accept too many pull requests so to add in some features, I maintain my own fork that has a few improvements. Long ago I added in a translate function and recently I added in a function that will take the attribute string from a GFF file and turn it into an awk array.

Getting data from NCBI assembly using the accession number only

NCBI’s assembly database is a great one-stop-shop for genomic data and annotations but it’s actually kind of difficult to download data if you only know the accession number of an assembly. The documentation says that the assembly database is integrated with entrez-direct, a great set of command line utilities for accessing NCBI data from the command line. Most of the databases have an option to download data based on the ID, so I thought that something like the following would work

Creating huge metabolic overviews for comparative genomics

I love looking at KEGG maps and using them to understand an organisms metabolism but they have their limitations. For starters, you’re obviously stuck with how they are drawn, which in most cases includes many variations on a particular pathway. Secondly, the tools for mapping on your own genes to a pathway are limited to one organism at a time. What I really wanted was a way of quickly comparing 10s of genomes across a common set of pathways, in this case amino acid synthesis pathways.

Formatting scientific names in R

Here is a function that will take a character string in R and return an expression for fancy formatting in plots that properly italicize scientific names. The syntax for doing this is truly quite horrible, but this is how R does it.

split_by: splitting files on the command line based on content

Unix has so many great ways to perform text manipulation but one niche which hasn’t been filled is splitting a tabular file into pieces based on the contents of certain columns. There are two commands, split and csplit, that do a similar role. split can split a file into a certain number of bytes or lines; csplit uses a regular expression to determine where to split the file. Often for my purposes neither of these tools is a good fit, and what I really want is an equivalent to the “group by” clause in SQL databases.