Can you remember the first useful thing that you ever coded? I sure can, and I’m thankful for it every day.
I’ve recently finished writing a little program called fxtract
(which I’ve blogged about before)
that acts like grep but returns whole fasta or fastq records from a file. It’s
taken me a very long time to write this thing, primarily cause I’m writing it
in C++ but now that it’s pretty much done I’m feeling nostalgic about why I’m
writing it in the first place.
Back in 2010 I was just starting my PhD with Gene Tyson
and learning some bioinformatics on a “toy” dataset (which became my first
paper)
that was a huge 10 Mbp of viral sequence data! At the time I was learning Perl
using this great online tutorial
but I found that I needed some more “real-world” examples to really get to grips
with the language. I can’t fully remember what I was doing but I had a tabular
blast output file and I wanted to get the sequence of some contigs with blast
hits. I jumped at the opportunity to solve this with my nasent coding skills and
so I started writing my first useful bit of code:
contig_extractor.pl
.
In the beginning it did only what I first wrote it to do: take a blast file and
a fasta file and return some contigs. Over time though it morphed into something
more general - a way of getting some subset of a fasta file using a list of
identifiers. Over the years I added functionality for different file formats and
allowed searching using regular expressions.
This one piece of code eventually disseminated throughout the whole lab group
as new people came in and needed to solve the same problems. I think that the
introduction to github for many of the PhD students who came after me
was downloading my random collection of scripts just so they
could get their hands on contig_extractor.pl
.
Of course every bioinformatician has probably written the same piece of code to solve the same problem. If I had been smart enough to look I’m sure that I would have found one, but using a script doesn’t quite give the same level of satisfaction as writing it yourself (and finding that it’s useful by all your colleages).
There came a time though when contig_extractor.pl
reached it’s limit. I was
trying to extract a few thousand records from a fastq file with 100 million and
Perl just wasn’t cutting it anymore. And so fxtract
was born to do all the good
things from contig_extractor.pl
just a heap faster.The sun is setting on my
first useful piece of code, at the moment only my finger memory and force of habit
keeps it being used. But when I think about it, it’s been an amazing four year run for something that I
probably thought was going to be a one-off script, hopefully fxtract
can get
as much love from me as well.