My First Useful Script

Can you remember the first useful thing that you ever coded? I sure can, and I’m thankful for it every day. I’ve recently finished writing a little program called fxtract (which I’ve blogged about before) that acts like grep but returns whole fasta or fastq records from a file. It’s taken me a very long time to write this thing, primarily cause I’m writing it in C++ but now that it’s pretty much done I’m feeling nostalgic about why I’m writing it in the first place.

Finding 16s/18s reads in metagenomes

Got a Metagenome? want to know what the community looks like? rRNA operons are typically poorly assembled in metagenomic datasets due to highly conserved sequences. More targeted assembly approaches may be necessary to obtain accurate reconstructions from short read datasets. There are a few ways in which we can extract reads originating from either 16S or 18S reads and there are a number of programs (SSU-ALIGN, rRNASelector, riboPicker, SortMeRNA, blast, bowtie or bwa) to name a few.

Poking around inside grep

Playing around with the grep source code to make it output fasta/fastq records. Check out the code here. I’m quite interested in string searching algorithms as I’ve written a program, crass, which uses a few of them to search for CRISPR elements. Crass is pretty fast, but I want it to be faster, specifically there is one point in crass where it searches for exact matches to many thousands of patterns.

Genome Scaffolders Suck

Experiences using a variety of contig scaffolding tools. It was not a good experience. Recently in our lab we’ve been getting some Illumina mate-pair data to improve some metagenomic assemblies. The sequencing has been going well and we’ve been generating a good number of mate-pairs without too much duplication, but we’ve had quite a bit of trouble with the bioinformatic part of actually using this data to improve the assemblies. There are a number of software tools available to link contigs together after an assembly has been done, however many assume that you are scaffolding a genome not a metagenome.

Testing out Seqan's Multipattern Search Implementations

I recently discovered Seqan, a header-only C++ library for bioinformatics. I’ve been playing around with the toolkit to make some small programs just to see whether I want to use it in a larger project. So far I’ve written prepmate, an adaptor trimming program for Illumina’s Nextera mate-pair libraries; and fxtract, a grep-like program for extracting fasta/fastq records from large files. One of the algorithms that I use in fxtract and in another program I’ve written, crass, is to search for multiple patterns simultaneously (in this case a number of different DNA motifs). Seqan implements a number of algorithms for multipattern matching (checkout their tutorial page), however they don’t give many clues as to why you would choose one algorithm over another. So I decided to take a few of these implementations out for a spin using fxtract.