Can you remember the first useful thing that you ever coded? I sure can, and I’m thankful for it every day.
I’ve recently finished writing a little program called fxtract (which I’ve blogged about before) that acts like grep but returns whole fasta or fastq records from a file. It’s taken me a very long time to write this thing, primarily cause I’m writing it in C++ but now that it’s pretty much done I’m feeling nostalgic about why I’m writing it in the first place.
Got a Metagenome? want to know what the community looks like?
rRNA operons are typically poorly assembled in metagenomic datasets due to highly conserved sequences. More targeted assembly approaches may be necessary to obtain accurate reconstructions from short read datasets. There are a few ways in which we can extract reads originating from either 16S or 18S reads and there are a number of programs (SSU-ALIGN, rRNASelector, riboPicker, SortMeRNA, blast, bowtie or bwa) to name a few.
Playing around with the grep source code to make it output fasta/fastq records. Check out the code here.
I’m quite interested in string searching algorithms as I’ve written a program, crass, which uses a few of them to search for CRISPR elements. Crass is pretty fast, but I want it to be faster, specifically there is one point in crass where it searches for exact matches to many thousands of patterns.
Experiences using a variety of contig scaffolding tools. It was not a good experience.
Recently in our lab we’ve been getting some Illumina mate-pair data to improve some metagenomic assemblies. The sequencing has been going well and we’ve been generating a good number of mate-pairs without too much duplication, but we’ve had quite a bit of trouble with the bioinformatic part of actually using this data to improve the assemblies. There are a number of software tools available to link contigs together after an assembly has been done, however many assume that you are scaffolding a genome not a metagenome.
I recently discovered Seqan, a header-only C++ library for
bioinformatics. I’ve been playing around with the toolkit to make some
small programs just to see whether I want to use it in a
larger project. So far I’ve written
prepmate, an adaptor trimming
program for Illumina’s Nextera mate-pair libraries; and
fxtract, a
grep-like program for extracting fasta/fastq records from large files. One of the
algorithms that I use in fxtract and in another program I’ve written,
crass, is to search for
multiple patterns simultaneously (in this case a number of different DNA
motifs). Seqan
implements a number of algorithms for multipattern matching (checkout
their tutorial page), however they don’t give many clues as to why you
would choose one algorithm over another. So I decided to take a few of
these implementations out for a spin using fxtract.