Playing around with the grep source code to make it output fasta/fastq records. Check out the code here.
I’m quite interested in string searching algorithms as I’ve written a
program, crass, which uses a few of them
to search for CRISPR elements. Crass is pretty fast, but I want it to be faster,
specifically there is one point in crass where it searches for exact matches to many
thousands of patterns.
In a previous post
I tried out a number of different ‘multi-pattern matchers’ from seqan
and was fairly unimpressed with their speed. In this testing though I did not try
out the most widely used implementations of multi-pattern matching: GNU grep (using
-Ff options). I don’t have any graphs to show, but it was faster, like a lot faster.
It wasn’t a fair fight though, as grep works very differently to my test program fxtract. There is a great mailing list post by one of the original grep authors on ways to speed things up. It basically comes down to I/O (not that the search algorithms aren’t highly optimised themselves), fast input from a file and not copying the data in memory is the key. Grep doesn’t try to parse the file instead it just loads it into a big buffer in memory and searches the whole thing, if it finds a match then it figures out the boundaries of the line the match is on and prints it out. This is in contrast to programs like fxtract or crass that parse the file first to get the individual portions of each record before the search is performed.
This got me thinking that I could drastically speed crass up if I switched over to the grep way of processing files. I was a little worried though that determining the boundaries of a fasta/fastq record from an anonymous buffer might be a bit tricky so before I modified crass I chose to modify grep so that it would output fasta or fastq records.
The printing functions in grep live in
main.c and start with the function
grepbuf, which in-turn calls
prtext etc. The code is surprisingly simple,
grep reads from the file and fills a buffer;
one of the search functions on that buffer; if a match is found, a pointer to the first
character in the line the match was found is returned; and then the printing functions
take over. The printing functions get a pointer to the start of the line and the end
of the line of the match and pass that through to
fwrite. Everything is handled using
pointer arithmetic for determining the start and end of where to print.
This is great since it’s easy to change the pointer to the beginning a end of a record, rather
than a line. So that just left the logic for me to write in to find the limits of
a record. Below is a code snippet from grep where I’ve added in the logic. Fasta is
easy to implement as the
> symbol is generally unique. Fastq on the other hand takes
a bit more work, since the
@ symbol can also be found in the quality string.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
The fastq format parsing has fairly obvious corner cases since with this code there can be no text on the ‘comment’ line and the whole record must be of four lines. This version of fastq is the recommended formatting from the official fastq publication, which seems to have been adopted by Illumina and others, so hopefully this simple parsing will work most of the time.
It’s been a heap of fun looking at the way this very mature piece of software works and I’ve gotten a usable tool out of it. Now it’s onto the main event of ripping out parts of the source code that I want for crass