Got a Metagenome? want to know what the community looks like?
rRNA operons are typically poorly assembled in metagenomic datasets due to highly conserved sequences. More targeted assembly approaches may be necessary to obtain accurate reconstructions from short read datasets. There are a few ways in which we can extract reads originating from either 16S or 18S reads and there are a number of programs (SSU-ALIGN, rRNASelector, riboPicker, SortMeRNA, blast, bowtie or bwa) to name a few.
There are a other ways of looking at the communty from raw metagenomic reads. These are mostly kmer-based approaches like kraken or extended marker gene approaches like MetaPhlAn or phylosift. I’m not against any of them, just never used them, so I’m going to keep the following post specific to 16s/18s community composition
Extracting 16S/18S reads
I’ve used SortMeRNA and bowtie/bwa in my workflows with good success. The difference between these two methods is that SortMeRNA that uses a kmer searching method using an index created from a database of previously sequenced genes whereas bowtie/bwa use a local alignment method to compare the query sequence to a previously made database. Below are instructions on how to use both methods.
Installing and using SortMeRNA
From the source page for SortMeRNA on github copy the ssh clone url and then open up a terminal window and type
1 2 3 4
Mac OSX install note: I found that there was an error with my install
configure complaining that it was missing
install-sh. I solved this by
copying autogen.sh into the same
configure, execute it and then try running configure again.
The commands above should install SortMeRNA in the directory which you downloaded it to.
Building indexes for SortMeRNA
SortMeRNA comes packaged with 8 different rRNA databases, all derived from SILVA. Build either the 16S or 18S database, depending on what you want. Below is the command used to generate the 18S index (assuming that you’ve installed in the download directory)
ls the index directory you should see four files generated.
Now we can use this index to extract the reads that may come from 18S using the
sortmerna command. I’m going to assume that there are files containing raw
reads from an Illumina run called
Unfortunately one of the limitations of SortMeRNA is that it requires that you
only give it a single file and that the file is uncompressed. So to start with,
unzip the files using
gzip and then combine them into a single file using
merge-paired-reads.sh found in the scripts directory of the SortMeRNA source
To save space, it’s probably best to re-zip the files to save on harddrive space
Extracting the reads
Now we can run
sortmerna, saving matched reads (and their mates) to a file
with the prefix ‘matched-18S’ in fastq format using 4 processors.
1 2 3 4 5 6
Installing and using BWA
My personal preference is to use bwa over bowtie but the merits of either are debatable (there are also 50 other programs out there that try to solve the same problem so the choice is yours). Download bwa from github by copying the ssh clone url and typing the following into the terminal. (Change directories out of the SortMeRNA source directory before you do this)
1 2 3
bwa also requires that an index of the database be made. For simplicity, lets use the same database that was included in SortMeRNA. To make the bwa index copy the files that you want from the SortMeRNA source directory into a new location.
This should make a whole bunch of files that have extra file extensions appended to the fasta file. With the index created we can now align the reads
(Notice that there is no preprocessing steps necessary for the reads)
Postprocessing the sam file
The output of bwa is a standardized format called sam which many programs will now output. This format essentially describes the alignment of each of the query sequences to the reference sequences. This is not exactly what we want, which is the reads that were successful hits to the reference in fasta/q format. To go from a sam file to a fasta/q file is a little complicated (I wish it wasn’t). To start with, download samtools and fxtract from github, and download them into new source directories (bonus points for getting them installed without a walkthrough). First convert the sam file into its equivalent binary format and filter out unaligned sequences:
Now get the names of all the reads that aligned:
And finally extract those reads from the original fastq files: