This tutorial is designed to walk you through the various stages of running Crass and interprating its output. We will not be covering topics such as installation, for that refer to the manual.
Crass can take both fasta and fastq as input files and they can even be gzipped! The only requirement is the read length which must be between 76 and 2000 bp; this means that Crass can work with data from sanger, 454, Ion Torrent & Illumina sequencing platforms. For this tutorial we will use data from a Enhanced Biological Phosphorus Removal (EBPR) reactor community (but feel free to work through with your own metagenomes if you want). The data archive can be found in NCBI's sequence read archive under the accession SRX117744
This will download an SRA archive that you will need to convert into a fastq file. NCBI provides a toolkit for this kind of conversion. Use the
fastq-dump tool to convert the SRA archive to fastq format.
$ fastq-dump -F SRX117744.sra
The above command should produce a file called
-F option sets the output format to the original identifiers, rather than the SRA read identifier
Crass is very easy to run, the most basic command is:
$ crass SRX117744.fastq
This will run crass with the default settings on the fastq file created from the SRA archive. This will output a whole heap of files to the current directory. By default Crass will output two files for each CRISPR that it finds: a
.gv (graphviz) file and a
.fa (fasta) file. The fasta file contains all of the reads that contained the CRISPR, or more specifically the direct repeat. The graphviz file is a textual representation of the spacer arrangement and can be vizuallized in a number of programs. I use either Gephi or the graphviz software tools to create a graphical representation. There will also be two extra files, a log file that will give a little bit of information about the parameters and results. If you look at this log file, you'll see that it doesn't give you that much information, but you can increase the logging verbosity. The other file will be called
crass.crispr. This is an xml file containing all of the information about the CRISPRs that were discovered during the run. You can use the crisprtools software package to manipulate and extract information from this file.
Crass can take any number of input files that you have, just be aware that they will all be procesed the same way, and assumed to be from the same source:
$ crass SRX117744.1.fastq SRX117744.2.fastq
You can also give Crass fasta files as well:
$ crass SRX117744.1.fasta SRX117744.2.fasta
Pairing information from paired end sequencing is not used by Crass; Crass can take paired end reads but it treats each read as unpaired.
The default options set by Crass are pretty good but there are two things that I generally do when using Crass: increase logging verbosity and controlling the output directory.
$ crass -o crass_out -l 4 SRX117744.fastq
The command above will output (
-o) all the results to a folder called
crass_out and increase the log level (
-l) to 4. You don't have to create the directory yourself, Crass will do that for you if it doesn't exist. Now the results will be nicely tucked away in their own folder and and log file will contain some useful information about how Crass went about finding the CRISPRs
Crass contains a whole heap of options for controlling the way that it identifies and assembles CRISPR loci, please refer to the manual for more information about what each one does.
Crass outputs a graph of the spacer arrangement that it identifies. The image below gives a nice primer on how this kind of output compares with a more traditional approach used in current literature. Basically each spacer is a circle and the arrows between the circles tell you how the spacers sit one after another. Therefore if you start at the leader sequence and follow the arrows to the end, you would have travelled through one of the strains that compose this CRISPR.
Comparison between different visualization approaches used for CRISPRs. (A) A traditional alignment approach whereby each spacer is represented as a coloured rectangle; spacers that are common across multiple strains are aligned on top of each other. (B) The graph approach used in Crass; here each spacer is represented as a single circle, whether it was found in multiple strains or not. The arrows from the leader sequence show which spacers are next to each other.
Crass also outputs an xml file called
crass.crispr. This file contains all the information about the direct repeats, spacers and flanks found for each CRISPR as well as information to reconstruct the graph, should you loose that file. Below is an example file containing a single CRISPR with a single spacer, just to give you an idea on the structure of
.crispr files. Luckily there is already software, called crisprtools, available to manipulate the
<?xml version="1.0" encoding="ISO8859-1" standalone="no" ?> <crispr version="1.1"> <group drseq="AGTTGGGATGTTTCCAATGTGACTAATATGAGAG" gid="G11"> <data> <sources> <source accession="CAM_READ_0232711161" soid="SO2"/> </sources> <drs> <dr drid="DR1" seq="AGTTGGGATGTTTCCAATGTGACTAATATGAGAG"/> </drs> <spacers> <spacer cov="1" seq="GAATGTTTGGGCATAGTGAATTCAATCAAAATATTGGC" spid="SP12"> <source soid="SO2"/> </spacer> </spacer> </spacers> </data> <metadata> <program> <name>crass</name> <version>0.2.13</version> <command>crass -o crass_0.2.13 -K 9 raw/combined.fa </command> </program> <notes>Run on 20_05_2012_165537</notes> <file type="log" url="crass.20_05_2012_165537.log"/> <file type="data" url="Spacers_11_AGTTGGGATGTTTCCAATGTGACTAATATGAGAG.gv"/> <file type="sequence" url="Group_11_AGTTGGGATGTTTCCAATGTGACTAATATGAGAG.fa"/> </metadata> <assembly> <contig cid="C1"> <cspacer spid="SP12"> <bspacers> <bs drconf="0" drid="DR1" spid="SP10"/> </bspacers> <fspacers> <fs drconf="0" drid="DR1" spid="SP15"/> </fspacers> </cspacer> </contig> </assembly> </group> </crispr>
The graph shown above is an image rendered using Graphviz from one of the
.gv files generated by Crass. Each spacer is coloured based on the coverage in a scale from blue to red. In the top left hand corner there is a zoomed in view of two nodes in the graph, the circle is a spacer and the diamond is a flanking sequence. They are labelled using a four field scheme, where each field is separated by an "_" character. The fields are:
nodeType will be either
fl for spacer or flanker. The next three fields are the unique identification number, coverage of each flanker or spacer followed by the contig (linear segment) that the node was found in.
From the image of the graph above it's pretty obvious that this CRISPR has multiple strains. If you would like to return the linear sequence of one or many of these strains it is possible to use the
crass assembler. The user (you) will need to tell Crass which strain in the graph needs to be assembled. For this you will need to give crass a list of linear segments through the graph. Each node on the graph above has a segment number associated with it; it is the fourth field in the naming scheme.
The arrangement of spacers produced by Graphviz can sometimes be a little confusing (like in the graph above), unfortunately Graphviz will only give you a static image, however other software programs such as Gephi produce interactive graphs that you can manipulate and therefore get a better idea about the strain variation and which segments to assemble together.
$ crass-assembler [--velvet | --cap3] -x crass.crispr -g NUM -s s1,s2,s3,s4,s5 -i DIR --velvet Use the wrapper for the velvet genome assembler --cap3 Use the wrapper for the cap3 genome assembler -x The crispr file produced by Crass containing the CRISPR you would like to assemble -g The group number of the CRISPR produced by Crass that you would like to assemble -s A comma separated list of segments from the group that you would like to assemble -i The directory containing the output results from crass
Once you've figured out which segments in the graph go together to form individual strains you can input the corresponding segment numbers to the crass assembler along with the group number. The crispr file produced by Crass stores which spacers were found in which reads and therfore extracts the reads for each of these spacers and passes them along to an external assembly algorithm. At this point in time the only external programs supported are