Overview:

The first step to begin processing raw data from 16S Illumina libraries is to split the multiplexed libraries, trim the sequences to the same length and filter for quality. I've found that trimming to the same length is best, because the same sequence might cluster differently depending on what length it is. I've also included the parameters things that I've found most useful. This also makes a first pass OTU classification at 97% to a reference database (closed-reference OTU picking). This is a good for taking a first pass look at your data.

Process:

The easiest way to process fastq files from 16S Illumina data is to use Qiime. Follow the link for a tutorial:

http://http://qiime.org/tutorials/tutorial.html

This is installed on the darwin cluster (beagle) through the command:

module add qiime-default

Only the dependencies required for the default Qiime pipeline are loaded. Other tools might not be available.

To process the data, I've written a script that can automate the process. The script needs to be edited to include the path to the solexa file etc.

This can be submitted to the cluster with the following command (the RunQiime.csh file should be attached):

qsub -cwd RunQiime.csh

Below I've elaborated on what each of the variables mean:

#define your file paths
#where is the fastq file you want to process
SOLFILEF=

This is the path to your solexa file as a fastq file.

#Where it the associated mapping file
MAPFILE=

The mapping file is expanded on in the Qiime webpage tutorial above.

#the oligo file
OLIGO=

This is a file required by mothur to remove the primer and diversity region from the library construct. Mothur looks for the exact sequence, and only keeps sequences when the primer sequence is found. As a note, if you are using Phusion or another proof-reading polymerase, mismatches in the primer site within 4 bases from the 3' end are reverted to the template sequence (actually, NEB says that it will change anything within 10 bps of the 3' end), which would then be consequently discarded at the trim.seqs step. I typically change the last 4  from the 3' end of the primer to N's to be safe. Filtering sequences that don't match the primer sequence has been shown to improve the quality of the data (Zhou_2011.pdf).

#where the programs are called from
BIN=/data/spacocha/bin

This is a folder where you should put the following files:

fastq2Qiime_barcode.pl

revert_names_mothur.pl

#where you want to put the output of the analysis
UNIQUE=

It's a unique name for this analysis. It can include folders (which must exits), but the final extention is a pre-fix for all files.

#reference fasta file (latest greengenes OTUs)

REFERENCEFA=

These can be downloaded from the top-right corner of the Qiime blog homepage

#reference taxonomies

REFERENCETAX=

This will be in the same download as above, only in the taxonomy folder.

Output:

The output files that are created are the following (where the UNIQUE would be replaced by the variable defined above):

UNIQUE_output/ucrC/seqs_otus.mat-This is a matrix of your OTUs, mapped to the reference

UNIQUE_output/seqs.trim.names.fasta
-These are the trimmed, cleaned fasta sequences that were clustered.

UNIQUE_output/split_library_log.txt

-This contains stats about how your libraries are processed

Additional Notes:

Please let me know if there is anything that you don't understand about this process. I am happy to help. I think everything that you need is there, but I might have missed something.