Blog

SmileTrain: wrapper to get from raw data to reference based OTUs with one command

Introduction- Scott's script to consolidate all of the OTU calling scripts and submit them to the cluster with one command

(smile_train: Get's you where you want to go and your happy)

Works on coyote and maybe the broad, but maybe not other clusters.

You should install in lib (bin get crowded)- each person should install their own.

Make a lib directory unless this already exists:

mkdir lib

Clone this into the lib folder you just make using git:

git clone https://github.com/almlab/SmileTrain 

To get some information, go to the wiki and follow the directions:

https://github.com/almlab/SmileTrain/wiki

Alter train.cfg (you either have to link to usearch etc) to have your new names for the following (for example I changed these):

[User]

username=spacocha

tmp_directory=/net/radiodurans/alm/spacocha/tmp

library=/net/radiodurans/alm/spacocha/lib/SmileTrain

bashrc=/net/radiodurans/alm/spacocha/lib/SmileTrain/bashrcs/coyote.sh

Then, make the tmp_directory:

mkdir  /net/radiodurans/alm/spacocha/tmp

You can have a detailed description of anything by using --help.

This will source the bashrc that script that Scott made.

Submit from the head node (when you log on), and if it's long, you;re going to have to use this:

ssh coyote

screen

Now you are inside screen (screen -ls will tell you which ones you are running):

You can name the screens

screen -S SPtest

you can detatch and keep running:

RUN COMMAND

contol^A D

(or type man screen to get information)

and screen reattach will give you

screen -R SPtest

but then you have to actually stop them by typing exit from within screen

exit

PCR, real-time PCR, primers, column and SPRI clean-up of reactions

PCR, real-time PCR, primers, column and SPRI clean-up of reactions

 

Preparation of PCR primer stocks and working solutions

 

stocks

spin freezedried stocks, 1min full speed

add sterile H2O (molecular biology grade) to a final concentration of 100µM

working solutions

485µl sterile H2O (molecular biology grade)

+ 15µl primer stock

--> 3µM  

 

DNA column purification (Qiagen PCR clean-up / Qiaquick Gel Extraction)

Loading

mix reaction + 5Vol PBI buffer in Epi

place column in collection tube

load on column

spin 1min, full speed, RT

discard floughthrough

 

alternatively for gel extraction of DNA bands

cut bands on blue light table (DeLong lab), or on UV Transilluminator (Thompson lab) with clean     razor blade

transfer into Epi

weigh Epis (~1g) and gel slices

+3Vol/weight (300µl/100mg) GC buffer

incubate tubes at 50˚C, 10min, vortex gently every 2min

+ 1 Vol/weight (100µl/100mg) Isopropanol (improves yield especially for fragments below 500bp                             or above 4kb)

mix by inverting tube

place column in collection tube

load on column (if volume to big for column, then load, spin, discard flowthrough, and load rest)

spin 1min, full speed, RT

discard floughthrough

 

Washing

+750µl PE buffer (seal bottle tight to avoid EtOH evaporation)

spin 1min, full speed, RT

discard floughthrough

place column back in empty collection tube

Drying

spin (to dry) 30sec, full speed, RT

turn column in collection tube by 180˚

spin (to dry) 1min, full speed, RT

discard floughthrough and collection tube

place column in new Epi

dry open column under laminar air flow for 2min

Elution

+35-50µl EB or sterile H2O (molecular biology grade)

incubate at least for 5min

spin (to elute DNA) 30sec, full speed, RT

turn column in Epi by 180˚

spin (to elute DNA) 1min, full speed, RT

discard column and store Epi with DNA

 

SPRI clean up and primer/dimer removal (Agencourt AMPure XP beads)

Preparations:

adjust PCR reaction to 50µl with EB

vortex SPRI beads 1600rpm, 10sec

aliquot 45µl of beads into one 1.5ml tube per library

Binding to beads

add 50µl PCR reaction to beads

mix by pipetting/vortex 1600rpm

incubate for 5-7min

separate on magnet for 2min

remove and discard SN while tube stays on magnet

Wash - removal of salts, enzymes and low molecular weight DNA

wash beads carefully twice with 70% EtOH while tube stays on magnet (do not disturb bead      pellet)

incubate for 30sec

remove all SN

repeat

Dry

air dry on magnet for 15min

Elution

remove tube from magnet, add 20µl EB

vortex 1600rpm, 10sec

incubate at RT for 5min

separate on magnet for 2min

collect SN and transfer into new 1.5ml tube

 

Quantitative real-time PCR

Mastermix

    for 200µl (8x) reaction:

 

    10.525µl    H2O

    5µl           5x Phusion Pol buffer

    0.5µl        dNTP mix 10mM

    3.3+3.3µl  primers, 3µM

    2µl           template (try different 10 fold dilutions)

    0.125µl     SYBR Green I

    0.25µl        Pol (Phusion)

 

    prepare reaction in PCR tubes (or 96well plate) with optical covers that fit the utilized    

real-time PCR

 

cycle at         1) 98˚C, 20''

               2) 98˚C, 15''

               3) specific Annealing Temp˚C, 20''

               4) 72˚C, 20'' (for fragments shorther than 1kb)

               5) go back to step 2 45x

 

    always use at least 3 replicates per sample

    always include 3 replicates of a non-template (H2O) control

16S by hand library prep

 

16S By Hand Library Preparation

Materials:

●      Agencourt Ampure XP, A63881 (60mL, $300)

●      2 Roche LichtCycler480 384-well plate

●      1:100 dilution of SYBR stock (Invitrogen S7563, 10,000x)

●      Step 1/ Initial QPCR Primers ( PE_16s_v4U515-F OR PE_16s_v4_Bar#,  PE_16s_v4_E786_R)

●      Step 2 primers ( PE-III-PCR-F, PE-IV-PCR-XXX)

●      Final QPCR primers (PE Seq F, PE Seq R)

●      HF Phusion (NEB, M0530L)

●      KAPA SYBR 2xMM for final QPCR

●      Invitrogen Super magnet (16 or 8 sample capacity)

Determination of  Step 1 Cycle Time and Sample Check:

Materials used:

○   Contents of MM

○   P200 multi-channel pipette

○    96 well QPCR plate (96 well for opticon stocked in lab)

○   Clear QPCR plate covers

Initial QPCR master mix (MM)

 

Reagent

X1 RXN (uL)

H2O

12.1

HF Buffer

5

dNTPs

0.5

PE16s_V4_U515_F (3uM)

2.5

PE16S_V4_E786_R (3uM)

2.5

Template

2

SYBR green (1/100 dilu)

0.125

Phusion

0.25

 

 

Run this step in duplicate or triplicate to best estimate proper cycling time

Initial QPCR Program (Opticon):

Heat:

98°C – 30 seconds

Amplify:

98°C – 30 seconds

52°C – 30 seconds

72°C – 30 seconds

Cool:

4°C - continuous

Use Ct (bottom of curve, not mid-log) of curves to determine dilutions for step 1 amplification (google docs, Illumina Library QPCR and Multiplexing)

Breakdown of QPCR amplification math (done to normalize each sample):

○   delta Ct = Sample Ct - lowest Ct in sample set

○   fold = 1.75^(delta Ct)

○   dilution needed = fold

○  note - that input is 2uL per RXN so sample with lowest Ct gets 2uL undiluted

 Please note – samples may fail due to too little, too much material, or a poor reaction. It is recommended that failed samples be re-run before moving forward

Library Preparation:

Step 1

Please Note: Samples are run as four 25uL reactions that are pooled at end of cycling

1st step Master Mix 25uL RXN (MM1)

 

Reagent

X1 RXN (uL)

H2O

12.25

HF Buffer

5

dNTP

0.5

PE16S_V4_U515_F  (3uM)

2.5

PE16S_V4_E786_R (3uM)

2.5

Template

2

Phusion

0.25

 

 

16s Step 1 Program:

Heat:

98°C – 30 seconds

Amplify:

98°C – 30 seconds

52°C – 30 seconds

72°C – 30 seconds

Cool:

4°C - continuous

Run amplification cycle number determined via QPCR (no more than 20 cycles allowed)

After cycling pool duplicates, now have 1x 100uL reaction per sample

SPRI Clean Up

Materials used:

○   SPRI beads

○   70% EtOH

○   EB

○   Invitrogen super magnet

- Vortex cold AmpureXP beads, pool DNA from PCR tubes (~100uL)

- Aliquot 85.5uL beads into Epi’s – let equilibrate to RT

- Add DNA (take 95uL) + beads (85.5uL) = 180.5 uL

- Incubate 13’ @ RT

- Separate ON magnet 2’

- While ON magnet, remove/discard SN

- Wash beads 2x with 70% EtOH, 500uL each wash

- Air dry beads for 15-20’ on magnet

- Remove from magnet, elute in 40uL H2O, vortex to resuspend

- Incubate (at least 7’)

-  Separate on magnet 2’

- Collect 35-40 ul and save SN

Sample Re-Aliquoting and Step 2

Please Note: Samples are run as four 25uL reactions that are pooled at end of cycling

2nd step Master Mix  25uL RXNs (MM2)

 

Reagents

X1 RXN (uL)

H2O

8.65

HF Buffer

5

dNTPs

0.5

PE-PCR-III-F (3uM)

3.3

PE-PCR-IV-XXX (3uM)

3.3

Template

4

Phusion

0.25

 

 

16s Step 2 Program:

Heat:

98°C – 30 seconds

Amplify:

98°C – 30 seconds

83°C – 30 seconds

72°C – 30 seconds

Cool:

4°C - continuous

Run 9 cycles of amplification

- After cycling pool duplicates, now have 1x 100uL reaction per sample

SPRI Clean Up

Materials used:

○   SPRI beads

○   70% EtOH

○   EB

○   Invitrogen super magnet

- Vortex cold AmpureXP beads, pool DNA from PCR tubes (~100uL)

- Aliquot 85.5uL beads into Epi’s – let equilibrate to RT

- Add DNA (take 95uL) + beads (85.5uL) = 180.5 uL

- Incubate 13’ @ RT

- Separate ON magnet 2’

- While ON magnet, remove/discard SN

- Wash beads 2x with 70% EtOH, 500uL each wash

- Air dry beads for 15-20’ on magnet

- Remove from magnet, elute in 40uL H2O, vortex to resuspend

- Incubate (at least 7’)

-  Separate on magnet 2’

- Collect 35-40 ul and save SN

Final QPCR

Once you have a substantial or all of your samples prepared you can run a final QPCR to determine dilutions and volumes for multiplexing. This step also confirms that the library preparation was successful

QPCR Master Mix (QPCR MM, 20uL RXN)

 

Reagents

X1 RXN (uL)

X345RXN (uL)

H2O

7.2

2,484

PE Seq Primer – F (10uM)

0.4

138

PE Seq Primer – R (10uM)

0.4

138

KAPA SYBRgreen MM

10

3,450

Template

2

-

 

 

Final QPCR Program (Opticon)

Heat:

95°C – 5 minutes

Amplify:

95°C – 10 seconds

60°C – 20 seconds

72°C – 30 seconds

Melting Curve:

95°C – 5 seconds

65°C – 1 minute

97°C - continuous

Cool:

40°C – 10 seconds

Run 35 cycles of amplification

Use mid-log phase of curves to determine volumes for multiplexing (google docs, Illumina Library QPCR and Multiplexing)

Please note – samples may fail due to too little, too much material, or a poor reaction. It is recommended that failed samples be re-run before moving forward

Breakdown of QPCR multiplexing math (done to normalize each sample):

○  delta Ct = Sample Ct - lowest Ct in sample set

○  fold = 1.75^(delta Ct)

○  ratio = 1/fold

○  volume to mix b/c of ratio = X*ratio (X = minimum desired volume per sample)

○  how to dilute = fold

○  note - sample with lowest Ct will get an undiluted Xuls added to final multiplex, X can be raised or lowered to accommodate the needed volume of other samples

 

Sample Multiplexing and Submission for Sequencing:

- Once samples have been multiplexed aliquot ~20uL of the final mix and submit it to the BioMicro Center for sequencing

 

 

 

 

 

Running AdaptML

Introduction

This is the general outline for running AdaptML. This guide was generated by Sarah Preheim, not Lawrence David, just to keep that in mind. It is meant to complement the user guide provided on the official web page, but also includes some additional information for non-experts.


Download

Download AdaptML from the Alm Lab website and install as directed:

http://almlab.mit.edu/adaptml.html


Input tree

Unrooted tree is used to run AdaptML. Usually I make them with phyml for 2,000 or fewer sequences:

phyml_v2.4.4/exe/phyml_linux all_u.phy 0 i 1 100 GTR e e 4 e BIONJ y y


Running AdaptML

Here is an example of how to run AdaptML:

python ../../latest_adaptml/habitats/trunk/AdaptML_Analyze.py tree=./All_simple_7_ur_particle.phy_phyml_tree.txt2 hab0=16 outgroup=ECK_1 write=./ thresh=0.025

 

Finding Stable Habitats

Although there was a set of habitats predicted, you want to determine how often those same habitats would be predicted from 100 different iterations of AdaptML.

For example:

Now try to standardize with 100 runs:

foreach f ( 0 1 2 3 4 5 6 7 8 9 )

foreach? foreach d (0 1 2 3 4 5 6 7 8 9)

foreach? mkdir ${f}${d}_dir

foreach? python ../../latest_adaptml/habitats/trunk/AdaptML_Analyze.py tree=./All_simple_7_ur_particle.phy_phyml_tree.txt2 hab0=16 outgroup=ECK_1 write=./${f}${d}_dir/ thresh=0.025

foreach? perl parse_migration3.pl ./${f}${d}_dir/habitat.matrix > ./${f}${d}_dir/habitat.matrix.tab

foreach? end

 

Make a list of all of the migration.tab files

ls *_dir/habitat.matrix.tab > migration_list.txt2

Then use temp_dist2.pl to get the habitat number distribution.

perl ~/bin/temp_dist2.pl migration_list.txt2

4    5

5    21

6    53

7    20

8    1

 

The original migration.tab file in ./040609_dir/ has 6 habitats, so just use that to see the percent matching.

Then use get_stable_habitats6.pl and you might have to change the percent matching

so it’s as high as it can go before it dies with double hits.

Results with 0.024:

11    39

3    100

13    100

10    80

12    100

14    99

 

Make clusters

Once you have found a set of stable habitats you want to go with, then make the clusters. This example is for the tree ./example_noroot.phy_phyml_tree.txt3, outgroup ECK_1 using 9900.file to make the final clusters (using python 2.5.1 and you may have to change the paths to point to the AdaptML software):

python ~/AdaptML_dir/latest_AdaptML/habitats/trunk/AdaptML_Analyze.py tree=./example_noroot.phy_phyml_tree.txt3 hab0=16 outgroup=ECK_1 write=./example_dir thresh=0.05

perl ~/bin/migration2color_codes.pl ./example_dir/migration.matrix color_template.txt > ./example_dir/color.file

mkdir ./example_dir/rand_trials

python ~/AdaptML_dir/latest_AdaptML/clusters/getstats/rand_JointML.py habitats=./example_dir/migration.matrix mu=./example_dir/mu.val tree=./example_noroot.phy_phyml_tree.txt3 outgroup=ECK_1 write=./example_dir/rand_trials/

python ~/AdaptML_dir/latest_AdaptML/clusters/getstats/GetLikelihoods.py ./example_dir/rand_trials/ ./example_dir/

python ~/AdaptML_dir/latest_AdaptML/clusters/trunk/JointML.py habitats=./example_dir/migration.matrix mu=./example_dir/mu.val  color=./example_dir/color.file tree=./example_noroot.phy_phyml_tree.txt3 write=./example_dir/ outgroup=ECK_1  thresh=./example_dir/9900.file

 

Processing overlapping MiSeq 16S rRNA libraries

Alm lab Protocol for processing overlapping 16S rRNA reads from a MiSeq run

Experimental design

The specifics of the sequencing set-up and molecular construct will determine exactly how the data needs to be sequenced and processed. There are a few different designs that the Alm lab has set up:

1.) (Standard) Multiplexing different samples together to be sequenced in one lane of Illumina, marking each unique sample with a unique barcode on the reverse primer of step 2 that is read during the indexing read. This is common for up to 96 samples (or 105 including additional barcodes that are not in the 96-well format). The sequencing should be done, not with the standard Illumina indexing primer, but with the reverse complement of the 2nd read sequencing primer. This is a custom barcode that should be included in the sequencing set-up. See sequencing section below for protocol.

2.) Multiplexing multiple different plates of samples together using a barcode located 5' to the primer used in genome amplification (typically U515) and a reverse barcode that is read during the indexing read. This is not typical for the Alm lab MiSeq protocol, since getting only about 12-25 million reads from MiSeq is sufficient for about 100 samples, not more. However, it is a possible scenario for samples which do not require high coverage.

3.) Mixing both genome sequencing and 16S rRNA amplicon sequencing together in one lane. Adding genome library preps to 16S amplicon lanes improves the quality of the base calling by adding diversity without loosing much to phiX sequencing. The genome library constructs typically contain barcodes in the forward and reverse read sequences, and do not typcially have an indexing read associated with them. However, adding them to a lane which does have a index read is ok.

4.) An experimental set-up using both forward and reverse orientation of the 16S rRNA among different samples, and staggering the diversity region 5' of both primers used in genome amplification allows for sufficient base diversity to run 16S rRNA libraries without wasting phiX data. In this case, half of the samples begin by sequencing from the U515 primer in the forward read, and half begin by sequencing from the U786 from the forward read. Additionally, the number of bases before the primer sequence varies from 4-9 bp.

Sequencing

Sequencing our construct on MiSeq is slightly different than standard Illumina MiSeq sequencing. Load the providedsample sheet, (which arbitrarily specifies a 250 paired-end run with an 8nt barcode read) and spike in 15uL of the anti-reverse BMC index primer @ 100uM (5' AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCG 3') into tube 13 of the cartridge. This should provide three reads (forward, reverse and index) at 250, 8, 250 bp each.

(The following is curtsey of Shaprio lab):

To generate an indexing file, you have to change the setup of the MiSeq reporter, because by default MSR doesn't generate a barcodes_reads.fastq. 
In order to change that:
First, turn off MSR service:  Task Manager, Services Tab, Right click on MiSeq Reporter and click stop.
Used NotePad to edit MiSeqReporter.exe.config file that can be found in C:\Illumina\MiSeq Reporter

The following needs to be included in the top portion of the file (the <appSettings> section)

<add key="CreateFastqForIndexReads" value="1" />

Save then close.

You will then want to restart the service. This can be accomplished by right clicking on the tool bar in windows, selecting "Start Task Manager", select the "Services" tab, find MiSeq Reporter on the list and then select to stop and then start the service.

You can re-queue your run using the sample sheet WITH the index information on the sample. In our case, we used a very simple sample_sheet with one index like ATATATAT.

De-multiplexing

You can demultiplex at various stages. If there are multiple, un-related projects in the same run, I will pull out all of the reads that map to barcodes for only one project, so that I don't have to process extra data. You also have the option of removing unwanted barcodes at the qiime: split_libraries_fastq.py step by providing a mapping file containing only the barcodes you want, but you may waste time overlapping them if there are a lot of them. Do not use the following step if you if you will eventually work with all of the data. Only use the following if you never need to work with the other data, since it doesn't make sense to process it at all.

Run this program to parse out sequences from the raw data according to the following order:

1.) Single barcodes that are unique in the sample. These are possibly control samples or extra barcodes that were done uniquely and are the only piece of information indicating which samples that sequence came from

2.) Next, it looks for the presence of the forward and reverse barcodes in the forward and reverse read. These are typically from genome sequences.

3.) Finally, it looks for paired data, pulling out reads that have both the forward barcode and the indexing barcodes that match input samples.

All other samples that do not match are discarded.

Program:

perl parse_Illumina_multiplex2.pl <Solexa File1> <Solexa File2> <mapping> <output_prefix>

<mapping> input file should have the following fields (tab delimited, here's an example file):

Barcode construction, output name, forward barcode name, forward barcode seq, forward barcode orientation, index barcode name, index barcode seq, index barcode orientation, reverse barcode name, reverse barcode seq, reverse barcode orientation

Samples with the same output name will be in the same file. Barcode construction must be one of the following exact fields: single, double or forbar. Use single for option 1 above (single barcodes identify the samples), the 2

The output should be forward and reverse files labeled output_prefix.output_name.1 and output.2 respectively.

These can be used as the fastq files in downstream processes.

You can also use just the mapping files that would be the input to QIIME (not in the example above, but default for QIIME), and the index read generated as previously stated, if you just want to limit the data to a set of barcodes in your mapping file. In that case run the following command:

perl parse_Illumina_multiplex_from_map_index.pl <Solexa File1> <Solexa File2> <mapping> <output_prefix> <index read>

The fastq files will contain only those found in your mapping file and can be used in downstream analysis.

Overlapping the reads

You may have sufficient length to overlap the forward and reverse reads to create a longer sequence. This process will be time consuming, but it gains phylogenetic resolution and can be useful for many applications. We use SHE-RA, which was created to have a sophisticated calculation of quality for an overlapped base, given the quality of each overlapped base and whether or not they match. Other software exists (and is faster), but will do multiple things at once, including trimming the sequences for quality and will not provide as good an estimate of the quality of the overlapped bases. If other programs are used, it might be necessary to use other ways to de-multiplex samples after using. With SHE-RA, we overlap paired end sequences, then re-generate the fastq files to use with QIIME split_libraries_fastq.py.

First, divide up your samples into about 1 million reads per file, forward and reverse reads separately (SHERA has code for parallelization, but I couldn't get it to work).

general form-

perl split_fastq_qiime_1.8.pl <read> <number needed> <output prefix>

Example-

perl ~/bin/split_fastq_qiime_1.8.pl 131001Alm_D13-4961_1_sequence.fastq 100 131001Alm_D13-4961_1_sequence.split

perl ~/bin/split_fastq_qiime_1.8.pl 131001Alm_D13-4961_2_sequence.fastq 100 131001Alm_D13-4961_2_sequence.split

Then, overlap each of the 100 files with SHERA where ${PBS_ARRAYID} is the process number for parallel processing (remember to change the lib path in the code of concatReads.pl for the code to run from any folder- text editor like emacs, change the second line to the directory where the .pm files are, save):

general form-

perl concatReads.pl fastq_1 fastq2 --qualityScaling sanger

example of actual command-

perl concatReads.pl 131001Alm_D13-4961_1_sequence.split.${PBS_ARRAYID}.fastq 131001Alm_D13-4961_2_sequence.split.${PBS_ARRAYID}.fastq --qualityScaling sanger

Filter out the bad overlaps from the fa and quala generated with SHERA:

perl filterReads.pl 131001Alm_D13-4961_1_sequence.split.${PBS_ARRAYID}.fa 131001Alm_D13-4961_1_sequence.split.${PBS_ARRAYID}.quala 0.8
Use mothur to re-generate the fastq files:

mothur "#make.fastq(fasta=131001Alm_D13-4961_1_sequence.split.${PBS_ARRAYID}.filter_0.8.fa, qfile=131001Alm_D13-4961_1_sequence.split.${PBS_ARRAYID}.filter_0.8.quala)"

Now, you will either have to fix the index file to contain only the reads in your file (if the index read is a separate file):

perl fix_index.pl 131001Alm_D13-4961_1_sequence.split.${PBS_ARRAYID}.filter_0.8.fastq 131001Alm_D13-4961_3_sequence.fastq > 131001Alm_D13-4961_3_${PBS_ARRAY_ID}.filter_0.8.fastq

Or, if you have to generate it from the header (if the index is already present in the header):

perl fastq2Qiime_barcode2.pl 131001Alm_D13-4961_1_sequence.split.${PBS_ARRAYID}.filter_0.8.fastq > 131001Alm_D13-4961_1_sequence.split.index.filter_0.8.fastq
This file can be used for specific header configuration where the fastq files look like this, where it pulls out the longest string of base letters (ATGCKMRYSWBVHDNX) after the #, in this case it would be TGGGACCT and creates a false quality for each base as the lower case of each barcode letter:

@MISEQ:1:2106:21797:11095#TGGGACCT_204bp_199.2_0.90

TGTAGTGCCAGCCGCCGCGGTAATACGTAGGTGGCGAGCGTTGTTCGGATTTATTGGGCGTAAAGGGTCCGCAGGGGGTT
CGCTAAGTCTGATGTGAAATCCCGGAGCTCAACTCCGGAACTGCATTGGAGACTGGTGGACTAGAGTATCGGAGAGGTAA
GCGGAATTCCAGGTGTAGCGGTGGAATGCGTAGATATCTGGAAGAACACCGAAAGCGAAGGCAGCTTACTGGACGGTAAC
TGACCCTCAGGGACGAAAGCGTGGGGATCAAACAGGATTAGAAACCCCTGTAGTCC

Result is:

@MISEQ:1:2106:21797:11095#TGGGACCT_204bp_199.2_0.90

TGGGACCT

+

tgggacct

De-multiplex

Now the re-created fasta and index read can be used as normal with QIIME or other software of your choice:

split_libraries_fastq.py  -i 131001Alm_D13-4961_1_sequence.split.${PBS_ARRAYID}.filter_0.8.fastq -m mapping_file.txt  -b 131001Alm_D13-4961_1_sequence.split.${PBS_ARRAYID}.index.filter_0.8.fastq --barcode_type 8 --rev_comp_barcode --min_per_read_length .8 -q 10 --max_bad_run_length 0 -o unique_output_${PBS_ARRAYID} --phred_offset 33

This will create a seqs.fna file which can be used in down stream analysis.

Alm lab sequencing effort

Please fill in as appropriate, also on google docs spreadsheet that Carrie maintains.

Listed below are all of the sequencing runs and what was sequenced:

Nov 18  2012 121114Alm
Feb 22  2013 121217Alm
Mar 14  2013 130308Alm
Apr 12  2013 130320Alm

?

May 16  2013 130423Alm

?

Jul 24 18:59 130719Alm -Sequenced by Sarah Preheim with Julie Khodor for the Brandeis high school Genesis (?) program

071813SP1_map.xlsx

Sep  2 20:49 130823Alm- Sequenced by Sarah Preheim with all environmental samples

130823Alm_mapping_file.xlsx

Oct  7 10:47 131001Alm - Sequenced by Sarah Preheim with all environmental samples and one Willimam's pond

131001Alm_mapping.xlsx

Oct 17 15:05 131011Alm - Sequenced by Sarah Preheim with all environmental samples testing the staggered primers (?)

131011Alm_mapping_file.xlsx

Dec  2 12:04 131114AlmA -Sequenced by Sarah Preheim with environmental samples, Spence TR samples, testing the staggered and flipped primers

Mapping_files_112613.xlsx

Nov 25 12:23 131114Alm

?

Dec 13 23:05 131126Alm - Sequenced by Sarah Preheim with environmental samples, and mouse IGA samples with staggered and flipped primers

131126Alm_mapping.xlsx

Materials Needed:

-       Bio-Rupter (Parson’s 3rd floor in between Polz and Chisholm labs)

-       Quick blunting and ligation kit (NEB, E0542L 100RXNs $473.60)

-       10mM dNTP mix ( NEB, N0447L $216.00)

  • also need a 1:10 dilution for 1mM dNTPs

-       IGA adapter A# (10uM working solution)

-       IGA adapter B#-PE (10uM working solution)

-       SPRI beads (Beckman Coulter, A63882, 450mL $4,100)

-       BST Polymerase large fragment (NEB, M0275S 1,600U $49.60)

-       IGA-PCR-PE-F primer (40uM working solution)

-       IGA-PCR-PE-R primer (40uM working solution)

-       Phusion, with HF buffer (NEB, M0530L 500U $329.60 )

-       SybrGreen (Invitrogen S7563 500uL 10,000x $235.00 )

-       Qiaquick PCR cleanup column (Qiagen, 50 columns: 28104, $98.94; 250 columns: 28106, $465.60)

-       MinElute Reaction clean up column (Qiagen 50 columns: 28204, $109.61; 250 columns: 28206, $503.43)

Protocol for library whole genome construction

 

  1. Shear DNA by sonication. Make sure your sample is in 50ul of solution. Start with 2-20ug of DNA. Fill BioRupter with water (upto .5 inches from line) and then ice upto line. Do 6 cycles, replace ice. Repeat for a total of 18-20 cycles of 30seconds on/off with “H” setting. Average 200-400 base pairs. Use Agilent Bioanalyzer to confirm shear size.

 2.     End-repair

  • Blunt and 5’-phosporyate the DNA from step 2 using Quick blunting kit.
  • Mix:   

sheared DNA (2μg)                 45.5μl

10x Blunting Buffer                6μl

1mM dNTP Mix                    6μl

Blunt enzyme mix                   2.5μl

TOTAL                                  60μl

  • Incubate at RT for 30 minutes
  • Purify using Qiagen MinElute column (these are kept in the fridge.) Elute in 12μl. 

3.     Ligate Solexa adaptors

  • Solexa adapters must be hybridized before use. Heat to 95 for 5 minutes, cool slowly to Room temperature.
  • Ligate adaptors, using a 10x molar excess of each one, and as much DNA as possible.
  • Mix:

                                    End-repaired DNA                                         10μl 

                                    100μM IGA adapter A#                                 1.25μl

                                    100μM IGA adapter B#-PE                           1.25μl 

                                    2X Quick Ligation Reaction Buffer (NEB)       15μl

                                    Quick T4 Ligase (NEB)                                 2.5μl

                                    TOTAL                                                          30μl

  • Incubate at RT for 15 minutes.

 4.     Size selection and purification using SPRI beads.

  • Mix DNA and beads to appropriate ratio: 0.65X SPRI beads: Add 19.5 μl of SPRI beads to 30μl reaction from step 3.
  • Incubate at RT for 20 minutes.
  • Place tubes on magnet for 6 minutes.
  • Transfer all SN to new tube. Discard beads.
  • Mix DNA and beads to appropriate ratio, 1X SPRI beads: Add 10.5 μl SPRI beads to 49.5μl reaction.
  • Vortex, spin.
  • Incubate at RT for 7-20 minutes.
  • Place tubes on magnet for 6 minutes.
  • Remove all SN, keep beads.
  • Wash with 500μl 70% EtOH, incubate for 30 seconds, remove all SN.
  • Repeat: Wash with 500μl 70% EtOH, incubate for 30 seconds, remove all SN.
  • Let dry completely for 15 minutes. Remove from magnet.
  • Elute in 30μl EM.
  • Vortex.
  • Incubate at RT for 2 minutes.
  • Put on magnet for 2 minutes
  • Transfer SN to new tube. 

5.     Nick translation

  • Bst polymerase can be used for nick translation---it can be used at elevated temperatures which is good for melting and secondary structures and lacks both 3’-5’ and 5’3’ exonuclease activity.
  • Mix:

Purified DNA                                                 14 μl

10X Buffer (NEB)                              2μl

10mM dNTPs                                                0.4μl

1mg/ml BSA                                        2μl

Water                                                  0.6μl

Bst polymerase (Enzymatics)                        1μl

TOTAL                                              20μl

  • Incubate at 65 degrees, 25 minutes.

 6.     Library Enrichment by PCR.

  • Perform 2 25μl reactions:
  • Mix:

H2O                                        16.6μl

5X HF Buffer                           5μl

dNTPs (10mM)                        0.5μl

40μM Solexa PCR-A-PE           0.25μl

40μM Solexa PCR-B-PE           0.25μl

SybrGreenI                             0.125μl

Nick-translated DNA                2μl

Phusion                                0.25μl

TOTAL                                  25μl

  • Program:       
  1. 98˚C    70sec
  2. 98˚C    15sec
  3. 65˚C    20sec
  4. 72˚C    30sec
  5. Go to step 2 34 more times.
  6. 72˚C    5 min
  7. 4˚C      Forever
  • These 2 reactions are to check cycle time only. Look at the melting curves---use the mid-log point to pick the ultimate cycle time.
  • Prep PCR as above, but in 2 100μl reactions using 8μl of sample in each, and cycle with cycle number.
  • Mix:

H2O                                        66.8μl

5X HF Buffer                           20μl

dNTPs(10mM)                         2μl

40μM Solexa PCR-A-PE           1μl

40μM Solexa PCR-B-PE           1μl

Nick-translated DNA                 8μl

Phusion                                   1μl

                                                      TOTAL                                  100μl

  • Run on a QIAElute column. Elute in 50ul. (You could also do a single SPRI---check the ratios of beads to reaction volume)
  • Analyze using Bioanalyzer.
Computational information for newbies

Introduction

I've collected information about tricks that a newbie might not know, but which is useful to help you get around computational work. I'll try to keep adding stuff as I learn it. Please add your tricks too!

Parallel computing

I'm trying to get a better sense about designing parallel scripts. Typically, I've inherited someone's code and I've made it work for me. However, I have been looking for a good basic resource and I've found at least one site that looks promising. They have a few free courses that look relevant like "Parallel computing explained", "Introduction to MPI" and "Intermediate MPI". I found this by looking at an MIT computing course which pointed to this site.

http://www.citutor.org/index.php

Although it's got a lot of basic information, it's hard to figure out how it helps because I'm really not sure what type of clusters I'm actually using (i.e. which parts are relevant to me). Didn't really help me do any actual coding yet, although some background about computers was semi-interesting.

How to find stuff out about computing clusters

I wanted to know whether there was a website where you could just find out about how to run stuff on a computer cluster (i.e. beagle, aces, or coyote). Basically, Scott said that only the sys admin knows all of the specific rules associated with each cluster and if you don't pick their brain about it, you won't really know how to use it right. I will hopefully pick brains for you and put it on this website in another post about each system. That's a work in progress.

You can find out about specifics of aces queues with:

qstat -Qf | more

or

qstat -q

Which results in this on aces:

server: login

Queue            Memory CPU Time Walltime Node  Run Que Lm  State
--------------- ---- ------ ------ --  -- -- -  -----
geom               -      -       -      -    0   0 --   E R
one                -      -    06:00:00     1   8 319 10   E R
four-twelve        -      -    12:00:00   --    8   4 10   E R
four               -      -    02:00:00    16   8 437 10   E R
long               -      -    24:00:00    16   1   0 10   E R
all                -      -    02:00:00  1024   0   0  4   E R
mchen              -      -    02:00:00  1024   0   0  4   E R
mediumlong         -      -    96:00:00    30   0   0 10   E R
special            -      -       -       36   0   0 -   E R
toolong            -      -    168:00:0     4   0   0 10   E R
                                               ---- ----
                                                  25   760

An this on coyote:

server: wiley

Queue            Memory CPU Time Walltime Node  Run Que Lm  State
--------------- ---- ------ ------ --  -- -- -  -----
speedy             -      -    00:30:00   -    0   0 -   E R
short              -      -    12:00:00   -    2  -2 -   E R
long               -      -    48:00:00   -   68  46 -   E R
quick              -      -    03:00:00   -    0   0 -   E R
be320              -      -    00:30:00   -    0   0 -   E R
ultra              -      -    336:00:0   -    2   0 -   E R
                                               ---- ----
                                                  72    44

You can also use this to find more information about qsub (I would be in a place like the head node because not all nodes have the same qsub data):

man qsub

You can find out more about the various flags you can use with qsub.

Queuing system on clusters

Never run anything on the head node!!! When you log into a cluster, you need to submit jobs to a queue or work interactively on a dedicated interactive node. The dedicated interactive nodes will have different names, so you just have to find them. Sometimes you can request nodes by qsub -I or qsub to a dedicated interactive node (qubert on aces and super-genius on coyote), but these also depend on your system.

So, on any given cluster, there might be different queues (i.e. short, long, ultra-long) that you want to submit your jobs to. To find out (if you don't know already) you can qstat and the queue names will be the last column. It might be obvious about what each of these queues mean from the name and the amount of times thing have run in each queu (short < 12 hours, ultra-long > 2500 hours), but this is likely just something you need to find out from someone who knows about the cluster or from the sys admin again. Then if you want to submit to a specific queue, use something like this (I think, but I actually haven't done it exactly like this):

qsub -q short ....

Shortcut for ssh'ing

Scott also told me about how to set up your computer to automatically fill-in ssh information so you don't have to type it in each time.You have a folder  ~/.ssh/ and file ~/.ssh/config which should be modified to contain each of the following for each host

Host aces

   Hostname login.acesgrid.org

   Username spacocha

Then each time you want to ssh just type:

ssh aces

Works for scp too (and presumably other things).

Downloading directly to the clusters

You can get stuff from a website using wget, for example:

wget https://github.com/swo/lake_matlab_sens/archive/master.zip

 

Running something on a detatched screen:

use screen. This will help you figure stuff out:

screen -man

or

screen --help

This starts screen:

screen -S SPPtest

This detaches but keeps it running:

hold "control" and "A" keys then type "D"

To reattach to detached screen:

screen -R  SPPtest

To get rid of the screen altogether type this from within a screen:

exit

Statistical tests used for microbial community analysis

Introduction

I'm always trying to figure out which statistical test to use to analyze data, and I wish there was some sort of summary of when to use which statistic, what the limitations are and how to use it correctly. I'm going to try to add stuff including when to use each test, with the hypothesis you are interested in testing and how you can implement it. This is a work in progress, and I'm going to try to keep adding stuff that I learn about.

Background and good references

I found a pretty good book that I thought might be useful, but I haven't gotten it yet. It's not in the MIT library, so I ordered it through the some order.

Statistical analysis in microbiology : statnotes

Richard A Armstrong; Anthony C Hilton

Book

Contingency tables

I have run into situations where I want to test my observations against a model. The observations that I are the counts of an OTU found at a bunch of discrete depths. The model I want to test is whether this new OTU is the same as the distribution with depth of another more abundant and closely related OTU as applied in distribution-based clustering:

http://aem.asm.org/content/79/21/6593.full?sid=76732af5-84eb-4f2b-8465-bd1c66283323

Here's a good basic video explaining the chi square test and the use of contingency tables:

http://tll.mit.edu/help/genetics-and-statistics

I use R to calculate the chi-square value. These are my basic cheat-sheet notes for using R:

> alleles <- matrix(c(12, 4, 15, 17, 25, 4), nr=3,
+ dimnames=list(c("A1A1", "A1A2", "A2A2"), c("A1", "A2")))
> alleles
     A1 A2
A1A1 12 17
A1A2  4 25
A2A2 15  4
> chisq.test(alleles)

        Pearson's Chi-squared test

data:  alleles
X-squared = 20.2851, df = 2, p-value = 3.937e-05

However, new information that I'm getting is that for very large values (like Illumina reads), your model will never fit because you have too much information and even small variations will be significant. I found this problem to be true so I got over it by determining whether the information content was the same (using SQRT JSD), which is basically a work around. However, I'm looking into the Root Mean Square Error of Approximation, although I haven't done so yet, to get over problems with big numbers like illumina count data.

http://www.rasch.org/rmt/rmt254d.htm

Determining a bug of importance from 16S data

Alex Sheh, postdoc in Fox lab, was looking at changes in the microbiome associated with cancer. He had an output from the Biomicro Center bioinformatics pipeline that indicated two significant bugs, one type of Clostridia associated with caner and a Bacteridetes associated with wildtype (or health, I'm not sure). In another analysis using a software package PLS-DA in SIMCA, two bugs seemed to be significantly associated with protection and with cancer. I suggested that he figure out whether the two results were similar (initially we thought they might be). I wasn't sure which test was (or could have been) applied and how to interpret the data. I suggests he use Slime to figure out which bugs were associated with disease and protection, but wasn't sure whether that used the same tests as the other two, or if it would be and additional independent confirmation of the other results (by yet another test). Below, I plan to outline which tests to do, what are the caveats and when to apply these tests, and when not to apply these tests. Also, to figure out whether the results are worth investing more money to verify.

Computing on coyote

This is all about how to compute on coyote. There's another site with other information at:

https://wikis.mit.edu/confluence/display/ParsonsLabMSG/Coyote

Gaining access:

I'm pretty sure that [greg] still helps out with this, you would probably need to ask him for access. Also, if you are working off campus, you need to log into your on-campus athena account first then ssh onto coyote. If you have questions about this, just ask me. Also, put yourself on the mailing list by checking out the other link above.

What is available:


You can find out with "module avail"

modules of interest (to me):

module add python/2.7.3
module add atlas/3.8.3
module add suitesparse/20110224
module add numpy/1.5.1
module add scipy/0.11.0
module add biopython

module add matplotlib/1.1.0

module add matlab

(matlab above one above is 2009)

module add matlab/2012b

QIIME has been installed (both of the following commands in order!):

  module add python/2.7.6
  module add qiime/1.8.0

Interactive computing:

I've was able to get onto a node with:

qsub -I -l nodes=1

But when I tried to use matlab with (module add matlab) it didn't work (although it did work upon ssh'ing)

To run matlab with the window, first log in with X11:

ssh -X user@coyote.mit.edu

ssh -X super-genius

module add matlab/2012b

matlab

Submitting multiple jobs:

Before running the program below, make sure to load the following modules (I tried without and I got an error loading argparse):

module add python/2.7.3
module add atlas/3.8.3
module add suitesparse/20110224
module add biopython
module add java/1.6.0_21

You can also just source csmillie's .bashrc to make sure it works (if you didn't do anything else to yours that you need).

Also, there are different timed queues, so make sure if you get this working that it submits to the right queue. If you type qstat -q you can see a list of queues and how many running and queued items each has.  At the time I checked there are six queues: speedy, short, long, quick, be320, and ultra.  These have different allowed runtimes.

From Mark and Chris-

I've been using a script chris wrote which works pretty well:/home/csmillie/bin/ssub
What it does
It streamlines job submission. If you give it a list of commands, it will (1) create scripts for them, and (2) submit them as a job array. You can give it the list of commands as command line arguments or through a pipe.
Quick examples
1. Submit a single command to the cluster ssub "python /path/to/script.py > /path/to/output.txt"
2. Submit multiple commands to the cluster (use semicolon separator) ssub "python /path/to/script1.py; python /path/to/script2.py"
3. Submit a list of commands to the cluster (newline separator) cat /list/of/commands.txt | ssub
Detailed example /home/csmillie/alm/mammals/aln/95/
In this directory, I have 12,352 fasta files I want to align. I can do this on 100 nodes quite easily:
1. First, I create a list of commands: for x in `ls *fst`; do y=${x%.*}; echo muscle -in $x -out $y.aln; done > commands.txt

The output looks like this:
...
muscle -in O95_9990.fst -out O95_9990.aln
muscle -in O95_9991.fst -out O95_9991.aln
muscle -in O95_9992.fst -out O95_9992.aln
muscle -in O95_9993.fst -out O95_9993.aln
...
2. Then I submit these commands as a job array: cat commands.txt | ssub
How to configure it
Copy it to your ~/bin (or wherever). Then edit the top of the script:uname = your username tmpdir = directory where scripts are created max_size = number of nodes you want to use
Other things 

It automatically creates random filenames for your scripts and job arrays. These files are created in the directory specified by "tmpdir" It can also submit individual scripts instead of a job array.

Coyote queue

qstat -Qf | more

This will tell you the specifics of each job. There is also no priority allocation, so please be polite and choose the right queue for your job.

Submitting multiple files to process in the same job

Example from Spence -

I wanted to write bash files that would submit multiple files for the same analysis command on coyote.  I used PBS_ARRAYID, which will take on values that you designate with the -t option of qsub.

I got access to qiime functions by adding the following line to the bottom of my .bashrc file:

export PATH="$PATH:/srv/pkg/python/python-2.7.6/pkg/qiime/qiime-1.8.0/qiime-1.8.0-release/bin"

Then I made my .bashrc file the source in my submission script (see below).  The DATA variable just shortens a directory where I store my data.

To run all this, I created the file below then typed the following at the command line:
$ module add python/2.7.6
$ module add qiime/1.8.0 
$ qsub -q long -t 1-10 pickRepSet.sh

(the -t option will vary my PBS_ARRAYID variable from 1 to 10, iterating through my 10 experimental files).

#!/bin/sh
#filename: pickRepSet.sh
#
# PBS script to run a job on the myrinet-3 cluster.
# The lines beginning #PBS set various queuing parameters.
#
#    -N Job Name
#PBS -N pickRepSet
#
#    -l resource lists that control where job goes
#PBS -l nodes=1
#
#       Where to write output
#PBS -e stderr
#PBS -o stdout
#
#    Export all my environment variables to the job
#PBS -V
#
source /home/sjspence/.bashrc

DATA="/net/radiodurans/alm/sjspence/data/140509_ML1/"

pick_rep_set.py -i ${DATA}fwd_otu/uclust_picked_otus${PBS_ARRAYID}/ML1-${PBS_ARRAYID}_filt_otus.txt -f ${DATA}fwd_filt/dropletBC/ML1-${PBS_ARRAYID}_filt.fna -o ${DATA}fwd_otu/repSet/ML1-${PBS_ARRAYID}_rep.fna

 

Installing and Running Slime on coyote (also note the trick about installing packages below):

ssh onto super-genius

Clone slime into the ~/lib/:

git clone https://github.com/cssmillie/slime.git

Then add r/3.0.2

module add r/3.0.2

I wanted to install some package in R, but I couldn't get them directly, so I did the following:

In R:

> Sys.setenv(http_proxy="http://10.0.2.1:3128")
Then it should work.

install.packages('optparse')

install.packages('randomForest')

install.packages('ggplot2')

install.packages('plyr')

install.packages('reshape')

install.packages('car')

(However, I'm still having trouble running slime)

If i exit out now try to run these on the command line, I had some success with this (although Chris said I need to be in the slime folder because I edited run.r to include the path to until.r to work):

Rscript ~/lib/slime/run.r -m setup -x unique.f5.final.mat.transpose2 -y enigma.meta -o output > commands.sh

 

Computing on the Aces cluster

http://acesgrid.org/http://acesgrid.org/
Overview

The ACES clusters stands for Alliance for Computational Earth Science (ACES) ad I got access from [greg] by emailing him, and he provided access within 24 hours. This came with an email from the grid system itself with a link to their website with lots of useful information. This includes the times for office hours (Tuesdays from 11:30-1:30) and a list of software.

Information

Their website has some useful information, although it's not perfect. If you have any specific questions, you might find it there.

  http://acesgrid.org/getting_started.html

Some of the information might be old. For example, there are not currently office hours and you should just email [greg] if you need help with something specific. If you sign up for the aces-support list, you will get some emails (although I imagine very infrequently) about this.

I'll just summarize a few things I am aware of.

Storage

Because we don't have our own Alm lab storage and back-up system, you are only allowed to have access to 1GB on your home drive. Although I think there is quite a lot of memory available on a scratch drive, but it is for short term storage and will be deleted automatically. So this might be a good option when you want to do something which requires a lot of computational power to finish processing something, then you can move it off and store else where. Look at the webpage for specifics.

Interactive computing

After normal login to the head node, ssh to a dedicated compute node with:

ssh qubert

or get a node through the queue interactively:

qsub -I -l nodes=1

The Queue system

The website has some examples of how to submit a job. Try to follow their examples.

From the login (head) node, you can find out about the queue system with the command:

qstat -q

And from the login (head) node, you can find out about the qsub command-line options and what they mean with (if you do this from an interactive node you will get a different manual):

man qsub

Note: You can not qsub when you are logged into an interactive node (qubert), it says "qsub: command not found". Instead, qsub from the head node upon logging in.

This script ran on aces by qsubbing the file below like this:
qsub multiple_scripts1-10

multiple_scripts1-10 is:
#!/bin/csh
#filename: pbs_script
#PBS -N pbs_script
#PBS -l nodes=1
#PBS -e stderr
#PBS -o stdout
# o Export all my environment variables to the job
#PBS -V

if ( -f /etc/profile.d/modules.csh ) then
    source /etc/profile.d/modules.csh
endif

module load matlab
matlab -nojvm -r "run(0.2985,1.0,75.0,10000.0,600.0,25.0,0.1,'rates_0_1.csv'); exit;"

MatLAB

I was specifically looking for a new way to use Matlab since I was using it on beagle but that went down. This is how to get matlab working with Aces

http://acesgrid.org/matlab.html

However, if you want run interactively, you reverse a node:

qsub -I -l nodes=1

Then in the next terminal window, you login as you would normally with -X (maybe not as directed on the above webpage):

ssh -X aces

Then you ssh from aces onto the reserved node:

ssh -X reserved.node.name

This should work and bring up X11 (that's what I'm using):

module add matlab

matlab

QIIME

You can also get qiime to work, using the qsub command above to get interactive computing and use:

module add qiime

It's QIIME version 1.6 (I think) so it's got a few quirks that are different from the version 1.3 that was on beagle. You might need to change some variables names etc, but the QIIME documentation should be helpful for this.

Space

So I was able to make my own directory in /data/ and /scratch/, but it didn't exist before I made it:

mkdir /scratch/spacocha

mkdir /data/spacocha

I'm able to write to that folder, so I think it should work fine.

File transfer

Although I can make the /scratch/spacocha folder, I can't scp to it. Instead, I can scp big files, like fastq files, to /data/spacocha/ just fine.

Notes

I have been having trouble getting and interactive node using qsub for a few days. Seems like the cluster fills up occasionally making it difficult to get stuff done in a hurry.

Summary

In summary, this cluster might be good for a few specific tasks, but it's not good for long term storage, and it's geared towards the earth science crowd and modeling ocean circulation etc. It might have some functional capabilities that you could use (i.e. Matlab, QIIME), but be careful not to leave data on their scratch because it will be deleted.

Forward Barcoded primers for Illumina 16S Libraries

This information is specific for the 16S Illumina Libraries. Multiplexed genome libraries should follow the information for the genome barcodes.

Outline:

In order to multiplex more than 96 samples into a lane, a forward barcode is required. This is because the reverse barcodes cost a lot of money to make, and you can get more bang for your buck using the same reverse barcodes again with a different forward barcode. The forward barcode is a 5 bp sequence before the U515 F primer sequence. The forward primer must also include homology to the second step forward primer sequence. The entire construct is depicted in Fig S1.pdf.

Forward Primer Barcode Sequences:

These are the forward barcode sequences that we currently have are here: Manually_copied_forward_barcodes.xls

8 Steps Illumina genome Library prep_older version
  • From Ilana Brito from Alm lab (posted by Sarah Preheim)

Protocol for library whole genome construction

 

  1. Shear DNA by sonication. Make sure your sample is in 50ul of solution. Start with 2-20ug of DNA. Fill BioRupter with water (upto .5 inches from line) and then ice upto line. Do 6 cycles, replace ice. Repeat for a total of 18-20 cycles of 30seconds on/off with “H” setting. Average 200-400 base pairs.

 

  1. 2.     End-repair
  • Blunt and 5’-phosporyate the DNA from step 2 using Quick blunting kit.
  • Mix:   

sheared DNA (2μg)                 45.5μl

10x Blunting Buffer                6μl

1mM dNTP Mix                    6μl

Blunt enzyme mix                   2.5μl

TOTAL                                  60μl

  • Incubate at RT for 30 minutes
  • Purify using Qiagen MinElute column (these are kept in the fridge.) Elute in 12μl.
  1. 3.     Ligate Solexa adaptors
  • Solexa adapters must be hybridized before use. Heat to 95 for 5 minutes, cool slowly to Room temperature.
  • Ligate adaptors, using a 10x molar excess of each one, and as much DNA as possible.
  • Mix:

                                    End-repaired DNA                                         10μl (12.5 pmol)

                                    100μM IGA adapter A#                                 1.25μl (125 pmol)

                                    100μM IGA adapter B#-PE                           1.25μl (125 pmol)

                                    2X Quick Ligation Reaction Buffer (NEB)    15μl

                                    Quick T4 Ligase (NEB)                                  2.5μl

                                    TOTAL                                                          30μl

  • Incubate at RT for 15 minutes.

 

  1. 4.     Size selection and purification using SPRI beads.
  • Mix DNA and beads to appropriate ratio: 0.65X SPRI beads: Add 19.5 μl of SPRI beads to 30μl reaction from step 3.
  • Incubate at RT for 20 minutes.
  • Place tubes on magnet for 6 minutes.
  • Transfer all SN to new tube. Discard beads.
  • Mix DNA and beads to appropriate ratio, 1X SPRI beads: Add 10.5 μl SPRI beads to 49.5μl reaction.
  • Vortex, spin.
  • Incubate at RT for 7-20 minutes.
  • Place tubes on magnet for 6 minutes.
  • Remove all SN, keep beads.
  • Wash with 500μl 70% EtOH, incubate for 30 seconds, remove all SN.
  • Repeat: Wash with 500μl 70% EtOH, incubate for 30 seconds, remove all SN.
  • Let dry completely for 15 minutes. Remove from magnet.
  • Elute in 30μl EM.
  • Vortex.
  • Incubate at RT for 2 minutes.
  • Put on magnet for 2 minutes
  • Transfer SN to new tube.
  1. 5.     Nick translation
  • Bst polymerase can be used for nick translation---it can be used at elevated temperatures which is good for melting and secondary structures and lacks both 3’-5’ and 5’3’ exonuclease activity.
  • Mix:

Purified DNA                                                 14 μl

10X Buffer (NEB)                              2μl

10mM dNTPs                                                0.4μl

1mg/ml BSA                                        2μl

Water                                                  0.6μl

Bst polymerase (Enzymatics)                        1μl

TOTAL                                              20μl

  • Incubate at 65 degrees, 25 minutes.

 

  1. 6.     Library Enrichment by PCR.
  • Perform 2 25μl reactions: (100μM primer)
  • Mix:

H2O                                        19.125μl

5X Pfu Turbo buffer               5μl

dNTPs            10mM                         0.5μl

40μl Solexa PCR-A-PE           0.25μl

40μl Solexa PCR-B-PE           0.25μl

SybrGreenI                             0.125μl

Nick-translated DNA             2μl

Pfu Turbo                               0.25μl

TOTAL                                  25μl

  • Program:       
  1. 95˚C    120sec
  2. 95˚C    30sec
  3. 60˚C    30sec
  4. 72˚C    60sec
  5. 95˚C    120sec
  6. Go to step 2 34 more times.
  7. 72˚C    5 min
  8. 4˚C      Forevair
  • These 2 reactions are to check cycle time only. Look at the melting curves---use the mid-log point to pick the ultimate cycle time.
  • Prep PCR as above, but in 2 100μl reactions using 8μl of sample in each, and cycle with cycle number.
  • Mix:

H2O                                        77μl

5X Pfu Turbo buffer               20μl

dNTPs            10mM                         2μl

40μl Solexa PCR-A-PE           1μl

40μl Solexa PCR-B-PE           1μl

Nick-translated DNA             8μl

Pfu Turbo                               1μl

                                                      TOTAL                                  100μl

  • Run on a QIAElute column. Elute in 50ul. (You could also do a single SPRI---check the ratios of beads to reaction volume)
  • Analyze using Bioanalyzer.
Barcoded Reverse Primers for Illumina Libraries

Overview

We have designed barcodes to multiplex samples together in a single Illumina lane. Currently, only three reads supported by illumina, a forward read, a reverse read and a barcode read. However, we had incorporated an additional barcode read into the first read as well. The current design outlined in Fig S1.pdf.

Designing Illumina amplicon libraries

Any PCR amplicon (16S, TCR-beta, etc.) can be used with this scheme, since it was designed to be modular. The first step primers must contain the following:

1.) The genomic DNA primer binding sites (to attach and extend the PCR product)

2.) The forward primer must contain some site diversity. This diversity is important for cluster identification and having the first read begin with conserved primer sequence will severely impact the quality of the data. In Fig_S1.pdf, this diversity region is a string of YRYR (N's can not be used -with IDT anyway- unless specifying an equal ratio of the four bases and that might be costly). However, you can additionally add another set of barcodes and order different step one primers with the forward barcodes attached. The only caveat with this method is that you need at least four different forward barcodes in one lane to get enough diversity. The barcodes should be relatively evenly added to the sample in a ratio of 1:1:1:1 of each barcode. More than four barcodes in the forward read should increase the quality of the calls.

Specs

Here are specs for the most recent reverse barcodes:

Uri Laserson_6957574_6123588.XLS

In addition, there are 9 additional barcodes outside of the 96 in the plate above: 097-105. These can be used for multiplexing mock or control samples into your lane separately.

Name

Sequence

PE-IV-PCR-097

CATTTCGCT

PE-IV-PCR-098

TTGCTCGTG

PE-IV-PCR-099

TCCGCTCAC

PE-IV-PCR-100

CCCAACAAA

PE-IV-PCR-101

GCAGACCAA

PE-IV-PCR-102

TGGCGATAT

PE-IV-PCR-103

TGGTTCTGC

PE-IV-PCR-104

GGTACGAGT

PE-IV-PCR-105

ACCCGTTCG

16S Library Preparation (Manual)

How to use the ParsonsLabMSG wiki

Overview

In order to make this site more useful to all, here are a few tips on how to use this site and how to make a Wiki blog post so others will be able to find the information they need. This should make the system more user friendly for all.

How to use this site to find information

Labels heatmap:

In order to find information that you want, look at the bottom of the home page for the specific labels containing the words that you are interested in. For example, if you want to learn about how to process raw fastq data from an Illumina library, you might start by clicking the Illumina label at the bottom of the home page.

Theme pages:

Alternatively, there are some theme pages which list all posts that have a particular theme. You can go to the Bioinformatics link on the left had site of the home page to get to this page. This page lists all of the posts with the Bioinformatics label. Look through all of those posts.

Search box:

You can also search the whole Wiki from the search box at the top right.

How to add information to this site

Blog posts:

The easiest way to have the pages be self-organizing is to input your information as blog posts and use the appropriate labels. When choosing labels to use, consider each word as a meaningful label (if you want to put a space in a term like 16S library use an underscore as 16S_library since library in itself might not be the most useful term). Try to use labels that other have already chosen if possible but add new labels if it seems appropriate.

Theme pages:

In addition to the list of all labels on the front page, it might be nice to make pages that have similar themes. Some of these major themes are already there, but feel free to add a theme when it is appropriate (for example, if there are any sampling protocols, we might want to make a field work page or something).

Additional Information:

For those of you who have interest, explore all of the options available on the Wiki, which includes calendars and the like. If you have a need, please use these tools to increase the utility of this site.

How to make a blog post

It's easy to make a blog post. Just go to the top right hand corner of any page where it says "Add" and choose "Blog Post". The great thing is that you can add attachments, links and images to the post with the insert button. There are also a lot of great macros to choose from if you have something specific in mind you might be able to find one. But, most importantly, you should add the labels at the bottom under "Labels:". This is an important step in order for the site to be self-organizing and for information to be readily accessible to others. You can add pages as well, but this might take some organization. Pages and blogs seem to me to be identical in the way they are created, so the same things apply if you want to make a page.

Thanks for sharing your expertise with the group!