Blog from November, 2013

Materials Needed:

-       Bio-Rupter (Parson’s 3rd floor in between Polz and Chisholm labs)

-       Quick blunting and ligation kit (NEB, E0542L 100RXNs $473.60)

-       10mM dNTP mix ( NEB, N0447L $216.00)

  • also need a 1:10 dilution for 1mM dNTPs

-       IGA adapter A# (10uM working solution)

-       IGA adapter B#-PE (10uM working solution)

-       SPRI beads (Beckman Coulter, A63882, 450mL $4,100)

-       BST Polymerase large fragment (NEB, M0275S 1,600U $49.60)

-       IGA-PCR-PE-F primer (40uM working solution)

-       IGA-PCR-PE-R primer (40uM working solution)

-       Phusion, with HF buffer (NEB, M0530L 500U $329.60 )

-       SybrGreen (Invitrogen S7563 500uL 10,000x $235.00 )

-       Qiaquick PCR cleanup column (Qiagen, 50 columns: 28104, $98.94; 250 columns: 28106, $465.60)

-       MinElute Reaction clean up column (Qiagen 50 columns: 28204, $109.61; 250 columns: 28206, $503.43)

Protocol for library whole genome construction

 

  1. Shear DNA by sonication. Make sure your sample is in 50ul of solution. Start with 2-20ug of DNA. Fill BioRupter with water (upto .5 inches from line) and then ice upto line. Do 6 cycles, replace ice. Repeat for a total of 18-20 cycles of 30seconds on/off with “H” setting. Average 200-400 base pairs. Use Agilent Bioanalyzer to confirm shear size.

 2.     End-repair

  • Blunt and 5’-phosporyate the DNA from step 2 using Quick blunting kit.
  • Mix:   

sheared DNA (2μg)                 45.5μl

10x Blunting Buffer                6μl

1mM dNTP Mix                    6μl

Blunt enzyme mix                   2.5μl

TOTAL                                  60μl

  • Incubate at RT for 30 minutes
  • Purify using Qiagen MinElute column (these are kept in the fridge.) Elute in 12μl. 

3.     Ligate Solexa adaptors

  • Solexa adapters must be hybridized before use. Heat to 95 for 5 minutes, cool slowly to Room temperature.
  • Ligate adaptors, using a 10x molar excess of each one, and as much DNA as possible.
  • Mix:

                                    End-repaired DNA                                         10μl 

                                    100μM IGA adapter A#                                 1.25μl

                                    100μM IGA adapter B#-PE                           1.25μl 

                                    2X Quick Ligation Reaction Buffer (NEB)       15μl

                                    Quick T4 Ligase (NEB)                                 2.5μl

                                    TOTAL                                                          30μl

  • Incubate at RT for 15 minutes.

 4.     Size selection and purification using SPRI beads.

  • Mix DNA and beads to appropriate ratio: 0.65X SPRI beads: Add 19.5 μl of SPRI beads to 30μl reaction from step 3.
  • Incubate at RT for 20 minutes.
  • Place tubes on magnet for 6 minutes.
  • Transfer all SN to new tube. Discard beads.
  • Mix DNA and beads to appropriate ratio, 1X SPRI beads: Add 10.5 μl SPRI beads to 49.5μl reaction.
  • Vortex, spin.
  • Incubate at RT for 7-20 minutes.
  • Place tubes on magnet for 6 minutes.
  • Remove all SN, keep beads.
  • Wash with 500μl 70% EtOH, incubate for 30 seconds, remove all SN.
  • Repeat: Wash with 500μl 70% EtOH, incubate for 30 seconds, remove all SN.
  • Let dry completely for 15 minutes. Remove from magnet.
  • Elute in 30μl EM.
  • Vortex.
  • Incubate at RT for 2 minutes.
  • Put on magnet for 2 minutes
  • Transfer SN to new tube. 

5.     Nick translation

  • Bst polymerase can be used for nick translation---it can be used at elevated temperatures which is good for melting and secondary structures and lacks both 3’-5’ and 5’3’ exonuclease activity.
  • Mix:

Purified DNA                                                 14 μl

10X Buffer (NEB)                              2μl

10mM dNTPs                                                0.4μl

1mg/ml BSA                                        2μl

Water                                                  0.6μl

Bst polymerase (Enzymatics)                        1μl

TOTAL                                              20μl

  • Incubate at 65 degrees, 25 minutes.

 6.     Library Enrichment by PCR.

  • Perform 2 25μl reactions:
  • Mix:

H2O                                        16.6μl

5X HF Buffer                           5μl

dNTPs (10mM)                        0.5μl

40μM Solexa PCR-A-PE           0.25μl

40μM Solexa PCR-B-PE           0.25μl

SybrGreenI                             0.125μl

Nick-translated DNA                2μl

Phusion                                0.25μl

TOTAL                                  25μl

  • Program:       
  1. 98˚C    70sec
  2. 98˚C    15sec
  3. 65˚C    20sec
  4. 72˚C    30sec
  5. Go to step 2 34 more times.
  6. 72˚C    5 min
  7. 4˚C      Forever
  • These 2 reactions are to check cycle time only. Look at the melting curves---use the mid-log point to pick the ultimate cycle time.
  • Prep PCR as above, but in 2 100μl reactions using 8μl of sample in each, and cycle with cycle number.
  • Mix:

H2O                                        66.8μl

5X HF Buffer                           20μl

dNTPs(10mM)                         2μl

40μM Solexa PCR-A-PE           1μl

40μM Solexa PCR-B-PE           1μl

Nick-translated DNA                 8μl

Phusion                                   1μl

                                                      TOTAL                                  100μl

  • Run on a QIAElute column. Elute in 50ul. (You could also do a single SPRI---check the ratios of beads to reaction volume)
  • Analyze using Bioanalyzer.
Computational information for newbies

Introduction

I've collected information about tricks that a newbie might not know, but which is useful to help you get around computational work. I'll try to keep adding stuff as I learn it. Please add your tricks too!

Parallel computing

I'm trying to get a better sense about designing parallel scripts. Typically, I've inherited someone's code and I've made it work for me. However, I have been looking for a good basic resource and I've found at least one site that looks promising. They have a few free courses that look relevant like "Parallel computing explained", "Introduction to MPI" and "Intermediate MPI". I found this by looking at an MIT computing course which pointed to this site.

http://www.citutor.org/index.php

Although it's got a lot of basic information, it's hard to figure out how it helps because I'm really not sure what type of clusters I'm actually using (i.e. which parts are relevant to me). Didn't really help me do any actual coding yet, although some background about computers was semi-interesting.

How to find stuff out about computing clusters

I wanted to know whether there was a website where you could just find out about how to run stuff on a computer cluster (i.e. beagle, aces, or coyote). Basically, Scott said that only the sys admin knows all of the specific rules associated with each cluster and if you don't pick their brain about it, you won't really know how to use it right. I will hopefully pick brains for you and put it on this website in another post about each system. That's a work in progress.

You can find out about specifics of aces queues with:

qstat -Qf | more

or

qstat -q

Which results in this on aces:

server: login

Queue            Memory CPU Time Walltime Node  Run Que Lm  State
--------------- ---- ------ ------ --  -- -- -  -----
geom               -      -       -      -    0   0 --   E R
one                -      -    06:00:00     1   8 319 10   E R
four-twelve        -      -    12:00:00   --    8   4 10   E R
four               -      -    02:00:00    16   8 437 10   E R
long               -      -    24:00:00    16   1   0 10   E R
all                -      -    02:00:00  1024   0   0  4   E R
mchen              -      -    02:00:00  1024   0   0  4   E R
mediumlong         -      -    96:00:00    30   0   0 10   E R
special            -      -       -       36   0   0 -   E R
toolong            -      -    168:00:0     4   0   0 10   E R
                                               ---- ----
                                                  25   760

An this on coyote:

server: wiley

Queue            Memory CPU Time Walltime Node  Run Que Lm  State
--------------- ---- ------ ------ --  -- -- -  -----
speedy             -      -    00:30:00   -    0   0 -   E R
short              -      -    12:00:00   -    2  -2 -   E R
long               -      -    48:00:00   -   68  46 -   E R
quick              -      -    03:00:00   -    0   0 -   E R
be320              -      -    00:30:00   -    0   0 -   E R
ultra              -      -    336:00:0   -    2   0 -   E R
                                               ---- ----
                                                  72    44

You can also use this to find more information about qsub (I would be in a place like the head node because not all nodes have the same qsub data):

man qsub

You can find out more about the various flags you can use with qsub.

Queuing system on clusters

Never run anything on the head node!!! When you log into a cluster, you need to submit jobs to a queue or work interactively on a dedicated interactive node. The dedicated interactive nodes will have different names, so you just have to find them. Sometimes you can request nodes by qsub -I or qsub to a dedicated interactive node (qubert on aces and super-genius on coyote), but these also depend on your system.

So, on any given cluster, there might be different queues (i.e. short, long, ultra-long) that you want to submit your jobs to. To find out (if you don't know already) you can qstat and the queue names will be the last column. It might be obvious about what each of these queues mean from the name and the amount of times thing have run in each queu (short < 12 hours, ultra-long > 2500 hours), but this is likely just something you need to find out from someone who knows about the cluster or from the sys admin again. Then if you want to submit to a specific queue, use something like this (I think, but I actually haven't done it exactly like this):

qsub -q short ....

Shortcut for ssh'ing

Scott also told me about how to set up your computer to automatically fill-in ssh information so you don't have to type it in each time.You have a folder  ~/.ssh/ and file ~/.ssh/config which should be modified to contain each of the following for each host

Host aces

   Hostname login.acesgrid.org

   Username spacocha

Then each time you want to ssh just type:

ssh aces

Works for scp too (and presumably other things).

Downloading directly to the clusters

You can get stuff from a website using wget, for example:

wget https://github.com/swo/lake_matlab_sens/archive/master.zip

 

Running something on a detatched screen:

use screen. This will help you figure stuff out:

screen -man

or

screen --help

This starts screen:

screen -S SPPtest

This detaches but keeps it running:

hold "control" and "A" keys then type "D"

To reattach to detached screen:

screen -R  SPPtest

To get rid of the screen altogether type this from within a screen:

exit

Statistical tests used for microbial community analysis

Introduction

I'm always trying to figure out which statistical test to use to analyze data, and I wish there was some sort of summary of when to use which statistic, what the limitations are and how to use it correctly. I'm going to try to add stuff including when to use each test, with the hypothesis you are interested in testing and how you can implement it. This is a work in progress, and I'm going to try to keep adding stuff that I learn about.

Background and good references

I found a pretty good book that I thought might be useful, but I haven't gotten it yet. It's not in the MIT library, so I ordered it through the some order.

Statistical analysis in microbiology : statnotes

Richard A Armstrong; Anthony C Hilton

Book

Contingency tables

I have run into situations where I want to test my observations against a model. The observations that I are the counts of an OTU found at a bunch of discrete depths. The model I want to test is whether this new OTU is the same as the distribution with depth of another more abundant and closely related OTU as applied in distribution-based clustering:

http://aem.asm.org/content/79/21/6593.full?sid=76732af5-84eb-4f2b-8465-bd1c66283323

Here's a good basic video explaining the chi square test and the use of contingency tables:

http://tll.mit.edu/help/genetics-and-statistics

I use R to calculate the chi-square value. These are my basic cheat-sheet notes for using R:

> alleles <- matrix(c(12, 4, 15, 17, 25, 4), nr=3,
+ dimnames=list(c("A1A1", "A1A2", "A2A2"), c("A1", "A2")))
> alleles
     A1 A2
A1A1 12 17
A1A2  4 25
A2A2 15  4
> chisq.test(alleles)

        Pearson's Chi-squared test

data:  alleles
X-squared = 20.2851, df = 2, p-value = 3.937e-05

However, new information that I'm getting is that for very large values (like Illumina reads), your model will never fit because you have too much information and even small variations will be significant. I found this problem to be true so I got over it by determining whether the information content was the same (using SQRT JSD), which is basically a work around. However, I'm looking into the Root Mean Square Error of Approximation, although I haven't done so yet, to get over problems with big numbers like illumina count data.

http://www.rasch.org/rmt/rmt254d.htm

Determining a bug of importance from 16S data

Alex Sheh, postdoc in Fox lab, was looking at changes in the microbiome associated with cancer. He had an output from the Biomicro Center bioinformatics pipeline that indicated two significant bugs, one type of Clostridia associated with caner and a Bacteridetes associated with wildtype (or health, I'm not sure). In another analysis using a software package PLS-DA in SIMCA, two bugs seemed to be significantly associated with protection and with cancer. I suggested that he figure out whether the two results were similar (initially we thought they might be). I wasn't sure which test was (or could have been) applied and how to interpret the data. I suggests he use Slime to figure out which bugs were associated with disease and protection, but wasn't sure whether that used the same tests as the other two, or if it would be and additional independent confirmation of the other results (by yet another test). Below, I plan to outline which tests to do, what are the caveats and when to apply these tests, and when not to apply these tests. Also, to figure out whether the results are worth investing more money to verify.

Computing on coyote

This is all about how to compute on coyote. There's another site with other information at:

https://wikis.mit.edu/confluence/display/ParsonsLabMSG/Coyote

Gaining access:

I'm pretty sure that [greg] still helps out with this, you would probably need to ask him for access. Also, if you are working off campus, you need to log into your on-campus athena account first then ssh onto coyote. If you have questions about this, just ask me. Also, put yourself on the mailing list by checking out the other link above.

What is available:


You can find out with "module avail"

modules of interest (to me):

module add python/2.7.3
module add atlas/3.8.3
module add suitesparse/20110224
module add numpy/1.5.1
module add scipy/0.11.0
module add biopython

module add matplotlib/1.1.0

module add matlab

(matlab above one above is 2009)

module add matlab/2012b

QIIME has been installed (both of the following commands in order!):

  module add python/2.7.6
  module add qiime/1.8.0

Interactive computing:

I've was able to get onto a node with:

qsub -I -l nodes=1

But when I tried to use matlab with (module add matlab) it didn't work (although it did work upon ssh'ing)

To run matlab with the window, first log in with X11:

ssh -X user@coyote.mit.edu

ssh -X super-genius

module add matlab/2012b

matlab

Submitting multiple jobs:

Before running the program below, make sure to load the following modules (I tried without and I got an error loading argparse):

module add python/2.7.3
module add atlas/3.8.3
module add suitesparse/20110224
module add biopython
module add java/1.6.0_21

You can also just source csmillie's .bashrc to make sure it works (if you didn't do anything else to yours that you need).

Also, there are different timed queues, so make sure if you get this working that it submits to the right queue. If you type qstat -q you can see a list of queues and how many running and queued items each has.  At the time I checked there are six queues: speedy, short, long, quick, be320, and ultra.  These have different allowed runtimes.

From Mark and Chris-

I've been using a script chris wrote which works pretty well:/home/csmillie/bin/ssub
What it does
It streamlines job submission. If you give it a list of commands, it will (1) create scripts for them, and (2) submit them as a job array. You can give it the list of commands as command line arguments or through a pipe.
Quick examples
1. Submit a single command to the cluster ssub "python /path/to/script.py > /path/to/output.txt"
2. Submit multiple commands to the cluster (use semicolon separator) ssub "python /path/to/script1.py; python /path/to/script2.py"
3. Submit a list of commands to the cluster (newline separator) cat /list/of/commands.txt | ssub
Detailed example /home/csmillie/alm/mammals/aln/95/
In this directory, I have 12,352 fasta files I want to align. I can do this on 100 nodes quite easily:
1. First, I create a list of commands: for x in `ls *fst`; do y=${x%.*}; echo muscle -in $x -out $y.aln; done > commands.txt

The output looks like this:
...
muscle -in O95_9990.fst -out O95_9990.aln
muscle -in O95_9991.fst -out O95_9991.aln
muscle -in O95_9992.fst -out O95_9992.aln
muscle -in O95_9993.fst -out O95_9993.aln
...
2. Then I submit these commands as a job array: cat commands.txt | ssub
How to configure it
Copy it to your ~/bin (or wherever). Then edit the top of the script:uname = your username tmpdir = directory where scripts are created max_size = number of nodes you want to use
Other things 

It automatically creates random filenames for your scripts and job arrays. These files are created in the directory specified by "tmpdir" It can also submit individual scripts instead of a job array.

Coyote queue

qstat -Qf | more

This will tell you the specifics of each job. There is also no priority allocation, so please be polite and choose the right queue for your job.

Submitting multiple files to process in the same job

Example from Spence -

I wanted to write bash files that would submit multiple files for the same analysis command on coyote.  I used PBS_ARRAYID, which will take on values that you designate with the -t option of qsub.

I got access to qiime functions by adding the following line to the bottom of my .bashrc file:

export PATH="$PATH:/srv/pkg/python/python-2.7.6/pkg/qiime/qiime-1.8.0/qiime-1.8.0-release/bin"

Then I made my .bashrc file the source in my submission script (see below).  The DATA variable just shortens a directory where I store my data.

To run all this, I created the file below then typed the following at the command line:
$ module add python/2.7.6
$ module add qiime/1.8.0 
$ qsub -q long -t 1-10 pickRepSet.sh

(the -t option will vary my PBS_ARRAYID variable from 1 to 10, iterating through my 10 experimental files).

#!/bin/sh
#filename: pickRepSet.sh
#
# PBS script to run a job on the myrinet-3 cluster.
# The lines beginning #PBS set various queuing parameters.
#
#    -N Job Name
#PBS -N pickRepSet
#
#    -l resource lists that control where job goes
#PBS -l nodes=1
#
#       Where to write output
#PBS -e stderr
#PBS -o stdout
#
#    Export all my environment variables to the job
#PBS -V
#
source /home/sjspence/.bashrc

DATA="/net/radiodurans/alm/sjspence/data/140509_ML1/"

pick_rep_set.py -i ${DATA}fwd_otu/uclust_picked_otus${PBS_ARRAYID}/ML1-${PBS_ARRAYID}_filt_otus.txt -f ${DATA}fwd_filt/dropletBC/ML1-${PBS_ARRAYID}_filt.fna -o ${DATA}fwd_otu/repSet/ML1-${PBS_ARRAYID}_rep.fna

 

Installing and Running Slime on coyote (also note the trick about installing packages below):

ssh onto super-genius

Clone slime into the ~/lib/:

git clone https://github.com/cssmillie/slime.git

Then add r/3.0.2

module add r/3.0.2

I wanted to install some package in R, but I couldn't get them directly, so I did the following:

In R:

> Sys.setenv(http_proxy="http://10.0.2.1:3128")
Then it should work.

install.packages('optparse')

install.packages('randomForest')

install.packages('ggplot2')

install.packages('plyr')

install.packages('reshape')

install.packages('car')

(However, I'm still having trouble running slime)

If i exit out now try to run these on the command line, I had some success with this (although Chris said I need to be in the slime folder because I edited run.r to include the path to until.r to work):

Rscript ~/lib/slime/run.r -m setup -x unique.f5.final.mat.transpose2 -y enigma.meta -o output > commands.sh

 

Computing on the Aces cluster

http://acesgrid.org/http://acesgrid.org/
Overview

The ACES clusters stands for Alliance for Computational Earth Science (ACES) ad I got access from [greg] by emailing him, and he provided access within 24 hours. This came with an email from the grid system itself with a link to their website with lots of useful information. This includes the times for office hours (Tuesdays from 11:30-1:30) and a list of software.

Information

Their website has some useful information, although it's not perfect. If you have any specific questions, you might find it there.

  http://acesgrid.org/getting_started.html

Some of the information might be old. For example, there are not currently office hours and you should just email [greg] if you need help with something specific. If you sign up for the aces-support list, you will get some emails (although I imagine very infrequently) about this.

I'll just summarize a few things I am aware of.

Storage

Because we don't have our own Alm lab storage and back-up system, you are only allowed to have access to 1GB on your home drive. Although I think there is quite a lot of memory available on a scratch drive, but it is for short term storage and will be deleted automatically. So this might be a good option when you want to do something which requires a lot of computational power to finish processing something, then you can move it off and store else where. Look at the webpage for specifics.

Interactive computing

After normal login to the head node, ssh to a dedicated compute node with:

ssh qubert

or get a node through the queue interactively:

qsub -I -l nodes=1

The Queue system

The website has some examples of how to submit a job. Try to follow their examples.

From the login (head) node, you can find out about the queue system with the command:

qstat -q

And from the login (head) node, you can find out about the qsub command-line options and what they mean with (if you do this from an interactive node you will get a different manual):

man qsub

Note: You can not qsub when you are logged into an interactive node (qubert), it says "qsub: command not found". Instead, qsub from the head node upon logging in.

This script ran on aces by qsubbing the file below like this:
qsub multiple_scripts1-10

multiple_scripts1-10 is:
#!/bin/csh
#filename: pbs_script
#PBS -N pbs_script
#PBS -l nodes=1
#PBS -e stderr
#PBS -o stdout
# o Export all my environment variables to the job
#PBS -V

if ( -f /etc/profile.d/modules.csh ) then
    source /etc/profile.d/modules.csh
endif

module load matlab
matlab -nojvm -r "run(0.2985,1.0,75.0,10000.0,600.0,25.0,0.1,'rates_0_1.csv'); exit;"

MatLAB

I was specifically looking for a new way to use Matlab since I was using it on beagle but that went down. This is how to get matlab working with Aces

http://acesgrid.org/matlab.html

However, if you want run interactively, you reverse a node:

qsub -I -l nodes=1

Then in the next terminal window, you login as you would normally with -X (maybe not as directed on the above webpage):

ssh -X aces

Then you ssh from aces onto the reserved node:

ssh -X reserved.node.name

This should work and bring up X11 (that's what I'm using):

module add matlab

matlab

QIIME

You can also get qiime to work, using the qsub command above to get interactive computing and use:

module add qiime

It's QIIME version 1.6 (I think) so it's got a few quirks that are different from the version 1.3 that was on beagle. You might need to change some variables names etc, but the QIIME documentation should be helpful for this.

Space

So I was able to make my own directory in /data/ and /scratch/, but it didn't exist before I made it:

mkdir /scratch/spacocha

mkdir /data/spacocha

I'm able to write to that folder, so I think it should work fine.

File transfer

Although I can make the /scratch/spacocha folder, I can't scp to it. Instead, I can scp big files, like fastq files, to /data/spacocha/ just fine.

Notes

I have been having trouble getting and interactive node using qsub for a few days. Seems like the cluster fills up occasionally making it difficult to get stuff done in a hurry.

Summary

In summary, this cluster might be good for a few specific tasks, but it's not good for long term storage, and it's geared towards the earth science crowd and modeling ocean circulation etc. It might have some functional capabilities that you could use (i.e. Matlab, QIIME), but be careful not to leave data on their scratch because it will be deleted.