Toast 2015 qiime_talk

QIIME: Quantitative Insights Into Microbial Ecology (part

1) Thomas JeffriesFederico M. Lauro

Grazia Marina Quero Tiziano Minuzzo

The Omics Analysis Sydney Tutorial

Australian Museum 23rd-24th February 2015

QIIME

• Open source software package for taxonomic analysis of 16S rRNA sequences

• UC Colorado & Northern Arizona

• www.qiime.org (great resource…..)

• Good community support

• Can google most problems

• Multi-platform

• Widely used

Caporaso

Knight

Getting QIIME

Linux: https://github.com/qiime/qiime-deploy

Mac: http://www.wernerlab.org/software/macqiime

Ubuntu virtualbox: http://qiime.org/install/virtual_box.html

Linux remote machine e.g UTS FEIT cluster, NECTAR: http://nectar.org.au/research-cloud

http://qiime.org/install/install.html

Data formats

• 454:

DNA sequences (FASTA, .fna)

Quality (.qual)

Mapping file (.txt)

• Illumina

Sequences and quality in same file (.fastq)

Also supports paired end

Getting into QIIME

• Command line interface

• Some very basic commands needed for QIIME:

example:

/folder$ programme.py -i file_in -o file_out

ls :list files in working directorycd : changes directorycd .. : goes back to parent directory‘tab’ key: magically fills out file namesmkdir : makes a directorypwd : tells you where you are

QIIME tutorial and example data

• Many tutorials @ http://qiime.org/tutorials/index.html

• Good place to start: http://qiime.org/tutorials/tutorial.html

• Great Microbial Ecology course (includes QIIME): http://edamame-course.org/

• A few of the commands have changed in the new version – the current commands are in this talk - and I have renamed the files to make it easier to follow

Some useful terminology

αDiversityAlpha diversity is the diversity within ONE

sample

αDiversity

αDiversity: Richness

αDiversity: Evenness

αDiversity: Evenness

Common metric: Pielou’s evenness

Tutorial dataset

Tutorial dataset

1. Check mapping file format

• Checks that format of mapping file is ok

validate_mapping_file.py -m my_mapping_file.txt -o validate_mapping_file_output

“No errors or warnings were found in mapping file”

1. Check mapping file

Name (ID) of sample

Primer

Sequencing barcode

Sample categories (treatments)

Tab separated !!!

Hands on – validate your mapping file

validate_mapping_file.py -o moving_pictures_tutorial-1.8.0/illumina/ci

d_l1/ -m moving_pictures_tutorial-1.8.0/illumina/ra

w/filtered_mapping_l1.txt

2. De-multiplex - 454

• Using sample specific barcodes, identify each sequence with a sample (renames sequences)

• Performs some QC:

Removes sequences < 200bp

Removes sequences with a quality score <25

Removes sequences with >6 ambiguous bases or >6 homopolymer runs

split_libraries.py -m my_mapping_file.txt -f my_sequence_file.fna -q my_quality_file.qual -o split_library_output

• Produces seqs.fna

2. De-multiplex - Illumina (Step 1)

• If the samples contain paired-end reads, you first need to join them and update the barcodes using:

join_paired_ends.py -f my_forw_reads.fastq -r my_rev_reads.fastq -b my_barcodes.fastq -o my_joined.fastq

2. De-multiplex - Illumina (Step 2)

Then you can proceed to the split libraries step. If the sequences are NOT paired-ends go directly to split_libraries_fastq.py. This step also performs the Illumina reads QC:

split_libraries_fastq.py -m my_mapping_file.txt -i my_sequence_file.fastq -b my_barcodes.fastq -o split_library_output

• Data from multiple lanes can be processed together by separating inputs with a comma (,)

• Produces seqs.fna

1.8.0/illumina/raw/subsampled_s_1_sequence.fastq,moving_pictures_tutorial-1.8.0/illumina/raw/subsampled_s_2_sequence.fastq,moving_pictures_tutorial-1.8.0/illumina/raw/subsampled_s_3_sequence.fastq,moving_pictures_tutorial-1.8.0/illumina/raw/subsampled_s_4_sequence.fastq,moving_pictures_tutorial-1.8.0/illumina/raw/subsampled_s_5_sequence.fastq,moving_pictures_tutorial-

1.8.0/illumina/raw/subsampled_s_6_sequence.fastq -b moving_pictures_tutorial-1.8.0/illumina/raw/subsampled_s_1_sequence_barcodes.fastq,moving_pictures_t

utorial-1.8.0/illumina/raw/subsampled_s_2_sequence_barcodes.fastq,moving_pictures_tutorial-1.8.0/

illumina/raw/subsampled_s_3_sequence_barcodes.fastq,moving_pictures_tutorial-1.8.0/



illumina/raw/subsampled_s_6_sequence_barcodes.fastq -m moving_pictures_tutorial-1.8.0/illumina/raw/filtered_mapping_l1.txt,moving_pict

ures_tutorial-1.8.0/illumina/raw/filtered_mapping_l2.txt,moving_pictures_tutorial-1.8.0/illumina/raw/filtered_mapping_l3.txt,moving_pictures_tutorial-1.8.0/illumina/raw/filtered_mapping_l4.txt,moving_pictures_tutorial-1.8.0/illumina/raw/filtered_mapping_l5.txt,moving_pictures_tutorial-1.8.0/illumina/raw/

filtered_mapping_l6.txt

count_seqs.py -i moving_pictures_tutorial-1.8.0/illumina/slout/seqs.fna

Hands on: split your libraries

3. OTU picking strategies

• De Novo OTU picking: clustering of sequences at 97%

Overlapping sequences

No reference database necessary

computationally expensive

• Closed-Reference

non overlapping reads

needs reference database

discards sequences with no match - e.g. no erroneous reads

• Open-reference

Overlapping reads

reads clustered against reference and non matching reads are clustered de-novo

Hands on – picking O.T.U.s

pick_open_reference_otus.py -o moving_pictures_tutorial-1.8.0/illumina/otus/ -i moving_pictures_tutorial-1.8.0/illumina/slout/seqs.fna -r gg_13_8_otus/rep_set/97_otus.fasta -p moving_pictures_tutorial-1.8.0/uc_fast_params.txt

pick_de_novo_otus.py -o moving_pictures_tutorial-1.8.0/illumina/otus_denovo/ -i moving_pictures_tutorial-1.8.0/illumina/slout/seqs.fna

3. Pick OTUs

Note: following steps can be automated by (what we are doing):

pick_de_novo_otus.py –i seqs.fna -o otus

pick_otus.py -i seqs.fna -o picked_otus_default

•Will cluster your sequences at 97% similarity (can change this if you wish) and produce ‘seqs_otus.txt’ which maps each sequence to a cluster

•Uses UCLUST algorithm (Edgar, 2010, Bioinformatics)

3. Pick OTUs

Generate OTUs by clustering reads based on similarity (default is 97%)

Sort reads according to size (long -> short)

Cluster

OTU1

OTU2

OTU3

OTU4

OTU5

4. Pick representative sequences

• We want a representative sequence for each OTU – time consuming to annotate each sequence and they are already clustered……

• This will take the most abundant sequence in each OTU and make a file that has 1 sequence for each OTU (rep_set1.fna)

pick_rep_set.py -i seqs_otus.txt -f seqs.fna -o rep_set1.fna

5. Annotate (assign taxonomy to each OTU)

• Compare each representative sequence to a database using one of several algorithms:

• UCLUST, BLAST, RDP Classifier, et al…..….

• New Defaults: UCLUST against the Greengenes database

assign_taxonomy.py -i rep_set1.fna

(output in directory: uclust_assigned_taxonomy)

• BLAST example (reference sequences and taxonomy downloaded from database):

assign_taxonomy.py -i rep_set1.fna -r ref_seq_set.fna -t id_to_taxonomy.txt -m blast

5. Annotate

• Some useful databases that are compatible with QIIME:

http://greengenes.secondgenome.com

Good for everything and default in QIIME

http://unite.ut.eeFungal Internal Transcribed Spacer (ITS)

Good for soil fungi

http://www.arb-silva.deContains both 16S and 18S rRNA (Eukaryotes…)

Good representation of marine taxa

Recap

Species A

Species B

Species C

mixed amplicons

Sample 1

Sample 2

Sample 3

OTU 1

OTU 2

OTU 3

Split library into samples

using barcodes

Used clustering to choose OTUs

Picked a representative

sequences and assigned

taxonomy

Referencedatabase

6. Putting it all together: making an OTU table

• Need to combine the OTU identity with the abundance information in the clusters and link back to each sample so we can do ECOLOGY

• The table is in .biom format:

• http://biom-format.org/documentation/biom_format.html

• Convert to text file:

• biom convert -i otu_table.biom -o otu_table.txt --table-type "otu table" --header-key taxonomy –b

make_otu_table.py -i seqs_otus.txt -t rep_set1_tax_assignments.txt -o otu_table.biom

Closed reference O.T.U. picking pick_closed_reference_otus.py -i seqs.fna -r reference.fna -o otus_w_tax/ -t taxa_map.txt

•Reference is database i.e. greengenes unaligned 97% otus and matching taxa map (same files as for BLAST)

•Output has all of your sequences aligned to greengenes and an OTU table

•So this picks OTUs and Assign taxonomy in 1 step (but loose non-matching sequences….do we care? – taxa summaries no, beta-diversity maybe….)

•Quick – good for illumina

7. Aligning sequences

• Back to our representative sequences….

• How closely related are the organisms present in the samples i.e. what is the phylogeny of our community and how does this shift between samples

• Default: PYNAST to align samples to a reference set of pre-aligned sequences (e.g. greengenes ALIGNED) – more computationally efficient than de novo alignment

• Can also select other methods e.g. MUSCLE,

align_seqs.py -i rep_set1.fna –o pynast_aligned/

7. Aligning sequences• Not all regions of the rRNA gene are informative or useful for

phylogenetic inference

• Gaps – short length sequence vs full length rRNA gene

• filter_alignment.py -i rep_set1_aligned.fasta -o filtered_alignment/

• Optional lanemask template that defines informative regions for some databases

• filter_alignment.py -i seqs_rep_set_aligned.fasta -m lanemask_in_1s_and_0s -o filtered_alignment/

• If you are going to use this alignment for making a phylogenetic tree this step is essential…..

A note on chimera removal

•Chimeras sequences formed from DNA of 2 or more organisms (artifact of PCR amplification)

•QIIME uses ChimeraSlayer to detect chimeric sequences using your alignment and a reference database

•You should then remove these OTU’s from your OTU table and alignment before proceeding with tree building and visualization of results :

•-e chimeric_seqs.txt when making OTU table, filter_fasta.py for alignment

identify_chimeric_seqs.py -m ChimeraSlayer -i rep_set_aligned.fasta -a reference_set1_aligned.fasta -o chimeric_seqs.txt

8. Make a phylogenetic tree

make_phylogeny.py -i rep_set1_aligned_pfiltered.fasta -o rep_phylo.tre

• Builds a tree from the alignment using FastTree

• Outputs a tree in newick format (.tre) which can be opened with software such as FigTree or can be used to calculate phylogenetic metrics

• Also filter Chimeras from tree

We now have 2 final outputs:

• OTU Table

1.Taxonomic composition

2.α-diversity (e.g. ‘species’ richness)

3.β-diversity (e.g. abundance similarity between samples)

• Phylogenetic tree

1.Phylogenetic β-diversity

QIIME has powerful visualization and statistical tools

Hands on – reformatting outputs

biom convert -i "otu table" --header-key taxonomy -bmoving_pictures_tutorial-1.8.0/illumina/otus_denovo/ot

u_table.biom -o moving_pictures_tutorial-1.8.0/illumina/otus_denovo/ot

u_table.txt --table-type

filter_alignment.py -i moving_pictures_tutorial-1.8.0/illumina/otus_denovo/p

ynast_aligned_seqs/seqs_rep_set_aligned.fasta -o moving_pictures_tutorial-1.8.0/illumina/otus_denovo/p

ynast_aligned_seqs/filtered_alignment

We have automated (piped) most of the steps I have talked aboutWe need to convert the OTU table to a text file and filter the alignment

9. Merging the mapping files

• We started with 6 lanes of Illumina but now we have a single OTU table. The merged mapping file will have duplicated barcodes but these are not used anymore (already demultiplexed):

• merge_mapping_files.py -o combined_mapping_file.txt -m mapfile1.txt,mapfile2.txt…,mapfilexxx.txt

Hands on – merge your mapping files

merge_mapping_files.py -o moving_pictures_tutorial-1.8.0/illumina/combined_mapping_file.txt -m moving_pictures_tutorial-1.8.0/illumina/raw/filtered_mapping_l1.txt,moving_pictures_tutorial-1.8.0/illumina/raw/filtered_mapping_l2.txt,moving_pictures_tutorial-1.8.0/illumina/raw/filtered_mapping_l3.txt,moving_pictures_tutorial-1.8.0/illumina/raw/filtered_mapping_l4.txt,moving_pictures_tutorial-1.8.0/illumina/raw/filtered_mapping_l5.txt,moving_pictures_tutorial-1.8.0/illumina/raw/filtered_mapping_l6.txtbiom summarize-table -i moving_pictures_tutorial-1.8.0/illumina/otus_denovo/otu_table.biom -o moving_pictures_tutorial-1.8.0/illumina/otus_denovo/otu_table.summary

Visualizing diversity 1 – community composition

biom summarize-table –i otu_table.biom –o otu_table_summary.txt

Counts/Sample detail:

L3S237: 138.0L3S235: 187.0L3S372: 205.0L3S373: 228.0L3S367: 259.0L3S370: 273.0L3S368: 274.0L3S369: 284.0

• Summary of OTU table: we want to standardize the number of sequences (sampling depth) to allow accurate comparison Ie. 146 sequences

single_rarefaction.py -i otu_table.biom -o otu_table_even146.biom -d 138

alpha_rarefaction.py -i otu_table.biom -m combined_mapping_file.txt -o rarefaction/ -t rep_set.tre

• How ‘deep’ do we need to go to adequately sample community? = Rarefaction analysis

• number of species increase until a point where producing more sequence does not significantly increase the number of observed species

• repeated subsampling of your data at different intervals. Plots subsamples against the number of observed species. If curves flatten, then you have sequenced at sufficient depth.

• Rarefaction trade off between ‘keeping’ samples below a given sequence cut-off and loosing diversity

Visualizing diversity 1 – community composition

Hands on - Rarefaction

single_rarefaction.py -i moving_pictures_tutorial-1.8.0/illumina/otus_denovo/otu_table.biom -o moving_pictures_tutorial-1.8.0/illumina/otus_denovo/otu_table_even138.biom -d 138

alpha_rarefaction.py -i moving_pictures_tutorial-1.8.0/illumina/otus_denovo/otu_table.biom -o moving_pictures_tutorial-1.8.0/illumina/otus_denovo/rarefaction/ -m moving_pictures_tutorial-1.8.0/illumina/combined_mapping_file.txt -t moving_pictures_tutorial-1.8.0/illumina/otus_denovo/rep_set.tre

Tomorrow……

Visualizing and comparing diversity

Software references: QIIME Caporaso et al 2010. QIIME allows analysis of high-throughput community sequencing data. Nature Methods 7(5): 335-336.

UCLUST Edgar RC. 2010. Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26(19):2460-2461.

BLAST Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. 1990. Basic local alignment search tool. J Mol Biol 215(3):403-410.

GRENGENES McDonald et al 2012. An improved Greengenes taxonomy with explicit ranks for ecological and evolutionary analyses of bacteria and archaea. ISME J 6(3): 610–618.

RDP Classifier Wang Q, Garrity GM, Tiedje JM, Cole JR. 2007. Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Appl Environ Microb 73(16): 5261-5267.

PyNAST Caporaso JG et al 2010. PyNAST: a flexible tool for aligning sequences to a template alignment. Bioinformatics 26:266-267.

ChimeraSlayer Haas BJ, Gevers D, Earl AM, Feldgarden M, Ward DV, Giannoukos G, et al. 2011. Chimeric 16S rRNA sequence formation and detection in Sanger and 454-pyrosequenced PCR amplicons. Genome Research 21:494-504.

MUSCLE Edgar, R.C. 2004 MUSCLE: multiple sequence alignment with high accuracy and high throughput Nucleic Acids Res:1792-1797

FasttTree Price MN, Dehal PS, Arkin AP. 2010. FastTree 2-Approximately Maximum-Likelihood Trees for Large Alignments. Plos One 5(3)

UNIFRAC Lozupone C, Knight R. 2005. UniFrac: a new phylogenetic method for comparing microbial communities. Appl Environ Microbiol 71(12): 8228-8235.

Emperor Vazquez-Baeza Y, Pirrung M, Gonzalez A, Knight R. 2013. Emperor: A tool for visualizing high-throughput microbial community data. Gigascience 2(1):16.

Documents

Toast 2015 qiime_talk