45
QIIME: Quantitative Insights Into Microbial Ecology (part 1) Thomas Jeffries Federico M. Lauro Grazia Marina Quero Tiziano Minuzzo The Omics Analysis Sydney Tutorial Australian Museum 23 rd -24 th February 2015

Toast 2015 qiime_talk

Embed Size (px)

Citation preview

Page 1: Toast 2015 qiime_talk

QIIME: Quantitative Insights Into Microbial Ecology (part

1) Thomas JeffriesFederico M. Lauro

Grazia Marina Quero Tiziano Minuzzo

The Omics Analysis Sydney Tutorial

Australian Museum 23rd-24th February 2015

Page 2: Toast 2015 qiime_talk

QIIME

• Open source software package for taxonomic analysis of 16S rRNA sequences

• UC Colorado & Northern Arizona

• www.qiime.org (great resource…..)

• Good community support

• Can google most problems

• Multi-platform

• Widely used

Caporaso

Knight

Page 3: Toast 2015 qiime_talk

Getting QIIME

Linux: https://github.com/qiime/qiime-deploy

Mac: http://www.wernerlab.org/software/macqiime

Ubuntu virtualbox: http://qiime.org/install/virtual_box.html

Linux remote machine e.g UTS FEIT cluster, NECTAR: http://nectar.org.au/research-cloud

http://qiime.org/install/install.html

Page 4: Toast 2015 qiime_talk

Data formats

• 454:

DNA sequences (FASTA, .fna)

Quality (.qual)

Mapping file (.txt)

• Illumina

Sequences and quality in same file (.fastq)

Also supports paired end

Page 5: Toast 2015 qiime_talk

Getting into QIIME

• Command line interface

• Some very basic commands needed for QIIME:

example:

/folder$ programme.py -i file_in -o file_out

ls :list files in working directorycd : changes directorycd .. : goes back to parent directory‘tab’ key: magically fills out file namesmkdir : makes a directorypwd : tells you where you are

Page 6: Toast 2015 qiime_talk

QIIME tutorial and example data

• Many tutorials @ http://qiime.org/tutorials/index.html

• Good place to start: http://qiime.org/tutorials/tutorial.html

• Great Microbial Ecology course (includes QIIME): http://edamame-course.org/

• A few of the commands have changed in the new version – the current commands are in this talk - and I have renamed the files to make it easier to follow

Page 7: Toast 2015 qiime_talk

Some useful terminology

Page 8: Toast 2015 qiime_talk

αDiversityAlpha diversity is the diversity within ONE

sample

Page 9: Toast 2015 qiime_talk

αDiversity

Page 10: Toast 2015 qiime_talk

αDiversity: Richness

Page 11: Toast 2015 qiime_talk

αDiversity: Evenness

Page 12: Toast 2015 qiime_talk

αDiversity: Evenness

Common metric: Pielou’s evenness

Page 13: Toast 2015 qiime_talk

Tutorial dataset

Page 14: Toast 2015 qiime_talk

Tutorial dataset

Page 15: Toast 2015 qiime_talk

1. Check mapping file format

• Checks that format of mapping file is ok

validate_mapping_file.py -m my_mapping_file.txt -o validate_mapping_file_output

“No errors or warnings were found in mapping file”

Page 16: Toast 2015 qiime_talk

1. Check mapping file

Name (ID) of sample

Primer

Sequencing barcode

Sample categories (treatments)

Tab separated !!!

Page 17: Toast 2015 qiime_talk

Hands on – validate your mapping file

validate_mapping_file.py -o moving_pictures_tutorial-1.8.0/illumina/ci

d_l1/ -m moving_pictures_tutorial-1.8.0/illumina/ra

w/filtered_mapping_l1.txt

Page 18: Toast 2015 qiime_talk

2. De-multiplex - 454

• Using sample specific barcodes, identify each sequence with a sample (renames sequences)

• Performs some QC:

Removes sequences < 200bp

Removes sequences with a quality score <25

Removes sequences with >6 ambiguous bases or >6 homopolymer runs

split_libraries.py -m my_mapping_file.txt -f my_sequence_file.fna -q my_quality_file.qual -o split_library_output

• Produces seqs.fna

Page 19: Toast 2015 qiime_talk

2. De-multiplex - Illumina (Step 1)

• If the samples contain paired-end reads, you first need to join them and update the barcodes using:

join_paired_ends.py -f my_forw_reads.fastq -r my_rev_reads.fastq -b my_barcodes.fastq -o my_joined.fastq

Page 20: Toast 2015 qiime_talk

2. De-multiplex - Illumina (Step 2)

Then you can proceed to the split libraries step. If the sequences are NOT paired-ends go directly to split_libraries_fastq.py. This step also performs the Illumina reads QC:

split_libraries_fastq.py -m my_mapping_file.txt -i my_sequence_file.fastq -b my_barcodes.fastq -o split_library_output

• Data from multiple lanes can be processed together by separating inputs with a comma (,)

• Produces seqs.fna

Page 21: Toast 2015 qiime_talk

1.8.0/illumina/raw/subsampled_s_1_sequence.fastq,moving_pictures_tutorial-1.8.0/illumina/raw/subsampled_s_2_sequence.fastq,moving_pictures_tutorial-1.8.0/illumina/raw/subsampled_s_3_sequence.fastq,moving_pictures_tutorial-1.8.0/illumina/raw/subsampled_s_4_sequence.fastq,moving_pictures_tutorial-1.8.0/illumina/raw/subsampled_s_5_sequence.fastq,moving_pictures_tutorial-

1.8.0/illumina/raw/subsampled_s_6_sequence.fastq -b moving_pictures_tutorial-1.8.0/illumina/raw/subsampled_s_1_sequence_barcodes.fastq,moving_pictures_t

utorial-1.8.0/illumina/raw/subsampled_s_2_sequence_barcodes.fastq,moving_pictures_tutorial-1.8.0/

illumina/raw/subsampled_s_3_sequence_barcodes.fastq,moving_pictures_tutorial-1.8.0/

illumina/raw/subsampled_s_4_sequence_barcodes.fastq,moving_pictures_tutorial-1.8.0/

illumina/raw/subsampled_s_5_sequence_barcodes.fastq,moving_pictures_tutorial-1.8.0/

illumina/raw/subsampled_s_6_sequence_barcodes.fastq -m moving_pictures_tutorial-1.8.0/illumina/raw/filtered_mapping_l1.txt,moving_pict

ures_tutorial-1.8.0/illumina/raw/filtered_mapping_l2.txt,moving_pictures_tutorial-1.8.0/illumina/raw/filtered_mapping_l3.txt,moving_pictures_tutorial-1.8.0/illumina/raw/filtered_mapping_l4.txt,moving_pictures_tutorial-1.8.0/illumina/raw/filtered_mapping_l5.txt,moving_pictures_tutorial-1.8.0/illumina/raw/

filtered_mapping_l6.txt

count_seqs.py -i moving_pictures_tutorial-1.8.0/illumina/slout/seqs.fna

Hands on: split your libraries

Page 22: Toast 2015 qiime_talk

3. OTU picking strategies

• De Novo OTU picking: clustering of sequences at 97%

Overlapping sequences

No reference database necessary

computationally expensive

• Closed-Reference

non overlapping reads

needs reference database

discards sequences with no match - e.g. no erroneous reads

• Open-reference

Overlapping reads

reads clustered against reference and non matching reads are clustered de-novo

Page 23: Toast 2015 qiime_talk

Hands on – picking O.T.U.s

pick_open_reference_otus.py -o moving_pictures_tutorial-1.8.0/illumina/otus/ -i moving_pictures_tutorial-1.8.0/illumina/slout/seqs.fna -r gg_13_8_otus/rep_set/97_otus.fasta -p moving_pictures_tutorial-1.8.0/uc_fast_params.txt

pick_de_novo_otus.py -o moving_pictures_tutorial-1.8.0/illumina/otus_denovo/ -i moving_pictures_tutorial-1.8.0/illumina/slout/seqs.fna

Page 24: Toast 2015 qiime_talk

3. Pick OTUs

Note: following steps can be automated by (what we are doing):

pick_de_novo_otus.py –i seqs.fna -o otus

pick_otus.py -i seqs.fna -o picked_otus_default

•Will cluster your sequences at 97% similarity (can change this if you wish) and produce ‘seqs_otus.txt’ which maps each sequence to a cluster

•Uses UCLUST algorithm (Edgar, 2010, Bioinformatics)

Page 25: Toast 2015 qiime_talk

3. Pick OTUs

Generate OTUs by clustering reads based on similarity (default is 97%)

Sort reads according to size (long -> short)

Cluster

OTU1

OTU2

OTU3

OTU4

OTU5

Page 26: Toast 2015 qiime_talk

4. Pick representative sequences

• We want a representative sequence for each OTU – time consuming to annotate each sequence and they are already clustered……

• This will take the most abundant sequence in each OTU and make a file that has 1 sequence for each OTU (rep_set1.fna)

pick_rep_set.py -i seqs_otus.txt -f seqs.fna -o rep_set1.fna

Page 27: Toast 2015 qiime_talk

5. Annotate (assign taxonomy to each OTU)

• Compare each representative sequence to a database using one of several algorithms:

• UCLUST, BLAST, RDP Classifier, et al…..….

• New Defaults: UCLUST against the Greengenes database

assign_taxonomy.py -i rep_set1.fna

(output in directory: uclust_assigned_taxonomy)

• BLAST example (reference sequences and taxonomy downloaded from database):

assign_taxonomy.py -i rep_set1.fna -r ref_seq_set.fna -t id_to_taxonomy.txt -m blast

Page 28: Toast 2015 qiime_talk

5. Annotate

• Some useful databases that are compatible with QIIME:

http://greengenes.secondgenome.com

Good for everything and default in QIIME

http://unite.ut.eeFungal Internal Transcribed Spacer (ITS)

Good for soil fungi

http://www.arb-silva.deContains both 16S and 18S rRNA (Eukaryotes…)

Good representation of marine taxa

Page 29: Toast 2015 qiime_talk

Recap

Page 30: Toast 2015 qiime_talk

Species A

Species B

Species C

mixed amplicons

Sample 1

Sample 2

Sample 3

OTU 1

OTU 2

OTU 3

Split library into samples

using barcodes

Used clustering to choose OTUs

Picked a representative

sequences and assigned

taxonomy

Referencedatabase

Page 31: Toast 2015 qiime_talk

6. Putting it all together: making an OTU table

• Need to combine the OTU identity with the abundance information in the clusters and link back to each sample so we can do ECOLOGY

• The table is in .biom format:

• http://biom-format.org/documentation/biom_format.html

• Convert to text file:

• biom convert -i otu_table.biom -o otu_table.txt --table-type "otu table" --header-key taxonomy –b

make_otu_table.py -i seqs_otus.txt -t rep_set1_tax_assignments.txt -o otu_table.biom

Page 32: Toast 2015 qiime_talk

Closed reference O.T.U. picking pick_closed_reference_otus.py -i seqs.fna -r reference.fna -o otus_w_tax/ -t taxa_map.txt

•Reference is database i.e. greengenes unaligned 97% otus and matching taxa map (same files as for BLAST)

•Output has all of your sequences aligned to greengenes and an OTU table

•So this picks OTUs and Assign taxonomy in 1 step (but loose non-matching sequences….do we care? – taxa summaries no, beta-diversity maybe….)

•Quick – good for illumina

Page 33: Toast 2015 qiime_talk

7. Aligning sequences

• Back to our representative sequences….

• How closely related are the organisms present in the samples i.e. what is the phylogeny of our community and how does this shift between samples

• Default: PYNAST to align samples to a reference set of pre-aligned sequences (e.g. greengenes ALIGNED) – more computationally efficient than de novo alignment

• Can also select other methods e.g. MUSCLE,

align_seqs.py -i rep_set1.fna –o pynast_aligned/

Page 34: Toast 2015 qiime_talk

7. Aligning sequences• Not all regions of the rRNA gene are informative or useful for

phylogenetic inference

• Gaps – short length sequence vs full length rRNA gene

• filter_alignment.py -i rep_set1_aligned.fasta -o filtered_alignment/

• Optional lanemask template that defines informative regions for some databases

• filter_alignment.py -i seqs_rep_set_aligned.fasta -m lanemask_in_1s_and_0s -o filtered_alignment/

• If you are going to use this alignment for making a phylogenetic tree this step is essential…..

Page 35: Toast 2015 qiime_talk

A note on chimera removal

•Chimeras sequences formed from DNA of 2 or more organisms (artifact of PCR amplification)

•QIIME uses ChimeraSlayer to detect chimeric sequences using your alignment and a reference database

•You should then remove these OTU’s from your OTU table and alignment before proceeding with tree building and visualization of results :

•-e chimeric_seqs.txt when making OTU table, filter_fasta.py for alignment

identify_chimeric_seqs.py -m ChimeraSlayer -i rep_set_aligned.fasta -a reference_set1_aligned.fasta -o chimeric_seqs.txt

Page 36: Toast 2015 qiime_talk

8. Make a phylogenetic tree

make_phylogeny.py -i rep_set1_aligned_pfiltered.fasta -o rep_phylo.tre

• Builds a tree from the alignment using FastTree

• Outputs a tree in newick format (.tre) which can be opened with software such as FigTree or can be used to calculate phylogenetic metrics

• Also filter Chimeras from tree

Page 37: Toast 2015 qiime_talk

We now have 2 final outputs:

• OTU Table

1.Taxonomic composition

2.α-diversity (e.g. ‘species’ richness)

3.β-diversity (e.g. abundance similarity between samples)

• Phylogenetic tree

1.Phylogenetic β-diversity

QIIME has powerful visualization and statistical tools

Page 38: Toast 2015 qiime_talk

Hands on – reformatting outputs

biom convert -i "otu table" --header-key taxonomy -bmoving_pictures_tutorial-1.8.0/illumina/otus_denovo/ot

u_table.biom -o moving_pictures_tutorial-1.8.0/illumina/otus_denovo/ot

u_table.txt --table-type

filter_alignment.py -i moving_pictures_tutorial-1.8.0/illumina/otus_denovo/p

ynast_aligned_seqs/seqs_rep_set_aligned.fasta -o moving_pictures_tutorial-1.8.0/illumina/otus_denovo/p

ynast_aligned_seqs/filtered_alignment

We have automated (piped) most of the steps I have talked aboutWe need to convert the OTU table to a text file and filter the alignment

Page 39: Toast 2015 qiime_talk

9. Merging the mapping files

• We started with 6 lanes of Illumina but now we have a single OTU table. The merged mapping file will have duplicated barcodes but these are not used anymore (already demultiplexed):

• merge_mapping_files.py -o combined_mapping_file.txt -m mapfile1.txt,mapfile2.txt…,mapfilexxx.txt

Page 40: Toast 2015 qiime_talk

Hands on – merge your mapping files

merge_mapping_files.py -o moving_pictures_tutorial-1.8.0/illumina/combined_mapping_file.txt -m moving_pictures_tutorial-1.8.0/illumina/raw/filtered_mapping_l1.txt,moving_pictures_tutorial-1.8.0/illumina/raw/filtered_mapping_l2.txt,moving_pictures_tutorial-1.8.0/illumina/raw/filtered_mapping_l3.txt,moving_pictures_tutorial-1.8.0/illumina/raw/filtered_mapping_l4.txt,moving_pictures_tutorial-1.8.0/illumina/raw/filtered_mapping_l5.txt,moving_pictures_tutorial-1.8.0/illumina/raw/filtered_mapping_l6.txtbiom summarize-table -i moving_pictures_tutorial-1.8.0/illumina/otus_denovo/otu_table.biom -o moving_pictures_tutorial-1.8.0/illumina/otus_denovo/otu_table.summary

Page 41: Toast 2015 qiime_talk

Visualizing diversity 1 – community composition

biom summarize-table –i otu_table.biom –o otu_table_summary.txt

Counts/Sample detail:

L3S237: 138.0L3S235: 187.0L3S372: 205.0L3S373: 228.0L3S367: 259.0L3S370: 273.0L3S368: 274.0L3S369: 284.0

• Summary of OTU table: we want to standardize the number of sequences (sampling depth) to allow accurate comparison Ie. 146 sequences

single_rarefaction.py -i otu_table.biom -o otu_table_even146.biom -d 138

alpha_rarefaction.py -i otu_table.biom -m combined_mapping_file.txt -o rarefaction/ -t rep_set.tre

Page 42: Toast 2015 qiime_talk

• How ‘deep’ do we need to go to adequately sample community? = Rarefaction analysis

• number of species increase until a point where producing more sequence does not significantly increase the number of observed species

• repeated subsampling of your data at different intervals. Plots subsamples against the number of observed species. If curves flatten, then you have sequenced at sufficient depth.

• Rarefaction trade off between ‘keeping’ samples below a given sequence cut-off and loosing diversity

Visualizing diversity 1 – community composition

Page 43: Toast 2015 qiime_talk

Hands on - Rarefaction

single_rarefaction.py -i moving_pictures_tutorial-1.8.0/illumina/otus_denovo/otu_table.biom -o moving_pictures_tutorial-1.8.0/illumina/otus_denovo/otu_table_even138.biom -d 138

alpha_rarefaction.py -i moving_pictures_tutorial-1.8.0/illumina/otus_denovo/otu_table.biom -o moving_pictures_tutorial-1.8.0/illumina/otus_denovo/rarefaction/ -m moving_pictures_tutorial-1.8.0/illumina/combined_mapping_file.txt -t moving_pictures_tutorial-1.8.0/illumina/otus_denovo/rep_set.tre

Page 44: Toast 2015 qiime_talk

Tomorrow……

Visualizing and comparing diversity

Page 45: Toast 2015 qiime_talk

Software references: QIIME Caporaso et al 2010. QIIME allows analysis of high-throughput community sequencing data. Nature Methods 7(5): 335-336.

UCLUST Edgar RC. 2010. Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26(19):2460-2461.

BLAST Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. 1990. Basic local alignment search tool. J Mol Biol 215(3):403-410.

GRENGENES McDonald et al 2012. An improved Greengenes taxonomy with explicit ranks for ecological and evolutionary analyses of bacteria and archaea. ISME J 6(3): 610–618.

RDP Classifier Wang Q, Garrity GM, Tiedje JM, Cole JR. 2007. Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Appl Environ Microb 73(16): 5261-5267.

PyNAST Caporaso JG et al 2010. PyNAST: a flexible tool for aligning sequences to a template alignment. Bioinformatics 26:266-267.

ChimeraSlayer Haas BJ, Gevers D, Earl AM, Feldgarden M, Ward DV, Giannoukos G, et al. 2011. Chimeric 16S rRNA sequence formation and detection in Sanger and 454-pyrosequenced PCR amplicons. Genome Research 21:494-504.

MUSCLE Edgar, R.C. 2004 MUSCLE: multiple sequence alignment with high accuracy and high throughput Nucleic Acids Res:1792-1797

FasttTree Price MN, Dehal PS, Arkin AP. 2010. FastTree 2-Approximately Maximum-Likelihood Trees for Large Alignments. Plos One 5(3)

UNIFRAC Lozupone C, Knight R. 2005. UniFrac: a new phylogenetic method for comparing microbial communities. Appl Environ Microbiol 71(12): 8228-8235.

Emperor Vazquez-Baeza Y, Pirrung M, Gonzalez A, Knight R. 2013. Emperor: A tool for visualizing high-throughput microbial community data. Gigascience 2(1):16.