65
Assembly I C. Titus Brown [email protected] Aug 7, 2013

2013 stamps-intro-assembly

Embed Size (px)

Citation preview

Page 1: 2013 stamps-intro-assembly

Assembly IC. Titus Brown

[email protected]

Aug 7, 2013

Page 2: 2013 stamps-intro-assembly

whoami?

Asst Prof, Microbiology & Computer Science, Michigan State University

Page 3: 2013 stamps-intro-assembly

Some definitions

Bioinformaticians write the software that takes your perfectly good data and produces bad clusters and assemblies.

Page 4: 2013 stamps-intro-assembly

Some definitions

Bioinformaticians write the software that takes your perfectly good data and produces bad clusters and assemblies.

Statisticians are the people who take your perfectly good clusters and tell you that your results are not statistically significant.

Page 5: 2013 stamps-intro-assembly

whoami?

Primary scientific interest:

Effective analysis and generation of high-quality biological hypotheses from

sequencing data.

Now entirely dry lab.

Page 6: 2013 stamps-intro-assembly

whoami?

Cross-cutting interests:Good (efficient, accurate, remixable)

software development, on the open source model.

Reproducibility.Open science; preprints, blogging,

Twitter.

Page 7: 2013 stamps-intro-assembly

The Plan

Th lecture: intro to metagenome assembly

Th eve lab: k-mer & assembly basics

Fri lecture/lab: assembling environmental metagenomes

Fri afternoon lab: doing metagenome assembly

Fri evening: reproduce, you fools!

Page 8: 2013 stamps-intro-assembly

Outline

1) Dealing with shotgun metagenomics

2) Short-read annotation & why/why not

3) Annotating soil reads – a story

4) Assembly stories!

Page 9: 2013 stamps-intro-assembly

Outline

1) Dealing with shotgun metagenomics

2) Short-read annotation & why/why not

3) Annotating soil reads – a story

4) Assembly stories!

Page 10: 2013 stamps-intro-assembly

Shotgun metagenomics

Collect samples;

Extract DNA;

Feed into sequencer;

Computationally analyze.

Wikipedia: Environmental shotgun sequencing.png

Page 11: 2013 stamps-intro-assembly

FASTQ etc.

@SRR606249.17/1GAGTATGTTCTCATAGAGGTTGGTANNNNT+B@BDDFFFHHHHHJIJJJJGHIJHJ####[email protected]/2CGAANNNNNNNNNNNNNNNNNCCTGGCTCA+CCCF#################22@GHIJJJ

Name

Quality score

Note: /1 and /2 => interleaved

Page 12: 2013 stamps-intro-assembly

Shotgun sequencing & assembly

Randomly fragment & sequence from DNA;

reassemble computationally.

UMD assembly primer (cbcb.umd.edu)

Page 13: 2013 stamps-intro-assembly

Dealing with shotgun metagenome reads.

1) Annotate/analyze individual reads.

2) Assemble reads into contigs, genomes, etc.

3) A middle ground (I’ll mention on Friday)

Page 14: 2013 stamps-intro-assembly

Outline

1) Dealing with shotgun metagenomics

2) Short-read annotation & why/why not

3) Annotating soil reads – a story

4) Assembly stories!

Page 15: 2013 stamps-intro-assembly

Annotating individual reads

Works really well when you have EITHER(a) evolutionarily close references(b) rather long sequences

(This is obvious, right?)

Page 16: 2013 stamps-intro-assembly

Annotating individual reads #2

We have found that this does not work well with Illumina samples from unexplored environments (e.g. soil).

Sensitivity is fine (correct match is usually there)

Specificity is bad (correct match may be drowned out by incorrect matches)

Page 17: 2013 stamps-intro-assembly

Recommendation:

For reads < 200-300 bp,

Annotate individual reads for human-associated samples, or exploration of well-studied systems.

For everything else, look to assembly.

Page 18: 2013 stamps-intro-assembly

So, why assemble?

Increase your ability to assign homology/orthology correctly!!

Essentially all functional annotation systems depend on sequence similarity to assign homology. This is why you want to assemble your data.

Page 19: 2013 stamps-intro-assembly

Why else would you want to assemble?

Assemble new “reference”.

Look for large-scale variation from reference – pathogenicity islands, etc.

Discriminate between different members of gene families.

Discover operon assemblages & annotate on co-incidence of genes.

Reduce size of data!!

Page 20: 2013 stamps-intro-assembly

Why don’t you want to assemble??

Abundance threshold – low-coverage filter.

Strain variation

Chimerism

Page 21: 2013 stamps-intro-assembly

Outline

1) Dealing with shotgun metagenomics

2) Short-read annotation & why/why not

3) Annotating soil reads – a story

4) Assembly stories!

Page 22: 2013 stamps-intro-assembly

A story: looking at land management with 454 shotgun

Tracy Teal, Vicente Gomez-Alvarado, & Tom Schmidt

Ask detailed questions of @tracykteal on Twitter, please :)

Page 23: 2013 stamps-intro-assembly
Page 24: 2013 stamps-intro-assembly
Page 25: 2013 stamps-intro-assembly
Page 26: 2013 stamps-intro-assembly
Page 27: 2013 stamps-intro-assembly
Page 28: 2013 stamps-intro-assembly
Page 29: 2013 stamps-intro-assembly
Page 30: 2013 stamps-intro-assembly
Page 31: 2013 stamps-intro-assembly

Annotating soil reads - thoughts

Possible to find well-known genes using long (454) reads.

Normalize for organism abundance!

Primer independence can be important!

Note, replicates give you error bars…

Page 32: 2013 stamps-intro-assembly

A few assembly stories

Low complexity/Osedax symbionts

High complexity/soil.

Page 33: 2013 stamps-intro-assembly

Osedax symbiontsGoffredi et al., submitted.

Page 34: 2013 stamps-intro-assembly

16s

shotgun

Page 35: 2013 stamps-intro-assembly

Metagenomic assembly followed by binning enabled isolation of fairly complete genomes

Page 36: 2013 stamps-intro-assembly

Assembly allowed genomic content comparisons to nearest cultured relative

Page 37: 2013 stamps-intro-assembly

Osedax assembly story

Conclusions include:

Osedax symbionts have genes needed for free living stage;

metabolic versatility in carbon, phosphate, and iron uptake strategies;

Genome includes mechanisms for intracellular survival, and numerous potential virulence capabilities.

Page 38: 2013 stamps-intro-assembly

Osedax assembly story

Low diversity metagenome!

Physical isolation => MDA => sequencing => diginorm => binning => 94% complete Rs166-89% complete Rs2

Note: many interesting critters are hard to isolate => so, basically, metagenomes.

Page 39: 2013 stamps-intro-assembly

Human-associated communities

“Time series community genomics analysis reveals rapid shifts in bacterial species, strains, and phage during infant gut colonization.” Sharon et al. (Banfield lab);Genome Res. 2013 23: 111-120

Page 40: 2013 stamps-intro-assembly

Setup

Collected 11 fecal samples from premature female, days 15-24.

260m 100-bp paired-end Illumina HiSeq reads; 400 and 900 base fragments.

Assembled > 96% of reads into contigs > 500bp; 8 complete genomes; reconstructed genes down to 0.05% of population abundance.

Page 41: 2013 stamps-intro-assembly

Sharon et al., 2013; pmid 22936250

Page 42: 2013 stamps-intro-assembly

Key strategy: abundance binning

Sharon et al., 2013; pmid 22936250

Page 43: 2013 stamps-intro-assembly

Sharon et al., 2013; pmid 22936250

Page 44: 2013 stamps-intro-assembly

Tracking abundance

Sharon et al., 2013; pmid 22936250

Page 45: 2013 stamps-intro-assembly

Conclusions

Recovered strain variation, phage variation, abundance variation, lateral gene transfer.

Claim that “recovered genomes are superior to draft genomes generated in most isolate genome sequencing projects.”

Page 46: 2013 stamps-intro-assembly

Environmental metagenomics: Deepwater Horizon spill

“Transcriptional response of bathypelagic marine bacterioplankton to the Deepwater Horizon oil spill.” Rivers et al., 2013, Moran lab. Pmid 23902988.

Page 47: 2013 stamps-intro-assembly

Sequencing strategy

Page 48: 2013 stamps-intro-assembly

Protocol for messenger RNA extraction, assembly, downstream analysis, including differential expression.

Note: used overlapping paired-end reads.

Page 49: 2013 stamps-intro-assembly

Great Prairie Grand Challenge - soil

Together with Janet Jansson, Jim Tiedje, et al.

“Can we make sense of soil with deep Illumina sequencing?”

Page 50: 2013 stamps-intro-assembly

Great Prairie Grand Challenge - soil

What ecosystem level functions are present, and how do microbes do them?

How does agricultural soil differ from native soil?

How does soil respond to climate perturbation?

Questions that are not easy to answer without shotgun sequencing:What kind of strain-level heterogeneity is present in

the population?What does the phage and viral population look like?What species are where?

Page 51: 2013 stamps-intro-assembly

A “Grand Challenge” dataset (DOE/JGI)

Page 52: 2013 stamps-intro-assembly

Approach 1: Digital normalization(a computational version of library normalization)

Suppose you have a dilution factor of A (10) to B(1). To get 10x of B you need to get 100x of A! Overkill!!

This 100x will consume disk

space and, because of

errors, memory.

We can discard it for you…

Page 53: 2013 stamps-intro-assembly

Approach 2: Data partitioning(a computational version of cell sorting)

Split reads into “bins” belonging to different source species.

Can do this based almost entirely on connectivity of sequences.

“Divide and conquer”

Memory-efficient implementation helps to scale assembly.

Page 54: 2013 stamps-intro-assembly

Putting it in perspective:Total equivalent of ~1200 bacterial genomesHuman genome ~3 billion bp

Assembly results for Iowa corn and prairie(2x ~300 Gbp soil metagenomes)

Total Assembly

Total Contigs(> 300 bp)

% Reads Assemble

d

Predicted protein coding

2.5 bill 4.5 mill 19% 5.3 mill

3.5 bill 5.9 mill 22% 6.8 mill

Adina Howe

Page 55: 2013 stamps-intro-assembly

Resulting contigs are low coverage.

Page 56: 2013 stamps-intro-assembly

Corn Prairie

Iowa prairie & corn - very even.

Page 57: 2013 stamps-intro-assembly

Taxonomy– Iowa prairie

Note: this is predicted taxonomy of contigs w/o considering abundance or length. (MG-RAST)

Page 58: 2013 stamps-intro-assembly

Taxonomy – Iowa corn

Note: this is predicted taxonomy of contigs w/o considering abundance or length. (MG-RAST)

Page 59: 2013 stamps-intro-assembly

Strain variation?To

p t

wo a

llele

fr

eq

uenci

es

Position within contig

Of 5000 most abundantcontigs, only 1 has apolymorphism rate > 5%

Can measure by read mapping.

Page 60: 2013 stamps-intro-assembly

Concluding thoughts on assembly (I)

There’s no standard approach yet; almost every paper uses a specialized pipeline of some sort.More like genome assemblyBut unlike transcriptomics…

Page 61: 2013 stamps-intro-assembly

Concluding thoughts on assembly (II)

Anecdotally, everyone worries about strain variation.Some groups (e.g. Banfield, us) have found

that this is not a problem in their system so far.

Others (viral metagenomes! HMP!) have found this to be a big concern.

Page 62: 2013 stamps-intro-assembly

Concluding thoughts on assembly (III)

Some groups have found metagenome assembly to be very useful.

Others (us! soil!) have not yet proven its utility.

Page 63: 2013 stamps-intro-assembly

Questions that will be addressed tomorrow morning.

How much sequencing should I do?

How do I evaluate metagenome assemblies?

Which assembler is best?

Page 64: 2013 stamps-intro-assembly

High coverage is critical.

Low coverage is the dominant problem blocking assembly of

metagenomes

Page 65: 2013 stamps-intro-assembly

Materials

In addition to materials for this class, see:

http://software-carpentry.org/v4/

http://ged.msu.edu/angus/