Genome Characterization Assembly/sequencing BIO520 BioinformaticsJim Lund Assigned reading: Ch 9

Genome Characterization

Assembly/sequencing

BIO520 Bioinformatics Jim Lund

Assigned reading: Ch 9

Organism Selection

Library Creation

Sequencing

Assembly

Gap Closure

Finishing

Annotation

The (original) genome sequencing process

Organism Selection

Sequencing

Assembly

Annotation

The (current) genome sequencing process

Next gen. random sequencing lets library generation get skipped

Gap closure and finishing often get skipped, at least for now.

Contigs, Islands

contigs

Island

Assembly pipeline

1. Sequence reads.2. Phred: base calling.3. crossmatch: screen out vector, E.

coli sequence.4. Phrap: assemble contigs.5. Consed: view assembly, correct

problems.6. Finishing.

Assembly Methods

• Strip out vector (or contaminant)• Mask known repeats• Trim off unreliable data• Find Matches (n seq x n seq comparisons)

– how long (what ktuple [10 common])

– how perfect (reliability index)

– where to look? (ends only vs entire)

Assembly Programs

• PHRAP FAMILY– phred/phrap/consed/cross_match

– Developed by Phil Green, U of Wash.

• Other assemblers– phrap, kangaroo, phrapo,

– CAP, TIGRAssembler,...

http://www.phrap.org/

Assembly

• Phred -reads DNA sequencing trace files, calls bases, and assigns a quality value to each called base.– The quality value is a log-transformed error probability,

specifically: Q = -10 log10( Pe )– Q = quality value, Pe = error probability.– Q= 20 -> 1% chance of miscall, Q= 30 -> 0.1% chance of miscall.

• Phrap -assembles shotgun DNA sequence data.• Consed/Autofinish -view, edit, and finish sequence

assemblies created with phrap. – Allows the user to pick primers and templates– Suggests additional sequencing reactions – Suggest digests and forward/reverse pair information to

check accuracy of assembly.

Poisson statistics for sequencing completion

P0=e-L(N)/GL=read length

N=#readsG=genome size

E. coli 15kbH. sapiens 900kb

Coverage1 = 1-fold = 1X

% not sequenced

< 1e20

Number of Gaps = Ne-c

150kb Target Clone, 500 bp reads

N=# of readsc = fold coverage

Coverage,reads1, 300

5, 1500

8, 2400

10, 3000

50, 15000

GapsNumber of Gaps = Ne-c

Human genome, 3Gb, 1,000 bp reads

N=# of readsc = fold coverage

454 Seq, 400bp reads

Coverage,reads1, 3e6

5, 1.5e7

8, 2.4e7

10, 3e7

50, 3.75e8

1,000,000

100,000

Contigs, Islands

contigs

Island

Finishing

• GOALS– >95% coverage on BOTH strands

– every base covered 3X

– resolve ambiguities

• Finish when random no longer productive (~8X range)

Sequence finishing. How?

• Identify gaps, ambiguities– Captured gaps: gaps is contained in a clone

• Extend from end of contigs– Resequencing, new chemistry.– Specific primers– Subcloning and sequencing.

• Uncaptured gaps.– New specific primers– PCR across gap, sequence PCR product.

• Resolve ambiguities– Consensus or resequence

• Specific primers, different chemistry

Large clone sequencing process

• Phase 1: Unfinished, may be unordered/unoriented contigs, with gaps.

• Phase 2: Unfinished, fully oriented and ordered sequence, may contain gaps and low quality sequences

• Phase 3: Finished, no gaps.

Genome assembly after initial contigs are made

• Order clones/contig sequences:– Sequence overlaps.

• Clone/contig end sequences.

– Clone fingerprints.– Anchor using other maps

• Sequence based markers on genetic or physical maps.

• Conserved synteny to other genomes.

• Easiest when re-sequencing, e.g, another human genome!

Process Control

• LIMS– Laboratory

information management system

• AIMS– Analysis

information management system

Hard genome sequencing problems

• Repeats• Complex genome structures

Where does a clone from a repetitive region map?

Approaches to sequencerepeat problems

• Multiple fragment sizes in 1 project• Use length/distance info• New assemblers, eg. ARACHNE

Results of Multi-length Fragment Assembly

• Contigs

• “Supercontigs”

• Clone links for finishing

• Clone map

DOE Joint Genome Institute (JGI) Prokaryote Finishing Standards

• All low-quality areas (<Q30) are reviewed and resequenced.

• The final error rate must be less than 0.2 per 10 Kb.• No single-clone coverage is permitted (minimum of 2x

depth everywhere).• Single-stranded regions are manually inspected and

quantified.• All positions where an aligned high-quality read (>Q29)

disagrees with the consensus base are checked.• All strings of xxxx are resolved in the final sequence.• All repeats are verified.• The ends of final contigs (chromosomes, plasmids) are

checked• The final assembly is given a manual QC check.

Completed genomes 23 complete, 329 in assembly, in progress 389Arabidopsis thaliana Caenorhabditis elegans Candida glabrataCryptococcus neoformansCyanidioschyzon merolae Debaryomyces hanseniiDrosophila melanogasterEncephalitozoon cuniculiEntamoeba histolytica

Plants Animal s Protists Fungi

http://www.ncbi.nlm.nih.gov/genomes/leuks.cgi

Eremothecium gossypiiHomo sapiens Kluyveromyces lactisLeishmzania major Friedlin Mus musculusOryza sativaSaccharomyces cerevisiaeSchizosaccharomyces pombeTrypanosoma cruzi Yarrowia lipolytica

Genomes Complete

• Eukaryotes--23 complete, 329 in assembly, in progress 389– Human, mouse, rat, zebrafish, – Homo sapiens neanderthalensis– Drosophila, Anopheles, Caenorhabditis– Arabadopsis, oat, corn, barley, rice, tomato– Saccharomyces, Schizosaccharomyces,

Magnaportha, Cryptococcus, Candida…– Encephalitozoon cuniculi, Guillardia theta – Toxoplasma, Plasmodium– And many more…

Eubacteria and Archaea genomes

• 608 Bacteria and 48 Archaea completed• Comprehensive Microbial Resource

– http://pathema.tigr.org/tigr-scripts/CMR/CmrHomePage.cgi

• Joint Genome Institute– http://www.jgi.doe.gov/genome-projects/– 2065 genome projects underway or

completed!

• NCBI Genomes

Genome Centers

• Joint Genome Institute (DOE)• Whitehead Institute (MIT)• TIGR• Washington University (St. Louis)• Celera• Sanger Institute (the other UK)• RIKEN (Japan)• Beijing Genomics Institute (China)• Max Planck (Germany)…

Where do you find Genomic data?

• NCBI– Entrez (by clone, by Refseq)– Genome (view and search map)

• Genome center sites• Organism genome project sites

• Annotations projects– UCSC Genome Browser, – Ensembl Genome Browser

Arabidopsis

http://mips.helmholtz-muenchen.de/plant/athal/index.jsp

C. elegans (nematode)

http://wormbase.org

Genome Characterization Assembly/sequencing BIO520 BioinformaticsJim Lund Assigned reading: Ch 9

Documents

Genome Annotation and Databases Genomic DNA sequence Genomic annotation BIO520 BioinformaticsJim Lund Reading Ch 9, Ch10

Intelligence & Lund - Intelligence Lund.pdf · Dragana Radovanović Intelligence & Lund Abstract Title: Intelligence & Lund - What lessons Lund can learn in order to become an intelligent

Functional Genomics Functional genomic datasets Biological networks Integrating genomic datasets BIO520 BioinformaticsJim Lund

Lund 2008 Wilbanks

The Productive Multivocality Project Lund, Rosé, Suthersisls-naples.psy.lmu.de/intro/all-webinars/lund-rose-suthers/lund... · The Productive Multivocality Project Lund, Rosé, Suthers

Portfolio_Tine Lund

Cajsa Lund

Presentacion lund

Lund University Publicationsportal.research.lu.se/portal/files/5168442/1659374.pdf · 2016. 5. 20. · Lund University Publications Institutional Repository of Lund University This

Lund Aerosol Group Div. of Nuclear Physics, Lund University, LTH

Gene Structure and Identification III BIO520 BioinformaticsJim Lund Previous reading: 1.3, 9.1-9.6 10.2, 10.4, 10.6-8

Microarray analysis Quantitation of Gene Expression Expression Data to Networks BIO520 BioinformaticsJim Lund Reading: Ch 16

Genome Characterization DNA sequence-ULTIMATE Map DNA sequencing-methods Assembly/sequencing BIO520 BioinformaticsJim Lund Assigned reading: Service 2006

SELF-WALK TOUR CENTRAL LUND · SELF-WALK TOUR CENTRAL LUND 6 1. The Lund Cathedral The Lund Cathedral (Swedish: Lunds Domkyrka) is the Lutheran cathedral in Lund, Scania, Sweden

Trait Mapping Recombination Mapping SNP mapping BIO520 BioinformaticsJim Lund

Portfolio - Charity lund

Evolutionary Biology Concepts Molecular Evolution Phylogenetic Inference BIO520 BioinformaticsJim Lund Reading: Ch7

Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch 4.1-4.7, Ch 5.1, get what you can

Lund Linux Conference 2016, Lund, Sweden - Introduction to Brillo OS & Weave

Lund University Lund University Master of International