Transcript
Page 1: Sequencing the Maize (B73) Genome Genome Sequencing Center Maize Genome Sequencing Consortium

Sequencing the Maize (B73) Genome

GenomeSequencingCenter

Maize Genome Sequencing Consortium

Page 2: Sequencing the Maize (B73) Genome Genome Sequencing Center Maize Genome Sequencing Consortium

The Team

• WU Genome Sequencing Center (R. Wilson, PI)- Bob Fulton, Pat Minx, Sandy Clifton

• Arizona Genome Institute (R. Wing)• Cold Spring Harbor Laboratory

- D. Ware, L. Stein- R. McCombie, R. Martienssen

• Iowa State University (P. Schnable & S. Aluru)• The Maize research community

Page 3: Sequencing the Maize (B73) Genome Genome Sequencing Center Maize Genome Sequencing Consortium

The Plan

Page 4: Sequencing the Maize (B73) Genome Genome Sequencing Center Maize Genome Sequencing Consortium

library_doneshotgun_done

prefin_donefinished

4311

3106

2261

4110%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Progress Through Pipeline

Progress as of 9/30/06

Page 5: Sequencing the Maize (B73) Genome Genome Sequencing Center Maize Genome Sequencing Consortium

Progress Through Pipeline Across Time

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

1 4 7 10 13 16 19 22 25

week

number of clone

library_done

shotgun_done

prefin_done

finished

Page 6: Sequencing the Maize (B73) Genome Genome Sequencing Center Maize Genome Sequencing Consortium

Agenda9:00 – 9:15 Introductions and Project Overview (Rick Wilson)

9:15 – 10:15 Plans and Progress – WU/AGI/CSHL/ISU Project

Map and Tile Path Selection (Rod Wing)Library Construction and Production (Lucinda Fulton)Sequence Improvement (Bob Fulton, Dick McCombie, Rod Wing)Data Submission (Joanne Nelson)Annotation and Data Display (Doreen Ware)Outreach (Rick Wilson)

10:15 - 10:30 Break

10:30 – 11:00 Plans and Progress – DOE Project (Dan Rohksar)

11:00 – 11:30 Future Plans and CollaborationsPat Schnable (by phone) - retrotransposons

11:30 – Noon Executive Session

Noon – 1:00 Working Lunch and Discussion

1:00 Depart for Airport

Page 7: Sequencing the Maize (B73) Genome Genome Sequencing Center Maize Genome Sequencing Consortium

BAC-by-BAC Strategy to Sequence the Maize Genome

Maize B73 Genome (2300 Mb)

BAC library construction (Hind III, EcoR I/MboI ; 27X deep ; 150kb avg. insert)

BAC End Sequencing

~800,000

Genetic Anchoring in silico, overgo hybridization

Fingerprinting ~460,000 BACs

STC databaseBAC physical maps (HICF & Agarose)FPC databases

(Agarose and HICF) Choose a seed BAC

Shotgun sequencing and finishing

STC database search, FP comparison

Determine minimum overlap BACs

Complete maize genome sequence

Page 8: Sequencing the Maize (B73) Genome Genome Sequencing Center Maize Genome Sequencing Consortium

Map Summary

1. Total Assembled Contigs: 721– Equal to 2,150 Mb, 93.5% coverage of 2300 Mb genome– Anchored: 421 ctgs, 86.1% the genome – average anchored contig size: 4.7 Mb– Unanchored: 300 ctgs, 7.4% coverage

average unanchored contig size: 0.56 Mb– 189 of the 300 unanchored contigs are less

than 10 clones– Largest anchored contig 22.9Mb in Chr9– Largest unanchored contig 6.7 Mb

2. Total FPC Markers: 25,924– STS markers: 9,129– Overgo Markers: 14,877– Anchored markers: 1918

Page 9: Sequencing the Maize (B73) Genome Genome Sequencing Center Maize Genome Sequencing Consortium

MTP Selection

•Seed BACs: 4000, done

•Mega Contig: 197, done

•Clone Walking from Seed BACs: 2,800 done; in progress

•Total clones picked = 6,997

•On track to deliver 1000 clones/month until maze MTP is complete

Page 10: Sequencing the Maize (B73) Genome Genome Sequencing Center Maize Genome Sequencing Consortium

Flowchart for MTP picking and Library Construction

Clone selection(combine seed BAC and BAC end sequences with

fingerprinting and trace files)

Clone picking (Resource Center)

MTP sequencingGenBank BAC end sequence database

Library DNA production

Hfq sequencing

Clone verification

Clone shipping

Continue shotgun library construction at WashU

DNA shearing

Seed BAC database

MTP BAC end database

Library DNA production

Page 11: Sequencing the Maize (B73) Genome Genome Sequencing Center Maize Genome Sequencing Consortium

Seed BAC Walking

In Agarose and HICF map, selecting large clones next to seed BAC

Blastn search of BAC end sequences against seed BAC sequences

Check blastn alignment for candidate clones

Check trace file for Dye blob

Check the Sulston score in HICF map for overlap

Check Agarose fingerprints to avoid overlap with large bands

Choose walking clone

Page 12: Sequencing the Maize (B73) Genome Genome Sequencing Center Maize Genome Sequencing Consortium

Minimum Tile Path Pipeline• BAC End Sequence of potential BACs

are BLASTed against the Seed BACs

• Results are classified based on location on the FPC

• A table for each BAC is created of filtered BLAST results with links to CMap and GBrowse

• Blast results are imported into CMap and GBrowse with additional information such as trace files and FPCs

Page 13: Sequencing the Maize (B73) Genome Genome Sequencing Center Maize Genome Sequencing Consortium

Minimum Tile Path Pipeline Usage• A table of alignments between the

seed BAC and the BAC end sequences contains links to CMap and GBrowse.

• CMap displays the FPC data for the seed BAC and the potential next BACs.

• GBrowse provides an alignment of the BES with the seed sequence and displays the trace data.

Page 14: Sequencing the Maize (B73) Genome Genome Sequencing Center Maize Genome Sequencing Consortium

Blast Results Table

Page 15: Sequencing the Maize (B73) Genome Genome Sequencing Center Maize Genome Sequencing Consortium

Maize Production Sequencing

Shotgun of 19,000 BACs

Fosmid End Sequencing of 1 Million Reads

BAC End Sequencing of 220,000 clones

Page 16: Sequencing the Maize (B73) Genome Genome Sequencing Center Maize Genome Sequencing Consortium

Maize BAC shotgun

BAC DNA received from AGI or prepared at the GSC

Small Scale Library Construction

Production Sequencing - 1,536 reads/project

Automated Shotgun_done

Page 17: Sequencing the Maize (B73) Genome Genome Sequencing Center Maize Genome Sequencing Consortium
Page 18: Sequencing the Maize (B73) Genome Genome Sequencing Center Maize Genome Sequencing Consortium

Maize BAC Shotgun Reads

0

200000

400000

600000

800000

1000000

1200000

1400000

Feb-06 Mar-06 Apr-06 May-06 Jun-06 Jul-06 Aug-06 Sep-06

Months

Reads/Month

Reads

To date 3,106 BAC clones are shotgun_done

Page 19: Sequencing the Maize (B73) Genome Genome Sequencing Center Maize Genome Sequencing Consortium

Maize Fosmid Sequencing

Fosmid trays 0001 to 0471 were received from Messing labInitial QC was fine, but bulk shipment has failed to grow

Stamping results of the original trays show no growth

85 Fosmid ligations which represent ~250,000 clones werereceived from the Messing lab, plating is underway

GSC Fosmid library construction has been completed and represents 1M clones

Expected completion date is November of this year.

Page 20: Sequencing the Maize (B73) Genome Genome Sequencing Center Maize Genome Sequencing Consortium
Page 21: Sequencing the Maize (B73) Genome Genome Sequencing Center Maize Genome Sequencing Consortium

Maize BAC End Sequencing

BAC end sequencing will be completed next week

Total of 440,000 reads from two different libraries

Pass rate of 75% with an average read length 600 bases

Paired end read rate is ~70%

Page 22: Sequencing the Maize (B73) Genome Genome Sequencing Center Maize Genome Sequencing Consortium

Sequence Improvement Pipeline

•Shotgun_done triggers the prefinishing pipeline

•Initial identification of “do finish” regions

•Manual sorting and use of autoedit(Gordon) to break

apart misassembly.

•Autofinish(Gordon) used to choose directed reactions for

all gaps and regions of low quality in “do finish” regions

•Reassembly and 2nd iteration of prefinishing pipeline

•Final identification of “do finish” regions and handoff to

finishing pipeline

Page 23: Sequencing the Maize (B73) Genome Genome Sequencing Center Maize Genome Sequencing Consortium

0

100

200

300

400

500

600

700

800

1-5 ctg6-10 ctg11-15 ctg16-20 ctg21-25ctg26-30 ctg31-35 ctg35-40 ctg40+ ctg

before prefinish after prefinish

Clone Improvement through the Prefinishing Pipeline

Page 24: Sequencing the Maize (B73) Genome Genome Sequencing Center Maize Genome Sequencing Consortium
Page 25: Sequencing the Maize (B73) Genome Genome Sequencing Center Maize Genome Sequencing Consortium

End

Spanning Plasmids

Coverage (green)

Page 26: Sequencing the Maize (B73) Genome Genome Sequencing Center Maize Genome Sequencing Consortium

Repeat Tags

Do FinishGSS sequence

EST sequence

Page 27: Sequencing the Maize (B73) Genome Genome Sequencing Center Maize Genome Sequencing Consortium

Alignment with cDNA read pairs

Alignment with End Sequences

Page 28: Sequencing the Maize (B73) Genome Genome Sequencing Center Maize Genome Sequencing Consortium

Future Plans for Improved Throughput

•Automated Shotgun-done status assigning

•Overlap Evaluation at Prefinishing

•Addition of Fosmid End Pairs at Prefinishing

•Direct Sequencing for Unspanned Gaps

•Additional Finishing Staff Hired at all 3 Centers

Page 29: Sequencing the Maize (B73) Genome Genome Sequencing Center Maize Genome Sequencing Consortium

Maize clone submissions

clone status submission keywords shotgun complete HTGS_PHASE1; HTGS_FULLTOP

2 rounds of prefinish HTGS_PHASE1; HTGS_PREFIN

in finishing HTGS_PHASE1; HTGS_ACTIVEFIN

finished HTGS_PHASE1; HTGS_IMPROVED

zea mays[ORGN] AND HTGS_PREFIN[KYWD] AND WUGSC[CNTR]zea mays[ORGN] AND HTGS_IMPROVED[KYWD] AND WUGSC[CNTR]

Restrict by date range:zea mays[ORGN] AND WUGSC[CNTR] AND HTGS_FULLTOP[KYWD] AND 2006/09[PDAT]zea mays[ORGN] AND WUGSC[CNTR] AND HTGS_FULLTOP[KYWD] AND 2006/09/26:2006/10/03[PDAT]

Query GenBank by keywords

Page 30: Sequencing the Maize (B73) Genome Genome Sequencing Center Maize Genome Sequencing Consortium

HTGS_IMPROVED submissions

Pick a clonename, any clonename - DEFINITION Zea mays chromosome 4 clone CH201-11H16; ZMMBBc0011H16Center project name: Z_AF-11H16

Improved sequence is annotated on submission record

Where possible, contigs have been ordered and oriented based on read pairing. and these regions are designated as scaffolds.

Small contigs (<2kb) that don’t represent a clone end, don’t contain improved sequence, or are not part of a scaffold are removed from the final submission.

Contigs are screened for bacterial contamination

Page 31: Sequencing the Maize (B73) Genome Genome Sequencing Center Maize Genome Sequencing Consortium

FEATURES Location/Qualifiers source 1..173904 /organism="Zea mays" /mol_type="genomic DNA" /db_xref="taxon:4577" /chromosome="unknown" /clone="CH201-112C8; ZMMBBc0112C08" misc_feature 1..51940 /note="scaffold_name:Scaffold1" misc_feature 1..36440 /note="assembly_name:Contig245 clone_end:left vector_side:T7" gap 36441..36540 /estimated_length=unknown misc_feature 36541..51940 /note="assembly_name:Contig240" misc_feature 51941..129231 /note="scaffold_name:Scaffold2" gap 51941..52040 /estimated_length=unknown misc_feature 52041..59371 /note="assembly_name:Contig250”........... misc_feature 120342..122491 /note="Improved sequence." misc_feature 128142..129231 /note="Improved sequence." misc_feature 129232..139656 /note="scaffold_name:Scaffold3" .....

Page 32: Sequencing the Maize (B73) Genome Genome Sequencing Center Maize Genome Sequencing Consortium

1005 HTGS_FULLTOP 254 PREFIN_DONE1532 ACTIVE_FIN357 HTGS_IMPROVED

GenBank

Page 33: Sequencing the Maize (B73) Genome Genome Sequencing Center Maize Genome Sequencing Consortium

Ongoing work at CSHL

• BAC Annotations Levels

• Data Analysis

• Display

• Project Management

• Collaborations

Page 34: Sequencing the Maize (B73) Genome Genome Sequencing Center Maize Genome Sequencing Consortium

BAC Data Analysis

• Ensembl Pipeline • 3 inclusive phases of

annotation– Level I: Display BAC

information

– Level II: Sequence-based annotations

– Level III: Integrative annotations

Shiran Pasternak, Apurva Narechania, Joshua Stein

Page 35: Sequencing the Maize (B73) Genome Genome Sequencing Center Maize Genome Sequencing Consortium

Application of Mathematical Repeat Analysis

• Identifies novel repeats w/o dependence on curation.– Based on frequency of 20-mers in JGI WGS sequence

• Correlates with presence of retroelements.

• Can modulate threshold to optimize application.

Exon Coverage at Log 1.25 Repeat Level

90.21

0.150

20406080

100

TE non-TE

% Coverage

Apurva Narechania, Joshua Stein

Page 36: Sequencing the Maize (B73) Genome Genome Sequencing Center Maize Genome Sequencing Consortium

Retroelement Annotation

• Classify retroelement families• Current list covers ~68% of genome• Ten most prevalent account for ~80% retroelement sequences

– Ji, huck, opie, zeon, cinful, prem1, grande, xilon, gyma, giepum

Collaboration with Jeff Bennetzen and Philip SanMiguel

Goal is to visualize the history of transpositions

Giepum element interrupted by ji and opie in AC148166

Joshua Stein

Page 37: Sequencing the Maize (B73) Genome Genome Sequencing Center Maize Genome Sequencing Consortium

Whole Genome Alignments• Wobble Aware Bulk Aligner (WABA)*

– TIGR Transcripts Rice– WABA alignments Maize

• Distinguishes between:– low similarity regions (grey)– high-similarity regions (medium blue)– high similarity regions w/ wobble-base mismatch of coding regions (green)

*Kent, WJ & Zahler, A.M. (2000). Genome Res. 10:1115-25

Joshua Stein

Page 38: Sequencing the Maize (B73) Genome Genome Sequencing Center Maize Genome Sequencing Consortium

Whole Genome Alignments• BLASTZ* with AXTCHAIN** & CHAINNET**

– Sensitive gapped BLAST algorithm designed for aligning long sequences.– Accommodates long gaps & overlapping gaps, inversions, translocations, & duplications

*Schwartz, S et al. (2003). Genome Res. 13:103-7 **Kent, WJ, et al. (2003). PNAS 100:11484-11489

Example of BLASTZ(net) display in Ensembl.

Page 39: Sequencing the Maize (B73) Genome Genome Sequencing Center Maize Genome Sequencing Consortium

www.maizesequence.org

Sequenced BAC

FPC Contig

Virtual Bin

Core Bin Marker

Chromosome

Synteny Views

Main Navigation bar is accessible from every page

Contains multiple entry points to the genome

Page 40: Sequencing the Maize (B73) Genome Genome Sequencing Center Maize Genome Sequencing Consortium

MapView

Displays statistics by chromosome and provides entry points based on a single chromosome

Page 41: Sequencing the Maize (B73) Genome Genome Sequencing Center Maize Genome Sequencing Consortium

CytoViewProvides detail information on features anchored to the FPC map.

The side bar highlights the location on the chromosome and provides page specific functionality

including data export.

The Detailed view is customizable, tracks can be added or removed by the users.

Feature contain drop down menus that contain general information as well as provided internal links, and external links.

Page 42: Sequencing the Maize (B73) Genome Genome Sequencing Center Maize Genome Sequencing Consortium

ContigView

This view is based BAC coordinated and displays annotation levels II and III.

The header contains the Clone name in the physical map, GenBank Accession, and Chromosome and FPC contig information.

Detailed view offers semantic zooming, customizable and provides links to other views and information resources.

Page 43: Sequencing the Maize (B73) Genome Genome Sequencing Center Maize Genome Sequencing Consortium

SyntenyView

Page 44: Sequencing the Maize (B73) Genome Genome Sequencing Center Maize Genome Sequencing Consortium

Upcoming Features

• Release – October 2006

• BlastView– December 2006

• BAC Annotation– Level II January, 2007– Level III annotation April, 2007– WG alignments June, 2007

• BioMart– January, 2007

• NSF collaborations– TwinScan annotations: March, 2007– Maize Optical Map: July, 2007– Full-length cDNAs: December, 2007

• Notification System– Users are notified

• When a region of interest is updated

• When markers are aligned to a specific sequence

– January, 2007

Page 45: Sequencing the Maize (B73) Genome Genome Sequencing Center Maize Genome Sequencing Consortium

Hardware Environments

• Software– Developed locally– Managed with source control– Frequent releases to staging

environment– Quarterly production releases

• Data– Timed analysis on staging

environment– Mirrored weekly on production

Shiran Pasternak, Apurva Narechania

Page 46: Sequencing the Maize (B73) Genome Genome Sequencing Center Maize Genome Sequencing Consortium

Quality Assurance

• Unit-testing framework– Binary assertions

– Failure report and automatic notification

• Software Quality Control– e.g., code retrieves correct data from the database

• Data Quality Control– e.g., clone in Genbank record exists in FPC map

Shiran Pasternak

Page 47: Sequencing the Maize (B73) Genome Genome Sequencing Center Maize Genome Sequencing Consortium

Project Management

• Mantis Bug Tracker– Manage tasks using

priorities, severities, and resource allocations

– Automated submission of issues using feedback form

– Generation of progress reports

Page 48: Sequencing the Maize (B73) Genome Genome Sequencing Center Maize Genome Sequencing Consortium

Project Management

• Wiki– Enhances group

communication– Meeting notes,

flowcharts, specification documents

– Maintains history of specifications and design decisions

– Seamless editing

Page 49: Sequencing the Maize (B73) Genome Genome Sequencing Center Maize Genome Sequencing Consortium

Collaborations

• MaizeGDB (Iowa State University, University of Missouri)

– C. Lawrence

• Maize Optical Map (University of Wisconsin)

– D. Schwartz

• Maize Transposon Annotation (University of Georgia, Purdue)

– J. Bennetzen, P. San Miguel

• Ensembl (EBI)

– E. Birney

• Vmatch for Mathematical Repeats (University of Hamburg)

– S. Kurtz

• Maize Full Length cDNA project (Arizona Genomics Institute)

– Y. Yu

• TwinScan (Danforth Plant Science Center)

– B. Barbazuk


Recommended