43
Visualizing Genes and Evolution Jim Kent Genome Bioinformatics Group University of California Santa Cruz

Visualizing Genes and Evolution Jim Kent Genome Bioinformatics Group University of California Santa Cruz

  • View
    227

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Visualizing Genes and Evolution Jim Kent Genome Bioinformatics Group University of California Santa Cruz

Visualizing Genes and Evolution

Jim KentGenome Bioinformatics Group

University of California Santa Cruz

Page 2: Visualizing Genes and Evolution Jim Kent Genome Bioinformatics Group University of California Santa Cruz

VisiGene• Image browser for in-situ and other gene- oriented

pictures• Hopefully in the long run will have a million

images covering almost all vertebrate genes.• Currently has 12000 images

– Mouse transcription factor in-situs from Paul Gray– Imagery from the literature collected by Jackson Labs

• Features:– Interactive zooming and scrolling– Extensive database for captions– Integration into UCSC Genome Browser web site

Page 3: Visualizing Genes and Evolution Jim Kent Genome Bioinformatics Group University of California Santa Cruz

Current VisiGene: http://hgwdev-kent.cse.ucsc.edu/cgi-bin/hgVisiGene

Page 4: Visualizing Genes and Evolution Jim Kent Genome Bioinformatics Group University of California Santa Cruz

Clicking on area of interest zooms in on it by 2x

Page 5: Visualizing Genes and Evolution Jim Kent Genome Bioinformatics Group University of California Santa Cruz

Zooming & scrolling fast - only visible area is transmitted to users.

Page 6: Visualizing Genes and Evolution Jim Kent Genome Bioinformatics Group University of California Santa Cruz

Caption under image gives specimen details, credits and links to data providers.

Page 7: Visualizing Genes and Evolution Jim Kent Genome Bioinformatics Group University of California Santa Cruz

For well studied literature gene images from Jackson Labs abound.

Page 8: Visualizing Genes and Evolution Jim Kent Genome Bioinformatics Group University of California Santa Cruz

Integration with Genome Browser is via Known Genes Track

Page 9: Visualizing Genes and Evolution Jim Kent Genome Bioinformatics Group University of California Santa Cruz

Integration with Genome Browser is via Known Genes Track

Page 10: Visualizing Genes and Evolution Jim Kent Genome Bioinformatics Group University of California Santa Cruz

A new link in known gene details page

Page 11: Visualizing Genes and Evolution Jim Kent Genome Bioinformatics Group University of California Santa Cruz

Gene sorter also links to VisiGene. Here using sorter to search for genes expressed in substantia nigra (target of Parkinson’s disease)

Page 12: Visualizing Genes and Evolution Jim Kent Genome Bioinformatics Group University of California Santa Cruz

VisiGene Challenges• Virtual microscope zooming and scrolling

– Precomputed pyramid scheme for storing pictures at various scales

– Javascript to interactively fetch just parts of images needed. (Javascript portability issues!)

• Defining orthologous genes between species.• Image storage - will terabytes suffice?• Database for captions and other annotations of

images.• Collecting imagery and data from a wide variety

of projects… including Allen Institute?

Page 13: Visualizing Genes and Evolution Jim Kent Genome Bioinformatics Group University of California Santa Cruz
Page 14: Visualizing Genes and Evolution Jim Kent Genome Bioinformatics Group University of California Santa Cruz

The Spectrum of Cells

A single fertilized egg develops into a human body, brain and all. During this process perhaps 300 different types of cells arise. Understanding and controlling this process of development is critical for stem cell based medicine.

Page 15: Visualizing Genes and Evolution Jim Kent Genome Bioinformatics Group University of California Santa Cruz

Cell Lineage Tree of C. elegans, focusing on the gut.

For some animals, such as C. elegans, the complete lineage and type of every cell is worked out.

Page 16: Visualizing Genes and Evolution Jim Kent Genome Bioinformatics Group University of California Santa Cruz

Full cell lineage for C. elegans, worked out by John Sulston and colleagues using microscope, eye, sketchbook and patience.

Page 17: Visualizing Genes and Evolution Jim Kent Genome Bioinformatics Group University of California Santa Cruz

Vertebrate development is too complex to work by hand

Mouse embryos days 7 - 10

Page 18: Visualizing Genes and Evolution Jim Kent Genome Bioinformatics Group University of California Santa Cruz

Some cell lineages, such as those leading from the hematopoietic stem cell to the various types of cells that make up the blood stream and much of the immune system are well worked out in vertebrates.

Page 19: Visualizing Genes and Evolution Jim Kent Genome Bioinformatics Group University of California Santa Cruz

A good deal of stem cell research involves finding marker genes that distinguish between different cell types at various branches in the differentiation tree. In many cases the discovery of new markers has resulted in the definition of new cell types.

Page 20: Visualizing Genes and Evolution Jim Kent Genome Bioinformatics Group University of California Santa Cruz

Blood cell lineages were worked out with surface markers and a cell sorter.

Many types of cells don’t sort so easily, but usually cell nuclei can be resolved even in fairly complex embryos by microscopy, especially confocal microscopy.

Page 21: Visualizing Genes and Evolution Jim Kent Genome Bioinformatics Group University of California Santa Cruz

Multiple markers can be used to classify cells into different types with microscopy as well as with sorters. Using quantum dots it is now possible to label simultaneously with a dozen markers.

Page 22: Visualizing Genes and Evolution Jim Kent Genome Bioinformatics Group University of California Santa Cruz

Cell 450 nm

475 nm

500 nm

525 nm

550 nm

575 nm

600 nm

625 nm

650 nm

3 1

1 5 8 1

7 1

5 7 2 4 6 3

9

5 5

Note: last two might look same to eye but not to sensors, which have more than three channels.

Page 23: Visualizing Genes and Evolution Jim Kent Genome Bioinformatics Group University of California Santa Cruz

Ideal Markers• Unlike cell sorting experiments, would try to pick

markers that are each present in 1/3 to 2/3 of cell types rather than markers present in only one cell type.

• The markers would be chosen so that their expression patterns were relatively independent of each other, using resources such as Allen Brain Atlas

• In ideal world, 8 perfect markers could distinguish 256 cell types, in real world we’d hope 12 or 15 well chosen markers would be enough.

Page 24: Visualizing Genes and Evolution Jim Kent Genome Bioinformatics Group University of California Santa Cruz

Clustering and Beyond

• Once have measured marker levels on cells, can use clustering software as is used for microarrays, to define cell types.

• Since markers don’t change instantaniously we should observe in embryos ‘trails’ in color space between clusters linking together two cell types.

• We can also have information about what cells are near each other, and potentially information about cell shape for more sophisticated algorithms.

Page 25: Visualizing Genes and Evolution Jim Kent Genome Bioinformatics Group University of California Santa Cruz

Cell Spectrum Summary1) Use brain and gene atlas data to find 10 to 20

nuclear markers with distinct but overlapping expression patterns.

2) Label antibodies with quantum dots.3) Stain thick slices with labeled antibodies.4) Capture images with multichannel confocal

microscope.5) Identify nuclei and assign colors to them.6) Cluster based on color to define cell types.7) Construct tree of cell types by looking at spatial and

temporal data, and looking for intermediate forms.

Page 26: Visualizing Genes and Evolution Jim Kent Genome Bioinformatics Group University of California Santa Cruz

Comparative Genomics

Webb Miller

Page 27: Visualizing Genes and Evolution Jim Kent Genome Bioinformatics Group University of California Santa Cruz

Comparative Genomics at BMP10

Page 28: Visualizing Genes and Evolution Jim Kent Genome Bioinformatics Group University of California Santa Cruz

Conservation of Gene Features

Conservation pattern across 3165 mappings of human RefSeq mRNAs to the genome. A program sampled 200 evenly spaced bases across 500 bases upstream of transcription, the 5’ UTR, the first coding exon, introns, middle coding exons, introns, the 3’ UTR and 500 bases after polyadenylatoin. There are peaks of conservation at the transition from one region to another.

50%

55%

60%

65%

70%

75%

80%

85%

90%

95%

100%

aligning identity

Page 29: Visualizing Genes and Evolution Jim Kent Genome Bioinformatics Group University of California Santa Cruz

Chaining Alignments

• Chaining bridges the gulf between syntenic blocks and base-by-base alignments.

• Local alignments tend to break at transposon insertions, inversions, duplications, etc.

• Global alignments tend to force non-homologous bases to align.

• Chaining is a rigorous way of joining together local alignments into larger structures.

Page 30: Visualizing Genes and Evolution Jim Kent Genome Bioinformatics Group University of California Santa Cruz

Chains join together related local alignments

Protease Regulatory Subunit 3

Page 31: Visualizing Genes and Evolution Jim Kent Genome Bioinformatics Group University of California Santa Cruz

Affine penalties are too harsh for long gaps

Log count of gaps vs. size of gaps in mouse/human alignment correlated with sizes of transposon relics. Affine

gap scores model red/blue plots as straight lines.

Page 32: Visualizing Genes and Evolution Jim Kent Genome Bioinformatics Group University of California Santa Cruz

Before and After Chaining

Page 33: Visualizing Genes and Evolution Jim Kent Genome Bioinformatics Group University of California Santa Cruz

Chaining Algorithm

• Input - blocks of gapless alignments from blastz• Dynamic program based on the recurrence relationship:

score(Bi) = max(score(Bj) + match(Bi) - gap(Bi, Bj))

• Uses Miller’s KD-tree algorithm to minimize which parts of dynamic programming graph to traverse. Timing is O(N logN), where N is number of blocks (which is in hundreds of thousands)

j<i

Page 34: Visualizing Genes and Evolution Jim Kent Genome Bioinformatics Group University of California Santa Cruz

Netting Alignments

• Commonly multiple mouse alignments can be found for a particular human region, particularly for coding regions.

• Net finds best match mouse match for each human region.

• Highest scoring chains are used first.• Lower scoring chains fill in gaps within

chains inducing a natural hierarchy.

Page 35: Visualizing Genes and Evolution Jim Kent Genome Bioinformatics Group University of California Santa Cruz

Net Focuses on Ortholog

Page 36: Visualizing Genes and Evolution Jim Kent Genome Bioinformatics Group University of California Santa Cruz

Net highlights rearrangements

A large gap in the top level of the net is filled by an inversion containing two genes. Numerous smaller gaps are filled in by local duplications and processed pseudo-genes.

Page 37: Visualizing Genes and Evolution Jim Kent Genome Bioinformatics Group University of California Santa Cruz

Useful in finding pseudogenes

Ensembl and Fgenesh++ automatic gene predictions confounded by numerous processed pseudogenes. Domain structure of resulting predicted protein must be interesting!

Page 38: Visualizing Genes and Evolution Jim Kent Genome Bioinformatics Group University of California Santa Cruz

Mouse/HumanRearrangement Statistics

Number of rearrangements of given type per megabase

excluding known transposons.

Page 39: Visualizing Genes and Evolution Jim Kent Genome Bioinformatics Group University of California Santa Cruz

A Rearrangement Hot Spot

Rearrangements are not evenly distributed. Roughly 5% of the genome is in hot spots of rearrangements such as this one. This 350,000 base region is between two very long chains on chromosome 7.

Page 40: Visualizing Genes and Evolution Jim Kent Genome Bioinformatics Group University of California Santa Cruz

Conservation Track at UCSC

• Based on Multiz alignments using tools from Miller lab.• Conservation score is calculatedy by Adam Seipel’s

PhyloHMM.• Highlights exons, promoters, enhancers, etc.• We should have a 10-species vertebrate multiple

alignment up soon at UCSC, currently have 8-species.

Page 41: Visualizing Genes and Evolution Jim Kent Genome Bioinformatics Group University of California Santa Cruz

Reconstructed ancestral (boreutherian) genome for one

chromosome

Page 42: Visualizing Genes and Evolution Jim Kent Genome Bioinformatics Group University of California Santa Cruz

Acknowledgements• UCSC Bioinformatics: Galt Barber, Robert Baertsch, Gill Bejerano,

Mathieu Blanchette, Ron Chao, Hiram Clawson, Mark Diekhans, Jorge Garcia, Patrick Gavin, Rachel Harte, Angie Hinrichs, Fan Hsu, Jennifer Jackson, Donna Karolchik, Robert Kuhn, Yontao Lu, Crystal Lynch, Webb Miller, Jakob Pedersen, Andy Pohl, Katie Pollard, Brian Raney, Kate Rosenbloom, Krishna Roskin, Adam Siepel, Chuck Sugnet, Ali Sultan-Qurraie, Paul Tatarsky, Daryl Thomas, Heather Trumbower, David Haussler

• Penn State Comparative Genomics: Webb Miller, Ross Hardison, Belinda Giardine, Scott Schwartz, Cathy Riemer, Minmei Hou, LouXin Zhang, Jian Ma

• Sequence Data: Sanger Institute, Washington University, Broad Institute, Baylor College of Medicine, Joint Genome Institute, International Human Genome Sequencing Consortium.

• Images: Jackson Labs, Mahoney Lab, Company of Biologists.

• Funding: NHGRI, HHMI, NCI, QB3, UC Santa Cruz

Page 43: Visualizing Genes and Evolution Jim Kent Genome Bioinformatics Group University of California Santa Cruz

The End