Visualizing Genes and Evolution Jim Kent Genome Bioinformatics Group University of California Santa Cruz

Visualizing Genes and Evolution

Jim KentGenome Bioinformatics Group

University of California Santa Cruz

VisiGene• Image browser for in-situ and other gene- oriented

pictures• Hopefully in the long run will have a million

images covering almost all vertebrate genes.• Currently has 12000 images

– Mouse transcription factor in-situs from Paul Gray– Imagery from the literature collected by Jackson Labs

• Features:– Interactive zooming and scrolling– Extensive database for captions– Integration into UCSC Genome Browser web site

Current VisiGene: http://hgwdev-kent.cse.ucsc.edu/cgi-bin/hgVisiGene

Clicking on area of interest zooms in on it by 2x

Zooming & scrolling fast - only visible area is transmitted to users.

Caption under image gives specimen details, credits and links to data providers.

For well studied literature gene images from Jackson Labs abound.

Integration with Genome Browser is via Known Genes Track

Integration with Genome Browser is via Known Genes Track

A new link in known gene details page

Gene sorter also links to VisiGene. Here using sorter to search for genes expressed in substantia nigra (target of Parkinson’s disease)

VisiGene Challenges• Virtual microscope zooming and scrolling

– Precomputed pyramid scheme for storing pictures at various scales

– Javascript to interactively fetch just parts of images needed. (Javascript portability issues!)

• Defining orthologous genes between species.• Image storage - will terabytes suffice?• Database for captions and other annotations of

images.• Collecting imagery and data from a wide variety

of projects… including Allen Institute?

The Spectrum of Cells

A single fertilized egg develops into a human body, brain and all. During this process perhaps 300 different types of cells arise. Understanding and controlling this process of development is critical for stem cell based medicine.

Cell Lineage Tree of C. elegans, focusing on the gut.

For some animals, such as C. elegans, the complete lineage and type of every cell is worked out.

Full cell lineage for C. elegans, worked out by John Sulston and colleagues using microscope, eye, sketchbook and patience.

Vertebrate development is too complex to work by hand

Mouse embryos days 7 - 10

Some cell lineages, such as those leading from the hematopoietic stem cell to the various types of cells that make up the blood stream and much of the immune system are well worked out in vertebrates.

A good deal of stem cell research involves finding marker genes that distinguish between different cell types at various branches in the differentiation tree. In many cases the discovery of new markers has resulted in the definition of new cell types.

Blood cell lineages were worked out with surface markers and a cell sorter.

Many types of cells don’t sort so easily, but usually cell nuclei can be resolved even in fairly complex embryos by microscopy, especially confocal microscopy.

Multiple markers can be used to classify cells into different types with microscopy as well as with sorters. Using quantum dots it is now possible to label simultaneously with a dozen markers.

Cell 450 nm

475 nm

500 nm

525 nm

550 nm

575 nm

600 nm

625 nm

650 nm

3 1

1 5 8 1

7 1

5 7 2 4 6 3

9

5 5

Note: last two might look same to eye but not to sensors, which have more than three channels.

Ideal Markers• Unlike cell sorting experiments, would try to pick

markers that are each present in 1/3 to 2/3 of cell types rather than markers present in only one cell type.

• The markers would be chosen so that their expression patterns were relatively independent of each other, using resources such as Allen Brain Atlas

• In ideal world, 8 perfect markers could distinguish 256 cell types, in real world we’d hope 12 or 15 well chosen markers would be enough.

Clustering and Beyond

• Once have measured marker levels on cells, can use clustering software as is used for microarrays, to define cell types.

• Since markers don’t change instantaniously we should observe in embryos ‘trails’ in color space between clusters linking together two cell types.

• We can also have information about what cells are near each other, and potentially information about cell shape for more sophisticated algorithms.

Cell Spectrum Summary1) Use brain and gene atlas data to find 10 to 20

nuclear markers with distinct but overlapping expression patterns.

2) Label antibodies with quantum dots.3) Stain thick slices with labeled antibodies.4) Capture images with multichannel confocal

microscope.5) Identify nuclei and assign colors to them.6) Cluster based on color to define cell types.7) Construct tree of cell types by looking at spatial and

temporal data, and looking for intermediate forms.

Comparative Genomics

Webb Miller

Comparative Genomics at BMP10

Conservation of Gene Features

Conservation pattern across 3165 mappings of human RefSeq mRNAs to the genome. A program sampled 200 evenly spaced bases across 500 bases upstream of transcription, the 5’ UTR, the first coding exon, introns, middle coding exons, introns, the 3’ UTR and 500 bases after polyadenylatoin. There are peaks of conservation at the transition from one region to another.

50%

55%

60%

65%

70%

75%

80%

85%

90%

95%

100%

aligning identity

Chaining Alignments

• Chaining bridges the gulf between syntenic blocks and base-by-base alignments.

• Local alignments tend to break at transposon insertions, inversions, duplications, etc.

• Global alignments tend to force non-homologous bases to align.

• Chaining is a rigorous way of joining together local alignments into larger structures.

Chains join together related local alignments

Protease Regulatory Subunit 3

Affine penalties are too harsh for long gaps

Log count of gaps vs. size of gaps in mouse/human alignment correlated with sizes of transposon relics. Affine

gap scores model red/blue plots as straight lines.

Before and After Chaining

Chaining Algorithm

• Input - blocks of gapless alignments from blastz• Dynamic program based on the recurrence relationship:

score(Bi) = max(score(Bj) + match(Bi) - gap(Bi, Bj))

• Uses Miller’s KD-tree algorithm to minimize which parts of dynamic programming graph to traverse. Timing is O(N logN), where N is number of blocks (which is in hundreds of thousands)

j<i

Netting Alignments

• Commonly multiple mouse alignments can be found for a particular human region, particularly for coding regions.

• Net finds best match mouse match for each human region.

• Highest scoring chains are used first.• Lower scoring chains fill in gaps within

chains inducing a natural hierarchy.

Net Focuses on Ortholog

Net highlights rearrangements

A large gap in the top level of the net is filled by an inversion containing two genes. Numerous smaller gaps are filled in by local duplications and processed pseudo-genes.

Useful in finding pseudogenes

Ensembl and Fgenesh++ automatic gene predictions confounded by numerous processed pseudogenes. Domain structure of resulting predicted protein must be interesting!

Mouse/HumanRearrangement Statistics

Number of rearrangements of given type per megabase

excluding known transposons.

A Rearrangement Hot Spot

Rearrangements are not evenly distributed. Roughly 5% of the genome is in hot spots of rearrangements such as this one. This 350,000 base region is between two very long chains on chromosome 7.

Conservation Track at UCSC

• Based on Multiz alignments using tools from Miller lab.• Conservation score is calculatedy by Adam Seipel’s

PhyloHMM.• Highlights exons, promoters, enhancers, etc.• We should have a 10-species vertebrate multiple

alignment up soon at UCSC, currently have 8-species.

Reconstructed ancestral (boreutherian) genome for one

chromosome

Acknowledgements• UCSC Bioinformatics: Galt Barber, Robert Baertsch, Gill Bejerano,

Mathieu Blanchette, Ron Chao, Hiram Clawson, Mark Diekhans, Jorge Garcia, Patrick Gavin, Rachel Harte, Angie Hinrichs, Fan Hsu, Jennifer Jackson, Donna Karolchik, Robert Kuhn, Yontao Lu, Crystal Lynch, Webb Miller, Jakob Pedersen, Andy Pohl, Katie Pollard, Brian Raney, Kate Rosenbloom, Krishna Roskin, Adam Siepel, Chuck Sugnet, Ali Sultan-Qurraie, Paul Tatarsky, Daryl Thomas, Heather Trumbower, David Haussler

• Penn State Comparative Genomics: Webb Miller, Ross Hardison, Belinda Giardine, Scott Schwartz, Cathy Riemer, Minmei Hou, LouXin Zhang, Jian Ma

• Sequence Data: Sanger Institute, Washington University, Broad Institute, Baylor College of Medicine, Joint Genome Institute, International Human Genome Sequencing Consortium.

• Images: Jackson Labs, Mahoney Lab, Company of Biologists.

• Funding: NHGRI, HHMI, NCI, QB3, UC Santa Cruz

The End

Documents

Visualizing Genes and Evolution Jim Kent Genome Bioinformatics Group University of California Santa Cruz