View
227
Download
0
Tags:
Embed Size (px)
Citation preview
Visualizing Genes and Evolution
Jim KentGenome Bioinformatics Group
University of California Santa Cruz
VisiGene• Image browser for in-situ and other gene- oriented
pictures• Hopefully in the long run will have a million
images covering almost all vertebrate genes.• Currently has 12000 images
– Mouse transcription factor in-situs from Paul Gray– Imagery from the literature collected by Jackson Labs
• Features:– Interactive zooming and scrolling– Extensive database for captions– Integration into UCSC Genome Browser web site
Current VisiGene: http://hgwdev-kent.cse.ucsc.edu/cgi-bin/hgVisiGene
Clicking on area of interest zooms in on it by 2x
Zooming & scrolling fast - only visible area is transmitted to users.
Caption under image gives specimen details, credits and links to data providers.
For well studied literature gene images from Jackson Labs abound.
Integration with Genome Browser is via Known Genes Track
Integration with Genome Browser is via Known Genes Track
A new link in known gene details page
Gene sorter also links to VisiGene. Here using sorter to search for genes expressed in substantia nigra (target of Parkinson’s disease)
VisiGene Challenges• Virtual microscope zooming and scrolling
– Precomputed pyramid scheme for storing pictures at various scales
– Javascript to interactively fetch just parts of images needed. (Javascript portability issues!)
• Defining orthologous genes between species.• Image storage - will terabytes suffice?• Database for captions and other annotations of
images.• Collecting imagery and data from a wide variety
of projects… including Allen Institute?
The Spectrum of Cells
A single fertilized egg develops into a human body, brain and all. During this process perhaps 300 different types of cells arise. Understanding and controlling this process of development is critical for stem cell based medicine.
Cell Lineage Tree of C. elegans, focusing on the gut.
For some animals, such as C. elegans, the complete lineage and type of every cell is worked out.
Full cell lineage for C. elegans, worked out by John Sulston and colleagues using microscope, eye, sketchbook and patience.
Vertebrate development is too complex to work by hand
Mouse embryos days 7 - 10
Some cell lineages, such as those leading from the hematopoietic stem cell to the various types of cells that make up the blood stream and much of the immune system are well worked out in vertebrates.
A good deal of stem cell research involves finding marker genes that distinguish between different cell types at various branches in the differentiation tree. In many cases the discovery of new markers has resulted in the definition of new cell types.
Blood cell lineages were worked out with surface markers and a cell sorter.
Many types of cells don’t sort so easily, but usually cell nuclei can be resolved even in fairly complex embryos by microscopy, especially confocal microscopy.
Multiple markers can be used to classify cells into different types with microscopy as well as with sorters. Using quantum dots it is now possible to label simultaneously with a dozen markers.
Cell 450 nm
475 nm
500 nm
525 nm
550 nm
575 nm
600 nm
625 nm
650 nm
3 1
1 5 8 1
7 1
5 7 2 4 6 3
9
5 5
Note: last two might look same to eye but not to sensors, which have more than three channels.
Ideal Markers• Unlike cell sorting experiments, would try to pick
markers that are each present in 1/3 to 2/3 of cell types rather than markers present in only one cell type.
• The markers would be chosen so that their expression patterns were relatively independent of each other, using resources such as Allen Brain Atlas
• In ideal world, 8 perfect markers could distinguish 256 cell types, in real world we’d hope 12 or 15 well chosen markers would be enough.
Clustering and Beyond
• Once have measured marker levels on cells, can use clustering software as is used for microarrays, to define cell types.
• Since markers don’t change instantaniously we should observe in embryos ‘trails’ in color space between clusters linking together two cell types.
• We can also have information about what cells are near each other, and potentially information about cell shape for more sophisticated algorithms.
Cell Spectrum Summary1) Use brain and gene atlas data to find 10 to 20
nuclear markers with distinct but overlapping expression patterns.
2) Label antibodies with quantum dots.3) Stain thick slices with labeled antibodies.4) Capture images with multichannel confocal
microscope.5) Identify nuclei and assign colors to them.6) Cluster based on color to define cell types.7) Construct tree of cell types by looking at spatial and
temporal data, and looking for intermediate forms.
Comparative Genomics
Webb Miller
Comparative Genomics at BMP10
Conservation of Gene Features
Conservation pattern across 3165 mappings of human RefSeq mRNAs to the genome. A program sampled 200 evenly spaced bases across 500 bases upstream of transcription, the 5’ UTR, the first coding exon, introns, middle coding exons, introns, the 3’ UTR and 500 bases after polyadenylatoin. There are peaks of conservation at the transition from one region to another.
50%
55%
60%
65%
70%
75%
80%
85%
90%
95%
100%
aligning identity
Chaining Alignments
• Chaining bridges the gulf between syntenic blocks and base-by-base alignments.
• Local alignments tend to break at transposon insertions, inversions, duplications, etc.
• Global alignments tend to force non-homologous bases to align.
• Chaining is a rigorous way of joining together local alignments into larger structures.
Chains join together related local alignments
Protease Regulatory Subunit 3
Affine penalties are too harsh for long gaps
Log count of gaps vs. size of gaps in mouse/human alignment correlated with sizes of transposon relics. Affine
gap scores model red/blue plots as straight lines.
Before and After Chaining
Chaining Algorithm
• Input - blocks of gapless alignments from blastz• Dynamic program based on the recurrence relationship:
score(Bi) = max(score(Bj) + match(Bi) - gap(Bi, Bj))
• Uses Miller’s KD-tree algorithm to minimize which parts of dynamic programming graph to traverse. Timing is O(N logN), where N is number of blocks (which is in hundreds of thousands)
j<i
Netting Alignments
• Commonly multiple mouse alignments can be found for a particular human region, particularly for coding regions.
• Net finds best match mouse match for each human region.
• Highest scoring chains are used first.• Lower scoring chains fill in gaps within
chains inducing a natural hierarchy.
Net Focuses on Ortholog
Net highlights rearrangements
A large gap in the top level of the net is filled by an inversion containing two genes. Numerous smaller gaps are filled in by local duplications and processed pseudo-genes.
Useful in finding pseudogenes
Ensembl and Fgenesh++ automatic gene predictions confounded by numerous processed pseudogenes. Domain structure of resulting predicted protein must be interesting!
Mouse/HumanRearrangement Statistics
Number of rearrangements of given type per megabase
excluding known transposons.
A Rearrangement Hot Spot
Rearrangements are not evenly distributed. Roughly 5% of the genome is in hot spots of rearrangements such as this one. This 350,000 base region is between two very long chains on chromosome 7.
Conservation Track at UCSC
• Based on Multiz alignments using tools from Miller lab.• Conservation score is calculatedy by Adam Seipel’s
PhyloHMM.• Highlights exons, promoters, enhancers, etc.• We should have a 10-species vertebrate multiple
alignment up soon at UCSC, currently have 8-species.
Reconstructed ancestral (boreutherian) genome for one
chromosome
Acknowledgements• UCSC Bioinformatics: Galt Barber, Robert Baertsch, Gill Bejerano,
Mathieu Blanchette, Ron Chao, Hiram Clawson, Mark Diekhans, Jorge Garcia, Patrick Gavin, Rachel Harte, Angie Hinrichs, Fan Hsu, Jennifer Jackson, Donna Karolchik, Robert Kuhn, Yontao Lu, Crystal Lynch, Webb Miller, Jakob Pedersen, Andy Pohl, Katie Pollard, Brian Raney, Kate Rosenbloom, Krishna Roskin, Adam Siepel, Chuck Sugnet, Ali Sultan-Qurraie, Paul Tatarsky, Daryl Thomas, Heather Trumbower, David Haussler
• Penn State Comparative Genomics: Webb Miller, Ross Hardison, Belinda Giardine, Scott Schwartz, Cathy Riemer, Minmei Hou, LouXin Zhang, Jian Ma
• Sequence Data: Sanger Institute, Washington University, Broad Institute, Baylor College of Medicine, Joint Genome Institute, International Human Genome Sequencing Consortium.
• Images: Jackson Labs, Mahoney Lab, Company of Biologists.
• Funding: NHGRI, HHMI, NCI, QB3, UC Santa Cruz
The End