SyMAP Master's Thesis Presentation

  • Published on
    27-Jan-2015

  • View
    105

  • Download
    2

DESCRIPTION

My master's thesis on SyMAP, a synteny mapping and analysis program.

Transcript

<ul><li> 1. SyMAP Synteny Mapping and Analysis Program Austin Shoemaker</li></ul> <p> 2. SyMAP Team </p> <ul><li>Dr. Cari Soderlund </li></ul> <ul><li>Dr. Will Nelson </li></ul> <ul><li>Austin Shoemaker </li></ul> <ul><li><ul><li>Interactive SyMAP views </li></ul></li></ul> <ul><li><ul><li>Sytry </li></ul></li></ul> <ul><li><ul><li><ul><li>Testing environment for synteny finding algorithms </li></ul></li></ul></li></ul> <ul><li><ul><li>Worked with the team on: </li></ul></li></ul> <ul><li><ul><li><ul><li>The synteny finding algorithm </li></ul></li></ul></li></ul> <ul><li><ul><li><ul><li>MySQL database schema </li></ul></li></ul></li></ul> <p> 3. Background </p> <ul><li>Comparative Genomics </li></ul> <ul><li>Physical Map </li></ul> <ul><li>Computing Synteny </li></ul> <ul><li>Properties of FPC to Genome Synteny </li></ul> <p> 4. Comparative Genomics </p> <ul><li>Compare genomes of different species </li></ul> <ul><li>Knowledge of one helps understand the other </li></ul> <ul><li><ul><li>Gene Function </li></ul></li></ul> <ul><li><ul><li><ul><li>Organism O 1has a gene G 1 </li></ul></li></ul></li></ul> <ul><li><ul><li><ul><li>Organism O 2has a gene G 2with a sequence similar to G 1 </li></ul></li></ul></li></ul> <ul><li><ul><li><ul><li>G 1and G 2may have similar functions </li></ul></li></ul></li></ul> <ul><li><ul><li>Evolutionary History </li></ul></li></ul> <ul><li><ul><li><ul><li>Genome rearrangements </li></ul></li></ul></li></ul> <p> 5. Genome Rearrangements RearrangementScenarioResult Inversion Duplication Insertion Deletion A BCD EA DCB E A B CAC ABA C B A BCDA B B CD A B CA B C B , A BB C AB AB AB 6. Whole-Genome Duplication </p> <ul><li>mya (million years ago) </li></ul> <p>Last Common Ancestor rice maize diverged 50-70 mya 70 mya duplication 11 mya duplication 7. Synteny </p> <ul><li>At least two pairs of genes with similar structure and function on the same chromosome </li></ul> <ul><li><ul><li>Order does not need to be conserved </li></ul></li></ul> <ul><li>Often found using sequenced genomes </li></ul> <ul><li>We use a physical map and a genomic sequence </li></ul> <p>Genome A c d e f g c d e f g Genome B 8. Physical Map </p> <ul><li>Expensive to sequence large genomes </li></ul> <ul><li>A physical map provides partial ordering of pieces of DNA and pieces of genes </li></ul> <p> 9. FPC Map </p> <ul><li>FingerPrinted Contigs </li></ul> <ul><li><ul><li><ul><li>Soderlund et al. 1997 </li></ul></li></ul></li></ul> <ul><li>Type of physical map </li></ul> <ul><li>Made up of clones </li></ul> <ul><li><ul><li>Snippets of DNA </li></ul></li></ul> <ul><li><ul><li>We use BAC clones </li></ul></li></ul> <ul><li><ul><li><ul><li>Bacterial artificial chromosome clones </li></ul></li></ul></li></ul> <ul><li><ul><li>Stored in clone libraries </li></ul></li></ul> <p> 10. Making a BAC Clone Library </p> <ul><li>Take thousands of copies of a genome </li></ul> <ul><li>Cut it up into overlapping pieces(~150,000 base pairs) </li></ul> <ul><li><ul><li>Restriction enzymes </li></ul></li></ul> <ul><li><ul><li><ul><li>Proteins that cut at specific DNA sequences </li></ul></li></ul></li></ul> <ul><li><ul><li>Partial digestion </li></ul></li></ul> <ul><li><ul><li><ul><li>Restriction enzymes not allowed to cut at all possible locations so that the clones overlap </li></ul></li></ul></li></ul> <p> 11. Clones </p> <ul><li>Each clone is stored in a well on a microtiter plate </li></ul> <ul><li>Do not know the order of the clones, or where each clone is on the chromosome </li></ul> <p> 12. Clone Fingerprinting </p> <ul><li>Clone fingerprints are found to gather more information on a clone </li></ul> <ul><li>Fully digest a clone using restriction enzymes </li></ul> <ul><li>If two clones share many fragments, they may overlap </li></ul> <p> 13. Clone Fingerprinting </p> <ul><li>Fragments are run on a gel </li></ul> <ul><li><ul><li>Shorter fragments migrate faster </li></ul></li></ul> <ul><li><ul><li>Measure migration rate </li></ul></li></ul> <ul><li>False positives and false negatives </li></ul> <p> 14. FPC </p> <ul><li>Assembles fingerprinted clones into contigs </li></ul> <ul><li><ul><li>Contig -&gt; contiguous overlapping clones </li></ul></li></ul> <ul><li>Assembles into many contigs instead of one large contig</li></ul> <ul><li><ul><li>Unclonable regions </li></ul></li></ul> <ul><li><ul><li>Uneven distribution </li></ul></li></ul> <p> 15. Markers </p> <ul><li>Markers are pieces of DNA </li></ul> <ul><li><ul><li>~ 300 base pairs </li></ul></li></ul> <ul><li>Hybridization </li></ul> <ul><li><ul><li>A marker hybridizes to a clone when the clone contains the marker </li></ul></li></ul> <p> 16. BESs </p> <ul><li>Expensive to sequence entire clones </li></ul> <ul><li>BAC End Sequences </li></ul> <ul><li><ul><li>BESs are sequences from the ends of BAC clones </li></ul></li></ul> <ul><li><ul><li>~800 base pairs </li></ul></li></ul> <ul><li><ul><li>Do not know which end the sequence comes from </li></ul></li></ul> <ul><li><ul><li>There are errors in the sequence </li></ul></li></ul> <p> 17. Anchors </p> <ul><li>Locations of two genomes found to be similar through a comparison of DNA sequences </li></ul> <ul><li>We use marker sequences and BESs searched against a known genome sequence </li></ul> <ul><li><ul><li>Maize has an FPC map with markers and BESs </li></ul></li></ul> <ul><li><ul><li>The rice genome is sequenced </li></ul></li></ul> <p>G G C C G T G G T G C T C T T T G C A A T G G G G G C T G T G G T G C T C T T C G C A A T G G G 18. Component Summary 19. Finding Chains 20. Key Synteny Finding Algorithms </p> <ul><li>Vandepoele et al. (2002) ADHoRe </li></ul> <ul><li><ul><li>Variable gap size </li></ul></li></ul> <ul><li><ul><li>Coefficient of determination to determine the quality of a synteny block </li></ul></li></ul> <ul><li>Haas et al. (2004) DAGchainer </li></ul> <ul><li><ul><li>Directed acyclic graph </li></ul></li></ul> <ul><li><ul><li>Dynamic programming </li></ul></li></ul> <ul><li><ul><li>Gap penalty </li></ul></li></ul> <p> 21. Other Synteny Finding Algorithms </p> <ul><li>Key characteristics for us: </li></ul> <ul><li><ul><li>Dynamic programming </li></ul></li></ul> <ul><li><ul><li><ul><li>Ordering the anchors to form a DAG </li></ul></li></ul></li></ul> <ul><li><ul><li>Gap penalty </li></ul></li></ul> <ul><li><ul><li>Variable gap size </li></ul></li></ul> <ul><li>Not appropriate for finding synteny using an FPC map </li></ul> <ul><li><ul><li>Do not consider the error conditions that arise </li></ul></li></ul> <p> 22. FPC to Genome Synteny </p> <ul><li>Properties associated with FPC </li></ul> <ul><li><ul><li>FPC maps do not cover the entire genome </li></ul></li></ul> <ul><li><ul><li>False+ and False- hybridized markers </li></ul></li></ul> <ul><li><ul><li>FPC coordinates are approximate </li></ul></li></ul> <ul><li><ul><li>Which end of the parent clone a BES belongs to is unknown </li></ul></li></ul> <p> 23. FPC Synteny Properties 1x 2o 3x4x5o 68#x9x ax bx cx 7x 123456789abc Genome A (FPC map) Genome B (sequenced genome) 24. Noise 25. SyMAP Algorithm </p> <ul><li>Anchor (a k , b l )</li></ul> <ul><li><ul><li>a kis the location on the FPC map of genome G A </li></ul></li></ul> <ul><li><ul><li>b lis the location on the genomic sequence of G B </li></ul></li></ul> <ul><li>Directed Acyclic Graph </li></ul> <ul><li><ul><li>E = {u, v | |a k -a i |M Aand 0b l -b j M B }</li></ul></li></ul> <ul><li><ul><li><ul><li>where u = (a i , b j ), v = (a k , b l ) are anchors </li></ul></li></ul></li></ul> <ul><li><ul><li>Allows edges decreasing along G A </li></ul></li></ul> <ul><li><ul><li><ul><li>Catch off-diagonal anchors </li></ul></li></ul></li></ul> <ul><li><ul><li><ul><li>Some inversions </li></ul></li></ul></li></ul> <p> 26. SyMAP Algorithm </p> <ul><li>Manhattan distance function with scaling </li></ul> <ul><li><ul><li>D(v, w) = |a k- a i | / t A+ |b l- b j | / t B </li></ul></li></ul> <ul><li><ul><li>Average distance between anchors may be different </li></ul></li></ul> <ul><li>Dynamic Programming </li></ul> <ul><li><ul><li>Node(v) = 1 + Max(0,Max u P(v)(Node(u) - D(u,v)))</li></ul></li></ul> <ul><li><ul><li><ul><li>P(v) is the set of edges (u,v)E </li></ul></li></ul></li></ul> <ul><li><ul><li>1 is the score given to an individual anchor </li></ul></li></ul> <ul><li><ul><li>Plus the maximum path score for a previous node </li></ul></li></ul> <ul><li><ul><li>Penalized by the distance between the nodes </li></ul></li></ul> <p> 27. SyMAP Algorithm </p> <ul><li>Chains must satisfy constraints </li></ul> <ul><li><ul><li><ul><li>Number of anchors </li></ul></li></ul></li></ul> <ul><li><ul><li><ul><li>Strength of line</li></ul></li></ul></li></ul> <ul><li><ul><li><ul><li><ul><li>Pearson correlation coefficient </li></ul></li></ul></li></ul></li></ul> <ul><li><ul><li>Required to be more precisely linear the closer they are to the minimal number of anchors </li></ul></li></ul> <ul><li><ul><li>Exception for small and dense chains </li></ul></li></ul> <ul><li><ul><li><ul><li>Lower correlation due to errors in the assignment of BES ends or clone ordering within a contig </li></ul></li></ul></li></ul> <p> 28. Sytry </p> <ul><li>Tool for testing synteny finding algorithms </li></ul> <ul><li>Allows for modifying the parameters of an algorithm and rerunning </li></ul> <ul><li>Results are shown as a dot plot </li></ul> <ul><li><ul><li>Need to visually confirm results, as correct </li></ul></li></ul> <ul><li><ul><li>Correct is what looks right to the user </li></ul></li></ul> <p> 29. Automated Parameter Setting </p> <ul><li>Difficult to set parameters (e.g.,t Aand t B ) </li></ul> <ul><li><ul><li>Effects of changes can be unclear </li></ul></li></ul> <ul><li><ul><li>Dependent on average distance between anchors and noise </li></ul></li></ul> <ul><li><ul><li><ul><li>Optimal values vary between regions </li></ul></li></ul></li></ul> <ul><li>Have the algorithm set the gap parameters </li></ul> <ul><li><ul><li>Attempt to optimize t xfor each chain </li></ul></li></ul> <p> 30. Sub-Chains </p> <ul><li>Overall orientation of a synteny chain may not be accurate for sub-chains </li></ul> <p> 31. Sub-Chain Finder </p> <ul><li>Use only anchors that are part of a chain </li></ul> <ul><li>Define distance between anchors in terms of the number of anchors that fall between the anchors </li></ul> <ul><li>A significant gap signals the start of a possible inversion </li></ul> <p> 32. Sub-Chains </p> <ul><li>Evolutionary history </li></ul> <ul><li><ul><li>e.g., total number of inversions </li></ul></li></ul> <ul><li>Assigning an accurate orientation to all anchors in a chain </li></ul> <ul><li><ul><li>Beneficial for fixing the clone end assignment of BES </li></ul></li></ul> <p> 33. BES Clone End Assignments </p> <ul><li>BESs are arbitrarily assigned to clone ends </li></ul> <ul><li><ul><li>Algorithm takes this into account </li></ul></li></ul> <ul><li><ul><li>However, the synteny when viewing can be distorted </li></ul></li></ul> <ul><li>Orientation can be used to correct BES assignments </li></ul> <p> 34. BES Clone End Assignments </p> <ul><li>positive orientation -&gt; lines should not cross </li></ul> <p>2x 3o 456o 7x 8x 12345678 A B 1x B A 2 3 4 5 6 7 8 2 3 4 5 6 7 8 1 1 35. BES Clone End Assignments </p> <ul><li>negative orientation -&gt; lines should cross </li></ul> <p>7x 6o 5 43o 2x 1x 12345678 A 8x B B A 1 2 3 4 5 6 7 1 2 3 4 5 6 7 8 8 36. SyMAP Views </p> <ul><li>Accessible through a web browser </li></ul> <ul><li>Static views </li></ul> <ul><li><ul><li>All synteny blocks sequenced chromosomes </li></ul></li></ul> <ul><li><ul><li>Synteny blocks sequenced chromosome </li></ul></li></ul> <ul><li>Interactive views </li></ul> <ul><li><ul><li>Dot plot view </li></ul></li></ul> <ul><li><ul><li><ul><li>Genome to genome </li></ul></li></ul></li></ul> <ul><li><ul><li><ul><li>Chromosome to chromosome </li></ul></li></ul></li></ul> <ul><li><ul><li>Alignment view </li></ul></li></ul> <ul><li><ul><li><ul><li>FPC sequenced chromosome </li></ul></li></ul></li></ul> <ul><li><ul><li><ul><li>FPC FPC </li></ul></li></ul></li></ul> <ul><li><ul><li><ul><li>FPC sequenced chromosome FPC </li></ul></li></ul></li></ul> <ul><li><ul><li>Close-up view </li></ul></li></ul> <ul><li><ul><li><ul><li>FPC sequenced chromosome </li></ul></li></ul></li></ul> <p> 37. All Blocks Sequenced Chromosomes 38. Blocks Sequenced Chromosome 39. Genome Genome Dot Plot 40. Chromosome Chromosome Dot Plot 41. Block Sequenced Chromosome 42. Subset Flipped 43. Contig Sequenced Chromosome 44. Filters and Controls 45. FPC Sequenced Chromosome FPC 46. FPC FPC 47. Close-up of Gene 48. SyMAP Implementation </p> <ul><li>Caching is needed: </li></ul> <ul><li><ul><li>Downloads large amounts of data from remote database </li></ul></li></ul> <ul><li><ul><li>History feature </li></ul></li></ul> <ul><li><ul><li><ul><li>Navigating back and forth between the same views </li></ul></li></ul></li></ul> <ul><li>Soft References </li></ul> <ul><li><ul><li>Remain alive as long as the memory is available </li></ul></li></ul> <ul><li>Data objects </li></ul> <ul><li><ul><li>Hold data in a compact form </li></ul></li></ul> <ul><li><ul><li>Converted to view objects when needed </li></ul></li></ul> <p> 49. Results </p> <ul><li>www.agcol.arizona.edu/symap </li></ul> <ul><li><ul><li>Maize and sorghum aligned to rice </li></ul></li></ul> <ul><li><ul><li>Maize FPC aligned to sorghum FPC </li></ul></li></ul> <ul><li>Used in editing the maize FPC maps based on its alignment to rice(Wei et al., in preparation)</li></ul> <ul><li>Alignment of maize to rice chromosome 3 </li></ul> <ul><li><ul><li>Buell et al. (2005)</li></ul></li></ul> <ul><li>Used in OMAP project</li></ul> <ul><li><ul><li>Aligning 12 species of rice to the sequenced genome of rice(Wing et al., in preparation) </li></ul></li></ul> <p> 50. Acknowledgements </p> <ul><li>Thesis Committee </li></ul> <ul><li><ul><li>Dr. Cari Soderlund, thesis advisor </li></ul></li></ul> <ul><li><ul><li>Dr. Peter Downey </li></ul></li></ul> <ul><li><ul><li>Dr. Kobus Bernard </li></ul></li></ul> <ul><li>This work is funded in part by NSF DBI #0115903 </li></ul> <ul><li>www.agcol.arizona.edu/symap </li></ul>