Combinatorial Algorithms for Maximum Likelihood Tag SNP Selection and Haplotype Inference Ion...

Combinatorial Algorithms for Maximum Likelihood Tag SNP

Selection and Haplotype Inference

Ion Mandoiu

University of Connecticut

CS&E Department

Outline

• Biological background

• Maximum likelihood tag SNP selection

• Maximum likelihood population haplotyping

• Ongoing and future work

• Human Genome 3 109 base pairs

• Main form of variation between individual genomes: single nucleotide polymorphisms (SNPs)

– Single base changes in the genome sequence that occurs in a significant proportion (more than 1 percent) of the population

– Most SNPs are bi-allelic

• Total #SNPs 1 107

• Difference b/w any two individuals 3 106 SNPs ( 0.1% of entire genome)

Genomic Variation and SNPs

• Diploid organisms: cells have two homologous sets of chromosomes

• Haplotype: description of SNP alleles on a chromosome– 0/1 vector, e.g., 00110101 (0 is for major, 1 is for minor allele)

• Genotype: combined description of SNP alleles on pairs of homologous chromosomes– 0/1/2 vector, e.g., 01122110 (0=0+0, 1=1+1, 2=0+1 or 1+0)– Each genotype with k 2’s can be explained by 2k-1 pairs of

haplotypes

Haplotypes and Genotypes

1 1 0 1

0 0 0 1

2 2 0 1

1 0 0 1

0 1 0 1

2 2 0 1

• Limitations of current technologies: – High cost per (user selected) SNP

Tag SNP selection problem

– Find genotypes, not haplotypes Haplotype inference problem

• Effective solutions require combining accurate probabilistic models with scalable combinatorial optimization techniques!

Computational Challenges

Outline

Two-Stage Sampling Methodology

• Pilot Study– All SNPs of interest are genotyped in a small sample of the

population– Common haplotypes are inferred using statistical methods– A set of tag SNPs is selected

• Population Study– Tag SNPs are genotyped in remaining population– Statistical methods are used to infer haplotypes over the tag SNPs– Haplotypes over the tag SNPs are extrapolated to full haplotypes

Haplotype pairs (tag SNPs)

Haplotype pairs (all SNPs)

Sample haplotypes (with frequencies)

Remaining PopulationPopulation Sample

Tag SNP Set

Genotypes (tag SNPs)

Extrapolation

PhasingPhasing

Tag SNP Selection

Pilot Study Population Study

Flow 1: Haplotype-Extrapolation

Genotypes (all SNPs)

Haplotype pairs (all SNPs)

Sample haplotypes (with frequencies)

Remaining PopulationPopulation Sample

Tag SNP Set

Genotypes (tag SNPs)

Phasing

ExtrapolationPhasing

Tag SNP Selection

Pilot Study Population Study

Flow 2: Genotype-Extrapolation

Genotypes (all SNPs)

Previous Works on Tag SNP Selection• Statistical correlation based methods

– Poor control over the number of tag SNPs

• [Bafna et al. 03] Informative SNP Set Problem– Find set of k SNPs with maximum “informativeness”

• [Sebastiani et al. 03] Best Enumeration of SNP Tags (BEST) – Finds minimum number of SNPs that distinguishes all given haplotypes

– No control over the number of tag SNPs!

Fully Informative Tag SNP Set Selection by Integer Programming

• Given: haplotypes h1, h2, …, hm over n SNPs

• Find: minimum number of tag SNPs

• Such that: every two distinct haplotypes differ in at least one tag SNP

Integer Program Formulation• 0/1 variable xj for every SNP

- xj = 1 if SNP j is selected as a tag SNP- xj = 0 otherwise

• Can be solved efficiently using general purpose solvers such as CPLEX - In practice significantly faster than BEST

miixtsjhjhjj

'1,1..)()(: '

jjxMin

Extrapolation Approaches• [Halperin et al. 05]

– Each SNP genotype predicted individually

– Only immediate neighbor tag SNPs used in prediction

• [He&Zelikovsky 06] – Each SNP genotype predicted individually

– All tag SNPs used in prediction

• Maximum likelihood– Pick the most likely full genotype compatible with short

genotype over tag SNPs

– Full genotype predicted in a single step

Tag Selection for Maximum Likelihood Genotype Extrapolation

Idea: Select K tag SNPs maximizing correct prediction probability

Tag SNP 1

Tag SNP 2

iijgjgj

Tag Selection for Maximum Likelihood Genotype Extrapolation

• Synthetic datasets generated following [Forton et al. 05]- 2 populations (European and West African) + 2 genomic regions (IL8 and 5q31)- For each of the 4 populations, we used haplotypes and frequencies inferred in [Forton et al. 05] from the real data to generate 5 datasets containing between 200 and 1000 individuals- Fixed block size of 10 SNPs- For each dataset we picked 5 random samples with size 50

• Maximum likelihood (ML) flows 1 and 2 were compared to the Multivariate Linear Regression (MLR) algorithm of [He&Zelikovsky 06]

-Genotype frequencies estimated from haplotype frequencies used to generate the datasets (pop), respectively from haplotype frequecies inferred from sample using PHASE (phase)

Experimental Setup

Haplotype Accuracy

1 2 3 4 5 6 7 8 9

ML2-pop

ML2-phase

Genotype Accuracy

1 2 3 4 5 6 7 8 9

ML2-pop

ML2-phase

Outline

Population Haplotyping Problem

Given the set G of genotypes observed in a population of individuals, infer a set H of haplotypes explaining G

Numerous approaches: entropy minimization, perfect phylogeny, Bayesian networks, pure parsimony, …

Maximum likelihood approach:

1. Estimate for each haplotype h its probability ph in the population under study

2. Find set H that explains G and has maximum likelihood

11012201

•Haplotypes graph vertices - Weight of vertex h = -log(ph)

•Genotypes edge colors- Edge (h, h’) with color g iff g can be explained by haplotypes h and h’

Graph Theoretical Reformulation

Minimum Weight Multi-Colored Subgraph Problem (MWMCSP): Find min-weight set of vertices that induce at least one edge of each given color

10012201

1201 2101

Approximation Algorithms•[Lancia et al. 02]

- Algorithms with approximation factors of (for unweighted version) and q, where n is the number of genotypes and q is the maximum number of haplotype pairs compatible with a genotype

•[Huang et al. 05]

- O(log n) approximation using semidefinite programming, but big O constant hides factor of q

•[Hassin&Segev 05]

- Greedy algorithm with approximation factor of

•[Hajiaghayi et al. 06]

- LP-rounding algorithm with approximation factor of ) log O( nq

Integer Program Formulation • Extends formulation of [Gusfield 03]

•0/1 variable xu for every vertex u

- xu is set to 1 if u is selected, 0 otherwise

• 0/1 variable ye for every edge e

- ye set to 1 if e is induced by selected vertices, 0 otherwise

Outline

Haplotype Frequency Estimation

• Accurate haplotype frequency estimation becomes key to overall accuracy of likelihood maximization methods

• Important to capture frequencies of haplotypes that may not appear in the sample – phasing and counting gives poor estimates

• Existing high-quality algorithms, e.g., Haplofreq [Halperin&Hazan 05], do not have good scaling runtime

HMM-Based Frequency Estimation

• Hidden Markov Models (HMMs) are uniquely suited for modeling haplotype frequencies in a population

• Recently used very successfully in haplotype inference [Rastas et al. 05], disease association [Kimmel&Shamir 05]– Main computational bottleneck: HMM training based on genotype data

HMM-Based Frequency Estimation

• Good compromise in context of two stage experiments– Sample consisting of trios (child, mother, and father)

– Sample phased using fast trio-aware phasing method (e.g., entropy phasing [Pasaniuc&M 06])

– HMM trained on resulting (highly accurate) haplotypes

– Haplotype frequencies computed efficiently using k-shortest paths algorithm

Combinatorial Algorithms for Maximum Likelihood Tag SNP Selection and Haplotype Inference Ion...

Documents

The Relationship Between Haplotype-Based Haplotype Lengthweb.stanford.edu/group/rosenberglab/papers/MehtaEtAl2019...HIGHLIGHTED ARTICLE | INVESTIGATION The Relationship Between Haplotype-Based

I. Proc Allele (SAS/Genetics) II. Single SNP analysis III. Tests for Multi-allelic Markers IV. Haplotype tests i. Macro %HAPPY (

Computational Problems in Haplotype Recognitionfaculties.sbu.ac.ir/~katanforoush/dissertation/myDefense.pdf · Computational Problems in Haplotype Recognition by Ali Katanforoush

189 ' # '6& *#1 & 7 - InTechcdn.intechopen.com/pdfs-wm/22506.pdf · 20 SNPpattern: A Genetic Tool to Derive Haplotype Blocks and Measure Genomic Diversity in Populations Using SNP

Tumor Necrosis Factor-A Haplotype

Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity Justin Kennedy Dissertation Defense for the Degree

Haplotype Based Association Tests

National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational

SNP-l5233H/l5233€¦ · Security Dimensions (WxH) Weight SNP-l5233H/l5233 SNP-l5233HN/HP SNP-l5233N/P 1.3M HD 23x Network PTZ Dome Camera SNP-L5233H SNP-L5233 key Features

ISBRA 2007 Tutorial A: Scalable Algorithms for Genotype and Haplotype Analysis Ion Mandoiu (University of Connecticut) Alexander Zelikovsky (Georgia State

· PDF fileAnaliza si aprobare Bilant Contabil APRIL 2012 2. Propunere, ... Albalact: Irina Mandoiu, Eustatiu Mandoiu, Raul Ciurtin Almera: Grigore Lingulescu

Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

The Chromium™ System - Agilent...•Compound heterozygote is haplotype resolved for a 57kb deletion (exons 1-10 in CTNS) and a SNP Compound Heterozygous Variant Resolution chr17:3,563,221

Combinatorial Algorithms for Maximum Likelihood Tag SNP Selection and Haplotype Inference

Zhaohui (Steve) Qin - Emory University · PDF file5 Zhang K, Qin ZS, Liu JS, Chen T, Waterman MS, Sun F. (2004) Haplotype Block partitioning and Tag SNP Selection Using Genotype Data

A high-resolution HLA and SNP haplotype map for …debakker.med.harvard.edu/pdf/de_Bakker-MHC-NG-2006.pdf2006/09/24 · A high-resolution HLA and SNP haplotype map for disease association

SNP- and haplotype-based genome-wide association studies ......carcass, and meat quality traits. Results: In total, 836 animals were genotyped using the Illumina PorcineSNP60 BeadChip

Linear Time Probabilistic Algorithms for the Singular Haplotype Reconstruction Problem from SNP Fragments Zhixiang ChenUniversity of Texas Pan American

Whole-genome evaluation of complex traits using SNP, haplotype, or QTL information

Reconstruction of Haplotype Spectra from NGS Data Ion Mandoiu UTC Associate Professor in Engineering Innovation Department of Computer Science & Engineering