2002Wang_Haplotype Inference by Maximum Parsimony

Embed Size (px)

Citation preview

  • 8/3/2019 2002Wang_Haplotype Inference by Maximum Parsimony

    1/17

    Haplotype Inference by Maximum Parsimony

    Lusheng Wang

    Department of Computer Science

    City University of Hong Kong

    Kowloon, Hong Kong, China

    E-mail: [email protected]

    Ying Xu

    Department of Computer Science

    Peking University

    Beijing, 100871, China

    E-mail: [email protected]

    Abstract

    Motivation: Haplotypes have been attracting increasing attention because of their impor-

    tance in analysis of many fine-scale molecular-genetics data. Since direct sequencing of haplotype

    via experimental methods is both time-consuming and expensive, haplotype inference methods

    that infer haplotypes based on genotype samples become attractive alternatives.

    Results: We propose a new model for haplotype inference that finds a set of minimum

    number of haplotypes that explains the genotype samples. Experiments on both real data and

    simulation data confirm the new model. Our new approach outperforms the existing methods

    in many cases.

    Availability: The software HAPAR is free for non-commercial uses. Available upon request

    ([email protected]).

    1

  • 8/3/2019 2002Wang_Haplotype Inference by Maximum Parsimony

    2/17

    1 Introduction

    Single nucleotide polymorphisms (SNPs) are the most frequent form of human genetic variation.

    The SNP sequence information from each of the two copies of a given chromosome in a diploid

    genome is a haplotype. Haplotype information has been attracting great attention in recent years

    because of its importance in analysis of many fine-scale molecular-genetics data, such as in the

    mapping of complex disease genes, inferring population histories and drug design. However, cur-

    rent routine sequencing methods typically provide genotype information (It consists of a pair of

    haplotype information at each position of the two copies of a chromosome in diploid organisms.

    However, the connection between two adjacent positions is not known.) rather than haplotype in-

    formation. Since direct sequencing of haplotype via experimental methods is both time-consuming

    and expensive, in silico haplotyping methods become attractive alternatives.

    The haplotype inference problem is as follows: given an n k genotype matrix, where each cell

    has value 0, 1, or 2. Each of the n rows in the matrix is a vector associated with sites of intereston the two copies of a chromosome for diploid organisms. The state of any site on a single copy of

    a chromosome is either 0 or 1. A cell (i, j) in the i-th row has a value 0 or 1 if the chromosome

    site has that state (0 or 1) on both copies, and it has a value 2 if both states are present at this

    site (one for each copy). A cell is resolved if it has value 0 or 1, and ambiguous if it has value 2.

    The goal here is to determine which copy of the chromosome has value 1 and which copy of the

    chromosome has value 0 at the sites with value 2 based on some mathematical models.

    In 1990, Clark first discovered that genotypes from population samples were useful in recon-

    structing haplotypes and proposed an inference method. After that, many algorithms and programs

    have been developed to solve the haplotype inference problem. The existing algorithms can be di-vided into four primary categories. The first category is Clarks inference rule approach that is

    exemplified in Clark (1990) and extended in Gusfield (2001) by trying to maximize the number of

    resolved vectors. The second category is expectation-maximization (EM) method which looks for

    the set of haplotypes maximizing the posterior probability of given genotypes (Excoffier and Slatkin,

    1995; Hawley and Kidd, 1995; Long et al., 1995; Chiano and Clayton, 1998). Recently, several sta-

    tistical methods based on Bayesian estimators and Gibbs sampling were proposed (Stephens et al.,

    2001; Niu et al., 2002; Zhang et al., 2001). Finally, adopting the no-recombination assumption,

    Gusfield proposed a model that finds a set of haplotypes forming a perfect phylogeny (Gusfield,

    2002; Bafna et al., 2002).

    In this paper, we propose a new model for the haplotype inference problem that finds a set of

    minimum number of haplotypes that explains the given genotypes. Based on the new model, an

    exact algorithm is designed and implemented. Experiments on both real data and simulation data

    confirm the new model. Simulation results show that our approach outperforms existing methods

    in most cases.

    2

  • 8/3/2019 2002Wang_Haplotype Inference by Maximum Parsimony

    3/17

    2 The new model

    Given an n k genotype matrix M, where each cell has value 0, 1, or 2, a 2n k haplotype matrix

    Q that explains M is obtained as follows: (1) duplicate each row i in M to create pairs of rows i,

    and i (for i = 1, 2, . . . n) in Q and (2) re-set each pair of cells Q(i, j) = 2 and Q(i, j) = 2 to be

    either Q(i, j) = 0 and Q(i, j) = 1 or Q(i, j) = 1 and Q(i, j) = 0 in the new resulting 2nk matrix

    Q. Each row in M is called a genotype. For a genotype mi (the i-th row in M), the pair of finally

    resulted rows Qi and Qi form a resolution ofmi. We also say that Qi and Qi resolve the genotype

    mi. For a given genotype matrix M, if there are h 2s in M, then any of the 2h possible haplotype

    matrices can explain M. Thus, without any further assumptions, it is hard to infer the haplotypes.

    Here we propose a new model that finds a set of minimum number of haplotypes that explains

    the genotype samples as follows: given an nk genotype matrix M, find a 2nk haplotype matrix

    Q such that the number of distinct rows in Q is minimized. n is often referred to as the sample

    size. The computation problem is called the minimum number of origins.

    2.1 Supports of the new model

    Our new method is based on the parsimony principle that attempts to minimize the total number

    of haplotypes observed in the sample. The parsimony principle is one of the most basic principles in

    nature, and has been applied in numerous biological problems. In fact, Clarks inference algorithm,

    which has been extensively used in practice and shown to be useful (Clark et al., 1998; Rieder et

    al., 1999; Drysdale et al., 2000), can also be viewed as a sort of parsimony approach. However,

    to apply Clarks algorithm, there must be homozygotes or single-site heterozygotes in the sample.

    Our method overcomes this obstacle by proposing a global optimization goal.The characteristics of real biological data also provide justifications for our method. The number

    of haplotypes existing in a large population is actually very small whereas genotypes derived from

    these limited number of haplotypes behave a great diversity. Theoretically, given m haplotypes,

    there are m(m 1)/2 possible pairs to form genotypes. (Even if Hardy-Weinberg equilibrium is

    violated and some combinations are rare, the number is still quite large.) When some population

    is to be studied, the haplotype number can be taken as a fixed constant, while the number of

    distinct genotypes is decided by the sample size sequenced, which is relatively large. Intuitively,

    when genotype sample size n is large enough, the corresponding number of haplotypes m would be

    relatively small. Thus, our approach has a good chance to correctly recover the whole haplotypeset.

    A real example strongly supports above arguments.

    3

  • 8/3/2019 2002Wang_Haplotype Inference by Maximum Parsimony

    4/17

    Nucleotide -1023 -709 -654 -468 -367 -47 -20 46 79 252 491 523

    Alleles G/A C/A G/A C/G T/C T/C T/C G/A C/G G/A C/T C/A

    h1 A C G C T T T A C G C C 100000010000

    h2 A C G G C C C G G G C C 100111101000h3 G A A C T T T A C G C C 011000010000

    h4 G C A C T T T A C G C C 001000010000

    h5 G C A C T T T G C G C C 001000000000

    h6 G C G C T T T G C A C A 000000000101

    h7 G C G C T T T G C A T A 000000000111

    h8 G C A C T T T A C A C A 001000010101

    h9 G C G C T T T G C A C C 000000000100

    h10 G C G C T T T G C G C C 000000000000

    Table 1: 10 haplotypes of 2AR genes. Nucleotide number is the position of the site, basedon the first nucleotide of the starting codon being +1. Allele is the two nucleotide possibilities at

    each SNP site. These data are from Drysdale et al. (2000). The original paper gave 12 haplotypes

    of 13 SNP sites. In this table, we only listed 10 haplotypes which were found in the asthmatic

    cohort, and one rare SNP site which did not show ambiguity in the sample was excluded. The last

    column of each haplotype is the representation of that haplotype. Each haplotype is a vector of

    SNP values. For each SNP, we assume the first nucleotide in Alleles to be wild type (represented

    with 0), and second one to be mutant (represented with 1).

    2.2 A real example exactly fitting the new model

    2-adrenergic receptors (2ARs) are G protein-coupled receptors that mediate the actions of cat-

    echolamines in multiple issues. In Drysdale et al. (2000), 13 variable sites within a span of 1.6kb

    were reported in the human 2AR gene. Only 10 haplotypes were found to exist in the studied asth-

    matic cohort, far less than theoretically possible 213 = 8192 combinations. 18 distinct genotypes

    were identified in the sample consisting of 121 individuals. Those 10 haplotypes and 18 genotypes

    are illustrated in table 1 and table 2, respectively. In this data set, the genotype number(18) is

    relatively large with respect to the haplotype number(10). Computation shows that the minimum

    number of haplotypes needed to generate the 18 genotypes is 10, and given the 18 genotypes as

    input, the set of haplotypes inferred by our algorithm (see next section) is exactly the original set.

    3 An exact algorithm

    To verify our model with more real data and simulation data, we design an exact algorithm to find

    the optimal solution for the minimum number of origins problem. The basic idea is very simple.

    Given a set of genotypes (rows in the genotype matrix) M = {m1, m2, . . . , mn}, the algorithm works

    4

  • 8/3/2019 2002Wang_Haplotype Inference by Maximum Parsimony

    5/17

    genotype resolution value

    g1 (h2,h4) 202222222000

    g2 (h2,h2) 100111101000

    g3 (h2,h6) 200222202202

    g4 (h4,h4) 001000010000

    g5 (h4,h6) 002000020202

    g6 (h2,h5) 202222202000

    g7 (h4,h9) 002000020200

    g8 (h1,h4) 202000010000

    g9 (h1,h6) 200000020202

    g10 (h2,h10) 200222202000

    g11 (h2,h3) 222222222000

    g12 (h2,h7) 200222202222

    g13 (h2,h8) 202222222202

    g14 (h3,h4) 021000010000

    g15 (h4,h5) 001000020000g16 (h4,h7) 002000020222

    g17 (h4,h8) 001000010202

    g18 (h6,h7) 000000000121

    Table 2: 18 genotypes of 2AR genes. The second column of each genotype is the (true)

    resolution to that genotype. For example, genotype g1 is resolved by haplotypes h2 and h4. The

    third column is the representation of that genotype. Each genotype is a vector of SNP values and

    the value of each SNP is 0, 1 or 2, standing for homozygous wild type, homozygous mutant, or

    heterozygous.

    5

  • 8/3/2019 2002Wang_Haplotype Inference by Maximum Parsimony

    6/17

    Figure 1: An example of a search tree containing three levels, i.e., there are three rows

    (genotypes) in the given genotype matrix. s1 = 4, s2 = 2, and s3 = 3.

    as follows: (1) list all possible resolutions for each genotype; (2) construct a haplotype matrix by

    choosing one resolution from each genotype; and (3) all such haplotype matrices are examined by

    a branch-and-bound searching and a matrix with the minimum number of distinct haplotypes is

    output.

    Consider a genotype mi (the i-th row in M). Let si be the number of possible resolutions to

    mi. All possible resolutions for mi are stored in a list {ri1 = (h

    i1,1, h

    i1,2), . . . , r

    isi

    = (hisi,1, hisi,2

    )}.

    A haplotype matrix (r1j1 , r2j2

    , . . . , rnjn) consists of n resolutions, one for each genotype mi in M. A

    partial solution consists of i resolutions (r1j1 , . . . , riji

    ), where i n. The size of a solution is defined

    as the number ofdistincthaplotypes contained in these n resolutions. We are looking for a solution

    with the minimum size.

    We use depth-first search to exhaustively enumerate all n

    i=1si possible solutions and find anoptimal one. Figure 1 gives an example of a search tree containing ni=1si possible solutions (leaves).

    The size of the search tree is too big. Thus, we use branch-and-bound approach to reduce the search

    space. Assume that we have obtained a solution of size x. If in the search process, we obtain a

    partial solution with size at least x, then we do not further extend the solution. This can save lots

    of time.

    Theoretically, the running time of the above described algorithm could still be exponential in

    terms of the input size. In order to make our program efficient enough for practical use, we made

    some further improvements to reduce the running time.

    Improvement 1: Choosing a tight initial bound

    If we have a good (small size) solution, then we are able to prune many unnecessary searching

    paths. Before executing branch-and-bound searching, we run a greedy algorithm to obtain a solu-

    tion, and use this solution as an initial bound in the branch-and-bound search. The coverage of a

    haplotype is the number of genotypes that the haplotype can resolve. The coverage of a resolution

    is the sum of the coverage for the two haplotypes of the resolution. The greedy algorithm simply

    chooses from each genotype a resolution with maximum coverage to form a solution. This heuristic

    6

  • 8/3/2019 2002Wang_Haplotype Inference by Maximum Parsimony

    7/17

    algorithm can often give a solution with size close to the optimum.

    Improvement 2: Reducing the search space

    Since we only report one optimal solution, if some possible resolutions are equally good, we

    can keep only one representative and discard the others. We only consider two cases in our program.Case 1: Two resolutions to the same genotype mi both have coverages 2. In this case, none of

    the four haplotypes contained in the two resolutions appears in any other genotypes. Thus, we just

    keep one of the resolutoins in the resolutoin list of mi.

    Case 2: Consider two genotypes mi and mj . Suppose mi has two resolutions (h1, h2) and

    (h4, h5) and mj has two resolutions (h2, h3) and (h5, h6). Ifh1, h3, h4 and h6 have coverage 1, and

    h2 and h4 have coverage 2, then we only have to keep the combination (h1, h2) and (h2, h3) and

    delete the combination (h4, h5) and (h5, h6).

    The idea can be extended to more sophisticated cases. However, those cases do not happen

    very often and may not help much in practice. After applying this trick, the number of possible

    resolutions and the number of candidate haplotypes (haplotypes contained in resolutions) are dra-

    matically cut down. For example, when we run our program on ACE data containing 11 individuals

    of 52 SNPs (See Section 4 for details.), the number of candidate haplotypes is only 483, which is

    far less than the total of 252 possible haplotypes.

    We implement the above algorithm with C++. The program, HAPAR, is now available upon

    request. It takes a file containing genotype data as input, and outputs resolved haplotypes. With

    all these improvements, our program is fairly efficient. For example, it takes HAPAR only 2.25

    minutes on our computer to compute ACE data. In contrast, it takes PHASE, a program based on

    Gibbs sampling, 12 minutes to compute ACE in the same environment. Programs based on EM

    method even cannot handle this set of data.

    4 Results

    We ran our program HAPAR for a large amount of real biological data as well as simulation data

    to demonstrate the performance of our program. We also compared our program with four famous

    existing programs, HAPINFERX, Emdecoder, PHASE, and Haplotyper. HAPINFERX is an im-

    plementation of Clarks algorithm (Clark, 1990), and was kindly provided by A.G.Clark. Emdecoder

    uses an EM alogrithm, and was downloaded at J. Lius homepage (http://www.people.fas.harvard.edu/junliu/

    PHASE is a Gibbs sampler for haplotyping (Stephenset al.

    , 2001), and was downloaded at M.Stephens homepage. Haplotyper is a Bayesian method, and was downloaded at J. Lius homepage.

    We will discuss different sets of data in the following subsections.

    7

  • 8/3/2019 2002Wang_Haplotype Inference by Maximum Parsimony

    8/17

    program result haplotype number correct haplotype number recovery rate accuracy rate

    HAPAR 7 7 0.54 1

    Haplotyper 13 7 or 9 0.54 or 0.69 0.54 or 0.69

    HAPINFERX 7 7 0.54 1PHASE 13 7 0.54 0.54

    Table 3: Comparison of performance of four programs on ACE data set. Haplotyper has

    two different accurate rates/ recovery rate because it gives different results in multiple runs.

    4.1 Angiotensin converting enzyme

    Angiotensin converting enzyme (encoded by the gene DCP1, also known as ACE) catalyses the

    conversion of angiotensin I to the the physiologically active peptide angiotensin II. Due to its

    key function in the renin-angiotensin system, many association studies have been performed with

    DCP1. Rieder et al. (1999) completed the genomic sequencing ofDCP1 from 11 individuals, and

    identified 78 varying sites in 22 chromosomes. 52 out of the 78 varying sites are non-unique poly-

    morphic sites, and complete data on these 52 biallelic markers are available. 13 distinct haplotypes

    were resolved from the sample.

    We ran the four programs, HAPAR, Haplotyper, HAPINFERX and PHASE, on ACE data set

    (11individuals/10genotypes, 52SNPs, 13haplotypes). (Emdecoder is limited in the number of SNPs

    in genotype data and thus is excluded.) The result is summarized in table 3. The recovery rate is

    defined asno. of correctly detected haplotypes

    total no. of true haplotypes.

    The accurate rate is defined as

    no. of correctly detected haplotypes

    total no. of inferred haplotypes.

    Most programs can recover 7 haplotypes out of 13 original ones. The low performance was due

    to the relative small distinct genotype number. In fact, 3 genotypes are resolved by 6 haplotypes

    each of which appears only once, so there is no enough information for any of the 4 programs to

    resolve those 3 genotypes successfully. In some runs, Haplotyper can guess one resolution correctly

    (thus 9 haplotypes are correctly recovered), but it cannot get a consistent result. Note that unlike

    statistical methods (e.g. Haplotyper and PHASE), HAPAR and HAPINFERX do not try to guess

    resolutions for genotypes when there is no enough information. Therefore, both of them only report

    7 haplotypes with a 100% accuracy rate. In contrast, Haplotyper and PHASE report 13 haplotypes,

    some of which are inaccurate.

    8

  • 8/3/2019 2002Wang_Haplotype Inference by Maximum Parsimony

    9/17

    4.2 Simulations on random data sets

    In this subsection, we use simulation data to evaluate different programs. First, m haplotypes, each

    containing k SNPs sites, were randomly generated. Then a sample of n genotypes were generated,

    each of which was formed by randomly picking up two haplotypes and conflating them. (Here n

    is the sample size, which may be larger than the number of distinct genotypes.) A haplotyping

    program resolved those genotypes and output inferred haplotypes, which were then compared with

    the original haplotypes to evaluate the performance of the program.

    An important question is how to generate random haplotypes. Since some algorithms impose

    particular assumption on evolutionary model, simulation data sets generated based on some model

    will certainly favor one program while impair others. As pointed out in Niu et al. (2002), The

    performances of different in silico haplotyping methods is subtler than it appears to be - that is,

    the model underlying the simulation study can greatly affect the conclusion. We decide not to

    adopt any particular evolutionary model when generating haplotypes. We simply set every bitin every haplotypes randomly and independently to be 0 or 1 with some probability. (To test the

    applicability of our method under different conditions, we also did simulations on data sets favoring

    particular evolutionary models, which are discussed later in this paper.)

    The performance of a haplotype inference program contains two aspects: (1) the accurate rate

    and (2) the recovery rate. Figure 2 (a) and Figure 2 (b) illustrated the accurate rate and recovery

    rate of different programs under parameter setting m = 10 and k = 10. It can be seen from the

    figures that for HAPAR, Haplotyper, Emdecoder and PHASE, the two measurements do not make

    much difference, whereas HAPINFERX has an accurate rate better than recovery rate (this is

    because HAPINFERX will leave some orphan genotypes when they are not resolvable by existing

    haplotypes). This fact was also supported by other simulation results. Therefore, in the rest of the

    section, we use the arithmetic mean of these two values, (accurate rate + recovery rate) 0.5, as

    the measurement for performance.

    Also, in the rest of the section, 100 data sets were generated for each parameter setting, and

    performance was calculated by taking the average of the performance values in the 100 runs.

    We conducted simulations with different parameter settings and compared the performance

    of the five programs HAPAR, HAPINFERX, Haplotyper, Emdecoder and PHASE. Two sets of

    parameters were used: (1) m = 10, k = 10, n ranges from 5 to 24 (see Figure 2 (a) to Figure 2

    (c)), and (2) m = 18, k = 8, n ranges from 9 to 40 (see Figure 2 (d)). As shown by the figures,

    HAPAR outperforms the other four programs in almost all cases. When n is as small as m/2 (everyhaplotype appears in only one genotype), any resolution combination would be an optimal solution

    for the minimum number of origins problem, so HAPAR cannot identify the correct resolutions.

    In this case, other programs also have poor performance due to the lack of information. When n

    becomes larger, all five programs gain an increasing in performance, and HAPAR shows an obvious

    advantage over the others. When n is large enough (n = 12 for m = 10, k = 10; and n = 30

    9

  • 8/3/2019 2002Wang_Haplotype Inference by Maximum Parsimony

    10/17

    for m = 18, k = 8), HAPAR, Haplotyper, Emdecoder and PHASE can all recover the original

    haplotypes successfully with high probability (the accurate rate and recovery rate are both greater

    than 0.9).

    0

    0.2

    0.4

    0.6

    0.8

    1

    4 6 8 10 12 14 16 18 20 22 24

    accuracy

    rate

    sample size

    HAPARHAPINFERXHaplotyperEmdecoder

    PHASE

    (a) Accurate rate (m = 10; k = 10)

    0

    0.2

    0.4

    0.6

    0.8

    1

    4 6 8 10 12 14 16 18 20 22 24

    recovery

    rate

    sample size

    HAPARHAPINFERXHaplotyperEmdecoder

    PHASE

    (b) Recovery rate (m = 10; k = 10)

    0

    0.2

    0.4

    0.6

    0.8

    1

    4 6 8 10 12 14 16 18 20 22 24

    performance

    sample size

    HAPARHAPINFERX

    HaplotyperEmdecoder

    PHASE

    (c) Performance (m = 10; k = 10)

    0

    0.2

    0.4

    0.6

    0.8

    1

    10 15 20 25 30 35 40

    performance

    sample size

    HAPARHAPINFERX

    HaplotyperEmdecoder

    PHASE

    (d) Performance (m = 18; k = 8)

    Figure 2: Comparison of the performance of five programs on random data. In Figure 2

    (a) to (c), originally there are 10 haplotypes of 10 SNPs; In Figure 2 (d), originally there are 18

    haplotypes of 8 SNPs. Performance in Figure 2 (c) and (d) is the arithmetic mean of accurate rate

    and recovery rate. For each parameter setting, 100 data sets were generated, and performance,

    accurate rate and recovery rate were calculated by taking average of 100 runs.

    4.3 Simulations on maize data set

    The maize data were used in Wang and Miao (2002) as one of the benchmarks to evaluate hap-

    lotyping programs. The locus 14 of maize profile containing 17 SNP sites and 4 haplotypes (with

    10

  • 8/3/2019 2002Wang_Haplotype Inference by Maximum Parsimony

    11/17

    sample size HAPAR Haplotyper Emdecoder HAPINFERX PHASE

    3 0.60 0.55 0.38 0.38 0.56

    4 1 1 1 0.625 1

    7 1 1 1 0.625 110 1 1 1 0.9 1

    Table 4: Comparison of performance of five programs on Maize data set (m = 4, k = 17).

    frequency 9, 17, 8 and 1) were identified. We randomly generated a sample of n genotypes from

    these haplotypes, each of which was formed by randomly picking up 2 haplotypes according to

    their frequencies and conflating them. The results were summarized in Table 4. According to our

    experiments, all of HAPAR, Haplotyper, Emdecoder and PHASE can recover the haplotypes cor-

    rectly when the sample size n is greater than or equal to 4, while HAPINFERX does not produce a

    satisfying result until sample size n reaches 10. When the sample size is 3, HAPAR behaves slightlybetter than other programs, but none of them produces satisfying results.

    4.4 Simulations on haplotypes forming a perfect phylogeny

    Coalescence is one of the evolutionary model most commonly used in population genetics. The

    coalescent model of haplotype evolution says that without recombination, haplotypes can fit into

    a perfect phylogeny (Gusfield, 2001). Jin et al. (1999) found a 565bp chromosome 21 region near

    the M X1 gene, which contains 12 polymorphic sites. This region is unaffected by recombination

    and recurrent mutation. The genotypes determined from sequence data of 354 human individuals

    were resolved into 10 haplotypes, the evolutionary history of which can be modeled by a per-fect phylogeny. These 10 haplotypes were used to generate genotype samples of different size for

    evaluation.

    The performance of the five programs on MX1 data is compared in Figure 3. PHASE performs

    better than the other programs, because it incorporates coalescent model which fits this data

    set. In the remaining four programs which do not adopt coalescent assumption, we see that the

    performance of HAPAR is better than others when the sample size is relatively large (n > 20). Note

    that performance of the other four programs almost remains still when the sample size is greater

    than 20, whereas HAPAR still gets a continuing increase as the sample size increases. When the

    sample size is as large as 40, HAPAR even beats PHASE. This supports our hypothesis that whenthe sample size is large enough, our algorithm is likely to give accurate results.

    4.5 Simulations on haplotypes with recombination hotspots

    To verify the robustness of our method in the presence of recombination events, we conduct sim-

    ulations using the data on chromosome 5q31 studied in Daly et al. (2001). They reported a

    11

  • 8/3/2019 2002Wang_Haplotype Inference by Maximum Parsimony

    12/17

    0

    0.2

    0.4

    0.6

    0.8

    1

    5 10 15 20 25 30 35 40

    performan

    ce

    sample size

    HAPARHAPINFERXHaplotyperEmdecoder

    PHASE

    Figure 3: Comparison of performance of five programs on MX1 data set (m = 10, k = 12).

    high-resolution analysis of the haplotype structure across 500kb on chromosome 5q31 using 103

    SNPs in a European-derived population. The result showed a picture of discrete haplotype blocks

    (of tens to thousands of kilobases), each with limited diversity. The discrete blocks are separated

    by intervals in which several independent historical recombination events seem to have occurred,

    giving rise to greater haplotype diversity for regions spanning the blocks.

    We use the haplotypes from block 9 (with 6 sites, and 4 haplotypes) and block 10 (with 7

    sites, 6 of which are complete, and 3 haplotypes). There is a recombination spot between the two

    blocks, which is estimated to have a haplotype exchange rate of 27%. 9 new haplotypes with 12

    sites are generated by connecting two haplotypes from block 9 and block 10 which were observedto have common recombination events, and their frequencies were normalized. Genotype samples

    of different sizes were randomly generated and used for evaluation.

    According to the experiment results illustrated in Figure 4, Emdecoder, HAPAR, Haplotyper

    and PHASE have similar performance on these data sets. A strange phenomenon is that their

    performance increases very slowly after it reaches approximately 0.9. This is due to the confusion

    of recombination. Take a sequence of two SNP sites as an example. There are 4 haplotypes, 00,

    01, 10 and 11. Both the combinations (00, 11) and (01, 10) result in genotype 22. The haplotype

    11 is rare since its two sites are both mutant, while haplotype 00 is the most common. According

    to their frequencies, 11 is difficult to observe, and even if it appears, it is probably in genotype

    22, which will likely be resolved into (01, 10). Unless the sample size is large enough so that the

    genotype formed by (11, 01) or (11, 10) is sequenced, the existence of haplotype 11 will not be

    detected by haplotype inference programs. That is why the performance curves in figure 4 increase

    slowly at around 0.9. As mentioned above, if the sample size is large enough, the chance that the

    rare haplotypes are recovered will be good. In this data set, when the sample size is raised to 50,

    the performance of HAPAR can reach 0.96.

    12

  • 8/3/2019 2002Wang_Haplotype Inference by Maximum Parsimony

    13/17

    0

    0.2

    0.4

    0.6

    0.8

    1

    4 6 8 10 12 14 16 18 20

    performan

    ce

    sample size

    HAPARHAPINFERXHaplotyperEmdecoder

    PHASE

    Figure 4: Comparison of performance of five programs on 5q31 data set (m = 9, k = 12).

    4.6 Influence of the number of haplotypes and the number of SNP sites on

    performance

    We also examined the influence of m, the number of haplotypes, and k, the number of SNP sites,

    on the performance of our algorithm. Original haplotype sets and then genotypes were generated

    randomly.

    Figure 5 (a) to Figure 5 (c) illustrate the influence of the number of original haplotypes m on

    performance when k is fixed. In the three figures, we fix k = 6, 10, 12 respectively; and in each

    figure, performance under three different numbers of haplotypes, 8, 10 and 12, is compared. All

    the three figures show that, with the same sample size n and the same number of SNP sites k, the

    performance gets better when the number of haplotypes m decreases, which is consistent with the

    intuition. In each figure, three performance curves have similar shapes. At first, the performance

    increases fast as the sample size n increases, and then after the sample size n becomes large enough

    and the performance becomes fairly high (above 0.9), the increasing pace slows down. (This also

    holds for all above data sets.) Moreover, if we compare the three figures, we can see that the curves

    in Figure (c) are steeper than those in Figure (a). That is, the increasing pace of performance

    is larger when the number of SNP sites is larger. This relationship is more clearly illustrated in

    Figure 5 (d) to Figure 5 (f).

    Figure 5 (d) to Figure 5 (f) show the influence of different numbers of SNP sites on performancewhen m is fixed. In the three figures, we fix m = 8, 10, 12, respectively; and in each figure, the

    performance under four different numbers of SNP sites, 6, 10, 12 and 15, is compared. The bigger

    the number of SNP sites is, the faster the performance increases as the number of sample size

    increases. When the sample size is reasonably large, with the same sample size n and the original

    haplotype number m, HAPAR is more likely to detect the correct haplotypes when the number of

    13

  • 8/3/2019 2002Wang_Haplotype Inference by Maximum Parsimony

    14/17

    SNP sites k is larger. This is because when k is small, it is more likely that the four patterns 00, 01,

    10 and 11 exist simultaneously in the original haplotype set, which causes confusion to HAPAR.

    5 Discussion

    Haplotypes are the raw material of many genetic analysis, but the rapid growth of high-throughput

    genotyping techniques has not been matched by similar advances in cheap experimental haplotype

    determination. In this paper, we have introduced a new method for this problem. Experiments

    on both real data and simulation data confirm this new method. Our method imposes no assump-

    tion on the population evolutionary history. While integrating particular evolutionary model into

    haplotyping algorithm may favor a special class of data, it will fail to infer haplotypes that do not

    fit into the model. In contrast, our method has the widest applicability. Experiments show that

    HAPAR is comparable to the best haplotyping programs on almost all data sets, and when the

    simulation data are generated fully randomly, it outperforms other programs.

    Our new method raised some interesting topics for further investigation. Does there exist a

    polynomial time algorithm to solve the minimum number of origins problem? (We guess that the

    problem is NP-hard.) If no polynomial time algorithm exists, an efficient approximation algorithm

    with reasonable performance guarantee is also desirable. Moreover, it is interesting to design an

    efficient algorithm to substitute the branch-and-bound searching procedure, i.e. given a set of

    genotypes and a set of candidate haplotypes, the problem here is to find a minimum size subset of

    candidate haplotypes that can resolve all genotypes. Although the size of candidate haplotype set

    could be exponential to the original problem size in the worst case, the size is acceptable in many

    practical examples. Therefore, we are interested in accelerating the searching procedure.Another interesting problem is under what condition the parsimony algorithm is able to infer

    all haplotypes correctly with high probability. All simulations illustrate that the performance of

    HAPAR increases monotonically as the sample size becomes larger, and it is conjectured that when

    the sample size is big enough, our algorithm can recover exactly the original haplotype set. With

    the number of original haplotypes m and the number of SNP sites k fixed, we consider a sample

    size to be able to recover haplotypes with high probability if the average performance of 100 runs

    under the parameter setting is greater than 0.9, and the minimum of such sample size is referred to

    as the threshold sample size. Simulations have been conducted to examine the relationship between

    the threshold sample size and the number of haplotypes m. The result for k = 10 is summarizedin table 5. When m increases, the threshold sample size increases. Figure 5 shows that with the

    same m value, the threshold sample size decreases when k increases. Further statistical analysis

    may help to explain this relationship.

    Acknowledgements

    We thank A. Clark for kindly giving us HAPINFERX. We also thank J. Liu and M. Stephens

    14

  • 8/3/2019 2002Wang_Haplotype Inference by Maximum Parsimony

    15/17

    0

    0.2

    0.4

    0.6

    0.8

    1

    5 10 15 20 25 30

    performance

    sample size

    8 haplotypes10 haplotypes12 haplotypes

    (a) performance under different m (k=6)

    0

    0.2

    0.4

    0.6

    0.8

    1

    5 10 15 20 25 30

    performance

    sample size

    8 haplotypes10 haplotypes12 haplotypes

    (b) performance under different m (k=10)

    0

    0.2

    0.4

    0.6

    0.8

    1

    4 6 8 10 12 14 16 18 20 22 24

    performance

    sample size

    8 haplotypes10 haplotypes12 haplotypes

    (c) performance under different m (k=12)

    0

    0.2

    0.4

    0.6

    0.8

    1

    4 6 8 10 12 14 16 18

    performance

    sample size

    6 SNP sites10 SNP sites12 SNP sites15 SNP sites

    (d) performance under different k (m=8)

    0

    0.2

    0.4

    0.6

    0.8

    1

    5 10 15 20 25 30

    performance

    sample size

    6 SNP sites10 SNP sites12 SNP sites15 SNP sites

    (e) performance under different k (m=10)

    0

    0.2

    0.4

    0.6

    0.8

    1

    5 10 15 20 25 30

    performance

    sample size

    6 SNP sites10 SNP sites12 SNP sites15 SNP sites

    (f) performance under different k (m=12)

    Figure 5: Comparison of performance under different number of haplotypes and number

    of SNP sites. (a) to (c) illustrate the influence of different m on performance. (d) to (f) illustrate

    the influence of different k on performance.

    15

  • 8/3/2019 2002Wang_Haplotype Inference by Maximum Parsimony

    16/17

    original haplotype number 5 6 8 10 12 14 15

    threshold sample size 7 7 9 11 14 17 18

    Table 5: Relationship between the threshold sample size and the original haplotype

    number (k = 10).

    for providing software on their web pages. The work is fully supported by a grant from the

    Research Grants Council of the Hong Kong Special Administrative Region, China [Project No.

    CityU 1047/01E].

    References

    Bafna,V., Gusfield,D., Lancia,G. and Yooseph,S. (2002) Haplotyping as perfect phylogenty: A

    direct approach. Technical Report UCDavis CSE-2002-21.

    Chiano,M. and Clayton,D. (1998) Fine genetic mapping using haplotype analysis and the missingdata problem. Am. J. Hum. Genet., 62, 55-60.

    Clark,A. (1990) Inference of haplotypes from PCR-amplified samples of diploid populations. Mol.

    Biol. Evol., 7, 111-122.

    Clark,A., Weiss,K., Nickerson,D., Taylor,S., Buchanan,A., Stengard,J., Salomaa,V., Vartiainen,E.,

    Perola,M., Boerwinkle,E. and Sing,C. (1998) Haplotype structure and population genetic inferences

    from nucleotide-sequence variation in human lipoprotein lipase. Am. J. Hum. Genet., 63, 595-612.

    Drysdale,C., McGraw,D., Stack,C., Stephens,J., Judson,R., Nandabalan,K., Arnold,K., Ruano,G.

    and Liggett,S. (2000) Complex promoter and coding region 2-adrenergic receptor haplotypes alter

    receptor expression and predict in vivo responsiveness.Proc. Natl. Acad. Sci. USA

    , 97, 10483-10488.

    Daly,M., Rioux,J., Schaffner,S., Hudson,T. and Lander,E. (2001) High-resolution haplotype struc-

    ture in the human genome. nature genetics, 29, 229-232.

    Excoffier,L. and Slatkin,M. (1995) Maximum-likelihood estimation of molecular haplotype frequen-

    cies in a diploid population. Mol. Biol. Evol., 12,921-927.

    Gusfield,D. (2001) Inference of haplotypes from samples of diploid populations: complexity and

    algorithms. J. Comput. Biol., 8, 305-323.

    Gusfield, D. (2002) Haplotyping as perfect phylogeny: conceptual framework and efficient solutions.

    RECOMB02.

    Hawley,M. and Kidd,K. (1995) HAPLO: a program using the EM algorithm to estimate the fre-

    quencies of multi-site haplotypes. J. Hered.,86,409-411.

    Jin,L., Underhill,P., Doctor,V., Davis,R., Shen,P., Cavalli-Sforza,L. and Oefner,P. (1999) distribu-

    tion of haplotypes from a chromosome 21 region distinguished multiple prehistoric human migra-

    tions. Proc. Natl. Acad. Sci. USA, 96, 3796-3800.

    Long,J., Williams,R. and Urbanek,M. (1995) An E-M algorithm and testing strategy for multi-locus

    16

  • 8/3/2019 2002Wang_Haplotype Inference by Maximum Parsimony

    17/17

    haplotypes. Am. J. Hum. Genet., 56, 799-810.

    Niu,T., Qin,Z., Xu,X. and Liu,J. (2002) Bayesian haplotype inference for multiple linked single-

    nucleotide polymorphisms. Am. J. Hum. Genet., 70, 157-169.

    Rieder,M., Taylor,S., Clark,A. and Nickerson,D. (1999) Sequence variation in the human an-giotensin converting enzyme. nature genetics, 22, 59-62.

    Stephens,M., Smith,N. and Donnelly,P. (2001) A new statistical Method for haplotype reconstruc-

    tion. Am. J. Hum. Genet., 68, 978-989.

    Wang,X. and Miao,J. (2002) In-silico haplotyping: state-of-the-art. UDEL technical reports.

    Zhang,J., Vingron,M. and Hoehe,M. (2001) On haplotype reconstruction for diploid populations.

    EURANDOM report 2001-026.

    17