2002Wang_Haplotype Inference by Maximum Parsimony

8/3/2019 2002Wang_Haplotype Inference by Maximum Parsimony

1/17

Haplotype Inference by Maximum Parsimony

Lusheng Wang

Department of Computer Science

City University of Hong Kong

Kowloon, Hong Kong, China

E-mail: [email protected]

Ying Xu

Department of Computer Science

Peking University

Beijing, 100871, China

E-mail: [email protected]

Abstract

Motivation: Haplotypes have been attracting increasing attention because of their impor-

tance in analysis of many fine-scale molecular-genetics data. Since direct sequencing of haplotype

via experimental methods is both time-consuming and expensive, haplotype inference methods

that infer haplotypes based on genotype samples become attractive alternatives.

Results: We propose a new model for haplotype inference that finds a set of minimum

number of haplotypes that explains the genotype samples. Experiments on both real data and

simulation data confirm the new model. Our new approach outperforms the existing methods

in many cases.

Availability: The software HAPAR is free for non-commercial uses. Available upon request

([email protected]).

1


2/17

1 Introduction

Single nucleotide polymorphisms (SNPs) are the most frequent form of human genetic variation.

The SNP sequence information from each of the two copies of a given chromosome in a diploid

genome is a haplotype. Haplotype information has been attracting great attention in recent years

because of its importance in analysis of many fine-scale molecular-genetics data, such as in the

mapping of complex disease genes, inferring population histories and drug design. However, cur-

rent routine sequencing methods typically provide genotype information (It consists of a pair of

haplotype information at each position of the two copies of a chromosome in diploid organisms.

However, the connection between two adjacent positions is not known.) rather than haplotype in-

formation. Since direct sequencing of haplotype via experimental methods is both time-consuming

and expensive, in silico haplotyping methods become attractive alternatives.

The haplotype inference problem is as follows: given an n k genotype matrix, where each cell

has value 0, 1, or 2. Each of the n rows in the matrix is a vector associated with sites of intereston the two copies of a chromosome for diploid organisms. The state of any site on a single copy of

a chromosome is either 0 or 1. A cell (i, j) in the i-th row has a value 0 or 1 if the chromosome

site has that state (0 or 1) on both copies, and it has a value 2 if both states are present at this

site (one for each copy). A cell is resolved if it has value 0 or 1, and ambiguous if it has value 2.

The goal here is to determine which copy of the chromosome has value 1 and which copy of the

chromosome has value 0 at the sites with value 2 based on some mathematical models.

In 1990, Clark first discovered that genotypes from population samples were useful in recon-

structing haplotypes and proposed an inference method. After that, many algorithms and programs

have been developed to solve the haplotype inference problem. The existing algorithms can be di-vided into four primary categories. The first category is Clarks inference rule approach that is

exemplified in Clark (1990) and extended in Gusfield (2001) by trying to maximize the number of

resolved vectors. The second category is expectation-maximization (EM) method which looks for

the set of haplotypes maximizing the posterior probability of given genotypes (Excoffier and Slatkin,

1995; Hawley and Kidd, 1995; Long et al., 1995; Chiano and Clayton, 1998). Recently, several sta-

tistical methods based on Bayesian estimators and Gibbs sampling were proposed (Stephens et al.,

2001; Niu et al., 2002; Zhang et al., 2001). Finally, adopting the no-recombination assumption,

Gusfield proposed a model that finds a set of haplotypes forming a perfect phylogeny (Gusfield,

2002; Bafna et al., 2002).

In this paper, we propose a new model for the haplotype inference problem that finds a set of

minimum number of haplotypes that explains the given genotypes. Based on the new model, an

exact algorithm is designed and implemented. Experiments on both real data and simulation data

confirm the new model. Simulation results show that our approach outperforms existing methods

in most cases.

2


3/17

2 The new model

Given an n k genotype matrix M, where each cell has value 0, 1, or 2, a 2n k haplotype matrix

Q that explains M is obtained as follows: (1) duplicate each row i in M to create pairs of rows i,

and i (for i = 1, 2, . . . n) in Q and (2) re-set each pair of cells Q(i, j) = 2 and Q(i, j) = 2 to be

either Q(i, j) = 0 and Q(i, j) = 1 or Q(i, j) = 1 and Q(i, j) = 0 in the new resulting 2nk matrix

Q. Each row in M is called a genotype. For a genotype mi (the i-th row in M), the pair of finally

resulted rows Qi and Qi form a resolution ofmi. We also say that Qi and Qi resolve the genotype

mi. For a given genotype matrix M, if there are h 2s in M, then any of the 2h possible haplotype

matrices can explain M. Thus, without any further assumptions, it is hard to infer the haplotypes.

Here we propose a new model that finds a set of minimum number of haplotypes that explains

the genotype samples as follows: given an nk genotype matrix M, find a 2nk haplotype matrix

Q such that the number of distinct rows in Q is minimized. n is often referred to as the sample

size. The computation problem is called the minimum number of origins.

2.1 Supports of the new model

Our new method is based on the parsimony principle that attempts to minimize the total number

of haplotypes observed in the sample. The parsimony principle is one of the most basic principles in

nature, and has been applied in numerous biological problems. In fact, Clarks inference algorithm,

which has been extensively used in practice and shown to be useful (Clark et al., 1998; Rieder et

al., 1999; Drysdale et al., 2000), can also be viewed as a sort of parsimony approach. However,

to apply Clarks algorithm, there must be homozygotes or single-site heterozygotes in the sample.

Our method overcomes this obstacle by proposing a global optimization goal.The characteristics of real biological data also provide justifications for our method. The number

of haplotypes existing in a large population is actually very small whereas genotypes derived from

these limited number of haplotypes behave a great diversity. Theoretically, given m haplotypes,

there are m(m 1)/2 possible pairs to form genotypes. (Even if Hardy-Weinberg equilibrium is

violated and some combinations are rare, the number is still quite large.) When some population

is to be studied, the haplotype number can be taken as a fixed constant, while the number of

distinct genotypes is decided by the sample size sequenced, which is relatively large. Intuitively,

when genotype sample size n is large enough, the corresponding number of haplotypes m would be

relatively small. Thus, our approach has a good chance to correctly recover the whole haplotypeset.

A real example strongly supports above arguments.

3


4/17

Nucleotide -1023 -709 -654 -468 -367 -47 -20 46 79 252 491 523

Alleles G/A C/A G/A C/G T/C T/C T/C G/A C/G G/A C/T C/A

h1 A C G C T T T A C G C C 100000010000

h2 A C G G C C C G G G C C 100111101000h3 G A A C T T T A C G C C 011000010000

h4 G C A C T T T A C G C C 001000010000

h5 G C A C T T T G C G C C 001000000000

h6 G C G C T T T G C A C A 000000000101

h7 G C G C T T T G C A T A 000000000111

h8 G C A C T T T A C A C A 001000010101

h9 G C G C T T T G C A C C 000000000100

h10 G C G C T T T G C G C C 000000000000

Table 1: 10 haplotypes of 2AR genes. Nucleotide number is the position of the site, basedon the first nucleotide of the starting codon being +1. Allele is the two nucleotide possibilities at

each SNP site. These data are from Drysdale et al. (2000). The original paper gave 12 haplotypes

of 13 SNP sites. In this table, we only listed 10 haplotypes which were found in the asthmatic

cohort, and one rare SNP site which did not show ambiguity in the sample was excluded. The last

column of each haplotype is the representation of that haplotype. Each haplotype is a vector of

SNP values. For each SNP, we assume the first nucleotide in Alleles to be wild type (represented

with 0), and second one to be mutant (represented with 1).

2.2 A real example exactly fitting the new model

2-adrenergic receptors (2ARs) are G protein-coupled receptors that mediate the actions of cat-

echolamines in multiple issues. In Drysdale et al. (2000), 13 variable sites within a span of 1.6kb

were reported in the human 2AR gene. Only 10 haplotypes were found to exist in the studied asth-

matic cohort, far less than theoretically possible 213 = 8192 combinations. 18 distinct genotypes

were identified in the sample consisting of 121 individuals. Those 10 haplotypes and 18 genotypes

are illustrated in table 1 and table 2, respectively. In this data set, the genotype number(18) is

relatively large with respect to the haplotype number(10). Computation shows that the minimum

number of haplotypes needed to generate the 18 genotypes is 10, and given the 18 genotypes as

input, the set of haplotypes inferred by our algorithm (see next section) is exactly the original set.

3 An exact algorithm

To verify our model with more real data and simulation data, we design an exact algorithm to find

the optimal solution for the minimum number of origins problem. The basic idea is very simple.

Given a set of genotypes (rows in the genotype matrix) M = {m1, m2, . . . , mn}, the algorithm works

4


5/17

genotype resolution value

g1 (h2,h4) 202222222000

g2 (h2,h2) 100111101000

g3 (h2,h6) 200222202202

g4 (h4,h4) 001000010000

g5 (h4,h6) 002000020202

g6 (h2,h5) 202222202000

g7 (h4,h9) 002000020200

g8 (h1,h4) 202000010000

g9 (h1,h6) 200000020202

g10 (h2,h10) 200222202000

g11 (h2,h3) 222222222000

g12 (h2,h7) 200222202222

g13 (h2,h8) 202222222202

g14 (h3,h4) 021000010000

g15 (h4,h5) 001000020000g16 (h4,h7) 002000020222

g17 (h4,h8) 001000010202

g18 (h6,h7) 000000000121

Table 2: 18 genotypes of 2AR genes. The second column of each genotype is the (true)

resolution to that genotype. For example, genotype g1 is resolved by haplotypes h2 and h4. The

third column is the representation of that genotype. Each genotype is a vector of SNP values and

the value of each SNP is 0, 1 or 2, standing for homozygous wild type, homozygous mutant, or

heterozygous.

5


6/17

Figure 1: An example of a search tree containing three levels, i.e., there are three rows

(genotypes) in the given genotype matrix. s1 = 4, s2 = 2, and s3 = 3.

as follows: (1) list all possible resolutions for each genotype; (2) construct a haplotype matrix by

choosing one resolution from each genotype; and (3) all such haplotype matrices are examined by

a branch-and-bound searching and a matrix with the minimum number of distinct haplotypes is

output.

Consider a genotype mi (the i-th row in M). Let si be the number of possible resolutions to

mi. All possible resolutions for mi are stored in a list {ri1 = (h

i1,1, h

i1,2), . . . , r

isi

= (hisi,1, hisi,2

)}.

A haplotype matrix (r1j1 , r2j2

, . . . , rnjn) consists of n resolutions, one for each genotype mi in M. A

partial solution consists of i resolutions (r1j1 , . . . , riji

), where i n. The size of a solution is defined

as the number ofdistincthaplotypes contained in these n resolutions. We are looking for a solution

with the minimum size.

We use depth-first search to exhaustively enumerate all n

i=1si possible solutions and find anoptimal one. Figure 1 gives an example of a search tree containing ni=1si possible solutions (leaves).

The size of the search tree is too big. Thus, we use branch-and-bound approach to reduce the search

space. Assume that we have obtained a solution of size x. If in the search process, we obtain a

partial solution with size at least x, then we do not further extend the solution. This can save lots

of time.

Theoretically, the running time of the above described algorithm could still be exponential in

terms of the input size. In order to make our program efficient enough for practical use, we made

some further improvements to reduce the running time.

Improvement 1: Choosing a tight initial bound

If we have a good (small size) solution, then we are able to prune many unnecessary searching

paths. Before executing branch-and-bound searching, we run a greedy algorithm to obtain a solu-

tion, and use this solution as an initial bound in the branch-and-bound search. The coverage of a

haplotype is the number of genotypes that the haplotype can resolve. The coverage of a resolution

is the sum of the coverage for the two haplotypes of the resolution. The greedy algorithm simply

chooses from each genotype a resolution with maximum coverage to form a solution. This heuristic

6


7/17

algorithm can often give a solution with size close to the optimum.

Improvement 2: Reducing the search space

Since we only report one optimal solution, if some possible resolutions are equally good, we

can keep only one representative and discard the others. We only consider two cases in our program.Case 1: Two resolutions to the same genotype mi both have coverages 2. In this case, none of

the four haplotypes contained in the two resolutions appears in any other genotypes. Thus, we just

keep one of the resolutoins in the resolutoin list of mi.

Case 2: Consider two genotypes mi and mj . Suppose mi has two resolutions (h1, h2) and

(h4, h5) and mj has two resolutions (h2, h3) and (h5, h6). Ifh1, h3, h4 and h6 have coverage 1, and

h2 and h4 have coverage 2, then we only have to keep the combination (h1, h2) and (h2, h3) and

delete the combination (h4, h5) and (h5, h6).

The idea can be extended to more sophisticated cases. However, those cases do not happen

very often and may not help much in practice. After applying this trick, the number of possible

resolutions and the number of candidate haplotypes (haplotypes contained in resolutions) are dra-

matically cut down. For example, when we run our program on ACE data containing 11 individuals

of 52 SNPs (See Section 4 for details.), the number of candidate haplotypes is only 483, which is

far less than the total of 252 possible haplotypes.

We implement the above algorithm with C++. The program, HAPAR, is now available upon

request. It takes a file containing genotype data as input, and outputs resolved haplotypes. With

all these improvements, our program is fairly efficient. For example, it takes HAPAR only 2.25

minutes on our computer to compute ACE data. In contrast, it takes PHASE, a program based on

Gibbs sampling, 12 minutes to compute ACE in the same environment. Programs based on EM

method even cannot handle this set of data.

4 Results

We ran our program HAPAR for a large amount of real biological data as well as simulation data

to demonstrate the performance of our program. We also compared our program with four famous

existing programs, HAPINFERX, Emdecoder, PHASE, and Haplotyper. HAPINFERX is an im-

plementation of Clarks algorithm (Clark, 1990), and was kindly provided by A.G.Clark. Emdecoder

uses an EM alogrithm, and was downloaded at J. Lius homepage (http://www.people.fas.harvard.edu/junliu/

PHASE is a Gibbs sampler for haplotyping (Stephenset al.

, 2001), and was downloaded at M.Stephens homepage. Haplotyper is a Bayesian method, and was downloaded at J. Lius homepage.

We will discuss different sets of data in the following subsections.

7


8/17

program result haplotype number correct haplotype number recovery rate accuracy rate

HAPAR 7 7 0.54 1

Haplotyper 13 7 or 9 0.54 or 0.69 0.54 or 0.69

HAPINFERX 7 7 0.54 1PHASE 13 7 0.54 0.54

Table 3: Comparison of performance of four programs on ACE data set. Haplotyper has

two different accurate rates/ recovery rate because it gives different results in multiple runs.

4.1 Angiotensin converting enzyme

Angiotensin converting enzyme (encoded by the gene DCP1, also known as ACE) catalyses the

conversion of angiotensin I to the the physiologically active peptide angiotensin II. Due to its

key function in the renin-angiotensin system, many association studies have been performed with

DCP1. Rieder et al. (1999) completed the genomic sequencing ofDCP1 from 11 individuals, and

identified 78 varying sites in 22 chromosomes. 52 out of the 78 varying sites are non-unique poly-

morphic sites, and complete data on these 52 biallelic markers are available. 13 distinct haplotypes

were resolved from the sample.

We ran the four programs, HAPAR, Haplotyper, HAPINFERX and PHASE, on ACE data set

(11individuals/10genotypes, 52SNPs, 13haplotypes). (Emdecoder is limited in the number of SNPs

in genotype data and thus is excluded.) The result is summarized in table 3. The recovery rate is

defined asno. of correctly detected haplotypes

total no. of true haplotypes.

The accurate rate is defined as

no. of correctly detected haplotypes

total no. of inferred haplotypes.

Most programs can recover 7 haplotypes out of 13 original ones. The low performance was due

to the relative small distinct genotype number. In fact, 3 genotypes are resolved by 6 haplotypes

each of which appears only once, so there is no enough information for any of the 4 programs to

resolve those 3 genotypes successfully. In some runs, Haplotyper can guess one resolution correctly

(thus 9 haplotypes are correctly recovered), but it cannot get a consistent result. Note that unlike

statistical methods (e.g. Haplotyper and PHASE), HAPAR and HAPINFERX do not try to guess

resolutions for genotypes when there is no enough information. Therefore, both of them only report

7 haplotypes with a 100% accuracy rate. In contrast, Haplotyper and PHASE report 13 haplotypes,

some of which are inaccurate.

8


9/17

4.2 Simulations on random data sets

In this subsection, we use simulation data to evaluate different programs. First, m haplotypes, each

containing k SNPs sites, were randomly generated. Then a sample of n genotypes were generated,

each of which was formed by randomly picking up two haplotypes and conflating them. (Here n

is the sample size, which may be larger than the number of distinct genotypes.) A haplotyping

program resolved those genotypes and output inferred haplotypes, which were then compared with

the original haplotypes to evaluate the performance of the program.

An important question is how to generate random haplotypes. Since some algorithms impose

particular assumption on evolutionary model, simulation data sets generated based on some model

will certainly favor one program while impair others. As pointed out in Niu et al. (2002), The

performances of different in silico haplotyping methods is subtler than it appears to be - that is,

the model underlying the simulation study can greatly affect the conclusion. We decide not to

adopt any particular evolutionary model when generating haplotypes. We simply set every bitin every haplotypes randomly and independently to be 0 or 1 with some probability. (To test the

applicability of our method under different conditions, we also did simulations on data sets favoring

particular evolutionary models, which are discussed later in this paper.)

The performance of a haplotype inference program contains two aspects: (1) the accurate rate

and (2) the recovery rate. Figure 2 (a) and Figure 2 (b) illustrated the accurate rate and recovery

rate of different programs under parameter setting m = 10 and k = 10. It can be seen from the

figures that for HAPAR, Haplotyper, Emdecoder and PHASE, the two measurements do not make

much difference, whereas HAPINFERX has an accurate rate better than recovery rate (this is

because HAPINFERX will leave some orphan genotypes when they are not resolvable by existing

haplotypes). This fact was also supported by other simulation results. Therefore, in the rest of the

section, we use the arithmetic mean of these two values, (accurate rate + recovery rate) 0.5, as

the measurement for performance.

Also, in the rest of the section, 100 data sets were generated for each parameter setting, and

performance was calculated by taking the average of the performance values in the 100 runs.

We conducted simulations with different parameter settings and compared the performance

of the five programs HAPAR, HAPINFERX, Haplotyper, Emdecoder and PHASE. Two sets of

parameters were used: (1) m = 10, k = 10, n ranges from 5 to 24 (see Figure 2 (a) to Figure 2

(c)), and (2) m = 18, k = 8, n ranges from 9 to 40 (see Figure 2 (d)). As shown by the figures,

HAPAR outperforms the other four programs in almost all cases. When n is as small as m/2 (everyhaplotype appears in only one genotype), any resolution combination would be an optimal solution

for the minimum number of origins problem, so HAPAR cannot identify the correct resolutions.

In this case, other programs also have poor performance due to the lack of information. When n

becomes larger, all five programs gain an increasing in performance, and HAPAR shows an obvious

advantage over the others. When n is large enough (n = 12 for m = 10, k = 10; and n = 30

9


10/17

for m = 18, k = 8), HAPAR, Haplotyper, Emdecoder and PHASE can all recover the original

haplotypes successfully with high probability (the accurate rate and recovery rate are both greater

than 0.9).

0

0.2

0.4

0.6

0.8

1

4 6 8 10 12 14 16 18 20 22 24

accuracy

rate

sample size

HAPARHAPINFERXHaplotyperEmdecoder

PHASE

(a) Accurate rate (m = 10; k = 10)

0

0.2

0.4

0.6

0.8

1

4 6 8 10 12 14 16 18 20 22 24

recovery

rate

sample size


PHASE

(b) Recovery rate (m = 10; k = 10)

0

0.2

0.4

0.6

0.8

1

4 6 8 10 12 14 16 18 20 22 24

performance

sample size

HAPARHAPINFERX

HaplotyperEmdecoder

PHASE

(c) Performance (m = 10; k = 10)

0

0.2

0.4

0.6

0.8

1

10 15 20 25 30 35 40

performance

sample size

HAPARHAPINFERX

HaplotyperEmdecoder

PHASE

(d) Performance (m = 18; k = 8)

Figure 2: Comparison of the performance of five programs on random data. In Figure 2

(a) to (c), originally there are 10 haplotypes of 10 SNPs; In Figure 2 (d), originally there are 18

haplotypes of 8 SNPs. Performance in Figure 2 (c) and (d) is the arithmetic mean of accurate rate

and recovery rate. For each parameter setting, 100 data sets were generated, and performance,

accurate rate and recovery rate were calculated by taking average of 100 runs.

4.3 Simulations on maize data set

The maize data were used in Wang and Miao (2002) as one of the benchmarks to evaluate hap-

lotyping programs. The locus 14 of maize profile containing 17 SNP sites and 4 haplotypes (with

10


11/17

sample size HAPAR Haplotyper Emdecoder HAPINFERX PHASE

3 0.60 0.55 0.38 0.38 0.56

4 1 1 1 0.625 1

7 1 1 1 0.625 110 1 1 1 0.9 1

Table 4: Comparison of performance of five programs on Maize data set (m = 4, k = 17).

frequency 9, 17, 8 and 1) were identified. We randomly generated a sample of n genotypes from

these haplotypes, each of which was formed by randomly picking up 2 haplotypes according to

their frequencies and conflating them. The results were summarized in Table 4. According to our

experiments, all of HAPAR, Haplotyper, Emdecoder and PHASE can recover the haplotypes cor-

rectly when the sample size n is greater than or equal to 4, while HAPINFERX does not produce a

satisfying result until sample size n reaches 10. When the sample size is 3, HAPAR behaves slightlybetter than other programs, but none of them produces satisfying results.

4.4 Simulations on haplotypes forming a perfect phylogeny

Coalescence is one of the evolutionary model most commonly used in population genetics. The

coalescent model of haplotype evolution says that without recombination, haplotypes can fit into

a perfect phylogeny (Gusfield, 2001). Jin et al. (1999) found a 565bp chromosome 21 region near

the M X1 gene, which contains 12 polymorphic sites. This region is unaffected by recombination

and recurrent mutation. The genotypes determined from sequence data of 354 human individuals

were resolved into 10 haplotypes, the evolutionary history of which can be modeled by a per-fect phylogeny. These 10 haplotypes were used to generate genotype samples of different size for

evaluation.

The performance of the five programs on MX1 data is compared in Figure 3. PHASE performs

better than the other programs, because it incorporates coalescent model which fits this data

set. In the remaining four programs which do not adopt coalescent assumption, we see that the

performance of HAPAR is better than others when the sample size is relatively large (n > 20). Note

that performance of the other four programs almost remains still when the sample size is greater

than 20, whereas HAPAR still gets a continuing increase as the sample size increases. When the

sample size is as large as 40, HAPAR even beats PHASE. This supports our hypothesis that whenthe sample size is large enough, our algorithm is likely to give accurate results.

4.5 Simulations on haplotypes with recombination hotspots

To verify the robustness of our method in the presence of recombination events, we conduct sim-

ulations using the data on chromosome 5q31 studied in Daly et al. (2001). They reported a

11


12/17

0

0.2

0.4

0.6

0.8

1

5 10 15 20 25 30 35 40

performan

ce

sample size


PHASE

Figure 3: Comparison of performance of five programs on MX1 data set (m = 10, k = 12).

high-resolution analysis of the haplotype structure across 500kb on chromosome 5q31 using 103

SNPs in a European-derived population. The result showed a picture of discrete haplotype blocks

(of tens to thousands of kilobases), each with limited diversity. The discrete blocks are separated

by intervals in which several independent historical recombination events seem to have occurred,

giving rise to greater haplotype diversity for regions spanning the blocks.

We use the haplotypes from block 9 (with 6 sites, and 4 haplotypes) and block 10 (with 7

sites, 6 of which are complete, and 3 haplotypes). There is a recombination spot between the two

blocks, which is estimated to have a haplotype exchange rate of 27%. 9 new haplotypes with 12

sites are generated by connecting two haplotypes from block 9 and block 10 which were observedto have common recombination events, and their frequencies were normalized. Genotype samples

of different sizes were randomly generated and used for evaluation.

According to the experiment results illustrated in Figure 4, Emdecoder, HAPAR, Haplotyper

and PHASE have similar performance on these data sets. A strange phenomenon is that their

performance increases very slowly after it reaches approximately 0.9. This is due to the confusion

of recombination. Take a sequence of two SNP sites as an example. There are 4 haplotypes, 00,

01, 10 and 11. Both the combinations (00, 11) and (01, 10) result in genotype 22. The haplotype

11 is rare since its two sites are both mutant, while haplotype 00 is the most common. According

to their frequencies, 11 is difficult to observe, and even if it appears, it is probably in genotype

22, which will likely be resolved into (01, 10). Unless the sample size is large enough so that the

genotype formed by (11, 01) or (11, 10) is sequenced, the existence of haplotype 11 will not be

detected by haplotype inference programs. That is why the performance curves in figure 4 increase

slowly at around 0.9. As mentioned above, if the sample size is large enough, the chance that the

rare haplotypes are recovered will be good. In this data set, when the sample size is raised to 50,

the performance of HAPAR can reach 0.96.

12


13/17

0

0.2

0.4

0.6

0.8

1

4 6 8 10 12 14 16 18 20

performan

ce

sample size


PHASE

Figure 4: Comparison of performance of five programs on 5q31 data set (m = 9, k = 12).

4.6 Influence of the number of haplotypes and the number of SNP sites on

performance

We also examined the influence of m, the number of haplotypes, and k, the number of SNP sites,

on the performance of our algorithm. Original haplotype sets and then genotypes were generated

randomly.

Figure 5 (a) to Figure 5 (c) illustrate the influence of the number of original haplotypes m on

performance when k is fixed. In the three figures, we fix k = 6, 10, 12 respectively; and in each

figure, performance under three different numbers of haplotypes, 8, 10 and 12, is compared. All

the three figures show that, with the same sample size n and the same number of SNP sites k, the

performance gets better when the number of haplotypes m decreases, which is consistent with the

intuition. In each figure, three performance curves have similar shapes. At first, the performance

increases fast as the sample size n increases, and then after the sample size n becomes large enough

and the performance becomes fairly high (above 0.9), the increasing pace slows down. (This also

holds for all above data sets.) Moreover, if we compare the three figures, we can see that the curves

in Figure (c) are steeper than those in Figure (a). That is, the increasing pace of performance

is larger when the number of SNP sites is larger. This relationship is more clearly illustrated in

Figure 5 (d) to Figure 5 (f).

Figure 5 (d) to Figure 5 (f) show the influence of different numbers of SNP sites on performancewhen m is fixed. In the three figures, we fix m = 8, 10, 12, respectively; and in each figure, the

performance under four different numbers of SNP sites, 6, 10, 12 and 15, is compared. The bigger

the number of SNP sites is, the faster the performance increases as the number of sample size

increases. When the sample size is reasonably large, with the same sample size n and the original

haplotype number m, HAPAR is more likely to detect the correct haplotypes when the number of

13


14/17

SNP sites k is larger. This is because when k is small, it is more likely that the four patterns 00, 01,

10 and 11 exist simultaneously in the original haplotype set, which causes confusion to HAPAR.

5 Discussion

Haplotypes are the raw material of many genetic analysis, but the rapid growth of high-throughput

genotyping techniques has not been matched by similar advances in cheap experimental haplotype

determination. In this paper, we have introduced a new method for this problem. Experiments

on both real data and simulation data confirm this new method. Our method imposes no assump-

tion on the population evolutionary history. While integrating particular evolutionary model into

haplotyping algorithm may favor a special class of data, it will fail to infer haplotypes that do not

fit into the model. In contrast, our method has the widest applicability. Experiments show that

HAPAR is comparable to the best haplotyping programs on almost all data sets, and when the

simulation data are generated fully randomly, it outperforms other programs.

Our new method raised some interesting topics for further investigation. Does there exist a

polynomial time algorithm to solve the minimum number of origins problem? (We guess that the

problem is NP-hard.) If no polynomial time algorithm exists, an efficient approximation algorithm

with reasonable performance guarantee is also desirable. Moreover, it is interesting to design an

efficient algorithm to substitute the branch-and-bound searching procedure, i.e. given a set of

genotypes and a set of candidate haplotypes, the problem here is to find a minimum size subset of

candidate haplotypes that can resolve all genotypes. Although the size of candidate haplotype set

could be exponential to the original problem size in the worst case, the size is acceptable in many

practical examples. Therefore, we are interested in accelerating the searching procedure.Another interesting problem is under what condition the parsimony algorithm is able to infer

all haplotypes correctly with high probability. All simulations illustrate that the performance of

HAPAR increases monotonically as the sample size becomes larger, and it is conjectured that when

the sample size is big enough, our algorithm can recover exactly the original haplotype set. With

the number of original haplotypes m and the number of SNP sites k fixed, we consider a sample

size to be able to recover haplotypes with high probability if the average performance of 100 runs

under the parameter setting is greater than 0.9, and the minimum of such sample size is referred to

as the threshold sample size. Simulations have been conducted to examine the relationship between

the threshold sample size and the number of haplotypes m. The result for k = 10 is summarizedin table 5. When m increases, the threshold sample size increases. Figure 5 shows that with the

same m value, the threshold sample size decreases when k increases. Further statistical analysis

may help to explain this relationship.

Acknowledgements

We thank A. Clark for kindly giving us HAPINFERX. We also thank J. Liu and M. Stephens

14


15/17

0

0.2

0.4

0.6

0.8

1

5 10 15 20 25 30

performance

sample size

8 haplotypes10 haplotypes12 haplotypes

(a) performance under different m (k=6)

0

0.2

0.4

0.6

0.8

1

5 10 15 20 25 30

performance

sample size


(b) performance under different m (k=10)

0

0.2

0.4

0.6

0.8

1

4 6 8 10 12 14 16 18 20 22 24

performance

sample size


(c) performance under different m (k=12)

0

0.2

0.4

0.6

0.8

1

4 6 8 10 12 14 16 18

performance

sample size

6 SNP sites10 SNP sites12 SNP sites15 SNP sites

(d) performance under different k (m=8)

0

0.2

0.4

0.6

0.8

1

5 10 15 20 25 30

performance

sample size


(e) performance under different k (m=10)

0

0.2

0.4

0.6

0.8

1

5 10 15 20 25 30

performance

sample size


(f) performance under different k (m=12)

Figure 5: Comparison of performance under different number of haplotypes and number

of SNP sites. (a) to (c) illustrate the influence of different m on performance. (d) to (f) illustrate

the influence of different k on performance.

15


16/17

original haplotype number 5 6 8 10 12 14 15

threshold sample size 7 7 9 11 14 17 18

Table 5: Relationship between the threshold sample size and the original haplotype

number (k = 10).

for providing software on their web pages. The work is fully supported by a grant from the

Research Grants Council of the Hong Kong Special Administrative Region, China [Project No.

CityU 1047/01E].

References

Bafna,V., Gusfield,D., Lancia,G. and Yooseph,S. (2002) Haplotyping as perfect phylogenty: A

direct approach. Technical Report UCDavis CSE-2002-21.

Chiano,M. and Clayton,D. (1998) Fine genetic mapping using haplotype analysis and the missingdata problem. Am. J. Hum. Genet., 62, 55-60.

Clark,A. (1990) Inference of haplotypes from PCR-amplified samples of diploid populations. Mol.

Biol. Evol., 7, 111-122.

Clark,A., Weiss,K., Nickerson,D., Taylor,S., Buchanan,A., Stengard,J., Salomaa,V., Vartiainen,E.,

Perola,M., Boerwinkle,E. and Sing,C. (1998) Haplotype structure and population genetic inferences

from nucleotide-sequence variation in human lipoprotein lipase. Am. J. Hum. Genet., 63, 595-612.

Drysdale,C., McGraw,D., Stack,C., Stephens,J., Judson,R., Nandabalan,K., Arnold,K., Ruano,G.

and Liggett,S. (2000) Complex promoter and coding region 2-adrenergic receptor haplotypes alter

receptor expression and predict in vivo responsiveness.Proc. Natl. Acad. Sci. USA

, 97, 10483-10488.

Daly,M., Rioux,J., Schaffner,S., Hudson,T. and Lander,E. (2001) High-resolution haplotype struc-

ture in the human genome. nature genetics, 29, 229-232.

Excoffier,L. and Slatkin,M. (1995) Maximum-likelihood estimation of molecular haplotype frequen-

cies in a diploid population. Mol. Biol. Evol., 12,921-927.

Gusfield,D. (2001) Inference of haplotypes from samples of diploid populations: complexity and

algorithms. J. Comput. Biol., 8, 305-323.

Gusfield, D. (2002) Haplotyping as perfect phylogeny: conceptual framework and efficient solutions.

RECOMB02.

Hawley,M. and Kidd,K. (1995) HAPLO: a program using the EM algorithm to estimate the fre-

quencies of multi-site haplotypes. J. Hered.,86,409-411.

Jin,L., Underhill,P., Doctor,V., Davis,R., Shen,P., Cavalli-Sforza,L. and Oefner,P. (1999) distribu-

tion of haplotypes from a chromosome 21 region distinguished multiple prehistoric human migra-

tions. Proc. Natl. Acad. Sci. USA, 96, 3796-3800.

Long,J., Williams,R. and Urbanek,M. (1995) An E-M algorithm and testing strategy for multi-locus

16


17/17

haplotypes. Am. J. Hum. Genet., 56, 799-810.

Niu,T., Qin,Z., Xu,X. and Liu,J. (2002) Bayesian haplotype inference for multiple linked single-

nucleotide polymorphisms. Am. J. Hum. Genet., 70, 157-169.

Rieder,M., Taylor,S., Clark,A. and Nickerson,D. (1999) Sequence variation in the human an-giotensin converting enzyme. nature genetics, 22, 59-62.

Stephens,M., Smith,N. and Donnelly,P. (2001) A new statistical Method for haplotype reconstruc-

tion. Am. J. Hum. Genet., 68, 978-989.

Wang,X. and Miao,J. (2002) In-silico haplotyping: state-of-the-art. UDEL technical reports.

Zhang,J., Vingron,M. and Hoehe,M. (2001) On haplotype reconstruction for diploid populations.

EURANDOM report 2001-026.

17

Documents

2002Wang_Haplotype Inference by Maximum Parsimony