Detecting Epistasis in the Human Genome

Tara Eicher and Vinay Karagod

Algorithms for Detecting Epistasis in the Human Genome

Abstract

Epistasis is believed to play a significant role in the genetic characterization of many diseases present in the human genome. This paper focuses on modeling epistasis from a statistical perspective using various types of epistasis models, a challenging task due to the magnitude of genome data and the complexity of finding correlations between multiple genes. In this paper, six algorithms are surveyed: BOOST, MDR/MDR-SP, TEAM, SNPRuler, AntEpiSeeker, and MegaSNPHunter. The mechanisms of these algorithms are described, and each algorithm is analyzed for computational complexity, detection power, and robustness to missing data, genotyping error, phenocopy, and population stratification. It is found that each of the algorithms can be applied effectively within its respective domain but that none of them perform superiorly in all domains.

Introduction

Broadly speaking, epistasis can be defined as the interaction between two genes at different locations (or loci) on a chromosome to produce a specific phenotype. It has been shown that epistasis likely plays a role in several diseases, including diabetes, hypertension, obesity [1], breast cancer, coronary heart disease, and Alzheimer’s disease [2]. Clearly, epistasis is an important phenomenon in genetics. However, due to a lack of genetic information and computational power, investigating epistasis was not feasible until recently. Today, it has become possible to search for epistatic interactions in the human genome using GWASs (Genome Wide Association Studies).

When discussing gene-gene interactions, it is common to speak in terms of SNPs, or "single nucleotide polymorphisms" rather than trying to analyze genes as atomic entities. As the

name implies, a SNP is a nucleotide variation (A, T, C, or G) at a particular locus. For instance, the condition of having nucleotide A at locus 12345 is a SNP, while C at locus 12345 is another, and G at 12347 yet another. SNPs are reference points or "markers" for genetic variations [3]. To computationally analyze epistasis in terms of SNP interactions, one must first characterize this phenomenon mathematically with respect to the penetrance, or “probability of developing disease given genotype” [4], of the individual and combined SNPs. This is not an easy task, as SNPs can interact to produce penetrance in accordance with many different mathematical models [4]. The simplest is the additive model, which corresponds to the summation of the effects of the individual SNPs [4], called the main effects [5] or marginal effects [2]. Other models include the multiplicative model, the heterogeneous model [4], otherwise known as the XOR model [1], and combinations of these. Some algorithms aim to detect epistatic interactions including main effects, while others do not. Epistasis following a model without main effects is called pure epistasis [5]. Likewise, some algorithms aim to detect interactions between only two SNPs, while others aim for higher-order interactions. When an epistatic interaction has been found such that no subset of SNPs involved in the model can produce the same effect, this is known as strict epistasis [5].

Epistasis characterized using mathematical models and penetrance measures is known as “statistical epistasis” [4]. Several technical definitions exist. When introducing their algorithm SNPRuler, which only detects pure epistasis, Wan et al define epistasis as “departure from marginal effects of different genetic loci in a way that they combine to cause disease” [2]. In a study aimed at detecting pure, two-locus (or pairwise) interactions, Wan et al define it as “the joint behavior of two loci as the statistical deviation from their additive effects” [6]. Many other works do not provide an explicit


definition but contain implicit definitions in the mechanism of the algorithm.

There are about 10 million SNPs in the human genome [6]. This high number of SNPs illuminates why epistasis detection falls under the domain of big data. Detecting epistasis between two or more SNPs from such a large domain is not possible by using exhaustive search, especially because this requires O(2P) running time, where P is the size of the domain [2]. Even solving a subset of this problem is challenging. For instance, detecting pairwise interactions for 500,000 SNPs requires 1.25 x 10^11 statistical tests [4].

In addition to the characterization and computational complexity, there are other problematic aspects of epistasis detection. Missing data, genotyping errors, and phenocopy (the influence of environmental factors on disease) [7], and population stratification [8] (the tendency toward a specific SNP in a specific subpopulation) [9] can be difficult to handle in a dataset and can skew the results of the statistical tests. Furthermore, once an algorithm has been proposed, it is difficult to find a gold standard for measuring the effectiveness of that algorithm. The GAMETES program described by Urbanowicz et al, which classifies pure, strict, two-locus epistasis models by shape, has been proposed as a gold standard for determining the detection power of an algorithm [5]. Typically, the detection power of an algorithm is determined by comparing it to an algorithm or (if testing on a small data set) to the brute force method. Performance is usually evaluated by on the basis of a comparison to an earlier method.

In general, all modern methods for detecting epistasis follow a specific protocol. First, they make use of datasets containing genotype data for each SNP and phenotype data for many individuals, in which the phenotype data consists of case and control labels for particular diseases, where a case label indicates that the individual has the disease, and a control label indicates that the individual does not have the disease. Based on these datasets, the algorithm in use finds statistical correlations between a set of SNPs and a case label for a specific

disease without performing the full statistical test for all possible SNP sets in order to save time. This can be accomplished using either exhaustive methods or non-exhaustive methods. Typically, exhaustive methods consist of a data structure containing some statistics for each SNP set [7], a screening or pruning procedure involving analysis of the data structure, and an in-depth testing procedure on the remaining SNP pairs. Non-exhaustive methods consist of a random or heuristic [7] procedure that involves calculating statistics on some of the SNP sets and returning likely candidates for epistasis and an in-depth testing procedure run on the likely candidates. While many exhaustive and non-exhaustive algorithms exist, this paper focuses on three exhaustive algorithms (BOOST, MDR/MDR-SP, and TEAM) and three non-exhaustive algorithms (SNPRuler, AntEpiSeeker, and MegaSNPHunter).

Exhaustive Algorithms

BOOST

BOOST, which stands for "BOolean Operation-based Screening and Testing" was proposed by Wan et al in 2010 [1]. It is an exhaustive algorithm that screens potential interactions using approximate logistic regression models and makes use of Boolean values to represent genetic data, which reduces computational complexity [1]. BOOST only detects pure epistasis involving two-locus interactions [7]. The basic idea behind BOOST is to build a logistic regression model for the complete interaction for each pair (including main effects) and another logistic regression model for only the main effects. For each of these models, a statistic called the log-likelihood is computed, represented by LM for the main-effect model and LF for the full model. The interactive effect, represented by LM - LF, must be greater than a specified value τ in order to be considered for full statistical evaluation. Those pairs whose interactive effects exceed τ


undergo a Χ2 test to determine whether the interaction is epistatic [1].

However, logistic regression is too time-consuming to compute for each SNP pair, so BOOST actually approximates the logistic regression models using log-linear models. To do this, it creates a contingency table for three variables: SNP pair member one, SNP pair member two, and the phenotype value (either “case” or “control”). For each cell, the number of occurrences nijk is stored, and πijk is the probability that a member of the dataset falls into that cell. The log-likelihood of the full model is represented as

[1, Equation 7], and the log-likelihood of the model with only main effects is too computationally expensive to compute directly. For this reason, a value called the Kirkwood superposition approximation is used in its place. This value is

represented as [1, Equation 13],

where [1].Notably, BOOST’s method for implementing

the contingency table is not as straightforward as the normal method for implementing such a table. Rather than storing counts as single integers, BOOST uses two bit-strings for each count: one for cases, the other for controls. Because Boolean data is easy to store and lends itself naturally to logical operations, this implementation saves both space and time [1].

The creators of BOOST tested its performance against older algorithms not covered in this paper and against real genetic data from the Wellcome Trust Case Control Consortium (WTCCC). When applied to WTCCC data for the purpose of detecting bipolar disorder, coronary artery disease, Crohn’s disease, hypertension, Rheumatoid Arthritis, type 1 diabetes, and type 2 diabetes, BOOST was able to find many epistatic interactions related to type 1 diabetes, but none related to any other disease [1]. This may be because no other epistatic interactions exist, or it may be because interactions that do exist involve more

than two loci or include main effects. When tested by Shang et al against other contemporary algorithms, including TEAM, SNPRuler, and AntEpiSeeker, using simulated data, BOOST had higher detection power for epistatic interactions without main effects than any other algorithm and was faster than any other algorithm [7]. BOOST is not capable of handling missing data, but handled genotyping errors and phenocopy perfectly in Shang et al’s study [7]. It is not designed to handle population stratification [1].

MDR/MDR-SP

Unlike BOOST, which can only handle two-locus interactions, Multifactor Dimensionality Reduction, or MDR, is designed to detect high-order interactions [8], and it is also designed for handling sparse data [8]. In MDR, multi-locus interactions are pooled into high-risk and low-risk groups [8], making it possible to reduce the multidimensional genotype predictor to a single dimension. The algorithm then chooses the best loci based on the one-dimensional values. Several variations of this algorithm exist [8].

MDR has a high false positive rate in the presence of population stratification. In order to overcome this issue, Niu et al defined a more efficient and robust variation: MDR-SP [8]. In this method, adjustment for population stratification is done by calculating an eigenvector on the variance-covariance matrix of all individuals in the data set. The first K components of the eigenvector are considered as the genetic background for a population [8]. The data set is then divided into a training set and a testing set. In the training set, genotype scores of multiple loci are mapped onto binary vectors of length j for each individual, where j is the total number of genotype combinations and the binary values represent whether or not the individual has that respective genotype [8]. The vectors and trait values are then adjusted by the background information using linear regression, and a probabilistic score Tj

2 is calculated on all individuals to determine


whether there is a correlation between j and the given trait. A genotype is considered high-risk only if Tj

2 is greater than 0. Otherwise, it is considered as a low-risk genotype, reducing the values to one-dimension.

The one-dimensional values are also adjusted by the background information using linear regression. These values are used to determine the fitness of all models built on the training set and the testing set, and cross-validation is performed between the two. In order to cross-validate the results of the training set with the testing set, the prediction error is calculated by comparing the output of the training sets with the testing sets, and this process is repeated until all the possible sets are calculated [8]. The best models are then used to choose candidates for final analysis.

[8, Figure 1]In the above diagram, two-locus genotypes

are considered. Darker-shaded regions are treated as high-risk genotypes, lightly-shaded regions as low-risk genotypes, and empty cells are the ones for which no data was observed. This adjusted output is what is used to predict the trait values [8].

Experiments reveal a high false positive rate for MDR [10] but lower for MDR-SP [8]. MDR also has a heavy computational burden [11]. Upstill-Goddard et al found MDR to handle 5% genotyping error and 5% missing data effectively, but its power dropped for phenocopy [10]. Significantly, MDR is not

dependent on a particular model, which makes it applicable to different genetic scenarios.

TEAM

The final exhaustive algorithm is TEAM, or Tree-based Epistasis Association Mapping [12], which was proposed by Zhang et al in 2010 [12]. It is similar to BOOST in that it only detects two-locus interactions [7] and uses a contingency table [12]. In the case of TEAM, however, an undirected minimum spanning tree is initially used to represent the data, and this structure allows for a quick updating procedure for the contingency table. TEAM also makes use of a permutation test to minimize false positives [7]. In the tree, the nodes are SNPs, and each edge is weighted by “the number of individuals having different genotypes in the two SNPs" [7], i.e. the number of individuals that have genes (0, 1), (0, 2), (1, 2),… or (2, 1) rather than (0, 0), (1, 1), or (2, 2) for SNP one and SNP two, respectively. As Zhang et al prove, it is possible to update a contingency table cell for two SNPs Xi and X’j based on the following three factors: the set of individuals with both Xi and the desired phenotype, the value of the edge between some Xj and X’j, and the value of the cell corresponding to Xi and Xj [7]. For each leaf, TEAM uses available information to update the contingency table cells corresponding to connections between the leaf and its ancestors and then deletes the leaf. This process is repeated with each leaf using breadth-first search and then, because the leaves at that level have been deleted, it continues for the parents of those leaves, which are the new leaves [7]. This allows for a runtime of O(MNK+MN2+WT NK) rather than the brute force runtime of O(MN2K), where M represents the number of individuals, N the number of SNPs, K the number of permutations, and WT

the weight of the spanning tree [7]. The contingency table is then used to detect interactions using a statistical test. Zhang et al’s experiments use a chi-squared test, but this is not a requirement, as long as one uses a


statistical test that can be applied to a contingency table [8].

Unlike BOOST, TEAM detects interactions with and without main effects [7]. Zhang et al’s study demonstates that this algorithm is one to two orders of magnitude faster than the brute force approach [12], but this is still slower than any of the algorithms analyzed in Shang et al’s study, largely because of the permutation test TEAM uses [7]. Although TEAM detected all simulated interactions in Zhang et al's study [12], Shang et al found that the detection power is good but model-sensitive [7]. Likewise, its robustness to genotyping error and phenocopy are also model-sensitive. TEAM was not tested for robustness to missing data because it uses external tools to handle this [18]. There is no mention of population stratification in Zhang et al’s work, so it can be assumed that TEAM does not handle this.

Non-Exhaustive Algorithms

As explained previously, all of the above methods must perform some statistical analysis on all SNP pairs, which is why they are labeled as exhaustive. Other algorithms exist that do not examine all SNP pairs in an effort to reduce complexity. SNPRuler, AntEpiSeeker, and MegaSNPHunter are three of these algorithms.

SNPRuler

SNPRuler was proposed by Wan et al in 2009 [2]. It focuses on pure epistasis [7], detects higher-order interactions in addition to two-locus, and is model-independent [2]. SNPRuler is unique in that it uses machine learning to find predictive rules [2]. An example of the type of rule that would be created by a SNPRuler algorithm is as follows: (SNP i has value v and SNP j has value w and ...) → this is a case sample. These rules are then ranked in order of relevance [2].

Like TEAM, SNPRuler represents data in the form of a tree, where SNPs are nodes. However, this is a simple tree with no weights. SNPRuler

uses paths along the tree to represent possible interactions, and it builds the rules using bounded depth-first search. During the traversal, SNPRuler adds each SNP to the left side of each rule r. A contingency table already exists for r with the following four counts: the number a of controls that follow from r, the number b of cases that follow from r, the number c of controls that do not follow from r, and the number d of cases that do not follow from r. This table is updated after adding the new SNP to obtain a’, b’, c’, and d’. The utility value of r is then measured by

[2, Equation 6], where [2] and m = min(a, a’). If adding the new SNP is found to have increased the utility of r, then it is permanently added to r. Otherwise, a new rule is spawned with the new SNP on the left side of the implication. This process continues until the bound has been reached [2]. At this point, the rules are ranked by utility value, and the highest-valued rules are selected. The second stage of the algorithm performs a Χ2 test on interactions from derived rules [2].

The performance of SNPRuler depends on the genotype variance because the utility function is based on genotype variance. The worst case is when there is no variance, in which case all SNPs are evaluated. However, the experiments performed by Wan et al reveal that 99.1% of all branches are pruned on average [2]. In Shang et al’s study, SNPRuler handled missing data and genotyping error well, and its detection power actually increased in the presence of phenocopy because models with main effects built on datasets with phenocopy tend to resemble models with no main effects, for which SNPRuler is designed [7]. The running time is low and increases at a moderate rate, but the memory requirements are high [7]. As with TEAM, the creators of SNPRuler make no references to population stratification.


AntEpiSeeker

Another non-exhaustive algorithm capable of detecting high-order epistatic interactions is AntEpiSeeker, which was proposed by Wang et al in 2010 [13]. Like MDR, AntEpiSeeker detects interactions with and without main effects [7]. It is an ant colony algorithm [13], which means that it uses a probability distribution function (PDF) that parallel processing units update by weights. This mimics the process of ants leaving pheromones along paths toward food, making it easier for other ants to find the correct path. The PDF is defined as follows:

[13, Equation 1]

Here, is the pheromone level for locus k, α is the weight of importance given to

the pheromone levels, and represents prior

information (here, = 1).For this algorithm, each processing unit (ant)

focuses on a set of SNPs larger than the expected order of interaction [13]. It computes the Χ2 value for that set and updates the PDF using the formula [13, Equation 2], where i is the iteration and

= 0.1 Χ2. In next iteration, the ants choose new, smaller sets based on the PDF. This continues for a fixed number of iterations and results in a list of highly probable sets. The next step is to perform exhaustive search on all sets in the list. Because AntEpiSeeker has a high false positive rate, a procedure is often followed to minimize false positives. In this procedure, the first step is to initialize a set EIm to null. Then, for each interaction detected, the procedure tries to add it to EIm. If it overlaps with a preexisting interaction in EIm, it checks the p-values of this and the overlapping interaction. The one with the largest p-value is kept in EIm, and EIm is returned as the final solution [13].

When tested by Wang et al using three different models, it was found that AntEpiSeeker had an acceptable false positive

rate (after performing the minimization procedure) [13]. Wang et al also found that AntEpiSeeker identified more epistatic interactions related to Rheumatoid Arthritis in the WTCCC than preexisting algorithms [13] and that performance was better in the presence of population stratification than for preexisting algorithms [13]. In Shang et al’s study, AntEpiSeeker performed well for models with and without main effects [7]. However, detection power was inversely proportional to the size of the data set due to the nature of the algorithm [7]. AntEpiSeeker handled missing data, genotyping errors, and phenocopy well, and its running time was moderate, though slower than SNPRuler and BOOST [7].

MegaSNPHunter

The last non-exhaustive algorithm, MegaSNPHunter, is significant because it ranks SNPs instead of filtering out ones that have weak main effects. Initially, the genome is divided into sub-genomes, where every sub-genome covers the respective haplotype effects. For each sub-genome, MegaSNPHunter builds a boosting tree classifier using several regression trees [14]. The contribution of each SNP to the boosting tree classifier is calculated by

Here, Tj is the boosting tree for sub-genome j and ev is the error reduction on SNP S i. The SNPs are ranked according to this importance value. SNPs with ranks above a certain cutoff point are filtered out to the next iteration. This process continues until only a small set of SNPs is left over [14]. Then, rather than using exhaustive search on the remaining set (which is time-consuming), all paths are extracted from the final classifier tree, and they are ranked based on a statistical function H, which measures the fraction of the interactive effect not caused by marginal effects. The highest-ranked SNP sets are returned as the epistatic interactions.


Wan et al performed a comparative study between MegaSNPHunter and an older algorithm, BEAM. In the case of simulated data and real data on Parkinson's disease and Rheumatoid Arthritis, MegaSNPHunter was more powerful in identifying SNP interaction than BEAM. The running time was faster than BEAM [14], which in turn is faster than MDR [11]. However, there was a high rate of false positives when tested on randomly selected SNPs [14]. To the authors’ knowledge, there have been no studies regarding the effectiveness of MegaSNPHunter in the presence of genotyping error, phenocopy, population stratification, or missing data.

Conclusions

Each of the algorithms studied is designed for specific sub-problems in epistasis and works well within its domain. However, none of them can be considered an all-purpose epistasis detection algorithm for use in GWAS. In summary, BOOST and TEAM focus on pairwise interactions only, while the other four algorithms are capable of detecting higher-order interactions. BOOST and SNPRuler focus on only pure epistasis, whereas MDR, TEAM, MegaSNPHunter, and AntEpiSeeker detect interactions with and without main effects.

BOOST is the fastest algorithm, followed by SNPRuler. MegaSNPHunter and AntEpiSeeker have more moderate performance, while TEAM and MDR are both slow. MDR, MegaSNPHunter, and AntEpiSeeker are all known for having high false positive rates, whereas those of the other algorithms are lower. MDR, AntEpiSeeker, and SNPRuler all handle missing data effectively, BOOST, AntEpiSeeker, and SNPRuler are robust to phenocopy, and nearly all of the algorithms handle genotyping error well. While MDR-SP was designed for population stratification, AntEpiSeeker is also effective in handling this phenomenon. Each algorithm exhibits certain other peculiarities that must be considered as well. For instance, the detection power of TEAM seems to be limited to certain epistasis models,

SNPRuler’s performance depends on genotype variation, and AntEpiSeeker consumes a large amount of memory and performs best when applied to small data sets.

Therefore, which method to use depends on the criteria of the test. For instance, if time is an important constraint and the interactions sought are likely to be pure and strict, then BOOST would likely be an ideal choice. If memory is not a large constraint, but higher-order interactions with possible main effects are sought, AntEpiSeeker is probably a good choice. If time is not a serious constraint and the goal is to find pairwise interactions that may include main effects, the tester may want to consider TEAM. Finally, in the presence of population stratification, MDR-SP would likely perform well. While the goal of finding the perfect epistasis detection algorithm has yet to be reached, these algorithms and others provide a good means for detecting certain types of epistasis.

References

[1] X. Wan et al, “BOOST: A Fast Approach to Detecting Gene-Gene Interactions in Genome-wide Case-Control Studies,” in The American Journal of Human Genetics, 87, The American Society of Human Genetics, Sep. 10, 2010. Pp. 325-340. Available: AJHC, www.cell.com. [Accessed: Apr. 12, 2015].

[2] X. Wan et al, “Predictive rule inference for epistatic interaction detection in genome-wide association studies,” in Bioinformatics, vol. 26 no. 1, Oxford University Press, Oct. 30, 2009. Pp. 30-37. Available: Bioinformatics, http://bioinformatics.oxfordjournals.org. [Accessed: Apr. 12, 2015].

[3] Lister Hill National Center for Biomedical Communications, “What are single nucleotide polymorphisms (SNPs)?” April 28, 2015. Available: Genetics Home Reference, http://ghr.nlm.nih.gov. [Accessed: Apr. 26, 2015].

[4] H. Cordell, “Epistasis: what it means, what it doesn’t mean, and statistical methods to detect it in humans,” in Human Molecular Genetics,


2002, vol. 11, no. 20, Oxford University Press, 2002. Pp. 2463–2468. Available: Oxford Journals, http://hmg.oxfordjournals.org. [Accessed: Apr. 2, 2015].

[5] R. Urbanowicz, “A classification and characterization of two-locus, pure, strict, epistatic models for simulation and detection,” BioData Mining 2014, 7:8, Lebanon, OH, June 9, 2014. Available: EBSCOhost, https://www.ebscohost.com. [Accessed: Apr. 2, 2015].

[6] X. Wan et al, “The complete compositional epistasis detection in genome-wide association studies,” in BMC Genetics, 14:7, BioMed Central, 2013. Available: EBSCOhost, https://www.ebscohost.com. [Accessed: Apr. 2, 2015].

[7] J. Shang et al, “Performance analysis of novel methods for detecting epistasis,” in BMC Bioinformatics, 12:475, BioMed Central, 2011. Available: BMC Bioinformatics, www.biomedcentral.com. [Accessed: Mar. 22, 2015].

[8] A. Niu et al, “A Novel Method to Detect Gene–Gene Interactions in Structured Populations: MDR-SP” in Annals of Human Genetics, 75, Houghton: Blackwell Publishing, 2011. Pp. 742–754. Available: EBSCOhost, https://www.ebscohost.com. [Accessed: Apr. 2, 2015].

[9] “Population stratification,” Wikipedia, 29 October 2014. Available: http://en.wikipedia.org. [Accessed: May 4, 2015].

[10] R. Upstill-Goddard et al, “Machine learning approaches for the discovery of gene-gene interactions in disease data” in Briefings in Bioinformatics, vol. 14 no. 2, Oxford University Press, 2012. Pp 251-260. Available: EBSCOhost, https://www.ebscohost.com. [Accessed: Apr. 2, 2015].

[11] L. Chen, “A Ground Truth Based Comparative Study on Detecting Epistatic SNPs,” in Proceedings (IEEE Int Conf Bioinformatics Biomed), Nov. 1, 2009. Pp. 26-31. Available: NCBI, http://www.ncbi.nlm.nih.gov. [Accessed: Apr. 2, 2015].

[12] X. Zhang et al, “TEAM: efficient two-locus epistasis tests in human genome-wide association study,” in ISMB 2010, vol. 26, Oxford University Press, 2010. pp i217–i227. Available: Bioinformatics, http://bioinformatics.oxfordjournals.org. [Accessed: Apr. 12, 2015].

[13] Y. Wang et al, “AntEpiSeeker: detecting epistatic interactions for case-control studies using a two-stage ant colony optimization algorithm,” in BMC Research Notes 2010, 3:17, BioMed Central, 2010. Available: BMC Research Notes, http://www.biomedcentral.com. [Accessed: Apr. 2, 2015].

[14] X. Wan et al, “MegaSNPHunter: a learning approach to detect disease predisposition SNPs and high level interactions in genome wide association study,” in BMC Informatics, 2009, 10:13, BioMed Central, 2009. Available: EBSCOhost, https://www.ebscohost.com. [Accessed: Apr. 2, 2015].

Documents

Detecting Epistasis in the Human Genome