Upload
others
View
5
Download
1
Embed Size (px)
Citation preview
MACHINE LEARNING METHODOLOGIES IN THE DISCOVERY OF THE INTERACTION
BETWEEN GENES IN COMPLEX DISEASES
RICARDO JOSÉ MOREIRA PINHO DISSERTAÇÃO DE MESTRADO APRESENTADA À FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO EM ÁREA CIENTÍFICA
M 2014
FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO
Machine Learning methodologies in thediscovery of the interaction between
genes in complex diseases
Ricardo Pinho
DISSERTATION
Mestrado Integrado em Engenharia Informática e Computação
Supervisor: Rui Camacho
Co-Supervisor: Alexessander Alves (Imperial College London, UK)
July 2014
Machine Learning methodologies in the discovery of theinteraction between genes in complex diseases
Ricardo Pinho
Mestrado Integrado em Engenharia Informática e Computação
Approved in oral examination by the committee:
Chair: Auxiliary Professor Ana Cristina Ramada Paiva
External Examiner: Auxiliary Researcher Sérgio Guilherme Aleixo de Matos(Instituto de Engenharia Electrónica e Telemática de Aveiro)
Supervisor: Associate Professor Rui Carlos Camacho de Sousa Ferreira da Silva
July 2014
Abstract
In recent years, there has been a big research in gene-gene interactions to analyze how complexdiseases are affected by the genome. Many Genome Wide Association Studies (GWAS) havebeen performed with interesting results. This new interest is due to the computing power that isavailable today. Machine Learning methodologies quickly became a successful tool to find previ-ously unknown genetic relations. The popularity of this field increased greatly after discoveringthe potential value of gene-gene studies in the detection and understanding of how phenotypes areexpressed.
The information available in the DNA of the human genome can be divided into functionalsubgroups that code different phenotypes. These subgroups are the genes, which can have differentpresentations and still have the same behaviour. However, if a certain part of a gene changes itsbehaviour, this part is called Single Nucleotide Polymorphism (SNP). These SNPs interact witheach other to affect how genes work, which affects the phenotypes that are expressed.
The purpose of this dissertation is to increase the knowledge obtained from these studies,detecting more interactions related to the manifestation of complex diseases. This is achievedby testing algorithms in a complex empirical study, and adding a new and improved Ensembleapproach, that shows better results than the existing state-of-the-art algorithms.
To achieve this goal, there are two main stages. The first stage consists of a comparison studyamongst the most recent statistical and Machine Learning methodologies using simulated data setscontaining generated epistatic interactions. The algorithms BEAM3.0, BOOST, MBMDR, Screenand Clean, SNPHarvester, SNPRuler, and TEAM were processed with many different settings ofdata sets. The results showed that, with the exception of Screen and Clean, and MBMDR, allalgorithms displayed good results in relation to Power, Type I Error Rate, and Scalability.
The second stage is the creation of new combination of a new algorithm based on the resultsobtained in the first stage. This new algorithm is comprised of an aggregation of previously testedmethodologies, of which 5 algorithms were chosen. This new Ensemble approach manages tomaintain the Power of the best algorithm, while decreasing the Type I Error Rate.
i
ii
Acknowledgements
First and foremost I would like to thank my Supervisor, professor Rui Camacho, for the effort,patience and dedication to this project, which would be impossible to accomplish without hissupport. I would also like to thank my co-supervisor Alexessander Alves, for saving me a lot oftrouble and helping me understand the specifics of the project. Considering that this area is verynew to me, his expertise is very much needed and appreciated.
I want to thank my family, specially my parents for giving me the opportunity to be able tolearn and work on something that I love and for always believing in me.
Ricardo Pinho
iii
iv
"An approximate answer to the right problem is worth a good deal more than an exact answer toan approximate problem."
John Wilder Tukey
v
vi
Contents
1 Introduction 11.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Motivation and Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.4 Structure of the Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 State-of-the-Art 52.1 Biological concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Statistical and Machine Learning Algorithms . . . . . . . . . . . . . . . . . . . 82.3 Data analysis evaluation procedures and measures . . . . . . . . . . . . . . . . . 212.4 Data Simulation and Analysis Software . . . . . . . . . . . . . . . . . . . . . . 272.5 Chapter Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3 A Comparative Study of Epistasis and Main Effect Analysis Algorithms 373.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2.1 Algorithms for interaction analysis . . . . . . . . . . . . . . . . . . . . . 393.3 Simulation Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403.4 Chapter Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4 Ensemble Approach 514.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.3 Chapter Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5 Conclusions 675.1 Contribution summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
References 69
A Glossary 77A.1 Biology related terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77A.2 Data mining terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78A.3 Lab Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
vii
CONTENTS
viii
List of Figures
2.1 An illustration of the interior of a cell. [cel14] . . . . . . . . . . . . . . . . . . . 62.2 Bayesian Network. Nodes represent SNPs. [JNBV11] . . . . . . . . . . . . . . . 122.3 An example of a Neural Network. [HK06] . . . . . . . . . . . . . . . . . . . . . 182.4 A logit transformation and a possible logistic regression function resultant of the
logit transformation.[WFH11] . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.5 An example of a ROC curve. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.6 The main stages of the KDD process.[FPSS96] . . . . . . . . . . . . . . . . . . 262.7 The CRISP-DM life cycle.[CCK00] . . . . . . . . . . . . . . . . . . . . . . . . 272.8 A diagram of the ATHENA software package [HDF+13]. . . . . . . . . . . . . . 302.9 The Weka Explorer interface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.1 Results of epistasis detection by population size. . . . . . . . . . . . . . . . . . . 433.2 Results of main effect detection by population size. . . . . . . . . . . . . . . . . 443.3 Results of full effect detection by population size. . . . . . . . . . . . . . . . . . 453.4 Results of epistasis detection by minor allele frequency. . . . . . . . . . . . . . . 463.5 Results of main effect detection by minor allele frequency. . . . . . . . . . . . . 473.6 Results of full effect detection by minor allele frequency. . . . . . . . . . . . . . 48
4.1 Results of epistasis detection by population size, with a 0.1 minor allele frequency,2.0 odds ratio, and 0.02 prevalence. . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.2 Results of main effect detection by population size, with a 0.1 minor allele fre-quency, 2.0 odds ratio, and 0.02 prevalence. . . . . . . . . . . . . . . . . . . . . 54
4.3 Results of full effect detection by population size, with a 0.1 minor allele fre-quency, 2.0 odds ratio, and 0.02 prevalence. . . . . . . . . . . . . . . . . . . . . 55
4.4 Results of epistasis detection by minor allele frequency, with 2000 individuals, 2.0odds ratio, and 0.02 prevalence. . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.5 Results of main effect detection by minor allele frequency, with 2000 individuals,2.0 odds ratio, and 0.02 prevalence. . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.6 Results of full effect detection by minor allele frequency, with 2000 individuals,2.0 odds ratio, and 0.02 prevalence. . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.7 Results of epistasis detection by odds ratio, with a 0.1 minor allele frequency, 2000individuals, and 0.02 prevalence. . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.8 Results of main effect detection by odds ratio, with a 0.1 minor allele frequency,2000 individuals, and 0.02 prevalence. . . . . . . . . . . . . . . . . . . . . . . . 60
4.9 Results of full effect detection by odds ratio, with a 0.1 minor allele frequency,2000 individuals, and 0.02 prevalence. . . . . . . . . . . . . . . . . . . . . . . . 61
4.10 Results of epistasis detection by prevalence, with a 0.1 minor allele frequency,2000 individuals, and 2.0 odds ratio. . . . . . . . . . . . . . . . . . . . . . . . . 62
ix
LIST OF FIGURES
4.11 Results of main effect detection by prevalence, with a 0.1 minor allele frequency,2000 individuals, and 2.0 odds ratio. . . . . . . . . . . . . . . . . . . . . . . . . 62
4.12 Results of full effect detection by prevalence, with a 0.1 minor allele frequency,2000 individuals, and 2.0 odds ratio. . . . . . . . . . . . . . . . . . . . . . . . . 63
x
List of Tables
2.1 An example of a penetrance table. . . . . . . . . . . . . . . . . . . . . . . . . . 82.2 A description of each data selection algorithm. . . . . . . . . . . . . . . . . . . . 102.3 A description of each model creation algorithm designed for this problem. . . . . 162.4 A description of each generic model creation algorithm. . . . . . . . . . . . . . . 192.5 A description of auxiliary algorithms used in model creation or data selection. . . 212.6 A description of data analysis measures and how they are calculated. . . . . . . . 242.7 A description of data analysis procedures. . . . . . . . . . . . . . . . . . . . . . 252.8 A comparison between the different procedures [AS08]. . . . . . . . . . . . . . . 272.9 A comparison of the most relevant features of data simulation tools. . . . . . . . 292.10 A comparison of data mining tools. . . . . . . . . . . . . . . . . . . . . . . . . . 332.11 Similarities and differences between BEAM3, BOOST MBMDR, and Screen &
Clean, SNPHarvester, SNPRuler, and TEAM. . . . . . . . . . . . . . . . . . . . 35
3.1 The values of each parameter used. Each configuration has a unique set of theparameters used. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2 Scalability test containing the average running time, CPU usage, and memoryusage by data set population size. . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.1 Scalability test containing the average running time, CPU usage, and memoryusage by data set population size for epistasis detection. . . . . . . . . . . . . . . 64
4.2 Scalability test containing the average running time, CPU usage, and memoryusage by data set population size for main effect detection. . . . . . . . . . . . . 64
4.3 Scalability test containing the average running time, CPU usage, and memoryusage by data set population size for full effect detection. . . . . . . . . . . . . . 65
xi
LIST OF TABLES
xii
Abbreviations
A AdenineACO Ant Colonization OptimizationALEPH A Learning Engine for Proposing HypothesesAPI Application Programing InterfaceATHENA Analysis Tool for Heritable and Environmental Network AssociationsAUC Area Under the CurveBDA Backward Dropping AlgorithmBEAM Bayesian Epistasis Association MappingBOOST Boolean Operation-based Screening and TestingC CytosineCRISP-DM Cross Industry Standard Process for Data MiningDAG Directed Acyclic GraphDM Data MiningFOL First Order LogicG GuanineGENN Grammatical Evolution Neural NetworksGPNN Genetically Programmed Neural NetworksGPAS Genetic Programming for Association StudiesGUI Graphics User InterfaceGWAS Genome Wide Association StudyIDE Integrated Development EnvironmentsIIM Information Interaction MethodILP Inductive Logic ProgrammingK-NN K-Nearest NeighborKEEL Knowledge Extraction Evolutionary LearningKDD Knowledge Discovery in DatabasesKNIME Konstanz Information MinerMAGENTA Meta-Analysis Gene-set Enrichment of variaNT AssociationsMCMC Markov Chain Monte CarloMDR Multifactor Dimensionality ReductionMDS MultiDimensional ScalingML Machine LearningNB Naïve BayesNN Neural NetworkOS Operating SystemPDIS Dissertation PlanningPMML Predictive Model Markup Language
xiii
ABBREVIATIONS
ROC Receiver Operating CharacteristicS&C Screen & CleanSAS Statistical Analysis SystemSEMMA Sample, Explore, Modify, Model and AssessSNP Single Nucleotide Polymorphism.SVM Support Vector MachineT ThymineTEAM Tree-based Epistasis Association Mapping
xiv
Chapter 1
Introduction
This Chapter introduces the context by discussing evolution of epistasis research. The next sec-
tion contains a project discussion and the contribution of the dissertation. It is followed by the
motivation and goals for this work. The last section contains the structure of the document.
1.1 Context
Epistasis is the interaction between genes that work together to affect a manifestation of a complex
disease. The study of epistasis to determine the expression of phenotypes has long been discovered
to yield interesting results considering that most phenotypes cannot be explained by simple cor-
relations to Single Nucleotide Polymorphisms (SNPs) or even to a specific gene. However, only
recently, enough advances have been made in technology, such as computer processing power,
to allow for whole genome studies. These studies quickly became a popular tool to discover ge-
netic patterns in the manifestation of several phenotypes, including certain SNP configurations
with high risk of developing complex diseases. This allowed for a better understanding of many
complex diseases that were otherwise undetectable until the manifestation of symptoms.
From studies of allergic sensitization [BMP+13] and obesity predisposition [FTW+07] to di-
abetes [SGM+10] and breast cancer [RHR+01], there have been many successfully identified
associations between genes and the expression of complex diseases.
Considering that these advances in Genome Wide Association Study (GWAS) are very recent,
there is not yet a well-established method to test and find significant results. Therefore many
statistic and machine learning approaches have since been developed.
The study of epistatic interactions is a high dimensionality problem, caused by the millions of
possible combinations between SNPs and each combination can have a varying amount of SNPs
involved in each interaction. Because of this, the correct identification of interactions becomes a
problem, not only due to outliers and noise, but also because of the many possible configurations.
Another issue with its complexity is the identification of the correlation between interactions and
the actual manifestation of the phenotype in question. There is also an error associated with every
data mining problem, which in this case can be explained by mutations or ambiguity in SNPs.
1
Introduction
Very recently, many different algorithms have been proposed to tackle the problems implicit to
GWAS.
Recent machine learning algorithms have tackled this problem by simplifying its complexity,
reducing the inherent dimensionality.
1.2 Project
The project of this dissertation consists of two empirical studies. Initially, there is a review of
the literature to identify a range of different algorithms that are likely to produce better results.
Artificial datasets are then created to test these algorithms.
Based on the results obtained, a new empirical study will be made with a newly generated algo-
rithm aiming to find an approach that may obtain better results than the existing algorithms. This
new methodology is a combination of the state-of-the-art algorithms.
Dissertation Contribuition
This dissertation contains empirical studies of the state-of-the-art algorithms, which enables a
broad analysis of the factors affecting the performance of these algorithms using relevant evalua-
tion measures. Based on this information, new studies can have a better understanding of what to
expect from each method. With the introduction of new algorithms, this dissertation may produce
methodologies that better suit the needs of the domain problem.
1.3 Motivation and Goals
Genome wide association studies have made a big impact over SNP identification and analysis in
the last years. They allowed the discovery of how genes interact with each other and how each
gene affects phenotypes. By mapping epistatic interactions and gene behavior, it is possible to find
risk factors associated with complex diseases. These risk factors can be identified by certain SNP
configurations or genotypes.
From a machine leaning standpoint, there is a lot of room for improvement. Considering that
this is a problem that has only started to be studied using very different methodologies, it is still
not as optimized or as accurately solved as it can be. With better, more adapted and efficient
algorithms, the GWAS relevance can increase considerably. Due to the inherent dimension com-
plexity, scalability is a very important issue required by the developed methodologies. Algorithms
used in typical prediction problems, such as classification and regression, now need to be adapted
to fit the requirements of this specific problem, which does not fall in the classical convention of
prediction problems. This requires an adaptation to the data and to the output, generating a result
that is understandable by specialists in the genetic field.
The main goal of epistatic studies is to find SNPs responsible for the expression of phenotypes,
which in this case are related to complex diseases. This means that the loci and the alleles that are
2
Introduction
active in complex diseases and contribute to its manifestation need to be identified. Their behavior
and the interactions relevant to the disease need to be monitored and accessed. This information
presents a better understanding of the disease in question and can be used in a medical scenario
to mark specific genotypes that have a high probability of manifesting genetic related complex
diseases, which can then be preemptively watched and treated.
1.4 Structure of the Report
The rest of this report is divided into three main chapters: state-of-the-art, work planning and
conclusions.
Chapter 2 has a brief introduction to the topic, followed by all the knowledge of the related
biological concepts in Section 2.1. Section 2.2 consists on the description of state-of-the-art ma-
chine learning and statistical algorithms that have been used in data selection, model creation and
other auxiliary algorithms. Section 2.3 contains evaluation measures and procedures that are rele-
vant to estimating and optimizing the data mining process. There is also a description of the many
data mining tools in Section 2.4 including software tools specifically designed for epistasis detec-
tion analysis and data simulation software. Section 2.5 of this chapter contains the conclusions
extracted from the study of the existing algorithms, tools and evaluation measures and procedures.
Chapter 3 is composed by an introduction to the stage I of the experiments 3.1, a description
of the data and methodologies used in this project in Section 3.2, the experimental procedure and
results 3.3, and finally, the conclusions of the chapter 3.4.
In Chapter 4, Section 4.1 contains the introduction to the stage II of the experiments. Section
4.2 has the experimental procedure, results, and discussion of the stage II experiments. Section
4.3 contains the final conclusions of this chapter.
Chapter 5 contains a brief summary with the final conclusions from the empirical studies, and
a summary of the contributions of this dissertation in Section 5.1.
3
Introduction
4
Chapter 2
State-of-the-Art
Over the last 5 years, many methodologies have surfaced to find a solution for genome wide stud-
ies. With the development of computing power, algorithms that were practically infeasible are
now valid options for the identification of diseases related to SNPs. Most of these algorithms
are based on well-known data mining approaches like prediction, clustering and rule-based algo-
rithms. Some software tools were also developed specifically for this purpose.
In this chapter we first introduce some basic biological concepts as biological background
required to understand the main issues and specifications of the dissertation. The concept of
epistasis and Genome Wide Association Studies are introduced in that section.
Data Mining algorithms are introduced in Section 2.2. This includes data selection, model
creation algorithms specifically designed for epistasis studies and generic model creation algo-
rithms that are related to specific algorithm implementations. Some auxiliary algorithms, used in
the model creation are also included in this section.
In Section 2.3 the most relevant procedures and measures of evaluating the state-of-the-art
methodologies are discussed. These include relevant evaluation metrics for model testing and
machine learning approaches to provide better results for the generated solutions. In that section
there is also a description of the most commonly used data mining methodologies.
Section 2.4 contains data mining software tools, some of which are specifically designed for
this problem. This Section also contains the data simulation software used for artificial generation
of data sets with epistasis and main effect.
2.1 Biological concepts
Basic concepts
Most human beings have 46 chromosomes within the nucleus of each cell. These chromosomes are
divided into 23 groups, with every group having a pair of similar chromosomes. Each chromosome
is composed by a very large double helix structure: the deoxyribonucleic acid (DNA). DNA is
subdivided into sections called genes, which consist of regions that code for proteins. Figure 2.1
illustrates these structures.
5
State-of-the-Art
A gene is composed by several nucleotide bases. These bases can be either adenine, cytosine,
guanine, or thymine. Each connection has a complementing pair. This means that for each
position of a nucleotide base, also known as locus, there a connection of adenine-thymine, or
cytosine-guanine. Each combination in nucleotide bases is called an allele. Alleles and Loci can
also be referred to a variation and position of a gene, rather than a single nucleotide base. The pair
of alleles in the same locus, on each chromosome is called genotype. Considering that there are
usually two alleles for each locus, there is usually a dominant genotype and a recessive genotype,
which means that there is an allele that is expressed more often than another allele. The expression
of a physical trait or a creation of a protein is called a phenotype. For each locus, there are usually
3 different genotype configurations, 2 dominant genes, 2 recessive genes (having 2 equal alleles
in a genotype is called homozygous), or a dominant and a recessive gene (having different alleles
in a genotype is called heterozygous). A recessive gene is only expressed in an homozygous
recessive genotype.
A
T A
T
C G
Figure 2.1: An illustration of the interior of a cell. [cel14]
Single Nucleotide Polymorphisms
A single nucleotide polymorphism (SNP) is a specific nucleotide base, within a gene, that changes
what the gene expresses. This means that a different allele SNP will create a different gene that
will express a different protein or physical trait. Not all nucleotide bases in a gene are relevant to
this process. In this context, SNPs are the most important part in the gene.
Genetic Markers
Genetic markers are specific sets of SNPs, genes or DNA sequences that are used to identify
specific traits, individuals or species. In our study, genetic markers are used to identify specific
traits within complex diseases. In recent years, genetic markers are not limited to genes that encode
visible characteristics. Due to the genetic mapping of the human genome, patterns of SNPs can be
related to traits, without directly encoding a specific characteristic, including dominant/recessive
or co-dominant markers [Avi94].
6
State-of-the-Art
Main Effect
In this context, the main effect is related to the influence of a SNP on the expression of a phenotype,
in this case a complex disease. Any SNP has a main effect if it has a direct impact on the disease
expression. Multiple SNPs can have a main effect on the same phenotype expression, without
having a relation between them.
Epistasis
The concept of epistasis was first described by Bateson [Mud09] as the control of the manifestation
of the effect of one allele in one locus by another allele in another locus [Cor02]. This definition
changed its meaning and subdivided into different, often conflicting definitions. According to
Philips [Phi08], there are three major categories in which the term “epistasis” can be subdivided
into: Functional Epistasis, Compositional Epistasis and Statistical Epistasis.
Functional epistasis refers to the functional applications of the molecular interactions between
genes. The focus is on the proteins that are created by these interactions and on their effects.
Compositional epistasis is used to describe the traditional usage of the term "epistasis". It
describes the interaction in two locus, with specific alleles. This interaction affects the phenotype
expression.
Statistical epistasis describes the average deviation in the effect resulting from the interaction
in a set of alleles at different loci from the effect of those alleles considered independently [Fis19].
It is an additive expectation of the epistasis effect on the allelic function.
Genome Wide Association Studies
The search for genetic markers has helped to determine previously unknown aspects of complex
diseases. Previous studies focused on single-locus analysis and provided underwhelming results
[Cor09]. By changing this approach to include complex relations between genes in the effect of
phenotypes, new information about biological and biochemical pathways has surfaced and, since
then, become a powerful tool in understanding the diseases.
Penetrance Tables
Diseases can affect different proportions of individuals in a population, who have the disease
related genetic markers. This is the penetrance of a disease [FES+98]. With a high disease pen-
etrance, most of the individuals with a given disease-associated genetic marker will manifest that
disease. This penetrance is visible with the disease allele frequency and the disease affected in-
dividuals. This is the proportion of individuals with the disease-associated SNP that develop the
disease among the population. Analysing these results, a table can be created showing the percent-
age of individuals affected with a disease, given their genotypes. Each genotypic configuration has
a penetrance value associated. Table 2.1 shows an example of a penetrance table.
7
State-of-the-Art
PENTABLEGenotype Configuration Penetrance
AABB 0.068AABb 0.064AAbb 0.040AaBB 0.055AaBb 0.047Aabb 0.103aaBB 0.039aaBb 0.103aabb 0.004
Table 2.1: An example of a penetrance table.
2.2 Statistical and Machine Learning Algorithms
The disciplines of Statistics and Machine Learning (ML) have been studying these problems in
the last few years. We now survey a set of algorithms from both Statistics and ML application to
epistasis problems.
Feature Selection Algorithms
Ant Colony Optimization
Ant Colony Optimization (ACO) [Dor92] is a search wrapper that explores the same mechanism
seen in ant colonies to find shortest paths. This means that it uses a particular classifier to score
subsets of variables based on their relation to the class variable. Basically it transforms the op-
timization problem into a problem of finding the best path on a weighted graph [Ste12]. In this
context, it uses expert knowledge added in the "pheromones" to select SNPs with better expert
knowledge scores, calculated using fitness functions [GWM08]. In this context, ACO is used to
filter out SNPs by randomly searching for SNPs and choosing only the ones that are the most
relevant to the phenotype.
Classification Trees
Classification trees can be used as a feature selection algorithm by creating an upside down tree
[ZB00] where each node represents a test on an attribute or a set of attributes, and each edge rep-
resents the possible outcome value of the test in the "parent" node [NKFS01]. This representation
of a tree emphasizes the connections between attributes and therefore it can correlate to possible
relations between a disease and a connection of attributes [CKS04]. However, it can only use
selections of attributes that somehow connect to each other, it skips attributes that might have a
pure interaction with the disease [Cor09].
8
State-of-the-Art
ReliefF
ReliefF [RSK03] and its modified version Tuned ReliefF (TuRF) [MW07] are filtering algorithms.
The basic idea of ReliefF is to estimate the quality of attributes by the variation of its genotype
values in the instance neighborhood with the same class label. If the neighbors within a different
class have the same value however, the attribute separates two instances with the same class, which
means that the value of that attribute gets lower. Similarly, having a neighbor with a different class
and the same attribute value lowers its quality but if the same neighbor has a different attribute
value, the quality increases. ReliefF can deal with incomplete and noisy data and searches for k
neighbors of incomplete or noisy data with the same class and k neighbors with different classes.
Tuned ReliefF is an improvement over the original algorithm, by removing the worst attributes
and recalculating the estimations of attributes at each step.
Evaporative Cooling
Based on the ReliefF [KR92] algorithm, evaporative cooling removes ∆N of the least informative
attributes using classification accuracy [MRW+07][MCGT09]. The energy in the system is given
by:〈ε〉N〈ε〉N0
=
(NN0
)η
(2.1)
where 〈ε〉 is the average “energy” density of the system, N0 is the number of attributes before
the evaporation step and η is an adjustable parameter related to the evaporation rate, allowing for
a slow evaporation for higher values and a fast evaporation for lower values, which can lead to
a collection of suboptimal attributes. Evaporative Cooling is often used as a wrapper filter for
attribute selection. This means that the variables are scored based on their predictive power.
Genetic Programming for Association Studies
Genetic Programming for Association Studies (GPAS) works by searching variables, mapping
them to new boolean variables. By randomly choosing two individuals consisting of one randomly
selected literal, a form of genetic algorithm is then applied to select, based on a fitness function,
the score of the current generation in the population [Nun08]. A new, customized form of GPAS
that detects interactions involving a higher number of SNPs has since been developed [NBS+07].
Random Forest
The purpose of random forest in this context is to select, according to the various trees formed
by the algorithm, the main attributes that are important to a disease. Random forest is a fast
algorithm that can be applied in parallel, with many customizable parameters, such as the number
of trees, the number of instances to be used at each split and the number of permutations to
assess variable importance [Bre01]. Random forest is an ensemble algorithm that creates several
bootstrap samples from a data set with the same size as the original sample. For each bootstrap, a
9
State-of-the-Art
tree is grown, considering only a small random set of attributes at each node [LHSV04]. Instances
that where not used in the training phase are then selected to estimate the prediction error. By using
a random forest instead of a single tree, there is an improvement in the classification accuracy
[BDF+05].
Random jungle (RJ) [SKZ10] is an improved approach of random forest, using parallel pro-
cessing, becoming a faster and more viable in larger datasets. Even without parallel processing, RJ
is faster and uses less memory than the standard random forest, implementing variable backward
elimination.
Summary Table
Table 2.2 shows the summary of all the data selections algorithms discussed.
Algorithms DescriptionACO Search based on the behavior of ants. Optimization problem be-
comes a problem of finding the best path based on positive feedback.Class. Trees Construction of trees that split nodes based on rules that represent a
good division in the outcome variable.Rand. Forest Ensemble approach to Classification Trees.ReliefF Calculates the value of each attribute based on its value in neighbor-
hood individuals that have the same outcome variable.Tuned ReliefF Modified version of ReliefF that removes the worst attributes and
recalculates the weight in remaining attributes.Evap. Cooling Based on ReliefF, removes the least informative attributes with an
adjustable parameter related to the evaporation rate.GPAS Searches and maps attributes from random individuals to boolean
variables and uses a fitness function to evaluate them.Table 2.2: A description of each data selection algorithm.
Specific Model Creation Algorithms
Backward Dropping Algorithm
Backward Dropping Algorithm (BDA) tries to find the subset of attributes that have the biggest
impact on the class variable [WLZH12]. The class is assumed to be binary and all other attributes
are assumed to be discrete. The explanatory variables are segregated into partitions of subsets,
which are then used to calculate I-score as:
I = ∑j∈Pk
n2j (Yj− Y )2 (2.2)
where P is the partition selected with k variables, n is the number of observations, Yj and Y is the
average of Y observations in the j partition and overall average respectively. In the training set,
a large group of explanatory variables are selected to be sampled into subsets. After computing
10
State-of-the-Art
the I-score, the variable which contributes less to the I-score is dropped. In each round another
variable is dropped until there is only one variable left. The subset which has the highest I-score in
the whole dropping process is returned. This subset represents the set of variables than contribute
the most to a positive state of Y , the response variable.
Bayesian Epistasis Association Mapping
Bayesian Epistasis Association Mapping (BEAM) receives genotype markers as input and deter-
mines the probability of each marker being associated with the disease, through a Markov Chain
Monte Carlo (MCMC), independently or in epistasis with another marker, and creates partitions
with those markers [ZL07]. It classifies these markers into three categories: SNPs unassociated
with the disease, SNPs associated with the disease independently and SNPs jointly associated with
the disease in epistasis [WYY12]. A B statistic was developed to show the statistic relevance of
the associations made with the disease. BEAM searches for epistasis with interactions of 3 or 2
SNPs. This is a hypothesis-testing procedure, testing each marker for significant interactions. The
B statistic is defined by:
BM = lnPA(DM,UM)
P0(DM,UM)= ln
Pjoin(DM)[Pind(UM)+Pjoin(UM)]
Pind(DM,UM)+Pjoin(DM,UM)(2.3)
where M represents each set of k markers, representing different complexities of interactions. DM
and UM are genotype data from M cases and controls and P0(DM,UM) and PA(DM,UM) are the
Bayes factors. Pind is the distribution that assumes independence among markers in M and Pjoin is
a saturated joint distribution of genotype combinations among all markers in M.
BEAM3.0 is the third iteration of the BEAM algorithm and introduces multi-SNP associations and
high-order interactions flexibility, using graphs, reducing the complexity and increasing the power.
BEAM3 produces cleaner results with improved mapping sensitivity and specificity [ZL07]. The
algorithm is written in C++.
BNMBL
BNMBL is a Bayesian Network that assumes SNPs can either be Adenine(A) and Guanine(G) or
Cytosine(C) and Thymine (T), depending on the nucleotide base, and therefore can only assume
three possible values in the genotype: AA, GG or AG, because A is the same as T, and C is
the same as G, in this context. A directed acyclic graph (DAG) model is created for each data
item to assign a probability of the relationships between SNP. Figure 2.2 shows an example of
a probabilistic model of the relationship between SNPs and the disease D. Using only 12 log2m
bits for each conditional probability, where m is the number of data items, the penalty calculated
in Equation 2.4 is applied in the scoring phase to each DAG, where k is the number of SNPs
[JNBV11].3k
2log2
m3k +
2k2
log2m (2.4)
11
State-of-the-Art
D
S1 S2 S3 S4
S6 S7 S8
Figure 2.2: Bayesian Network. Nodes represent SNPs. [JNBV11]
Boolean Operation-based Screening and Testing
Boolean Operation-based Screening and Testing (BOOST) converts the data representation into a
boolean type, using logic operators [Weg60], which allows faster operations and a smaller usage of
memory [WYY+10a]. The algorithm uses a pruning approach by removing interactions which are
statistically irrelevant. The ratio at which the pruning occurs is based on the difference between
the full logistic regression model:
logP(Y = 1|Xl1 = i,Xl2 = j)P(Y = 2|Xl1 = i,Xl2 = j)
= β0 +βXl1i +β
Xl2i +β
Xl1 Xl2i j (2.5)
and the main logistic regression model:
logP(Y = 1|Xl1 = i,Xl2 = j)P(Y = 2|Xl1 = i,Xl2 = j)
= β0 +βXl1i +β
Xl2i +β
Xl1 Xl2i j (2.6)
where Xl1 and Xl2 are genotype variables and i and j are one of the three possible states (0,1,2)
[WLFW11]. The algorithm is written in C. A GPU version of the algorithm was developed
(GBOOST) [YYWY11] providing a 40-fold speedup compared to that of BOOST running in a
CPU.
Grammatically Evolution Neural Networks
Based on neural networks, Grammatically Evolution Neural Networks (GENN) uses instructions
and a fitness function to train for classification problems related to genetics [TDR10a]. Based on
genetic algorithms, the populations within the data are heterogeneous and go through a process of
pairing, crossover and mutation to find the best Neural Network (NN) solution, which translates to
finding influential SNPs and correctly evaluating network weights. As the name suggests, linear
genomes and grammars are used to define the population. Grammar is used to increase diversity,
by separating the genotype from the phenotype [TDR10b]. GENN uses a Genetically Programmed
Neural Networks (GPNN) approach to optimize the NN selection, which is an improvement on
the NN architecture using genetic programming. This means that there are binary expression trees
that are evolved in a tree-like structure, fitting into the NN architecture.
12
State-of-the-Art
Information Interaction Method
Information Interaction Method (IIM) is an exhaustive algorithm that searches for all possible
pairs of SNPs to find relations between them and the expression of the phenotype [OSL13]. If
the synergy between the pair and the phenotype is above a user-defined threshold, then there is a
possible correlation between the pair and the phenotype. This is revealed by the Equation 2.7.
I(A;B;Y ) = I(A;B|Y )− I(A;B)
= I(A;Y |B)− I(A;Y )
= I(B;Y |A)− I(B;Y )
(2.7)
Associations between single SNPs and a given phenotype are also tested by applying a mutual
information method, explained in Equation 2.8.
I(X ;Y ) = H(X)−H(X |Y )= H(Y )−H(Y |X)
= H(X)+H(Y )−H(H,Y )
= H(X ,Y )−H(X |Y )−H(Y |X)
(2.8)
Meta-Analysis Gene-set Enrichment of variaNT Associations
Meta-Analysis Gene-set Enrichment of variaNT Associations (MAGENTA) consists of four steps:
mapping SNPs to genes, assigning a score to each gene association, applying a correction of am-
biguous gene association scores and finally a statistical test is made to find predefined biologically
relevant gene sets in the association scores, compared to randomly sampled gene sets of identical
size [SGM+10]. Instead of receiving genotype data, MAGENTA receives p-values of SNPs as an
input. The gene association score is done based on regional SNP p-values.
Multifactor Dimensionality Reduction
Multifactor Dimensionality Reduction (MDR) is one of the most popular methods for the detection
of interactions between SNPs. MDR receives two parameters: the N number of attributes with
the strongest connection to the disease to be selected and the T threshold ratio for affected to
unaffected individuals to distinguish high risk from low risk genotype combinations [HRM03].
MDR uses cross validation and in the training data set of each fold determines the high/low risk
groups. After calculating the misclassification error using the test data, the resulting prediction
error rate is the average of all the folds. In the end, the n-order combination with the minimum
average prediction error and the maximum cross-validation accuracy from all the dimensions is
selected [CLEP07]. The odds ratio for the best combinations is used to generate bootstrap data.
After calculating the odds ratio for the best combination, the confidence intervals are constructed
by using empirical distribution of the odds ratio [Moo04]. The combinations of high risk loci are
the ones that have a stronger connection in the disease outcome [RHR+01].
13
State-of-the-Art
Model-Based Multifactor Dimensionality Reduction
Model-Based MDR (MB-MDR)[MVV11] tries to overcome many of the drawbacks from the
original algorithm MDR such as missing important interactions due to sampling too many cells
together and only analysing at most one significant epistasis model. MB-MDR merges multi-locus
genotypes that have a significant high or low risk based on testing, rather than a threshold value.
Unassociated loci are placed in a ’No Evidence for risk’ class. This algorithm uses a significance
assessment, correcting type I errors, and evaluating each SNP with a Walt statistic test [MVV11].
The algorithm is written in R. MB-MDR process can be divided into the following steps:
1. Multi-locus cell prioritization - Each two-locus genotype is assigned to either High risk,
Low risk or No Evidence of risk categories.
2. Association test on lower-dimensional construct - The result of the first step creates a new
variable with a value correlated to one of the categories. This new variable is then compared
with the original label to find the weight of high and low risk genotype cells.
3. Significance assessment - This stage tries to correct the inflation of type I errors after the
combination of cells into the weight of High risk and Low risk.
MB-MDR can also be adjusted to consider main effects within interactions.
Screen and Clean
Screen and clean (S&C) [WDR+10] is a recent algorithm divided into two main parts: screening
part and cleaning part. The algorithm creates a dictionary with all SNPs and splits the data into
stage 1 data for screening and stage 2 data for cleaning. In stage 1, the data is screened using the
logistic regression model in Equation 2.9 to find SNPs.
g(E [Y |X ]) = β0 +N
∑j=1
β jX j (2.9)
where X j is the encoded genotype value 0, 1 or 2 and Y is the encoded phenotype, 0 or 1, N is
the number of measured SNPs, g is an appropriate link function, and S ={
j : β j 6= 0, j ∈ 1, ...,N}
are the set of terms associated with the phenotype as main effect [WLFW11]. According to the
selected SNPs, Screen and Clean tries to find relevant interacting SNPs that fit into the following
interaction model:
g(E [Y |X ]) = β0 +N
∑j=1
β jX j + ∑i< j;i, j=1,...,N
βi jXiX j (2.10)
where S = {(i, j) : β i j 6= 0,(i, j) ∈ 1, ...,N} are the set of terms associated with the phenotype as
epistasis. In stage 2, clean, controls false positives by using the stage 2 data and removing SNPs
with p-values higher than a predetermined threshold (α). The algorithm is written in R.
14
State-of-the-Art
SNPHarvester
SNPHarvester is a stochastic search algorithm. It divides the SNPs into three different categories:
unrelated to the disease, related independently and contributes jointly to the disease with no effect
independently. SNPHarvester is based on a multiple path generation with a generic score function
[YHW+09]. The first point in each path is generated randomly. Using a created local search algo-
rithm, SNPHarvester finds the local optimum, usually in two or three iterations, and the significant
groups of SNPs in each path, according to a scoring function. This function is a popular χ2 value
score function [RHR+01]. After the scoring, randomly picks k SNPs to form an active set, leaving
the rest as a candidate set. Each SNP in the active set is then substituted with one from the can-
didate sets in order to maximize χ2. After finding the maximized candidate, removes the selected
SNP group and repeats the procedure to identify M groups which is a predetermined parameter.
The selected M groups are then fitted into the L2 penalized logistic regression model
L(β0,β ,λ ) =−l(β0,β )+λ
2‖β‖ (2.11)
where l(β0,β ) is the binomial log-likelihood and λ is a regularization parameter [WLFW11].
SNPHarvester is written in Java.
SNPRuler
SNPRuler is a rule-based algorithm. Epistatic interactions promote a set of rules. These rules are
implications of an interaction between SNPs and the disease. To find the rules, SNPRuler uses
trees that represent genotypes in each node, with the leaves representing the phenotypes, creating
a path of epistatic interaction. For each rule, a 3x3 table is generated based on the probability of
each possible genotype combination and phenotype [WYY+10b]. In a big number of SNPs, there
is an upper bound limit to the tree, pruning it instead of an exhaustive search. This threshold is a χ2
test statistic [RHR+01]. However, this pruning can lead to a wrongful prune of many true-positive
epistatic interactions. This algorithm was developed in Java.
TEAM
Tree-based Epistasis Association Mapping (TEAM) is essentially an exhaustive algorithm. TEAM
uses a permutation test to create a contingency table with all the calculated p-values [ZHZW10].
To reduce computation costs, if there are two SNPs with very frequent genotype values, then it
is shared for each individual with the same genotype. TEAM only works with two-loci interac-
tions. It uses a tree-based representation, where nodes contain SNPs 1,2,3 and the edges represent
the number of individuals with different values on the two SNPs [WLFW11], further reducing
computation costs when the values are the same. This algorithm is written in C++.
Summary Table
Table 2.3 shows a summary of all model creation algorithms previously discussed.
15
State-of-the-Art
Algorithms DescriptionBDA Uses an iterative selection process where the most significant SNPs to the
disease are selected.BEAM Determines the probability of a given SNP to be associated with a disease
independently, or in epistasis with N SNPs.BNMBL Creates a DAG with the probabilistic model of the relationship between
SNPs and the disease.BOOST Converts data into a boolean type, pruning statistically irrelevant SNPs.GENN Creates NNs which are evolved, based on a genetic algorithm approach,
find the NN with the best accuracy.IIM Searches all possible pairs of SNPs, finding a relation between each pair
and the phenotype above a specified threshold.MAGENTA Calculates the statistic relevance of SNPs based on regional SNP p-values
instead of genotype data.MDR Applies cross-validation training the data to find high risk groups of SNPs.S&C Creates a dictionary of all the SNPs and divides the data into two stages:
screening, to select SNPs according to a logistic regression model, andcleaning, to decrease false positives.
SNPHarvester Stochastic search algorithm classifying SNPs according to their relationwith the disease, using a random local search algorithm.
SNPRuler Rule-based algorithm, defining rules based on epistatic interactions.TEAM Exhaustive algorithm, creating a table with p-values of each pair of SNPs,
and uses a tree-based representation to place the results.Table 2.3: A description of each model creation algorithm designed for this problem.
Generic Model Creation Algorithms
Ensembles
There are many types of ensemble algorithms created by joining several kinds of model creation
algorithms to try and make a more accurate and reliable model. In this context, one of the most
recent ensemble methods [YHZZ10] was created using a genetic algorithm together with a few
classifier algorithms. Several subsets of SNPs are selected by applying the genetic algorithm a
predetermined number of times. These subsets are analyzed and ranked based on the number
of times each SNP combination appears in the selected subsets. After acquiring the fitness for
every SNP subset, the chromosome with the highest fitness is selected, represented by the SNP
subset contained in the chromosome. The genetic algorithm then applies selection, crossover and
mutation to the chosen subset. Considering the large amount of SNPs, in order to reduce the noise
and optimize the process, two classifier strategies and a diversity promoting strategy are used to
preselect and evaluate the SNPs. Blocking uses M classification algorithms to eliminate differences
caused by noise. Voting is used to balance and increase accuracy in evaluating the fitness of SNPs.
Double fault diversity tries to evaluate the diversity between classifiers by calculating the fitness
of misclassified SNPs, focusing on the diversity between them. This particular approach uses
decision-tree-based classifiers and instance based classifiers.
16
State-of-the-Art
ILP
Inductive Logic Programming (ILP) algorithms work by creating hypotheses that are encoded as
First Order clauses. ILP is characterized as an expressive representation language (First order
Logic - FOL) to represent both data and hypotheses. ILPs are very used in the bioinformatics field
and produce good results but have a slow runtime, therefore affecting the scalability.
Initially, ILP algorithms create background knowledge of the problem by logic propositions.
The training is then made with positive and negative examples. Hypotheses are then generated by
creating new logic propositions using the background knowledge and trained examples.
Success is measured by the classification accuracy of a given hypothesis and the transparency
of a formulated hypothesis, which means the ability to be understood by humans [LD94].
There are many implementations of this type of algorithm. One of the most used systems is A
Learning Engine for Proposing Hypotheses (Aleph) [Sri01]. This algorithm works in 4 different
steps:
1. Select example. Selects an example to be generalized and stops if none exist.
2. Build most-specific clause. Based on the example selected, the most specific clause that
respects the language restrictions is constructed.
3. Search. After creating the most specific clause, a more generalized clause is searched for
in a subset of the clauses in the previous clause.
4. Remove redundant. The clause with the best score is then added to the theory, removing
the redundant examples.
K-Nearest Neighbor
K-Nearest Neighbor (K-NN) is a classification and regression algorithm that determines the value
of new data based on its approximation to other instances [HK06]. In classification, the closest
neighbors to the new instance determine the class of that instance. In regression, the result is
the average of the nearest neighbors. K is the number of the nearest neighbors to be used in the
calculation of new results. For this context, K-NN is mostly used as an attribute selection method.
Methods such as ReliefF use an approach based on the K-NN algorithm.
Naïve Bayes
Bayesian approaches are amongst the most common in this context. The naïve approach assumes
independence between features. There are many optimizations to reduce the naivety [PV08], such
as selecting subsets of attributes that are considered to be conditionally independent [LIT92], or
extending the structure of Naïve Bayes (NB) to represent dependencies in attributes [WBW05].
NB works using Bayesian networks by assigning probabilities to each event, using the model
trained previously. The final result is then chosen based on the most probable outcome. Specific
implementations of this nature can be seen in BEAM and BNMBL.
17
State-of-the-Art
Neural Networks
NNs are a type of classification and regression algorithm based on the neurological system of the
central nervous system. An example of these NNs is a Multilayer Perceptron, whic is the most
used type of NNs. In a Multilayer Perceptron, there is an input layer, which proceeds to a second
layer of nodes that represent neurons. These intermediate layers are also called hidden layers. The
last hidden layer, or output layer, represents the prediction of the network. Each layer is densely
connected. Each connection is weighted based on the relations between nodes in the training phase
[HK06]. An example of this is illustrated in Figure 2.3.
1
2
3
4
5
6
w_1
w_2
w_3
w_14
w_15
w_24
w_25
w_34
w_35
w_46
w_56
Figure 2.3: An example of a Neural Network. [HK06]
NNs can have multiple layers, which can be used in nonlinear problems [DHS01]. There
are some recent implementations of NNs in the discovery of epistatic relations, such as GENN
[HDF+13].
Support Vector Machines
Support Vector Machines (SVM) is a classification algorithm that divides data based on pattern
recognition methods. In the training phase, data is divided into two parts, mapping them accord-
ing to their attributes. SVM then tries to find the best nonlinear mapping to separate data by a
hyperplane [DHS01]. This hyperplane is mapped in order to find the best separation possible
by increasing the distance in the gap between the classes. If a linear classification is not possi-
ble, SVM can use the kernel trick to divide the data by increasing the dimension of the problem
[ABR64]. A regression or multiple class alternative of SVM is also available by transforming
the problem into multiple binary class problems [DKN05].There are no specific implementation
of SVM in this context, however there are many methods that use pattern recognition in their
implementation.
18
State-of-the-Art
Summary Table
Table 2.4 contains a summary of all the generic algorithms discussed. A more technical table is
available in Figure 2.11.
Algorithms DescriptionEnsemble Many model creation algorithms are joined to "vote" on the most probable
outcome to increase the accuracy and creating a more reliable meta model.ILP Uses logic programming, representing positive and negative examples, back-
ground knowledge and hypotheses that use trained examples and backgroundknowledge to classify accurately and transparently.
K-NN Uses trained data to classify new instances based on the proximity to a givenneighbor previously classified. The outcome is obtained from the closestneighborhood of the new instance.
Naive Bayes Creates bayesian networks, calculating the probability of a relation betweenevents, assuming independence between attributes.
NN Based on the central nervous system, creates a graph using the trained dataand, receiving an input, calculates the most weighed path to an outcome nodebased on its connection to other nodes.
SVM Searches and maps attributes from random individuals to boolean variablesand uses a fitness function to evaluate them.
Table 2.4: A description of each generic model creation algorithm.
Statistical Methods
Bonferroni Correction
Bonferroni Correction is a conservative approach to multiple comparison testing [BH95]. It is the
simplest correction for selecting a predetermined number of hypotheses based on their statistical
relevance. However there is no assumption of dependency. For SNPs, the p value is calculated
using
pcorrected = 1− (1− puncorrected)n (2.12)
where n is the number of hypothesis tested. this can be further simplified to
pcorrected = npuncorrected (2.13)
when npuncorrected � 1 [Cor09].
Linear Regression
Linear regression models try to fit a straight line on the data points. As in every regression problem,
the label is numeric. This label is modeled as a linear function, as shown in Equation 2.14 of
another random variable. The weights attributed to each variable are calculated in the training
19
State-of-the-Art
0 0.2 0.4 0.6 0.8 1
−4
−2
0
2
4
(a) logit transformation
−10 −5 0 5 100
0.2
0.4
0.6
0.8
1
(b) logistic regression function
Figure 2.4: A logit transformation and a possible logistic regression function resultant of the logittransformation.[WFH11]
data [WFH11].
x = w0 +w1a1 +w2a2 + ...+wkak (2.14)
where x is the class outcome, ai is each attribute, and wi is the weight of the attribute. In case of
a multiple linear regression, more than one variable is involved in predicting the label [SGLT12].
Linear regression models are usually fitted using the least squares approach to fit data into linear
equations and minimizing the squared errors between the observed values and the fitted values
[HK06]. In this context, linear regression models are often used as fitness functions to test the
score of SNPs and their statistical relevance.
Logistic Regression
Linear functions can be used in classification problems by assigning 1 to instances in the training
that belong to the class and 0 for the instances that do not belong the class [WFH11]. A linear
function is still applied to new instances and the class closest to the resulting value is selected.
Since these values are not constrained in the interval from 0 to 1, a logit transformation is applied
by transforming the variable into a value ranging from 0 to 1. Figure 2.15 illustrates a relation of
the logit transformation and the final logistic regression function. To evaluate the logistic regres-
sion model, log likelihood is used instead of calculating the squared error. The formula used to
evaluate the model is
n
∑i=1
(1− x(i))log(1−Pr[1|a(i)1 ,a(i)2 , ...,a(i)k ])+ x(i)log(Pr[1|a(i)1 ,a(i)2 , ...,a(i)k ]) (2.15)
where x(i) is either 0 or 1 and ai represents each attribute.
Logistic regression, in this case, is used as a penalizing model [Ste12] [NCS05], and as a
statistical model when the outcome is binary [PH08][TJZ06][Cor09].
20
State-of-the-Art
A variation of this model called multinomial regression are used when there is more than two
possible outcomes.
Markov Chain Monte Carlo
MCMC algorithm is used to sample models within a high dimension surface. MCMC finds models
by using a random walk, trying to converge to a target equilibrium distribution [Smi84], creating
a sample of the population to be analyzed. This algorithm is often used in bayesian statistics
[SWS10].
Summary Table
Table 2.5 shows a brief description of the auxilliary algorithms.
Algorithms DescriptionBonferroni Correction Selects a predetermined amount of hypotheses based on their statistic
relevance. Used in model creation algorithms.Linear Regression Creates a straight line connecting SNPs to find their relevancy. Used
for numerical values.Logistic Regression Applies a linear function to assert how new data is evaluated, based
on trained data. Used for binary values.MCMC Uses a random walk to find statistical relevancy of SNPs. Used in
bayesian models.Table 2.5: A description of auxiliary algorithms used in model creation or data selection.
2.3 Data analysis evaluation procedures and measures
Data analysis evaluation measures
Type I error and Type II error
Type I errors refer to the acceptance of a false relation. In this case, it refers to the acceptance of
a relation between an SNP or interaction of SNPs and the disease that does not exist in fact. This
can also be referred to as a false positive. Type II errors refer to the rejection of true relations. This
can also be referred to as a false negative.
Accuracy
The accuracy for classifier algorithms can be determined by how accurately a given classifier will
correctly predict future data. This may sometimes be misleading when overfitting occurs. To
prevent this, data evaluation procedures are employed and the final accuracy is the average of the
accuracies obtained from each iteration [Pow11]. For ensemble methods, a voting process takes
21
State-of-the-Art
place and the final result is the most voted outcome.
Accuracy =true positives+ true negatives
true positives+ false positives+ true negatives+ false negatives(2.16)
Precision
Precision is the measure of the relation between the relevant results and the returned results by the
model [Pow11]. In this context, this means the number of SNPs or genotypes correctly identified
as related to a disease by the model, in relation to all the SNPs or genotypes identified as related
to a disease by the model.
Precision =true positives
true positives+ false positives(2.17)
Recall
Recall measures the fraction of the relevant results in relation to the retrieved relevant results
[Pow11]. In this context this means the SNPs or genotypes correctly identified as related to a
disease by the model, in relation to all the SNPs or genotypes that are actually related to a disease.
Recall =true positives
true positives+ false negatives(2.18)
F-measure
The f-measure is the relation between precision and recall. In the F1 measure, this creates some
problems due to the similar weight to precision and recall which may not have the same relevancy.
This is true for the epistasis problem, where type 1 errors should be prioritized [Pow11].
F1 = 2 · precision · recallprecision+ recall
(2.19)
ROC curve
The receiver operating characteristic (ROC) is a graphical representation of the relation between
true positives and false positives of a binary classifier. Multiple classifiers can be plotted to com-
pare results [Pow11]. The area under the curve (AUC) corresponds to a higher probability of
selecting a true positive than a false positive [WFH11]. Figure 2.5 shows an example of the ROC
curve. The greater the AUC the better. A representation of the ROC Curve is given by the relation
between sensitivity and specificity which can be seen in Equations 2.20 and 2.21.
Sensitivity =Number of true positives
number of true positives+number of false negatives(2.20)
Speci f icity =Number of true negatives
number of true negatives+number of false positives(2.21)
22
State-of-the-Art
0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1
False Positive Rate (Specificity)
True
Posi
tive
Rat
e(1−
Sens
itivi
ty)
Figure 2.5: An example of a ROC curve.
Summary Table
Table 2.6 shows the summary of the various evaluation measures.
Data analysis evaluation procedures
Bootstrapping
Given a dataset of n instances, a bootstrap sample is a selection of a new dataset with size n by
sampling from the original dataset with replacement, therefore creating a different dataset from the
original one. The probability for any given instance not to be chosen is (1−1/n)n ≈ e−1 ≈ 0.368.
Due to the high chance that some instances from the original dataset are not included in the new
dataset, these instances will be used for testing [Koh95]. Considering the high percentage of
probable testing instances, the bootstrap procedure is then repeated several times with different
samples and the final results are averaged [WFH11].
Cross-Validation
The dataset is split randomly into k mutually exclusive subsets or folds. Each fold has approx-
imately the same size and is tested once and trained k− 1 times. This means that there are k
iterations of model creation and evaluation. In each iteration, a new subset is selected to become
the test set, while the other k−1 subsets are used for training. The accuracy is estimated by
acccv =1n ∑〈vi,yi〉∈D
δ (L(D\D(i),vi),yi) (2.22)
where D(i) is the test set that includes instance xi = 〈vi,yi〉 and n is the number of folds [Koh95].
The error rates of the different iterations are averaged to yield the overall error rate [WFH11]. The
dataset can be stratified to place in each fold the same proportions of labels as the original dataset.
23
State-of-the-Art
Algorithms DescriptionType I error
false positives
Type II errorfalse negatives
Accuracytrue positives+ true negatives
true positives+ false positives+ true negatives+ false negatives
Precisiontrue positives
true positives+ false positives
Recalltrue positives
true positives+ false negatives
F-Measure2 · precision · recall
precision+ recall
ROC curveSensitivity =
Number of true positivesnumber of true positives+number of false negatives
Speci f icity =Number of true negatives
number of true negatives+number of false positives
Table 2.6: A description of data analysis measures and how they are calculated.
Leave-one-out
Leave-one-out is an n-fold cross validation where n is the number of instances in the dataset. In
each iteration, a new instance is left out as a test, while all the others are used in training. For
classification algorithms, this means that the success rate in each fold is either 0% or 100%. This
approach allows for the maximum use of data for training, which presumably increases accuracy,
and is a deterministic process because there is no random sampling for each fold. However this
process is computationally expensive and cannot be stratified [WFH11].
Hold-out method
Holdout method consists on the strict reservation of a portion of the data for testing. This means
that only a part of the dataset will be used for training, usually 70%, leaving 30% for testing
24
State-of-the-Art
purposes [Koh95]. This method works for big datasets but can have a negative effect on the
accuracy of small datasets, which can be underestimated.
Summary Table
The Table 2.7 shows the various methods of data testing.
Algorithms DescriptionBootstrapping Creates a new dataset with the same number of instances as the original,
but with replacement, which allows for a repetition of the same instance.This can be repeated multiple times.
Cross-Validation Divides the original data into small subset of equal size, leaving one sub-set to test. This subset changes with each iteration, through all subsets.
Leave-one-out Similar to Cross-Validation, but the size of the subsets is 1, leaving only1 instance to test, iterating through all instances.
Hold-out method Reserves a specific amount of data to test, using the rest for training.Table 2.7: A description of data analysis procedures.
Data Analysis Methodologies
KDD Process
The Knowledge Discovery in Databases (KDD) Process is the extraction of knowledge from DM
methods using the specification of measures and thresholds [AS08].
The KDD process is interactive and iterative, divided into many components which can be
summarized into 5 steps illustrated in Figure 2.6.
1. The selection step consists of learning the application domain and creating a target data set
or select data samples for knowledge discovery.
2. The pre processing stage consists of data cleaning and handles missing data fields.
3. The transformation step allows for data reduction methods to reduce the dimensionality
and adapt the data for the model creation algorithms.
4. The data mining stage is where the algorithm for model creation is selected and applied.
5. The final stage, interpretation/evaluation, is where the discovered patterns are interpreted
and the performance is measured [FPSS96].
25
State-of-the-Art
Figure 2.6: The main stages of the KDD process.[FPSS96]
CRISP-DM
The Cross Industry Standard Process for Data Mining (CRISP-DM) methodology consists in a
model of the life cycle in a data mining project. This life cycle is illustrated in Figure 2.7. There
are 6 main phases of a project [CCK00].
1. The business understanding phase is the initial phase, where the main focus is the un-
derstanding of the objectives and requirements of the project from a business perspective,
defining an initial plan to solve the data minig problem.
2. Data understanding is the collection and comprehension of the data itself. This is where
the first data characteristics become apparent.
3. In the data preparation phase is where preprocessing filters and feature selection methods
are performed.
4. During the modeling phase, one or more modeling techniques are applied to create specific
models. Each technique has data preparation requirements which vary for each modeling
technique.
5. At the evaluation stage, the models developed in the earlier phase are then put to the test
using different types of evaluation methods. The results are then analyzed.
6. Finally, the deployment stage is where the knowledge created with the data mining process
is put in practice, either by generating a report or by implementing a repeatable data mining
process for the customer.
SEMMA
SEMMA stands for Sample, Explore, Modify, Model and Assess. SEMMA, like CRISP-DM,
follows a data mining life cycle with 5 stages, consistent with each of the letters of the acronym.
1. The sample stage is where sampling of the data and data selection takes place. This stage is
optional.
2. The explore stage consists of searching the data for anomalies to gain understanding of the
data.
3. Modify stage consists of transforming the data and shaping it to serve the model selection
process needs.
26
State-of-the-Art
Figure 2.7: The CRISP-DM life cycle.[CCK00]
4. The model stage is where the model creation takes place.
5. The assess stage exists to evaluate the usefulness and the reliability of the created model.
Although the process is independent from the data mining (DM) tool, there are some guidelines
connected to the Statistical Analysis System (SAS) Enterprise Miner software [AS08].
Summary Table
Table 2.8 contains the discussed procedures and their various stages.
KDD SEMMA CRISP-DMPre KDD ————- Business understandingSelection Sample
Data UnderstandingPre processing ExploreTransformation Modify Data preparationData mining Model ModelingInterpretation/Evaluation Assessment EvaluationPost KDD ————- DeploymentTable 2.8: A comparison between the different procedures [AS08].
2.4 Data Simulation and Analysis Software
Data Simulation Software
HAPGEN
HAPGEN is used to simulate case-control datasets of SNP markers. These datasets can encode the
main effect and interactions of multiple disease SNPs. These datasets can be further customized to
27
State-of-the-Art
allow for a change in the number of individuals and SNPs. To create the simulation of phenotypes
with interaction between disease SNPs, an R package is supplied, using the data with independent
disease SNPs generated from HAPGEN [SSDM09].
GenomeSIMLA
genomeSIMLA can be divided into two main programs: SIMLA (SIMulation of Linkage and
Association) and simPEN. SIMLA generates large scale populations used in the samples for case-
control datasets while simPEN creates penetrance tables for the disease specifications. A forward-
time population simulation is used to specify many gene related parameters, allowing a controlled
evolutionary process [EBT+08]. It can be used in family studies and unrelated individuals. Pen-
etrance models can be generated, allowing specific allelic frequencies, in purely epistatic interac-
tions or associated with main effects. Each interaction may contain various SNPs. The number
of chromosomes, SNPs, and individuals in each dataset is configurable. The prevalence and odds
ratio of the disease can be adjusted to allow a more realistic manifestation in the disease model.
The software contains 4 main steps:
1. Pool Generation contains the evolution of the population, together with their chromos-
somes and allelic frequencies.
2. When a generation contains the desired characteristics, a Locus Selection is made to allo-
cate the disease model.
3. The Penetrance Specification is used to measure the risk associated to each configuration
of the disease alleles.
4. Finally the Data Simulation creates the datasets, according to the specified configurations.
Gene-Environment iNteraction Simulator 2
Gene-Environment iNteraction Simulator 2 (GENS2) simulates interactions between two genetic
SNPs and one environmental factor[MSA+12]. Initially, the population, which is used to gener-
ate the datasets, is carried out by a simuPOP script [PA10]. This population evolves with a time
simulation, based on the desired number of individuals and allele frequencies. The second step
involves the simulation disease penetrance model. Finally, according to the risk assessment, a dis-
ease status is assigned randomly. Some of the options of the customization of the data involve the
number of individuals, allellic frequencies, prevalence and risk associated. A GUI is implemented
to allow for a swift customization and configuration of the data.
Summary Table
The Table 2.9 contains all data simulation tools discussed previously.
28
State-of-the-Art
HAPGEN GenomeSIMLA GENS2Dataset types case-control case-control, pedi-
gree and familycase-control
Interaction types main effect andepistasis
main effect and/orepistasis
epistasis and gene-environment
Order of interactions X SNPs X SNPs 2 SNPs and 1 envi-ronment factor
GUI No Generates html toillustrate the popu-lation
Yes
Population not generated forward-time popu-lation simulation
time simulation
Customizable numberof individuals
Yes Yes Yes
Customizable numberof SNPs
Yes Yes No
Customizable Allelicfrequencies
Yes Yes Yes
Table 2.9: A comparison of the most relevant features of data simulation tools.
Data Analysis Software
Analysis Tool for Heritable and Environmental Network Associations
ATHENA (Analysis Tool for Heritable and Environmental Network Associations) is a software
package designed to create models by analyzing various types of data. The organization of the
software package can be seen in Figure 2.8.
ATHENA receives various types of inputs and consequently uses filtering methods for feature
selection and analytical methods for model creation. The final model consists of the best generated
analytical method.
The main usage of ATHENA is in feature selection and model creation. As a filtering method,
it uses a random jungle algorithm, which is a bootstrap, tree-based variable selection method
version of RF. For modeling, ATHENA uses computational evolution modeling techniques like
GENN [HDF+13], as well as other common regression algorithms.
29
State-of-the-Art
Filtering Methods
Input Data
Random Jungle
proteomics
SNPs
microarray
sequence data
biomarkers
clinical data
Meta-dimensional
ModelsATHENA
Symbolic Regression Bayesian Networks SVM GENN
Analytical Methods
Y = B1X1*B2X2+B3X3
Figure 2.8: A diagram of the ATHENA software package [HDF+13]. The input can have manydifferent formats, involving different kinds of data. In the filtering step, variables are prioritizedbased on their known biological functions. The analytical methods currently consist of compu-tational evolution modeling techniques as modeling techniques, but will be further developed toallow more methods. This analysis allows the creation of different types of data in order to iden-tify multi-variable prediction models that include data from different parts of the whole biologicalprocess.
Knowledge Extraction Evolutionary Learning
Like the name suggests, Knowledge Extraction Evolutionary Learning (KEEL) is a software tool,
containing many evolutionary algorithms, and is used in many typical data mining problems.
KEEL contains the most well-known models in evolutionary learning. These can used for
research purposes, using the built in automation of experiments, or used as an educational tool,
with emphasis on the execution time and a real-time view of the algorithms during the data mining
process [AFSG+09].
The currently available function blocks are:
1. Data Management is used for importing or exporting data into other formats, data edition
and visualization.
2. Design of Experiments is where the experimentation takes place, applying the selected
model, type of validation and type of learning on the selected data sets. This module is
available off-line.
3. Educational Experiments works in a similar way as the previous function block, but can
be closely monitored, with displaying the learning process for the selected model algorithm.
This module is available on-line [AFFL+11].
30
State-of-the-Art
Konstanz Information Miner
The Konstanz Information Miner (KNIME) is a Java graphical workflow editor. The architec-
ture was developed based on three main aspects: visual interactive framework, modularity in the
process, to distribute the development of different algorithms, and easy expandability to add new
processing nodes or views [BCD+08].
In version 2.0, KNIME now allows for loops in the workflow, new database ports and adds
Predictive Model Markup Language (PMML) used for storing and exchanging predictive models
in XML format [BCD09].
Orange
Developed in Python, Orange is a machine learning and data mining toolbox, containing many
hierarchically-organized data mining components. The main hierarchical blocks are: data man-agement and preprocessing for feature selection and data input, classification, regression, as-sociation (rules), ensembles (such as bagging and boosting), clustering, evaluation, and projec-tions.
Classification algorithms include bayesian approaches, SVM, rule induction approaches, clas-
sification trees and random forest. Regression methods include linear and lasso regression, partial
least square regression, multivariate regression and regression trees or forests. Evaluation contains
the various procedures for testing and scoring the quality of prediction methods or estimation of
reliability. The projections block is where the visual analysis takes place, with multi-dimensional
scaling and self-organizing maps.
Orange works with python shell scripts. This means that new methods can be created or
existing machine learning components can be combined [DCE13].
PLINK
PLINK is a C/C++ tool set designed to handle GWAS datasets. Due to their high complexity and
size, simple methods are preferred, which can achieve good results with more data. Measures
allele, genotype, and haplotype frequencies.
PLINK offers tools for clustering a population into homogeneous subsets, for classical multi-
dimensional scaling (MDS) algorithm and for outlier detection. MDS helps to find similarities by
plotting objects in many dimensions, trying to preserve distance between objects [PNTB+07].
A graphical user interface (GUI), gPLINK, offers a framework to manage projects. gPLINK
also provides integration with Haploview, which is a tool used in tabulating, filtering, sorting,
merging and visualizing PLINK GWAS output files [BFMD05].
R
The R project is a statistical computing system. R has a command-line-driven interpreter for
the S language, with many extension packages available [Rip01]. The advantage of R is the
31
State-of-the-Art
flexibility to create new algorithms instead of using implemented approaches, where the source
code is not available. R can also produce high quality graphics and mathematical symbols. Some
user interfaces are available as packages or by using Integrated Development Environments (IDEs)
and adding R as an plugin [VL12]. R also contains many algorithms encoded in packages, such as
NN, SVM and MBMDR.
RapidMiner
RapidMiner is a DM software tool which contains many algorithms for all DM problems and
business analysis. It contains a GUI for creation and editing of data mining processes, following
the CRISP-DM methodology [Jun09].
A modular and pipelined view of the process consists of four stages: input stage where many
formats of data can be imported, preprocessing stage where the filtering and data processing be-
gins, learning stage using the selected algorithm, and evaluation stage which contains the perfor-
mance results of the process. RapidMiner allows for the extension of plug ins, which can be used
by developers to create new algorithms.
Weka
Weka is a machine learning workbench and application programming interface (API). Weka has
four interfaces: command line, Knowledge Flow, Explorer and Experimenter [FHT+04].
Explorer is the main interface in Weka. It contains different tabs with different types of meth-
ods. The tab Preprocess contains filtering methods. Classify contains classifier and regression
algorithms. Cluster and Associate contain clustering algorithms and rule association methods re-
spectively. The select attribute tab contains methods for identifying subsets of attributes that are
predictive of other attributes. The final panel, visualize, allows plotting of pairs of attributes with
many customizable options. The user interface of the Explorer can be seen in Figure 2.9.
In the context of bioinformatics, Weka provides a wide variety of algorithms for classification,
regression, clustering and feature selection.
A recent update added many new methods and reduced the execution time by using just-in-
time compilers [HFH+09].
Summary Table
Table 2.10 contains a summary of all the discussed data analysis software.
32
State-of-the-Art
Figure 2.9: The Weka Explorer interface.
Tool GUI Allows scripting No of integrated algorithmsATHENA No Yes few algorithms
KEEL Yes Yes few algorithms (evolutionary learningalgorithms)
KNIME Yes Yes many algorithms and i/o converters
Orange Yes Yes many algorithms
PLINK Yes (gPLINK) Yes PLINK
R No Yes many algorithms (packages)
RapidMiner Yes No many algorithms and i/o converters
Weka Yes - 3 Yes many algorithmsTable 2.10: A comparison of data mining tools.
33
State-of-the-Art
2.5 Chapter Conclusions
This chapter can be divided into 4 main categories: biology background, statistical and machine
learning algorithms, evaluation measures and procedures, and Data Simulation and Analysis
tools.
The study of the biology concepts and the background knowledge is vital to understanding
the problem. This concerns the data understanding, which translates to a better approach to the
problem. How the DNA is organized in the chromosomes and the division into genes and SNPs is
very important to knowing how epistasis works.
The statistical and machine learning algorithms are divided into feature selection algorithms
and model creation algorithms. The feature selection algorithms may produce different results
depending on the generated model. This means that model creation algorithms need to be adapted
with specific feature selection approaches. This is true for most of the model creation algorithms,
where these feature selection approaches are already embedded. Considering the large amount of
model creation algorithms, a pre-selection is necessary. From previous results [WLFW11], algo-
rithms like PLINK, MDR and BEAM are deprecated in preference to BOOST, S & C, SNPHar-
vester, SNPRuler and TEAM. In the last year, an interesting study [OSL13] revealed that IIM was
better than BEAM and SNPHarvester, making this an interesting approach to be tested. However
this algorithm does not yield a χ2 score for the significant SNPs, and BEAM has since been im-
proved, and is now in its third iteration [ZL07]. Furthermore, considering that MDR is one of the
first and most popular approaches to GWAS, a new iteration, MDMBR is also a good algorithm to
test.
To optimize the results, machine learning procedures should be used. Cross validation and
bootstrapping are the most popular approaches, due to their high ratio of training to test instances
Hold-out is also very popular for large datasets. As a DM methodology, CRISP-DM is the most
widely adopted approach, for being independent from tools and industries.
There are many tools available, including specifically designed tools. However, some include
only a small amount of algorithms and do not allow implementation of new algorithms. This is
important to test existing approaches and generate new ones. A data analysis tool that allows
scripting, such as R, is very useful for creating scripts that evaluate existing algorithms based on
the chosen statistical relevancy tests.
The algorithms selected for the empirical study of state-of-the-art model creation algorithms
regarding epistasis and main effect detection are illustrated in 2.11, with a summary of the main
characteristics of each selected algorithm.
34
State-of-the-Art
Table 2.11: Similarities and differences between BEAM3, BOOST MBMDR, and Screen & Clean,SNPHarvester, SNPRuler, and TEAM.
Features BEAM 3 BOOST MBMDR Screen & CleanSearch Stochastisc Exhaustive Exhaustive HeuristicPermutation Test
√ − √ −Chi-square Test −*
√ −* −*Tree/Graph Structure
√ − − −Bonferroni Correction − √ − √
Interactive Effect√ √ √ √
Main Effect√ √ √ √
Full Effect√ √ √ √
Programming Language C++ C R R
Features SNPHarvester SNPRuler TEAMSearch Stochastic Heuristic ExhaustivePermutation Test − − √
Chi-square Test√ √ −*
Tree Structure − √ √
Bonferroni Correction√ √ −
Interactive Effect√ √ √
Main Effect√ − −
Full Effect√ − −
Programming Language Java Java C++Chi-square Test is done for each SNP in main effect, and for each SNP interaction in epistasisdetection. Full effect is a disease model with both main effect and epistasis detection.*Although BEAM3 can evaluate interactive and full effects, the evaluation test is not comparablebetween methods. Only single SNPs are evaluated with χ2 test. TEAM outputs χ2 test scorefrom the contingency tables however it does not ouput the individual SNP χ2 score. MBMDR andScreen & Clean results are comparable with other algorithms.
35
State-of-the-Art
36
Chapter 3
A Comparative Study of Epistasis andMain Effect Analysis Algorithms
In this chapter, the experimental setup of an empirical analysis with existing epistasis detection
algorithms is presented.
3.1 Introduction
The experiments can be divided into two stages: the empirical analysis of existing methods and
the comparison between a new approach and the existing algorithms.
For stage 1, several algorithms were selected based on the previous state of the art study,
using very different approaches. The algorithms selected are: BEAM 3.0 [Zha12]; BOOST[WYY+10a]; MBMDR [MVV11]; Screen and Clean [WDR+10]; SNPHarvester [YHW+09];
SNPRuler [WYY+10b]; and TEAM [ZHZW10]. The purpose of this study is to evaluate the
results of each algorithm and select the best algorithms according to the evaluation measures for
stage 2.
Stage 2 consists of creating an Ensemble approach based on the characteristics of each algo-
rithm. The existing algorithms are evaluated according to their Power, Scalability, and Type IError Rate. Each algorithm is analyzed with each measure, for each parameter configuration.
This allows a correlation between evaluation measures and data set parameters on each algorithm,
which means a greater understanding about the usability of each algorithm, according to parameter
setting.
In both of these studies, generated data sets were used, with many different configurations and
varying values. The many configurations use different parameters of: Population size; MinorAllele Frequency; Odds ratio; Prevalence; and different types of Disease Models. These arti-
ficial data sets were created using genomeSimla, an open source data generator with generation
evolution capabilities and many parametrization options.
In this chapter, the data sets and their parameters are explained in more detail on Section
3.2. This Section also contains the input, output, and parameters for each algorithm. Section
37
A Comparative Study of Epistasis and Main Effect Analysis Algorithms
3.3 contains the experimental procedure used for stage 1 experiments and the obtained results are
discussed. Finally, Section 3.4 contains the conclusions made from stage 1 experiments.
3.2 Methods
Data sets
In these experiments, there are a total of 270 different configurations of data sets. For each con-
figuration. There are 100 data sets, creating 27000 results for each algorithm.
Each data set is created in genomeSimla, creating a population of 1,000,000 individuals,
evolved after 1750 generations. The growth of the initial population, consisting of 10,000 in-
dividuals, is done using a logistic growth rate. This allows for an organic evolution of SNP allele
frequencies. Each data set contains 300 SNPs divided into 2 chromosomes. The first chromo-
some contains 20 blocks of 10 SNPs each, while the second chromosome contains 10 blocks of
10 SNPs. The alleles infused with disease related genotypes are chosen from different blocks in
different chromosomes.
The following parameters are used to generate different configurations of data sets:
• Allele Frequency - The frequency of the minor allele of the disease SNPs. Considering the
allele frequency of all 300 SNPs, the chosen SNPs that affect the disease are selected among
the SNPs closest to the desired minor allele frequency. The allele frequencies can be seen
in the lab notes [PC14a].
• Population - Number of individuals sampled in the data set. According to each data set, a
given number of individuals are selected from the generated population mentioned earlier.
The ratio of cases to controls is determined by the disease prevalence.
• Disease Model - Type of disease model: main effect, epistasis interaction, and full effect.
The main effect model consists of 2 SNPs that independently affect the phenotype expres-
sion. The epistasis interaction model is determined by 2 SNPs that interact with each other
and affect the phenotype expression only when both disease alleles are present. Full effect
is determined by 2 SNPs that affect the phenotype expression by epistasis interaction and
by their main effect.
• Odds ratio - Relation between disease SNPs. Probability of one disease SNP being present,
given the presence of the other disease SNP.
• Prevalence - The proportion of a population with the disease. Affects the number of cases
and controls in a data set. A prevalence of 0.0001 corresponds to 30% of cases while a
prevalence of 0.02 corresponds to 50% of cases.
For these experiments, the parameters chosen are illustrated in table 3.1.
38
A Comparative Study of Epistasis and Main Effect Analysis Algorithms
Parameters ValuesMinor Allele Frequency (0-1) 0.01;0.05;0.1;0.3;0.5Population (Number of Individuals) 500;1000;2000Disease Model Main Effect; Epistasis; Full EffectOdds Ratio 1.1;1.5;2.0Prevalence 0.001;0.02
Table 3.1: The values of each parameter used. Each configuration has a unique set of the parame-ters used.
3.2.1 Algorithms for interaction analysis
The following algorithms were selected for these experiments. These algorithms were selected
because of their unique approach and previous results obtained. A more detailed description and
additional result analysis of the algorithms is available in the lab notes of these experiments:
BEAM3.0[PC14b]; BOOST[PC14c]; Screen and Clean[PC14d]; SNPRuler[PC14f]; SNPHarvester[PC14g];
TEAM[PC14h]; and MBMDR[PC14i].
BEAM3
The algorithms allows a filtering of SNPs with many missing genotypes, a specific number of
interactions for the MCMC and its initial temperature. There is also a prior probability of the
likelihood of each SNP to be associated with the disease. The default value is p = 5/L where L is
the number of SNPs. This was changed to p = 2/L, considering that there are 2 disease affected
SNPs.
BOOST
The algorithm contains no options to be customized. Considering the transformation of the data
into a Boolean type, the χ2 tests for interaction analysis have only 4 degrees of freedom.
MBMDR
This algorithm was processed in a different computer setting. The computer used by this algorithm
has a Intel(R) Core(TM)2 Quad CPU Q9400 2.66GHz processor and 16,00 GB of RAM memory.
Screen and Clean
The parameters chosen for these algorithm are:
• L - number of SNPs to be retained with the smallest p-values. Since there are 300 SNPs,
this is the value chosen.
• K_pairs - Number of pairwise interactions to be retained by the lasso. The selected value is
100.
39
A Comparative Study of Epistasis and Main Effect Analysis Algorithms
• response - The type of phenotype. Can be binomial or gaussian. The phenotypes are bino-
mial.
• al pha - The Bonferroni correction lower bound limit for retention of SNPs. For this experi-
ment, α = 0.05.
• standardize - If true, the genotype coded as 0,1, or 2 are centered to mean 0 and standard de-
viation 1. The data must be standardized to run the Screen & Clean procedure. Considering
the input data, this is enabled.
SNPHarvester
This algorithm has two modes: "Threshold-Based" mode, which outputs all the significant SNPs
above a specified significance threshold, and a "Top-K Based" mode which outputs a specified
number of SNP interactions, based on a specified number. It is possible to choose the minimum
and maximum number of interacting SNPs. For these experiments, the mode used to obtain results
is the "Threshold-Based" mode, with a significance level of α = 0.05, a minimum of interacting
SNPs of 1, which will test main effects of SNPs, and a maximum of 2.
SNPRuler
These results are already limited by a threshold of 0.3, and further reduced to 0.05, with a Bonfer-
roni correction. There are 3 configurable parameters:
• listSize - The expected number of interactions.
• depth - Order of interaction. Number of interacting SNPs.
• updateRatio - The step size of updating a rule. Takes a value between 0 and 1, 0 being not
updated and 1 updating a rule at each step.
The maximum number of rules is 50000, the length of each rule is 2 and the pruning threshold
is 0, to allow for all possible combinations.
TEAM
For this experiment the χ2 score was calculated from the contingency tables. The number of
permutations used in the significant test is set to 100 and the false discovery rate is set to 1. This
is used to control error rate using the permutation test, instead of a Bonferroni correction.
3.3 Simulation Design
This section contains the evaluation measures for the obtained results, the experimental method-
ology used in the experiments, the obtained results, and a discussion of the results.
40
A Comparative Study of Epistasis and Main Effect Analysis Algorithms
Experimental Procedures
The results obtained from the various algorithms are evaluated according to their Power, Scala-bility, and Type I Error Rate.
In each data set, true positives and false positives are calculated based on the P-values that
correspond to α < 0.05 in the statistical test, after a Bonferroni correction.
The Power is the percentage for each configuration of data sets based on the amount of true
positives found out of the 100 data sets in the configuration. If the Power is 100%, that means that
the disease affected SNPs was found in every data set of the configuration.
The Type I Error Rate is calculated similarly to Power. For each configuration, the Type I
Error Rate is the percentage of data sets that contain false positives out of the 100 data sets in the
configuration. If the Type I Error Rate is 100%, that means that all the data sets contain at least
1 false positive. That means that there is at least 1 SNP or interaction of SNPs that is considered
statistically significant, but is not related to the disease.
Scalability is evaluated in three different ways: Running Time; CPU Usage; and Memory
Usage. Each of these measures is calculated for each data set, which is then averaged for each
configuration. The Running Time is calculated in seconds, CPU Usage is calculated in percentage
and Memory Usage is calculated in MBytes. All these measures are calculated from the moment
the algorithm is started until it has finished.
For these experiments, the Data Mining Process selected is CRISP-DM. The scripts used to
run each algorithm are made using the Unix shell Bash. Each algorithm was designed in a spe-
cific language. For comparison of the results, R language is used in the statistical relevancy test,
selecting only the relevant results.
For each allele frequency configuration, a different SNP pair is used, choosing the SNPs that
are closest to the desired minor allele frequency. The SNPs selected according to their minor allele
frequency (MAF) are as follows:
• MAF 0.01 - SNP112 (0.01329) and SNP267 (0.010001)
• MAF 0.05 - SNP4 (0.05239) and SNP239 (0.048355)
• MAF 0.1 - SNP135 (0.09855) and SNP230 (0.089905)
• MAF 0.3 - SNP197 (0.274662) and SNP266 (0.31648)
• MAF 0.5 - SNP80 (0.439337) and SNP229 (0.50654)
The penetrance tables are created differently for each allele frequency, altering the proportions
of each genotype in the disease SNPs.
Initially, the algorithms are tested for the most extreme configurations (minimum and maxi-
mum MAF) to see if the results obtained are as expected. After this is confirmed, the algorithms
are executed for all configurations, according to the capabilities of each algorithm.
For each data set, a file containing the scalability measures is created. For each configuration,
a file resuming all the data sets is created for Power, Scalability, and Type 1 Error Rate.
41
A Comparative Study of Epistasis and Main Effect Analysis Algorithms
The computer used for this experiments used the 64-bit Ubuntu 13.10 operating system, with
an Intel(R) Core(TM)2 Quad CPU Q6600 2.40GHz processor and 8,00 GB of RAM memory. The
results were obtained using parallel processing.
Results and Discussion
Figures 3.1, 3.2, and 3.3 show the Power and Type I Error Results for each algorithm according
to each population size, while Figures 3.4, 3.5, and 3.6 display the results according to each
minor allele frequency. Not all the algorithms are used for each disease model, due to algorithm
limitations and properties. This is discussed earlier in Figure 2.11 Further results, relating to
different parameters, can be seen in Lab Note 9 of these experiments [PC14e]. The lab notes are
available in the appendices.
For epistasis detection (Figure 3.1) by population size, in data sets with 500 individuals (a),
no algorithm has a Power above the Type I Error Rate, which is as high as 14%. The Power for
almost all algorithms is 0%, with the exception of BOOST, which is 1%. MBMDR and SNPR
have 0% Type I Error Rate, while Screen and Clean has the highest error rate in all algorithms,
closely followed by BOOST and SNPHarvester. For 1000 individuals (b), almost all algorithms
have Power higher than Type I Error Rate, with the exception of Screen and Clean. The algo-
rithm with highest Power is BOOST wih 41%, with SNPHarvester and TEAM behind, both with
21%. MBMDR and Screen and Clean have very little Power. Screen and Clean has the highest
Type I Error Rate, with 16%, followed by SNPHarvester, BOOST and TEAM. Both MBMDR
and SNPRuler have 0% error rate. In the data sets with 2000 individuals (c), there are several
algorithms with high Power. BOOST has the best Power with 94%, closely followed by TEAM
with 92% and SNPHarvester with 85%. The worst algorithm by Power is Screen and Clean with
6%. Type I Error Rate is relatively low overall, with TEAM having the highest value with 28%.
Screen and Clean, BOOST, and SNPHarvester closely behind, with 21%, 21% 19% respectively.
The algorithm with the lowest error rate is MBMDR.
In Main effect detection (Figure 3.2), for 500 individuals (a), nearly all algorithms present
0% Power, with the exception of BOOST with 2%. Type I Error Rate is high, with Screen and
Clean having the highest value with 21%, followed by BOOST and SNPHarvester with 12% and
11% respectively. BEAM3 has the lowest error rate with 9%. For data sets with 1000 individuals
(b), the algorithm with the highest Power is BOOST with 43%, with SNPHarvester and BEAM3
close behind at 38% and 32% respectively. Screen and Clean has 0% Power. Type I Error Rate
is constant amongst all algorithms, with BOOST and Screen and Clean slightly ahead, with 23%.
For 2000 individuals (c), BOOST has the highest Power with 97%, while BEAM3 and SNPhar-
vester have 93%. Screen and Clean has 39% Power, but also has the least error rate, 36%, where
SNPHarvester has the highest error rate, with 79%.
The data sets with full effect disease model (Figure 3.3), for 500 individuals (a), show that
BOOST has 1% Power and the other algorithms have 0%. The algorithm with the highest Type I
Error Rate is Screen and Clean, with 19%, and SNPHarvester has the lowest, with 9%. For 1000
individuals (b), BOOST has the most Power with 42%, SNPHarvester has 32% and Screen and
42
A Comparative Study of Epistasis and Main Effect Analysis Algorithms
BOOST
MBM
DRSnC
SNPHSNPR
TEAM0
10
20Po
wer
/Typ
e1
Err
or(%
)Power
Type 1 Error Rate
(a) 500 individuals.
BOOST
MBM
DRSnC
SNPHSNPR
TEAM0
20
40
60
Pow
er/T
ype
1E
rror
(%)
PowerType 1 Error Rate
(b) 1000 individuals.
BOOST
MBM
DRSnC
SNPHSNPR
TEAM0
50
100
150
Pow
er/T
ype
1E
rror
(%)
PowerType 1 Error Rate
(c) 2000 individuals.
Figure 3.1: These results correspond to epistasis detection by population size. The data sets havea 0.1 minor allele frequency, 2.0 odds ratio, and 0.02 prevalence. Each bar corresponds to thePower and Type 1 Error Rate of BOOST, MBMDR, Screen and Clean, SNPHarvester, SNPRulerand TEAM. Each sub figure contains the values for all algorithms in data sets with 500 individuals(a), 1000 individuals (b), and 2000 individuals (c).
Clean remains at 0%. Type I Error Rates are higher for BOOST, with 38%. Screen and Clean
and SNPHarvester have 28% and 27%, respectively. For 2000 individuals (c), the best algorithm
is, once again, BOOST with 98% Power, and SNPRuler closely behind with 95%. Screen and
Clean has 0% Power, but also has the lowest error rate, with 33%. BOOST has 81% error rate,
and SNPHarvester has 79% error rate.
In evaluating data set results by minor allele frequency, for epistasis detection (Figure 3.4),
There is 0% Power for all algorithms for 0.01 allele frequency (a). The Type I Error Rate is as
big as 19% for Screen and Clean. The algorithms with the lowest error rate are MBMDR and
SNPRuler with 0%. In 0.05 allele frequency (b), TEAM has the highest Power, with 43% and
all other algorithms have a Power lower than 20%. TEAM also has the highest error rate, with
37%, and SNPRuler is the algorithm with the lowest error rate, with only 1%. For data sets with
0.1 minor allele frequency (c), BOOST and TEAM are the best algorithms with 94% and 92%
Power, respectively. Screen and Clean is the algorithm with the lowest Power, at 3%. MBMDR
has the lowest Type I Error Rate, while TEAM has the highest error rate, with 28%. For 0.3 allele
43
A Comparative Study of Epistasis and Main Effect Analysis Algorithms
BEAM3
BOOSTSnC
SNPH0
10
20
30
Pow
er/T
ype
1E
rror
(%)
PowerType 1 Error Rate
(a) 500 individuals.BEAM
3
BOOSTSnC
SNPH0
20
40
60
Pow
er/T
ype
1E
rror
(%)
PowerType 1 Error Rate
(b) 1000 individuals.
BEAM3
BOOSTSnC
SNPH0
50
100
150
Pow
er/T
ype
1E
rror
(%)
PowerType 1 Error Rate
(c) 2000 individuals.
Figure 3.2: These results correspond to main effect detection by population size. The data setshave a 0.1 minor allele frequency, 2.0 odds ratio, and 0.02 prevalence. Each bar corresponds to thePower and Type 1 Error Rate of BEAM3, BOOST, Screen and Clean, and SNPHarvester. Each subfigure contains the values for all algorithms in data sets with 500 individuals (a), 1000 individuals(b), and 2000 individuals (c).
frequency (d), BOOST has 100% Power, with TEAM close behind at 92%, and SNPHarvester
with 85%. Screen and Clean has the lowest Power, at 2%. SNPRuler has the lowest error rate,
with only 2%, while SNPHarvester has the highest, with 19%. Finally, for 0.5 allele frequency
(e), algorithms BOOST, TEAM and SNPRuler have the highest Power, with 100%, 95% and 92%,
respectively. Once again, Screen and Clean has the lowest Power with 0%. SNPHarvester has the
highest error rate, with 11%, and MBMDR together with SNPRuler have the lowest, with 0%.
For main effect detection (Figure 3.5), in 0.01 allele frequency (a), Power is 0% in all algo-
rithms. Type I Error Rate is highest in Screen and Clean, with 13%. In 0.05 allele frequency (b),
Power is nearly 0% for all algorithms except BOOST, with 14%. SNPHarvester has the highest
Type I Error Rate, with 24%, followed by Screen and Clean, with 22%. BOOST has the lowest
error rate with 11%. For 0.1 allele frequency (c), the most powerful algorithm is BOOST (97%),
closely followed by BEAM3(92%) and SNPHarvester(92%). SNPHarvester has the highest error
rate with 79%, and Screen and Clean has the lowest, with 36%. In data sets with 0.3 allele fre-
quency (d), all algorithms have 100% Power, with the exception of Screen and Clean with only
58%. All algorithms have 100% Error Rate, except Screen and Clean with 38%. The results are
44
A Comparative Study of Epistasis and Main Effect Analysis Algorithms
BOOSTSnC
SNPH0
10
20
30Po
wer
/Typ
e1
Err
or(%
)Power
Type 1 Error Rate
(a) 500 individuals.BOOST
SnCSNPH
0
20
40
60
Pow
er/T
ype
1E
rror
(%)
PowerType 1 Error Rate
(b) 1000 individuals.
BOOSTSnC
SNPH0
50
100
150
Pow
er/T
ype
1E
rror
(%)
PowerType 1 Error Rate
(c) 2000 individuals.
Figure 3.3: These results correspond to full effect detection by population size. The data sets havea 0.1 minor allele frequency, 2.0 odds ratio, and 0.02 prevalence. Each bar corresponds to thePower and Type 1 Error Rate of BOOST, Screen and Clean, and SNPHarvester. Each sub figurecontains the values for all algorithms in data sets with 500 individuals (a), 1000 individuals (b),and 2000 individuals (c).
the same in 0.5 minor allele frequency (e), with the exception for Screen and Clean, with 62%
Power and 48% Type I Error Rate.
For full effect detection (Figure 3.6), in 0.01 allele frequency (a), there is 0% Power in all
algorithms, with Screen and Clean having the highest Type I Error Rate (14%), and SNPHarvester
having the lowest (1%). For 0.05 minor allele frequency (b), only BOOST has any Power, with
15%. Screen and Clean has the highest error rate with 21%, followed by SNPHarvester with 20%
and BOOST with 17%. For 0.1 (c), BOOST and SNPHarvester have a high Power percentage,
with 98% and 95% respectively. Screen and Clean is once again with 0% Power. However, Screen
and Clean has the lowest error rate (33%), while BOOST has the highest (81%), followed by
SNPHarvester (79%). In 0.3 (d) and 0.5 (e) minor allele frequency, both BOOST and SNPHar-
vester have the same values, with 100% Power and Type I Error Rate. Screen and Clean has a
Power of 40% and 91% and Type I Error Rate of 68% and 84% for 0.3 and 0.5 allele frequencies.
Table 3.2 contains the scalability analysis. Screen and Clean is revealed to be the slowest
algorithm, followed by SNPHarvester. TEAM and BEAM3 have similar values, with SNPRuler
having close to half of their running time. BOOST is the fastest algorithm, with less than 1 second
45
A Comparative Study of Epistasis and Main Effect Analysis Algorithms
BOOST
MBM
DRSnC
SNPHSNPR
TEAM0
10
20
30
40P
T1ER
(a) 0.01 allele frequency.
BOOST
MBM
DRSnC
SNPHSNPR
TEAM0
20406080 P
T1ER
(b) 0.05 allele frequency.
BOOST
MBM
DRSnC
SNPHSNPR
TEAM0
50100150200 P
T1ER
(c) 0.1 allele frequency.
BOOST
MBM
DRSnC
SNPHSNPR
TEAM0
100
200 PT1ER
(d) 0.3 allele frequency.
BOOST
MBM
DRSnC
SNPHSNPR
TEAM0
100
200 PT1ER
(e) 0.5 allele frequency.
Figure 3.4: These results correspond to epistasis detection by minor allele frequency. The data setshave 2000 individuals, 2.0 odds ratio, and 0.02 prevalence. Each bar corresponds to the Power andType 1 Error Rate of BOOST, MBMDR, Screen and Clean, SNPHarvester, SNPRuler and TEAM.Each sub figure contains the values for all algorithms in data sets with 0.01 (a), 0.05 (b), 0.1 (c),0.3 (d), and 0.5 (e) allele frequencies.
of running time in the biggest data sets. Screen and Clean also has the biggest increase in running
time, followed by SNPHarvester. SNPRuler is the most expensive algorithm in CPU usage, having
a higher than 100% usage of CPU, which means that the algorithm uses more than 1 core to process
each data set. In memory usage, SNPRuler also has the highest usage of memory, closely followed
by TEAM, Screen and Clean, SNPHarvester, BEAM3, and finally BOOST far behind.
46
A Comparative Study of Epistasis and Main Effect Analysis Algorithms
BEAM3
BOOSTSnC
SNPH0
10
20P
T1ER
(a) 0.01 allele frequency.BEAM
3
BOOSTSnC
SNPH0
20
40P
T1ER
(b) 0.05 allele frequency.
BEAM3
BOOSTSnC
SNPH0
50100150200 P
T1ER
(c) 0.1 allele frequency.BEAM
3
BOOSTSnC
SNPH0
100
200 PT1ER
(d) 0.3 allele frequency.
BEAM3
BOOSTSnC
SNPH0
100
200 PT1ER
(e) 0.5 allele frequency.
Figure 3.5: These results correspond to main effect detection by minor allele frequency. The datasets have 2000 individuals, 2.0 odds ratio, and 0.02 prevalence. Each bar corresponds to the Powerand Type 1 Error Rate of BEAM3, BOOST, Screen and Clean, and SNPHarvester. Each sub figurecontains the values for all algorithms in data sets with 0.01 (a), 0.05 (b), 0.1 (c), 0.3 (d), and 0.5(e) allele frequencies.
47
A Comparative Study of Epistasis and Main Effect Analysis Algorithms
BOOSTSnC
SNPH0
10
20
30P
T1ER
(a) 0.01 allele frequency.BOOST
SnCSNPH
0
20
40 PT1ER
(b) 0.05 allele frequency.
BOOSTSnC
SNPH0
50100150200 P
T1ER
(c) 0.1 allele frequency.BOOST
SnCSNPH
0
100
200 PT1ER
(d) 0.3 allele frequency.
BOOSTSnC
SNPH0
100
200 PT1ER
(e) 0.5 allele frequency.
Figure 3.6: These results correspond to full effect detection by minor allele frequency. The datasets have 2000 individuals, 2.0 odds ratio, and 0.02 prevalence. Each bar corresponds to the Powerand Type 1 Error Rate of BOOST, Screen and Clean, and SNPHarvester. Each sub figure containsthe values for all algorithms in data sets with 0.01 (a), 0.05 (b), 0.1 (c), 0.3 (d), and 0.5 (e) allelefrequencies.
48
A Comparative Study of Epistasis and Main Effect Analysis Algorithms
Running Time (s) CPU Usage(%) Memory Usage (MB)500 1000 2000 500 1000 2000 500 1000 2000
BEAM3 4.9 7 8 87.8 96.3 95.5 4 4.3 5.8BOOST 0.16 0.22 0.34 95.7 98.79 97.87 0.98 1 1.2MBMDR* − − − − − − − − −SnC 8.05 18.65 34.65 75.7 98.99 77.25 129.8 137.2 152.5SNPHarvester 9.29 25.89 33 102.1 86.5 101.6 68.35 71.3 76.86SNPRuler 2.7 3.09 4.1 130.2 141.9 156.28 312.7 316 320.2TEAM 3.28 5.28 9.81 66.99 69.71 74.75 162.7 176 228.1Table 3.2: Scalability test containing the average running time, CPU usage, and memory usageby data set population size. BOOST, Screen and Clean, and SNPHarvester have values related tofull effect, TEAM, and SNPRuler are related to epistasis detection, and BEAM3 is related to maineffect detection. *MBMDR does not contain scalability results because these were obtained fromdifferent computers with different hardware settings from all other results. The average runningtime of MBMDR for each data set was higher than 3600 seconds. The data sets have a minor allelefrequency is 0.5, 2.0 odds ratio, 0.02 prevalence.
49
A Comparative Study of Epistasis and Main Effect Analysis Algorithms
3.4 Chapter Conclusions
The results show that BOOST is the best algorithm overall in terms of Power, but has a high Type
I Error Rate. SNPRuler has a low Type I Error Rate, but not very high Power and only works with
epistasis detection. Screen and Clean has very low Power in general, but has a relatively low error
rate, specially in data sets with a high number of individuals or a high minor allele frequency, in
main effect or full effect disease models. BEAM3 has high Power and slightly lower error rate than
BOOST, but only works with main effect. SNPHarvester has low Power, but also low Type I Error
Rate overall. MBMDR has very low Type I Error Rate with high Power in certain configurations,
but only works with epistasis and has a very high running time. TEAM has very high Power and
low Type I Error Rate, with the exception of certain configurations, particularly of lower number
of individuals and lower minor allele frequency. However, it only works for epistasis detection.
BOOST is the most scalable algorithm, followed by SNPRuler and BEAM3. This is important
for the next stage of the experiments, with an ensemble approach. Based on the data obtained,
we can conclude that some of the algorithms used would not be useful in an ensemble approach,
either because of their scalability, or because they would not add Power without compromising
Type I Error Rate.
These experiments show similar results from previous studies [WYY+10b] [SZS+11], how-
ever there is a vast amount of different types of data sets and algorithms, which differs from the
previous studies. These experiments can be viewed from different perspectives, using different
parameters, and the results can be analyzed according to their Power, Type I Error Rate, and Scal-
ability. Furthermore, the results obtained are available in the lab notes. The lab notes and the
created scripts are available in https://github.com/ei09045/EpistasisStudy.
50
Chapter 4
Ensemble Approach
In this chapter, a new Ensemble approach is discussed. This new approach uses algorithms from
the previous empirical study to improve results.
4.1 Introduction
The results from the empirical study of existing epistasis detection algorithms showed unique
properties in each algorithm. Considering Power and Type I Error Rate, the purpose of this stage
is to create a new approach that maintains the Power from the best algorithms and lowers Type I
Error Rate associated with these algorithms, which is usually high.
For this purpose, a new approach joining algorithms was developed. The algorithms are:
BEAM 3.0 [Zha12]; BOOST [WYY+10a]; SNPRuler [WYY+10b]; SNPHarvester [YHW+09];
and TEAM [ZHZW10]. This algorithms were selected based on their Power to Type I Error Rate
ratio, and their scalability. BOOST is used both for epistasis detection, and main effect detection,
which means that there are a total of 3 algorithms used for each detection type, with the exception
of full effect, which uses all algorithms.
In Section 4.2 the experimental procedure for stage 2 is discussed, involving the process of
selecting and using a voting system for the Ensemble approach. This Section also shows the
results obtained from the Ensemble approach and the comparison between the existing algorithms.
Finally, Section 4.3 shows the conclusions from the discussion of the results.
4.2 Experiments
Experimental Procedure
For these experiments, The same data sets discussed in the previous chapter were used. The same
evaluation measures are used to evaluate the new results. This new approach is created based on
an Ensemble approach, where each algorithm votes, based on their relevant SNPs and SNP pairs,
for a unified system that chooses relevant main effect SNPs and relevant epistasis interactions.
51
Ensemble Approach
For this purpose the algorithms selected for main effect detection are BEAM 3.0, BOOST,
and SNPHarvester. For epistasis detection the algorithms selected are BOOST, SNPRuler, and
TEAM. The Ensemble approach collects the registers relevant results from each algorithm and
selects SNPs and pairs of SNPs that are common in at least two algorithms. The algorithms
selected for main effect only work with single SNPs, and the algorithms selected for epistasis
detection only work with SNP pairs. BOOST works for both models, so the results enter the
voting stage of main effect and epistasis detection. The results obtained from each algorithm are
converted into a unified format, so they can be interpreted for the voting stage. In the full effect
detection, both main effect and epistasis detection algorithms intervene in the voting stage. This
helps to reduce Type I Error Rates, while maintaining Power because if each interaction is related
to the phenotype, it will be common amongst most algorithms, while non-related interactions will
not.
The computer used for this experiments used the 64-bit Debian testing (jessie) operating sys-
tem, with an Intel(R) Core(TM)2 Quad CPU Q9400 2.66GHz processor and 16,00 GB of RAM
memory.
Results and Discussion
The results obtained from previous experiments are used to compare the performance of existing
algorithms with the new ensemble approach. Figures 4.1, 4.2, and 4.3 show the Power and Type
I Error Results for each algorithm according to each population size; Figures 4.4, 4.5, and 4.6
display the results according to each minor allele frequency; Figures 4.7, 4.8, and 4.9 show the
results according to each odds ratio tested; and Figures 4.10, 4.11, and 4.12 contain the results
regarding both prevalence values.
The results in epistasis detection by population size for data sets with 500 individuals (a) show
0% Power but also 0% Type I Error Rate for the Ensemble approach. BOOST has the most Power
with 1% but has 7% Type I Error Rate. In data sets with 1000 individuals (b), the Ensemble
approach has 23% Power and 0% error rate, while the algorithm with the most Power is BOOST
with 41% and 5% error rate. For 2000 individuals (c) in data sets, the Ensemble has 92% Power
and 15% error rate, while BOOST has 94% Power but 21% error rate.
In main effect detection, for 500 individuals (a), there was 0% Power and 11% Error Rate.
BOOST has 2% Power but 12% error rate, while BEAM3 has only 9% error rate. For 1000
individuals (b), Ensemble has 37% Power and 19% error rate. BOOST has 43% Power and 23%
error rate while BEAM3 has the least error rate with 18% but has only 32% Power. Finally in data
sets with 2000 individuals (c), Ensemble has 92% and a Type I Error Rate of 71%. BOOST has
the most Power with 97% and has 74% error rate.
Full effect detection results show 0% Power and 11% Type I Error Rate in the Ensemble
approach for data sets with 500 individuals (a). SNPHarvester has the least error rate with 9%. For
data sets with 1000 individuals (b), Ensemble has 0% Power and 11% error rate, but SNPHarvester
has less error rate with 9%. For 2000 individuals (c), the Ensemble approach has 95% Power and
75% Type I Error Rate, having the most Power and least error rate.
52
Ensemble Approach
BOOSTSNPH
SNPR
TEAM
Ensem
ble0
5
10
Pow
er/T
ype
IErr
or(%
)
PowerType I Error Rate
(a) 500 individuals.
BOOSTSNPH
SNPR
TEAM
Ensem
ble0
20
40
60
Pow
er/T
ype
IErr
or(%
)
PowerType I Error Rate
(b) 1000 individuals.
BOOSTSNPH
SNPR
TEAM
Ensem
ble0
50
100
150
Pow
er/T
ype
IErr
or(%
)
PowerType I Error Rate
(c) 2000 individuals.
Figure 4.1: These results correspond to epistasis detection by population size, with a 0.1 minorallele frequency, 2.0 odds ratio, and 0.02 prevalence. The results of the Power and Type I ErrorRate of BOOST, SNPHarvester, SNPRuler, TEAM, and Ensemble. Each sub figure contains thevalues for all algorithms in data sets with 500 individuals (a), 1000 individuals (b), and 2000individuals (c).
53
Ensemble Approach
BEAM3
BOOSTSNPH
Ensem
ble0
5
10
15
20
Pow
er/T
ype
IErr
or(%
)
PowerType I Error Rate
(a) 500 individuals.
BEAM3
BOOSTSNPH
Ensem
ble0
20
40
60
Pow
er/T
ype
IErr
or(%
)
PowerType I Error Rate
(b) 1000 individuals.
BEAM3
BOOSTSNPH
Ensem
ble0
50
100
150
Pow
er/T
ype
IErr
or(%
)
PowerType I Error Rate
(c) 2000 individuals.
Figure 4.2: These results correspond to main effect detection by population size, with a 0.1 minorallele frequency, 2.0 odds ratio, and 0.02 prevalence. The results of the Power and Type I ErrorRate of BEAM3, BOOST, SNPHarvester, and Ensemble. Each sub figure contains the values forall algorithms in data sets with 500 individuals (a), 1000 individuals (b), and 2000 individuals (c).
54
Ensemble Approach
BOOSTSNPH
Ensem
ble0
10
20
Pow
er/T
ype
IErr
or(%
)
PowerType I Error Rate
(a) 500 individuals.
BOOSTSNPH
Ensem
ble0
20
40
60
Pow
er/T
ype
IErr
or(%
)
PowerType I Error Rate
(b) 1000 individuals.
BOOSTSNPH
Ensem
ble0
50
100
150
Pow
er/T
ype
IErr
or(%
)
PowerType I Error Rate
(c) 2000 individuals.
Figure 4.3: These results correspond to full effect detection by population size, with a 0.1 mi-nor allele frequency, 2.0 odds ratio, and 0.02 prevalence. The results of the Power and Type IError Rate of BOOST, SNPHarvester, and Ensemble. Each sub figure contains the values for allalgorithms in data sets with 500 individuals (a), 1000 individuals (b), and 2000 individuals (c).
55
Ensemble Approach
In minor allele frequency analysis, for epistasis detection, Ensemble shows 0% Power and
0% Type I Error Rate in data sets with 0.01 allele frequency (a). For 0.05 allele frequency (b),
Ensemble approach has 6% Power and 1% error rate, while TEAM has 43% Power and 37% error
rate. In 0.1 allele frequency (c), Ensemble has 92% Power and 15% error rate. BOOST has 94%
Power but 21% error rate. For 0.3 allele frequency (d), Ensemble has 99% Power and 1% error
rate, while BOOST has 100% Power but 6% error rate. Finally in 0.5 minor allele frequency (e),
Ensemble has 100% Power and 0% Type I Error Rate.
BOOSTSNPH
SNPR
TEAM
Ensem
ble0
2
4 PT1ER
(a) 0.01 allele frequency.
BOOSTSNPH
SNPR
TEAM
Ensem
ble0
20406080 P
T1ER
(b) 0.05 allele frequency.
BOOSTSNPH
SNPR
TEAM
Ensem
ble0
50100150200 P
T1ER
(c) 0.1 allele frequency.
BOOSTSNPH
SNPR
TEAM
Ensem
ble0
100
200 PT1ER
(d) 0.3 allele frequency.
BOOSTSNPH
SNPR
TEAM
Ensem
ble0
100
200 PT1ER
(e) 0.5 allele frequency.
Figure 4.4: These results correspond to epistasis detection by minor allele frequency, with 2000individuals, 2.0 odds ratio, and 0.02 prevalence. The results of the Power and Type I Error Rateof BOOST, SNPHarvester, SNPRuler, TEAM, and Ensemble. Each sub figure contains the valuesfor all algorithms in data sets with 0.01 (a), 0.05 (b), 0.1 (c), 0.3 (d), and 0.5 (e) allele frequencies.
In main effect detection, for 0.01 allele frequency (a), Ensemble has 0% and 1% error rate.
In 0.05 allele frequency (b), Ensemble has 1% Power and 20% Type I Error Rate. BOOST is the
best algorithm in this setting, with 14% Power and 11% Type I Error Rate. For 0.1 minor allele
frequency (c), Ensemble has 92% Power and 71% error rate. BOOST has 97% Power but has 74%
56
Ensemble Approach
error rate. BEAM3 has the same Power as Ensemble, but slight less error rate, with 67%. For 0.3
(d) and 0.5 (e) allele frequency, all the approaches have 100% Power and Type I error Rate.
BEAM3
BOOSTSNPH
Ensem
ble0
1
2 PT1ER
(a) 0.01 allele frequency.
BEAM3
BOOSTSNPH
Ensem
ble0
20
40P
T1ER
(b) 0.05 allele frequency.
BEAM3
BOOSTSNPH
Ensem
ble0
50100150200 P
T1ER
(c) 0.1 allele frequency.
BEAM3
BOOSTSNPH
Ensem
ble0
100
200 PT1ER
(d) 0.3 allele frequency.
BEAM3
BOOSTSNPH
Ensem
ble0
100
200 PT1ER
(e) 0.5 allele frequency.
Figure 4.5: These results correspond to main effect detection by minor allele frequency, with 2000individuals, 2.0 odds ratio, and 0.02 prevalence. The results of the Power and Type I Error Rateof BEAM3, BOOST, SNPHarvester, and Ensemble. Each sub figure contains the values for allalgorithms in data sets with 0.01 (a), 0.05 (b), 0.1 (c), 0.3 (d), and 0.5 (e) allele frequencies.
In full effect detection, for 0.01 allele frequency (a), no algorithm has any Power, but Ensemble
has the least error rate, with only 1%. For 0.05 allele frequency (b), Ensemble has the least error
rate with 16% but BOOST has the most Power with 15%. For 0.1 (c), Ensemble approach has 95%
Power and 75% error rate. BOOST has 98% Power but 81% error rate. Finally, all algorithms have
100% Power and Type I Error Rate for 0.3 (d) and 0.5 (e) minor allele frequencies.
Analysing the results by odds ratio, for epistasis detection, at 1.1 odds ratio (a), Ensemble
has 1% and 0% error rate. BOOST has 27% Power, but has 5% Type I Error Rate. In 1.5 odds
ratio (b), Ensemble has 84% Power and 3% Type I Error Rate, while BOOST has 95% Power, but
57
Ensemble Approach
BOOSTSNPH
Ensem
ble0
5
10
15P
T1ER
(a) 0.01 allele frequency.
BOOSTSNPH
Ensem
ble0
20
40 PT1ER
(b) 0.05 allele frequency.
BOOSTSNPH
Ensem
ble0
50100150200 P
T1ER
(c) 0.1 allele frequency.
BOOSTSNPH
Ensem
ble0
100
200 PT1ER
(d) 0.3 allele frequency.
BOOSTSNPH
Ensem
ble0
100
200 PT1ER
(e) 0.5 allele frequency.
Figure 4.6: These results correspond to full effect detection by minor allele frequency, with 2000individuals, 2.0 odds ratio, and 0.02 prevalence. The results of the Power and Type I Error Rate ofBOOST, SNPHarvester, and Ensemble. Each sub figure contains the values for all algorithms indata sets with 0.01 (a), 0.05 (b), 0.1 (c), 0.3 (d), and 0.5 (e) allele frequencies.
9% error rate. For 2.0 odds ratio (c), Ensemble has 92% Power and 15% error rate. BOOST has
slightly more Power, with 94%, but once again has higher error rate, with 21%.
In main effect detection, at 1.1 odds ratio (a), all algorithms have the same Power, with 2%,
but BEAM3 has less error rate, with 8%, while Ensemble has 10%. In 1.5 odds ratio (b), Ensemble
has the most Power, with 25%, and has 17% error rate, but BEAM3 has 16%, being the algorithm
with the least error rate. For 2.0 odds ratio (c), Ensemble has 92% Power and 71% error rate.
BEAM3 has the same Power but less error rate, with 67%, and BOOST has more Power (97%)
but also has more Type I Error Rate (74%)
Finally, for full effect detection by odds ratio, for 1.1 odds ratio (a), Ensemble has the least
Power, with 3% but also has the least error rate, with 7%. SNPHarvester has the most Power at
10%, and has 9% error rate. At 1.5 odds ratio (b), BOOST is the algorithm with the highest Power,
58
Ensemble Approach
BOOSTSNPH
SNPR
TEAM
Ensem
ble0
20
40
Pow
er/T
ype
IErr
or(%
)
PowerType I Error Rate
(a) 1.1 odds ratio.
BOOSTSNPH
SNPR
TEAM
Ensem
ble0
50
100
150
Pow
er/T
ype
IErr
or(%
)
PowerType I Error Rate
(b) 1.5 odds ratio.
BOOSTSNPH
SNPR
TEAM
Ensem
ble0
50
100
150
Pow
er/T
ype
IErr
or(%
)
PowerType I Error Rate
(c) 2.0 odds ratio.
Figure 4.7: These results correspond to epistasis detection by odds ratio, with a 0.1 minor allelefrequency, 2000 individuals, and 0.02 prevalence. The results of the Power and Type I Error Rateof BOOST, SNPHarvester, SNPRuler, TEAM, and Ensemble. Each sub figure contains the valuesfor all algorithms in data sets with 1.1 odds ratio (a), 1.5 odds ratio (b), and 2.0 odds ratio (c).
59
Ensemble Approach
BEAM3
BOOSTSNPH
Ensem
ble0
5
10
15
Pow
er/T
ype
IErr
or(%
)
PowerType I Error Rate
(a) 1.1 odds ratio.
BEAM3
BOOSTSNPH
Ensem
ble0
20
40
Pow
er/T
ype
IErr
or(%
)
PowerType I Error Rate
(b) 1.5 odds ratio.
BEAM3
BOOSTSNPH
Ensem
ble0
50
100
150
Pow
er/T
ype
IErr
or(%
)
PowerType I Error Rate
(c) 2.0 odds ratio.
Figure 4.8: These results correspond to main effect detection by odds ratio, with a 0.1 minor allelefrequency, 2000 individuals, and 0.02 prevalence. The results of the Power and Type I Error Rateof BEAM3, BOOST, SNPHarvester, and Ensemble. Each sub figure contains the values for allalgorithms in data sets with 1.1 odds ratio (a), 1.5 odds ratio (b), and 2.0 odds ratio (c).
60
Ensemble Approach
with 72%, and highest Type I Error Rate, with 51%. Ensemble has the least Power, with 65%, but
also the least error rate, with 40%. For 2.0 odds ratio (c), Ensemble has the least error rate, with
75%, and has 95% Power, while BOOST has 98% Power, but 81% error rate.
BOOSTSNPH
Ensem
ble0
10
20
Pow
er/T
ype
IErr
or(%
)
PowerType I Error Rate
(a) 1.1 odds ratio.
BOOSTSNPH
Ensem
ble0
50
100
Pow
er/T
ype
IErr
or(%
)
PowerType I Error Rate
(b) 1.5 odds ratio.
BOOSTSNPH
Ensem
ble0
50
100
150
Pow
er/T
ype
IErr
or(%
)
PowerType I Error Rate
(c) 2.0 odds ratio.
Figure 4.9: These results correspond to full effect detection by odds ratio, with a 0.1 minor allelefrequency, 2000 individuals, and 0.02 prevalence. The results of the Power and Type I Error Rateof BOOST, SNPHarvester, and Ensemble. Each sub figure contains the values for all algorithmsin data sets with 1.1 odds ratio (a), 1.5 odds ratio (b), and 2.0 odds ratio (c).
Looking at the results of data sets by the prevalence of the disease, in epistasis detection,
with 0.0001 prevalence (a), Ensemble has 86% Power and 2% error rate. SNPRuler is the only
algorithm with less error rate, at 0%, but has much less Power. BOOST has 91% Power, but has
7% error rate. For 0.02 prevalence (b), the Ensemble approach has 92% Power, and 15% error
rate. SNPRuler has 8% error rate, but has 32% Power. BOOST has 94% Power and a Type I Error
Rate of 21%.
Regarding main effect results, for 0.0001 prevalence (a), Ensemble has 98% Power and 77%
error rate. BEAM3 is the best in this configuration, with 99% Power, and 76% error rate. For 0.02
prevalence (b), Ensemble has 92% Power and 71% error rate. BEAM3 is the algorithm with the
least error rate, with 67%, with the same Power as Ensemble. BOOST has 97% Power and 74%
error rate.
For full effect analysis by prevalence, for 0.0001 prevalence (a), Ensemble has 99% and 99%
61
Ensemble Approach
BOOSTSNPH
SNPR
TEAM
Ensem
ble0
50
100
150
Pow
er/T
ype
IErr
or(%
)
PowerType I Error Rate
(a) 0.0001 Prevalence.
BOOSTSNPH
SNPR
TEAM
Ensem
ble0
50
100
150
Pow
er/T
ype
IErr
or(%
)
PowerType I Error Rate
(b) 0.02 Prevalence.
Figure 4.10: These results correspond to epistasis detection by prevalence, with a 0.1 minor allelefrequency, 2000 individuals, and 2.0 odds ratio. The results of the Power and Type I Error Rateof BOOST, SNPHarvester, SNPRuler, TEAM, and Ensemble. Each sub figure contains the valuesfor all algorithms in data sets with 0.0001 prevalence (a), and 0.02 prevalence (b).
BEAM3
BOOSTSNPH
Ensem
ble0
50
100
150
Pow
er/T
ype
IErr
or(%
)
PowerType I Error Rate
(a) 0.0001 Prevalence.
BEAM3
BOOSTSNPH
Ensem
ble0
50
100
150
Pow
er/T
ype
IErr
or(%
)
PowerType I Error Rate
(b) 0.02 Prevalence.
Figure 4.11: These results correspond to main effect detection by prevalence, with a 0.1 minorallele frequency, 2000 individuals, and 2.0 odds ratio. The results of the Power and Type I ErrorRate of BEAM3, BOOST, SNPHarvester, and Ensemble. Each sub figure contains the values forall algorithms in data sets with 0.0001 prevalence (a), and 0.02 prevalence (b).
62
Ensemble Approach
Type I Error Rate, while SNPHarvester has the same error rate, but has 100% Power. For 0.02
prevalence (b), Ensemble is the best algorithm ,with 95% Power, and 75% error rate.
BOOSTSNPH
Ensem
ble0
50
100
150
Pow
er/T
ype
IErr
or(%
)
PowerType I Error Rate
(a) 0.0001 Prevalence.
BOOSTSNPH
Ensem
ble0
50
100
150
Pow
er/T
ype
IErr
or(%
)
PowerType I Error Rate
(b) 0.02 Prevalence.
Figure 4.12: These results correspond to full effect detection by prevalence, with a 0.1 minor allelefrequency, 2000 individuals, and 2.0 odds ratio. The results of the Power and Type I Error Rate ofBOOST, SNPHarvester, and Ensemble. Each sub figure contains the values for all algorithms indata sets with 0.0001 prevalence (a), and 0.02 prevalence (b).
In order to evaluate the scalability of the Ensemble algorithm, in relation to other algorithms,
the scalability measures were taken from each algorithm individually, while being executed by the
Ensemble approach, and the overall Ensemble algorithm, including the voting stage. Tables 4.1,
4.2, and 4.3 have an indication of the total running time for all the algorithms, the running time
obtained from the Ensemble approach (with the voting stage), and the difference between them.
The average CPU usage and memory usage throughout the Ensemble algorithm are also registered.
The data sets chosen were the full effect disease model data sets because these data sets are more
likely to have the most statistically significant results per data set, which takes a longer running
time and memory usage.
The results show an increase in the difference between the total running time of all algorithms,
and the Ensemble running time. There is clearly an increase in the difference in running time,
with the increase in data set size. However, for epistasis and main effect, there is not an increase
in the amount of running time that is added to the difference, which means that the difference in
running time is almost the same percentage of the Ensemble running time, independently of the
data set size. In epistasis detection, there is a near 30% increase in relation to the total running
time of the algorithms, but the difference in seconds is smaller than main effect and full effect,
which have a near 14.5% and 9.6% increase at 2000 individuals. There is no relation between the
CPU usage and the data set size, but there is a small increase in memory usage with the data set
size, especially in full effect detection.
63
Ensemble Approach
Running Time (s) CPU Usage(%) Memory Usage (MB)500 1000 2000 500 1000 2000 500 1000 2000
BEAM3 0.5 0.6 0.8 87.8 85.3 88 2.2 2.6 3.4BOOST 0.1 0.2 0.3 98.6 96.3 96.9 1 1 1.1SNPHarvester 1.9 3.1 6 119.9 108.2 104.7 52.2 60.2 78.3SNPRuler 2.3 2.6 3.4 181.2 143.8 136.9 352.4 352.4 353.5TEAM 2.7 4.6 8.1 99 98.6 98.8 162.7 177 228.1Total* 7.5 11.1 18.6Ensemble* 9.8 14.3 23.9 110.6 103 102.1 352.4 352.4 353.5Difference* 2.3 3.2 5.3
Table 4.1: Scalability test containing the average running time, CPU usage, and memory usage bydata set population size. The data sets have a minor allele frequency is 0.5, 2.0 odds ratio, 0.02prevalence and the disease model is epistasis detection. *Total calculates the total added runningtime for all algorithms, Ensemble calculates the time for all algorithms in the Ensemble approachwith the voting stage, and the difference calculates the running time increase between them. CPUusage and memory constitute the average usage along the process, therefore the total CPU usageand memory usage are not relevant.
Running Time (s) CPU Usage(%) Memory Usage (MB)500 1000 2000 500 1000 2000 500 1000 2000
BEAM3 2.9 4.1 4.6 94.5 94.6 97.3 2.9 3.4 4.2BOOST 0.1 0.2 0.3 97.9 98.6 98.9 1 1 1.1SNPHarvester 2.5 11.6 39.4 117.9 105.5 102 59.6 92.5 103.6SNPRuler 2.3 2.3 3 168 147 160.9 352.2 352.2 354.3TEAM 2.9 4.2 7.7 98.7 99 99 162.7 177 227.9Total* 10.7 22.4 55Ensemble* 13.1 26 63 89.8 88 77.3 349.2 352.2 354.3Difference* 2.4 3.6 8
Table 4.2: Scalability test containing the average running time, CPU usage, and memory usage bydata set population size. The data sets have a minor allele frequency is 0.5, 2.0 odds ratio, 0.02prevalence and the disease model is main effect. *Total calculates the total added running time forall algorithms, Ensemble calculates the time for all algorithms in the Ensemble approach with thevoting stage, and the difference calculates the running time increase between them. CPU usageand memory constitute the average usage along the process, therefore the total CPU usage andmemory usage are not relevant.
64
Ensemble Approach
Running Time (s) CPU Usage(%) Memory Usage (MB)500 1000 2000 500 1000 2000 500 1000 2000
BEAM3 110.8 90.9 196.7 99 90.9 99 37.3 29.8 106.2BOOST 0.1 0.2 0.3 98.6 97.7 98.5 0.9 1 1.2SNPHarvester 7.9 20.3 28.9 111 103.8 102.9 98.9 101.3 102.8SNPRuler 2.3 2.3 2.8 197.7 149.9 154.8 337.4 304.7 353.7TEAM 2.7 4.5 8 99 98.6 98.9 162.7 176.8 227.8Total* 123.8 118.2 236.7Ensemble* 126.8 126.3 259.4 95 79.2 74.6 337.4 304.7 353.7Difference* 3 8.1 22.7
Table 4.3: Scalability test containing the average running time, CPU usage, and memory usage bydata set population size. The data sets have a minor allele frequency is 0.5, 2.0 odds ratio, 0.02prevalence and the disease model is full effect. *Total calculates the total added running time forall algorithms, Ensemble calculates the time for all algorithms in the Ensemble approach with thevoting stage, and the difference calculates the running time increase between them. CPU usageand memory constitute the average usage along the process, therefore the total CPU usage andmemory usage are not relevant.
65
Ensemble Approach
4.3 Chapter Conclusions
In this chapter, a new epistasis and main effect detection approach is discussed. This is an Ensem-
ble approach, using 5 of the best algorithms from the previous empirical study, showed in Chapter
3. This new Ensemble approach uses 3 algorithms to evaluate relevant epistatic interactions, and
relevant main effects of SNPs. If there is a majority in the voting stage, the SNP or SNP pair is
then selected as a relevant result.
From the results obtained, we can see that, for epistasis detection, the Ensemble method has
slightly less Power than the best algorithm but it is the algorithm with the least Type I Error Rate,
with the exception of SNPRuler in some configurations, but this algorithm has much less Power
than the Ensemble method. In main effect detection, Ensemble is amongst the algorithms with
the lowest Type I Error Rate, behind BEAM3 in some configurations, but has constantly higher
Power. BOOST has more Power than the Ensemble method, but also has higher error rate. For
full effect, Ensemble is the algorithm with the least error rate, but has less Power than BOOST in
some configurations.
The goal for these experiments is to create a more efficient method, that is able to find the
ground truth SNPs related to the disease, while reducing false positives. The Ensemble method
fulfills all these requirements. The Scalability test shows that there is an increase in the running
time and memory usage with the size of the data. However, given that only the 5 most scalable
algorithms were selected, and that there is only a stable increase in resource and time consumption,
it is easy to calculate the necessary time and resources necessary for big data sets, and given the
Power and Type I Error Rate results, it is far better than any single algorithm.
66
Chapter 5
Conclusions
This dissertation was created for the purpose of improving the detection of genes, specifically
SNPs, that cause the expression of complex diseases. These diseases have a genetic basis that
increases the susceptibility. This means that, given an individual genotype, it is possible to assume
a greater risk of developing complex diseases if it has a SNP allele that has a connection to the
disease manifestation. Therefore it is very important to find the genotype configurations that are
connected to a given disease.
A state-of-the-art study was made about the recent work related to this dissertation. For the
methodologies, the data selection and model creation algorithms were studied, together with more
generic algorithms, in which the specific model creation algorithms are based on. Some auxiliary
algorithms were also studied. These are used in different stages, by the specific model creation
algorithms. Furthermore, data analysis evaluations procedures and measures were studied. These
represent how the data is used to train and test, and statistical relevant measures to be taken from
the results, respectively. Data Mining processes and software were also studied. CRISP-DM was
the procedure selected for the experiments. R was the software used in the experiments, except in
algorithms that are implemented in different programming languages.
Initially, a group of algorithms was selected, based on the state-of-the-art study. These algo-
rithms are the most recent approaches that showed most promise and compatibility between them,
which made it easier to test. For this study, a large amount of data was generated, using genomeS-
imla, to create different types of data sets, revealing a wide range of results from each algorithm.
The data sets selected contained different types of disease model type, each type being compat-
ible with a subgroup of the selected algorithms. The purpose of this initial empirical study was
to find the best algorithms overall, according to their Power, Type I Error Rate, and Scalability.
The results showed that BOOST was the best algorithm overall in terms of Power, SNPRuler and
MBMDR had the least error rate, but MBMDR had really bad Scalability. BOOST is also the most
scalable.
Out of all 7 selected algorithms for the comparison study, 3 algorithms for main effect and
epistasis detection were detected. In each stage, these results are chosen from the majority of
these algorithms. For main effect detection, the algorithms chosen were: BEAM3; BOOST; and
67
Conclusions
SNPHarvester. For epistasis detection, the algorithms chosen were: BOOST; SNPRuler; and
TEAM. BOOST was selected twice because of the high Power results in both disease models.
A new methodology was then created with the selected algorithms: the Ensemble. Using the
selected algorithms, the purpose of this methodology is to maintain the Power of the algorithms,
while reducing Type I Error Rate.
The observed results showed that the Type I Error Rates were lowered significantly, especially
in epistasis detection. However, the Power of the Ensemble was slightly lower than BOOST in
some configurations. The Scalability results showed some difference in the running time of the
Ensemble and the selected algorithms, due to the voting stage, but this difference is stable and did
not show a clear increase with the data set size, which means that only a small percentage of the
overall running time is dedicated to the voting stage.
The main conclusion of the empirical study from the state-of-the-art algorithms is that, even
if some algorithms show more dominant results, there is no absolute best algorithm for all types
of diseases. These results used small sized artificial data sets, so for large realistic data sets, these
results are also very dependent on the scalability of each algorithm. This also limits the types of
configurations that each algorithm can process. It is very difficult to obtain true positives without
false positives, in a viable period of time. For this purpose, the Ensemble approach was created to
maintain the epistasis and main effect detections, without such high amounts of false positives, but
the necessary running time is greater than all the algorithms combined, which may turn out to be
not viable for the larger data sets. However, the Ensemble is the most accurate approach possible.
5.1 Contribution summary
The main contributions of this dissertation are as follows:
• Create a vast amount of data sets with many different configurations, altering different pa-
rameters that affect the data and consequently the results of the algorithms. This allows for
a more complete evaluation of a given algorithm.
• An empirical study of the 7 most recent epistasis and main effect detection algorithms,
across many different configurations.
• Creation and evaluation of a new methodology, Ensemble, based on existing state-of-the-art
algorithms. This new methodology was able to yield good results, while decreasing the
Type I Error Rate.
68
References
[ABR64] A Aizerman, Emmanuel M Braverman, and L I Rozoner. Theoretical foundationsof the potential function method in pattern recognition learning. Automation andremote control, 25:821–837, 1964.
[AFFL+11] J. Alcala-Fdez, A. Fernandez, J. Luengo, J. Derrac, S. Garcia, L. Sanchez, andF. Herrera. KEEL Data-Mining Software Tool: Data Set Repository, Integration ofAlgorithms and Experimental Analysis Framework. Journal of Mult.-Valued Logic& Soft Computing, 17:255–287, 2011.
[AFSG+09] J Alcalá-Fdez, L Sánchez, S García, M J del Jesus, S Ventura, J M Garrell, J Otero,C Romero, J Bacardit, V M Rivas, J C Fernández, and F Herrera. KEEL: a softwaretool to assess evolutionary algorithms for data mining problems. Soft Computing,13:307–318, 2009.
[AS08] Ana Azevedo and Manuel Filipe Santos. KDD, SEMMA and CRISP-DM: a paralleloverview. IADIS European Conference Data Mining, pages 182–185, 2008.
[Avi94] John C Avise. Molecular markers: natural history and evolution. Springer, 1994.
[BCD+08] Michael R Berthold, Nicolas Cebron, Fabian Dill, Thomas R Gabriel, Tobias Köt-ter, Thorsten Meinl, Peter Ohl, Christoph Sieb, Kilian Thiel, and Bernd Wiswedel.KNIME: The Konstanz information miner. Springer, 2008.
[BCD09] M R Berthold, N Cebron, and F Dill. KNIME-the Konstanz information miner:version 2.0 and beyond. ACM SIGKDD, 2009.
[BDF+05] Alexandre Bureau, Josée Dupuis, Kathleen Falls, Kathryn L Lunetta, Brooke Hay-ward, Tim P Keith, and Paul Van Eerdewegh. Identifying SNPs predictive of phe-notype using random forests. Genetic epidemiology, 28:171–182, 2005.
[BFMD05] J C Barrett, B Fry, J Maller, and M J Daly. Haploview: analysis and visualization ofLD and haplotype maps. Bioinformatics (Oxford, England), 21:263–265, 2005.
[BH95] Yoav Benjamini and Yosef Hochberg. Controlling the False Discovery Rate: A Prac-tical and Powerful Approach to Multiple Testing. Journal of the Royal StatisticalSociety. Series B (Methodological), 57:289 – 300, 1995.
[BMP+13] Klaus Bønnelykke, Melanie C Matheson, Tune H Pers, Raquel Granell, David PStrachan, Alexessander Couto Alves, Allan Linneberg, John a Curtin, Nicole MWarrington, Marie Standl, Marjan Kerkhof, Ingileif Jonsdottir, Blazenka K Bukvic,Marika Kaakinen, Patrick Sleimann, Gudmar Thorleifsson, Unnur Thorsteinsdot-tir, Katharina Schramm, Svetlana Baltic, Eskil Kreiner-Møller, Angela Simpson,
69
REFERENCES
Beate St Pourcain, Lachlan Coin, Jennie Hui, Eugene H Walters, Carla M T Tiesler,David L Duffy, Graham Jones, Susan M Ring, Wendy L McArdle, Loren Price,Colin F Robertson, Juha Pekkanen, Clara S Tang, Elisabeth Thiering, Grant WMontgomery, Anna-Liisa Hartikainen, Shyamali C Dharmage, Lise L Husemoen,Christian Herder, John P Kemp, Paul Elliot, Alan James, Melanie Waldenberger,Michael J Abramson, Benjamin P Fairfax, Julian C Knight, Ramneek Gupta, Philip JThompson, Patrick Holt, Peter Sly, Joel N Hirschhorn, Mario Blekic, Stephan Wei-dinger, Hakon Hakonarsson, Kari Stefansson, Joachim Heinrich, Dirkje S Postma,Adnan Custovic, Craig E Pennell, Marjo-Riitta Jarvelin, Gerard H Koppelman,Nicholas Timpson, Manuela Ferreira, Hans Bisgaard, and a John Henderson. Meta-analysis of genome-wide association studies identifies ten loci influencing allergicsensitization. Nature genetics, 45:902–6, 2013.
[Bre01] Leo Breiman. Random forests. Machine learning, 45:5–32, 2001.
[CCK00] P Chapman, J Clinton, and R Kerber. CRISP-DM 1.0. CRISP-DM, 2000.
[cel14] Eukaryote dna (htt p : //commons.wikimedia.org/wiki/ f ile : eukaryotedna.svg),June 2014.
[CKS04] Robert Culverhouse, Tsvika Klein, and William Shannon. Detecting epistatic inter-actions contributing to quantitative traits. Genetic epidemiology, 27:141–152, 2004.
[CLEP07] Yujin Chung, Seung Yeoun Lee, Robert C Elston, and Taesung Park. Odds ratiobased multifactor-dimensionality reduction method for detecting gene-gene interac-tions. Bioinformatics (Oxford, England), 23:71–76, 2007.
[Cor02] Heather J Cordell. Epistasis: what it means, what it doesn’t mean, and statisticalmethods to detect it in humans. Human Molecular Genetics, 11(20):2463–2468,October 2002.
[Cor09] Heather J Cordell. Detecting gene-gene interactions that underlie human diseases.Nature reviews. Genetics, 10:392–404, 2009.
[DCE13] J Demšar, T Curk, and A Erjavec. Orange: Data Mining Toolbox in Python. Journalof Machine Learning Research, 14:2349–2353, 2013.
[DHS01] Richard O Duda, Peter E Hart, and David G Stork. Pattern Classification, volume 2.2001.
[DKN05] KB Kai Bo Duan, Sathiya Keerthi, and Et al. N.C.Oza. In Multiple Classifier Sys-tems. 2005.
[Dor92] Marco Dorigo. Optimization, learning and natural algorithms. Ph. D. Thesis, Po-litecnico di Milano, Italy, 1992.
[EBT+08] Todd L Edwards, William S Bush, Stephen D Turner, Scott M Dudek, Eric STorstenson, Mike Schmidt, Eden Martin, and Marylyn D Ritchie. Generating Link-age Disequilibrium Patterns in Data Simulations using genomeSIMLA. LectureNotes in Computer Science, 4973:24–35, 2008.
[FES+98] Deborah Ford, D F Easton, M Stratton, S Narod, D Goldgar, P Devilee, D T Bishop,B Weber, G Lenoir, and J Chang-Claude. Genetic heterogeneity and penetrance
70
REFERENCES
analysis of the BRCA1 and BRCA2 genes in breast cancer families. The AmericanJournal of Human Genetics, 62(3):676–689, 1998.
[FHT+04] Eibe Frank, Mark Hall, Len Trigg, Geoffrey Holmes, and Ian H Witten. Data miningin bioinformatics using Weka. Bioinformatics (Oxford, England), 20:2479–2481,2004.
[Fis19] R A Fisher. XV.—The Correlation between Relatives on the Supposition ofMendelian Inheritance. Earth and Environmental Science Transactions of the RoyalSociety of Edinburgh, 52:399–433, 1919.
[FPSS96] Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth. The kdd processfor extracting useful knowledge from volumes of data, 1996.
[FTW+07] Timothy M Frayling, Nicholas J Timpson, Michael N Weedon, Eleftheria Zeggini,Rachel M Freathy, Cecilia M Lindgren, John R B Perry, Katherine S Elliott, HanaLango, and Nigel W Rayner. A common variant in the FTO gene is associatedwith body mass index and predisposes to childhood and adult obesity. Science,316(5826):889–894, 2007.
[GWM08] Casey S Greene, Bill C White, and Jason H Moore. Ant colony optimization forgenome-wide genetic analysis. Lecture Notes in Computer Science, 5217:37–47,2008.
[HDF+13] Emily R Holzinger, Scott M Dudek, Alex T Frase, Ronald M Krauss, Marisa WMedina, and Marylyn D Ritchie. ATHENA: a tool for meta-dimensional analysisapplied to genotypes and gene expression data to predict HDL cholesterol levels.Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing, pages385–96, 2013.
[HFH+09] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann,and Ian H. Witten. The WEKA Data Mining Software: An Update. ACM SIGKDDExplorations Newsletter, 11:10–18, 2009.
[HK06] Jiawei Han and Micheline Kamber. Data Mining: Concepts and Techniques, vol-ume 54. 2006.
[HRM03] Lance W Hahn, Marylyn D Ritchie, and Jason H Moore. Multifactor dimensional-ity reduction software for detecting gene-gene and gene-environment interactions.Bioinformatics (Oxford, England), 19:376–382, 2003.
[JNBV11] Xia Jiang, Richard E Neapolitan, M Michael Barmada, and Shyam Visweswaran.Learning genetic epistasis using Bayesian network scoring criteria. BMC bioinfor-matics, 12:89, 2011.
[Jun09] Felix Jungermann. Information Extraction with RapidMiner. GSCL SymposiumSprachtechnologie und eHumanities 2009, 2009:1–16, 2009.
[Koh95] Ron Kohavi. A Study of Cross-Validation and Bootstrap for Accuracy Estimationand Model Selection. In International Joint Conference on Artificial Intelligence,volume 14, pages 1137–1143, 1995.
71
REFERENCES
[KR92] Kenji Kira and Larry A. Rendell. The feature selection problem: traditional methodsand a new algorithm. In Proceedings of the tenth national conference on Artificialintelligence, pages 129–134, 1992.
[LD94] Nada Lavrac and Saso Dzeroski. Inductive Logic Programming: Techniques andApplications. E. Horwood, New York, 1994.
[LHSV04] Kathryn L Lunetta, L Brooke Hayward, Jonathan Segal, and Paul Van Eerdewegh.Screening large-scale association study data: exploiting interactions using randomforests. BMC genetics, 5:32, 2004.
[LIT92] Pat Langley, Wayne Iba, and Kevin Thompson. An analysis of Bayesian classifiers.In Proceedings of the National Conference on Artificial Intelligence, pages 223–223, 1992.
[MCGT09] Brett A McKinney, James E Crowe, Jingyu Guo, and Dehua Tian. Capturing thespectrum of interaction effects in genetic association studies by simulated evapora-tive cooling network analysis. PLoS genetics, 5:e1000432, 2009.
[Moo04] Jason H Moore. Computational analysis of gene-gene interactions using multifactordimensionality reduction. Expert review of molecular diagnostics, 4:795–803, 2004.
[MRW+07] B A McKinney, D M Reif, B C White, J E Crowe, and J H Moore. Evaporativecooling feature selection for genotypic data involving interactions. Bioinformatics(Oxford, England), 23:2113–2120, 2007.
[MSA+12] Gennaro Miele, Giovanni Scala, Roberto Amato, Sergio Cocozza, and MichelePinelli. Simulating gene-gene and gene-environment interactions in complex dis-eases: Gene-Environment iNteraction Simulator 2, 2012.
[Mud09] Geo. P. Mudge. Mendel’s principles of heredity. The Eugenics review, 1:130–137,1909.
[MVV11] Jestinah M Mahachie John, Francois Van Lishout, and Kristel Van Steen. Model-Based Multifactor Dimensionality Reduction to detect epistasis for quantitativetraits in the presence of error-free and noisy data. Eur J Hum Genet, 19(6):696–703, June 2011.
[MW07] JH Moore and BC White. Tuning ReliefF for genome-wide genetic analysis. LectureNotes in Computer Science, 4447:166–175, 2007.
[NBS+07] Robin Nunkesser, Thorsten Bernholt, Holger Schwender, Katja Ickstadt, and IngoWegener. Detecting high-order interactions of single nucleotide polymorphisms us-ing genetic programming. Bioinformatics (Oxford, England), 23:3280–3288, 2007.
[NCS05] Bernard V North, David Curtis, and Pak C Sham. Application of logistic regressionto case-control association studies involving two causative loci. Human heredity,59:79–87, 2005.
[NKFS01] M R Nelson, S L Kardia, R E Ferrell, and C F Sing. A combinatorial partition-ing method to identify multilocus genotypic partitions that predict quantitative traitvariation. Genome research, 11:458–470, 2001.
72
REFERENCES
[Nun08] Robin Nunkesser. Analysis of a genetic programming algorithm for associationstudies. In Proceedings of the 10th annual conference on Genetic and evolutionarycomputation, pages 1259–1266. ACM, 2008.
[OSL13] Anunciação; Orlando, Vinga; Susana, and Oliveira; Arlindo L. Using Informa-tion Interaction to Discover Epistatic Effects in Complex Diseases. PloS one,8(10):e76300, 2013.
[PA10] Bo Peng and Christopher I Amos. Forward-time simulation of realistic samples forgenome-wide association studies. BMC bioinformatics, 11:442, 2010.
[PC14a] Ricardo Pinho and Rui Camacho. Genetic Epistasis I - Materials and methods. 2014.
[PC14b] Ricardo Pinho and Rui Camacho. Genetic Epistasis II - Assessing Algorithm BEAM3.0. 2014.
[PC14c] Ricardo Pinho and Rui Camacho. Genetic Epistasis III - Assessing AlgorithmBOOST. 2014.
[PC14d] Ricardo Pinho and Rui Camacho. Genetic Epistasis IV - Assessing AlgorithmScreen and Clean. 2014.
[PC14e] Ricardo Pinho and Rui Camacho. Genetic Epistasis IX - Comparative Assessmentof the Algorithms. 2014.
[PC14f] Ricardo Pinho and Rui Camacho. Genetic Epistasis V - Assessing AlgorithmSNPRuler. 2014.
[PC14g] Ricardo Pinho and Rui Camacho. Genetic Epistasis VI - Assessing AlgorithmSNPHarvester. 2014.
[PC14h] Ricardo Pinho and Rui Camacho. Genetic Epistasis VII - Assessing AlgorithmTEAM. 2014.
[PC14i] Ricardo Pinho and Rui Camacho. Genetic Epistasis VIII - Assessing AlgorithmMBMDR. 2014.
[PH08] Mee Young Park and Trevor Hastie. Penalized logistic regression for detecting geneinteractions. Biostatistics (Oxford, England), 9:30–50, 2008.
[Phi08] Patrick C Phillips. Epistasis–the essential role of gene interactions in the structureand evolution of genetic systems. Nature reviews. Genetics, 9:855–867, 2008.
[PNTB+07] Shaun Purcell, Benjamin Neale, Kathe Todd-Brown, Lori Thomas, Manuel A R Fer-reira, David Bender, Julian Maller, Pamela Sklar, Paul I W de Bakker, Mark J Daly,and Pak C Sham. PLINK: a tool set for whole-genome association and population-based linkage analyses. American journal of human genetics, 81:559–575, 2007.
[Pow11] D.M.W. Powers. Evaluation : From Precision, Recall and F-Measure To Roc, In-formedness, Markedness & Correlation. Journal of Machine Learning Technolo-gies, 2:37–63, 2011.
[PV08] Anita Prinzie and Dirk Van den Poel. Random Forests for multiclass classification:Random MultiNomial Logit, 2008.
73
REFERENCES
[RHR+01] M D Ritchie, L W Hahn, N Roodi, L R Bailey, W D Dupont, F F Parl, and J HMoore. Multifactor-dimensionality reduction reveals high-order interactions amongestrogen-metabolism genes in sporadic breast cancer. American journal of humangenetics, 69:138–147, 2001.
[Rip01] Brian D Ripley. The R project in statistical computing. MSOR Connections,1(1):23–25, 2001.
[RSK03] Marko Robnik-Sikonja and Igor Kononenko. Theoretical and Empirical Analysis ofReliefF and RReliefF. Machine, 53:23–69, 2003.
[SGLT12] Ya Su, Xinbo Gao, Xuelong Li, and Dacheng Tao. Multivariate multilinear regres-sion, 2012.
[SGM+10] Ayellet V Segrè, Leif Groop, Vamsi K Mootha, Mark J Daly, and David Altshuler.Common inherited variation in mitochondrial genes is not enriched for associationswith type 2 diabetes or related glycemic traits. PLoS genetics, 6, 2010.
[SKZ10] Daniel F Schwarz, Inke R König, and Andreas Ziegler. On safari to Random Jungle:a fast implementation of Random Forests for high-dimensional data. Bioinformatics(Oxford, England), 26:1752–1758, 2010.
[Smi84] R. L. Smith. Efficient Monte Carlo Procedures for Generating Points UniformlyDistributed over Bounded Regions, 1984.
[Sri01] Ashwin Srinivasan. The aleph manual. Machine Learning at the Computing Labo-ratory, Oxford University, 2001.
[SSDM09] C. C A Spencer, Zhan Su, Peter Donnelly, and Jonathan Marchini. Designinggenome-wide association studies: Sample size, power, imputation, and the choiceof genotyping chip. PLoS Genetics, 5, 2009.
[Ste12] Kristel Van Steen. Travelling the world of gene-gene interactions. Briefings inbioinformatics, 13:1–19, 2012.
[SWS10] Yun S Song, Fulton Wang, and Montgomery Slatkin. General epistatic models ofthe risk of complex diseases. Genetics, 186:1467–1473, 2010.
[SZS+11] Junliang Shang, Junying Zhang, Yan Sun, Dan Liu, Daojun Ye, and Yaling Yin.Performance analysis of novel methods for detecting epistasis, 2011.
[TDR10a] Stephen D Turner, Scott M Dudek, and Marylyn D Ritchie. ATHENA: Aknowledge-based hybrid backpropagation-grammatical evolution neural network al-gorithm for discovering epistasis among quantitative trait Loci. BioData mining,3:5, 2010.
[TDR10b] Stephen D Turner, Scott M Dudek, and Marylyn D Ritchie. Grammatical Evolu-tion of Neural Networks for Discovering Epistasis among Quantitative Trait Loci.Lecture Notes in Computer Science, 6023:86–97, 2010.
[TJZ06] Michael W T Tanck, J Wouter Jukema, and Aeilko H Zwinderman. Simultaneousestimation of gene-gene and gene-environment interactions for numerous loci usingdouble penalized log-likelihood. Genetic epidemiology, 30:645–651, 2006.
74
REFERENCES
[VL12] J. Verzani and M. F. Lawrence. Programming graphical user interfaces with R.CRC Press, 2012.
[WBW05] Geoffrey I. Webb, Janice R. Boughton, and Zhihai Wang. Not So Naive Bayes:Aggregating One-Dependence Estimators, 2005.
[WDR+10] Jing Wu, Bernie Devlin, Steven Ringquist, Massimo Trucco, and Kathryn Roeder.Screen and clean: a tool for identifying interactions in genome-wide associationstudies. Genetic epidemiology, 34:275–285, 2010.
[Weg60] Peter Wegner. A technique for counting ones in a binary computer. Communicationsof the ACM, 3(5):322, 1960.
[WFH11] Ian H Witten, Eibe Frank, and Mark A Hall. Data Mining: Practical MachineLearning Tools and Techniques (Google eBook). 2011.
[WLFW11] Yue Wang, Guimei Liu, Mengling Feng, and Limsoon Wong. An empirical com-parison of several recent epistatic interaction detection methods. Bioinformatics(Oxford, England), 27:2936–43, 2011.
[WLZH12] Haitian Wang, Shaw-Hwa Lo, Tian Zheng, and Inchi Hu. Interaction-based featureselection and classification for high-dimensional biological data. Bioinformatics(Oxford, England), 28:2834–42, 2012.
[WYY+10a] Xiang Wan, Can Yang, Qiang Yang, Hong Xue, Xiaodan Fan, Nelson L S Tang,and Weichuan Yu. BOOST: A fast approach to detecting gene-gene interactions ingenome-wide case-control studies. American journal of human genetics, 87:325–340, 2010.
[WYY+10b] Xiang Wan, Can Yang, Qiang Yang, Hong Xue, Nelson L S Tang, and WeichuanYu. Predictive rule inference for epistatic interaction detection in genome-wideassociation studies. Bioinformatics (Oxford, England), 26:30–37, 2010.
[WYY12] Xiang Wan, Can Yang, and Weichuan Yu. Comments on ‘An empirical com-parison of several recent epistatic interaction detection methods’. Bioinformatics,28(1):145–146, 2012.
[YHW+09] Can Yang, Zengyou He, Xiang Wan, Qiang Yang, Hong Xue, and Weichuan Yu.SNPHarvester: a filtering-based approach for detecting epistatic interactions ingenome-wide association studies. Bioinformatics (Oxford, England), 25:504–511,2009.
[YHZZ10] Pengyi Yang, Joshua W K Ho, Albert Y Zomaya, and Bing B Zhou. A geneticensemble approach for gene-gene interaction identification. BMC bioinformatics,11:524, 2010.
[YYWY11] Ling Sing Yung, Can Yang, Xiang Wan, and Weichuan Yu. GBOOST: a GPU-based tool for detecting gene-gene interactions in genome-wide case control studies.Bioinformatics (Oxford, England), 27:1309–1310, 2011.
[ZB00] H Zhang and G Bonney. Use of classification trees for association studies. Geneticepidemiology, 19:323–332, 2000.
75
REFERENCES
[Zha12] Yu Zhang. A novel bayesian graphical model for genome-wide multi-SNP associa-tion mapping. Genetic Epidemiology, 36:36–47, 2012.
[ZHZW10] Xiang Zhang, Shunping Huang, Fei Zou, and Wei Wang. TEAM: efficient two-locusepistasis tests in human genome-wide association study. Bioinformatics (Oxford,England), 26:i217–i227, 2010.
[ZL07] Yu Zhang and Jun S Liu. Bayesian inference of epistatic interactions in case-controlstudies. Nature genetics, 39:1167–1173, 2007.
76
Appendix A
Glossary
A.1 Biology related terms
• Allele - One specific pair or series of genes in a given position of a specific chromosome.
• Cell - The most basic structural unit of any organism that is capable of independent func-
tioning.
• Chromosome - Genetic material stored in the nucleus of eukaryotic cells. Contains all the
hereditary information. In Humans, there are 23 pairs of chromosomes.
• DNA - Deoxyribonucleic Acid. Molecule where the genetic material is stored. Is capable
of self-replication and RNA synthesis.
• Dominant Gene - It is the allele which manifests itself in the phenotype, when there are
more than one type of alleles in the genotype or in homozygotic cases. Might also be
dominant to one allele but recessive to another.
• Epistasis - If various SNPs interact with each other to express a phenotype, that interaction
is called epistasis. There are 3 main types of Epistasis:
– Compositional Epistasis - How the relationship between genes work.
– Functional Epistasis - The direct effect of the function of a gene in another gene.
– Statistical Epistasis - Number of different allele effects.
• Eukaryote - A cell with a defined nucleus. Has a membrane that separates the nucleus of
the cell from the rest of its contents.
• Gene - The basic unit of hereditary information in DNA or RNA. May suffer mutations.
• Genotype - The genetic constitution of a specific trait. A combination of alleles in a corre-
sponding match of chromosomes that determines a trait.
77
Glossary
• GWAS - Genome Wide Association Study. Study of the entire genome to find SNPs that
are associate with specific traits. In this case, complex diseases.
• Heterozygous - A genotype composed with different alleles.
• Homozygous - A genotype composed with the same allele.
• Locus - Place within the DNA where a given gene is stored.
• Mutation - A change in a chromosome, either by the change of a gene or rearrangement of
a part of the chromosome.
• Nucleotide Bases - Different types of molecules that combine with each other to form DNA
and RNA.
– Adenine - Nucleotide base. Links to Thymine in DNA or Uracil in RNA.
– Cytosine - Nucleotide base. Links to Guanine.
– Guanine - Nucleotide base. Links to Cytosine.
– Thymine - Nucleotide base specific to DNA. Links to Adenine.
– Uracil - Nucleotide base specific to RNA. Links to Adenine.
• Phenotype - Expression of a specific trait. The manifestation of a certain gene or interaction
of various genes.
• Recessive Gene - Allele that does not manifest in the phenotype unless in homozygotic
cases. Might be recessive to one allele but dominant to another.
• Ribosome - Molecular machine used in protein synthesis from encoded RNA.
• RNA - Ribonucleic Acid. Molecule synthesized by DNA that expresses genes by being
transcribed by the ribosome.
• SNP - Single nucleotide polymorphism. DNA sequence in the genome that varies in indi-
viduals of the same species.
A.2 Data mining terms
• Association Rules - Relations between variables that are relevant in a significant number of
instances.
• Bayesian Networks - Graphical model that shows dependencies between a randomly cho-
sen group of variables.
• Classification - A type of Data Mining prediction problem, where the determined value is
nominal. In specific cases, the determined variable is binary.
78
Glossary
• Clustering - Tries to find similarities between instances and joins them together in groups,
or clusters.
• Data Mining - A vast field in computer science. Tries to identify patterns in big data sets.
• Data Set - A collection of data composed of Attributes (columns) and Instances (rows).
Attributes are different variables of the recorded data and each Instance is a new member of
the data set.
• Machine Learning - Area of Artificial Intelligence where algorithms and systems are de-
veloped to be able to learn from data. Can be used to solve problems such as Clustering,
Classification, Association Rules, Regression, etc.
• Model - Data Mining models are created by using a specific algorithm on a specific data
set. The result is a model specifically designed to predict or find relations in data, based on
the learned patterns of the data set used.
• Overfitting - When a model is too adapted to a subgroup of data that does not translate in
the all of the dataset and future data.
• pre-processing - The adaptation of data to fit a certain criteria, either by transforming the
data type or by reducing dimensionality in attributes or instances.
– Filter methods - preprocessing methods that select subsets of variables independently
from the model creation algorithms.
– Wrapper methods - Scoring and transformation of subsets of variables to serve a
specific type of algorithm.
– Embedded Methods - Feature selection that occurs during the training of a given
model.
• Pruning - To cut a connection in tree-based methods due the branch not being relevant to
the final result. This increase the efficiency of the algorithm but can also wrongfully cut
significant branches.
• Regression - A type of DM prediction problem, where the determined value is continuous.
• Supervised method - A method which has real values appointed in the data to the class
variable.
A.3 Lab Notes
79
Laboratory Note
Genetic EpistasisI - Materials and methods
LN-1-2014
Ricardo Pinho and Rui CamachoFEUP
Rua Dr Roberto Frias, s/n,4200-465 PORTO
PortugalFax: (+351) 22 508 1440e-mail: [email protected]
www : http://www.fe.up.pt/∼ei09045 [email protected] : http://www.fe.up.pt/∼rcamacho
May 2014
Abstract
Based on literature results, we have selected 7 epistatic detectionmethods. The selected methods were empirically evaluated and com-pared using generated data from genomeSimla to simulate a smallerscale of genome wide studies. The simulated data includes 270 dif-ferent configurations of datasets to simulate a wide array of diseasemodels. The selected algorithms are BEAM 3.0, BOOST, MBMDR,Screen and Clean, SNPRuler, SNPHarvester, and TEAM. These al-gorithms are evaluated according to their Power, scalability and TypeI Error Rate.
1 Introduction
The search for genetic predisposition to diseases has been researched for along time. However, most early studies only focused on single SNP studiesto determine disease predisposition. This is not the case in most complexdiseases. Generally, most disease involve thousands or milions of SNPs, in-teracting between them in a large scale. Due to the complexity of theseinteractions, the computational costs for epistasis detection were infeasibleuntil recently.The main objective of the following experiments is to empirically evaluatethe following algorithms: BEAM 3.0 [Zha12], BOOST [WYY+10a], MBMDR[MVV11], Screen and Clean [WDR+10], SNPRuler [WYY+10b], SNPHar-vester [YHW+09], and TEAM [ZHZW10].These algorithms will be evaluated according to their Power, scalability,and Type I Error Rate. Each algorithm will be executed with many datasets that simulate diseases with many different parameters.These data sets are generated with genomeSimla, an open source data gen-erator that contains many useful parameters to realistically simulate complexdiseases.The structure of the rest of the lab note consists of a brief description of thedata sets that were used in the experiments, including the application usedto generate them in Section 2. Section 3 includes a description of the evalua-tion measures used in these experiments. Section 4 contains the experimentalmethodology followed in these experiments. Section 5 is the summary of theexperiments that will be detailed in the next lab notes.
2 The Data sets for the Experiments
The data sets for the experiments were created specifically for these experi-ments. The program used for the generation of the datasets was genomeSimla[EBT+08]. In total, 270 different configurations were generated. Each con-figuration consists of 100 data sets, which means that each algorithm wasexecuted 27000 times.
Data Generation Application
The data generation application used for these experiments was genomeS-imla. Due to its ability to evolve a population and achieve the desired allelefrequencies, with any amount of SNPs, distributed by as many chromosomesas desired, genomeSimla is an adequate application for these kind of exper-
1
iments. The evolution of the population can follow a linear, exponential orlogistic growth, the last one being the most preferred model.Aside from generating and evolving a population to any amount of iterationsas required, genomeSimla allows for an observation of the allele frequenciesalong the population and chose, based on those frequencies, which SNPsshould be allocated, choosing how many chromosomes and block of SNPsper chromosome for each individual a priori.After the generation of the population according to the selected parameters,genomeSimla can then be used to generate datasets, sampling from the pop-ulation pool with as many individuals as necessary. The disease model canbe further customized, with the desired odds ratio, prevalence of the disease,and type of disease model. Based on these values, a penetrance table is gen-erated for each desired parameter.
• Allele Frequency - The frequency of the minor allele of the diseaseSNPs.
• Population - Number of individuals sampled in the data set.
• Disease Model - Type of disease model: main effect, epistasis interac-tion, and full effect.
• Odds ratio - Relation between disease SNPs. Probability of one diseaseSNP being present, given the presence of the other disease SNP.
• Prevalence - The proportion of a population with the disease. Affectsthe number of cases and controls in a data set.
With this data, data sets can be generated, using a configuration file, em-bedding the disease model into the desired alleles.
Data Set
The data sets were created using many different parameters, to maximizethe diversity of disease models, to assert which algorithms are best for whichscenarios. The data consists of a simulation of genotypes and phenotypes.For each individual, the attributes consist of genotypes associated with eachSNP, for a total of 3 states: Homozygotic dominant, heterozygotic and ho-mozygotic recessive. The label is binary, corresponding to an affected or notaffected individual.In each data set, a total of 2 pairs of chromosomes where generated. Thefirst chromosome contains 20 blocks of 10 SNPs and the second contains 10
2
blocks of 10 SNPs, having 300 SNPs in total. There are two disease allelesplaced in different chromosomes, according to the desired allele frequency.The generated data sets contain 3 different number of individuals: 500, 1000,and 2000 individuals. The disease alleles contains 5 different minor allele fre-quencies: 0.01, 0.05, 0.1, 0.3, and 0.5. Three different disease models areused: data sets with marginal effects and no epistatic relations, withoutmarginal effects and with epistatic relations and with marginal effects andepistatic relations. The odds ratio associated with both disease related allelesis 1.1, 1.5, or 2.0. The prevalence of the disease can is also configurated toeither 0.0001 or 0.02, which also influances the amount of cases and controls.
3 Evaluation Measures
The evaluations measures used for these experiments consist of Power, scal-ability, and Type I Error Rate.To evaluate the Power of the algorithms, for each configuration, the num-ber of data sets were the ground truth is a statistically relevant interaction,measured using the χ2 test, out of 100 data sets. Calculating the amountof datasets, within each configuration, how many data sets correctly identifythe ground truth of the disease as the most significant SNP pair, consideringthat SNPs are ranked according to their importance to the phenotype, usingstatistical hypothesis tests.To evaluate the scalability of each dataset, the average time for each datasetis calculated in each configuration.In the Type I Error Rate, the proportion out of 100 data sets where non-disease related SNPs in each configuration are classified as a statisticallyrelevant SNP pair, using χ2 test.
4 Experimental Methodology
Initially, the population for the datasets is generated using genomeSimla.The population is generated using a logistic growth rate, with an initial pop-ulation of 10000 and a maximum capacity of 1000000. The population chosenfor the datasets is picked from reported generations, based on the allele fre-quencies desired for the experiment. The generation 1750 was selected forthis purpose. 2 SNPs are selected for each configuration. The SNPs selectedaccording to their minor allele frequency (MAF) were as follows:
• MAF 0.01 - SNP112 and SNP267
3
• MAF 0.05 - SNP4 and SNP239
• MAF 0.1 - SNP135 and SNP230
• MAF 0.3 - SNP197 and SNP266
• MAF 0.5 - SNP80 and SNP229
The first 200 SNPs belong to chromosome 1, where as the last 100 correspondto chromosome 2 SNPs.The table with all the allele frequencies can be seenin the annexes. Table 1 are the chromosome 1 allele frequencies and table 2are the chromosome 2 allele frequenciesThe penetrance tables are created from the allele frequencies in the popula-tion, following the configurations that were discussed earlier. The datasetsare created, using each unique configuration file to create 100 datasets, gen-erating all the configurations mentioned before.With the data sets generated, the algorithms are tested for the most extremeconfigurations (minimum and maximum MAF) to find if the results are valid.Upon asserting the validity of the experiment, all algorithms are then exe-cuted for all configurations to analyze the potential of each algorithm.For each algorithm in each dataset, a file containing the ranked SNPs, ac-cording to statistical relevancy, is generated, together with information aboutthe time and memory used in the execution for each test. The Power andType I Error Rates are are taken from the results that present a statisticrelevance of α < 0.05.The computer used for this experiments used the 64-bit Ubuntu 13.10 op-erating system, with an Intel(R) Core(TM)2 Quad CPU Q6600 2.40GHzprocessor and 8,00 GB of RAM memory.
A Loci Frequencies
Chromosome 1
Table 1: Allele frequencies of the generated populationfor chromosome 1.
Label Freq Al1 Freq Al2 Map Dist. PositionRL0-1 0.704448 0.295552 0.0002523144 253RL0-2 0.467747 0.532253 3.65488E-006 256RL0-3 0.856627 0.143373 3.86582E-006 259
4
RL0-4 0.94761 0.05239 1.18175E-006 260RL0-5 0.747191 0.252809 1.23056E-006 261RL0-6 0.868644 0.131356 5.41858E-006 266RL0-7 0.869881 0.130119 9.49181E-006 275RL0-8 0.634084 0.365916 1.72337E-006 276RL0-9 0.616899 0.383101 4.81936E-006 280RL0-10 0.603205 0.396795 8.88582E-006 288RL0-11 0.951322 0.048678 0.000118908 406RL0-12 0.928004 0.071996 9.53558E-006 415RL0-13 0.7257 0.2743 3.20447E-006 418RL0-14 0.547945 0.452055 3.96875E-006 421RL0-15 0.735312 0.264688 5.03938E-006 426RL0-16 0.983344 0.016655 4.10188E-006 430RL0-17 0.809402 0.190598 7.11582E-006 437RL0-18 0.908173 0.091827 2.25726E-006 439RL0-19 0.628892 0.371108 2.13406E-006 441RL0-20 0.824863 0.175137 9.99491E-006 450RL0-21 0.640543 0.359457 0.000229233 679RL0-22 0.542639 0.457361 5.61457E-006 684RL0-23 0.776321 0.223679 5.05623E-006 689RL0-24 0.925422 0.074578 4.39722E-006 693RL0-25 0.596454 0.403546 9.48707E-006 702RL0-26 0.80071 0.19929 7.38516E-006 709RL0-27 0.712163 0.287837 3.95139E-006 712RL0-28 0.91426 0.08574 5.07943E-006 717RL0-29 0.902589 0.097411 0.000006668 723RL0-30 0.933652 0.066348 2.4885E-006 725RL0-31 0.486126 0.513874 0.000296081 1021RL0-32 0.553701 0.446299 8.33422E-006 1029RL0-33 0.887238 0.112762 4.95048E-006 1033RL0-34 0.93165 0.06835 5.32692E-006 1038RL0-35 0.887583 0.112417 2.23131E-006 1040RL0-36 0.824546 0.175454 5.40611E-006 1045RL0-37 1 0 7.03837E-006 1052RL0-38 0.817039 0.182961 1.44855E-006 1053RL0-39 0.762831 0.237169 9.89044E-006 1062RL0-40 0.623942 0.376058 2.53856E-006 1064RL0-41 0.886716 0.113284 0.0003574 1421
5
RL0-42 0.603873 0.396127 5.73344E-006 1426RL0-43 0.708144 0.291856 7.18489E-006 1433RL0-44 0.722182 0.277818 6.17693E-006 1439RL0-45 0.59756 0.40244 6.57155E-006 1445RL0-46 0.810217 0.189783 3.25347E-006 1448RL0-47 0.679944 0.320056 8.28564E-006 1456RL0-48 0.467092 0.532908 4.45383E-006 1460RL0-49 0.518637 0.481363 2.97358E-006 1462RL0-50 0.918397 0.081603 4.58774E-006 1466RL0-51 0.979136 0.020864 0.0003772277 1843RL0-52 0.571337 0.428663 3.32175E-006 1846RL0-53 0.615734 0.384266 2.23233E-006 1848RL0-54 0.695586 0.304414 2.98606E-006 1850RL0-55 0.660442 0.339558 4.02315E-006 1854RL0-56 0.910148 0.089852 5.9643E-006 1859RL0-57 0.445087 0.554913 6.82648E-006 1865RL0-58 0.470733 0.529267 9.5693E-006 1874RL0-59 0.858588 0.141412 0.000004337 1878RL0-60 0.681468 0.318532 2.15272E-006 1880RL0-61 0.870466 0.129534 0.0003573156 2237RL0-62 0.646194 0.353806 7.20147E-006 2244RL0-63 0.763207 0.236793 3.17006E-006 2247RL0-64 0.931087 0.068913 8.01624E-006 2255RL0-65 0.7151 0.2849 6.35415E-006 2261RL0-66 0.670911 0.329089 2.38872E-006 2263RL0-67 0.888122 0.111878 2.52589E-006 2265RL0-68 0.694165 0.305835 0.000008364 2273RL0-69 0.864311 0.135689 7.35972E-006 2280RL0-70 0.838895 0.161105 2.46709E-006 2282RL0-71 0.823928 0.176073 0.0001992617 2481RL0-72 0.583947 0.416053 6.33832E-006 2487RL0-73 0.841979 0.158021 9.79685E-006 2496RL0-74 0.6003 0.3997 7.07911E-006 2503RL0-75 0.892639 0.107361 5.16523E-006 2508RL0-76 0.761561 0.238439 2.85138E-006 2510RL0-77 0.900447 0.099553 1.53824E-006 2511RL0-78 0.599257 0.400743 3.89272E-006 2514RL0-79 0.972086 0.027914 6.53018E-006 2520
6
RL0-80 0.560663 0.439337 8.62124E-006 2528RL0-81 0.554206 0.445794 0.000199997 2727RL0-82 0.93403 0.06597 8.61757E-006 2735RL0-83 0.542574 0.457426 9.10087E-006 2744RL0-84 0.837702 0.162298 1.23079E-006 2745RL0-85 0.909783 0.090217 6.84162E-006 2751RL0-86 0.91318 0.08682 4.48263E-006 2755RL0-87 0.725569 0.274431 0.000001848 2756RL0-88 0.90355 0.09645 2.79894E-006 2758RL0-89 0.716186 0.283814 4.00443E-006 2762RL0-90 0.612835 0.387165 6.94976E-006 2768RL0-91 0.582162 0.417838 0.0003616833 3129RL0-92 0.83582 0.16418 0.000009529 3138RL0-93 0.558802 0.441198 9.02466E-006 3147RL0-94 0.86217 0.13783 5.29547E-006 3152RL0-95 0.617906 0.382094 7.09319E-006 3159RL0-96 0.801595 0.198405 6.73657E-006 3165RL0-97 0.676978 0.323022 6.97316E-006 3171RL0-98 0.738348 0.261652 7.87644E-006 3178RL0-99 0.591386 0.408614 3.67391E-006 3181RL0-100 0.521751 0.478249 4.20054E-006 3185RL0-101 0.508844 0.491156 9.09917E-005 3275RL0-102 0.565387 0.434613 9.41043E-006 3284RL0-103 0.479309 0.520691 7.40872E-006 3291RL0-104 0.745518 0.254482 3.35237E-006 3294RL0-105 0.532452 0.467548 4.28727E-006 3298RL0-106 0.935416 0.064584 9.89425E-006 3307RL0-107 0.662617 0.337383 8.74864E-006 3315RL0-108 0.658306 0.341694 2.01241E-006 3317RL0-109 0.712991 0.287009 5.8733E-006 3322RL0-110 0.665501 0.334499 6.69027E-006 3328RL0-111 0.568289 0.431711 8.718047E-005 3415RL0-112 0.98671 0.01329 8.66949E-006 3423RL0-113 0.79789 0.20211 5.05033E-006 3428RL0-114 0.553154 0.446846 9.60618E-006 3437RL0-115 0.667399 0.332601 6.92172E-006 3443RL0-116 0.700185 0.299815 9.52134E-006 3452RL0-117 0.610748 0.389252 5.60877E-006 3457
7
RL0-118 0.661102 0.338898 6.63784E-006 3463RL0-119 0.820744 0.179256 3.09427E-006 3466RL0-120 0.912926 0.087073 4.1968E-006 3470RL0-121 0.68335 0.31665 0.0003871028 3857RL0-122 0.707937 0.292063 5.00312E-006 3862RL0-123 0.589477 0.410523 2.13525E-006 3864RL0-124 0.745493 0.254507 9.8212E-006 3873RL0-125 0.698088 0.301912 7.02674E-006 3880RL0-126 0.424467 0.575533 5.18827E-006 3885RL0-127 0.787719 0.212281 4.74483E-006 3889RL0-128 0.860644 0.139356 5.22368E-006 3894RL0-129 0.638396 0.361604 3.96526E-006 3897RL0-130 0.731953 0.268047 8.71207E-006 3905RL0-131 0.744233 0.255766 0.0002181738 4123RL0-132 1 0 1.69539E-006 4124RL0-133 0.771704 0.228296 9.71469E-006 4133RL0-134 0.878927 0.121073 0.000002233 4135RL0-135 0.90145 0.09855 4.28905E-006 4139RL0-136 0.648369 0.351631 0.00000754 4146RL0-137 0.80335 0.19665 8.70869E-006 4154RL0-138 0.856866 0.143134 9.44719E-006 4163RL0-139 0.615518 0.384482 3.60345E-006 4166RL0-140 0.788087 0.211913 0.000002436 4168RL0-141 0.678961 0.321039 0.0002748812 4442RL0-142 0.771435 0.228565 5.86447E-006 4447RL0-143 0.503258 0.496742 3.67578E-006 4450RL0-144 0.795211 0.204789 2.75252E-006 4452RL0-145 0.490144 0.509856 4.10642E-006 4456RL0-146 0.488492 0.511508 4.30833E-006 4460RL0-147 0.667302 0.332698 7.3961E-006 4467RL0-148 0.643159 0.356841 2.3613E-006 4469RL0-149 0.673992 0.326008 9.5407E-006 4478RL0-150 0.788535 0.211465 5.39342E-006 4483RL0-151 0.781059 0.218941 0.0002359844 4718RL0-152 0.502629 0.497371 5.62238E-006 4723RL0-153 0.466542 0.533458 2.22743E-006 4725RL0-154 0.538982 0.461018 3.21068E-006 4728RL0-155 0.841056 0.158944 2.43989E-006 4730
8
RL0-156 0.462765 0.537235 7.40954E-006 4737RL0-157 0.90605 0.09395 3.96506E-006 4740RL0-158 0.681072 0.318928 2.10963E-006 4742RL0-159 0.596135 0.403865 6.71541E-006 4748RL0-160 0.855496 0.144504 0.00000768 4755RL0-161 0.727272 0.272728 0.0002969833 5051RL0-162 0.774272 0.225728 2.62789E-006 5053RL0-163 0.791941 0.208059 6.76876E-006 5059RL0-164 0.644252 0.355748 0.000005599 5064RL0-165 0.549582 0.450418 8.32549E-006 5072RL0-166 0.428749 0.571251 8.10471E-006 5080RL0-167 0.376485 0.623515 9.96927E-006 5089RL0-168 0.535948 0.464052 9.47661E-006 5098RL0-169 0.514295 0.485705 3.16517E-006 5101RL0-170 0.700045 0.299955 5.98168E-006 5106RL0-171 0.571955 0.428045 0.0003862553 5492RL0-172 0.586523 0.413477 2.88618E-006 5494RL0-173 0.783275 0.216725 7.29982E-006 5501RL0-174 0.610016 0.389985 9.43182E-006 5510RL0-175 0.866664 0.133336 7.05865E-006 5517RL0-176 0.75876 0.24124 7.56181E-006 5524RL0-177 0.600093 0.399907 1.005344E-005 5534RL0-178 0.577467 0.422533 2.42474E-006 5536RL0-179 0.789476 0.210524 6.1728E-006 5542RL0-180 0.590153 0.409847 5.99256E-006 5547RL0-181 0.422633 0.577367 9.624393E-005 5643RL0-182 0.526449 0.473551 1.007159E-005 5653RL0-183 0.83354 0.16646 3.23814E-006 5656RL0-184 0.737217 0.262783 8.58028E-006 5664RL0-185 0.650092 0.349908 9.27841E-006 5673RL0-186 0.56464 0.43536 5.87977E-006 5678RL0-187 0.717536 0.282464 3.16557E-006 5681RL0-188 0.961919 0.038081 2.93894E-006 5683RL0-189 0.84241 0.15759 8.25314E-006 5691RL0-190 0.817398 0.182602 4.0069E-006 5695RL0-191 1 0 0.0002386956 5933RL0-192 1 0 4.98276E-006 5937RL0-193 0.709334 0.290666 0.000002811 5939
9
RL0-194 0.78411 0.21589 0.000008052 5947RL0-195 0.932612 0.067388 2.89373E-006 5949RL0-196 0.865947 0.134053 8.6839E-006 5957RL0-197 0.725338 0.274662 5.21764E-006 5962RL0-198 0.795964 0.204036 7.8731E-006 5969RL0-199 0.583016 0.416984 4.61094E-006 5973RL0-200 0.803726 0.196274 8.37366E-006 5981
Chromosome 2
Table 2: Allele frequencies of the generated populationfor chromosome 2.
Label Freq Al1 Freq Al2 Map Dist. PositionRL1-201 0.893976 0.106024 0.0003986369 399RL1-202 0.584141 0.415859 2.05934E-006 401RL1-203 0.422083 0.577917 0.000005955 406RL1-204 0.73351 0.26649 5.58855E-006 411RL1-205 0.694034 0.305966 4.1723E-006 415RL1-206 0.765355 0.234645 2.06415E-006 417RL1-207 0.965014 0.034986 7.44318E-006 424RL1-208 0.668517 0.331483 9.60649E-006 433RL1-209 0.634885 0.365115 8.56251E-006 441RL1-210 0.725027 0.274973 6.14954E-006 447RL1-211 0.698398 0.301602 7.386583E-005 520RL1-212 0.595985 0.404015 9.7547E-006 529RL1-213 0.710597 0.289403 1.58667E-006 530RL1-214 0.663247 0.336753 4.37889E-006 534RL1-215 0.75663 0.24337 7.38782E-006 541RL1-216 0.936743 0.063257 8.35938E-006 549RL1-217 0.663784 0.336216 1.64064E-006 550RL1-218 0.680104 0.319896 9.16445E-006 559RL1-219 0.688756 0.311244 0.000007628 566RL1-220 0.9333 0.0667 7.01934E-006 573RL1-221 0.742415 0.257585 0.0003420352 915RL1-222 0.799322 0.200678 4.01391E-006 919RL1-223 0.709122 0.290879 0.000002737 921RL1-224 0.565597 0.434403 6.28353E-006 927
10
RL1-225 0.863029 0.136971 9.64911E-006 936RL1-226 0.752561 0.247439 6.74076E-006 942RL1-227 0.676998 0.323002 1.004539E-005 952RL1-228 0.840474 0.159526 1.71067E-006 953RL1-229 0.49346 0.50654 0.000001589 954RL1-230 0.910095 0.089905 7.41687E-006 961RL1-231 0.960868 0.039132 0.0002261121 1187RL1-232 0.933743 0.066257 1.91042E-006 1188RL1-233 0.760953 0.239047 4.80473E-006 1192RL1-234 0.748072 0.251928 7.04549E-006 1199RL1-235 0.663473 0.336527 8.21959E-006 1207RL1-236 0.964783 0.035217 2.82873E-006 1209RL1-237 0.905525 0.094475 0.000007663 1216RL1-238 0.691349 0.308651 5.04876E-006 1221RL1-239 0.951645 0.048355 1.59639E-006 1222RL1-240 0.989216 0.010784 8.73616E-006 1230RL1-241 0.738781 0.261219 0.0003243203 1554RL1-242 0.795527 0.204473 8.89964E-006 1562RL1-243 0.795563 0.204437 2.02264E-006 1564RL1-244 0.703822 0.296178 3.36477E-006 1567RL1-245 0.57285 0.42715 9.19778E-006 1576RL1-246 0.767369 0.232631 7.29139E-006 1583RL1-247 0.645825 0.354175 2.43094E-006 1585RL1-248 0.802402 0.197598 1.73925E-006 1586RL1-249 0.944397 0.055603 9.46653E-006 1595RL1-250 0.622399 0.377601 8.82309E-006 1603RL1-251 0.630848 0.369152 0.0002217582 1824RL1-252 0.818129 0.181871 1.91494E-006 1825RL1-253 0.484804 0.515196 1.6334E-006 1826RL1-254 0.676497 0.323503 1.59652E-006 1827RL1-255 0.880815 0.119185 3.35782E-006 1830RL1-256 0.959511 0.040489 2.75846E-006 1832RL1-257 0.784072 0.215928 3.03069E-006 1835RL1-258 0.52286 0.47714 6.06819E-006 1841RL1-259 0.623466 0.376534 6.91131E-006 1847RL1-260 0.874709 0.125291 7.25071E-006 1854RL1-261 0.803013 0.196987 0.000331411 2185RL1-262 0.545178 0.454822 4.47452E-006 2189
11
RL1-263 0.815965 0.184035 4.89193E-006 2193RL1-264 0.818366 0.181634 1.90565E-006 2194RL1-265 0.724692 0.275308 1.45521E-006 2195RL1-266 0.68352 0.31648 1.001287E-005 2205RL1-267 0.989999 0.010001 3.48414E-006 2208RL1-268 0.985774 0.014226 9.2895E-006 2217RL1-269 0.642113 0.357887 4.82072E-006 2221RL1-270 0.464929 0.535071 0.000002507 2223RL1-271 0.734131 0.265869 0.0003870134 2610RL1-272 0.632632 0.367368 8.94209E-006 2618RL1-273 0.553081 0.446919 0.000004175 2622RL1-274 0.764977 0.235023 5.60863E-006 2627RL1-275 0.464551 0.535449 9.08894E-006 2636RL1-276 0.851137 0.148863 1.002911E-005 2646RL1-277 0.739427 0.260573 9.30477E-006 2655RL1-278 0.555538 0.444462 6.07683E-006 2661RL1-279 0.551021 0.448979 1.71434E-006 2662RL1-280 0.593129 0.406871 6.07637E-006 2668RL1-281 0.79749 0.20251 0.0002724436 2940RL1-282 0.848332 0.151668 7.00277E-006 2947RL1-283 0.812696 0.187304 5.19928E-006 2952RL1-284 0.715573 0.284426 6.3489E-006 2958RL1-285 0.578981 0.421019 3.26024E-006 2961RL1-286 0.786632 0.213368 0.000008282 2969RL1-287 0.64689 0.35311 8.91268E-006 2977RL1-288 0.600677 0.399323 2.59076E-006 2979RL1-289 0.552264 0.447736 7.46941E-006 2986RL1-290 0.836774 0.163226 7.75812E-006 2993RL1-291 0.910408 0.089592 6.30604E-005 3056RL1-292 0.705616 0.294384 9.07012E-006 3065RL1-293 0.833055 0.166945 7.11207E-006 3072RL1-294 0.55822 0.44178 9.56152E-006 3081RL1-295 0.684736 0.315264 2.78967E-006 3083RL1-296 0.973315 0.026685 9.43315E-006 3092RL1-297 0.676965 0.323035 5.50475E-006 3097RL1-298 0.698511 0.301489 4.26716E-006 3101RL1-299 0.514109 0.485891 9.42184E-006 3110RL1-300 0.895842 0.104158 9.44663E-006 3119
12
References
[EBT+08] Todd L Edwards, William S Bush, Stephen D Turner, Scott MDudek, Eric S Torstenson, Mike Schmidt, Eden Martin, andMarylyn D Ritchie. Generating Linkage Disequilibrium Pat-terns in Data Simulations using genomeSIMLA. Lecture Notesin Computer Science, 4973:24–35, 2008.
[MVV11] Jestinah M Mahachie John, Francois Van Lishout, and KristelVan Steen. Model-Based Multifactor Dimensionality Reductionto detect epistasis for quantitative traits in the presence of error-free and noisy data. Eur J Hum Genet, 19(6):696–703, June2011.
[WDR+10] Jing Wu, Bernie Devlin, Steven Ringquist, Massimo Trucco,and Kathryn Roeder. Screen and clean: a tool for identifyinginteractions in genome-wide association studies. Genetic epi-demiology, 34:275–285, 2010.
[WYY+10a] Xiang Wan, Can Yang, Qiang Yang, Hong Xue, Xiaodan Fan,Nelson L S Tang, and Weichuan Yu. BOOST: A fast approachto detecting gene-gene interactions in genome-wide case-controlstudies. American journal of human genetics, 87:325–340, 2010.
[WYY+10b] Xiang Wan, Can Yang, Qiang Yang, Hong Xue, Nelson L STang, and Weichuan Yu. Predictive rule inference for epistaticinteraction detection in genome-wide association studies. Bioin-formatics (Oxford, England), 26:30–37, 2010.
[YHW+09] Can Yang, Zengyou He, Xiang Wan, Qiang Yang, Hong Xue,and Weichuan Yu. SNPHarvester: a filtering-based approachfor detecting epistatic interactions in genome-wide associationstudies. Bioinformatics (Oxford, England), 25:504–511, 2009.
[Zha12] Yu Zhang. A novel bayesian graphical model for genome-widemulti-SNP association mapping. Genetic Epidemiology, 36:36–47, 2012.
[ZHZW10] Xiang Zhang, Shunping Huang, Fei Zou, and Wei Wang.TEAM: efficient two-locus epistasis tests in human genome-wideassociation study. Bioinformatics (Oxford, England), 26:i217–i227, 2010.
13
Laboratory Note
Genetic EpistasisII - Assessing Algorithm BEAM 3.0
LN-2-2014
Ricardo Pinho and Rui CamachoFEUP
Rua Dr Roberto Frias, s/n,4200-465 PORTO
PortugalFax: (+351) 22 508 1440e-mail: [email protected]
www : http://www.fe.up.pt/∼[email protected]
www : http://www.fe.up.pt/∼rcamacho
May 2014
Abstract
In this lab note, the algorithm BEAM 3.0 is presented and testedfor main effect detection. This is a bayesian algorithm that createsa graph with SNPs and the relations between them and the diseaseexpression. The results obtained reveal a high detection for data setswith higher allele frequencies. This is also true for the population size,however this increases Type I Error Rates, therefore Power values arenearly equal to the error rates. The algorithm seems very scalablewith the data sets used, and may be scalable to large genome wideassociation studies.
1 Introduction
The Bayesian Epistasis association Mapping (BEAM) [ZL07] is a stochas-tic algorithm that uses a Markov chain Monte Carlo (MCMC) [ADH10] tocreate posterior probabilities that each marker is associated with the diseasephenotype.
Instead of the standard epistatic detection using χ2 statistic, BEAM usesa new B statistic. The B statistic is defined by:
BM = lnPA(DM , UM)
P0(DM , UM)= ln
Pjoin(DM)[Pind(UM) + Pjoin(UM)]
Pind(DM , UM) + Pjoin(DM , UM)(1)
where M represents each set of k markers, representing different complexitiesof interactions. DM and UM are genotype data from M cases and controlsand P0(DM , UM) and PA(DM , UM) are the Bayes factors. Pind is the distribu-tion that assumes independence among markers in M and Pjoin is a saturatedjoint distribution of genotype combinations among all markers in M .BEAM3 introduces multi-SNP associations and high-order interactions flex-ibility, using graphs, reducing the complexity and increasing the Power.BEAM3 [Zha12] produces cleaner results with improved mapping sensitivityand specificity.Initially, the disease graph is built based on the probability that a given geno-type configuration is related to the phenotype, considering the frequenciesof that genotype in controls and cases. Cliques (non overlapping groups ofSNPs) are then generated based on the disease related SNPs. A joint prob-ability model and MCMC are used to update the disease graph and createundirected edges between dependent SNPs.
1.1 Input files
The input file contains the phenotypes of all the individuals in the first rowand the genotypes of each SNP on the subsequent rows.
1.2 Output files
The algorithm outputs 3 files: posterior file; g.dot file; and chi.txt. Theposterior file contains the posterior probabilities of marginal and interaction
1
ID Chr Pos 0 1 0 0 1rs1 chr1 1 1 0 2 0 1rs2 chr1 2 1 2 1 1 0rs3 chr1 3 1 2 2 0 1
Table 1: An example of the input file containing the index of the SNPs,the chromosome that they belong to, the position of the SNP, the pheno-type, corresponding to the first row and subsequent rows correspond to thegenotype of each SNP for all individuals.
associations per SNP. The g.dot file contains the disease graph. The file re-quires a graph visualization software, such as GraphViz. The chi.txt containsthe chi square results, together with allele counts.
1.3 Parameters
There are some options available to the user:
• ”-filter k”: Tells the program to filter SNPs with too many missinggenotypes.
• ”-sample burnin mcmc”: Specifies the number of sampling interactionsby the MCMC.The default value is 100.
• ”-prior p”: specifies how likely each SNP is associated with the disease.By default, p=5/L, where L is the number of SNPs.
• ”-T t”: Specifies the temperature which the MCMC starts running.With a high temperature, the program can jump out of local modeswith few iterations. However, it can make the program very slow inthe first iterations.
2 Experimental Settings
The datasets used in the experiments are characterized in Lab Note 1. Thecomputer used for this experiments used the 64-bit Ubuntu 13.10 operatingsystem, with an Intel(R) Core(TM)2 Quad CPU Q6600 2.40GHz processorand 8,00 GB of RAM memory.The parameters used in this experiment are the default parameters, with theexception of ”-prior p”, which is p=2/L.
2
0.01 0.05 0.1 0.3 0.50
50
100
0 0 0
100 100
0 0
32
100 100
0 1
92100 100
Allele frequency
Pow
er(%
)
500 individuals1000 individuals2000 individuals
Figure 1: Power by allele frequency. For each frequency, three sizes of datasets were used to measure the Power, with odds ratio of 2.0 and prevalenceof 0.02. The Power is measured by the amount of data sets where the groundtruth was amongst the most relevant results, out of all 100 data sets.
3 Results
The results of epistasis detection of the algorithm consist of posterior prob-abilities. This is not comparable with χ2 tests, therefore only main effectdetections will be considered for this experiment.Figure 1 shows near 0% Power for allele frequencies lower than 0.1, but in-creases greatly reaching 100% Power for frequencies of 0.3 and 0.5. Thereis also a clear growth with population size, especially in data sets with 0.1minor allele frequency.
The running time (a) of these experiments show a steady increase, witha difference of nearly 3 seconds between data sets with 500 individuals anddata sets with 2000 individuals. The increase in running time is not verysignificant, which may translate to larger data sets. This is also true formemory usage (c), with only 1.5 MB increase from 500 to 2000 individualsin a data set. The CPU usage (b increased has an increase of nearly 10%from 500 individuals to 1000, lowering slightly for 2000 individuals.
The error rate results in Figure 3 contain high values of false positives.The percentage of Type I Error Rate is bigger than the Power for smallerallele frequencies. In frequencies higher than 0.1 the Type I Error Rate islower than the Power but the difference of both percentages decrease as thenumber of individuals increases. This means that for a bigger sized datasets, it is more likely to find the ground truth but it is also more likely to beaccompanied by false positives.
3
500 1000 20000
2
4
6
8
Number of Individuals
Runnin
gT
ime(
seco
nds)
(a) Average running time
500 1000 2000
88
90
92
94
96
Number of Individuals
CP
UU
sage
(%)
(b) Average CPU usage
500 1000 2000
4
4.5
5
5.5
Number of Individuals
Mem
ory
Usa
ge(M
byte
s)
(c) Average memory usage
Figure 2: Comparison of scalability measures between different sized datasets. This figures shows the average running time, CPU usage, and memoryusage by each data set. The data sets have a minor allele frequency is 0.5,2.0 odds ratio, 0.02 prevalence.
The distribution of Power by odds ratio reveals a big increase in Powerwith the increase of odds ratio in Figure 5. This is similar to the Power bypopulation size in Figure 4. Data sets with low allele frequencies have a near0% Power. With 0.1 minor allele frequency, there is a significant increase,having 92% of Power, and reaching 100% for higher allele frequencies inFigure 7. There is no clear difference in Power with prevalence changes onFigure 6.
4
0.01 0.05 0.1 0.3 0.50
50
100
0 3 9
71
99
6 318
99 100
117
67
100 100
Allele frequency
Typ
eI
Err
orR
ate(
%)
500 individuals1000 individuals2000 individuals
Figure 3: Type I Error Rate by allele frequency and population size, withodds ratio of 2.0 and prevalence of 0.02. The Type I Error Rate is measuredby the amount of data sets where the false positives were amongst the mostrelevant results, out of all 100 data sets.
4 Summary
BEAM3 is the third iteration of a bayesian algorithm that uses posteriorprobabilities to detect epistasis. BEAM3 generates a disease graph repre-senting multi-SNP associations that have a high probability of being relatedto the disease phenotype expression. This graph is updated using MCMC.This version of BEAM also outputs χ2 values of single SNPs, which are com-parable with other algorithms. Due to this the results consist of main effectdetection only. The Power obtained reveals similar values for Power and TypeI Error Rate, increasing with allele frequency and population size, but type1 errors are lower in relation to Power in data sets with high allele frequencyand low population size. The scalability of the algorithm is promising.
References
[ADH10] Christophe Andrieu, Arnaud Doucet, and Roman Holenstein. Par-ticle Markov chain Monte Carlo methods. Journal of the RoyalStatistical Society: Series B (Statistical Methodology), 72:269–342,2010.
[Zha12] Yu Zhang. A novel bayesian graphical model for genome-widemulti-SNP association mapping. Genetic Epidemiology, 36:36–47,2012.
5
[ZL07] Yu Zhang and Jun S Liu. Bayesian inference of epistatic interac-tions in case-control studies. Nature genetics, 39:1167–1173, 2007.
A Bar graphs
500 1000 20000
50
100
0
32
92
Population
Pow
er(%
)
Power by Population
Figure 4: Distribution of the Power by population. The allele frequency is0.1, the odds ratio is 2.0, and the prevalence is 0.02.
1.1 1.5 2.00
50
100
2
24
92
Odds Ratio
Pow
er(%
)
Power by Odds Ratio
Figure 5: Distribution of the Power by odds ratios. The allele frequency is0.1, the number of individuals is 2000, and the prevalence is 0.02.
6
0.0001 0.020
50
10099 92
Prevalence
Pow
er(%
)
Power by Prevalence
Figure 6: Distribution of the Power by prevalence. The allele frequency is0.1, the number of individuals is 2000, and the odds ratio is 2.0.
0.01 0.05 0.1 0.3 0.50
50
100
0 1
92100 100
Frequency
Pow
er(%
)
Power by Frequency
Figure 7: Distribution of the Power by allele frequency. The number ofindividuals is 2000, the odds ratio is 2.0, and the prevalence is 0.02.
B Table of Results
Table 2: A table containing the percentage of true positives and false posi-tives in each configuration. The first column contains the description of theconfiguration. The second and third columns contain the number of datasetswith true positives and false positives respectively, out of all 100 data setsper configuration.
Configuration* TP (%) FP (%)0.5,500,ME,2.0,0.02 100 99
0.5,500,ME,2.0,0.0001 100 95
7
0.5,500,ME,1.5,0.02 100 530.5,500,ME,1.5,0.0001 100 570.5,500,ME,1.1,0.02 80 20
0.5,500,ME,1.1,0.0001 79 220.5,2000,ME,2.0,0.02 100 100
0.5,2000,ME,2.0,0.0001 100 1000.5,2000,ME,1.5,0.02 100 100
0.5,2000,ME,1.5,0.0001 100 1000.5,2000,ME,1.1,0.02 100 100
0.5,2000,ME,1.1,0.0001 100 980.5,1000,ME,2.0,0.02 100 100
0.5,1000,ME,2.0,0.0001 100 1000.5,1000,ME,1.5,0.02 100 100
0.5,1000,ME,1.5,0.0001 100 970.5,1000,ME,1.1,0.02 100 57
0.5,1000,ME,1.1,0.0001 100 600.3,500,ME,2.0,0.02 100 71
0.3,500,ME,2.0,0.0001 100 790.3,500,ME,1.5,0.02 88 24
0.3,500,ME,1.5,0.0001 89 300.3,500,ME,1.1,0.02 21 11
0.3,500,ME,1.1,0.0001 23 60.3,2000,ME,2.0,0.02 100 100
0.3,2000,ME,2.0,0.0001 100 1000.3,2000,ME,1.5,0.02 100 99
0.3,2000,ME,1.5,0.0001 100 1000.3,2000,ME,1.1,0.02 100 54
0.3,2000,ME,1.1,0.0001 100 500.3,1000,ME,2.0,0.02 100 99
0.3,1000,ME,2.0,0.0001 100 1000.3,1000,ME,1.5,0.02 100 68
0.3,1000,ME,1.5,0.0001 100 630.3,1000,ME,1.1,0.02 90 25
0.3,1000,ME,1.1,0.0001 81 250.1,500,ME,2.0,0.02 0 9
0.1,500,ME,2.0,0.0001 12 170.1,500,ME,1.5,0.02 0 5
0.1,500,ME,1.5,0.0001 0 6
8
0.1,500,ME,1.1,0.02 0 60.1,500,ME,1.1,0.0001 0 50.1,2000,ME,2.0,0.02 92 67
0.1,2000,ME,2.0,0.0001 99 760.1,2000,ME,1.5,0.02 24 16
0.1,2000,ME,1.5,0.0001 44 290.1,2000,ME,1.1,0.02 2 8
0.1,2000,ME,1.1,0.0001 1 70.1,1000,ME,2.0,0.02 32 18
0.1,1000,ME,2.0,0.0001 59 380.1,1000,ME,1.5,0.02 1 6
0.1,1000,ME,1.5,0.0001 6 100.1,1000,ME,1.1,0.02 0 7
0.1,1000,ME,1.1,0.0001 0 50.05,500,ME,2.0,0.02 0 3
0.05,500,ME,2.0,0.0001 0 60.05,500,ME,1.5,0.02 0 4
0.05,500,ME,1.5,0.0001 0 40.05,500,ME,1.1,0.02 0 5
0.05,500,ME,1.1,0.0001 0 10.05,2000,ME,2.0,0.02 1 17
0.05,2000,ME,2.0,0.0001 7 250.05,2000,ME,1.5,0.02 0 3
0.05,2000,ME,1.5,0.0001 0 130.05,2000,ME,1.1,0.02 0 5
0.05,2000,ME,1.1,0.0001 0 60.05,1000,ME,2.0,0.02 0 3
0.05,1000,ME,2.0,0.0001 1 180.05,1000,ME,1.5,0.02 0 2
0.05,1000,ME,1.5,0.0001 0 50.05,1000,ME,1.1,0.02 0 7
0.05,1000,ME,1.1,0.0001 0 30.01,500,ME,2.0,0.02 0 0
0.01,500,ME,2.0,0.0001 0 60.01,500,ME,1.5,0.02 0 0
0.01,500,ME,1.5,0.0001 0 60.01,500,ME,1.1,0.02 0 0
0.01,500,ME,1.1,0.0001 0 6
9
0.01,2000,ME,2.0,0.02 0 10.01,2000,ME,2.0,0.0001 0 30.01,2000,ME,1.5,0.02 0 3
0.01,2000,ME,1.5,0.0001 0 20.01,2000,ME,1.1,0.02 0 3
0.01,2000,ME,1.1,0.0001 0 20.01,1000,ME,2.0,0.02 0 6
0.01,1000,ME,2.0,0.0001 0 30.01,1000,ME,1.5,0.02 0 7
0.01,1000,ME,1.5,0.0001 0 30.01,1000,ME,1.1,0.02 0 3
0.01,1000,ME,1.1,0.0001 0 4
*MAF,POP,MOD,OR,PREV where MAF represents the minor allele fre-quency, POP is the number of individuals, MOD is the used model (withor without main effect and with or without epistasis effect), OR is the oddsratio and PREV is the prevalence of the disease.
Table 3: A table containing the running time, cpu usage and memory usagein each configuration.
Configuration* Running Time (s) CPU Usage (%) Memory Usage (KB)0.5,500,ME,2.0,0.02 04.90 87.81 4152.80
0.5,500,ME,2.0,0.0001 03.30 87.16 3446.240.5,500,ME,1.5,0.02 02.16 86.74 2723.76
0.5,500,ME,1.5,0.0001 02.15 82.36 2757.200.5,500,ME,1.1,0.02 01.82 80.97 2566.12
0.5,500,ME,1.1,0.0001 01.73 83.54 2556.080.5,2000,ME,2.0,0.02 08.02 95.53 5986.72
0.5,2000,ME,2.0,0.0001 05.17 94.16 4108.720.5,2000,ME,1.5,0.02 02.78 92.74 3512.88
0.5,2000,ME,1.5,0.0001 02.59 93.39 3508.480.5,2000,ME,1.1,0.02 02.34 93.38 3493.44
0.5,2000,ME,1.1,0.0001 02.30 93.32 3492.600.5,1000,ME,2.0,0.02 06.96 96.31 4437.08
0.5,1000,ME,2.0,0.0001 03.79 95.00 3240.000.5,1000,ME,1.5,0.02 02.38 93.54 2771.80
0.5,1000,ME,1.5,0.0001 02.25 93.99 2729.160.5,1000,ME,1.1,0.02 02.10 93.08 2686.12
10
0.5,1000,ME,1.1,0.0001 02.02 93.41 2665.640.3,500,ME,2.0,0.02 02.60 94.60 2970.00
0.3,500,ME,2.0,0.0001 02.32 93.51 2917.440.3,500,ME,1.5,0.02 01.93 93.41 2615.88
0.3,500,ME,1.5,0.0001 01.83 92.49 2607.240.3,500,ME,1.1,0.02 01.17 89.70 2483.28
0.3,500,ME,1.1,0.0001 01.09 88.25 2476.680.3,2000,ME,2.0,0.02 02.77 94.79 3534.72
0.3,2000,ME,2.0,0.0001 02.95 95.25 3563.440.3,2000,ME,1.5,0.02 02.38 94.49 3493.60
0.3,2000,ME,1.5,0.0001 02.32 94.27 3492.920.3,2000,ME,1.1,0.02 02.30 94.73 3491.44
0.3,2000,ME,1.1,0.0001 02.28 94.56 3490.440.3,1000,ME,2.0,0.02 02.42 94.03 2886.64
0.3,1000,ME,2.0,0.0001 02.45 94.19 2831.720.3,1000,ME,1.5,0.02 02.04 93.96 2675.80
0.3,1000,ME,1.5,0.0001 02.04 93.88 2671.000.3,1000,ME,1.1,0.02 01.82 93.43 2665.28
0.3,1000,ME,1.1,0.0001 01.76 92.86 2662.680.1,500,ME,2.0,0.02 0.80 85.95 2471.00
0.1,500,ME,2.0,0.0001 0.95 88.33 2520.120.1,500,ME,1.5,0.02 0.61 82.27 2383.04
0.1,500,ME,1.5,0.0001 0.64 84.21 2432.960.1,500,ME,1.1,0.02 0.57 82.88 2367.56
0.1,500,ME,1.1,0.0001 0.58 81.66 2408.720.1,2000,ME,2.0,0.02 02.24 93.47 3493.84
0.1,2000,ME,2.0,0.0001 02.26 94.12 3492.400.1,2000,ME,1.5,0.02 01.37 90.66 3489.68
0.1,2000,ME,1.5,0.0001 01.45 91.55 3484.240.1,2000,ME,1.1,0.02 01.02 90.22 3482.16
0.1,2000,ME,1.1,0.0001 0.99 90.46 3483.440.1,1000,ME,2.0,0.02 01.38 89.81 2681.04
0.1,1000,ME,2.0,0.0001 01.50 91.44 2696.480.1,1000,ME,1.5,0.02 0.78 88.49 2655.24
0.1,1000,ME,1.5,0.0001 0.83 88.49 2653.080.1,1000,ME,1.1,0.02 0.69 83.77 2652.16
0.1,1000,ME,1.1,0.0001 0.68 89.10 2648.200.05,500,ME,2.0,0.02 0.59 81.11 2380.88
11
0.05,500,ME,2.0,0.0001 0.93 84.09 2439.400.05,500,ME,1.5,0.02 0.57 81.72 2361.20
0.05,500,ME,1.5,0.0001 0.60 81.99 2390.040.05,500,ME,1.1,0.02 0.59 79.48 2361.20
0.05,500,ME,1.1,0.0001 0.57 81.46 2381.480.05,2000,ME,2.0,0.02 01.18 89.59 3485.56
0.05,2000,ME,2.0,0.0001 01.19 89.80 3484.760.05,2000,ME,1.5,0.02 0.98 89.07 3480.08
0.05,2000,ME,1.5,0.0001 0.98 89.80 3480.160.05,2000,ME,1.1,0.02 0.94 89.82 3479.28
0.05,2000,ME,1.1,0.0001 0.94 90.33 3481.120.05,1000,ME,2.0,0.02 0.70 85.95 2651.56
0.05,1000,ME,2.0,0.0001 0.81 86.89 2653.840.05,1000,ME,1.5,0.02 0.67 81.01 2647.16
0.05,1000,ME,1.5,0.0001 0.70 82.83 2648.960.05,1000,ME,1.1,0.02 0.66 84.68 2648.20
0.05,1000,ME,1.1,0.0001 0.69 80.38 2647.760.01,500,ME,2.0,0.02 0.55 77.93 2340.40
0.01,500,ME,2.0,0.0001 0.59 79.62 2391.200.01,500,ME,1.5,0.02 0.54 81.51 2345.64
0.01,500,ME,1.5,0.0001 0.58 79.48 2387.760.01,500,ME,1.1,0.02 0.55 78.36 2349.92
0.01,500,ME,1.1,0.0001 0.59 79.67 2393.760.01,2000,ME,2.0,0.02 0.91 85.28 3476.88
0.01,2000,ME,2.0,0.0001 0.93 91.10 3479.400.01,2000,ME,1.5,0.02 0.91 91.18 3478.80
0.01,2000,ME,1.5,0.0001 0.92 91.62 3480.640.01,2000,ME,1.1,0.02 0.91 91.07 3477.44
0.01,2000,ME,1.1,0.0001 0.93 91.07 3478.960.01,1000,ME,2.0,0.02 0.66 86.84 2645.76
0.01,1000,ME,2.0,0.0001 0.67 89.19 2649.600.01,1000,ME,1.5,0.02 06.55 88.46 6100.36
0.01,1000,ME,1.5,0.0001 0.67 80.52 2646.280.01,1000,ME,1.1,0.02 0.66 84.18 2644.68
0.01,1000,ME,1.1,0.0001 0.66 81.46 2645.48
12
*MAF,POP,MOD,OR,PREV where MAF represents the minor allele fre-quency, POP is the number of individuals, MOD is the used model (withor without main effect and with or without epistasis effect), OR is the oddsratio and PREV is the prevalence of the disease.
13
Laboratory Note
Genetic EpistasisIII - Assessing Algorithm BOOST
LN-3-2014
Ricardo Pinho and Rui CamachoFEUP
Rua Dr Roberto Frias, s/n,4200-465 PORTO
PortugalFax: (+351) 22 508 1440e-mail: [email protected]
www : http://www.fe.up.pt/∼[email protected]
www : http://www.fe.up.pt/∼rcamacho
May 2014
Abstract
In this lab note, the algorithm BOOST is discussed. Its mainfeatures are transforming the data representation of genotypes into aBoolean type and making logic operations and pruning statisticallyirrelevant epistatic interactions. The results show a higher Power inmain effect than epistasis detection, but has a much higher Type IError Rate than epistasis detection. This is also true for full effectdetection. The scalability of the algorithm is very good, revealingonly a slight increase in the use of resources and running time withthe increase of population size.
1 Introduction
BOOST (BOolean Operation-based Screening and Testing) [WYY+10] trans-forms the data representation into a Boolean type, making logic operationsmore efficient and prunes insignificant epistatic interactions using an upperbound based on the likelihood ratio test statistic. BOOST works in twostages:
• Stage 1: Screening All pairwise interactions are evaluated using thecontingency tables collected by Boolean operations, removing interac-tions that fail to meet a predefined threshold. The evaluation of theinteractions at this stage is represented by Kullback-Leibler divergenceD = N ·DKL(π||ρ) where π represent the joint distribution by the fulllogistic regression model MS = β0 + βx1
i + βx2j + βx1x2
ij , and ρ is the ap-proximate joint distribution under the main logistic regression modelMH = β0 + βx1
i + βx2j using the method ”Kirkwood superposition ap-
proximation”.
• Stage 2: Testing Two statistic tests are used: likelihood ratio test,fitting the log-linear models MH and MS, and χ2 with four degrees offreedom. The p value is adjusted by a Bonferroni correction.
1.1 Input files
The input data files contain a file with the SNP and phenotype information,and a file containing the names of all data set files. In the data files, thefirst column corresponds to the phenotype taking its value in 0,1. From thesecond to the last column corresponds to the SNP, taking values in 0,1,2.
1.2 Output files
The output consists of two files, the interaction results, consisting of all SNPpairs with a χ2 result above 30 and contains the following columns:
• Index : number of interaction. Begins with 0
• SNP1 : first SNP in the interaction. Numeration begins with 0.
• SNP2 : second SNP in the interaction. Numeration begins with 0.
• SinglelocusAssoc1 : value of the marginal effect for the first associ-ated SNP.
1
• SinglelocusAssoc2 : value of the marginal effect for the second asso-ciated SNP.
• InteractionBOOST : The statistical value of BOOST from the χ2
test.
• InteractionPLINK : value obtained by using the statistic of PLINK.
The second file contains the marginal effect value for every SNP. The filecontains two columns:
• SNPindex : number of the SNP.
• Single-locusTestValue : value of the χ2 test.
2 Experimental Settings
The datasets used in the experiments are characterized in Lab Note 1. Thecomputer used for this experiments used the 64-bit Ubuntu 13.10 operatingsystem, with an Intel(R) Core(TM)2 Quad6 CPU Q6600 2.40GHz processorand 8,00 GB of RAM memory.BOOST is a C program. There are no settings for BOOST.
3 Results
The Power displayed in epistasis (b) is inferior to the Power detected by maineffects (a) and the best results were obtained by full effect detection (c) inalmost all configurations. In epistasis detection, Figure 1 shows an increaseof Power with population size in nearly all allele frequencies. The increase ofallele frequency also increases the Power.
In Figure 2 shows a varying cpu usage (b) with a very slight increasein running time (a), and memory usage (c)This increase is not significant,which reveals a good scalability.
The Type I Error Rate shows a maximum value of 21% for epistasis de-tection, but is 100% in the data sets with the highest population size andallele frequency, for main effect and full effect. Most of the type 1 errorsin epistasis are below 10% and therefore there is a bigger difference betweenPower and type 1 errors in epistasis detection. For main effect and full effect,there is an increase in Type 1 Error Rate with data set size and minor allele
2
0.01 0.05 0.1 0.3 0.50
50
100
0 0 114
26
0 0
41
66
91
0 7
94 100 100
Allele frequency
Pow
er(%
)
500 individuals1000 individuals2000 individuals
(a) Epistasis
0.01 0.05 0.1 0.3 0.50
50
100
0 0 2
100 100
0 1
43
100 100
014
97 100 100
Allele frequency
Pow
er(%
)
500 individuals1000 individuals2000 individuals
(b) Main Effect
0.01 0.05 0.1 0.3 0.50
50
100
0 0 1
100 100
0 2
42
100 100
015
98 100 100
Allele frequency
Pow
er(%
)
500 individuals1000 individuals2000 individuals
(c) Full Effect
Figure 1: Power by allele frequency. For each frequency, three sizes of datasets were used to measure the Power, with odds ratio of 2.0 and prevalenceof 0.02. The Power is measured by the amount of data sets where the groundtruth was amongst the most relevant results, out of all 100 data sets. (b),(a), and (c) represent main effect, epistatic and main effect + epistatic in-teractions.
3
500 1000 20000
0.1
0.2
0.3
Number of Individuals
Runnin
gT
ime(
seco
nds)
(a) Average running time.
500 1000 2000
96
97
98
99
Number of Individuals
CP
UU
sage
(%)
(b) Average CPU usage.
500 1000 2000
1,000
1,100
1,200
Number of Individuals
Mem
ory
Usa
ge(K
byte
s)
(c) Average memory usage.
Figure 2: Comparison of scalability measures between different sized datasets. The data sets have a minor allele frequency is 0.5, 2.0 odds ratio, 0.02prevalence and use the full effect disease model.
frequency.
The Power distribution by population 4 and by odds ratio 5 show a bigincrease with higher population sizes and odds ratios. The prevalence re-sult reveal very similar values for both prevalences , and the distribution byallele frequency 7 increases sligthly for 0.05 minor allele frequency and in-creases greatly in 0.1 minor allele frequency, reaching 100% for higher allelefrequencies.
4
0.01 0.05 0.1 0.3 0.50
50
100
4 7 7 6 47 4 5 2 82 2
216 8
Allele frequency
Typ
eI
Err
orR
ate(
%)
500 individuals1000 individuals2000 individuals
(a) Epistasis
0.01 0.05 0.1 0.3 0.50
50
100
1 112
78
97
7 3
23
99 100
111
74
100 100
Allele frequency
Typ
eI
Err
orR
ate(
%)
500 individuals1000 individuals2000 individuals
(b) Main Effect
0.01 0.05 0.1 0.3 0.50
50
100
10 415
100 100
11 16
38
100 100
717
81
100 100
Allele frequency
Typ
eI
Err
orR
ate(
%)
500 individuals1000 individuals2000 individuals
(c) Full Effect
Figure 3: Type I Error Rate by allele frequency. For each frequency, threesizes of data sets were used to measure the Power, with odds ratio of 2.0 andprevalence of 0.02. The Type I Error Rate is measured by the amount ofdata sets where the false positives were amongst the most relevant results,out of all 100 data sets.. (a), (b), and (c) represent epistatic, main effect,and main effect + epistatic interactions.
5
4 Summary
BOOST is an exhaustive algorithm that converts the data into a binaryformat and prunes irrelevant interactions by the contingency tables collectedby Boolean operations. The results show a very good scalability, with a slightbut irrelevant increase in running time, memory usage and cpu usage. Therelation between Power and Type I Error Rates has a bigger difference inepistasis detection, but the overall Power is lower than main effect and fulleffect.
References
[WYY+10] Xiang Wan, Can Yang, Qiang Yang, Hong Xue, Xiaodan Fan,Nelson L S Tang, and Weichuan Yu. BOOST: A fast approachto detecting gene-gene interactions in genome-wide case-controlstudies. American journal of human genetics, 87:325–340, 2010.
A Bar Graphs
6
500 1000 20000
50
100
1
41
94
Population
Pow
er(%
)
Power by Population
(a) Epistasis
500 1000 20000
50
100
2
43
97
Population
Pow
er(%
)
Power by Population
(b) Main Effect
500 1000 20000
50
100
1
42
98
Population
Pow
er(%
)
Power by Population
(c) Full Effect
Figure 4: Distribution of the Power by population for all disease models.The allele frequency is 0.1, the odds ratio is 2.0, and the prevalence is 0.02.
7
1.1 1.5 2.00
50
100
27
95 94
Odds Ratio
Pow
er(%
)
Power by Odds Ratio
(a) Epistasis
1.1 1.5 2.00
50
100
2
33
97
Odds Ratio
Pow
er(%
)
Power by Odds Ratio
(b) Main Effect
1.1 1.5 2.00
50
100
4
72
98
Odds Ratio
Pow
er(%
)
Power by Odds Ratio
(c) Full Effect
Figure 5: Distribution of the Power by odds ratios for all disease models.The allele frequency is 0.1, the population size is 2000 individuals, and theprevalence is 0.02.
8
0.0001 0.020
50
100 91 94
Prevalence
Pow
er(%
)
Power by Prevalence
(a) Epistasis
0.0001 0.020
50
10098 97
Prevalence
Pow
er(%
)
Power by Prevalence
(b) Main Effect
0.0001 0.020
50
100100 98
Prevalence
Pow
er(%
)
Power by Prevalence
(c) Full Effect
Figure 6: Distribution of the Power by prevalence for all disease models. Theallele frequency is 0.1, the odds ratio is 2.0, and the population size is 2000individuals.
9
0.01 0.05 0.1 0.3 0.50
50
100
0 7
94 100 100
Frequency
Pow
er(%
)
Power by Frequency
(a) Epistasis
0.01 0.05 0.1 0.3 0.50
50
100
014
97 100 100
Frequency
Pow
er(%
)
Power by Frequency
(b) Main Effect
0.01 0.05 0.1 0.3 0.50
50
100
015
98 100 100
Frequency
Pow
er(%
)
Power by Frequency
(c) Full Effect
Figure 7: Distribution of the Power by allele frequency for all disease mod-els. The population size is 2000 individuals, the odds ratio is 2.0, and theprevalence is 0.02.
10
B Table of Results
Table 1: A table containing the percentage of true positives and false posi-tives in each configuration. The first column contains the description of theconfiguration. The second and third columns contain the number of datasetswith true positives and false positives respectively, out of all 100 data setsper configuration.
Configuration* TP (%) FP (%)0.5,500,I,2.0,0.02 26 4
0.5,500,I,2.0,0.0001 20 30.5,500,I,1.5,0.02 19 2
0.5,500,I,1.5,0.0001 1 40.5,500,I,1.1,0.02 0 1
0.5,500,I,1.1,0.0001 0 30.5,2000,I,2.0,0.02 100 8
0.5,2000,I,2.0,0.0001 100 170.5,2000,I,1.5,0.02 100 11
0.5,2000,I,1.5,0.0001 67 20.5,2000,I,1.1,0.02 15 6
0.5,2000,I,1.1,0.0001 7 70.5,1000,I,2.0,0.02 91 8
0.5,1000,I,2.0,0.0001 90 70.5,1000,I,1.5,0.02 79 8
0.5,1000,I,1.5,0.0001 13 40.5,1000,I,1.1,0.02 1 5
0.5,1000,I,1.1,0.0001 0 60.3,500,I,2.0,0.02 14 6
0.3,500,I,2.0,0.0001 46 60.3,500,I,1.5,0.02 15 3
0.3,500,I,1.5,0.0001 12 20.3,500,I,1.1,0.02 0 4
0.3,500,I,1.1,0.0001 0 10.3,2000,I,2.0,0.02 100 6
0.3,2000,I,2.0,0.0001 100 100.3,2000,I,1.5,0.02 100 30
0.3,2000,I,1.5,0.0001 100 100.3,2000,I,1.1,0.02 7 4
0.3,2000,I,1.1,0.0001 8 60.3,1000,I,2.0,0.02 66 2
11
0.3,1000,I,2.0,0.0001 100 80.3,1000,I,1.5,0.02 81 10
0.3,1000,I,1.5,0.0001 71 60.3,1000,I,1.1,0.02 0 2
0.3,1000,I,1.1,0.0001 0 30.1,500,I,2.0,0.02 1 7
0.1,500,I,2.0,0.0001 0 10.1,500,I,1.5,0.02 1 3
0.1,500,I,1.5,0.0001 1 50.1,500,I,1.1,0.02 0 7
0.1,500,I,1.1,0.0001 0 40.1,2000,I,2.0,0.02 94 21
0.1,2000,I,2.0,0.0001 91 70.1,2000,I,1.5,0.02 95 9
0.1,2000,I,1.5,0.0001 77 60.1,2000,I,1.1,0.02 27 5
0.1,2000,I,1.1,0.0001 24 20.1,1000,I,2.0,0.02 41 5
0.1,1000,I,2.0,0.0001 13 70.1,1000,I,1.5,0.02 36 4
0.1,1000,I,1.5,0.0001 10 30.1,1000,I,1.1,0.02 0 6
0.1,1000,I,1.1,0.0001 0 50.05,500,I,2.0,0.02 0 7
0.05,500,I,2.0,0.0001 1 20.05,500,I,1.5,0.02 0 8
0.05,500,I,1.5,0.0001 0 60.05,500,I,1.1,0.02 0 11
0.05,500,I,1.1,0.0001 0 30.05,2000,I,2.0,0.02 7 2
0.05,2000,I,2.0,0.0001 70 350.05,2000,I,1.5,0.02 65 49
0.05,2000,I,1.5,0.0001 47 280.05,2000,I,1.1,0.02 0 7
0.05,2000,I,1.1,0.0001 0 00.05,1000,I,2.0,0.02 0 4
0.05,1000,I,2.0,0.0001 8 80.05,1000,I,1.5,0.02 11 7
12
0.05,1000,I,1.5,0.0001 1 50.05,1000,I,1.1,0.02 0 4
0.05,1000,I,1.1,0.0001 0 60.01,500,I,2.0,0.02 0 4
0.01,500,I,2.0,0.0001 0 20.01,500,I,1.5,0.02 0 5
0.01,500,I,1.5,0.0001 0 40.01,500,I,1.1,0.02 0 5
0.01,500,I,1.1,0.0001 0 70.01,2000,I,2.0,0.02 0 2
0.01,2000,I,2.0,0.0001 1 60.01,2000,I,1.5,0.02 0 4
0.01,2000,I,1.5,0.0001 1 10.01,2000,I,1.1,0.02 0 2
0.01,2000,I,1.1,0.0001 0 60.01,1000,I,2.0,0.02 0 7
0.01,1000,I,2.0,0.0001 0 40.01,1000,I,1.5,0.02 0 2
0.01,1000,I,1.5,0.0001 0 40.01,1000,I,1.1,0.02 0 5
0.01,1000,I,1.1,0.0001 0 80.5,500,ME,2.0,0.02 100 97
0.5,500,ME,2.0,0.0001 100 950.5,500,ME,1.5,0.02 100 63
0.5,500,ME,1.5,0.0001 100 610.5,500,ME,1.1,0.02 81 30
0.5,500,ME,1.1,0.0001 78 220.5,2000,ME,2.0,0.02 100 100
0.5,2000,ME,2.0,0.0001 100 1000.5,2000,ME,1.5,0.02 100 100
0.5,2000,ME,1.5,0.0001 100 1000.5,2000,ME,1.1,0.02 100 100
0.5,2000,ME,1.1,0.0001 100 970.5,1000,ME,2.0,0.02 100 100
0.5,1000,ME,2.0,0.0001 100 1000.5,1000,ME,1.5,0.02 100 100
0.5,1000,ME,1.5,0.0001 100 970.5,1000,ME,1.1,0.02 100 61
13
0.5,1000,ME,1.1,0.0001 100 590.3,500,ME,2.0,0.02 100 78
0.3,500,ME,2.0,0.0001 100 820.3,500,ME,1.5,0.02 90 29
0.3,500,ME,1.5,0.0001 86 330.3,500,ME,1.1,0.02 25 17
0.3,500,ME,1.1,0.0001 18 100.3,2000,ME,2.0,0.02 100 100
0.3,2000,ME,2.0,0.0001 100 1000.3,2000,ME,1.5,0.02 100 99
0.3,2000,ME,1.5,0.0001 100 1000.3,2000,ME,1.1,0.02 100 57
0.3,2000,ME,1.1,0.0001 100 510.3,1000,ME,2.0,0.02 100 99
0.3,1000,ME,2.0,0.0001 100 1000.3,1000,ME,1.5,0.02 100 69
0.3,1000,ME,1.5,0.0001 100 660.3,1000,ME,1.1,0.02 92 28
0.3,1000,ME,1.1,0.0001 77 270.1,500,ME,2.0,0.02 2 12
0.1,500,ME,2.0,0.0001 8 140.1,500,ME,1.5,0.02 0 7
0.1,500,ME,1.5,0.0001 0 80.1,500,ME,1.1,0.02 0 7
0.1,500,ME,1.1,0.0001 0 40.1,2000,ME,2.0,0.02 97 74
0.1,2000,ME,2.0,0.0001 98 710.1,2000,ME,1.5,0.02 33 22
0.1,2000,ME,1.5,0.0001 34 230.1,2000,ME,1.1,0.02 2 11
0.1,2000,ME,1.1,0.0001 1 50.1,1000,ME,2.0,0.02 43 23
0.1,1000,ME,2.0,0.0001 50 320.1,1000,ME,1.5,0.02 5 9
0.1,1000,ME,1.5,0.0001 1 90.1,1000,ME,1.1,0.02 0 10
0.1,1000,ME,1.1,0.0001 0 50.05,500,ME,2.0,0.02 0 1
14
0.05,500,ME,2.0,0.0001 0 40.05,500,ME,1.5,0.02 0 1
0.05,500,ME,1.5,0.0001 0 20.05,500,ME,1.1,0.02 0 3
0.05,500,ME,1.1,0.0001 0 20.05,2000,ME,2.0,0.02 14 11
0.05,2000,ME,2.0,0.0001 13 100.05,2000,ME,1.5,0.02 2 2
0.05,2000,ME,1.5,0.0001 1 60.05,2000,ME,1.1,0.02 0 3
0.05,2000,ME,1.1,0.0001 2 30.05,1000,ME,2.0,0.02 1 3
0.05,1000,ME,2.0,0.0001 4 60.05,1000,ME,1.5,0.02 0 0
0.05,1000,ME,1.5,0.0001 1 30.05,1000,ME,1.1,0.02 0 3
0.05,1000,ME,1.1,0.0001 0 20.01,500,ME,2.0,0.02 0 1
0.01,500,ME,2.0,0.0001 0 70.01,500,ME,1.5,0.02 0 1
0.01,500,ME,1.5,0.0001 0 70.01,500,ME,1.1,0.02 0 1
0.01,500,ME,1.1,0.0001 0 70.01,2000,ME,2.0,0.02 0 1
0.01,2000,ME,2.0,0.0001 0 40.01,2000,ME,1.5,0.02 0 5
0.01,2000,ME,1.5,0.0001 0 50.01,2000,ME,1.1,0.02 0 5
0.01,2000,ME,1.1,0.0001 0 50.01,1000,ME,2.0,0.02 0 7
0.01,1000,ME,2.0,0.0001 0 30.01,1000,ME,1.5,0.02 0 8
0.01,1000,ME,1.5,0.0001 0 30.01,1000,ME,1.1,0.02 0 4
0.01,1000,ME,1.1,0.0001 0 20.5,500,ME+I,2.0,0.02 100 100
0.5,500,ME+I,2.0,0.0001 100 1000.5,500,ME+I,1.5,0.02 100 100
15
0.5,500,ME+I,1.5,0.0001 100 1000.5,500,ME+I,1.1,0.02 100 89
0.5,500,ME+I,1.1,0.0001 100 860.5,2000,ME+I,2.0,0.02 100 100
0.5,2000,ME+I,2.0,0.0001 100 1000.5,2000,ME+I,1.5,0.02 100 100
0.5,2000,ME+I,1.5,0.0001 100 1000.5,2000,ME+I,1.1,0.02 100 100
0.5,2000,ME+I,1.1,0.0001 100 1000.5,1000,ME+I,2.0,0.02 100 100
0.5,1000,ME+I,2.0,0.0001 100 1000.5,1000,ME+I,1.5,0.02 100 100
0.5,1000,ME+I,1.5,0.0001 100 1000.5,1000,ME+I,1.1,0.02 100 100
0.5,1000,ME+I,1.1,0.0001 100 1000.3,500,ME+I,2.0,0.02 100 100
0.3,500,ME+I,2.0,0.0001 100 1000.3,500,ME+I,1.5,0.02 100 79
0.3,500,ME+I,1.5,0.0001 100 930.3,500,ME+I,1.1,0.02 79 28
0.3,500,ME+I,1.1,0.0001 89 250.3,2000,ME+I,2.0,0.02 100 100
0.3,2000,ME+I,2.0,0.0001 100 1000.3,2000,ME+I,1.5,0.02 100 100
0.3,2000,ME+I,1.5,0.0001 100 1000.3,2000,ME+I,1.1,0.02 100 97
0.3,2000,ME+I,1.1,0.0001 100 990.3,1000,ME+I,2.0,0.02 100 100
0.3,1000,ME+I,2.0,0.0001 100 1000.3,1000,ME+I,1.5,0.02 100 99
0.3,1000,ME+I,1.5,0.0001 100 1000.3,1000,ME+I,1.1,0.02 100 63
0.3,1000,ME+I,1.1,0.0001 100 660.1,500,ME+I,2.0,0.02 1 15
0.1,500,ME+I,2.0,0.0001 30 340.1,500,ME+I,1.5,0.02 0 10
0.1,500,ME+I,1.5,0.0001 1 100.1,500,ME+I,1.1,0.02 0 5
16
0.1,500,ME+I,1.1,0.0001 1 90.1,2000,ME+I,2.0,0.02 98 81
0.1,2000,ME+I,2.0,0.0001 100 1000.1,2000,ME+I,1.5,0.02 72 51
0.1,2000,ME+I,1.5,0.0001 46 380.1,2000,ME+I,1.1,0.02 4 14
0.1,2000,ME+I,1.1,0.0001 2 160.1,1000,ME+I,2.0,0.02 42 38
0.1,1000,ME+I,2.0,0.0001 91 700.1,1000,ME+I,1.5,0.02 2 18
0.1,1000,ME+I,1.5,0.0001 8 180.1,1000,ME+I,1.1,0.02 0 13
0.1,1000,ME+I,1.1,0.0001 0 140.05,500,ME+I,2.0,0.02 0 4
0.05,500,ME+I,2.0,0.0001 0 110.05,500,ME+I,1.5,0.02 0 3
0.05,500,ME+I,1.5,0.0001 1 110.05,500,ME+I,1.1,0.02 0 8
0.05,500,ME+I,1.1,0.0001 0 70.05,2000,ME+I,2.0,0.02 15 17
0.05,2000,ME+I,2.0,0.0001 27 300.05,2000,ME+I,1.5,0.02 1 13
0.05,2000,ME+I,1.5,0.0001 4 80.05,2000,ME+I,1.1,0.02 0 7
0.05,2000,ME+I,1.1,0.0001 0 70.05,1000,ME+I,2.0,0.02 2 16
0.05,1000,ME+I,2.0,0.0001 8 110.05,1000,ME+I,1.5,0.02 1 6
0.05,1000,ME+I,1.5,0.0001 1 50.05,1000,ME+I,1.1,0.02 0 5
0.05,1000,ME+I,1.1,0.0001 0 60.01,500,ME+I,2.0,0.02 0 10
0.01,500,ME+I,2.0,0.0001 0 130.01,500,ME+I,1.5,0.02 0 13
0.01,500,ME+I,1.5,0.0001 0 120.01,500,ME+I,1.1,0.02 0 7
0.01,500,ME+I,1.1,0.0001 0 110.01,2000,ME+I,2.0,0.02 0 7
17
0.01,2000,ME+I,2.0,0.0001 0 120.01,2000,ME+I,1.5,0.02 0 13
0.01,2000,ME+I,1.5,0.0001 0 80.01,2000,ME+I,1.1,0.02 0 16
0.01,2000,ME+I,1.1,0.0001 0 80.01,1000,ME+I,2.0,0.02 0 11
0.01,1000,ME+I,2.0,0.0001 0 80.01,1000,ME+I,1.5,0.02 0 13
0.01,1000,ME+I,1.5,0.0001 0 70.01,1000,ME+I,1.1,0.02 0 10
0.01,1000,ME+I,1.1,0.0001 0 3
*MAF,POP,MOD,OR,PREV where MAF represents the minor allele fre-quency, POP is the number of individuals, MOD is the used model (withor without main effect and with or without epistasis effect), OR is the oddsratio and PREV is the prevalence of the disease.
Table 2: A table containing the running time, cpu usage and memory usagein each configuration.
Configuration* Running Time (s) CPU Usage (%) Memory Usage (KB)0.5,500,ME+I,2.0,0.02 0.16 95.70 1003.04
0.5,500,ME+I,2.0,0.0001 0.16 96.05 1003.560.5,500,ME+I,1.5,0.02 0.16 96.07 1001.04
0.5,500,ME+I,1.5,0.0001 0.16 95.08 991.880.5,500,ME+I,1.1,0.02 0.16 96.93 995.92
0.5,500,ME+I,1.1,0.0001 0.16 95.35 973.640.5,500,ME,2.0,0.02 0.16 96.97 997.92
0.5,500,ME,2.0,0.0001 0.16 95.79 980.400.5,500,ME,1.5,0.02 0.16 97.07 996.76
0.5,500,ME,1.5,0.0001 0.16 98.08 972.080.5,500,ME,1.1,0.02 0.16 97.83 993.84
0.5,500,ME,1.1,0.0001 0.16 97.93 971.520.5,500,I,2.0,0.02 0.16 98.11 996.16
0.5,500,I,2.0,0.0001 0.16 98.02 973.600.5,500,I,1.5,0.02 0.16 97.67 998.08
0.5,500,I,1.5,0.0001 0.16 98.41 970.000.5,500,I,1.1,0.02 0.16 97.59 995.36
0.5,500,I,1.1,0.0001 0.16 97.38 967.72
18
0.5,2000,ME+I,2.0,0.02 0.34 97.87 1226.440.5,2000,ME+I,2.0,0.0001 0.34 97.56 1217.200.5,2000,ME+I,1.5,0.02 0.32 97.89 1190.48
0.5,2000,ME+I,1.5,0.0001 0.32 97.26 1188.160.5,2000,ME+I,1.1,0.02 0.31 97.13 1171.92
0.5,2000,ME+I,1.1,0.0001 0.31 97.29 1171.280.5,2000,ME,2.0,0.02 0.32 97.95 1175.04
0.5,2000,ME,2.0,0.0001 0.32 97.69 1174.320.5,2000,ME,1.5,0.02 0.31 97.78 1169.32
0.5,2000,ME,1.5,0.0001 0.31 98.06 1168.920.5,2000,ME,1.1,0.02 0.31 97.20 1155.96
0.5,2000,ME,1.1,0.0001 0.31 97.86 1142.440.5,2000,I,2.0,0.02 0.31 98.14 1146.04
0.5,2000,I,2.0,0.0001 0.31 97.71 1134.800.5,2000,I,1.5,0.02 0.31 97.82 1155.12
0.5,2000,I,1.5,0.0001 0.31 98.39 1132.640.5,2000,I,1.1,0.02 0.31 97.77 1149.60
0.5,2000,I,1.1,0.0001 0.31 98.57 1127.760.5,1000,ME+I,2.0,0.02 0.22 98.79 1071.92
0.5,1000,ME+I,2.0,0.0001 0.22 98.28 1053.880.5,1000,ME+I,1.5,0.02 0.22 95.92 1059.68
0.5,1000,ME+I,1.5,0.0001 0.22 96.97 1046.960.5,1000,ME+I,1.1,0.02 0.21 97.21 1053.52
0.5,1000,ME+I,1.1,0.0001 0.21 98.45 1028.160.5,1000,ME,2.0,0.02 0.22 98.66 1056.92
0.5,1000,ME,2.0,0.0001 0.21 98.77 1036.440.5,1000,ME,1.5,0.02 0.21 98.88 1049.04
0.5,1000,ME,1.5,0.0001 0.21 98.09 1016.320.5,1000,ME,1.1,0.02 0.21 98.19 1043.88
0.5,1000,ME,1.1,0.0001 0.21 98.18 1002.920.5,1000,I,2.0,0.02 0.21 97.86 1045.96
0.5,1000,I,2.0,0.0001 0.21 97.54 1003.000.5,1000,I,1.5,0.02 0.21 96.90 1048.20
0.5,1000,I,1.5,0.0001 0.21 97.46 1006.160.5,1000,I,1.1,0.02 0.21 96.44 1044.28
0.5,1000,I,1.1,0.0001 0.21 97.60 1000.880.3,500,ME+I,2.0,0.02 0.16 98.13 998.44
0.3,500,ME+I,2.0,0.0001 0.16 98.23 1003.00
19
0.3,500,ME+I,1.5,0.02 0.16 97.97 996.720.3,500,ME+I,1.5,0.0001 0.16 98.37 983.320.3,500,ME+I,1.1,0.02 0.16 98.12 997.60
0.3,500,ME+I,1.1,0.0001 0.16 97.90 966.760.3,500,ME,2.0,0.02 0.16 98.13 998.60
0.3,500,ME,2.0,0.0001 0.16 98.44 971.800.3,500,ME,1.5,0.02 0.16 98.78 997.36
0.3,500,ME,1.5,0.0001 0.16 98.05 969.240.3,500,ME,1.1,0.02 0.16 98.11 996.00
0.3,500,ME,1.1,0.0001 0.16 98.36 968.720.3,500,I,2.0,0.02 0.16 98.39 997.20
0.3,500,I,2.0,0.0001 0.16 98.55 978.320.3,500,I,1.5,0.02 0.16 98.36 995.88
0.3,500,I,1.5,0.0001 0.16 97.51 964.560.3,500,I,1.1,0.02 0.16 97.83 997.64
0.3,500,I,1.1,0.0001 0.16 97.29 968.680.3,2000,ME+I,2.0,0.02 0.32 97.82 1184.36
0.3,2000,ME+I,2.0,0.0001 0.34 96.01 1225.960.3,2000,ME+I,1.5,0.02 0.31 96.61 1171.32
0.3,2000,ME+I,1.5,0.0001 0.32 96.50 1175.600.3,2000,ME+I,1.1,0.02 0.31 96.16 1150.92
0.3,2000,ME+I,1.1,0.0001 0.31 97.24 1148.960.3,2000,ME,2.0,0.02 0.31 97.27 1170.44
0.3,2000,ME,2.0,0.0001 0.31 95.77 1171.480.3,2000,ME,1.5,0.02 0.31 96.48 1160.12
0.3,2000,ME,1.5,0.0001 0.31 97.31 1151.520.3,2000,ME,1.1,0.02 0.31 96.43 1150.12
0.3,2000,ME,1.1,0.0001 0.31 95.85 1129.600.3,2000,I,2.0,0.02 0.31 95.56 1153.76
0.3,2000,I,2.0,0.0001 0.31 95.81 1131.240.3,2000,I,1.5,0.02 0.31 97.12 1154.00
0.3,2000,I,1.5,0.0001 0.31 98.11 1134.880.3,2000,I,1.1,0.02 0.31 97.10 1147.48
0.3,2000,I,1.1,0.0001 0.31 95.79 1128.280.3,1000,ME+I,2.0,0.02 0.21 94.97 1058.36
0.3,1000,ME+I,2.0,0.0001 0.22 96.59 1059.720.3,1000,ME+I,1.5,0.02 0.21 96.59 1053.20
0.3,1000,ME+I,1.5,0.0001 0.21 96.29 1043.68
20
0.3,1000,ME+I,1.1,0.02 0.21 98.29 1047.480.3,1000,ME+I,1.1,0.0001 0.21 97.93 1006.04
0.3,1000,ME,2.0,0.02 0.21 97.33 1047.560.3,1000,ME,2.0,0.0001 0.21 98.02 1025.240.3,1000,ME,1.5,0.02 0.21 95.34 1048.40
0.3,1000,ME,1.5,0.0001 0.21 95.54 1006.600.3,1000,ME,1.1,0.02 0.21 96.92 1040.96
0.3,1000,ME,1.1,0.0001 0.21 97.23 1001.800.3,1000,I,2.0,0.02 0.21 97.65 1046.96
0.3,1000,I,2.0,0.0001 0.21 96.54 1010.360.3,1000,I,1.5,0.02 0.21 95.78 1051.16
0.3,1000,I,1.5,0.0001 0.21 96.49 1006.160.3,1000,I,1.1,0.02 0.21 96.77 1045.64
0.3,1000,I,1.1,0.0001 0.21 96.72 999.520.1,500,ME+I,2.0,0.02 0.16 96.61 995.84
0.1,500,ME+I,2.0,0.0001 0.16 95.81 971.080.1,500,ME+I,1.5,0.02 0.16 95.41 995.32
0.1,500,ME+I,1.5,0.0001 0.16 97.03 966.400.1,500,ME+I,1.1,0.02 0.16 96.58 995.56
0.1,500,ME+I,1.1,0.0001 0.16 97.02 973.000.1,500,ME,2.0,0.02 0.16 96.29 995.56
0.1,500,ME,2.0,0.0001 0.16 96.80 968.880.1,500,ME,1.5,0.02 0.16 95.68 999.52
0.1,500,ME,1.5,0.0001 0.16 96.15 968.080.1,500,ME,1.1,0.02 0.16 95.78 996.88
0.1,500,ME,1.1,0.0001 0.16 96.61 968.360.1,500,I,2.0,0.02 0.16 96.69 996.84
0.1,500,I,2.0,0.0001 0.16 96.31 970.080.1,500,I,1.5,0.02 0.16 96.01 996.00
0.1,500,I,1.5,0.0001 0.16 95.68 970.400.1,500,I,1.1,0.02 0.16 95.82 996.20
0.1,500,I,1.1,0.0001 0.16 96.19 969.840.1,2000,ME+I,2.0,0.02 0.31 96.47 1154.56
0.1,2000,ME+I,2.0,0.0001 0.31 97.14 1157.400.1,2000,ME+I,1.5,0.02 0.31 97.21 1151.28
0.1,2000,ME+I,1.5,0.0001 0.31 98.25 1131.320.1,2000,ME+I,1.1,0.02 0.31 98.12 1148.64
0.1,2000,ME+I,1.1,0.0001 0.31 96.25 1125.76
21
0.1,2000,ME,2.0,0.02 0.31 97.41 1153.040.1,2000,ME,2.0,0.0001 0.31 97.62 1139.760.1,2000,ME,1.5,0.02 0.31 97.73 1149.56
0.1,2000,ME,1.5,0.0001 0.31 97.73 1131.120.1,2000,ME,1.1,0.02 0.31 97.87 1148.72
0.1,2000,ME,1.1,0.0001 0.31 97.22 1129.480.1,2000,I,2.0,0.02 0.31 97.24 1150.88
0.1,2000,I,2.0,0.0001 0.31 95.68 1133.920.1,2000,I,1.5,0.02 0.31 94.91 1151.36
0.1,2000,I,1.5,0.0001 0.31 95.62 1135.040.1,2000,I,1.1,0.02 0.31 94.25 1148.12
0.1,2000,I,1.1,0.0001 0.31 96.06 1132.360.1,1000,ME+I,2.0,0.02 0.21 97.34 1043.80
0.1,1000,ME+I,2.0,0.0001 0.21 97.42 1012.400.1,1000,ME+I,1.5,0.02 0.21 97.26 1044.76
0.1,1000,ME+I,1.5,0.0001 0.21 97.05 1000.320.1,1000,ME+I,1.1,0.02 0.21 96.66 1042.76
0.1,1000,ME+I,1.1,0.0001 0.21 98.06 1000.160.1,1000,ME,2.0,0.02 0.21 97.68 1043.00
0.1,1000,ME,2.0,0.0001 0.21 97.66 1003.920.1,1000,ME,1.5,0.02 0.21 97.74 1043.64
0.1,1000,ME,1.5,0.0001 0.21 97.69 1001.600.1,1000,ME,1.1,0.02 0.21 97.95 1046.68
0.1,1000,ME,1.1,0.0001 0.21 97.83 1002.680.1,1000,I,2.0,0.02 0.21 96.10 1046.00
0.1,1000,I,2.0,0.0001 0.21 95.81 1003.880.1,1000,I,1.5,0.02 0.21 96.72 1041.84
0.1,1000,I,1.5,0.0001 0.21 96.96 1000.640.1,1000,I,1.1,0.02 0.21 96.48 1045.88
0.1,1000,I,1.1,0.0001 0.21 96.97 1003.840.05,500,ME+I,2.0,0.02 0.16 97.24 997.76
0.05,500,ME+I,2.0,0.0001 0.16 96.63 969.040.05,500,ME+I,1.5,0.02 0.16 97.31 995.32
0.05,500,ME+I,1.5,0.0001 0.16 97.56 966.840.05,500,ME+I,1.1,0.02 0.16 97.33 995.60
0.05,500,ME+I,1.1,0.0001 0.16 97.15 969.080.05,500,ME,2.0,0.02 0.16 96.23 997.08
0.05,500,ME,2.0,0.0001 0.16 97.46 971.00
22
0.05,500,ME,1.5,0.02 0.16 97.55 994.760.05,500,ME,1.5,0.0001 0.16 96.71 967.520.05,500,ME,1.1,0.02 0.16 95.80 995.36
0.05,500,ME,1.1,0.0001 0.16 96.52 969.680.05,500,I,2.0,0.02 0.16 98.23 995.56
0.05,500,I,2.0,0.0001 0.16 96.40 967.080.05,500,I,1.5,0.02 0.16 96.81 998.12
0.05,500,I,1.5,0.0001 0.16 96.51 971.360.05,500,I,1.1,0.02 0.16 96.55 996.88
0.05,500,I,1.1,0.0001 0.16 96.97 973.320.05,2000,ME+I,2.0,0.02 0.31 98.08 1149.36
0.05,2000,ME+I,2.0,0.0001 0.31 97.88 1131.840.05,2000,ME+I,1.5,0.02 0.31 97.97 1145.80
0.05,2000,ME+I,1.5,0.0001 0.31 97.98 1127.560.05,2000,ME+I,1.1,0.02 0.31 97.77 1145.32
0.05,2000,ME+I,1.1,0.0001 0.31 98.04 1127.080.05,2000,ME,2.0,0.02 0.31 98.05 1149.92
0.05,2000,ME,2.0,0.0001 0.31 98.14 1128.480.05,2000,ME,1.5,0.02 0.31 98.21 1146.80
0.05,2000,ME,1.5,0.0001 0.31 98.15 1128.400.05,2000,ME,1.1,0.02 0.31 98.11 1148.16
0.05,2000,ME,1.1,0.0001 0.31 97.26 1124.520.05,2000,I,2.0,0.02 0.31 97.86 1126.84
0.05,2000,I,2.0,0.0001 0.31 97.84 1135.600.05,2000,I,1.5,0.02 0.31 98.56 1155.56
0.05,2000,I,1.5,0.0001 0.31 98.07 1134.720.05,2000,I,1.1,0.02 0.31 97.36 1145.32
0.05,2000,I,1.1,0.0001 0.31 98.04 1127.440.05,1000,ME+I,2.0,0.02 0.21 96.67 1042.08
0.05,1000,ME+I,2.0,0.0001 0.21 96.80 999.680.05,1000,ME+I,1.5,0.02 0.21 95.98 1043.12
0.05,1000,ME+I,1.5,0.0001 0.21 97.60 1000.240.05,1000,ME+I,1.1,0.02 0.21 97.59 1044.92
0.05,1000,ME+I,1.1,0.0001 0.21 97.92 998.920.05,1000,ME,2.0,0.02 0.21 95.85 1044.24
0.05,1000,ME,2.0,0.0001 0.21 96.70 1001.080.05,1000,ME,1.5,0.02 0.21 97.88 1041.64
0.05,1000,ME,1.5,0.0001 0.21 97.31 998.80
23
0.05,1000,ME,1.1,0.02 0.21 97.01 1043.120.05,1000,ME,1.1,0.0001 0.21 98.04 1000.96
0.05,1000,I,2.0,0.02 0.21 97.76 1039.760.05,1000,I,2.0,0.0001 0.21 97.61 1006.800.05,1000,I,1.5,0.02 0.22 97.33 1045.32
0.05,1000,I,1.5,0.0001 0.21 97.58 1000.040.05,1000,I,1.1,0.02 0.21 97.67 1043.52
0.05,1000,I,1.1,0.0001 0.21 97.16 1000.440.01,500,ME+I,2.0,0.02 0.16 95.84 995.28
0.01,500,ME+I,2.0,0.0001 0.16 96.53 967.960.01,500,ME+I,1.5,0.02 0.16 97.68 995.84
0.01,500,ME+I,1.5,0.0001 0.16 97.60 971.520.01,500,ME+I,1.1,0.02 0.16 96.47 995.80
0.01,500,ME+I,1.1,0.0001 0.16 97.49 965.560.01,500,ME,2.0,0.02 0.16 96.47 995.40
0.01,500,ME,2.0,0.0001 0.16 97.93 965.680.01,500,ME,1.5,0.02 0.16 97.54 995.92
0.01,500,ME,1.5,0.0001 0.16 96.80 965.560.01,500,ME,1.1,0.02 0.16 98.11 995.92
0.01,500,ME,1.1,0.0001 0.16 97.90 965.640.01,500,I,2.0,0.02 0.16 96.09 997.20
0.01,500,I,2.0,0.0001 0.16 96.16 968.280.01,500,I,1.5,0.02 0.16 95.26 997.04
0.01,500,I,1.5,0.0001 0.16 97.09 968.280.01,500,I,1.1,0.02 0.16 97.41 994.24
0.01,500,I,1.1,0.0001 0.16 97.18 966.480.01,2000,ME+I,2.0,0.02 0.31 97.00 1146.36
0.01,2000,ME+I,2.0,0.0001 0.31 96.89 1128.120.01,2000,ME+I,1.5,0.02 0.31 96.86 1140.88
0.01,2000,ME+I,1.5,0.0001 0.31 97.57 1125.520.01,2000,ME+I,1.1,0.02 0.31 97.07 1148.24
0.01,2000,ME+I,1.1,0.0001 0.31 98.56 1125.160.01,2000,ME,2.0,0.02 0.31 98.16 1145.76
0.01,2000,ME,2.0,0.0001 0.31 97.32 1128.200.01,2000,ME,1.5,0.02 0.31 97.36 1140.92
0.01,2000,ME,1.5,0.0001 0.31 97.76 1125.640.01,2000,ME,1.1,0.02 0.31 98.06 1140.92
0.01,2000,ME,1.1,0.0001 0.31 97.91 1125.76
24
0.01,2000,I,2.0,0.02 0.31 98.10 1144.240.01,2000,I,2.0,0.0001 0.31 98.33 1128.080.01,2000,I,1.5,0.02 0.31 97.82 1148.32
0.01,2000,I,1.5,0.0001 0.31 97.97 1128.080.01,2000,I,1.1,0.02 0.31 97.91 1144.64
0.01,2000,I,1.1,0.0001 0.31 98.36 1125.800.01,1000,ME+I,2.0,0.02 0.21 97.16 1046.00
0.01,1000,ME+I,2.0,0.0001 0.21 97.67 1001.360.01,1000,ME+I,1.5,0.02 0.21 97.29 1043.36
0.01,1000,ME+I,1.5,0.0001 0.21 97.29 998.120.01,1000,ME+I,1.1,0.02 0.21 97.45 1046.16
0.01,1000,ME+I,1.1,0.0001 0.21 97.45 1000.640.01,1000,ME,2.0,0.02 0.21 97.59 1043.72
0.01,1000,ME,2.0,0.0001 0.21 97.19 998.160.01,1000,ME,1.5,0.02 0.21 96.83 1042.20
0.01,1000,ME,1.5,0.0001 0.21 97.27 998.080.01,1000,ME,1.1,0.02 0.21 97.37 1046.36
0.01,1000,ME,1.1,0.0001 0.21 97.50 1000.840.01,1000,I,2.0,0.02 0.21 97.94 1044.48
0.01,1000,I,2.0,0.0001 0.21 96.81 1001.800.01,1000,I,1.5,0.02 0.21 97.98 1043.52
0.01,1000,I,1.5,0.0001 0.21 98.02 1000.440.01,1000,I,1.1,0.02 0.21 97.05 1042.88
0.01,1000,I,1.1,0.0001 0.21 97.81 1002.12
*MAF,POP,MOD,OR,PREV where MAF represents the minor allele fre-quency, POP is the number of individuals, MOD is the used model (withor without main effect and with or without epistasis effect), OR is the oddsratio and PREV is the prevalence of the disease.
25
Laboratory Note
Genetic EpistasisIV - Assessing Algorithm Screen and Clean
LN-4-2014
Ricardo Pinho and Rui CamachoFEUP
Rua Dr Roberto Frias, s/n,4200-465 PORTO
PortugalFax: (+351) 22 508 1440e-mail: [email protected]
www : http://www.fe.up.pt/∼[email protected]
www : http://www.fe.up.pt/∼rcamacho
May 2014
Abstract
In this lab note, the algorithm Screen and Clean is presented. Asthe name indicates, the algorithm screens all relevant SNPs and fitsthem using regression models for main effect and interaction. Thesecond part consists of cleaning the previously selected interactionsusing a portion of the data, cleaning possible false positives. Theresults of the algorithm show that it is nearly incapable of findingany epistatic interactions but produces somewhat decent results inmain effect and full effect detection, for data sets with high allelefrequencies. The scalability of the algorithm is bad due to the elevatedincrease in running time.
1 Introduction
Screen and Clean [WDR+10] is a two stage algorithm that works by creatinga dictionary of disease related SNPS and disease interactions that contractsor expands during a multi-step statistical procedure and then is revised twocontrol Type I Error Rate.In the beginning a dictionary, including all SNPs with minor allele frequencyabove 0.01, is created. If the number of SNPS is greater than the specifiedupper limit of covariates to enter the screen process, the SNPS are selectedbased on the SNPS with the lowest marginal p-values. The data is dividedfor step 1 (screen) and step 2 (clean). In step 1, a screen stage is appliedto restrict the number of terms. This restriction is applied using regressionmodels for main effects or interactions. For main effect models, the functionused is
g(E [Y |X]) = β0 +N∑
j=1
βjXj (1)
where g is an appropriate link function. The Xj is the encoded genotypevalue 0, 1 or 2 and Y is the encoded phenotype, 0 or 1. According to the se-lected SNPs, tries to find relevant interacting SNPs that fit into the followinginteraction model:
g(E [Y |X]) = β0 +N∑
j=1
βjXj +∑
i<j;i,j=1,...,N
βijXiXj (2)
where S = {j : βj 6= 0, j ∈ 1, ..., L} ∪ {(i, j) : βij 6= 0, (i, j) ∈ 1, ..., L} are setof terms associated with the phenotype as main effects or interactions. Across-validation is applied in this stage, to apply a further restriction. In thestage 2, the resulting dictionary is cleaned with p-values < α. This is doneusing the traditional t-statistic obtained from least squares analysis of thescreened model.
1.1 Input files
The input consists of two files containing the phenotype of all individualsand the genotype with all SNPS for all individuals.
1.2 Output files
There are many outputs available:
1
(a) Genotype
rs1, rs2, rs3, rs4, rs50, 2, 0, 0, 00, 1, 1, 0, 00, 1, 1, 0, 01, 1, 0, 1, 00, 1, 1, 1, 10, 1, 2, 1, 0
(b) Phenotype
Label010011
Table 1: An example of the input file containing genotype and phenotypeinformation with 5 SNPs and 6 individuals. Genotype 0,1,2 correspondsto homozygous dominant, heterozygous, and homozygous recessive. Thephenotype 0,1 corresponds to control and case respectively.
• snp screen - a vector of the column names of the SNPS picked by thescreen.
• snp screen2 - vector of K pairs and SNP pairs retained by the secondlasso screen.
• snp clean - vector of screened SNPs also retained by the multivariateregression clean.
• clean - a data frame with regression output for all of the screened SNPs.The snp2 column corresponds to the pairwise interaction or ”NA” formain effects.
• final - a data frame with output from the regression of phenotype onthe final cleaned SNPs.
1.3 Parameters
The following parameters can be configured:
• L - number of SNPs to be retained with the smallest p-values.
• K pairs - Number of pairwise interactions to be retained by the lasso.
• response - The type of phenotype. Can be binomial or gaussian.
• alpha - The Bonferroni correction lower bound limit for retention ofSNPs.
2
• snp fix - Index of SNPs that are forced into the lasso and multivariateregression models. Optional.
• cov struct - Matrix of covariates that are forced in every model fit byScreen & Clean. Optional.
• standardize - If true, the genotype coded as 0,1, or 2 are centered tomean 0 and standard deviation 1. The data must be standardized torun the Screen & Clean procedure.
2 Experimental Settings
The datasets used in the experiments are characterized in Lab Note 1. Thecomputer used for this experiments used the 64-bit Ubuntu 13.10 operatingsystem, with an Intel(R) Core(TM)2 Quad CPU Q6600 2.40GHz processorand 8,00 GB of RAM memory.The parameters used for this experiments were : L = 200, K pairs =100, response = ”binomial”, standardize = TRUE, alpha = 0.05.
3 Results
The results from Figure 1 show interesting results. for epistasis detection, thePower is nearly 0 in all configurations. In main effect detection, the groundtruth is detected in data sets with an allele frequency higher than 0.05. Infull effect detection only data sets with 0.3 allele frequency or higher haveany Power. There is no clear pattern between the Power and the data set size.
The scalability test shows a clear increase in the running time with theincrease in the number of individuals of the data sets. Memory usage alsoincreases with data set size, but not as significantly as running time, whichmay become a serious obstacle in larger data sets. CPU usage does not showa clear increase.
Type I Error Rate in epistatic detection shows a seemingly random dis-tribuition. Overall, the error rate is very constant. For main effect detection,there is a bigger increase with population size and allele frequency. This iseven more clear in full effect detection, reaching a maximum of 84% in aconfiguration of 2000 individuals and 0.5 allele frequency.
From Figure 4 and Figure 5 there is only an indication of Power for datasets with 2000 individuals, except for full effect data sets, exacly the same aswith the odds ratio variance. The prevalence variation shows a small Power
3
0.01 0.05 0.1 0.3 0.50
50
100
0 0 0 0 00 0 0 0 00 0 6 2 0
Allele frequency
Pow
er(%
)
500 individuals1000 individuals2000 individuals
(a) Epistasis
0.01 0.05 0.1 0.3 0.50
20
40
60
80
0 0 0
20
54
0 0 0
54
70
0 0
39
58 62
Allele frequency
Pow
er(%
)
500 individuals1000 individuals2000 individuals
(b) Main Effect
0.01 0.05 0.1 0.3 0.50
50
100
0 0 0
30
49
0 0 0
5873
0 0 0
40
91
Allele frequency
Pow
er(%
)
500 individuals1000 individuals2000 individuals
(c) Full Effect
Figure 1: Average Power by allele frequency. For each frequency, three sizesof data sets were used to measure the Power, with odds ratio of 2.0 andprevalence of 0.02. The Power is measured by the amount of data sets wherethe ground truth was amongst the most relevant results, out of all 100 datasets. (b), (a), and (c) represent main effect, epistatic and main effect +epistatic interactions.
4
500 1000 20000
10
20
30
Number of Individuals
Runnin
gT
ime(
seco
nds)
(a) Average running time.
500 1000 2000
80
90
100
Number of Individuals
CP
UU
sage
(%)
(b) Average CPU usage.
500 1000 2000
130
140
150
Number of Individuals
Mem
ory
Usa
ge(M
byte
s)
(c) Average memory usage.
Figure 2: Comparison of scalability measures between different sized datasets. The data sets have a minor allele frequency is 0.5, 2.0 odds ratio, 0.02prevalence and use the full effect disease model.
5
0.01 0.05 0.1 0.3 0.50
50
100
15 15 14 19 1817 22 16 15 2218 20 21 16 14
Allele frequency
Typ
eI
Err
orR
ate(
%)
500 individuals1000 individuals2000 individuals
(a) Epistasis
0.01 0.05 0.1 0.3 0.50
50
100
14 17 21 231514 21 23 28 30
1322
36 3848
Allele frequency
Typ
eI
Err
orR
ate(
%)
500 individuals1000 individuals2000 individuals
(b) Main Effect
0.01 0.05 0.1 0.3 0.50
50
100
18 15 19 1937
14 21 28 3545
14 2133
6884
Allele frequency
Typ
eI
Err
orR
ate(
%)
500 individuals1000 individuals2000 individuals
(c) Full Effect
Figure 3: Type I Error Rate by allele frequency. For each frequency, threesizes of data sets were used to measure the Power, with odds ratio of 2.0 andprevalence of 0.02. The Type I Error Rate is measured by the amount ofdata sets where the false positives were amongst the most relevant results,out of all 100 data sets. (a), (b), and (c) represent epistatic, main effect, andmain effect + epistatic interactions.
6
change in Figure 6, decreasing with the prevalence increase. Figure 7 revealsthat there is only relevant Power in higher allele frequencies, and in mainand full effects.
4 Summary
Screen and Clean is a heuristic algorithm that works by applying regressionmodels for main effect and epistatic detection, and pruning statistically ir-relevant interactions. The obtained results show very low Power for epistaticdetection. In main effect detection, there is an increase in Power, especiallyin higher dimension data sets. With higher allele frequencies, there is alsoan increase of Power. In full effect detection, there is only Power in datasets with allele frequency higher than 0.1. The scalability is bad, due to thebig increase in running time with different data set sizes. The type 1 errorsdo not vary significantly with population size or allele frequency in epistasisdetection, contrary to main effect and full effect detection.
References
[WDR+10] Jing Wu, Bernie Devlin, Steven Ringquist, Massimo Trucco, andKathryn Roeder. Screen and clean: a tool for identifying interac-tions in genome-wide association studies. Genetic epidemiology,34:275–285, 2010.
A Bar Graphs
7
500 1000 20000
50
100
0 0 6
Population
Pow
er(%
)
Power by Population
(a) Epistasis
500 1000 20000
50
100
0 0
39
Population
Pow
er(%
)
Power by Population
(b) Main Effect
500 1000 20000
50
100
0 0 0
Population
Pow
er(%
)
Power by Population
(c) Full Effect
Figure 4: Distribution of the Power by population for all disease models.The allele frequency is 0.1, the odds ratio is 2.0, and the prevalence is 0.02.
8
1.1 1.5 2.00
50
100
0 0 6
Odds Ratio
Pow
er(%
)
Power by Odds Ratio
(a) Epistasis
1.1 1.5 2.00
50
100
0 0
39
Odds Ratio
Pow
er(%
)
Power by Odds Ratio
(b) Main Effect
1.1 1.5 2.00
50
100
0 0 0
Odds Ratio
Pow
er(%
)
Power by Odds Ratio
(c) Full Effect
Figure 5: Distribution of the Power by odds ratios for all disease models.The allele frequency is 0.1, the population size is 2000 individuals, and theprevalence is 0.02.
9
0.0001 0.020
50
100
120
Prevalence
Pow
er(%
)
Power by Prevalence
(a) Epistasis
0.0001 0.020
50
10074
62
Prevalence
Pow
er(%
)
Power by Prevalence
(b) Main Effect
0.0001 0.020
50
100 93 91
Prevalence
Pow
er(%
)
Power by Prevalence
(c) Full Effect
Figure 6: Distribution of the Power by prevalence for all disease models. Theallele frequency is 0.5, the odds ratio is 2.0, and the population size is 2000individuals.
10
0.01 0.05 0.1 0.3 0.50
50
100
0 0 6 2 0
Frequency
Pow
er(%
)
Power by Frequency
(a) Epistasis
0.01 0.05 0.1 0.3 0.50
50
100
0 0
39
58 62
Frequency
Pow
er(%
)
Power by Frequency
(b) Main Effect
0.01 0.05 0.1 0.3 0.50
50
100
0 0 0
40
91
Frequency
Pow
er(%
)
Power by Frequency
(c) Full Effect
Figure 7: Distribution of the Power by allele frequency for all disease mod-els. The population size is 2000 individuals, the odds ratio is 2.0, and theprevalence is 0.02.
11
B Table of Results
Table 2: A table containing the percentage of true positives and false posi-tives in each configuration. The first column contains the description of theconfiguration. The second and third columns contain the number of datasetswith true positives and false positives respectively, out of all 100 data setsper configuration.
Configuration* TP (%) FP (%)0.5,500,I,2.0,0.02 0 18
0.5,500,I,2.0,0.0001 0 90.5,500,I,1.5,0.02 0 11
0.5,500,I,1.5,0.0001 0 170.5,500,I,1.1,0.02 0 18
0.5,500,I,1.1,0.0001 0 250.5,2000,I,2.0,0.02 0 14
0.5,2000,I,2.0,0.0001 12 190.5,2000,I,1.5,0.02 5 25
0.5,2000,I,1.5,0.0001 0 170.5,2000,I,1.1,0.02 1 20
0.5,2000,I,1.1,0.0001 0 100.5,1000,I,2.0,0.02 0 22
0.5,1000,I,2.0,0.0001 3 170.5,1000,I,1.5,0.02 2 11
0.5,1000,I,1.5,0.0001 0 170.5,1000,I,1.1,0.02 0 15
0.5,1000,I,1.1,0.0001 0 130.3,500,I,2.0,0.02 0 19
0.3,500,I,2.0,0.0001 0 140.3,500,I,1.5,0.02 0 14
0.3,500,I,1.5,0.0001 0 110.3,500,I,1.1,0.02 0 18
0.3,500,I,1.1,0.0001 0 140.3,2000,I,2.0,0.02 2 16
0.3,2000,I,2.0,0.0001 0 190.3,2000,I,1.5,0.02 0 20
0.3,2000,I,1.5,0.0001 1 110.3,2000,I,1.1,0.02 0 19
0.3,2000,I,1.1,0.0001 0 140.3,1000,I,2.0,0.02 0 15
12
0.3,1000,I,2.0,0.0001 0 140.3,1000,I,1.5,0.02 1 10
0.3,1000,I,1.5,0.0001 0 110.3,1000,I,1.1,0.02 0 13
0.3,1000,I,1.1,0.0001 0 120.1,500,I,2.0,0.02 0 14
0.1,500,I,2.0,0.0001 0 110.1,500,I,1.5,0.02 0 17
0.1,500,I,1.5,0.0001 0 130.1,500,I,1.1,0.02 0 8
0.1,500,I,1.1,0.0001 0 160.1,2000,I,2.0,0.02 6 21
0.1,2000,I,2.0,0.0001 1 190.1,2000,I,1.5,0.02 0 13
0.1,2000,I,1.5,0.0001 0 150.1,2000,I,1.1,0.02 0 17
0.1,2000,I,1.1,0.0001 0 160.1,1000,I,2.0,0.02 0 16
0.1,1000,I,2.0,0.0001 0 200.1,1000,I,1.5,0.02 0 11
0.1,1000,I,1.5,0.0001 0 120.1,1000,I,1.1,0.02 0 17
0.1,1000,I,1.1,0.0001 0 100.05,500,I,2.0,0.02 0 15
0.05,500,I,2.0,0.0001 0 50.05,500,I,1.5,0.02 0 23
0.05,500,I,1.5,0.0001 0 210.05,500,I,1.1,0.02 0 19
0.05,500,I,1.1,0.0001 0 170.05,2000,I,2.0,0.02 0 20
0.05,2000,I,2.0,0.0001 0 140.05,2000,I,1.5,0.02 2 25
0.05,2000,I,1.5,0.0001 1 190.05,2000,I,1.1,0.02 0 15
0.05,2000,I,1.1,0.0001 0 140.05,1000,I,2.0,0.02 0 22
0.05,1000,I,2.0,0.0001 0 150.05,1000,I,1.5,0.02 0 14
13
0.05,1000,I,1.5,0.0001 0 120.05,1000,I,1.1,0.02 0 22
0.05,1000,I,1.1,0.0001 0 160.01,500,I,2.0,0.02 0 15
0.01,500,I,2.0,0.0001 0 130.01,500,I,1.5,0.02 0 15
0.01,500,I,1.5,0.0001 0 190.01,500,I,1.1,0.02 0 13
0.01,500,I,1.1,0.0001 0 110.01,2000,I,2.0,0.02 0 18
0.01,2000,I,2.0,0.0001 0 130.01,2000,I,1.5,0.02 0 11
0.01,2000,I,1.5,0.0001 0 190.01,2000,I,1.1,0.02 0 19
0.01,2000,I,1.1,0.0001 0 110.01,1000,I,2.0,0.02 0 17
0.01,1000,I,2.0,0.0001 0 110.01,1000,I,1.5,0.02 0 11
0.01,1000,I,1.5,0.0001 0 170.01,1000,I,1.1,0.02 0 13
0.01,1000,I,1.1,0.0001 0 110.5,500,ME,2.0,0.02 54 15
0.5,500,ME,2.0,0.0001 35 230.5,500,ME,1.5,0.02 40 17
0.5,500,ME,1.5,0.0001 21 190.5,500,ME,1.1,0.02 15 14
0.5,500,ME,1.1,0.0001 10 270.5,2000,ME,2.0,0.02 62 48
0.5,2000,ME,2.0,0.0001 74 350.5,2000,ME,1.5,0.02 77 31
0.5,2000,ME,1.5,0.0001 83 170.5,2000,ME,1.1,0.02 91 17
0.5,2000,ME,1.1,0.0001 82 210.5,1000,ME,2.0,0.02 70 30
0.5,1000,ME,2.0,0.0001 72 240.5,1000,ME,1.5,0.02 71 22
0.5,1000,ME,1.5,0.0001 77 180.5,1000,ME,1.1,0.02 53 21
14
0.5,1000,ME,1.1,0.0001 50 200.3,500,ME,2.0,0.02 20 23
0.3,500,ME,2.0,0.0001 16 180.3,500,ME,1.5,0.02 7 20
0.3,500,ME,1.5,0.0001 4 210.3,500,ME,1.1,0.02 0 17
0.3,500,ME,1.1,0.0001 0 150.3,2000,ME,2.0,0.02 58 38
0.3,2000,ME,2.0,0.0001 48 530.3,2000,ME,1.5,0.02 62 37
0.3,2000,ME,1.5,0.0001 49 400.3,2000,ME,1.1,0.02 33 19
0.3,2000,ME,1.1,0.0001 29 290.3,1000,ME,2.0,0.02 54 28
0.3,1000,ME,2.0,0.0001 52 250.3,1000,ME,1.5,0.02 25 23
0.3,1000,ME,1.5,0.0001 17 150.3,1000,ME,1.1,0.02 11 11
0.3,1000,ME,1.1,0.0001 0 210.1,500,ME,2.0,0.02 0 21
0.1,500,ME,2.0,0.0001 0 200.1,500,ME,1.5,0.02 0 17
0.1,500,ME,1.5,0.0001 0 160.1,500,ME,1.1,0.02 0 17
0.1,500,ME,1.1,0.0001 0 140.1,2000,ME,2.0,0.02 39 36
0.1,2000,ME,2.0,0.0001 32 500.1,2000,ME,1.5,0.02 0 24
0.1,2000,ME,1.5,0.0001 0 280.1,2000,ME,1.1,0.02 0 21
0.1,2000,ME,1.1,0.0001 0 150.1,1000,ME,2.0,0.02 0 23
0.1,1000,ME,2.0,0.0001 0 190.1,1000,ME,1.5,0.02 0 9
0.1,1000,ME,1.5,0.0001 0 100.1,1000,ME,1.1,0.02 0 13
0.1,1000,ME,1.1,0.0001 0 150.05,500,ME,2.0,0.02 0 17
15
0.05,500,ME,2.0,0.0001 0 240.05,500,ME,1.5,0.02 0 13
0.05,500,ME,1.5,0.0001 0 120.05,500,ME,1.1,0.02 0 16
0.05,500,ME,1.1,0.0001 0 160.05,2000,ME,2.0,0.02 0 22
0.05,2000,ME,2.0,0.0001 0 180.05,2000,ME,1.5,0.02 0 10
0.05,2000,ME,1.5,0.0001 0 120.05,2000,ME,1.1,0.02 0 16
0.05,2000,ME,1.1,0.0001 0 150.05,1000,ME,2.0,0.02 0 21
0.05,1000,ME,2.0,0.0001 0 170.05,1000,ME,1.5,0.02 0 16
0.05,1000,ME,1.5,0.0001 0 130.05,1000,ME,1.1,0.02 0 15
0.05,1000,ME,1.1,0.0001 0 160.01,500,ME,2.0,0.02 0 14
0.01,500,ME,2.0,0.0001 0 220.01,500,ME,1.5,0.02 0 14
0.01,500,ME,1.5,0.0001 0 220.01,500,ME,1.1,0.02 0 12
0.01,500,ME,1.1,0.0001 0 240.01,2000,ME,2.0,0.02 0 13
0.01,2000,ME,2.0,0.0001 0 160.01,2000,ME,1.5,0.02 0 20
0.01,2000,ME,1.5,0.0001 0 140.01,2000,ME,1.1,0.02 0 19
0.01,2000,ME,1.1,0.0001 0 140.01,1000,ME,2.0,0.02 0 14
0.01,1000,ME,2.0,0.0001 0 150.01,1000,ME,1.5,0.02 0 13
0.01,1000,ME,1.5,0.0001 0 160.01,1000,ME,1.1,0.02 0 18
0.01,1000,ME,1.1,0.0001 0 170.5,500,ME+I,2.0,0.02 49 37
0.5,500,ME+I,2.0,0.0001 37 330.5,500,ME+I,1.5,0.02 54 33
16
0.5,500,ME+I,1.5,0.0001 43 300.5,500,ME+I,1.1,0.02 47 20
0.5,500,ME+I,1.1,0.0001 33 270.5,2000,ME+I,2.0,0.02 91 84
0.5,2000,ME+I,2.0,0.0001 93 760.5,2000,ME+I,1.5,0.02 94 78
0.5,2000,ME+I,1.5,0.0001 96 810.5,2000,ME+I,1.1,0.02 89 60
0.5,2000,ME+I,1.1,0.0001 88 720.5,1000,ME+I,2.0,0.02 73 45
0.5,1000,ME+I,2.0,0.0001 69 560.5,1000,ME+I,1.5,0.02 78 51
0.5,1000,ME+I,1.5,0.0001 80 500.5,1000,ME+I,1.1,0.02 78 33
0.5,1000,ME+I,1.1,0.0001 75 390.3,500,ME+I,2.0,0.02 30 19
0.3,500,ME+I,2.0,0.0001 37 280.3,500,ME+I,1.5,0.02 14 20
0.3,500,ME+I,1.5,0.0001 18 220.3,500,ME+I,1.1,0.02 5 17
0.3,500,ME+I,1.1,0.0001 3 140.3,2000,ME+I,2.0,0.02 40 68
0.3,2000,ME+I,2.0,0.0001 92 920.3,2000,ME+I,1.5,0.02 61 50
0.3,2000,ME+I,1.5,0.0001 77 740.3,2000,ME+I,1.1,0.02 50 35
0.3,2000,ME+I,1.1,0.0001 61 350.3,1000,ME+I,2.0,0.02 58 35
0.3,1000,ME+I,2.0,0.0001 72 570.3,1000,ME+I,1.5,0.02 45 34
0.3,1000,ME+I,1.5,0.0001 50 370.3,1000,ME+I,1.1,0.02 23 19
0.3,1000,ME+I,1.1,0.0001 12 180.1,500,ME+I,2.0,0.02 0 19
0.1,500,ME+I,2.0,0.0001 0 200.1,500,ME+I,1.5,0.02 0 23
0.1,500,ME+I,1.5,0.0001 0 110.1,500,ME+I,1.1,0.02 0 12
17
0.1,500,ME+I,1.1,0.0001 0 230.1,2000,ME+I,2.0,0.02 0 33
0.1,2000,ME+I,2.0,0.0001 0 200.1,2000,ME+I,1.5,0.02 0 23
0.1,2000,ME+I,1.5,0.0001 0 170.1,2000,ME+I,1.1,0.02 0 15
0.1,2000,ME+I,1.1,0.0001 0 100.1,1000,ME+I,2.0,0.02 0 28
0.1,1000,ME+I,2.0,0.0001 0 230.1,1000,ME+I,1.5,0.02 0 15
0.1,1000,ME+I,1.5,0.0001 0 120.1,1000,ME+I,1.1,0.02 0 8
0.1,1000,ME+I,1.1,0.0001 0 190.05,500,ME+I,2.0,0.02 0 15
0.05,500,ME+I,2.0,0.0001 0 190.05,500,ME+I,1.5,0.02 0 16
0.05,500,ME+I,1.5,0.0001 0 110.05,500,ME+I,1.1,0.02 0 15
0.05,500,ME+I,1.1,0.0001 0 210.05,2000,ME+I,2.0,0.02 0 21
0.05,2000,ME+I,2.0,0.0001 0 220.05,2000,ME+I,1.5,0.02 0 19
0.05,2000,ME+I,1.5,0.0001 0 180.05,2000,ME+I,1.1,0.02 0 18
0.05,2000,ME+I,1.1,0.0001 0 130.05,1000,ME+I,2.0,0.02 0 21
0.05,1000,ME+I,2.0,0.0001 0 150.05,1000,ME+I,1.5,0.02 0 12
0.05,1000,ME+I,1.5,0.0001 0 200.05,1000,ME+I,1.1,0.02 0 14
0.05,1000,ME+I,1.1,0.0001 0 180.01,500,ME+I,2.0,0.02 0 18
0.01,500,ME+I,2.0,0.0001 0 130.01,500,ME+I,1.5,0.02 0 15
0.01,500,ME+I,1.5,0.0001 0 150.01,500,ME+I,1.1,0.02 0 14
0.01,500,ME+I,1.1,0.0001 0 230.01,2000,ME+I,2.0,0.02 0 14
18
0.01,2000,ME+I,2.0,0.0001 0 180.01,2000,ME+I,1.5,0.02 0 22
0.01,2000,ME+I,1.5,0.0001 0 140.01,2000,ME+I,1.1,0.02 0 16
0.01,2000,ME+I,1.1,0.0001 0 150.01,1000,ME+I,2.0,0.02 0 14
0.01,1000,ME+I,2.0,0.0001 0 90.01,1000,ME+I,1.5,0.02 0 15
0.01,1000,ME+I,1.5,0.0001 0 150.01,1000,ME+I,1.1,0.02 0 18
0.01,1000,ME+I,1.1,0.0001 0 17
*MAF,POP,MOD,OR,PREV where MAF represents the minor allele fre-quency, POP is the number of individuals, MOD is the used model (withor without main effect and with or without epistasis effect), OR is the oddsratio and PREV is the prevalence of the disease.
Table 3: A table containing the running time, cpu usage and memory usagein each configuration.
Configuration* Running Time (s) CPU Usage (%) Memory Usage (KB)0.5,500,ME+I,2.0,0.02 8.05 75.72 132928.28
0.5,500,ME+I,2.0,0.0001 8.10 75.95 133723.360.5,500,ME+I,1.5,0.02 9.37 75.86 132094.44
0.5,500,ME+I,1.5,0.0001 9.03 75.38 132148.280.5,500,ME+I,1.1,0.02 11.23 75.14 133080.40
0.5,500,ME+I,1.1,0.0001 10.48 75.46 131997.880.5,500,ME,2.0,0.02 10.43 76.02 132144.56
0.5,500,ME,2.0,0.0001 9.98 76.18 132479.400.5,500,ME,1.5,0.02 11.88 75.85 133979.16
0.5,500,ME,1.5,0.0001 11.01 80.20 132260.320.5,500,ME,1.1,0.02 13.19 77.06 135044.20
0.5,500,ME,1.1,0.0001 12.12 77.23 133516.640.5,500,I,2.0,0.02 14.39 76.70 133500.72
0.5,500,I,2.0,0.0001 12.97 76.82 132901.400.5,500,I,1.5,0.02 14.32 76.82 133835.88
0.5,500,I,1.5,0.0001 13.16 76.95 132729.960.5,500,I,1.1,0.02 14.44 77.07 133436.20
0.5,500,I,1.1,0.0001 13.06 76.99 132833.76
19
0.5,2000,ME+I,2.0,0.02 34.65 77.25 156137.440.5,2000,ME+I,2.0,0.0001 31.37 76.78 153924.360.5,2000,ME+I,1.5,0.02 67.20 98.96 156944.56
0.5,2000,ME+I,1.5,0.0001 51.19 98.96 156487.440.5,2000,ME+I,1.1,0.02 106.54 99.00 157200.08
0.5,2000,ME+I,1.1,0.0001 76.74 99.00 157071.080.5,2000,ME,2.0,0.02 86.58 99.00 157251.32
0.5,2000,ME,2.0,0.0001 65.83 98.98 156912.400.5,2000,ME,1.5,0.02 115.43 99.00 156853.92
0.5,2000,ME,1.5,0.0001 84.74 99.00 157327.680.5,2000,ME,1.1,0.02 141.42 98.99 156264.16
0.5,2000,ME,1.1,0.0001 101.23 98.99 156739.320.5,2000,I,2.0,0.02 177.72 98.97 155566.84
0.5,2000,I,2.0,0.0001 121.39 98.99 155160.080.5,2000,I,1.5,0.02 175.38 99.00 155246.92
0.5,2000,I,1.5,0.0001 121.20 99.00 155048.640.5,2000,I,1.1,0.02 175.60 99.00 155732.32
0.5,2000,I,1.1,0.0001 121.17 99.00 155220.240.5,1000,ME+I,2.0,0.02 18.65 98.99 140519.08
0.5,1000,ME+I,2.0,0.0001 17.79 98.96 140777.400.5,1000,ME+I,1.5,0.02 27.78 98.93 140225.08
0.5,1000,ME+I,1.5,0.0001 23.17 98.90 140721.960.5,1000,ME+I,1.1,0.02 36.70 98.96 139726.28
0.5,1000,ME+I,1.1,0.0001 30.35 98.96 139558.000.5,1000,ME,2.0,0.02 31.89 98.98 139635.88
0.5,1000,ME,2.0,0.0001 27.30 98.98 140585.160.5,1000,ME,1.5,0.02 38.06 99.00 140003.76
0.5,1000,ME,1.5,0.0001 31.63 99.00 139501.320.5,1000,ME,1.1,0.02 43.46 99.00 139474.84
0.5,1000,ME,1.1,0.0001 36.17 99.00 139679.120.5,1000,I,2.0,0.02 52.11 98.93 138762.56
0.5,1000,I,2.0,0.0001 42.34 98.70 138701.960.5,1000,I,1.5,0.02 52.57 98.75 138734.88
0.5,1000,I,1.5,0.0001 41.92 98.96 138545.880.5,1000,I,1.1,0.02 52.19 98.94 138834.24
0.5,1000,I,1.1,0.0001 41.54 98.92 138252.560.3,500,ME+I,2.0,0.02 10.43 77.27 132217.92
0.3,500,ME+I,2.0,0.0001 8.47 77.17 132292.96
20
0.3,500,ME+I,1.5,0.02 12.12 76.84 134113.200.3,500,ME+I,1.5,0.0001 10.55 76.87 131771.240.3,500,ME+I,1.1,0.02 13.41 76.91 133854.48
0.3,500,ME+I,1.1,0.0001 12.22 76.76 132673.560.3,500,ME,2.0,0.02 11.96 76.68 134613.12
0.3,500,ME,2.0,0.0001 10.87 76.92 132077.720.3,500,ME,1.5,0.02 13.01 78.76 134016.68
0.3,500,ME,1.5,0.0001 11.80 80.22 133295.480.3,500,ME,1.1,0.02 13.89 76.88 134210.72
0.3,500,ME,1.1,0.0001 12.64 76.67 133055.160.3,500,I,2.0,0.02 14.56 76.78 133087.44
0.3,500,I,2.0,0.0001 13.11 76.36 133321.080.3,500,I,1.5,0.02 14.48 76.47 133449.28
0.3,500,I,1.5,0.0001 13.12 76.53 132977.320.3,500,I,1.1,0.02 14.62 76.97 132934.04
0.3,500,I,1.1,0.0001 12.91 79.52 132643.120.3,2000,ME+I,2.0,0.02 80.13 77.12 157067.76
0.3,2000,ME+I,2.0,0.0001 38.01 76.67 156089.920.3,2000,ME+I,1.5,0.02 108.52 77.82 156891.04
0.3,2000,ME+I,1.5,0.0001 68.37 79.27 157415.120.3,2000,ME+I,1.1,0.02 133.62 76.53 156605.84
0.3,2000,ME+I,1.1,0.0001 93.15 76.23 156330.200.3,2000,ME,2.0,0.02 94.83 72.29 145398.52
0.3,2000,ME,2.0,0.0001 73.01 76.34 157256.600.3,2000,ME,1.5,0.02 129.27 77.26 156373.32
0.3,2000,ME,1.5,0.0001 93.37 77.30 156781.520.3,2000,ME,1.1,0.02 148.69 76.90 155918.68
0.3,2000,ME,1.1,0.0001 104.23 77.27 156126.560.3,2000,I,2.0,0.02 163.18 76.81 155362.60
0.3,2000,I,2.0,0.0001 112.34 76.76 155090.440.3,2000,I,1.5,0.02 165.19 77.06 155644.08
0.3,2000,I,1.5,0.0001 113.22 76.96 155052.120.3,2000,I,1.1,0.02 165.32 76.90 155570.60
0.3,2000,I,1.1,0.0001 112.89 77.25 155295.480.3,1000,ME+I,2.0,0.02 29.58 76.65 140007.28
0.3,1000,ME+I,2.0,0.0001 18.54 76.70 140461.080.3,1000,ME+I,1.5,0.02 35.96 76.58 139731.16
0.3,1000,ME+I,1.5,0.0001 26.77 76.25 139746.88
21
0.3,1000,ME+I,1.1,0.02 39.73 74.45 135312.520.3,1000,ME+I,1.1,0.0001 33.51 60.15 138464.68
0.3,1000,ME,2.0,0.02 35.14 76.79 139728.880.3,1000,ME,2.0,0.0001 28.44 76.98 139477.160.3,1000,ME,1.5,0.02 39.75 76.60 139325.40
0.3,1000,ME,1.5,0.0001 32.14 78.60 139321.040.3,1000,ME,1.1,0.02 41.93 72.53 134819.68
0.3,1000,ME,1.1,0.0001 34.67 76.61 139146.120.3,1000,I,2.0,0.02 44.55 76.64 137672.52
0.3,1000,I,2.0,0.0001 35.88 76.58 138353.480.3,1000,I,1.5,0.02 45.07 77.84 138644.36
0.3,1000,I,1.5,0.0001 36.22 76.64 138355.680.3,1000,I,1.1,0.02 45.38 76.60 138536.52
0.3,1000,I,1.1,0.0001 36.54 76.65 138406.040.1,500,ME+I,2.0,0.02 14.39 76.21 133049.56
0.1,500,ME+I,2.0,0.0001 13.63 96.56 133010.000.1,500,ME+I,1.5,0.02 15.68 99.00 133365.44
0.1,500,ME+I,1.5,0.0001 13.83 99.00 132876.080.1,500,ME+I,1.1,0.02 15.71 98.92 133175.88
0.1,500,ME+I,1.1,0.0001 13.91 99.00 132622.440.1,500,ME,2.0,0.02 15.48 99.00 133278.80
0.1,500,ME,2.0,0.0001 13.81 98.98 132322.280.1,500,ME,1.5,0.02 15.56 98.99 133141.08
0.1,500,ME,1.5,0.0001 14.08 98.96 132418.280.1,500,ME,1.1,0.02 15.56 99.00 133151.84
0.1,500,ME,1.1,0.0001 14.01 98.98 133099.920.1,500,I,2.0,0.02 15.56 98.99 133436.84
0.1,500,I,2.0,0.0001 14.02 98.98 133378.840.1,500,I,1.5,0.02 15.87 99.00 133192.84
0.1,500,I,1.5,0.0001 14.09 98.98 132645.000.1,500,I,1.1,0.02 15.77 99.00 133209.40
0.1,500,I,1.1,0.0001 14.07 98.97 133120.920.1,2000,ME+I,2.0,0.02 158.73 77.20 155763.80
0.1,2000,ME,1.5,0.02 179.10 99.00 155353.080.1,2000,ME,1.5,0.0001 121.32 99.00 155054.960.1,2000,ME,1.1,0.02 179.21 94.98 155592.84
0.1,2000,ME,1.1,0.0001 123.67 94.92 155266.520.1,2000,I,2.0,0.02 179.43 93.90 155530.20
22
0.1,2000,I,2.0,0.0001 122.25 99.00 155230.160.1,2000,I,1.5,0.02 178.88 98.99 155394.24
0.1,2000,I,1.5,0.0001 122.16 99.00 155728.400.1,2000,I,1.1,0.02 177.35 99.00 155322.72
0.1,2000,I,1.1,0.0001 121.66 99.00 155222.800.1,1000,ME+I,2.0,0.02 48.96 99.00 139186.68
0.1,1000,ME+I,2.0,0.0001 39.09 99.00 138626.880.1,1000,ME+I,1.5,0.02 50.81 95.48 138308.36
0.1,1000,ME+I,1.5,0.0001 40.85 96.90 138674.000.1,1000,ME+I,1.1,0.02 51.26 97.28 138804.08
0.1,1000,ME+I,1.1,0.0001 41.67 92.17 138171.240.1,1000,ME,2.0,0.02 50.57 95.22 138776.32
0.1,1000,ME,2.0,0.0001 39.97 97.55 138625.880.1,1000,ME,1.5,0.02 50.41 98.21 138871.48
0.1,1000,ME,1.5,0.0001 40.33 98.03 138225.840.1,1000,ME,1.1,0.02 50.04 98.85 138539.48
0.1,1000,ME,1.1,0.0001 40.06 98.99 138220.520.1,1000,I,2.0,0.02 49.73 99.00 138651.80
0.1,1000,I,2.0,0.0001 40.38 98.19 138314.800.1,1000,I,1.5,0.02 50.46 98.66 139010.80
0.1,1000,I,1.5,0.0001 39.94 98.98 138255.480.1,1000,I,1.1,0.02 49.87 99.00 138711.12
0.1,1000,I,1.1,0.0001 40.13 98.99 138267.920.05,500,ME+I,2.0,0.02 16.57 98.78 133539.84
0.05,500,ME+I,2.0,0.0001 14.88 98.78 132866.200.05,500,ME+I,1.5,0.02 16.49 98.75 133801.12
0.05,500,ME+I,1.5,0.0001 14.89 98.83 132811.360.05,500,ME+I,1.1,0.02 16.64 98.82 133192.68
0.05,500,ME+I,1.1,0.0001 15.06 98.96 133419.440.05,500,ME,2.0,0.02 16.73 98.89 133147.12
0.05,500,ME,2.0,0.0001 15.19 98.79 132763.080.05,500,ME,1.5,0.02 16.43 98.87 133810.48
0.05,500,ME,1.5,0.0001 13.11 76.92 133189.440.05,500,ME,1.1,0.02 14.49 76.75 133125.68
0.05,500,ME,1.1,0.0001 13.12 76.65 133707.680.05,500,I,2.0,0.02 14.35 76.96 133531.60
0.05,500,I,2.0,0.0001 13.03 78.87 133306.000.05,500,I,1.5,0.02 14.30 79.83 133015.40
23
0.05,500,I,1.5,0.0001 12.98 78.90 133355.080.05,500,I,1.1,0.02 14.35 78.62 133549.00
0.05,500,I,1.1,0.0001 12.98 79.54 133045.080.05,2000,ME+I,2.0,0.02 190.54 99.00 155711.80
0.05,2000,ME+I,2.0,0.0001 131.34 99.00 155593.600.05,2000,ME+I,1.5,0.02 180.23 99.00 155621.12
0.05,2000,ME+I,1.5,0.0001 123.29 98.98 155334.440.05,2000,ME+I,1.1,0.02 178.12 99.00 155591.40
0.05,2000,ME+I,1.1,0.0001 124.22 99.00 155121.640.05,2000,ME,2.0,0.02 178.90 99.00 155657.64
0.05,2000,ME,2.0,0.0001 123.05 99.00 155412.040.05,2000,ME,1.5,0.02 178.36 99.00 155709.28
0.05,2000,ME,1.5,0.0001 123.93 99.00 155206.400.05,2000,ME,1.1,0.02 180.79 98.53 155466.84
0.05,2000,ME,1.1,0.0001 123.34 99.00 155389.400.05,2000,I,2.0,0.02 119.66 99.00 155255.48
0.05,2000,I,2.0,0.0001 122.52 99.00 155137.480.05,2000,I,1.5,0.02 178.53 99.00 155502.28
0.05,2000,I,1.5,0.0001 121.41 99.00 155484.520.05,2000,I,1.1,0.02 178.34 99.00 155644.44
0.05,2000,I,1.1,0.0001 122.37 98.98 155444.680.05,1000,ME+I,2.0,0.02 50.33 98.97 138882.24
0.05,1000,ME+I,2.0,0.0001 40.63 98.91 138206.720.05,1000,ME+I,1.5,0.02 50.46 98.92 138581.16
0.05,1000,ME+I,1.5,0.0001 40.53 98.89 138158.840.05,1000,ME+I,1.1,0.02 50.30 98.92 138835.84
0.05,1000,ME+I,1.1,0.0001 40.63 97.77 138088.720.05,1000,ME,2.0,0.02 50.16 99.00 138770.92
0.05,1000,ME,2.0,0.0001 40.17 98.59 138376.080.05,1000,ME,1.5,0.02 49.89 98.97 138975.00
0.05,1000,ME,1.5,0.0001 40.33 98.91 138648.360.05,1000,ME,1.1,0.02 49.85 98.98 138953.24
0.05,1000,ME,1.1,0.0001 39.79 99.00 138266.520.05,1000,I,2.0,0.02 49.24 98.88 138535.88
0.05,1000,I,2.0,0.0001 40.29 98.85 138387.880.05,1000,I,1.5,0.02 50.96 92.36 138616.68
0.05,1000,I,1.5,0.0001 41.18 88.89 137906.160.05,1000,I,1.1,0.02 50.33 97.64 138686.16
24
0.05,1000,I,1.1,0.0001 40.04 98.91 138828.360.01,500,ME+I,2.0,0.02 15.85 98.98 133601.64
0.01,500,ME+I,2.0,0.0001 14.01 98.96 132918.960.01,500,ME+I,1.5,0.02 15.65 98.99 132987.16
0.01,500,ME+I,1.5,0.0001 13.97 98.94 133053.600.01,500,ME+I,1.1,0.02 15.61 98.98 133764.08
0.01,500,ME+I,1.1,0.0001 14.17 98.99 132439.240.01,500,ME,2.0,0.02 15.55 98.97 133469.20
0.01,500,ME,2.0,0.0001 13.99 98.94 132721.640.01,500,ME,1.5,0.02 15.52 98.96 133784.12
0.01,500,ME,1.5,0.0001 13.96 98.95 132771.040.01,500,ME,1.1,0.02 15.65 98.92 133548.44
0.01,500,ME,1.1,0.0001 14.10 98.95 132695.160.01,500,I,2.0,0.02 15.71 98.97 132971.68
0.01,500,I,2.0,0.0001 14.17 98.97 133037.360.01,500,I,1.5,0.02 15.64 98.95 133494.84
0.01,500,I,1.5,0.0001 13.96 98.90 132756.080.01,500,I,1.1,0.02 15.63 98.94 133574.76
0.01,500,I,1.1,0.0001 14.00 98.93 133471.800.01,2000,ME+I,2.0,0.02 164.42 77.57 155881.28
0.01,2000,ME+I,2.0,0.0001 113.09 77.42 155392.920.01,2000,ME+I,1.5,0.02 162.51 77.55 155558.80
0.01,2000,ME+I,1.5,0.0001 112.62 77.64 155313.160.01,2000,ME+I,1.1,0.02 164.66 77.83 155642.44
0.01,2000,ME+I,1.1,0.0001 123.62 99.00 155250.160.01,2000,ME,2.0,0.02 165.13 77.53 155705.44
0.01,2000,ME,2.0,0.0001 111.69 79.50 155394.840.01,2000,ME,1.5,0.02 162.59 78.46 155294.12
0.01,2000,ME,1.5,0.0001 113.86 76.92 155176.880.01,2000,ME,1.1,0.02 164.27 76.97 155396.96
0.01,2000,ME,1.1,0.0001 113.76 76.75 155034.040.01,2000,I,2.0,0.02 164.08 77.22 155443.76
0.01,2000,I,2.0,0.0001 113.61 77.24 154994.480.01,2000,I,1.5,0.02 163.14 78.94 155548.28
0.01,2000,I,1.5,0.0001 111.07 79.09 155093.920.01,2000,I,1.1,0.02 162.67 77.36 155497.36
0.01,2000,I,1.1,0.0001 109.01 75.25 150508.880.01,1000,ME+I,2.0,0.02 50.30 98.92 138738.04
25
0.01,1000,ME+I,2.0,0.0001 40.31 99.00 138451.800.01,1000,ME+I,1.5,0.02 50.50 99.00 138824.72
0.01,1000,ME+I,1.5,0.0001 40.31 98.97 138218.680.01,1000,ME+I,1.1,0.02 50.41 99.00 138501.24
0.01,1000,ME+I,1.1,0.0001 40.45 99.00 138278.320.01,1000,ME,2.0,0.02 50.41 98.99 138407.60
0.01,1000,ME,2.0,0.0001 40.44 99.00 138331.400.01,1000,ME,1.5,0.02 50.07 98.99 138824.28
0.01,1000,ME,1.5,0.0001 40.31 99.00 138470.760.01,1000,ME,1.1,0.02 49.94 98.96 138802.64
0.01,1000,ME,1.1,0.0001 40.09 98.99 138732.520.01,1000,I,2.0,0.02 49.99 98.99 138941.92
0.01,1000,I,2.0,0.0001 40.10 98.95 138575.560.01,1000,I,1.5,0.02 49.52 98.92 138679.60
0.01,1000,I,1.5,0.0001 40.10 98.89 138683.160.01,1000,I,1.1,0.02 50.00 98.97 138552.00
0.01,1000,I,1.1,0.0001 40.43 97.41 138510.32
*MAF,POP,MOD,OR,PREV where MAF represents the minor allele fre-quency, POP is the number of individuals, MOD is the used model (withor without main effect and with or without epistasis effect), OR is the oddsratio and PREV is the prevalence of the disease.
26
Laboratory Note
Genetic EpistasisV - Assessing Algorithm SNPRuler
LN-5-2014
Ricardo Pinho and Rui CamachoFEUP
Rua Dr Roberto Frias, s/n,4200-465 PORTO
PortugalFax: (+351) 22 508 1440e-mail: [email protected]
www : http://www.fe.up.pt/∼[email protected]
www : http://www.fe.up.pt/∼rcamacho
May 2014
Abstract
In this lab note, the algorithm SNPRuler is presented. SNPRuleris an epistatic detection algorithm written in Java that creates rulesbased on the epistatic interactions detected in data sets. Using manyconfigurations of data sets, the results obtained show a correlation be-tween Power and the number of sampled individuals and a correlationbetween Power and minor allele frequency. This shows that the algo-rithm has a very high accuracy in optimal conditions, but has verylow accuracy in below optimal conditions. The algorithm is very scal-able with different number of individuals, with only a slight increasein running time and memory usage. The Type I Error Rate is verylow in all configurations.
1 Introduction
SNPRuler [WYY+10] is a rule based algorithm that, based on the relationsbetween SNPs and the phenotype related to the expression of a disease,creates rules of association, between SNPs and the phenotype expression.The order or magnitude of these interactions can have any amount of SNPs.For each rule, a 3x3 table is generated, relating to the probability of eachpossible genotype combination and phenotype expression.The way that rules are defined is described in the following steps:
1. Literal - A literal s is an index-value pair (i,v) with i denoting an indexand v a value in 1,2,3 representing the possible genotypes. A samplesatisfies a literal(i,v) if and only if its i-th SNP has the value v.
2. Predictive Rule - A predictive rule (r, ζ) : s1 ∩ s2 ∩ ...∩ sn → ζ, is anassociation between a conjunction of n literals denoted as r and a classlabel ζ. A sample satisfies (r, ζ) if and only if it satisfies all literals inr and its class label is ζ.
3. Literal Relevance - Given a predictive rule (r, ζ) and a utility functionU(r, ζ) for rule measurement, a literal si in the rule r is relevant if andonly if U(r, ζ) > U(r − si, ζ). Here, R− si means removing si from r.
4. Closed Rule - A predictive rule (r, ζ) if closed if and only if there isnot there is no literal si which satisfies U(r + si, ζ) > U(r, ζ). Here,R + si means adding si into r.
The measurement rule of relevance is χ2 statistic. Considering that most ofthe epistatic interactions involve many SNPs, before creating rules, an upperbound is used to determine if a new SNP will reveal to be significant to arule. This decreases the amount of rules created immensely, compared toexhaustive searches. A branch-and-bound approach is used for this effect.
1.1 Input files
The algorithm is written in Java and receives a file containing the genotypeand the phenotype expressed for each individual. The first row containsthe number of each SNP and the final column corresponds to the label.Each subsequent row contains an individuals genotype 0,1,2 correspondingto homozygous dominant genotype (AA), heterozygous genotype (Aa), andhomozygous recessive genotype (aa). The Label 0,1 corresponds to controland disease affected, respectively.
1
X1 X2 X3 X4 Label1 1 0 2 01 0 1 1 11 2 2 0 1
Table 1: An example of the input file containing genotype and phenotypeinformation with 4 SNPs and 3 individuals.
1.2 Output files
The output is a list of interactions ranked by their significance in the χ2 test.A post-processing calculates the P-value of these interactions adjusting theχ2 test with a Bonferroni correction with a significance threshold of 0.3.
1.3 Parameters
There are 3 configurable parameters:
• listSize - The expected number of interactions.
• depth - Order of interaction. Number of interacting SNPs.
• updateRatio - The step size of updating a rule. Takes a value between0 and 1, 0 being not updated and 1 updating a rule at each step.
2 Experimental Settings
The datasets used in the experiments are characterized in Lab Note 1. Thecomputer used for this experiments used the 64-bit Ubuntu 13.10 operatingsystem, with an Intel(R) Core(TM)2 Quad CPU Q6600 2.40GHz processorand 8,00 GB of RAM memory.The algorithms settings consist of a -Xmx7000M heap size, with a maximumnumber or rules set as 50 000. The length of the rules is 2, considering thatthe data sets used contain ground-truths of pairs of SNPs. The pruningthreshold is 0, which means that all possible combinations will be tested.
3 Results
SNPRuler is used exclusively for the interactive effect, therefore data setswith main effect and full effect wont be analyzed.In the Figure 1 the Power obtained from each allele frequency with different
2
population sizes is displayed. For data sets with 500 individuals, the Power isnearly 0 for all allele frequencies. However, as the population size increases,the Power starts to rise in data sets with allele frequency higher than 0.1.This is also true for data sets with 2000 individuals, with a slightly higherPower than smaller data sets. The configuration with the most Power cor-responds to the datasets with 2000 population and 0.5 minor allele frequency.
0.01 0.05 0.1 0.3 0.50
50
100
0 0 0 3 60 010
35
71
0 0
3244
92
Allele frequency
Pow
er(%
)
500 individuals1000 individuals2000 individuals
Figure 1: Power by allele frequency. For each frequency, three sizes of datasets were used to measure the Power, with an odds ratio of 2.0 and prevalenceof 0.02. The Power is measured by the amount of data sets where the groundtruth was amongst the most relevant results, out of all 100 data sets.
In Figure 2 the average running time, percentage of CPU usage, andmemory usage are displayed by individuals in the data set, to evaluate thescalability of the algorithm. The results show that there is a slight increasein running time when applied to larger data sets. In this results, the increasein running time is not very significant. The CPU usage shows an increasewith the data set size, with all the data sets having a CPU usage higher than100%. This means that for each data set, more than one core was used. Thememory usage results show that there is an increase of nearly 10 megabytesin memory usage. This increase may be significant in more complex datasets but is not as significant as the running time increase or the CPU usage.
For the Type I Error Rate test, Figure 3 shows that the Type I Error Rateis relatively small across all the data sets, having outliers with allele frequencyof 0.1 and 2000 individuals. This is the only groups of configurations thatyield a Type I Error Rate higher than 1%.
According to Figure 4 we can conclude that the number of individuals hasa big influence in the Power of the algorithm. This is also true for the allelefrequency. With very small number of individuals, the Power is nearly 0.The
3
500 1000 20000
2
4
Number of Individuals
Runnin
gT
ime(
seco
nds)
(a) Average running time.
500 1000 2000
130
140
150
Number of Individuals
CP
UU
sage
(%)
(b) Average CPU usage.
500 1000 2000312
314
316
318
320
Number of Individuals
Mem
ory
Usa
ge(M
byte
s)
(c) Average memory usage.
Figure 2: Comparison of scalability measures between different sized datasets. This figures shows the average running time, CPU usage, and memoryusage by each data set. The data sets have a minor allele frequency is 0.5,2.0 odds ratio, 0.02 prevalence.
Power also increases with the frequency of the alleles with the ground truth.On Figure 5 and 6, the influence of odds ratio, through the penetrance tableof the disease, and the prevalence of the disease are undetermined. There isan increase in Power with the odds ratio of 1.5, but it decreases for 2.0 oddsratio. The difference in prevalence does not show a very significant differencein Power. Figure 7 shows the Power by frequency, independent of populationsize.Overall, the Power of the algorithm shows very high accuracy in certain con-figurations with the optimal conditions, but also shows very low Power inmany configurations.
4
0.01 0.05 0.1 0.3 0.50
2
4
6
8
0 0 0 0 00 0 0 0 001
8
0 0
Allele frequency
Typ
eI
Err
orR
ate(
%)
500 individuals1000 individuals2000 individuals
Figure 3: Type I Error Rate by allele frequency and population size, with anodds ratio of 2.0 and prevalence of 0.02. The Type I Error Rate is measuredby the amount of data sets where the false positives were amongst the mostrelevant results, out of all 100 data sets.
4 Summary
In this lab note, the algorithm SNPRuler was presented and tested to detectepistasis interactions that manifest complex diseases using generated datasets. The results obtained showed that The number of individuals is impor-tant to epistasis detection. Diseases with ground truths in high frequencySNPs are easier to detect. The scalability test revealed a significant increasein the use of computer resources and running time with the increase in num-ber of individuals, which may have a significant impact in datasets with ahigher amount of SNPs. The Type I Error Rate results show very low errorrate in all configurations.
References
[WYY+10] Xiang Wan, Can Yang, Qiang Yang, Hong Xue, Nelson L S Tang,and Weichuan Yu. Predictive rule inference for epistatic interac-tion detection in genome-wide association studies. Bioinformat-ics (Oxford, England), 26:30–37, 2010.
A Bar graphs
5
500 1000 20000
50
100
010
32
Population
Pow
er(%
)Power by Population
Figure 4: Distribution of the Power by population. The allele frequency is0.1, the odds ratio is 2.0, and the prevalence is 0.02.
1.1 1.5 2.00
50
100
0
67
32
Odds Ratio
Pow
er(%
)
Power by Odds Ratio
Figure 5: Distribution of the Power by odds ratios. The allele frequency is0.1, the number of individuals is 2000, and the prevalence is 0.02.
6
0.0001 0.020
50
100
29 32
Prevalence
Pow
er(%
)Power by Prevalence
Figure 6: Distribution of the Power by prevalence. The allele frequency is0.1, the number of individuals is 2000, and the odds ratio is 2.0.
0.01 0.05 0.1 0.3 0.50
50
100
0 0
3244
92
Frequency
Pow
er(%
)
Power by Frequency
Figure 7: Distribution of the Power by allele frequency. The number ofindividuals is 2000, the odds ratio is 2.0, and the prevalence is 0.02.
7
B Table of Results
Table 2: A table containing the percentage of true positives and false posi-tives in each configuration. The first column contains the description of theconfiguration. The second and third columns contain the number of datasetswith true positives and false positives respectively, out of all 100 data setsper configuration.
Configuration* TP (%) FP (%)0.5,2000,I,2.0,0.02 92 00.3,2000,I,1.5,0.02 89 7
0.3,2000,I,1.5,0.0001 84 30.5,2000,I,1.5,0.02 81 10.5,1000,I,2.0,0.02 71 00.1,2000,I,1.5,0.02 67 2
0.5,2000,I,2.0,0.0001 52 80.5,1000,I,2.0,0.0001 50 10.05,2000,I,2.0,0.0001 50 19
0.3,2000,I,2.0,0.02 44 00.3,1000,I,1.5,0.02 41 10.5,1000,I,1.5,0.02 40 0
0.3,1000,I,2.0,0.0001 36 00.3,1000,I,2.0,0.02 35 0
0.05,2000,I,1.5,0.0001 35 120.5,2000,I,1.5,0.0001 34 10.1,2000,I,2.0,0.02 32 8
0.3,1000,I,1.5,0.0001 29 10.1,2000,I,2.0,0.0001 29 00.1,2000,I,1.5,0.0001 29 10.05,2000,I,1.5,0.02 23 140.3,2000,I,2.0,0.0001 12 10.1,1000,I,2.0,0.02 10 00.5,500,I,2.0,0.02 6 0
0.3,500,I,2.0,0.0001 6 00.5,2000,I,1.1,0.02 5 10.5,500,I,2.0,0.0001 4 00.5,1000,I,1.5,0.0001 3 0
0.3,500,I,2.0,0.02 3 00.5,2000,I,1.1,0.0001 2 00.3,500,I,1.5,0.0001 2 0
8
0.3,2000,I,1.1,0.0001 2 00.1,1000,I,1.5,0.02 2 0
0.05,1000,I,2.0,0.0001 2 00.5,500,I,1.5,0.02 1 00.3,2000,I,1.1,0.02 1 0
0.1,2000,I,1.1,0.0001 1 10.1,1000,I,2.0,0.0001 1 00.5,500,I,1.5,0.0001 0 00.5,500,I,1.1,0.02 0 0
0.5,500,I,1.1,0.0001 0 00.5,1000,I,1.1,0.02 0 0
0.5,1000,I,1.1,0.0001 0 00.3,500,I,1.5,0.02 0 00.3,500,I,1.1,0.02 0 0
0.3,500,I,1.1,0.0001 0 00.3,1000,I,1.1,0.02 0 0
0.3,1000,I,1.1,0.0001 0 00.1,500,I,2.0,0.02 0 0
0.1,500,I,2.0,0.0001 0 00.1,500,I,1.5,0.02 0 0
0.1,500,I,1.5,0.0001 0 00.1,500,I,1.1,0.02 0 0
0.1,500,I,1.1,0.0001 0 00.1,2000,I,1.1,0.02 0 0
0.1,1000,I,1.5,0.0001 0 00.1,1000,I,1.1,0.02 0 0
0.1,1000,I,1.1,0.0001 0 00.05,500,I,2.0,0.02 0 0
0.05,500,I,2.0,0.0001 0 00.05,500,I,1.5,0.02 0 0
0.05,500,I,1.5,0.0001 0 00.05,500,I,1.1,0.02 0 0
0.05,500,I,1.1,0.0001 0 00.05,2000,I,2.0,0.02 0 10.05,2000,I,1.1,0.02 0 0
0.05,2000,I,1.1,0.0001 0 10.05,1000,I,2.0,0.02 0 00.05,1000,I,1.5,0.02 0 0
9
0.05,1000,I,1.5,0.0001 0 00.05,1000,I,1.1,0.02 0 0
0.05,1000,I,1.1,0.0001 0 00.01,500,I,2.0,0.02 0 0
0.01,500,I,2.0,0.0001 0 00.01,500,I,1.5,0.02 0 0
0.01,500,I,1.5,0.0001 0 10.01,500,I,1.1,0.02 0 0
0.01,500,I,1.1,0.0001 0 00.01,2000,I,2.0,0.02 0 0
0.01,2000,I,2.0,0.0001 0 10.01,2000,I,1.5,0.02 0 0
0.01,2000,I,1.5,0.0001 0 00.01,2000,I,1.1,0.02 0 0
0.01,2000,I,1.1,0.0001 0 10.01,1000,I,2.0,0.02 0 0
0.01,1000,I,2.0,0.0001 0 00.01,1000,I,1.5,0.02 0 0
0.01,1000,I,1.5,0.0001 0 00.01,1000,I,1.1,0.02 0 0
0.01,1000,I,1.1,0.0001 0 1
*MAF,POP,MOD,OR,PREV where MAF represents the minor allele fre-quency, POP is the number of individuals, MOD is the used model (withor without main effect and with or without epistasis effect), OR is the oddsratio and PREV is the prevalence of the disease.
Table 3: A table containing the running time, cpu usage and memory usagein each configuration.
Configuration* Running Time (s) CPU Usage (%) Memory Usage (KB)0.5,500,I,2.0,0.02 2.70 130.19 320211.28
0.5,500,I,2.0,0.0001 2.69 136.88 319311.360.5,500,I,1.5,0.02 2.68 140.78 319508.72
0.5,500,I,1.5,0.0001 2.69 141.46 320285.240.5,500,I,1.1,0.02 2.73 136.88 320504.08
0.5,500,I,1.1,0.0001 2.70 136.47 319897.040.5,2000,I,2.0,0.02 4.10 156.28 327876.12
0.5,2000,I,2.0,0.0001 4.16 143.03 330393.48
10
0.5,2000,I,1.5,0.02 4.10 140.41 329206.280.5,2000,I,1.5,0.0001 4.01 136.85 327414.840.5,2000,I,1.1,0.02 3.96 125.00 325492.92
0.5,2000,I,1.1,0.0001 3.97 126.28 325792.920.5,1000,I,2.0,0.02 3.09 141.88 323600.36
0.5,1000,I,2.0,0.0001 3.12 139.30 324334.680.5,1000,I,1.5,0.02 3.08 141.47 323865.08
0.5,1000,I,1.5,0.0001 3.11 140.43 323880.440.5,1000,I,1.1,0.02 3.09 142.06 323780.88
0.5,1000,I,1.1,0.0001 3.12 141.69 323507.800.3,500,I,2.0,0.02 2.75 148.18 321318.64
0.3,500,I,2.0,0.0001 2.73 149.82 319605.000.3,500,I,1.5,0.02 2.73 149.43 321487.72
0.3,500,I,1.5,0.0001 2.75 150.12 320878.400.3,500,I,1.1,0.02 2.74 150.35 320952.24
0.3,500,I,1.1,0.0001 2.74 150.21 319914.160.3,2000,I,2.0,0.02 4.05 124.62 325950.12
0.3,2000,I,2.0,0.0001 4.04 119.74 325417.160.3,2000,I,1.5,0.02 4.04 122.47 325669.04
0.3,2000,I,1.5,0.0001 4.07 126.54 326147.320.3,2000,I,1.1,0.02 4.12 125.71 325679.80
0.3,2000,I,1.1,0.0001 4.11 123.02 325735.240.3,1000,I,2.0,0.02 3.07 118.96 322399.76
0.3,1000,I,2.0,0.0001 3.10 127.03 323056.560.3,1000,I,1.5,0.02 3.07 124.95 322673.52
0.3,1000,I,1.5,0.0001 3.11 131.41 323709.600.3,1000,I,1.1,0.02 3.09 134.61 323485.68
0.3,1000,I,1.1,0.0001 3.09 138.13 323444.760.1,500,I,2.0,0.02 2.75 119.13 320066.32
0.1,500,I,2.0,0.0001 2.74 119.29 319312.120.1,500,I,1.5,0.02 2.73 118.35 320222.28
0.1,500,I,1.5,0.0001 2.77 119.58 319002.320.1,500,I,1.1,0.02 2.77 118.50 320626.68
0.1,500,I,1.1,0.0001 2.76 121.01 320034.200.1,2000,I,2.0,0.02 4.01 119.18 325869.52
0.1,2000,I,2.0,0.0001 4.05 122.05 325484.960.1,2000,I,1.5,0.02 4.07 127.11 326038.04
0.1,2000,I,1.5,0.0001 4.09 126.69 326636.80
11
0.1,2000,I,1.1,0.02 4.10 127.66 326390.360.1,2000,I,1.1,0.0001 4.12 126.83 326720.760.1,1000,I,2.0,0.02 3.13 128.79 323402.72
0.1,1000,I,2.0,0.0001 3.13 128.00 323800.640.1,1000,I,1.5,0.02 3.12 126.40 323558.52
0.1,1000,I,1.5,0.0001 3.14 125.43 323584.040.1,1000,I,1.1,0.02 3.14 126.95 323569.56
0.1,1000,I,1.1,0.0001 3.14 126.27 323193.080.05,500,I,2.0,0.02 2.73 135.34 319177.48
0.05,500,I,2.0,0.0001 2.76 139.71 320980.880.05,500,I,1.5,0.02 2.73 131.66 320560.40
0.05,500,I,1.5,0.0001 2.76 139.02 320381.200.05,500,I,1.1,0.02 2.75 137.41 320737.96
0.05,500,I,1.1,0.0001 2.77 132.74 320620.160.05,2000,I,2.0,0.02 3.85 128.39 325633.16
0.05,2000,I,2.0,0.0001 3.93 135.36 324273.960.05,2000,I,1.5,0.02 3.87 144.42 326558.92
0.05,2000,I,1.5,0.0001 3.88 137.91 325713.840.05,2000,I,1.1,0.02 3.99 131.54 325690.40
0.05,2000,I,1.1,0.0001 3.94 131.49 324629.080.05,1000,I,2.0,0.02 2.94 147.28 323110.24
0.05,1000,I,2.0,0.0001 3.00 149.84 323443.360.05,1000,I,1.5,0.02 3.00 146.13 323144.92
0.05,1000,I,1.5,0.0001 3.02 143.14 323136.720.05,1000,I,1.1,0.02 3.00 143.31 323410.08
0.05,1000,I,1.1,0.0001 3.02 146.23 323356.000.01,500,I,2.0,0.02 2.63 154.11 320784.96
0.01,500,I,2.0,0.0001 2.65 150.07 320432.160.01,500,I,1.5,0.02 2.64 126.83 320529.56
0.01,500,I,1.5,0.0001 2.75 129.40 319814.800.01,500,I,1.1,0.02 2.76 129.15 320633.56
0.01,500,I,1.1,0.0001 2.72 182.19 321332.200.01,2000,I,2.0,0.02 3.99 130.97 325901.32
0.01,2000,I,2.0,0.0001 4.03 129.72 325971.000.01,2000,I,1.5,0.02 4.06 126.38 325816.40
0.01,2000,I,1.5,0.0001 4.02 110.41 324423.520.01,2000,I,1.1,0.02 4.00 121.24 325429.32
0.01,2000,I,1.1,0.0001 4.04 128.06 326333.80
12
0.01,1000,I,2.0,0.02 3.06 127.62 323421.920.01,1000,I,2.0,0.0001 3.07 127.73 323639.960.01,1000,I,1.5,0.02 3.07 126.69 323483.56
0.01,1000,I,1.5,0.0001 3.05 156.55 325006.560.01,1000,I,1.1,0.02 3.03 163.41 324945.28
0.01,1000,I,1.1,0.0001 3.01 156.46 320749.28
*MAF,POP,MOD,OR,PREV where MAF represents the minor allele fre-quency, POP is the number of individuals, MOD is the used model (withor without main effect and with or without epistasis effect), OR is the oddsratio and PREV is the prevalence of the disease.
13
Laboratory Note
Genetic EpistasisVI - Assessing Algorithm SNPHarvester
LN-6-2014
Ricardo Pinho and Rui CamachoFEUP
Rua Dr Roberto Frias, s/n,4200-465 PORTO
PortugalFax: (+351) 22 508 1440e-mail: [email protected]
www : http://www.fe.up.pt/∼[email protected]
www : http://www.fe.up.pt/∼rcamacho
May 2014
Abstract
In this lab note, the algorithm SNPHarvester was presented andtested. This algorithm consists of a stochastic approach that searchesrelevant SNPs for main effect and epistasis interactions, using a Path-Seeker algorithm and revealing relevant results using χ2 test. Theresults show that the Power and Type 1 Error Rate of the algorithmis high for main effect and full effect, but shows good results for epis-tasis detection, with very low error rates. The scalability test of thealgorithm shows that this may be a problem for higher data sets.
1 Introduction
SNPHarvester [YHW+09] works as a stochastic algorithm, by generatingmultiple paths among the many SNPS and these are joined into groups.Significant groups are selected if their scores are above a statistical predeter-mined threshold. The score function used to measure the association betweena k-SNP group, where k is the number of SNPS in epistatic interaction, andthe phenotype is χ2 test.For this effect, a PathSeeker algorithm was developed to randomly start anew path and for each group tries to increase the score, changing only oneSNP in active set at a time, converging to a local minimum, typically con-verging in two or three iterations. The evaluation is based on the χ2 valueand has a significance threshold of α = 0.05 after Bonferroni corrections.A post processing stage is applied to eliminate k-SNP groups that may besignificant due to a sub-group and SNPs that may show a false strong asso-ciation due to a small marginal effect. An L2 penalized logistic regression isused to filter out these interactions.
L(β0, β, λ) = −l(β0, β) +λ
2‖β‖ (1)
where l(β0, β) is the binomial log-likelihood and λ is a regularization param-eter.The differences between SNPHarvester and the other algorithms is that SNPfocuses on local optima instead of a global optima. Each local optima issignificant because there are usually multiple interaction patterns. SNPHar-vester also uses a sequential optimization rather than parallel optimization,removing local optima in the search process, becoming a smaller space inlater stages. SNPHarvester also uses a model-free approach,randomly creat-ing paths to directly detect significant associations.
1.1 Input files
The input of SNPHarvester consists of the names of each column in the firstrow, i.e. SNPs and the label phenotype as the last column. The next rowscontain the genotypes 0,1,2 as homozygous dominant, heterozygous, and ho-mozygous recessive, respectively. The Label 0,1 is control, case, respectively.
1
X1,X2,X3,X4,Label1, 1, 0, 2, 01, 1, 2, 1, 11, 2, 2, 0, 1
Table 1: An example of the input file containing genotype and phenotypeinformation with 4 SNPs and 3 individuals.
1.2 Output files
The algorithm outputs contain the final extracted single or interacting SNPs,with the χ2 value of that specific interaction or single SNP, and the runningtime of the algorithm.
1.3 Parameters
There are two modes: ”Threshold-Based” mode, where the program outputsall of the significant SNPs above a user specified significance threshold, and”Top-K Based” mode where the program outputs a specified number of SNPinteractions, regardless of their significance level. Both modes have parame-ters to chose the minimum and maximum number of interacting SNPs to bedetected. If the minimum is 1, it will test main effects of SNPs.
2 Experimental Settings
The datasets used in the experiments are characterized in Lab Note 1. Thecomputer used for this experiments used the 64-bit Ubuntu 13.10 operatingsystem, with an Intel(R) Core(TM)2 Quad CPU Q6600 2.40GHz processorand 8,00 GB of RAM memory.SNPHarvester provides a Java program. The ”Threshold-Based” mode waschosen for this analysis, with a significance level of α = 0.05. The heap sizeis set to -Xmx7000M. Main effects and pairwise interactions are tested forthis experiment.
3 Results
SNPHarvester works in epistasis detection, main effect detection, and fulleffect detection. All data set configurations were used in this experiment.Figure 1 shows the Power obtained in relation to allele frequency and popula-tion size for epistasis (a), main effect (b) and epistasis + main effect (c). The
2
results show that the Power is higher in main effect detection than epistasisdetection overall, reaching 100% Power in data sets with 0.3 and 0.5 allelefrequency. Epistasis detection shows a much lower result, with 0.1 allele fre-quency and 2000 individuals having the best Power for epistasis detection,with 85%. However, significant results can be seen with allele frequency aslow as 0.05%, which is not true for main effect detection. In full effect detec-tion, the results are very similar to main effect detections.
For scalability detection, Figure 2 shows a very significant difference inrunning time a from the data sets with 500 individuals running for an averageof 9.29 seconds, and data sets with 2000 individuals, showing an average of33 seconds. This somewhat linear growth, together with the slight increasein memory usage (c) reveal a scalability problem. The CPU usage is near100% across all data set sizes.
Type 1 Error Rates show a concerning increase in main effect and full ef-fect data sets, in relation to epistasis detection. This disproportion is due tothe ease in main effect detection, which reveals highly valued ground truths,but also increases the chances of detecting false positives, even if their statis-tical significance is significantly lower than the ground truth. There is stilla higher Power than Type 1 Error Rate in most cases, with the exception ofhigh allele frequencies and high population, which reveal error rates of 100%.This is not true for epistasis detection, having a maximum of 27% error ratewith data sets with 2000 individuals with an allele frequency of 0.05. Theother configurations show a slight increase of error rate with the increase ofdata set population size. There is not clear Type 1 Error Rate differencebetween allele frequencies for epistasis detection.
The relation of Power and population and allele frequency is reinforced inFigure 4 and 7. However, the Power by allele frequency in Epistasis detectionshows a peak in 0.1 minor allele frequency and a descent for higher allele fre-quencies. Figure 6 shows a slight but not significant increase of Power withprevalence for epistasis detection and a slight decrease for main and full effectdetection. The linear increase in Power shown in Figure 5 by odds ratios issimilar to the distribution of Power by population.
3
0.01 0.05 0.1 0.3 0.50
50
100
0 0 0 4 20 0
21
43
140
18
8570
33
Allele frequency
Pow
er(%
)
500 individuals1000 individuals2000 individuals
(a) Epistasis
0.01 0.05 0.1 0.3 0.50
50
100
0 0 0
100 100
0 0
38
100 100
0 1
92100 100
Allele frequency
Pow
er(%
)
500 individuals1000 individuals2000 individuals
(b) Main Effect
0.01 0.05 0.1 0.3 0.50
50
100
0 0 0
100 100
0 0
32
100 100
0 0
95 100 100
Allele frequency
Pow
er(%
)
500 individuals1000 individuals2000 individuals
(c) Full Effect
Figure 1: Power by allele frequency. For each frequency, three sizes of datasets were used to measure the Power, with odds ratio of 2.0 and prevalenceof 0.02. The Power is measured by the amount of data sets where the groundtruth was amongst the most relevant results, out of all 100 data sets. (b),(a), and (c) represent main effect, epistatic and main effect + epistatic in-teractions.
4
500 1000 20000
10
20
30
Number of Individuals
Runnin
gT
ime(
seco
nds)
(a) Average Running time.
500 1000 20000
50
100
Number of Individuals
CP
UU
sage
(%)
(b) Average CPU usage.
500 1000 2000
68
70
72
74
76
Number of Individuals
Mem
ory
Usa
ge(M
byte
s)
(c) Average Memory usage.
Figure 2: Comparison of scalability measures between different sized datasets. The data sets have a minor allele frequency is 0.5, 2.0 odds ratio, 0.02prevalence and use the full effect disease model.
5
0.01 0.05 0.1 0.3 0.50
50
100
4 4 7 3 4413 9 9 32
2719
11 5
Allele frequency
Typ
e1
Err
orR
ate(
%)
500 individuals1000 individuals2000 individuals
(a) Epistasis
0.01 0.05 0.1 0.3 0.50
50
100
1 5 11
78
99
10 422
99 100
1
24
79
100 100
Allele frequency
Typ
e1
Err
orR
ate(
%)
500 individuals1000 individuals2000 individuals
(b) Main Effect
0.01 0.05 0.1 0.3 0.50
50
100
2 8 9
100 100
4 8
27
100 100
1
20
79
100 100
Allele frequency
Typ
e1
Err
orR
ate(
%)
500 individuals1000 individuals2000 individuals
(c) Full Effect
Figure 3: Type 1 Error Rate by allele frequency. For each frequency, threesizes of data sets were used to measure the Power, with odds ratio of 2.0 andprevalence of 0.02. The Type 1 Error Rate is measured by the amount ofdata sets where the false positives were amongst the most relevant results,out of all 100 data sets. (a), (b), and (c) represent epistatic, main effect, andmain effect + epistatic interactions.
6
4 Summary
In this experiment, the SNPHarvester was tested using many data sets withsignificant configuration changes. The results show that the algorithm hasa high Power in main effect and full effect detections, but also a high Type1 Error Rate. For epistasis, the Power is lower but reveals very low Type 1Error Rate values. There is a linear increase of Power with the number ofindividuals and odds ratios, and a significant increase with allele frequencies.The algorithm shows scalability problems, due to the high increase in runningtime, which may be crucial in genome wide studies.
References
[YHW+09] Can Yang, Zengyou He, Xiang Wan, Qiang Yang, Hong Xue,and Weichuan Yu. SNPHarvester: a filtering-based approachfor detecting epistatic interactions in genome-wide associationstudies. Bioinformatics (Oxford, England), 25:504–511, 2009.
A Bar Graph
7
500 1000 20000
50
100
0
21
85
Population
Pow
er(%
)
Power by Population
(a) Epistasis
500 1000 20000
50
100
0
38
92
Population
Pow
er(%
)
Power by Population
(b) Main Effect
500 1000 20000
50
100
0
32
95
Population
Pow
er(%
)
Power by Population
(c) Full Effect
Figure 4: Distribution of the Power by population for all disease models.The allele frequency is 0.1, the odds ratio is 2.0, and the prevalence is 0.02.
8
1.1 1.5 2.00
50
100
0
41
85
Odds Ratio
Pow
er(%
)
Power by Odds Ratio
(a) Epistasis
1.1 1.5 2.00
50
100
2
25
92
Odds Ratio
Pow
er(%
)
Power by Odds Ratio
(b) Main Effect
1.1 1.5 2.00
50
100
3
66
95
Odds Ratio
Pow
er(%
)
Power by Odds Ratio
(c) Full Effect
Figure 5: Distribution of the Power by odds ratios for all disease models.The allele frequency is 0.1, the population size is 2000 individuals, and theprevalence is 0.02.
9
0.0001 0.020
50
10074
85
Prevalence
Pow
er(%
)
Power by Prevalence
(a) Epistasis
0.0001 0.020
50
10099 92
Prevalence
Pow
er(%
)
Power by Prevalence
(b) Main Effect
0.0001 0.020
50
100100 95
Prevalence
Pow
er(%
)
Power by Prevalence
(c) Full Effect
Figure 6: Distribution of the Power by prevalence for all disease models. Theallele frequency is 0.1, the odds ratio is 2.0, and the population size is 2000individuals.
10
0.01 0.05 0.1 0.3 0.50
50
100
018
8570
33
Frequency
Pow
er(%
)
Power by Frequency
(a) Epistasis
0.01 0.05 0.1 0.3 0.50
50
100
0 1
92100 100
Frequency
Pow
er(%
)
Power by Frequency
(b) Main Effect
0.01 0.05 0.1 0.3 0.50
50
100
0 0
95 100 100
Frequency
Pow
er(%
)
Power by Frequency
(c) Full Effect
Figure 7: Distribution of the Power by allele frequency for all disease mod-els. The population size is 2000 individuals, the odds ratio is 2.0, and theprevalence is 0.02.
11
B Table of Results
Table 2: A table containing the percentage of true positives and false posi-tives in each configuration. The first column contains the description of theconfiguration. The second and third columns contain the number of datasetswith true positives and false positives respectively, out of all 100 data setsper configuration.
Configuration* TP (%) FP (%)0.01,500,ME+I,2.0,0.02 0 2
0.01,500,ME+I,2.0,0.0001 0 50.01,500,ME+I,1.5,0.02 0 7
0.01,500,ME+I,1.5,0.0001 0 50.01,500,ME+I,1.1,0.02 0 1
0.01,500,ME+I,1.1,0.0001 0 60.01,500,ME,2.0,0.02 0 1
0.01,500,ME,2.0,0.0001 0 60.01,500,ME,1.5,0.02 0 1
0.01,500,ME,1.5,0.0001 0 60.01,500,ME,1.1,0.02 0 1
0.01,500,ME,1.1,0.0001 0 60.01,500,I,2.0,0.02 0 4
0.01,500,I,2.0,0.0001 0 50.01,500,I,1.5,0.02 0 5
0.01,500,I,1.5,0.0001 0 100.01,500,I,1.1,0.02 0 3
0.01,500,I,1.1,0.0001 0 60.01,2000,ME+I,2.0,0.02 0 1
0.01,2000,ME+I,2.0,0.0001 0 30.01,2000,ME+I,1.5,0.02 0 6
0.01,2000,ME+I,1.5,0.0001 0 40.01,2000,ME+I,1.1,0.02 0 6
0.01,2000,ME+I,1.1,0.0001 0 60.01,2000,ME,2.0,0.02 0 1
0.01,2000,ME,2.0,0.0001 0 30.01,2000,ME,1.5,0.02 0 6
0.01,2000,ME,1.5,0.0001 0 40.01,2000,ME,1.1,0.02 0 6
0.01,2000,ME,1.1,0.0001 0 40.01,2000,I,2.0,0.02 0 2
12
0.01,2000,I,2.0,0.0001 0 90.01,2000,I,1.5,0.02 0 8
0.01,2000,I,1.5,0.0001 0 120.01,2000,I,1.1,0.02 0 7
0.01,2000,I,1.1,0.0001 0 60.01,1000,ME+I,2.0,0.02 0 4
0.01,1000,ME+I,2.0,0.0001 0 30.01,1000,ME+I,1.5,0.02 0 10
0.01,1000,ME+I,1.5,0.0001 0 50.01,1000,ME+I,1.1,0.02 0 3
0.01,1000,ME+I,1.1,0.0001 0 50.01,1000,ME,2.0,0.02 0 10
0.01,1000,ME,2.0,0.0001 0 50.01,1000,ME,1.5,0.02 0 11
0.01,1000,ME,1.5,0.0001 0 50.01,1000,ME,1.1,0.02 0 3
0.01,1000,ME,1.1,0.0001 0 50.01,1000,I,2.0,0.02 0 4
0.01,1000,I,2.0,0.0001 0 40.01,1000,I,1.5,0.02 0 4
0.01,1000,I,1.5,0.0001 0 50.01,1000,I,1.1,0.02 0 6
0.01,1000,I,1.1,0.0001 0 50.05,500,ME+I,2.0,0.02 0 8
0.05,500,ME+I,2.0,0.0001 0 80.05,500,ME+I,1.5,0.02 0 4
0.05,500,ME+I,1.5,0.0001 0 50.05,500,ME+I,1.1,0.02 0 3
0.05,500,ME+I,1.1,0.0001 0 70.05,500,ME,2.0,0.02 0 5
0.05,500,ME,2.0,0.0001 0 70.05,500,ME,1.5,0.02 0 5
0.05,500,ME,1.5,0.0001 0 60.05,500,ME,1.1,0.02 0 7
0.05,500,ME,1.1,0.0001 0 40.05,500,I,2.0,0.02 0 4
0.05,500,I,2.0,0.0001 0 60.05,500,I,1.5,0.02 0 5
13
0.05,500,I,1.5,0.0001 0 110.05,500,I,1.1,0.02 0 5
0.05,500,I,1.1,0.0001 0 30.05,2000,ME+I,2.0,0.02 0 20
0.05,2000,ME+I,2.0,0.0001 26 420.05,2000,ME+I,1.5,0.02 0 13
0.05,2000,ME+I,1.5,0.0001 1 160.05,2000,ME+I,1.1,0.02 0 4
0.05,2000,ME+I,1.1,0.0001 0 90.05,2000,ME,2.0,0.02 1 24
0.05,2000,ME,2.0,0.0001 9 300.05,2000,ME,1.5,0.02 0 5
0.05,2000,ME,1.5,0.0001 0 150.05,2000,ME,1.1,0.02 0 6
0.05,2000,ME,1.1,0.0001 0 80.05,2000,I,2.0,0.02 18 27
0.05,2000,I,2.0,0.0001 45 50.05,2000,I,1.5,0.02 39 9
0.05,2000,I,1.5,0.0001 40 90.05,2000,I,1.1,0.02 0 5
0.05,2000,I,1.1,0.0001 0 60.05,1000,ME+I,2.0,0.02 0 8
0.05,1000,ME+I,2.0,0.0001 0 210.05,1000,ME+I,1.5,0.02 0 1
0.05,1000,ME+I,1.5,0.0001 0 70.05,1000,ME+I,1.1,0.02 0 4
0.05,1000,ME+I,1.1,0.0001 0 50.05,1000,ME,2.0,0.02 0 4
0.05,1000,ME,2.0,0.0001 1 260.05,1000,ME,1.5,0.02 0 2
0.05,1000,ME,1.5,0.0001 0 60.05,1000,ME,1.1,0.02 0 7
0.05,1000,ME,1.1,0.0001 0 40.05,1000,I,2.0,0.02 0 13
0.05,1000,I,2.0,0.0001 2 50.05,1000,I,1.5,0.02 0 4
0.05,1000,I,1.5,0.0001 1 40.05,1000,I,1.1,0.02 0 3
14
0.05,1000,I,1.1,0.0001 0 110.1,500,ME+I,2.0,0.02 0 9
0.1,500,ME+I,2.0,0.0001 41 380.1,500,ME+I,1.5,0.02 0 4
0.1,500,ME+I,1.5,0.0001 1 90.1,500,ME+I,1.1,0.02 0 3
0.1,500,ME+I,1.1,0.0001 1 40.1,500,ME,2.0,0.02 0 11
0.1,500,ME,2.0,0.0001 13 200.1,500,ME,1.5,0.02 0 7
0.1,500,ME,1.5,0.0001 0 70.1,500,ME,1.1,0.02 0 6
0.1,500,ME,1.1,0.0001 0 60.1,500,I,2.0,0.02 0 7
0.1,500,I,2.0,0.0001 1 30.1,500,I,1.5,0.02 0 5
0.1,500,I,1.5,0.0001 0 10.1,500,I,1.1,0.02 0 5
0.1,500,I,1.1,0.0001 0 70.1,2000,ME+I,2.0,0.02 95 79
0.1,2000,ME+I,2.0,0.0001 100 990.1,2000,ME+I,1.5,0.02 66 40
0.1,2000,ME+I,1.5,0.0001 61 480.1,2000,ME+I,1.1,0.02 3 8
0.1,2000,ME+I,1.1,0.0001 3 170.1,2000,ME,2.0,0.02 92 79
0.1,2000,ME,2.0,0.0001 99 880.1,2000,ME,1.5,0.02 25 20
0.1,2000,ME,1.5,0.0001 48 410.1,2000,ME,1.1,0.02 2 11
0.1,2000,ME,1.1,0.0001 1 70.1,2000,I,2.0,0.02 85 19
0.1,2000,I,2.0,0.0001 74 110.1,2000,I,1.5,0.02 41 9
0.1,2000,I,1.5,0.0001 23 60.1,2000,I,1.1,0.02 0 7
0.1,2000,I,1.1,0.0001 2 90.1,1000,ME+I,2.0,0.02 32 27
15
0.1,1000,ME+I,2.0,0.0001 97 740.1,1000,ME+I,1.5,0.02 1 11
0.1,1000,ME+I,1.5,0.0001 13 120.1,1000,ME+I,1.1,0.02 0 7
0.1,1000,ME+I,1.1,0.0001 0 80.1,1000,ME,2.0,0.02 38 22
0.1,1000,ME,2.0,0.0001 59 430.1,1000,ME,1.5,0.02 2 9
0.1,1000,ME,1.5,0.0001 6 120.1,1000,ME,1.1,0.02 0 7
0.1,1000,ME,1.1,0.0001 0 120.1,1000,I,2.0,0.02 21 9
0.1,1000,I,2.0,0.0001 9 90.1,1000,I,1.5,0.02 1 2
0.1,1000,I,1.5,0.0001 1 40.1,1000,I,1.1,0.02 0 6
0.1,1000,I,1.1,0.0001 0 120.3,500,ME+I,2.0,0.02 100 100
0.3,500,ME+I,2.0,0.0001 100 1000.3,500,ME+I,1.5,0.02 100 75
0.3,500,ME+I,1.5,0.0001 100 960.3,500,ME+I,1.1,0.02 77 21
0.3,500,ME+I,1.1,0.0001 93 280.3,500,ME,2.0,0.02 100 78
0.3,500,ME,2.0,0.0001 100 890.3,500,ME,1.5,0.02 89 27
0.3,500,ME,1.5,0.0001 89 370.3,500,ME,1.1,0.02 25 13
0.3,500,ME,1.1,0.0001 25 90.3,500,I,2.0,0.02 4 3
0.3,500,I,2.0,0.0001 20 80.3,500,I,1.5,0.02 1 3
0.3,500,I,1.5,0.0001 3 60.3,500,I,1.1,0.02 0 3
0.3,500,I,1.1,0.0001 0 60.3,2000,ME+I,2.0,0.02 100 100
0.3,2000,ME+I,2.0,0.0001 100 1000.3,2000,ME+I,1.5,0.02 100 100
16
0.3,2000,ME+I,1.5,0.0001 100 1000.3,2000,ME+I,1.1,0.02 100 99
0.3,2000,ME+I,1.1,0.0001 100 1000.3,2000,ME,2.0,0.02 100 100
0.3,2000,ME,2.0,0.0001 100 1000.3,2000,ME,1.5,0.02 100 100
0.3,2000,ME,1.5,0.0001 100 1000.3,2000,ME,1.1,0.02 100 67
0.3,2000,ME,1.1,0.0001 100 620.3,2000,I,2.0,0.02 70 11
0.3,2000,I,2.0,0.0001 73 200.3,2000,I,1.5,0.02 58 8
0.3,2000,I,1.5,0.0001 53 70.3,2000,I,1.1,0.02 1 5
0.3,2000,I,1.1,0.0001 1 80.3,1000,ME+I,2.0,0.02 100 100
0.3,1000,ME+I,2.0,0.0001 100 1000.3,1000,ME+I,1.5,0.02 100 99
0.3,1000,ME+I,1.5,0.0001 100 1000.3,1000,ME+I,1.1,0.02 100 66
0.3,1000,ME+I,1.1,0.0001 100 690.3,1000,ME,2.0,0.02 100 99
0.3,1000,ME,2.0,0.0001 100 1000.3,1000,ME,1.5,0.02 100 78
0.3,1000,ME,1.5,0.0001 100 750.3,1000,ME,1.1,0.02 93 30
0.3,1000,ME,1.1,0.0001 84 330.3,1000,I,2.0,0.02 43 9
0.3,1000,I,2.0,0.0001 79 90.3,1000,I,1.5,0.02 30 3
0.3,1000,I,1.5,0.0001 27 90.3,1000,I,1.1,0.02 0 4
0.3,1000,I,1.1,0.0001 0 50.5,500,ME+I,2.0,0.02 100 100
0.5,500,ME+I,2.0,0.0001 100 1000.5,500,ME+I,1.5,0.02 100 100
0.5,500,ME+I,1.5,0.0001 100 1000.5,500,ME+I,1.1,0.02 100 79
17
0.5,500,ME+I,1.1,0.0001 100 890.5,500,ME,2.0,0.02 100 99
0.5,500,ME,2.0,0.0001 100 970.5,500,ME,1.5,0.02 100 63
0.5,500,ME,1.5,0.0001 100 620.5,500,ME,1.1,0.02 80 27
0.5,500,ME,1.1,0.0001 79 280.5,500,I,2.0,0.02 2 4
0.5,500,I,2.0,0.0001 4 50.5,500,I,1.5,0.02 1 4
0.5,500,I,1.5,0.0001 0 70.5,500,I,1.1,0.02 0 1
0.5,500,I,1.1,0.0001 0 80.5,2000,ME+I,2.0,0.02 100 100
0.5,2000,ME+I,2.0,0.0001 99 990.5,2000,ME+I,1.5,0.02 100 100
0.5,2000,ME+I,1.5,0.0001 100 1000.5,2000,ME+I,1.1,0.02 100 100
0.5,2000,ME+I,1.1,0.0001 100 1000.5,2000,ME,2.0,0.02 100 100
0.5,2000,ME,2.0,0.0001 100 1000.5,2000,ME,1.5,0.02 100 100
0.5,2000,ME,1.5,0.0001 100 1000.5,2000,ME,1.1,0.02 100 100
0.5,2000,ME,1.1,0.0001 100 980.5,2000,I,2.0,0.02 33 5
0.5,2000,I,2.0,0.0001 78 90.5,2000,I,1.5,0.02 65 2
0.5,2000,I,1.5,0.0001 21 20.5,2000,I,1.1,0.02 7 8
0.5,2000,I,1.1,0.0001 2 70.5,1000,ME+I,2.0,0.02 100 100
0.5,1000,ME+I,2.0,0.0001 100 1000.5,1000,ME+I,1.5,0.02 100 100
0.5,1000,ME+I,1.5,0.0001 100 1000.5,1000,ME+I,1.1,0.02 100 100
0.5,1000,ME+I,1.1,0.0001 100 1000.5,1000,ME,2.0,0.02 100 100
18
0.5,1000,ME,2.0,0.0001 100 1000.5,1000,ME,1.5,0.02 100 100
0.5,1000,ME,1.5,0.0001 100 970.5,1000,ME,1.1,0.02 100 64
0.5,1000,ME,1.1,0.0001 100 690.5,1000,I,2.0,0.02 14 3
0.5,1000,I,2.0,0.0001 52 60.5,1000,I,1.5,0.02 28 5
0.5,1000,I,1.5,0.0001 1 50.5,1000,I,1.1,0.02 0 3
0.5,1000,I,1.1,0.0001 0 9
*MAF,POP,MOD,OR,PREV where MAF represents the minor allele fre-quency, POP is the number of individuals, MOD is the used model (withor without main effect and with or without epistasis effect), OR is the oddsratio and PREV is the prevalence of the disease.
Table 3: A table containing the running time, cpu usage and memory usagein each configuration.
Configuration* Running Time (s) CPU Usage (%) Memory Usage (KB)0.5,500,ME+I,2.0,0.02 0 75.44 8975.12
0.5,500,ME+I,2.0,0.0001 0 76.83 8975.000.5,500,ME+I,1.5,0.02 0.07 73.91 9593.72
0.5,500,ME+I,1.5,0.0001 0 74.17 8975.280.5,500,ME+I,1.1,0.02 0 78.78 8975.20
0.5,500,ME+I,1.1,0.0001 0 76.66 8974.800.5,500,ME,2.0,0.02 0 77.81 8974.84
0.5,500,ME,2.0,0.0001 0 77.69 8975.600.5,500,ME,1.5,0.02 0 78.18 8975.36
0.5,500,ME,1.5,0.0001 0 76.23 8975.160.5,500,ME,1.1,0.02 0 80.54 8975.24
0.5,500,ME,1.1,0.0001 0 78.98 8975.040.5,500,I,2.0,0.02 0 76.74 8975.16
0.5,500,I,2.0,0.0001 0 75.78 8974.760.5,500,I,1.5,0.02 0 77.19 8975.40
0.5,500,I,1.5,0.0001 0 78.27 8974.640.5,500,I,1.1,0.02 0 78.71 8975.20
0.5,500,I,1.1,0.0001 0 78.75 8975.00
19
0.5,2000,ME+I,2.0,0.02 1.03 84.06 11814.040.5,2000,ME+I,2.0,0.0001 35.91 100.13 76053.200.5,2000,ME+I,1.5,0.02 45.54 99.68 77689.64
0.5,2000,ME+I,1.5,0.0001 47.30 99.04 78211.720.5,2000,ME+I,1.1,0.02 53.11 99.33 78391.92
0.5,2000,ME+I,1.1,0.0001 51.93 98.62 79691.560.5,2000,ME,2.0,0.02 54.63 100.05 77422.96
0.5,2000,ME,2.0,0.0001 54.31 99.80 79153.360.5,2000,ME,1.5,0.02 44.44 101.10 76040.16
0.5,2000,ME,1.5,0.0001 39.89 101.33 75383.840.5,2000,ME,1.1,0.02 20.10 100.50 72422.88
0.5,2000,ME,1.1,0.0001 18.37 101.77 72461.680.5,2000,I,2.0,0.02 13.30 101.99 71876.44
0.5,2000,I,2.0,0.0001 14.32 101.73 70963.000.5,2000,I,1.5,0.02 14.52 102.26 70528.40
0.5,2000,I,1.5,0.0001 12.63 102.34 71372.920.5,2000,I,1.1,0.02 12.16 102.54 73187.88
0.5,2000,I,1.1,0.0001 11.82 101.52 71635.040.5,1000,ME+I,2.0,0.02 25.89 86.51 73035.92
0.5,1000,ME+I,2.0,0.0001 26.33 89.30 73462.040.5,1000,ME+I,1.5,0.02 26.16 98.22 73189.12
0.5,1000,ME+I,1.5,0.0001 25.22 100.91 73075.680.5,1000,ME+I,1.1,0.02 14.21 103.72 71475.92
0.5,1000,ME+I,1.1,0.0001 12.89 104.18 71463.920.5,1000,ME,2.0,0.02 19.63 102.59 72507.32
0.5,1000,ME,2.0,0.0001 17.44 103.13 72036.120.5,1000,ME,1.5,0.02 10.19 105.15 70972.68
0.5,1000,ME,1.5,0.0001 9.25 105.96 70377.840.5,1000,ME,1.1,0.02 6.62 107.41 69163.76
0.5,1000,ME,1.1,0.0001 6.88 109.39 69143.720.5,1000,I,2.0,0.02 6.60 108.50 69107.64
0.5,1000,I,2.0,0.0001 7.38 108.82 68815.080.5,1000,I,1.5,0.02 7.03 100.62 68200.44
0.5,1000,I,1.5,0.0001 6.38 101.91 67529.520.5,1000,I,1.1,0.02 6.47 102.15 68962.80
0.5,1000,I,1.1,0.0001 6.54 102.20 68464.080.3,500,ME+I,2.0,0.02 6.26 104.90 68340.96
0.3,500,ME+I,2.0,0.0001 12.21 102.21 70501.56
20
0.3,500,ME+I,1.5,0.02 4.13 106.37 66912.320.3,500,ME+I,1.5,0.0001 5.38 105.98 68096.640.3,500,ME+I,1.1,0.02 3.74 105.94 65277.00
0.3,500,ME+I,1.1,0.0001 3.82 105.86 65443.960.3,500,ME,2.0,0.02 4.08 106.11 65608.68
0.3,500,ME,2.0,0.0001 4.43 106.19 67463.800.3,500,ME,1.5,0.02 3.73 106.90 65098.52
0.3,500,ME,1.5,0.0001 3.80 109.81 65059.360.3,500,ME,1.1,0.02 3.67 109.74 64770.44
0.3,500,ME,1.1,0.0001 3.73 105.00 65267.440.3,500,I,2.0,0.02 3.79 102.17 65143.80
0.3,500,I,2.0,0.0001 3.90 105.07 65876.560.3,500,I,1.5,0.02 3.68 105.09 65198.44
0.3,500,I,1.5,0.0001 3.70 100.02 65044.640.3,500,I,1.1,0.02 3.67 105.56 64918.80
0.3,500,I,1.1,0.0001 3.75 103.87 65366.360.3,2000,ME+I,2.0,0.02 52.54 97.45 78867.88
0.3,2000,ME+I,2.0,0.0001 33.90 101.81 75939.520.3,2000,ME+I,1.5,0.02 50.68 100.31 76332.04
0.3,2000,ME+I,1.5,0.0001 54.60 101.14 75759.680.3,2000,ME+I,1.1,0.02 18.08 104.45 71770.00
0.3,2000,ME+I,1.1,0.0001 22.20 103.66 72230.520.3,2000,ME,2.0,0.02 47.70 101.73 76081.80
0.3,2000,ME,2.0,0.0001 51.00 98.52 77481.440.3,2000,ME,1.5,0.02 23.96 98.85 73314.16
0.3,2000,ME,1.5,0.0001 23.65 99.04 72820.920.3,2000,ME,1.1,0.02 13.11 96.47 71377.16
0.3,2000,ME,1.1,0.0001 12.94 100.18 72271.400.3,2000,I,2.0,0.02 14.49 99.58 70987.12
0.3,2000,I,2.0,0.0001 13.89 100.05 70761.040.3,2000,I,1.5,0.02 14.63 102.00 71405.12
0.3,2000,I,1.5,0.0001 14.30 99.79 71175.520.3,2000,I,1.1,0.02 11.88 100.25 72587.36
0.3,2000,I,1.1,0.0001 12.30 100.42 72231.520.3,1000,ME+I,2.0,0.02 24.07 99.29 72695.04
0.3,1000,ME+I,2.0,0.0001 25.39 99.83 72831.760.3,1000,ME+I,1.5,0.02 18.57 98.83 71546.72
0.3,1000,ME+I,1.5,0.0001 24.64 98.82 72522.36
21
0.3,1000,ME+I,1.1,0.02 11.58 98.75 71113.480.3,1000,ME+I,1.1,0.0001 11.98 98.73 71254.12
0.3,1000,ME,2.0,0.02 16.77 98.95 71090.840.3,1000,ME,2.0,0.0001 19.44 98.35 71731.320.3,1000,ME,1.5,0.02 12.06 98.94 71088.80
0.3,1000,ME,1.5,0.0001 12.44 98.89 70971.280.3,1000,ME,1.1,0.02 11.00 98.78 71013.08
0.3,1000,ME,1.1,0.0001 11.27 98.53 70955.920.3,1000,I,2.0,0.02 12.35 98.54 71127.96
0.3,1000,I,2.0,0.0001 13.31 98.89 70009.720.3,1000,I,1.5,0.02 12.28 98.94 71714.48
0.3,1000,I,1.5,0.0001 12.17 98.91 71538.400.3,1000,I,1.1,0.02 11.20 98.96 71779.92
0.3,1000,I,1.1,0.0001 11.12 98.62 71854.360.1,500,ME+I,2.0,0.02 6.07 99.82 70315.16
0.1,500,ME+I,2.0,0.0001 6.13 99.08 69049.480.1,500,ME+I,1.5,0.02 6.06 99.41 70168.84
0.1,500,ME+I,1.5,0.0001 6.12 98.96 69809.040.1,500,ME+I,1.1,0.02 6.13 99.57 70186.84
0.1,500,ME+I,1.1,0.0001 6.14 99.51 70018.720.1,500,ME,2.0,0.02 6.12 99.33 70033.16
0.1,500,ME,2.0,0.0001 6.15 99.98 69367.680.1,500,ME,1.5,0.02 6.16 99.87 70112.36
0.1,500,ME,1.5,0.0001 6.16 99.11 70216.160.1,500,ME,1.1,0.02 6.12 99.47 70135.28
0.1,500,ME,1.1,0.0001 6.13 99.55 70127.360.1,500,I,2.0,0.02 6.11 99.21 70007.60
0.1,500,I,2.0,0.0001 6.11 98.95 70187.960.1,500,I,1.5,0.02 6.16 98.99 70377.76
0.1,500,I,1.5,0.0001 6.08 99.29 70226.680.1,500,I,1.1,0.02 6.11 98.84 70228.80
0.1,500,I,1.1,0.0001 6.15 99.60 70123.920.1,2000,ME+I,2.0,0.02 24.39 98.80 72612.80
0.1,2000,ME+I,2.0,0.0001 35.63 98.93 74894.120.1,2000,ME+I,1.5,0.02 21.49 98.70 72437.88
0.1,2000,ME+I,1.5,0.0001 22.66 98.90 72547.360.1,2000,ME+I,1.1,0.02 20.98 98.79 73179.28
0.1,2000,ME+I,1.1,0.0001 21.20 98.72 73060.00
22
0.1,2000,ME,2.0,0.02 23.49 98.81 72854.640.1,2000,ME,2.0,0.0001 27.55 98.84 73112.760.1,2000,ME,1.5,0.02 20.96 98.65 72501.96
0.1,2000,ME,1.5,0.0001 22.40 98.84 72125.240.1,2000,ME,1.1,0.02 20.94 98.07 73506.48
0.1,2000,ME,1.1,0.0001 21.46 99.20 73131.600.1,2000,I,2.0,0.02 24.99 98.67 71017.84
0.1,2000,I,2.0,0.0001 25.33 98.95 71315.960.1,2000,I,1.5,0.02 24.72 99.68 72858.24
0.1,2000,I,1.5,0.0001 23.07 99.88 72926.080.1,2000,I,1.1,0.02 21.07 101.62 73633.20
0.1,2000,I,1.1,0.0001 21.58 101.93 73469.880.1,1000,ME+I,2.0,0.02 11.19 104.18 71543.48
0.1,1000,ME+I,2.0,0.0001 13.23 103.98 71146.120.1,1000,ME+I,1.5,0.02 11.13 104.34 71897.60
0.1,1000,ME+I,1.5,0.0001 11.32 104.25 71671.520.1,1000,ME+I,1.1,0.02 11.01 104.52 72502.08
0.1,1000,ME+I,1.1,0.0001 11.26 104.54 72783.760.1,1000,ME,2.0,0.02 11.10 104.39 71333.52
0.1,1000,ME,2.0,0.0001 11.59 104.16 71271.440.1,1000,ME,1.5,0.02 11.11 104.37 71968.00
0.1,1000,ME,1.5,0.0001 11.11 104.28 71837.240.1,1000,ME,1.1,0.02 11.00 104.53 72532.48
0.1,1000,ME,1.1,0.0001 11.18 104.47 72076.520.1,1000,I,2.0,0.02 11.52 104.40 71641.20
0.1,1000,I,2.0,0.0001 11.29 104.52 72135.840.1,1000,I,1.5,0.02 11.12 104.56 72310.20
0.1,1000,I,1.5,0.0001 11.12 104.58 72486.120.1,1000,I,1.1,0.02 11.11 104.71 72571.52
0.1,1000,I,1.1,0.0001 11.16 104.38 72365.840.05,500,ME+I,2.0,0.02 6.23 108.59 69799.12
0.05,500,ME+I,2.0,0.0001 6.14 108.27 69907.080.05,500,ME+I,1.5,0.02 6.13 108.57 70177.28
0.05,500,ME+I,1.5,0.0001 6.15 108.44 70259.120.05,500,ME+I,1.1,0.02 6.20 108.71 70280.92
0.05,500,ME+I,1.1,0.0001 6.13 108.78 69942.240.05,500,ME,2.0,0.02 6.19 108.97 70123.56
0.05,500,ME,2.0,0.0001 6.13 108.19 69865.56
23
0.05,500,ME,1.5,0.02 6.14 109.00 70221.360.05,500,ME,1.5,0.0001 6.14 108.88 69994.160.05,500,ME,1.1,0.02 6.12 108.32 69700.40
0.05,500,ME,1.1,0.0001 6.11 108.61 70141.160.05,500,I,2.0,0.02 6.14 108.80 70028.96
0.05,500,I,2.0,0.0001 6.16 108.74 70185.160.05,500,I,1.5,0.02 6.09 108.90 69967.76
0.05,500,I,1.5,0.0001 6.18 108.73 69698.040.05,500,I,1.1,0.02 6.22 107.68 69844.80
0.05,500,I,1.1,0.0001 6.21 101.49 69872.320.05,2000,ME+I,2.0,0.02 21.74 96.81 72821.72
0.05,2000,ME+I,2.0,0.0001 22.90 92.35 72652.160.05,2000,ME+I,1.5,0.02 21.44 97.61 73224.84
0.05,2000,ME+I,1.5,0.0001 21.35 97.65 73010.560.05,2000,ME+I,1.1,0.02 21.32 100.62 73583.24
0.05,2000,ME+I,1.1,0.0001 21.79 100.34 73084.720.05,2000,ME,2.0,0.02 21.33 102.42 72800.80
0.05,2000,ME,2.0,0.0001 22.45 100.06 72805.600.05,2000,ME,1.5,0.02 22.20 97.06 73334.84
0.05,2000,ME,1.5,0.0001 21.67 101.75 73099.680.05,2000,ME,1.1,0.02 21.64 102.07 73360.48
0.05,2000,ME,1.1,0.0001 21.36 102.47 73394.800.05,2000,I,2.0,0.02 21.75 101.65 72672.52
0.05,2000,I,2.0,0.0001 24.67 99.33 72954.520.05,2000,I,1.5,0.02 23.76 99.50 72027.12
0.05,2000,I,1.5,0.0001 24.41 99.04 72224.640.05,2000,I,1.1,0.02 21.99 98.53 73594.44
0.05,2000,I,1.1,0.0001 21.83 99.55 73442.160.05,1000,ME+I,2.0,0.02 11.21 102.98 71934.28
0.05,1000,ME+I,2.0,0.0001 11.22 102.77 71696.960.05,1000,ME+I,1.5,0.02 11.07 103.08 72868.56
0.05,1000,ME+I,1.5,0.0001 11.12 101.16 72054.760.05,1000,ME+I,1.1,0.02 10.98 103.88 72243.88
0.05,1000,ME+I,1.1,0.0001 11.04 105.79 71986.880.05,1000,ME,2.0,0.02 10.99 105.88 72255.96
0.05,1000,ME,2.0,0.0001 11.11 105.82 71831.640.05,1000,ME,1.5,0.02 10.91 105.93 72275.60
0.05,1000,ME,1.5,0.0001 11.01 105.99 72655.04
24
0.05,1000,ME,1.1,0.02 10.91 106.09 72631.160.05,1000,ME,1.1,0.0001 10.88 106.08 72100.20
0.05,1000,I,2.0,0.02 10.89 105.87 72380.600.05,1000,I,2.0,0.0001 10.98 105.90 72264.560.05,1000,I,1.5,0.02 11.06 105.79 72350.44
0.05,1000,I,1.5,0.0001 11.02 106.11 72392.640.05,1000,I,1.1,0.02 11.07 105.81 72310.56
0.05,1000,I,1.1,0.0001 11.17 105.82 72267.920.01,500,ME+I,2.0,0.02 6.10 110.76 70623.48
0.01,500,ME+I,2.0,0.0001 6.19 100.90 70173.600.01,500,ME+I,1.5,0.02 6.23 88.83 70297.80
0.01,500,ME+I,1.5,0.0001 6.19 92.12 70435.840.01,500,ME+I,1.1,0.02 6.21 95.10 70178.08
0.01,500,ME+I,1.1,0.0001 6.18 94.75 69908.240.01,500,ME,2.0,0.02 6.17 100.01 70107.48
0.01,500,ME,2.0,0.0001 6.20 98.25 70174.400.01,500,ME,1.5,0.02 6.20 96.74 70149.52
0.01,500,ME,1.5,0.0001 6.20 97.67 69945.920.01,500,ME,1.1,0.02 6.21 100.55 70039.24
0.01,500,ME,1.1,0.0001 6.24 91.17 69905.400.01,500,I,2.0,0.02 6.23 102.47 70315.16
0.01,500,I,2.0,0.0001 6.17 103.53 70046.480.01,500,I,1.5,0.02 6.25 101.20 69696.88
0.01,500,I,1.5,0.0001 6.18 101.27 69886.680.01,500,I,1.1,0.02 6.30 99.65 70166.36
0.01,500,I,1.1,0.0001 6.27 99.71 70019.800.01,2000,ME+I,2.0,0.02 21.61 99.05 73708.60
0.01,2000,ME+I,2.0,0.0001 21.65 98.32 73273.960.01,2000,ME+I,1.5,0.02 21.57 99.21 73551.20
0.01,2000,ME+I,1.5,0.0001 21.52 98.94 73723.640.01,2000,ME+I,1.1,0.02 21.20 99.76 73592.64
0.01,2000,ME+I,1.1,0.0001 21.51 96.31 73391.080.01,2000,ME,2.0,0.02 21.70 92.67 73853.88
0.01,2000,ME,2.0,0.0001 21.51 99.72 73350.920.01,2000,ME,1.5,0.02 21.44 99.98 73758.00
0.01,2000,ME,1.5,0.0001 21.44 99.25 73351.200.01,2000,ME,1.1,0.02 21.53 97.16 73346.52
0.01,2000,ME,1.1,0.0001 21.16 101.37 73605.40
25
0.01,2000,I,2.0,0.02 21.07 101.02 73651.600.01,2000,I,2.0,0.0001 21.55 100.96 73217.440.01,2000,I,1.5,0.02 21.19 100.46 73303.00
0.01,2000,I,1.5,0.0001 21.60 99.60 73325.360.01,2000,I,1.1,0.02 21.43 100.51 73260.24
0.01,2000,I,1.1,0.0001 21.91 97.81 73582.960.01,1000,ME+I,2.0,0.02 11.24 98.26 71785.68
0.01,1000,ME+I,2.0,0.0001 11.18 97.90 72111.080.01,1000,ME+I,1.5,0.02 11.29 98.82 71760.68
0.01,1000,ME+I,1.5,0.0001 11.29 99.32 71912.760.01,1000,ME+I,1.1,0.02 11.19 98.55 71999.88
0.01,1000,ME+I,1.1,0.0001 11.28 98.35 72015.680.01,1000,ME,2.0,0.02 11.34 97.87 71920.68
0.01,1000,ME,2.0,0.0001 11.35 99.01 72120.600.01,1000,ME,1.5,0.02 11.34 95.71 71681.44
0.01,1000,ME,1.5,0.0001 11.39 96.87 71781.120.01,1000,ME,1.1,0.02 11.21 99.33 71747.48
0.01,1000,ME,1.1,0.0001 11.35 98.86 71964.160.01,1000,I,2.0,0.02 11.19 98.84 71847.96
0.01,1000,I,2.0,0.0001 11.36 97.89 71583.320.01,1000,I,1.5,0.02 11.25 97.67 72072.00
0.01,1000,I,1.5,0.0001 11.29 96.01 71814.480.01,1000,I,1.1,0.02 11.28 97.48 71709.20
0.01,1000,I,1.1,0.0001 11.36 97.21 71803.64
*MAF,POP,MOD,OR,PREV where MAF represents the minor allele fre-quency, POP is the number of individuals, MOD is the used model (withor without main effect and with or without epistasis effect), OR is the oddsratio and PREV is the prevalence of the disease.
26
Laboratory Note
Genetic EpistasisVII - Assessing Algorithm TEAM
LN-7-2014
Ricardo Pinho and Rui CamachoFEUP
Rua Dr Roberto Frias, s/n,4200-465 PORTO
PortugalFax: (+351) 22 508 1440e-mail: [email protected]
www : http://www.fe.up.pt/∼[email protected]
www : http://www.fe.up.pt/∼rcamacho
May 2014
Abstract
In this lab note, the algorithm TEAM is presented. TEAM isan exhaustive algorithm that works by updating contingency tablesand the minimum spanning tree made from SNPs. The results ob-tained show an increase in Power by population size, allele frequency,and odds ratio. There is also an increase in Type 1 Error Rate withpopulation size, but not a clear indicator for allele frequency. Thescalability of the algorithm is questionable, considering that there isa big increase in the running time required by data sets with differentpopulation sizes, which is not relevant for these experiments but maybe problematic for larger data sets.
1 Introduction
Tree-based Epistasis Association Mapping) TEAM [ZHZW10] is an exhaus-tive algorithm that computes all two-locus pairs to obtain a permutationtest, which is applicable to all statistical relevancy tests, due to the contin-gency table generated. TEAM also uses Family-wise error rate (FWER) andfalse discovery rate (FDR) to control error rate using the permutation test,which is better than Bonferroni correction but also more computationallyexpensive. The algorithm builds a minimum spanning tree containing SNPsas nodes and the edges represent the genotype difference between two SNPs.This is used to update the contingency tables, allowing for a pruning of manyindividuals.The algorithm receives the SNPs genotypes and phenotypes of each individ-ual, creating a specified number of permutations. The contingency tables foreach single-locus are generated. The minimum spanning tree is built, usingthe different genotypes associated to each edge. The tree is then updatedfor each leaf node with the information related to the contingency table forgenotype relation between SNPs. The test values are then calculated, usingthe contingency tables.
S1 S2 S3 S4 S5 S6 S7 S8 S9 S10
X1 0 0 0 1 2 0 2 0 2 0X2 2 0 0 2 0 2 0 1 2 1X3 2 2 0 1 2 2 0 1 1 0X4 0 2 2 0 0 0 0 1 0 1X5 2 1 2 0 1 2 0 1 0 2Y1 1 1 1 0 1 0 1 1 1 0Y2 0 0 0 1 1 0 1 0 1 0Y3 1 0 1 1 1 0 1 0 1 0
Table 1: An example of the input data, consisting of 5 SNPs X1...X6, the orig-inal Phenotype Y1, and two permutations Y2,Y3 for 10 individuals S1,...,S10.
1.1 Input files
The input consists of 2 files: a file containing the genotype information andanother containing the phenotype information for each individual.
1
Xi=0 Xi=1 Xi=2 TotalXi=0 Xi=1 Xi=2 Xj=0 Xj=1 Xj=2 Xj=0 Xj=1 Xj=2
Yk=0 Eventa1
Eventa2
Eventa3
Eventb1
Eventb2
Eventb3
Evente1
Evente2
Evente3
Yk=1 Eventc1
Eventc2
Eventc3
Eventd1
Eventd2
Eventd3
Eventf1
Eventf2
Eventf3
Total M
Table 2: The contingency table between two SNPs Xi and Xj for a givenphenotype Yk. M refers to the total amount of individuals.
(a) Genotype
0011001121121211112110010001022202121111
(b) Phenotype
0000000010
Table 3: An example of the input file containing genotype and phenotypeinformation with 4 SNPs and 10 individuals. Genotype 0,1,2 correspondsto homozygous dominant, heterozygous, and homozygous recessive. Thephenotype 0,1 corresponds to control and case respectively.
1.2 Output files
The output consists of a list of every SNP pair and the relevant test score.The score can be calculated for any statistics defined in the contingency table.In this experiment, the test score corresponds to chi-square statistic.
1.3 Parameters
The customizable parameters are as follows:
• individual - The number of individuals in the data. In this case it isdependent on the data set parameters.
• SNPs - Number of SNPs in the data. In this case it is dependent onthe data sets ( which is fixed at 300).
• permutation - Number of permutations used in the significant test.
• fdrthreshold - The FDR threshold for significance.
2
2 Experimental Settings
The datasets used in the experiments are characterized in Lab Note 1. Thecomputer used for this experiments used the 64-bit Ubuntu 13.10 operatingsystem, with an Intel(R) Core(TM)2 Quad CPU Q6600 2.40GHz processorand 8,00 GB of RAM memory.Team contains a C++ program that takes as parameters the genotype file,the phenotype file, the number of individuals, number of SNPs, number ofpermutations for the significance test and FDR threshold. The number ofpermutations is set to 100 and the FDR threshold is set to 1.
3 Results
The algorithm only outputs pairwise relations between SNPs. Due to this,only epistasis detection will be evaluated.The Power observed in Figure 1 show a maximum value of 8% of Power fordata sets with 500 individuals, 65% for 1000 individuals and 95% for 2000individuals. There is a big correlation between the Power and the size of thedata sets. However, for frequencies smaller than 0.1 there is a near 0% Powerfor most configurations, with the exception of data sets with 2000 individ-uals and 0.05 minor allele frequency. These values also increase with allelefrequency, with the exception of 0.5 allele frequencies.
0.01 0.05 0.1 0.3 0.50
50
100
0 0 0 6 80 1
21
47
65
0
43
92 92 95
Allele frequency
Pow
er(%
)
500 individuals1000 individuals2000 individuals
Figure 1: Power by allele frequency. For each frequency, three sizes of datasets were used to measure the Power, with an odds ratio of 2.0 and prevalenceof 0.02. The Power is measured by the amount of data sets where the groundtruth was amongst the most relevant results, out of all 100 data sets.
The Type 1 Error Rate in Figure 2 has an interesting pattern, clearly
3
0.01 0.05 0.1 0.3 0.50
10
20
30
40
0 0 2 0 12 4 51 01
37
28
10
1
Allele frequency
Typ
e1
Err
orR
ate(
%)
500 individuals1000 individuals2000 individuals
Figure 2: Type 1 Error Rate by allele frequency and population size. TheType 1 Error Rate is measured by the amount of data sets where the falsepositives were amongst the most relevant results, out of all 100 data sets.
showing a growth in error rate with the population size. However, the errordoes not necessarily increase with allele frequency, revealing a maximum of37% in data sets with 0.05 allele frequency and 2000 individuals. There isalso a decrease in data sets with higher allele frequencies in data sets with2000 individuals. Therefore the relation between error rate and allele fre-quency is undetermined.
There is a 10% difference in CPU usage (b) and 7 seconds in runningtime (a), with a maximum of 74% and 10 seconds, respectively. The mem-ory usage increases from 162 MB to 228 MB, which is a 40% increase. Themost relevant increase is the running time because the running time for 2000individuals is the triple of the running time for 500 individuals, which is aproblem for big data sets.
There is a clear increase in Power for odds ratio increase in Figure 5,especially from a 1.1 to a 1.5 odds ratio and population increase in Figure4, with emphasis on the difference between 1000 and 2000 individuals Theprevalence test shows very little difference between disease prevalence in Fig-ure 6 and the allele frequency shows a growth with the increase in minorallele frequency.
4
500 1000 20000
5
10
Number of Individuals
Runnin
gT
ime(
seco
nds)
(a) Average running time.
500 1000 2000
68
70
72
74
76
Number of Individuals
CP
UU
sage
(%)
(b) Average CPU usage.
500 1000 2000
160
180
200
220
Number of Individuals
Mem
ory
Usa
ge(M
byte
s)
(c) Average memory usage.
Figure 3: Comparison of scalability measures between different sized datasets. The data sets have a minor allele frequency is 0.5, 2.0 odds ratio, 0.02prevalence.
5
4 Summary
TEAM is an exhaustive algorithm that uses permutation tests to generatecontingency tables, which then can be applied to any relevancy test. Theresults show an increase in Power related to the increase in population sizeand allele frequency. The scalability test shows that the running time of datasets with the highest population size is the triple of the running time for datasets with the lowest population size. The Type 1 Error Rate increases withthe population size, but the relation between error rate and allele frequency isundetermined. The results of data set configurations by population and allelefrequency confirm the previously discussed results. The odds ratio increaseyields a clear increase in Power, but the prevalence increase shows nearly thesame Power.
References
[ZHZW10] Xiang Zhang, Shunping Huang, Fei Zou, and Wei Wang. TEAM:efficient two-locus epistasis tests in human genome-wide associa-tion study. Bioinformatics (Oxford, England), 26:i217–i227, 2010.
A Bar Graphs
500 1000 20000
50
100
0
21
92
Population
Pow
er(%
)
Power by Population
Figure 4: Distribution of the Power by population. The allele frequency is0.1, the odds ratio is 2.0, and the prevalence is 0.02.
6
1.1 1.5 2.00
50
100
1
8192
Odds Ratio
Pow
er(%
)Power by Odds Ratio
Figure 5: Distribution of the Power by odds ratios. The allele frequency is0.1, the number of individuals is 2000, and the prevalence is 0.02.
0.0001 0.020
50
100 89 92
Prevalence
Pow
er(%
)
Power by Prevalence
Figure 6: Distribution of the Power by prevalence. The allele frequency is0.1, the number of individuals is 2000, and the odds ratio is 2.0.
7
0.01 0.05 0.1 0.3 0.50
50
100
0
43
92 92 95
Frequency
Pow
er(%
)
Power by Frequency
Figure 7: Distribution of the averaged Power by allele frequency. The numberof individuals is 2000, the odds ratio is 2.0, and the prevalence is 0.02.
8
B Table of Results
Table 4: A table containing the percentage of true positives and false posi-tives in each configuration. The first column contains the description of theconfiguration. The second and third columns contain the number of datasetswith true positives and false positives respectively, out of all 100 data setsper configuration.
Configuration* TP (%) FP (%)0.5,500,I,2.0,0.02 8 1
0.5,500,I,2.0,0.0001 9 00.5,500,I,1.5,0.02 3 0
0.5,500,I,1.5,0.0001 0 10.5,500,I,1.1,0.02 0 0
0.5,500,I,1.1,0.0001 0 30.5,2000,I,2.0,0.02 95 1
0.5,2000,I,2.0,0.0001 100 220.5,2000,I,1.5,0.02 93 7
0.5,2000,I,1.5,0.0001 47 10.5,2000,I,1.1,0.02 10 3
0.5,2000,I,1.1,0.0001 3 20.5,1000,I,2.0,0.02 65 0
0.5,1000,I,2.0,0.0001 79 50.5,1000,I,1.5,0.02 53 3
0.5,1000,I,1.5,0.0001 4 00.5,1000,I,1.1,0.02 0 1
0.5,1000,I,1.1,0.0001 0 30.3,500,I,2.0,0.02 6 0
0.3,500,I,2.0,0.0001 26 30.3,500,I,1.5,0.02 2 0
0.3,500,I,1.5,0.0001 5 10.3,500,I,1.1,0.02 0 0
0.3,500,I,1.1,0.0001 0 40.3,2000,I,2.0,0.02 92 10
0.3,2000,I,2.0,0.0001 100 560.3,2000,I,1.5,0.02 95 15
0.3,2000,I,1.5,0.0001 98 100.3,2000,I,1.1,0.02 2 1
0.3,2000,I,1.1,0.0001 2 40.3,1000,I,2.0,0.02 47 1
9
0.3,1000,I,2.0,0.0001 100 120.3,1000,I,1.5,0.02 40 3
0.3,1000,I,1.5,0.0001 49 50.3,1000,I,1.1,0.02 0 0
0.3,1000,I,1.1,0.0001 0 10.1,500,I,2.0,0.02 0 2
0.1,500,I,2.0,0.0001 1 10.1,500,I,1.5,0.02 1 0
0.1,500,I,1.5,0.0001 0 10.1,500,I,1.1,0.02 0 1
0.1,500,I,1.1,0.0001 0 10.1,2000,I,2.0,0.02 92 28
0.1,2000,I,2.0,0.0001 89 200.1,2000,I,1.5,0.02 81 5
0.1,2000,I,1.5,0.0001 42 30.1,2000,I,1.1,0.02 1 3
0.1,2000,I,1.1,0.0001 5 20.1,1000,I,2.0,0.02 21 5
0.1,1000,I,2.0,0.0001 12 60.1,1000,I,1.5,0.02 5 0
0.1,1000,I,1.5,0.0001 1 10.1,1000,I,1.1,0.02 0 2
0.1,1000,I,1.1,0.0001 0 40.05,500,I,2.0,0.02 0 0
0.05,500,I,2.0,0.0001 0 10.05,500,I,1.5,0.02 0 0
0.05,500,I,1.5,0.0001 0 30.05,500,I,1.1,0.02 0 1
0.05,500,I,1.1,0.0001 0 10.05,2000,I,2.0,0.02 43 37
0.05,2000,I,2.0,0.0001 57 210.05,2000,I,1.5,0.02 40 24
0.05,2000,I,1.5,0.0001 43 190.05,2000,I,1.1,0.02 0 3
0.05,2000,I,1.1,0.0001 0 30.05,1000,I,2.0,0.02 1 4
0.05,1000,I,2.0,0.0001 3 20.05,1000,I,1.5,0.02 0 1
10
0.05,1000,I,1.5,0.0001 1 30.05,1000,I,1.1,0.02 0 1
0.05,1000,I,1.1,0.0001 0 60.01,500,I,2.0,0.02 0 0
0.01,500,I,2.0,0.0001 0 20.01,500,I,1.5,0.02 0 0
0.01,500,I,1.5,0.0001 0 40.01,500,I,1.1,0.02 0 1
0.01,500,I,1.1,0.0001 0 00.01,2000,I,2.0,0.02 0 1
0.01,2000,I,2.0,0.0001 0 50.01,2000,I,1.5,0.02 0 3
0.01,2000,I,1.5,0.0001 0 40.01,2000,I,1.1,0.02 0 1
0.01,2000,I,1.1,0.0001 0 20.01,1000,I,2.0,0.02 0 2
0.01,1000,I,2.0,0.0001 0 10.01,1000,I,1.5,0.02 0 0
0.01,1000,I,1.5,0.0001 0 20.01,1000,I,1.1,0.02 0 2
0.01,1000,I,1.1,0.0001 0 3
*MAF,POP,MOD,OR,PREV where MAF represents the minor allele fre-quency, POP is the number of individuals, MOD is the used model (withor without main effect and with or without epistasis effect), OR is the oddsratio and PREV is the prevalence of the disease.
Table 5: A table containing the running time, cpu usage and memory usagein each configuration.
Configuration* Running Time (s) CPU Usage (%) Memory Usage (KB)0.5,500,I,2.0,0.02 3.28 66.99 166590.64
0.5,500,I,2.0,0.0001 3.81 54.75 166590.280.5,500,I,1.5,0.02 3.07 74.40 166590.44
0.5,500,I,1.5,0.0001 3.76 57.98 166590.600.5,500,I,1.1,0.02 3.08 68.52 161592.60
0.5,500,I,1.1,0.0001 3.91 55.00 166590.720.5,2000,I,2.0,0.02 9.81 74.75 233543.92
0.5,2000,I,2.0,0.0001 11.00 72.09 233802.28
11
0.5,2000,I,1.5,0.02 9.83 72.85 233535.720.5,2000,I,1.5,0.0001 10.98 66.89 233821.760.5,2000,I,1.1,0.02 9.82 73.74 233562.12
0.5,2000,I,1.1,0.0001 10.99 69.46 233832.840.5,1000,I,2.0,0.02 5.28 69.71 181210.92
0.5,1000,I,2.0,0.0001 6.08 68.02 181210.720.5,1000,I,1.5,0.02 5.53 66.72 181210.60
0.5,1000,I,1.5,0.0001 6.10 66.35 181210.640.5,1000,I,1.1,0.02 5.40 68.68 181210.64
0.5,1000,I,1.1,0.0001 6.09 65.91 181210.840.3,500,I,2.0,0.02 3.12 71.53 166590.44
0.3,500,I,2.0,0.0001 3.79 56.19 166590.600.3,500,I,1.5,0.02 3.13 70.93 166590.72
0.3,500,I,1.5,0.0001 3.77 56.60 166590.720.3,500,I,1.1,0.02 3.08 72.65 166590.68
0.3,500,I,1.1,0.0001 3.78 56.01 166590.400.3,2000,I,2.0,0.02 9.84 72.54 233557.00
0.3,2000,I,2.0,0.0001 10.94 73.45 233778.480.3,2000,I,1.5,0.02 9.92 72.03 233546.36
0.3,2000,I,1.5,0.0001 10.95 73.75 233801.920.3,2000,I,1.1,0.02 9.95 72.35 233546.48
0.3,2000,I,1.1,0.0001 11.00 70.49 233828.960.3,1000,I,2.0,0.02 5.34 67.05 181210.88
0.3,1000,I,2.0,0.0001 6.09 63.97 181210.800.3,1000,I,1.5,0.02 5.35 69.00 181210.56
0.3,1000,I,1.5,0.0001 6.12 63.37 181210.800.3,1000,I,1.1,0.02 5.44 67.27 181210.44
0.3,1000,I,1.1,0.0001 6.11 65.06 181210.680.1,500,I,2.0,0.02 3.28 65.33 166590.76
0.1,500,I,2.0,0.0001 3.78 56.18 166590.600.1,500,I,1.5,0.02 3.13 71.07 166590.60
0.1,500,I,1.5,0.0001 3.81 55.52 166590.840.1,500,I,1.1,0.02 3.22 67.56 166590.64
0.1,500,I,1.1,0.0001 3.84 54.26 166590.520.1,2000,I,2.0,0.02 9.91 72.77 233527.88
0.1,2000,I,2.0,0.0001 10.95 73.18 233788.920.1,2000,I,1.5,0.02 9.94 71.34 233538.28
0.1,2000,I,1.5,0.0001 10.97 71.25 233803.52
12
0.1,2000,I,1.1,0.02 9.82 69.45 231225.760.1,2000,I,1.1,0.0001 10.76 73.08 233841.760.1,1000,I,2.0,0.02 5.46 66.40 181210.92
0.1,1000,I,2.0,0.0001 6.10 65.14 181210.640.1,1000,I,1.5,0.02 5.41 67.52 181210.80
0.1,1000,I,1.5,0.0001 6.17 63.74 181210.800.1,1000,I,1.1,0.02 5.49 65.42 181210.52
0.1,1000,I,1.1,0.0001 6.25 57.76 181210.680.05,500,I,2.0,0.02 3.06 74.66 166590.52
0.05,500,I,2.0,0.0001 3.67 63.00 166590.840.05,500,I,1.5,0.02 3.10 73.32 166590.60
0.05,500,I,1.5,0.0001 3.70 60.99 166590.680.05,500,I,1.1,0.02 3.09 74.07 166590.96
0.05,500,I,1.1,0.0001 3.74 60.54 166590.840.05,2000,I,2.0,0.02 10.87 75.38 233762.72
0.05,2000,I,2.0,0.0001 10.88 76.33 233830.320.05,2000,I,1.5,0.02 9.76 75.67 233551.64
0.05,2000,I,1.5,0.0001 10.84 77.87 233818.160.05,2000,I,1.1,0.02 9.76 76.10 233559.48
0.05,2000,I,1.1,0.0001 10.89 78.26 233821.160.05,1000,I,2.0,0.02 5.45 69.61 181210.40
0.05,1000,I,2.0,0.0001 6.01 69.13 181210.880.05,1000,I,1.5,0.02 5.24 74.24 181210.68
0.05,1000,I,1.5,0.0001 6.04 68.74 181211.000.05,1000,I,1.1,0.02 5.34 71.82 181210.52
0.05,1000,I,1.1,0.0001 5.99 68.72 181210.640.01,500,I,2.0,0.02 3.13 72.69 166590.40
0.01,500,I,2.0,0.0001 3.69 60.83 166590.960.01,500,I,1.5,0.02 3.02 76.70 166590.72
0.01,500,I,1.5,0.0001 3.77 59.05 166590.520.01,500,I,1.1,0.02 3.20 69.81 166590.60
0.01,500,I,1.1,0.0001 3.72 60.54 166590.640.01,2000,I,2.0,0.02 9.70 77.18 233557.00
0.01,2000,I,2.0,0.0001 10.87 76.75 233813.880.01,2000,I,1.5,0.02 9.76 76.94 233554.60
0.01,2000,I,1.5,0.0001 10.83 77.23 233817.520.01,2000,I,1.1,0.02 9.81 81.09 233562.32
0.01,2000,I,1.1,0.0001 10.86 76.11 233839.32
13
0.01,1000,I,2.0,0.02 5.28 73.35 181210.760.01,1000,I,2.0,0.0001 6.00 69.20 181210.720.01,1000,I,1.5,0.02 5.35 71.68 181210.80
0.01,1000,I,1.5,0.0001 6.05 68.01 181210.360.01,1000,I,1.1,0.02 5.35 71.71 181210.84
0.01,1000,I,1.1,0.0001 5.99 68.05 181211.00
*MAF,POP,MOD,OR,PREV where MAF represents the minor allele fre-quency, POP is the number of individuals, MOD is the used model (withor without main effect and with or without epistasis effect), OR is the oddsratio and PREV is the prevalence of the disease.
14
Laboratory Note
Genetic EpistasisVIII - Assessing Algorithm MBMDR
LN-8-2014
Ricardo Pinho and Rui CamachoFEUP
Rua Dr Roberto Frias, s/n,4200-465 PORTO
PortugalFax: (+351) 22 508 1440e-mail: [email protected]
www : http://www.fe.up.pt/∼[email protected]
www : http://www.fe.up.pt/∼rcamacho
May 2014
Abstract
Model-Based Multifactor DImensionality Reduction (MBMDR) isan algorithm that implements on the previous MDR methodology,which consists on dividing SNPs into two clusters based on their riskto the determination of the disease. Instead of using a predeterminedthreshold from the frequency of SNPs in the data, MBMDR uses atesting approach followed by a significance assessment. The resultsshow a high Power only for large sized data sets and very low Type 1Error Rate for all configurations. The running time of the algorithmmakes the algorithm not viable for larger data sets.
1 Introduction
Multifactor DImensionality Reduction (MDR) [CLEP07] is one of the mostreferenced algorithms for epistasis detection. MDR filters SNPs, based onthe frequency in case control data, to divide SNPs into high risk or low riskbased on a predetermined threshold. Using cross validation and permuta-tions to determine the high/low risk groups, the algorithm returns the highrisk loci that have a stronger connection in the disease outcome. However,it samples many SNPs together analysing at most one significant epistasismodel, skiping other possible SNP groups that may not have such significantconection, but may also be related to the disease.Model-Based Multifactor Dimensionality Reduction [MVV11] merges multi-locus genotypes that have significant high or low risk based on associationtesting, rather than a threshold value.MB-MDR process can be divided into the following steps:
1. Multi-locus cell prioritization - Each two-locus genotype is assignedto either High risk, Low risk or No Evidence of risk categories.
2. Association test on lower-dimensional construct - The result ofthe first step creates a new variable with a value correlated to one ofthe categories. This new variable is then compared with the originallabel to find the weight of high and low risk genotype cells.
3. Significance assessment - This stage tries to correct the inflation oftype I errors after the combination of cells into the weight of High riskand Low risk. This is done using the Wald statistic.
1.1 Input files
The input file consists of the Index and phenotype in the first two columns,and the genotype of each SNP in the following columns. The first row cor-responds to the name of each column.
””,”Y”,”SNP1”, ”SNP2”, ”SNP3”, ”SNP4”, ”SNP5””0”, 0, 1, 2, 0, 0, 0”1”, 0, 0, 2, 1, 2, 0”2”, 1, 1, 0, 1, 0, 1”3”, 1, 1, 1, 2, 1, 0
Table 1: An example of the input file containing genotype and phenotypeinformation with 5 SNPs and 4 individuals.
1
1.2 Output files
The output consist of a list of SNP interactions selected with the followingcolumns for each interaction:
1. SNP1...SNPx - Names of snps in interaction.
2. NH - Number of significant High risk genotypes in the interaction.
3. betaH - Regresion coeficient in step2 for High risk exposition.
4. WH - Wald statistic for High risk category.
5. PH - P-value of the Wald test for the High risk category.
6. NL - Number of significant Low risk genotypes in the interaction.
7. betaL - Regresion coeficient in step2 for Low risk exposition.
8. WL - Wald statistic for Low risk category.
9. PL - P-value of the Wald test for the Low risk category.
10. MIN.P - Minimun p-value (min(PH,PL)) for the interaction model.
1.3 Parameters
The MBMDR can contain the following arguments:
• order - dimension of interactions to be analyzed.
• covar - (Optional) a data frame containing the covariates for adjustingregression models.
• exclude - (Optional) Value/s of missing data.
• risk.threshold - Threshold used to define the risk category of a multi-locus genotype. The default value is 0.1.
• adjust - (Optional) Types of regression adjustment. Can be ”none”,”covariates”, ”main effects” or ”both”. The default value is ”none”.
• first.model - Specifies the first interaction to be tested. Useful whenstoped before finishing the complete analysis.
• list.models - (Optional) Exhaustive list of models to be analyzed. Onlypossible interactions in this list will be analyzed.
2
• use.logistf - Boolean value indicating wheter or not the logistf packageshould be used. The default value is TRUE.
• printStep1 - Boolean value that prints every model obtained if thevalue is TRUE. The default value is FALSE.
2 Experimental Settings
The datasets used in the experiments are characterized in Lab Note 1.The limit number of interactions selected is 2, considering that the groundtruth is a pairwise interaction, and all of the SNPs are tested with each otherfor pairwise interactions.
3 Results
The algorithm only outputs the statistical relevancy test of interactions be-tween SNPs. Due to this, only epistatic disease model data sets will be usedfor this experiment. Because of time constraints, several computers were usedto obtain results. This means that it is not possible to compare scalabilityresults.The Figure 1 reveals a large increase with population size for data sets witha minor allele frequency higher than 0.01. There is a big increase in datasets with 2000 individuals from a minor allele frequency of 0.05 to 0.1. Theresults from data sets with a smaller amount of population size has muchlower Power, having 0 Power for almost all data sets with 500 individuals.There is also a clear increase with minor allele frequency.
According to Figure 2 the Type 1 Error Rate is very low across all allelefrequencies and data set sizes, having a maximum of 6% and 2% for 0.05minor allele frequency with 2000 and 1000 individuals respectively. For otherallele frequencies, only 0.1 and 0.3 contain false positives for data sets with2000 individuals.
Figure 3 and 6 show the same results as Figure 1, with a different prespec-tive. Figure 4 also shows an increase in Power with the increase in odds ratio.Figure 5 shows a smaller increase in Power with the increase in prevalence.
3
0.01 0.05 0.1 0.3 0.50
20
40
60
80
100
0 0 0 0 10 0 2 7 120
14
54
7185
Allele frequency
Pow
er(%
)
500 individuals1000 individuals2000 individuals
Figure 1: Power by allele frequency. For each frequency, three sizes of datasets were used to measure the Power, with an odds ratio of 2.0 and prevalenceof 0.02. The Power is measured by the amount of data sets where the groundtruth was amongst the most relevant results, out of all 100 data sets.
0.01 0.05 0.1 0.3 0.50
2
4
6
0 0 0 0 00
2
0 0 00
6
32
0
Allele frequency
Typ
e1
Err
orR
ate(
%)
500 individuals1000 individuals2000 individuals
Figure 2: Type 1 Error Rate by allele frequency and population size. TheType 1 Error Rate is measured by the amount of data sets where the falsepositives were amongst the most relevant results, out of all 100 data sets.
4
4 Summary
MBMDR is an algorithm based on the popular MDR approach, with a clus-tering of SNPs by high and low risk of determining the disease phenotype.The results show very high Power for data sets with 2000 individuals, butvery low Power for all other configurations. The Type 1 Error Rate is verylow, reaching a maximum of only 6% for 0.05 allele frequency and 2000 indi-viduals. Considering that there are no results concerning the scalability dueto the expected running time of the algorithm shows that it is not viable touse this algorithm on big data sets that might contain thousands or millionsof SNPs.
References
[CLEP07] Yujin Chung, Seung Yeoun Lee, Robert C Elston, and Tae-sung Park. Odds ratio based multifactor-dimensionality reduc-tion method for detecting gene-gene interactions. Bioinformatics(Oxford, England), 23:71–76, 2007.
[MVV11] Jestinah M Mahachie John, Francois Van Lishout, and KristelVan Steen. Model-Based Multifactor Dimensionality Reductionto detect epistasis for quantitative traits in the presence of error-free and noisy data. Eur J Hum Genet, 19(6):696–703, June 2011.
A Bar Graphs
5
500 1000 20000
50
100
0 2
54
Population
Pow
er(%
)Power by Population
Figure 3: Distribution of the Power by population. The allele frequency is0.1, the odds ratio is 2.0, and the prevalence is 0.02.
1.1 1.5 2.00
50
100
0 1
54
Odds Ratio
Pow
er(%
)
Power by Odds Ratio
Figure 4: Distribution of the Power by odds ratios. The allele frequency is0.1, the number of individuals is 2000, and the prevalence is 0.02.
6
0.0001 0.020
50
100
3654
Prevalence
Pow
er(%
)Power by Prevalence
Figure 5: Distribution of the Power by prevalence. The allele frequency is0.1, the number of individuals is 2000, and the odds ratio is 2.0.
0.01 0.05 0.1 0.3 0.50
50
100
0
43
92 92 95
Frequency
Pow
er(%
)
Power by Frequency
Figure 6: Distribution of the Power by allele frequency. The number ofindividuals is 2000, the odds ratio is 2.0, and the prevalence is 0.02.
7
B Table of Results
Table 2: A table containing the percentage of true positives and false posi-tives in each configuration. The first column contains the description of theconfiguration. The second and third columns contain the number of datasetswith true positives and false positives respectively, out of all 100 data setsper configuration.
Configuration* TP (%) FP (%)0.5,500,I,2.0,0.02 1 0
0.5,500,I,2.0,0.0001 2 00.5,500,I,1.5,0.0001 0 00.5,500,I,1.1,0.02 0 0
0.5,500,I,1.1,0.0001 0 00.5,2000,I,2.0,0.02 85 0
0.5,2000,I,2.0,0.0001 91 20.5,2000,I,1.5,0.02 17 1
0.5,2000,I,1.5,0.0001 2 00.5,2000,I,1.1,0.02 0 0
0.5,2000,I,1.1,0.0001 0 00.5,1000,I,2.0,0.02 12 0
0.5,1000,I,2.0,0.0001 26 00.5,1000,I,1.5,0.02 0 0
0.5,1000,I,1.5,0.0001 0 100.3,500,I,2.0,0.02 0 0
0.3,500,I,2.0,0.0001 11 00.3,500,I,1.5,0.02 0 0
0.3,500,I,1.5,0.0001 0 00.3,500,I,1.1,0.02 0 0
0.3,500,I,1.1,0.0001 0 00.3,2000,I,2.0,0.02 71 2
0.3,2000,I,2.0,0.0001 100 80.3,2000,I,1.5,0.02 5 0
0.3,2000,I,1.5,0.0001 43 20.3,2000,I,1.1,0.02 0 0
0.3,2000,I,1.1,0.0001 0 00.3,1000,I,2.0,0.02 7 0
0.3,1000,I,2.0,0.0001 62 00.3,1000,I,1.5,0.02 0 0
0.3,1000,I,1.5,0.0001 5 0
8
0.3,1000,I,1.1,0.02 0 00.3,1000,I,1.1,0.0001 0 0
0.1,500,I,2.0,0.02 0 00.1,500,I,2.0,0.0001 0 00.1,500,I,1.5,0.02 0 0
0.1,500,I,1.5,0.0001 0 00.1,500,I,1.1,0.02 0 0
0.1,500,I,1.1,0.0001 0 00.1,2000,I,2.0,0.02 54 3
0.1,2000,I,2.0,0.0001 36 20.1,2000,I,1.5,0.02 1 0
0.1,2000,I,1.5,0.0001 0 00.1,2000,I,1.1,0.02 0 0
0.1,2000,I,1.1,0.0001 0 00.1,1000,I,2.0,0.02 2 0
0.1,1000,I,2.0,0.0001 1 00.1,1000,I,1.5,0.02 0 0
0.1,1000,I,1.5,0.0001 0 00.1,1000,I,1.1,0.02 0 0
0.1,1000,I,1.1,0.0001 0 00.05,500,I,2.0,0.02 0 0
0.05,500,I,2.0,0.0001 0 00.05,500,I,1.5,0.02 0 0
0.05,500,I,1.5,0.0001 0 00.05,500,I,1.1,0.02 0 0
0.05,500,I,1.1,0.0001 0 00.05,2000,I,2.0,0.02 14 6
0.05,2000,I,2.0,0.0001 3 10.05,2000,I,1.5,0.02 7 3
0.05,2000,I,1.5,0.0001 17 70.05,2000,I,1.1,0.02 0 0
0.05,2000,I,1.1,0.0001 0 00.05,1000,I,2.0,0.02 0 2
0.05,1000,I,2.0,0.0001 0 00.05,1000,I,1.5,0.02 0 0
0.05,1000,I,1.5,0.0001 0 00.05,1000,I,1.1,0.02 0 0
0.05,1000,I,1.1,0.0001 0 1
9
0.01,500,I,2.0,0.02 0 00.01,500,I,2.0,0.0001 0 00.01,500,I,1.5,0.02 0 0
0.01,500,I,1.5,0.0001 0 10.01,500,I,1.1,0.02 0 0
0.01,500,I,1.1,0.0001 0 00.01,1000,I,1.5,0.0001 0 00.01,1000,I,1.1,0.0001 0 0
*MAF,POP,MOD,OR,PREV where MAF represents the minor allele fre-quency, POP is the number of individuals, MOD is the used model (withor without main effect and with or without epistasis effect), OR is the oddsratio and PREV is the prevalence of the disease.
10
Laboratory Note
Genetic EpistasisIX - Comparative Assessment of the
AlgorithmsLN-9-2014
Ricardo Pinho and Rui CamachoFEUP
Rua Dr Roberto Frias, s/n,4200-465 PORTO
PortugalFax: (+351) 22 508 1440e-mail: [email protected]
www : http://www.fe.up.pt/∼ei09045 [email protected] : http://www.fe.up.pt/∼rcamacho
May 2014
Abstract
This lab note contains the results obtained from the algorithmsdiscussed in previous lab notes. All algorithms are compared by theircharacteristics and by their Power, scalability, and Type 1 Error Ratesin epistatic detection, main effect detection, and full effect detection.From the results obtained, we can see that the algorithm BOOST hasthe highest Power in epistatic detection and main effect detection,but has a high error rate. Screen and Clean has a contant but higherror rate overall, very low Power in epistatic detection and averagePower in other models.SNPHarvester and SNPRuler have relativelylow Power, but low error rates. TEAM has good Power, but higherror rate. MBMDR has good Power and low Type 1 Error Rate, butvery bad scalability. BEAM3 has high Power in main effect detection,but also high error rate. In terms of scalability, BOOST is the mostscalable, with MBMDR being the least scalable.
1 Introduction
In this lab note, the epistasis detection algorithms used in earlier lab notes([PC14b][PC14c] [PC14d] [PC14e] [PC14f] [PC14g] [PC14h]) will be compared, usingthe results from the data sets and measurements discussed in Lab Note LN-1-2014 [PC14a].The algorithms used in this empirical study are BEAM 3.0 [Zha12]; BOOST[WYY+10a]; MBMDR [MVV11]; Screen and Clean [WDR+10]; SNPRuler[WYY+10b]; SNPHarvester [YHW+09]; and TEAM [ZHZW10]. Table 1 andTable 2 show the main characteristics of the search methods, scoring tech-niques, types of disease models detected, and the programming language ofthe tested algorithms [SZS+11].
Table 1: Similarities and differences between BEAM3, BOOST MBMDR,and Screen & Clean.
Features BEAM 3 BOOST MBMDR Screen & CleanSearch Stochastisc Exhaustive Exhaustive HeuristicPermutation Test
√ − √ −Chi-square Test −*
√ −* −*Tree/Graph Structure
√ − − −Bonferroni Correction − √ − √Interactive Effect
√ √ √ √Main Effect
√ √ √ √Full Effect
√ √ √ √Programming Language C++ C R R
*Although BEAM3 can evaluate interactive and full effects, the evaluationtest is not comparable between methods. Only single SNPs are evaluatedwith χ2 test. MBMDR and Screen & Clean results are comparable withother algorithms.
1
Features SNPHarvester SNPRuler TEAMSearch Stochastic Heuristic ExhaustivePermutation Test − − √Chi-square Test
√ √ −Tree Structure − √ √Bonferroni Correction
√ √ −Interactive Effect
√ √ √Main Effect
√ − −Full Effect
√ − −Programming Language Java Java C++
2 Comparative Assesmement
The measures used to assess the quality of each algorithm are: Power; Scal-ability; and Type 1 Error Rate.
2.1 Power
The Power of an algorithm is related to its ability to find the ground truthof the disease. In this case, the Power is evaluated by the number of datasets, out of 100, where the algorithm finds the ground truth and is measuredas a percentage for each data set configuration. In each data set, the mostsignificant interactions, i.e. α < 0.05, are selected.
2.2 Scalability
Scalability is determined by 3 main factors: execution time, cpu usage,and memory usage. Execution time is measured in seconds, cpu usage ismeasured in precentage of processor usage by the algorithm, and memoryusage is measured in Kilobytes of RAM memory used by the algorithm. Allmeasures are averaged over the 100 data sets in each data set configuration.
2.3 Type 1 Error Rate
Similar to the Power, the Type 1 Error Rate is determined by the amount offalse positives in the 100 data sets within the most significant interactions,i.e. α < 0.05.
2
3 Experimental Procedure
As mentioned in Lab Note LN-1-2014 [PC14a], there are 270 different configu-rations of data sets, with different parameters: allele frequency (0.01,0.05,0.1,0.3,and 0.5); population size (500,1000, and 2000); odds ratio (1.1,1.5, and 2.0);prevalence (0.0001 and 0.02); and disease model (Epistasis, Main effect, andEpistasis + Main Effect).To test the Power and Type 1 Error Rate of algorithms, the outputs of eachalgorithm is gathered for each data set configuration and the correspondingconfusion matrix is created. The output of each algorithm is filtered, select-ing only interactions with a statistical relevancy of 5%. From these confusionmatrices, the number true positives and false positives of data sets withineach configuration is obtained and used as comparison for Power and Type1 Error Rate respectively. For scalability, the built-in shell command timewas used to obtain al the scalability measures for all algorithms.
4 Results
To compare each criteria of the algorithms, Table 2, 3, and 4 were createdto represent the Power and Type 1 Error Rate of each algorithm, by numberof individuals and allele frequency for epistasis, main effect and full effectrespectively. Table 5 shows the results of the scalability measures used toevaluate each algorithm.For epistasis detection, we can see that, for data sets with 500 individuals,no algorithm has a Power above 26%. This shows ta big dificulty in detect-ing epistasis with few individuals. The algorithm with best Power for thesedata sets is BOOST, followed by TEAM and SNPRuler and SNPHarvester.In error rate however, the algorithm with the lowest values is SNPRuler,followed by TEAM, SNPHarvester and BOOST. For data sets with 1000individuals, there is a big increase in Power in all algorithms, reaching amaximum of 91%. BOOST has the best Power in all allele frequencies, fol-lowed by TEAM, SNPRuler and SNPHarvester. SNPRuler is once again thealgorithm with the lowest Type 1 Error Rate, followed by TEAM, BOOSTand SNPHarvester. In 2000 individuals, BOOST has the best Power with amaximum of 100%, followed by TEAM and SNPHarvester, with SNPRulerbeing better than SNPHarvester for 0.5 minor allele frequency. The lowesterror rate is achieved by SNPRuler. Each of the other algorithms has a highType 1 Error Rate in at least 1 setting. Screen and Clean is clearly theworse algorithm due to its lack of Power and high Type 1 Error Rate acrossall data set sizes. The Power shows an increase with allele frequency in each
3
algorithm, reaching their maximum Power in 0.5 allele frequency. There isno clear correlation between error rate and allele frequency for any algorithm.
POP 500 individualsMAF 0.01 0.05 0.1 0.3 0.5
P T1ER P T1ER P T1ER P T1ER P T1ERBOOST 0% 4% 0% 7% 1% 7% 14% 6% 26% 4%SnC 0% 15% 0% 15% 0% 14% 0% 19% 0% 18%SNPH 0% 4% 0% 4% 0% 7% 4% 3% 2% 4%SNPR 0% 0% 0% 0% 0% 0% 3% 0% 6% 0%TEAM 0% 0% 0% 0% 0% 2% 6% 0% 8% 1%POP 1000 individualsMAF 0.01 0.05 0.1 0.3 0.5
P T1ER P T1ER P T1ER P T1ER P T1ERBOOST 0% 7% 0% 4% 41% 5% 66% 2% 91% 8%SnC 0% 17% 0% 22% 0% 16% 0% 15% 0% 22%SNPH 0% 4% 0% 13% 21% 9% 43% 9% 14% 3%SNPR 0% 0% 0% 0% 10% 0% 35% 0% 71% 0%TEAM 0% 2% 1% 4% 21% 5% 47% 1% 65% 0%POP 2000 individualsMAF 0.01 0.05 0.1 0.3 0.5
P T1ER P T1ER P T1ER P T1ER P T1ERBOOST 0% 2% 7% 2% 94% 21% 100% 6% 100% 8%SnC 0% 18% 0% 20% 6% 21% 2% 16% 0% 14%SNPH 0% 2% 18% 27% 85% 19% 70% 11% 33% 5%SNPR 0% 0% 0% 1% 32% 8% 44% 0% 92% 0%TEAM 0% 1% 43% 37% 92% 28% 92% 10% 95% 1%
Table 2: This table contains the results for epistasis detection. A comparisonbetween the tested algorithms: BOOST, Screen and Clean, SNPHarvester,SNPRuler, and TEAM. The table is organized by population size (POP) andminor allele frequency (MAF), with an odds ratio of 2.0 and a prevalenceof 0.02. For each allele frequency, there are two columns: the Power (P)obtained, and the Type 1 Error Rate (T1ER).
In main effect detection, for 500 individuals, the best algorithm is BEAM3,closely followed by BOOST, SNPHarvester and Screen and Clean far behind.The Type 1 Error Rate is lowest in MBMDR, and Screen and Clean, withBEAM3, SNPHarvester, and BOOST very close to each other, with veryhigh error rates, BOOST having the highest error rate. for 1000 individu-
4
als, BOOST has better Power than BEAM3, followed by SNPHarvester, andScreen and Clean with MBMDR far behind. The Type 1 Error Rate is higherfor BOOST, very closely followe by BEAM3, SNPHarvester, and Screen andClean, with MBMDR having the lowest error rate. For data sets with 2000individuals, BOOST and MBMDR have a better Power for data sets withallele frequency lower than 0.1, and BEAM3, BOOST and SNPHarvesterequally good in allele frequencies higher than 0.1. The error rate is lowergenerally for MBMDR, followed by Scren and Clean.
Table 4 shows the full effect detection for BOOST, Screen and Clean,and SNPHarvester. BOOST and SNPHarvester have the highest Power de-tection for all allele frequencies but have a high Type 1 Error Rate. Screenand Clean has high Power for high allele frequencies, but 0 for configurationsbelo 0.3 and with a higher Type 1 Error Rate for configurations below 0.1.Screen and Clean has the lowest Type 1 Error Rate but also has the worstPower detection. BOOST has the best ratio of Power to Type 1 Error Rate.
Table 5 shows the running time, CPU usage and memory usage of allalgorithms for scalability measure. Screen and Clean is the slowest recordedalgorithm, followed by SNPHarvester, TEAM, BEAM3 and SNPRuler, withBOOST being the fastest algorithm. Screen and Clean also has the highestincrease in running time, followed by SNPHarvester, TEAM, with BOOST,BEAM3, and SNPRuler far behind. SNPRuler is the algorithm with thehighest CPU usage, having to resort to more than 1 core to finish eachtask. SNPHarvester, BOOST, BEAM3, and Screen and Clean are all closeto 100%, with TEAM being the algorithm with the least required CPU usage.BEAM3, BOOST, and TEAM have an increase of CPU usage with data setsize, TEAM being the algorithm with the highest increase. In memory usage,SNPRuler shows the highest usage of memory, closely followed by TEAM,Screen and Clean, SNPHarvester, BEAM3, and finally BOOST far behind.
5
POP 500 individualsMAF 0.01 0.05 0.1 0.3 0.5
P T1ER P T1ER P T1ER P T1ER P T1ERBEAM3 0% 0% 0% 3% 0% 9% 100% 71% 100% 99%BOOST 0% 1% 0% 1% 2% 12% 100% 78% 100% 97%MBMDR 0% 0% 0% 0% 0% 0% 0% 0% 1% 0%SnC 0% 14% 0% 17% 0% 21% 20% 23% 54% 15%SNPH 0% 1% 0% 5% 0% 11% 100% 78% 100% 99%POP 1000 individualsMAF 0.01 0.05 0.1 0.3 0.5
P T1ER P T1ER P T1ER P T1ER P T1ERBEAM3 0% 6% 0% 3% 32% 18% 100% 99% 100% 100%BOOST 0% 7% 1% 3% 43% 23% 100% 99% 100% 100%MBMDR 0% 0% 0% 2% 2% 0% 7% 0% 12% 0%SnC 0% 14% 0% 21% 0% 23% 54% 28% 70% 30%SNPH 0% 10% 0% 4% 38% 22% 100% 99% 100% 100%POP 2000 individualsMAF 0.01 0.05 0.1 0.3 0.5
P T1ER P T1ER P T1ER P T1ER P T1ERBEAM3 0% 1% 1% 17% 92% 67% 100% 100% 100% 100%BOOST 0% 1% 14% 11% 97% 74% 100% 100% 100% 100%MBMDR 0% 0% 14% 6% 54% 3% 71% 2% 85% 0%SnC 0% 13% 0% 22% 39% 36% 58% 38% 62% 48%SNPH 0% 1% 1% 24% 92% 79% 100% 100% 100% 100%
Table 3: This table contains the results for main effect detection. A compari-son between the tested algorithms: BEAM3, BOOST, Screen and Clean, andSNPHarvester. The table is organized by population size (POP) and minorallele frequency (MAF), with an odds ratio of 2.0 and a prevalence of 0.02.For each allele frequency, there are two columns: the Power (P) obtained,and the Type 1 Error Rate (T1ER).
6
POP 500 individualsMAF 0.01 0.05 0.1 0.3 0.5
P T1ER P T1ER P T1ER P T1ER P T1ERBOOST 0% 10% 0% 4% 1% 15% 100% 100% 100% 100%SnC 0% 18% 0% 15% 0% 19% 30% 19% 49% 37%SNPH 0% 2% 0% 8% 0% 9% 100% 100% 100% 100%POP 1000 individualsMAF 0.01 0.05 0.1 0.3 0.5
P T1ER P T1ER P T1ER P T1ER P T1ERBOOST 0% 11% 2% 16% 42% 38% 100% 100% 100% 100%SnC 0% 14% 0% 21% 0% 28% 58% 35% 73% 45%SNPH 0% 4% 0% 8% 32% 27% 100% 100% 100% 100%POP 2000 individualsMAF 0.01 0.05 0.1 0.3 0.5
P T1ER P T1ER P T1ER P T1ER P T1ERBOOST 0% 7% 15% 17% 98% 81% 100% 100% 100% 100%SnC 0% 14% 0% 21% 0% 33% 40% 68% 91% 84%SNPH 0% 1% 0% 20% 95% 79% 100% 100% 100% 100%
Table 4: This table contains the results for full effect detection. A comparisonbetween the tested algorithms: BOOST, Screen and Clean, and SNPHar-vester. The table is organized by population size (POP) and minor allelefrequency (MAF), with an odds ratio of 2.0 and a prevalence of 0.02. Foreach allele frequency, there are two columns: the Power (P) obtained, andthe Type 1 Error Rate (T1ER).
7
Running Time (s) CPU Usage(%) Memory Usage (MB)500 1000 2000 500 1000 2000 500 1000 2000
BEAM3 4.9 7 8 87.8 96.3 95.5 4 4.3 5.8BOOST 0.16 0.22 0.34 95.7 98.79 97.87 0.98 1 1.2MBMDR* − − − − − − − − −SnC 8.05 18.65 34.65 75.7 98.99 77.25 129.8 137.2 152.5SNPHarvester 9.29 25.89 33 102.1 86.5 101.6 68.35 71.3 76.86SNPRuler 2.7 3.09 4.1 130.2 141.9 156.28 312.7 316 320.2TEAM 3.28 5.28 9.81 66.99 69.71 74.75 162.7 176 228.1
Table 5: Scalability test containing the average running time, CPU usage,and memory usage by data set population size. *MBMDR does not containscalability results because these were obtained from different computers withdifferent hardware settings from all other results. The data sets have a minorallele frequency is 0.5, 2.0 odds ratio, 0.02 prevalence.
8
5 Results Discussion
The results obtained from all the different algorithms show interesting qual-ities among them. BOOST is clearly the algorithm with the highest Power,but has high Type 1 Error Rate. SNPRuler has low Type 1 Error Rate,but not very high Power and only works for epistasis detection. Screen andClean is ineffective in most settings, but has a relatively low Type 1 ErrorRate and high Power for main effect and full detection in data sets with highallele frequency. BEAM3 only works for main effect detection but has highPower with slightly lower error rate than BOOST. SNPHarvester has lowPower, but also low Type 1 Error Rate in all model types. MBMDR hasgood Power for certain configurations and a very low Type 1 Error Rate,however it has a very high running time for each data set. TEAM has goodPower, with slightly high Type I Error Rate in certain configurations.BOOST is the most scalable algorithm, followed by SNPRuler and BEAM3.This is specially important for large data sets and their ability to work inan ensemble approach. In epistasis detection, considering the Power, Screenand Clean and SNPHarvester show the worse potential. For main effect, ThePower is lowest for Scren and Clean and MBMDR. For full effect, Screen andClean is once again the weakest algorithm.With this information, the best algorithms for each scenario can be usedtogether to maximize Power and lower Type 1 Error Rate.This experiments use more configurations than any previous empirical stud-ies. These configurations are processed by 7 of the state-of-the-art algo-rithms, which yielded interesting results. The contribution of these exper-iments is an unprecedented large comparison study, using various relevantmeasures, which allows for a full understanding of each algorithm.
References
[MVV11] Jestinah M Mahachie John, Francois Van Lishout, and KristelVan Steen. Model-Based Multifactor Dimensionality Reductionto detect epistasis for quantitative traits in the presence of error-free and noisy data. Eur J Hum Genet, 19(6):696–703, June2011.
[PC14a] Ricardo Pinho and Rui Camacho. Genetic Epistasis I - Mate-rials and methods. 2014.
[PC14b] Ricardo Pinho and Rui Camacho. Genetic Epistasis II - Assess-ing Algorithm BEAM 3.0. 2014.
9
[PC14c] Ricardo Pinho and Rui Camacho. Genetic Epistasis III - As-sessing Algorithm BOOST. 2014.
[PC14d] Ricardo Pinho and Rui Camacho. Genetic Epistasis IV - As-sessing Algorithm Screen and Clean. 2014.
[PC14e] Ricardo Pinho and Rui Camacho. Genetic Epistasis V - Assess-ing Algorithm SNPRuler. 2014.
[PC14f] Ricardo Pinho and Rui Camacho. Genetic Epistasis VI - As-sessing Algorithm SNPHarvester. 2014.
[PC14g] Ricardo Pinho and Rui Camacho. Genetic Epistasis VII - As-sessing Algorithm TEAM. 2014.
[PC14h] Ricardo Pinho and Rui Camacho. Genetic Epistasis VIII - As-sessing Algorithm MBMDR. 2014.
[SZS+11] Junliang Shang, Junying Zhang, Yan Sun, Dan Liu, DaojunYe, and Yaling Yin. Performance analysis of novel methods fordetecting epistasis, 2011.
[WDR+10] Jing Wu, Bernie Devlin, Steven Ringquist, Massimo Trucco,and Kathryn Roeder. Screen and clean: a tool for identifyinginteractions in genome-wide association studies. Genetic epi-demiology, 34:275–285, 2010.
[WYY+10a] Xiang Wan, Can Yang, Qiang Yang, Hong Xue, Xiaodan Fan,Nelson L S Tang, and Weichuan Yu. BOOST: A fast approachto detecting gene-gene interactions in genome-wide case-controlstudies. American journal of human genetics, 87:325–340, 2010.
[WYY+10b] Xiang Wan, Can Yang, Qiang Yang, Hong Xue, Nelson L STang, and Weichuan Yu. Predictive rule inference for epistaticinteraction detection in genome-wide association studies. Bioin-formatics (Oxford, England), 26:30–37, 2010.
[YHW+09] Can Yang, Zengyou He, Xiang Wan, Qiang Yang, Hong Xue,and Weichuan Yu. SNPHarvester: a filtering-based approachfor detecting epistatic interactions in genome-wide associationstudies. Bioinformatics (Oxford, England), 25:504–511, 2009.
[Zha12] Yu Zhang. A novel bayesian graphical model for genome-widemulti-SNP association mapping. Genetic Epidemiology, 36:36–47, 2012.
10
[ZHZW10] Xiang Zhang, Shunping Huang, Fei Zou, and Wei Wang.TEAM: efficient two-locus epistasis tests in human genome-wideassociation study. Bioinformatics (Oxford, England), 26:i217–i227, 2010.
A Bar Graphs
A.1 Population size
11
BOOST
MBM
DR
SnC
SNPH
SNPR
TEAM
0
10
20
Algorithms
Pow
er/T
yp
e1
Err
or(%
)
PowerType 1 Error Rate
(a) 500 individuals.
BOOST
MBM
DR
SnC
SNPH
SNPR
TEAM
0
20
40
60
Algorithms
Pow
er/T
yp
e1
Err
or(%
)
PowerType 1 Error Rate
(b) 1000 individuals.
BOOST
MBM
DR
SnC
SNPH
SNPR
TEAM
0
50
100
150
Algorithms
Pow
er/T
yp
e1
Err
or(%
)
PowerType 1 Error Rate
(c) 2000 individuals.
Figure 1: These results correspond to epistasis detection by population size,with a 0.1 minor allele frequency, 2.0 odds ratio, and 0.02 prevalence. Theresults of the Power and Type 1 Error Rate of BOOST, MBMDR, Screen andClean, SNPHarvester, SNPRuler and TEAM. Each subfigure contains thevalues for all algorithms in data sets with 500 individuals (a), 1000 individuals(b), and 2000 individuals (c).
12
BEAM
BOOST Sn
C
SNPH
0
10
20
30
Algorithms
Pow
er/T
yp
e1
Err
or(%
)
PowerType 1 Error Rate
(a) 500 individuals.
BEAM
BOOST Sn
C
SNPH
0
20
40
60
Algorithms
Pow
er/T
yp
e1
Err
or(%
)
PowerType 1 Error Rate
(b) 1000 individuals.
BEAM
BOOST Sn
C
SNPH
0
50
100
150
Algorithms
Pow
er/T
yp
e1
Err
or(%
)
PowerType 1 Error Rate
(c) 2000 individuals.
Figure 2: These results correspond to main effect detection by populationsize, with a 0.1 minor allele frequency, 2.0 odds ratio, and 0.02 prevalence.The results of the Power and Type 1 Error Rate of BOOST, MBMDR, Screenand Clean, SNPHarvester, SNPRuler and TEAM. Each subfigure containsthe values for all algorithms in data sets with 500 individuals (a), 1000 indi-viduals (b), and 2000 individuals (c).
13
BOOST Sn
C
SNPH
0
10
20
30
Algorithms
Pow
er/T
yp
e1
Err
or(%
)
PowerType 1 Error Rate
(a) 500 individuals.
BOOST Sn
C
SNPH
0
20
40
60
Algorithms
Pow
er/T
yp
e1
Err
or(%
)
PowerType 1 Error Rate
(b) 1000 individuals.
BOOST Sn
C
SNPH
0
50
100
150
Algorithms
Pow
er/T
yp
e1
Err
or(%
)
PowerType 1 Error Rate
(c) 2000 individuals.
Figure 3: These results correspond to full effect detection by population size,with a 0.1 minor allele frequency, 2.0 odds ratio, and 0.02 prevalence. Theresults of the Power and Type 1 Error Rate of BOOST, MBMDR, Screen andClean, SNPHarvester, SNPRuler and TEAM. Each subfigure contains thevalues for all algorithms in data sets with 500 individuals (a), 1000 individuals(b), and 2000 individuals (c).
14
A.2 Frequency
BOOST
MBM
DR
SnC
SNPH
SNPR
TEAM
0
10
20
30
40P
T1ER
(a) 0.01 allele frequency.
BOOST
MBM
DR
SnC
SNPH
SNPR
TEAM
020406080 P
T1ER
(b) 0.05 allele frequency.
BOOST
MBM
DR
SnC
SNPH
SNPR
TEAM
0
50
100
150
200P
T1ER
(c) 0.1 allele frequency.
BOOST
MBM
DR
SnC
SNPH
SNPR
TEAM
0
100
200 PT1ER
(d) 0.3 allele frequency.
BOOST
MBM
DR
SnC
SNPH
SNPR
TEAM
0
100
200 PT1ER
(e) 0.5 allele frequency.
Figure 4: These results correspond to epistasis detection by minor allelefrequency, with 2000 individuals, 2.0 odds ratio, and 0.02 prevalence. Theresults of the Power and Type 1 Error Rate of BOOST, MBMDR, Screenand Clean, SNPHarvester, SNPRuler and TEAM. Each subfigure containsthe values for all algorithms in data sets with 0.01 (a), 0.05 (b), 0.1 (c), 0.3(d), and 0.5 (e) allele frequencies.
15
BEAM
BOOST Sn
C
SNPH
0
10
20P
T1ER
(a) 0.01 allele frequency.
BEAM
BOOST Sn
C
SNPH
0
20
40P
T1ER
(b) 0.05 allele frequency.
BEAM
BOOST Sn
C
SNPH
0
50
100
150
200 PT1ER
(c) 0.1 allele frequency.
BEAM
BOOST Sn
C
SNPH
0
100
200 PT1ER
(d) 0.3 allele frequency.
BEAM
BOOST Sn
C
SNPH
0
100
200 PT1ER
(e) 0.5 allele frequency.
Figure 5: These results correspond to main effect detection by minor allelefrequency, with 2000 individuals, 2.0 odds ratio, and 0.02 prevalence. Theresults of the Power and Type 1 Error Rate of BOOST, MBMDR, Screenand Clean, SNPHarvester, SNPRuler and TEAM. Each subfigure containsthe values for all algorithms in data sets with 0.01 (a), 0.05 (b), 0.1 (c), 0.3(d), and 0.5 (e) allele frequencies.
16
BOOST Sn
C
SNPH
0
10
20
30P
T1ER
(a) 0.01 allele frequency.
BOOST Sn
C
SNPH
0
20
40 PT1ER
(b) 0.05 allele frequency.
BOOST Sn
C
SNPH
0
50
100
150
200 PT1ER
(c) 0.1 allele frequency.
BOOST Sn
C
SNPH
0
100
200 PT1ER
(d) 0.3 allele frequency.
BOOST Sn
C
SNPH
0
100
200 PT1ER
(e) 0.5 allele frequency.
Figure 6: These results correspond to full effect detection by minor allelefrequency, with 2000 individuals, 2.0 odds ratio, and 0.02 prevalence. Theresults of the Power and Type 1 Error Rate of BOOST, MBMDR, Screenand Clean, SNPHarvester, SNPRuler and TEAM. Each subfigure containsthe values for all algorithms in data sets with 0.01 (a), 0.05 (b), 0.1 (c), 0.3(d), and 0.5 (e) allele frequencies.
17
A.3 Odds Ratio
BOOST
MBM
DR
SnC
SNPH
SNPR
TEAM
0
20
40
Algorithms
Pow
er/T
yp
e1
Err
or(%
)
PowerType 1 Error Rate
(a) 1.1 odds ratio.
BOOST
MBM
DR
SnC
SNPH
SNPR
TEAM
0
50
100
150
Algorithms
Pow
er/T
yp
e1
Err
or(%
)
PowerType 1 Error Rate
(b) 1.5 odds ratio.
BOOST
MBM
DR
SnC
SNPH
SNPR
TEAM
0
50
100
150
Algorithms
Pow
er/T
yp
e1
Err
or(%
)
PowerType 1 Error Rate
(c) 2.0 odds ratio.
Figure 7: These results correspond to epistatic detection by odds ratio, witha minor allele frequency of 0.1, 2000 individuals, and a 0.02 prevalence. Theresults of the Power and Type 1 Error Rate of BOOST, MBMDR, Screenand Clean, SNPHarvester, SNPRuler and TEAM. Each subfigure containsthe values for all algorithms in data sets with 1.1 (a), 1.5 (b), and 2.0 (c)odds ratio.
18
BEAM
BOOST Sn
C
SNPH
0
10
20
30
Algorithms
Pow
er/T
yp
e1
Err
or(%
)
PowerType 1 Error Rate
(a) 1.1 odds ratio.
BEAM
BOOST Sn
C
SNPH
0
20
40
Algorithms
Pow
er/T
yp
e1
Err
or(%
)
PowerType 1 Error Rate
(b) 1.5 odds ratio.
BEAM
BOOST Sn
C
SNPH
0
50
100
150
Algorithms
Pow
er/T
yp
e1
Err
or(%
)
PowerType 1 Error Rate
(c) 2.0 odds ratio.
Figure 8: These results correspond to main effect detection by odds ratio,with a minor allele frequency of 0.1, 2000 individuals, and a 0.02 prevalence.The results of the Power and Type 1 Error Rate of BOOST, MBMDR, Screenand Clean, SNPHarvester, SNPRuler and TEAM. Each subfigure containsthe values for all algorithms in data sets with 1.1 (a), 1.5 (b), and 2.0 (c)odds ratio.
19
BOOST Sn
C
SNPH
0
10
20
Algorithms
Pow
er/T
yp
e1
Err
or(%
)
PowerType 1 Error Rate
(a) 1.1 odds ratio.
BOOST Sn
C
SNPH
0
50
100
Algorithms
Pow
er/T
yp
e1
Err
or(%
)
PowerType 1 Error Rate
(b) 1.5 odds ratio.
BOOST Sn
C
SNPH
0
50
100
150
Algorithms
Pow
er/T
yp
e1
Err
or(%
)
PowerType 1 Error Rate
(c) 2.0 odds ratio.
Figure 9: These results correspond to full effect detection by odds ratio, witha minor allele frequency of 0.1, 2000 individuals, and a 0.02 prevalence. Theresults of the Power and Type 1 Error Rate of BOOST, MBMDR, Screenand Clean, SNPHarvester, SNPRuler and TEAM. Each subfigure containsthe values for all algorithms in data sets with 1.1 (a), 1.5 (b), and 2.0 (c)odds ratio.
20
A.4 Prevalence
BOOST
MBM
DR
SnC
SNPH
SNPR
TEAM
0
50
100
150
Algorithms
Pow
er/T
yp
e1
Err
or(%
)
PowerType 1 Error Rate
(a) 0.0001 prevalence.
BOOST
MBM
DR
SnC
SNPH
SNPR
TEAM
0
50
100
150
Algorithms
Pow
er/T
yp
e1
Err
or(%
)
PowerType 1 Error Rate
(b) 0.02 prevalence.
Figure 10: These results correspond to epistasis detection by prevalence, witha minor allele frequency of 0.1, 2000 individuals, and a 2.0 odds ratio. Theresults of the Power and Type 1 Error Rate of BOOST, MBMDR, Screen andClean, SNPHarvester, SNPRuler and TEAM. Each subfigure contains thevalues for all algorithms in data sets with 0.0001 (a), and 0.02 (b) prevalence.
21
BEAM
BOOST Sn
C
SNPH
0
50
100
150
Algorithms
Pow
er/T
yp
e1
Err
or(%
)
PowerType 1 Error Rate
(a) 0.0001 prevalence.
BEAM
BOOST Sn
C
SNPH
0
50
100
150
Algorithms
Pow
er/T
yp
e1
Err
or(%
)
PowerType 1 Error Rate
(b) 0.02 prevalence.
Figure 11: These results correspond to main effect detection by prevalence,with a minor allele frequency of 0.1, 2000 individuals, and a 2.0 odds ra-tio. The results of the Power and Type 1 Error Rate of BOOST, MBMDR,Screen and Clean, SNPHarvester, SNPRuler and TEAM. Each subfigure con-tains the values for all algorithms in data sets with 0.0001 (a), and 0.02 (b)prevalence.
BOOST Sn
C
SNPH
0
50
100
150
Algorithms
Pow
er/T
yp
e1
Err
or(%
)
PowerType 1 Error Rate
(a) 0.0001 prevalence.
BOOST Sn
C
SNPH
0
50
100
150
Algorithms
Pow
er/T
yp
e1
Err
or(%
)
PowerType 1 Error Rate
(b) 0.02 prevalence.
Figure 12: These results correspond to full effect detection by prevalence,with a minor allele frequency of 0.1, 2000 individuals, and a 2.0 odds ra-tio. The results of the Power and Type 1 Error Rate of BOOST, MBMDR,Screen and Clean, SNPHarvester, SNPRuler and TEAM. Each subfigure con-tains the values for all algorithms in data sets with 0.0001 (a), and 0.02 (b)prevalence.
22