MACHINE LEARNING METHODOLOGIES IN THE DISCOVERY OF … · BDA Backward Dropping Algorithm BEAM Bayesian Epistasis Association Mapping BOOST Boolean Operation-based Screening and Testing

MACHINE LEARNING METHODOLOGIES IN THE DISCOVERY OF THE INTERACTION

BETWEEN GENES IN COMPLEX DISEASES

RICARDO JOSÉ MOREIRA PINHO DISSERTAÇÃO DE MESTRADO APRESENTADA À FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO EM ÁREA CIENTÍFICA

M 2014

FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO

Machine Learning methodologies in thediscovery of the interaction between

genes in complex diseases

Ricardo Pinho

DISSERTATION

Mestrado Integrado em Engenharia Informática e Computação

Supervisor: Rui Camacho

Co-Supervisor: Alexessander Alves (Imperial College London, UK)

July 2014

Machine Learning methodologies in the discovery of theinteraction between genes in complex diseases

Ricardo Pinho

Mestrado Integrado em Engenharia Informática e Computação

Approved in oral examination by the committee:

Chair: Auxiliary Professor Ana Cristina Ramada Paiva

External Examiner: Auxiliary Researcher Sérgio Guilherme Aleixo de Matos(Instituto de Engenharia Electrónica e Telemática de Aveiro)

Supervisor: Associate Professor Rui Carlos Camacho de Sousa Ferreira da Silva

July 2014

Abstract

In recent years, there has been a big research in gene-gene interactions to analyze how complexdiseases are affected by the genome. Many Genome Wide Association Studies (GWAS) havebeen performed with interesting results. This new interest is due to the computing power that isavailable today. Machine Learning methodologies quickly became a successful tool to find previ-ously unknown genetic relations. The popularity of this field increased greatly after discoveringthe potential value of gene-gene studies in the detection and understanding of how phenotypes areexpressed.

The information available in the DNA of the human genome can be divided into functionalsubgroups that code different phenotypes. These subgroups are the genes, which can have differentpresentations and still have the same behaviour. However, if a certain part of a gene changes itsbehaviour, this part is called Single Nucleotide Polymorphism (SNP). These SNPs interact witheach other to affect how genes work, which affects the phenotypes that are expressed.

The purpose of this dissertation is to increase the knowledge obtained from these studies,detecting more interactions related to the manifestation of complex diseases. This is achievedby testing algorithms in a complex empirical study, and adding a new and improved Ensembleapproach, that shows better results than the existing state-of-the-art algorithms.

To achieve this goal, there are two main stages. The first stage consists of a comparison studyamongst the most recent statistical and Machine Learning methodologies using simulated data setscontaining generated epistatic interactions. The algorithms BEAM3.0, BOOST, MBMDR, Screenand Clean, SNPHarvester, SNPRuler, and TEAM were processed with many different settings ofdata sets. The results showed that, with the exception of Screen and Clean, and MBMDR, allalgorithms displayed good results in relation to Power, Type I Error Rate, and Scalability.

The second stage is the creation of new combination of a new algorithm based on the resultsobtained in the first stage. This new algorithm is comprised of an aggregation of previously testedmethodologies, of which 5 algorithms were chosen. This new Ensemble approach manages tomaintain the Power of the best algorithm, while decreasing the Type I Error Rate.

i

ii

Acknowledgements

First and foremost I would like to thank my Supervisor, professor Rui Camacho, for the effort,patience and dedication to this project, which would be impossible to accomplish without hissupport. I would also like to thank my co-supervisor Alexessander Alves, for saving me a lot oftrouble and helping me understand the specifics of the project. Considering that this area is verynew to me, his expertise is very much needed and appreciated.

I want to thank my family, specially my parents for giving me the opportunity to be able tolearn and work on something that I love and for always believing in me.

Ricardo Pinho

iii

iv

"An approximate answer to the right problem is worth a good deal more than an exact answer toan approximate problem."

John Wilder Tukey

v

vi

Contents

1 Introduction 11.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Motivation and Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.4 Structure of the Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 State-of-the-Art 52.1 Biological concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Statistical and Machine Learning Algorithms . . . . . . . . . . . . . . . . . . . 82.3 Data analysis evaluation procedures and measures . . . . . . . . . . . . . . . . . 212.4 Data Simulation and Analysis Software . . . . . . . . . . . . . . . . . . . . . . 272.5 Chapter Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3 A Comparative Study of Epistasis and Main Effect Analysis Algorithms 373.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.2.1 Algorithms for interaction analysis . . . . . . . . . . . . . . . . . . . . . 393.3 Simulation Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403.4 Chapter Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4 Ensemble Approach 514.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.3 Chapter Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5 Conclusions 675.1 Contribution summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

References 69

A Glossary 77A.1 Biology related terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77A.2 Data mining terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78A.3 Lab Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

vii

CONTENTS

viii

List of Figures

2.1 An illustration of the interior of a cell. [cel14] . . . . . . . . . . . . . . . . . . . 62.2 Bayesian Network. Nodes represent SNPs. [JNBV11] . . . . . . . . . . . . . . . 122.3 An example of a Neural Network. [HK06] . . . . . . . . . . . . . . . . . . . . . 182.4 A logit transformation and a possible logistic regression function resultant of the

logit transformation.[WFH11] . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.5 An example of a ROC curve. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.6 The main stages of the KDD process.[FPSS96] . . . . . . . . . . . . . . . . . . 262.7 The CRISP-DM life cycle.[CCK00] . . . . . . . . . . . . . . . . . . . . . . . . 272.8 A diagram of the ATHENA software package [HDF+13]. . . . . . . . . . . . . . 302.9 The Weka Explorer interface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.1 Results of epistasis detection by population size. . . . . . . . . . . . . . . . . . . 433.2 Results of main effect detection by population size. . . . . . . . . . . . . . . . . 443.3 Results of full effect detection by population size. . . . . . . . . . . . . . . . . . 453.4 Results of epistasis detection by minor allele frequency. . . . . . . . . . . . . . . 463.5 Results of main effect detection by minor allele frequency. . . . . . . . . . . . . 473.6 Results of full effect detection by minor allele frequency. . . . . . . . . . . . . . 48

4.1 Results of epistasis detection by population size, with a 0.1 minor allele frequency,2.0 odds ratio, and 0.02 prevalence. . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.2 Results of main effect detection by population size, with a 0.1 minor allele fre-quency, 2.0 odds ratio, and 0.02 prevalence. . . . . . . . . . . . . . . . . . . . . 54

4.3 Results of full effect detection by population size, with a 0.1 minor allele fre-quency, 2.0 odds ratio, and 0.02 prevalence. . . . . . . . . . . . . . . . . . . . . 55

4.4 Results of epistasis detection by minor allele frequency, with 2000 individuals, 2.0odds ratio, and 0.02 prevalence. . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.5 Results of main effect detection by minor allele frequency, with 2000 individuals,2.0 odds ratio, and 0.02 prevalence. . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.6 Results of full effect detection by minor allele frequency, with 2000 individuals,2.0 odds ratio, and 0.02 prevalence. . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.7 Results of epistasis detection by odds ratio, with a 0.1 minor allele frequency, 2000individuals, and 0.02 prevalence. . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.8 Results of main effect detection by odds ratio, with a 0.1 minor allele frequency,2000 individuals, and 0.02 prevalence. . . . . . . . . . . . . . . . . . . . . . . . 60

4.9 Results of full effect detection by odds ratio, with a 0.1 minor allele frequency,2000 individuals, and 0.02 prevalence. . . . . . . . . . . . . . . . . . . . . . . . 61

4.10 Results of epistasis detection by prevalence, with a 0.1 minor allele frequency,2000 individuals, and 2.0 odds ratio. . . . . . . . . . . . . . . . . . . . . . . . . 62

ix

LIST OF FIGURES

4.11 Results of main effect detection by prevalence, with a 0.1 minor allele frequency,2000 individuals, and 2.0 odds ratio. . . . . . . . . . . . . . . . . . . . . . . . . 62

4.12 Results of full effect detection by prevalence, with a 0.1 minor allele frequency,2000 individuals, and 2.0 odds ratio. . . . . . . . . . . . . . . . . . . . . . . . . 63

x

List of Tables

2.1 An example of a penetrance table. . . . . . . . . . . . . . . . . . . . . . . . . . 82.2 A description of each data selection algorithm. . . . . . . . . . . . . . . . . . . . 102.3 A description of each model creation algorithm designed for this problem. . . . . 162.4 A description of each generic model creation algorithm. . . . . . . . . . . . . . . 192.5 A description of auxiliary algorithms used in model creation or data selection. . . 212.6 A description of data analysis measures and how they are calculated. . . . . . . . 242.7 A description of data analysis procedures. . . . . . . . . . . . . . . . . . . . . . 252.8 A comparison between the different procedures [AS08]. . . . . . . . . . . . . . . 272.9 A comparison of the most relevant features of data simulation tools. . . . . . . . 292.10 A comparison of data mining tools. . . . . . . . . . . . . . . . . . . . . . . . . . 332.11 Similarities and differences between BEAM3, BOOST MBMDR, and Screen &

Clean, SNPHarvester, SNPRuler, and TEAM. . . . . . . . . . . . . . . . . . . . 35

3.1 The values of each parameter used. Each configuration has a unique set of theparameters used. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.2 Scalability test containing the average running time, CPU usage, and memoryusage by data set population size. . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.1 Scalability test containing the average running time, CPU usage, and memoryusage by data set population size for epistasis detection. . . . . . . . . . . . . . . 64

4.2 Scalability test containing the average running time, CPU usage, and memoryusage by data set population size for main effect detection. . . . . . . . . . . . . 64

4.3 Scalability test containing the average running time, CPU usage, and memoryusage by data set population size for full effect detection. . . . . . . . . . . . . . 65

xi

LIST OF TABLES

xii

Abbreviations

A AdenineACO Ant Colonization OptimizationALEPH A Learning Engine for Proposing HypothesesAPI Application Programing InterfaceATHENA Analysis Tool for Heritable and Environmental Network AssociationsAUC Area Under the CurveBDA Backward Dropping AlgorithmBEAM Bayesian Epistasis Association MappingBOOST Boolean Operation-based Screening and TestingC CytosineCRISP-DM Cross Industry Standard Process for Data MiningDAG Directed Acyclic GraphDM Data MiningFOL First Order LogicG GuanineGENN Grammatical Evolution Neural NetworksGPNN Genetically Programmed Neural NetworksGPAS Genetic Programming for Association StudiesGUI Graphics User InterfaceGWAS Genome Wide Association StudyIDE Integrated Development EnvironmentsIIM Information Interaction MethodILP Inductive Logic ProgrammingK-NN K-Nearest NeighborKEEL Knowledge Extraction Evolutionary LearningKDD Knowledge Discovery in DatabasesKNIME Konstanz Information MinerMAGENTA Meta-Analysis Gene-set Enrichment of variaNT AssociationsMCMC Markov Chain Monte CarloMDR Multifactor Dimensionality ReductionMDS MultiDimensional ScalingML Machine LearningNB Naïve BayesNN Neural NetworkOS Operating SystemPDIS Dissertation PlanningPMML Predictive Model Markup Language

xiii

ABBREVIATIONS

ROC Receiver Operating CharacteristicS&C Screen & CleanSAS Statistical Analysis SystemSEMMA Sample, Explore, Modify, Model and AssessSNP Single Nucleotide Polymorphism.SVM Support Vector MachineT ThymineTEAM Tree-based Epistasis Association Mapping

xiv

Chapter 1

Introduction

This Chapter introduces the context by discussing evolution of epistasis research. The next sec-

tion contains a project discussion and the contribution of the dissertation. It is followed by the

motivation and goals for this work. The last section contains the structure of the document.

1.1 Context

Epistasis is the interaction between genes that work together to affect a manifestation of a complex

disease. The study of epistasis to determine the expression of phenotypes has long been discovered

to yield interesting results considering that most phenotypes cannot be explained by simple cor-

relations to Single Nucleotide Polymorphisms (SNPs) or even to a specific gene. However, only

recently, enough advances have been made in technology, such as computer processing power,

to allow for whole genome studies. These studies quickly became a popular tool to discover ge-

netic patterns in the manifestation of several phenotypes, including certain SNP configurations

with high risk of developing complex diseases. This allowed for a better understanding of many

complex diseases that were otherwise undetectable until the manifestation of symptoms.

From studies of allergic sensitization [BMP+13] and obesity predisposition [FTW+07] to di-

abetes [SGM+10] and breast cancer [RHR+01], there have been many successfully identified

associations between genes and the expression of complex diseases.

Considering that these advances in Genome Wide Association Study (GWAS) are very recent,

there is not yet a well-established method to test and find significant results. Therefore many

statistic and machine learning approaches have since been developed.

The study of epistatic interactions is a high dimensionality problem, caused by the millions of

possible combinations between SNPs and each combination can have a varying amount of SNPs

involved in each interaction. Because of this, the correct identification of interactions becomes a

problem, not only due to outliers and noise, but also because of the many possible configurations.

Another issue with its complexity is the identification of the correlation between interactions and

the actual manifestation of the phenotype in question. There is also an error associated with every

data mining problem, which in this case can be explained by mutations or ambiguity in SNPs.

1

Introduction

Very recently, many different algorithms have been proposed to tackle the problems implicit to

GWAS.

Recent machine learning algorithms have tackled this problem by simplifying its complexity,

reducing the inherent dimensionality.

1.2 Project

The project of this dissertation consists of two empirical studies. Initially, there is a review of

the literature to identify a range of different algorithms that are likely to produce better results.

Artificial datasets are then created to test these algorithms.

Based on the results obtained, a new empirical study will be made with a newly generated algo-

rithm aiming to find an approach that may obtain better results than the existing algorithms. This

new methodology is a combination of the state-of-the-art algorithms.

Dissertation Contribuition

This dissertation contains empirical studies of the state-of-the-art algorithms, which enables a

broad analysis of the factors affecting the performance of these algorithms using relevant evalua-

tion measures. Based on this information, new studies can have a better understanding of what to

expect from each method. With the introduction of new algorithms, this dissertation may produce

methodologies that better suit the needs of the domain problem.

1.3 Motivation and Goals

Genome wide association studies have made a big impact over SNP identification and analysis in

the last years. They allowed the discovery of how genes interact with each other and how each

gene affects phenotypes. By mapping epistatic interactions and gene behavior, it is possible to find

risk factors associated with complex diseases. These risk factors can be identified by certain SNP

configurations or genotypes.

From a machine leaning standpoint, there is a lot of room for improvement. Considering that

this is a problem that has only started to be studied using very different methodologies, it is still

not as optimized or as accurately solved as it can be. With better, more adapted and efficient

algorithms, the GWAS relevance can increase considerably. Due to the inherent dimension com-

plexity, scalability is a very important issue required by the developed methodologies. Algorithms

used in typical prediction problems, such as classification and regression, now need to be adapted

to fit the requirements of this specific problem, which does not fall in the classical convention of

prediction problems. This requires an adaptation to the data and to the output, generating a result

that is understandable by specialists in the genetic field.

The main goal of epistatic studies is to find SNPs responsible for the expression of phenotypes,

which in this case are related to complex diseases. This means that the loci and the alleles that are

2

Introduction

active in complex diseases and contribute to its manifestation need to be identified. Their behavior

and the interactions relevant to the disease need to be monitored and accessed. This information

presents a better understanding of the disease in question and can be used in a medical scenario

to mark specific genotypes that have a high probability of manifesting genetic related complex

diseases, which can then be preemptively watched and treated.

1.4 Structure of the Report

The rest of this report is divided into three main chapters: state-of-the-art, work planning and

conclusions.

Chapter 2 has a brief introduction to the topic, followed by all the knowledge of the related

biological concepts in Section 2.1. Section 2.2 consists on the description of state-of-the-art ma-

chine learning and statistical algorithms that have been used in data selection, model creation and

other auxiliary algorithms. Section 2.3 contains evaluation measures and procedures that are rele-

vant to estimating and optimizing the data mining process. There is also a description of the many

data mining tools in Section 2.4 including software tools specifically designed for epistasis detec-

tion analysis and data simulation software. Section 2.5 of this chapter contains the conclusions

extracted from the study of the existing algorithms, tools and evaluation measures and procedures.

Chapter 3 is composed by an introduction to the stage I of the experiments 3.1, a description

of the data and methodologies used in this project in Section 3.2, the experimental procedure and

results 3.3, and finally, the conclusions of the chapter 3.4.

In Chapter 4, Section 4.1 contains the introduction to the stage II of the experiments. Section

4.2 has the experimental procedure, results, and discussion of the stage II experiments. Section

4.3 contains the final conclusions of this chapter.

Chapter 5 contains a brief summary with the final conclusions from the empirical studies, and

a summary of the contributions of this dissertation in Section 5.1.

3

Introduction

4

Chapter 2

State-of-the-Art

Over the last 5 years, many methodologies have surfaced to find a solution for genome wide stud-

ies. With the development of computing power, algorithms that were practically infeasible are

now valid options for the identification of diseases related to SNPs. Most of these algorithms

are based on well-known data mining approaches like prediction, clustering and rule-based algo-

rithms. Some software tools were also developed specifically for this purpose.

In this chapter we first introduce some basic biological concepts as biological background

required to understand the main issues and specifications of the dissertation. The concept of

epistasis and Genome Wide Association Studies are introduced in that section.

Data Mining algorithms are introduced in Section 2.2. This includes data selection, model

creation algorithms specifically designed for epistasis studies and generic model creation algo-

rithms that are related to specific algorithm implementations. Some auxiliary algorithms, used in

the model creation are also included in this section.

In Section 2.3 the most relevant procedures and measures of evaluating the state-of-the-art

methodologies are discussed. These include relevant evaluation metrics for model testing and

machine learning approaches to provide better results for the generated solutions. In that section

there is also a description of the most commonly used data mining methodologies.

Section 2.4 contains data mining software tools, some of which are specifically designed for

this problem. This Section also contains the data simulation software used for artificial generation

of data sets with epistasis and main effect.

2.1 Biological concepts

Basic concepts

Most human beings have 46 chromosomes within the nucleus of each cell. These chromosomes are

divided into 23 groups, with every group having a pair of similar chromosomes. Each chromosome

is composed by a very large double helix structure: the deoxyribonucleic acid (DNA). DNA is

subdivided into sections called genes, which consist of regions that code for proteins. Figure 2.1

illustrates these structures.

5

State-of-the-Art

A gene is composed by several nucleotide bases. These bases can be either adenine, cytosine,

guanine, or thymine. Each connection has a complementing pair. This means that for each

position of a nucleotide base, also known as locus, there a connection of adenine-thymine, or

cytosine-guanine. Each combination in nucleotide bases is called an allele. Alleles and Loci can

also be referred to a variation and position of a gene, rather than a single nucleotide base. The pair

of alleles in the same locus, on each chromosome is called genotype. Considering that there are

usually two alleles for each locus, there is usually a dominant genotype and a recessive genotype,

which means that there is an allele that is expressed more often than another allele. The expression

of a physical trait or a creation of a protein is called a phenotype. For each locus, there are usually

3 different genotype configurations, 2 dominant genes, 2 recessive genes (having 2 equal alleles

in a genotype is called homozygous), or a dominant and a recessive gene (having different alleles

in a genotype is called heterozygous). A recessive gene is only expressed in an homozygous

recessive genotype.

A

T A

T

C G

Figure 2.1: An illustration of the interior of a cell. [cel14]

Single Nucleotide Polymorphisms

A single nucleotide polymorphism (SNP) is a specific nucleotide base, within a gene, that changes

what the gene expresses. This means that a different allele SNP will create a different gene that

will express a different protein or physical trait. Not all nucleotide bases in a gene are relevant to

this process. In this context, SNPs are the most important part in the gene.

Genetic Markers

Genetic markers are specific sets of SNPs, genes or DNA sequences that are used to identify

specific traits, individuals or species. In our study, genetic markers are used to identify specific

traits within complex diseases. In recent years, genetic markers are not limited to genes that encode

visible characteristics. Due to the genetic mapping of the human genome, patterns of SNPs can be

related to traits, without directly encoding a specific characteristic, including dominant/recessive

or co-dominant markers [Avi94].

6

State-of-the-Art

Main Effect

In this context, the main effect is related to the influence of a SNP on the expression of a phenotype,

in this case a complex disease. Any SNP has a main effect if it has a direct impact on the disease

expression. Multiple SNPs can have a main effect on the same phenotype expression, without

having a relation between them.

Epistasis

The concept of epistasis was first described by Bateson [Mud09] as the control of the manifestation

of the effect of one allele in one locus by another allele in another locus [Cor02]. This definition

changed its meaning and subdivided into different, often conflicting definitions. According to

Philips [Phi08], there are three major categories in which the term “epistasis” can be subdivided

into: Functional Epistasis, Compositional Epistasis and Statistical Epistasis.

Functional epistasis refers to the functional applications of the molecular interactions between

genes. The focus is on the proteins that are created by these interactions and on their effects.

Compositional epistasis is used to describe the traditional usage of the term "epistasis". It

describes the interaction in two locus, with specific alleles. This interaction affects the phenotype

expression.

Statistical epistasis describes the average deviation in the effect resulting from the interaction

in a set of alleles at different loci from the effect of those alleles considered independently [Fis19].

It is an additive expectation of the epistasis effect on the allelic function.

Genome Wide Association Studies

The search for genetic markers has helped to determine previously unknown aspects of complex

diseases. Previous studies focused on single-locus analysis and provided underwhelming results

[Cor09]. By changing this approach to include complex relations between genes in the effect of

phenotypes, new information about biological and biochemical pathways has surfaced and, since

then, become a powerful tool in understanding the diseases.

Penetrance Tables

Diseases can affect different proportions of individuals in a population, who have the disease

related genetic markers. This is the penetrance of a disease [FES+98]. With a high disease pen-

etrance, most of the individuals with a given disease-associated genetic marker will manifest that

disease. This penetrance is visible with the disease allele frequency and the disease affected in-

dividuals. This is the proportion of individuals with the disease-associated SNP that develop the

disease among the population. Analysing these results, a table can be created showing the percent-

age of individuals affected with a disease, given their genotypes. Each genotypic configuration has

a penetrance value associated. Table 2.1 shows an example of a penetrance table.

7

State-of-the-Art

PENTABLEGenotype Configuration Penetrance

AABB 0.068AABb 0.064AAbb 0.040AaBB 0.055AaBb 0.047Aabb 0.103aaBB 0.039aaBb 0.103aabb 0.004

Table 2.1: An example of a penetrance table.

2.2 Statistical and Machine Learning Algorithms

The disciplines of Statistics and Machine Learning (ML) have been studying these problems in

the last few years. We now survey a set of algorithms from both Statistics and ML application to

epistasis problems.

Feature Selection Algorithms

Ant Colony Optimization

Ant Colony Optimization (ACO) [Dor92] is a search wrapper that explores the same mechanism

seen in ant colonies to find shortest paths. This means that it uses a particular classifier to score

subsets of variables based on their relation to the class variable. Basically it transforms the op-

timization problem into a problem of finding the best path on a weighted graph [Ste12]. In this

context, it uses expert knowledge added in the "pheromones" to select SNPs with better expert

knowledge scores, calculated using fitness functions [GWM08]. In this context, ACO is used to

filter out SNPs by randomly searching for SNPs and choosing only the ones that are the most

relevant to the phenotype.

Classification Trees

Classification trees can be used as a feature selection algorithm by creating an upside down tree

[ZB00] where each node represents a test on an attribute or a set of attributes, and each edge rep-

resents the possible outcome value of the test in the "parent" node [NKFS01]. This representation

of a tree emphasizes the connections between attributes and therefore it can correlate to possible

relations between a disease and a connection of attributes [CKS04]. However, it can only use

selections of attributes that somehow connect to each other, it skips attributes that might have a

pure interaction with the disease [Cor09].

8

State-of-the-Art

ReliefF

ReliefF [RSK03] and its modified version Tuned ReliefF (TuRF) [MW07] are filtering algorithms.

The basic idea of ReliefF is to estimate the quality of attributes by the variation of its genotype

values in the instance neighborhood with the same class label. If the neighbors within a different

class have the same value however, the attribute separates two instances with the same class, which

means that the value of that attribute gets lower. Similarly, having a neighbor with a different class

and the same attribute value lowers its quality but if the same neighbor has a different attribute

value, the quality increases. ReliefF can deal with incomplete and noisy data and searches for k

neighbors of incomplete or noisy data with the same class and k neighbors with different classes.

Tuned ReliefF is an improvement over the original algorithm, by removing the worst attributes

and recalculating the estimations of attributes at each step.

Evaporative Cooling

Based on the ReliefF [KR92] algorithm, evaporative cooling removes ∆N of the least informative

attributes using classification accuracy [MRW+07][MCGT09]. The energy in the system is given

by:〈ε〉N〈ε〉N0

=

(NN0

)η

(2.1)

where 〈ε〉 is the average “energy” density of the system, N0 is the number of attributes before

the evaporation step and η is an adjustable parameter related to the evaporation rate, allowing for

a slow evaporation for higher values and a fast evaporation for lower values, which can lead to

a collection of suboptimal attributes. Evaporative Cooling is often used as a wrapper filter for

attribute selection. This means that the variables are scored based on their predictive power.

Genetic Programming for Association Studies

Genetic Programming for Association Studies (GPAS) works by searching variables, mapping

them to new boolean variables. By randomly choosing two individuals consisting of one randomly

selected literal, a form of genetic algorithm is then applied to select, based on a fitness function,

the score of the current generation in the population [Nun08]. A new, customized form of GPAS

that detects interactions involving a higher number of SNPs has since been developed [NBS+07].

Random Forest

The purpose of random forest in this context is to select, according to the various trees formed

by the algorithm, the main attributes that are important to a disease. Random forest is a fast

algorithm that can be applied in parallel, with many customizable parameters, such as the number

of trees, the number of instances to be used at each split and the number of permutations to

assess variable importance [Bre01]. Random forest is an ensemble algorithm that creates several

bootstrap samples from a data set with the same size as the original sample. For each bootstrap, a

9

State-of-the-Art

tree is grown, considering only a small random set of attributes at each node [LHSV04]. Instances

that where not used in the training phase are then selected to estimate the prediction error. By using

a random forest instead of a single tree, there is an improvement in the classification accuracy

[BDF+05].

Random jungle (RJ) [SKZ10] is an improved approach of random forest, using parallel pro-

cessing, becoming a faster and more viable in larger datasets. Even without parallel processing, RJ

is faster and uses less memory than the standard random forest, implementing variable backward

elimination.

Summary Table

Table 2.2 shows the summary of all the data selections algorithms discussed.

Algorithms DescriptionACO Search based on the behavior of ants. Optimization problem be-

comes a problem of finding the best path based on positive feedback.Class. Trees Construction of trees that split nodes based on rules that represent a

good division in the outcome variable.Rand. Forest Ensemble approach to Classification Trees.ReliefF Calculates the value of each attribute based on its value in neighbor-

hood individuals that have the same outcome variable.Tuned ReliefF Modified version of ReliefF that removes the worst attributes and

recalculates the weight in remaining attributes.Evap. Cooling Based on ReliefF, removes the least informative attributes with an

adjustable parameter related to the evaporation rate.GPAS Searches and maps attributes from random individuals to boolean

variables and uses a fitness function to evaluate them.Table 2.2: A description of each data selection algorithm.

Specific Model Creation Algorithms

Backward Dropping Algorithm

Backward Dropping Algorithm (BDA) tries to find the subset of attributes that have the biggest

impact on the class variable [WLZH12]. The class is assumed to be binary and all other attributes

are assumed to be discrete. The explanatory variables are segregated into partitions of subsets,

which are then used to calculate I-score as:

I = ∑j∈Pk

n2j (Yj− Y )2 (2.2)

where P is the partition selected with k variables, n is the number of observations, Yj and Y is the

average of Y observations in the j partition and overall average respectively. In the training set,

a large group of explanatory variables are selected to be sampled into subsets. After computing

10

State-of-the-Art

the I-score, the variable which contributes less to the I-score is dropped. In each round another

variable is dropped until there is only one variable left. The subset which has the highest I-score in

the whole dropping process is returned. This subset represents the set of variables than contribute

the most to a positive state of Y , the response variable.

Bayesian Epistasis Association Mapping

Bayesian Epistasis Association Mapping (BEAM) receives genotype markers as input and deter-

mines the probability of each marker being associated with the disease, through a Markov Chain

Monte Carlo (MCMC), independently or in epistasis with another marker, and creates partitions

with those markers [ZL07]. It classifies these markers into three categories: SNPs unassociated

with the disease, SNPs associated with the disease independently and SNPs jointly associated with

the disease in epistasis [WYY12]. A B statistic was developed to show the statistic relevance of

the associations made with the disease. BEAM searches for epistasis with interactions of 3 or 2

SNPs. This is a hypothesis-testing procedure, testing each marker for significant interactions. The

B statistic is defined by:

BM = lnPA(DM,UM)

P0(DM,UM)= ln

Pjoin(DM)[Pind(UM)+Pjoin(UM)]

Pind(DM,UM)+Pjoin(DM,UM)(2.3)

where M represents each set of k markers, representing different complexities of interactions. DM

and UM are genotype data from M cases and controls and P0(DM,UM) and PA(DM,UM) are the

Bayes factors. Pind is the distribution that assumes independence among markers in M and Pjoin is

a saturated joint distribution of genotype combinations among all markers in M.

BEAM3.0 is the third iteration of the BEAM algorithm and introduces multi-SNP associations and

high-order interactions flexibility, using graphs, reducing the complexity and increasing the power.

BEAM3 produces cleaner results with improved mapping sensitivity and specificity [ZL07]. The

algorithm is written in C++.

BNMBL

BNMBL is a Bayesian Network that assumes SNPs can either be Adenine(A) and Guanine(G) or

Cytosine(C) and Thymine (T), depending on the nucleotide base, and therefore can only assume

three possible values in the genotype: AA, GG or AG, because A is the same as T, and C is

the same as G, in this context. A directed acyclic graph (DAG) model is created for each data

item to assign a probability of the relationships between SNP. Figure 2.2 shows an example of

a probabilistic model of the relationship between SNPs and the disease D. Using only 12 log2m

bits for each conditional probability, where m is the number of data items, the penalty calculated

in Equation 2.4 is applied in the scoring phase to each DAG, where k is the number of SNPs

[JNBV11].3k

2log2

m3k +

2k2

log2m (2.4)

11

State-of-the-Art

D

S1 S2 S3 S4

S6 S7 S8

Figure 2.2: Bayesian Network. Nodes represent SNPs. [JNBV11]

Boolean Operation-based Screening and Testing

Boolean Operation-based Screening and Testing (BOOST) converts the data representation into a

boolean type, using logic operators [Weg60], which allows faster operations and a smaller usage of

memory [WYY+10a]. The algorithm uses a pruning approach by removing interactions which are

statistically irrelevant. The ratio at which the pruning occurs is based on the difference between

the full logistic regression model:

logP(Y = 1|Xl1 = i,Xl2 = j)P(Y = 2|Xl1 = i,Xl2 = j)

= β0 +βXl1i +β

Xl2i +β

Xl1 Xl2i j (2.5)

and the main logistic regression model:

logP(Y = 1|Xl1 = i,Xl2 = j)P(Y = 2|Xl1 = i,Xl2 = j)

= β0 +βXl1i +β

Xl2i +β

Xl1 Xl2i j (2.6)

where Xl1 and Xl2 are genotype variables and i and j are one of the three possible states (0,1,2)

[WLFW11]. The algorithm is written in C. A GPU version of the algorithm was developed

(GBOOST) [YYWY11] providing a 40-fold speedup compared to that of BOOST running in a

CPU.

Grammatically Evolution Neural Networks

Based on neural networks, Grammatically Evolution Neural Networks (GENN) uses instructions

and a fitness function to train for classification problems related to genetics [TDR10a]. Based on

genetic algorithms, the populations within the data are heterogeneous and go through a process of

pairing, crossover and mutation to find the best Neural Network (NN) solution, which translates to

finding influential SNPs and correctly evaluating network weights. As the name suggests, linear

genomes and grammars are used to define the population. Grammar is used to increase diversity,

by separating the genotype from the phenotype [TDR10b]. GENN uses a Genetically Programmed

Neural Networks (GPNN) approach to optimize the NN selection, which is an improvement on

the NN architecture using genetic programming. This means that there are binary expression trees

that are evolved in a tree-like structure, fitting into the NN architecture.

12

State-of-the-Art

Information Interaction Method

Information Interaction Method (IIM) is an exhaustive algorithm that searches for all possible

pairs of SNPs to find relations between them and the expression of the phenotype [OSL13]. If

the synergy between the pair and the phenotype is above a user-defined threshold, then there is a

possible correlation between the pair and the phenotype. This is revealed by the Equation 2.7.

I(A;B;Y ) = I(A;B|Y )− I(A;B)

= I(A;Y |B)− I(A;Y )

= I(B;Y |A)− I(B;Y )

(2.7)

Associations between single SNPs and a given phenotype are also tested by applying a mutual

information method, explained in Equation 2.8.

I(X ;Y ) = H(X)−H(X |Y )= H(Y )−H(Y |X)

= H(X)+H(Y )−H(H,Y )

= H(X ,Y )−H(X |Y )−H(Y |X)

(2.8)

Meta-Analysis Gene-set Enrichment of variaNT Associations

Meta-Analysis Gene-set Enrichment of variaNT Associations (MAGENTA) consists of four steps:

mapping SNPs to genes, assigning a score to each gene association, applying a correction of am-

biguous gene association scores and finally a statistical test is made to find predefined biologically

relevant gene sets in the association scores, compared to randomly sampled gene sets of identical

size [SGM+10]. Instead of receiving genotype data, MAGENTA receives p-values of SNPs as an

input. The gene association score is done based on regional SNP p-values.

Multifactor Dimensionality Reduction

Multifactor Dimensionality Reduction (MDR) is one of the most popular methods for the detection

of interactions between SNPs. MDR receives two parameters: the N number of attributes with

the strongest connection to the disease to be selected and the T threshold ratio for affected to

unaffected individuals to distinguish high risk from low risk genotype combinations [HRM03].

MDR uses cross validation and in the training data set of each fold determines the high/low risk

groups. After calculating the misclassification error using the test data, the resulting prediction

error rate is the average of all the folds. In the end, the n-order combination with the minimum

average prediction error and the maximum cross-validation accuracy from all the dimensions is

selected [CLEP07]. The odds ratio for the best combinations is used to generate bootstrap data.

After calculating the odds ratio for the best combination, the confidence intervals are constructed

by using empirical distribution of the odds ratio [Moo04]. The combinations of high risk loci are

the ones that have a stronger connection in the disease outcome [RHR+01].

13

State-of-the-Art

Model-Based Multifactor Dimensionality Reduction

Model-Based MDR (MB-MDR)[MVV11] tries to overcome many of the drawbacks from the

original algorithm MDR such as missing important interactions due to sampling too many cells

together and only analysing at most one significant epistasis model. MB-MDR merges multi-locus

genotypes that have a significant high or low risk based on testing, rather than a threshold value.

Unassociated loci are placed in a ’No Evidence for risk’ class. This algorithm uses a significance

assessment, correcting type I errors, and evaluating each SNP with a Walt statistic test [MVV11].

The algorithm is written in R. MB-MDR process can be divided into the following steps:

1. Multi-locus cell prioritization - Each two-locus genotype is assigned to either High risk,

Low risk or No Evidence of risk categories.

2. Association test on lower-dimensional construct - The result of the first step creates a new

variable with a value correlated to one of the categories. This new variable is then compared

with the original label to find the weight of high and low risk genotype cells.

3. Significance assessment - This stage tries to correct the inflation of type I errors after the

combination of cells into the weight of High risk and Low risk.

MB-MDR can also be adjusted to consider main effects within interactions.

Screen and Clean

Screen and clean (S&C) [WDR+10] is a recent algorithm divided into two main parts: screening

part and cleaning part. The algorithm creates a dictionary with all SNPs and splits the data into

stage 1 data for screening and stage 2 data for cleaning. In stage 1, the data is screened using the

logistic regression model in Equation 2.9 to find SNPs.

g(E [Y |X ]) = β0 +N

∑j=1

β jX j (2.9)

where X j is the encoded genotype value 0, 1 or 2 and Y is the encoded phenotype, 0 or 1, N is

the number of measured SNPs, g is an appropriate link function, and S ={

j : β j 6= 0, j ∈ 1, ...,N}

are the set of terms associated with the phenotype as main effect [WLFW11]. According to the

selected SNPs, Screen and Clean tries to find relevant interacting SNPs that fit into the following

interaction model:

g(E [Y |X ]) = β0 +N

∑j=1

β jX j + ∑i< j;i, j=1,...,N

βi jXiX j (2.10)

where S = {(i, j) : β i j 6= 0,(i, j) ∈ 1, ...,N} are the set of terms associated with the phenotype as

epistasis. In stage 2, clean, controls false positives by using the stage 2 data and removing SNPs

with p-values higher than a predetermined threshold (α). The algorithm is written in R.

14

State-of-the-Art

SNPHarvester

SNPHarvester is a stochastic search algorithm. It divides the SNPs into three different categories:

unrelated to the disease, related independently and contributes jointly to the disease with no effect

independently. SNPHarvester is based on a multiple path generation with a generic score function

[YHW+09]. The first point in each path is generated randomly. Using a created local search algo-

rithm, SNPHarvester finds the local optimum, usually in two or three iterations, and the significant

groups of SNPs in each path, according to a scoring function. This function is a popular χ2 value

score function [RHR+01]. After the scoring, randomly picks k SNPs to form an active set, leaving

the rest as a candidate set. Each SNP in the active set is then substituted with one from the can-

didate sets in order to maximize χ2. After finding the maximized candidate, removes the selected

SNP group and repeats the procedure to identify M groups which is a predetermined parameter.

The selected M groups are then fitted into the L2 penalized logistic regression model

L(β0,β ,λ ) =−l(β0,β )+λ

2‖β‖ (2.11)

where l(β0,β ) is the binomial log-likelihood and λ is a regularization parameter [WLFW11].

SNPHarvester is written in Java.

SNPRuler

SNPRuler is a rule-based algorithm. Epistatic interactions promote a set of rules. These rules are

implications of an interaction between SNPs and the disease. To find the rules, SNPRuler uses

trees that represent genotypes in each node, with the leaves representing the phenotypes, creating

a path of epistatic interaction. For each rule, a 3x3 table is generated based on the probability of

each possible genotype combination and phenotype [WYY+10b]. In a big number of SNPs, there

is an upper bound limit to the tree, pruning it instead of an exhaustive search. This threshold is a χ2

test statistic [RHR+01]. However, this pruning can lead to a wrongful prune of many true-positive

epistatic interactions. This algorithm was developed in Java.

TEAM

Tree-based Epistasis Association Mapping (TEAM) is essentially an exhaustive algorithm. TEAM

uses a permutation test to create a contingency table with all the calculated p-values [ZHZW10].

To reduce computation costs, if there are two SNPs with very frequent genotype values, then it

is shared for each individual with the same genotype. TEAM only works with two-loci interac-

tions. It uses a tree-based representation, where nodes contain SNPs 1,2,3 and the edges represent

the number of individuals with different values on the two SNPs [WLFW11], further reducing

computation costs when the values are the same. This algorithm is written in C++.

Summary Table

Table 2.3 shows a summary of all model creation algorithms previously discussed.

15

State-of-the-Art

Algorithms DescriptionBDA Uses an iterative selection process where the most significant SNPs to the

disease are selected.BEAM Determines the probability of a given SNP to be associated with a disease

independently, or in epistasis with N SNPs.BNMBL Creates a DAG with the probabilistic model of the relationship between

SNPs and the disease.BOOST Converts data into a boolean type, pruning statistically irrelevant SNPs.GENN Creates NNs which are evolved, based on a genetic algorithm approach,

find the NN with the best accuracy.IIM Searches all possible pairs of SNPs, finding a relation between each pair

and the phenotype above a specified threshold.MAGENTA Calculates the statistic relevance of SNPs based on regional SNP p-values

instead of genotype data.MDR Applies cross-validation training the data to find high risk groups of SNPs.S&C Creates a dictionary of all the SNPs and divides the data into two stages:

screening, to select SNPs according to a logistic regression model, andcleaning, to decrease false positives.

SNPHarvester Stochastic search algorithm classifying SNPs according to their relationwith the disease, using a random local search algorithm.

SNPRuler Rule-based algorithm, defining rules based on epistatic interactions.TEAM Exhaustive algorithm, creating a table with p-values of each pair of SNPs,

and uses a tree-based representation to place the results.Table 2.3: A description of each model creation algorithm designed for this problem.

Generic Model Creation Algorithms

Ensembles

There are many types of ensemble algorithms created by joining several kinds of model creation

algorithms to try and make a more accurate and reliable model. In this context, one of the most

recent ensemble methods [YHZZ10] was created using a genetic algorithm together with a few

classifier algorithms. Several subsets of SNPs are selected by applying the genetic algorithm a

predetermined number of times. These subsets are analyzed and ranked based on the number

of times each SNP combination appears in the selected subsets. After acquiring the fitness for

every SNP subset, the chromosome with the highest fitness is selected, represented by the SNP

subset contained in the chromosome. The genetic algorithm then applies selection, crossover and

mutation to the chosen subset. Considering the large amount of SNPs, in order to reduce the noise

and optimize the process, two classifier strategies and a diversity promoting strategy are used to

preselect and evaluate the SNPs. Blocking uses M classification algorithms to eliminate differences

caused by noise. Voting is used to balance and increase accuracy in evaluating the fitness of SNPs.

Double fault diversity tries to evaluate the diversity between classifiers by calculating the fitness

of misclassified SNPs, focusing on the diversity between them. This particular approach uses

decision-tree-based classifiers and instance based classifiers.

16

State-of-the-Art

ILP

Inductive Logic Programming (ILP) algorithms work by creating hypotheses that are encoded as

First Order clauses. ILP is characterized as an expressive representation language (First order

Logic - FOL) to represent both data and hypotheses. ILPs are very used in the bioinformatics field

and produce good results but have a slow runtime, therefore affecting the scalability.

Initially, ILP algorithms create background knowledge of the problem by logic propositions.

The training is then made with positive and negative examples. Hypotheses are then generated by

creating new logic propositions using the background knowledge and trained examples.

Success is measured by the classification accuracy of a given hypothesis and the transparency

of a formulated hypothesis, which means the ability to be understood by humans [LD94].

There are many implementations of this type of algorithm. One of the most used systems is A

Learning Engine for Proposing Hypotheses (Aleph) [Sri01]. This algorithm works in 4 different

steps:

1. Select example. Selects an example to be generalized and stops if none exist.

2. Build most-specific clause. Based on the example selected, the most specific clause that

respects the language restrictions is constructed.

3. Search. After creating the most specific clause, a more generalized clause is searched for

in a subset of the clauses in the previous clause.

4. Remove redundant. The clause with the best score is then added to the theory, removing

the redundant examples.

K-Nearest Neighbor

K-Nearest Neighbor (K-NN) is a classification and regression algorithm that determines the value

of new data based on its approximation to other instances [HK06]. In classification, the closest

neighbors to the new instance determine the class of that instance. In regression, the result is

the average of the nearest neighbors. K is the number of the nearest neighbors to be used in the

calculation of new results. For this context, K-NN is mostly used as an attribute selection method.

Methods such as ReliefF use an approach based on the K-NN algorithm.

Naïve Bayes

Bayesian approaches are amongst the most common in this context. The naïve approach assumes

independence between features. There are many optimizations to reduce the naivety [PV08], such

as selecting subsets of attributes that are considered to be conditionally independent [LIT92], or

extending the structure of Naïve Bayes (NB) to represent dependencies in attributes [WBW05].

NB works using Bayesian networks by assigning probabilities to each event, using the model

trained previously. The final result is then chosen based on the most probable outcome. Specific

implementations of this nature can be seen in BEAM and BNMBL.

17

State-of-the-Art

Neural Networks

NNs are a type of classification and regression algorithm based on the neurological system of the

central nervous system. An example of these NNs is a Multilayer Perceptron, whic is the most

used type of NNs. In a Multilayer Perceptron, there is an input layer, which proceeds to a second

layer of nodes that represent neurons. These intermediate layers are also called hidden layers. The

last hidden layer, or output layer, represents the prediction of the network. Each layer is densely

connected. Each connection is weighted based on the relations between nodes in the training phase

[HK06]. An example of this is illustrated in Figure 2.3.

1

2

3

4

5

6

w_1

w_2

w_3

w_14

w_15

w_24

w_25

w_34

w_35

w_46

w_56

Figure 2.3: An example of a Neural Network. [HK06]

NNs can have multiple layers, which can be used in nonlinear problems [DHS01]. There

are some recent implementations of NNs in the discovery of epistatic relations, such as GENN

[HDF+13].

Support Vector Machines

Support Vector Machines (SVM) is a classification algorithm that divides data based on pattern

recognition methods. In the training phase, data is divided into two parts, mapping them accord-

ing to their attributes. SVM then tries to find the best nonlinear mapping to separate data by a

hyperplane [DHS01]. This hyperplane is mapped in order to find the best separation possible

by increasing the distance in the gap between the classes. If a linear classification is not possi-

ble, SVM can use the kernel trick to divide the data by increasing the dimension of the problem

[ABR64]. A regression or multiple class alternative of SVM is also available by transforming

the problem into multiple binary class problems [DKN05].There are no specific implementation

of SVM in this context, however there are many methods that use pattern recognition in their

implementation.

18

State-of-the-Art

Summary Table

Table 2.4 contains a summary of all the generic algorithms discussed. A more technical table is

available in Figure 2.11.

Algorithms DescriptionEnsemble Many model creation algorithms are joined to "vote" on the most probable

outcome to increase the accuracy and creating a more reliable meta model.ILP Uses logic programming, representing positive and negative examples, back-

ground knowledge and hypotheses that use trained examples and backgroundknowledge to classify accurately and transparently.

K-NN Uses trained data to classify new instances based on the proximity to a givenneighbor previously classified. The outcome is obtained from the closestneighborhood of the new instance.

Naive Bayes Creates bayesian networks, calculating the probability of a relation betweenevents, assuming independence between attributes.

NN Based on the central nervous system, creates a graph using the trained dataand, receiving an input, calculates the most weighed path to an outcome nodebased on its connection to other nodes.

SVM Searches and maps attributes from random individuals to boolean variablesand uses a fitness function to evaluate them.

Table 2.4: A description of each generic model creation algorithm.

Statistical Methods

Bonferroni Correction

Bonferroni Correction is a conservative approach to multiple comparison testing [BH95]. It is the

simplest correction for selecting a predetermined number of hypotheses based on their statistical

relevance. However there is no assumption of dependency. For SNPs, the p value is calculated

using

pcorrected = 1− (1− puncorrected)n (2.12)

where n is the number of hypothesis tested. this can be further simplified to

pcorrected = npuncorrected (2.13)

when npuncorrected � 1 [Cor09].

Linear Regression

Linear regression models try to fit a straight line on the data points. As in every regression problem,

the label is numeric. This label is modeled as a linear function, as shown in Equation 2.14 of

another random variable. The weights attributed to each variable are calculated in the training

19

State-of-the-Art

0 0.2 0.4 0.6 0.8 1

−4

−2

0

2

4

(a) logit transformation

−10 −5 0 5 100

0.2

0.4

0.6

0.8

1

(b) logistic regression function

Figure 2.4: A logit transformation and a possible logistic regression function resultant of the logittransformation.[WFH11]

data [WFH11].

x = w0 +w1a1 +w2a2 + ...+wkak (2.14)

where x is the class outcome, ai is each attribute, and wi is the weight of the attribute. In case of

a multiple linear regression, more than one variable is involved in predicting the label [SGLT12].

Linear regression models are usually fitted using the least squares approach to fit data into linear

equations and minimizing the squared errors between the observed values and the fitted values

[HK06]. In this context, linear regression models are often used as fitness functions to test the

score of SNPs and their statistical relevance.

Logistic Regression

Linear functions can be used in classification problems by assigning 1 to instances in the training

that belong to the class and 0 for the instances that do not belong the class [WFH11]. A linear

function is still applied to new instances and the class closest to the resulting value is selected.

Since these values are not constrained in the interval from 0 to 1, a logit transformation is applied

by transforming the variable into a value ranging from 0 to 1. Figure 2.15 illustrates a relation of

the logit transformation and the final logistic regression function. To evaluate the logistic regres-

sion model, log likelihood is used instead of calculating the squared error. The formula used to

evaluate the model is

n

∑i=1

(1− x(i))log(1−Pr[1|a(i)1 ,a(i)2 , ...,a(i)k ])+ x(i)log(Pr[1|a(i)1 ,a(i)2 , ...,a(i)k ]) (2.15)

where x(i) is either 0 or 1 and ai represents each attribute.

Logistic regression, in this case, is used as a penalizing model [Ste12] [NCS05], and as a

statistical model when the outcome is binary [PH08][TJZ06][Cor09].

20

State-of-the-Art

A variation of this model called multinomial regression are used when there is more than two

possible outcomes.

Markov Chain Monte Carlo

MCMC algorithm is used to sample models within a high dimension surface. MCMC finds models

by using a random walk, trying to converge to a target equilibrium distribution [Smi84], creating

a sample of the population to be analyzed. This algorithm is often used in bayesian statistics

[SWS10].

Summary Table

Table 2.5 shows a brief description of the auxilliary algorithms.

Algorithms DescriptionBonferroni Correction Selects a predetermined amount of hypotheses based on their statistic

relevance. Used in model creation algorithms.Linear Regression Creates a straight line connecting SNPs to find their relevancy. Used

for numerical values.Logistic Regression Applies a linear function to assert how new data is evaluated, based

on trained data. Used for binary values.MCMC Uses a random walk to find statistical relevancy of SNPs. Used in

bayesian models.Table 2.5: A description of auxiliary algorithms used in model creation or data selection.

2.3 Data analysis evaluation procedures and measures

Data analysis evaluation measures

Type I error and Type II error

Type I errors refer to the acceptance of a false relation. In this case, it refers to the acceptance of

a relation between an SNP or interaction of SNPs and the disease that does not exist in fact. This

can also be referred to as a false positive. Type II errors refer to the rejection of true relations. This

can also be referred to as a false negative.

Accuracy

The accuracy for classifier algorithms can be determined by how accurately a given classifier will

correctly predict future data. This may sometimes be misleading when overfitting occurs. To

prevent this, data evaluation procedures are employed and the final accuracy is the average of the

accuracies obtained from each iteration [Pow11]. For ensemble methods, a voting process takes

21

State-of-the-Art

place and the final result is the most voted outcome.

Accuracy =true positives+ true negatives

true positives+ false positives+ true negatives+ false negatives(2.16)

Precision

Precision is the measure of the relation between the relevant results and the returned results by the

model [Pow11]. In this context, this means the number of SNPs or genotypes correctly identified

as related to a disease by the model, in relation to all the SNPs or genotypes identified as related

to a disease by the model.

Precision =true positives

true positives+ false positives(2.17)

Recall

Recall measures the fraction of the relevant results in relation to the retrieved relevant results

[Pow11]. In this context this means the SNPs or genotypes correctly identified as related to a

disease by the model, in relation to all the SNPs or genotypes that are actually related to a disease.

Recall =true positives

true positives+ false negatives(2.18)

F-measure

The f-measure is the relation between precision and recall. In the F1 measure, this creates some

problems due to the similar weight to precision and recall which may not have the same relevancy.

This is true for the epistasis problem, where type 1 errors should be prioritized [Pow11].

F1 = 2 · precision · recallprecision+ recall

(2.19)

ROC curve

The receiver operating characteristic (ROC) is a graphical representation of the relation between

true positives and false positives of a binary classifier. Multiple classifiers can be plotted to com-

pare results [Pow11]. The area under the curve (AUC) corresponds to a higher probability of

selecting a true positive than a false positive [WFH11]. Figure 2.5 shows an example of the ROC

curve. The greater the AUC the better. A representation of the ROC Curve is given by the relation

between sensitivity and specificity which can be seen in Equations 2.20 and 2.21.

Sensitivity =Number of true positives

number of true positives+number of false negatives(2.20)

Speci f icity =Number of true negatives

number of true negatives+number of false positives(2.21)

22

State-of-the-Art

0 0.2 0.4 0.6 0.8 1

0

0.2

0.4

0.6

0.8

1

False Positive Rate (Specificity)

True

Posi

tive

Rat

e(1−

Sens

itivi

ty)

Figure 2.5: An example of a ROC curve.

Summary Table

Table 2.6 shows the summary of the various evaluation measures.

Data analysis evaluation procedures

Bootstrapping

Given a dataset of n instances, a bootstrap sample is a selection of a new dataset with size n by

sampling from the original dataset with replacement, therefore creating a different dataset from the

original one. The probability for any given instance not to be chosen is (1−1/n)n ≈ e−1 ≈ 0.368.

Due to the high chance that some instances from the original dataset are not included in the new

dataset, these instances will be used for testing [Koh95]. Considering the high percentage of

probable testing instances, the bootstrap procedure is then repeated several times with different

samples and the final results are averaged [WFH11].

Cross-Validation

The dataset is split randomly into k mutually exclusive subsets or folds. Each fold has approx-

imately the same size and is tested once and trained k− 1 times. This means that there are k

iterations of model creation and evaluation. In each iteration, a new subset is selected to become

the test set, while the other k−1 subsets are used for training. The accuracy is estimated by

acccv =1n ∑〈vi,yi〉∈D

δ (L(D\D(i),vi),yi) (2.22)

where D(i) is the test set that includes instance xi = 〈vi,yi〉 and n is the number of folds [Koh95].

The error rates of the different iterations are averaged to yield the overall error rate [WFH11]. The

dataset can be stratified to place in each fold the same proportions of labels as the original dataset.

23

State-of-the-Art

Algorithms DescriptionType I error

false positives

Type II errorfalse negatives

Accuracytrue positives+ true negatives

true positives+ false positives+ true negatives+ false negatives

Precisiontrue positives

true positives+ false positives

Recalltrue positives

true positives+ false negatives

F-Measure2 · precision · recall

precision+ recall

ROC curveSensitivity =

Number of true positivesnumber of true positives+number of false negatives

Speci f icity =Number of true negatives

number of true negatives+number of false positives

Table 2.6: A description of data analysis measures and how they are calculated.

Leave-one-out

Leave-one-out is an n-fold cross validation where n is the number of instances in the dataset. In

each iteration, a new instance is left out as a test, while all the others are used in training. For

classification algorithms, this means that the success rate in each fold is either 0% or 100%. This

approach allows for the maximum use of data for training, which presumably increases accuracy,

and is a deterministic process because there is no random sampling for each fold. However this

process is computationally expensive and cannot be stratified [WFH11].

Hold-out method

Holdout method consists on the strict reservation of a portion of the data for testing. This means

that only a part of the dataset will be used for training, usually 70%, leaving 30% for testing

24

State-of-the-Art

purposes [Koh95]. This method works for big datasets but can have a negative effect on the

accuracy of small datasets, which can be underestimated.

Summary Table

The Table 2.7 shows the various methods of data testing.

Algorithms DescriptionBootstrapping Creates a new dataset with the same number of instances as the original,

but with replacement, which allows for a repetition of the same instance.This can be repeated multiple times.

Cross-Validation Divides the original data into small subset of equal size, leaving one sub-set to test. This subset changes with each iteration, through all subsets.

Leave-one-out Similar to Cross-Validation, but the size of the subsets is 1, leaving only1 instance to test, iterating through all instances.

Hold-out method Reserves a specific amount of data to test, using the rest for training.Table 2.7: A description of data analysis procedures.

Data Analysis Methodologies

KDD Process

The Knowledge Discovery in Databases (KDD) Process is the extraction of knowledge from DM

methods using the specification of measures and thresholds [AS08].

The KDD process is interactive and iterative, divided into many components which can be

summarized into 5 steps illustrated in Figure 2.6.

1. The selection step consists of learning the application domain and creating a target data set

or select data samples for knowledge discovery.

2. The pre processing stage consists of data cleaning and handles missing data fields.

3. The transformation step allows for data reduction methods to reduce the dimensionality

and adapt the data for the model creation algorithms.

4. The data mining stage is where the algorithm for model creation is selected and applied.

5. The final stage, interpretation/evaluation, is where the discovered patterns are interpreted

and the performance is measured [FPSS96].

25

State-of-the-Art

Figure 2.6: The main stages of the KDD process.[FPSS96]

CRISP-DM

The Cross Industry Standard Process for Data Mining (CRISP-DM) methodology consists in a

model of the life cycle in a data mining project. This life cycle is illustrated in Figure 2.7. There

are 6 main phases of a project [CCK00].

1. The business understanding phase is the initial phase, where the main focus is the un-

derstanding of the objectives and requirements of the project from a business perspective,

defining an initial plan to solve the data minig problem.

2. Data understanding is the collection and comprehension of the data itself. This is where

the first data characteristics become apparent.

3. In the data preparation phase is where preprocessing filters and feature selection methods

are performed.

4. During the modeling phase, one or more modeling techniques are applied to create specific

models. Each technique has data preparation requirements which vary for each modeling

technique.

5. At the evaluation stage, the models developed in the earlier phase are then put to the test

using different types of evaluation methods. The results are then analyzed.

6. Finally, the deployment stage is where the knowledge created with the data mining process

is put in practice, either by generating a report or by implementing a repeatable data mining

process for the customer.

SEMMA

SEMMA stands for Sample, Explore, Modify, Model and Assess. SEMMA, like CRISP-DM,

follows a data mining life cycle with 5 stages, consistent with each of the letters of the acronym.

1. The sample stage is where sampling of the data and data selection takes place. This stage is

optional.

2. The explore stage consists of searching the data for anomalies to gain understanding of the

data.

3. Modify stage consists of transforming the data and shaping it to serve the model selection

process needs.

26

State-of-the-Art

Figure 2.7: The CRISP-DM life cycle.[CCK00]

4. The model stage is where the model creation takes place.

5. The assess stage exists to evaluate the usefulness and the reliability of the created model.

Although the process is independent from the data mining (DM) tool, there are some guidelines

connected to the Statistical Analysis System (SAS) Enterprise Miner software [AS08].

Summary Table

Table 2.8 contains the discussed procedures and their various stages.

KDD SEMMA CRISP-DMPre KDD ————- Business understandingSelection Sample

Data UnderstandingPre processing ExploreTransformation Modify Data preparationData mining Model ModelingInterpretation/Evaluation Assessment EvaluationPost KDD ————- DeploymentTable 2.8: A comparison between the different procedures [AS08].

2.4 Data Simulation and Analysis Software

Data Simulation Software

HAPGEN

HAPGEN is used to simulate case-control datasets of SNP markers. These datasets can encode the

main effect and interactions of multiple disease SNPs. These datasets can be further customized to

27

State-of-the-Art

allow for a change in the number of individuals and SNPs. To create the simulation of phenotypes

with interaction between disease SNPs, an R package is supplied, using the data with independent

disease SNPs generated from HAPGEN [SSDM09].

GenomeSIMLA

genomeSIMLA can be divided into two main programs: SIMLA (SIMulation of Linkage and

Association) and simPEN. SIMLA generates large scale populations used in the samples for case-

control datasets while simPEN creates penetrance tables for the disease specifications. A forward-

time population simulation is used to specify many gene related parameters, allowing a controlled

evolutionary process [EBT+08]. It can be used in family studies and unrelated individuals. Pen-

etrance models can be generated, allowing specific allelic frequencies, in purely epistatic interac-

tions or associated with main effects. Each interaction may contain various SNPs. The number

of chromosomes, SNPs, and individuals in each dataset is configurable. The prevalence and odds

ratio of the disease can be adjusted to allow a more realistic manifestation in the disease model.

The software contains 4 main steps:

1. Pool Generation contains the evolution of the population, together with their chromos-

somes and allelic frequencies.

2. When a generation contains the desired characteristics, a Locus Selection is made to allo-

cate the disease model.

3. The Penetrance Specification is used to measure the risk associated to each configuration

of the disease alleles.

4. Finally the Data Simulation creates the datasets, according to the specified configurations.

Gene-Environment iNteraction Simulator 2

Gene-Environment iNteraction Simulator 2 (GENS2) simulates interactions between two genetic

SNPs and one environmental factor[MSA+12]. Initially, the population, which is used to gener-

ate the datasets, is carried out by a simuPOP script [PA10]. This population evolves with a time

simulation, based on the desired number of individuals and allele frequencies. The second step

involves the simulation disease penetrance model. Finally, according to the risk assessment, a dis-

ease status is assigned randomly. Some of the options of the customization of the data involve the

number of individuals, allellic frequencies, prevalence and risk associated. A GUI is implemented

to allow for a swift customization and configuration of the data.

Summary Table

The Table 2.9 contains all data simulation tools discussed previously.

28

State-of-the-Art

HAPGEN GenomeSIMLA GENS2Dataset types case-control case-control, pedi-

gree and familycase-control

Interaction types main effect andepistasis

main effect and/orepistasis

epistasis and gene-environment

Order of interactions X SNPs X SNPs 2 SNPs and 1 envi-ronment factor

GUI No Generates html toillustrate the popu-lation

Yes

Population not generated forward-time popu-lation simulation

time simulation

Customizable numberof individuals

Yes Yes Yes

Customizable numberof SNPs

Yes Yes No

Customizable Allelicfrequencies

Yes Yes Yes

Table 2.9: A comparison of the most relevant features of data simulation tools.

Data Analysis Software

Analysis Tool for Heritable and Environmental Network Associations

ATHENA (Analysis Tool for Heritable and Environmental Network Associations) is a software

package designed to create models by analyzing various types of data. The organization of the

software package can be seen in Figure 2.8.

ATHENA receives various types of inputs and consequently uses filtering methods for feature

selection and analytical methods for model creation. The final model consists of the best generated

analytical method.

The main usage of ATHENA is in feature selection and model creation. As a filtering method,

it uses a random jungle algorithm, which is a bootstrap, tree-based variable selection method

version of RF. For modeling, ATHENA uses computational evolution modeling techniques like

GENN [HDF+13], as well as other common regression algorithms.

29

State-of-the-Art

Filtering Methods

Input Data

Random Jungle

proteomics

SNPs

microarray

sequence data

biomarkers

clinical data

Meta-dimensional

ModelsATHENA

Symbolic Regression Bayesian Networks SVM GENN

Analytical Methods

Y = B1X1*B2X2+B3X3

Figure 2.8: A diagram of the ATHENA software package [HDF+13]. The input can have manydifferent formats, involving different kinds of data. In the filtering step, variables are prioritizedbased on their known biological functions. The analytical methods currently consist of compu-tational evolution modeling techniques as modeling techniques, but will be further developed toallow more methods. This analysis allows the creation of different types of data in order to iden-tify multi-variable prediction models that include data from different parts of the whole biologicalprocess.

Knowledge Extraction Evolutionary Learning

Like the name suggests, Knowledge Extraction Evolutionary Learning (KEEL) is a software tool,

containing many evolutionary algorithms, and is used in many typical data mining problems.

KEEL contains the most well-known models in evolutionary learning. These can used for

research purposes, using the built in automation of experiments, or used as an educational tool,

with emphasis on the execution time and a real-time view of the algorithms during the data mining

process [AFSG+09].

The currently available function blocks are:

1. Data Management is used for importing or exporting data into other formats, data edition

and visualization.

2. Design of Experiments is where the experimentation takes place, applying the selected

model, type of validation and type of learning on the selected data sets. This module is

available off-line.

3. Educational Experiments works in a similar way as the previous function block, but can

be closely monitored, with displaying the learning process for the selected model algorithm.

This module is available on-line [AFFL+11].

30

State-of-the-Art

Konstanz Information Miner

The Konstanz Information Miner (KNIME) is a Java graphical workflow editor. The architec-

ture was developed based on three main aspects: visual interactive framework, modularity in the

process, to distribute the development of different algorithms, and easy expandability to add new

processing nodes or views [BCD+08].

In version 2.0, KNIME now allows for loops in the workflow, new database ports and adds

Predictive Model Markup Language (PMML) used for storing and exchanging predictive models

in XML format [BCD09].

Orange

Developed in Python, Orange is a machine learning and data mining toolbox, containing many

hierarchically-organized data mining components. The main hierarchical blocks are: data man-agement and preprocessing for feature selection and data input, classification, regression, as-sociation (rules), ensembles (such as bagging and boosting), clustering, evaluation, and projec-tions.

Classification algorithms include bayesian approaches, SVM, rule induction approaches, clas-

sification trees and random forest. Regression methods include linear and lasso regression, partial

least square regression, multivariate regression and regression trees or forests. Evaluation contains

the various procedures for testing and scoring the quality of prediction methods or estimation of

reliability. The projections block is where the visual analysis takes place, with multi-dimensional

scaling and self-organizing maps.

Orange works with python shell scripts. This means that new methods can be created or

existing machine learning components can be combined [DCE13].

PLINK

PLINK is a C/C++ tool set designed to handle GWAS datasets. Due to their high complexity and

size, simple methods are preferred, which can achieve good results with more data. Measures

allele, genotype, and haplotype frequencies.

PLINK offers tools for clustering a population into homogeneous subsets, for classical multi-

dimensional scaling (MDS) algorithm and for outlier detection. MDS helps to find similarities by

plotting objects in many dimensions, trying to preserve distance between objects [PNTB+07].

A graphical user interface (GUI), gPLINK, offers a framework to manage projects. gPLINK

also provides integration with Haploview, which is a tool used in tabulating, filtering, sorting,

merging and visualizing PLINK GWAS output files [BFMD05].

R

The R project is a statistical computing system. R has a command-line-driven interpreter for

the S language, with many extension packages available [Rip01]. The advantage of R is the

31

State-of-the-Art

flexibility to create new algorithms instead of using implemented approaches, where the source

code is not available. R can also produce high quality graphics and mathematical symbols. Some

user interfaces are available as packages or by using Integrated Development Environments (IDEs)

and adding R as an plugin [VL12]. R also contains many algorithms encoded in packages, such as

NN, SVM and MBMDR.

RapidMiner

RapidMiner is a DM software tool which contains many algorithms for all DM problems and

business analysis. It contains a GUI for creation and editing of data mining processes, following

the CRISP-DM methodology [Jun09].

A modular and pipelined view of the process consists of four stages: input stage where many

formats of data can be imported, preprocessing stage where the filtering and data processing be-

gins, learning stage using the selected algorithm, and evaluation stage which contains the perfor-

mance results of the process. RapidMiner allows for the extension of plug ins, which can be used

by developers to create new algorithms.

Weka

Weka is a machine learning workbench and application programming interface (API). Weka has

four interfaces: command line, Knowledge Flow, Explorer and Experimenter [FHT+04].

Explorer is the main interface in Weka. It contains different tabs with different types of meth-

ods. The tab Preprocess contains filtering methods. Classify contains classifier and regression

algorithms. Cluster and Associate contain clustering algorithms and rule association methods re-

spectively. The select attribute tab contains methods for identifying subsets of attributes that are

predictive of other attributes. The final panel, visualize, allows plotting of pairs of attributes with

many customizable options. The user interface of the Explorer can be seen in Figure 2.9.

In the context of bioinformatics, Weka provides a wide variety of algorithms for classification,

regression, clustering and feature selection.

A recent update added many new methods and reduced the execution time by using just-in-

time compilers [HFH+09].

Summary Table

Table 2.10 contains a summary of all the discussed data analysis software.

32

State-of-the-Art

Figure 2.9: The Weka Explorer interface.

Tool GUI Allows scripting No of integrated algorithmsATHENA No Yes few algorithms

KEEL Yes Yes few algorithms (evolutionary learningalgorithms)

KNIME Yes Yes many algorithms and i/o converters

Orange Yes Yes many algorithms

PLINK Yes (gPLINK) Yes PLINK

R No Yes many algorithms (packages)

RapidMiner Yes No many algorithms and i/o converters

Weka Yes - 3 Yes many algorithmsTable 2.10: A comparison of data mining tools.

33

State-of-the-Art

2.5 Chapter Conclusions

This chapter can be divided into 4 main categories: biology background, statistical and machine

learning algorithms, evaluation measures and procedures, and Data Simulation and Analysis

tools.

The study of the biology concepts and the background knowledge is vital to understanding

the problem. This concerns the data understanding, which translates to a better approach to the

problem. How the DNA is organized in the chromosomes and the division into genes and SNPs is

very important to knowing how epistasis works.

The statistical and machine learning algorithms are divided into feature selection algorithms

and model creation algorithms. The feature selection algorithms may produce different results

depending on the generated model. This means that model creation algorithms need to be adapted

with specific feature selection approaches. This is true for most of the model creation algorithms,

where these feature selection approaches are already embedded. Considering the large amount of

model creation algorithms, a pre-selection is necessary. From previous results [WLFW11], algo-

rithms like PLINK, MDR and BEAM are deprecated in preference to BOOST, S & C, SNPHar-

vester, SNPRuler and TEAM. In the last year, an interesting study [OSL13] revealed that IIM was

better than BEAM and SNPHarvester, making this an interesting approach to be tested. However

this algorithm does not yield a χ2 score for the significant SNPs, and BEAM has since been im-

proved, and is now in its third iteration [ZL07]. Furthermore, considering that MDR is one of the

first and most popular approaches to GWAS, a new iteration, MDMBR is also a good algorithm to

test.

To optimize the results, machine learning procedures should be used. Cross validation and

bootstrapping are the most popular approaches, due to their high ratio of training to test instances

Hold-out is also very popular for large datasets. As a DM methodology, CRISP-DM is the most

widely adopted approach, for being independent from tools and industries.

There are many tools available, including specifically designed tools. However, some include

only a small amount of algorithms and do not allow implementation of new algorithms. This is

important to test existing approaches and generate new ones. A data analysis tool that allows

scripting, such as R, is very useful for creating scripts that evaluate existing algorithms based on

the chosen statistical relevancy tests.

The algorithms selected for the empirical study of state-of-the-art model creation algorithms

regarding epistasis and main effect detection are illustrated in 2.11, with a summary of the main

characteristics of each selected algorithm.

34

State-of-the-Art

Table 2.11: Similarities and differences between BEAM3, BOOST MBMDR, and Screen & Clean,SNPHarvester, SNPRuler, and TEAM.

Features BEAM 3 BOOST MBMDR Screen & CleanSearch Stochastisc Exhaustive Exhaustive HeuristicPermutation Test

√ − √ −Chi-square Test −*

√ −* −*Tree/Graph Structure

√ − − −Bonferroni Correction − √ − √

Interactive Effect√ √ √ √

Main Effect√ √ √ √

Full Effect√ √ √ √

Programming Language C++ C R R

Features SNPHarvester SNPRuler TEAMSearch Stochastic Heuristic ExhaustivePermutation Test − − √

Chi-square Test√ √ −*

Tree Structure − √ √

Bonferroni Correction√ √ −

Interactive Effect√ √ √

Main Effect√ − −

Full Effect√ − −

Programming Language Java Java C++Chi-square Test is done for each SNP in main effect, and for each SNP interaction in epistasisdetection. Full effect is a disease model with both main effect and epistasis detection.*Although BEAM3 can evaluate interactive and full effects, the evaluation test is not comparablebetween methods. Only single SNPs are evaluated with χ2 test. TEAM outputs χ2 test scorefrom the contingency tables however it does not ouput the individual SNP χ2 score. MBMDR andScreen & Clean results are comparable with other algorithms.

35

State-of-the-Art

36

Chapter 3

A Comparative Study of Epistasis andMain Effect Analysis Algorithms

In this chapter, the experimental setup of an empirical analysis with existing epistasis detection

algorithms is presented.

3.1 Introduction

The experiments can be divided into two stages: the empirical analysis of existing methods and

the comparison between a new approach and the existing algorithms.

For stage 1, several algorithms were selected based on the previous state of the art study,

using very different approaches. The algorithms selected are: BEAM 3.0 [Zha12]; BOOST[WYY+10a]; MBMDR [MVV11]; Screen and Clean [WDR+10]; SNPHarvester [YHW+09];

SNPRuler [WYY+10b]; and TEAM [ZHZW10]. The purpose of this study is to evaluate the

results of each algorithm and select the best algorithms according to the evaluation measures for

stage 2.

Stage 2 consists of creating an Ensemble approach based on the characteristics of each algo-

rithm. The existing algorithms are evaluated according to their Power, Scalability, and Type IError Rate. Each algorithm is analyzed with each measure, for each parameter configuration.

This allows a correlation between evaluation measures and data set parameters on each algorithm,

which means a greater understanding about the usability of each algorithm, according to parameter

setting.

In both of these studies, generated data sets were used, with many different configurations and

varying values. The many configurations use different parameters of: Population size; MinorAllele Frequency; Odds ratio; Prevalence; and different types of Disease Models. These arti-

ficial data sets were created using genomeSimla, an open source data generator with generation

evolution capabilities and many parametrization options.

In this chapter, the data sets and their parameters are explained in more detail on Section

3.2. This Section also contains the input, output, and parameters for each algorithm. Section

37

A Comparative Study of Epistasis and Main Effect Analysis Algorithms

3.3 contains the experimental procedure used for stage 1 experiments and the obtained results are

discussed. Finally, Section 3.4 contains the conclusions made from stage 1 experiments.

3.2 Methods

Data sets

In these experiments, there are a total of 270 different configurations of data sets. For each con-

figuration. There are 100 data sets, creating 27000 results for each algorithm.

Each data set is created in genomeSimla, creating a population of 1,000,000 individuals,

evolved after 1750 generations. The growth of the initial population, consisting of 10,000 in-

dividuals, is done using a logistic growth rate. This allows for an organic evolution of SNP allele

frequencies. Each data set contains 300 SNPs divided into 2 chromosomes. The first chromo-

some contains 20 blocks of 10 SNPs each, while the second chromosome contains 10 blocks of

10 SNPs. The alleles infused with disease related genotypes are chosen from different blocks in

different chromosomes.

The following parameters are used to generate different configurations of data sets:

• Allele Frequency - The frequency of the minor allele of the disease SNPs. Considering the

allele frequency of all 300 SNPs, the chosen SNPs that affect the disease are selected among

the SNPs closest to the desired minor allele frequency. The allele frequencies can be seen

in the lab notes [PC14a].

• Population - Number of individuals sampled in the data set. According to each data set, a

given number of individuals are selected from the generated population mentioned earlier.

The ratio of cases to controls is determined by the disease prevalence.

• Disease Model - Type of disease model: main effect, epistasis interaction, and full effect.

The main effect model consists of 2 SNPs that independently affect the phenotype expres-

sion. The epistasis interaction model is determined by 2 SNPs that interact with each other

and affect the phenotype expression only when both disease alleles are present. Full effect

is determined by 2 SNPs that affect the phenotype expression by epistasis interaction and

by their main effect.

• Odds ratio - Relation between disease SNPs. Probability of one disease SNP being present,

given the presence of the other disease SNP.

• Prevalence - The proportion of a population with the disease. Affects the number of cases

and controls in a data set. A prevalence of 0.0001 corresponds to 30% of cases while a

prevalence of 0.02 corresponds to 50% of cases.

For these experiments, the parameters chosen are illustrated in table 3.1.

38


Parameters ValuesMinor Allele Frequency (0-1) 0.01;0.05;0.1;0.3;0.5Population (Number of Individuals) 500;1000;2000Disease Model Main Effect; Epistasis; Full EffectOdds Ratio 1.1;1.5;2.0Prevalence 0.001;0.02

Table 3.1: The values of each parameter used. Each configuration has a unique set of the parame-ters used.

3.2.1 Algorithms for interaction analysis

The following algorithms were selected for these experiments. These algorithms were selected

because of their unique approach and previous results obtained. A more detailed description and

additional result analysis of the algorithms is available in the lab notes of these experiments:

BEAM3.0[PC14b]; BOOST[PC14c]; Screen and Clean[PC14d]; SNPRuler[PC14f]; SNPHarvester[PC14g];

TEAM[PC14h]; and MBMDR[PC14i].

BEAM3

The algorithms allows a filtering of SNPs with many missing genotypes, a specific number of

interactions for the MCMC and its initial temperature. There is also a prior probability of the

likelihood of each SNP to be associated with the disease. The default value is p = 5/L where L is

the number of SNPs. This was changed to p = 2/L, considering that there are 2 disease affected

SNPs.

BOOST

The algorithm contains no options to be customized. Considering the transformation of the data

into a Boolean type, the χ2 tests for interaction analysis have only 4 degrees of freedom.

MBMDR

This algorithm was processed in a different computer setting. The computer used by this algorithm

has a Intel(R) Core(TM)2 Quad CPU Q9400 2.66GHz processor and 16,00 GB of RAM memory.

Screen and Clean

The parameters chosen for these algorithm are:

• L - number of SNPs to be retained with the smallest p-values. Since there are 300 SNPs,

this is the value chosen.

• K_pairs - Number of pairwise interactions to be retained by the lasso. The selected value is

100.

39


• response - The type of phenotype. Can be binomial or gaussian. The phenotypes are bino-

mial.

• al pha - The Bonferroni correction lower bound limit for retention of SNPs. For this experi-

ment, α = 0.05.

• standardize - If true, the genotype coded as 0,1, or 2 are centered to mean 0 and standard de-

viation 1. The data must be standardized to run the Screen & Clean procedure. Considering

the input data, this is enabled.

SNPHarvester

This algorithm has two modes: "Threshold-Based" mode, which outputs all the significant SNPs

above a specified significance threshold, and a "Top-K Based" mode which outputs a specified

number of SNP interactions, based on a specified number. It is possible to choose the minimum

and maximum number of interacting SNPs. For these experiments, the mode used to obtain results

is the "Threshold-Based" mode, with a significance level of α = 0.05, a minimum of interacting

SNPs of 1, which will test main effects of SNPs, and a maximum of 2.

SNPRuler

These results are already limited by a threshold of 0.3, and further reduced to 0.05, with a Bonfer-

roni correction. There are 3 configurable parameters:

• listSize - The expected number of interactions.

• depth - Order of interaction. Number of interacting SNPs.

• updateRatio - The step size of updating a rule. Takes a value between 0 and 1, 0 being not

updated and 1 updating a rule at each step.

The maximum number of rules is 50000, the length of each rule is 2 and the pruning threshold

is 0, to allow for all possible combinations.

TEAM

For this experiment the χ2 score was calculated from the contingency tables. The number of

permutations used in the significant test is set to 100 and the false discovery rate is set to 1. This

is used to control error rate using the permutation test, instead of a Bonferroni correction.

3.3 Simulation Design

This section contains the evaluation measures for the obtained results, the experimental method-

ology used in the experiments, the obtained results, and a discussion of the results.

40


Experimental Procedures

The results obtained from the various algorithms are evaluated according to their Power, Scala-bility, and Type I Error Rate.

In each data set, true positives and false positives are calculated based on the P-values that

correspond to α < 0.05 in the statistical test, after a Bonferroni correction.

The Power is the percentage for each configuration of data sets based on the amount of true

positives found out of the 100 data sets in the configuration. If the Power is 100%, that means that

the disease affected SNPs was found in every data set of the configuration.

The Type I Error Rate is calculated similarly to Power. For each configuration, the Type I

Error Rate is the percentage of data sets that contain false positives out of the 100 data sets in the

configuration. If the Type I Error Rate is 100%, that means that all the data sets contain at least

1 false positive. That means that there is at least 1 SNP or interaction of SNPs that is considered

statistically significant, but is not related to the disease.

Scalability is evaluated in three different ways: Running Time; CPU Usage; and Memory

Usage. Each of these measures is calculated for each data set, which is then averaged for each

configuration. The Running Time is calculated in seconds, CPU Usage is calculated in percentage

and Memory Usage is calculated in MBytes. All these measures are calculated from the moment

the algorithm is started until it has finished.

For these experiments, the Data Mining Process selected is CRISP-DM. The scripts used to

run each algorithm are made using the Unix shell Bash. Each algorithm was designed in a spe-

cific language. For comparison of the results, R language is used in the statistical relevancy test,

selecting only the relevant results.

For each allele frequency configuration, a different SNP pair is used, choosing the SNPs that

are closest to the desired minor allele frequency. The SNPs selected according to their minor allele

frequency (MAF) are as follows:

• MAF 0.01 - SNP112 (0.01329) and SNP267 (0.010001)

• MAF 0.05 - SNP4 (0.05239) and SNP239 (0.048355)

• MAF 0.1 - SNP135 (0.09855) and SNP230 (0.089905)

• MAF 0.3 - SNP197 (0.274662) and SNP266 (0.31648)

• MAF 0.5 - SNP80 (0.439337) and SNP229 (0.50654)

The penetrance tables are created differently for each allele frequency, altering the proportions

of each genotype in the disease SNPs.

Initially, the algorithms are tested for the most extreme configurations (minimum and maxi-

mum MAF) to see if the results obtained are as expected. After this is confirmed, the algorithms

are executed for all configurations, according to the capabilities of each algorithm.

For each data set, a file containing the scalability measures is created. For each configuration,

a file resuming all the data sets is created for Power, Scalability, and Type 1 Error Rate.

41


The computer used for this experiments used the 64-bit Ubuntu 13.10 operating system, with

an Intel(R) Core(TM)2 Quad CPU Q6600 2.40GHz processor and 8,00 GB of RAM memory. The

results were obtained using parallel processing.

Results and Discussion

Figures 3.1, 3.2, and 3.3 show the Power and Type I Error Results for each algorithm according

to each population size, while Figures 3.4, 3.5, and 3.6 display the results according to each

minor allele frequency. Not all the algorithms are used for each disease model, due to algorithm

limitations and properties. This is discussed earlier in Figure 2.11 Further results, relating to

different parameters, can be seen in Lab Note 9 of these experiments [PC14e]. The lab notes are

available in the appendices.

For epistasis detection (Figure 3.1) by population size, in data sets with 500 individuals (a),

no algorithm has a Power above the Type I Error Rate, which is as high as 14%. The Power for

almost all algorithms is 0%, with the exception of BOOST, which is 1%. MBMDR and SNPR

have 0% Type I Error Rate, while Screen and Clean has the highest error rate in all algorithms,

closely followed by BOOST and SNPHarvester. For 1000 individuals (b), almost all algorithms

have Power higher than Type I Error Rate, with the exception of Screen and Clean. The algo-

rithm with highest Power is BOOST wih 41%, with SNPHarvester and TEAM behind, both with

21%. MBMDR and Screen and Clean have very little Power. Screen and Clean has the highest

Type I Error Rate, with 16%, followed by SNPHarvester, BOOST and TEAM. Both MBMDR

and SNPRuler have 0% error rate. In the data sets with 2000 individuals (c), there are several

algorithms with high Power. BOOST has the best Power with 94%, closely followed by TEAM

with 92% and SNPHarvester with 85%. The worst algorithm by Power is Screen and Clean with

6%. Type I Error Rate is relatively low overall, with TEAM having the highest value with 28%.

Screen and Clean, BOOST, and SNPHarvester closely behind, with 21%, 21% 19% respectively.

The algorithm with the lowest error rate is MBMDR.

In Main effect detection (Figure 3.2), for 500 individuals (a), nearly all algorithms present

0% Power, with the exception of BOOST with 2%. Type I Error Rate is high, with Screen and

Clean having the highest value with 21%, followed by BOOST and SNPHarvester with 12% and

11% respectively. BEAM3 has the lowest error rate with 9%. For data sets with 1000 individuals

(b), the algorithm with the highest Power is BOOST with 43%, with SNPHarvester and BEAM3

close behind at 38% and 32% respectively. Screen and Clean has 0% Power. Type I Error Rate

is constant amongst all algorithms, with BOOST and Screen and Clean slightly ahead, with 23%.

For 2000 individuals (c), BOOST has the highest Power with 97%, while BEAM3 and SNPhar-

vester have 93%. Screen and Clean has 39% Power, but also has the least error rate, 36%, where

SNPHarvester has the highest error rate, with 79%.

The data sets with full effect disease model (Figure 3.3), for 500 individuals (a), show that

BOOST has 1% Power and the other algorithms have 0%. The algorithm with the highest Type I

Error Rate is Screen and Clean, with 19%, and SNPHarvester has the lowest, with 9%. For 1000

individuals (b), BOOST has the most Power with 42%, SNPHarvester has 32% and Screen and

42


BOOST

MBM

DRSnC

SNPHSNPR

TEAM0

10

20Po

wer

/Typ

e1

Err

or(%

)Power

Type 1 Error Rate

(a) 500 individuals.

BOOST

MBM

DRSnC

SNPHSNPR

TEAM0

20

40

60

Pow

er/T

ype

1E

rror

(%)

PowerType 1 Error Rate

(b) 1000 individuals.

BOOST

MBM

DRSnC

SNPHSNPR

TEAM0

50

100

150

Pow

er/T

ype

1E

rror

(%)


(c) 2000 individuals.

Figure 3.1: These results correspond to epistasis detection by population size. The data sets havea 0.1 minor allele frequency, 2.0 odds ratio, and 0.02 prevalence. Each bar corresponds to thePower and Type 1 Error Rate of BOOST, MBMDR, Screen and Clean, SNPHarvester, SNPRulerand TEAM. Each sub figure contains the values for all algorithms in data sets with 500 individuals(a), 1000 individuals (b), and 2000 individuals (c).

Clean remains at 0%. Type I Error Rates are higher for BOOST, with 38%. Screen and Clean

and SNPHarvester have 28% and 27%, respectively. For 2000 individuals (c), the best algorithm

is, once again, BOOST with 98% Power, and SNPRuler closely behind with 95%. Screen and

Clean has 0% Power, but also has the lowest error rate, with 33%. BOOST has 81% error rate,

and SNPHarvester has 79% error rate.

In evaluating data set results by minor allele frequency, for epistasis detection (Figure 3.4),

There is 0% Power for all algorithms for 0.01 allele frequency (a). The Type I Error Rate is as

big as 19% for Screen and Clean. The algorithms with the lowest error rate are MBMDR and

SNPRuler with 0%. In 0.05 allele frequency (b), TEAM has the highest Power, with 43% and

all other algorithms have a Power lower than 20%. TEAM also has the highest error rate, with

37%, and SNPRuler is the algorithm with the lowest error rate, with only 1%. For data sets with

0.1 minor allele frequency (c), BOOST and TEAM are the best algorithms with 94% and 92%

Power, respectively. Screen and Clean is the algorithm with the lowest Power, at 3%. MBMDR

has the lowest Type I Error Rate, while TEAM has the highest error rate, with 28%. For 0.3 allele

43


BEAM3

BOOSTSnC

SNPH0

10

20

30

Pow

er/T

ype

1E

rror

(%)


(a) 500 individuals.BEAM

3

BOOSTSnC

SNPH0

20

40

60

Pow

er/T

ype

1E

rror

(%)



BEAM3

BOOSTSnC

SNPH0

50

100

150

Pow

er/T

ype

1E

rror

(%)



Figure 3.2: These results correspond to main effect detection by population size. The data setshave a 0.1 minor allele frequency, 2.0 odds ratio, and 0.02 prevalence. Each bar corresponds to thePower and Type 1 Error Rate of BEAM3, BOOST, Screen and Clean, and SNPHarvester. Each subfigure contains the values for all algorithms in data sets with 500 individuals (a), 1000 individuals(b), and 2000 individuals (c).

frequency (d), BOOST has 100% Power, with TEAM close behind at 92%, and SNPHarvester

with 85%. Screen and Clean has the lowest Power, at 2%. SNPRuler has the lowest error rate,

with only 2%, while SNPHarvester has the highest, with 19%. Finally, for 0.5 allele frequency

(e), algorithms BOOST, TEAM and SNPRuler have the highest Power, with 100%, 95% and 92%,

respectively. Once again, Screen and Clean has the lowest Power with 0%. SNPHarvester has the

highest error rate, with 11%, and MBMDR together with SNPRuler have the lowest, with 0%.

For main effect detection (Figure 3.5), in 0.01 allele frequency (a), Power is 0% in all algo-

rithms. Type I Error Rate is highest in Screen and Clean, with 13%. In 0.05 allele frequency (b),

Power is nearly 0% for all algorithms except BOOST, with 14%. SNPHarvester has the highest

Type I Error Rate, with 24%, followed by Screen and Clean, with 22%. BOOST has the lowest

error rate with 11%. For 0.1 allele frequency (c), the most powerful algorithm is BOOST (97%),

closely followed by BEAM3(92%) and SNPHarvester(92%). SNPHarvester has the highest error

rate with 79%, and Screen and Clean has the lowest, with 36%. In data sets with 0.3 allele fre-

quency (d), all algorithms have 100% Power, with the exception of Screen and Clean with only

58%. All algorithms have 100% Error Rate, except Screen and Clean with 38%. The results are

44


BOOSTSnC

SNPH0

10

20

30Po

wer

/Typ

e1

Err

or(%

)Power

Type 1 Error Rate

(a) 500 individuals.BOOST

SnCSNPH

0

20

40

60

Pow

er/T

ype

1E

rror

(%)



BOOSTSnC

SNPH0

50

100

150

Pow

er/T

ype

1E

rror

(%)



Figure 3.3: These results correspond to full effect detection by population size. The data sets havea 0.1 minor allele frequency, 2.0 odds ratio, and 0.02 prevalence. Each bar corresponds to thePower and Type 1 Error Rate of BOOST, Screen and Clean, and SNPHarvester. Each sub figurecontains the values for all algorithms in data sets with 500 individuals (a), 1000 individuals (b),and 2000 individuals (c).

the same in 0.5 minor allele frequency (e), with the exception for Screen and Clean, with 62%

Power and 48% Type I Error Rate.

For full effect detection (Figure 3.6), in 0.01 allele frequency (a), there is 0% Power in all

algorithms, with Screen and Clean having the highest Type I Error Rate (14%), and SNPHarvester

having the lowest (1%). For 0.05 minor allele frequency (b), only BOOST has any Power, with

15%. Screen and Clean has the highest error rate with 21%, followed by SNPHarvester with 20%

and BOOST with 17%. For 0.1 (c), BOOST and SNPHarvester have a high Power percentage,

with 98% and 95% respectively. Screen and Clean is once again with 0% Power. However, Screen

and Clean has the lowest error rate (33%), while BOOST has the highest (81%), followed by

SNPHarvester (79%). In 0.3 (d) and 0.5 (e) minor allele frequency, both BOOST and SNPHar-

vester have the same values, with 100% Power and Type I Error Rate. Screen and Clean has a

Power of 40% and 91% and Type I Error Rate of 68% and 84% for 0.3 and 0.5 allele frequencies.

Table 3.2 contains the scalability analysis. Screen and Clean is revealed to be the slowest

algorithm, followed by SNPHarvester. TEAM and BEAM3 have similar values, with SNPRuler

having close to half of their running time. BOOST is the fastest algorithm, with less than 1 second

45


BOOST

MBM

DRSnC

SNPHSNPR

TEAM0

10

20

30

40P

T1ER

(a) 0.01 allele frequency.

BOOST

MBM

DRSnC

SNPHSNPR

TEAM0

20406080 P

T1ER

(b) 0.05 allele frequency.

BOOST

MBM

DRSnC

SNPHSNPR

TEAM0

50100150200 P

T1ER

(c) 0.1 allele frequency.

BOOST

MBM

DRSnC

SNPHSNPR

TEAM0

100

200 PT1ER

(d) 0.3 allele frequency.

BOOST

MBM

DRSnC

SNPHSNPR

TEAM0

100

200 PT1ER

(e) 0.5 allele frequency.

Figure 3.4: These results correspond to epistasis detection by minor allele frequency. The data setshave 2000 individuals, 2.0 odds ratio, and 0.02 prevalence. Each bar corresponds to the Power andType 1 Error Rate of BOOST, MBMDR, Screen and Clean, SNPHarvester, SNPRuler and TEAM.Each sub figure contains the values for all algorithms in data sets with 0.01 (a), 0.05 (b), 0.1 (c),0.3 (d), and 0.5 (e) allele frequencies.

of running time in the biggest data sets. Screen and Clean also has the biggest increase in running

time, followed by SNPHarvester. SNPRuler is the most expensive algorithm in CPU usage, having

a higher than 100% usage of CPU, which means that the algorithm uses more than 1 core to process

each data set. In memory usage, SNPRuler also has the highest usage of memory, closely followed

by TEAM, Screen and Clean, SNPHarvester, BEAM3, and finally BOOST far behind.

46


BEAM3

BOOSTSnC

SNPH0

10

20P

T1ER

(a) 0.01 allele frequency.BEAM

3

BOOSTSnC

SNPH0

20

40P

T1ER


BEAM3

BOOSTSnC

SNPH0

50100150200 P

T1ER

(c) 0.1 allele frequency.BEAM

3

BOOSTSnC

SNPH0

100

200 PT1ER


BEAM3

BOOSTSnC

SNPH0

100

200 PT1ER


Figure 3.5: These results correspond to main effect detection by minor allele frequency. The datasets have 2000 individuals, 2.0 odds ratio, and 0.02 prevalence. Each bar corresponds to the Powerand Type 1 Error Rate of BEAM3, BOOST, Screen and Clean, and SNPHarvester. Each sub figurecontains the values for all algorithms in data sets with 0.01 (a), 0.05 (b), 0.1 (c), 0.3 (d), and 0.5(e) allele frequencies.

47


BOOSTSnC

SNPH0

10

20

30P

T1ER

(a) 0.01 allele frequency.BOOST

SnCSNPH

0

20

40 PT1ER


BOOSTSnC

SNPH0

50100150200 P

T1ER

(c) 0.1 allele frequency.BOOST

SnCSNPH

0

100

200 PT1ER


BOOSTSnC

SNPH0

100

200 PT1ER


Figure 3.6: These results correspond to full effect detection by minor allele frequency. The datasets have 2000 individuals, 2.0 odds ratio, and 0.02 prevalence. Each bar corresponds to the Powerand Type 1 Error Rate of BOOST, Screen and Clean, and SNPHarvester. Each sub figure containsthe values for all algorithms in data sets with 0.01 (a), 0.05 (b), 0.1 (c), 0.3 (d), and 0.5 (e) allelefrequencies.

48


Running Time (s) CPU Usage(%) Memory Usage (MB)500 1000 2000 500 1000 2000 500 1000 2000

BEAM3 4.9 7 8 87.8 96.3 95.5 4 4.3 5.8BOOST 0.16 0.22 0.34 95.7 98.79 97.87 0.98 1 1.2MBMDR* − − − − − − − − −SnC 8.05 18.65 34.65 75.7 98.99 77.25 129.8 137.2 152.5SNPHarvester 9.29 25.89 33 102.1 86.5 101.6 68.35 71.3 76.86SNPRuler 2.7 3.09 4.1 130.2 141.9 156.28 312.7 316 320.2TEAM 3.28 5.28 9.81 66.99 69.71 74.75 162.7 176 228.1Table 3.2: Scalability test containing the average running time, CPU usage, and memory usageby data set population size. BOOST, Screen and Clean, and SNPHarvester have values related tofull effect, TEAM, and SNPRuler are related to epistasis detection, and BEAM3 is related to maineffect detection. *MBMDR does not contain scalability results because these were obtained fromdifferent computers with different hardware settings from all other results. The average runningtime of MBMDR for each data set was higher than 3600 seconds. The data sets have a minor allelefrequency is 0.5, 2.0 odds ratio, 0.02 prevalence.

49



The results show that BOOST is the best algorithm overall in terms of Power, but has a high Type

I Error Rate. SNPRuler has a low Type I Error Rate, but not very high Power and only works with

epistasis detection. Screen and Clean has very low Power in general, but has a relatively low error

rate, specially in data sets with a high number of individuals or a high minor allele frequency, in

main effect or full effect disease models. BEAM3 has high Power and slightly lower error rate than

BOOST, but only works with main effect. SNPHarvester has low Power, but also low Type I Error

Rate overall. MBMDR has very low Type I Error Rate with high Power in certain configurations,

but only works with epistasis and has a very high running time. TEAM has very high Power and

low Type I Error Rate, with the exception of certain configurations, particularly of lower number

of individuals and lower minor allele frequency. However, it only works for epistasis detection.

BOOST is the most scalable algorithm, followed by SNPRuler and BEAM3. This is important

for the next stage of the experiments, with an ensemble approach. Based on the data obtained,

we can conclude that some of the algorithms used would not be useful in an ensemble approach,

either because of their scalability, or because they would not add Power without compromising

Type I Error Rate.

These experiments show similar results from previous studies [WYY+10b] [SZS+11], how-

ever there is a vast amount of different types of data sets and algorithms, which differs from the

previous studies. These experiments can be viewed from different perspectives, using different

parameters, and the results can be analyzed according to their Power, Type I Error Rate, and Scal-

ability. Furthermore, the results obtained are available in the lab notes. The lab notes and the

created scripts are available in https://github.com/ei09045/EpistasisStudy.

50

Chapter 4

Ensemble Approach

In this chapter, a new Ensemble approach is discussed. This new approach uses algorithms from

the previous empirical study to improve results.

4.1 Introduction

The results from the empirical study of existing epistasis detection algorithms showed unique

properties in each algorithm. Considering Power and Type I Error Rate, the purpose of this stage

is to create a new approach that maintains the Power from the best algorithms and lowers Type I

Error Rate associated with these algorithms, which is usually high.

For this purpose, a new approach joining algorithms was developed. The algorithms are:

BEAM 3.0 [Zha12]; BOOST [WYY+10a]; SNPRuler [WYY+10b]; SNPHarvester [YHW+09];

and TEAM [ZHZW10]. This algorithms were selected based on their Power to Type I Error Rate

ratio, and their scalability. BOOST is used both for epistasis detection, and main effect detection,

which means that there are a total of 3 algorithms used for each detection type, with the exception

of full effect, which uses all algorithms.

In Section 4.2 the experimental procedure for stage 2 is discussed, involving the process of

selecting and using a voting system for the Ensemble approach. This Section also shows the

results obtained from the Ensemble approach and the comparison between the existing algorithms.

Finally, Section 4.3 shows the conclusions from the discussion of the results.

4.2 Experiments

Experimental Procedure

For these experiments, The same data sets discussed in the previous chapter were used. The same

evaluation measures are used to evaluate the new results. This new approach is created based on

an Ensemble approach, where each algorithm votes, based on their relevant SNPs and SNP pairs,

for a unified system that chooses relevant main effect SNPs and relevant epistasis interactions.

51

Ensemble Approach

For this purpose the algorithms selected for main effect detection are BEAM 3.0, BOOST,

and SNPHarvester. For epistasis detection the algorithms selected are BOOST, SNPRuler, and

TEAM. The Ensemble approach collects the registers relevant results from each algorithm and

selects SNPs and pairs of SNPs that are common in at least two algorithms. The algorithms

selected for main effect only work with single SNPs, and the algorithms selected for epistasis

detection only work with SNP pairs. BOOST works for both models, so the results enter the

voting stage of main effect and epistasis detection. The results obtained from each algorithm are

converted into a unified format, so they can be interpreted for the voting stage. In the full effect

detection, both main effect and epistasis detection algorithms intervene in the voting stage. This

helps to reduce Type I Error Rates, while maintaining Power because if each interaction is related

to the phenotype, it will be common amongst most algorithms, while non-related interactions will

not.

The computer used for this experiments used the 64-bit Debian testing (jessie) operating sys-

tem, with an Intel(R) Core(TM)2 Quad CPU Q9400 2.66GHz processor and 16,00 GB of RAM

memory.

Results and Discussion

The results obtained from previous experiments are used to compare the performance of existing

algorithms with the new ensemble approach. Figures 4.1, 4.2, and 4.3 show the Power and Type

I Error Results for each algorithm according to each population size; Figures 4.4, 4.5, and 4.6

display the results according to each minor allele frequency; Figures 4.7, 4.8, and 4.9 show the

results according to each odds ratio tested; and Figures 4.10, 4.11, and 4.12 contain the results

regarding both prevalence values.

The results in epistasis detection by population size for data sets with 500 individuals (a) show

0% Power but also 0% Type I Error Rate for the Ensemble approach. BOOST has the most Power

with 1% but has 7% Type I Error Rate. In data sets with 1000 individuals (b), the Ensemble

approach has 23% Power and 0% error rate, while the algorithm with the most Power is BOOST

with 41% and 5% error rate. For 2000 individuals (c) in data sets, the Ensemble has 92% Power

and 15% error rate, while BOOST has 94% Power but 21% error rate.

In main effect detection, for 500 individuals (a), there was 0% Power and 11% Error Rate.

BOOST has 2% Power but 12% error rate, while BEAM3 has only 9% error rate. For 1000

individuals (b), Ensemble has 37% Power and 19% error rate. BOOST has 43% Power and 23%

error rate while BEAM3 has the least error rate with 18% but has only 32% Power. Finally in data

sets with 2000 individuals (c), Ensemble has 92% and a Type I Error Rate of 71%. BOOST has

the most Power with 97% and has 74% error rate.

Full effect detection results show 0% Power and 11% Type I Error Rate in the Ensemble

approach for data sets with 500 individuals (a). SNPHarvester has the least error rate with 9%. For

data sets with 1000 individuals (b), Ensemble has 0% Power and 11% error rate, but SNPHarvester

has less error rate with 9%. For 2000 individuals (c), the Ensemble approach has 95% Power and

75% Type I Error Rate, having the most Power and least error rate.

52

Ensemble Approach

BOOSTSNPH

SNPR

TEAM

Ensem

ble0

5

10

Pow

er/T

ype

IErr

or(%

)

PowerType I Error Rate


BOOSTSNPH

SNPR

TEAM

Ensem

ble0

20

40

60

Pow

er/T

ype

IErr

or(%

)



BOOSTSNPH

SNPR

TEAM

Ensem

ble0

50

100

150

Pow

er/T

ype

IErr

or(%

)



Figure 4.1: These results correspond to epistasis detection by population size, with a 0.1 minorallele frequency, 2.0 odds ratio, and 0.02 prevalence. The results of the Power and Type I ErrorRate of BOOST, SNPHarvester, SNPRuler, TEAM, and Ensemble. Each sub figure contains thevalues for all algorithms in data sets with 500 individuals (a), 1000 individuals (b), and 2000individuals (c).

53

Ensemble Approach

BEAM3

BOOSTSNPH

Ensem

ble0

5

10

15

20

Pow

er/T

ype

IErr

or(%

)



BEAM3

BOOSTSNPH

Ensem

ble0

20

40

60

Pow

er/T

ype

IErr

or(%

)



BEAM3

BOOSTSNPH

Ensem

ble0

50

100

150

Pow

er/T

ype

IErr

or(%

)



Figure 4.2: These results correspond to main effect detection by population size, with a 0.1 minorallele frequency, 2.0 odds ratio, and 0.02 prevalence. The results of the Power and Type I ErrorRate of BEAM3, BOOST, SNPHarvester, and Ensemble. Each sub figure contains the values forall algorithms in data sets with 500 individuals (a), 1000 individuals (b), and 2000 individuals (c).

54

Ensemble Approach

BOOSTSNPH

Ensem

ble0

10

20

Pow

er/T

ype

IErr

or(%

)



BOOSTSNPH

Ensem

ble0

20

40

60

Pow

er/T

ype

IErr

or(%

)



BOOSTSNPH

Ensem

ble0

50

100

150

Pow

er/T

ype

IErr

or(%

)



Figure 4.3: These results correspond to full effect detection by population size, with a 0.1 mi-nor allele frequency, 2.0 odds ratio, and 0.02 prevalence. The results of the Power and Type IError Rate of BOOST, SNPHarvester, and Ensemble. Each sub figure contains the values for allalgorithms in data sets with 500 individuals (a), 1000 individuals (b), and 2000 individuals (c).

55

Ensemble Approach

In minor allele frequency analysis, for epistasis detection, Ensemble shows 0% Power and

0% Type I Error Rate in data sets with 0.01 allele frequency (a). For 0.05 allele frequency (b),

Ensemble approach has 6% Power and 1% error rate, while TEAM has 43% Power and 37% error

rate. In 0.1 allele frequency (c), Ensemble has 92% Power and 15% error rate. BOOST has 94%

Power but 21% error rate. For 0.3 allele frequency (d), Ensemble has 99% Power and 1% error

rate, while BOOST has 100% Power but 6% error rate. Finally in 0.5 minor allele frequency (e),

Ensemble has 100% Power and 0% Type I Error Rate.

BOOSTSNPH

SNPR

TEAM

Ensem

ble0

2

4 PT1ER


BOOSTSNPH

SNPR

TEAM

Ensem

ble0

20406080 P

T1ER


BOOSTSNPH

SNPR

TEAM

Ensem

ble0

50100150200 P

T1ER


BOOSTSNPH

SNPR

TEAM

Ensem

ble0

100

200 PT1ER


BOOSTSNPH

SNPR

TEAM

Ensem

ble0

100

200 PT1ER


Figure 4.4: These results correspond to epistasis detection by minor allele frequency, with 2000individuals, 2.0 odds ratio, and 0.02 prevalence. The results of the Power and Type I Error Rateof BOOST, SNPHarvester, SNPRuler, TEAM, and Ensemble. Each sub figure contains the valuesfor all algorithms in data sets with 0.01 (a), 0.05 (b), 0.1 (c), 0.3 (d), and 0.5 (e) allele frequencies.

In main effect detection, for 0.01 allele frequency (a), Ensemble has 0% and 1% error rate.

In 0.05 allele frequency (b), Ensemble has 1% Power and 20% Type I Error Rate. BOOST is the

best algorithm in this setting, with 14% Power and 11% Type I Error Rate. For 0.1 minor allele

frequency (c), Ensemble has 92% Power and 71% error rate. BOOST has 97% Power but has 74%

56

Ensemble Approach

error rate. BEAM3 has the same Power as Ensemble, but slight less error rate, with 67%. For 0.3

(d) and 0.5 (e) allele frequency, all the approaches have 100% Power and Type I error Rate.

BEAM3

BOOSTSNPH

Ensem

ble0

1

2 PT1ER


BEAM3

BOOSTSNPH

Ensem

ble0

20

40P

T1ER


BEAM3

BOOSTSNPH

Ensem

ble0

50100150200 P

T1ER


BEAM3

BOOSTSNPH

Ensem

ble0

100

200 PT1ER


BEAM3

BOOSTSNPH

Ensem

ble0

100

200 PT1ER


Figure 4.5: These results correspond to main effect detection by minor allele frequency, with 2000individuals, 2.0 odds ratio, and 0.02 prevalence. The results of the Power and Type I Error Rateof BEAM3, BOOST, SNPHarvester, and Ensemble. Each sub figure contains the values for allalgorithms in data sets with 0.01 (a), 0.05 (b), 0.1 (c), 0.3 (d), and 0.5 (e) allele frequencies.

In full effect detection, for 0.01 allele frequency (a), no algorithm has any Power, but Ensemble

has the least error rate, with only 1%. For 0.05 allele frequency (b), Ensemble has the least error

rate with 16% but BOOST has the most Power with 15%. For 0.1 (c), Ensemble approach has 95%

Power and 75% error rate. BOOST has 98% Power but 81% error rate. Finally, all algorithms have

100% Power and Type I Error Rate for 0.3 (d) and 0.5 (e) minor allele frequencies.

Analysing the results by odds ratio, for epistasis detection, at 1.1 odds ratio (a), Ensemble

has 1% and 0% error rate. BOOST has 27% Power, but has 5% Type I Error Rate. In 1.5 odds

ratio (b), Ensemble has 84% Power and 3% Type I Error Rate, while BOOST has 95% Power, but

57

Ensemble Approach

BOOSTSNPH

Ensem

ble0

5

10

15P

T1ER


BOOSTSNPH

Ensem

ble0

20

40 PT1ER


BOOSTSNPH

Ensem

ble0

50100150200 P

T1ER


BOOSTSNPH

Ensem

ble0

100

200 PT1ER


BOOSTSNPH

Ensem

ble0

100

200 PT1ER


Figure 4.6: These results correspond to full effect detection by minor allele frequency, with 2000individuals, 2.0 odds ratio, and 0.02 prevalence. The results of the Power and Type I Error Rate ofBOOST, SNPHarvester, and Ensemble. Each sub figure contains the values for all algorithms indata sets with 0.01 (a), 0.05 (b), 0.1 (c), 0.3 (d), and 0.5 (e) allele frequencies.

9% error rate. For 2.0 odds ratio (c), Ensemble has 92% Power and 15% error rate. BOOST has

slightly more Power, with 94%, but once again has higher error rate, with 21%.

In main effect detection, at 1.1 odds ratio (a), all algorithms have the same Power, with 2%,

but BEAM3 has less error rate, with 8%, while Ensemble has 10%. In 1.5 odds ratio (b), Ensemble

has the most Power, with 25%, and has 17% error rate, but BEAM3 has 16%, being the algorithm

with the least error rate. For 2.0 odds ratio (c), Ensemble has 92% Power and 71% error rate.

BEAM3 has the same Power but less error rate, with 67%, and BOOST has more Power (97%)

but also has more Type I Error Rate (74%)

Finally, for full effect detection by odds ratio, for 1.1 odds ratio (a), Ensemble has the least

Power, with 3% but also has the least error rate, with 7%. SNPHarvester has the most Power at

10%, and has 9% error rate. At 1.5 odds ratio (b), BOOST is the algorithm with the highest Power,

58

Ensemble Approach

BOOSTSNPH

SNPR

TEAM

Ensem

ble0

20

40

Pow

er/T

ype

IErr

or(%

)


(a) 1.1 odds ratio.

BOOSTSNPH

SNPR

TEAM

Ensem

ble0

50

100

150

Pow

er/T

ype

IErr

or(%

)


(b) 1.5 odds ratio.

BOOSTSNPH

SNPR

TEAM

Ensem

ble0

50

100

150

Pow

er/T

ype

IErr

or(%

)


(c) 2.0 odds ratio.

Figure 4.7: These results correspond to epistasis detection by odds ratio, with a 0.1 minor allelefrequency, 2000 individuals, and 0.02 prevalence. The results of the Power and Type I Error Rateof BOOST, SNPHarvester, SNPRuler, TEAM, and Ensemble. Each sub figure contains the valuesfor all algorithms in data sets with 1.1 odds ratio (a), 1.5 odds ratio (b), and 2.0 odds ratio (c).

59

Ensemble Approach

BEAM3

BOOSTSNPH

Ensem

ble0

5

10

15

Pow

er/T

ype

IErr

or(%

)


(a) 1.1 odds ratio.

BEAM3

BOOSTSNPH

Ensem

ble0

20

40

Pow

er/T

ype

IErr

or(%

)


(b) 1.5 odds ratio.

BEAM3

BOOSTSNPH

Ensem

ble0

50

100

150

Pow

er/T

ype

IErr

or(%

)


(c) 2.0 odds ratio.

Figure 4.8: These results correspond to main effect detection by odds ratio, with a 0.1 minor allelefrequency, 2000 individuals, and 0.02 prevalence. The results of the Power and Type I Error Rateof BEAM3, BOOST, SNPHarvester, and Ensemble. Each sub figure contains the values for allalgorithms in data sets with 1.1 odds ratio (a), 1.5 odds ratio (b), and 2.0 odds ratio (c).

60

Ensemble Approach

with 72%, and highest Type I Error Rate, with 51%. Ensemble has the least Power, with 65%, but

also the least error rate, with 40%. For 2.0 odds ratio (c), Ensemble has the least error rate, with

75%, and has 95% Power, while BOOST has 98% Power, but 81% error rate.

BOOSTSNPH

Ensem

ble0

10

20

Pow

er/T

ype

IErr

or(%

)


(a) 1.1 odds ratio.

BOOSTSNPH

Ensem

ble0

50

100

Pow

er/T

ype

IErr

or(%

)


(b) 1.5 odds ratio.

BOOSTSNPH

Ensem

ble0

50

100

150

Pow

er/T

ype

IErr

or(%

)


(c) 2.0 odds ratio.

Figure 4.9: These results correspond to full effect detection by odds ratio, with a 0.1 minor allelefrequency, 2000 individuals, and 0.02 prevalence. The results of the Power and Type I Error Rateof BOOST, SNPHarvester, and Ensemble. Each sub figure contains the values for all algorithmsin data sets with 1.1 odds ratio (a), 1.5 odds ratio (b), and 2.0 odds ratio (c).

Looking at the results of data sets by the prevalence of the disease, in epistasis detection,

with 0.0001 prevalence (a), Ensemble has 86% Power and 2% error rate. SNPRuler is the only

algorithm with less error rate, at 0%, but has much less Power. BOOST has 91% Power, but has

7% error rate. For 0.02 prevalence (b), the Ensemble approach has 92% Power, and 15% error

rate. SNPRuler has 8% error rate, but has 32% Power. BOOST has 94% Power and a Type I Error

Rate of 21%.

Regarding main effect results, for 0.0001 prevalence (a), Ensemble has 98% Power and 77%

error rate. BEAM3 is the best in this configuration, with 99% Power, and 76% error rate. For 0.02

prevalence (b), Ensemble has 92% Power and 71% error rate. BEAM3 is the algorithm with the

least error rate, with 67%, with the same Power as Ensemble. BOOST has 97% Power and 74%

error rate.

For full effect analysis by prevalence, for 0.0001 prevalence (a), Ensemble has 99% and 99%

61

Ensemble Approach

BOOSTSNPH

SNPR

TEAM

Ensem

ble0

50

100

150

Pow

er/T

ype

IErr

or(%

)


(a) 0.0001 Prevalence.

BOOSTSNPH

SNPR

TEAM

Ensem

ble0

50

100

150

Pow

er/T

ype

IErr

or(%

)


(b) 0.02 Prevalence.

Figure 4.10: These results correspond to epistasis detection by prevalence, with a 0.1 minor allelefrequency, 2000 individuals, and 2.0 odds ratio. The results of the Power and Type I Error Rateof BOOST, SNPHarvester, SNPRuler, TEAM, and Ensemble. Each sub figure contains the valuesfor all algorithms in data sets with 0.0001 prevalence (a), and 0.02 prevalence (b).

BEAM3

BOOSTSNPH

Ensem

ble0

50

100

150

Pow

er/T

ype

IErr

or(%

)



BEAM3

BOOSTSNPH

Ensem

ble0

50

100

150

Pow

er/T

ype

IErr

or(%

)



Figure 4.11: These results correspond to main effect detection by prevalence, with a 0.1 minorallele frequency, 2000 individuals, and 2.0 odds ratio. The results of the Power and Type I ErrorRate of BEAM3, BOOST, SNPHarvester, and Ensemble. Each sub figure contains the values forall algorithms in data sets with 0.0001 prevalence (a), and 0.02 prevalence (b).

62

Ensemble Approach

Type I Error Rate, while SNPHarvester has the same error rate, but has 100% Power. For 0.02

prevalence (b), Ensemble is the best algorithm ,with 95% Power, and 75% error rate.

BOOSTSNPH

Ensem

ble0

50

100

150

Pow

er/T

ype

IErr

or(%

)



BOOSTSNPH

Ensem

ble0

50

100

150

Pow

er/T

ype

IErr

or(%

)



Figure 4.12: These results correspond to full effect detection by prevalence, with a 0.1 minor allelefrequency, 2000 individuals, and 2.0 odds ratio. The results of the Power and Type I Error Rate ofBOOST, SNPHarvester, and Ensemble. Each sub figure contains the values for all algorithms indata sets with 0.0001 prevalence (a), and 0.02 prevalence (b).

In order to evaluate the scalability of the Ensemble algorithm, in relation to other algorithms,

the scalability measures were taken from each algorithm individually, while being executed by the

Ensemble approach, and the overall Ensemble algorithm, including the voting stage. Tables 4.1,

4.2, and 4.3 have an indication of the total running time for all the algorithms, the running time

obtained from the Ensemble approach (with the voting stage), and the difference between them.

The average CPU usage and memory usage throughout the Ensemble algorithm are also registered.

The data sets chosen were the full effect disease model data sets because these data sets are more

likely to have the most statistically significant results per data set, which takes a longer running

time and memory usage.

The results show an increase in the difference between the total running time of all algorithms,

and the Ensemble running time. There is clearly an increase in the difference in running time,

with the increase in data set size. However, for epistasis and main effect, there is not an increase

in the amount of running time that is added to the difference, which means that the difference in

running time is almost the same percentage of the Ensemble running time, independently of the

data set size. In epistasis detection, there is a near 30% increase in relation to the total running

time of the algorithms, but the difference in seconds is smaller than main effect and full effect,

which have a near 14.5% and 9.6% increase at 2000 individuals. There is no relation between the

CPU usage and the data set size, but there is a small increase in memory usage with the data set

size, especially in full effect detection.

63

Ensemble Approach


BEAM3 0.5 0.6 0.8 87.8 85.3 88 2.2 2.6 3.4BOOST 0.1 0.2 0.3 98.6 96.3 96.9 1 1 1.1SNPHarvester 1.9 3.1 6 119.9 108.2 104.7 52.2 60.2 78.3SNPRuler 2.3 2.6 3.4 181.2 143.8 136.9 352.4 352.4 353.5TEAM 2.7 4.6 8.1 99 98.6 98.8 162.7 177 228.1Total* 7.5 11.1 18.6Ensemble* 9.8 14.3 23.9 110.6 103 102.1 352.4 352.4 353.5Difference* 2.3 3.2 5.3

Table 4.1: Scalability test containing the average running time, CPU usage, and memory usage bydata set population size. The data sets have a minor allele frequency is 0.5, 2.0 odds ratio, 0.02prevalence and the disease model is epistasis detection. *Total calculates the total added runningtime for all algorithms, Ensemble calculates the time for all algorithms in the Ensemble approachwith the voting stage, and the difference calculates the running time increase between them. CPUusage and memory constitute the average usage along the process, therefore the total CPU usageand memory usage are not relevant.


BEAM3 2.9 4.1 4.6 94.5 94.6 97.3 2.9 3.4 4.2BOOST 0.1 0.2 0.3 97.9 98.6 98.9 1 1 1.1SNPHarvester 2.5 11.6 39.4 117.9 105.5 102 59.6 92.5 103.6SNPRuler 2.3 2.3 3 168 147 160.9 352.2 352.2 354.3TEAM 2.9 4.2 7.7 98.7 99 99 162.7 177 227.9Total* 10.7 22.4 55Ensemble* 13.1 26 63 89.8 88 77.3 349.2 352.2 354.3Difference* 2.4 3.6 8

Table 4.2: Scalability test containing the average running time, CPU usage, and memory usage bydata set population size. The data sets have a minor allele frequency is 0.5, 2.0 odds ratio, 0.02prevalence and the disease model is main effect. *Total calculates the total added running time forall algorithms, Ensemble calculates the time for all algorithms in the Ensemble approach with thevoting stage, and the difference calculates the running time increase between them. CPU usageand memory constitute the average usage along the process, therefore the total CPU usage andmemory usage are not relevant.

64

Ensemble Approach


BEAM3 110.8 90.9 196.7 99 90.9 99 37.3 29.8 106.2BOOST 0.1 0.2 0.3 98.6 97.7 98.5 0.9 1 1.2SNPHarvester 7.9 20.3 28.9 111 103.8 102.9 98.9 101.3 102.8SNPRuler 2.3 2.3 2.8 197.7 149.9 154.8 337.4 304.7 353.7TEAM 2.7 4.5 8 99 98.6 98.9 162.7 176.8 227.8Total* 123.8 118.2 236.7Ensemble* 126.8 126.3 259.4 95 79.2 74.6 337.4 304.7 353.7Difference* 3 8.1 22.7

Table 4.3: Scalability test containing the average running time, CPU usage, and memory usage bydata set population size. The data sets have a minor allele frequency is 0.5, 2.0 odds ratio, 0.02prevalence and the disease model is full effect. *Total calculates the total added running time forall algorithms, Ensemble calculates the time for all algorithms in the Ensemble approach with thevoting stage, and the difference calculates the running time increase between them. CPU usageand memory constitute the average usage along the process, therefore the total CPU usage andmemory usage are not relevant.

65

Ensemble Approach


In this chapter, a new epistasis and main effect detection approach is discussed. This is an Ensem-

ble approach, using 5 of the best algorithms from the previous empirical study, showed in Chapter

3. This new Ensemble approach uses 3 algorithms to evaluate relevant epistatic interactions, and

relevant main effects of SNPs. If there is a majority in the voting stage, the SNP or SNP pair is

then selected as a relevant result.

From the results obtained, we can see that, for epistasis detection, the Ensemble method has

slightly less Power than the best algorithm but it is the algorithm with the least Type I Error Rate,

with the exception of SNPRuler in some configurations, but this algorithm has much less Power

than the Ensemble method. In main effect detection, Ensemble is amongst the algorithms with

the lowest Type I Error Rate, behind BEAM3 in some configurations, but has constantly higher

Power. BOOST has more Power than the Ensemble method, but also has higher error rate. For

full effect, Ensemble is the algorithm with the least error rate, but has less Power than BOOST in

some configurations.

The goal for these experiments is to create a more efficient method, that is able to find the

ground truth SNPs related to the disease, while reducing false positives. The Ensemble method

fulfills all these requirements. The Scalability test shows that there is an increase in the running

time and memory usage with the size of the data. However, given that only the 5 most scalable

algorithms were selected, and that there is only a stable increase in resource and time consumption,

it is easy to calculate the necessary time and resources necessary for big data sets, and given the

Power and Type I Error Rate results, it is far better than any single algorithm.

66

Chapter 5

Conclusions

This dissertation was created for the purpose of improving the detection of genes, specifically

SNPs, that cause the expression of complex diseases. These diseases have a genetic basis that

increases the susceptibility. This means that, given an individual genotype, it is possible to assume

a greater risk of developing complex diseases if it has a SNP allele that has a connection to the

disease manifestation. Therefore it is very important to find the genotype configurations that are

connected to a given disease.

A state-of-the-art study was made about the recent work related to this dissertation. For the

methodologies, the data selection and model creation algorithms were studied, together with more

generic algorithms, in which the specific model creation algorithms are based on. Some auxiliary

algorithms were also studied. These are used in different stages, by the specific model creation

algorithms. Furthermore, data analysis evaluations procedures and measures were studied. These

represent how the data is used to train and test, and statistical relevant measures to be taken from

the results, respectively. Data Mining processes and software were also studied. CRISP-DM was

the procedure selected for the experiments. R was the software used in the experiments, except in

algorithms that are implemented in different programming languages.

Initially, a group of algorithms was selected, based on the state-of-the-art study. These algo-

rithms are the most recent approaches that showed most promise and compatibility between them,

which made it easier to test. For this study, a large amount of data was generated, using genomeS-

imla, to create different types of data sets, revealing a wide range of results from each algorithm.

The data sets selected contained different types of disease model type, each type being compat-

ible with a subgroup of the selected algorithms. The purpose of this initial empirical study was

to find the best algorithms overall, according to their Power, Type I Error Rate, and Scalability.

The results showed that BOOST was the best algorithm overall in terms of Power, SNPRuler and

MBMDR had the least error rate, but MBMDR had really bad Scalability. BOOST is also the most

scalable.

Out of all 7 selected algorithms for the comparison study, 3 algorithms for main effect and

epistasis detection were detected. In each stage, these results are chosen from the majority of

these algorithms. For main effect detection, the algorithms chosen were: BEAM3; BOOST; and

67

Conclusions

SNPHarvester. For epistasis detection, the algorithms chosen were: BOOST; SNPRuler; and

TEAM. BOOST was selected twice because of the high Power results in both disease models.

A new methodology was then created with the selected algorithms: the Ensemble. Using the

selected algorithms, the purpose of this methodology is to maintain the Power of the algorithms,

while reducing Type I Error Rate.

The observed results showed that the Type I Error Rates were lowered significantly, especially

in epistasis detection. However, the Power of the Ensemble was slightly lower than BOOST in

some configurations. The Scalability results showed some difference in the running time of the

Ensemble and the selected algorithms, due to the voting stage, but this difference is stable and did

not show a clear increase with the data set size, which means that only a small percentage of the

overall running time is dedicated to the voting stage.

The main conclusion of the empirical study from the state-of-the-art algorithms is that, even

if some algorithms show more dominant results, there is no absolute best algorithm for all types

of diseases. These results used small sized artificial data sets, so for large realistic data sets, these

results are also very dependent on the scalability of each algorithm. This also limits the types of

configurations that each algorithm can process. It is very difficult to obtain true positives without

false positives, in a viable period of time. For this purpose, the Ensemble approach was created to

maintain the epistasis and main effect detections, without such high amounts of false positives, but

the necessary running time is greater than all the algorithms combined, which may turn out to be

not viable for the larger data sets. However, the Ensemble is the most accurate approach possible.

5.1 Contribution summary

The main contributions of this dissertation are as follows:

• Create a vast amount of data sets with many different configurations, altering different pa-

rameters that affect the data and consequently the results of the algorithms. This allows for

a more complete evaluation of a given algorithm.

• An empirical study of the 7 most recent epistasis and main effect detection algorithms,

across many different configurations.

• Creation and evaluation of a new methodology, Ensemble, based on existing state-of-the-art

algorithms. This new methodology was able to yield good results, while decreasing the

Type I Error Rate.

68

References

[ABR64] A Aizerman, Emmanuel M Braverman, and L I Rozoner. Theoretical foundationsof the potential function method in pattern recognition learning. Automation andremote control, 25:821–837, 1964.

[AFFL+11] J. Alcala-Fdez, A. Fernandez, J. Luengo, J. Derrac, S. Garcia, L. Sanchez, andF. Herrera. KEEL Data-Mining Software Tool: Data Set Repository, Integration ofAlgorithms and Experimental Analysis Framework. Journal of Mult.-Valued Logic& Soft Computing, 17:255–287, 2011.

[AFSG+09] J Alcalá-Fdez, L Sánchez, S García, M J del Jesus, S Ventura, J M Garrell, J Otero,C Romero, J Bacardit, V M Rivas, J C Fernández, and F Herrera. KEEL: a softwaretool to assess evolutionary algorithms for data mining problems. Soft Computing,13:307–318, 2009.

[AS08] Ana Azevedo and Manuel Filipe Santos. KDD, SEMMA and CRISP-DM: a paralleloverview. IADIS European Conference Data Mining, pages 182–185, 2008.

[Avi94] John C Avise. Molecular markers: natural history and evolution. Springer, 1994.

[BCD+08] Michael R Berthold, Nicolas Cebron, Fabian Dill, Thomas R Gabriel, Tobias Köt-ter, Thorsten Meinl, Peter Ohl, Christoph Sieb, Kilian Thiel, and Bernd Wiswedel.KNIME: The Konstanz information miner. Springer, 2008.

[BCD09] M R Berthold, N Cebron, and F Dill. KNIME-the Konstanz information miner:version 2.0 and beyond. ACM SIGKDD, 2009.

[BDF+05] Alexandre Bureau, Josée Dupuis, Kathleen Falls, Kathryn L Lunetta, Brooke Hay-ward, Tim P Keith, and Paul Van Eerdewegh. Identifying SNPs predictive of phe-notype using random forests. Genetic epidemiology, 28:171–182, 2005.

[BFMD05] J C Barrett, B Fry, J Maller, and M J Daly. Haploview: analysis and visualization ofLD and haplotype maps. Bioinformatics (Oxford, England), 21:263–265, 2005.

[BH95] Yoav Benjamini and Yosef Hochberg. Controlling the False Discovery Rate: A Prac-tical and Powerful Approach to Multiple Testing. Journal of the Royal StatisticalSociety. Series B (Methodological), 57:289 – 300, 1995.

[BMP+13] Klaus Bønnelykke, Melanie C Matheson, Tune H Pers, Raquel Granell, David PStrachan, Alexessander Couto Alves, Allan Linneberg, John a Curtin, Nicole MWarrington, Marie Standl, Marjan Kerkhof, Ingileif Jonsdottir, Blazenka K Bukvic,Marika Kaakinen, Patrick Sleimann, Gudmar Thorleifsson, Unnur Thorsteinsdot-tir, Katharina Schramm, Svetlana Baltic, Eskil Kreiner-Møller, Angela Simpson,

69

REFERENCES

Beate St Pourcain, Lachlan Coin, Jennie Hui, Eugene H Walters, Carla M T Tiesler,David L Duffy, Graham Jones, Susan M Ring, Wendy L McArdle, Loren Price,Colin F Robertson, Juha Pekkanen, Clara S Tang, Elisabeth Thiering, Grant WMontgomery, Anna-Liisa Hartikainen, Shyamali C Dharmage, Lise L Husemoen,Christian Herder, John P Kemp, Paul Elliot, Alan James, Melanie Waldenberger,Michael J Abramson, Benjamin P Fairfax, Julian C Knight, Ramneek Gupta, Philip JThompson, Patrick Holt, Peter Sly, Joel N Hirschhorn, Mario Blekic, Stephan Wei-dinger, Hakon Hakonarsson, Kari Stefansson, Joachim Heinrich, Dirkje S Postma,Adnan Custovic, Craig E Pennell, Marjo-Riitta Jarvelin, Gerard H Koppelman,Nicholas Timpson, Manuela Ferreira, Hans Bisgaard, and a John Henderson. Meta-analysis of genome-wide association studies identifies ten loci influencing allergicsensitization. Nature genetics, 45:902–6, 2013.

[Bre01] Leo Breiman. Random forests. Machine learning, 45:5–32, 2001.

[CCK00] P Chapman, J Clinton, and R Kerber. CRISP-DM 1.0. CRISP-DM, 2000.

[cel14] Eukaryote dna (htt p : //commons.wikimedia.org/wiki/ f ile : eukaryotedna.svg),June 2014.

[CKS04] Robert Culverhouse, Tsvika Klein, and William Shannon. Detecting epistatic inter-actions contributing to quantitative traits. Genetic epidemiology, 27:141–152, 2004.

[CLEP07] Yujin Chung, Seung Yeoun Lee, Robert C Elston, and Taesung Park. Odds ratiobased multifactor-dimensionality reduction method for detecting gene-gene interac-tions. Bioinformatics (Oxford, England), 23:71–76, 2007.

[Cor02] Heather J Cordell. Epistasis: what it means, what it doesn’t mean, and statisticalmethods to detect it in humans. Human Molecular Genetics, 11(20):2463–2468,October 2002.

[Cor09] Heather J Cordell. Detecting gene-gene interactions that underlie human diseases.Nature reviews. Genetics, 10:392–404, 2009.

[DCE13] J Demšar, T Curk, and A Erjavec. Orange: Data Mining Toolbox in Python. Journalof Machine Learning Research, 14:2349–2353, 2013.

[DHS01] Richard O Duda, Peter E Hart, and David G Stork. Pattern Classification, volume 2.2001.

[DKN05] KB Kai Bo Duan, Sathiya Keerthi, and Et al. N.C.Oza. In Multiple Classifier Sys-tems. 2005.

[Dor92] Marco Dorigo. Optimization, learning and natural algorithms. Ph. D. Thesis, Po-litecnico di Milano, Italy, 1992.

[EBT+08] Todd L Edwards, William S Bush, Stephen D Turner, Scott M Dudek, Eric STorstenson, Mike Schmidt, Eden Martin, and Marylyn D Ritchie. Generating Link-age Disequilibrium Patterns in Data Simulations using genomeSIMLA. LectureNotes in Computer Science, 4973:24–35, 2008.

[FES+98] Deborah Ford, D F Easton, M Stratton, S Narod, D Goldgar, P Devilee, D T Bishop,B Weber, G Lenoir, and J Chang-Claude. Genetic heterogeneity and penetrance

70

REFERENCES

analysis of the BRCA1 and BRCA2 genes in breast cancer families. The AmericanJournal of Human Genetics, 62(3):676–689, 1998.

[FHT+04] Eibe Frank, Mark Hall, Len Trigg, Geoffrey Holmes, and Ian H Witten. Data miningin bioinformatics using Weka. Bioinformatics (Oxford, England), 20:2479–2481,2004.

[Fis19] R A Fisher. XV.—The Correlation between Relatives on the Supposition ofMendelian Inheritance. Earth and Environmental Science Transactions of the RoyalSociety of Edinburgh, 52:399–433, 1919.

[FPSS96] Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth. The kdd processfor extracting useful knowledge from volumes of data, 1996.

[FTW+07] Timothy M Frayling, Nicholas J Timpson, Michael N Weedon, Eleftheria Zeggini,Rachel M Freathy, Cecilia M Lindgren, John R B Perry, Katherine S Elliott, HanaLango, and Nigel W Rayner. A common variant in the FTO gene is associatedwith body mass index and predisposes to childhood and adult obesity. Science,316(5826):889–894, 2007.

[GWM08] Casey S Greene, Bill C White, and Jason H Moore. Ant colony optimization forgenome-wide genetic analysis. Lecture Notes in Computer Science, 5217:37–47,2008.

[HDF+13] Emily R Holzinger, Scott M Dudek, Alex T Frase, Ronald M Krauss, Marisa WMedina, and Marylyn D Ritchie. ATHENA: a tool for meta-dimensional analysisapplied to genotypes and gene expression data to predict HDL cholesterol levels.Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing, pages385–96, 2013.

[HFH+09] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann,and Ian H. Witten. The WEKA Data Mining Software: An Update. ACM SIGKDDExplorations Newsletter, 11:10–18, 2009.

[HK06] Jiawei Han and Micheline Kamber. Data Mining: Concepts and Techniques, vol-ume 54. 2006.

[HRM03] Lance W Hahn, Marylyn D Ritchie, and Jason H Moore. Multifactor dimensional-ity reduction software for detecting gene-gene and gene-environment interactions.Bioinformatics (Oxford, England), 19:376–382, 2003.

[JNBV11] Xia Jiang, Richard E Neapolitan, M Michael Barmada, and Shyam Visweswaran.Learning genetic epistasis using Bayesian network scoring criteria. BMC bioinfor-matics, 12:89, 2011.

[Jun09] Felix Jungermann. Information Extraction with RapidMiner. GSCL SymposiumSprachtechnologie und eHumanities 2009, 2009:1–16, 2009.

[Koh95] Ron Kohavi. A Study of Cross-Validation and Bootstrap for Accuracy Estimationand Model Selection. In International Joint Conference on Artificial Intelligence,volume 14, pages 1137–1143, 1995.

71

REFERENCES

[KR92] Kenji Kira and Larry A. Rendell. The feature selection problem: traditional methodsand a new algorithm. In Proceedings of the tenth national conference on Artificialintelligence, pages 129–134, 1992.

[LD94] Nada Lavrac and Saso Dzeroski. Inductive Logic Programming: Techniques andApplications. E. Horwood, New York, 1994.

[LHSV04] Kathryn L Lunetta, L Brooke Hayward, Jonathan Segal, and Paul Van Eerdewegh.Screening large-scale association study data: exploiting interactions using randomforests. BMC genetics, 5:32, 2004.

[LIT92] Pat Langley, Wayne Iba, and Kevin Thompson. An analysis of Bayesian classifiers.In Proceedings of the National Conference on Artificial Intelligence, pages 223–223, 1992.

[MCGT09] Brett A McKinney, James E Crowe, Jingyu Guo, and Dehua Tian. Capturing thespectrum of interaction effects in genetic association studies by simulated evapora-tive cooling network analysis. PLoS genetics, 5:e1000432, 2009.

[Moo04] Jason H Moore. Computational analysis of gene-gene interactions using multifactordimensionality reduction. Expert review of molecular diagnostics, 4:795–803, 2004.

[MRW+07] B A McKinney, D M Reif, B C White, J E Crowe, and J H Moore. Evaporativecooling feature selection for genotypic data involving interactions. Bioinformatics(Oxford, England), 23:2113–2120, 2007.

[MSA+12] Gennaro Miele, Giovanni Scala, Roberto Amato, Sergio Cocozza, and MichelePinelli. Simulating gene-gene and gene-environment interactions in complex dis-eases: Gene-Environment iNteraction Simulator 2, 2012.

[Mud09] Geo. P. Mudge. Mendel’s principles of heredity. The Eugenics review, 1:130–137,1909.

[MVV11] Jestinah M Mahachie John, Francois Van Lishout, and Kristel Van Steen. Model-Based Multifactor Dimensionality Reduction to detect epistasis for quantitativetraits in the presence of error-free and noisy data. Eur J Hum Genet, 19(6):696–703, June 2011.

[MW07] JH Moore and BC White. Tuning ReliefF for genome-wide genetic analysis. LectureNotes in Computer Science, 4447:166–175, 2007.

[NBS+07] Robin Nunkesser, Thorsten Bernholt, Holger Schwender, Katja Ickstadt, and IngoWegener. Detecting high-order interactions of single nucleotide polymorphisms us-ing genetic programming. Bioinformatics (Oxford, England), 23:3280–3288, 2007.

[NCS05] Bernard V North, David Curtis, and Pak C Sham. Application of logistic regressionto case-control association studies involving two causative loci. Human heredity,59:79–87, 2005.

[NKFS01] M R Nelson, S L Kardia, R E Ferrell, and C F Sing. A combinatorial partition-ing method to identify multilocus genotypic partitions that predict quantitative traitvariation. Genome research, 11:458–470, 2001.

72

REFERENCES

[Nun08] Robin Nunkesser. Analysis of a genetic programming algorithm for associationstudies. In Proceedings of the 10th annual conference on Genetic and evolutionarycomputation, pages 1259–1266. ACM, 2008.

[OSL13] Anunciação; Orlando, Vinga; Susana, and Oliveira; Arlindo L. Using Informa-tion Interaction to Discover Epistatic Effects in Complex Diseases. PloS one,8(10):e76300, 2013.

[PA10] Bo Peng and Christopher I Amos. Forward-time simulation of realistic samples forgenome-wide association studies. BMC bioinformatics, 11:442, 2010.

[PC14a] Ricardo Pinho and Rui Camacho. Genetic Epistasis I - Materials and methods. 2014.

[PC14b] Ricardo Pinho and Rui Camacho. Genetic Epistasis II - Assessing Algorithm BEAM3.0. 2014.

[PC14c] Ricardo Pinho and Rui Camacho. Genetic Epistasis III - Assessing AlgorithmBOOST. 2014.

[PC14d] Ricardo Pinho and Rui Camacho. Genetic Epistasis IV - Assessing AlgorithmScreen and Clean. 2014.

[PC14e] Ricardo Pinho and Rui Camacho. Genetic Epistasis IX - Comparative Assessmentof the Algorithms. 2014.

[PC14f] Ricardo Pinho and Rui Camacho. Genetic Epistasis V - Assessing AlgorithmSNPRuler. 2014.

[PC14g] Ricardo Pinho and Rui Camacho. Genetic Epistasis VI - Assessing AlgorithmSNPHarvester. 2014.

[PC14h] Ricardo Pinho and Rui Camacho. Genetic Epistasis VII - Assessing AlgorithmTEAM. 2014.

[PC14i] Ricardo Pinho and Rui Camacho. Genetic Epistasis VIII - Assessing AlgorithmMBMDR. 2014.

[PH08] Mee Young Park and Trevor Hastie. Penalized logistic regression for detecting geneinteractions. Biostatistics (Oxford, England), 9:30–50, 2008.

[Phi08] Patrick C Phillips. Epistasis–the essential role of gene interactions in the structureand evolution of genetic systems. Nature reviews. Genetics, 9:855–867, 2008.

[PNTB+07] Shaun Purcell, Benjamin Neale, Kathe Todd-Brown, Lori Thomas, Manuel A R Fer-reira, David Bender, Julian Maller, Pamela Sklar, Paul I W de Bakker, Mark J Daly,and Pak C Sham. PLINK: a tool set for whole-genome association and population-based linkage analyses. American journal of human genetics, 81:559–575, 2007.

[Pow11] D.M.W. Powers. Evaluation : From Precision, Recall and F-Measure To Roc, In-formedness, Markedness & Correlation. Journal of Machine Learning Technolo-gies, 2:37–63, 2011.

[PV08] Anita Prinzie and Dirk Van den Poel. Random Forests for multiclass classification:Random MultiNomial Logit, 2008.

73

REFERENCES

[RHR+01] M D Ritchie, L W Hahn, N Roodi, L R Bailey, W D Dupont, F F Parl, and J HMoore. Multifactor-dimensionality reduction reveals high-order interactions amongestrogen-metabolism genes in sporadic breast cancer. American journal of humangenetics, 69:138–147, 2001.

[Rip01] Brian D Ripley. The R project in statistical computing. MSOR Connections,1(1):23–25, 2001.

[RSK03] Marko Robnik-Sikonja and Igor Kononenko. Theoretical and Empirical Analysis ofReliefF and RReliefF. Machine, 53:23–69, 2003.

[SGLT12] Ya Su, Xinbo Gao, Xuelong Li, and Dacheng Tao. Multivariate multilinear regres-sion, 2012.

[SGM+10] Ayellet V Segrè, Leif Groop, Vamsi K Mootha, Mark J Daly, and David Altshuler.Common inherited variation in mitochondrial genes is not enriched for associationswith type 2 diabetes or related glycemic traits. PLoS genetics, 6, 2010.

[SKZ10] Daniel F Schwarz, Inke R König, and Andreas Ziegler. On safari to Random Jungle:a fast implementation of Random Forests for high-dimensional data. Bioinformatics(Oxford, England), 26:1752–1758, 2010.

[Smi84] R. L. Smith. Efficient Monte Carlo Procedures for Generating Points UniformlyDistributed over Bounded Regions, 1984.

[Sri01] Ashwin Srinivasan. The aleph manual. Machine Learning at the Computing Labo-ratory, Oxford University, 2001.

[SSDM09] C. C A Spencer, Zhan Su, Peter Donnelly, and Jonathan Marchini. Designinggenome-wide association studies: Sample size, power, imputation, and the choiceof genotyping chip. PLoS Genetics, 5, 2009.

[Ste12] Kristel Van Steen. Travelling the world of gene-gene interactions. Briefings inbioinformatics, 13:1–19, 2012.

[SWS10] Yun S Song, Fulton Wang, and Montgomery Slatkin. General epistatic models ofthe risk of complex diseases. Genetics, 186:1467–1473, 2010.

[SZS+11] Junliang Shang, Junying Zhang, Yan Sun, Dan Liu, Daojun Ye, and Yaling Yin.Performance analysis of novel methods for detecting epistasis, 2011.

[TDR10a] Stephen D Turner, Scott M Dudek, and Marylyn D Ritchie. ATHENA: Aknowledge-based hybrid backpropagation-grammatical evolution neural network al-gorithm for discovering epistasis among quantitative trait Loci. BioData mining,3:5, 2010.

[TDR10b] Stephen D Turner, Scott M Dudek, and Marylyn D Ritchie. Grammatical Evolu-tion of Neural Networks for Discovering Epistasis among Quantitative Trait Loci.Lecture Notes in Computer Science, 6023:86–97, 2010.

[TJZ06] Michael W T Tanck, J Wouter Jukema, and Aeilko H Zwinderman. Simultaneousestimation of gene-gene and gene-environment interactions for numerous loci usingdouble penalized log-likelihood. Genetic epidemiology, 30:645–651, 2006.

74

REFERENCES

[VL12] J. Verzani and M. F. Lawrence. Programming graphical user interfaces with R.CRC Press, 2012.

[WBW05] Geoffrey I. Webb, Janice R. Boughton, and Zhihai Wang. Not So Naive Bayes:Aggregating One-Dependence Estimators, 2005.

[WDR+10] Jing Wu, Bernie Devlin, Steven Ringquist, Massimo Trucco, and Kathryn Roeder.Screen and clean: a tool for identifying interactions in genome-wide associationstudies. Genetic epidemiology, 34:275–285, 2010.

[Weg60] Peter Wegner. A technique for counting ones in a binary computer. Communicationsof the ACM, 3(5):322, 1960.

[WFH11] Ian H Witten, Eibe Frank, and Mark A Hall. Data Mining: Practical MachineLearning Tools and Techniques (Google eBook). 2011.

[WLFW11] Yue Wang, Guimei Liu, Mengling Feng, and Limsoon Wong. An empirical com-parison of several recent epistatic interaction detection methods. Bioinformatics(Oxford, England), 27:2936–43, 2011.

[WLZH12] Haitian Wang, Shaw-Hwa Lo, Tian Zheng, and Inchi Hu. Interaction-based featureselection and classification for high-dimensional biological data. Bioinformatics(Oxford, England), 28:2834–42, 2012.

[WYY+10a] Xiang Wan, Can Yang, Qiang Yang, Hong Xue, Xiaodan Fan, Nelson L S Tang,and Weichuan Yu. BOOST: A fast approach to detecting gene-gene interactions ingenome-wide case-control studies. American journal of human genetics, 87:325–340, 2010.

[WYY+10b] Xiang Wan, Can Yang, Qiang Yang, Hong Xue, Nelson L S Tang, and WeichuanYu. Predictive rule inference for epistatic interaction detection in genome-wideassociation studies. Bioinformatics (Oxford, England), 26:30–37, 2010.

[WYY12] Xiang Wan, Can Yang, and Weichuan Yu. Comments on ‘An empirical com-parison of several recent epistatic interaction detection methods’. Bioinformatics,28(1):145–146, 2012.

[YHW+09] Can Yang, Zengyou He, Xiang Wan, Qiang Yang, Hong Xue, and Weichuan Yu.SNPHarvester: a filtering-based approach for detecting epistatic interactions ingenome-wide association studies. Bioinformatics (Oxford, England), 25:504–511,2009.

[YHZZ10] Pengyi Yang, Joshua W K Ho, Albert Y Zomaya, and Bing B Zhou. A geneticensemble approach for gene-gene interaction identification. BMC bioinformatics,11:524, 2010.

[YYWY11] Ling Sing Yung, Can Yang, Xiang Wan, and Weichuan Yu. GBOOST: a GPU-based tool for detecting gene-gene interactions in genome-wide case control studies.Bioinformatics (Oxford, England), 27:1309–1310, 2011.

[ZB00] H Zhang and G Bonney. Use of classification trees for association studies. Geneticepidemiology, 19:323–332, 2000.

75

REFERENCES

[Zha12] Yu Zhang. A novel bayesian graphical model for genome-wide multi-SNP associa-tion mapping. Genetic Epidemiology, 36:36–47, 2012.

[ZHZW10] Xiang Zhang, Shunping Huang, Fei Zou, and Wei Wang. TEAM: efficient two-locusepistasis tests in human genome-wide association study. Bioinformatics (Oxford,England), 26:i217–i227, 2010.

[ZL07] Yu Zhang and Jun S Liu. Bayesian inference of epistatic interactions in case-controlstudies. Nature genetics, 39:1167–1173, 2007.

76

Appendix A

Glossary

A.1 Biology related terms

• Allele - One specific pair or series of genes in a given position of a specific chromosome.

• Cell - The most basic structural unit of any organism that is capable of independent func-

tioning.

• Chromosome - Genetic material stored in the nucleus of eukaryotic cells. Contains all the

hereditary information. In Humans, there are 23 pairs of chromosomes.

• DNA - Deoxyribonucleic Acid. Molecule where the genetic material is stored. Is capable

of self-replication and RNA synthesis.

• Dominant Gene - It is the allele which manifests itself in the phenotype, when there are

more than one type of alleles in the genotype or in homozygotic cases. Might also be

dominant to one allele but recessive to another.

• Epistasis - If various SNPs interact with each other to express a phenotype, that interaction

is called epistasis. There are 3 main types of Epistasis:

– Compositional Epistasis - How the relationship between genes work.

– Functional Epistasis - The direct effect of the function of a gene in another gene.

– Statistical Epistasis - Number of different allele effects.

• Eukaryote - A cell with a defined nucleus. Has a membrane that separates the nucleus of

the cell from the rest of its contents.

• Gene - The basic unit of hereditary information in DNA or RNA. May suffer mutations.

• Genotype - The genetic constitution of a specific trait. A combination of alleles in a corre-

sponding match of chromosomes that determines a trait.

77

Glossary

• GWAS - Genome Wide Association Study. Study of the entire genome to find SNPs that

are associate with specific traits. In this case, complex diseases.

• Heterozygous - A genotype composed with different alleles.

• Homozygous - A genotype composed with the same allele.

• Locus - Place within the DNA where a given gene is stored.

• Mutation - A change in a chromosome, either by the change of a gene or rearrangement of

a part of the chromosome.

• Nucleotide Bases - Different types of molecules that combine with each other to form DNA

and RNA.

– Adenine - Nucleotide base. Links to Thymine in DNA or Uracil in RNA.

– Cytosine - Nucleotide base. Links to Guanine.

– Guanine - Nucleotide base. Links to Cytosine.

– Thymine - Nucleotide base specific to DNA. Links to Adenine.

– Uracil - Nucleotide base specific to RNA. Links to Adenine.

• Phenotype - Expression of a specific trait. The manifestation of a certain gene or interaction

of various genes.

• Recessive Gene - Allele that does not manifest in the phenotype unless in homozygotic

cases. Might be recessive to one allele but dominant to another.

• Ribosome - Molecular machine used in protein synthesis from encoded RNA.

• RNA - Ribonucleic Acid. Molecule synthesized by DNA that expresses genes by being

transcribed by the ribosome.

• SNP - Single nucleotide polymorphism. DNA sequence in the genome that varies in indi-

viduals of the same species.

A.2 Data mining terms

• Association Rules - Relations between variables that are relevant in a significant number of

instances.

• Bayesian Networks - Graphical model that shows dependencies between a randomly cho-

sen group of variables.

• Classification - A type of Data Mining prediction problem, where the determined value is

nominal. In specific cases, the determined variable is binary.

78

Glossary

• Clustering - Tries to find similarities between instances and joins them together in groups,

or clusters.

• Data Mining - A vast field in computer science. Tries to identify patterns in big data sets.

• Data Set - A collection of data composed of Attributes (columns) and Instances (rows).

Attributes are different variables of the recorded data and each Instance is a new member of

the data set.

• Machine Learning - Area of Artificial Intelligence where algorithms and systems are de-

veloped to be able to learn from data. Can be used to solve problems such as Clustering,

Classification, Association Rules, Regression, etc.

• Model - Data Mining models are created by using a specific algorithm on a specific data

set. The result is a model specifically designed to predict or find relations in data, based on

the learned patterns of the data set used.

• Overfitting - When a model is too adapted to a subgroup of data that does not translate in

the all of the dataset and future data.

• pre-processing - The adaptation of data to fit a certain criteria, either by transforming the

data type or by reducing dimensionality in attributes or instances.

– Filter methods - preprocessing methods that select subsets of variables independently

from the model creation algorithms.

– Wrapper methods - Scoring and transformation of subsets of variables to serve a

specific type of algorithm.

– Embedded Methods - Feature selection that occurs during the training of a given

model.

• Pruning - To cut a connection in tree-based methods due the branch not being relevant to

the final result. This increase the efficiency of the algorithm but can also wrongfully cut

significant branches.

• Regression - A type of DM prediction problem, where the determined value is continuous.

• Supervised method - A method which has real values appointed in the data to the class

variable.

A.3 Lab Notes

79

Laboratory Note

Genetic EpistasisI - Materials and methods

LN-1-2014

Ricardo Pinho and Rui CamachoFEUP

Rua Dr Roberto Frias, s/n,4200-465 PORTO

PortugalFax: (+351) 22 508 1440e-mail: [email protected]

www : http://www.fe.up.pt/∼ei09045 [email protected] : http://www.fe.up.pt/∼rcamacho

May 2014

Abstract

Based on literature results, we have selected 7 epistatic detectionmethods. The selected methods were empirically evaluated and com-pared using generated data from genomeSimla to simulate a smallerscale of genome wide studies. The simulated data includes 270 dif-ferent configurations of datasets to simulate a wide array of diseasemodels. The selected algorithms are BEAM 3.0, BOOST, MBMDR,Screen and Clean, SNPRuler, SNPHarvester, and TEAM. These al-gorithms are evaluated according to their Power, scalability and TypeI Error Rate.

1 Introduction

The search for genetic predisposition to diseases has been researched for along time. However, most early studies only focused on single SNP studiesto determine disease predisposition. This is not the case in most complexdiseases. Generally, most disease involve thousands or milions of SNPs, in-teracting between them in a large scale. Due to the complexity of theseinteractions, the computational costs for epistasis detection were infeasibleuntil recently.The main objective of the following experiments is to empirically evaluatethe following algorithms: BEAM 3.0 [Zha12], BOOST [WYY+10a], MBMDR[MVV11], Screen and Clean [WDR+10], SNPRuler [WYY+10b], SNPHar-vester [YHW+09], and TEAM [ZHZW10].These algorithms will be evaluated according to their Power, scalability,and Type I Error Rate. Each algorithm will be executed with many datasets that simulate diseases with many different parameters.These data sets are generated with genomeSimla, an open source data gen-erator that contains many useful parameters to realistically simulate complexdiseases.The structure of the rest of the lab note consists of a brief description of thedata sets that were used in the experiments, including the application usedto generate them in Section 2. Section 3 includes a description of the evalua-tion measures used in these experiments. Section 4 contains the experimentalmethodology followed in these experiments. Section 5 is the summary of theexperiments that will be detailed in the next lab notes.

2 The Data sets for the Experiments

The data sets for the experiments were created specifically for these experi-ments. The program used for the generation of the datasets was genomeSimla[EBT+08]. In total, 270 different configurations were generated. Each con-figuration consists of 100 data sets, which means that each algorithm wasexecuted 27000 times.

Data Generation Application

The data generation application used for these experiments was genomeS-imla. Due to its ability to evolve a population and achieve the desired allelefrequencies, with any amount of SNPs, distributed by as many chromosomesas desired, genomeSimla is an adequate application for these kind of exper-

1

iments. The evolution of the population can follow a linear, exponential orlogistic growth, the last one being the most preferred model.Aside from generating and evolving a population to any amount of iterationsas required, genomeSimla allows for an observation of the allele frequenciesalong the population and chose, based on those frequencies, which SNPsshould be allocated, choosing how many chromosomes and block of SNPsper chromosome for each individual a priori.After the generation of the population according to the selected parameters,genomeSimla can then be used to generate datasets, sampling from the pop-ulation pool with as many individuals as necessary. The disease model canbe further customized, with the desired odds ratio, prevalence of the disease,and type of disease model. Based on these values, a penetrance table is gen-erated for each desired parameter.

• Allele Frequency - The frequency of the minor allele of the diseaseSNPs.

• Population - Number of individuals sampled in the data set.

• Disease Model - Type of disease model: main effect, epistasis interac-tion, and full effect.

• Odds ratio - Relation between disease SNPs. Probability of one diseaseSNP being present, given the presence of the other disease SNP.

• Prevalence - The proportion of a population with the disease. Affectsthe number of cases and controls in a data set.

With this data, data sets can be generated, using a configuration file, em-bedding the disease model into the desired alleles.

Data Set

The data sets were created using many different parameters, to maximizethe diversity of disease models, to assert which algorithms are best for whichscenarios. The data consists of a simulation of genotypes and phenotypes.For each individual, the attributes consist of genotypes associated with eachSNP, for a total of 3 states: Homozygotic dominant, heterozygotic and ho-mozygotic recessive. The label is binary, corresponding to an affected or notaffected individual.In each data set, a total of 2 pairs of chromosomes where generated. Thefirst chromosome contains 20 blocks of 10 SNPs and the second contains 10

2

blocks of 10 SNPs, having 300 SNPs in total. There are two disease allelesplaced in different chromosomes, according to the desired allele frequency.The generated data sets contain 3 different number of individuals: 500, 1000,and 2000 individuals. The disease alleles contains 5 different minor allele fre-quencies: 0.01, 0.05, 0.1, 0.3, and 0.5. Three different disease models areused: data sets with marginal effects and no epistatic relations, withoutmarginal effects and with epistatic relations and with marginal effects andepistatic relations. The odds ratio associated with both disease related allelesis 1.1, 1.5, or 2.0. The prevalence of the disease can is also configurated toeither 0.0001 or 0.02, which also influances the amount of cases and controls.

3 Evaluation Measures

The evaluations measures used for these experiments consist of Power, scal-ability, and Type I Error Rate.To evaluate the Power of the algorithms, for each configuration, the num-ber of data sets were the ground truth is a statistically relevant interaction,measured using the χ2 test, out of 100 data sets. Calculating the amountof datasets, within each configuration, how many data sets correctly identifythe ground truth of the disease as the most significant SNP pair, consideringthat SNPs are ranked according to their importance to the phenotype, usingstatistical hypothesis tests.To evaluate the scalability of each dataset, the average time for each datasetis calculated in each configuration.In the Type I Error Rate, the proportion out of 100 data sets where non-disease related SNPs in each configuration are classified as a statisticallyrelevant SNP pair, using χ2 test.

4 Experimental Methodology

Initially, the population for the datasets is generated using genomeSimla.The population is generated using a logistic growth rate, with an initial pop-ulation of 10000 and a maximum capacity of 1000000. The population chosenfor the datasets is picked from reported generations, based on the allele fre-quencies desired for the experiment. The generation 1750 was selected forthis purpose. 2 SNPs are selected for each configuration. The SNPs selectedaccording to their minor allele frequency (MAF) were as follows:

• MAF 0.01 - SNP112 and SNP267

3





The first 200 SNPs belong to chromosome 1, where as the last 100 correspondto chromosome 2 SNPs.The table with all the allele frequencies can be seenin the annexes. Table 1 are the chromosome 1 allele frequencies and table 2are the chromosome 2 allele frequenciesThe penetrance tables are created from the allele frequencies in the popula-tion, following the configurations that were discussed earlier. The datasetsare created, using each unique configuration file to create 100 datasets, gen-erating all the configurations mentioned before.With the data sets generated, the algorithms are tested for the most extremeconfigurations (minimum and maximum MAF) to find if the results are valid.Upon asserting the validity of the experiment, all algorithms are then exe-cuted for all configurations to analyze the potential of each algorithm.For each algorithm in each dataset, a file containing the ranked SNPs, ac-cording to statistical relevancy, is generated, together with information aboutthe time and memory used in the execution for each test. The Power andType I Error Rates are are taken from the results that present a statisticrelevance of α < 0.05.The computer used for this experiments used the 64-bit Ubuntu 13.10 op-erating system, with an Intel(R) Core(TM)2 Quad CPU Q6600 2.40GHzprocessor and 8,00 GB of RAM memory.

A Loci Frequencies

Chromosome 1

Table 1: Allele frequencies of the generated populationfor chromosome 1.

Label Freq Al1 Freq Al2 Map Dist. PositionRL0-1 0.704448 0.295552 0.0002523144 253RL0-2 0.467747 0.532253 3.65488E-006 256RL0-3 0.856627 0.143373 3.86582E-006 259

4

RL0-4 0.94761 0.05239 1.18175E-006 260RL0-5 0.747191 0.252809 1.23056E-006 261RL0-6 0.868644 0.131356 5.41858E-006 266RL0-7 0.869881 0.130119 9.49181E-006 275RL0-8 0.634084 0.365916 1.72337E-006 276RL0-9 0.616899 0.383101 4.81936E-006 280RL0-10 0.603205 0.396795 8.88582E-006 288RL0-11 0.951322 0.048678 0.000118908 406RL0-12 0.928004 0.071996 9.53558E-006 415RL0-13 0.7257 0.2743 3.20447E-006 418RL0-14 0.547945 0.452055 3.96875E-006 421RL0-15 0.735312 0.264688 5.03938E-006 426RL0-16 0.983344 0.016655 4.10188E-006 430RL0-17 0.809402 0.190598 7.11582E-006 437RL0-18 0.908173 0.091827 2.25726E-006 439RL0-19 0.628892 0.371108 2.13406E-006 441RL0-20 0.824863 0.175137 9.99491E-006 450RL0-21 0.640543 0.359457 0.000229233 679RL0-22 0.542639 0.457361 5.61457E-006 684RL0-23 0.776321 0.223679 5.05623E-006 689RL0-24 0.925422 0.074578 4.39722E-006 693RL0-25 0.596454 0.403546 9.48707E-006 702RL0-26 0.80071 0.19929 7.38516E-006 709RL0-27 0.712163 0.287837 3.95139E-006 712RL0-28 0.91426 0.08574 5.07943E-006 717RL0-29 0.902589 0.097411 0.000006668 723RL0-30 0.933652 0.066348 2.4885E-006 725RL0-31 0.486126 0.513874 0.000296081 1021RL0-32 0.553701 0.446299 8.33422E-006 1029RL0-33 0.887238 0.112762 4.95048E-006 1033RL0-34 0.93165 0.06835 5.32692E-006 1038RL0-35 0.887583 0.112417 2.23131E-006 1040RL0-36 0.824546 0.175454 5.40611E-006 1045RL0-37 1 0 7.03837E-006 1052RL0-38 0.817039 0.182961 1.44855E-006 1053RL0-39 0.762831 0.237169 9.89044E-006 1062RL0-40 0.623942 0.376058 2.53856E-006 1064RL0-41 0.886716 0.113284 0.0003574 1421

5

RL0-42 0.603873 0.396127 5.73344E-006 1426RL0-43 0.708144 0.291856 7.18489E-006 1433RL0-44 0.722182 0.277818 6.17693E-006 1439RL0-45 0.59756 0.40244 6.57155E-006 1445RL0-46 0.810217 0.189783 3.25347E-006 1448RL0-47 0.679944 0.320056 8.28564E-006 1456RL0-48 0.467092 0.532908 4.45383E-006 1460RL0-49 0.518637 0.481363 2.97358E-006 1462RL0-50 0.918397 0.081603 4.58774E-006 1466RL0-51 0.979136 0.020864 0.0003772277 1843RL0-52 0.571337 0.428663 3.32175E-006 1846RL0-53 0.615734 0.384266 2.23233E-006 1848RL0-54 0.695586 0.304414 2.98606E-006 1850RL0-55 0.660442 0.339558 4.02315E-006 1854RL0-56 0.910148 0.089852 5.9643E-006 1859RL0-57 0.445087 0.554913 6.82648E-006 1865RL0-58 0.470733 0.529267 9.5693E-006 1874RL0-59 0.858588 0.141412 0.000004337 1878RL0-60 0.681468 0.318532 2.15272E-006 1880RL0-61 0.870466 0.129534 0.0003573156 2237RL0-62 0.646194 0.353806 7.20147E-006 2244RL0-63 0.763207 0.236793 3.17006E-006 2247RL0-64 0.931087 0.068913 8.01624E-006 2255RL0-65 0.7151 0.2849 6.35415E-006 2261RL0-66 0.670911 0.329089 2.38872E-006 2263RL0-67 0.888122 0.111878 2.52589E-006 2265RL0-68 0.694165 0.305835 0.000008364 2273RL0-69 0.864311 0.135689 7.35972E-006 2280RL0-70 0.838895 0.161105 2.46709E-006 2282RL0-71 0.823928 0.176073 0.0001992617 2481RL0-72 0.583947 0.416053 6.33832E-006 2487RL0-73 0.841979 0.158021 9.79685E-006 2496RL0-74 0.6003 0.3997 7.07911E-006 2503RL0-75 0.892639 0.107361 5.16523E-006 2508RL0-76 0.761561 0.238439 2.85138E-006 2510RL0-77 0.900447 0.099553 1.53824E-006 2511RL0-78 0.599257 0.400743 3.89272E-006 2514RL0-79 0.972086 0.027914 6.53018E-006 2520

6

RL0-80 0.560663 0.439337 8.62124E-006 2528RL0-81 0.554206 0.445794 0.000199997 2727RL0-82 0.93403 0.06597 8.61757E-006 2735RL0-83 0.542574 0.457426 9.10087E-006 2744RL0-84 0.837702 0.162298 1.23079E-006 2745RL0-85 0.909783 0.090217 6.84162E-006 2751RL0-86 0.91318 0.08682 4.48263E-006 2755RL0-87 0.725569 0.274431 0.000001848 2756RL0-88 0.90355 0.09645 2.79894E-006 2758RL0-89 0.716186 0.283814 4.00443E-006 2762RL0-90 0.612835 0.387165 6.94976E-006 2768RL0-91 0.582162 0.417838 0.0003616833 3129RL0-92 0.83582 0.16418 0.000009529 3138RL0-93 0.558802 0.441198 9.02466E-006 3147RL0-94 0.86217 0.13783 5.29547E-006 3152RL0-95 0.617906 0.382094 7.09319E-006 3159RL0-96 0.801595 0.198405 6.73657E-006 3165RL0-97 0.676978 0.323022 6.97316E-006 3171RL0-98 0.738348 0.261652 7.87644E-006 3178RL0-99 0.591386 0.408614 3.67391E-006 3181RL0-100 0.521751 0.478249 4.20054E-006 3185RL0-101 0.508844 0.491156 9.09917E-005 3275RL0-102 0.565387 0.434613 9.41043E-006 3284RL0-103 0.479309 0.520691 7.40872E-006 3291RL0-104 0.745518 0.254482 3.35237E-006 3294RL0-105 0.532452 0.467548 4.28727E-006 3298RL0-106 0.935416 0.064584 9.89425E-006 3307RL0-107 0.662617 0.337383 8.74864E-006 3315RL0-108 0.658306 0.341694 2.01241E-006 3317RL0-109 0.712991 0.287009 5.8733E-006 3322RL0-110 0.665501 0.334499 6.69027E-006 3328RL0-111 0.568289 0.431711 8.718047E-005 3415RL0-112 0.98671 0.01329 8.66949E-006 3423RL0-113 0.79789 0.20211 5.05033E-006 3428RL0-114 0.553154 0.446846 9.60618E-006 3437RL0-115 0.667399 0.332601 6.92172E-006 3443RL0-116 0.700185 0.299815 9.52134E-006 3452RL0-117 0.610748 0.389252 5.60877E-006 3457

7

RL0-118 0.661102 0.338898 6.63784E-006 3463RL0-119 0.820744 0.179256 3.09427E-006 3466RL0-120 0.912926 0.087073 4.1968E-006 3470RL0-121 0.68335 0.31665 0.0003871028 3857RL0-122 0.707937 0.292063 5.00312E-006 3862RL0-123 0.589477 0.410523 2.13525E-006 3864RL0-124 0.745493 0.254507 9.8212E-006 3873RL0-125 0.698088 0.301912 7.02674E-006 3880RL0-126 0.424467 0.575533 5.18827E-006 3885RL0-127 0.787719 0.212281 4.74483E-006 3889RL0-128 0.860644 0.139356 5.22368E-006 3894RL0-129 0.638396 0.361604 3.96526E-006 3897RL0-130 0.731953 0.268047 8.71207E-006 3905RL0-131 0.744233 0.255766 0.0002181738 4123RL0-132 1 0 1.69539E-006 4124RL0-133 0.771704 0.228296 9.71469E-006 4133RL0-134 0.878927 0.121073 0.000002233 4135RL0-135 0.90145 0.09855 4.28905E-006 4139RL0-136 0.648369 0.351631 0.00000754 4146RL0-137 0.80335 0.19665 8.70869E-006 4154RL0-138 0.856866 0.143134 9.44719E-006 4163RL0-139 0.615518 0.384482 3.60345E-006 4166RL0-140 0.788087 0.211913 0.000002436 4168RL0-141 0.678961 0.321039 0.0002748812 4442RL0-142 0.771435 0.228565 5.86447E-006 4447RL0-143 0.503258 0.496742 3.67578E-006 4450RL0-144 0.795211 0.204789 2.75252E-006 4452RL0-145 0.490144 0.509856 4.10642E-006 4456RL0-146 0.488492 0.511508 4.30833E-006 4460RL0-147 0.667302 0.332698 7.3961E-006 4467RL0-148 0.643159 0.356841 2.3613E-006 4469RL0-149 0.673992 0.326008 9.5407E-006 4478RL0-150 0.788535 0.211465 5.39342E-006 4483RL0-151 0.781059 0.218941 0.0002359844 4718RL0-152 0.502629 0.497371 5.62238E-006 4723RL0-153 0.466542 0.533458 2.22743E-006 4725RL0-154 0.538982 0.461018 3.21068E-006 4728RL0-155 0.841056 0.158944 2.43989E-006 4730

8

RL0-156 0.462765 0.537235 7.40954E-006 4737RL0-157 0.90605 0.09395 3.96506E-006 4740RL0-158 0.681072 0.318928 2.10963E-006 4742RL0-159 0.596135 0.403865 6.71541E-006 4748RL0-160 0.855496 0.144504 0.00000768 4755RL0-161 0.727272 0.272728 0.0002969833 5051RL0-162 0.774272 0.225728 2.62789E-006 5053RL0-163 0.791941 0.208059 6.76876E-006 5059RL0-164 0.644252 0.355748 0.000005599 5064RL0-165 0.549582 0.450418 8.32549E-006 5072RL0-166 0.428749 0.571251 8.10471E-006 5080RL0-167 0.376485 0.623515 9.96927E-006 5089RL0-168 0.535948 0.464052 9.47661E-006 5098RL0-169 0.514295 0.485705 3.16517E-006 5101RL0-170 0.700045 0.299955 5.98168E-006 5106RL0-171 0.571955 0.428045 0.0003862553 5492RL0-172 0.586523 0.413477 2.88618E-006 5494RL0-173 0.783275 0.216725 7.29982E-006 5501RL0-174 0.610016 0.389985 9.43182E-006 5510RL0-175 0.866664 0.133336 7.05865E-006 5517RL0-176 0.75876 0.24124 7.56181E-006 5524RL0-177 0.600093 0.399907 1.005344E-005 5534RL0-178 0.577467 0.422533 2.42474E-006 5536RL0-179 0.789476 0.210524 6.1728E-006 5542RL0-180 0.590153 0.409847 5.99256E-006 5547RL0-181 0.422633 0.577367 9.624393E-005 5643RL0-182 0.526449 0.473551 1.007159E-005 5653RL0-183 0.83354 0.16646 3.23814E-006 5656RL0-184 0.737217 0.262783 8.58028E-006 5664RL0-185 0.650092 0.349908 9.27841E-006 5673RL0-186 0.56464 0.43536 5.87977E-006 5678RL0-187 0.717536 0.282464 3.16557E-006 5681RL0-188 0.961919 0.038081 2.93894E-006 5683RL0-189 0.84241 0.15759 8.25314E-006 5691RL0-190 0.817398 0.182602 4.0069E-006 5695RL0-191 1 0 0.0002386956 5933RL0-192 1 0 4.98276E-006 5937RL0-193 0.709334 0.290666 0.000002811 5939

9

RL0-194 0.78411 0.21589 0.000008052 5947RL0-195 0.932612 0.067388 2.89373E-006 5949RL0-196 0.865947 0.134053 8.6839E-006 5957RL0-197 0.725338 0.274662 5.21764E-006 5962RL0-198 0.795964 0.204036 7.8731E-006 5969RL0-199 0.583016 0.416984 4.61094E-006 5973RL0-200 0.803726 0.196274 8.37366E-006 5981

Chromosome 2

Table 2: Allele frequencies of the generated populationfor chromosome 2.

Label Freq Al1 Freq Al2 Map Dist. PositionRL1-201 0.893976 0.106024 0.0003986369 399RL1-202 0.584141 0.415859 2.05934E-006 401RL1-203 0.422083 0.577917 0.000005955 406RL1-204 0.73351 0.26649 5.58855E-006 411RL1-205 0.694034 0.305966 4.1723E-006 415RL1-206 0.765355 0.234645 2.06415E-006 417RL1-207 0.965014 0.034986 7.44318E-006 424RL1-208 0.668517 0.331483 9.60649E-006 433RL1-209 0.634885 0.365115 8.56251E-006 441RL1-210 0.725027 0.274973 6.14954E-006 447RL1-211 0.698398 0.301602 7.386583E-005 520RL1-212 0.595985 0.404015 9.7547E-006 529RL1-213 0.710597 0.289403 1.58667E-006 530RL1-214 0.663247 0.336753 4.37889E-006 534RL1-215 0.75663 0.24337 7.38782E-006 541RL1-216 0.936743 0.063257 8.35938E-006 549RL1-217 0.663784 0.336216 1.64064E-006 550RL1-218 0.680104 0.319896 9.16445E-006 559RL1-219 0.688756 0.311244 0.000007628 566RL1-220 0.9333 0.0667 7.01934E-006 573RL1-221 0.742415 0.257585 0.0003420352 915RL1-222 0.799322 0.200678 4.01391E-006 919RL1-223 0.709122 0.290879 0.000002737 921RL1-224 0.565597 0.434403 6.28353E-006 927

10

RL1-225 0.863029 0.136971 9.64911E-006 936RL1-226 0.752561 0.247439 6.74076E-006 942RL1-227 0.676998 0.323002 1.004539E-005 952RL1-228 0.840474 0.159526 1.71067E-006 953RL1-229 0.49346 0.50654 0.000001589 954RL1-230 0.910095 0.089905 7.41687E-006 961RL1-231 0.960868 0.039132 0.0002261121 1187RL1-232 0.933743 0.066257 1.91042E-006 1188RL1-233 0.760953 0.239047 4.80473E-006 1192RL1-234 0.748072 0.251928 7.04549E-006 1199RL1-235 0.663473 0.336527 8.21959E-006 1207RL1-236 0.964783 0.035217 2.82873E-006 1209RL1-237 0.905525 0.094475 0.000007663 1216RL1-238 0.691349 0.308651 5.04876E-006 1221RL1-239 0.951645 0.048355 1.59639E-006 1222RL1-240 0.989216 0.010784 8.73616E-006 1230RL1-241 0.738781 0.261219 0.0003243203 1554RL1-242 0.795527 0.204473 8.89964E-006 1562RL1-243 0.795563 0.204437 2.02264E-006 1564RL1-244 0.703822 0.296178 3.36477E-006 1567RL1-245 0.57285 0.42715 9.19778E-006 1576RL1-246 0.767369 0.232631 7.29139E-006 1583RL1-247 0.645825 0.354175 2.43094E-006 1585RL1-248 0.802402 0.197598 1.73925E-006 1586RL1-249 0.944397 0.055603 9.46653E-006 1595RL1-250 0.622399 0.377601 8.82309E-006 1603RL1-251 0.630848 0.369152 0.0002217582 1824RL1-252 0.818129 0.181871 1.91494E-006 1825RL1-253 0.484804 0.515196 1.6334E-006 1826RL1-254 0.676497 0.323503 1.59652E-006 1827RL1-255 0.880815 0.119185 3.35782E-006 1830RL1-256 0.959511 0.040489 2.75846E-006 1832RL1-257 0.784072 0.215928 3.03069E-006 1835RL1-258 0.52286 0.47714 6.06819E-006 1841RL1-259 0.623466 0.376534 6.91131E-006 1847RL1-260 0.874709 0.125291 7.25071E-006 1854RL1-261 0.803013 0.196987 0.000331411 2185RL1-262 0.545178 0.454822 4.47452E-006 2189

11

RL1-263 0.815965 0.184035 4.89193E-006 2193RL1-264 0.818366 0.181634 1.90565E-006 2194RL1-265 0.724692 0.275308 1.45521E-006 2195RL1-266 0.68352 0.31648 1.001287E-005 2205RL1-267 0.989999 0.010001 3.48414E-006 2208RL1-268 0.985774 0.014226 9.2895E-006 2217RL1-269 0.642113 0.357887 4.82072E-006 2221RL1-270 0.464929 0.535071 0.000002507 2223RL1-271 0.734131 0.265869 0.0003870134 2610RL1-272 0.632632 0.367368 8.94209E-006 2618RL1-273 0.553081 0.446919 0.000004175 2622RL1-274 0.764977 0.235023 5.60863E-006 2627RL1-275 0.464551 0.535449 9.08894E-006 2636RL1-276 0.851137 0.148863 1.002911E-005 2646RL1-277 0.739427 0.260573 9.30477E-006 2655RL1-278 0.555538 0.444462 6.07683E-006 2661RL1-279 0.551021 0.448979 1.71434E-006 2662RL1-280 0.593129 0.406871 6.07637E-006 2668RL1-281 0.79749 0.20251 0.0002724436 2940RL1-282 0.848332 0.151668 7.00277E-006 2947RL1-283 0.812696 0.187304 5.19928E-006 2952RL1-284 0.715573 0.284426 6.3489E-006 2958RL1-285 0.578981 0.421019 3.26024E-006 2961RL1-286 0.786632 0.213368 0.000008282 2969RL1-287 0.64689 0.35311 8.91268E-006 2977RL1-288 0.600677 0.399323 2.59076E-006 2979RL1-289 0.552264 0.447736 7.46941E-006 2986RL1-290 0.836774 0.163226 7.75812E-006 2993RL1-291 0.910408 0.089592 6.30604E-005 3056RL1-292 0.705616 0.294384 9.07012E-006 3065RL1-293 0.833055 0.166945 7.11207E-006 3072RL1-294 0.55822 0.44178 9.56152E-006 3081RL1-295 0.684736 0.315264 2.78967E-006 3083RL1-296 0.973315 0.026685 9.43315E-006 3092RL1-297 0.676965 0.323035 5.50475E-006 3097RL1-298 0.698511 0.301489 4.26716E-006 3101RL1-299 0.514109 0.485891 9.42184E-006 3110RL1-300 0.895842 0.104158 9.44663E-006 3119

12

References

[EBT+08] Todd L Edwards, William S Bush, Stephen D Turner, Scott MDudek, Eric S Torstenson, Mike Schmidt, Eden Martin, andMarylyn D Ritchie. Generating Linkage Disequilibrium Pat-terns in Data Simulations using genomeSIMLA. Lecture Notesin Computer Science, 4973:24–35, 2008.

[MVV11] Jestinah M Mahachie John, Francois Van Lishout, and KristelVan Steen. Model-Based Multifactor Dimensionality Reductionto detect epistasis for quantitative traits in the presence of error-free and noisy data. Eur J Hum Genet, 19(6):696–703, June2011.

[WDR+10] Jing Wu, Bernie Devlin, Steven Ringquist, Massimo Trucco,and Kathryn Roeder. Screen and clean: a tool for identifyinginteractions in genome-wide association studies. Genetic epi-demiology, 34:275–285, 2010.

[WYY+10a] Xiang Wan, Can Yang, Qiang Yang, Hong Xue, Xiaodan Fan,Nelson L S Tang, and Weichuan Yu. BOOST: A fast approachto detecting gene-gene interactions in genome-wide case-controlstudies. American journal of human genetics, 87:325–340, 2010.

[WYY+10b] Xiang Wan, Can Yang, Qiang Yang, Hong Xue, Nelson L STang, and Weichuan Yu. Predictive rule inference for epistaticinteraction detection in genome-wide association studies. Bioin-formatics (Oxford, England), 26:30–37, 2010.

[YHW+09] Can Yang, Zengyou He, Xiang Wan, Qiang Yang, Hong Xue,and Weichuan Yu. SNPHarvester: a filtering-based approachfor detecting epistatic interactions in genome-wide associationstudies. Bioinformatics (Oxford, England), 25:504–511, 2009.

[Zha12] Yu Zhang. A novel bayesian graphical model for genome-widemulti-SNP association mapping. Genetic Epidemiology, 36:36–47, 2012.

[ZHZW10] Xiang Zhang, Shunping Huang, Fei Zou, and Wei Wang.TEAM: efficient two-locus epistasis tests in human genome-wideassociation study. Bioinformatics (Oxford, England), 26:i217–i227, 2010.

13

Laboratory Note

Genetic EpistasisII - Assessing Algorithm BEAM 3.0

LN-2-2014




www : http://www.fe.up.pt/∼[email protected]

www : http://www.fe.up.pt/∼rcamacho

May 2014

Abstract

In this lab note, the algorithm BEAM 3.0 is presented and testedfor main effect detection. This is a bayesian algorithm that createsa graph with SNPs and the relations between them and the diseaseexpression. The results obtained reveal a high detection for data setswith higher allele frequencies. This is also true for the population size,however this increases Type I Error Rates, therefore Power values arenearly equal to the error rates. The algorithm seems very scalablewith the data sets used, and may be scalable to large genome wideassociation studies.

1 Introduction

The Bayesian Epistasis association Mapping (BEAM) [ZL07] is a stochas-tic algorithm that uses a Markov chain Monte Carlo (MCMC) [ADH10] tocreate posterior probabilities that each marker is associated with the diseasephenotype.

Instead of the standard epistatic detection using χ2 statistic, BEAM usesa new B statistic. The B statistic is defined by:

BM = lnPA(DM , UM)

P0(DM , UM)= ln

Pjoin(DM)[Pind(UM) + Pjoin(UM)]

Pind(DM , UM) + Pjoin(DM , UM)(1)

where M represents each set of k markers, representing different complexitiesof interactions. DM and UM are genotype data from M cases and controlsand P0(DM , UM) and PA(DM , UM) are the Bayes factors. Pind is the distribu-tion that assumes independence among markers in M and Pjoin is a saturatedjoint distribution of genotype combinations among all markers in M .BEAM3 introduces multi-SNP associations and high-order interactions flex-ibility, using graphs, reducing the complexity and increasing the Power.BEAM3 [Zha12] produces cleaner results with improved mapping sensitivityand specificity.Initially, the disease graph is built based on the probability that a given geno-type configuration is related to the phenotype, considering the frequenciesof that genotype in controls and cases. Cliques (non overlapping groups ofSNPs) are then generated based on the disease related SNPs. A joint prob-ability model and MCMC are used to update the disease graph and createundirected edges between dependent SNPs.

1.1 Input files

The input file contains the phenotypes of all the individuals in the first rowand the genotypes of each SNP on the subsequent rows.

1.2 Output files

The algorithm outputs 3 files: posterior file; g.dot file; and chi.txt. Theposterior file contains the posterior probabilities of marginal and interaction

1

ID Chr Pos 0 1 0 0 1rs1 chr1 1 1 0 2 0 1rs2 chr1 2 1 2 1 1 0rs3 chr1 3 1 2 2 0 1

Table 1: An example of the input file containing the index of the SNPs,the chromosome that they belong to, the position of the SNP, the pheno-type, corresponding to the first row and subsequent rows correspond to thegenotype of each SNP for all individuals.

associations per SNP. The g.dot file contains the disease graph. The file re-quires a graph visualization software, such as GraphViz. The chi.txt containsthe chi square results, together with allele counts.

1.3 Parameters

There are some options available to the user:

• ”-filter k”: Tells the program to filter SNPs with too many missinggenotypes.

• ”-sample burnin mcmc”: Specifies the number of sampling interactionsby the MCMC.The default value is 100.

• ”-prior p”: specifies how likely each SNP is associated with the disease.By default, p=5/L, where L is the number of SNPs.

• ”-T t”: Specifies the temperature which the MCMC starts running.With a high temperature, the program can jump out of local modeswith few iterations. However, it can make the program very slow inthe first iterations.

2 Experimental Settings

The datasets used in the experiments are characterized in Lab Note 1. Thecomputer used for this experiments used the 64-bit Ubuntu 13.10 operatingsystem, with an Intel(R) Core(TM)2 Quad CPU Q6600 2.40GHz processorand 8,00 GB of RAM memory.The parameters used in this experiment are the default parameters, with theexception of ”-prior p”, which is p=2/L.

2

0.01 0.05 0.1 0.3 0.50

50

100

0 0 0

100 100

0 0

32

100 100

0 1

92100 100

Allele frequency

Pow

er(%

)

500 individuals1000 individuals2000 individuals

Figure 1: Power by allele frequency. For each frequency, three sizes of datasets were used to measure the Power, with odds ratio of 2.0 and prevalenceof 0.02. The Power is measured by the amount of data sets where the groundtruth was amongst the most relevant results, out of all 100 data sets.

3 Results

The results of epistasis detection of the algorithm consist of posterior prob-abilities. This is not comparable with χ2 tests, therefore only main effectdetections will be considered for this experiment.Figure 1 shows near 0% Power for allele frequencies lower than 0.1, but in-creases greatly reaching 100% Power for frequencies of 0.3 and 0.5. Thereis also a clear growth with population size, especially in data sets with 0.1minor allele frequency.

The running time (a) of these experiments show a steady increase, witha difference of nearly 3 seconds between data sets with 500 individuals anddata sets with 2000 individuals. The increase in running time is not verysignificant, which may translate to larger data sets. This is also true formemory usage (c), with only 1.5 MB increase from 500 to 2000 individualsin a data set. The CPU usage (b increased has an increase of nearly 10%from 500 individuals to 1000, lowering slightly for 2000 individuals.

The error rate results in Figure 3 contain high values of false positives.The percentage of Type I Error Rate is bigger than the Power for smallerallele frequencies. In frequencies higher than 0.1 the Type I Error Rate islower than the Power but the difference of both percentages decrease as thenumber of individuals increases. This means that for a bigger sized datasets, it is more likely to find the ground truth but it is also more likely to beaccompanied by false positives.

3

500 1000 20000

2

4

6

8

Number of Individuals

Runnin

gT

ime(

seco

nds)

(a) Average running time

500 1000 2000

88

90

92

94

96


CP

UU

sage

(%)

(b) Average CPU usage

500 1000 2000

4

4.5

5

5.5


Mem

ory

Usa

ge(M

byte

s)

(c) Average memory usage

Figure 2: Comparison of scalability measures between different sized datasets. This figures shows the average running time, CPU usage, and memoryusage by each data set. The data sets have a minor allele frequency is 0.5,2.0 odds ratio, 0.02 prevalence.

The distribution of Power by odds ratio reveals a big increase in Powerwith the increase of odds ratio in Figure 5. This is similar to the Power bypopulation size in Figure 4. Data sets with low allele frequencies have a near0% Power. With 0.1 minor allele frequency, there is a significant increase,having 92% of Power, and reaching 100% for higher allele frequencies inFigure 7. There is no clear difference in Power with prevalence changes onFigure 6.

4

0.01 0.05 0.1 0.3 0.50

50

100

0 3 9

71

99

6 318

99 100

117

67

100 100

Allele frequency

Typ

eI

Err

orR

ate(

%)


Figure 3: Type I Error Rate by allele frequency and population size, withodds ratio of 2.0 and prevalence of 0.02. The Type I Error Rate is measuredby the amount of data sets where the false positives were amongst the mostrelevant results, out of all 100 data sets.

4 Summary

BEAM3 is the third iteration of a bayesian algorithm that uses posteriorprobabilities to detect epistasis. BEAM3 generates a disease graph repre-senting multi-SNP associations that have a high probability of being relatedto the disease phenotype expression. This graph is updated using MCMC.This version of BEAM also outputs χ2 values of single SNPs, which are com-parable with other algorithms. Due to this the results consist of main effectdetection only. The Power obtained reveals similar values for Power and TypeI Error Rate, increasing with allele frequency and population size, but type1 errors are lower in relation to Power in data sets with high allele frequencyand low population size. The scalability of the algorithm is promising.

References

[ADH10] Christophe Andrieu, Arnaud Doucet, and Roman Holenstein. Par-ticle Markov chain Monte Carlo methods. Journal of the RoyalStatistical Society: Series B (Statistical Methodology), 72:269–342,2010.

[Zha12] Yu Zhang. A novel bayesian graphical model for genome-widemulti-SNP association mapping. Genetic Epidemiology, 36:36–47,2012.

5

[ZL07] Yu Zhang and Jun S Liu. Bayesian inference of epistatic interac-tions in case-control studies. Nature genetics, 39:1167–1173, 2007.

A Bar graphs

500 1000 20000

50

100

0

32

92

Population

Pow

er(%

)

Power by Population

Figure 4: Distribution of the Power by population. The allele frequency is0.1, the odds ratio is 2.0, and the prevalence is 0.02.

1.1 1.5 2.00

50

100

2

24

92

Odds Ratio

Pow

er(%

)

Power by Odds Ratio

Figure 5: Distribution of the Power by odds ratios. The allele frequency is0.1, the number of individuals is 2000, and the prevalence is 0.02.

6

0.0001 0.020

50

10099 92

Prevalence

Pow

er(%

)

Power by Prevalence

Figure 6: Distribution of the Power by prevalence. The allele frequency is0.1, the number of individuals is 2000, and the odds ratio is 2.0.

0.01 0.05 0.1 0.3 0.50

50

100

0 1

92100 100

Frequency

Pow

er(%

)

Power by Frequency

Figure 7: Distribution of the Power by allele frequency. The number ofindividuals is 2000, the odds ratio is 2.0, and the prevalence is 0.02.

B Table of Results

Table 2: A table containing the percentage of true positives and false posi-tives in each configuration. The first column contains the description of theconfiguration. The second and third columns contain the number of datasetswith true positives and false positives respectively, out of all 100 data setsper configuration.

Configuration* TP (%) FP (%)0.5,500,ME,2.0,0.02 100 99

0.5,500,ME,2.0,0.0001 100 95

7

0.5,500,ME,1.5,0.02 100 530.5,500,ME,1.5,0.0001 100 570.5,500,ME,1.1,0.02 80 20

0.5,500,ME,1.1,0.0001 79 220.5,2000,ME,2.0,0.02 100 100

0.5,2000,ME,2.0,0.0001 100 1000.5,2000,ME,1.5,0.02 100 100

0.5,2000,ME,1.5,0.0001 100 1000.5,2000,ME,1.1,0.02 100 100

0.5,2000,ME,1.1,0.0001 100 980.5,1000,ME,2.0,0.02 100 100

0.5,1000,ME,2.0,0.0001 100 1000.5,1000,ME,1.5,0.02 100 100

0.5,1000,ME,1.5,0.0001 100 970.5,1000,ME,1.1,0.02 100 57

0.5,1000,ME,1.1,0.0001 100 600.3,500,ME,2.0,0.02 100 71

0.3,500,ME,2.0,0.0001 100 790.3,500,ME,1.5,0.02 88 24

0.3,500,ME,1.5,0.0001 89 300.3,500,ME,1.1,0.02 21 11

0.3,500,ME,1.1,0.0001 23 60.3,2000,ME,2.0,0.02 100 100

0.3,2000,ME,2.0,0.0001 100 1000.3,2000,ME,1.5,0.02 100 99

0.3,2000,ME,1.5,0.0001 100 1000.3,2000,ME,1.1,0.02 100 54

0.3,2000,ME,1.1,0.0001 100 500.3,1000,ME,2.0,0.02 100 99

0.3,1000,ME,2.0,0.0001 100 1000.3,1000,ME,1.5,0.02 100 68

0.3,1000,ME,1.5,0.0001 100 630.3,1000,ME,1.1,0.02 90 25

0.3,1000,ME,1.1,0.0001 81 250.1,500,ME,2.0,0.02 0 9

0.1,500,ME,2.0,0.0001 12 170.1,500,ME,1.5,0.02 0 5

0.1,500,ME,1.5,0.0001 0 6

8

0.1,500,ME,1.1,0.02 0 60.1,500,ME,1.1,0.0001 0 50.1,2000,ME,2.0,0.02 92 67

0.1,2000,ME,2.0,0.0001 99 760.1,2000,ME,1.5,0.02 24 16

0.1,2000,ME,1.5,0.0001 44 290.1,2000,ME,1.1,0.02 2 8

0.1,2000,ME,1.1,0.0001 1 70.1,1000,ME,2.0,0.02 32 18

0.1,1000,ME,2.0,0.0001 59 380.1,1000,ME,1.5,0.02 1 6

0.1,1000,ME,1.5,0.0001 6 100.1,1000,ME,1.1,0.02 0 7

0.1,1000,ME,1.1,0.0001 0 50.05,500,ME,2.0,0.02 0 3

0.05,500,ME,2.0,0.0001 0 60.05,500,ME,1.5,0.02 0 4

0.05,500,ME,1.5,0.0001 0 40.05,500,ME,1.1,0.02 0 5

0.05,500,ME,1.1,0.0001 0 10.05,2000,ME,2.0,0.02 1 17

0.05,2000,ME,2.0,0.0001 7 250.05,2000,ME,1.5,0.02 0 3

0.05,2000,ME,1.5,0.0001 0 130.05,2000,ME,1.1,0.02 0 5

0.05,2000,ME,1.1,0.0001 0 60.05,1000,ME,2.0,0.02 0 3

0.05,1000,ME,2.0,0.0001 1 180.05,1000,ME,1.5,0.02 0 2

0.05,1000,ME,1.5,0.0001 0 50.05,1000,ME,1.1,0.02 0 7

0.05,1000,ME,1.1,0.0001 0 30.01,500,ME,2.0,0.02 0 0

0.01,500,ME,2.0,0.0001 0 60.01,500,ME,1.5,0.02 0 0

0.01,500,ME,1.5,0.0001 0 60.01,500,ME,1.1,0.02 0 0

0.01,500,ME,1.1,0.0001 0 6

9

0.01,2000,ME,2.0,0.02 0 10.01,2000,ME,2.0,0.0001 0 30.01,2000,ME,1.5,0.02 0 3

0.01,2000,ME,1.5,0.0001 0 20.01,2000,ME,1.1,0.02 0 3

0.01,2000,ME,1.1,0.0001 0 20.01,1000,ME,2.0,0.02 0 6

0.01,1000,ME,2.0,0.0001 0 30.01,1000,ME,1.5,0.02 0 7

0.01,1000,ME,1.5,0.0001 0 30.01,1000,ME,1.1,0.02 0 3

0.01,1000,ME,1.1,0.0001 0 4

*MAF,POP,MOD,OR,PREV where MAF represents the minor allele fre-quency, POP is the number of individuals, MOD is the used model (withor without main effect and with or without epistasis effect), OR is the oddsratio and PREV is the prevalence of the disease.

Table 3: A table containing the running time, cpu usage and memory usagein each configuration.

Configuration* Running Time (s) CPU Usage (%) Memory Usage (KB)0.5,500,ME,2.0,0.02 04.90 87.81 4152.80

0.5,500,ME,2.0,0.0001 03.30 87.16 3446.240.5,500,ME,1.5,0.02 02.16 86.74 2723.76

0.5,500,ME,1.5,0.0001 02.15 82.36 2757.200.5,500,ME,1.1,0.02 01.82 80.97 2566.12

0.5,500,ME,1.1,0.0001 01.73 83.54 2556.080.5,2000,ME,2.0,0.02 08.02 95.53 5986.72

0.5,2000,ME,2.0,0.0001 05.17 94.16 4108.720.5,2000,ME,1.5,0.02 02.78 92.74 3512.88

0.5,2000,ME,1.5,0.0001 02.59 93.39 3508.480.5,2000,ME,1.1,0.02 02.34 93.38 3493.44

0.5,2000,ME,1.1,0.0001 02.30 93.32 3492.600.5,1000,ME,2.0,0.02 06.96 96.31 4437.08

0.5,1000,ME,2.0,0.0001 03.79 95.00 3240.000.5,1000,ME,1.5,0.02 02.38 93.54 2771.80

0.5,1000,ME,1.5,0.0001 02.25 93.99 2729.160.5,1000,ME,1.1,0.02 02.10 93.08 2686.12

10

0.5,1000,ME,1.1,0.0001 02.02 93.41 2665.640.3,500,ME,2.0,0.02 02.60 94.60 2970.00

0.3,500,ME,2.0,0.0001 02.32 93.51 2917.440.3,500,ME,1.5,0.02 01.93 93.41 2615.88

0.3,500,ME,1.5,0.0001 01.83 92.49 2607.240.3,500,ME,1.1,0.02 01.17 89.70 2483.28

0.3,500,ME,1.1,0.0001 01.09 88.25 2476.680.3,2000,ME,2.0,0.02 02.77 94.79 3534.72

0.3,2000,ME,2.0,0.0001 02.95 95.25 3563.440.3,2000,ME,1.5,0.02 02.38 94.49 3493.60

0.3,2000,ME,1.5,0.0001 02.32 94.27 3492.920.3,2000,ME,1.1,0.02 02.30 94.73 3491.44

0.3,2000,ME,1.1,0.0001 02.28 94.56 3490.440.3,1000,ME,2.0,0.02 02.42 94.03 2886.64

0.3,1000,ME,2.0,0.0001 02.45 94.19 2831.720.3,1000,ME,1.5,0.02 02.04 93.96 2675.80

0.3,1000,ME,1.5,0.0001 02.04 93.88 2671.000.3,1000,ME,1.1,0.02 01.82 93.43 2665.28

0.3,1000,ME,1.1,0.0001 01.76 92.86 2662.680.1,500,ME,2.0,0.02 0.80 85.95 2471.00

0.1,500,ME,2.0,0.0001 0.95 88.33 2520.120.1,500,ME,1.5,0.02 0.61 82.27 2383.04

0.1,500,ME,1.5,0.0001 0.64 84.21 2432.960.1,500,ME,1.1,0.02 0.57 82.88 2367.56

0.1,500,ME,1.1,0.0001 0.58 81.66 2408.720.1,2000,ME,2.0,0.02 02.24 93.47 3493.84

0.1,2000,ME,2.0,0.0001 02.26 94.12 3492.400.1,2000,ME,1.5,0.02 01.37 90.66 3489.68

0.1,2000,ME,1.5,0.0001 01.45 91.55 3484.240.1,2000,ME,1.1,0.02 01.02 90.22 3482.16

0.1,2000,ME,1.1,0.0001 0.99 90.46 3483.440.1,1000,ME,2.0,0.02 01.38 89.81 2681.04

0.1,1000,ME,2.0,0.0001 01.50 91.44 2696.480.1,1000,ME,1.5,0.02 0.78 88.49 2655.24

0.1,1000,ME,1.5,0.0001 0.83 88.49 2653.080.1,1000,ME,1.1,0.02 0.69 83.77 2652.16

0.1,1000,ME,1.1,0.0001 0.68 89.10 2648.200.05,500,ME,2.0,0.02 0.59 81.11 2380.88

11

0.05,500,ME,2.0,0.0001 0.93 84.09 2439.400.05,500,ME,1.5,0.02 0.57 81.72 2361.20

0.05,500,ME,1.5,0.0001 0.60 81.99 2390.040.05,500,ME,1.1,0.02 0.59 79.48 2361.20

0.05,500,ME,1.1,0.0001 0.57 81.46 2381.480.05,2000,ME,2.0,0.02 01.18 89.59 3485.56

0.05,2000,ME,2.0,0.0001 01.19 89.80 3484.760.05,2000,ME,1.5,0.02 0.98 89.07 3480.08

0.05,2000,ME,1.5,0.0001 0.98 89.80 3480.160.05,2000,ME,1.1,0.02 0.94 89.82 3479.28

0.05,2000,ME,1.1,0.0001 0.94 90.33 3481.120.05,1000,ME,2.0,0.02 0.70 85.95 2651.56

0.05,1000,ME,2.0,0.0001 0.81 86.89 2653.840.05,1000,ME,1.5,0.02 0.67 81.01 2647.16

0.05,1000,ME,1.5,0.0001 0.70 82.83 2648.960.05,1000,ME,1.1,0.02 0.66 84.68 2648.20

0.05,1000,ME,1.1,0.0001 0.69 80.38 2647.760.01,500,ME,2.0,0.02 0.55 77.93 2340.40

0.01,500,ME,2.0,0.0001 0.59 79.62 2391.200.01,500,ME,1.5,0.02 0.54 81.51 2345.64

0.01,500,ME,1.5,0.0001 0.58 79.48 2387.760.01,500,ME,1.1,0.02 0.55 78.36 2349.92

0.01,500,ME,1.1,0.0001 0.59 79.67 2393.760.01,2000,ME,2.0,0.02 0.91 85.28 3476.88

0.01,2000,ME,2.0,0.0001 0.93 91.10 3479.400.01,2000,ME,1.5,0.02 0.91 91.18 3478.80

0.01,2000,ME,1.5,0.0001 0.92 91.62 3480.640.01,2000,ME,1.1,0.02 0.91 91.07 3477.44

0.01,2000,ME,1.1,0.0001 0.93 91.07 3478.960.01,1000,ME,2.0,0.02 0.66 86.84 2645.76

0.01,1000,ME,2.0,0.0001 0.67 89.19 2649.600.01,1000,ME,1.5,0.02 06.55 88.46 6100.36

0.01,1000,ME,1.5,0.0001 0.67 80.52 2646.280.01,1000,ME,1.1,0.02 0.66 84.18 2644.68

0.01,1000,ME,1.1,0.0001 0.66 81.46 2645.48

12


13

Laboratory Note

Genetic EpistasisIII - Assessing Algorithm BOOST

LN-3-2014






May 2014

Abstract

In this lab note, the algorithm BOOST is discussed. Its mainfeatures are transforming the data representation of genotypes into aBoolean type and making logic operations and pruning statisticallyirrelevant epistatic interactions. The results show a higher Power inmain effect than epistasis detection, but has a much higher Type IError Rate than epistasis detection. This is also true for full effectdetection. The scalability of the algorithm is very good, revealingonly a slight increase in the use of resources and running time withthe increase of population size.

1 Introduction

BOOST (BOolean Operation-based Screening and Testing) [WYY+10] trans-forms the data representation into a Boolean type, making logic operationsmore efficient and prunes insignificant epistatic interactions using an upperbound based on the likelihood ratio test statistic. BOOST works in twostages:

• Stage 1: Screening All pairwise interactions are evaluated using thecontingency tables collected by Boolean operations, removing interac-tions that fail to meet a predefined threshold. The evaluation of theinteractions at this stage is represented by Kullback-Leibler divergenceD = N ·DKL(π||ρ) where π represent the joint distribution by the fulllogistic regression model MS = β0 + βx1

i + βx2j + βx1x2

ij , and ρ is the ap-proximate joint distribution under the main logistic regression modelMH = β0 + βx1

i + βx2j using the method ”Kirkwood superposition ap-

proximation”.

• Stage 2: Testing Two statistic tests are used: likelihood ratio test,fitting the log-linear models MH and MS, and χ2 with four degrees offreedom. The p value is adjusted by a Bonferroni correction.

1.1 Input files

The input data files contain a file with the SNP and phenotype information,and a file containing the names of all data set files. In the data files, thefirst column corresponds to the phenotype taking its value in 0,1. From thesecond to the last column corresponds to the SNP, taking values in 0,1,2.

1.2 Output files

The output consists of two files, the interaction results, consisting of all SNPpairs with a χ2 result above 30 and contains the following columns:

• Index : number of interaction. Begins with 0

• SNP1 : first SNP in the interaction. Numeration begins with 0.

• SNP2 : second SNP in the interaction. Numeration begins with 0.

• SinglelocusAssoc1 : value of the marginal effect for the first associ-ated SNP.

1

• SinglelocusAssoc2 : value of the marginal effect for the second asso-ciated SNP.

• InteractionBOOST : The statistical value of BOOST from the χ2

test.

• InteractionPLINK : value obtained by using the statistic of PLINK.

The second file contains the marginal effect value for every SNP. The filecontains two columns:

• SNPindex : number of the SNP.

• Single-locusTestValue : value of the χ2 test.


The datasets used in the experiments are characterized in Lab Note 1. Thecomputer used for this experiments used the 64-bit Ubuntu 13.10 operatingsystem, with an Intel(R) Core(TM)2 Quad6 CPU Q6600 2.40GHz processorand 8,00 GB of RAM memory.BOOST is a C program. There are no settings for BOOST.

3 Results

The Power displayed in epistasis (b) is inferior to the Power detected by maineffects (a) and the best results were obtained by full effect detection (c) inalmost all configurations. In epistasis detection, Figure 1 shows an increaseof Power with population size in nearly all allele frequencies. The increase ofallele frequency also increases the Power.

In Figure 2 shows a varying cpu usage (b) with a very slight increasein running time (a), and memory usage (c)This increase is not significant,which reveals a good scalability.

The Type I Error Rate shows a maximum value of 21% for epistasis de-tection, but is 100% in the data sets with the highest population size andallele frequency, for main effect and full effect. Most of the type 1 errorsin epistasis are below 10% and therefore there is a bigger difference betweenPower and type 1 errors in epistasis detection. For main effect and full effect,there is an increase in Type 1 Error Rate with data set size and minor allele

2

0.01 0.05 0.1 0.3 0.50

50

100

0 0 114

26

0 0

41

66

91

0 7

94 100 100

Allele frequency

Pow

er(%

)


(a) Epistasis

0.01 0.05 0.1 0.3 0.50

50

100

0 0 2

100 100

0 1

43

100 100

014

97 100 100

Allele frequency

Pow

er(%

)


(b) Main Effect

0.01 0.05 0.1 0.3 0.50

50

100

0 0 1

100 100

0 2

42

100 100

015

98 100 100

Allele frequency

Pow

er(%

)


(c) Full Effect

Figure 1: Power by allele frequency. For each frequency, three sizes of datasets were used to measure the Power, with odds ratio of 2.0 and prevalenceof 0.02. The Power is measured by the amount of data sets where the groundtruth was amongst the most relevant results, out of all 100 data sets. (b),(a), and (c) represent main effect, epistatic and main effect + epistatic in-teractions.

3

500 1000 20000

0.1

0.2

0.3


Runnin

gT

ime(

seco

nds)

(a) Average running time.

500 1000 2000

96

97

98

99


CP

UU

sage

(%)

(b) Average CPU usage.

500 1000 2000

1,000

1,100

1,200


Mem

ory

Usa

ge(K

byte

s)

(c) Average memory usage.

Figure 2: Comparison of scalability measures between different sized datasets. The data sets have a minor allele frequency is 0.5, 2.0 odds ratio, 0.02prevalence and use the full effect disease model.

frequency.

The Power distribution by population 4 and by odds ratio 5 show a bigincrease with higher population sizes and odds ratios. The prevalence re-sult reveal very similar values for both prevalences , and the distribution byallele frequency 7 increases sligthly for 0.05 minor allele frequency and in-creases greatly in 0.1 minor allele frequency, reaching 100% for higher allelefrequencies.

4

0.01 0.05 0.1 0.3 0.50

50

100

4 7 7 6 47 4 5 2 82 2

216 8

Allele frequency

Typ

eI

Err

orR

ate(

%)


(a) Epistasis

0.01 0.05 0.1 0.3 0.50

50

100

1 112

78

97

7 3

23

99 100

111

74

100 100

Allele frequency

Typ

eI

Err

orR

ate(

%)


(b) Main Effect

0.01 0.05 0.1 0.3 0.50

50

100

10 415

100 100

11 16

38

100 100

717

81

100 100

Allele frequency

Typ

eI

Err

orR

ate(

%)


(c) Full Effect

Figure 3: Type I Error Rate by allele frequency. For each frequency, threesizes of data sets were used to measure the Power, with odds ratio of 2.0 andprevalence of 0.02. The Type I Error Rate is measured by the amount ofdata sets where the false positives were amongst the most relevant results,out of all 100 data sets.. (a), (b), and (c) represent epistatic, main effect,and main effect + epistatic interactions.

5

4 Summary

BOOST is an exhaustive algorithm that converts the data into a binaryformat and prunes irrelevant interactions by the contingency tables collectedby Boolean operations. The results show a very good scalability, with a slightbut irrelevant increase in running time, memory usage and cpu usage. Therelation between Power and Type I Error Rates has a bigger difference inepistasis detection, but the overall Power is lower than main effect and fulleffect.

References

[WYY+10] Xiang Wan, Can Yang, Qiang Yang, Hong Xue, Xiaodan Fan,Nelson L S Tang, and Weichuan Yu. BOOST: A fast approachto detecting gene-gene interactions in genome-wide case-controlstudies. American journal of human genetics, 87:325–340, 2010.

A Bar Graphs

6

500 1000 20000

50

100

1

41

94

Population

Pow

er(%

)

Power by Population

(a) Epistasis

500 1000 20000

50

100

2

43

97

Population

Pow

er(%

)

Power by Population

(b) Main Effect

500 1000 20000

50

100

1

42

98

Population

Pow

er(%

)

Power by Population

(c) Full Effect

Figure 4: Distribution of the Power by population for all disease models.The allele frequency is 0.1, the odds ratio is 2.0, and the prevalence is 0.02.

7

1.1 1.5 2.00

50

100

27

95 94

Odds Ratio

Pow

er(%

)

Power by Odds Ratio

(a) Epistasis

1.1 1.5 2.00

50

100

2

33

97

Odds Ratio

Pow

er(%

)

Power by Odds Ratio

(b) Main Effect

1.1 1.5 2.00

50

100

4

72

98

Odds Ratio

Pow

er(%

)

Power by Odds Ratio

(c) Full Effect

Figure 5: Distribution of the Power by odds ratios for all disease models.The allele frequency is 0.1, the population size is 2000 individuals, and theprevalence is 0.02.

8

0.0001 0.020

50

100 91 94

Prevalence

Pow

er(%

)

Power by Prevalence

(a) Epistasis

0.0001 0.020

50

10098 97

Prevalence

Pow

er(%

)

Power by Prevalence

(b) Main Effect

0.0001 0.020

50

100100 98

Prevalence

Pow

er(%

)

Power by Prevalence

(c) Full Effect

Figure 6: Distribution of the Power by prevalence for all disease models. Theallele frequency is 0.1, the odds ratio is 2.0, and the population size is 2000individuals.

9

0.01 0.05 0.1 0.3 0.50

50

100

0 7

94 100 100

Frequency

Pow

er(%

)

Power by Frequency

(a) Epistasis

0.01 0.05 0.1 0.3 0.50

50

100

014

97 100 100

Frequency

Pow

er(%

)

Power by Frequency

(b) Main Effect

0.01 0.05 0.1 0.3 0.50

50

100

015

98 100 100

Frequency

Pow

er(%

)

Power by Frequency

(c) Full Effect

Figure 7: Distribution of the Power by allele frequency for all disease mod-els. The population size is 2000 individuals, the odds ratio is 2.0, and theprevalence is 0.02.

10

B Table of Results


Configuration* TP (%) FP (%)0.5,500,I,2.0,0.02 26 4

0.5,500,I,2.0,0.0001 20 30.5,500,I,1.5,0.02 19 2

0.5,500,I,1.5,0.0001 1 40.5,500,I,1.1,0.02 0 1

0.5,500,I,1.1,0.0001 0 30.5,2000,I,2.0,0.02 100 8

0.5,2000,I,2.0,0.0001 100 170.5,2000,I,1.5,0.02 100 11

0.5,2000,I,1.5,0.0001 67 20.5,2000,I,1.1,0.02 15 6

0.5,2000,I,1.1,0.0001 7 70.5,1000,I,2.0,0.02 91 8

0.5,1000,I,2.0,0.0001 90 70.5,1000,I,1.5,0.02 79 8

0.5,1000,I,1.5,0.0001 13 40.5,1000,I,1.1,0.02 1 5

0.5,1000,I,1.1,0.0001 0 60.3,500,I,2.0,0.02 14 6

0.3,500,I,2.0,0.0001 46 60.3,500,I,1.5,0.02 15 3

0.3,500,I,1.5,0.0001 12 20.3,500,I,1.1,0.02 0 4

0.3,500,I,1.1,0.0001 0 10.3,2000,I,2.0,0.02 100 6

0.3,2000,I,2.0,0.0001 100 100.3,2000,I,1.5,0.02 100 30

0.3,2000,I,1.5,0.0001 100 100.3,2000,I,1.1,0.02 7 4

0.3,2000,I,1.1,0.0001 8 60.3,1000,I,2.0,0.02 66 2

11

0.3,1000,I,2.0,0.0001 100 80.3,1000,I,1.5,0.02 81 10

0.3,1000,I,1.5,0.0001 71 60.3,1000,I,1.1,0.02 0 2

0.3,1000,I,1.1,0.0001 0 30.1,500,I,2.0,0.02 1 7

0.1,500,I,2.0,0.0001 0 10.1,500,I,1.5,0.02 1 3

0.1,500,I,1.5,0.0001 1 50.1,500,I,1.1,0.02 0 7

0.1,500,I,1.1,0.0001 0 40.1,2000,I,2.0,0.02 94 21

0.1,2000,I,2.0,0.0001 91 70.1,2000,I,1.5,0.02 95 9

0.1,2000,I,1.5,0.0001 77 60.1,2000,I,1.1,0.02 27 5

0.1,2000,I,1.1,0.0001 24 20.1,1000,I,2.0,0.02 41 5

0.1,1000,I,2.0,0.0001 13 70.1,1000,I,1.5,0.02 36 4

0.1,1000,I,1.5,0.0001 10 30.1,1000,I,1.1,0.02 0 6

0.1,1000,I,1.1,0.0001 0 50.05,500,I,2.0,0.02 0 7

0.05,500,I,2.0,0.0001 1 20.05,500,I,1.5,0.02 0 8

0.05,500,I,1.5,0.0001 0 60.05,500,I,1.1,0.02 0 11

0.05,500,I,1.1,0.0001 0 30.05,2000,I,2.0,0.02 7 2

0.05,2000,I,2.0,0.0001 70 350.05,2000,I,1.5,0.02 65 49

0.05,2000,I,1.5,0.0001 47 280.05,2000,I,1.1,0.02 0 7

0.05,2000,I,1.1,0.0001 0 00.05,1000,I,2.0,0.02 0 4

0.05,1000,I,2.0,0.0001 8 80.05,1000,I,1.5,0.02 11 7

12

0.05,1000,I,1.5,0.0001 1 50.05,1000,I,1.1,0.02 0 4

0.05,1000,I,1.1,0.0001 0 60.01,500,I,2.0,0.02 0 4

0.01,500,I,2.0,0.0001 0 20.01,500,I,1.5,0.02 0 5

0.01,500,I,1.5,0.0001 0 40.01,500,I,1.1,0.02 0 5

0.01,500,I,1.1,0.0001 0 70.01,2000,I,2.0,0.02 0 2

0.01,2000,I,2.0,0.0001 1 60.01,2000,I,1.5,0.02 0 4

0.01,2000,I,1.5,0.0001 1 10.01,2000,I,1.1,0.02 0 2

0.01,2000,I,1.1,0.0001 0 60.01,1000,I,2.0,0.02 0 7

0.01,1000,I,2.0,0.0001 0 40.01,1000,I,1.5,0.02 0 2

0.01,1000,I,1.5,0.0001 0 40.01,1000,I,1.1,0.02 0 5

0.01,1000,I,1.1,0.0001 0 80.5,500,ME,2.0,0.02 100 97

0.5,500,ME,2.0,0.0001 100 950.5,500,ME,1.5,0.02 100 63

0.5,500,ME,1.5,0.0001 100 610.5,500,ME,1.1,0.02 81 30

0.5,500,ME,1.1,0.0001 78 220.5,2000,ME,2.0,0.02 100 100

0.5,2000,ME,2.0,0.0001 100 1000.5,2000,ME,1.5,0.02 100 100

0.5,2000,ME,1.5,0.0001 100 1000.5,2000,ME,1.1,0.02 100 100

0.5,2000,ME,1.1,0.0001 100 970.5,1000,ME,2.0,0.02 100 100

0.5,1000,ME,2.0,0.0001 100 1000.5,1000,ME,1.5,0.02 100 100

0.5,1000,ME,1.5,0.0001 100 970.5,1000,ME,1.1,0.02 100 61

13

0.5,1000,ME,1.1,0.0001 100 590.3,500,ME,2.0,0.02 100 78

0.3,500,ME,2.0,0.0001 100 820.3,500,ME,1.5,0.02 90 29

0.3,500,ME,1.5,0.0001 86 330.3,500,ME,1.1,0.02 25 17

0.3,500,ME,1.1,0.0001 18 100.3,2000,ME,2.0,0.02 100 100

0.3,2000,ME,2.0,0.0001 100 1000.3,2000,ME,1.5,0.02 100 99

0.3,2000,ME,1.5,0.0001 100 1000.3,2000,ME,1.1,0.02 100 57

0.3,2000,ME,1.1,0.0001 100 510.3,1000,ME,2.0,0.02 100 99

0.3,1000,ME,2.0,0.0001 100 1000.3,1000,ME,1.5,0.02 100 69

0.3,1000,ME,1.5,0.0001 100 660.3,1000,ME,1.1,0.02 92 28

0.3,1000,ME,1.1,0.0001 77 270.1,500,ME,2.0,0.02 2 12

0.1,500,ME,2.0,0.0001 8 140.1,500,ME,1.5,0.02 0 7

0.1,500,ME,1.5,0.0001 0 80.1,500,ME,1.1,0.02 0 7

0.1,500,ME,1.1,0.0001 0 40.1,2000,ME,2.0,0.02 97 74

0.1,2000,ME,2.0,0.0001 98 710.1,2000,ME,1.5,0.02 33 22

0.1,2000,ME,1.5,0.0001 34 230.1,2000,ME,1.1,0.02 2 11

0.1,2000,ME,1.1,0.0001 1 50.1,1000,ME,2.0,0.02 43 23

0.1,1000,ME,2.0,0.0001 50 320.1,1000,ME,1.5,0.02 5 9

0.1,1000,ME,1.5,0.0001 1 90.1,1000,ME,1.1,0.02 0 10

0.1,1000,ME,1.1,0.0001 0 50.05,500,ME,2.0,0.02 0 1

14

0.05,500,ME,2.0,0.0001 0 40.05,500,ME,1.5,0.02 0 1

0.05,500,ME,1.5,0.0001 0 20.05,500,ME,1.1,0.02 0 3

0.05,500,ME,1.1,0.0001 0 20.05,2000,ME,2.0,0.02 14 11

0.05,2000,ME,2.0,0.0001 13 100.05,2000,ME,1.5,0.02 2 2

0.05,2000,ME,1.5,0.0001 1 60.05,2000,ME,1.1,0.02 0 3

0.05,2000,ME,1.1,0.0001 2 30.05,1000,ME,2.0,0.02 1 3

0.05,1000,ME,2.0,0.0001 4 60.05,1000,ME,1.5,0.02 0 0

0.05,1000,ME,1.5,0.0001 1 30.05,1000,ME,1.1,0.02 0 3

0.05,1000,ME,1.1,0.0001 0 20.01,500,ME,2.0,0.02 0 1

0.01,500,ME,2.0,0.0001 0 70.01,500,ME,1.5,0.02 0 1

0.01,500,ME,1.5,0.0001 0 70.01,500,ME,1.1,0.02 0 1

0.01,500,ME,1.1,0.0001 0 70.01,2000,ME,2.0,0.02 0 1

0.01,2000,ME,2.0,0.0001 0 40.01,2000,ME,1.5,0.02 0 5

0.01,2000,ME,1.5,0.0001 0 50.01,2000,ME,1.1,0.02 0 5

0.01,2000,ME,1.1,0.0001 0 50.01,1000,ME,2.0,0.02 0 7

0.01,1000,ME,2.0,0.0001 0 30.01,1000,ME,1.5,0.02 0 8

0.01,1000,ME,1.5,0.0001 0 30.01,1000,ME,1.1,0.02 0 4

0.01,1000,ME,1.1,0.0001 0 20.5,500,ME+I,2.0,0.02 100 100

0.5,500,ME+I,2.0,0.0001 100 1000.5,500,ME+I,1.5,0.02 100 100

15

0.5,500,ME+I,1.5,0.0001 100 1000.5,500,ME+I,1.1,0.02 100 89

0.5,500,ME+I,1.1,0.0001 100 860.5,2000,ME+I,2.0,0.02 100 100

0.5,2000,ME+I,2.0,0.0001 100 1000.5,2000,ME+I,1.5,0.02 100 100

0.5,2000,ME+I,1.5,0.0001 100 1000.5,2000,ME+I,1.1,0.02 100 100

0.5,2000,ME+I,1.1,0.0001 100 1000.5,1000,ME+I,2.0,0.02 100 100

0.5,1000,ME+I,2.0,0.0001 100 1000.5,1000,ME+I,1.5,0.02 100 100

0.5,1000,ME+I,1.5,0.0001 100 1000.5,1000,ME+I,1.1,0.02 100 100

0.5,1000,ME+I,1.1,0.0001 100 1000.3,500,ME+I,2.0,0.02 100 100

0.3,500,ME+I,2.0,0.0001 100 1000.3,500,ME+I,1.5,0.02 100 79

0.3,500,ME+I,1.5,0.0001 100 930.3,500,ME+I,1.1,0.02 79 28

0.3,500,ME+I,1.1,0.0001 89 250.3,2000,ME+I,2.0,0.02 100 100

0.3,2000,ME+I,2.0,0.0001 100 1000.3,2000,ME+I,1.5,0.02 100 100

0.3,2000,ME+I,1.5,0.0001 100 1000.3,2000,ME+I,1.1,0.02 100 97

0.3,2000,ME+I,1.1,0.0001 100 990.3,1000,ME+I,2.0,0.02 100 100

0.3,1000,ME+I,2.0,0.0001 100 1000.3,1000,ME+I,1.5,0.02 100 99

0.3,1000,ME+I,1.5,0.0001 100 1000.3,1000,ME+I,1.1,0.02 100 63

0.3,1000,ME+I,1.1,0.0001 100 660.1,500,ME+I,2.0,0.02 1 15

0.1,500,ME+I,2.0,0.0001 30 340.1,500,ME+I,1.5,0.02 0 10

0.1,500,ME+I,1.5,0.0001 1 100.1,500,ME+I,1.1,0.02 0 5

16

0.1,500,ME+I,1.1,0.0001 1 90.1,2000,ME+I,2.0,0.02 98 81

0.1,2000,ME+I,2.0,0.0001 100 1000.1,2000,ME+I,1.5,0.02 72 51

0.1,2000,ME+I,1.5,0.0001 46 380.1,2000,ME+I,1.1,0.02 4 14

0.1,2000,ME+I,1.1,0.0001 2 160.1,1000,ME+I,2.0,0.02 42 38

0.1,1000,ME+I,2.0,0.0001 91 700.1,1000,ME+I,1.5,0.02 2 18

0.1,1000,ME+I,1.5,0.0001 8 180.1,1000,ME+I,1.1,0.02 0 13

0.1,1000,ME+I,1.1,0.0001 0 140.05,500,ME+I,2.0,0.02 0 4

0.05,500,ME+I,2.0,0.0001 0 110.05,500,ME+I,1.5,0.02 0 3

0.05,500,ME+I,1.5,0.0001 1 110.05,500,ME+I,1.1,0.02 0 8

0.05,500,ME+I,1.1,0.0001 0 70.05,2000,ME+I,2.0,0.02 15 17

0.05,2000,ME+I,2.0,0.0001 27 300.05,2000,ME+I,1.5,0.02 1 13

0.05,2000,ME+I,1.5,0.0001 4 80.05,2000,ME+I,1.1,0.02 0 7

0.05,2000,ME+I,1.1,0.0001 0 70.05,1000,ME+I,2.0,0.02 2 16

0.05,1000,ME+I,2.0,0.0001 8 110.05,1000,ME+I,1.5,0.02 1 6

0.05,1000,ME+I,1.5,0.0001 1 50.05,1000,ME+I,1.1,0.02 0 5

0.05,1000,ME+I,1.1,0.0001 0 60.01,500,ME+I,2.0,0.02 0 10

0.01,500,ME+I,2.0,0.0001 0 130.01,500,ME+I,1.5,0.02 0 13

0.01,500,ME+I,1.5,0.0001 0 120.01,500,ME+I,1.1,0.02 0 7

0.01,500,ME+I,1.1,0.0001 0 110.01,2000,ME+I,2.0,0.02 0 7

17

0.01,2000,ME+I,2.0,0.0001 0 120.01,2000,ME+I,1.5,0.02 0 13

0.01,2000,ME+I,1.5,0.0001 0 80.01,2000,ME+I,1.1,0.02 0 16

0.01,2000,ME+I,1.1,0.0001 0 80.01,1000,ME+I,2.0,0.02 0 11

0.01,1000,ME+I,2.0,0.0001 0 80.01,1000,ME+I,1.5,0.02 0 13

0.01,1000,ME+I,1.5,0.0001 0 70.01,1000,ME+I,1.1,0.02 0 10

0.01,1000,ME+I,1.1,0.0001 0 3



Configuration* Running Time (s) CPU Usage (%) Memory Usage (KB)0.5,500,ME+I,2.0,0.02 0.16 95.70 1003.04

0.5,500,ME+I,2.0,0.0001 0.16 96.05 1003.560.5,500,ME+I,1.5,0.02 0.16 96.07 1001.04

0.5,500,ME+I,1.5,0.0001 0.16 95.08 991.880.5,500,ME+I,1.1,0.02 0.16 96.93 995.92

0.5,500,ME+I,1.1,0.0001 0.16 95.35 973.640.5,500,ME,2.0,0.02 0.16 96.97 997.92

0.5,500,ME,2.0,0.0001 0.16 95.79 980.400.5,500,ME,1.5,0.02 0.16 97.07 996.76

0.5,500,ME,1.5,0.0001 0.16 98.08 972.080.5,500,ME,1.1,0.02 0.16 97.83 993.84

0.5,500,ME,1.1,0.0001 0.16 97.93 971.520.5,500,I,2.0,0.02 0.16 98.11 996.16

0.5,500,I,2.0,0.0001 0.16 98.02 973.600.5,500,I,1.5,0.02 0.16 97.67 998.08

0.5,500,I,1.5,0.0001 0.16 98.41 970.000.5,500,I,1.1,0.02 0.16 97.59 995.36

0.5,500,I,1.1,0.0001 0.16 97.38 967.72

18

0.5,2000,ME+I,2.0,0.02 0.34 97.87 1226.440.5,2000,ME+I,2.0,0.0001 0.34 97.56 1217.200.5,2000,ME+I,1.5,0.02 0.32 97.89 1190.48

0.5,2000,ME+I,1.5,0.0001 0.32 97.26 1188.160.5,2000,ME+I,1.1,0.02 0.31 97.13 1171.92

0.5,2000,ME+I,1.1,0.0001 0.31 97.29 1171.280.5,2000,ME,2.0,0.02 0.32 97.95 1175.04

0.5,2000,ME,2.0,0.0001 0.32 97.69 1174.320.5,2000,ME,1.5,0.02 0.31 97.78 1169.32

0.5,2000,ME,1.5,0.0001 0.31 98.06 1168.920.5,2000,ME,1.1,0.02 0.31 97.20 1155.96

0.5,2000,ME,1.1,0.0001 0.31 97.86 1142.440.5,2000,I,2.0,0.02 0.31 98.14 1146.04

0.5,2000,I,2.0,0.0001 0.31 97.71 1134.800.5,2000,I,1.5,0.02 0.31 97.82 1155.12

0.5,2000,I,1.5,0.0001 0.31 98.39 1132.640.5,2000,I,1.1,0.02 0.31 97.77 1149.60

0.5,2000,I,1.1,0.0001 0.31 98.57 1127.760.5,1000,ME+I,2.0,0.02 0.22 98.79 1071.92

0.5,1000,ME+I,2.0,0.0001 0.22 98.28 1053.880.5,1000,ME+I,1.5,0.02 0.22 95.92 1059.68

0.5,1000,ME+I,1.5,0.0001 0.22 96.97 1046.960.5,1000,ME+I,1.1,0.02 0.21 97.21 1053.52

0.5,1000,ME+I,1.1,0.0001 0.21 98.45 1028.160.5,1000,ME,2.0,0.02 0.22 98.66 1056.92

0.5,1000,ME,2.0,0.0001 0.21 98.77 1036.440.5,1000,ME,1.5,0.02 0.21 98.88 1049.04

0.5,1000,ME,1.5,0.0001 0.21 98.09 1016.320.5,1000,ME,1.1,0.02 0.21 98.19 1043.88

0.5,1000,ME,1.1,0.0001 0.21 98.18 1002.920.5,1000,I,2.0,0.02 0.21 97.86 1045.96

0.5,1000,I,2.0,0.0001 0.21 97.54 1003.000.5,1000,I,1.5,0.02 0.21 96.90 1048.20

0.5,1000,I,1.5,0.0001 0.21 97.46 1006.160.5,1000,I,1.1,0.02 0.21 96.44 1044.28

0.5,1000,I,1.1,0.0001 0.21 97.60 1000.880.3,500,ME+I,2.0,0.02 0.16 98.13 998.44

0.3,500,ME+I,2.0,0.0001 0.16 98.23 1003.00

19

0.3,500,ME+I,1.5,0.02 0.16 97.97 996.720.3,500,ME+I,1.5,0.0001 0.16 98.37 983.320.3,500,ME+I,1.1,0.02 0.16 98.12 997.60

0.3,500,ME+I,1.1,0.0001 0.16 97.90 966.760.3,500,ME,2.0,0.02 0.16 98.13 998.60

0.3,500,ME,2.0,0.0001 0.16 98.44 971.800.3,500,ME,1.5,0.02 0.16 98.78 997.36

0.3,500,ME,1.5,0.0001 0.16 98.05 969.240.3,500,ME,1.1,0.02 0.16 98.11 996.00

0.3,500,ME,1.1,0.0001 0.16 98.36 968.720.3,500,I,2.0,0.02 0.16 98.39 997.20

0.3,500,I,2.0,0.0001 0.16 98.55 978.320.3,500,I,1.5,0.02 0.16 98.36 995.88

0.3,500,I,1.5,0.0001 0.16 97.51 964.560.3,500,I,1.1,0.02 0.16 97.83 997.64

0.3,500,I,1.1,0.0001 0.16 97.29 968.680.3,2000,ME+I,2.0,0.02 0.32 97.82 1184.36

0.3,2000,ME+I,2.0,0.0001 0.34 96.01 1225.960.3,2000,ME+I,1.5,0.02 0.31 96.61 1171.32

0.3,2000,ME+I,1.5,0.0001 0.32 96.50 1175.600.3,2000,ME+I,1.1,0.02 0.31 96.16 1150.92

0.3,2000,ME+I,1.1,0.0001 0.31 97.24 1148.960.3,2000,ME,2.0,0.02 0.31 97.27 1170.44

0.3,2000,ME,2.0,0.0001 0.31 95.77 1171.480.3,2000,ME,1.5,0.02 0.31 96.48 1160.12

0.3,2000,ME,1.5,0.0001 0.31 97.31 1151.520.3,2000,ME,1.1,0.02 0.31 96.43 1150.12

0.3,2000,ME,1.1,0.0001 0.31 95.85 1129.600.3,2000,I,2.0,0.02 0.31 95.56 1153.76

0.3,2000,I,2.0,0.0001 0.31 95.81 1131.240.3,2000,I,1.5,0.02 0.31 97.12 1154.00

0.3,2000,I,1.5,0.0001 0.31 98.11 1134.880.3,2000,I,1.1,0.02 0.31 97.10 1147.48

0.3,2000,I,1.1,0.0001 0.31 95.79 1128.280.3,1000,ME+I,2.0,0.02 0.21 94.97 1058.36

0.3,1000,ME+I,2.0,0.0001 0.22 96.59 1059.720.3,1000,ME+I,1.5,0.02 0.21 96.59 1053.20

0.3,1000,ME+I,1.5,0.0001 0.21 96.29 1043.68

20

0.3,1000,ME+I,1.1,0.02 0.21 98.29 1047.480.3,1000,ME+I,1.1,0.0001 0.21 97.93 1006.04

0.3,1000,ME,2.0,0.02 0.21 97.33 1047.560.3,1000,ME,2.0,0.0001 0.21 98.02 1025.240.3,1000,ME,1.5,0.02 0.21 95.34 1048.40

0.3,1000,ME,1.5,0.0001 0.21 95.54 1006.600.3,1000,ME,1.1,0.02 0.21 96.92 1040.96

0.3,1000,ME,1.1,0.0001 0.21 97.23 1001.800.3,1000,I,2.0,0.02 0.21 97.65 1046.96

0.3,1000,I,2.0,0.0001 0.21 96.54 1010.360.3,1000,I,1.5,0.02 0.21 95.78 1051.16

0.3,1000,I,1.5,0.0001 0.21 96.49 1006.160.3,1000,I,1.1,0.02 0.21 96.77 1045.64

0.3,1000,I,1.1,0.0001 0.21 96.72 999.520.1,500,ME+I,2.0,0.02 0.16 96.61 995.84

0.1,500,ME+I,2.0,0.0001 0.16 95.81 971.080.1,500,ME+I,1.5,0.02 0.16 95.41 995.32

0.1,500,ME+I,1.5,0.0001 0.16 97.03 966.400.1,500,ME+I,1.1,0.02 0.16 96.58 995.56

0.1,500,ME+I,1.1,0.0001 0.16 97.02 973.000.1,500,ME,2.0,0.02 0.16 96.29 995.56

0.1,500,ME,2.0,0.0001 0.16 96.80 968.880.1,500,ME,1.5,0.02 0.16 95.68 999.52

0.1,500,ME,1.5,0.0001 0.16 96.15 968.080.1,500,ME,1.1,0.02 0.16 95.78 996.88

0.1,500,ME,1.1,0.0001 0.16 96.61 968.360.1,500,I,2.0,0.02 0.16 96.69 996.84

0.1,500,I,2.0,0.0001 0.16 96.31 970.080.1,500,I,1.5,0.02 0.16 96.01 996.00

0.1,500,I,1.5,0.0001 0.16 95.68 970.400.1,500,I,1.1,0.02 0.16 95.82 996.20

0.1,500,I,1.1,0.0001 0.16 96.19 969.840.1,2000,ME+I,2.0,0.02 0.31 96.47 1154.56

0.1,2000,ME+I,2.0,0.0001 0.31 97.14 1157.400.1,2000,ME+I,1.5,0.02 0.31 97.21 1151.28

0.1,2000,ME+I,1.5,0.0001 0.31 98.25 1131.320.1,2000,ME+I,1.1,0.02 0.31 98.12 1148.64

0.1,2000,ME+I,1.1,0.0001 0.31 96.25 1125.76

21

0.1,2000,ME,2.0,0.02 0.31 97.41 1153.040.1,2000,ME,2.0,0.0001 0.31 97.62 1139.760.1,2000,ME,1.5,0.02 0.31 97.73 1149.56

0.1,2000,ME,1.5,0.0001 0.31 97.73 1131.120.1,2000,ME,1.1,0.02 0.31 97.87 1148.72

0.1,2000,ME,1.1,0.0001 0.31 97.22 1129.480.1,2000,I,2.0,0.02 0.31 97.24 1150.88

0.1,2000,I,2.0,0.0001 0.31 95.68 1133.920.1,2000,I,1.5,0.02 0.31 94.91 1151.36

0.1,2000,I,1.5,0.0001 0.31 95.62 1135.040.1,2000,I,1.1,0.02 0.31 94.25 1148.12

0.1,2000,I,1.1,0.0001 0.31 96.06 1132.360.1,1000,ME+I,2.0,0.02 0.21 97.34 1043.80

0.1,1000,ME+I,2.0,0.0001 0.21 97.42 1012.400.1,1000,ME+I,1.5,0.02 0.21 97.26 1044.76

0.1,1000,ME+I,1.5,0.0001 0.21 97.05 1000.320.1,1000,ME+I,1.1,0.02 0.21 96.66 1042.76

0.1,1000,ME+I,1.1,0.0001 0.21 98.06 1000.160.1,1000,ME,2.0,0.02 0.21 97.68 1043.00

0.1,1000,ME,2.0,0.0001 0.21 97.66 1003.920.1,1000,ME,1.5,0.02 0.21 97.74 1043.64

0.1,1000,ME,1.5,0.0001 0.21 97.69 1001.600.1,1000,ME,1.1,0.02 0.21 97.95 1046.68

0.1,1000,ME,1.1,0.0001 0.21 97.83 1002.680.1,1000,I,2.0,0.02 0.21 96.10 1046.00

0.1,1000,I,2.0,0.0001 0.21 95.81 1003.880.1,1000,I,1.5,0.02 0.21 96.72 1041.84

0.1,1000,I,1.5,0.0001 0.21 96.96 1000.640.1,1000,I,1.1,0.02 0.21 96.48 1045.88

0.1,1000,I,1.1,0.0001 0.21 96.97 1003.840.05,500,ME+I,2.0,0.02 0.16 97.24 997.76

0.05,500,ME+I,2.0,0.0001 0.16 96.63 969.040.05,500,ME+I,1.5,0.02 0.16 97.31 995.32

0.05,500,ME+I,1.5,0.0001 0.16 97.56 966.840.05,500,ME+I,1.1,0.02 0.16 97.33 995.60

0.05,500,ME+I,1.1,0.0001 0.16 97.15 969.080.05,500,ME,2.0,0.02 0.16 96.23 997.08

0.05,500,ME,2.0,0.0001 0.16 97.46 971.00

22

0.05,500,ME,1.5,0.02 0.16 97.55 994.760.05,500,ME,1.5,0.0001 0.16 96.71 967.520.05,500,ME,1.1,0.02 0.16 95.80 995.36

0.05,500,ME,1.1,0.0001 0.16 96.52 969.680.05,500,I,2.0,0.02 0.16 98.23 995.56

0.05,500,I,2.0,0.0001 0.16 96.40 967.080.05,500,I,1.5,0.02 0.16 96.81 998.12

0.05,500,I,1.5,0.0001 0.16 96.51 971.360.05,500,I,1.1,0.02 0.16 96.55 996.88

0.05,500,I,1.1,0.0001 0.16 96.97 973.320.05,2000,ME+I,2.0,0.02 0.31 98.08 1149.36

0.05,2000,ME+I,2.0,0.0001 0.31 97.88 1131.840.05,2000,ME+I,1.5,0.02 0.31 97.97 1145.80

0.05,2000,ME+I,1.5,0.0001 0.31 97.98 1127.560.05,2000,ME+I,1.1,0.02 0.31 97.77 1145.32

0.05,2000,ME+I,1.1,0.0001 0.31 98.04 1127.080.05,2000,ME,2.0,0.02 0.31 98.05 1149.92

0.05,2000,ME,2.0,0.0001 0.31 98.14 1128.480.05,2000,ME,1.5,0.02 0.31 98.21 1146.80

0.05,2000,ME,1.5,0.0001 0.31 98.15 1128.400.05,2000,ME,1.1,0.02 0.31 98.11 1148.16

0.05,2000,ME,1.1,0.0001 0.31 97.26 1124.520.05,2000,I,2.0,0.02 0.31 97.86 1126.84

0.05,2000,I,2.0,0.0001 0.31 97.84 1135.600.05,2000,I,1.5,0.02 0.31 98.56 1155.56

0.05,2000,I,1.5,0.0001 0.31 98.07 1134.720.05,2000,I,1.1,0.02 0.31 97.36 1145.32

0.05,2000,I,1.1,0.0001 0.31 98.04 1127.440.05,1000,ME+I,2.0,0.02 0.21 96.67 1042.08

0.05,1000,ME+I,2.0,0.0001 0.21 96.80 999.680.05,1000,ME+I,1.5,0.02 0.21 95.98 1043.12

0.05,1000,ME+I,1.5,0.0001 0.21 97.60 1000.240.05,1000,ME+I,1.1,0.02 0.21 97.59 1044.92

0.05,1000,ME+I,1.1,0.0001 0.21 97.92 998.920.05,1000,ME,2.0,0.02 0.21 95.85 1044.24

0.05,1000,ME,2.0,0.0001 0.21 96.70 1001.080.05,1000,ME,1.5,0.02 0.21 97.88 1041.64

0.05,1000,ME,1.5,0.0001 0.21 97.31 998.80

23

0.05,1000,ME,1.1,0.02 0.21 97.01 1043.120.05,1000,ME,1.1,0.0001 0.21 98.04 1000.96

0.05,1000,I,2.0,0.02 0.21 97.76 1039.760.05,1000,I,2.0,0.0001 0.21 97.61 1006.800.05,1000,I,1.5,0.02 0.22 97.33 1045.32

0.05,1000,I,1.5,0.0001 0.21 97.58 1000.040.05,1000,I,1.1,0.02 0.21 97.67 1043.52

0.05,1000,I,1.1,0.0001 0.21 97.16 1000.440.01,500,ME+I,2.0,0.02 0.16 95.84 995.28

0.01,500,ME+I,2.0,0.0001 0.16 96.53 967.960.01,500,ME+I,1.5,0.02 0.16 97.68 995.84

0.01,500,ME+I,1.5,0.0001 0.16 97.60 971.520.01,500,ME+I,1.1,0.02 0.16 96.47 995.80

0.01,500,ME+I,1.1,0.0001 0.16 97.49 965.560.01,500,ME,2.0,0.02 0.16 96.47 995.40

0.01,500,ME,2.0,0.0001 0.16 97.93 965.680.01,500,ME,1.5,0.02 0.16 97.54 995.92

0.01,500,ME,1.5,0.0001 0.16 96.80 965.560.01,500,ME,1.1,0.02 0.16 98.11 995.92

0.01,500,ME,1.1,0.0001 0.16 97.90 965.640.01,500,I,2.0,0.02 0.16 96.09 997.20

0.01,500,I,2.0,0.0001 0.16 96.16 968.280.01,500,I,1.5,0.02 0.16 95.26 997.04

0.01,500,I,1.5,0.0001 0.16 97.09 968.280.01,500,I,1.1,0.02 0.16 97.41 994.24

0.01,500,I,1.1,0.0001 0.16 97.18 966.480.01,2000,ME+I,2.0,0.02 0.31 97.00 1146.36

0.01,2000,ME+I,2.0,0.0001 0.31 96.89 1128.120.01,2000,ME+I,1.5,0.02 0.31 96.86 1140.88

0.01,2000,ME+I,1.5,0.0001 0.31 97.57 1125.520.01,2000,ME+I,1.1,0.02 0.31 97.07 1148.24

0.01,2000,ME+I,1.1,0.0001 0.31 98.56 1125.160.01,2000,ME,2.0,0.02 0.31 98.16 1145.76

0.01,2000,ME,2.0,0.0001 0.31 97.32 1128.200.01,2000,ME,1.5,0.02 0.31 97.36 1140.92

0.01,2000,ME,1.5,0.0001 0.31 97.76 1125.640.01,2000,ME,1.1,0.02 0.31 98.06 1140.92

0.01,2000,ME,1.1,0.0001 0.31 97.91 1125.76

24

0.01,2000,I,2.0,0.02 0.31 98.10 1144.240.01,2000,I,2.0,0.0001 0.31 98.33 1128.080.01,2000,I,1.5,0.02 0.31 97.82 1148.32

0.01,2000,I,1.5,0.0001 0.31 97.97 1128.080.01,2000,I,1.1,0.02 0.31 97.91 1144.64

0.01,2000,I,1.1,0.0001 0.31 98.36 1125.800.01,1000,ME+I,2.0,0.02 0.21 97.16 1046.00

0.01,1000,ME+I,2.0,0.0001 0.21 97.67 1001.360.01,1000,ME+I,1.5,0.02 0.21 97.29 1043.36

0.01,1000,ME+I,1.5,0.0001 0.21 97.29 998.120.01,1000,ME+I,1.1,0.02 0.21 97.45 1046.16

0.01,1000,ME+I,1.1,0.0001 0.21 97.45 1000.640.01,1000,ME,2.0,0.02 0.21 97.59 1043.72

0.01,1000,ME,2.0,0.0001 0.21 97.19 998.160.01,1000,ME,1.5,0.02 0.21 96.83 1042.20

0.01,1000,ME,1.5,0.0001 0.21 97.27 998.080.01,1000,ME,1.1,0.02 0.21 97.37 1046.36

0.01,1000,ME,1.1,0.0001 0.21 97.50 1000.840.01,1000,I,2.0,0.02 0.21 97.94 1044.48

0.01,1000,I,2.0,0.0001 0.21 96.81 1001.800.01,1000,I,1.5,0.02 0.21 97.98 1043.52

0.01,1000,I,1.5,0.0001 0.21 98.02 1000.440.01,1000,I,1.1,0.02 0.21 97.05 1042.88

0.01,1000,I,1.1,0.0001 0.21 97.81 1002.12


25

Laboratory Note

Genetic EpistasisIV - Assessing Algorithm Screen and Clean

LN-4-2014






May 2014

Abstract

In this lab note, the algorithm Screen and Clean is presented. Asthe name indicates, the algorithm screens all relevant SNPs and fitsthem using regression models for main effect and interaction. Thesecond part consists of cleaning the previously selected interactionsusing a portion of the data, cleaning possible false positives. Theresults of the algorithm show that it is nearly incapable of findingany epistatic interactions but produces somewhat decent results inmain effect and full effect detection, for data sets with high allelefrequencies. The scalability of the algorithm is bad due to the elevatedincrease in running time.

1 Introduction

Screen and Clean [WDR+10] is a two stage algorithm that works by creatinga dictionary of disease related SNPS and disease interactions that contractsor expands during a multi-step statistical procedure and then is revised twocontrol Type I Error Rate.In the beginning a dictionary, including all SNPs with minor allele frequencyabove 0.01, is created. If the number of SNPS is greater than the specifiedupper limit of covariates to enter the screen process, the SNPS are selectedbased on the SNPS with the lowest marginal p-values. The data is dividedfor step 1 (screen) and step 2 (clean). In step 1, a screen stage is appliedto restrict the number of terms. This restriction is applied using regressionmodels for main effects or interactions. For main effect models, the functionused is

g(E [Y |X]) = β0 +N∑

j=1

βjXj (1)

where g is an appropriate link function. The Xj is the encoded genotypevalue 0, 1 or 2 and Y is the encoded phenotype, 0 or 1. According to the se-lected SNPs, tries to find relevant interacting SNPs that fit into the followinginteraction model:

g(E [Y |X]) = β0 +N∑

j=1

βjXj +∑

i<j;i,j=1,...,N

βijXiXj (2)

where S = {j : βj 6= 0, j ∈ 1, ..., L} ∪ {(i, j) : βij 6= 0, (i, j) ∈ 1, ..., L} are setof terms associated with the phenotype as main effects or interactions. Across-validation is applied in this stage, to apply a further restriction. In thestage 2, the resulting dictionary is cleaned with p-values < α. This is doneusing the traditional t-statistic obtained from least squares analysis of thescreened model.

1.1 Input files

The input consists of two files containing the phenotype of all individualsand the genotype with all SNPS for all individuals.

1.2 Output files

There are many outputs available:

1

(a) Genotype

rs1, rs2, rs3, rs4, rs50, 2, 0, 0, 00, 1, 1, 0, 00, 1, 1, 0, 01, 1, 0, 1, 00, 1, 1, 1, 10, 1, 2, 1, 0

(b) Phenotype

Label010011

Table 1: An example of the input file containing genotype and phenotypeinformation with 5 SNPs and 6 individuals. Genotype 0,1,2 correspondsto homozygous dominant, heterozygous, and homozygous recessive. Thephenotype 0,1 corresponds to control and case respectively.

• snp screen - a vector of the column names of the SNPS picked by thescreen.

• snp screen2 - vector of K pairs and SNP pairs retained by the secondlasso screen.

• snp clean - vector of screened SNPs also retained by the multivariateregression clean.

• clean - a data frame with regression output for all of the screened SNPs.The snp2 column corresponds to the pairwise interaction or ”NA” formain effects.

• final - a data frame with output from the regression of phenotype onthe final cleaned SNPs.

1.3 Parameters

The following parameters can be configured:

• L - number of SNPs to be retained with the smallest p-values.

• K pairs - Number of pairwise interactions to be retained by the lasso.

• response - The type of phenotype. Can be binomial or gaussian.

• alpha - The Bonferroni correction lower bound limit for retention ofSNPs.

2

• snp fix - Index of SNPs that are forced into the lasso and multivariateregression models. Optional.

• cov struct - Matrix of covariates that are forced in every model fit byScreen & Clean. Optional.

• standardize - If true, the genotype coded as 0,1, or 2 are centered tomean 0 and standard deviation 1. The data must be standardized torun the Screen & Clean procedure.


The datasets used in the experiments are characterized in Lab Note 1. Thecomputer used for this experiments used the 64-bit Ubuntu 13.10 operatingsystem, with an Intel(R) Core(TM)2 Quad CPU Q6600 2.40GHz processorand 8,00 GB of RAM memory.The parameters used for this experiments were : L = 200, K pairs =100, response = ”binomial”, standardize = TRUE, alpha = 0.05.

3 Results

The results from Figure 1 show interesting results. for epistasis detection, thePower is nearly 0 in all configurations. In main effect detection, the groundtruth is detected in data sets with an allele frequency higher than 0.05. Infull effect detection only data sets with 0.3 allele frequency or higher haveany Power. There is no clear pattern between the Power and the data set size.

The scalability test shows a clear increase in the running time with theincrease in the number of individuals of the data sets. Memory usage alsoincreases with data set size, but not as significantly as running time, whichmay become a serious obstacle in larger data sets. CPU usage does not showa clear increase.

Type I Error Rate in epistatic detection shows a seemingly random dis-tribuition. Overall, the error rate is very constant. For main effect detection,there is a bigger increase with population size and allele frequency. This iseven more clear in full effect detection, reaching a maximum of 84% in aconfiguration of 2000 individuals and 0.5 allele frequency.

From Figure 4 and Figure 5 there is only an indication of Power for datasets with 2000 individuals, except for full effect data sets, exacly the same aswith the odds ratio variance. The prevalence variation shows a small Power

3

0.01 0.05 0.1 0.3 0.50

50

100

0 0 0 0 00 0 0 0 00 0 6 2 0

Allele frequency

Pow

er(%

)


(a) Epistasis

0.01 0.05 0.1 0.3 0.50

20

40

60

80

0 0 0

20

54

0 0 0

54

70

0 0

39

58 62

Allele frequency

Pow

er(%

)


(b) Main Effect

0.01 0.05 0.1 0.3 0.50

50

100

0 0 0

30

49

0 0 0

5873

0 0 0

40

91

Allele frequency

Pow

er(%

)


(c) Full Effect

Figure 1: Average Power by allele frequency. For each frequency, three sizesof data sets were used to measure the Power, with odds ratio of 2.0 andprevalence of 0.02. The Power is measured by the amount of data sets wherethe ground truth was amongst the most relevant results, out of all 100 datasets. (b), (a), and (c) represent main effect, epistatic and main effect +epistatic interactions.

4

500 1000 20000

10

20

30


Runnin

gT

ime(

seco

nds)


500 1000 2000

80

90

100


CP

UU

sage

(%)


500 1000 2000

130

140

150


Mem

ory

Usa

ge(M

byte

s)



5

0.01 0.05 0.1 0.3 0.50

50

100

15 15 14 19 1817 22 16 15 2218 20 21 16 14

Allele frequency

Typ

eI

Err

orR

ate(

%)


(a) Epistasis

0.01 0.05 0.1 0.3 0.50

50

100

14 17 21 231514 21 23 28 30

1322

36 3848

Allele frequency

Typ

eI

Err

orR

ate(

%)


(b) Main Effect

0.01 0.05 0.1 0.3 0.50

50

100

18 15 19 1937

14 21 28 3545

14 2133

6884

Allele frequency

Typ

eI

Err

orR

ate(

%)


(c) Full Effect

Figure 3: Type I Error Rate by allele frequency. For each frequency, threesizes of data sets were used to measure the Power, with odds ratio of 2.0 andprevalence of 0.02. The Type I Error Rate is measured by the amount ofdata sets where the false positives were amongst the most relevant results,out of all 100 data sets. (a), (b), and (c) represent epistatic, main effect, andmain effect + epistatic interactions.

6

change in Figure 6, decreasing with the prevalence increase. Figure 7 revealsthat there is only relevant Power in higher allele frequencies, and in mainand full effects.

4 Summary

Screen and Clean is a heuristic algorithm that works by applying regressionmodels for main effect and epistatic detection, and pruning statistically ir-relevant interactions. The obtained results show very low Power for epistaticdetection. In main effect detection, there is an increase in Power, especiallyin higher dimension data sets. With higher allele frequencies, there is alsoan increase of Power. In full effect detection, there is only Power in datasets with allele frequency higher than 0.1. The scalability is bad, due to thebig increase in running time with different data set sizes. The type 1 errorsdo not vary significantly with population size or allele frequency in epistasisdetection, contrary to main effect and full effect detection.

References

[WDR+10] Jing Wu, Bernie Devlin, Steven Ringquist, Massimo Trucco, andKathryn Roeder. Screen and clean: a tool for identifying interac-tions in genome-wide association studies. Genetic epidemiology,34:275–285, 2010.

A Bar Graphs

7

500 1000 20000

50

100

0 0 6

Population

Pow

er(%

)

Power by Population

(a) Epistasis

500 1000 20000

50

100

0 0

39

Population

Pow

er(%

)

Power by Population

(b) Main Effect

500 1000 20000

50

100

0 0 0

Population

Pow

er(%

)

Power by Population

(c) Full Effect


8

1.1 1.5 2.00

50

100

0 0 6

Odds Ratio

Pow

er(%

)

Power by Odds Ratio

(a) Epistasis

1.1 1.5 2.00

50

100

0 0

39

Odds Ratio

Pow

er(%

)

Power by Odds Ratio

(b) Main Effect

1.1 1.5 2.00

50

100

0 0 0

Odds Ratio

Pow

er(%

)

Power by Odds Ratio

(c) Full Effect


9

0.0001 0.020

50

100

120

Prevalence

Pow

er(%

)

Power by Prevalence

(a) Epistasis

0.0001 0.020

50

10074

62

Prevalence

Pow

er(%

)

Power by Prevalence

(b) Main Effect

0.0001 0.020

50

100 93 91

Prevalence

Pow

er(%

)

Power by Prevalence

(c) Full Effect


10

0.01 0.05 0.1 0.3 0.50

50

100

0 0 6 2 0

Frequency

Pow

er(%

)

Power by Frequency

(a) Epistasis

0.01 0.05 0.1 0.3 0.50

50

100

0 0

39

58 62

Frequency

Pow

er(%

)

Power by Frequency

(b) Main Effect

0.01 0.05 0.1 0.3 0.50

50

100

0 0 0

40

91

Frequency

Pow

er(%

)

Power by Frequency

(c) Full Effect


11

B Table of Results



0.5,500,I,2.0,0.0001 0 90.5,500,I,1.5,0.02 0 11

0.5,500,I,1.5,0.0001 0 170.5,500,I,1.1,0.02 0 18

0.5,500,I,1.1,0.0001 0 250.5,2000,I,2.0,0.02 0 14

0.5,2000,I,2.0,0.0001 12 190.5,2000,I,1.5,0.02 5 25

0.5,2000,I,1.5,0.0001 0 170.5,2000,I,1.1,0.02 1 20

0.5,2000,I,1.1,0.0001 0 100.5,1000,I,2.0,0.02 0 22

0.5,1000,I,2.0,0.0001 3 170.5,1000,I,1.5,0.02 2 11

0.5,1000,I,1.5,0.0001 0 170.5,1000,I,1.1,0.02 0 15

0.5,1000,I,1.1,0.0001 0 130.3,500,I,2.0,0.02 0 19

0.3,500,I,2.0,0.0001 0 140.3,500,I,1.5,0.02 0 14

0.3,500,I,1.5,0.0001 0 110.3,500,I,1.1,0.02 0 18

0.3,500,I,1.1,0.0001 0 140.3,2000,I,2.0,0.02 2 16

0.3,2000,I,2.0,0.0001 0 190.3,2000,I,1.5,0.02 0 20

0.3,2000,I,1.5,0.0001 1 110.3,2000,I,1.1,0.02 0 19

0.3,2000,I,1.1,0.0001 0 140.3,1000,I,2.0,0.02 0 15

12

0.3,1000,I,2.0,0.0001 0 140.3,1000,I,1.5,0.02 1 10

0.3,1000,I,1.5,0.0001 0 110.3,1000,I,1.1,0.02 0 13

0.3,1000,I,1.1,0.0001 0 120.1,500,I,2.0,0.02 0 14

0.1,500,I,2.0,0.0001 0 110.1,500,I,1.5,0.02 0 17

0.1,500,I,1.5,0.0001 0 130.1,500,I,1.1,0.02 0 8

0.1,500,I,1.1,0.0001 0 160.1,2000,I,2.0,0.02 6 21

0.1,2000,I,2.0,0.0001 1 190.1,2000,I,1.5,0.02 0 13

0.1,2000,I,1.5,0.0001 0 150.1,2000,I,1.1,0.02 0 17

0.1,2000,I,1.1,0.0001 0 160.1,1000,I,2.0,0.02 0 16

0.1,1000,I,2.0,0.0001 0 200.1,1000,I,1.5,0.02 0 11

0.1,1000,I,1.5,0.0001 0 120.1,1000,I,1.1,0.02 0 17

0.1,1000,I,1.1,0.0001 0 100.05,500,I,2.0,0.02 0 15

0.05,500,I,2.0,0.0001 0 50.05,500,I,1.5,0.02 0 23

0.05,500,I,1.5,0.0001 0 210.05,500,I,1.1,0.02 0 19

0.05,500,I,1.1,0.0001 0 170.05,2000,I,2.0,0.02 0 20

0.05,2000,I,2.0,0.0001 0 140.05,2000,I,1.5,0.02 2 25

0.05,2000,I,1.5,0.0001 1 190.05,2000,I,1.1,0.02 0 15

0.05,2000,I,1.1,0.0001 0 140.05,1000,I,2.0,0.02 0 22

0.05,1000,I,2.0,0.0001 0 150.05,1000,I,1.5,0.02 0 14

13

0.05,1000,I,1.5,0.0001 0 120.05,1000,I,1.1,0.02 0 22

0.05,1000,I,1.1,0.0001 0 160.01,500,I,2.0,0.02 0 15

0.01,500,I,2.0,0.0001 0 130.01,500,I,1.5,0.02 0 15

0.01,500,I,1.5,0.0001 0 190.01,500,I,1.1,0.02 0 13

0.01,500,I,1.1,0.0001 0 110.01,2000,I,2.0,0.02 0 18

0.01,2000,I,2.0,0.0001 0 130.01,2000,I,1.5,0.02 0 11

0.01,2000,I,1.5,0.0001 0 190.01,2000,I,1.1,0.02 0 19

0.01,2000,I,1.1,0.0001 0 110.01,1000,I,2.0,0.02 0 17

0.01,1000,I,2.0,0.0001 0 110.01,1000,I,1.5,0.02 0 11

0.01,1000,I,1.5,0.0001 0 170.01,1000,I,1.1,0.02 0 13

0.01,1000,I,1.1,0.0001 0 110.5,500,ME,2.0,0.02 54 15

0.5,500,ME,2.0,0.0001 35 230.5,500,ME,1.5,0.02 40 17

0.5,500,ME,1.5,0.0001 21 190.5,500,ME,1.1,0.02 15 14

0.5,500,ME,1.1,0.0001 10 270.5,2000,ME,2.0,0.02 62 48

0.5,2000,ME,2.0,0.0001 74 350.5,2000,ME,1.5,0.02 77 31

0.5,2000,ME,1.5,0.0001 83 170.5,2000,ME,1.1,0.02 91 17

0.5,2000,ME,1.1,0.0001 82 210.5,1000,ME,2.0,0.02 70 30

0.5,1000,ME,2.0,0.0001 72 240.5,1000,ME,1.5,0.02 71 22

0.5,1000,ME,1.5,0.0001 77 180.5,1000,ME,1.1,0.02 53 21

14

0.5,1000,ME,1.1,0.0001 50 200.3,500,ME,2.0,0.02 20 23

0.3,500,ME,2.0,0.0001 16 180.3,500,ME,1.5,0.02 7 20

0.3,500,ME,1.5,0.0001 4 210.3,500,ME,1.1,0.02 0 17

0.3,500,ME,1.1,0.0001 0 150.3,2000,ME,2.0,0.02 58 38

0.3,2000,ME,2.0,0.0001 48 530.3,2000,ME,1.5,0.02 62 37

0.3,2000,ME,1.5,0.0001 49 400.3,2000,ME,1.1,0.02 33 19

0.3,2000,ME,1.1,0.0001 29 290.3,1000,ME,2.0,0.02 54 28

0.3,1000,ME,2.0,0.0001 52 250.3,1000,ME,1.5,0.02 25 23

0.3,1000,ME,1.5,0.0001 17 150.3,1000,ME,1.1,0.02 11 11

0.3,1000,ME,1.1,0.0001 0 210.1,500,ME,2.0,0.02 0 21

0.1,500,ME,2.0,0.0001 0 200.1,500,ME,1.5,0.02 0 17

0.1,500,ME,1.5,0.0001 0 160.1,500,ME,1.1,0.02 0 17

0.1,500,ME,1.1,0.0001 0 140.1,2000,ME,2.0,0.02 39 36

0.1,2000,ME,2.0,0.0001 32 500.1,2000,ME,1.5,0.02 0 24

0.1,2000,ME,1.5,0.0001 0 280.1,2000,ME,1.1,0.02 0 21

0.1,2000,ME,1.1,0.0001 0 150.1,1000,ME,2.0,0.02 0 23

0.1,1000,ME,2.0,0.0001 0 190.1,1000,ME,1.5,0.02 0 9

0.1,1000,ME,1.5,0.0001 0 100.1,1000,ME,1.1,0.02 0 13

0.1,1000,ME,1.1,0.0001 0 150.05,500,ME,2.0,0.02 0 17

15

0.05,500,ME,2.0,0.0001 0 240.05,500,ME,1.5,0.02 0 13

0.05,500,ME,1.5,0.0001 0 120.05,500,ME,1.1,0.02 0 16

0.05,500,ME,1.1,0.0001 0 160.05,2000,ME,2.0,0.02 0 22

0.05,2000,ME,2.0,0.0001 0 180.05,2000,ME,1.5,0.02 0 10

0.05,2000,ME,1.5,0.0001 0 120.05,2000,ME,1.1,0.02 0 16

0.05,2000,ME,1.1,0.0001 0 150.05,1000,ME,2.0,0.02 0 21

0.05,1000,ME,2.0,0.0001 0 170.05,1000,ME,1.5,0.02 0 16

0.05,1000,ME,1.5,0.0001 0 130.05,1000,ME,1.1,0.02 0 15

0.05,1000,ME,1.1,0.0001 0 160.01,500,ME,2.0,0.02 0 14

0.01,500,ME,2.0,0.0001 0 220.01,500,ME,1.5,0.02 0 14

0.01,500,ME,1.5,0.0001 0 220.01,500,ME,1.1,0.02 0 12

0.01,500,ME,1.1,0.0001 0 240.01,2000,ME,2.0,0.02 0 13

0.01,2000,ME,2.0,0.0001 0 160.01,2000,ME,1.5,0.02 0 20

0.01,2000,ME,1.5,0.0001 0 140.01,2000,ME,1.1,0.02 0 19

0.01,2000,ME,1.1,0.0001 0 140.01,1000,ME,2.0,0.02 0 14

0.01,1000,ME,2.0,0.0001 0 150.01,1000,ME,1.5,0.02 0 13

0.01,1000,ME,1.5,0.0001 0 160.01,1000,ME,1.1,0.02 0 18

0.01,1000,ME,1.1,0.0001 0 170.5,500,ME+I,2.0,0.02 49 37

0.5,500,ME+I,2.0,0.0001 37 330.5,500,ME+I,1.5,0.02 54 33

16

0.5,500,ME+I,1.5,0.0001 43 300.5,500,ME+I,1.1,0.02 47 20

0.5,500,ME+I,1.1,0.0001 33 270.5,2000,ME+I,2.0,0.02 91 84

0.5,2000,ME+I,2.0,0.0001 93 760.5,2000,ME+I,1.5,0.02 94 78

0.5,2000,ME+I,1.5,0.0001 96 810.5,2000,ME+I,1.1,0.02 89 60

0.5,2000,ME+I,1.1,0.0001 88 720.5,1000,ME+I,2.0,0.02 73 45

0.5,1000,ME+I,2.0,0.0001 69 560.5,1000,ME+I,1.5,0.02 78 51

0.5,1000,ME+I,1.5,0.0001 80 500.5,1000,ME+I,1.1,0.02 78 33

0.5,1000,ME+I,1.1,0.0001 75 390.3,500,ME+I,2.0,0.02 30 19

0.3,500,ME+I,2.0,0.0001 37 280.3,500,ME+I,1.5,0.02 14 20

0.3,500,ME+I,1.5,0.0001 18 220.3,500,ME+I,1.1,0.02 5 17

0.3,500,ME+I,1.1,0.0001 3 140.3,2000,ME+I,2.0,0.02 40 68

0.3,2000,ME+I,2.0,0.0001 92 920.3,2000,ME+I,1.5,0.02 61 50

0.3,2000,ME+I,1.5,0.0001 77 740.3,2000,ME+I,1.1,0.02 50 35

0.3,2000,ME+I,1.1,0.0001 61 350.3,1000,ME+I,2.0,0.02 58 35

0.3,1000,ME+I,2.0,0.0001 72 570.3,1000,ME+I,1.5,0.02 45 34

0.3,1000,ME+I,1.5,0.0001 50 370.3,1000,ME+I,1.1,0.02 23 19

0.3,1000,ME+I,1.1,0.0001 12 180.1,500,ME+I,2.0,0.02 0 19

0.1,500,ME+I,2.0,0.0001 0 200.1,500,ME+I,1.5,0.02 0 23

0.1,500,ME+I,1.5,0.0001 0 110.1,500,ME+I,1.1,0.02 0 12

17

0.1,500,ME+I,1.1,0.0001 0 230.1,2000,ME+I,2.0,0.02 0 33

0.1,2000,ME+I,2.0,0.0001 0 200.1,2000,ME+I,1.5,0.02 0 23

0.1,2000,ME+I,1.5,0.0001 0 170.1,2000,ME+I,1.1,0.02 0 15

0.1,2000,ME+I,1.1,0.0001 0 100.1,1000,ME+I,2.0,0.02 0 28

0.1,1000,ME+I,2.0,0.0001 0 230.1,1000,ME+I,1.5,0.02 0 15

0.1,1000,ME+I,1.5,0.0001 0 120.1,1000,ME+I,1.1,0.02 0 8

0.1,1000,ME+I,1.1,0.0001 0 190.05,500,ME+I,2.0,0.02 0 15

0.05,500,ME+I,2.0,0.0001 0 190.05,500,ME+I,1.5,0.02 0 16

0.05,500,ME+I,1.5,0.0001 0 110.05,500,ME+I,1.1,0.02 0 15

0.05,500,ME+I,1.1,0.0001 0 210.05,2000,ME+I,2.0,0.02 0 21

0.05,2000,ME+I,2.0,0.0001 0 220.05,2000,ME+I,1.5,0.02 0 19

0.05,2000,ME+I,1.5,0.0001 0 180.05,2000,ME+I,1.1,0.02 0 18

0.05,2000,ME+I,1.1,0.0001 0 130.05,1000,ME+I,2.0,0.02 0 21

0.05,1000,ME+I,2.0,0.0001 0 150.05,1000,ME+I,1.5,0.02 0 12

0.05,1000,ME+I,1.5,0.0001 0 200.05,1000,ME+I,1.1,0.02 0 14

0.05,1000,ME+I,1.1,0.0001 0 180.01,500,ME+I,2.0,0.02 0 18

0.01,500,ME+I,2.0,0.0001 0 130.01,500,ME+I,1.5,0.02 0 15

0.01,500,ME+I,1.5,0.0001 0 150.01,500,ME+I,1.1,0.02 0 14

0.01,500,ME+I,1.1,0.0001 0 230.01,2000,ME+I,2.0,0.02 0 14

18

0.01,2000,ME+I,2.0,0.0001 0 180.01,2000,ME+I,1.5,0.02 0 22

0.01,2000,ME+I,1.5,0.0001 0 140.01,2000,ME+I,1.1,0.02 0 16

0.01,2000,ME+I,1.1,0.0001 0 150.01,1000,ME+I,2.0,0.02 0 14

0.01,1000,ME+I,2.0,0.0001 0 90.01,1000,ME+I,1.5,0.02 0 15

0.01,1000,ME+I,1.5,0.0001 0 150.01,1000,ME+I,1.1,0.02 0 18

0.01,1000,ME+I,1.1,0.0001 0 17



Configuration* Running Time (s) CPU Usage (%) Memory Usage (KB)0.5,500,ME+I,2.0,0.02 8.05 75.72 132928.28

0.5,500,ME+I,2.0,0.0001 8.10 75.95 133723.360.5,500,ME+I,1.5,0.02 9.37 75.86 132094.44

0.5,500,ME+I,1.5,0.0001 9.03 75.38 132148.280.5,500,ME+I,1.1,0.02 11.23 75.14 133080.40

0.5,500,ME+I,1.1,0.0001 10.48 75.46 131997.880.5,500,ME,2.0,0.02 10.43 76.02 132144.56

0.5,500,ME,2.0,0.0001 9.98 76.18 132479.400.5,500,ME,1.5,0.02 11.88 75.85 133979.16

0.5,500,ME,1.5,0.0001 11.01 80.20 132260.320.5,500,ME,1.1,0.02 13.19 77.06 135044.20

0.5,500,ME,1.1,0.0001 12.12 77.23 133516.640.5,500,I,2.0,0.02 14.39 76.70 133500.72

0.5,500,I,2.0,0.0001 12.97 76.82 132901.400.5,500,I,1.5,0.02 14.32 76.82 133835.88

0.5,500,I,1.5,0.0001 13.16 76.95 132729.960.5,500,I,1.1,0.02 14.44 77.07 133436.20

0.5,500,I,1.1,0.0001 13.06 76.99 132833.76

19

0.5,2000,ME+I,2.0,0.02 34.65 77.25 156137.440.5,2000,ME+I,2.0,0.0001 31.37 76.78 153924.360.5,2000,ME+I,1.5,0.02 67.20 98.96 156944.56

0.5,2000,ME+I,1.5,0.0001 51.19 98.96 156487.440.5,2000,ME+I,1.1,0.02 106.54 99.00 157200.08

0.5,2000,ME+I,1.1,0.0001 76.74 99.00 157071.080.5,2000,ME,2.0,0.02 86.58 99.00 157251.32

0.5,2000,ME,2.0,0.0001 65.83 98.98 156912.400.5,2000,ME,1.5,0.02 115.43 99.00 156853.92

0.5,2000,ME,1.5,0.0001 84.74 99.00 157327.680.5,2000,ME,1.1,0.02 141.42 98.99 156264.16

0.5,2000,ME,1.1,0.0001 101.23 98.99 156739.320.5,2000,I,2.0,0.02 177.72 98.97 155566.84

0.5,2000,I,2.0,0.0001 121.39 98.99 155160.080.5,2000,I,1.5,0.02 175.38 99.00 155246.92

0.5,2000,I,1.5,0.0001 121.20 99.00 155048.640.5,2000,I,1.1,0.02 175.60 99.00 155732.32

0.5,2000,I,1.1,0.0001 121.17 99.00 155220.240.5,1000,ME+I,2.0,0.02 18.65 98.99 140519.08

0.5,1000,ME+I,2.0,0.0001 17.79 98.96 140777.400.5,1000,ME+I,1.5,0.02 27.78 98.93 140225.08

0.5,1000,ME+I,1.5,0.0001 23.17 98.90 140721.960.5,1000,ME+I,1.1,0.02 36.70 98.96 139726.28

0.5,1000,ME+I,1.1,0.0001 30.35 98.96 139558.000.5,1000,ME,2.0,0.02 31.89 98.98 139635.88

0.5,1000,ME,2.0,0.0001 27.30 98.98 140585.160.5,1000,ME,1.5,0.02 38.06 99.00 140003.76

0.5,1000,ME,1.5,0.0001 31.63 99.00 139501.320.5,1000,ME,1.1,0.02 43.46 99.00 139474.84

0.5,1000,ME,1.1,0.0001 36.17 99.00 139679.120.5,1000,I,2.0,0.02 52.11 98.93 138762.56

0.5,1000,I,2.0,0.0001 42.34 98.70 138701.960.5,1000,I,1.5,0.02 52.57 98.75 138734.88

0.5,1000,I,1.5,0.0001 41.92 98.96 138545.880.5,1000,I,1.1,0.02 52.19 98.94 138834.24

0.5,1000,I,1.1,0.0001 41.54 98.92 138252.560.3,500,ME+I,2.0,0.02 10.43 77.27 132217.92

0.3,500,ME+I,2.0,0.0001 8.47 77.17 132292.96

20

0.3,500,ME+I,1.5,0.02 12.12 76.84 134113.200.3,500,ME+I,1.5,0.0001 10.55 76.87 131771.240.3,500,ME+I,1.1,0.02 13.41 76.91 133854.48

0.3,500,ME+I,1.1,0.0001 12.22 76.76 132673.560.3,500,ME,2.0,0.02 11.96 76.68 134613.12

0.3,500,ME,2.0,0.0001 10.87 76.92 132077.720.3,500,ME,1.5,0.02 13.01 78.76 134016.68

0.3,500,ME,1.5,0.0001 11.80 80.22 133295.480.3,500,ME,1.1,0.02 13.89 76.88 134210.72

0.3,500,ME,1.1,0.0001 12.64 76.67 133055.160.3,500,I,2.0,0.02 14.56 76.78 133087.44

0.3,500,I,2.0,0.0001 13.11 76.36 133321.080.3,500,I,1.5,0.02 14.48 76.47 133449.28

0.3,500,I,1.5,0.0001 13.12 76.53 132977.320.3,500,I,1.1,0.02 14.62 76.97 132934.04

0.3,500,I,1.1,0.0001 12.91 79.52 132643.120.3,2000,ME+I,2.0,0.02 80.13 77.12 157067.76

0.3,2000,ME+I,2.0,0.0001 38.01 76.67 156089.920.3,2000,ME+I,1.5,0.02 108.52 77.82 156891.04

0.3,2000,ME+I,1.5,0.0001 68.37 79.27 157415.120.3,2000,ME+I,1.1,0.02 133.62 76.53 156605.84

0.3,2000,ME+I,1.1,0.0001 93.15 76.23 156330.200.3,2000,ME,2.0,0.02 94.83 72.29 145398.52

0.3,2000,ME,2.0,0.0001 73.01 76.34 157256.600.3,2000,ME,1.5,0.02 129.27 77.26 156373.32

0.3,2000,ME,1.5,0.0001 93.37 77.30 156781.520.3,2000,ME,1.1,0.02 148.69 76.90 155918.68

0.3,2000,ME,1.1,0.0001 104.23 77.27 156126.560.3,2000,I,2.0,0.02 163.18 76.81 155362.60

0.3,2000,I,2.0,0.0001 112.34 76.76 155090.440.3,2000,I,1.5,0.02 165.19 77.06 155644.08

0.3,2000,I,1.5,0.0001 113.22 76.96 155052.120.3,2000,I,1.1,0.02 165.32 76.90 155570.60

0.3,2000,I,1.1,0.0001 112.89 77.25 155295.480.3,1000,ME+I,2.0,0.02 29.58 76.65 140007.28

0.3,1000,ME+I,2.0,0.0001 18.54 76.70 140461.080.3,1000,ME+I,1.5,0.02 35.96 76.58 139731.16

0.3,1000,ME+I,1.5,0.0001 26.77 76.25 139746.88

21

0.3,1000,ME+I,1.1,0.02 39.73 74.45 135312.520.3,1000,ME+I,1.1,0.0001 33.51 60.15 138464.68

0.3,1000,ME,2.0,0.02 35.14 76.79 139728.880.3,1000,ME,2.0,0.0001 28.44 76.98 139477.160.3,1000,ME,1.5,0.02 39.75 76.60 139325.40

0.3,1000,ME,1.5,0.0001 32.14 78.60 139321.040.3,1000,ME,1.1,0.02 41.93 72.53 134819.68

0.3,1000,ME,1.1,0.0001 34.67 76.61 139146.120.3,1000,I,2.0,0.02 44.55 76.64 137672.52

0.3,1000,I,2.0,0.0001 35.88 76.58 138353.480.3,1000,I,1.5,0.02 45.07 77.84 138644.36

0.3,1000,I,1.5,0.0001 36.22 76.64 138355.680.3,1000,I,1.1,0.02 45.38 76.60 138536.52

0.3,1000,I,1.1,0.0001 36.54 76.65 138406.040.1,500,ME+I,2.0,0.02 14.39 76.21 133049.56

0.1,500,ME+I,2.0,0.0001 13.63 96.56 133010.000.1,500,ME+I,1.5,0.02 15.68 99.00 133365.44

0.1,500,ME+I,1.5,0.0001 13.83 99.00 132876.080.1,500,ME+I,1.1,0.02 15.71 98.92 133175.88

0.1,500,ME+I,1.1,0.0001 13.91 99.00 132622.440.1,500,ME,2.0,0.02 15.48 99.00 133278.80

0.1,500,ME,2.0,0.0001 13.81 98.98 132322.280.1,500,ME,1.5,0.02 15.56 98.99 133141.08

0.1,500,ME,1.5,0.0001 14.08 98.96 132418.280.1,500,ME,1.1,0.02 15.56 99.00 133151.84

0.1,500,ME,1.1,0.0001 14.01 98.98 133099.920.1,500,I,2.0,0.02 15.56 98.99 133436.84

0.1,500,I,2.0,0.0001 14.02 98.98 133378.840.1,500,I,1.5,0.02 15.87 99.00 133192.84

0.1,500,I,1.5,0.0001 14.09 98.98 132645.000.1,500,I,1.1,0.02 15.77 99.00 133209.40

0.1,500,I,1.1,0.0001 14.07 98.97 133120.920.1,2000,ME+I,2.0,0.02 158.73 77.20 155763.80

0.1,2000,ME,1.5,0.02 179.10 99.00 155353.080.1,2000,ME,1.5,0.0001 121.32 99.00 155054.960.1,2000,ME,1.1,0.02 179.21 94.98 155592.84

0.1,2000,ME,1.1,0.0001 123.67 94.92 155266.520.1,2000,I,2.0,0.02 179.43 93.90 155530.20

22

0.1,2000,I,2.0,0.0001 122.25 99.00 155230.160.1,2000,I,1.5,0.02 178.88 98.99 155394.24

0.1,2000,I,1.5,0.0001 122.16 99.00 155728.400.1,2000,I,1.1,0.02 177.35 99.00 155322.72

0.1,2000,I,1.1,0.0001 121.66 99.00 155222.800.1,1000,ME+I,2.0,0.02 48.96 99.00 139186.68

0.1,1000,ME+I,2.0,0.0001 39.09 99.00 138626.880.1,1000,ME+I,1.5,0.02 50.81 95.48 138308.36

0.1,1000,ME+I,1.5,0.0001 40.85 96.90 138674.000.1,1000,ME+I,1.1,0.02 51.26 97.28 138804.08

0.1,1000,ME+I,1.1,0.0001 41.67 92.17 138171.240.1,1000,ME,2.0,0.02 50.57 95.22 138776.32

0.1,1000,ME,2.0,0.0001 39.97 97.55 138625.880.1,1000,ME,1.5,0.02 50.41 98.21 138871.48

0.1,1000,ME,1.5,0.0001 40.33 98.03 138225.840.1,1000,ME,1.1,0.02 50.04 98.85 138539.48

0.1,1000,ME,1.1,0.0001 40.06 98.99 138220.520.1,1000,I,2.0,0.02 49.73 99.00 138651.80

0.1,1000,I,2.0,0.0001 40.38 98.19 138314.800.1,1000,I,1.5,0.02 50.46 98.66 139010.80

0.1,1000,I,1.5,0.0001 39.94 98.98 138255.480.1,1000,I,1.1,0.02 49.87 99.00 138711.12

0.1,1000,I,1.1,0.0001 40.13 98.99 138267.920.05,500,ME+I,2.0,0.02 16.57 98.78 133539.84

0.05,500,ME+I,2.0,0.0001 14.88 98.78 132866.200.05,500,ME+I,1.5,0.02 16.49 98.75 133801.12

0.05,500,ME+I,1.5,0.0001 14.89 98.83 132811.360.05,500,ME+I,1.1,0.02 16.64 98.82 133192.68

0.05,500,ME+I,1.1,0.0001 15.06 98.96 133419.440.05,500,ME,2.0,0.02 16.73 98.89 133147.12

0.05,500,ME,2.0,0.0001 15.19 98.79 132763.080.05,500,ME,1.5,0.02 16.43 98.87 133810.48

0.05,500,ME,1.5,0.0001 13.11 76.92 133189.440.05,500,ME,1.1,0.02 14.49 76.75 133125.68

0.05,500,ME,1.1,0.0001 13.12 76.65 133707.680.05,500,I,2.0,0.02 14.35 76.96 133531.60

0.05,500,I,2.0,0.0001 13.03 78.87 133306.000.05,500,I,1.5,0.02 14.30 79.83 133015.40

23

0.05,500,I,1.5,0.0001 12.98 78.90 133355.080.05,500,I,1.1,0.02 14.35 78.62 133549.00

0.05,500,I,1.1,0.0001 12.98 79.54 133045.080.05,2000,ME+I,2.0,0.02 190.54 99.00 155711.80

0.05,2000,ME+I,2.0,0.0001 131.34 99.00 155593.600.05,2000,ME+I,1.5,0.02 180.23 99.00 155621.12

0.05,2000,ME+I,1.5,0.0001 123.29 98.98 155334.440.05,2000,ME+I,1.1,0.02 178.12 99.00 155591.40

0.05,2000,ME+I,1.1,0.0001 124.22 99.00 155121.640.05,2000,ME,2.0,0.02 178.90 99.00 155657.64

0.05,2000,ME,2.0,0.0001 123.05 99.00 155412.040.05,2000,ME,1.5,0.02 178.36 99.00 155709.28

0.05,2000,ME,1.5,0.0001 123.93 99.00 155206.400.05,2000,ME,1.1,0.02 180.79 98.53 155466.84

0.05,2000,ME,1.1,0.0001 123.34 99.00 155389.400.05,2000,I,2.0,0.02 119.66 99.00 155255.48

0.05,2000,I,2.0,0.0001 122.52 99.00 155137.480.05,2000,I,1.5,0.02 178.53 99.00 155502.28

0.05,2000,I,1.5,0.0001 121.41 99.00 155484.520.05,2000,I,1.1,0.02 178.34 99.00 155644.44

0.05,2000,I,1.1,0.0001 122.37 98.98 155444.680.05,1000,ME+I,2.0,0.02 50.33 98.97 138882.24

0.05,1000,ME+I,2.0,0.0001 40.63 98.91 138206.720.05,1000,ME+I,1.5,0.02 50.46 98.92 138581.16

0.05,1000,ME+I,1.5,0.0001 40.53 98.89 138158.840.05,1000,ME+I,1.1,0.02 50.30 98.92 138835.84

0.05,1000,ME+I,1.1,0.0001 40.63 97.77 138088.720.05,1000,ME,2.0,0.02 50.16 99.00 138770.92

0.05,1000,ME,2.0,0.0001 40.17 98.59 138376.080.05,1000,ME,1.5,0.02 49.89 98.97 138975.00

0.05,1000,ME,1.5,0.0001 40.33 98.91 138648.360.05,1000,ME,1.1,0.02 49.85 98.98 138953.24

0.05,1000,ME,1.1,0.0001 39.79 99.00 138266.520.05,1000,I,2.0,0.02 49.24 98.88 138535.88

0.05,1000,I,2.0,0.0001 40.29 98.85 138387.880.05,1000,I,1.5,0.02 50.96 92.36 138616.68

0.05,1000,I,1.5,0.0001 41.18 88.89 137906.160.05,1000,I,1.1,0.02 50.33 97.64 138686.16

24

0.05,1000,I,1.1,0.0001 40.04 98.91 138828.360.01,500,ME+I,2.0,0.02 15.85 98.98 133601.64

0.01,500,ME+I,2.0,0.0001 14.01 98.96 132918.960.01,500,ME+I,1.5,0.02 15.65 98.99 132987.16

0.01,500,ME+I,1.5,0.0001 13.97 98.94 133053.600.01,500,ME+I,1.1,0.02 15.61 98.98 133764.08

0.01,500,ME+I,1.1,0.0001 14.17 98.99 132439.240.01,500,ME,2.0,0.02 15.55 98.97 133469.20

0.01,500,ME,2.0,0.0001 13.99 98.94 132721.640.01,500,ME,1.5,0.02 15.52 98.96 133784.12

0.01,500,ME,1.5,0.0001 13.96 98.95 132771.040.01,500,ME,1.1,0.02 15.65 98.92 133548.44

0.01,500,ME,1.1,0.0001 14.10 98.95 132695.160.01,500,I,2.0,0.02 15.71 98.97 132971.68

0.01,500,I,2.0,0.0001 14.17 98.97 133037.360.01,500,I,1.5,0.02 15.64 98.95 133494.84

0.01,500,I,1.5,0.0001 13.96 98.90 132756.080.01,500,I,1.1,0.02 15.63 98.94 133574.76

0.01,500,I,1.1,0.0001 14.00 98.93 133471.800.01,2000,ME+I,2.0,0.02 164.42 77.57 155881.28

0.01,2000,ME+I,2.0,0.0001 113.09 77.42 155392.920.01,2000,ME+I,1.5,0.02 162.51 77.55 155558.80

0.01,2000,ME+I,1.5,0.0001 112.62 77.64 155313.160.01,2000,ME+I,1.1,0.02 164.66 77.83 155642.44

0.01,2000,ME+I,1.1,0.0001 123.62 99.00 155250.160.01,2000,ME,2.0,0.02 165.13 77.53 155705.44

0.01,2000,ME,2.0,0.0001 111.69 79.50 155394.840.01,2000,ME,1.5,0.02 162.59 78.46 155294.12

0.01,2000,ME,1.5,0.0001 113.86 76.92 155176.880.01,2000,ME,1.1,0.02 164.27 76.97 155396.96

0.01,2000,ME,1.1,0.0001 113.76 76.75 155034.040.01,2000,I,2.0,0.02 164.08 77.22 155443.76

0.01,2000,I,2.0,0.0001 113.61 77.24 154994.480.01,2000,I,1.5,0.02 163.14 78.94 155548.28

0.01,2000,I,1.5,0.0001 111.07 79.09 155093.920.01,2000,I,1.1,0.02 162.67 77.36 155497.36

0.01,2000,I,1.1,0.0001 109.01 75.25 150508.880.01,1000,ME+I,2.0,0.02 50.30 98.92 138738.04

25

0.01,1000,ME+I,2.0,0.0001 40.31 99.00 138451.800.01,1000,ME+I,1.5,0.02 50.50 99.00 138824.72

0.01,1000,ME+I,1.5,0.0001 40.31 98.97 138218.680.01,1000,ME+I,1.1,0.02 50.41 99.00 138501.24

0.01,1000,ME+I,1.1,0.0001 40.45 99.00 138278.320.01,1000,ME,2.0,0.02 50.41 98.99 138407.60

0.01,1000,ME,2.0,0.0001 40.44 99.00 138331.400.01,1000,ME,1.5,0.02 50.07 98.99 138824.28

0.01,1000,ME,1.5,0.0001 40.31 99.00 138470.760.01,1000,ME,1.1,0.02 49.94 98.96 138802.64

0.01,1000,ME,1.1,0.0001 40.09 98.99 138732.520.01,1000,I,2.0,0.02 49.99 98.99 138941.92

0.01,1000,I,2.0,0.0001 40.10 98.95 138575.560.01,1000,I,1.5,0.02 49.52 98.92 138679.60

0.01,1000,I,1.5,0.0001 40.10 98.89 138683.160.01,1000,I,1.1,0.02 50.00 98.97 138552.00

0.01,1000,I,1.1,0.0001 40.43 97.41 138510.32


26

Laboratory Note

Genetic EpistasisV - Assessing Algorithm SNPRuler

LN-5-2014






May 2014

Abstract

In this lab note, the algorithm SNPRuler is presented. SNPRuleris an epistatic detection algorithm written in Java that creates rulesbased on the epistatic interactions detected in data sets. Using manyconfigurations of data sets, the results obtained show a correlation be-tween Power and the number of sampled individuals and a correlationbetween Power and minor allele frequency. This shows that the algo-rithm has a very high accuracy in optimal conditions, but has verylow accuracy in below optimal conditions. The algorithm is very scal-able with different number of individuals, with only a slight increasein running time and memory usage. The Type I Error Rate is verylow in all configurations.

1 Introduction

SNPRuler [WYY+10] is a rule based algorithm that, based on the relationsbetween SNPs and the phenotype related to the expression of a disease,creates rules of association, between SNPs and the phenotype expression.The order or magnitude of these interactions can have any amount of SNPs.For each rule, a 3x3 table is generated, relating to the probability of eachpossible genotype combination and phenotype expression.The way that rules are defined is described in the following steps:

1. Literal - A literal s is an index-value pair (i,v) with i denoting an indexand v a value in 1,2,3 representing the possible genotypes. A samplesatisfies a literal(i,v) if and only if its i-th SNP has the value v.

2. Predictive Rule - A predictive rule (r, ζ) : s1 ∩ s2 ∩ ...∩ sn → ζ, is anassociation between a conjunction of n literals denoted as r and a classlabel ζ. A sample satisfies (r, ζ) if and only if it satisfies all literals inr and its class label is ζ.

3. Literal Relevance - Given a predictive rule (r, ζ) and a utility functionU(r, ζ) for rule measurement, a literal si in the rule r is relevant if andonly if U(r, ζ) > U(r − si, ζ). Here, R− si means removing si from r.

4. Closed Rule - A predictive rule (r, ζ) if closed if and only if there isnot there is no literal si which satisfies U(r + si, ζ) > U(r, ζ). Here,R + si means adding si into r.

The measurement rule of relevance is χ2 statistic. Considering that most ofthe epistatic interactions involve many SNPs, before creating rules, an upperbound is used to determine if a new SNP will reveal to be significant to arule. This decreases the amount of rules created immensely, compared toexhaustive searches. A branch-and-bound approach is used for this effect.

1.1 Input files

The algorithm is written in Java and receives a file containing the genotypeand the phenotype expressed for each individual. The first row containsthe number of each SNP and the final column corresponds to the label.Each subsequent row contains an individuals genotype 0,1,2 correspondingto homozygous dominant genotype (AA), heterozygous genotype (Aa), andhomozygous recessive genotype (aa). The Label 0,1 corresponds to controland disease affected, respectively.

1

X1 X2 X3 X4 Label1 1 0 2 01 0 1 1 11 2 2 0 1

Table 1: An example of the input file containing genotype and phenotypeinformation with 4 SNPs and 3 individuals.

1.2 Output files

The output is a list of interactions ranked by their significance in the χ2 test.A post-processing calculates the P-value of these interactions adjusting theχ2 test with a Bonferroni correction with a significance threshold of 0.3.

1.3 Parameters

There are 3 configurable parameters:

• listSize - The expected number of interactions.

• depth - Order of interaction. Number of interacting SNPs.

• updateRatio - The step size of updating a rule. Takes a value between0 and 1, 0 being not updated and 1 updating a rule at each step.


The datasets used in the experiments are characterized in Lab Note 1. Thecomputer used for this experiments used the 64-bit Ubuntu 13.10 operatingsystem, with an Intel(R) Core(TM)2 Quad CPU Q6600 2.40GHz processorand 8,00 GB of RAM memory.The algorithms settings consist of a -Xmx7000M heap size, with a maximumnumber or rules set as 50 000. The length of the rules is 2, considering thatthe data sets used contain ground-truths of pairs of SNPs. The pruningthreshold is 0, which means that all possible combinations will be tested.

3 Results

SNPRuler is used exclusively for the interactive effect, therefore data setswith main effect and full effect wont be analyzed.In the Figure 1 the Power obtained from each allele frequency with different

2

population sizes is displayed. For data sets with 500 individuals, the Power isnearly 0 for all allele frequencies. However, as the population size increases,the Power starts to rise in data sets with allele frequency higher than 0.1.This is also true for data sets with 2000 individuals, with a slightly higherPower than smaller data sets. The configuration with the most Power cor-responds to the datasets with 2000 population and 0.5 minor allele frequency.

0.01 0.05 0.1 0.3 0.50

50

100

0 0 0 3 60 010

35

71

0 0

3244

92

Allele frequency

Pow

er(%

)


Figure 1: Power by allele frequency. For each frequency, three sizes of datasets were used to measure the Power, with an odds ratio of 2.0 and prevalenceof 0.02. The Power is measured by the amount of data sets where the groundtruth was amongst the most relevant results, out of all 100 data sets.

In Figure 2 the average running time, percentage of CPU usage, andmemory usage are displayed by individuals in the data set, to evaluate thescalability of the algorithm. The results show that there is a slight increasein running time when applied to larger data sets. In this results, the increasein running time is not very significant. The CPU usage shows an increasewith the data set size, with all the data sets having a CPU usage higher than100%. This means that for each data set, more than one core was used. Thememory usage results show that there is an increase of nearly 10 megabytesin memory usage. This increase may be significant in more complex datasets but is not as significant as the running time increase or the CPU usage.

For the Type I Error Rate test, Figure 3 shows that the Type I Error Rateis relatively small across all the data sets, having outliers with allele frequencyof 0.1 and 2000 individuals. This is the only groups of configurations thatyield a Type I Error Rate higher than 1%.

According to Figure 4 we can conclude that the number of individuals hasa big influence in the Power of the algorithm. This is also true for the allelefrequency. With very small number of individuals, the Power is nearly 0.The

3

500 1000 20000

2

4


Runnin

gT

ime(

seco

nds)


500 1000 2000

130

140

150


CP

UU

sage

(%)


500 1000 2000312

314

316

318

320


Mem

ory

Usa

ge(M

byte

s)


Figure 2: Comparison of scalability measures between different sized datasets. This figures shows the average running time, CPU usage, and memoryusage by each data set. The data sets have a minor allele frequency is 0.5,2.0 odds ratio, 0.02 prevalence.

Power also increases with the frequency of the alleles with the ground truth.On Figure 5 and 6, the influence of odds ratio, through the penetrance tableof the disease, and the prevalence of the disease are undetermined. There isan increase in Power with the odds ratio of 1.5, but it decreases for 2.0 oddsratio. The difference in prevalence does not show a very significant differencein Power. Figure 7 shows the Power by frequency, independent of populationsize.Overall, the Power of the algorithm shows very high accuracy in certain con-figurations with the optimal conditions, but also shows very low Power inmany configurations.

4

0.01 0.05 0.1 0.3 0.50

2

4

6

8

0 0 0 0 00 0 0 0 001

8

0 0

Allele frequency

Typ

eI

Err

orR

ate(

%)


Figure 3: Type I Error Rate by allele frequency and population size, with anodds ratio of 2.0 and prevalence of 0.02. The Type I Error Rate is measuredby the amount of data sets where the false positives were amongst the mostrelevant results, out of all 100 data sets.

4 Summary

In this lab note, the algorithm SNPRuler was presented and tested to detectepistasis interactions that manifest complex diseases using generated datasets. The results obtained showed that The number of individuals is impor-tant to epistasis detection. Diseases with ground truths in high frequencySNPs are easier to detect. The scalability test revealed a significant increasein the use of computer resources and running time with the increase in num-ber of individuals, which may have a significant impact in datasets with ahigher amount of SNPs. The Type I Error Rate results show very low errorrate in all configurations.

References

[WYY+10] Xiang Wan, Can Yang, Qiang Yang, Hong Xue, Nelson L S Tang,and Weichuan Yu. Predictive rule inference for epistatic interac-tion detection in genome-wide association studies. Bioinformat-ics (Oxford, England), 26:30–37, 2010.

A Bar graphs

5

500 1000 20000

50

100

010

32

Population

Pow

er(%

)Power by Population


1.1 1.5 2.00

50

100

0

67

32

Odds Ratio

Pow

er(%

)

Power by Odds Ratio


6

0.0001 0.020

50

100

29 32

Prevalence

Pow

er(%

)Power by Prevalence


0.01 0.05 0.1 0.3 0.50

50

100

0 0

3244

92

Frequency

Pow

er(%

)

Power by Frequency


7

B Table of Results


Configuration* TP (%) FP (%)0.5,2000,I,2.0,0.02 92 00.3,2000,I,1.5,0.02 89 7

0.3,2000,I,1.5,0.0001 84 30.5,2000,I,1.5,0.02 81 10.5,1000,I,2.0,0.02 71 00.1,2000,I,1.5,0.02 67 2

0.5,2000,I,2.0,0.0001 52 80.5,1000,I,2.0,0.0001 50 10.05,2000,I,2.0,0.0001 50 19

0.3,2000,I,2.0,0.02 44 00.3,1000,I,1.5,0.02 41 10.5,1000,I,1.5,0.02 40 0

0.3,1000,I,2.0,0.0001 36 00.3,1000,I,2.0,0.02 35 0

0.05,2000,I,1.5,0.0001 35 120.5,2000,I,1.5,0.0001 34 10.1,2000,I,2.0,0.02 32 8

0.3,1000,I,1.5,0.0001 29 10.1,2000,I,2.0,0.0001 29 00.1,2000,I,1.5,0.0001 29 10.05,2000,I,1.5,0.02 23 140.3,2000,I,2.0,0.0001 12 10.1,1000,I,2.0,0.02 10 00.5,500,I,2.0,0.02 6 0

0.3,500,I,2.0,0.0001 6 00.5,2000,I,1.1,0.02 5 10.5,500,I,2.0,0.0001 4 00.5,1000,I,1.5,0.0001 3 0

0.3,500,I,2.0,0.02 3 00.5,2000,I,1.1,0.0001 2 00.3,500,I,1.5,0.0001 2 0

8

0.3,2000,I,1.1,0.0001 2 00.1,1000,I,1.5,0.02 2 0

0.05,1000,I,2.0,0.0001 2 00.5,500,I,1.5,0.02 1 00.3,2000,I,1.1,0.02 1 0

0.1,2000,I,1.1,0.0001 1 10.1,1000,I,2.0,0.0001 1 00.5,500,I,1.5,0.0001 0 00.5,500,I,1.1,0.02 0 0

0.5,500,I,1.1,0.0001 0 00.5,1000,I,1.1,0.02 0 0

0.5,1000,I,1.1,0.0001 0 00.3,500,I,1.5,0.02 0 00.3,500,I,1.1,0.02 0 0

0.3,500,I,1.1,0.0001 0 00.3,1000,I,1.1,0.02 0 0

0.3,1000,I,1.1,0.0001 0 00.1,500,I,2.0,0.02 0 0

0.1,500,I,2.0,0.0001 0 00.1,500,I,1.5,0.02 0 0

0.1,500,I,1.5,0.0001 0 00.1,500,I,1.1,0.02 0 0

0.1,500,I,1.1,0.0001 0 00.1,2000,I,1.1,0.02 0 0

0.1,1000,I,1.5,0.0001 0 00.1,1000,I,1.1,0.02 0 0

0.1,1000,I,1.1,0.0001 0 00.05,500,I,2.0,0.02 0 0

0.05,500,I,2.0,0.0001 0 00.05,500,I,1.5,0.02 0 0

0.05,500,I,1.5,0.0001 0 00.05,500,I,1.1,0.02 0 0

0.05,500,I,1.1,0.0001 0 00.05,2000,I,2.0,0.02 0 10.05,2000,I,1.1,0.02 0 0

0.05,2000,I,1.1,0.0001 0 10.05,1000,I,2.0,0.02 0 00.05,1000,I,1.5,0.02 0 0

9

0.05,1000,I,1.5,0.0001 0 00.05,1000,I,1.1,0.02 0 0

0.05,1000,I,1.1,0.0001 0 00.01,500,I,2.0,0.02 0 0

0.01,500,I,2.0,0.0001 0 00.01,500,I,1.5,0.02 0 0

0.01,500,I,1.5,0.0001 0 10.01,500,I,1.1,0.02 0 0

0.01,500,I,1.1,0.0001 0 00.01,2000,I,2.0,0.02 0 0

0.01,2000,I,2.0,0.0001 0 10.01,2000,I,1.5,0.02 0 0

0.01,2000,I,1.5,0.0001 0 00.01,2000,I,1.1,0.02 0 0

0.01,2000,I,1.1,0.0001 0 10.01,1000,I,2.0,0.02 0 0

0.01,1000,I,2.0,0.0001 0 00.01,1000,I,1.5,0.02 0 0

0.01,1000,I,1.5,0.0001 0 00.01,1000,I,1.1,0.02 0 0

0.01,1000,I,1.1,0.0001 0 1



Configuration* Running Time (s) CPU Usage (%) Memory Usage (KB)0.5,500,I,2.0,0.02 2.70 130.19 320211.28

0.5,500,I,2.0,0.0001 2.69 136.88 319311.360.5,500,I,1.5,0.02 2.68 140.78 319508.72

0.5,500,I,1.5,0.0001 2.69 141.46 320285.240.5,500,I,1.1,0.02 2.73 136.88 320504.08

0.5,500,I,1.1,0.0001 2.70 136.47 319897.040.5,2000,I,2.0,0.02 4.10 156.28 327876.12

0.5,2000,I,2.0,0.0001 4.16 143.03 330393.48

10

0.5,2000,I,1.5,0.02 4.10 140.41 329206.280.5,2000,I,1.5,0.0001 4.01 136.85 327414.840.5,2000,I,1.1,0.02 3.96 125.00 325492.92

0.5,2000,I,1.1,0.0001 3.97 126.28 325792.920.5,1000,I,2.0,0.02 3.09 141.88 323600.36

0.5,1000,I,2.0,0.0001 3.12 139.30 324334.680.5,1000,I,1.5,0.02 3.08 141.47 323865.08

0.5,1000,I,1.5,0.0001 3.11 140.43 323880.440.5,1000,I,1.1,0.02 3.09 142.06 323780.88

0.5,1000,I,1.1,0.0001 3.12 141.69 323507.800.3,500,I,2.0,0.02 2.75 148.18 321318.64

0.3,500,I,2.0,0.0001 2.73 149.82 319605.000.3,500,I,1.5,0.02 2.73 149.43 321487.72

0.3,500,I,1.5,0.0001 2.75 150.12 320878.400.3,500,I,1.1,0.02 2.74 150.35 320952.24

0.3,500,I,1.1,0.0001 2.74 150.21 319914.160.3,2000,I,2.0,0.02 4.05 124.62 325950.12

0.3,2000,I,2.0,0.0001 4.04 119.74 325417.160.3,2000,I,1.5,0.02 4.04 122.47 325669.04

0.3,2000,I,1.5,0.0001 4.07 126.54 326147.320.3,2000,I,1.1,0.02 4.12 125.71 325679.80

0.3,2000,I,1.1,0.0001 4.11 123.02 325735.240.3,1000,I,2.0,0.02 3.07 118.96 322399.76

0.3,1000,I,2.0,0.0001 3.10 127.03 323056.560.3,1000,I,1.5,0.02 3.07 124.95 322673.52

0.3,1000,I,1.5,0.0001 3.11 131.41 323709.600.3,1000,I,1.1,0.02 3.09 134.61 323485.68

0.3,1000,I,1.1,0.0001 3.09 138.13 323444.760.1,500,I,2.0,0.02 2.75 119.13 320066.32

0.1,500,I,2.0,0.0001 2.74 119.29 319312.120.1,500,I,1.5,0.02 2.73 118.35 320222.28

0.1,500,I,1.5,0.0001 2.77 119.58 319002.320.1,500,I,1.1,0.02 2.77 118.50 320626.68

0.1,500,I,1.1,0.0001 2.76 121.01 320034.200.1,2000,I,2.0,0.02 4.01 119.18 325869.52

0.1,2000,I,2.0,0.0001 4.05 122.05 325484.960.1,2000,I,1.5,0.02 4.07 127.11 326038.04

0.1,2000,I,1.5,0.0001 4.09 126.69 326636.80

11

0.1,2000,I,1.1,0.02 4.10 127.66 326390.360.1,2000,I,1.1,0.0001 4.12 126.83 326720.760.1,1000,I,2.0,0.02 3.13 128.79 323402.72

0.1,1000,I,2.0,0.0001 3.13 128.00 323800.640.1,1000,I,1.5,0.02 3.12 126.40 323558.52

0.1,1000,I,1.5,0.0001 3.14 125.43 323584.040.1,1000,I,1.1,0.02 3.14 126.95 323569.56

0.1,1000,I,1.1,0.0001 3.14 126.27 323193.080.05,500,I,2.0,0.02 2.73 135.34 319177.48

0.05,500,I,2.0,0.0001 2.76 139.71 320980.880.05,500,I,1.5,0.02 2.73 131.66 320560.40

0.05,500,I,1.5,0.0001 2.76 139.02 320381.200.05,500,I,1.1,0.02 2.75 137.41 320737.96

0.05,500,I,1.1,0.0001 2.77 132.74 320620.160.05,2000,I,2.0,0.02 3.85 128.39 325633.16

0.05,2000,I,2.0,0.0001 3.93 135.36 324273.960.05,2000,I,1.5,0.02 3.87 144.42 326558.92

0.05,2000,I,1.5,0.0001 3.88 137.91 325713.840.05,2000,I,1.1,0.02 3.99 131.54 325690.40

0.05,2000,I,1.1,0.0001 3.94 131.49 324629.080.05,1000,I,2.0,0.02 2.94 147.28 323110.24

0.05,1000,I,2.0,0.0001 3.00 149.84 323443.360.05,1000,I,1.5,0.02 3.00 146.13 323144.92

0.05,1000,I,1.5,0.0001 3.02 143.14 323136.720.05,1000,I,1.1,0.02 3.00 143.31 323410.08

0.05,1000,I,1.1,0.0001 3.02 146.23 323356.000.01,500,I,2.0,0.02 2.63 154.11 320784.96

0.01,500,I,2.0,0.0001 2.65 150.07 320432.160.01,500,I,1.5,0.02 2.64 126.83 320529.56

0.01,500,I,1.5,0.0001 2.75 129.40 319814.800.01,500,I,1.1,0.02 2.76 129.15 320633.56

0.01,500,I,1.1,0.0001 2.72 182.19 321332.200.01,2000,I,2.0,0.02 3.99 130.97 325901.32

0.01,2000,I,2.0,0.0001 4.03 129.72 325971.000.01,2000,I,1.5,0.02 4.06 126.38 325816.40

0.01,2000,I,1.5,0.0001 4.02 110.41 324423.520.01,2000,I,1.1,0.02 4.00 121.24 325429.32

0.01,2000,I,1.1,0.0001 4.04 128.06 326333.80

12

0.01,1000,I,2.0,0.02 3.06 127.62 323421.920.01,1000,I,2.0,0.0001 3.07 127.73 323639.960.01,1000,I,1.5,0.02 3.07 126.69 323483.56

0.01,1000,I,1.5,0.0001 3.05 156.55 325006.560.01,1000,I,1.1,0.02 3.03 163.41 324945.28

0.01,1000,I,1.1,0.0001 3.01 156.46 320749.28


13

Laboratory Note

Genetic EpistasisVI - Assessing Algorithm SNPHarvester

LN-6-2014






May 2014

Abstract

In this lab note, the algorithm SNPHarvester was presented andtested. This algorithm consists of a stochastic approach that searchesrelevant SNPs for main effect and epistasis interactions, using a Path-Seeker algorithm and revealing relevant results using χ2 test. Theresults show that the Power and Type 1 Error Rate of the algorithmis high for main effect and full effect, but shows good results for epis-tasis detection, with very low error rates. The scalability test of thealgorithm shows that this may be a problem for higher data sets.

1 Introduction

SNPHarvester [YHW+09] works as a stochastic algorithm, by generatingmultiple paths among the many SNPS and these are joined into groups.Significant groups are selected if their scores are above a statistical predeter-mined threshold. The score function used to measure the association betweena k-SNP group, where k is the number of SNPS in epistatic interaction, andthe phenotype is χ2 test.For this effect, a PathSeeker algorithm was developed to randomly start anew path and for each group tries to increase the score, changing only oneSNP in active set at a time, converging to a local minimum, typically con-verging in two or three iterations. The evaluation is based on the χ2 valueand has a significance threshold of α = 0.05 after Bonferroni corrections.A post processing stage is applied to eliminate k-SNP groups that may besignificant due to a sub-group and SNPs that may show a false strong asso-ciation due to a small marginal effect. An L2 penalized logistic regression isused to filter out these interactions.

L(β0, β, λ) = −l(β0, β) +λ

2‖β‖ (1)

where l(β0, β) is the binomial log-likelihood and λ is a regularization param-eter.The differences between SNPHarvester and the other algorithms is that SNPfocuses on local optima instead of a global optima. Each local optima issignificant because there are usually multiple interaction patterns. SNPHar-vester also uses a sequential optimization rather than parallel optimization,removing local optima in the search process, becoming a smaller space inlater stages. SNPHarvester also uses a model-free approach,randomly creat-ing paths to directly detect significant associations.

1.1 Input files

The input of SNPHarvester consists of the names of each column in the firstrow, i.e. SNPs and the label phenotype as the last column. The next rowscontain the genotypes 0,1,2 as homozygous dominant, heterozygous, and ho-mozygous recessive, respectively. The Label 0,1 is control, case, respectively.

1

X1,X2,X3,X4,Label1, 1, 0, 2, 01, 1, 2, 1, 11, 2, 2, 0, 1


1.2 Output files

The algorithm outputs contain the final extracted single or interacting SNPs,with the χ2 value of that specific interaction or single SNP, and the runningtime of the algorithm.

1.3 Parameters

There are two modes: ”Threshold-Based” mode, where the program outputsall of the significant SNPs above a user specified significance threshold, and”Top-K Based” mode where the program outputs a specified number of SNPinteractions, regardless of their significance level. Both modes have parame-ters to chose the minimum and maximum number of interacting SNPs to bedetected. If the minimum is 1, it will test main effects of SNPs.


The datasets used in the experiments are characterized in Lab Note 1. Thecomputer used for this experiments used the 64-bit Ubuntu 13.10 operatingsystem, with an Intel(R) Core(TM)2 Quad CPU Q6600 2.40GHz processorand 8,00 GB of RAM memory.SNPHarvester provides a Java program. The ”Threshold-Based” mode waschosen for this analysis, with a significance level of α = 0.05. The heap sizeis set to -Xmx7000M. Main effects and pairwise interactions are tested forthis experiment.

3 Results

SNPHarvester works in epistasis detection, main effect detection, and fulleffect detection. All data set configurations were used in this experiment.Figure 1 shows the Power obtained in relation to allele frequency and popula-tion size for epistasis (a), main effect (b) and epistasis + main effect (c). The

2

results show that the Power is higher in main effect detection than epistasisdetection overall, reaching 100% Power in data sets with 0.3 and 0.5 allelefrequency. Epistasis detection shows a much lower result, with 0.1 allele fre-quency and 2000 individuals having the best Power for epistasis detection,with 85%. However, significant results can be seen with allele frequency aslow as 0.05%, which is not true for main effect detection. In full effect detec-tion, the results are very similar to main effect detections.

For scalability detection, Figure 2 shows a very significant difference inrunning time a from the data sets with 500 individuals running for an averageof 9.29 seconds, and data sets with 2000 individuals, showing an average of33 seconds. This somewhat linear growth, together with the slight increasein memory usage (c) reveal a scalability problem. The CPU usage is near100% across all data set sizes.

Type 1 Error Rates show a concerning increase in main effect and full ef-fect data sets, in relation to epistasis detection. This disproportion is due tothe ease in main effect detection, which reveals highly valued ground truths,but also increases the chances of detecting false positives, even if their statis-tical significance is significantly lower than the ground truth. There is stilla higher Power than Type 1 Error Rate in most cases, with the exception ofhigh allele frequencies and high population, which reveal error rates of 100%.This is not true for epistasis detection, having a maximum of 27% error ratewith data sets with 2000 individuals with an allele frequency of 0.05. Theother configurations show a slight increase of error rate with the increase ofdata set population size. There is not clear Type 1 Error Rate differencebetween allele frequencies for epistasis detection.

The relation of Power and population and allele frequency is reinforced inFigure 4 and 7. However, the Power by allele frequency in Epistasis detectionshows a peak in 0.1 minor allele frequency and a descent for higher allele fre-quencies. Figure 6 shows a slight but not significant increase of Power withprevalence for epistasis detection and a slight decrease for main and full effectdetection. The linear increase in Power shown in Figure 5 by odds ratios issimilar to the distribution of Power by population.

3

0.01 0.05 0.1 0.3 0.50

50

100

0 0 0 4 20 0

21

43

140

18

8570

33

Allele frequency

Pow

er(%

)


(a) Epistasis

0.01 0.05 0.1 0.3 0.50

50

100

0 0 0

100 100

0 0

38

100 100

0 1

92100 100

Allele frequency

Pow

er(%

)


(b) Main Effect

0.01 0.05 0.1 0.3 0.50

50

100

0 0 0

100 100

0 0

32

100 100

0 0

95 100 100

Allele frequency

Pow

er(%

)


(c) Full Effect

Figure 1: Power by allele frequency. For each frequency, three sizes of datasets were used to measure the Power, with odds ratio of 2.0 and prevalenceof 0.02. The Power is measured by the amount of data sets where the groundtruth was amongst the most relevant results, out of all 100 data sets. (b),(a), and (c) represent main effect, epistatic and main effect + epistatic in-teractions.

4

500 1000 20000

10

20

30


Runnin

gT

ime(

seco

nds)

(a) Average Running time.

500 1000 20000

50

100


CP

UU

sage

(%)


500 1000 2000

68

70

72

74

76


Mem

ory

Usa

ge(M

byte

s)

(c) Average Memory usage.


5

0.01 0.05 0.1 0.3 0.50

50

100

4 4 7 3 4413 9 9 32

2719

11 5

Allele frequency

Typ

e1

Err

orR

ate(

%)


(a) Epistasis

0.01 0.05 0.1 0.3 0.50

50

100

1 5 11

78

99

10 422

99 100

1

24

79

100 100

Allele frequency

Typ

e1

Err

orR

ate(

%)


(b) Main Effect

0.01 0.05 0.1 0.3 0.50

50

100

2 8 9

100 100

4 8

27

100 100

1

20

79

100 100

Allele frequency

Typ

e1

Err

orR

ate(

%)


(c) Full Effect

Figure 3: Type 1 Error Rate by allele frequency. For each frequency, threesizes of data sets were used to measure the Power, with odds ratio of 2.0 andprevalence of 0.02. The Type 1 Error Rate is measured by the amount ofdata sets where the false positives were amongst the most relevant results,out of all 100 data sets. (a), (b), and (c) represent epistatic, main effect, andmain effect + epistatic interactions.

6

4 Summary

In this experiment, the SNPHarvester was tested using many data sets withsignificant configuration changes. The results show that the algorithm hasa high Power in main effect and full effect detections, but also a high Type1 Error Rate. For epistasis, the Power is lower but reveals very low Type 1Error Rate values. There is a linear increase of Power with the number ofindividuals and odds ratios, and a significant increase with allele frequencies.The algorithm shows scalability problems, due to the high increase in runningtime, which may be crucial in genome wide studies.

References


A Bar Graph

7

500 1000 20000

50

100

0

21

85

Population

Pow

er(%

)

Power by Population

(a) Epistasis

500 1000 20000

50

100

0

38

92

Population

Pow

er(%

)

Power by Population

(b) Main Effect

500 1000 20000

50

100

0

32

95

Population

Pow

er(%

)

Power by Population

(c) Full Effect


8

1.1 1.5 2.00

50

100

0

41

85

Odds Ratio

Pow

er(%

)

Power by Odds Ratio

(a) Epistasis

1.1 1.5 2.00

50

100

2

25

92

Odds Ratio

Pow

er(%

)

Power by Odds Ratio

(b) Main Effect

1.1 1.5 2.00

50

100

3

66

95

Odds Ratio

Pow

er(%

)

Power by Odds Ratio

(c) Full Effect


9

0.0001 0.020

50

10074

85

Prevalence

Pow

er(%

)

Power by Prevalence

(a) Epistasis

0.0001 0.020

50

10099 92

Prevalence

Pow

er(%

)

Power by Prevalence

(b) Main Effect

0.0001 0.020

50

100100 95

Prevalence

Pow

er(%

)

Power by Prevalence

(c) Full Effect


10

0.01 0.05 0.1 0.3 0.50

50

100

018

8570

33

Frequency

Pow

er(%

)

Power by Frequency

(a) Epistasis

0.01 0.05 0.1 0.3 0.50

50

100

0 1

92100 100

Frequency

Pow

er(%

)

Power by Frequency

(b) Main Effect

0.01 0.05 0.1 0.3 0.50

50

100

0 0

95 100 100

Frequency

Pow

er(%

)

Power by Frequency

(c) Full Effect


11

B Table of Results


Configuration* TP (%) FP (%)0.01,500,ME+I,2.0,0.02 0 2

0.01,500,ME+I,2.0,0.0001 0 50.01,500,ME+I,1.5,0.02 0 7

0.01,500,ME+I,1.5,0.0001 0 50.01,500,ME+I,1.1,0.02 0 1

0.01,500,ME+I,1.1,0.0001 0 60.01,500,ME,2.0,0.02 0 1

0.01,500,ME,2.0,0.0001 0 60.01,500,ME,1.5,0.02 0 1

0.01,500,ME,1.5,0.0001 0 60.01,500,ME,1.1,0.02 0 1

0.01,500,ME,1.1,0.0001 0 60.01,500,I,2.0,0.02 0 4

0.01,500,I,2.0,0.0001 0 50.01,500,I,1.5,0.02 0 5

0.01,500,I,1.5,0.0001 0 100.01,500,I,1.1,0.02 0 3

0.01,500,I,1.1,0.0001 0 60.01,2000,ME+I,2.0,0.02 0 1

0.01,2000,ME+I,2.0,0.0001 0 30.01,2000,ME+I,1.5,0.02 0 6

0.01,2000,ME+I,1.5,0.0001 0 40.01,2000,ME+I,1.1,0.02 0 6

0.01,2000,ME+I,1.1,0.0001 0 60.01,2000,ME,2.0,0.02 0 1

0.01,2000,ME,2.0,0.0001 0 30.01,2000,ME,1.5,0.02 0 6

0.01,2000,ME,1.5,0.0001 0 40.01,2000,ME,1.1,0.02 0 6

0.01,2000,ME,1.1,0.0001 0 40.01,2000,I,2.0,0.02 0 2

12

0.01,2000,I,2.0,0.0001 0 90.01,2000,I,1.5,0.02 0 8

0.01,2000,I,1.5,0.0001 0 120.01,2000,I,1.1,0.02 0 7

0.01,2000,I,1.1,0.0001 0 60.01,1000,ME+I,2.0,0.02 0 4

0.01,1000,ME+I,2.0,0.0001 0 30.01,1000,ME+I,1.5,0.02 0 10

0.01,1000,ME+I,1.5,0.0001 0 50.01,1000,ME+I,1.1,0.02 0 3

0.01,1000,ME+I,1.1,0.0001 0 50.01,1000,ME,2.0,0.02 0 10

0.01,1000,ME,2.0,0.0001 0 50.01,1000,ME,1.5,0.02 0 11

0.01,1000,ME,1.5,0.0001 0 50.01,1000,ME,1.1,0.02 0 3

0.01,1000,ME,1.1,0.0001 0 50.01,1000,I,2.0,0.02 0 4

0.01,1000,I,2.0,0.0001 0 40.01,1000,I,1.5,0.02 0 4

0.01,1000,I,1.5,0.0001 0 50.01,1000,I,1.1,0.02 0 6

0.01,1000,I,1.1,0.0001 0 50.05,500,ME+I,2.0,0.02 0 8

0.05,500,ME+I,2.0,0.0001 0 80.05,500,ME+I,1.5,0.02 0 4

0.05,500,ME+I,1.5,0.0001 0 50.05,500,ME+I,1.1,0.02 0 3

0.05,500,ME+I,1.1,0.0001 0 70.05,500,ME,2.0,0.02 0 5

0.05,500,ME,2.0,0.0001 0 70.05,500,ME,1.5,0.02 0 5

0.05,500,ME,1.5,0.0001 0 60.05,500,ME,1.1,0.02 0 7

0.05,500,ME,1.1,0.0001 0 40.05,500,I,2.0,0.02 0 4

0.05,500,I,2.0,0.0001 0 60.05,500,I,1.5,0.02 0 5

13

0.05,500,I,1.5,0.0001 0 110.05,500,I,1.1,0.02 0 5

0.05,500,I,1.1,0.0001 0 30.05,2000,ME+I,2.0,0.02 0 20

0.05,2000,ME+I,2.0,0.0001 26 420.05,2000,ME+I,1.5,0.02 0 13

0.05,2000,ME+I,1.5,0.0001 1 160.05,2000,ME+I,1.1,0.02 0 4

0.05,2000,ME+I,1.1,0.0001 0 90.05,2000,ME,2.0,0.02 1 24

0.05,2000,ME,2.0,0.0001 9 300.05,2000,ME,1.5,0.02 0 5

0.05,2000,ME,1.5,0.0001 0 150.05,2000,ME,1.1,0.02 0 6

0.05,2000,ME,1.1,0.0001 0 80.05,2000,I,2.0,0.02 18 27

0.05,2000,I,2.0,0.0001 45 50.05,2000,I,1.5,0.02 39 9

0.05,2000,I,1.5,0.0001 40 90.05,2000,I,1.1,0.02 0 5

0.05,2000,I,1.1,0.0001 0 60.05,1000,ME+I,2.0,0.02 0 8

0.05,1000,ME+I,2.0,0.0001 0 210.05,1000,ME+I,1.5,0.02 0 1

0.05,1000,ME+I,1.5,0.0001 0 70.05,1000,ME+I,1.1,0.02 0 4

0.05,1000,ME+I,1.1,0.0001 0 50.05,1000,ME,2.0,0.02 0 4

0.05,1000,ME,2.0,0.0001 1 260.05,1000,ME,1.5,0.02 0 2

0.05,1000,ME,1.5,0.0001 0 60.05,1000,ME,1.1,0.02 0 7

0.05,1000,ME,1.1,0.0001 0 40.05,1000,I,2.0,0.02 0 13

0.05,1000,I,2.0,0.0001 2 50.05,1000,I,1.5,0.02 0 4

0.05,1000,I,1.5,0.0001 1 40.05,1000,I,1.1,0.02 0 3

14

0.05,1000,I,1.1,0.0001 0 110.1,500,ME+I,2.0,0.02 0 9

0.1,500,ME+I,2.0,0.0001 41 380.1,500,ME+I,1.5,0.02 0 4

0.1,500,ME+I,1.5,0.0001 1 90.1,500,ME+I,1.1,0.02 0 3

0.1,500,ME+I,1.1,0.0001 1 40.1,500,ME,2.0,0.02 0 11

0.1,500,ME,2.0,0.0001 13 200.1,500,ME,1.5,0.02 0 7

0.1,500,ME,1.5,0.0001 0 70.1,500,ME,1.1,0.02 0 6

0.1,500,ME,1.1,0.0001 0 60.1,500,I,2.0,0.02 0 7

0.1,500,I,2.0,0.0001 1 30.1,500,I,1.5,0.02 0 5

0.1,500,I,1.5,0.0001 0 10.1,500,I,1.1,0.02 0 5

0.1,500,I,1.1,0.0001 0 70.1,2000,ME+I,2.0,0.02 95 79

0.1,2000,ME+I,2.0,0.0001 100 990.1,2000,ME+I,1.5,0.02 66 40

0.1,2000,ME+I,1.5,0.0001 61 480.1,2000,ME+I,1.1,0.02 3 8

0.1,2000,ME+I,1.1,0.0001 3 170.1,2000,ME,2.0,0.02 92 79

0.1,2000,ME,2.0,0.0001 99 880.1,2000,ME,1.5,0.02 25 20

0.1,2000,ME,1.5,0.0001 48 410.1,2000,ME,1.1,0.02 2 11

0.1,2000,ME,1.1,0.0001 1 70.1,2000,I,2.0,0.02 85 19

0.1,2000,I,2.0,0.0001 74 110.1,2000,I,1.5,0.02 41 9

0.1,2000,I,1.5,0.0001 23 60.1,2000,I,1.1,0.02 0 7

0.1,2000,I,1.1,0.0001 2 90.1,1000,ME+I,2.0,0.02 32 27

15

0.1,1000,ME+I,2.0,0.0001 97 740.1,1000,ME+I,1.5,0.02 1 11

0.1,1000,ME+I,1.5,0.0001 13 120.1,1000,ME+I,1.1,0.02 0 7

0.1,1000,ME+I,1.1,0.0001 0 80.1,1000,ME,2.0,0.02 38 22

0.1,1000,ME,2.0,0.0001 59 430.1,1000,ME,1.5,0.02 2 9

0.1,1000,ME,1.5,0.0001 6 120.1,1000,ME,1.1,0.02 0 7

0.1,1000,ME,1.1,0.0001 0 120.1,1000,I,2.0,0.02 21 9

0.1,1000,I,2.0,0.0001 9 90.1,1000,I,1.5,0.02 1 2

0.1,1000,I,1.5,0.0001 1 40.1,1000,I,1.1,0.02 0 6

0.1,1000,I,1.1,0.0001 0 120.3,500,ME+I,2.0,0.02 100 100

0.3,500,ME+I,2.0,0.0001 100 1000.3,500,ME+I,1.5,0.02 100 75

0.3,500,ME+I,1.5,0.0001 100 960.3,500,ME+I,1.1,0.02 77 21

0.3,500,ME+I,1.1,0.0001 93 280.3,500,ME,2.0,0.02 100 78

0.3,500,ME,2.0,0.0001 100 890.3,500,ME,1.5,0.02 89 27

0.3,500,ME,1.5,0.0001 89 370.3,500,ME,1.1,0.02 25 13

0.3,500,ME,1.1,0.0001 25 90.3,500,I,2.0,0.02 4 3

0.3,500,I,2.0,0.0001 20 80.3,500,I,1.5,0.02 1 3

0.3,500,I,1.5,0.0001 3 60.3,500,I,1.1,0.02 0 3

0.3,500,I,1.1,0.0001 0 60.3,2000,ME+I,2.0,0.02 100 100

0.3,2000,ME+I,2.0,0.0001 100 1000.3,2000,ME+I,1.5,0.02 100 100

16

0.3,2000,ME+I,1.5,0.0001 100 1000.3,2000,ME+I,1.1,0.02 100 99

0.3,2000,ME+I,1.1,0.0001 100 1000.3,2000,ME,2.0,0.02 100 100

0.3,2000,ME,2.0,0.0001 100 1000.3,2000,ME,1.5,0.02 100 100

0.3,2000,ME,1.5,0.0001 100 1000.3,2000,ME,1.1,0.02 100 67

0.3,2000,ME,1.1,0.0001 100 620.3,2000,I,2.0,0.02 70 11

0.3,2000,I,2.0,0.0001 73 200.3,2000,I,1.5,0.02 58 8

0.3,2000,I,1.5,0.0001 53 70.3,2000,I,1.1,0.02 1 5

0.3,2000,I,1.1,0.0001 1 80.3,1000,ME+I,2.0,0.02 100 100

0.3,1000,ME+I,2.0,0.0001 100 1000.3,1000,ME+I,1.5,0.02 100 99

0.3,1000,ME+I,1.5,0.0001 100 1000.3,1000,ME+I,1.1,0.02 100 66

0.3,1000,ME+I,1.1,0.0001 100 690.3,1000,ME,2.0,0.02 100 99

0.3,1000,ME,2.0,0.0001 100 1000.3,1000,ME,1.5,0.02 100 78

0.3,1000,ME,1.5,0.0001 100 750.3,1000,ME,1.1,0.02 93 30

0.3,1000,ME,1.1,0.0001 84 330.3,1000,I,2.0,0.02 43 9

0.3,1000,I,2.0,0.0001 79 90.3,1000,I,1.5,0.02 30 3

0.3,1000,I,1.5,0.0001 27 90.3,1000,I,1.1,0.02 0 4

0.3,1000,I,1.1,0.0001 0 50.5,500,ME+I,2.0,0.02 100 100

0.5,500,ME+I,2.0,0.0001 100 1000.5,500,ME+I,1.5,0.02 100 100

0.5,500,ME+I,1.5,0.0001 100 1000.5,500,ME+I,1.1,0.02 100 79

17

0.5,500,ME+I,1.1,0.0001 100 890.5,500,ME,2.0,0.02 100 99

0.5,500,ME,2.0,0.0001 100 970.5,500,ME,1.5,0.02 100 63

0.5,500,ME,1.5,0.0001 100 620.5,500,ME,1.1,0.02 80 27

0.5,500,ME,1.1,0.0001 79 280.5,500,I,2.0,0.02 2 4

0.5,500,I,2.0,0.0001 4 50.5,500,I,1.5,0.02 1 4

0.5,500,I,1.5,0.0001 0 70.5,500,I,1.1,0.02 0 1

0.5,500,I,1.1,0.0001 0 80.5,2000,ME+I,2.0,0.02 100 100

0.5,2000,ME+I,2.0,0.0001 99 990.5,2000,ME+I,1.5,0.02 100 100

0.5,2000,ME+I,1.5,0.0001 100 1000.5,2000,ME+I,1.1,0.02 100 100

0.5,2000,ME+I,1.1,0.0001 100 1000.5,2000,ME,2.0,0.02 100 100

0.5,2000,ME,2.0,0.0001 100 1000.5,2000,ME,1.5,0.02 100 100

0.5,2000,ME,1.5,0.0001 100 1000.5,2000,ME,1.1,0.02 100 100

0.5,2000,ME,1.1,0.0001 100 980.5,2000,I,2.0,0.02 33 5

0.5,2000,I,2.0,0.0001 78 90.5,2000,I,1.5,0.02 65 2

0.5,2000,I,1.5,0.0001 21 20.5,2000,I,1.1,0.02 7 8

0.5,2000,I,1.1,0.0001 2 70.5,1000,ME+I,2.0,0.02 100 100

0.5,1000,ME+I,2.0,0.0001 100 1000.5,1000,ME+I,1.5,0.02 100 100

0.5,1000,ME+I,1.5,0.0001 100 1000.5,1000,ME+I,1.1,0.02 100 100

0.5,1000,ME+I,1.1,0.0001 100 1000.5,1000,ME,2.0,0.02 100 100

18

0.5,1000,ME,2.0,0.0001 100 1000.5,1000,ME,1.5,0.02 100 100

0.5,1000,ME,1.5,0.0001 100 970.5,1000,ME,1.1,0.02 100 64

0.5,1000,ME,1.1,0.0001 100 690.5,1000,I,2.0,0.02 14 3

0.5,1000,I,2.0,0.0001 52 60.5,1000,I,1.5,0.02 28 5

0.5,1000,I,1.5,0.0001 1 50.5,1000,I,1.1,0.02 0 3

0.5,1000,I,1.1,0.0001 0 9



Configuration* Running Time (s) CPU Usage (%) Memory Usage (KB)0.5,500,ME+I,2.0,0.02 0 75.44 8975.12

0.5,500,ME+I,2.0,0.0001 0 76.83 8975.000.5,500,ME+I,1.5,0.02 0.07 73.91 9593.72

0.5,500,ME+I,1.5,0.0001 0 74.17 8975.280.5,500,ME+I,1.1,0.02 0 78.78 8975.20

0.5,500,ME+I,1.1,0.0001 0 76.66 8974.800.5,500,ME,2.0,0.02 0 77.81 8974.84

0.5,500,ME,2.0,0.0001 0 77.69 8975.600.5,500,ME,1.5,0.02 0 78.18 8975.36

0.5,500,ME,1.5,0.0001 0 76.23 8975.160.5,500,ME,1.1,0.02 0 80.54 8975.24

0.5,500,ME,1.1,0.0001 0 78.98 8975.040.5,500,I,2.0,0.02 0 76.74 8975.16

0.5,500,I,2.0,0.0001 0 75.78 8974.760.5,500,I,1.5,0.02 0 77.19 8975.40

0.5,500,I,1.5,0.0001 0 78.27 8974.640.5,500,I,1.1,0.02 0 78.71 8975.20

0.5,500,I,1.1,0.0001 0 78.75 8975.00

19

0.5,2000,ME+I,2.0,0.02 1.03 84.06 11814.040.5,2000,ME+I,2.0,0.0001 35.91 100.13 76053.200.5,2000,ME+I,1.5,0.02 45.54 99.68 77689.64

0.5,2000,ME+I,1.5,0.0001 47.30 99.04 78211.720.5,2000,ME+I,1.1,0.02 53.11 99.33 78391.92

0.5,2000,ME+I,1.1,0.0001 51.93 98.62 79691.560.5,2000,ME,2.0,0.02 54.63 100.05 77422.96

0.5,2000,ME,2.0,0.0001 54.31 99.80 79153.360.5,2000,ME,1.5,0.02 44.44 101.10 76040.16

0.5,2000,ME,1.5,0.0001 39.89 101.33 75383.840.5,2000,ME,1.1,0.02 20.10 100.50 72422.88

0.5,2000,ME,1.1,0.0001 18.37 101.77 72461.680.5,2000,I,2.0,0.02 13.30 101.99 71876.44

0.5,2000,I,2.0,0.0001 14.32 101.73 70963.000.5,2000,I,1.5,0.02 14.52 102.26 70528.40

0.5,2000,I,1.5,0.0001 12.63 102.34 71372.920.5,2000,I,1.1,0.02 12.16 102.54 73187.88

0.5,2000,I,1.1,0.0001 11.82 101.52 71635.040.5,1000,ME+I,2.0,0.02 25.89 86.51 73035.92

0.5,1000,ME+I,2.0,0.0001 26.33 89.30 73462.040.5,1000,ME+I,1.5,0.02 26.16 98.22 73189.12

0.5,1000,ME+I,1.5,0.0001 25.22 100.91 73075.680.5,1000,ME+I,1.1,0.02 14.21 103.72 71475.92

0.5,1000,ME+I,1.1,0.0001 12.89 104.18 71463.920.5,1000,ME,2.0,0.02 19.63 102.59 72507.32

0.5,1000,ME,2.0,0.0001 17.44 103.13 72036.120.5,1000,ME,1.5,0.02 10.19 105.15 70972.68

0.5,1000,ME,1.5,0.0001 9.25 105.96 70377.840.5,1000,ME,1.1,0.02 6.62 107.41 69163.76

0.5,1000,ME,1.1,0.0001 6.88 109.39 69143.720.5,1000,I,2.0,0.02 6.60 108.50 69107.64

0.5,1000,I,2.0,0.0001 7.38 108.82 68815.080.5,1000,I,1.5,0.02 7.03 100.62 68200.44

0.5,1000,I,1.5,0.0001 6.38 101.91 67529.520.5,1000,I,1.1,0.02 6.47 102.15 68962.80

0.5,1000,I,1.1,0.0001 6.54 102.20 68464.080.3,500,ME+I,2.0,0.02 6.26 104.90 68340.96

0.3,500,ME+I,2.0,0.0001 12.21 102.21 70501.56

20

0.3,500,ME+I,1.5,0.02 4.13 106.37 66912.320.3,500,ME+I,1.5,0.0001 5.38 105.98 68096.640.3,500,ME+I,1.1,0.02 3.74 105.94 65277.00

0.3,500,ME+I,1.1,0.0001 3.82 105.86 65443.960.3,500,ME,2.0,0.02 4.08 106.11 65608.68

0.3,500,ME,2.0,0.0001 4.43 106.19 67463.800.3,500,ME,1.5,0.02 3.73 106.90 65098.52

0.3,500,ME,1.5,0.0001 3.80 109.81 65059.360.3,500,ME,1.1,0.02 3.67 109.74 64770.44

0.3,500,ME,1.1,0.0001 3.73 105.00 65267.440.3,500,I,2.0,0.02 3.79 102.17 65143.80

0.3,500,I,2.0,0.0001 3.90 105.07 65876.560.3,500,I,1.5,0.02 3.68 105.09 65198.44

0.3,500,I,1.5,0.0001 3.70 100.02 65044.640.3,500,I,1.1,0.02 3.67 105.56 64918.80

0.3,500,I,1.1,0.0001 3.75 103.87 65366.360.3,2000,ME+I,2.0,0.02 52.54 97.45 78867.88

0.3,2000,ME+I,2.0,0.0001 33.90 101.81 75939.520.3,2000,ME+I,1.5,0.02 50.68 100.31 76332.04

0.3,2000,ME+I,1.5,0.0001 54.60 101.14 75759.680.3,2000,ME+I,1.1,0.02 18.08 104.45 71770.00

0.3,2000,ME+I,1.1,0.0001 22.20 103.66 72230.520.3,2000,ME,2.0,0.02 47.70 101.73 76081.80

0.3,2000,ME,2.0,0.0001 51.00 98.52 77481.440.3,2000,ME,1.5,0.02 23.96 98.85 73314.16

0.3,2000,ME,1.5,0.0001 23.65 99.04 72820.920.3,2000,ME,1.1,0.02 13.11 96.47 71377.16

0.3,2000,ME,1.1,0.0001 12.94 100.18 72271.400.3,2000,I,2.0,0.02 14.49 99.58 70987.12

0.3,2000,I,2.0,0.0001 13.89 100.05 70761.040.3,2000,I,1.5,0.02 14.63 102.00 71405.12

0.3,2000,I,1.5,0.0001 14.30 99.79 71175.520.3,2000,I,1.1,0.02 11.88 100.25 72587.36

0.3,2000,I,1.1,0.0001 12.30 100.42 72231.520.3,1000,ME+I,2.0,0.02 24.07 99.29 72695.04

0.3,1000,ME+I,2.0,0.0001 25.39 99.83 72831.760.3,1000,ME+I,1.5,0.02 18.57 98.83 71546.72

0.3,1000,ME+I,1.5,0.0001 24.64 98.82 72522.36

21

0.3,1000,ME+I,1.1,0.02 11.58 98.75 71113.480.3,1000,ME+I,1.1,0.0001 11.98 98.73 71254.12

0.3,1000,ME,2.0,0.02 16.77 98.95 71090.840.3,1000,ME,2.0,0.0001 19.44 98.35 71731.320.3,1000,ME,1.5,0.02 12.06 98.94 71088.80

0.3,1000,ME,1.5,0.0001 12.44 98.89 70971.280.3,1000,ME,1.1,0.02 11.00 98.78 71013.08

0.3,1000,ME,1.1,0.0001 11.27 98.53 70955.920.3,1000,I,2.0,0.02 12.35 98.54 71127.96

0.3,1000,I,2.0,0.0001 13.31 98.89 70009.720.3,1000,I,1.5,0.02 12.28 98.94 71714.48

0.3,1000,I,1.5,0.0001 12.17 98.91 71538.400.3,1000,I,1.1,0.02 11.20 98.96 71779.92

0.3,1000,I,1.1,0.0001 11.12 98.62 71854.360.1,500,ME+I,2.0,0.02 6.07 99.82 70315.16

0.1,500,ME+I,2.0,0.0001 6.13 99.08 69049.480.1,500,ME+I,1.5,0.02 6.06 99.41 70168.84

0.1,500,ME+I,1.5,0.0001 6.12 98.96 69809.040.1,500,ME+I,1.1,0.02 6.13 99.57 70186.84

0.1,500,ME+I,1.1,0.0001 6.14 99.51 70018.720.1,500,ME,2.0,0.02 6.12 99.33 70033.16

0.1,500,ME,2.0,0.0001 6.15 99.98 69367.680.1,500,ME,1.5,0.02 6.16 99.87 70112.36

0.1,500,ME,1.5,0.0001 6.16 99.11 70216.160.1,500,ME,1.1,0.02 6.12 99.47 70135.28

0.1,500,ME,1.1,0.0001 6.13 99.55 70127.360.1,500,I,2.0,0.02 6.11 99.21 70007.60

0.1,500,I,2.0,0.0001 6.11 98.95 70187.960.1,500,I,1.5,0.02 6.16 98.99 70377.76

0.1,500,I,1.5,0.0001 6.08 99.29 70226.680.1,500,I,1.1,0.02 6.11 98.84 70228.80

0.1,500,I,1.1,0.0001 6.15 99.60 70123.920.1,2000,ME+I,2.0,0.02 24.39 98.80 72612.80

0.1,2000,ME+I,2.0,0.0001 35.63 98.93 74894.120.1,2000,ME+I,1.5,0.02 21.49 98.70 72437.88

0.1,2000,ME+I,1.5,0.0001 22.66 98.90 72547.360.1,2000,ME+I,1.1,0.02 20.98 98.79 73179.28

0.1,2000,ME+I,1.1,0.0001 21.20 98.72 73060.00

22

0.1,2000,ME,2.0,0.02 23.49 98.81 72854.640.1,2000,ME,2.0,0.0001 27.55 98.84 73112.760.1,2000,ME,1.5,0.02 20.96 98.65 72501.96

0.1,2000,ME,1.5,0.0001 22.40 98.84 72125.240.1,2000,ME,1.1,0.02 20.94 98.07 73506.48

0.1,2000,ME,1.1,0.0001 21.46 99.20 73131.600.1,2000,I,2.0,0.02 24.99 98.67 71017.84

0.1,2000,I,2.0,0.0001 25.33 98.95 71315.960.1,2000,I,1.5,0.02 24.72 99.68 72858.24

0.1,2000,I,1.5,0.0001 23.07 99.88 72926.080.1,2000,I,1.1,0.02 21.07 101.62 73633.20

0.1,2000,I,1.1,0.0001 21.58 101.93 73469.880.1,1000,ME+I,2.0,0.02 11.19 104.18 71543.48

0.1,1000,ME+I,2.0,0.0001 13.23 103.98 71146.120.1,1000,ME+I,1.5,0.02 11.13 104.34 71897.60

0.1,1000,ME+I,1.5,0.0001 11.32 104.25 71671.520.1,1000,ME+I,1.1,0.02 11.01 104.52 72502.08

0.1,1000,ME+I,1.1,0.0001 11.26 104.54 72783.760.1,1000,ME,2.0,0.02 11.10 104.39 71333.52

0.1,1000,ME,2.0,0.0001 11.59 104.16 71271.440.1,1000,ME,1.5,0.02 11.11 104.37 71968.00

0.1,1000,ME,1.5,0.0001 11.11 104.28 71837.240.1,1000,ME,1.1,0.02 11.00 104.53 72532.48

0.1,1000,ME,1.1,0.0001 11.18 104.47 72076.520.1,1000,I,2.0,0.02 11.52 104.40 71641.20

0.1,1000,I,2.0,0.0001 11.29 104.52 72135.840.1,1000,I,1.5,0.02 11.12 104.56 72310.20

0.1,1000,I,1.5,0.0001 11.12 104.58 72486.120.1,1000,I,1.1,0.02 11.11 104.71 72571.52

0.1,1000,I,1.1,0.0001 11.16 104.38 72365.840.05,500,ME+I,2.0,0.02 6.23 108.59 69799.12

0.05,500,ME+I,2.0,0.0001 6.14 108.27 69907.080.05,500,ME+I,1.5,0.02 6.13 108.57 70177.28

0.05,500,ME+I,1.5,0.0001 6.15 108.44 70259.120.05,500,ME+I,1.1,0.02 6.20 108.71 70280.92

0.05,500,ME+I,1.1,0.0001 6.13 108.78 69942.240.05,500,ME,2.0,0.02 6.19 108.97 70123.56

0.05,500,ME,2.0,0.0001 6.13 108.19 69865.56

23

0.05,500,ME,1.5,0.02 6.14 109.00 70221.360.05,500,ME,1.5,0.0001 6.14 108.88 69994.160.05,500,ME,1.1,0.02 6.12 108.32 69700.40

0.05,500,ME,1.1,0.0001 6.11 108.61 70141.160.05,500,I,2.0,0.02 6.14 108.80 70028.96

0.05,500,I,2.0,0.0001 6.16 108.74 70185.160.05,500,I,1.5,0.02 6.09 108.90 69967.76

0.05,500,I,1.5,0.0001 6.18 108.73 69698.040.05,500,I,1.1,0.02 6.22 107.68 69844.80

0.05,500,I,1.1,0.0001 6.21 101.49 69872.320.05,2000,ME+I,2.0,0.02 21.74 96.81 72821.72

0.05,2000,ME+I,2.0,0.0001 22.90 92.35 72652.160.05,2000,ME+I,1.5,0.02 21.44 97.61 73224.84

0.05,2000,ME+I,1.5,0.0001 21.35 97.65 73010.560.05,2000,ME+I,1.1,0.02 21.32 100.62 73583.24

0.05,2000,ME+I,1.1,0.0001 21.79 100.34 73084.720.05,2000,ME,2.0,0.02 21.33 102.42 72800.80

0.05,2000,ME,2.0,0.0001 22.45 100.06 72805.600.05,2000,ME,1.5,0.02 22.20 97.06 73334.84

0.05,2000,ME,1.5,0.0001 21.67 101.75 73099.680.05,2000,ME,1.1,0.02 21.64 102.07 73360.48

0.05,2000,ME,1.1,0.0001 21.36 102.47 73394.800.05,2000,I,2.0,0.02 21.75 101.65 72672.52

0.05,2000,I,2.0,0.0001 24.67 99.33 72954.520.05,2000,I,1.5,0.02 23.76 99.50 72027.12

0.05,2000,I,1.5,0.0001 24.41 99.04 72224.640.05,2000,I,1.1,0.02 21.99 98.53 73594.44

0.05,2000,I,1.1,0.0001 21.83 99.55 73442.160.05,1000,ME+I,2.0,0.02 11.21 102.98 71934.28

0.05,1000,ME+I,2.0,0.0001 11.22 102.77 71696.960.05,1000,ME+I,1.5,0.02 11.07 103.08 72868.56

0.05,1000,ME+I,1.5,0.0001 11.12 101.16 72054.760.05,1000,ME+I,1.1,0.02 10.98 103.88 72243.88

0.05,1000,ME+I,1.1,0.0001 11.04 105.79 71986.880.05,1000,ME,2.0,0.02 10.99 105.88 72255.96

0.05,1000,ME,2.0,0.0001 11.11 105.82 71831.640.05,1000,ME,1.5,0.02 10.91 105.93 72275.60

0.05,1000,ME,1.5,0.0001 11.01 105.99 72655.04

24

0.05,1000,ME,1.1,0.02 10.91 106.09 72631.160.05,1000,ME,1.1,0.0001 10.88 106.08 72100.20

0.05,1000,I,2.0,0.02 10.89 105.87 72380.600.05,1000,I,2.0,0.0001 10.98 105.90 72264.560.05,1000,I,1.5,0.02 11.06 105.79 72350.44

0.05,1000,I,1.5,0.0001 11.02 106.11 72392.640.05,1000,I,1.1,0.02 11.07 105.81 72310.56

0.05,1000,I,1.1,0.0001 11.17 105.82 72267.920.01,500,ME+I,2.0,0.02 6.10 110.76 70623.48

0.01,500,ME+I,2.0,0.0001 6.19 100.90 70173.600.01,500,ME+I,1.5,0.02 6.23 88.83 70297.80

0.01,500,ME+I,1.5,0.0001 6.19 92.12 70435.840.01,500,ME+I,1.1,0.02 6.21 95.10 70178.08

0.01,500,ME+I,1.1,0.0001 6.18 94.75 69908.240.01,500,ME,2.0,0.02 6.17 100.01 70107.48

0.01,500,ME,2.0,0.0001 6.20 98.25 70174.400.01,500,ME,1.5,0.02 6.20 96.74 70149.52

0.01,500,ME,1.5,0.0001 6.20 97.67 69945.920.01,500,ME,1.1,0.02 6.21 100.55 70039.24

0.01,500,ME,1.1,0.0001 6.24 91.17 69905.400.01,500,I,2.0,0.02 6.23 102.47 70315.16

0.01,500,I,2.0,0.0001 6.17 103.53 70046.480.01,500,I,1.5,0.02 6.25 101.20 69696.88

0.01,500,I,1.5,0.0001 6.18 101.27 69886.680.01,500,I,1.1,0.02 6.30 99.65 70166.36

0.01,500,I,1.1,0.0001 6.27 99.71 70019.800.01,2000,ME+I,2.0,0.02 21.61 99.05 73708.60

0.01,2000,ME+I,2.0,0.0001 21.65 98.32 73273.960.01,2000,ME+I,1.5,0.02 21.57 99.21 73551.20

0.01,2000,ME+I,1.5,0.0001 21.52 98.94 73723.640.01,2000,ME+I,1.1,0.02 21.20 99.76 73592.64

0.01,2000,ME+I,1.1,0.0001 21.51 96.31 73391.080.01,2000,ME,2.0,0.02 21.70 92.67 73853.88

0.01,2000,ME,2.0,0.0001 21.51 99.72 73350.920.01,2000,ME,1.5,0.02 21.44 99.98 73758.00

0.01,2000,ME,1.5,0.0001 21.44 99.25 73351.200.01,2000,ME,1.1,0.02 21.53 97.16 73346.52

0.01,2000,ME,1.1,0.0001 21.16 101.37 73605.40

25

0.01,2000,I,2.0,0.02 21.07 101.02 73651.600.01,2000,I,2.0,0.0001 21.55 100.96 73217.440.01,2000,I,1.5,0.02 21.19 100.46 73303.00

0.01,2000,I,1.5,0.0001 21.60 99.60 73325.360.01,2000,I,1.1,0.02 21.43 100.51 73260.24

0.01,2000,I,1.1,0.0001 21.91 97.81 73582.960.01,1000,ME+I,2.0,0.02 11.24 98.26 71785.68

0.01,1000,ME+I,2.0,0.0001 11.18 97.90 72111.080.01,1000,ME+I,1.5,0.02 11.29 98.82 71760.68

0.01,1000,ME+I,1.5,0.0001 11.29 99.32 71912.760.01,1000,ME+I,1.1,0.02 11.19 98.55 71999.88

0.01,1000,ME+I,1.1,0.0001 11.28 98.35 72015.680.01,1000,ME,2.0,0.02 11.34 97.87 71920.68

0.01,1000,ME,2.0,0.0001 11.35 99.01 72120.600.01,1000,ME,1.5,0.02 11.34 95.71 71681.44

0.01,1000,ME,1.5,0.0001 11.39 96.87 71781.120.01,1000,ME,1.1,0.02 11.21 99.33 71747.48

0.01,1000,ME,1.1,0.0001 11.35 98.86 71964.160.01,1000,I,2.0,0.02 11.19 98.84 71847.96

0.01,1000,I,2.0,0.0001 11.36 97.89 71583.320.01,1000,I,1.5,0.02 11.25 97.67 72072.00

0.01,1000,I,1.5,0.0001 11.29 96.01 71814.480.01,1000,I,1.1,0.02 11.28 97.48 71709.20

0.01,1000,I,1.1,0.0001 11.36 97.21 71803.64


26

Laboratory Note

Genetic EpistasisVII - Assessing Algorithm TEAM

LN-7-2014






May 2014

Abstract

In this lab note, the algorithm TEAM is presented. TEAM isan exhaustive algorithm that works by updating contingency tablesand the minimum spanning tree made from SNPs. The results ob-tained show an increase in Power by population size, allele frequency,and odds ratio. There is also an increase in Type 1 Error Rate withpopulation size, but not a clear indicator for allele frequency. Thescalability of the algorithm is questionable, considering that there isa big increase in the running time required by data sets with differentpopulation sizes, which is not relevant for these experiments but maybe problematic for larger data sets.

1 Introduction

Tree-based Epistasis Association Mapping) TEAM [ZHZW10] is an exhaus-tive algorithm that computes all two-locus pairs to obtain a permutationtest, which is applicable to all statistical relevancy tests, due to the contin-gency table generated. TEAM also uses Family-wise error rate (FWER) andfalse discovery rate (FDR) to control error rate using the permutation test,which is better than Bonferroni correction but also more computationallyexpensive. The algorithm builds a minimum spanning tree containing SNPsas nodes and the edges represent the genotype difference between two SNPs.This is used to update the contingency tables, allowing for a pruning of manyindividuals.The algorithm receives the SNPs genotypes and phenotypes of each individ-ual, creating a specified number of permutations. The contingency tables foreach single-locus are generated. The minimum spanning tree is built, usingthe different genotypes associated to each edge. The tree is then updatedfor each leaf node with the information related to the contingency table forgenotype relation between SNPs. The test values are then calculated, usingthe contingency tables.

S1 S2 S3 S4 S5 S6 S7 S8 S9 S10

X1 0 0 0 1 2 0 2 0 2 0X2 2 0 0 2 0 2 0 1 2 1X3 2 2 0 1 2 2 0 1 1 0X4 0 2 2 0 0 0 0 1 0 1X5 2 1 2 0 1 2 0 1 0 2Y1 1 1 1 0 1 0 1 1 1 0Y2 0 0 0 1 1 0 1 0 1 0Y3 1 0 1 1 1 0 1 0 1 0

Table 1: An example of the input data, consisting of 5 SNPs X1...X6, the orig-inal Phenotype Y1, and two permutations Y2,Y3 for 10 individuals S1,...,S10.

1.1 Input files

The input consists of 2 files: a file containing the genotype information andanother containing the phenotype information for each individual.

1

Xi=0 Xi=1 Xi=2 TotalXi=0 Xi=1 Xi=2 Xj=0 Xj=1 Xj=2 Xj=0 Xj=1 Xj=2

Yk=0 Eventa1

Eventa2

Eventa3

Eventb1

Eventb2

Eventb3

Evente1

Evente2

Evente3

Yk=1 Eventc1

Eventc2

Eventc3

Eventd1

Eventd2

Eventd3

Eventf1

Eventf2

Eventf3

Total M

Table 2: The contingency table between two SNPs Xi and Xj for a givenphenotype Yk. M refers to the total amount of individuals.

(a) Genotype

0011001121121211112110010001022202121111

(b) Phenotype

0000000010

Table 3: An example of the input file containing genotype and phenotypeinformation with 4 SNPs and 10 individuals. Genotype 0,1,2 correspondsto homozygous dominant, heterozygous, and homozygous recessive. Thephenotype 0,1 corresponds to control and case respectively.

1.2 Output files

The output consists of a list of every SNP pair and the relevant test score.The score can be calculated for any statistics defined in the contingency table.In this experiment, the test score corresponds to chi-square statistic.

1.3 Parameters

The customizable parameters are as follows:

• individual - The number of individuals in the data. In this case it isdependent on the data set parameters.

• SNPs - Number of SNPs in the data. In this case it is dependent onthe data sets ( which is fixed at 300).

• permutation - Number of permutations used in the significant test.

• fdrthreshold - The FDR threshold for significance.

2


The datasets used in the experiments are characterized in Lab Note 1. Thecomputer used for this experiments used the 64-bit Ubuntu 13.10 operatingsystem, with an Intel(R) Core(TM)2 Quad CPU Q6600 2.40GHz processorand 8,00 GB of RAM memory.Team contains a C++ program that takes as parameters the genotype file,the phenotype file, the number of individuals, number of SNPs, number ofpermutations for the significance test and FDR threshold. The number ofpermutations is set to 100 and the FDR threshold is set to 1.

3 Results

The algorithm only outputs pairwise relations between SNPs. Due to this,only epistasis detection will be evaluated.The Power observed in Figure 1 show a maximum value of 8% of Power fordata sets with 500 individuals, 65% for 1000 individuals and 95% for 2000individuals. There is a big correlation between the Power and the size of thedata sets. However, for frequencies smaller than 0.1 there is a near 0% Powerfor most configurations, with the exception of data sets with 2000 individ-uals and 0.05 minor allele frequency. These values also increase with allelefrequency, with the exception of 0.5 allele frequencies.

0.01 0.05 0.1 0.3 0.50

50

100

0 0 0 6 80 1

21

47

65

0

43

92 92 95

Allele frequency

Pow

er(%

)



The Type 1 Error Rate in Figure 2 has an interesting pattern, clearly

3

0.01 0.05 0.1 0.3 0.50

10

20

30

40

0 0 2 0 12 4 51 01

37

28

10

1

Allele frequency

Typ

e1

Err

orR

ate(

%)


Figure 2: Type 1 Error Rate by allele frequency and population size. TheType 1 Error Rate is measured by the amount of data sets where the falsepositives were amongst the most relevant results, out of all 100 data sets.

showing a growth in error rate with the population size. However, the errordoes not necessarily increase with allele frequency, revealing a maximum of37% in data sets with 0.05 allele frequency and 2000 individuals. There isalso a decrease in data sets with higher allele frequencies in data sets with2000 individuals. Therefore the relation between error rate and allele fre-quency is undetermined.

There is a 10% difference in CPU usage (b) and 7 seconds in runningtime (a), with a maximum of 74% and 10 seconds, respectively. The mem-ory usage increases from 162 MB to 228 MB, which is a 40% increase. Themost relevant increase is the running time because the running time for 2000individuals is the triple of the running time for 500 individuals, which is aproblem for big data sets.

There is a clear increase in Power for odds ratio increase in Figure 5,especially from a 1.1 to a 1.5 odds ratio and population increase in Figure4, with emphasis on the difference between 1000 and 2000 individuals Theprevalence test shows very little difference between disease prevalence in Fig-ure 6 and the allele frequency shows a growth with the increase in minorallele frequency.

4

500 1000 20000

5

10


Runnin

gT

ime(

seco

nds)


500 1000 2000

68

70

72

74

76


CP

UU

sage

(%)


500 1000 2000

160

180

200

220


Mem

ory

Usa

ge(M

byte

s)


Figure 3: Comparison of scalability measures between different sized datasets. The data sets have a minor allele frequency is 0.5, 2.0 odds ratio, 0.02prevalence.

5

4 Summary

TEAM is an exhaustive algorithm that uses permutation tests to generatecontingency tables, which then can be applied to any relevancy test. Theresults show an increase in Power related to the increase in population sizeand allele frequency. The scalability test shows that the running time of datasets with the highest population size is the triple of the running time for datasets with the lowest population size. The Type 1 Error Rate increases withthe population size, but the relation between error rate and allele frequency isundetermined. The results of data set configurations by population and allelefrequency confirm the previously discussed results. The odds ratio increaseyields a clear increase in Power, but the prevalence increase shows nearly thesame Power.

References

[ZHZW10] Xiang Zhang, Shunping Huang, Fei Zou, and Wei Wang. TEAM:efficient two-locus epistasis tests in human genome-wide associa-tion study. Bioinformatics (Oxford, England), 26:i217–i227, 2010.

A Bar Graphs

500 1000 20000

50

100

0

21

92

Population

Pow

er(%

)

Power by Population


6

1.1 1.5 2.00

50

100

1

8192

Odds Ratio

Pow

er(%

)Power by Odds Ratio


0.0001 0.020

50

100 89 92

Prevalence

Pow

er(%

)

Power by Prevalence


7

0.01 0.05 0.1 0.3 0.50

50

100

0

43

92 92 95

Frequency

Pow

er(%

)

Power by Frequency

Figure 7: Distribution of the averaged Power by allele frequency. The numberof individuals is 2000, the odds ratio is 2.0, and the prevalence is 0.02.

8

B Table of Results



0.5,500,I,2.0,0.0001 9 00.5,500,I,1.5,0.02 3 0

0.5,500,I,1.5,0.0001 0 10.5,500,I,1.1,0.02 0 0

0.5,500,I,1.1,0.0001 0 30.5,2000,I,2.0,0.02 95 1

0.5,2000,I,2.0,0.0001 100 220.5,2000,I,1.5,0.02 93 7

0.5,2000,I,1.5,0.0001 47 10.5,2000,I,1.1,0.02 10 3

0.5,2000,I,1.1,0.0001 3 20.5,1000,I,2.0,0.02 65 0

0.5,1000,I,2.0,0.0001 79 50.5,1000,I,1.5,0.02 53 3

0.5,1000,I,1.5,0.0001 4 00.5,1000,I,1.1,0.02 0 1

0.5,1000,I,1.1,0.0001 0 30.3,500,I,2.0,0.02 6 0

0.3,500,I,2.0,0.0001 26 30.3,500,I,1.5,0.02 2 0

0.3,500,I,1.5,0.0001 5 10.3,500,I,1.1,0.02 0 0

0.3,500,I,1.1,0.0001 0 40.3,2000,I,2.0,0.02 92 10

0.3,2000,I,2.0,0.0001 100 560.3,2000,I,1.5,0.02 95 15

0.3,2000,I,1.5,0.0001 98 100.3,2000,I,1.1,0.02 2 1

0.3,2000,I,1.1,0.0001 2 40.3,1000,I,2.0,0.02 47 1

9

0.3,1000,I,2.0,0.0001 100 120.3,1000,I,1.5,0.02 40 3

0.3,1000,I,1.5,0.0001 49 50.3,1000,I,1.1,0.02 0 0

0.3,1000,I,1.1,0.0001 0 10.1,500,I,2.0,0.02 0 2

0.1,500,I,2.0,0.0001 1 10.1,500,I,1.5,0.02 1 0

0.1,500,I,1.5,0.0001 0 10.1,500,I,1.1,0.02 0 1

0.1,500,I,1.1,0.0001 0 10.1,2000,I,2.0,0.02 92 28

0.1,2000,I,2.0,0.0001 89 200.1,2000,I,1.5,0.02 81 5

0.1,2000,I,1.5,0.0001 42 30.1,2000,I,1.1,0.02 1 3

0.1,2000,I,1.1,0.0001 5 20.1,1000,I,2.0,0.02 21 5

0.1,1000,I,2.0,0.0001 12 60.1,1000,I,1.5,0.02 5 0

0.1,1000,I,1.5,0.0001 1 10.1,1000,I,1.1,0.02 0 2

0.1,1000,I,1.1,0.0001 0 40.05,500,I,2.0,0.02 0 0

0.05,500,I,2.0,0.0001 0 10.05,500,I,1.5,0.02 0 0

0.05,500,I,1.5,0.0001 0 30.05,500,I,1.1,0.02 0 1

0.05,500,I,1.1,0.0001 0 10.05,2000,I,2.0,0.02 43 37

0.05,2000,I,2.0,0.0001 57 210.05,2000,I,1.5,0.02 40 24

0.05,2000,I,1.5,0.0001 43 190.05,2000,I,1.1,0.02 0 3

0.05,2000,I,1.1,0.0001 0 30.05,1000,I,2.0,0.02 1 4

0.05,1000,I,2.0,0.0001 3 20.05,1000,I,1.5,0.02 0 1

10

0.05,1000,I,1.5,0.0001 1 30.05,1000,I,1.1,0.02 0 1

0.05,1000,I,1.1,0.0001 0 60.01,500,I,2.0,0.02 0 0

0.01,500,I,2.0,0.0001 0 20.01,500,I,1.5,0.02 0 0

0.01,500,I,1.5,0.0001 0 40.01,500,I,1.1,0.02 0 1

0.01,500,I,1.1,0.0001 0 00.01,2000,I,2.0,0.02 0 1

0.01,2000,I,2.0,0.0001 0 50.01,2000,I,1.5,0.02 0 3

0.01,2000,I,1.5,0.0001 0 40.01,2000,I,1.1,0.02 0 1

0.01,2000,I,1.1,0.0001 0 20.01,1000,I,2.0,0.02 0 2

0.01,1000,I,2.0,0.0001 0 10.01,1000,I,1.5,0.02 0 0

0.01,1000,I,1.5,0.0001 0 20.01,1000,I,1.1,0.02 0 2

0.01,1000,I,1.1,0.0001 0 3



Configuration* Running Time (s) CPU Usage (%) Memory Usage (KB)0.5,500,I,2.0,0.02 3.28 66.99 166590.64

0.5,500,I,2.0,0.0001 3.81 54.75 166590.280.5,500,I,1.5,0.02 3.07 74.40 166590.44

0.5,500,I,1.5,0.0001 3.76 57.98 166590.600.5,500,I,1.1,0.02 3.08 68.52 161592.60

0.5,500,I,1.1,0.0001 3.91 55.00 166590.720.5,2000,I,2.0,0.02 9.81 74.75 233543.92

0.5,2000,I,2.0,0.0001 11.00 72.09 233802.28

11

0.5,2000,I,1.5,0.02 9.83 72.85 233535.720.5,2000,I,1.5,0.0001 10.98 66.89 233821.760.5,2000,I,1.1,0.02 9.82 73.74 233562.12

0.5,2000,I,1.1,0.0001 10.99 69.46 233832.840.5,1000,I,2.0,0.02 5.28 69.71 181210.92

0.5,1000,I,2.0,0.0001 6.08 68.02 181210.720.5,1000,I,1.5,0.02 5.53 66.72 181210.60

0.5,1000,I,1.5,0.0001 6.10 66.35 181210.640.5,1000,I,1.1,0.02 5.40 68.68 181210.64

0.5,1000,I,1.1,0.0001 6.09 65.91 181210.840.3,500,I,2.0,0.02 3.12 71.53 166590.44

0.3,500,I,2.0,0.0001 3.79 56.19 166590.600.3,500,I,1.5,0.02 3.13 70.93 166590.72

0.3,500,I,1.5,0.0001 3.77 56.60 166590.720.3,500,I,1.1,0.02 3.08 72.65 166590.68

0.3,500,I,1.1,0.0001 3.78 56.01 166590.400.3,2000,I,2.0,0.02 9.84 72.54 233557.00

0.3,2000,I,2.0,0.0001 10.94 73.45 233778.480.3,2000,I,1.5,0.02 9.92 72.03 233546.36

0.3,2000,I,1.5,0.0001 10.95 73.75 233801.920.3,2000,I,1.1,0.02 9.95 72.35 233546.48

0.3,2000,I,1.1,0.0001 11.00 70.49 233828.960.3,1000,I,2.0,0.02 5.34 67.05 181210.88

0.3,1000,I,2.0,0.0001 6.09 63.97 181210.800.3,1000,I,1.5,0.02 5.35 69.00 181210.56

0.3,1000,I,1.5,0.0001 6.12 63.37 181210.800.3,1000,I,1.1,0.02 5.44 67.27 181210.44

0.3,1000,I,1.1,0.0001 6.11 65.06 181210.680.1,500,I,2.0,0.02 3.28 65.33 166590.76

0.1,500,I,2.0,0.0001 3.78 56.18 166590.600.1,500,I,1.5,0.02 3.13 71.07 166590.60

0.1,500,I,1.5,0.0001 3.81 55.52 166590.840.1,500,I,1.1,0.02 3.22 67.56 166590.64

0.1,500,I,1.1,0.0001 3.84 54.26 166590.520.1,2000,I,2.0,0.02 9.91 72.77 233527.88

0.1,2000,I,2.0,0.0001 10.95 73.18 233788.920.1,2000,I,1.5,0.02 9.94 71.34 233538.28

0.1,2000,I,1.5,0.0001 10.97 71.25 233803.52

12

0.1,2000,I,1.1,0.02 9.82 69.45 231225.760.1,2000,I,1.1,0.0001 10.76 73.08 233841.760.1,1000,I,2.0,0.02 5.46 66.40 181210.92

0.1,1000,I,2.0,0.0001 6.10 65.14 181210.640.1,1000,I,1.5,0.02 5.41 67.52 181210.80

0.1,1000,I,1.5,0.0001 6.17 63.74 181210.800.1,1000,I,1.1,0.02 5.49 65.42 181210.52

0.1,1000,I,1.1,0.0001 6.25 57.76 181210.680.05,500,I,2.0,0.02 3.06 74.66 166590.52

0.05,500,I,2.0,0.0001 3.67 63.00 166590.840.05,500,I,1.5,0.02 3.10 73.32 166590.60

0.05,500,I,1.5,0.0001 3.70 60.99 166590.680.05,500,I,1.1,0.02 3.09 74.07 166590.96

0.05,500,I,1.1,0.0001 3.74 60.54 166590.840.05,2000,I,2.0,0.02 10.87 75.38 233762.72

0.05,2000,I,2.0,0.0001 10.88 76.33 233830.320.05,2000,I,1.5,0.02 9.76 75.67 233551.64

0.05,2000,I,1.5,0.0001 10.84 77.87 233818.160.05,2000,I,1.1,0.02 9.76 76.10 233559.48

0.05,2000,I,1.1,0.0001 10.89 78.26 233821.160.05,1000,I,2.0,0.02 5.45 69.61 181210.40

0.05,1000,I,2.0,0.0001 6.01 69.13 181210.880.05,1000,I,1.5,0.02 5.24 74.24 181210.68

0.05,1000,I,1.5,0.0001 6.04 68.74 181211.000.05,1000,I,1.1,0.02 5.34 71.82 181210.52

0.05,1000,I,1.1,0.0001 5.99 68.72 181210.640.01,500,I,2.0,0.02 3.13 72.69 166590.40

0.01,500,I,2.0,0.0001 3.69 60.83 166590.960.01,500,I,1.5,0.02 3.02 76.70 166590.72

0.01,500,I,1.5,0.0001 3.77 59.05 166590.520.01,500,I,1.1,0.02 3.20 69.81 166590.60

0.01,500,I,1.1,0.0001 3.72 60.54 166590.640.01,2000,I,2.0,0.02 9.70 77.18 233557.00

0.01,2000,I,2.0,0.0001 10.87 76.75 233813.880.01,2000,I,1.5,0.02 9.76 76.94 233554.60

0.01,2000,I,1.5,0.0001 10.83 77.23 233817.520.01,2000,I,1.1,0.02 9.81 81.09 233562.32

0.01,2000,I,1.1,0.0001 10.86 76.11 233839.32

13

0.01,1000,I,2.0,0.02 5.28 73.35 181210.760.01,1000,I,2.0,0.0001 6.00 69.20 181210.720.01,1000,I,1.5,0.02 5.35 71.68 181210.80

0.01,1000,I,1.5,0.0001 6.05 68.01 181210.360.01,1000,I,1.1,0.02 5.35 71.71 181210.84

0.01,1000,I,1.1,0.0001 5.99 68.05 181211.00


14

Laboratory Note

Genetic EpistasisVIII - Assessing Algorithm MBMDR

LN-8-2014






May 2014

Abstract

Model-Based Multifactor DImensionality Reduction (MBMDR) isan algorithm that implements on the previous MDR methodology,which consists on dividing SNPs into two clusters based on their riskto the determination of the disease. Instead of using a predeterminedthreshold from the frequency of SNPs in the data, MBMDR uses atesting approach followed by a significance assessment. The resultsshow a high Power only for large sized data sets and very low Type 1Error Rate for all configurations. The running time of the algorithmmakes the algorithm not viable for larger data sets.

1 Introduction

Multifactor DImensionality Reduction (MDR) [CLEP07] is one of the mostreferenced algorithms for epistasis detection. MDR filters SNPs, based onthe frequency in case control data, to divide SNPs into high risk or low riskbased on a predetermined threshold. Using cross validation and permuta-tions to determine the high/low risk groups, the algorithm returns the highrisk loci that have a stronger connection in the disease outcome. However,it samples many SNPs together analysing at most one significant epistasismodel, skiping other possible SNP groups that may not have such significantconection, but may also be related to the disease.Model-Based Multifactor Dimensionality Reduction [MVV11] merges multi-locus genotypes that have significant high or low risk based on associationtesting, rather than a threshold value.MB-MDR process can be divided into the following steps:

1. Multi-locus cell prioritization - Each two-locus genotype is assignedto either High risk, Low risk or No Evidence of risk categories.

2. Association test on lower-dimensional construct - The result ofthe first step creates a new variable with a value correlated to one ofthe categories. This new variable is then compared with the originallabel to find the weight of high and low risk genotype cells.

3. Significance assessment - This stage tries to correct the inflation oftype I errors after the combination of cells into the weight of High riskand Low risk. This is done using the Wald statistic.

1.1 Input files

The input file consists of the Index and phenotype in the first two columns,and the genotype of each SNP in the following columns. The first row cor-responds to the name of each column.

””,”Y”,”SNP1”, ”SNP2”, ”SNP3”, ”SNP4”, ”SNP5””0”, 0, 1, 2, 0, 0, 0”1”, 0, 0, 2, 1, 2, 0”2”, 1, 1, 0, 1, 0, 1”3”, 1, 1, 1, 2, 1, 0


1

1.2 Output files

The output consist of a list of SNP interactions selected with the followingcolumns for each interaction:

1. SNP1...SNPx - Names of snps in interaction.

2. NH - Number of significant High risk genotypes in the interaction.

3. betaH - Regresion coeficient in step2 for High risk exposition.

4. WH - Wald statistic for High risk category.

5. PH - P-value of the Wald test for the High risk category.

6. NL - Number of significant Low risk genotypes in the interaction.

7. betaL - Regresion coeficient in step2 for Low risk exposition.

8. WL - Wald statistic for Low risk category.

9. PL - P-value of the Wald test for the Low risk category.

10. MIN.P - Minimun p-value (min(PH,PL)) for the interaction model.

1.3 Parameters

The MBMDR can contain the following arguments:

• order - dimension of interactions to be analyzed.

• covar - (Optional) a data frame containing the covariates for adjustingregression models.

• exclude - (Optional) Value/s of missing data.

• risk.threshold - Threshold used to define the risk category of a multi-locus genotype. The default value is 0.1.

• adjust - (Optional) Types of regression adjustment. Can be ”none”,”covariates”, ”main effects” or ”both”. The default value is ”none”.

• first.model - Specifies the first interaction to be tested. Useful whenstoped before finishing the complete analysis.

• list.models - (Optional) Exhaustive list of models to be analyzed. Onlypossible interactions in this list will be analyzed.

2

• use.logistf - Boolean value indicating wheter or not the logistf packageshould be used. The default value is TRUE.

• printStep1 - Boolean value that prints every model obtained if thevalue is TRUE. The default value is FALSE.


The datasets used in the experiments are characterized in Lab Note 1.The limit number of interactions selected is 2, considering that the groundtruth is a pairwise interaction, and all of the SNPs are tested with each otherfor pairwise interactions.

3 Results

The algorithm only outputs the statistical relevancy test of interactions be-tween SNPs. Due to this, only epistatic disease model data sets will be usedfor this experiment. Because of time constraints, several computers were usedto obtain results. This means that it is not possible to compare scalabilityresults.The Figure 1 reveals a large increase with population size for data sets witha minor allele frequency higher than 0.01. There is a big increase in datasets with 2000 individuals from a minor allele frequency of 0.05 to 0.1. Theresults from data sets with a smaller amount of population size has muchlower Power, having 0 Power for almost all data sets with 500 individuals.There is also a clear increase with minor allele frequency.

According to Figure 2 the Type 1 Error Rate is very low across all allelefrequencies and data set sizes, having a maximum of 6% and 2% for 0.05minor allele frequency with 2000 and 1000 individuals respectively. For otherallele frequencies, only 0.1 and 0.3 contain false positives for data sets with2000 individuals.

Figure 3 and 6 show the same results as Figure 1, with a different prespec-tive. Figure 4 also shows an increase in Power with the increase in odds ratio.Figure 5 shows a smaller increase in Power with the increase in prevalence.

3

0.01 0.05 0.1 0.3 0.50

20

40

60

80

100

0 0 0 0 10 0 2 7 120

14

54

7185

Allele frequency

Pow

er(%

)



0.01 0.05 0.1 0.3 0.50

2

4

6

0 0 0 0 00

2

0 0 00

6

32

0

Allele frequency

Typ

e1

Err

orR

ate(

%)


Figure 2: Type 1 Error Rate by allele frequency and population size. TheType 1 Error Rate is measured by the amount of data sets where the falsepositives were amongst the most relevant results, out of all 100 data sets.

4

4 Summary

MBMDR is an algorithm based on the popular MDR approach, with a clus-tering of SNPs by high and low risk of determining the disease phenotype.The results show very high Power for data sets with 2000 individuals, butvery low Power for all other configurations. The Type 1 Error Rate is verylow, reaching a maximum of only 6% for 0.05 allele frequency and 2000 indi-viduals. Considering that there are no results concerning the scalability dueto the expected running time of the algorithm shows that it is not viable touse this algorithm on big data sets that might contain thousands or millionsof SNPs.

References

[CLEP07] Yujin Chung, Seung Yeoun Lee, Robert C Elston, and Tae-sung Park. Odds ratio based multifactor-dimensionality reduc-tion method for detecting gene-gene interactions. Bioinformatics(Oxford, England), 23:71–76, 2007.

[MVV11] Jestinah M Mahachie John, Francois Van Lishout, and KristelVan Steen. Model-Based Multifactor Dimensionality Reductionto detect epistasis for quantitative traits in the presence of error-free and noisy data. Eur J Hum Genet, 19(6):696–703, June 2011.

A Bar Graphs

5

500 1000 20000

50

100

0 2

54

Population

Pow

er(%

)Power by Population


1.1 1.5 2.00

50

100

0 1

54

Odds Ratio

Pow

er(%

)

Power by Odds Ratio


6

0.0001 0.020

50

100

3654

Prevalence

Pow

er(%

)Power by Prevalence


0.01 0.05 0.1 0.3 0.50

50

100

0

43

92 92 95

Frequency

Pow

er(%

)

Power by Frequency


7

B Table of Results



0.5,500,I,2.0,0.0001 2 00.5,500,I,1.5,0.0001 0 00.5,500,I,1.1,0.02 0 0

0.5,500,I,1.1,0.0001 0 00.5,2000,I,2.0,0.02 85 0

0.5,2000,I,2.0,0.0001 91 20.5,2000,I,1.5,0.02 17 1

0.5,2000,I,1.5,0.0001 2 00.5,2000,I,1.1,0.02 0 0

0.5,2000,I,1.1,0.0001 0 00.5,1000,I,2.0,0.02 12 0

0.5,1000,I,2.0,0.0001 26 00.5,1000,I,1.5,0.02 0 0

0.5,1000,I,1.5,0.0001 0 100.3,500,I,2.0,0.02 0 0

0.3,500,I,2.0,0.0001 11 00.3,500,I,1.5,0.02 0 0

0.3,500,I,1.5,0.0001 0 00.3,500,I,1.1,0.02 0 0

0.3,500,I,1.1,0.0001 0 00.3,2000,I,2.0,0.02 71 2

0.3,2000,I,2.0,0.0001 100 80.3,2000,I,1.5,0.02 5 0

0.3,2000,I,1.5,0.0001 43 20.3,2000,I,1.1,0.02 0 0

0.3,2000,I,1.1,0.0001 0 00.3,1000,I,2.0,0.02 7 0

0.3,1000,I,2.0,0.0001 62 00.3,1000,I,1.5,0.02 0 0

0.3,1000,I,1.5,0.0001 5 0

8

0.3,1000,I,1.1,0.02 0 00.3,1000,I,1.1,0.0001 0 0

0.1,500,I,2.0,0.02 0 00.1,500,I,2.0,0.0001 0 00.1,500,I,1.5,0.02 0 0

0.1,500,I,1.5,0.0001 0 00.1,500,I,1.1,0.02 0 0

0.1,500,I,1.1,0.0001 0 00.1,2000,I,2.0,0.02 54 3

0.1,2000,I,2.0,0.0001 36 20.1,2000,I,1.5,0.02 1 0

0.1,2000,I,1.5,0.0001 0 00.1,2000,I,1.1,0.02 0 0

0.1,2000,I,1.1,0.0001 0 00.1,1000,I,2.0,0.02 2 0

0.1,1000,I,2.0,0.0001 1 00.1,1000,I,1.5,0.02 0 0

0.1,1000,I,1.5,0.0001 0 00.1,1000,I,1.1,0.02 0 0

0.1,1000,I,1.1,0.0001 0 00.05,500,I,2.0,0.02 0 0

0.05,500,I,2.0,0.0001 0 00.05,500,I,1.5,0.02 0 0

0.05,500,I,1.5,0.0001 0 00.05,500,I,1.1,0.02 0 0

0.05,500,I,1.1,0.0001 0 00.05,2000,I,2.0,0.02 14 6

0.05,2000,I,2.0,0.0001 3 10.05,2000,I,1.5,0.02 7 3

0.05,2000,I,1.5,0.0001 17 70.05,2000,I,1.1,0.02 0 0

0.05,2000,I,1.1,0.0001 0 00.05,1000,I,2.0,0.02 0 2

0.05,1000,I,2.0,0.0001 0 00.05,1000,I,1.5,0.02 0 0

0.05,1000,I,1.5,0.0001 0 00.05,1000,I,1.1,0.02 0 0

0.05,1000,I,1.1,0.0001 0 1

9

0.01,500,I,2.0,0.02 0 00.01,500,I,2.0,0.0001 0 00.01,500,I,1.5,0.02 0 0

0.01,500,I,1.5,0.0001 0 10.01,500,I,1.1,0.02 0 0

0.01,500,I,1.1,0.0001 0 00.01,1000,I,1.5,0.0001 0 00.01,1000,I,1.1,0.0001 0 0


10

Laboratory Note

Genetic EpistasisIX - Comparative Assessment of the

AlgorithmsLN-9-2014




www : http://www.fe.up.pt/∼ei09045 [email protected] : http://www.fe.up.pt/∼rcamacho

May 2014

Abstract

This lab note contains the results obtained from the algorithmsdiscussed in previous lab notes. All algorithms are compared by theircharacteristics and by their Power, scalability, and Type 1 Error Ratesin epistatic detection, main effect detection, and full effect detection.From the results obtained, we can see that the algorithm BOOST hasthe highest Power in epistatic detection and main effect detection,but has a high error rate. Screen and Clean has a contant but higherror rate overall, very low Power in epistatic detection and averagePower in other models.SNPHarvester and SNPRuler have relativelylow Power, but low error rates. TEAM has good Power, but higherror rate. MBMDR has good Power and low Type 1 Error Rate, butvery bad scalability. BEAM3 has high Power in main effect detection,but also high error rate. In terms of scalability, BOOST is the mostscalable, with MBMDR being the least scalable.

1 Introduction

In this lab note, the epistasis detection algorithms used in earlier lab notes([PC14b][PC14c] [PC14d] [PC14e] [PC14f] [PC14g] [PC14h]) will be compared, usingthe results from the data sets and measurements discussed in Lab Note LN-1-2014 [PC14a].The algorithms used in this empirical study are BEAM 3.0 [Zha12]; BOOST[WYY+10a]; MBMDR [MVV11]; Screen and Clean [WDR+10]; SNPRuler[WYY+10b]; SNPHarvester [YHW+09]; and TEAM [ZHZW10]. Table 1 andTable 2 show the main characteristics of the search methods, scoring tech-niques, types of disease models detected, and the programming language ofthe tested algorithms [SZS+11].

Table 1: Similarities and differences between BEAM3, BOOST MBMDR,and Screen & Clean.

Features BEAM 3 BOOST MBMDR Screen & CleanSearch Stochastisc Exhaustive Exhaustive HeuristicPermutation Test

√ − √ −Chi-square Test −*

√ −* −*Tree/Graph Structure

√ − − −Bonferroni Correction − √ − √Interactive Effect

√ √ √ √Main Effect

√ √ √ √Full Effect

√ √ √ √Programming Language C++ C R R

*Although BEAM3 can evaluate interactive and full effects, the evaluationtest is not comparable between methods. Only single SNPs are evaluatedwith χ2 test. MBMDR and Screen & Clean results are comparable withother algorithms.

1

Features SNPHarvester SNPRuler TEAMSearch Stochastic Heuristic ExhaustivePermutation Test − − √Chi-square Test

√ √ −Tree Structure − √ √Bonferroni Correction

√ √ −Interactive Effect

√ √ √Main Effect

√ − −Full Effect

√ − −Programming Language Java Java C++

2 Comparative Assesmement

The measures used to assess the quality of each algorithm are: Power; Scal-ability; and Type 1 Error Rate.

2.1 Power

The Power of an algorithm is related to its ability to find the ground truthof the disease. In this case, the Power is evaluated by the number of datasets, out of 100, where the algorithm finds the ground truth and is measuredas a percentage for each data set configuration. In each data set, the mostsignificant interactions, i.e. α < 0.05, are selected.

2.2 Scalability

Scalability is determined by 3 main factors: execution time, cpu usage,and memory usage. Execution time is measured in seconds, cpu usage ismeasured in precentage of processor usage by the algorithm, and memoryusage is measured in Kilobytes of RAM memory used by the algorithm. Allmeasures are averaged over the 100 data sets in each data set configuration.

2.3 Type 1 Error Rate

Similar to the Power, the Type 1 Error Rate is determined by the amount offalse positives in the 100 data sets within the most significant interactions,i.e. α < 0.05.

2

3 Experimental Procedure

As mentioned in Lab Note LN-1-2014 [PC14a], there are 270 different configu-rations of data sets, with different parameters: allele frequency (0.01,0.05,0.1,0.3,and 0.5); population size (500,1000, and 2000); odds ratio (1.1,1.5, and 2.0);prevalence (0.0001 and 0.02); and disease model (Epistasis, Main effect, andEpistasis + Main Effect).To test the Power and Type 1 Error Rate of algorithms, the outputs of eachalgorithm is gathered for each data set configuration and the correspondingconfusion matrix is created. The output of each algorithm is filtered, select-ing only interactions with a statistical relevancy of 5%. From these confusionmatrices, the number true positives and false positives of data sets withineach configuration is obtained and used as comparison for Power and Type1 Error Rate respectively. For scalability, the built-in shell command timewas used to obtain al the scalability measures for all algorithms.

4 Results

To compare each criteria of the algorithms, Table 2, 3, and 4 were createdto represent the Power and Type 1 Error Rate of each algorithm, by numberof individuals and allele frequency for epistasis, main effect and full effectrespectively. Table 5 shows the results of the scalability measures used toevaluate each algorithm.For epistasis detection, we can see that, for data sets with 500 individuals,no algorithm has a Power above 26%. This shows ta big dificulty in detect-ing epistasis with few individuals. The algorithm with best Power for thesedata sets is BOOST, followed by TEAM and SNPRuler and SNPHarvester.In error rate however, the algorithm with the lowest values is SNPRuler,followed by TEAM, SNPHarvester and BOOST. For data sets with 1000individuals, there is a big increase in Power in all algorithms, reaching amaximum of 91%. BOOST has the best Power in all allele frequencies, fol-lowed by TEAM, SNPRuler and SNPHarvester. SNPRuler is once again thealgorithm with the lowest Type 1 Error Rate, followed by TEAM, BOOSTand SNPHarvester. In 2000 individuals, BOOST has the best Power with amaximum of 100%, followed by TEAM and SNPHarvester, with SNPRulerbeing better than SNPHarvester for 0.5 minor allele frequency. The lowesterror rate is achieved by SNPRuler. Each of the other algorithms has a highType 1 Error Rate in at least 1 setting. Screen and Clean is clearly theworse algorithm due to its lack of Power and high Type 1 Error Rate acrossall data set sizes. The Power shows an increase with allele frequency in each

3

algorithm, reaching their maximum Power in 0.5 allele frequency. There isno clear correlation between error rate and allele frequency for any algorithm.

POP 500 individualsMAF 0.01 0.05 0.1 0.3 0.5

P T1ER P T1ER P T1ER P T1ER P T1ERBOOST 0% 4% 0% 7% 1% 7% 14% 6% 26% 4%SnC 0% 15% 0% 15% 0% 14% 0% 19% 0% 18%SNPH 0% 4% 0% 4% 0% 7% 4% 3% 2% 4%SNPR 0% 0% 0% 0% 0% 0% 3% 0% 6% 0%TEAM 0% 0% 0% 0% 0% 2% 6% 0% 8% 1%POP 1000 individualsMAF 0.01 0.05 0.1 0.3 0.5

P T1ER P T1ER P T1ER P T1ER P T1ERBOOST 0% 7% 0% 4% 41% 5% 66% 2% 91% 8%SnC 0% 17% 0% 22% 0% 16% 0% 15% 0% 22%SNPH 0% 4% 0% 13% 21% 9% 43% 9% 14% 3%SNPR 0% 0% 0% 0% 10% 0% 35% 0% 71% 0%TEAM 0% 2% 1% 4% 21% 5% 47% 1% 65% 0%POP 2000 individualsMAF 0.01 0.05 0.1 0.3 0.5

P T1ER P T1ER P T1ER P T1ER P T1ERBOOST 0% 2% 7% 2% 94% 21% 100% 6% 100% 8%SnC 0% 18% 0% 20% 6% 21% 2% 16% 0% 14%SNPH 0% 2% 18% 27% 85% 19% 70% 11% 33% 5%SNPR 0% 0% 0% 1% 32% 8% 44% 0% 92% 0%TEAM 0% 1% 43% 37% 92% 28% 92% 10% 95% 1%

Table 2: This table contains the results for epistasis detection. A comparisonbetween the tested algorithms: BOOST, Screen and Clean, SNPHarvester,SNPRuler, and TEAM. The table is organized by population size (POP) andminor allele frequency (MAF), with an odds ratio of 2.0 and a prevalenceof 0.02. For each allele frequency, there are two columns: the Power (P)obtained, and the Type 1 Error Rate (T1ER).

In main effect detection, for 500 individuals, the best algorithm is BEAM3,closely followed by BOOST, SNPHarvester and Screen and Clean far behind.The Type 1 Error Rate is lowest in MBMDR, and Screen and Clean, withBEAM3, SNPHarvester, and BOOST very close to each other, with veryhigh error rates, BOOST having the highest error rate. for 1000 individu-

4

als, BOOST has better Power than BEAM3, followed by SNPHarvester, andScreen and Clean with MBMDR far behind. The Type 1 Error Rate is higherfor BOOST, very closely followe by BEAM3, SNPHarvester, and Screen andClean, with MBMDR having the lowest error rate. For data sets with 2000individuals, BOOST and MBMDR have a better Power for data sets withallele frequency lower than 0.1, and BEAM3, BOOST and SNPHarvesterequally good in allele frequencies higher than 0.1. The error rate is lowergenerally for MBMDR, followed by Scren and Clean.

Table 4 shows the full effect detection for BOOST, Screen and Clean,and SNPHarvester. BOOST and SNPHarvester have the highest Power de-tection for all allele frequencies but have a high Type 1 Error Rate. Screenand Clean has high Power for high allele frequencies, but 0 for configurationsbelo 0.3 and with a higher Type 1 Error Rate for configurations below 0.1.Screen and Clean has the lowest Type 1 Error Rate but also has the worstPower detection. BOOST has the best ratio of Power to Type 1 Error Rate.

Table 5 shows the running time, CPU usage and memory usage of allalgorithms for scalability measure. Screen and Clean is the slowest recordedalgorithm, followed by SNPHarvester, TEAM, BEAM3 and SNPRuler, withBOOST being the fastest algorithm. Screen and Clean also has the highestincrease in running time, followed by SNPHarvester, TEAM, with BOOST,BEAM3, and SNPRuler far behind. SNPRuler is the algorithm with thehighest CPU usage, having to resort to more than 1 core to finish eachtask. SNPHarvester, BOOST, BEAM3, and Screen and Clean are all closeto 100%, with TEAM being the algorithm with the least required CPU usage.BEAM3, BOOST, and TEAM have an increase of CPU usage with data setsize, TEAM being the algorithm with the highest increase. In memory usage,SNPRuler shows the highest usage of memory, closely followed by TEAM,Screen and Clean, SNPHarvester, BEAM3, and finally BOOST far behind.

5


P T1ER P T1ER P T1ER P T1ER P T1ERBEAM3 0% 0% 0% 3% 0% 9% 100% 71% 100% 99%BOOST 0% 1% 0% 1% 2% 12% 100% 78% 100% 97%MBMDR 0% 0% 0% 0% 0% 0% 0% 0% 1% 0%SnC 0% 14% 0% 17% 0% 21% 20% 23% 54% 15%SNPH 0% 1% 0% 5% 0% 11% 100% 78% 100% 99%POP 1000 individualsMAF 0.01 0.05 0.1 0.3 0.5

P T1ER P T1ER P T1ER P T1ER P T1ERBEAM3 0% 6% 0% 3% 32% 18% 100% 99% 100% 100%BOOST 0% 7% 1% 3% 43% 23% 100% 99% 100% 100%MBMDR 0% 0% 0% 2% 2% 0% 7% 0% 12% 0%SnC 0% 14% 0% 21% 0% 23% 54% 28% 70% 30%SNPH 0% 10% 0% 4% 38% 22% 100% 99% 100% 100%POP 2000 individualsMAF 0.01 0.05 0.1 0.3 0.5

P T1ER P T1ER P T1ER P T1ER P T1ERBEAM3 0% 1% 1% 17% 92% 67% 100% 100% 100% 100%BOOST 0% 1% 14% 11% 97% 74% 100% 100% 100% 100%MBMDR 0% 0% 14% 6% 54% 3% 71% 2% 85% 0%SnC 0% 13% 0% 22% 39% 36% 58% 38% 62% 48%SNPH 0% 1% 1% 24% 92% 79% 100% 100% 100% 100%

Table 3: This table contains the results for main effect detection. A compari-son between the tested algorithms: BEAM3, BOOST, Screen and Clean, andSNPHarvester. The table is organized by population size (POP) and minorallele frequency (MAF), with an odds ratio of 2.0 and a prevalence of 0.02.For each allele frequency, there are two columns: the Power (P) obtained,and the Type 1 Error Rate (T1ER).

6


P T1ER P T1ER P T1ER P T1ER P T1ERBOOST 0% 10% 0% 4% 1% 15% 100% 100% 100% 100%SnC 0% 18% 0% 15% 0% 19% 30% 19% 49% 37%SNPH 0% 2% 0% 8% 0% 9% 100% 100% 100% 100%POP 1000 individualsMAF 0.01 0.05 0.1 0.3 0.5

P T1ER P T1ER P T1ER P T1ER P T1ERBOOST 0% 11% 2% 16% 42% 38% 100% 100% 100% 100%SnC 0% 14% 0% 21% 0% 28% 58% 35% 73% 45%SNPH 0% 4% 0% 8% 32% 27% 100% 100% 100% 100%POP 2000 individualsMAF 0.01 0.05 0.1 0.3 0.5

P T1ER P T1ER P T1ER P T1ER P T1ERBOOST 0% 7% 15% 17% 98% 81% 100% 100% 100% 100%SnC 0% 14% 0% 21% 0% 33% 40% 68% 91% 84%SNPH 0% 1% 0% 20% 95% 79% 100% 100% 100% 100%

Table 4: This table contains the results for full effect detection. A comparisonbetween the tested algorithms: BOOST, Screen and Clean, and SNPHar-vester. The table is organized by population size (POP) and minor allelefrequency (MAF), with an odds ratio of 2.0 and a prevalence of 0.02. Foreach allele frequency, there are two columns: the Power (P) obtained, andthe Type 1 Error Rate (T1ER).

7


BEAM3 4.9 7 8 87.8 96.3 95.5 4 4.3 5.8BOOST 0.16 0.22 0.34 95.7 98.79 97.87 0.98 1 1.2MBMDR* − − − − − − − − −SnC 8.05 18.65 34.65 75.7 98.99 77.25 129.8 137.2 152.5SNPHarvester 9.29 25.89 33 102.1 86.5 101.6 68.35 71.3 76.86SNPRuler 2.7 3.09 4.1 130.2 141.9 156.28 312.7 316 320.2TEAM 3.28 5.28 9.81 66.99 69.71 74.75 162.7 176 228.1

Table 5: Scalability test containing the average running time, CPU usage,and memory usage by data set population size. *MBMDR does not containscalability results because these were obtained from different computers withdifferent hardware settings from all other results. The data sets have a minorallele frequency is 0.5, 2.0 odds ratio, 0.02 prevalence.

8

5 Results Discussion

The results obtained from all the different algorithms show interesting qual-ities among them. BOOST is clearly the algorithm with the highest Power,but has high Type 1 Error Rate. SNPRuler has low Type 1 Error Rate,but not very high Power and only works for epistasis detection. Screen andClean is ineffective in most settings, but has a relatively low Type 1 ErrorRate and high Power for main effect and full detection in data sets with highallele frequency. BEAM3 only works for main effect detection but has highPower with slightly lower error rate than BOOST. SNPHarvester has lowPower, but also low Type 1 Error Rate in all model types. MBMDR hasgood Power for certain configurations and a very low Type 1 Error Rate,however it has a very high running time for each data set. TEAM has goodPower, with slightly high Type I Error Rate in certain configurations.BOOST is the most scalable algorithm, followed by SNPRuler and BEAM3.This is specially important for large data sets and their ability to work inan ensemble approach. In epistasis detection, considering the Power, Screenand Clean and SNPHarvester show the worse potential. For main effect, ThePower is lowest for Scren and Clean and MBMDR. For full effect, Screen andClean is once again the weakest algorithm.With this information, the best algorithms for each scenario can be usedtogether to maximize Power and lower Type 1 Error Rate.This experiments use more configurations than any previous empirical stud-ies. These configurations are processed by 7 of the state-of-the-art algo-rithms, which yielded interesting results. The contribution of these exper-iments is an unprecedented large comparison study, using various relevantmeasures, which allows for a full understanding of each algorithm.

References

[MVV11] Jestinah M Mahachie John, Francois Van Lishout, and KristelVan Steen. Model-Based Multifactor Dimensionality Reductionto detect epistasis for quantitative traits in the presence of error-free and noisy data. Eur J Hum Genet, 19(6):696–703, June2011.

[PC14a] Ricardo Pinho and Rui Camacho. Genetic Epistasis I - Mate-rials and methods. 2014.

[PC14b] Ricardo Pinho and Rui Camacho. Genetic Epistasis II - Assess-ing Algorithm BEAM 3.0. 2014.

9

[PC14c] Ricardo Pinho and Rui Camacho. Genetic Epistasis III - As-sessing Algorithm BOOST. 2014.

[PC14d] Ricardo Pinho and Rui Camacho. Genetic Epistasis IV - As-sessing Algorithm Screen and Clean. 2014.

[PC14e] Ricardo Pinho and Rui Camacho. Genetic Epistasis V - Assess-ing Algorithm SNPRuler. 2014.

[PC14f] Ricardo Pinho and Rui Camacho. Genetic Epistasis VI - As-sessing Algorithm SNPHarvester. 2014.

[PC14g] Ricardo Pinho and Rui Camacho. Genetic Epistasis VII - As-sessing Algorithm TEAM. 2014.

[PC14h] Ricardo Pinho and Rui Camacho. Genetic Epistasis VIII - As-sessing Algorithm MBMDR. 2014.

[SZS+11] Junliang Shang, Junying Zhang, Yan Sun, Dan Liu, DaojunYe, and Yaling Yin. Performance analysis of novel methods fordetecting epistasis, 2011.

[WDR+10] Jing Wu, Bernie Devlin, Steven Ringquist, Massimo Trucco,and Kathryn Roeder. Screen and clean: a tool for identifyinginteractions in genome-wide association studies. Genetic epi-demiology, 34:275–285, 2010.

[WYY+10a] Xiang Wan, Can Yang, Qiang Yang, Hong Xue, Xiaodan Fan,Nelson L S Tang, and Weichuan Yu. BOOST: A fast approachto detecting gene-gene interactions in genome-wide case-controlstudies. American journal of human genetics, 87:325–340, 2010.

[WYY+10b] Xiang Wan, Can Yang, Qiang Yang, Hong Xue, Nelson L STang, and Weichuan Yu. Predictive rule inference for epistaticinteraction detection in genome-wide association studies. Bioin-formatics (Oxford, England), 26:30–37, 2010.


[Zha12] Yu Zhang. A novel bayesian graphical model for genome-widemulti-SNP association mapping. Genetic Epidemiology, 36:36–47, 2012.

10

[ZHZW10] Xiang Zhang, Shunping Huang, Fei Zou, and Wei Wang.TEAM: efficient two-locus epistasis tests in human genome-wideassociation study. Bioinformatics (Oxford, England), 26:i217–i227, 2010.

A Bar Graphs

A.1 Population size

11

BOOST

MBM

DR

SnC

SNPH

SNPR

TEAM

0

10

20

Algorithms

Pow

er/T

yp

e1

Err

or(%

)



BOOST

MBM

DR

SnC

SNPH

SNPR

TEAM

0

20

40

60

Algorithms

Pow

er/T

yp

e1

Err

or(%

)



BOOST

MBM

DR

SnC

SNPH

SNPR

TEAM

0

50

100

150

Algorithms

Pow

er/T

yp

e1

Err

or(%

)



Figure 1: These results correspond to epistasis detection by population size,with a 0.1 minor allele frequency, 2.0 odds ratio, and 0.02 prevalence. Theresults of the Power and Type 1 Error Rate of BOOST, MBMDR, Screen andClean, SNPHarvester, SNPRuler and TEAM. Each subfigure contains thevalues for all algorithms in data sets with 500 individuals (a), 1000 individuals(b), and 2000 individuals (c).

12

BEAM

BOOST Sn

C

SNPH

0

10

20

30

Algorithms

Pow

er/T

yp

e1

Err

or(%

)



BEAM

BOOST Sn

C

SNPH

0

20

40

60

Algorithms

Pow

er/T

yp

e1

Err

or(%

)



BEAM

BOOST Sn

C

SNPH

0

50

100

150

Algorithms

Pow

er/T

yp

e1

Err

or(%

)



Figure 2: These results correspond to main effect detection by populationsize, with a 0.1 minor allele frequency, 2.0 odds ratio, and 0.02 prevalence.The results of the Power and Type 1 Error Rate of BOOST, MBMDR, Screenand Clean, SNPHarvester, SNPRuler and TEAM. Each subfigure containsthe values for all algorithms in data sets with 500 individuals (a), 1000 indi-viduals (b), and 2000 individuals (c).

13

BOOST Sn

C

SNPH

0

10

20

30

Algorithms

Pow

er/T

yp

e1

Err

or(%

)



BOOST Sn

C

SNPH

0

20

40

60

Algorithms

Pow

er/T

yp

e1

Err

or(%

)



BOOST Sn

C

SNPH

0

50

100

150

Algorithms

Pow

er/T

yp

e1

Err

or(%

)



Figure 3: These results correspond to full effect detection by population size,with a 0.1 minor allele frequency, 2.0 odds ratio, and 0.02 prevalence. Theresults of the Power and Type 1 Error Rate of BOOST, MBMDR, Screen andClean, SNPHarvester, SNPRuler and TEAM. Each subfigure contains thevalues for all algorithms in data sets with 500 individuals (a), 1000 individuals(b), and 2000 individuals (c).

14

A.2 Frequency

BOOST

MBM

DR

SnC

SNPH

SNPR

TEAM

0

10

20

30

40P

T1ER


BOOST

MBM

DR

SnC

SNPH

SNPR

TEAM

020406080 P

T1ER


BOOST

MBM

DR

SnC

SNPH

SNPR

TEAM

0

50

100

150

200P

T1ER


BOOST

MBM

DR

SnC

SNPH

SNPR

TEAM

0

100

200 PT1ER


BOOST

MBM

DR

SnC

SNPH

SNPR

TEAM

0

100

200 PT1ER


Figure 4: These results correspond to epistasis detection by minor allelefrequency, with 2000 individuals, 2.0 odds ratio, and 0.02 prevalence. Theresults of the Power and Type 1 Error Rate of BOOST, MBMDR, Screenand Clean, SNPHarvester, SNPRuler and TEAM. Each subfigure containsthe values for all algorithms in data sets with 0.01 (a), 0.05 (b), 0.1 (c), 0.3(d), and 0.5 (e) allele frequencies.

15

BEAM

BOOST Sn

C

SNPH

0

10

20P

T1ER


BEAM

BOOST Sn

C

SNPH

0

20

40P

T1ER


BEAM

BOOST Sn

C

SNPH

0

50

100

150

200 PT1ER


BEAM

BOOST Sn

C

SNPH

0

100

200 PT1ER


BEAM

BOOST Sn

C

SNPH

0

100

200 PT1ER


Figure 5: These results correspond to main effect detection by minor allelefrequency, with 2000 individuals, 2.0 odds ratio, and 0.02 prevalence. Theresults of the Power and Type 1 Error Rate of BOOST, MBMDR, Screenand Clean, SNPHarvester, SNPRuler and TEAM. Each subfigure containsthe values for all algorithms in data sets with 0.01 (a), 0.05 (b), 0.1 (c), 0.3(d), and 0.5 (e) allele frequencies.

16

BOOST Sn

C

SNPH

0

10

20

30P

T1ER


BOOST Sn

C

SNPH

0

20

40 PT1ER


BOOST Sn

C

SNPH

0

50

100

150

200 PT1ER


BOOST Sn

C

SNPH

0

100

200 PT1ER


BOOST Sn

C

SNPH

0

100

200 PT1ER


Figure 6: These results correspond to full effect detection by minor allelefrequency, with 2000 individuals, 2.0 odds ratio, and 0.02 prevalence. Theresults of the Power and Type 1 Error Rate of BOOST, MBMDR, Screenand Clean, SNPHarvester, SNPRuler and TEAM. Each subfigure containsthe values for all algorithms in data sets with 0.01 (a), 0.05 (b), 0.1 (c), 0.3(d), and 0.5 (e) allele frequencies.

17

A.3 Odds Ratio

BOOST

MBM

DR

SnC

SNPH

SNPR

TEAM

0

20

40

Algorithms

Pow

er/T

yp

e1

Err

or(%

)


(a) 1.1 odds ratio.

BOOST

MBM

DR

SnC

SNPH

SNPR

TEAM

0

50

100

150

Algorithms

Pow

er/T

yp

e1

Err

or(%

)


(b) 1.5 odds ratio.

BOOST

MBM

DR

SnC

SNPH

SNPR

TEAM

0

50

100

150

Algorithms

Pow

er/T

yp

e1

Err

or(%

)


(c) 2.0 odds ratio.

Figure 7: These results correspond to epistatic detection by odds ratio, witha minor allele frequency of 0.1, 2000 individuals, and a 0.02 prevalence. Theresults of the Power and Type 1 Error Rate of BOOST, MBMDR, Screenand Clean, SNPHarvester, SNPRuler and TEAM. Each subfigure containsthe values for all algorithms in data sets with 1.1 (a), 1.5 (b), and 2.0 (c)odds ratio.

18

BEAM

BOOST Sn

C

SNPH

0

10

20

30

Algorithms

Pow

er/T

yp

e1

Err

or(%

)


(a) 1.1 odds ratio.

BEAM

BOOST Sn

C

SNPH

0

20

40

Algorithms

Pow

er/T

yp

e1

Err

or(%

)


(b) 1.5 odds ratio.

BEAM

BOOST Sn

C

SNPH

0

50

100

150

Algorithms

Pow

er/T

yp

e1

Err

or(%

)


(c) 2.0 odds ratio.

Figure 8: These results correspond to main effect detection by odds ratio,with a minor allele frequency of 0.1, 2000 individuals, and a 0.02 prevalence.The results of the Power and Type 1 Error Rate of BOOST, MBMDR, Screenand Clean, SNPHarvester, SNPRuler and TEAM. Each subfigure containsthe values for all algorithms in data sets with 1.1 (a), 1.5 (b), and 2.0 (c)odds ratio.

19

BOOST Sn

C

SNPH

0

10

20

Algorithms

Pow

er/T

yp

e1

Err

or(%

)


(a) 1.1 odds ratio.

BOOST Sn

C

SNPH

0

50

100

Algorithms

Pow

er/T

yp

e1

Err

or(%

)


(b) 1.5 odds ratio.

BOOST Sn

C

SNPH

0

50

100

150

Algorithms

Pow

er/T

yp

e1

Err

or(%

)


(c) 2.0 odds ratio.

Figure 9: These results correspond to full effect detection by odds ratio, witha minor allele frequency of 0.1, 2000 individuals, and a 0.02 prevalence. Theresults of the Power and Type 1 Error Rate of BOOST, MBMDR, Screenand Clean, SNPHarvester, SNPRuler and TEAM. Each subfigure containsthe values for all algorithms in data sets with 1.1 (a), 1.5 (b), and 2.0 (c)odds ratio.

20

A.4 Prevalence

BOOST

MBM

DR

SnC

SNPH

SNPR

TEAM

0

50

100

150

Algorithms

Pow

er/T

yp

e1

Err

or(%

)


(a) 0.0001 prevalence.

BOOST

MBM

DR

SnC

SNPH

SNPR

TEAM

0

50

100

150

Algorithms

Pow

er/T

yp

e1

Err

or(%

)


(b) 0.02 prevalence.

Figure 10: These results correspond to epistasis detection by prevalence, witha minor allele frequency of 0.1, 2000 individuals, and a 2.0 odds ratio. Theresults of the Power and Type 1 Error Rate of BOOST, MBMDR, Screen andClean, SNPHarvester, SNPRuler and TEAM. Each subfigure contains thevalues for all algorithms in data sets with 0.0001 (a), and 0.02 (b) prevalence.

21

BEAM

BOOST Sn

C

SNPH

0

50

100

150

Algorithms

Pow

er/T

yp

e1

Err

or(%

)



BEAM

BOOST Sn

C

SNPH

0

50

100

150

Algorithms

Pow

er/T

yp

e1

Err

or(%

)



Figure 11: These results correspond to main effect detection by prevalence,with a minor allele frequency of 0.1, 2000 individuals, and a 2.0 odds ra-tio. The results of the Power and Type 1 Error Rate of BOOST, MBMDR,Screen and Clean, SNPHarvester, SNPRuler and TEAM. Each subfigure con-tains the values for all algorithms in data sets with 0.0001 (a), and 0.02 (b)prevalence.

BOOST Sn

C

SNPH

0

50

100

150

Algorithms

Pow

er/T

yp

e1

Err

or(%

)



BOOST Sn

C

SNPH

0

50

100

150

Algorithms

Pow

er/T

yp

e1

Err

or(%

)



Figure 12: These results correspond to full effect detection by prevalence,with a minor allele frequency of 0.1, 2000 individuals, and a 2.0 odds ra-tio. The results of the Power and Type 1 Error Rate of BOOST, MBMDR,Screen and Clean, SNPHarvester, SNPRuler and TEAM. Each subfigure con-tains the values for all algorithms in data sets with 0.0001 (a), and 0.02 (b)prevalence.

22

Documents

MACHINE LEARNING METHODOLOGIES IN THE DISCOVERY OF … · BDA Backward Dropping Algorithm BEAM Bayesian Epistasis Association Mapping BOOST Boolean Operation-based Screening and Testing