Data Mining and Bioinformatics

Data Mining

and

Bioinformatics

Sebastian Kropp

27 May 2004Monash University

Faculty of Information TechnologyCaulfield, VIC

Abstract

This paper looks at the use of Data Mining in the domain of Bioinformatics. Knowledge-discoverytechniques are becoming more and more important as the collected data increases. Future progressin biology is made possible by advances in machine learning. The broad use of data mining and theirapplicability in the different areas of bioinformatics are evaluated. The areas include the Genomeproject, prediction of protein structures and the struggle of neurobiology to understand the humanbrain.

1

Contents

1 Introduction 3

2 Brain Functionality 3

3 Protein structure prediction 4

4 Discussion and conclusions 4

2

1 Introduction

Computer science and biology fuse in the relatively new disciple of bioinformatics. This interdisciplinarywork is driven by the need to analyse and make sense out of the vast amount of data that is produced,when biological systems are studied. Data mining has already been successfully applied for businessproblems. Insurance companies asses insurance risks [1] and other highly competitive markets likethe telecommunication industry use data mining to predict customer churn. Throughout the economysimilar such knowledge-discovery methods are used to optimise productivity and the understanding ofdata mining as a tool for optimization is fairly good understood in this area. Those positive experiencesare tried to be adopted for science. Science and especially biology produce vast, complex and noisy dataof unseen proportion. An example for this is the human genome project. The sequence of the wholehuman DNA poses a new challenge for data mining and computer science. Data mining algorithms andmachine learning have exponential complexity and sometimes require parallel computation.

Before we take a look at examples of data mining in biology, we need to define what it actuallymeans. The term data mining or also known as Knowledge-discovery in Databases (KDD) is explainedin the book Principles of Data Mining [2] as ”The nontrivial extraction of implicit, previously unknown,and potentially useful information from data” and ”The science of extracting useful information fromlarge data sets or databases”. This definition is quite general. In some cases it is extended to includeall possible means of knowledge-extraction that is available to gain the most possible understanding ofthe data. There are many ways how data con be exploited. Data mining can be divided into two maingroups; supervised and unsupervised techniques. Supervised algorithms require a posteriori knowledgeand experience with the data. Classification and decision trees are examples of this approach and canbe used to verify a hypothesis. A priori techniques do not need knowledge. They discover relations bytheir own. Clustering is used to detect similarities a priori. The Apriori algorithm is fundamental fordata mining. Such statistical approaches usually lag the ability to detect non-linear relations but provideunderstandable results (decision trees). New advances in artificial intelligence like neural networks andgenetic algorithms support the pattern recognition process to find non-linear relations. There are a lotof patterns in biology which are not understood and data mining helps to discover novel and hopefullyuseful information. Data mining is used in the prediction of gene relations in a genome, understandingof relations for region activation in the brain and the prediction of protein folding resulting from changesin the DNA.

2 Brain Functionality

The understanding of the human brain and functional composition of brain activities is a challenging taskof biology in these days. Research is this area is heavily dependant on image recognition. Functional-Magnetic Resonance Imaging (fMRI) is used as the basis of data retrieval. The resulting 3D images showlocations (Regions of Interest RoI) of increased positron activity. Two kinds of functional associations inthe human brain are of interest in the international study called ”Computationally Intelligent Methods forMining 3D Medical Images” [3]. One is to understand the association of damaged brain regions and theresulting neuropsychological deficits. This might be of interest to assess probable damage before a brainsurgery. The second interest is to identify activation patterns for different tasks. subjects are asked toperform different tasks and the activation of brain regions is measured. This helps to identify the regionsnecessary for a specific task (example: learning). Current techniques are either too computationallyexpensive or not accurate enough. The study [3] tries to tackle that problem in two ways. Adaptiverecursive partioning is used to reduce the domain and a neural network is used for classification ofthis data. To identify discriminate regions in Alzheimer disease patients statistical, adaptive statisticalmethods and neural networks are compared. Neural network outperformed both statistical methods inaccuracy of the prediction of affected regions.

3

3 Protein structure prediction

The aim of protein structure prediction is to determine the three-dimensional structure of proteins fromtheir amino acid sequence [4]. Combing this information with the knowledge of the structure of usefulproteins leads to rational drug design, speeding up the research in drug design. To determine proteinstructures is tedious and expensive and to verify the resulting structure, molecule spectroscopy is needed.Some factors make it extremely difficult to predict the structure. The most important is probably, thatthe molecular physical stability is not fully understood. This is where prediction comes into play, sincegenerating the structures in simulation is not possible. Mohammed J. Zaki [5] has written an interestingpaper called ”Mining Protein Contact Maps”. The sequence of amino acids (linear structure) determinesthe way, a protein is folded. Since the pyhsical model behind this is not understood, similarities be-tween sequences and their three-dimensional structure can help to understand and predict the structuraloutcome of a protein. Such data driven approaches are generally useful when the physical model is notunderstood. The Protein Data Bank has records of the position of each atom in a known protein. Clus-tering, classification, association rules, hidden Markov models and many more data mining algorithmsare applied to predict a sequence’s output. These heuristic approaches just deliver a probability and nota certainty, which seems to be enough for now. Unarguably, knowing the physical model would lead toexact results. But even if the model would be known, simulation of the protein construction would bevery complex. The probabilistic approach yields to faster results.Those measures are applied to protein contact maps. These are matrixes of the contact of amino acidsin a protein. Mohammed J. Zaki used the hidden Markov model HMMSTR to predict, if two acids arelikely to have contact with each other.

4 Discussion and conclusions

Data mining in bioinformatics has a revolutionary impact on biology. Not applying data mining methodsin research where the model is not known might miss essential discoveries. The data in genome and proteindatabases is growing constantly. New clusters of computer are crunching on quantities of numbers, likenever before. This has in return leaded to new approaches in data mining, optimising the algorithmsand combinations of those thrown at the biological data. Advances in artificial intelligence play a biggerrole in those techniques, since in most cases, the data is not understood and self-organizing maps (neuralnetworks) and genetic algorithms are continuously searching for similarities and optimisations in anunsupervised manner.

References

[1] C. Apte, E. Grossman, E. Pednault, B. Rosen, F. Tipu, and B. White. Probabilistic estimation baseddata mining for discovering insurance risks. Technical report, IBM Corporation, Yorktown Heights,NY, September 1999.

[2] D. Hand, H. Mannila, and P. Smyth. Principles of Data Mining. MIT Press, Cambridge, 2001.

[3] Despina Kontos, Vasileios Megalooikonomou, and Filia Makedon. Computationally intelligent meth-ods for mining 3d medical images. Technical report, Temple University, Department of ComputerScience, Dartmouth College, University of the Aegean, 2002.

[4] Wikipedia. Protein structure prediction. World Wide Web page[http://en.wikipedia.org/wiki/Protein structure prediction].

[5] Mohammed J. Zaki. Mining protein contact maps. Technical report, Rensselaer Polytechnic Institute,Computer Science Department, 2000.

4

Documents

Data Mining and Bioinformatics