60
Advanced Algorithms for Advanced Algorithms for Biological Data Analysis Biological Data Analysis Center for Bioinformation Technology (CBI T) & Biointelligence Laboratory School of Computer Science and Engineerin g Seoul National University http://bi.snu.ac.kr/ http://cbit.snu.ac.kr/

Advanced Algorithms for Biological Data Analysis Center for Bioinformation Technology (CBIT) & Biointelligence Laboratory School of Computer Science and

Embed Size (px)

Citation preview

Advanced Algorithms for Biological Advanced Algorithms for Biological Data AnalysisData Analysis

Center for Bioinformation Technology (CBIT) &Biointelligence Laboratory

School of Computer Science and EngineeringSeoul National University

http://bi.snu.ac.kr/ http://cbit.snu.ac.kr/

2

Lecture ScheduleLecture Schedule

Day 1: Introduction to Machine Learning Day 2: Neural Networks Day 3: Hidden Markov Models Day 4: Principal Component Analysis Day 5: Clustering Analysis

Introduction to Machine Learning AlgIntroduction to Machine Learning Algorithms in Bioinformaticsorithms in Bioinformatics

Byoung-Tak Zhang

Center for Bioinformation Technology (CBIT) &Biointelligence Laboratory

School of Computer Science and EngineeringSeoul National University

E-mail: [email protected]://bi.snu.ac.kr./ http://cbit.snu.ac.kr/

4

OutlineOutline

Part I

Concept of Machine Learning (ML)

Machine Learning Algorithms and Applications

Applications in Bioinformatics

Part II

Version Space Learning

Decision Tree Learning

5

6

What is Artificial Intelligence (AI)?What is Artificial Intelligence (AI)?

Design and study of computer programs that behave intelligently.

Designing computer programs to make computers smarter.

Study of how to make computers do things at which, at the moment, people are better.

(No satisfactory definition of AI)

7

Research Areas and ApproachesResearch Areas and Approaches

ArtificialIntelligence

Research

Rationalism (Logical)Empiricism (Statistical)Connectionism (Neural)Evolutionary (Genetic)Biological (Molecular)

Paradigm

Application

Intelligent AgentsInformation RetrievalElectronic CommerceData MiningBioinformaticsNatural Language Proc.Expert Systems

Learning AlgorithmsInference MechanismsKnowledge RepresentationIntelligent System Architecture

8

Concept of Machine LearningConcept of Machine Learning

9

10

InformationTheory

ContextContext

ComputerScience

(AI)

CognitiveScience

Statistics

MachineLearning

11

Why Machine Learning?Why Machine Learning?

Recent progress in algorithms and theory Growing flood of online data Computational power is available Budding industry

Three niches for machine learning Data mining: using historical data to improve decisions

Medical records --> medical knowledge Software applications we can’t program by hand

Autonomous driving Speech recognition

Self-customizing programs Newsreader that learns user interests

12

Brief History of Machine LearningBrief History of Machine Learning

1950’s: Samuels checker player 1960’s: Neural networks, perceptron; pattern recognition; learning in t

he limit theory; Minsky &Papert. 1970’s: Symbolic concept induction; Winstons’s arch learner; knowle

dge acquisition bottleneck; Quinlan’s ID3; Michalski’s AQ and soybean diagnosis results; scientific discovery with BACON; mathematical discovery with AM.

1980’s: Continued progress on decision-tree and rule learning; Explanation-based learning; speedup learning; utility problem, analogy; resurgence of connectionism (PDP, ANN); Valiant’s PAC learning; experimental evaluation

1990’s: Data mining; adaptive software agents & IR; reinforcement learning; theory refinement; inductive logic programming; voting, bagging, boosting, and stacking; learning Bayesian networks.

13

Learning: DefinitionLearning: Definition

Definition Learning is the improvement of performance in some

environment through the acquisition of knowledge resulting from experience in that environment.

the improvementof behavior

the improvementof behavior

on someperformance task

on someperformance task

through acquisitionof knowledge

through acquisitionof knowledge

based on partial task experience

based on partial task experience

14

A Learning Problem: A Learning Problem: EnjoySportEnjoySport

Sky

What is the general concept?

Temp Humid Wind WaterForecast EnjoySports

Sunny Warm Normal Strong Warm Same Yes

Sunny Warm High Strong Warm Same Yes Rainy Cold High Strong Warm Change No

Sunny Warm High Strong Cool Change Yes

15

Possible Uses of Machine Possible Uses of Machine LearningLearning

configurationand design

planning andscheduling

languageunderstanding

vision and speech

executionand control

diagnosticreasoning

data mining andknowledge discovery

16

Metaphors and MethodsMetaphors and Methods

Neurobiology

BiologicalEvolution

HeuristicSearch

StatisticalInference

Memory andRetrieval

ConnectionistLearning

Genetic Learning Tree / RuleInduction

Case-BasedLearning

ProbabilisticInduction

17

Learning: ComponentsLearning: Components

Components of a learning system Performance: accuracy, efficiency, understandability Environment: external setting to the learner Knowledge: internal data structure Experience: perception, action, mental traces Improvement: desirable change in performance

18

Learning SystemLearning System

Performance

Learning

Environment Knowledge

acquired knowledge

get knowledge

improve behavior

get data

solutionproblem

19

What is the Learning Problem?What is the Learning Problem?

Learning = improving with experience at some task Improve over task T, With respect to performance measure P, Based on experience E.

E.g., Learn to play checkers T: Play checkers P: % of games won in world tournament E: opportunity to play against self

20

Machine Learning: TasksMachine Learning: Tasks

Supervised Learning Estimate an unknown mapping from known input- output pairs Learn fw from training set D={(x,y)} s.t.

Classification: y is discrete Regression: y is continuous

Unsupervised Learning Only input values are provided Learn fw from D={(x)} s.t.

Compression Clustering

Reinforcement Learning

)()( xxw fyf

xxw )(f

21

Machine Learning: StrategiesMachine Learning: Strategies

Rote learning Concept learning Learning from examples Learning by instruction Inductive learning Deductive learning Explanation-based learning (EBL) Learning by analogy Learning by observation

22

Supervised LearningSupervised Learning

Given a sequence of input/output pairs of the form <xi, yi>, where xi is a possible input and yi is the output associated with xi.

Learn a function f that accounts for the examples seen so far, f(xi) = yi for all i, and that makes a good guess for the outputs of the inputs that it has not seen.

23

Examples of Input-Output PairsExamples of Input-Output Pairs

Task Inputs Outputs

Recognition

Action

Janitor robot

problem

Descriptions of

objects

Classes that the

objects belong to

Actions or predictionsDescriptions of

situations

Descriptions of

offices (floor, prof’s office)

Yes or No (indicating

whether or not the

office contains a

recycling bin)

24

Classification and Concept Classification and Concept LearningLearningClassification

If the function is discrete valued, then the outputs are called classes

Concept learning Learned function has only two possible outputs

25

Unsupervised LearningUnsupervised Learning

Clustering A clustering algorithm partitions the inputs into a fixed

number of subsets or clusters so that inputs in the same cluster are close to one another.

Discovery learning The objective is to uncover new relations in the data.

Reinforcement learning Uses a feedback signal (not the target output) that gives

the learning program an indication of whether or not what it has learned is correct.

26

Online and Batch LearningOnline and Batch Learning

Batch methods Process large sets of examples all at once.

Online (incremental) methods Process examples one at a time.

27

Machine Learning Algorithms and Machine Learning Algorithms and ApplicationsApplications

28

Machine Learning Algorithms (1/2)Machine Learning Algorithms (1/2)

Symbolic Learning (covered on Day 1) Version Space Learning Case-Based Learning

Neural Learning (covered on Day 2) Multilayer Perceptrons (MLPs) Self-Organizing Maps (SOMs) Support Vector Machines (SVMs)

Evolutionary Learning (very briefly explained on Day 1) Evolution Strategies Evolutionary Programming Genetic Algorithms Genetic Programming

29

Machine Learning Algorithms (2/2)Machine Learning Algorithms (2/2)

Probabilistic Learning (covered on Days 3 and 5) Bayesian Networks (BNs) Helmholtz Machines (HMs) Latent Variable Models (LVMs) Generative Topographic Mapping (GTM)

Other Machine Learning Methods (partially covered on Days 1 and 4) Decision Trees (DTs) Reinforcement Learning (RL) Boosting Algorithms Mixture of Experts (ME) Independent Component Analysis (ICA)

30

Example Applications of ML (1/2)Example Applications of ML (1/2)

Banking & Investment Credit card fraud Delinquent accounts Authorization of purchases Predict stock market

Health Care Disease diagnosis Managing resources Look for causal relationships between environment and disease

Marketing Credit card applications Use past buying habits to predict likelihood of customer

purchasing some new product Textual Data Mining

31

Example Applications of ML (2/2)Example Applications of ML (2/2)

Astronomy Bioinformatics Chemistry Human resources: evaluating job performance Insurance & Finance Manufacturing: process control Signal and image processing Speech recognition …

32

Neural Nets for Handwritten Digit Neural Nets for Handwritten Digit RecognitionRecognition

Pre-processing

… Input units

Hidden units

Output units0 1 2 3 9

Training Test

0 1 2 3 9

?

33

ALVINN System: ALVINN System: Neural Network Learning to Steer Neural Network Learning to Steer

an Autonomous Vehiclean Autonomous Vehicle

34

Learning to Navigate a Vehicle by Learning to Navigate a Vehicle by Observing an Human Expert (1/2)Observing an Human Expert (1/2)Inputs

The images produces by a camera mounted on the vehicle

Outputs The actions taken by the human driver to steer

the vehicle or adjust its speed.

Result of learning A function mapping images to control actions

35

Learning to Navigate a Vehicle by Learning to Navigate a Vehicle by Observing an Human Expert (2/2)Observing an Human Expert (2/2)

36

Data Recorrection by a Hopfield NetData Recorrection by a Hopfield Networkwork

original target data

corrupted input data

Recorrected data after

10 iterations

Recorrected data after

20 iterations

Fullyrecorrected data after

35 iterations

37

Predicting the Sunspot Number with Predicting the Sunspot Number with Neural NetworksNeural Networks

38

ANN for Face Recognition

960 x 3 x 4 network is trained on gray-level images of faces to predict whether a person is looking to their left, right, ahead, or up.

39

Data MiningData Mining

-- -- ---- -- ---- -- --

-- -- ---- -- ---- -- --

Target data

Cleaned data

Transformed data

Patterns/ model

KnowledgeDatabase/data warehouse

Selection& Sampling

Selection& Sampling

Preprocessing& Cleaning

Preprocessing& Cleaning

Transformation& reduction

Transformation& reduction

Interpretation/Evaluation

Interpretation/EvaluationData MiningData Mining

Performance system

40

Customer Relationship Management Customer Relationship Management (CRM)(CRM) Increased Customer Lifetime Value Increased Wallet Share Improved Customer Retention Segmentation of Customers by Profitability Segmentation of Customers by Risk of Default Integrating Data Mining into the Full Marketing Proce

41

Hot Water Flashing Nozzle with Hot Water Flashing Nozzle with Evolutionary AlgorithmsEvolutionary Algorithms

Start

Hot water entering Steam and droplet at exit

At throat: Mach 1 and onset of flashing

Hans-Paul Schwefel performed the original experiments

42

Case-Based ReasoningCase-Based Reasoning(Aamodt & Plaza, 1994)

Input New Problem 1. Retrieve

Case Base

GeneralKnowledge

RetrivedCases

Learned Case

RetrivedSolution

2. Reuse

RetrivedSolution

3. Revise

4. Retain

Output

43

Machine Learning Applications in Machine Learning Applications in BioinformaticsBioinformatics

44

BioinformaticsBioinformatics

What is a Bioinformatics?

Bioinformatics is a new term referring to the discipline that employs computers to store, retrieve, analyze and assist in understanding biological information.

The application of information technology and computer science to the study of biological systems.

The analysis of the massive (and constantly increasing) amount of genetic information

Sophisticated computer technologies to enable discovery in all fields of life sciences.

45

Problems in BioinformaticsProblems in Bioinformatics

Structure analysisStructure analysis Protein structure comparison Protein structure prediction RNA structure modeling

Pathway analysisPathway analysis Metabolic pathway Regulatory networks

Sequence analysisSequence analysis Sequence alignment Structure and function prediction Gene finding

Expression analysisExpression analysis Gen expression analysis Gene clustering

46

Applications of BioinformaticsApplications of Bioinformatics

Drug design Identification of genetic risk factors Gene therapy Genetic modification of food crops and animals Forensics Biological warfare

Personalized Medicine E-Doctor

47

Machine Learning and BioinformMachine Learning and Bioinformaticsatics

knowledgeknowledge

Bio DB

Machine learning

Drug

Development

Medicaltherapyresearch

Pharmacology Ecology

48

Machine Learning Techniques for BioMachine Learning Techniques for Bio Data Mining Data Mining Sequence Alignment

Simulated Annealing Genetic Algorithms

Structure and Function Prediction Hidden Markov Models Multilayer Perceptrons Decision Trees

Molecular Clustering and Classification Support Vector Machines Nearest Neighbor Algorithms

Expression (DNA Chip Data) Analysis Self-Organizing Maps Bayesian Networks

49

Structure and Function PredictionStructure and Function Prediction

Protein structure prediction

Protein modeling

Gene finding and gene prediction

50

Effect and Applications of Biological Effect and Applications of Biological Data MiningData Mining

Biocomputing

Diagnosis with Chip SNP (Single Nucleotide Polymorphism)

Customized Drug

Biological Data MiningBiological Data Mining

Increase and Improvement of Farm Products

Renewable Energy

store, retrieve, analyze and assist store, retrieve, analyze and assist in understanding biological informationin understanding biological information

51

Hidden Markov ModelsHidden Markov Modelsfor Protein Modelingfor Protein Modeling

20 alphabets (20 amino acids) m0: start state, m5: end state, mk: match states ik: insertion states, dk: deletion states T(s2|s1): transition probabilities P(x|mk): alphabet generating probabilities (x: letter: amino acid)

52

A Simple Example of Hidden Markov A Simple Example of Hidden Markov ModelsModels

S E

0.5

0.5

0.5

0.5

0.25 0.25

0.250.250.250.25

0.10.10.10.7

ATCCTTTTTTTCA

53

Clustering of Related Gene Clustering of Related Gene ExpressionsExpressions

54

Non-negative Matrix Factorization Non-negative Matrix Factorization Clustering Gene Expression DataClustering Gene Expression Data

…..

.

...

.

...

.

.

7,129 genes

38 samples

x..

.

.

2 factors

… encoding

38 samples7,129 genes

G W(?) H(?)

Factors can capture the correlations between the genes using the values of expression level.

Cluster training samples into 2 groups by NMF Assign each sample to the factor (class) which has higher encoding value. Accuracy: 0 ~1 error for the training data set

H1·

g1 g2 g7,129

W

H2 ·

g3 g4

55

Bayesian NetworksBayesian Networksfor Gene Expression Analysisfor Gene Expression Analysis

Processed

dataData

Preprocessing

Learningalgorithm

Gene C Gene B

Gene A

Target

Gene D

Gene C Gene B

Gene A

Target

Gene D

Gene C Gene B

Gene A

Target

Gene D

Gene C Gene B

Gene A

Target

Gene D

The values of Gene C and Gene B are given.

Belief propagation Probability for the target is computed.

Learning

Inference

56

Multilayer Perceptrons for Gene Multilayer Perceptrons for Gene Finding and PredictionFinding and Prediction

Coding potential valueCoding potential value

GC CompositionGC Composition

LengthLength

DonorDonor

AcceptorAcceptor

Intron vocabularyIntron vocabulary

basesDiscrete

exon score

0

1

sequence

score

57

Self-Organizing Maps for DNA MiSelf-Organizing Maps for DNA Microarray Data Analysiscroarray Data Analysis

Two-dimensional arrayof postsynaptic neurons

Bundle of synapticconnections

Winning neurons

Input

58

Biological Information ExtractionBiological Information ExtractionText Data

DB

LocationDate

DB Record

Database TemplateFilling

Data Analysis &Field Identification

Data Classification &Field Extraction

Information Extraction

Field PropertyIdentification & Learning

59

Biomolecular ComputingBiomolecular Computing

011001101010001 ATGCTCGAAGCT

60

More information More information on on

biological data mining biological data mining and related research and related research

can be found can be found at at

http://cbit.snu.ac.kr/http://cbit.snu.ac.kr/http://bi.snu.ac.kr/http://bi.snu.ac.kr/