Upload
marcus-reed
View
214
Download
0
Tags:
Embed Size (px)
Citation preview
Advanced Algorithms for Biological Advanced Algorithms for Biological Data AnalysisData Analysis
Center for Bioinformation Technology (CBIT) &Biointelligence Laboratory
School of Computer Science and EngineeringSeoul National University
http://bi.snu.ac.kr/ http://cbit.snu.ac.kr/
2
Lecture ScheduleLecture Schedule
Day 1: Introduction to Machine Learning Day 2: Neural Networks Day 3: Hidden Markov Models Day 4: Principal Component Analysis Day 5: Clustering Analysis
Introduction to Machine Learning AlgIntroduction to Machine Learning Algorithms in Bioinformaticsorithms in Bioinformatics
Byoung-Tak Zhang
Center for Bioinformation Technology (CBIT) &Biointelligence Laboratory
School of Computer Science and EngineeringSeoul National University
E-mail: [email protected]://bi.snu.ac.kr./ http://cbit.snu.ac.kr/
4
OutlineOutline
Part I
Concept of Machine Learning (ML)
Machine Learning Algorithms and Applications
Applications in Bioinformatics
Part II
Version Space Learning
Decision Tree Learning
6
What is Artificial Intelligence (AI)?What is Artificial Intelligence (AI)?
Design and study of computer programs that behave intelligently.
Designing computer programs to make computers smarter.
Study of how to make computers do things at which, at the moment, people are better.
(No satisfactory definition of AI)
7
Research Areas and ApproachesResearch Areas and Approaches
ArtificialIntelligence
Research
Rationalism (Logical)Empiricism (Statistical)Connectionism (Neural)Evolutionary (Genetic)Biological (Molecular)
Paradigm
Application
Intelligent AgentsInformation RetrievalElectronic CommerceData MiningBioinformaticsNatural Language Proc.Expert Systems
Learning AlgorithmsInference MechanismsKnowledge RepresentationIntelligent System Architecture
10
InformationTheory
ContextContext
ComputerScience
(AI)
CognitiveScience
Statistics
MachineLearning
11
Why Machine Learning?Why Machine Learning?
Recent progress in algorithms and theory Growing flood of online data Computational power is available Budding industry
Three niches for machine learning Data mining: using historical data to improve decisions
Medical records --> medical knowledge Software applications we can’t program by hand
Autonomous driving Speech recognition
Self-customizing programs Newsreader that learns user interests
12
Brief History of Machine LearningBrief History of Machine Learning
1950’s: Samuels checker player 1960’s: Neural networks, perceptron; pattern recognition; learning in t
he limit theory; Minsky &Papert. 1970’s: Symbolic concept induction; Winstons’s arch learner; knowle
dge acquisition bottleneck; Quinlan’s ID3; Michalski’s AQ and soybean diagnosis results; scientific discovery with BACON; mathematical discovery with AM.
1980’s: Continued progress on decision-tree and rule learning; Explanation-based learning; speedup learning; utility problem, analogy; resurgence of connectionism (PDP, ANN); Valiant’s PAC learning; experimental evaluation
1990’s: Data mining; adaptive software agents & IR; reinforcement learning; theory refinement; inductive logic programming; voting, bagging, boosting, and stacking; learning Bayesian networks.
13
Learning: DefinitionLearning: Definition
Definition Learning is the improvement of performance in some
environment through the acquisition of knowledge resulting from experience in that environment.
the improvementof behavior
the improvementof behavior
on someperformance task
on someperformance task
through acquisitionof knowledge
through acquisitionof knowledge
based on partial task experience
based on partial task experience
14
A Learning Problem: A Learning Problem: EnjoySportEnjoySport
Sky
What is the general concept?
Temp Humid Wind WaterForecast EnjoySports
Sunny Warm Normal Strong Warm Same Yes
Sunny Warm High Strong Warm Same Yes Rainy Cold High Strong Warm Change No
Sunny Warm High Strong Cool Change Yes
15
Possible Uses of Machine Possible Uses of Machine LearningLearning
configurationand design
planning andscheduling
languageunderstanding
vision and speech
executionand control
diagnosticreasoning
data mining andknowledge discovery
16
Metaphors and MethodsMetaphors and Methods
Neurobiology
BiologicalEvolution
HeuristicSearch
StatisticalInference
Memory andRetrieval
ConnectionistLearning
Genetic Learning Tree / RuleInduction
Case-BasedLearning
ProbabilisticInduction
17
Learning: ComponentsLearning: Components
Components of a learning system Performance: accuracy, efficiency, understandability Environment: external setting to the learner Knowledge: internal data structure Experience: perception, action, mental traces Improvement: desirable change in performance
18
Learning SystemLearning System
Performance
Learning
Environment Knowledge
acquired knowledge
get knowledge
improve behavior
get data
solutionproblem
19
What is the Learning Problem?What is the Learning Problem?
Learning = improving with experience at some task Improve over task T, With respect to performance measure P, Based on experience E.
E.g., Learn to play checkers T: Play checkers P: % of games won in world tournament E: opportunity to play against self
20
Machine Learning: TasksMachine Learning: Tasks
Supervised Learning Estimate an unknown mapping from known input- output pairs Learn fw from training set D={(x,y)} s.t.
Classification: y is discrete Regression: y is continuous
Unsupervised Learning Only input values are provided Learn fw from D={(x)} s.t.
Compression Clustering
Reinforcement Learning
)()( xxw fyf
xxw )(f
21
Machine Learning: StrategiesMachine Learning: Strategies
Rote learning Concept learning Learning from examples Learning by instruction Inductive learning Deductive learning Explanation-based learning (EBL) Learning by analogy Learning by observation
22
Supervised LearningSupervised Learning
Given a sequence of input/output pairs of the form <xi, yi>, where xi is a possible input and yi is the output associated with xi.
Learn a function f that accounts for the examples seen so far, f(xi) = yi for all i, and that makes a good guess for the outputs of the inputs that it has not seen.
23
Examples of Input-Output PairsExamples of Input-Output Pairs
Task Inputs Outputs
Recognition
Action
Janitor robot
problem
Descriptions of
objects
Classes that the
objects belong to
Actions or predictionsDescriptions of
situations
Descriptions of
offices (floor, prof’s office)
Yes or No (indicating
whether or not the
office contains a
recycling bin)
24
Classification and Concept Classification and Concept LearningLearningClassification
If the function is discrete valued, then the outputs are called classes
Concept learning Learned function has only two possible outputs
25
Unsupervised LearningUnsupervised Learning
Clustering A clustering algorithm partitions the inputs into a fixed
number of subsets or clusters so that inputs in the same cluster are close to one another.
Discovery learning The objective is to uncover new relations in the data.
Reinforcement learning Uses a feedback signal (not the target output) that gives
the learning program an indication of whether or not what it has learned is correct.
26
Online and Batch LearningOnline and Batch Learning
Batch methods Process large sets of examples all at once.
Online (incremental) methods Process examples one at a time.
28
Machine Learning Algorithms (1/2)Machine Learning Algorithms (1/2)
Symbolic Learning (covered on Day 1) Version Space Learning Case-Based Learning
Neural Learning (covered on Day 2) Multilayer Perceptrons (MLPs) Self-Organizing Maps (SOMs) Support Vector Machines (SVMs)
Evolutionary Learning (very briefly explained on Day 1) Evolution Strategies Evolutionary Programming Genetic Algorithms Genetic Programming
29
Machine Learning Algorithms (2/2)Machine Learning Algorithms (2/2)
Probabilistic Learning (covered on Days 3 and 5) Bayesian Networks (BNs) Helmholtz Machines (HMs) Latent Variable Models (LVMs) Generative Topographic Mapping (GTM)
Other Machine Learning Methods (partially covered on Days 1 and 4) Decision Trees (DTs) Reinforcement Learning (RL) Boosting Algorithms Mixture of Experts (ME) Independent Component Analysis (ICA)
30
Example Applications of ML (1/2)Example Applications of ML (1/2)
Banking & Investment Credit card fraud Delinquent accounts Authorization of purchases Predict stock market
Health Care Disease diagnosis Managing resources Look for causal relationships between environment and disease
Marketing Credit card applications Use past buying habits to predict likelihood of customer
purchasing some new product Textual Data Mining
31
Example Applications of ML (2/2)Example Applications of ML (2/2)
Astronomy Bioinformatics Chemistry Human resources: evaluating job performance Insurance & Finance Manufacturing: process control Signal and image processing Speech recognition …
32
Neural Nets for Handwritten Digit Neural Nets for Handwritten Digit RecognitionRecognition
…
Pre-processing
…
…
…
… Input units
Hidden units
Output units0 1 2 3 9
…
Training Test
…
…
…
0 1 2 3 9
?
…
33
ALVINN System: ALVINN System: Neural Network Learning to Steer Neural Network Learning to Steer
an Autonomous Vehiclean Autonomous Vehicle
34
Learning to Navigate a Vehicle by Learning to Navigate a Vehicle by Observing an Human Expert (1/2)Observing an Human Expert (1/2)Inputs
The images produces by a camera mounted on the vehicle
Outputs The actions taken by the human driver to steer
the vehicle or adjust its speed.
Result of learning A function mapping images to control actions
35
Learning to Navigate a Vehicle by Learning to Navigate a Vehicle by Observing an Human Expert (2/2)Observing an Human Expert (2/2)
36
Data Recorrection by a Hopfield NetData Recorrection by a Hopfield Networkwork
original target data
corrupted input data
Recorrected data after
10 iterations
Recorrected data after
20 iterations
Fullyrecorrected data after
35 iterations
37
Predicting the Sunspot Number with Predicting the Sunspot Number with Neural NetworksNeural Networks
38
ANN for Face Recognition
960 x 3 x 4 network is trained on gray-level images of faces to predict whether a person is looking to their left, right, ahead, or up.
39
Data MiningData Mining
-- -- ---- -- ---- -- --
-- -- ---- -- ---- -- --
Target data
Cleaned data
Transformed data
Patterns/ model
KnowledgeDatabase/data warehouse
Selection& Sampling
Selection& Sampling
Preprocessing& Cleaning
Preprocessing& Cleaning
Transformation& reduction
Transformation& reduction
Interpretation/Evaluation
Interpretation/EvaluationData MiningData Mining
Performance system
40
Customer Relationship Management Customer Relationship Management (CRM)(CRM) Increased Customer Lifetime Value Increased Wallet Share Improved Customer Retention Segmentation of Customers by Profitability Segmentation of Customers by Risk of Default Integrating Data Mining into the Full Marketing Proce
41
Hot Water Flashing Nozzle with Hot Water Flashing Nozzle with Evolutionary AlgorithmsEvolutionary Algorithms
Start
Hot water entering Steam and droplet at exit
At throat: Mach 1 and onset of flashing
Hans-Paul Schwefel performed the original experiments
42
Case-Based ReasoningCase-Based Reasoning(Aamodt & Plaza, 1994)
Input New Problem 1. Retrieve
Case Base
GeneralKnowledge
RetrivedCases
Learned Case
RetrivedSolution
2. Reuse
RetrivedSolution
3. Revise
4. Retain
Output
44
BioinformaticsBioinformatics
What is a Bioinformatics?
Bioinformatics is a new term referring to the discipline that employs computers to store, retrieve, analyze and assist in understanding biological information.
The application of information technology and computer science to the study of biological systems.
The analysis of the massive (and constantly increasing) amount of genetic information
Sophisticated computer technologies to enable discovery in all fields of life sciences.
45
Problems in BioinformaticsProblems in Bioinformatics
Structure analysisStructure analysis Protein structure comparison Protein structure prediction RNA structure modeling
Pathway analysisPathway analysis Metabolic pathway Regulatory networks
Sequence analysisSequence analysis Sequence alignment Structure and function prediction Gene finding
Expression analysisExpression analysis Gen expression analysis Gene clustering
46
Applications of BioinformaticsApplications of Bioinformatics
Drug design Identification of genetic risk factors Gene therapy Genetic modification of food crops and animals Forensics Biological warfare
Personalized Medicine E-Doctor
47
Machine Learning and BioinformMachine Learning and Bioinformaticsatics
knowledgeknowledge
Bio DB
Machine learning
Drug
Development
Medicaltherapyresearch
Pharmacology Ecology
48
Machine Learning Techniques for BioMachine Learning Techniques for Bio Data Mining Data Mining Sequence Alignment
Simulated Annealing Genetic Algorithms
Structure and Function Prediction Hidden Markov Models Multilayer Perceptrons Decision Trees
Molecular Clustering and Classification Support Vector Machines Nearest Neighbor Algorithms
Expression (DNA Chip Data) Analysis Self-Organizing Maps Bayesian Networks
49
Structure and Function PredictionStructure and Function Prediction
Protein structure prediction
Protein modeling
Gene finding and gene prediction
50
Effect and Applications of Biological Effect and Applications of Biological Data MiningData Mining
Biocomputing
Diagnosis with Chip SNP (Single Nucleotide Polymorphism)
Customized Drug
Biological Data MiningBiological Data Mining
Increase and Improvement of Farm Products
Renewable Energy
store, retrieve, analyze and assist store, retrieve, analyze and assist in understanding biological informationin understanding biological information
51
Hidden Markov ModelsHidden Markov Modelsfor Protein Modelingfor Protein Modeling
20 alphabets (20 amino acids) m0: start state, m5: end state, mk: match states ik: insertion states, dk: deletion states T(s2|s1): transition probabilities P(x|mk): alphabet generating probabilities (x: letter: amino acid)
52
A Simple Example of Hidden Markov A Simple Example of Hidden Markov ModelsModels
S E
0.5
0.5
0.5
0.5
0.25 0.25
0.250.250.250.25
0.10.10.10.7
ATCCTTTTTTTCA
54
Non-negative Matrix Factorization Non-negative Matrix Factorization Clustering Gene Expression DataClustering Gene Expression Data
…..
.
...
.
...
.
.
7,129 genes
38 samples
x..
.
.
2 factors
… encoding
38 samples7,129 genes
G W(?) H(?)
Factors can capture the correlations between the genes using the values of expression level.
Cluster training samples into 2 groups by NMF Assign each sample to the factor (class) which has higher encoding value. Accuracy: 0 ~1 error for the training data set
…
H1·
g1 g2 g7,129
W
H2 ·
g3 g4
55
Bayesian NetworksBayesian Networksfor Gene Expression Analysisfor Gene Expression Analysis
Processed
dataData
Preprocessing
Learningalgorithm
Gene C Gene B
Gene A
Target
Gene D
Gene C Gene B
Gene A
Target
Gene D
Gene C Gene B
Gene A
Target
Gene D
Gene C Gene B
Gene A
Target
Gene D
The values of Gene C and Gene B are given.
Belief propagation Probability for the target is computed.
Learning
Inference
56
Multilayer Perceptrons for Gene Multilayer Perceptrons for Gene Finding and PredictionFinding and Prediction
Coding potential valueCoding potential value
GC CompositionGC Composition
LengthLength
DonorDonor
AcceptorAcceptor
Intron vocabularyIntron vocabulary
basesDiscrete
exon score
0
1
sequence
score
57
Self-Organizing Maps for DNA MiSelf-Organizing Maps for DNA Microarray Data Analysiscroarray Data Analysis
Two-dimensional arrayof postsynaptic neurons
Bundle of synapticconnections
Winning neurons
Input
58
Biological Information ExtractionBiological Information ExtractionText Data
DB
LocationDate
DB Record
Database TemplateFilling
Data Analysis &Field Identification
Data Classification &Field Extraction
Information Extraction
Field PropertyIdentification & Learning