Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
Intelligent Data Mining Techniques(tutorial presented at ANNIE´2003)
Iveta Mrázová
Department of Software EngineeringCharles University Prague
Iveta Mrázová, ANNIE´03 2
Content outlineIntelligent Data Mining: introduction and overview of Intelligent DataMining Techniques (20 min)Selected Data Mining Techniques: principles and examples
– undirected DM-techniques:Market Basket Analysis (MBA) - (20 min)Link Analysis and Scale-Free Networks (10 min)Automatic Cluster Detection and Fuzzy Systems:Clustering the World Bank Data (20 min)
– directed DM-techniques:Internal Knowledge Representation in BP-Networks (20 min)Modular Networks, Sensitivity Analysis and Feature Selection (20 min)Neural Networks and Decision Trees: Students´Questionnaire (20 min)Genetic Algorithms and BP-networks: Generating Melodies (10 min)
Conclusions, Questions + Answers (10 min)
Iveta Mrázová, ANNIE´03 3
Intelligent Data Mining: ReferencesM. J. A. Berry, G. Linoff: Data Mining Techniques for Marketing, Sales,and Customer Support, John Wiley & Sons, 1997M. J. A. Berry, G. Linoff: Mastering Data Mining, John Wiley & Sons,2000J. Han, M. Kamber: Data Mining: Concepts and Techniques, MorganKaufmann Publishers, 2001D. Hand, H. Mannila, P. Smyth: Principles of Data Mining, The MIT Press,2001D. Pyle: Data Preparation for Data Mining, Morgan Kaufmann Press, 1999I. H. Witten, E. Frank: Data Mining: practical machine learning tools andtechniques with Java implementations, Morgan Kaufmann Publishers, 2000http://www.mkp.com/datamining
http://www.cs.waikato.ac.nz/ml/weka
Iveta Mrázová, ANNIE´03 4
What is Data Mining?discovering patterns in data– discovered patterns should be meaningful– should lead to some advantage, e.g. economic, …– allows to make non-trivial predictions on new data
the data is present in substantial quantitiesautomatic or semi-automatic processtwo extremes for the form of discovered patterns:– black box - e.g. neural networks– transparent box - more structured, capture the
decision structure in an explicit way
Iveta Mrázová, ANNIE´03 5
Building models for the data
Classification model:– assigns an existing classification to new records
Predictive model– Time-series model
Clustering model
Model
INPUTS
Output
Confidence level
Iveta Mrázová, ANNIE´03 6
Data Analysis:Influence of other disciplinesstatisticssamplingregression analysis
– linear regressioncorrelation analysis
memory-based reasoninglink analysisgenetic algorithms andneural networks
interpret observationsreduce the size of datainter- and extrapolateobservations
– fit a line to observed datamutual occurrence ofobservationsdirectly from AIgraph theorymodel biological processes
Iveta Mrázová, ANNIE´03 7
Intelligent DM-Techniques: an overview
Market Basket Analysis (MBA)Memory-Based Reasoning (MBR)Automatic Cluster DetectionFuzzy Systems (FS)Link AnalysisDecision TreesArtificial Neural Networks (ANN)Genetic Algorithms (GA)
Iveta Mrázová, ANNIE´03 8
Analyses in the retail industry:
What items occur together in a “basket”?
Results:– expressed as rules– highly actionable
Applications:– planning store layouts– offering coupons, limiting specials– bundling products
Market Basket Analysis (MBA)
Iveta Mrázová, ANNIE´03 9
Memory-Based Reasoning (MBR)
Look for the nearest “known” neighbor toclassify or predict value!applicable to virtually any datanew instances learned by adding them to the data setdistance to neighbors estimates the correctness of theresultsKey elements in MBR:
– distance function - to find nearest neighbors– combination function - combine values at nearest
neighbors to classify or predict
Iveta Mrázová, ANNIE´03 10
Link AnalysisGoals:– find patterns in relationships between records– visualize the links
Application areas:– telecommunications– law enforcement - clues about crimes are linked
together to solve them– marketing - relationships between customers
Iveta Mrázová, ANNIE´03 11
Goal:
Find previously unknown similarities in the data!
Build models that find data records similarto each otherGood as an initial analysis of the dataUndirected data mining
Automatic Cluster Detection
Iveta Mrázová, ANNIE´03 12
Decision Trees and Rule Induction
Divide the data into disjoint subsetscharacterized by simple rules!Directed data mining (classification)Explainable rules applicable directly to newrecordsTechniques:– Classification And Regression Trees (CART)– Chi-squared Automatic Induction (CHAID)– C4.5
Iveta Mrázová, ANNIE´03 13
Artificial Neural Networks (ANN)
Detect patterns in the data in a way “similar” tohuman thinking!
Directed data mining (classification andprediction)Applicable also to undirected data mining (SOMs)Two major drawbacks:– difficulty in understanding the models they produce– sensitivity to the format of incoming data
Iveta Mrázová, ANNIE´03 14
Genetic Algorithms (GA)Apply genetics and natural selection to findoptimal parameters of a predictive function!
GA use “genetic” operators to evolve successivegenerations of solutions:– selection– crossover– mutation
Best candidates “survive” to further generationsuntil convergence is achievedDirected data mining
Iveta Mrázová, ANNIE´03 15
On-Line Analytic Processing (OLAP)
an important tool for extracting and presentinginformationfacilitates understanding of the data and importantpatterns inside ita way of presenting relational data to usersmulti-dimensional databases (MDDs):– a representation of data– allows users to drill down into the data and understand
various important summarizations
Iveta Mrázová, ANNIE´03 16
Analyses in the retail industry:
What items occur together in a “basket”?
Results:– expressed as rules– highly actionable
Applications:– planning store layouts– offering coupons, limiting specials– bundling products
Market Basket Analysis (MBA)
Iveta Mrázová, ANNIE´03 17
Association rulesHow do the products relate one to each other?Association rules should be:
– easy to understand: once the pattern is found, it is easy tojustify it
– useful: contain actionable information leading to otherinterventions
Association rules should not be:– trivial: results are already known by anyone familiar with
the business– inexplicable: seem to have no explanation and do not
suggest any action
Iveta Mrázová, ANNIE´03 18
MBA to compare storesVirtual items:– specify which group the transaction comes from– do not correspond to a product or service
Comparison between new and existing stores:1 Gather data for a specific period from store openings2 Gather about the same amount of data from existing
stores3 Apply MBA to find association rules in each set4 Consider especially association rules containing the
virtual items
Iveta Mrázová, ANNIE´03 19
MBA - how does it work?Items - products or service offeringsTransactions contain one or more itemsCo-occurrence table– indicates the number of times that any two
items co-occur in a transaction (i.e. theseproducts were purchased together)
– values along the diagonal represent the numberof transactions containing just that one item
Iveta Mrázová, ANNIE´03 20
MBA - exampleGrocery transactions:
Customer Items 1 bread, butter 2 milk, bread, butter 3 bread, coffee 4 bread, butter, coffee 5 coffee, butter
Co-occurrence of products:
bread butter milk coffee bread 4 3 1 2 butter 3 4 1 2 milk 1 1 1 0 coffee 2 2 0 3
Sales patterns apparent from the co-occurrence table:
Milk is never purchased with coffee. Bread and butter are likely to be purchased together.
Iveta Mrázová, ANNIE´03 21
Rule: IF Condition THEN Result. ( Rule_r : IF Item_i THEN Item_j . )
Questions:– How good are the found association rules?
supportconfidenceimprovement
– How to find association rules automatically?
MBA - Association rules
Iveta Mrázová, ANNIE´03 22
Support and confidenceSupport: How frequently can the rule be applied?
Confidence: How much can we rely on the result of the rule?
Nr_of_Transactions_containing_i_and_j
Number_of_all_Transactions
Nr_of_Transactions_containing_i
Nr_of_Transactions_containing_i_and_j• 100 %
• 100 %Support(Rule_r) =
Confidence(Rule_r) =
Iveta Mrázová, ANNIE´03 23
Support and confidence - example
Rule 1: If a customer purchases bread then the customer also purchases butter.
Rule 2: If a customer purchases coffee then the customer also purchases butter.
Support ( Rule_1 ) = 3 / 5 • 100 % = 60 %
Confidence ( Rule_1 ) = 3 / 4 • 100 % = 75 %
Support ( Rule_2 ) = 2 / 5 • 100 % = 40 %
Confidence ( Rule_2 ) = 2 / 3 • 100 % = 66 %
Iveta Mrázová, ANNIE´03 24
Improvement of a rule
p(i_and_j) p(i) • p(j)
Improvement(Rule_r) =
Improvement: How much is a rule better at predict- ing the result than just assuming it?
If Improvement < 1:rule is worse at predicting the result than random chanceNEGATING the result might produce a better rule
IF Condition THEN NOT Result.
Iveta Mrázová, ANNIE´03 25
Improvement of a rule - example
Rule: If a customer purchases milk then the customer also purchases butter.
Support ( Rule_1 ) = 1 / 5 • 100 % = 20 %
Confidence ( Rule_1 ) = 1 / 1 • 100 % = 100 %
Improvement ( Rule_1 ) = ( 1 / 5 ) / ( ( 1 / 5 ) • ( 4 / 5 ) ) = 5 / 4 = 1.25
Iveta Mrázová, ANNIE´03 26
Basic steps of MBAChoose the right set of items and the right levelGenerate rules by deciphering the co-occurrencematrix– calculate the probabilities and joint probabilities of
items and their combinations in transactions– limit the search with thresholds set on support
Analyze probabilities to determine best rules– overcome limits imposed by the number of items and
their combinations in “interesting” transactions
Iveta Mrázová, ANNIE´03 27
MBA - the choice of right items
Gathering transaction data:often bad quality requiring extensive pre-processingitems of interest may change over timethe right level of detail:– a growing number of item combinations– actionable results (specific items)– rules with sufficient support (frequent occurrence in
the data set)
Iveta Mrázová, ANNIE´03 28
Taxonomies: hierarchical categories
MBA - Complexity of generated rules:Use more general items initiallyThen, generate rules for more specific items using onlytransactions containing these items
MBA - Actionable results:Items should occur in roughly the same number of transactions:
roll up rare items to higher levels in the taxonomy (tobecome more frequent)keep more common items at lower levels (to prevents rulesfrom being dominated by the most common items)
Iveta Mrázová, ANNIE´03 29
Virtual items: go beyond the taxonomy
cross product boundaries of original items– e.g. designer labels - Calvin Klein
may include information about the transaction itself– anonymous (day of week, time, etc.)– signed (info about customers and their behavior over time)
might be a cause of redundant rules– items from the taxonomy are associated with just one virtual
item (“If Coke product then Coke.”)– virtual and generalized items appear together in a rule (“If
Coke product and diet soda then pretzels” instead of “If dietcoke then pretzels”)
Iveta Mrázová, ANNIE´03 30
MBA - generating rulesCompute the co-occurrence table:– provides the information about which combinations of
items occur most commonly in the transactions– applicable for evaluating basic probabilities necessary to
evaluate the importance of generated rules
Provide useful rules:– improvement should be greater than 1
low improvement can be increased by negating the rulesnegated rules might be less actionable than original rules
– reduce the number of generated rules - PRUNING
Iveta Mrázová, ANNIE´03 31
Minimum support pruningEliminate less frequent items
actions should affect enough transactionstwo possibilities:– eliminate rare items from consideration (then,
eliminate their respective associative rules)– use taxonomy to generalize items (then, resulting
generalized items should meet the threshold criteria)variable minimum support - a cascading effect
Iveta Mrázová, ANNIE´03 32
MBA - Dissociation rulesRule: IF A AND NOT B THEN C.
– Introduce new items inverse to original ones– Each transaction will contain an inverse item if it does
not contain the original oneDrawbacks:– doubled number of items– growing size of transactions– inverse items tend to occur more frequently than original
(leading to less actionable rules with all items inverted:“IF NOT A AND NOT B THEN NOT C.”)
Iveta Mrázová, ANNIE´03 33
Time-series analysis with MBAAnalyze cause and effects:– time- or sequencing information to determine when
transactions occurred relative to each other– usually requires some way of identifying the customer
Conversions to an MBA-problem:– include in transactions items before the event of interest
(for causes) or after the event of interest (for effects);then, remove duplicate items from the transaction
– time-window: a “snapshot” of all items that occurwithin a certain period (e.g. all transactions within a month)
trends for rare items
Iveta Mrázová, ANNIE´03 34
Strengths of MBAProduces clear and understandable results– actionable IF - THEN - rules
Supports undirected data mining– important when approaching large data sets
with no prior knowledgeWorks on variable-length dataComputations are easy to understand– Computational costs grow exponentially with
the number of items!
Iveta Mrázová, ANNIE´03 35
Weaknesses of MBAExponentially growing computational costs– necessity for item taxonomies and virtual items
Limited support for attributes on the data– pruning of less actionable general items
Difficult to determine the right number of items– items should have approximately the same frequency
Discounts rare items– variable thresholds for minimum support pruning– higher levels in item taxonomies
Iveta Mrázová, ANNIE´03 36
Link AnalysisGoals:– find patterns in relationships between records– visualize the links
Application areas:– telecommunications– law enforcement - clues about crimes are linked
together to solve them– marketing - relationships between customers
Iveta Mrázová, ANNIE´03 37
Scale-Free Networks
Some nodes have an extremely large number oflinks (edges) to other nodes - hubsMost nodes have just a few links to other nodesRobust against accidental failuresVulnerable to coordinated attacksNew application areas
– preventing computer viruses spreading through the Internet– medicine (vaccinations)– business (marketing)
Iveta Mrázová, ANNIE´03 38
Scale-Free Networks
adapted from “A. L. Barabasi and E. Bonabeau: Scale-Free Networks, Scientific American, May 2003”
A random graph
Distribution of edges Distribution of edges
number of edges number of edgesnr. o
f nod
es
nr. o
f nod
es
A scale-free network
Iveta Mrázová, ANNIE´03 39
Examples of Scale-Free NetworksSocial networks
– research collaboration (scientists, co-authorship of papers)– Hollywood (actors, appearance in the same movie)
Biological networks– cellular metabolism (molecules involved in energy production,
participation in the same biological reaction)– protein regulatory network (proteins controlling cell activity,
interactions among proteins)
Socio-technical networks– Internet (routers, optical or other connections)– World Wide Web (Web-pages and URLs)
Iveta Mrázová, ANNIE´03 40
Scale-Free Networks:basic characteristics
Two basic mechanisms:– growth– preferential attachment
“The rich get richer” (hubs):– new nodes tend to connect to the more
connected sites– “popular locations” acquire more links
over time than less connected neighborsReliability
– accidental failures (80% of randomlyselected nodes can fail withoutfragmenting the cluster)
– coordinated attacks (eliminating 5-15% of all hubs can crash the system)
Iveta Mrázová, ANNIE´03 41
Scale-Free Networks
adapted from “A. L. Barabasi and E. Bonabeau: Scale-Free Networks, Scientific American, May 2003”
node
before before before
hub hub
failed node failed node
after afterafter
attacked hub
Random network accidental node failure
Scale-free network accidental node failure
Scale-free network attack on hubs
Iveta Mrázová, ANNIE´03 42
Implications ofScale-Free NetworksComputing– networks with scale-free architectures
Medicine– vaccination campaigns and new drugs
Business– cascading financial failures– marketing
Iveta Mrázová, ANNIE´03 43
Implications ofScale-Free Networks
Computingcomputer networks with scale-free architectures(e.g. WWW)
highly resistant to accidental failuresvery vulnerable to deliberate attacks and sabotage
eradicating viruses from the Internet will beeffectively impossible
Iveta Mrázová, ANNIE´03 44
Implications ofScale-Free NetworksMedicine
vaccination campaigns against serious virusesfocused on hubs– people with many connections to others– difficult to identify such people
new drugs targeting the hub molecules involved incertain diseasescontrol the side-effects of drugs with maps ofnetworks within cells
Iveta Mrázová, ANNIE´03 45
Implications ofScale-Free NetworksBusiness
financial failures– understand how companies, industries and economies
are inter-linked– monitor and avoid cascading financial failures
marketing– study the spread of a contagion on a scale-free network– more efficient ways of propagating consumer buzz
about new products
Iveta Mrázová, ANNIE´03 46
Goal:
Find previously unknown similarities in the data!
Build models that find data records similarto each otherGood as an initial analysis of the dataUndirected data mining
Automatic Cluster Detection
Iveta Mrázová, ANNIE´03 47
Economies groupedaccording to their results
purchasing power paritygross national product
GD
P gr
owth
rate
s
Mining the World Bank Data:the Fuzzy c-means Clustering Approach
with Cihan H. Dagli,Engineering Management Department, University of Missouri - Rolla
Iveta Mrázová, ANNIE´03 49
FCM-clustering: introductionWorld Development Indicators (WDI)– published annually by the World Bank– reflect development process in the countries– incomplete and imprecise data
Previously applied techniques– regression analysis - linear relationships– US-based grouping of countries (G. Ip, Wall Street
Journal)– GDP-based grouping of economies (World Bank)– self-organizing feature maps (T. Kohonen, S. Kaski, G.
Deboeck)
Iveta Mrázová, ANNIE´03 50
• more neurons thancountries
• only localgeometric relationsare important
• countries mappedclose to each otherhave a similar stateof development
adapted from “T. Kohonen: Self-Organizing Maps,3-rd Edition, Springer-Verlag, 2001”
Poverty maps - T. Kohonen
Iveta Mrázová, ANNIE´03 51
Poverty maps - T. Kohonen, S. Kaski
U-matrix:illustrate “boundaries”between clustersrepresent averagedistances betweenneighboring neuronsin a gray scale
small average distance ⇒ light shade
large average distance ⇒ dark shadeadapted from “T. Kohonen: Self-Organizing Maps,
3-rd Edition, Springer-Verlag, 2001”
Iveta Mrázová, ANNIE´03 52
Our goalCluster efficiently imprecise dataEstimate the number of clustersVisualize the resultsInterpret the results
Iveta Mrázová, ANNIE´03 53
Our goalCluster efficiently imprecise dataEstimate the number of clustersVisualize the resultsInterpret the results
fuzzy c - means clustering (FCM) cluster validity indicators spread-sheet-like form find “landmarks”
Iveta Mrázová, ANNIE´03 54
corresponds to the weighted distancebetween input patterns and cluster centers:
membership degrees between 0 and 1:total membership of a pattern equals to 1:no empty or full clusters:
The objective function
−= ∑∑ ∑
== =
n
jijpj
P
p
c
i
sips vxuJ
1
2
1 1
)()(),( vU
membership degreefuzziness parameter input pattern
cluster center
10 ≤≤ ipu1
1=∀ ∑=
c
i ipup
Pui P
p ip <<∀ ∑ =10
Iveta Mrázová, ANNIE´03 55
Fuzzy c-means Clustering (FCM)Step 1: Initialize c, s, ε and t. Choose randomly .Step 2: Determine new fuzzy cluster centers:
Step 3: Calculate new partition matrix :
Step 4: EvaluateIf ∆ > ε then set t = t + 1 and go to Step 2. If ∆ ≤ ε then Stop.END of FCM
)0(U
)()1(,
)()1( max tip
tippi
tt uuUU −=−=∆ ++
ps
p
tip
p
stip
ti xu
uv rr ∑∑
= )()(
1 )()(
)(
∑=
−
−+
−
−= c
k
stkp
stipt
ip
vx
vxu
1
1/12)(
1/12)()1(
)/1(
)/1(rr
rr
)1( +tU
Iveta Mrázová, ANNIE´03 56
Cluster validity criteriaPartition coefficient:
Partition entropy:
Windham´s proportion exponent:
∑ ∑= =
=P
p
c
iipu
PcUF
1 1
2)(1);(
)ln(1);(1 1
ip
P
p
c
iip uu
PcUH ∑ ∑
= =
−= .0for 0)ln( ; == ipipip uuu
( )
⋅−
−−= −
=
+
=∑∑
−
1
1
1
1
1)1(ln);(1
cp
j
jP
p
jjc
cUWp
µµ
{ }ipcip u max ; 1 ≤≤=µ
clusters
membershipdegreepatterns
Iveta Mrázová, ANNIE´03 57
How many clusters?Partition coefficient:
Partition entropy:
Windham´s proportion exponent:
{ [ ] });(minmin 12 cUHUPc −≤≤
{ [ ] });(maxmax 12 cUFUPc −≤≤
{ [ ] });(maxmax 12 cUWUPc −≤≤
clusterspartitions
Iveta Mrázová, ANNIE´03 58
Supporting experiments - artificial dataCluster validity indicators for artificial data Fuzzy 4-partition of the data
21 input patterns, s = 1.4, ε = 0.05 ´×´ indicates cluster centers, patterns from the same clusters are labeled identically
Iveta Mrázová, ANNIE´03 59
Supporting experiments - artificial dataFuzzy 8-partition of the dataFuzzy 6-partition of the data
´×´ indicates cluster centers, patterns from the same clusters are labeled identically
´×´ indicates cluster centers, patterns from the same clusters are labeled identically
Iveta Mrázová, ANNIE´03 60
Interpret the results!Characteristic features for detected clusters:
cluster centers - “fictive” patterns out of the data set“calibrate” clusters with the “most representative”patterns from the data set - based on just one patternDetermine outstanding properties for clusters:
– compared to other properties within the cluster– compared to properties of other clusters– exception: “border areas”
fuzzy c-landmarks
Iveta Mrázová, ANNIE´03 61
Automatic landmark selectionFuzzy c-landmark for cluster :
“fuzzy distance” from the cluster center should be small“fuzzy distance” from all other cluster centers should belarge
∑
∑
∑
∑
=
=
≠≤≤
=
=
≤≤
−
−
=
P
p
spi
P
pijpj
spi
iici
P
p
spi
P
pjipj
spi
nj
u
vxu
u
vxu
j
1
11
1
1
1*
*
*
*
*
**
min
min arg
*i ( )**,*jivj
membershipdegrees
cluster centers
input patterns
clusters
inputs
Iveta Mrázová, ANNIE´03 62
Supporting experiments:The World Bank Data
99 state economies with 16 (latest) indicators for eachcountryeconomical and social potential of countries and their citizensall indicators are relative to populationelement-wise transformation to (0,1) with:
and
the choice of other parameters (k=4; s=1.4; ε=0.05)
minmax
min´xx
xxx−
−=
)2/1´(11´´ −−+
= xkex
maximum over all patternsminimum over all patterns
Iveta Mrázová, ANNIE´03 63
Used Development IndicatorsGNP per capitaPurchasing Power ParityGrowth rate of GDP percapitaGDP implicit deflatorExternal debt (% of GNP)Total debt service (% ofexport of goods and services)High technology exports(% of manufactured exports)Military expenditures(% of GNP)
Expenditures for R&D (% ofGNP)Total expenditures on health (%of GDP)Public expenditures on education(% of GNP)Male life expectancy at birthFertility ratesGINI-index (distribution ofincome/consumption)Internet hosts per 10000 peopleMobile phones per 1000 people
Iveta Mrázová, ANNIE´03 64
Supporting experiments:the World Bank data Cluster validity indicators for the WB-data
99 countries with 16 indicators s = 1.1, ε = 0.05
Cluster validity indicators for the WB-data
99 countries with 16 indicators s = 1.4, ε = 0.05
Iveta Mrázová, ANNIE´03 65
Fuzzy 7-partition of the WB data
A part of the fuzzy 7-partition of the World Bank data: 99 countries with 16 indicators; s = 1.4, ε = 0.05
Iveta Mrázová, ANNIE´03 66
Landmarks for the WB data
“Representative patterns” and fuzzy 7-landmarks for the World Bank data: 99 countries with 16 indicators; s = 1.4, ε = 0.05
Iveta Mrázová, ANNIE´03 67
FCM-Clustering: conclusionsFCM-clustering– efficiency and cluster validity– choice of the fuzziness parameter– grouping of country economies (World Bank, Ip,
Kohonen, Deboeck)
Visualization– membership degree– topological relationships
Landmarks and interpretation of the results– formulation of “class discriminating” criteria
Iveta Mrázová, ANNIE´03 68
Rule extraction: Characteristics Economical_results– fuzzy inference systems– (feed-forward) neural networks
back-propagationRBF-networks
Neuro-fuzzy systems with adaptive inputs– detection of significant input patterns– influence of internal knowledge representation– speed-up the training and recall process
From FCM towards Fuzzy Systems
Iveta Mrázová, ANNIE´03 69
Artificial Neural Networks (ANN)
Detect patterns in the data in a way “similar” tohuman thinking!
Directed data mining (classification andprediction)Applicable also to undirected data mining (SOMs)Two major drawbacks:– difficulty in understanding the models they produce– sensitivity to the format of incoming data
Back-Propagationand GREN-networks
Iveta Mrázová, ANNIE´03 71
Introduction
Multi-layer feed-forward networks (BP-networks)– one of the most often used models– relatively simple training algorithm– relatively good results
Limits of the considered model– the speed of the training process– convergence and local minimums– generalization and “over-training” additional demands on the desired network behavior
Iveta Mrázová, ANNIE´03 72
corresponds to the difference between theactual and the desired network output:
the goal of the training process is to minimizethis difference on the given training set
Back-Propagation training algorithm
The error function
( )2,,2
1 ∑ ∑ −=p j
pjpj dyE
actual output
desiredoutput
patterns output neurons
Iveta Mrázová, ANNIE´03 73
The Back-Propagationtraining algorithm
computes the actual outputfor a given training patterncompares the desired andthe actual outputadapts the weights and thethresholds
– against the gradient of theerror function
– backwards from the outputlayer towards the inputlayer
I N P U T
O U T P U T
Iveta Mrázová, ANNIE´03 74
Drawbacks of thestandard BP-model
The error function– correspondence to the desired behavior– the form of the training set
requires the knowledge of desired network outputsbetter performance for “larger” and “well-balanced”training sets
Generalization abilities– ability to interpret and evaluate the “gained” experience– retraining for modified and/or developing task domains
Iveta Mrázová, ANNIE´03 75
Desired propertiesof trained networks
Robustness against small deviations ofthose input patterns lying “close to theseparating hyper-plane”Transparent network structure with asuitable internal knowledge representation
A possible reuse of already trainednetworks under changed conditions
Iveta Mrázová, ANNIE´03 76
Condensed internal representation
interpret the activity ofhidden neurons:1 active YES0 passive NO
silent“no decision possible”
“clear” the innernetwork structuredetect superfluousneurons and pruneI N P U T
O U T P U T
12
Iveta Mrázová, ANNIE´03 77
How to force the condensedinternal representation?
formulate “the desired properties” in the form ofan objective function:
local minima of the representation error functioncorrespond to active, passive and silent states:
G E c Fs= +Standard error function
Representation error function
the influence of F on G
( ) ( )F y y yh ps
hph p
s
h p= − −∑∑ , , , .1 0 52
silent stateactive state
passive statepatternshidden neurons
the shape of F
Iveta Mrázová, ANNIE´03 78
Influence of parametersslower forcing of the internalrepresentation and the desirednetwork functionstability of the forced internalrepresentation and an optimalnetwork architecturethe shape of the representationerror function, the speed of therepresentation forcing processand its formthe time-overhead of the weightadjustment( ) ( )w t w t y yij ij j i r j i+ = + + +1 αδ α ρ
( ) ( )( )+ − −α m ij ijw t w t 1
0
0.002
0.004
0.006
0.008
0.01
0.012
0.014
0.016
0 0.2 0.4 0.6 0.8 1
repr
esen
tatio
n er
ror
actual neuron output
0
5e-07
1e-06
0 0.2 0.4 0.6 0.8 10
9e-08
1.8e-07
0 0.2 0.4 0.6 0.8 1
0
4e-05
8e-05
0 0.2 0.4 0.6 0.8 10
0.008
0.016
0 0.2 0.4 0.6 0.8 1
Shape of the representation function( ) ( )F y y ys s t
= − −1 0 5.
s t= =1 2
s t= =8 2
s t= =4 2
s t= =5 4
Iveta Mrázová, ANNIE´03 80
Further modifications ofthe representation function
Discrete internal representation:(S allowed output values for neurons from thelast hidden layer)
Condensed internal representation for allhidden layers:
( ) ( ) ( )F y r y r y rj pjp
t
j p S
t
j p ssjp
tS s
= − − = −∑∑ ∏∑∑, , ,1
2 2 21
K
( ) ( )F y y yj ps
jplj p
s
j pll
l l= − −∑∑∑ '
'' ',
', , .1 0 5
2
r rS1, ,K
Iveta Mrázová, ANNIE´03 81
Unambiguousinternal representation
Patterns with highly different outputs should formhighly different internal representationsFormulate the requirements as a modifiedobjective function:Ambiguity criterion for the internal representation:
G E F H= + +
( ) ( )H d d y yo p o qojq pp
j p j q= − − −∑∑∑∑≠
12
2 2
, , , ,
= const. for a given p
= const. for a given p= const. for a given ppatterns
hidden neurons output neurons
Iveta Mrázová, ANNIE´03 82
Modular structure of BP-networks
Decompose the task into the particular subtasksPropose and form the modular architecture– strategy for extracting ε-equivalent BP-modules
elimination of superfluous hidden and/or input neuronssuitable for "already trained" networksa compromise between the desired accuracy of the extractedmodule and its optimal architecture
Communication between the particular modules– serial and parallel composition of BP-networks
Iveta Mrázová, ANNIE´03 83
0
0.5
1
1.5
2
2.5
-4 -3 -2 -1 0 1 2 3 4 5
0
0.5
1
Extracting BP-modules- allowed potential deviations
The potential changeis in this case smaller than thepotential changeThe potential should change"towards the separating hyper-plane"The changed potential shouldpreserve the location of the inputpattern in the same half-spaceThe allowed potential changesshould be independent of eachparticular input pattern (from S)
( )δ ξr−
( )δ ξr+
ξ δ ξ ξ ξ δ ξ− − +r r( ) ( ) +
f r( )ξ ε+f ( )ξ
f r( )ξ ε−
••° °εr = 0 2.
δ ξ
δ ξr
r
+
−
( )( )
Iveta Mrázová, ANNIE´03 84
Notes on the construction ofan ε-equivalent network
possible improvements ofnetwork properties:
– “egalitarian” versus“differentiated” approach
the relationship of theconstruction to “morerobust” networks
– necessary knowledge ofεr-boundary regions
– preserve the created internalrepresentation
I N P U T
O U T P U T
Iveta Mrázová, ANNIE´03 85
Desired properties of “experts”for training (modular) BP-networks
evaluate the error connectedwith the actual response ofa BP-network“explain” the BP-network itserror during trainingnot require the knowledge ofthe desired network outputbut should recognize a correctbehavior“suggest” a “better” behavior
Iveta Mrázová, ANNIE´03 86
Desired properties of “experts”for training BP-networks
evaluate the error connectedwith the actual response ofa BP-network“explain” the BP-network itserror during trainingnot require the knowledge ofthe desired network outputbut should recognize a correctbehavior“suggest” a “better” behavior
Iveta Mrázová, ANNIE´03 87
Desired properties of “experts”for training BP-networks
evaluate the error connected withthe actual response of a BP-network“explain” the BP-network its errorduring trainingnot require the knowledge of thedesired network outputbut should recognize a correctbehavior“suggest” a “better” behavior
Iveta Mrázová, ANNIE´03 88
GREN-networks:Generalized relief error networks
assign the error to the pairs[input pattern, actual output]trained e.g. by the standardBP-training algorithmshould have good approximationand generalization abilities“approximates” the error functionby:
INPUT PATTERN ACTUAL OUTPUT
ERROR VALUES
∑ ∑=p e
GRpe
BeE ,
output neurons ofthe GREN-network
output values ofthe GREN-networkpatterns
Iveta Mrázová, ANNIE´03 89
A modular system for trainingBP-networks with GREN-networks
ERROR VALUES
INPUT PATTERN
INPUT PATTERN
ACTUAL OUTPUT
GREN-NETWORK
ADAPTEDBP-NETWORK
Iveta Mrázová, ANNIE´03 90
Applied the basic idea of Back-PropagationHow to determine the error terms at theoutput of the trained BP-network?
Use error terms back-propagated from the GREN-network
Weight adjustment rules similar to thestandard Back-Propagation
Training with a GREN-network
Iveta Mrázová, ANNIE´03 91
Training with a GREN-network
Applied the basic idea of Back-Propagation
How to determine at the outputlayer of the BP-network B?
Bij
Bj
Bj
Bj
Bj
Bij
BijE w
yyE
wEw
∂∂
∂∂
∂∂
−=∂∂
−=∆ξ
ξ
potential of neuron j
weight of a BP-network Bactual output
error computed by the GREN-network
BjyE ∂∂
Iveta Mrázová, ANNIE´03 92
Weight adjustment rules
Use error terms back-propagated from theGREN-networkRules similar to the standard Back-Propagation
For output neurons, compute by means ofpropagated from the GREN-network
Bi
Bj
Bij
Bij yoldwneww αδ+= )()(
BGRk
BGRk
BGRk
B yy
EGRk ξ
δ∂
∂
∂∂−=
Bjδ
learning rate error term
actual output of neuron i
weight
BGRkδ
BGR
error term
actual output of neuron k
potential of neuron k
error
Iveta Mrázová, ANNIE´03 93
Error terms forthe trained BP-network
The back-propagated error terms correspondfor to:
( )
−
−−
=
∑
∑
∑
)´(
)´(
)´(1
Bj
k
Bjk
Bk
Bj
k
GRjk
GRk
Bj
e
GRje
GRe
GRe
Bj
fw
fw
fwee
BB
BBB
ξδ
ξδ
ξ
δ
∑=e
eeE
Bjδ
for an output neuron of B and GRB with no hidden layer
for an output neuron of B and GRB with hidden layers
for a hidden neuron of B
Iveta Mrázová, ANNIE´03 94
Is the GREN-network an “expert”?
Has not to “know the right answer”But should “recognize the correct answer”
for an input pattern, yield the minimum error only for one actual output - the right one
Simple test for “problematic” GREN-experts:– zero-weights from the actual output– zero “y-terms” of potentials in the 1. hidden layer– “too many large negative weights” ∑∑ +− >>
ii
ii ww
By
Iveta Mrázová, ANNIE´03 95
Find “better” input patterns!
input patterns of a GREN-network“similar” to those presented to and recalled bythe BP-networkwith a smaller error
minimize the error at the output of the GREN-network, e.g. by back-propagation adjust input patterns against the gradient of the GREN-network error function
Iveta Mrázová, ANNIE´03 96
Avoid “problematic”GREN-networks!
Insensitive to the outputs of trained BP-networks– inadequately small error terms back-propagated by the
GREN-network
Incapable of training further BP-networks– small error terms even for large errors
Our goal: Increase the sensitivity of GREN-
networks to their inputs!
Iveta Mrázová, ANNIE´03 97
How to handle thesensitivity of BP-networks?Increase their robustness:
over-fitting leads to functions with a lot of structureand a relatively high curvaturefavor “smoother” network functionsalternative formulation of the objective function
– penalizing large second-order derivatives of the networkfunction
– penalizing large second-order derivatives of the transferfunction for hidden neurons
– weight-decay regularizers
Iveta Mrázová, ANNIE´03 98
Controlled learningof GREN-networks
Require GREN-networks sensitive to their inputs– non-zero error terms for incorrect BP-network outputs
Favor larger values of the error terms
Minimize during training2
,
,∑ ∑ ∑∑>
∂
∂−==
q s nr qr
qs
q
REGq
REG
yy
EE
patterns output neurons controlledinput neurons
outputvalues
controlledinput values
Iveta Mrázová, ANNIE´03 99
Weight adjustment rules
Regularization by means of
Rules similar to the standard Back-Propagation
∂∂
−∂
∂−=
∂∂
−=∆ ∑ ∑>s nr r
s
ijij
REG
ijE yy
wwEwREG
2
controlled inputs
output
( ) ( )( ) ( )( )1
1
−−+
+∆+∆+=+
TwTw
wwTwTwBB
REGBB
GRij
GRijm
ijEcijEGRij
GRij
α
αα
BP-weight adjustment
weight
controlled weight adjustment
moment
Iveta Mrázová, ANNIE´03 100
Characteristics of the method
Applicable to any BP-network and/or input neuronQuicker training of “actual” BP-networks– larger “sensitivity terms” transfer better the
errors from the GREN-network
Oscillations during training “actual” BP-networks– due to the “linear” nature of the GREN-specified error
function
rs yy ∂∂ /
patterns
∑ ∑=p e
GRpe
BeE ,
output neurons ofthe GREN-network
output values of theGREN-network
Iveta Mrázová, ANNIE´03 101
Modification of the methodUse “quadratic” GREN-specified error terms for training“actual” BP-networks
Considers both the GREN-network outputs and the“sensitivity” terms
Crucial for low sensitivity to erroneous training patterns
Bpj
GRpe ye B
,, / ∂∂
patterns
( )2,
ˆ ∑ ∑=p e
GRpe
BeE
output neurons of theGREN-network
output values of theGREN-network
BGRpee ,
Iveta Mrázová, ANNIE´03 102
Supporting experimentsOutput of the standard BP-network
00.2
0.40.6
0.81x-coordinate 0
0.2
0.4
0.6
0.8
1
y-coordinate
0
0.2
0.4
0.6
0.8
1
1.2
network output
Output of the standard BP-network
x-coordinate
y-coordinate
network output
3000 cycles, SSE = 0.89
Output of the BP-network trained with a GREN-network
00.2
0.40.6
0.81x-coordinate 0
0.2
0.4
0.6
0.8
1
y-coordinate
0
0.2
0.4
0.6
0.8
1
1.2
network output
y-coordinatex-coordinate
network output
3000 cycles, SSE = 0.05
Output of the GREN-trained BP-network
Iveta Mrázová, ANNIE´03 103
Supporting experiments
-1
-0.5
0
0.5
1
1.5
2
-0.2 0 0.2 0.4 0.6 0.8 1 1.2
netw
ork
outp
ut
x-coordinate
BP-network output for a constant y=0.25
errorbars correspond to the GREN-error
-1
-0.5
0
0.5
1
1.5
2
-0.2 0 0.2 0.4 0.6 0.8 1 1.2
netw
ork
outp
ut
x-coordinate
BP-network output for a constant y=0.25
errorbars correspond to the GREN-error
-0.5
0
0.5
1
1.5
2
-0.2 0 0.2 0.4 0.6 0.8 1 1.2
netw
ork
outp
ut
x-coordinate
GREN-adjusted input/output patterns for a constant y=0.25
errorbars correspond to the GREN-error
initial I/O_pattern_1=[0,0.25,0.197]
initial I/O_pattern_2=[0.5,0.25,0.388]
initial I/O_pattern_3=[1,0.25,0.932]
I/O_pattern_1
I/O_pattern_2 I/O_pattern_3
-0.5
0
0.5
1
1.5
2
-0.2 0 0.2 0.4 0.6 0.8 1 1.2
netw
ork
outp
ut
x-coordinate
GREN-adjusted input/output patterns for a constant y=0.25
errorbars correspond to the GREN-error
initial I/O_pattern_1=[0,0.25,0.197]
initial I/O_pattern_2=[0.5,0.25,0.388]
initial I/O_pattern_3=[1,0.25,0.932]
I/O_pattern_1
I/O_pattern_2 I/O_pattern_3
-0.5
0
0.5
1
1.5
2
-0.2 0 0.2 0.4 0.6 0.8 1 1.2
netw
ork
outp
ut
x-coordinate
GREN-adjusted input/output patterns for a constant y=0.25
errorbars correspond to the GREN-error
initial I/O_pattern_1=[0,0.25,0.197]
initial I/O_pattern_2=[0.5,0.25,0.388]
initial I/O_pattern_3=[1,0.25,0.932]
I/O_pattern_1
I/O_pattern_2 I/O_pattern_3
-0.5
0
0.5
1
1.5
2
-0.2 0 0.2 0.4 0.6 0.8 1 1.2
netw
ork
outp
ut
x-coordinate
GREN-adjusted input/output patterns for a constant y=0.25
errorbars correspond to the GREN-error
initial I/O_pattern_1=[0,0.25,0.197]
initial I/O_pattern_2=[0.5,0.25,0.388]
initial I/O_pattern_3=[1,0.25,0.932]
I/O_pattern_1
I/O_pattern_2 I/O_pattern_3
-0.5
0
0.5
1
1.5
2
-0.2 0 0.2 0.4 0.6 0.8 1 1.2
netw
ork
outp
ut
x-coordinate
GREN-adjusted input/output patterns for a constant y=0.25
errorbars correspond to the GREN-error
initial I/O_pattern_1=[0,0.25,0.197]
initial I/O_pattern_2=[0.5,0.25,0.388]
initial I/O_pattern_3=[1,0.25,0.932]
I/O_pattern_1
I/O_pattern_2 I/O_pattern_3
-0.5
0
0.5
1
1.5
2
-0.2 0 0.2 0.4 0.6 0.8 1 1.2
netw
ork
outp
ut
x-coordinate
GREN-adjusted input/output patterns for a constant y=0.25
errorbars correspond to the GREN-error
initial I/O_pattern_1=[0,0.25,0.197]
initial I/O_pattern_2=[0.5,0.25,0.388]
initial I/O_pattern_3=[1,0.25,0.932]
I/O_pattern_1
I/O_pattern_2 I/O_pattern_3
-0.5
0
0.5
1
1.5
2
-0.2 0 0.2 0.4 0.6 0.8 1 1.2
netw
ork
outp
ut
x-coordinate
GREN-adjusted input/output patterns for a constant y=0.25
errorbars correspond to the GREN-error
initial I/O_pattern_1=[0,0.25,0.197]
initial I/O_pattern_2=[0.5,0.25,0.388]
initial I/O_pattern_3=[1,0.25,0.932]
I/O_pattern_1
I/O_pattern_2 I/O_pattern_3
errorbars correspond to the GREN-error
errorbars correspond to the GREN-error
I/O pattern 1
I/O pattern 2 I/O pattern 3
GREN-adjusted input/output patterns for a constant y=0.25
y-co
ordi
nate
x-coordinate
BP-network output for a constant y=0.25
y-co
ordi
nate
x-coordinate
Iveta Mrázová, ANNIE´03 104
Supporting experiments
Output of the standard BP-network (with 8 hidden neurons)
0
0.5
1
x-coordinate0
0.5
1
y-coordinate
0
0.5
1
network output
1500 cycles, SSE = 0.51
Output of the GREN-trained BP-network (with 8 hidden neurons)
0
0.5
1
x-coordinate0
0.5
1
y-coordinate
0
0.5
1
network output
1500 cycles, SSE = 0.06, GREN-error = 1.2
Iveta Mrázová, ANNIE´03 105
Supporting experiments
0
2
4
0 2500 5000
network error
network sensitivity
cycles
SSE
0
2
0 2500 5000
network sensitivity
network error
SSE
cycles
Sensitivity and error for a standard BP-trained GREN-network
Sensitivity and error for a controlled-trained GREN-network (control rates = 0.2)
Iveta Mrázová, ANNIE´03 106
Supporting experiments
Sensitivities and error for a controlled-trained GREN-network (control rates = 0.2)
Sensitivities and error for an over-trained GREN-network (control rates = 0.2)
-2
0
2
0 2500 5000
0
4
8
0 5000 10000 15000
SSESSE
cyclescycles
sensitivityto BP-outputsensitivity to BP-output
network sensitivity
network error
network error
Iveta Mrázová, ANNIE´03 107
GREN-networks: conclusions
GREN-networks can train BP-networkswithout the knowledge of their desiredoutputsA simple detection of “problematic”GREN-expertsGREN-networks can find “similar” inputpatterns with a lower error
Iveta Mrázová, ANNIE´03 108
Conclusions:Sensitivity of GREN-networks
Increase the sensitivity of trained GREN-networksto their inputsDetect “over-training” in GREN-networksTrain BP-networks more efficiently byminimizing squared GREN-network outputsinstead of the “linear” onesFurther research: simplified sensitivity control
Acoustic Emission andFeature Selection Based on
Sensitivity Analysis
with M. Chlada and Z. Převorovský, Institute of Thermomechanics, Academy of Science
Iveta Mrázová, ANNIE´03 110
Acoustic Emission and FeatureSelection Based on Sensitivity Analysis
BP-networks and sensitivity analysis– larger “sensitivity terms” indicate higher
importance of the feature i
numerical experiments– acoustic emission:
classification of simulated AE data– feature selection:
reduction of original input parameters (from 14 to 6)model dependence between parameters
|| ij xy ∂∂
Iveta Mrázová, ANNIE´03 111
Simulation of AE-data
0 10 20 30-1
-0.5
0
0.5
1WAVE 1
0 10 20 30-0.5
0
0.5
1
1.5WAVE 2
0 10 20 30-0.5
0
0.5
1
1.5WAVE 3
0 10 20 30-0.2
0
0.2
0.4
0.6
0.80.3*WAVE1 + 0.2*WAVE2 + 0.5*WAVE3
MODEL PULSES
Iveta Mrázová, ANNIE´03 112
0 50 100 150 200-0.08
-0.06
-0.04
-0.02
0
0.02
0.04
0.06
0.08INP UT S IGNAL (a=0.3, b=0.2, c=0.5)
CONVOLUTION WITH THE GREEN FUNCTION
0 20 40 60 80 100 120 140 160 180 200-0.2
-0.15
-0.1
-0.05
0
0.05GREEN FUNCTION - 140mm
Simulation of AE-data
Iveta Mrázová, ANNIE´03 113
Original Features for AE-signalsamplitude:rise timeeffective value (RMS)
energy moment:
mean value:
deviation:
asymmetry:
excess:
6 spectral parameters:
with
andfN is the Nyquist frequency
{ }FEDCBAXdkkf
dkkfP
G
XX ,,,,, ;
)(
)(∈=
∫∫
{ })(maxmax tzz Tt∈=
∫=T
dttzT
RMS )(1 2
∫ ⋅=TE dttztT )(2
( ) Tdttztt Ts ∫ ⋅= )(
( ) TdttztT s∫ −= 22 ))((σ
( ) 332 ))(( ση ∫ −=T s dttzt
( ) 442 ))(( σξ ∫ −=T s dttzt
[ ]( )[ ]( )[ ]( )[ ]( )[ ]( )
[ ]( )[ ]( )N
N
N
N
N
N
N
fGfF
fEfDfCfB
fA
2/1,02/1,6.0
2/6.0,48.02/48.0,36.02/36.0,24.02/24.0,12.0
2/12.0,0
=======
Iveta Mrázová, ANNIE´03 114
Factor analysisfor input parameters
9 factors selected“explain” 98.4% of allvariables, e.g.
– higher energy of signalslead to higheramplitudes and RMS(parameters 1, 3, 4)
allow to reduce linearlydependent inputparameters
– in our case to: 2, 3, 5, 6,7, 8, 11, 12 and 2 newspectral parametersselected factors
inpu
t par
amet
ers
Iveta Mrázová, ANNIE´03 115
0.1730.0930.3200.3010.5640.1960.0990.0650.0220.0530.0350.0390.0810.260
0.2660.0680.1930.1780.2500.3220.0630.0150.0140.0200.0120.0500.1340.172
0.1490.0470.1840.1960.2060.1580.0430.0300.0160.0120.0320.0220.0820.109
S ENS ITIVITY COEFFICIENTS
OUTP UTS
INP
UTS
1 2 3
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Sensitivity analysisof trained BP-networks
2000 samples– 500 training s.
14-27-19-3180 iterationsselected inputs:
– sensitivity analysis1, 3, 4, 5, 6, 13, 14
– + factor analysis 1, 3, 5, 6, 13, 14
new architecture:6-13-7-3 (even with slightlybetter MSE-results)
Iveta Mrázová, ANNIE´03 116
0.32
0.04
0.02
0.16
0.16
0.92
0.05
0.14
0.08
0.03
0.98
0.05
0.04
0.04
0.03
0.35
SENSITIVITY COEFFICIENTS ... X4 = (X1)4
TARGET - PARAMETER
PAR
AME
TE
RS
1 2 3 4
1
2
3
4
-1 -0.5 0 0.5 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1X4 = (X1)4
Model dependence
Iveta Mrázová, ANNIE´03 117
Model dependence
0.77
0.03
0.04
0.12
0.13
0.89
0.02
0.11
0.06
0.05
0.97
0.05
0.11
0.03
0.03
0.59
SENSITIVITY COEFFICIENTS ... X4 = s in(9*X1)
TARGET - PARAMETER
PAR
AME
TE
RS
1 2 3 4
1
2
3
4
-1 -0.5 0 0.5 1-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1X4 = s in(9*X1)
Knowledge extractionin neural networks(students’ questionnaire)
with Eva Poučková,Department of Software Engineering, Charles University Prague
Iveta Mrázová, ANNIE´03 119
Knowledge representation in NN
Distributed!⌦Which inputs are the
most important ones?⌦What and how does the
network do?
I N P U T
O U T P U T
Iveta Mrázová, ANNIE´03 120
Knowledge extraction in NN
Dimension reduction and sensitivity analysisfor inputsRule extraction from trained networks– Structural learning with forgetting
BP-networksGREN-networks
– Babsi-trees (B. Hammer et al.)GRLVQ
Iveta Mrázová, ANNIE´03 121
Dimension reductionPCA: linear transformation of the input dataSensitivity analysis:Feature Subset Selection (FSS)Correlation-based Feature Selection (CFS):select a group of features with a highaverage correlation input_feature - outputbut with a low mutual correlation
Iveta Mrázová, ANNIE´03 122
Dimension reduction: resultsPCA: method not suitable for further
processing – knowledge and rule extraction
FSS: 7 features CFS: 7 features
25 original features
8 features selected as a union of theresults for FSS and CFS
Iveta Mrázová, ANNIE´03 123
Features selected forthe overall evaluation
Feature subset selection(FSS):(1) understandable subject(2) structured and prepared
presentations(3) interesting classes(4) quality of education(5) understandable classes(6) start/end of class on time(7) relationship to students
Correlation-based featureselection (CFS):(1) understandable subject(2) structured and prepared
presentations(3) interesting classes(4) quality of education(5) understandable classes(6) start/end of class on time(8) students prepare for
classes
Iveta Mrázová, ANNIE´03 124
Methods for knowledge extraction
SLF – Structural learning with forgetting• Learning with forgetting• Learning with forcing internal representations on
hidden neurons• Learning with selective forgetting
Babsi-trees• Form a tree from a neural network trained by
means of the GRLVQ-method
Iveta Mrázová, ANNIE´03 125
Generalized relevance learningvector quantization (GRLVQ)
a robust combination of GLVQ and RLVQprovides weighing factors (λ) for input features– larger λ corresponds to a “more important” feature
applicable to pruning of input featuresGLVQ: considers class representatives– separating surfaces approach the optimum Bayessian ones
RLVQ: input features can have different importance– relatively unstable, sensitive to noise
Iveta Mrázová, ANNIE´03 126
Generalized LVQ - GLVQSelect a fixed number of representatives w1, …,wL for all classes Ci, i=1, …, q.Receptive field of the class representative wi
Receptive fields of class representatives shouldbe as small as possible!– minimize ; σ denotes the sigmoid
– and
{ }||||||:||| kii ikTR wxwxx −<−≠∀∈=
( )( )∑ ==
p
ktE
1)(xησ
( )||||||||||||||||
)()(
)()()(
−+
−+
−+−−−−
=wxwxwxwxx tt
tttη
correct classification wrong classification
Iveta Mrázová, ANNIE´03 127
( )( )+
−+
−+ −
−+−−=∆ wx
wxwxwxw )(
2)()(
)(
|||||||||||´ t
tt
t
ασ
Generalized LVQ - GLVQWeight adjustment:
withand learning rates α
( )( )−
−+
+− −
−+−−
−=∆ wxwxwx
wxw )(2)()(
)(
|||||||||||´ t
tt
t
ασ
)))((1))((())´((´ )()()( ttt xxx ησησησσ −==
Iveta Mrázová, ANNIE´03 128
( )( )( )
−+
=−−=∆
−
−
else0,max
2)()1(
)(2)()1()(
ijtt
i
jt
ijtt
iti
wxcdwx
i
i
ελ
ελλ
Relevance LVQ - RLVQInput features can have different importance λ
Receptive field of the class representative wi
Weight adjustment according to GLVQ with adaptiveimportance factors λ i for input features ( 0 < ε < 1 ):
{ }λλλ ||||||:||| jii ijTR wxwxx −<−≠∀∈=
∑∑ ===−=
n
i in
i iii wxdist1
21
2 1;)(),( λλλwx
Iveta Mrázová, ANNIE´03 129
( )( )
( )( )
−
−+−−−
−
−
−+−−−=∆
−−+
+
+−+
−
2)(2)()(
)(
2)(2)()(
)()(
|||||||||||
|||||||||||´
it
itt
t
it
tt
tt
i
wx
wxi
wxwxwx
wxwxwxεσλ
Generalized Relevance LearningVector Quantization (GRLVQ)
Weight adjustment according to GLVQwith adaptive importance factors λ i forinput features:
Iveta Mrázová, ANNIE´03 130
Babsi-treesRoot-trees G=(V,E) satisfying the following conditions:
all vertices vi∈V can have an arbitrary number ni of sonsall leaves vJ are labeled with the corresponding classificationclass CJ
all vertices vi which are not leafs are labeled with stands for the currently processed input dimension i dimensions are “ordered” according to their importance (λ)
All edges going from a vertex vi to its sons are labeled withintervals
interval boundaries are placed in the middle between twoneighboring cluster representatives
ivI
)ii vl
vk stst ,
ivI
Iveta Mrázová, ANNIE´03 131
SLF for layered networks feed-forwardneural networks
GREN-networks
Iveta Mrázová, ANNIE´03 132
Results for the SLF-method
Both BP-networks and GREN-trainednetworks lead to similar sets of rules:
Iveta Mrázová, ANNIE´03 133
The resulting Babsi-treeRelevant dimensions (features): 4, 8 and 7 (4)
(8)
(7)
Overall evaluation
many patterns
just one pattern
Values for intervals: 1 (-∞,1.5)2 [1.5,2.5)3 [2.5,3.5)4 [3.5,4.5)5 [4.5,∞)
ambiguous classification
Iveta Mrázová, ANNIE´03 134
Comparison of the resultsSLF
Few simple hierar-chically ordered rulesPossibility to add rulesafter achieving thedesired accuracyRule correctlyapplicable - 71%and 73%, resp.
Babsi-treesMany simple rulesQuick training of thenetworkFew trainingparametersRule correctlyapplicable - 67%
Iveta Mrázová, ANNIE´03 135
Knowledge extraction: Conclusions
Main results achieved:• dimension reduction for the input space• analysis of various models for knowledge extraction• rule extraction from GREN-networks• comparison with other neural network models
Further research:• adjusting rules extracted from a neural network trained
with the GRLVQ-algorithm• (automatic) selection of training parameters for the
SLF-algorithm
Iveta Mrázová, ANNIE´03 136
Genetic Algorithms (GA)Apply genetics and natural selection to findoptimal parameters of a predictive function!
GA use “genetic” operators to evolve successivegenerations of solutions:– selection– crossover– mutation
Best candidates “survive” to further generationsuntil convergence is achievedDirected data mining
Iveta Mrázová, ANNIE´03 137
The basic Genetic AlgorithmStep 1:Create an initial population of individualsStep 2: Evaluate the fitness of all individuals in
the populationStep 3: Select candidates for the next generationStep 4: Create new individuals (use genetic
operators - crossover and mutation)Step 5: Form a new population by replacing
(some) old individuals by new onesGOTO Step 2
ANTARESSTUDENT SOFTWARE PROJECTsupervised by I. Mrázová, F. Mráz
participating students:D. Bělonožník, D. J. Květoň, M. Šubert,
J. Tomaštík, J. Tupý
http://www.ms.mff.cuni.cz/~mraz/antares
Iveta Mrázová, ANNIE´03 139
Project ANTARESGenerate melodies with genetic algorithms– the fitness of candidate solutions is evaluated by
the cooperating feed-forward neural networkParallel implementation of genetic algorithms– open system for the design and testing of genetic
algorithms and neural networks– supports mutual cooperation between neural
networks and genetic algorithms
Iveta Mrázová, ANNIE´03 140
Fitness evaluationwith neural networksFor some problems, it might be difficult todefine explicitly the fitness function– e.g. „evaluate“ the beauty of generated melodies
Fitness of candidate melodies (generated byGA) is evaluated by the pre-trained NN:– provide a set of positive and negative examples
(supervised learning)– train a feed-forward network to approximate the
„unknown“ fitness function (on the training set)
Iveta Mrázová, ANNIE´03 141
Generating melodies:positive training samples
Iveta Mrázová, ANNIE´03 142
Generating melodiesPositive training samples
Negative training samples
Test samples (with a high fitness value)
Generated melodies
Thank you for your attention!