Upload
kelly-barnett
View
27
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Experiments with MRDTL – A Multi-relational Decision Tree Learning Algorithm. Experiments with MRDTL – A Multi-relational Decision Tree Learning Algorithm. Hector Leiva, Anna Atramentov and Vasant Honavar * Artificial Intelligence Laboratory Department of Computer Science and - PowerPoint PPT Presentation
Citation preview
Experiments with MRDTL –A Multi-relational Decision Tree Learning
Algorithm
Hector Leiva, Anna Atramentov and Vasant Honavar*
Artificial Intelligence Laboratory
Department of Computer Science and
Graduate Program in Bioinformatics and Computational Biology
Iowa State University
Ames, IA 50011, USA
www.cs.iastate.edu/~honavar/aigroup/html
* Support provided in part by National Science Foundation, Carver Foundation, and Pioneer Hi-Bred, Inc.
Experiments with MRDTL –A Multi-relational Decision Tree Learning
Algorithm
Motivation
Importance of multi-relational learning:
Growth of data stored in MRDB Techniques for learning unstructured data often extract the data into MRDB
Expanding of the techniques for multi-relational learning:
Blockeel’s framework (ILP)(1998) Getoor’s framework (first order extensions of PM)(2001) Knobbe’s framework (MRDM)(1999)
Problem: no experimental results available
Goals Perform experiments and evaluate performance of the Knobbe’s framework Understand strengths and limits of the approach
Multi-Relational Learning Literature
Inductive Logic ProgrammingInductive Logic Programming
First order extensions of probabilistic models First order extensions of probabilistic models
Multi-Relational Data Mining Multi-Relational Data Mining
Propositionalization methodsPropositionalization methods
PRMs extension for cumulative learning for learning and PRMs extension for cumulative learning for learning and reasoning as agents interact with the world reasoning as agents interact with the world
Approaches for mining data in form of graphApproaches for mining data in form of graph
Blockeel, 1998; De Raedt, 1998; Knobbe et al., 1999; Friedman et al., Blockeel, 1998; De Raedt, 1998; Knobbe et al., 1999; Friedman et al., 1999; Koller, 1999; Krogel and Wrobel, 2001; Getoor, 2001; Kersting et 1999; Koller, 1999; Krogel and Wrobel, 2001; Getoor, 2001; Kersting et al., 2000; Pfeffer, 2000; Dzeroski and Lavrac, 2001; Dehaspe and De al., 2000; Pfeffer, 2000; Dzeroski and Lavrac, 2001; Dehaspe and De Raedt, 1997; Dzeroski et al., 2001; Jaeger, 1997; Karalic and Bratko, Raedt, 1997; Dzeroski et al., 2001; Jaeger, 1997; Karalic and Bratko, 1997;1997; Holder and Cook, 2000; Gonzalez et al., 2000Holder and Cook, 2000; Gonzalez et al., 2000
Problem Formulation
Example of multi-relational database
Given: Data stored in relational data baseGoal: Build decision tree for predicting target attribute in the target table
schemainstances
Department
d1 Math 1000
d2 Physics 300
d3 Computer Science
400
Staff
p1 Dale d1 Professor
70 - 80k
p2 Martin d3 Postdoc 30-40k
p3 Victor d2 VisitorScientist
40-50k
p4 David d3 Professor
80-100k
Graduate Student
s1 John 2.0 4 p1 d3
s2 Lisa 3.5 10 p4 d3
s3 Michel 3.9 3 p4 d4
Department
ID
Specialization
#Students
Staff
ID
Name
Department
Position
Salary
Grad.Student
ID
Name
GPA
#Publications
Advisor
Department
No
{d3, d4}{d1, d2}
{d1, d2, d3, d4}Tree_induction(D: data) A = optimal_attribute(D) if stopping_criterion (D) return leaf(D) else Dleft := split(D, A) Dright := splitcomplement(D, A) childleft := Tree_induction(Dleft) childright := Tree_induction(Dright) return node(A, childleft, childright)
Propositional decision tree algorithm. Construction phase
Day Outlook Temp-re Humidity Wind PlayTennis
d1 Sunny Hot High Weak No
d2 Sunny Hot High Strong No
d3 Overcast
Hot High Weak Yes
d4 Overcast
Cold Normal Weak No
Outlook not sunny
…
…
…
…
Temperature
hot not hot
No Yes
{d3} {d4}
Day Outlook Temp Hum-ty Wind PlayT
Overcast Hot High Weak Yes
d4 Overcast Cold Normal Weak No
Day Outlook Temp Hum-ty Wind PlayT
d1 Sunny Hot High Weak No
d2 Sunny Hot High Strong No
sunny
MR setting. Splitting data with Selection Graphs
ID Specialization #Students
d1 Math 1000
d2 Physics 300
d3 Computer Science
400
Department Graduate Student
ID Name Department
Position Salary
p1 Dale d1 Professor 70 - 80k
p2 Martin d3 Postdoc 30-40k
p3 Victor d2 VisitorScientist
40-50k
p4 David d3 Professor 80-100k
Staff
ID Name GPA #Public. Advisor Department
s1 John 2.0 4 p1 d3
s2 Lisa 3.5 10 p4 d3
s3 Michel 3.9 3 p4 d4
Staff
Grad. Student
Grad. Student
GPA >2.0
Department
Staff
Grad.Student
complement selection graphs
Staff Grad. Student
GPA >2.0
Staff Grad. Student
ID Name Department
Position Salary
p2 Martin d3 Postdoc 30-40k
p3 Victor d2 VisitorScientist
40-50k
ID Name Department
Position Salary
p4 David d3 Professor
80-100k
ID Name Department
Position Salary
Dale d1 Professor 70-80k
What is selection graph?
Staff
Grad.Student
GPA >3.9
Grad.Student
Department
It corresponds to the subset of the instances from target table
Nodes correspond to the tables from the database
Edges correspond to the associations between tables
Open edge = “have at least one”
Closed edge = “have non of ”
Department
Staff
Grad.Student
Specialization=math
Automatic transforming selection graphs into SQL queryStaff
Staff
Staff
Staff Grad. Student
Grad. Student
Grad. Student
Grad. Student
SelectSelect distinct T0.idT0.id
FromFrom Staff
WhereWhere T0.position=Professor T0.position=ProfessorPosition = Professor
Select Select distinct T0.idT0.id
FromFrom Staff T0, Graduate_Student T1T0, Graduate_Student T1
Where Where T0.id=T1.AdvisorT0.id=T1.Advisor
SelectSelect distinct T0.idT0.id
FromFrom Staff T0T0
WhereWhere T0.id not in T0.id not in
( ( SelectSelect T1. id T1. id
FromFrom Graduate_Student T1) Graduate_Student T1)
GPA >3.9
Select distinct T0. idFrom Staff T0, Graduate_Student T1Graduate_Student T1Where T0.id=T1.AdvisorT0.id=T1.Advisor
T0. id not in ( ( SelectSelect T1. id T1. id
FromFrom Graduate_Student T1 Graduate_Student T1
WhereWhere T1.GPA > 3.9) T1.GPA > 3.9)
Generic query:
select distinct T0.primary_key from table_list where join_list and condition_list
MR decision tree
Staff
… …
… …
… …
Staff Staff
StaffStaff Grad. StudentGrad.Student
Grad.Student Grad.Student
GPA >3.9
GPA >3.9
Grad.Student
Each node contains selection graph Each children selection graph is a supergraph
of the parent selection graph
How to choose selection graphs in nodes?
Problem: There are too many supergraph selection graphs to choose from in each node
Solution: start with initial selection graph find greedy heuristic to choose supergraph
selection graphs: refinements use binary splits for simplicity for each refinement
get complement refinement choose the best refinement based
on information gain criterion
Problem: Somepotentiallygood refinementsmay give noimmediate benefit
Solution: look ahead capability
Staff
… …
… …
… …
Staff Staff
StaffStaff Grad. StudentGrad.Student
Grad.Student Grad.Student
GPA >3.9
GPA >3.9
Grad.Student
Refinements of selection graph
add condition to the node - explore attribute information in the tables
add present edge and open node –explore relational properties between the tables
Staff
Grad.Student
GPA >3.9
Grad.Student
Department
Department
Staff
Grad.Student
Specialization=math
Refinements of selection graph
Position = Professor
Staff
Grad.Student
GPA >3.9
Grad.Student
Department
Staff
Grad.Student
GPA >3.9
Grad.Student
Department
Position != Professor
Staff
Grad.Student
GPA >3.9
Grad.Student
Department
refinement
complement refinement
Department
Staff
Grad.Student
add condition to the nodeadd condition to the node add present edge and open node
Specialization=math
Specialization=math
Specialization=math
Refinements of selection graph
Staff
Grad.Student
GPA >3.9
Grad.Student
Department
GPA >2.0
Staff
Grad.Student
GPA >3.9
Grad.Student
Department
Grad.StudentGPA >2.0
Staff
Grad.Student
GPA >3.9
Grad.Student
Department
Department
Staff
Grad.Student
add condition to the nodeadd condition to the node add present edge and open node
refinement
complement refinementSpecialization=math
Specialization=math
Specialization=math
Refinements of selection graph
Staff
Grad.Student
GPA >3.9
Grad.Student
Department
#Students >200
Staff
Grad.Student
GPA >3.9
Grad.Student
Department
Department
#Students >200
Staff
Grad.Student
GPA >3.9
Grad.Student
Department
Department
Staff
Grad.Student
add condition to the nodeadd condition to the node add present edge and open node
refinement
complement refinementSpecialization=math
Specialization=math
Specialization=math
Refinements of selection graph
Staff
Grad.Student
GPA >3.9
Grad.Student
Department
Department
Staff
Grad.Student
GPA >3.9
Grad.Student
Department
Department
Staff
Grad.Student
GPA >3.9
Grad.Student
Department
Department
Staff
Grad.Student
add condition to the node add present edge and open nodeadd present edge and open node
refinement
complement refinement
Note: information gain = 0
Specialization=math
Specialization=math
Specialization=math
Refinements of selection graph
Staff
Grad.Student
GPA >3.9
Grad.Student
Department
Staff
Staff
Grad.Student
GPA >3.9
Grad.Student
Department
Staff
Staff
Grad.Student
GPA >3.9
Grad.Student
Department
Department
Staff
Grad.Student
refinement
complement refinement
add condition to the node add present edge and open nodeadd present edge and open node
Specialization=math
Specialization=math
Specialization=math
Refinements of selection graph
Staff
Grad.Student
GPA >3.9
Grad.Student
Department Staff
Staff
Grad.Student
GPA >3.9
Grad.Student
Department Staff
Staff
Grad.Student
GPA >3.9
Grad.Student
Department
Department
Staff
Grad.Student
refinement
complement refinement
add condition to the node add present edge and open nodeadd present edge and open node
Specialization=math
Specialization=math
Specialization=math
Refinements of selection graph
Staff
Grad.Student
GPA >3.9
Grad.Student
Department Grad.S
Staff
Grad.Student
GPA >3.9
Grad.Student
Department Grad.S
Staff
Grad.Student
GPA >3.9
Grad.Student
Department
Department
Staff
Grad.Student
refinement
complement refinement
add condition to the node add present edge and open nodeadd present edge and open node
Specialization=math
Specialization=math
Specialization=math
Look ahead capability
Staff
Grad.Student
GPA >3.9
Grad.Student
Department
Department
Staff
Grad.Student
Staff
Grad.Student
GPA >3.9
Grad.Student
Department
Department
Staff
Grad.Student
GPA >3.9
Grad.Student
Department
Department
refinement
complement refinement
Specialization=math
Specialization=math
Specialization=math
Look ahead capability
Department
Staff
Grad.Student
#Students > 200
Staff
Grad.Student
GPA >3.9
Grad.Student
Department
Department
refinement
complement refinement
#Students > 200
Staff
Grad.Student
GPA >3.9
Grad.Student
Department
Department
Department
Staff
Grad.Student
GPA >3.9
Grad.Student
Department
Specialization=math
Specialization=math
Specialization=math
MR decision tree algorithm. Construction phase
Staff
… …
… …
… …
Staff StaffGrad.Student Grad.Student
Staff Grad. Student
GPA >3.9
StaffGrad.Student
GPA >3.9
Grad.Student
for each non-leaf node: consider all possible refinements and
their complements of the node’s selection graph
choose the best ones based on information gain criterion
create children nodes
MR decision tree algorithm. Classification phaseStaff
… …
… …… …
Staff Staff
StaffStaff Grad. Student
Grad.Student
Grad.Student Grad.Student
GPA >3.9
GPA >3.9
Grad.Student
Staff Grad. Student
GPA >3.9
Department
Spec=math
Staff Grad. Student
GPA >3.9
Department
Spec=physics
Position =Professor ……………..
70-80k 80-100k
for each leaf: apply selection graph of the leaf to the
test data classify resulting instances with
classification of the leaf
Experimental results. Mutagenesis Most widely DB used in ILP. Describes molecules of certain nitro aromatic compounds. Goal: predict their mutagenic activity (label attribute) – ability to
cause DNA to mutate. High mutagenic activity can cause cancer. Class distribution.
Compounds Active Inactive Total
Regression friendly 125 63 188
Regression unfriendly 13 29 42
Total 138 92 230
5 levels of background knowledge: B0, B1, B2, B3, B4. They provide richer descriptions of the examples. The first three levels (B0, B1, B2) are used only.
Experimental results. Mutagenesis
Results of 10-fold cross-validation for regression friendly set.
Systems Accuracy (%) Time (secs.)
B0 B1 B2 B0 B1 B2
Progol 79 86 86 8595 4627 6530
Progol 76 81 83 117k 64k 42k
FOIL 61 61 83 4950 9138 0.5
TILDE 75 79 85 41 170 142
MRDTL 67 87 88 0.85 332 221
Size of decision trees.
Systems Number of nodes
B0 B1 B2
MRDTL 1 53 51
Experimental results. Mutagenesis Results of leave-one-out cross-validation for regression unfriendly set.
Background Accuracy Time #Nodes
B0 70% 0.6 secs. 1
B1 81% 86 secs. 24
B2 81% 60 secs. 22
Two recent approaches (Sebag and Rauveirol, 1997) and (Kramer and De Raedt, 2001) using B3 have achieved 93.6% and 94.7%, respectively for mutagenesis database.
Experimental results. KDD Cup 2001
Consists of a variety of details about the various genes of one particular type of organism.
Genes code for proteins, and these proteins tend to localize in various parts of cells and interact with one another in order to perform crucial functions.
Task: Prediction of gene/protein localization (15 possible values)
Target table: Gene Target attribute: Localization 862 training genes, 381 test genes.
Challenge: many attribute values are missing.
Approach: using a special value to encode a missing value.Result: accuracy of 50%
Have to find good techniques for filling in missing values.Have to find good techniques for filling in missing values.
Experimental results. KDD Cup 2001
Approach: Replacing missing values by the most common value of the attribute for the class.Results:- accuracy of around 85% with a decision tree of 367 nodes, with no limit in the number of times an association can be instantiated.- accuracy of 80%, when limiting the number of times an association can be instantiated.- accuracy of around 75% is obtained when following associations only in the forward direction.
This shows that providing reasonable guesses for missing values can significantly enhance the performance of MRDTL on real world data sets.
In practice, since the class labels for test data are unknown, it is not possible to apply this method.
Approach: Extension of the Naïve Bayes algorithm for relational dataResult:-no improvement comparing to the first approach
Have to incorporate handling missing values into decision tree algorithmHave to incorporate handling missing values into decision tree algorithm
Experimental results. Adult database
Result after removal of missing values and using original train/test split: 82.2%. Filling missing values with Naïve Bayes approach yields 83% C4.5 result: 84.46%
Training Test Total
>50k <=50k >50k <=50k
With missing values 7841 24720 3846 12435 48842
W/o missing values 7508 22654 3700 11360 45222
Suitable for propositional learning. One table, 6 numerical attributes, 8 nominal attributes.
Information from 1994 census. Task: determine whether a person makes over 50k a year. Class distribution for adult database:
Summary the algorithm is a promising alternative to existing algorithms, such as
Progol, Foil, and Tilde
the running time is comparable with the best existing approaches
if equipped with principled approaches to handle missing values it is an effective algorithm for learning real-world relational data
the approach is an extension of propositional learning, and can be successfully applied for propositional learning
Questions:- why can’t we split the data based on the value of the attribute in arbitrary table right away?
- is there less restrictive and more simple way of representing the splits of data than selection graphs?
- the running time for computing the first nodes in decision tree is much less then for the rest of the nodes. Is it unavoidable? Can we implement the same idea more efficiently?
Future work
Incorporation of the more sophisticated techniques for handling missing values
Incorporating of more sophisticated pruning techniques or complexity regularizations
More extensive evaluation of MRDTL on real-world data sets Development of ontology-guided multi-relational decision tree learning
algotihms to generate classifiers at multiple levels of abstraction [Zhang et al., 2002]
Development of variants of MRDTL for classification tasks where the classes are not disjoint, based on the recently developed propositional decision tree counterparts of such algorithms [Caragea et al., 2002]
Development of variants of MRDTL that can learn from heterogeneous, distributed, autonomous data sources, based on recently developed techniques for distributed learning and ontology based data integration