Christine Preisach, Steffen Rendle and Lars Schmidt- Thieme Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Germany Relational

  • View
    212

  • Download
    0

Embed Size (px)

Text of Christine Preisach, Steffen Rendle and Lars Schmidt- Thieme Information Systems and Machine Learning...

  • Slide 1
  • Christine Preisach, Steffen Rendle and Lars Schmidt- Thieme Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Germany Relational Classification Using Automatically Extracted Relations by Record Linkage
  • Slide 2
  • 2 Outline Motivation Relation Extraction and Multi-Relational Classification Framework Relation Extraction Multi-Relational Classification Evaluation Conclusion
  • Slide 3
  • 3 Example: Motivation P1 P3 P2 PublicationTitleAuthorConferenceCategory 1 Classification of scientific publications John Smith ICDMData Mining 2 Classification of Hypertext John Smith KDD Data Mining 3 Hierarchical Clustering Dan Miller ICDM Data Mining
  • Slide 4
  • 4 Motivation Traditional classifiers takes only local attributes like keywords, title and abstract into account Assumption: Instances are independent But: Assumption does not hold Instances can be related to other documents by the authorship, citations, same conference etc. These relations should be exploited and combined in order to improve classification accuracy. But: Manuel extraction of relations by experts is expensive Automatic extraction of relations from noisy attributes.
  • Slide 5
  • 5 Data Mining Category 5th International Conference on Data Mining KDD ICDM 2005 Conference Dan MillerHierarchical Clustering 3 John Smith Classification of Hypertext 2 J. SmithClassification of scientific publications 1 AuthorTitlePublication Relation Extraction Component Extraction of relations from objects with noisy attributes Multi-Relational Classification Component Use extracted relations instead or additionally to local attributes for classification Relation Extraction and Relational Classification Framework
  • Slide 6
  • 6 Relation Extraction Pairwise feature extraction from noisy attributes with several similarity measures (e.g. TFIDF, cosine similarity, Levenshtein) Probabilistic pairwise decision model Use extracted similarities as features for a probabilistic classifier and build a model on the training data And apply it on unknown pairs Collective decision model If is an equivalence relation then use constrained clustering (e.g. HAC) using the pair wise decision model as a learned similarity measure to transform into a binary relation Pairwise feature extraction Probabilistic pairwaise decision model Collective decision model Attributes Relations
  • Slide 7
  • 7 Relation Extraction Collective Decision Model Initialisation Must Links Cannot Links
  • Slide 8
  • 8 Multi-Relational Classification Relational classification problem: Make use of additional information of related objects (i.e. their classes or attributes) Propositionalize the relational data e.g. with: where is the neighborhood of
  • Slide 9
  • 9 Multi-Relational Classification Algorithm: 1. for each relation R:1 to m (a) Build a undirected weighted graph with (b) Perform relational classification simultaneously for all instances in the test set (c) Output a probability distribution 2. Apply ensemble classification to the resulting probability distributions of these relations 3. Output final classification Relational Classification Relational Classification Ensemble Classification
  • Slide 10
  • 10 Simple Relational Methods Probabilistic Relational Neighbor Classifier (EPRN) [Macskassy and Provost 2003] Where is a normalization factor, is the weight and is the iteration EPRN2HOP Takes additionally the neighbors of the direct neighbors into account if the direct neighborhood size is small Multi-Relational Classification
  • Slide 11
  • 11 Aggregation-based Relational Learning Methods Use aggregation functions in order to propositionalize the set-valued attribute Use aggregated values as attributes for traditional machine learning methods We used Logistic Regression as classifier Multi-Relational Classification Category 1 Category 2 Category 3 Category 1
  • Slide 12
  • 12 Methods which combine different models Increases classification accuracy Usage Combine results achieved by relational classification for different relations Combine results of relational and local models Voting Stacking Use Meta-classifier to learn a model on the results of different models Build new instances Apply cross validation Ensemble Classification
  • Slide 13
  • 13 Evaluation Data CompuScience data set 147 571 scientific papers 77 topics (categories) Relations: authors, reviewer, journals Cora deduplication data set 1 295 citations 112 unique publications Relation:samePaper Cora data set 3298 papers 12 categories Relations: conferences, authors, citations
  • Slide 14
  • 14 Evaluation Relation Extraction Evaluation set single linkage complete linkage average linkage X tst 0.900.740.92 X 0.710.93 F1 measure for finding the SamePaper relation on Cora Pairwise feature extraction with TFIDF, Levenshtein, Jaccard, Cosine on all attributes
  • Slide 15
  • 15 The ensemble of relational and content-based text classification achieved a significantly higher F-measure then the pure text classifier Evaluation Multi-Relational Classification 3-fold cross validation on CompuScience for Author, Reviewer and Journal relation
  • Slide 16
  • 16 Evaluation Multi-Relational Classification using automatically extracted relations 50%/50% splits, 10 runs
  • Slide 17
  • 17 Summary: Presented framework for relation extraction and multi- relational classification Automatic relation extraction with record linkage Relational classification using each extracted relation for classification and fusing the results with ensemble methods Future Work Evaluate our framework on different data sets and relations Evaluate the relational classifiers quality depending on the quality of the extracted relations Conclusion and Future Work
  • Slide 18
  • 18 Questions ? www.ismll.uni-hildesheim.de Christine Preisach preisach@ismll.uni-hildesheim.de Steffen Rendle srendle@ismll.uni-hildesheim.de Lars Schmidt-Thieme schmidt-thieme@ismll.uni-hildesheim.de Thank you