Classification as a Tool for Research: Proceedings of the 11th IFCS Biennial Conference and 33rd Annual Conference of the Gesellschaft fur Klassifikation e.V., Dresden, March 13–18,

Studies in Classification, Data Analysis,and Knowledge Organization

Managing Editors Editorial Board

H.-H. Bock, Aachen Ph. Arabie, NewarkD. Baier, Cottbus

M. Vichi, Rome F. Critchley, Milton Keynes

E. Diday, ParisM. Greenacre, Barcelona

J. Meulman, LeidenP. Monari, BolognaS. Nishisato, TorontoN. Ohsumi, TokyoO. Opitz, Augsburg

M. Schader, Mannheim

C.N. Lauro, Naples

C. Weihs, Dortmund

G. Ritter, Passau

R. Decker, Bielefeld

W. Gaul, Karlsruhe

For further volumes: :http://www.springer.com/series/1564

123

Editors

Proceedings of the 11th IFCS BiennialConference and 33rd Annual Conference

e.V., Dresden, March 1318, 2009

Hermann Locarek-Junge Claus Weihs

of the Gesellschaft fr Klassifikation

as a Tool for ResearchClassification

imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.

Printed on acid-free paper

or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, reproduction on microfilm or in any other way, and storage in data banks. Duplication of this publication

The use of general descriptive names, registered names, trademarks, etc. in this publication does not

This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting,

Springer is part of Springer Science + Business Media (www.springer.com)

ISSN 1431-8814

Editors

Springer Heidelberg Dordrecht London New York

Springer-Verlag Berlin Heidelberg 2010

Professor Dr. Hermann Locarek-JungeChair Finance and Financial ServicesDresden University of TechnologyHelmholtzstr. 1001062 DresdenGermany

Professor Dr. Claus WeihsChair Computational StatisticsDortmund University of TechnologyVogelpothsweg 8744221 [email protected]

ISBN 978-3-642-10744-3 e-ISBN 978-3-642-10745-0DOI: 10.1007/978-3-642-10745-0

Library of Congress Control Number: 2010923661

[email protected]

Cover design: WMX Design, Heidelberg

in its current version, and permissions for use must always be obtained from Springer . Violations areliable for prosecution under the German Copyright Law.

Preface

This volume contains revised selected papers from plenary and invited as well ascontributed sessions at the 11th Biennial Conference of the International Federationof Classification Societies (IFCS) in combination with the 33rd Annual Conferenceof the German Classification Society Gesellschaft fr Klassifikation (GfKl), orga-nized by the Faculty of Business Management and Economics at the TechnischeUniversitt Dresden in March 2009. The theme of the conference was Classifi-cation as a Tool for Research. The conference encompassed 290 presentations in100 sessions, including 11 plenary talks and 2 workshops. Moreover, five tutorialstook place before the conference. With 357 attendees from 58 countries, the confer-ence provided a very attractive interdisciplinary international forum for discussionand mutual exchange of knowledge.

The chapters in this volume were selected in a second reviewing process after theconference. From the remaining 120 submitted papers, 90 papers were accepted forthis volume. In addition to the fundamental methodological areas of Classificationand Data Analysis, the volume contains many chapters from a wide range of top-ics representing typical applications of classification and data analysis methods inArchaeology and Spatial Science, Bio-Sciences, Electronic Data and Web, Financeand Banking, Linguistics, Marketing, Music Science, and Quality Assurance andEngineering.

The editors would like to thank the session organizers for supporting the spreadof information about the conference, and for inviting speakers, all reviewers fortheir timely reports, and Irene Barrios-Kezic and Martina Bihn of Springer-Verlag,Heidelberg, for their support and dedication to the production of this volume.

Moreover, IFCS and GfKl want to thank the Local Organizing Committee,Werner Esswein, Andreas Hilbert, and Hermann Locarek-Junge for this very well-organized conference. We also thank all our supporters special thanks to ThorstenKlug, Sven Loagk, Karoline Schnbrunn, Jens Weller, and the student staff at theconference!

Dresden and Dortmund Hermann Locarek-JungeNovember 2009 Claus Weihs

v

vi Preface

Scientific Program Committee

Chair

Claus Weihs, University of Dortmund, Germany

Members

David Banks, Duke University, USA Vladimir Batagelj, University of Ljubljana, Slovenia Patrice Bertrand, Universit Paris-Dauphine, France Hans-Hermann Bock, RWTH Aachen University, Germany Paula Brito, University of Porto, Portugal Joachim Buhmann, ETH Zrich, Switzerland Andrea Cerioli, University of Parma, Italy Eva Ceulemans, Katholieke Universiteit Leuven, Belgium Reinhold Decker, Universitt Bielefeld, Germany Werner Esswein, TU Dresden, Germany Bernard Fichet, LIF Marseille, France Eugeniusz Gatnar, AE Katowice, Poland David Hand, Imperial College London, UK Christian Hennig, UCL, UK Andreas Hilbert, TU Dresden, Germany Tadashi Imaizumi, Tama University Tokyo, Japan Krzysztof Jajuga, Wroclaw University of Economics, Poland Taerim Lee, Korea National Open University, Korea Hermann Locarek-Junge, TU Dresden, Germany Buck McMorris, Illinois Institute of Technology, USA Fionn Murtagh, University of London, UK Akinori Okada, Tama University Tokyo, Japan Marieke Timmerman, Rijks Universiteit Groningen, Netherlands Maurizio Vichi, Universit di Roma La Sapienza, Italy

Reviewers (in alphabetical order)

Daniel Baier, Martin Behnisch, Axel Benner, Lynne Billard, Anne-Laure Boulesteix,Alexander Brenning, Hans Burkhardt, Wolfgang Gaul, Patrick J.F. Groenen,GeorgesHebrail, Irmela Herzog, Tadashi Imaizumi, Krzysztof Jajuga, Sabine Krolak-Schwerdt, Berthold Lausen, Taerim Lee, Uwe Ligges, Hermann Locarek-Junge,Jorn Mehnen, Andreas Nuernberger, Jozef Pociecha, Axel G. Posluschny, GunterRitter, Alex Rogers, Jrgen Rolshoven, Lars Schmidt-Thieme, Wilfried Seidel,Susanne Strahringer, Heike Trautmann, Alfred Ultsch, Rosanna Verde, Claus Weihs

Preface vii

1 Program Sessions and Chairs

1.1 Plenary Sessions

Taylan Cemgil: Hierarchical Bayesian Models for Audio and MusicProcessing (Chair: Prof. Claus Weihs)

Sanjoy Dasgupta: Performance Guarantees for Hierarchical Clustering(Chair: Prof. Joachim M Buhmann)

Josef Kittler: Data Quality Dependent Decision Making in PatternClassification (Chair: Prof. Fionn Murtagh)

Lars Schmidt-Thieme: Object Identification(Chair: Prof. Andreas Geyer-Schulz)

Alfred Ultsch: Benchmarking Methods for the Identification ofDifferentially Expressed Genes (Chair: Dr. Berthold Lausen)

Vincenzo Vinzi: PLS Path Modelling and PLS Regression(Chair: Prof. Yoshio Takane)

1.2 Presidents Invited Session (Chair: Prof. F.R. McMorris)

Alain Guenoche: String Distances for Complete Genome PhylogenyBoris Mirkin: Clustering Proteins and Tree-Mapping Evolutionary EventsWilliam Shannon, Elena Deych, Robert Culverhouse: Microarray

Dimension Reduction Based on Maximizing Mantel CorrelationCoefficients Using a Genetic Algorithm Search Strategy

1.3 Workshops

Sensor Networks (Chairs: Dr. Dimitris Tasoulis, Dr. Niall Adams)(Organizers: Dr. Dimitris K. Tasoulis, Dr. Niall M. Adams,Dr. Alex Rogers)

Bibliothekarischer Workshop (Chairs: Dr. Hans-Joachim Hermes, Dr. BerndLorenz) (Organizers: Dr. Hans-Joachim Hermes, Dr. Bernd Lorenz)

1.4 Invited Sessions

Business Informatics(Chairs: Prof. Werner Esswein, Prof. Susanne Strahringer)

Classification Approaches for Symbolic Data(Chair: Prof. Rosanna Verde)

Clustering and Classification (Chair: Prof. Bernard J.E. Fichet)Clustering in Networks (Chair: Prof. Vladimir Batagelj)

viii Preface

Clustering in Reduced Space(Chairs: Prof. Eva Ceulemans, Dr. Marieke E. Timmerman)

Correspondence Analysis and Related Methods(Chair: Prof. Patrick J.F. Groenen)

Data Stream Mining (Chair: Prof. Georges Hebrail)Graph-Theoretical Methods for Clustering (Chair: Prof. Hans-Hermann Bock)Information Extraction and Retrieval (Chair: Prof. Lars Schmidt-Thieme)Modelling Genome Wide Data in Clinical Research I

(Chair: Dr. Berthold Lausen)Modelling Genome Wide Data in Clinical Research II

(Chair: Prof. Katja Ickstadt)Model-Based Clustering Methods I (Chair: Dr. Christian Hennig)Multicriteria Optimization I (Chair: Dr. Heike Trautmann)Non-Standard Data (Chair: Lynne Billard)Spatial Classification I (Chair: Prof. Alexander Brenning)Two-Way Clustering and Applications (Chair: Prof. Hans-Hermann Bock)

1.5 Contributed Sessions

Clustering and Classification

Applied Clustering Methods (Chair: Dr. Andrzej Dudek)Clustering for Similarity Data (Chair: Prof. Erhard Godehardt)Clustering: Bias and Stability (Chair: Dr. Christian Hennig)Comparison/Dynamics in Clustering (Chair: Simona Balbi)Hierarchical Clustering (Chair: Prof. Maria Paula Brito)Multiway/Reduced Space Clustering (Chair: Dr. Patrice Bertrand)Discrimination I (Chair: Prof. Guy Cucumel)Discrimination II (Chair: Prof. Ulrich Mller-Funk)Model-Based Clustering Methods II (Chair: Prof. Andrzej Sokolowski)New Clustering Strategies I (Chair: Dr. Jan W. Owsinski)New Clustering Strategies II (Chair: Prof. Immanuel M. Bomze)Selection and Clustering of Variables (Chair: Prof. Jozef Dziechciarz)

Data Analysis Methods

Correspondence Analysis and Related Methods I (Chair: Prof. Jrg Blasius)Correspondence Analysis and Related Methods II

(Chair: Prof. Michael J. Greenacre)Correspondence Analysis and Related Methods III

(Chair: Prof. John Gower)Data Analysis Software (Chair: Prof. Uwe Ligges)

Preface ix

Data Cleaning and Pre-Processing/Ensemble Methods(Chair: Prof. Eugeniusz Gatnar)

Exploratory Data Analysis I (Chair: Prof. Andrea Cerioli)Exploratory Data Analysis II (Visualization)

(Chair: Prof. Anthony C. Atkinson)Exploratory Data Analysis III (Multivariate)

(Chair: Prof. Vincenzo Esposito Vinzi)Exploratory Data Analysis IV (Chair: Prof. Alfred Ultsch)Large and Complex Data I (Chair: Prof. Maurizio Vichi)Large and Complex Data II (Chair: Prof. Tadashi Imaizumi)Mixture Analysis Mixture Models in Genetics

(Chair: Prof. Wilfried Seidel)Mixture Analysis Mixture Estimation and Model Selection

(Chair: Prof. Angela Montanari)Non-Gaussian Mixtures (Chair: Prof. Wilfried Seidel)Non-Standard Data I (Chair: Lynne Billard)Non-Standard Data II (Chair: Lynne Billard)Non-Standard Data III (Chair: Lynne Billard)Pattern Recognition and Machine Learning/Data Analysis

(Chair: Prof. Joachim M. Buhmann)Regression Mixture Models (Chair: Prof. Angela Montanari)Visualization of Asymmetry (Chair: Dr. Akinori Okada)Visualization of Symbolic Data (Chair: Prof. Tadashi Imaizumi)Visualization I (Chair: Prof. Patrick J.F. Groenen)Visualization II (Chair: Prof. Patrick J.F. Groenen)

Archaeology and Spatial Science

Archaeology and Historical Geography I(Chairs: Irmela Herzog, Dr. Tim Kerig)

Archaeology and Historical Geography II(Chairs: Irmela Herzog, Dr. Tim Kerig)

Spatial Classification II (Chair: Prof. Alexander Brenning)Spatial Planning (Chair: Dr. Martin Behnisch)

Bio-Sciences

Biostatistics and Bioinformatics - Mult. Tests/Pred. with Genomics Data(Chair: Prof. Geoffrey J. McLachlan)

Highdimensional Genomics I (Chair: Axel Benner)Highdimensional Genomics II (Chair: Prof. Iven Van Mechelen)Medical Health I (Chair: Dr. Berthold Lausen)Medical Health II (Chair: Prof. Taerim Lee)

x Preface

Pre-Clinical Development and Biostatistics (Chair: Axel Benner)SNPs and Genome Analysis (Chair: Prof. Gunter Ritter)

Finance and Banking

Banking and Finance I (Chair: Prof. Ursula Walther)Banking and Finance II (Chair: Prof. Hermann Locarek-Junge)Banking and Finance III (Chair: Prof. Matija Mayer-Fiedrich)Banking and Finance IV (Chair: Prof. Alfred Ultsch)

Linguistics and Text Mining

Linguistics I (Chair: Prof. Jrgen Rolshoven)Text Mining Classification (Chair: Prof. Andreas Nuernberger)Text Mining II (Chair: Prof. Andreas Nuernberger)

Marketing

Marketing and Management Science II (Chair: Prof. Daniel Baier)Marketing and Management Science III (Chair: Prof. Winfried Steiner)Marketing and Management Science IV (Chair: Prof. Daniel Baier)Marketing and Management Science V (Chair: Prof. Winfried Steiner)Marketing and Management Science VI (Chair: Prof. Reinhold Decker)Retailing/Direktmarketing (Chair: Prof. Reinhold Decker)

Music Science

Statistical Musicology I (Chair: Prof. Claus Weihs)Statistical Musicology II (Chair: Prof. Claus Weihs)

Quality Assurance and Engineering

Multicriteria Optimization II (Chair: Dr. Heike Trautmann)Production Engineering (Chair: Dr. Jorn Mehnen)

Preface xi

Social Sciences

Psychology and Education (Chair: Prof. Sabine Krolak-Schwerdt)Social Sciences I (Chair: Dr. Akinori Okada)Social Sciences II (Chair: Prof. Eugeniusz Gatnar)

Web Mining

Web Mining I (Chair: Prof. W. Gaul)Web Mining II (Chair: Prof. W. Gaul)

Contents

Part I (Semi-) Plenary Presentations

Hierarchical Clustering with Performance Guarantees . . . . . . . . . . . . . . . . . . . . . . . 3Sanjoy Dasgupta

Alignment Free String Distances for Phylogeny . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15Frdric Guyon and Alain Gunoche

Data Quality Dependent Decision Making in PatternClassification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25Josef Kittler and Norman Poh

Clustering Proteins and Reconstructing Evolutionary Events . . . . . . . . . . . . . . . . 37Boris Mirkin

Microarray Dimension Reduction Based on MaximizingMantel Correlation Coefficients Using a Genetic AlgorithmSearch Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49Elena Deych, Robert Culverhouse, and William D. Shannon

Part II Classification and Data Analysis

Classification

Multiparameter Hierarchical Clustering Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63Gunnar Carlsson and Facundo Mmoli

Unsupervised Sparsification of Similarity Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71Tim Gollub and Benno Stein

Simultaneous Clustering and Dimensionality ReductionUsing Variational Bayesian Mixture Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81Kazuho Watanabe, Shotaro Akaho, Shinichiro Omachi,and Masato Okada

xiii

xiv Contents

A Partitioning Method for the Clustering of CategoricalVariables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91Marie Chavent, Vanessa Kuentz, and Jrme Saracco

Treed Gaussian Process Models for Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . .101Tamara Broderick and Robert B. Gramacy

Ridgeline Plot and Clusterwise Stability as Tools for MergingGaussian Mixture Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .109Christian Hennig

Clustering with Confidence: A Low-Dimensional BinningApproach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .117Rebecca Nugent and Werner Stuetzle

Local Classification of Discrete Variables by Latent ClassModels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .127Michael Bcker, Gero Szepannek, and Claus Weihs

A Comparative Study on Discrete Discriminant Analysisthrough a Hierarchical Coupling Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .137Ana Sousa Ferreira

A Comparative Study of Several Parametricand Semiparametric Approaches for Time Series Classification . . . . . . . . . . . . .147Sonia Prtega Daz and Jos A. Vilar

Finite Dimensional Representation of Functional Datawith Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .157Alberto MuQnoz and Javier Gonzlez

Clustering Spatio-Functional Data: A Model Based Approach . . . . . . . . . . . . . . .167Elvira Romano, Antonio Balzanella, and Rosanna Verde

Use of Mixture Models in Multiple Hypothesis Testingwith Applications in Bioinformatics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .177Geoffrey J. McLachlan and Leesa Wockner

Finding Groups in Ordinal Data: An Examinationof Some Clustering Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .185Marek Walesiak and Andrzej Dudek

An Application of One-mode Three-way Overlapping ClusterAnalysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .193Satoru Yokoyama, Atsuho Nakayama, and Akinori Okada

Contents xv

Evaluation of Clustering Results: The Trade-offBias-Variability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .201Margarida G.M.S. Cardoso, Katti Faceli,and Andr C.P.L.F. de Carvalho

Cluster Structured Multivariate Probability Distributionwith Uniform Marginals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .209Andrzej Sokolowski and Sabina Denkowska

Analysis of Diversity-Accuracy Relations in Cluster Ensemble . . . . . . . . . . . . . .217Dorota Rozmus

Linear Discriminant Analysis with more Variablesthan Observations: A not so Naive Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .227A. Pedro Duarte Silva

Fast Hierarchical Clustering from the Baire Distance . . . . . . . . . . . . . . . . . . . . . . . . .235Pedro Contreras and Fionn Murtagh

Data Analysis

The Trend Vector Model: Identification and Estimation in SAS . . . . . . . . . . . . .245Mark de Rooij and Hsiu-Ting Yu

Discrete Beta-Type Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .253Antonio Punzo

The R Package DAKS: Basic Functions and ComplexAlgorithms in Knowledge Space Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .263Anatol Sargin and Ali nl

Methods for the Analysis of Skew-Symmetry in AsymmetricMultidimensional Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .271Giuseppe Bove

Canonical Correspondence Analysis in Social Science Research . . . . . . . . . . . . .279Michael Greenacre

Exploring Data Through Archetypes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .287Maria Rosaria DEsposito, Giancarlo Ragozini,and Domenico Vistocco

Exploring Sensitive Topics: Sensitivity, Jeopardy, and Cheating . . . . . . . . . . . . .299Claudia Becker

xvi Contents

Sampling the Join of Streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .307Raphal Fraud, Fabrice Clrot, and Pascal Gouzien

The R Package fechner for Fechnerian Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .315Thomas Kiefer, Ali nl, and Ehtibar N. Dzhafarov

Asymptotic Behaviour in Symbolic Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . .323Monique Noirhomme-Fraiture

An Interactive Graphical System for Visualizing DataQualityTableplot Graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .331Waqas Ahmed Malik, Antony Unwin, and Alexander Gribov

Symbolic Multidimensional Scaling Versus Noisy Variablesand Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .341Marcin Peka

Principal Components Analysis for Trapezoidal FuzzyNumbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .351Alexia Pacheco and Oldemar Rodrguez

Factor Selection in Observational Studies An Applicationof Nonlinear Factor Selection to Propensity Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . .361Stephan Dlugosz

Nonlinear Mapping Using a Hybridof PARAMAP and Isomap Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .371Ulas Akkucuk and J. Douglas Carroll

Dimensionality Reduction Techniques for Streaming TimeSeries: A New Symbolic Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .381Antonio Balzanella, Antonio Irpino, and Rosanna Verde

A Batesian Semiparametric Generalized Linear Modelwith Random Effects Using Dirichlet Process Priors . . . . . . . . . . . . . . . . . . . . . . . . . .391Kei Miyazaki and Kazuo Shigemasu

Exact Confidence Intervals for Odds Ratios with AlgebraicStatistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .399Anne Krampe and Sonja Kuhnt

The CHIC Analysis Software v1.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .409Angelos Markos, George Menexes, and Iannis Papadimitriou

Contents xvii

Part III Applications

Archaeology and Spatial Planning

Clustering the Roman Heaven: Uncovering the ReligiousStructures in the Roman Province Germania Superior . . . . . . . . . . . . . . . . . . . . . . .419Tudor Ionescu and Leif Scheuermann

Geochemical and Statistical Investigation of Roman StampedTiles of the Legio XXI Rapax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .427Hans-Georg Bartel, Hans-Joachim Mucha, and Jens Dolata

Land Cover Classification by Multisource Remote Sensing:Comparing Classifiers for Spatial Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .435Alexander Brenning

Are there Cluster of Communities with the SameDynamic Behaviour? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .445Martin Behnisch and Alfred Ultsch

Land Cover Detection with Unsupervised Clusteringand Hierarchical Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .455Laura Poggio and Pierre Soille

Using Advanced Regression Models for Determining OptimalSoil Heterogeneity Indicators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .463Georg Ru, Rudolf Kruse, Martin Schneider, and Peter Wagner

Bio-Sciences

Local Analysis of SNP Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .473Tina Mller, Julia Schiffner, Holger Schwender, Gero Szepannek,Claus Weihs, and Katja Ickstadt

Airborne Particulate Matter and Adverse Health Events:Robust Estimation of Timescale Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .481Massimo Bilancia and Francesco Campobasso

Identification of Specific Genomic Regions Responsiblefor the Invasivity of Neisseria Meningitidis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .491Dunarel Badescu, Abdoulaye Banir Diallo,and Vladimir Makarenkov

xviii Contents

Classification of ABC Transporters Using CommunityDetection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .501Claire Gaugain, Roland Barriot, Gwennaele Fichant,and Yves Quentin

Estimation of the Number of Sustained Viral Respondersby Interferon Therapy Using Random Numbers with a LogisticModel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .509Shinobu Tatsunami, Takahiko Ueno, Rie Kuwabara,Junichi Mimaya, Akira Shirahata, and Masashi Taki

Virtual High Throughput Screening Using Machine LearningMethods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .517Cherif Mballo and Vladimir Makarenkov

Electronic Data and Web

Network Analysis of Works on Clustering and Classificationfrom Web of Science . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .525Nataa Kejar, Simona Korenjak Cerne, and Vladimir Batagelj

Recommending in Social Tagging Systems Based on KernelizedMultiway Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .537Alexandros Nanopoulos and Artus Krohn-Grimberghe

Dynamic Population Segmentation in Online MarketMonitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .545Norbert Walchhofer, Karl A. Froeschl, Milan Hronsky,and Kurt Hornik

Gaining Consumer Insights from Influential Actorsin Weblog Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .553Martin Klaus and Ralf Wagner

Visualising a Text with a Tree Cloud . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .561Philippe Gambette and Jean Vronis

A Tree Kernel Based on Classification and Citation Datato Analyse Patent Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .571Markus Arndt and Ulrich Arndt

A New SNA Centrality Measure Quantifying the Distanceto the Nearest Center . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .579Angela Bohn, Stefan Theul, Ingo Feinerer, Kurt Hornik,Patrick Mair, and Norbert Walchhofer

Contents xix

Mining Innovative Ideas to Support New Product Researchand Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .587Dirk Thorleuchter, Dirk Van den Poel, and Anita Prinzie

Finance and Banking

The Basis of Credit Scoring: On the Definition of CreditDefault Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .595Alexandra Schwarz and Gerhard Arminger

Forecasting Candlesticks Time Series with Locally WeightedLearning Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .603Javier Arroyo

An Analysis of Alternative Methods for MeasuringLong-Run Performance: An Application to ShareRepurchase Announcements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .613Wolfgang Bessler, Julian Holler, and Martin Seim

Knowledge Discovery in Stock Market Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .621Alfred Ultsch and Hermann Locarek-Junge

The Asia Financial Crises and Exchange Rates: Had therebeen Volatility Shifts for Asian Currencies? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .629Takashi Oga and Wolfgang Polasek

The Pricing of Risky Securities in a Fuzzy Least SquareRegression Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .639Francesco Campobasso, Annarita Fanizzi, and Massimo Bilancia

Linguistics

Classification of the Indo-European Languages Usinga Phylogenetic Network Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .647Alix Boc, Anna Maria Di Sciullo, and Vladimir Makarenkov

Parsing as Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .657Lidia Khmylko and Wolfgang Menzel

Comparing the Stability of Clustering Results of Dialect DataBased on Several Distance Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .665Edgar Haimerl and Hans-Joachim Mucha

xx Contents

Marketing

Marketing and Regional Sales: Evaluationof Expenditure Strategies by Spatial Sales Response Functions . . . . . . . . . . . . . .673Daniel Baier and Wolfgang Polasek

A Demand Learning Data Based Approach to OptimizeRevenues of a Retail Chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .683Wolfgang Gaul and Abdolhadi Darzian Azizi

Missing Values and the Consistency Problem Concerning AHPData . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .693Wolfgang Gaul and Dominic Gastes

Monte Carlo Methods in the Assessment of New Products:A Comparison of Different Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .701Said Esber and Daniel Baier

Preference Analysis and Product Design in Markets for ElderlyPeople: A Comparison of Methods and Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . .709Samah Abu-Assab, Daniel Baier, and Mirko Khne

Usefulness of A Priori Information about Customersfor Market Research: An Analysis for PersonalisationAspects in Retailing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .717Michael Brusch and Eva Stber

Importance of Consumer Preferences on the Diffusionof Complex Products and Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .725Sabine Schmidt and Magdalena Missler-Behr

Household Possession of Consumer Durables on Backgroundof some Poverty Lines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .735Jzef Dziechciarz, Marta Dziechciarz, and Klaudia Przybysz

Effect of Consumer Perceptions of Web Site Brand Personalityand Web Site Brand Association on Web Site Brand Image . . . . . . . . . . . . . . . . . .743Sandra Loureiro and Silvina Santana

Music Science

Perceptually Based Phoneme Recognition in Popular Music . . . . . . . . . . . . . . . . .751Gero Szepannek, Matthias Gruhne, Bernd Bischl, Sebastian Krey,Tamas Harczos, Frank Klefenz, Christian Dittmar, and Claus Weihs

Contents xxi

SVM Based Instrument and Timbre Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . .759Sebastian Krey and Uwe Ligges

Three-way Scaling and Clustering Approach to MusicalStructural Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .767Mitsuhiro Tsuji, Toshio Shimokawa, and Akinori Okada

Improving GMM Classifiers by Preliminary One-class SVMOutlier Detection: Application to Automatic Music MoodEstimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .775Hanna Lukashevich and Christian Dittmar

Quality Assurance and Engineering

Multiobjective Optimization for Decision Supportin Automated 2.5D System-in-Package Electronics Design . . . . . . . . . . . . . . . . . . . .783Martin Berger, Michael Schrder, and Karl-Heinz Kfer

Multi-Objective Quality Assessment for EA Parameter Tuning . . . . . . . . . . . . . .793Heike Trautmann, Boris Naujoks, and Mike Preuss

A Novel Multi-Objective Target Value Optimization Approach . . . . . . . . . . . . . .801S. Wenzel, S. Straatmann, L. Kwiatkowski, P. Schmelzer,and J. Kunert

Desirability-Based Multi-Criteria Optimisation of HVOFSpray Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .811Gerd Kopp, Ingor Baumann, Evelina Vogli, Wolfgang Tillmann,and Claus Weihs

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .819

Contributors

Samah Abu-Assab Chair of Marketing and Innovation Management, BrandenburgUniversity of Technology Cottbus, Postbox 101344, 03013 Cottbus, Germany,[email protected]

Shotaro Akaho The National Institute of Advanced Industrial Science andTechnology, 1-1-1, Umezono, Tsukuba, 305-8568, Japan, [email protected]

Ulas Akkucuk Department of Management, Bogazici University, Istanbul,Turkey, [email protected]

Gerhard Arminger Schumpeter School of Business and Economics, Universityof Wuppertal, Gaustr. 20, 42097 Wuppertal, Germany, [email protected]

Markus Arndt European Patent Office, Erhardt Street 27, 80649 Munich,Germany, [email protected]

Ulrich Arndt data2knowledge GmbH, Wilhelm-Umbach-Street 12, 63225Langen, Germany, [email protected]

Javier Arroyo Dpto. de Ingeniera del Software e Inteligencia Artificial,Universidad Complutense de Madrid. Prof. Jos Garca Santesmases s/n 28040Madrid, Spain, [email protected]

Abdolhadi Darzian Azizi Institut fr Entscheidungstheorie undUnternehmensforschung, Karlsruhe University, Karlsruhe, Germany,[email protected]

Dunarel Badescu Dpartement dinformatique, Universit du Qubec Montral,C.P. 8888, Succursale Centre-Ville, Montral QC, Canada H3C 3P8,[email protected]

Daniel Baier Chair of Marketing and Innovation Management, BrandenburgUniversity of Technology Cottbus, Postbox 101344, 03013 Cottbus, Germany,[email protected]

Antonio Balzanella Facolt di Economia, Dipartimento di Matematica eStatistica, Universit degliStudi di Napoli Federico II, Via Cinthia, 80138 Napoli,[email protected]

xxiii

[email protected]@[email protected]@statistik.uni-wuppertal.dearminger@[email protected]@[email protected]@[email protected]@[email protected]

xxiv Contributors

Roland Barriot Universit de Toulouse UPS Laboratoire de Microbiologie etGntique Molculaires, 31000 Toulouse, France, [email protected]

and

Universit Paul Sabatier, CNRS, LMGM, Bat. IBCG, 118, route de Narbonne,31062 Toulouse cedex 9, France, [email protected]

Hans-Georg Bartel Institute for Chemistry, Humboldt University Berlin,Brook-Taylor-Strae 2, 12489 Berlin, Germany, [email protected]

Vladimir Batagelj Faculty of Mathematics and Physics, University of Ljubljana,Jadranska 19, 1000 Ljubljana, Slovenia, [email protected]

Ingor Baumann Lehrstuhl fr Werkstofftechnologie, Fakultt Maschinenbau, TUDortmund, Leonhard-Euler Str. 2, 44227 Dortmund, Germany,[email protected]

Claudia Becker School of Law, Economics, and Business, Martin-Luther-University Halle-Wittenberg, 06099 Halle, Germany, [email protected]

Martin Behnisch Institute of Historic Building Research and Conservation, ETHHoenggerberg, HIT H 21.3, 8093 Zurich, Switzerland, [email protected]

Martin Berger Fraunhofer Institute for Industrial Mathematics (ITWM),Fraunhofer-Platz 1 67663 Kaiserslautern, Germany, [email protected]

Wolfgang Bessler Center for Finance and Banking, Justus-Liebig University,Licher Strasse 74, 35394 Giessen, Germany, [email protected]

Massimo Bilancia Dipartimento di Scienze Statistiche Carlo Cecchi, Universitdegli Studi di Bari, Bari, Italy, [email protected]

Bernd Bischl Faculty of Statistics, Dortmund University of Technology, 44221Dortmund, Germany, [email protected]

Alix Boc Universit du Qubec Montral, Case postale 8888, succursaleCentre-ville, Montral, QC, Canada H3C 3P8, [email protected]

Angela Bohn Wirtschaftsuniversitt Wien, 1090 Wien, Austria,[email protected]

Giuseppe Bove Dipartimento di Scienze dellEducazione, Universit degli StudiRoma Tre, Rome, Italy, [email protected]

Alexander Brenning Department of Geography and Environmental Management,University of Waterloo, 200 University Ave. W., Waterloo, ON, Canada N2L 3G1,[email protected]

Tamara Broderick Statistical Laboratory, University of Cambridge, Cambridge,UK, [email protected]

[email protected]@[email protected]@[email protected]@[email protected]@[email protected]@itwm.fraunhofer.dewolfgang.bessler@wirtschaft.uni-giessen.dewolfgang.bessler@[email protected][email protected]@[email protected]@[email protected]@statslab.cam.ac.uk

Contributors xxv

Michael Brusch Institute of Business Administration and Economics,Brandenburg University of Technology Cottbus, Postbox 101344, 03013 Cottbus,Germany, [email protected]

Michael Bcker Fakultt Statistik, Technische Universitt Dortmund, 44221Dortmund, Germany, [email protected]

Francesco Campobasso Dipartimento di Scienze Statistiche Carlo Cecchi,Universit degli Studi di Bari, Bari, Italy, [email protected]

Margarida G.M.S. Cardoso Department of Quantitative Methods, ISCTEBusiness School, Av. das Foras Armadas 1649-026, Lisboa, Portugal,[email protected]

Gunnar Carlsson Mathematics Department, Stanford University, Stanford, CA,USA, [email protected]

J. Douglas Carroll Rutgers Business School, Newark and New Brunswick, NJ,USA, [email protected]

Simona Korenjak Cerne Faculty of Economics, University of Ljubljana,Kardeljeva pl. 17, 1000 Ljubljana, Slovenia, [email protected]

Marie Chavent Universit de Bordeaux, IMB, CNRS, UMR 5251, France

and

INRIA Bordeaux Sud-Ouest, CQFD team, France, [email protected]

Fabrice Clrot Orange Labs, [email protected]

Pedro Contreras Department of Computer Science, Royal Holloway, Universityof London, 57 Egham Hill, Egham TW20 OEX, England, [email protected]

Robert Culverhouse Washington University School of Medicine, St. Louis, MO,USA, [email protected]

Sanjoy Dasgupta University of California, San Diego, CA, USA,[email protected]

Andr C.P.L.F. de Carvalho Department of Computer Science, ICMC, Universityof So Paulo, Av. Trabalhador Socarlense, 400, CEP 13560-970, So Carlos, SP,Brazil, [email protected]

Sabina Denkowska Department of Statistics, Cracow University of Economics,Cracow, Poland, [email protected]

Mark de Rooij Leiden University Institute for Psychological Research, Leiden,The Netherlands, [email protected]

Maria Rosaria DEsposito Department of Economics and Statistics, Universityof Salerno, Salerno, Italy, [email protected]

Elena Deych Washington University School of Medicine, St. Louis, MO, USA,[email protected]

[email protected]@[email protected]@[email protected]@[email protected]@[email protected]@[email protected]@[email protected]@[email protected]@[email protected]

xxvi Contributors

Abdoulaye Banir Diallo Dpartement dinformatique, Universit du Qubec Montral, C.P. 8888, Succursale Centre-Ville, Montral QC, Canada H3C 3P8,[email protected]

and

McGill Centre for Bioinformatics and School of Computer Science, McGillUniversity, 3775 University Street, Montral QC, Canada H3A 2B4

Sonia Prtega Daz Unidad de Epidemiologa Clnica y Bioestadstica,Hospital de A Corua, As Xubias, 84, 15006 A Corua, Spain,[email protected]

Anna Maria Di Sciullo Universit du Qubec Montral, Case postale 8888,succursale Centre-ville, Montral, QC, Canada H3C 3P8, [email protected]

Christian Dittmar Fraunhofer Institute of Digital Media Technology (IDMT),Ehrenbergstr. 31, 98693 Ilmenau, Germany, [email protected]

Stephan Dlugosz ZEW Centre for European Economic Research, Mannheim,Germany, [email protected]

Jens Dolata Head Office for Cultural Heritage Rhineland-Palatinate (GDKE),Groe Langgasse 29, 55116 Mainz, Germany, [email protected]

Andrzej Dudek Wrocaw University of Economics, Nowowiejska 3, 58-500Jelenia Gra, Poland, [email protected]

Ehtibar N. Dzhafarov Department of Psychological Sciences, Purdue University,West Lafayette, IN, USA, [email protected]

Jzef Dziechciarz University of Economics, Wrocaw, Poland,[email protected]

Marta Dziechciarz University of Economics, Wrocaw, Poland,[email protected]

Said Esber Chair of Marketing and Innovation Management, BrandenburgUniversity of Technology Cottbus, Postbox 101344, 03013 Cottbus, Germany,[email protected]

Katti Faceli Federal University of So Carlos, Campus Sorocaba, Rodovia JooLeme dos Santos, Km 110, 18052-780, Sorocaba, SP, Brazil, [email protected]

Annarita Fanizzi Dipartimento di Scienze Statistiche Carlo Cecchi, Universitdegli Studi di Bari, Bari, Italy, [email protected]

Ingo Feinerer Technische Universitt Wien, 1040 Wien, Austria, [email protected]

Raphal Fraud Raphal Fraud, Orange Labs, 2 avenue Pierre Marzin, 22300Lannion, France. [email protected]

[email protected]@[email protected]@[email protected]@[email protected]@[email protected]@[email protected]@[email protected]@[email protected]@orange-ftgroup.com

Contributors xxvii

Ana Sousa Ferreira LEAD, FPCE, University of Lisbon, Alameda daUniversidade, 1649-013 Lisboa, Portugal

and

CEAUL, Multivariate Data Analysis and Modelling Project, Lisboa, Portugal,[email protected]

Gwennaele Fichant Universit de Toulouse, UPS, Laboratoire de Microbiologieet Gntique Molculaires, 31000 Toulouse, France, [email protected]

and


Karl A. Froeschl University of Vienna, Dr.-Karl-Lueger-Ring 1, 1010 Vienna,Austria, [email protected]

Philippe Gambette L.I.R.M.M., UMR CNRS 5506, Universit Montpellier 2,Montpellier, France, [email protected]

Dominic Gastes Institute of Decision Theory and Operations Research, Universityof Karlsruhe, Kaiserstrasse 12, 76131 Karlsruhe, Germany, [email protected]

Claire Gaugain Universit de Toulouse, UPS, Laboratoire de Microbiologie etGntique Molculaires, 31000 Toulouse, France, [email protected]

and


Wolfgang Gaul Institute of Decision Theory and Operations Research, Universityof Karlsruhe, Kaiserstrasse 12, 76131 Karlsruhe, Germany, [email protected]

Tim Gollub Faculty of Media, Media Systems, Bauhaus-Universitt Weimar,Weimar, Germany, [email protected]

Javier Gonzlez Universidad Carlos III de Madrid, c/Madrid 126, 28903 Getafe,Spain, [email protected]

Pascal Gouzien Orange Labs, [email protected]

Robert B. Gramacy Statistical Laboratory, University of Cambridge, Cambridge,UK, [email protected]

Michael Greenacre Universitat Pompeu Fabra, Ramon Trias Fargas 25-27, 08005Barcelona, Spain, [email protected]

Alexander Gribov Department of Computer Oriented Statistics and DataAnalysis, Institute of Mathematics, University of Augsburg, Augsburg, Germany,[email protected]

[email protected]@ibcg.biotoul.frGwennaele.Fichant@[email protected]@[email protected]@[email protected][email protected]@[email protected]@[email protected]@[email protected]@[email protected]@math.uni-augsburg.de

xxviii Contributors

Matthias Gruhne Fraunhofer Institute of Digital Media Technology (IDMT),Ilmenau, Germany, [email protected]

Alain Gunoche IML, CNRS, 163 Avenue de Luminy, Marseille, France,[email protected]

Frdric Guyon MTI, INSERM-Universit Denis Diderot, 36 rue Hlne Brion,Paris, France, [email protected]

Edgar Haimerl Institut fr Romanistik, Universitt Salzburg, Akademiestrae 24,5020 Salzburg, Austria, [email protected]

Tamas Harczos Fraunhofer Institute of Digital Media Technology (IDMT),Ilmenau, Germany, [email protected]

Christian Hennig Department of Statistical Science, UCL, Gower Street, LondonWC1E 6BT, UK, [email protected]

Julian Holler Center for Finance and Banking, Justus-Liebig University, LicherStrasse 74,35394 Giessen, Germany, [email protected]

Kurt Hornik Vienna University of Economics and Business, Augasse 2-6, 1090Vienna, Austria, [email protected]

Milan Hronsky EC3 E-Commerce Competence Center, Vorlaufstrasse 5/6,1010 Vienna, Austria, [email protected]

Katja Ickstadt Faculty of Statistics, TU Dortmund and SFB 475, Dortmund,Germany, [email protected]

Tudor Ionescu IKE, Universitt Stuttgart, Stuttgart, Germany, [email protected]

Antonio Irpino Department of European and Mediterranean Studies, SecondUniversity of Naples, Via del Setificio 15, 81100, Caserta, Italy, [email protected]

Nataa Kejar Faculty of Social Sciences, University of Ljubljana, Kardeljeva pl.5, 1000 Ljubljana, Slovenia, [email protected]

Lidia Khmylko Natural Language Systems Group, University of Hamburg,Vogt-Klln-Strae 30, 22527 Hamburg, Germany, [email protected]

Thomas Kiefer Technische Universitt Dortmund, Fakultt Statistik, D-44221Dortmund, Germany, [email protected]

Josef Kittler Centre for Vision, Speech and Signal Processing University ofSurrey, Guildford GU2 7XH, UK, J/[email protected]

Martin Klaus SVI Endowed Chair for International Direct Marketing, DMCCDialog Marketing Competence Center, University of Kassel, Kassel, Germany,[email protected]

[email protected]@[email protected]@[email protected]@stats.ucl.ac.ukjulian.holler@[email protected]@[email protected]@ike.uni-stuttgart.detudor.ionescu@[email protected]@[email protected]@[email protected]/[email protected]@wirtschaft.uni-kassel.de

Contributors xxix

Frank Klefenz Fraunhofer Institute of Digital Media Technology (IDMT),Ilmenau, Germany, [email protected]

Gerd Kopp Lehrstuhl Computergesttzte Statistik, Fakultt Statistik, TUDortmund, Vogelpothsweg 87, 44227 Dortmund, Germany, [email protected]

Anne Krampe Faculty of Statistics, TU Dortmund University, Dortmund,Germany, [email protected]

Sebastian Krey Faculty of Statistics, Dortmund University of Technology, 44221Dortmund, Germany, [email protected]

Artus Krohn-Grimberghe Institute of Computer Science, Information Systemsand Machine Learning Lab, University of Hildesheim, Germany, [email protected]

Rudolf Kruse Otto-von-Guericke-Universitt Magdeburg, Magdeburg, Germany,[email protected]

Vanessa Kuentz Universit de Bordeaux, IMB, CNRS, UMR 5251, France

and

INRIA Bordeaux Sud-Ouest, CQFD team, France, [email protected]

Karl-Heinz Kfer Fraunhofer Institute for Industrial Mathematics (ITWM),Fraunhofer-Platz 1 67663 Kaiserslautern, Germany, [email protected]

Mirko Khne Chair of Marketing and Innovation Management, BrandenburgUniversity of Technology Cottbus, Postbox 101344, 03013 Cottbus, Germany,[email protected]

Sonja Kuhnt Faculty of Statistics, TU Dortmund University, Dortmund, Germany,[email protected]

J. Kunert Department of Statistics, Technische Universitt Dortmund, Dortmund,Germany, [email protected]

Rie Kuwabara Department of Pediatrics of Yokohama Seibu Hospital,Collaboration of Unit of Medical Statistics, Institute of Radioisotope Research,St. Marianna University School of Medicine, Kawasaki 216-8511, Japan

L. Kwiatkowski Department of Mechanical Engineering, Technische UniversittDortmund, Dortmund, Germany, [email protected]

Uwe Ligges Fakultt Statistik, Technische Universitt Dortmund, 44221Dortmund, Germany, [email protected]

Hermann Locarek-Junge Lehrstuhl fr Finanzwirtschaft und Finanzdienstleis-tungen, Technische Universitt Dresden, Dresden, Germany, [email protected]

Sandra LoureiroPortugal, [email protected]

University of Aveiro, Campus de Santiago, 3810-193 Aveiro,

[email protected]@[email protected]@[email protected]@iws.cs.uni-magdeburg.dekuentz@math.u-bordeaux1.frkarl-heinz.kuefer@[email protected]@[email protected]@statistik.tu-dortmund.delukas.kwiatkowski@[email protected]@finance.wiwi.tu-dresden.delocarekj@[email protected]

xxx Contributors

Hanna Lukashevich Fraunhofer IDMT, Ehrenbergstr. 31, 98693 Ilmenau,Germany, [email protected]

Patrick Mair Wirtschaftsuniversitt Wien, 1090 Wien, Austria,[email protected]

Vladimir Makarenkov Laboratoire de bioinformatique, Dpartementdinformatique, Universit du Qubec Montral, C.P. 8888, SuccursaleCentre-Ville, Montral QC, Canada H3C 3P8, [email protected]

Waqas Ahmed Malik Department of Computer Oriented Statistics and DataAnalysis, Institute of Mathematics, University of Augsburg, Augsburg, Germany,[email protected]

Angelos Markos Department of Applied Informatics, University of Macedonia,Macedonia, Greece, [email protected]

Cherif Mballo Laboratoire de bioinformatique, Departement dinformatique,UQAM, C.P. 8888 Succursale Centre-Ville, Montreal, QC, Canada H3C 3P8,[email protected]

Geoffrey J. McLachlan Department of Mathematics, University of Queensland,Australia

and

Institute for Molecular Bioscience, University of Queensland, Australia,[email protected]

Facundo Mmoli Mathematics Department, Stanford University, Stanford, CA,USA, [email protected]

George Menexes Lab of Agronomy, School of Agriculture, Aristotle Universityof Thessaloniki, Thessaloniki, Greece, [email protected]

Wolfgang Menzel Natural Language Systems Group, University of Hamburg,Vogt-Klln-Strae 30, 22527 Hamburg, Germany, [email protected]

Junichi Mimaya The Research Committee for the National Surveillance onCoagulation Disorders in Japan, Japan, [email protected]

Boris Mirkin School of Computer Science, Birkbeck University of London, MaletStreet, London, WC1 7HX, UK, [email protected]

and

Department of Applied Mathematics, Higher School of Economics, Kirpichnaya33/5, Moscow, Russian Federation, [email protected]

Magdalena Missler-Behr Chair of Planning and Innovation Management,Brandenburg University of Technology Cottbus, Konrad-Wachsmann-Allee 1,03046 Cottbus, Germany, [email protected]

[email protected]@[email protected]@[email protected]@[email protected]@[email protected]@[email protected]@[email protected]@[email protected]

Contributors xxxi

Kei Miyazaki Department of Cognitive and Behavioral Science,

[email protected]

Hans-Joachim Mucha Weierstrass Institute for Applied Analysis and Stochastics(WIAS), 10117 Berlin, Germany, [email protected]

Tina Mller Faculty of Statistics, TU Dortmund and SFB 475, Dortmund,Germany, [email protected]

Alberto Muoz Universidad Carlos III de Madrid, c/Madrid 126, 28903 Getafe,Spain, [email protected], [email protected]

Fionn Murtagh Department of Computer Science, Royal Holloway, Universityof London, 57 Egham Hill, Egham TW20 OEX, England

and

Science Foundation Ireland, Wilton Place, Dublin 2, Ireland, [email protected]

Atsuho Nakayama Faculty of Economics, Nagasaki University, 4-2-1 Katafuchi,Nagasaki 850-8506, Japan, [email protected]

Alexandros Nanopoulos Institute of Computer Science, Information Systems andMachine Learning Lab, University of Hildesheim, Germany, [email protected]

Boris Naujoks Log!n GmbH, Schwelm, Germany,[email protected]

Monique Noirhomme-Fraiture University of Namur, Namur, Belgium,[email protected]

Rebecca Nugent Department of Statistics, Carnegie Mellon University, Pittsburgh,PA, USA, [email protected]

Takashi Oga Chiba University, 1-33 Yayoi-Cho, Inage-Ku, Chiba, 263-8522,Japan, [email protected]

Akinori Okada Graduate School of Management and Information Sciences, TamaUniversity, Tokyo, Japan, [email protected]

Masato Okada Nara Institute of Science and Technology, 8916-5, Takayama-cho,Ikoma, Nara, 630-0192, Japan

and

The University of Tokyo, 5-1-5 Kashiwanoha, Kashiwa, 277-8561, Japan,[email protected]

Shinichiro Omachi Tohoku University, 6-6-05 Aoba, Aramaki, Aoba-ku, Sendai,980-8579, Japan, [email protected]

Alexia Pacheco Costarican Institute of Electricity, San Jos, Costa Rica,[email protected]

The University of Tokyo, Komaba 3-8-1, Meguro-ku, Tokyo 153-8902, Japan,

[email protected]@[email protected]@[email protected]@[email protected]@[email protected]@[email protected]@[email protected]@[email protected]@ice.go.cr

xxxii Contributors

Iannis Papadimitriou Department of Applied Informatics, Universityof Macedonia, Greece, [email protected]

Marcin Peka Department of Econometrics and Computer Science, WroclawUniversity of Economics, Wroclaw, Poland, [email protected]

Laura Poggio The Macaulay Land use Research Institute, Aberdeen, UK,[email protected]

Norman Poh Centre for Vision, Speech and Signal Processing Universityof Surrey, Guildford GU2 7XH, UK, [email protected]

Wolfgang Polasek Institute for Advanced Studies, Stumpergasse 56, 1060,Vienna, Austria, [email protected]

Mike Preuss Chair of Algorithm Engineering, TU Dortmund University,Dortmund, Germany, [email protected]

Anita Prinzie Manchester Business School, Marketing Group, Booth Street West,Manchester M15 6PB, UK, [email protected]

Klaudia Przybysz University of Economics, Wrocaw, Poland, [email protected]

Antonio Punzo Dipartimento di Economia e Metodi Quantitativi, Universit diCatania, Catania, Italy, [email protected]

Yves Quentin Universit de Toulouse, UPS, Laboratoire de Microbiologie etGntique Molculaires, 31000 Toulouse, France, [email protected]

and


Giancarlo Ragozini Department of Sociology, Federico II University of Naples,Naples, Italy, [email protected]

Oldemar Rodrguez School of Mathematics, University of Costa Rica, San Jos,Costa Rica, [email protected]

Elvira Romano Facolt di Studi Politici, Dipartimento di Studi Europei eMediterranei, Seconda Universit degli Studi di Napoli, Via del Setificio 15, 81100Caserta, Italy, [email protected]

Dorota Rozmus Department of Statistics, Katowice University of Economics,Bogucicka 14, 40-226 Katowice, Poland, [email protected]

Georg Ru Otto-von-Guericke-Universitt Magdeburg, Magdeburg, Germany,[email protected]

Silvina Santana University of Aveiro, Campus de Santiago, 3810193 Aveiro,Portugal, [email protected]

[email protected]@[email protected]@[email protected]@[email protected]@[email protected]@[email protected]@[email protected]@[email protected]@[email protected]@ua.pt

Contributors xxxiii

Jrme Saracco Universit de Bordeaux, IMB, CNRS, UMR 5251, France

and

INRIA Bordeaux Sud-Ouest, CQFD team, France

and

Universit Montesquieu Bordeaux IV, GREThA, CNRS, UMR 5113, France,[email protected]

Anatol Sargin Fakultt Statistik, Technische Universitt Dortmund, D-44221Dortmund, Germany, [email protected]

Leif Scheuermann Max-Weber-Kolleg, Universitt Erfurt, Erfurt, Germany,[email protected]

Julia Schiffner Faculty of Statistics, TU Dortmund and SFB 475, Dortmund,Germany, [email protected]

P. Schmelzer Department of Mechanical Engineering, Technische UniversittDortmund, Dortmund, Germany, [email protected]

Sabine Schmidt Chair of Planning and Innovation Management, BrandenburgUniversity of Technology Cottbus, Konrad-Wachsmann-Allee 1, 03046 Cottbus,Germany, [email protected]

Martin Schneider Martin-Luther-Universitt Halle-Wittenberg, Halle, Germany,[email protected]

Michael Schrder Fraunhofer Institute for Industrial Mathematics (ITWM),Fraunhofer-Platz 1 67663 Kaiserslautern, Germany, [email protected]

Alexandra Schwarz Schumpeter School of Business and Economics, Universityof Wuppertal, Gaustr. 20, 42097 Wuppertal, Germany, [email protected]

Holger Schwender Faculty of Statistics, TU Dortmund and SFB 475, Dortmund,Germany, [email protected]

Martin Seim Center for Finance and Banking, Justus-Liebig University, LicherStrasse 74, 35394 Giessen, Germany, [email protected]

William D. Shannon Washington University School of Medicine, St. Louis, MO,USA, [email protected]

Kazuo Shigemasu Department of Psychology, Teikyo University, Otsuka 359,Hachioji-shi, Tokyo 192, Japan, [email protected]

Toshio Shimokawa University of Yamanashi, Yamanashi, Japan,[email protected]

Akira Shirahata The Research Committee for the National Surveillance onCoagulation Disorders in Japan, Japan, [email protected]

[email protected]@[email protected]@statistik.tu-dortmund.depaul.schmelzer@[email protected]@landw.uni-halle.demichael.schroeder@[email protected]@statistik.uni-wuppertal.deschwarz@statistik.uni-wuppertal.deholger.schwender@[email protected]@[email protected]@[email protected]

xxxiv Contributors

A. Pedro Duarte Silva Faculdade de Economia e Gesto & CEGE, UniversidadeCatlica Portuguesa at Porto, Rua Diogo Botelho, 1327, 4169-005 Porto, Portugal,[email protected]

Pierre Soille Joint Research Centre, European Commission, Ispra, Italy,[email protected]

Andrzej Sokolowski Department of Statistics, Cracow University of Economics,Cracow, Poland, [email protected]

Benno Stein Faculty of Media, Media Systems, Bauhaus-Universitt Weimar,Germany, [email protected]

S. Straatmann Department of Statistics, Technische Universitt Dortmund,Dortmund, Germany, [email protected]

Eva Stber Institute of Business Administration and Economics, BrandenburgUniversity of Technology Cottbus, Postbox 101344, 03013 Cottbus, Germany,[email protected]

Werner Stuetzle Department of Statistics, University of Washington, Seattle, WA,USA, [email protected]

Gero Szepannek Faculty of Statistics, Dortmund University of Technology,44221 Dortmund, Germany, [email protected]

Masashi Taki Department of Pediatrics of Yokohama Seibu Hospital,Collaboration of Unit of Medical Statistics, Institute of Radioisotope Research,St. Marianna University School of Medicine, Kawasaki 216-8511, Japan,[email protected]

Shinobu Tatsunami Department of Pediatrics of Yokohama Seibu Hospital,Collaboration of Unit of Medical Statistics, Institute of Radioisotope Research,St. Marianna University School of Medicine, Kawasaki 216-8511, Japan,[email protected]

Stefan Theul Wirtschaftsuniversitt Wien, 1090 Wien, Austria,[email protected]

Dirk Thorleuchter Fraunhofer INT, 53879 Euskirchen, Appelsgarten 2, Germany,[email protected]

Wolfgang Tillmann Lehrstuhl fr Werkstofftechnologie, Fakultt Maschinenbau,TU Dortmund, Leonhard-Euler Str. 2, 44227 Dortmund, Germany, [email protected]

Heike Trautmann Statistics Faculty, TU Dortmund University, Dortmund,Germany, [email protected]

Mitsuhiro Tsuji Kansai University, Osaka, Japan, [email protected]

Takahiko Ueno Department of Pediatrics of Yokohama Seibu Hospital,Collaboration of Unit of Medical Statistics, Institute of Radioisotope Research,St. Marianna University School of Medicine, Kawasaki 216-8511, Japan

[email protected]@[email protected]@[email protected]@[email protected]@[email protected]@[email protected]@[email protected]@[email protected]@kansai-u.ac.jp

Contributors xxxv

Alfred Ultsch Datenbionic Research Group, Hans-Meerwein-Strasse, Philipps-University Marburg, 35032 Marburg, Germany, [email protected], [email protected]

Ali nl Institute of Mathematics, University of Augsburg, 86135 Augsburg,Germany, [email protected]

Antony Unwin Department of Computer Oriented Statistics and Data Analysis,Institute of Mathematics, University of Augsburg, Germany, [email protected]

Dirk Van den Poel Faculty of Economics and Business Administration, GhentUniversity, 9000 Gent, Tweekerkenstraat 2, Belgium, [email protected]

Rosanna Verde Department of European and Mediterranean Studies,Second University of Naples, Via del Setificio 15, 81100, Caserta, Italy,[email protected]

Jean Vronis L.I.F., UMR CNRS 6166, Universit de Provence, France,[email protected]

Jos A. Vilar Departamento de Matematicas, Universidade de A Corua, Campusde Elvia s/n, 15071 A Corua, Spain, [email protected]

Domenico Vistocco Department of Economics, University of Cassino, Cassino,Italy, [email protected]

Evelina Vogli Lehrstuhl fr Werkstofftechnologie, Fakultt Maschinenbau, TUDortmund, Leonhard-Euler Str. 2, 44227 Dortmund, Germany,[email protected]

Peter Wagner Martin-Luther-Universitt Halle-Wittenberg, Halle, Germany,[email protected]

Ralf Wagner SVI Endowed Chair for International Direct Marketing, DMCCDialog Marketing Competence Center, University of Kassel, Kassel, Germany,[email protected]

Norbert Walchhofer EC3 E-Commerce Competence Center, 1010 Wien, Austria,[email protected]

Marek Walesiak Wrocaw University of Economics, Nowowiejska 3, 58-500Jelenia Gra, Poland, [email protected]

Kazuho Watanabe Nara Institute of Science and Technology, 8916-5,Takayama-cho, Ikoma, Nara, 630-0192, Japan, [email protected]

Claus Weihs Lehrstuhl Computergesttzte Statistik, Fakultt Statistik, TUDortmund, Vogelpothsweg 87, 44227 Dortmund, Germany, [email protected]

S. Wenzel Department of Statistics, Technische Universitt Dortmund, Dortmund,Germany, [email protected]

[email protected]@Mathematik.Uni-Marburg.deultsch@[email protected]@[email protected]@[email protected]@[email protected]@[email protected]@landw.uni-halle.derwagner@[email protected]@[email protected]@[email protected]@statistik.tu-dortmund.de

xxxvi Contributors

Leesa Wockner Department of Mathematics, University of Queensland, Brisbane,Australia, [email protected]

Satoru Yokoyama Department of Business Administration, Faculty ofEconomics, Teikyo University, 359 Otsuka Hachiouji City, Tokyo, 192-0395,Japan, [email protected]

Hsiu-Ting Yu Leiden University Institute for Psychological Research, Leiden,The Netherlands, [email protected]

[email protected]@[email protected]

Part I(Semi-) Plenary Presentations

Hierarchical Clustering with PerformanceGuarantees

Sanjoy Dasgupta

Abstract We describe two new algorithms for hierarchical clustering, one that is analternative to complete linkage, and the other an alternative to the k-d tree. In eachcase, the new algorithm is shown to admit stronger performance guarantees than theclassical scheme it replaces.

1 Introduction

A hierarchical clustering is a recursive partitioning of a data set into successivelymore fine-grained clusterings. At the top of the hierarchy, all points are grouped intoa single cluster; and each intermediate level is obtained by splitting the clusters inthe level above it.

Hierarchical clustering is a basic primitive of statistics and data analysis. It isused for a variety of purposes, prominent among which are:

1. Exploratory analysis of data. Here a typical goal is to discover whether a dataset contains meaningful groupings, that is, groupings in which the clusters areclearly defined (usually in the sense of being well separated). Popular algorithmsfor this kind of analysis are agglomerative bottom-up schemes such as averagelinkage and complete linkage (Sokal and Sneath 1963).

2. Tree-based vector quantization (Gray and Neuhoff 1998). Here the idea is toquantize a large data set, that is, to approximate it with a few representativessuch that the quantization error (the typical distance between a data point andits representative) is small. It is irrelevant whether or not the clusters are well-defined. This type of hierarchical clustering arises in audio and video coding, andis often constructed top-down, by repeated application of the k-means algorithm(MacQueen 1967).

S. DasguptaUniversity of California, San Diego, CAe-mail: [email protected]

H. Locarek-Junge and C. Weihs (eds.), Classification as a Tool for Research,Studies in Classification, Data Analysis, and Knowledge Organization,DOI 10.1007/978-3-642-10745-0_1, c Springer-Verlag Berlin Heidelberg 2010

3

[email protected]

4 S. Dasgupta

3. Organization of data into a spatial structure. Here the aim is to facilitate futurestatistical queries such as nearest-neighbor, or classification, or regression. Suchqueries generically take time O.n/ on a database of n points; but if the points arearranged into a tree, it might be possible to process queries much more efficiently,perhaps even in O.logn/ time. In these applications, the most popular form ofhierarchical clustering is probably the k-d tree (Bentley 1975).

These are all important applications, and yet the hierarchical clusterings typicallyused for them are woefully short on meaningful guarantees. If a data set has welldefined clusters, is complete linkage guaranteed to find them? If a set of points canbe quantized with very low distortion, will the k-means algorithm necessarily findsuch a quantization? And are k-d trees really the best trees for speeding up statisticalqueries? In each case, the answer is no.

This state of affairs is understandable when it is considered that these popularalgorithms were developed at a time when data was typically one dimensional. Inlow dimension, the output of a clustering algorithm can be visually checked to seeif it is reasonable, and if it isnt, a different clustering procedure can be used; so itis not urgently necessary to have a mathematical assurance of optimality (or near-optimality) for procedures like complete linkage or k-means. Likewise, a wide rangeof tree structures are effective for answering statistical queries when data is lowdimensional; k-d trees work just fine, and are convenient to implement.

In the present time, data analysis lies at the heart of some of the biggest scientificchallenges facing us such as genomics and climate modeling but these dataare extremely high dimensional. It is no longer possible to visualize them to checkwhether a clustering is sensible. And many of the procedures that work well in lowdimension suffer when applied to high dimensional data, either because the problemof local optima is hugely exacerbated (as in the case of the k-means algorithm) orbecause they fail to adapt effectively to the geometry of high dimensional space(as in the case of k-d trees). In this new regime, it is crucial to have performanceguarantees for clustering.

In this paper, we describe two algorithms for hierarchical clustering that wererecently proposed specifically to address the challenges of high-dimensional dataanalysis. In each case, we start with a performance criterion and find that classicalschemes fare badly when subjected to this rigorous test. We then design an alter-native with strong performance guarantees. Our first algorithm is a replacement fork-d trees; the second, for complete linkage.

2 A Replacement for k-d Trees

2.1 The Curse of Dimension for Spatial Data Structures

A k-d tree (Bentley 1975) is a spatial data structure that partitions RD into hyper-rectangular cells. It is built in a recursive manner, splitting along one coordinatedirection at a time (Fig. 1, left). The succession of splits corresponds to a binary treewhose leaves contain the individual cells in RD .

Hierarchical Clustering with Performance Guarantees 5

q q

Fig. 1 Left: A spatial partitioning of R2 induced by a k-d tree with three levels. The dots are datapoints; the cross marks a query point q. Right: Partitioning induced by an RP tree

These trees are among the most widely used spatial partitionings in machinelearning and statistics. To understand their application, consider Fig. 1(left), andsuppose that the dots are points in a database, while the cross is a query point q. Thecell containing q, henceforth denoted cell.q/, can quickly be identified by moving qdown the tree. If the diameter of cell.q/ is small (where the diameter is taken tomean the distance between the furthest pair of data points in the cell), then thepoints in it can be expected to have similar properties, for instance similar labels. Inclassification, q is assigned the majority label in its cell, or the label of its nearestneighbor in the cell. In regression, q is assigned the average response value in itscell. In vector quantization, q is replaced by the mean of the data points in the cell.Naturally, the statistical theory around k-d trees is centered on the rate at which thediameter of individual cells drops as you move down the tree; for details, see page320 of Devroye et al. (1996).

It is an empirical observation that the usefulness of k-d trees diminishes as thedimensionD increases. This can be explained in terms of cell diameter; it is possibleto construct a data set in RD for which a k-d tree requires D levels in order to halvethe cell diameter. In other words, if the data lie in R1000, it could take 1000 levelsof the tree to bring the diameter of cells down to half that of the entire data set. Thiswould require 21;000 data points!

Heres the construction. Consider S RD made up of the coordinate axesbetween 1 and 1: S D SDiD1ftei W 1 t 1g, where e1; : : : ; eD is the canoni-cal basis of RD . There are many application domains, such as text, in which data issparse; this example is an extreme case. Now, the diameter of S is 2, and it remains2 even after S is split along one coordinate direction. In fact, it decreases to 1 onlyafter D splits.

Thus k-d trees are susceptible to the same curse of dimensionality that has beenthe bane of other nonparametric statistical methods.

6 S. Dasgupta

2.2 Low Dimensional Manifolds and Intrinsic Dimension

A recent positive development in machine learning has been the realization that alot of data which superficially lie in a very high-dimensional space RD , actuallyhave low intrinsic dimension, in the sense of lying close to a manifold of dimensiond D. There has been significant interest in algorithms which learn this manifoldfrom data, with the intention that future data can then be transformed into this low-dimensional space, in which standard methods will work well. This field is quiterecent and yet the literature on it is already voluminous; early foundational workincludes Tenenbaum et al. (2000), Roweis and Saul (2000), and Belkin and Niyogi(2003).

Why is the manifold hypothesis at all reasonable? Suppose, for instance, that youwish to create realistic animations by collecting human motion data and then fittingmodels to it. A common method for collecting motion data is to have a person weara skin-tight suit with high contrast reference points printed on it. Video cameras areused to track the 3D trajectories of the reference points as the person is walkingor running. In order to ensure good coverage, a typical suit has about N D 100reference points. The position and posture of the body at a particular point of timeis represented by a .3N /-dimensional vector. However, despite this seeming highdimensionality, the number of degrees of freedom is small, corresponding to thedozen-or-so joint angles in the body. The positions of the reference points are moreor less deterministic functions of these joint angles.

To take another example, a speech signal is commonly represented by a high-dimensional time series: the signal is broken into overlapping windows, and avariety of filters are applied within each window. Even richer representations canbe obtained by using more filters, or by concatenating vectors corresponding toconsecutive windows. Through all this, the intrinsic dimensionality remains small,because the system can be described by a few physical parameters describing theconfiguration of the speakers vocal apparatus.

We will adopt a broad notion of intrinsic dimension called the Assouad (or dou-bling) dimension (Assouad 1983). For any point x 2 RD and any r > 0, letB.x; r/ D fz W kx zk rg denote the closed ball of radius r centered at x.The Assouad dimension of S RD is the smallest integer d such that for any ballB.x; r/ RD , the set B.x; r/ \ S can be covered by 2d balls of radius r=2.

For instance, suppose set S is a line in some high-dimensional space RD . For anyball B , the intersection S \B , if nonempty, is a line segment, and it can be coveredby exactly two balls of half the radius. Thus the Assouad dimension of S is 1.

A generalization of this argument shows that a d -dimensional affine subspace ofR

D has Assouad dimension O.d/. So does a d -dimensional Riemannian subman-ifold of RD , subject to a bound on the second fundamental form of the manifold(Dasgupta and Freund 2008). Thus Assouad dimension is more general than themanifold notion we began with.

In fact, it is considerably more general, and also captures sparsity, which hasrecently been a subject of great interest in statistics. For instance, a text document istypically represented as a vector in which each coordinate corresponds to a word and


denotes how often that word occurs within the document. This is an extremely high-dimensional representation if a lot of words are chosen, but it is also sparse mostlyzero because any given document only contains a tiny subset of the universe ofwords. It is not hard to show that if S lies in RD but has elements with at most dnonzero coordinates, then the Assouad dimension of S is at most O.d logD/.

We are interested in techniques that automatically adapt to intrinsic low dimen-sional structure without having to explicitly learn this structure. The most obviousfirst question is, do k-d trees adapt to intrinsic low dimension? The answer is no:the bad example constructed above has an Assouad dimension of just log 2D (thecorresponding set S lies within B.0; 1/ and can be covered by 2D balls of radius1=2.) So we must turn elsewhere.

2.3 Random Projection Trees

Remarkably, a simple variant of k-d trees does adapt to intrinsic dimension. Insteadof splitting along coordinate directions at the median, we split along a random direc-tion in SD1 (the unit sphere in RD), and instead of splitting exactly at the median,we add a small amount of jitter. We call these random projection trees (Fig. 1,right), or RP trees for short. Specifically, for any cell within the tree containing datapoints (say) S , the splitting rule is determined as follows:

Choose a random unit direction v 2 RD . Pick any x 2 S ; let y 2 S be the farthest point from it. Choose uniformly at random in 1; 1 6kx yk=pD. All points fx 2 S W x v .median.fz v W z 2 Sg/C g go to the left subtree;

the remainder go to the right.

Suppose an RP tree is built from a data set S RD , not necessarily finite. If thetree has k levels, then it partitions the space into 2k cells. We define the radius of acell C RD to be the smallest r > 0 such that S \ C B.x; r/ for some x 2 C .Our theorem gives an upper bound on the rate at which the radius of cells in an RPtree decreases as one moves down the tree.

Theorem 1 (Dasgupta and Freund 2008). There is a constant c1 with the follow-ing property. Suppose an RP tree is built using data set S RD : Pick any cellC in the RP tree; suppose that S \ C has Assouad dimension d: Then withprobability at least 1=2 (over the randomization in constructing the subtree rootedat C ), for every descendant C 0 which is more than c1d logd levels below C , wehave radius.C 0/ radius.C /=2.There is no dependence at all on the extrinsic dimension D.

Since they were introduced, RP trees have been shown to yield algorithms fortree-based vector quantization (Dasgupta and Freund 2009) and regression (Kpotufe2009) that are adaptive to intrinsic low dimensionality. Also, an efficient scheme fornearest neighbor turns out in retrospect to be using a similar idea (Liu et al. 2004).For experimental work, see Freund et al. (2007).

8 S. Dasgupta

Open problems

1. An RP tree halves the diameter of cells every O.d logd/ levels; is there analternative splitting rule that requires just d levels?

2. RP trees and k-d trees are designed for data in Euclidean space. Are there similarconstructions (with simple splitting rules) that work in arbitrary metric spaces?

3. What guarantees can be given for query times in nearest neighbor search usingRP trees?

3 A Replacement for Complete Linkage

3.1 An Existence Problem for Hierarchical Clustering

We now turn to hierarchical clusterings for exploratory data analysis. Such represen-tations of data have long been a staple of biologists and social scientists, and sincethe sixties or seventies they have been a standard part of the statisticians toolbox.Their popularity is easy to understand. They require no prior specification of thenumber of clusters, they permit the data to be understood simultaneously at manylevels of granularity, and there are some simple, greedy heuristics that can be usedto construct them.

It is very useful to be able to view data at different levels of detail, but the require-ment that these clusterings be nested within each other presents some fundamentaldifficulties. Consider the data set of Fig. 2, consisting of six evenly spaced collinearpoints in the Euclidean plane. The most commonly used clustering cost functions,such as that of k-means, strive to produce clusters of small radius or diameter. Undersuch criteria, the best 2-clustering (grouping into two clusters) of this data is unam-biguous, as is the best 3-clustering. However, they are hierarchically incompatible.This raises a troubling question: by requiring a hierarchical structure, do we doomourselves to intermediate clusterings of poor quality?

To rephrase this more constructively, must there always exist a hierarchical clus-tering in which, for every k, the induced k-clustering (grouping into k clusters) isclose to the optimal k-clustering under some reasonable cost function? As we havealready seen, it is quite possible that the optimal cost-based k-clustering cannotbe obtained by merging clusters of the optimal .k C 1/-clustering. Can they be sofar removed that they cannot be reconciled even approximately into a hierarchicalstructure? We resolve this fundamental existence question via the following result.

Theorem 2 (Dasgupta and Long 2005). Take the cost of a clustering to be thelargest radius of its clusters. Then, any data set in any metric space has a hierarchi-cal clustering in which, for each k, the induced k-clustering has cost at most eighttimes that of the optimal k-clustering.

Fig. 2 What is the best hierarchical clustering for this data set?


Moreover, we have an algorithm for constructing such a hierarchy which issimilar in simplicity and efficiency to the popular complete linkage agglomerativeclustering algorithm. Complete linkage has the same underlying cost function, butdoes not admit a similar guarantee.

Theorem 3 (Dasgupta 2009). For any k, there is a data set for which completelinkage induces k-clusterings whose cost is k times that of the optimal k-clustering.

3.2 Approximation Algorithms for Clustering

There has been a lot of recent work on the k-center and k-median problems. Ineach of these, the input consists of points in a metric space as well as a preordainednumber of clusters k, and the goal is to find a partition of the points into clustersC1; : : : ; Ck , and also cluster centers 1; : : : ; k drawn from the metric space, so asto minimize some cost function which is related to the radius of the clusters.

1. k-center: Maximum distance from a point to its closest center2. k-median: Average distance from a point to its closest center

Both problems are NP-hard but have simple constant-factor approximation algo-rithms. For k-center, a two-approximation was found by Gonzlez (1985), and thisis the best approximation factor possible (Feder and Greene 1988). For k-medianthere have been a series of results; for instance (Arya et al. 2001), achieves anapproximation ratio of 6C , in time nO.1=/.

What does a constant-factor approximation mean for a clustering problem? Con-sider the scenario of Fig. 3, set in the Euclidean plane. The solid lines show thereal clusters, and the three dots represent the centers of a bad 3-clustering whosecost (in either measure) exceeds that of the true solution by a factor of at least 10.This clustering would therefore not be returned by the approximation algorithmswe mentioned. However, EM and k-means regularly fall into local optima of thiskind, and practitioners have to take great pains to try to avoid them. In this sense,constant-factor approximations avoid the worst: they are guaranteed to never do toobadly. At the same time, the solutions they return can often use some fine-tuning,and local improvement procedures like EM might work well for this.

Although most work on approximation algorithms has focused on flat k-clustering, there is some other work on hierarchies. A different algorithm for thesame cost function as ours is given in Charikar et al. (2004); while Plaxton (2003)works with the k-median cost function. More recently, Lin et al. (2006) gives a uni-fying framework that is able to adapt algorithms for flat clustering to make themhierarchical.

10 S. Dasgupta

Fig. 3 The circles represent an optimal 3-clustering; all the data points lie within them. The dotsare centers of a really bad clustering

Input: n data points with a distance metric d.; /.Pick a point and label it 1.For i D 2; 3; : : : ; n

Find the point furthest from f1; 2; : : : ; i 1g and label it i .Let .i/ D arg minj


R2

R6 2

R3

R5

4

R10

R8

1

R4R7

5

9R9

3

6

10

7

8

Fig. 5 A farthest-first traversal of ten data points in the plane, under Euclidean distance. Thenumbering is completely determined by the choice of point number one (and by the method ofbreaking any ties that arise)

Ri D d.i; .i// D d.i; f1; 2; : : : ; i 1g/:

Then R1 R2 R3 Rn. Figure 5 shows an example with a toy data set often points.

The algorithm of Gonzlez uses points 1; 2; : : : ; k as centers for a k-clustering.Let Ck be this clustering; notice that its cost is exactly RkC1.

Theorem 4 (Gonzlez 1985). For any k, any k-clustering must have at leastone cluster of diameter RkC1. Thus, cost.Ck/ D RkC1 2 cost(optimalk-clustering).

3.4 A Hierarchical Clustering Algorithm

A farthest-first traversal orders the points so that for any k, the first k points consti-tute the centers of a near-optimal k-clustering Ck . Unfortunately, the n clusteringsdefined in this manner are not hierarchical. In Fig. 5 for instance, the 2-clusteringcl