The Fourteenth International Symposium on Intelligent Data ...Welcome to IDA 2015 (the 14th International Symposium on Intelligent Data Analysis) in Saint-Etienne! The symposium series

The Fourteenth International Symposium on

Intelligent Data Analysis

Saint Étienne, France

22-24 Oct, 2015www.ida2015.org

@ida_news IDA_symposia

Dear colleagues,

Welcome to IDA 2015 (the 14th International Symposium on Intelligent DataAnalysis) in Saint-Etienne! The symposium series started 20 years ago inBaden Baden (Germany). Since then, IDA has established a long tradition andwe hope that this edition in Saint-Etienne will build on that tradition bybeing both inspiring and enjoyable.

IDA 2015 features an exciting program with three invited talks, six regularresearch sessions ending every day by an Horizon talk, two poster sessions,and �ve social activities.

Don't forget to also explore Saint-Etienne. Did you know that Saint-Etiennewas nominated as "City of Design" as part of Unesco's Creative CitiesNetwork?

Enjoy your stay,The IDA 2015 team

Wi�: UJM-Invite

Wi� username: ida2015_ytr

Wi� password: yTFRRD

Website: ida2015.univ-st-etienne.fr

Invited Talk

Tony Veale

Thu 22nd Oct

The Shape of Tweets to Come − University College Dublin, Ireland

Tony Veale (Af�atus.UCD.ie) is a computer scientist whose principalresearch topic is Computational Linguistic Creativity. Veale teachesComputer Science at University College Dublin (UCD) and at FudanUniversity Shanghai as part of UCD’s international BSc. in SoftwareEngineering, which Veale helped establish in 2002. Veale’s work onComputational Creativity (CC) focuses on creative linguistic phenomenasuch as metaphor, simile, blending, analogy, similarity and irony. Heleads the European Commission’s coordination action on ComputationalCreativity called PROSECCO – Promoting the Scienti�c Exploration ofComputational Creativity – which aims to develop the CC �eld into amature discipline. He is author of the 2012 monograph Exploding the

Creativity Myth: The Computational Foundations of Linguistic Creativity and is principal co-editor ofthe multidisciplinary volume from de Gruyter titled Creativity and the Agile Mind. As a visitingprofessor in Web Science at the Korean Advanced Institute of Science and Technology (2011-2013)Veale was funded by the Korean World Class University (WCU) programme to study the convergenceof Computational Creativity and the Web in the form of Creative Web Services. He has recentlylaunched a new Web initiative called RobotComix.com to engage with the wider public on thetheory, philosophy and practice of building creative services, and one such system, MetaphorMagnet, can be sampled via the twitterbot @MetaphorMagnet.

Abstract

Twitter has proven itself a rich and varied source of language data for linguistic analysis. ForTwitter is more than a popular new platform for social interaction via language; in many waysTwitter constitutes a whole new genre of text, as users adapt to its limitations (140 character“tweets”) and its novel conventions (e.g. re-tweeting, hashtags). Language researchers can thusharvest Twitter data to study how users convey meaning with affect, and how they achievestickiness and virality with the texts they compose. But Twitter presents an opportunity of anotherkind to the computationally-minded language researcher, a generative opportunity to study howalgorithmic models might impose linguistic hypotheses onto large data sources to compose noveland meaningful micro-texts of their own. This computational turn allows researchers to go beyondmerely descriptive models of playful uses of language such as metaphor, sarcasm and irony. Itallows researchers to test whether their models embody a suf�ciently algorithmic understanding ofa phenomenon to facilitate the construction of a fully-automated computational system, one thatcan generate wholly novel examples that are deemed acceptable to humans. This talk presents andevaluates one such system, a Twitterbot named @MetaphorMagnet that generates, expresses andshares its own playful insights on Twitter. I shall show how @MetaphorMagnet’stweets are inspiredby data but shaped by knowledge, and consider how the outputs of this hybrid data/knowledge-driven bot may be usefully anchored in another source of data — the news stream.

Invited Talk

Nick Heard

Fri 23rd Oct

Combining Weak Statistical Evidence in Cyber Security − Imperial College London, UK

Nick Heard is a Senior Lecturer in the Statistics section of theDepartment of Mathematics, Imperial College London. He obtained hisPhD under the supervision of Adrian Smith in 2001. Nick’s best knownwork is on cluster analysis, where he has developed Bayesianmethodology and software for use in bioinformatics. He has also writtenpapers on Monte Carlo convergence, sequential Monte Carlo, and socialnetworks. Nick’s current main research interest is in dynamic networksand cyber security. He is currently on a long term secondment to theHeilbronn Institute of Mathematical Research, University of Bristol,which has a research focus on Internet data analysis and cyber securityapplications in the presence of so called ‘Big Data’.

Abstract

Cyber attacks on government and industry computer networks are now commonplace and nosystem can be made invulnerable to intrusion. Instead, much importance is placed on reducing theimpact of cyber attacks when they occur, which �rst means quickly detecting their presenceamongst the �ow of cyber traf�c. However, sophisticated hackers and cyber criminals will actcarefully to hide their presence, and so any hard detection rules (“signatures”) can becircumnavigated. Nonetheless, if an intrusion has a malign purpose, then at least some unusualbehaviour will be hidden within the network traf�c data. Statistical modelling of nodes and edgesin a computer network can build up a picture of normal behaviour in the system. Typicalinstitutional computer networks produce high volume data streams and so, from time to time,surprising but benign behaviour will be observed. The task is to detect the signi�cance of genuineintrusion events against this background. In statistical modelling, p-values are the fundamentalquantities for measuring the signi�cance of observed data against a null hypothesis. This talk willreview methods of combining p-values to accumulate evidence, investigating their properties indepth. Some new approaches will then be proposed which are better suited for detecting subsetsof signi�cant p-values. Finally, the advantages of the proposed approach will be illustrated on acyber authentication problem, stemming from collaborative work with Los Alamos NationalLaboratory.

Invited Talk

Pascal Van Hentenryck

Sat 24th Oct

Evidence-Based Optimization − National ICT, Australia (NICTA)

Pascal Van Hentenryck leads the Optimisation Research Group (about 75people) at National ICT Australia (NICTA). He also holds aVice-Chancellor Strategic Chair in data-intensive computing at theAustralian National University. Van Hentenryck is the recipient of twohonorary degrees and is a fellow of the Association for theAdvancement of Arti�cial Intelligence. He was awarded the 2002

INFORMS ICS Award for research excellence in operations research and compute science, the 2006ACP Award for research excellence in constraint programming, the 2010-2011 Philip J. Bray Awardfor Teaching Excellence at Brown University, and was the 2013 IFORS Distinguished speaker. VanHentenryck is the author of �ve MIT Press books and has developed a number of innovativeoptimisation systems that are widely used in academia and industry. Van Hentenryck’s research iscurrently at the intersection of data science and optimization with a focus on disastermanagement, energy systems, recommender systems, and transportation.

Abstract

For the �rst time in the history of mankind, we are accumulating data sets of unprecedented scaleand accuracy about physical infrastructures, natural phenomena, man-made processes, andhuman behavior. These developments, together with progress in high-performance computing,machine learning, and operations research, offer exciting opportunities for the evidence-basedoptimization of global systems. This talk reviews some case-studies in disaster management,energy systems, high-performance computing, and market optimization to showcase these uniqueopportunities and their associated challenges, and presents some emerging architectures forevidence-based optimization.

ScheduleWed 21st Oct

19:30 Welcome reception at Town hall Saint-Etienne

Thu 22nd Oct

08:30 Registration desk opens

09:20 Welcome from General Chair (room J020)

09:30 Invited talk: The Shape of Tweets to Come by Tony VealeChair: Elisa Fromont

10:30 Coffee break (room D03)

11:00 Contributed session 1 (room J020)Chair: Bart Goethals

11:00-11:30 Batch Steepest-Descent-Mildest-Ascent for Interactive MaximumMargin Clustering . Fabian Gieseke, Tapio Pahikkala and Tom Heskes11:30-12:00 Diversity-driven Widening of Hierarchical Agglomerative Clustering .Alexander Fillbrunn and Michael R. Berthold12:00-12:30 Diagonal Co-clustering Algorithm for Document-Word Partitioning .Charlotte Laclau and Mohamed Nadif

12:30 Lunch at “La Platine”

14:30 Contributed session 2 (room J020)Chair: Bruno Crémilleux

14:30-15:00 I-Louvain: an attributed graph clustering method . Christine Largeron,Mathias Géry, David Combe and Elöd Egyed-Zsigmond15:00-15:45 Horizon Talk: Towards a Data Science Collaboratory. JoaquinVanschoren, Bernd Bischl, Frank Hutter, Michele Sebag, Balazs Kegl,Matthias Schmid,Giulio Napolitano, Katy Wolstencroft, Alan R. Williams and Neil Lawrence


16:00 PhD poster session

18:00 Social event: city walk (departure square Boivin , arrival square Jean Jaures)

19:30 Cocktail at Nouai Borfa

Fri 23rd Oct


09:30J020

Invited talk: Combining Weak Statistical Evidence in Cyber Security by Nick HeardChair: Tijl De Bie


11:00 Contributed session 3 (room J020)Chair: Stephen Swift

11:00-11:30 Continuous and Discrete Deep Classi�ers for Data Integration . NataliyaSokolovska, Salwa Rizkalla, Karine Clément and Jean-Daniel Zucker11:30-12:00 Ef�cient Model Selection for Regularized Classi�cation by ExploitingUnlabeled Data . Georgios Balikas, Ioannis Partalas, Eric Gaussier, Rohit Babbar andMassih-Reza Amini12:00-12:30 Spotlight talks for the regular poster session

12:30 Lunch (room D03)

14:00 Contributed session 4 (room J020)Chair: Hendrik Blockeel

14:00-14:30 A Bayesian Approach for Identifying Multivariate Differences BetweenGroups. Yuriy Sverchkov and Gregory Cooper14:30-15:00 Segregation Discovery in a Social Network of Companies. AlessandroBaroni and Salvatore Ruggieri15:00-15:45 Horizon Talk: When Learning Indeed Changes the World: DiagnosingPrediction-Induced DriftGeorg Krempl, David Bodnar and Anita Hrubos


16:00 Regular poster session (room D03)

18:00 End of sessions

19:00 Banquet at “ Le Grand Cercle “

Sat 24th Oct


09:30J020

Invited talk: Evidence-Based Optimization by Pascal Van HentenryckChair: Matthijs Van Leuwen


11:00 Contributed session 5Chair: Michael Berthold

11:00-11:30 Probabilistic Active Learning in Data Streams. Daniel Kottke, GeorgKrempl and Myra Spiliopoulou11:30-12:00 Fast Algorithm Selection using Learning Curves. Jan N. van Rijn, SalisuMamman Abdulrahman, Pavel Brazdil and Joaquin Vanschoren12:00-12:30 VoQs: A Web Application for Visualization of Questionnaire Surveys .Xiaowei Zhang, Frank Klawonn, Lorenz Grigull and Werner Lechner


14:00 Contributed session 6Chair: Joost Kok

14:00-14:30 Automatically Discovering Offensive Patterns in Soccer Match Data. JanVan Haaren, Vladimir Dzyuba, Siebe Hannosset and Jesse Davis14:30-15:15 Horizon Talk: The Data Problem in Data Mining. Albrecht Zimmermann

15:15 Prize award ceremony

15:45 IDA council meeting (council only)

18:00 Farewell social event

Regular talks

A Bayesian Approach for Identifying Multivariate Differences Between GroupsYuriy Sverchkov (University of Wisconsin–Madison) and Gregory Cooper (University of Pittsburgh)We present a novel approach to the problem of detecting multivariate statistical differences acrossgroups of data. The need to compare data in a multivariate manner arises naturally in observationalstudies, randomized trials, comparative effectiveness research, abnormality and anomaly detectionscenarios, and other application areas. In such comparisons, it is of interest to identify statisticaldifferences across the groups being compared. The approach we present in this paper addresses thisissue by constructing statistical models that describe the groups being compared and using adecomposable Bayesian Dirichlet score of the models to identify variables that behave statisticallydifferently between the groups. In our evaluation, the new method performed signi�cantly betterthan logistic lasso regression in indentifying differences in a variety of datasets under a variety ofconditions.

I-Louvain: an attributed graph clustering methodChristine Largeron (Université de Lyon), Mathias Géry (LaHC, Université Jean Monnet, Saint-Etienne(France)), David Combe (Laboratoire Hubert Curien) and Elöd Egyed-Zsigmond (LIRIS)Modularity allows to estimate the quality of a partition into communities of a graph composed ofhighly inter-connected vertices. In this article, we introduce a complementary measure, based oninertia, and specially conceived to evaluate the quality of a partition based on real attributesdescribing the vertices. We propose also I-Louvain, a graph nodes clustering method which uses ourcriterion, combined with Newman’s modularity, in order to detect communities in attributed graphwhere real attributes are associated with the vertices. Our experiments show that combining therelational information with the attributes allows to detect the communities more ef�ciently thanusing only one type of information. In addition, our method is more robust to data degradation.

Continuous and Discrete Deep Classi�ers for Data IntegrationNataliya Sokolovska (University Paris 6, INSERM U872)Data representation in a lower dimension is needed in applications, where information comes frommultiple high dimensional sources. A �nal compact model has to be interpreted by human experts,and interpretation of a classi�er whose weights are discrete is much more straightforward. In thiscontribution, we propose a novel approach, called Deep Kernel Dimensionality Reduction which isdesigned for learning layers of new compact data representations simultaneously. We show byexperiments on standard and on real large-scale biomedical data sets that the proposed methodembeds data in a new compact meaningful representation, and leads to a lower classi�cation errorcompared to the state-of-the-art methods. We also consider some state-of-the art deep learners andtheir corresponding discrete classi�ers. We illustrate by our experiments that although purelydiscrete models do not always perform better than real-valued classi�ers, the trade-off between themodel accuracy and the interpretability is quite reasonable.

Diversity-driven Widening of Hierarchical Agglomerative ClusteringAlexander Fillbrunn (University of Konstanz) and Michael R. Berthold (University of Konstanz)In this paper we show that diversity-driven widening, the parallel exploration of the model spacewith focus on developing diverse models, can improve hierarchical agglomerative clustering.Depending on the selected linkage method, the model that is found through the widened searchachieves a better silhouette coecient than its sequentially built counterpart.

Batch Steepest-Descent-Mildest-Ascent for Interactive Maximum MarginClusteringFabian Gieseke (Radboud University Nijmegen), Tapio Pahikkala (University of Turku) and Tom Heskes(Radboud University Nijmegen)The maximum margin clustering principle extends support vector machines to unsupervisedscenarios. We present a variant of this clustering scheme that can be used in the context ofinteractive clustering scenarios. In particular, our approach permits the class ratios to be manuallyde�ned by the user during the �tting process. Our framework can be used at early stages of the datamining process when no or very little information is given about the true clusters and class ratios.One of the key contributions is an adapted steepest-descent-mildest-ascent optimization schemethat can be used to �ne-tune maximum margin clustering solutions in an interactive manner. Wedemonstrate the applicability of our approach in the context of remote sensing and astronomy withtraining sets consisting of hundreds of thousands of patterns.

Segregation Discovery in a Social Network of CompaniesAlessandro Baroni (Dipartimento di Informatica, Università di Pisa) and Salvatore Ruggieri (Dipartimentodi Informatica, Università di Pisa)We introduce a framework for a data-driven analysis of segregation of minority groups in socialnetworks, and challenge it on a complex scenario. The framework builds on quantitative measures ofsegregation, called segregation indexes, proposed in the social science literature. The segregationdiscovery problem is introduced, which consists of searching sub-graph and sub-groups for which asegregation index is above a minimum threshold. A search algorithm is devised that solves thesegregation problem. The framework is challenged on the analysis of segregation in the real andlarge network of the Italian companies connected through shared directors in their boards.

Probabilistic Active Learning in Data StreamsDaniel Kottke (Knowledge Management and Discovery Lab, University Magdeburg), Georg Krempl(Knowledge Management and Discovery Lab, University Magdeburg) and Myra Spiliopoulou (KnowledgeManagement and Discovery Lab, University Magdeburg)In recent years, stream-based active learning has become an intensively investigated research topic.In this work, we propose a new algorithm for stream-based active learning that decides immediatelywhether to acquire a label (selective sampling). It uses Probabilistic Active Learning (PAL) to measurethe spatial usefulness of each instance in the stream. To determine if a currently arrived instancebelongs to the most useful instances (temporal usefulness) given a prede�ned budget, we proposeBIQF – a Balanced Incremental Quantile Filter. It uses a sliding window to represent the distributionof the most recent usefulness values and �nds a labeling threshold using quantiles. The balancingmechanism ensures that the prede�ned budget will be met within a given tolerance window. Weevaluate our approach against other stream active learning approaches on multiple datasets. Theresults con�rm the effectiveness of our method.

Fast Algorithm Selection using Learning CurvesJan N. van Rijn (Leiden University), Salisu Mamman Abdulrahman (University of Porto), Pavel Brazdil(University of Porto) and Joaquin Vanschoren (K.U.Leuven)One of the challenges in Machine Learning is given a dataset, �nd a classi�er that works well on it.Performing a cross-validation evaluation procedure on all possible combinations typically takes toomuch time; many solutions have been proposed that attempt to predict which classi�ers are mostpromising to try. As the �rst recommended classi�er is not always the correct choice, multiplerecommendations should be made, making this a ranking problem rather than a classi�cationproblem. Even though this is a well studied problem, there is currently no good way of evaluatingsuch rankings. We advocate the use of Loss Time Curves, as used in optimization literature. Thesevisualize the amount of budget (time) needed to converge to a acceptable solution. We investigate amethod that utilizes the measured performances of classi�ers on small samples of data to make suchrecommendation, and adapt it so that it works well in Loss Time space. Experimental results showthat this method converges extremely fast to an acceptable solution.

Diagonal Co-clustering Algorithm for Document-Word PartitioningCharlotte Laclau (LIPADE, Universite Paris Descartes) and Mohamed Nadif (LIPADE, Universite ParisDescartes)We propose a novel diagonal co-clustering algorithm built upon the double Kmeans to address theproblem of document-word co-clustering. At each iteration, the proposed algorithm seeks for adiagonal block structure of the data by minimizing a criterion based on the variance within and thecentroid effect. In addition to be easy-to-interpret and ef�cient on sparse binary and continuousdata, Diagonal Double Kmeans (DDKM) is also faster than other state-of-the art clusteringalgorithms. We illustrate our contribution using real datasets commonly used in documentclustering.

Ef�cient Model Selection for Regularized Classi�cation by Exploiting UnlabeledDataGeorgios Balikas (University Grenoble Alpes), Ioannis Partalas (University Grenoble Alpes – LIG), EricGaussier (University Grenoble Alpes – LIG), Rohit Babbar (Max Plank Institute for Intelligent Systems) andMassih-Reza Amini (University Grenoble Alpes – LIG)Hyper-parameter tuning is a resource-intensive task when optimizing classi�cation models. Thecommonly used k-fold cross validation can become intractable in large scale settings when aclassi�er has to learn billions of parameters. At the same time, in real-world, one often encountersmulti-class classi�cation scenarios with only a few labeled examples; model selection approachesoften offer little improvement in such cases and the default values of learners are used. We proposebounds for classi�cation on accuracy and macro measures (precision, recall, F1) that motivateef�cient schemes for model selection and can bene�t from the existence of unlabeled data. Wedemonstrate the advantages of those schemes by comparing them with k-fold cross validation andhold-out estimation in the setting of large scale classi�cation.

Automatically Discovering Offensive Patterns in Soccer Match DataJan Van Haaren (KU Leuven), Vladimir Dzyuba (KU Leuven), Siebe Hannosset (Club Brugge KV) and JesseDavis (KU Leuven)In recent years, many professional sports clubs have adopted camera-based tracking technology thatcaptures the location of both the players and the ball at a high frequency. Nevertheless, the valuableinformation that is hidden in these performance data is rarely used in their decision-making process.What is missing are the computational methods to analyze these data in great depth. This paperaddresses the task of automatically discovering patterns in offensive strategies in professional soccermatches. To address this task, we propose an inductive logic programming approach that can easilydeal with the relational structure of the data. An experimental study shows the utility of ourapproach.

VoQs: A Web Application for Visualization of Questionnaire SurveysXiaowei Zhang (Helmholtz Centre for Infection Research), Frank Klawonn (Helmholtz Centre for InfectionResearch), Lorenz Grigull (Medical University Hanover) and Werner Lechner (Improved Medical Diagnostics,Pte. Ltd.)This paper is motivated by analyzing questionnaire data that collected from patients who suffer froman orphan disease. In order to decrease misdiagnoses and shorten the diagnosis time for thosepatients who have not yet been diagnosed but have a long history of health problems, a researchproject about using questionnaire mode and data analysis methods to predetermine orphan diseasehas been set up and questionnaires were designed based on experiences with patients who alreadyhave a diagnosis. The main focus of this work is to visualize answering patterns that characterizepatients with a speci�c orphan disease, which questions are most useful to distinguish betweencertain orphan diseases and how well an answering pattern of a speci�c patient �ts to the generalpattern of those patients who share the same disease. We borrow from the concept of sequencelogos, commonly used in genetics to visualize the conservation of nucleotides in a strand of DNA orRNA. Instead of nucleotides, we have possible answers from a question. Our proposed visualizationtechniques are not limited to questionnaires on orphan diseases but also can be applied to anyquestionnaire survey with closed-ended questions for which we are interested in answeringcharacteristics of different groups.

Regular posters

Implicitly Constrained Semi-Supervised Least Squares Classi�cationJesse H. Krijthe (Delft University of Technology/Leiden University) and Marco Loog (Delft University ofTechnology/University of Copenhagen)We introduce a novel semi-supervised version of the least squares classi�er.This implicitly constrained leastsquares (ICLS) classi�er minimizes the squared loss on the labeled data among the set of parametersimplied by all possible labelings of the unlabeled data.Unlike other discriminative semi-supervisedmethods, our approach does not introduce explicit additional assumptions into the objective function, butleverages implicit assumptions already present in the choice of the supervised least squares classi�er. Weshow this approach can be formulated as a quadratic programming problem and its solution can be foundusing a simple gradient descent procedure. We prove that, in a certain way, our method never leads toperformance worse than the supervised classi�er. Experimental results corroborate this theoretical result inthe multidimensional case on benchmark datasets, also in terms of the error rate.

A Comparative Study of Clustering Methods with Multinomial DistributionMd Abul Hasnat (ERIC Lab, University Lyon 2), Julien Velcin (ERIC Lab, University Lyon 2), StéphaneBonnevay (University Lyon 1) and Julien Jacques (ERIC Lab, University Lyon 2)In this paper, we study different discrete data clustering methods which employ the Model-Based Clustering(MBC) with the Multinomial distribution. We propose a novel MBC method by ef�ciently combining thepartitional and hierarchical clustering techniques. We conduct experiments on both synthetic and real dataand evaluate the methods using accuracy, stability and computation time. Our study identi�es appropriatestrategies that should be used for different sub-tasks for unsupervised discrete data analysis with the MBCmethod.

A parallel distributed processing algorithm for image feature extractionAlexander Belousov (Ariel University) and Joel Ratsaby (Ariel University)We present a new parallel algorithm for image feature extraction that uses a recently introduced distancefunction (UID) between images. It is based on the LZ-complexity of the string representation of the twoimages. An input image is represented by a feature vector whose components are the distance valuesbetween its parts (sub-images) and a set of prototypes. The algorithm is highly scalable and computes thesevalues in parallel. We implement it on a massively parallel graphics processing unit (GPU) with severalthousands of cores. The algorithm distributes the computation of distances of thousands of sub-images andprototypes in parallel. This yields a very high increase in computational speeds for processing images,typically reducing the processing time from hours to seconds. Given a corpus of input images it produces atable of labeled cases that can be used by any supervised or unsupervised learning algorithm to learn imageclassi�cation or image clustering. A main advantage is the lack of need for any image processing or imageanalysis; the user only once de�nes image-features through a simple basic process of choosing a few smallimages to serve as prototypes. Results for several image classi�cation problems are presented.

Exploratory topic modeling with distributional semanticsSamuel Rönnqvist (Turku Centre for Computer Science)As we continue to collect and store textual data in a multitude of domains, we are regularly confronted withmaterial whose largely unknown thematic structure we want to uncover. With unsupervised, exploratoryanalysis, no prior knowledge about the content is required and highly open-ended tasks can be supported.In the past few years, probabilistic topic modeling has emerged as a popular approach to this problem.Nevertheless, the representation of the latent topics as aggregations of semi-coherent terms limits theirinterpretability and level of detail. This paper presents an alternative approach to topic modeling that mapstopics as a network for exploration, based on distributional semantics using learned word vectors. From thegranular level of terms and their semantic similarity relations global topic structures emerge as clusteredregions and gradients of concepts. Moreover, the paper discusses the visual interactive representation of thetopic map, which plays an important role in supporting its exploration.

Constraint-Based Querying for Bayesian Network ExplorationBehrouz Babaki (KU Leuven), Tias Guns (KU Leuven), Siegfried Nijssen (KU Leuven) and Luc De Raedt (KULeuven)Understanding the knowledge that resides in a Bayesian network can be hard, certainly when a largenetwork is to be used for the �rst time, or when the network is complex or has just been updated. Tools toassist users in the analysis of Bayesian networks can help. In this paper, we introduce a novel generalframework and tool for answering exploratory queries over Bayesian networks. The framework is inspired byqueries from the constraint-based mining literature designed for the exploratory analysis of data. Adaptedto Bayesian networks, these queries specify a set of constraints on explanations of interest, where anexplanation is an assignment to a subset of variables in a network. Characteristic for the methodology isthat it searches over different subsets of the explanations, corresponding to different marginalizations. Ageneral purpose framework, based on principles of constraint programming, data mining and knowledgecompilation, is used to answer all possible queries. This CP4BN framework employs a rich set of constraintsand is able to emulate a range of existing queries from both the Bayesian network and the constraint-baseddata mining literature.

Using entropy as a measure of acceptance for multi-label classi�cationLaurence Park (University of Western Sydney) and Simeon Simoff (University of Western Sydney)Multi-label classi�ers allow us to predict the state of a set of responses using a single model. A multi-labelmodel is able to make use of the correlation between the labels to potentially increase the accuracy of itsprediction. Critical applications of multi-label classi�ers (such as medical diagnoses) require that thesystem’s con�dence in prediction also be provided with the multi-label prediction. The specialist then usesthe measure of con�dence to assess whether to accept the system’s prediction. Probabilistic multi-labelclassi�cation provides a categorical distribution over the set of responses, allowing us to observe thedistribution, select the most probable response, and obtain an indication of con�dence by the shape of thedistribution. In this article, we examine if normalised entropy, a parameter of the probabilistic multi-labelresponse distribution, correlates with the accuracy of the prediction and therefore can be used to gaugecon�dence in the system’s prediction. We found that for all three methods examined on each data set, theaccuracy increases for the majority of the observations where the normalised entropy threshold decreases,showing that we can use normalised entropy to gauge a systems con�dence, and hence use it as a measureof acceptance.

Investigation of Node Deletion Techniques for Clustering Applications of GrowingSelf Organizing MapsThilina Rathnayake (University of Moratuwa), Maheshakya Wijewardena (University of Moratuwa), ThimalKempitiya (University of Moratuwa), Kevin Rathnasekara (University of Moratuwa), Thushan Ganegedara(University of Moratuwa), Amal Perera (University of Moratuwa) and Damminda Alahakoon (La TrobeUniversity)Self Organizing Maps (SOM) are widely used in data mining and high-dimensional data visualization due toits unsupervised nature and robustness. Growing Self Organizing Maps (GSOM) is a variant of SOMalgorithm which allows the nodes to grow so that it can represent the input space better. Without using a�xed 2D grid like SOM, GSOM starts with four nodes and keeps track of the quantization error in each node.New nodes are grown from an existing node if its error value exceeds a pre-de�ned threshold. Ability of theGSOM algorithm to represent input space and accurately cluster the input instances (vectors) is vital toextend its applicability to a wider spectrum of problems. This ability can be improved by identifying nodesthat represent low probability regions in the input space and removing them periodically. This will improvethe homogeneity and completeness of the �nal clustering result. A new extension to GSOM algorithm basedon node deletion is proposed in this paper as a solution to this problem. Furthermore, two new algorithmsinspired by cache replacement policies are presented. First algorithm is based on Adaptive ReplacementCache (ARC) and maintains two separate Least Recently Used (LRU) lists of the nodes. Second algorithm isbuilt on Frequency Based Replacement policy (FBR) and maintains a single LRU list. These algorithmsconsider both recent and frequent trends in the GSOM grid before deciding on the nodes to be deleted. Theexperiments conducted suggest that the FBR based method for node deletion outperforms the naivealgorithm and other existing node deletion methods.

Using Metalearning for Prediction of Taxi Trip Duration Using Different GranularityLevelsMohammad Nozari Zarmehri (INESC Porto) and Carlos Soares (University of Porto)Trip duration is an important metric for the management of taxi companies, as it affects operationalef�ciency, driver satisfaction and, above all, customer satisfaction. In particular, the ability to predict tripduration in advance can be very useful for allocating taxis to stands and �nding the best route for trips. Adata mining approach can be used to generate models for trip time prediction. In fact, given the amount ofdata available, different models can be generated for different taxis. Given the difference between the datacollected by different taxis, the best model for each one can be obtained with different algorithms and/orparameter settings. However, �nding the con�guration that generates the best model for each taxi iscomputationally very expensive. In this paper, we propose the use of metalearning to address the problemof selecting the algorithm that generates the model with the most accurate predictions for each taxi. Theapproach is tested on data collected in the Drive-In project. Our results show that metalearning can help toselect the algorithm with the best accuracy.

A �rst-order-logic based model for grounded language learningLeonor Becerra-Bonache (Université Jean Monnet), Hendrik Blockeel (K.U. Leuven), Maria Galvan(Université Jean Monnet) and François Jacquenet (Université Jean Monnet)Much is still unknown about how children learn language, but it is clear that they perform “grounded”language learning: they learn the grammar and vocabulary not just from examples of sentences, but fromexamples of sentences in a particular context. Grounded language learning has been the subject of muchresearch. Most of this work focuses on particular aspects, such as constructing semantic parsers, or onparticular types of applications. In this paper, we take a broader view that includes an aspect that hasreceived little attention until now: learning the meaning of phrases from phrase/context pairs in which thephrase’s meaning is not explicitly represented. We propose a simple model for this task that uses �rst-orderlogic representations for contexts and meanings, including a simple incremental learning algorithm. Weexperimentally demonstrate that the proposed model can explain the gradual learning of simple conceptsand language structure, and that it can easily be used for interpretation, generation, and translation ofphrases.

Slower can be faster: The iRetis incremental model tree learnerDenny Verbeeck (KULeuven) and Hendrik Blockeel (K.U. Leuven)Incremental learning is useful for processing streaming data, where data elements are produced at a highrate and cannot be stored. An incremental learner typically updates its model with each new instance thatarrives. To avoid skipped instances, the model update must �nish before the next element arrives, so itshould be fast. However, there can be a trade-off between the ef�ciency of the update and how manyupdates are needed to get a good model. We investigate this trade-off in the context of model trees. Wecompare FIMT, a state-of-the-art incremental model tree learner developed for streaming data, with twoalternative methods that use a more expensive update method. We �nd that for data with relatively low (butstill realistic) dimensionality, the most expensive method often yields the best learning curve: the systemconverges faster to a smaller and more accurate model tree.

Modeling concept drift: A probabilistic graphical model based approachHanen Borchani (Aalborg University), Ana M. Martinez (Aalborg University), Andres Masegosa (TheNorwegian University of Science and Technology), Helge Langseth (The Norwegian University of Scienceand Technology), Thomas D. Nielsen (Aalborg University), Antonio Salmeron (University of Almeria),Antonio Fernandez (Banco de Credito Cooperativo), Anders L. Madsen (HUGIN EXPERT A/S) and RamonSaez (Banco de Credito Cooperativo)An often used approach for detecting and adapting to concept drift when doing classi�cation is to treat thedata as i.i.d. and use changes in classi�cation accuracy as an indication of concept drift. In this paper, wetake a different perspective, and propose a framework, based on probabilistic graphical models, thatexplicitly represents concept drift using latent variables. To ensure ef�cient inference and learning, weresort to a variational Bayes inference scheme. As a proof of concept, we demonstrate and analyze theproposed framework using synthetic data sets as well as a real �nancial data set from a Spanish bank.

Optimally Weighted Cluster Kriging for Big Data RegressionBas van Stein (Leiden University), Hao Wang (Leiden University), Wojtek Kowalczyk (Leiden University),Michael Emmerich (Leiden University) and Thomas Bäck (Leiden University)In business and academia we are continuously trying to model and analyze complex processes in order togain insight and optimize. One of the most popular modeling algorithms is Kriging, or Gaussian Processes. Amajor bottleneck with Kriging is the amount of processing time of at least O(n^3) and memory requiredO(n^2) when applying this algorithm on medium to big data sets. With big data sets, that are more andmore available these days, Kriging is not computationally feasible. As a solution to this problem weintroduce a hybrid approach in which a number of Kriging models built on disjoint sub sets of the data areproperly weighted for the predictions. The proposed model is both in processing time and memory muchmore ef�cient than standard Global Kriging and performs equally well in terms of accuracy. The proposedalgorithm is better scalable, and well suited for parallelization.

Assigning Geo-Relevance of Sentiments Mined from Location-Based Social MediaPostsRandall Sanborn (University of Michigan – Flint), Michael Farmer (University of Michigan – Flint) andSyagnik Banerjee (University of Michigan – Flint)Broad adoption of smartphones has increased the number of posts generated while people are going abouttheir daily lives. Many of these posts are related to the location where that post is generated. Being able toinfer a person’s sentiment toward a given location would be a boon to market researchers. The largepercentage of system-generated content in these posts posed dif�culties for calculating sentiment andassigning that sentiment to the location associated with the post. Consequently our proposed systemimplements a sequence of text cleaning functions which was completed with a naive Bayes classi�er todetermine if a post was more or less likely to be associated with an individual’s present location. Thesystem was tested on set of nearly 30,000 posts from Foursquare that had been cross-posted to Twitterwhich resulted in reasonable precision but with a large number of posts discarded.

On Binary Reduction of Large-scale Multiclass Classi�cation ProblemsBikash Joshi (University Grenoble Alpes – LIG), Massih-Reza Amini (University Grenoble Alpes – LIG),Ioannis Partalas (University Grenoble Alpes – LIG), Liva Ralaivola (University Aix-Marseille – LIF), NicolasUsunier (Fracebook Research) and Eric Gaussier (University Grenoble Alpes – LIG)In the context of large-scale problems, traditional multiclass classi�cation approaches have to deal withclass imbalancement and complexity issues which make them inoperative in some extreme cases. In thispaper we study a transformation that reduces the initial multiclass classi�cation of examples into a binaryclassi�cation of pairs of examples and classes. We present generalization error bounds that exhibit theinterdependency between the pairs of examples and which recover known results on binary classi�cationwith i.i.d. data. We show the ef�ciency of the deduced algorithm compared to state-of-the-art multiclassclassi�cation strategies on two large-scale document collections especially in the interesting case wherethe number of classes becomes very large.

Class-based outlier detection: staying zombies or awaiting for resurrection?Leona Nezvalova (KD Lab, Faculty of Informatics, Masaryk University), Lubos Popelinsky (KD Lab, Facultyof Informatics, Masaryk University), Luis Torgo (F.Sci. U.Porto) and Karel Vaculik (KD Lab, Faculty ofInformatics, Masaryk University)This paper addresses the task of �nding outliers within each class in the context of supervised classi�cationproblems. Class-based outliers are cases that deviate too much with respect to the cases of the same class.We introduce a novel method for outlier detection in labelled data based on Random Forests and compare itwith the existing methods both on arti�cial and real-world data. We show that it is competitive with theexisting methods and sometimes gives more intuitive results. We also provide an overview for outlierdetection in labelled data. The main contribution are two methods for class-based outlier description andinterpretation.

Data analytics and optimisation for assessing a ride sharing systemVincent Armant (The Insight Center for Data Analytics), John Horan (Insight Centre), Nahid Mahbub (TheInsight Center for Data Analytics) and Ken Brown (University College Cork)Ride-sharing schemes attempt to reduce road traf�c by matching prospective passengers to drivers withspare seats in their cars. To be successful, such schemes require a critical mass of drivers and passengers. Incurrent deployed implementations, the possible matches are based on heuristics, rather than real routetimes or distances. In some cases, the heuristics propose infeasible matches; in others, feasible matches areomitted. Poor ride matching is likely to deter participants from using the system. We develop aconstraint-based model for acceptable ride matches which incorporates route plans and time windows.Through data analytics on a history of advertised schedules and agreed shared trips, we infer parameters forthis model that account for 90% of agreed trips. By applying the inferred model to the advertised schedules,we demonstrate that there is an imbalance between riders and passengers. We assess the potential bene�tsof persuading existing drivers to switch to becoming passengers if appropriate matches can be found, bysolving the inferred model with and without switching. We demonstrate that �exible participation has thepotential to reduce the number of unmatched participants by up to 80%.

Time Series Classi�cation with Representation EnsemblesRafael Giusti (Universidde de São Paulo), Diego Silva (USP) and Gustavo Batista (USP)Time series has attracted an enormous amount of attention in recent years, with thousands of methods indiverse tasks such as classi�cation, clustering, prediction, and anomaly and motif detection. Among allthese tasks, classi�cation is likely to be the most prominent task, accounting for most of the applicationsand attention from the research community. However, in spite of the huge number of methods available,there is a signi�cant body of empirical evidence indicating that the 1-nearest neighbor algorithm in the timedomain is “extremely dif�cult to beat”. In this paper, we evaluate the use of different data representations intime series classi�cation. Our work is motivated by methods used in related areas such as signal processingand music retrieval. In these areas, a change of representation frequently reveals features that are notapparent in the original data representation. The approach consists of using different representations suchas frequency, wavelets, and autocorrelation to transform the time series into alternative decision spaces. Aclassi�er is then used to provide a classi�cation for each test time series in the alternative domain. Weinvestigate how features provided in different domains can help in time series classi�cation. We alsoexperiment with different ensembles to investigate if the data representations are a good source of diversityfor time series classi�cation. Our extensive experimental evaluation approaches the issue of combining setsof representations and ensemble strategies, resulting in over $300$ ensemble con�gurations.

Horizon talks

Towards a Data Science CollaboratoryJoaquin Vanschoren (K.U.Leuven), Bernd Bischl (Ludwig-Maximilians-University Munich), Frank Hutter(Albert-Ludwigs-Universitat Freiburg), Michele Sebag (Universite Paris Sud), Balazs Kegl (Universite ParisSud), Matthias Schmid (Universitat Bonn), Giulio Napolitano (Universitat Bonn), Katy Wolstencroft (LeidenUniversity), Alan R. Williams (University of Manchester) and Neil Lawrence (University of Shef�eld)Data-driven research requires many people from different domains to collaborate ef�ciently. Thedomain scientist collects and analyzes scienti�c data, the data scientist develops new techniques,and the tool developer implements, optimizes and maintains existing techniques to be usedthroughout science and industry. Today, however, this data science expertise lies fragmented inloosely connected communities and scattered over many people, making it very hard to �nd theright expertise, data and tools at the right time. Collaborations are typically small and cross-domain knowledge transfer through the literature is slow. Although progress has been made, it isfar from easy for one to build on the latest results of the other and collaborate effortlessly acrossdomains. This slows down data-driven research and innovation, drives up costs and exacerbatesthe risks associated with the inappropriate use of data science techniques. We propose to create anopen, online collaboration platform, a `collaboratory’ for data-driven research, that brings togetherdata scientists, domain scientists and tool developers on the same platform. It will enable datascientists to evaluate their latest techniques on many current scienti�c datasets, allow domainscientists to discover which techniques work best on their data, and engage tool developers toshare in the latest developments. It will change the scale of collaborations from small topotentially massive, and from periodic to real-time. This will be an inclusive movement operatingacross academia, healthcare, and industry, and empower more students to engage in data science.

When Learning Indeed Changes the World: Diagnosing Prediction-Induced DriftGeorg Krempl (Knowledge Management and Discovery Lab, University Magdeburg), David Bodnar(University Magdeburg) and Anita Hrubos (University Magdeburg)A fundamental assumption underlying many prediction systems is that they act as an invisibleobserver without interfering with the evolution of the population under study. More formally,distributions are assumed to be independent of the system’s previous predictions. Nevertheless,this is violated when the predictor faces intelligent and malevolent adversaries who counteract theclassi�cation rules, or when the classi�cation as high-risk might become a self-ful�lling prophecy.We present a uni�ed framework to study self-defeating and self-ful�lling prophecies and to assesssuch prediction-induced drift. Our results on synthetic datasets are promising but highlight twomajor challenges: First, while the majority of prediction-induced drifts is correctly detected,detection of self-ful�lling prophecies seems more dif�cult, requiring further research. Second, thisanalysis requires knowing the labels assigned by prediction systems, which are currently notpublished in real-world datasets. Thus, this presentation aims to motivate the community tocollect, share and analyse such data on prediction-induced drift.

The Data Problem in Data MiningAlbrecht ZimmermannComputer science is essentially an applied or engineering science, creating tools. In Data Mining,those tools are supposed to help humans understand large amounts of data, and produceactionable insight. In this talk, I argue that for all the progress that has been made in Data Mining,in particular Pattern Mining, we are lacking understanding of key aspects of the performance andresults of pattern mining algorithms. I will focus particularly on the dif�culty of deriving actionableknowledge from patterns. I trace the lack of progress regarding those questions to a lack of datawith varying, controlled properties, and argue that we will need to make a science of digital datageneration, and use it to develop guidance to data practitioners.

PhD Posters

Semantic modeling and analysis of heterogeneous trajectory data sourcesMarwa Manaa, Jalel Akaichi

Random neurons for online data streamsDiego Marrón, Jesse Read, Albert Bifet, Nacho Navarro

Complimentarity of Sparse Codes for Image CategorizationCéline Rabouy, Sébastien Paris, Hervé Glotin

Language-Independent Embeddings for Text SummarizationGeorgios Balikas and Massih-Reza Amini

Data mining in intensive care: a focus on early detection of acute kidney injuryM Flechet Meyfroidt, F Güiza, M Schetz, P Wouters, I Vanhorebeek, I Derese, J Gunst, G Van den Berghe

A graph based approach for formalizing subjective interestingness of dataprojectionsBo Kang, Jefrey Lijf�jt, Raúl Santos-Rodríguez, Tijl De Bie

Exploiting Correlations across Datasets to Improve Scene LabelingDamien Fourure, Rémi Emonet, Elisa Fromont, Damien Muselet, Alain Trémeau, Christian Wolf

Complex Event Processing and Pattern Matching Over Streams of KnowledgeGraphsSyed Gillani, Gauthier Picard, Frédérique Laforest

From Numeric Time Series Data to Linguistic Description of Patterns:Data-driven Modelling and Representation of Physiological Sensor DataHadi Banaee and Amy Lout�

Incremental Reasoning on RDFSJules Chevalier, Julien Subercaze, Christophe Gravier, Frédérique Laforest

Large-scale Non-smooth Optimization Algorithms for Inverse Problems inImagingRahul Mourya, Loïc Denis, Eric Thiébaut, Jean-Marie Becker

Question answering: Techniques for question analysis and phrase mappingDennis Diefenbach

Saint-ÉtienneSaint-Étienne is a city in eastern central France. It is located in the Massif Central, 60 km southwest ofLyon in the Rhône-Alpes region. Saint-Étienne is the capital of the Loire département and has apopulation of approximately 178,500 in the city itself expanding to over 508,000 in the metropolitanarea.

Historic city

The city �rst appears in the historical record in the Middle Ages. In the13th century it was a small borough around the church dedicated toSaint Etienne. In the late 15th century it was a forti�ed village defendedby walls built around the original nucleus.From the 16th century, Saint-Étienne developed an arms manufacturing industry and became amarket town. It was this which accounted for the town’s importance,although it also became a centre for the manufacture of ribbons andpassementerie starting in the 17th century. During the Frenchrevolution, Saint-Étienne was brie�y renamed Armeville – ‘arms town’ –because of this activity. Later, it became a coal mining centre, and morerecently has become known for its bicycle industry.

Design

In 1990 Saint-Étienne set up a design biennale – the largest of its kind inFrance. It lasts around a month. The next convention will be in 2017. Alandmark in the history of the importance ascribed to design in Saint-Étienne was the inauguration of La Cité du design on the site of the formerarms factory in 2009. On 22 November 2010, it was nominated as “City ofDesign” as part of Unesco’s Creative Cities Network.

Football in Saint-Étienne

When you are from Saint-Étienne, one thing that you are supposed tolove is: Football! Indeed, ASSE (Association Sportive de Saint-EtienneLoire), also called “les verts” (the “greens”), is one of the most successfulclub in French football history. The club achieved the most of itshonours in the 60s and 70s. It discoverd players such as Michel Platinior Aimé Jacquet (who coached the French national team to victory at1998 FIFA World Cup). ASSE plays its home matches near the conferencevenue at Geoffroy-Guichard stadium which is locally called “LeChaudron” (the “cauldron”). This name comes from the fact that theatmosphere of the stadium is always “boiling”. Indeed, the supportersare known as some of the best supporters in France. If you are curiousyou can listen one of the most popular song about ASSE: "Allez lesverts".

OrganizersGeneral ChairMatthijs van Leeuwen (KU Leuven, Belgium)

Program ChairsTijl De Bie (University of Bristol, United Kingdom)Elisa Fromont (LaHC, University of Saint-Etienne, France)

Poster & Video ChairJesse Read (Aalto University, Finland)

Local ChairBaptiste Jeudy (LaHC, University of Saint-Etienne, France)

Publicity ChairEd Cohen (Imperial College London, United Kingdom)

Sponsorship ChairFrançois Jacquenet (LaHC, University of Saint-Etienne, France)

Frontier Prize ChairsElizabeth Bradley (University of Colorado, USA)Michael Berthold (University of Konstanz, Germany)

Advisory ChairsJoost Kok (Universiteit Leiden, Netherlands)Paul Cohen (University of Arizona, USA)Nada Lavrač (Jožef Stefan Institute, Slovenia)

WebmasterLeonor Becerra-Bonache (LaHC, University of Saint-Etienne, France)

Local team (from Saint-Etienne)Leonor Becerra-Bonache Romain Deville

Rémi Emonet Damien Fourure

Elisa Fromont François Jacquenet

Christophe Gravier Matthias Gery

Amaury Habrard Christine Largeron

Emilie Morvant

Sponsors

Frontier Prize sponsor

Video Prize sponsor

Venue

The conference plenary sessions will take place at Télécom Saint-Etienne (EngineeringSchool), 42000 Saint-Etienne, France. Room J020.

The coffee breaks and poster sessions will take place in the building in front of Télécom Saint-Etienne.Room D03.

Venue map (North is at the bottom) :

Getting to the conference venue

To get to Télécom Saint-Etienne by tram :Tram T1 (Solaure – Hopital Nord) direction Hopital Nord,Cité du design tram stoporTram T2 (Châteaucreux – Terrasse/Hopital Nord) direction Terrasse or Hopital Nord,Cité du design tramstop.Maps of the tram lines.

The STAS (Saint-Etienne Public Transport service) proposes group rates and young person rates on salefrom the ticket vending machines (at every tram stop) and at STAS sales outlets. You may consider tobuy a 10 fares ticket for 10 euros if you are travelling from your hotel to the conference venue everyday.

Thu 22nd Oct


09:20 Welcome from General Chair

09:30 Invited talk: The Shape ofTweets to Come by Tony VealeChair: Elisa Fromont


11:00 Contributed session 1Chair: Bart Goethals11:00-11:30 Batch Steepest-Descent-Mildest-Ascent forInteractive Maximum MarginClustering.Fabian Gieseke,Tapio Pahikkala and TomHeskes11:30-12:00 Diversity-drivenWidening of HierarchicalAgglomerative Clustering.Alexander Fillbrunn andMichael R. Berthold12:00-12:30 DiagonalCo-clustering Algorithm forDocument-Word Partitioning.Charlotte Laclau and MohamedNadif

12:30 Lunch at “La Platine”

14:30 Contributed session 2Chair: Bruno Crémilleux14:30-15:00 I-Louvain: anattributed graph clusteringmethod. Christine Largeron,Mathias Géry, David Combe andElöd Egyed-Zsigmond15:00-15:45 Horizon Talk:Towards a Data ScienceCollaboratory. JoaquinVanschoren, Bernd Bischl,Frank Hutter, Michele Sebag,Balazs Kegl,Matthias Schmid,Giulio Napolitano, KatyWolstencroft, Alan R. Williamsand Neil Lawrence


16:00 PhD poster session

18:00 Social event: city walk

19:30 Cocktail at Nouai Borfa

Fri 23rd Oct


09:30 Invited talk: Combining WeakStatistical Evidence in CyberSecurity by Nick HeardChair: Tijl De Bie


11:00 Contributed session 3 Chair:Chair: Stephen Swift11:00-11:30 Continuous andDiscrete Deep Classi�ers forData Integration. NataliyaSokolovska11:30-12:00 Ef�cient ModelSelection for RegularizedClassi�cation by ExploitingUnlabeled Data.GeorgiosBalikas, Ioannis Partalas, EricGaussier, Rohit Babbar andMassih-Reza Amini12:00-12:30 Spotlight talksfor the regular poster session


14:00 Contributed session 4Chair: Hendrik Blockeel14:00-14:30 A BayesianApproach for IdentifyingMultivariate DifferencesBetween Groups. YuriySverchkov and Gregory Cooper14:30-15:00 SegregationDiscovery in a Social Networkof Companies. AlessandroBaroni and Salvatore Ruggieri15:00-15:45 Horizon Talk:When Learning IndeedChanges the World:Diagnosing Prediction-Induced DriftGeorg Krempl, David Bodnarand Anita Hrubos


16:00 Regular poster session

18:00 End of sessions

19:00 Banquet at “ Le Grand Cercle “

Sat 24th Oct


09:30 Invited talk: Evidence-BasedOptimization by Pascal VanHentenryckChair: Matthijs Van Leuwen


11:00 Contributed session 5Chair: Michael Berthold11:00-11:30 ProbabilisticActive Learning in DataStreams. Daniel Kottke, GeorgKrempl and Myra Spiliopoulou11:30-12:00 Fast AlgorithmSelection using LearningCurves. Jan N. van Rijn, SalisuMamman Abdulrahman, PavelBrazdil and Joaquin Vanschoren12:00-12:30 VoQs: A WebApplication for Visualizationof Questionnaire Surveys.Xiaowei Zhang, Frank Klawonn,Lorenz Grigull and WernerLechner


14:00 Contributed session 6Chair: Joost Kok14:00-14:30 AutomaticallyDiscovering OffensivePatterns in Soccer MatchData. Jan Van Haaren, VladimirDzyuba, Siebe Hannosset andJesse Davis14:30-15:15 Horizon Talk: TheData Problem in Data Mining.Albrecht Zimmermann

15:15 Prize award ceremony

15:45IDA council meeting (councilonly)

18:00 Farewell social event

Documents

The Fourteenth International Symposium on Intelligent Data ...Welcome to IDA 2015 (the 14th International Symposium on Intelligent Data Analysis) in Saint-Etienne! The symposium series