Discovering Spatial Patterns using Statistically ... tionships between species based on location), spatial

  • View
    0

  • Download
    0

Embed Size (px)

Text of Discovering Spatial Patterns using Statistically ... tionships between species based on location),...

  • Discovering Spatial Patterns using Statistically Significant Dependencies

    by

    Mohomed Shazan Mohomed Jabbar

    A thesis submitted in partial fulfillment of the requirements for the degree of

    Master of Science

    Department of Computing Science

    University of Alberta

    c© Mohomed Shazan Mohomed Jabbar, 2016

  • Abstract

    Co-location pattern mining is a class of techniques to find associations among spa-

    tial features. It has a wide range of applications varying from business to science.

    Our work is motivated by an application in environmental health where the goal is to

    investigate whether the maternal exposure during pregnancy to air pollutants could

    be a potential cause to adverse birth outcomes. Discovering such relationships can

    be defined as finding spatial associations (i.e. co-location patterns) between adverse

    birth outcomes and air pollutant emissions. However, the increasing complexity of

    the application problems poses new challenges that traditional approaches are un-

    able to address well. For instance, comparing and contrasting spatial groups is one

    such complex task posed as a research question in our application problem. Further-

    more, traditional co-location pattern mining techniques heavily rely on frequency

    based thresholds which discard underrepresented rare patterns and find exaggerated

    noisy patterns which may not to be equally prevalent in unseen data. To address

    limitations in frequency based methods, some association studies propose to use

    statistical significance tests. The use of a spatial data transactionization mecha-

    nism helps exploiting such statistically significant association mining methods to

    find strong co-location patterns more efficiently. Towards this end we propose a

    novel approach, AGT-Fisher, to achieve the task of transactionization and using

    statistically significant dependency rules to find strong co-location patterns more

    efficiently. Our experiments reveal that the proposed AGT-Fisher could indeed help

    in finding co-location patterns with a better statistical significance. Furthermore

    to compare spatial groups we introduce two new spatial patterns: spatial contrast

    sets and spatial common sets, and techniques based on AGT-Fisher to mine them

    efficiently. Our evaluation reveals that the contrast sets we found can successfully

    ii

  • distinguish one group from the others. We also propose a new visualization frame-

    work, VizAR, to interactively visualize complex spatial patterns such as the ones we

    intend to discover. With the proposed methods and the VizAR tool, we discovered

    that air pollutants such as heavy metals, NO2, PM2.5, PM10 and TPM are frequently

    associated with adverse birth outcomes.

    iii

  • Preface

    Some part of the preliminary methods used in Chapter 3 of this thesis has been

    published as Jundong Li, Aibek Adilmagambetov, Mohomed Shazan Mohomed

    Jabbar, Osmar R. Zaı̈ane, Alvaro Osornio-Vargas and Osnat Wine, “On Discover-

    ing Co-Location Patterns in Datasets: A Case Study of Pollutants and Child Can-

    cers,” Geoinformatica, vol. 20, issue. 4, 651-692. I contributed by performing

    experiments as well as by writing and editing parts of the manuscript. Osmar R.

    Zaı̈ane was the supervisory author and was involved with concept formation and

    manuscript composition. A. Adilmagambetov and J. Li contributed on composing

    a preliminary version of the manuscript via their MSc research work that precede

    this paper. A. O. Vargas and O. Wine contributed their insights from the application

    domain (i.e. Pediatrics) perspective in collecting data and designing experiments.

    Part of Chapter 4 is published as Mohomed Shazan Mohomed Jabbar and Osmar

    R. Zaı̈ane, “Learning Statistically Significant Contrast Sets,” In Proceedings of the

    29th Canadian Conference on Artificial Intelligence, 237-242. I was responsible

    for the data collection, analysis as well as the manuscript composition. Osmar R.

    Zaı̈ane was the supervisory author and was involved with concept formation and

    manuscript composition.

    iv

  • Acknowledgements

    I would like to express my deepest gratitude to my supervisor and mentor, Prof. Os-

    mar R. Zaı̈ane for his advice, support, encouragement and guidance throughout my

    journey. I am extremely fortunate to have such a supervisor who genuinely cared

    about my well being and success in my work, and who always kept his doors open

    whenever I ran into any trouble. Thank you Osmar for all your invaluable insights

    and knowledge, constructive feedback, motivation, support and encouragement on

    this thesis.

    I am also grateful to Prof. Alvaro O. Vargas for his invaluable feedback, contin-

    uous support and encouragement throughout this research. My special thanks also

    goes to Jesus and Charlene for helping me in preparing the datasets, Saeed for his

    programming contributions in VizAR program and Jundong for his support in the

    work done related to Geoinformatica research paper. I also acknowledge the sup-

    port of Osnat and Leslie in many occasions. I also would like to thank the whole

    DoMiNO team for their feedback and the support.

    I also would like to thank Dr. Prakeshkumar Shah and the Maternal infant Care

    (MiCare) research team at the University of Toronto for their support given on an-

    alyzing adverse birth data from Canadian Neonatal Network and for hosting me at

    their Lab in Toronto during my short stay there. I am also especially indebted to my

    committee members, Prof. Mario Nascimento and Prof. Yutaka Yasui, for taking

    time from their busy schedules to read my thesis, and for providing me with great

    comments and insightful advices.

    Last but not least, I must express my heart-felt gratitude to my dear friends and

    family for their continuous support and unconditional love without whom I would

    not be where I am today.

    v

  • Table of Contents

    1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Thesis Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.4 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.5 Research Methodology . . . . . . . . . . . . . . . . . . . . . . . . 8

    1.5.1 Problem Understanding . . . . . . . . . . . . . . . . . . . 8 1.5.2 Data Collection and Preprocessing . . . . . . . . . . . . . . 9 1.5.3 Designing Analytical Methods . . . . . . . . . . . . . . . . 10 1.5.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.5.5 Dissemination . . . . . . . . . . . . . . . . . . . . . . . . 11

    1.6 Outline of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . 11

    2 Related Work 13 2.1 Association Rule Mining . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 Co-location Pattern Mining . . . . . . . . . . . . . . . . . . . . . . 15 2.3 Contrast Set Mining . . . . . . . . . . . . . . . . . . . . . . . . . . 16

    2.3.1 Traditional Approaches . . . . . . . . . . . . . . . . . . . . 17 2.3.2 Association Rule based Methods . . . . . . . . . . . . . . . 17 2.3.3 Other Related Methods . . . . . . . . . . . . . . . . . . . . 18

    2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

    3 Statistically Significant Spatial Co-location Patterns 21 3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

    3.1.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . 22 3.1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 23

    3.2 AGT-Fisher to Mine Co-location Patterns . . . . . . . . . . . . . . 25 3.2.1 Aggregated Grid Transactionization . . . . . . . . . . . . . 26 3.2.2 Fisher’s Test to Find Significant Rules . . . . . . . . . . . . 29

    3.3 Results and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 31 3.3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.3.2 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.3.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . 35 3.3.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 36

    3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

    4 Spatial Contrast and Common Sets 41 4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

    4.1.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . 42 4.1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 44

    4.2 Spatial Contrast/Common Set Mining Algorithms . . . . . . . . . . 45 4.2.1 DiSConS: Discovering Spatial Contrast Sets . . . . . . . . . 46

    vi

  • 4.2.2 DiSComS: Discovering Spatial Common Sets . . . . . . . . 47 4.3 Results and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 48

    4.3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.3.2 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.3.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . 50 4.3.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 52

    4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

    5 VizAR: A Visualization Fram