129
Algorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke University Committee : Pankaj K. Agarwal (supervisor) Kamesh Munagala Rong Ge Yusu Wang

Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

Algorithms for analyzing spatio-temporal data

PhD defenseAbhinandan Nath

Department of Computer ScienceDuke University

Committee :Pankaj K. Agarwal (supervisor) Kamesh MunagalaRong Ge Yusu Wang

Page 2: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

2

Introduction

Page 3: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

3

Introduction

Page 4: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

4

The Data Deluge

“Mankind created 150 exabytes (billion gigabytes) of data in 2005. This year, it will create 1,200 exabytes.”

- The Economist, 2010

https://www.economist.com/node/15579717

Page 5: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

5

Some (more) numbers ...

USGS National Elevation data (10 metre resolution)[Dewberry, 2012]

NYC taxi pickup and dropoff data, 2009-2016 : 1.3 billion points[towardsdatascience.com]

Page 6: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

6

Geometric flavor of data

● Many data sets geometric in nature

● Problems in other domains can be mapped to geometric domain

– e.g., SELECT query in relational databases

NAME AGE SALARY

Alice 26 30,000

Bob 30 35,000

Charlie 28 25,000

... ... ….

Page 7: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

7

Challenges

Massive data sets that are -

Noisy [towardsdatascience.com]

Have outliers

Incomplete Time-varying, e.g., trajectories

Page 8: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

8

My Research

● Use techniques from computational geometry and topology to tackle some of these challenges in geometric data sets

● Design algorithms that are– Practical– Have provable performance guarantees

Page 9: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

9

Broad themes

● Distributed algorithms– Inspired by frameworks like MapReduce [Dean

& Ghemawat, 2008] and Spark [Zaharia et al., 2010]

● Succinct descriptors– Concisely encode desired properties of big

data sets– Noise-robust proxies for data sets– Clustering

Page 10: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

10

At a glance

Distributedalgorithms

Succinctdescriptors

Indices to answer range and nearest-neighbor queries [AFMN, 2016]

Triangulation & contour tree of massive terrains [AFMN, 2016]

Comparing merge trees of real-valued functions [AFNSW, 2015]

Common movement patterns from trajectory data [AFMNPT, 2018]

Page 11: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

11

At a glance

Distributedalgorithms

Succinctdescriptors

Indices to answer range and nearest-neighbor queries [AFMN, 2016]

Triangulation & contour tree of massive terrains [AFMN, 2016]

Comparing merge trees of real-valued functions [AFNSW, 2015]

Common movement patterns from trajectory data [AFMNPT, 2018]

Page 12: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

12

Distributed model of computation

● Massively Parallel Communication (MPC) model [Beame et al., 2013]

● Captures salient features of modern frameworks like MapReduce [Dean & Ghemawat, 2008]

Page 13: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

13

MPC model of computation

● : no. of machines● : input distributed

across machines● :

each machine has storage

Assume ,

for

Communication Medium

Input size n

O(s) O(s) O(s) O(s) O(s) O(s)

Page 14: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

14

MPC model of computation

● Computation proceeds in rounds– In each round, each machine computes on

local data

● Communication between machines occurs between rounds

● No. of messages sent/received by any machine in a round bounded by

Page 15: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

15

Performance measures

● No. of rounds of computation :

● Running time : – : running time of machine in round

● Total work :

Page 16: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

16

At a glance

Distributedalgorithms

Succinctdescriptors

Indices to answer range and nearest-neighbor queries [AFMN, 2016]

Triangulation & contour tree of massive terrains [AFMN, 2016]

Comparing merge trees of real-valued functions [AFNSW, 2015]

Common movement patterns from trajectory data [AFMNPT, 2018]

Joint work with Pankaj K. Agarwal,Kyle Fox & Kamesh Munagala

Page 17: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

17

Indexing big data

● Query big data sets faster, but how?

– Build an index !

● Consider geometric queries– Orthogonal range queries– Nearest-neighbor queries

Page 18: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

18

Previous work

● Work on conjunctive and join queries, graph processing in MapReduce and its variants [Lee et al., 2012; Qin et al., 2014; Malewicz et al., 2010; Beame et al., 2013; Koutris et al.,2018; ...]

● Geometric queries - MapReduce implementations for analyzing and querying spatial and geometric data [Eldawy et al., 2013, 2015; Arabi et

al.,2014; …] - no provable performance guarantees!!

Page 19: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

19

Our work

Build and query distributed variants of the following classical data structures, with provable performance guarantees

– Orthogonal range searching● Kd-tree [Bentley, 1975]

● Range tree [Bentley, 1980]

Page 20: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

20

Our work

Build and query distributed variants of the following classical data structures, with provable performance guarantees

– Orthogonal range searching● Kd-tree [Bentley, 1975]

● Range tree [Bentley, 1980]

Page 21: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

21

Our work

Build and query distributed variants of the following classical data structures, with provable performance guarantees

– Orthogonal range searching● Kd-tree [Bentley, 1975]

● Range tree [Bentley, 1980]

Page 22: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

22

Our work

Build and query distributed variants of the following classical data structures, with provable performance guarantees

– Orthogonal range searching● Kd-tree [Bentley, 1975]

● Range tree [Bentley, 1980]

– Nearest-neighbor searching● Balanced Box Decomposition

(BBD)-tree [Arya et al., 1998]

Page 23: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

23

Our work

Build and query distributed variants of the following classical data structures, with provable performance guarantees

– Orthogonal range searching● Kd-tree [Bentley, 1975]

● Range tree [Bentley, 1980]

– Nearest-neighbor searching● Balanced Box Decomposition

(BBD)-tree [Arya et al., 1998]

Page 24: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

24

Our work

Build and query distributed variants of the following classical data structures, with provable performance guarantees

– Orthogonal range searching● Kd-tree [Bentley, 1975]

● Range tree [Bentley, 1980]

– Nearest-neighbor searching● Balanced Box Decomposition

(BBD)-tree [Arya et al., 1998]

Page 25: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

25

Our work

Build and query distributed variants of the following classical data structures, with provable performance guarantees

– Orthogonal range searching● Kd-tree [Bentley, 1975]

● Range tree [Bentley, 1980]

– Nearest-neighbor searching● Balanced Box Decomposition

(BBD)-tree [Arya et al., 1998]

Page 26: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

26

Our results

: total no. of input points in

: total no. of points reported for a range query

: max no. of points reported by a machine for a range query

Page 27: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

27

Our results

● Kd-tree :– Construction : rounds, time,

work

– Query : rounds, time, work – optimal if each point can be stored exactly once

Also extends to partition trees [Chan 2012] for simplex range searching

Page 28: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

28

Our results

● Range tree :– Construction : rounds, time,

work

– Query : rounds, time, and work

● BBD-tree :– Construction : rounds, time,

work

– Query : rounds, time and work

Page 29: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

29

Key idea : random sampling

● Data structures based on balanced hierarchical partitioning of input points represented as a tree

Page 30: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

30

Key idea : random sampling

● Data structures based on balanced hierarchical partitioning of input points represented as a tree

● Approximate this partitioning using a small random sample of input!

Page 31: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

31

Page 32: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

32

Page 33: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

33

Page 34: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

34

Balanced partitioning on random sample leads to balanced partitioning on entire set!!

Page 35: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

35

At a glance

Distributedalgorithms

Succinctdescriptors

Indices to answer range and nearest-neighbor queries [AFMN, 2016]

Triangulation & contour tree of massive terrains [AFMN, 2016]

Comparing merge trees of real-valued functions [AFNSW, 2015]

Common movement patterns from trajectory data [AFMNPT, 2018]

Page 36: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

36

At a glance

Distributedalgorithms

Succinctdescriptors

Indices to answer range and nearest-neighbor queries [AFMN, 2016]

Triangulation & contour tree of massive terrains [AFMN, 2016]

Comparing merge trees of real-valued functions [AFNSW, 2015]

Common movement patterns from trajectory data [AFMNPT, 2018]

Joint work with Pankaj K. Agarwal,Kyle Fox & Kamesh Munagala

Page 37: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

37

Terrain modeling

Airborne LiDAR scanning[http://www.lgs.ie/airborne-lidar.shtml]

Raw elevation data (3D point cloud)

[kellylab.berkeley.edu]

Digital Elevation Model (DEM)[gisgeography.com/free-global-dem-data-sources/]

Page 38: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

38

From 3D point cloud to DEM

● Terrain – xy-monotone surface in

● Graph of a height function

● Often stored as a triangulated irregular network (TIN)

● How to build TINs and perform terrain analysis in the MPC model ?

Page 39: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

39

Our Work

● Build TIN model, using Delaunay triangulation

● Compute the contour tree to succinctly encode all contours of terrain

Input points in

Build terrain model

Build contour tree

Use contour tree Many applications, e.g., waterflow

prediction, climate model viz.

Page 40: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

40

Prior Work

● Delaunay triangulation– RAM and I/O model [Crauser et al., 2001]

– PRAM algorithms [Blelloch et al., 1999]

– Goodrich's algorithm [Goodrich, 1997] can be adapted to MPC model – too complicated

– SpatialHadoop [Eldawy et al., 2015] – no theoretical bounds

● Contour tree– RAM and I/O model [Carr et al., 2003; Pascucci and Cole-McLaughlin, 2002; Agarwal

et al., 2010; …]

– Distributed and parallel algorithms [Morozov and Weber, 2013, 2014;

Pascucci and Cole-McLaughlin, 2003; Acharya and Natarajan, 2015; ...]

Page 41: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

41

Our results

● Given points, compute its Delaunay triangulation in rounds, time, and work, with high probability

● Given a terrain of size , compute its contour tree in rounds, time, and work

Page 42: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

42

Build terrain model

Input points in

Build terrain model

Build contour tree

Use contour tree

Page 43: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

43

Delaunay Triangulation

● Given points in , a triangulation of is Delaunay if– No triangle contains

any point of in interior of its circumcircle

● Many useful properties, e.g., avoids skinny triangles

[gamedev.stackexchange.com]

Page 44: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

44

Basic idea

● Randomly sample small set of points and compute triangulation of

● Use triangulation of to split input into smaller chunks

● Recurse on each chunk in parallel

Page 45: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

45

Algorithm

1. Given points stored across many machines, randomly sample of size and send to one machine

Page 46: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

46

Algorithm

1. Given points stored across many machines, randomly sample of size and send to one machine

Page 47: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

47

Algorithm

2. Compute , and use it to distribute to disjoint machines

Page 48: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

48

Algorithm

2. Compute , and use it to distribute to disjoint machines

Page 49: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

49

Algorithm

2. Compute , and use it to distribute to disjoint machines

Page 50: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

50

Algorithm

2. Compute , and use it to distribute to disjoint machines

With slight changes, it can be shown that each chunk has size with high probability

Page 51: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

51

Algorithm

3. Recursively compute for each chunk in parallel. Can filter unnecessary triangles by simple geometric tests to get

Page 52: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

52

Analysis

● No. of levels of recursion is

● Each level takes rounds, time, and work

Page 53: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

53

Build contour tree

Input points in

Build TIN DEM

Build contour tree

Use contour tree

Page 54: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

54

Level sets and contours

● : triangulation of

● Height function – Defined on each vertex

– Linearly interpolated within each face(triangle)

● Level set

● Contour : connected component of a level set

Page 55: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

55

Topology changes at saddle points

Image from [Agarwal et al., 2015]

Page 56: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

56

Contour tree

● Obtained by contracting each contour of to a point

Agarwal et al., 2015

Page 57: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

57

Our contribution

A simple and efficient divide-and-conquer algorithm to build and store the contour tree of a massive triangulated terrain in MPC model

Page 58: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

58

Storage

● Contour tree stored in a distributed fashion

Page 59: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

59

Storage

● Contour tree stored in a distributed fashion

– Top subtree : a sized subtree stored on one machine

α2

y2

α3

Page 60: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

60

Storage

● Contour tree stored in a distributed fashion

– Top subtree : a sized subtree stored on one machine

– Remaining subtrees stored on other machines, pointers to which stored with

α4

α5

y1y2

x4α3

α2

y2

α3

α2

α1

x1 x2

x3

Page 61: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

61

Algorithm (divide step)

1. Split into smaller chunks● Each chunk has same no. of points, goes to

disjoint set of machines

Page 62: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

62

Algorithm (divide step)

1. Split into smaller chunks● Each chunk has same no. of points, goes to

disjoint set of machines

Page 63: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

63

Algorithm (divide step)

1. Split into smaller chunks● Each chunk has same no. of points, goes to

disjoint set of machines

Page 64: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

64

Algorithm (conquer step)

2. Compute distributed contour trees of each chunk recursively in parallel

Page 65: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

65

Algorithm (conquer step)

2. Compute distributed contour trees of each chunk recursively in parallel

Page 66: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

66

Algorithm (merge step)

3. Combine contour trees to get

Page 67: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

67

Algorithm (merge step)

3. Combine contour trees to get – Minimize interaction b/w neighboring chunks– Take advantage of data distribution and

triangulation

Page 68: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

68

Our main result

Given a terrain of size , designed algorithm to compute its contour tree in rounds, time, and work

● These bounds are worst-case optimal !

Page 69: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

69

At a glance

Distributedalgorithms

Succinctdescriptors

Indices to answer range and nearest-neighbor queries [AFMN, 2016]

Triangulation & contour tree of massive terrains [AFMN, 2016]

Comparing merge trees of real-valued functions [AFNSW, 2015]

Common movement patterns from trajectory data [AFMNPT, 2018]

Page 70: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

70

At a glance

Distributedalgorithms

Succinctdescriptors

Indices to answer range and nearest-neighbor queries [AFMN, 2016]

Triangulation & contour tree of massive terrains [AFMN, 2016]

Comparing merge trees of real-valued functions [AFNSW, 2015]

Common movement patterns from trajectory data [AFMNPT, 2018]

Joint work with Pankaj K. Agarwal,Kyle Fox, Tasos Sidiropoulos &

Yusu Wang

Gonna skip!!

Page 71: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

71

At a glance

Distributedalgorithms

Succinctdescriptors

Indices to answer range and nearest-neighbor queries [AFMN, 2016]

Triangulation & contour tree of massive terrains [AFMN, 2016]

Comparing merge trees of real-valued functions [AFNSW, 2015]

Common movement patterns from trajectory data [AFMNPT, 2018]

Joint work with Pankaj K. Agarwal,Kyle Fox, Kamesh Munagala,

Jiangwei Pan & Erin Taylor

Page 72: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

72

Trajectory data

● Huge data available

– Improve decision making

– Gain insights

● Noisy and incomplete

● Several computational challenges

[https://www.sundried.com]

[developer.huawei.com]

Page 73: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

73

Motivation

Page 74: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

74

Motivation

Page 75: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

75

Motivation

● Subtrajectory clusters capture common portions● Different from clustering trajectories as a whole

Page 76: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

76

Motivation

● Extract high-level shared structure from large trajectory data sets

Page 77: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

77

Motivation

● Extract high-level shared structure from large trajectory data sets

Page 78: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

78

Pathlet

Representative pathlet for each cluster– Cluster “center”– Pathlet is a curve, not necessarily part of the

input

Page 79: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

79

Application of pathlets

● Compression of large trajectory data [Chen et al. 2013]

– Hope that each trajectory can be reconstructed with small no. of pathlets

– Small pathlet dictionary - non-linear dimension reduction

● Reconstructing road network from trajectory data [Li et al. 2013; Buchin et al. 2017]

Page 80: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

80

Our contribution

● Model for subtrajectory clustering– Robust to noise and missing data

– Data-driven clusters and pathlets

● NP-hardness of subtrajectory clustering problem

● Provably-efficient approximation algorithms– Faster algorithms for realistic inputs

● Experimental results

Page 81: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

81

Previous work

● Graph setting – no noise or gaps [Chen et al. 2013]

● Based only on point density [Panagiotakis et al. 2012]

● Restricted to line segments [Lee et al. 2007]

● Search for pre-defined patterns [Fan et al. 2016; Tang et al. 2013; Wang et al. 2015; Zheng et al. 2013]

None of these have provable performance guarantees!!

Page 82: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

82

Model and problem formulation

Model inputs :– Trajectories :

– Each trajectory is sequence of points in

● Subtrajectory is subsequence of traj.

– Let be all trajectory points

Page 83: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

83

Objective function

Page 84: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

84

Objective function

Page 85: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

85

Objective function

Need small# pathlets Measure of cluster quality

Page 86: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

86

Objective function

Need small# pathlets Measure of cluster quality

Page 87: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

87

Objective function

Need small# pathlets Measure of cluster quality

Fraction of pointsunassigned for

each trajectory : “gaps”

Page 88: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

88

Objective function

Page 89: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

89

A note on the distance

We use discrete Fréchet distance

Given and

● Correspondence s.t. every pt. in at least one pair

● is monotone if for all ,

Page 90: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

90

Discrete Fréchet distance

: Set of all monotonone correspondencess b/w ,

Page 91: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

91

Choosing pathlets

Given , goal is to choose from set of candidate pathlets to minimize objective function

● If is given as input : pathlet-cover problem

● If not given but assumed to be (uncountably) infinite set of all trajectories in plane : subtrajectory-clustering problem

Page 92: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

92

Basic idea

● Reduce to set-cover

● Solve using greedy algorithm : gives approximation

● Challenge : implementing greedy step efficiently

Page 93: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

93

Set-cover

Input :● Set system● Cost

Goal is to find of minimum total cost such that

Page 94: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

94

From pathlet-cover to set-cover

● has two kinds of sets :– For all , with

where

Page 95: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

95

From pathlet-cover to set-cover

● has two kinds of sets :– For all , with

where

Corresponds to treating as a gap in pathlet cover

Page 96: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

96

From pathlet-cover to set-cover

● has two kinds of sets :– For all and for any set of subtraj. ,

with

Page 97: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

97

From pathlet-cover to set-cover

● has two kinds of sets :– For all and for any set of subtraj. ,

with

Corresponds to assigningsubtraj. in to

Page 98: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

98

From pathlet-cover to set-cover

● has two kinds of sets :– For all and for any set of subtraj. ,

with

Exponential # sets : cannot construct explicitly!!

Page 99: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

99

From pathlet-cover to set-cover

Theorem : There exists bijection between feasible solutions of and with same cost across bijection

Page 100: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

100

Greedy algorithm for set-cover

Initialize

● At each step add to the set in that maximizes the coverage-to-cost ratio

● Stop when all points are covered

Page 101: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

101

Coverage-to-cost ratio

● For let denote coverage-to-cost ratio

Page 102: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

102

Coverage-to-cost ratio

● For let denote coverage-to-cost ratio

where is set of uncovered pts. of

Page 103: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

103

Coverage-to-cost ratio

● For let denote coverage-cost ratio

, if is not yet covered

, otherwise

Page 104: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

104

Implementing greedy step

For each need to compute that maximizes – Tricky, since we do not construct these sets at all !

Page 105: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

105

Implementing greedy step

For each need to compute that maximizes – Tricky, since we do not construct these sets at all !

● Best set for can be found in poly-time without explicitly constructing all the sets !!

– Can decompose into contribution corresponding to each traj.

– Independently chose “best” subtraj. from each traj.

Page 106: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

106

Our result

Let ,

● Theorem : The greedy algorithm computes a -approximate solution to the pathlet-cover problem in time

Page 107: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

107

Subtrajectory clustering

Set of candidate pathlets not given, assumed to be all possible trajectories

Page 108: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

108

Reducing # candidate pathlets

● satisfies triangle inequality :– Let candidate pathlets be subtraj. of input traj.– # candidate pathlets is – Optimal solution cost increases by factor of 2

Page 109: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

109

Reducing # candidate pathlets

● satisfies triangle inequality :– Let candidate pathlets be subtraj. of input traj.– # candidate pathlets is – Optimal solution cost increases by factor of 2

● :– Can reduce # candidate pathlets to – Cost increases by factor of

Page 110: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

110

Improved running time

● For realistic inputs can achieve more speed-up– For each pathlet only subtraj. assigned from

each traj.

● Theorem : For realistic curves using Fréchet distance, can compute -approximate solution to the subtrajectory clustering problem in time

Page 111: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

111

Experiments : data sets

Real data sets :● Beijing taxi data [Tsinghua University]

– 28,000 cabs over 4 days

– 9 mil. points

– Incomplete and sparse

Page 112: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

112

Experiments : data sets

Real data sets :● GeoLife [Microsoft Research Asia]

– Pedestrian data of 182 users over 4 years

– ~2,600 trajs.

– ~1.5 mill. pts.

● Cycling– 37 traj.

– 106,000 pts.

– Has self-intersections and loops

Page 113: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

113

Experiments : data sets

Synthetic data sets :● RTP

– Traffic data generated by web-based tool [http://mntg.cs.umn.edu/tg/index.php]

– Research Triangle in NC

– ~20,000 traj.

– ~1 mill. pts.

Page 114: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

114

Dense & popular regions

Page 115: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

115

Common trajectory portions

Page 116: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

116

Handling noise

Page 117: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

117

Gaps

Page 118: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

118

Data-driven pathlets

Page 119: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

119

Summary

● Indexing big data

● Massive terrain analysis

● Comparing merge trees - briefly

● Extracting common movement patterns from trajectories

Page 120: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

120

Future directions

● MPC model– Point location queries, multiway separators for

planar graphs ...

– Big open problem – general graph connectivity in rounds

– Other open problems in parallel query processing in databases [Koutris et al. 2018]

● Gromov-Hausdorff distance– Big gap b/w upper and lower bound :(

– More research into additive distortion of metric embeddings

Page 121: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

121

Future directions

● Trajectory clustering– Efficient -approx. to k-center, k-median, k-

means for say Frechet distance

– Stumbling block – infinite doubling dimension

– Work by [Driemel et al. 2016] on clustering time-series data● Running time is exponential in complexity of cluster

centers – assumed to be constant● Is it a good assumption??

– What are good assumptions? Perturbation resilience? Stability?

● Can anything interesting be proved ?

Page 122: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

122

Acknowledgements

Page 123: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

123

Committee

Pankaj

Kamesh Rong Yusu

Page 124: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

124

Collaborators

Pankaj Kamesh YusuKyle Tasos

Jiangwei Erin

Page 125: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

125

Theory group

Page 126: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

126

CS@Duke

● Ergys and Cassie; other students ...

● Marilyn, Pam, Celeste, Alison, Kathleen …

● CS Lab staff

Page 127: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

127

Outside Duke

Page 128: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

128

Page 129: Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

129