Similarity of mobile users based on sparse location historycs.joensuu.fi/sipu/pub/LocationSimilarity.pdf · based on sparse location history. Pasi Fränti. Radu Mariescu-Istodor

Similarity of mobile usersbased on sparse location history

Pasi FräntiRadu Mariescu-Istodor

Karol Waga

21.2.2019

P. Fränti, R. Mariescu-Istodor and K. Waga

”Similarity of mobile users based on sparse location history”

Int. Conf. Artificial Intelligence and Soft Computing (ICAISC)

Zakopane, Poland, 593-603, June 2018

Introduction

Motivation

•

Cold start users in Recommender systems

•

Activity in the same area •

Opinions of local experts more valuable [Bao’2012]

•

Identify the same user

Existing methods

•

Complete trajectories [Ying’10]•

Longest common subsequence with speed and turn points [Liu’12]

•

Stop points (>30 min) [Zheng’10]

•

Most common stops: home and work [Biagioni’13]

•

User given priority for parts more [Wang’12]

Problems?

•

Full GPS trajectories not always stored

•

Areas of activity are often more important than the exact tracks

•

Explicit activities such as visits, check-ins and photo taking are widely recorded.

Test case: APR trio

Andrei 201214

412

8

4

5

3

4

20

7

9 20

5

3

Andrei 2013

46

11

6

8

3

Andrei 2014

Pasi 2012

10

12

Pasi 2013

8

5

8

5

Pasi 2014

9

9

Radu 2012

18

28

Radu 2013

13

11

16

Radu 2014

6

21

8

11

A11

A13

A12

A14

46

11

6

8

3

20

7

9 20

5

3

14

412

8

4

5

3

410

6

6

11

Example of User A

Pasi 2011

Pasi 2013

Pasi 2012

Pasi 2014

5

610

12

8

4

5

8

59

9

Example of User P

Radu 2011

Radu 2013

Radu 2012

Radu 2014

19

32

18

28

13

11 6

16 21

8

11

Example of User R

Two approaches

1. Histogram comparison-

Construct histogram from some common data

-

Map points to the bins-

Histogram matching of the two counts

2. Clustering comparison-

Cluster each data separately

-

Map centroids between the data-

Calculate similarity based on the mapping

Selected approaches

1. Histogram comparison-

Construct histogram from some common data

-

Map points to the bins-

Histogram matching of the two counts

2. Clustering comparison-

Cluster each data separately

-

Map centroids between the data-

Calculate similarity based on the mapping

Approach 1:Histograms

Location historyActivity points of users

Histogram matchingCreate bins from popular locations

Location similarityCount visit statistics

1 0 0

1 0 0

0 0 0

0 0 3

0 0 2

2 2 0

1 1 1

4 3 19

76

Histogrambins

Visitcounts

Histogram of the toy example

4 3

1

2 2

0

0 0

3

1 1

1

0 0

2

1 0

0

1 0

0

0 0

0

9 6 7

8

43

3

2

1

1

0

0.44 0.50

0.14

0.22 0.33

0.00

0.00 0.00

0.43

0.11 0.17

0.14

0.00 0.00

0.28

0.11 0.00

0.00

0.11 0.00

0.00

0.00 0.00

0.00

Pekka Bartek Julia Pekka Bartek JuliaFrequencies Relative frequencies

Distance measures

Bhattacharyya distance

4 3

1

2 2

0

0 0

3

1 1

1

0 0

2

1 0

0

1 0

0

0 0

0

9 6 7

8

43

3

2

1

1

0

0.44 0.50

0.14

0.22 0.33

0.00

0.00 0.00

0.43

0.11 0.17

0.14

0.00 0.00

0.28

0.11 0.00

0.00

0.11 0.00

0.00

0.00 0.00

0.00

ii qpD ln

0.47

0.27

0.00

0.14

0.00

0.00

0.00

0.00

= 0.88

-ln = 0.13

0.26

0.00

0.00

0.15

0.00

0.00

0.00

0.000.410.89

Pekka Bartek Julia Pekka Bartek JuliaFrequencies Relative frequencies

Pekkavs

Bartek

Bartekvs

Julia

L1 distance

4 3

1

2 2

0

0 0

3

1 1

1

0 0

2

1 0

0

1 0

0

0 0

0

9 6 7

8

43

3

2

1

1

0

0.44 0.50

0.14

0.22 0.33

0.00

0.00 0.00

0.43

0.11 0.17

0.14

0.00 0.00

0.28

0.11 0.00

0.00

0.11 0.00

0.00

0.00 0.00

0.00

0.06

0.11

0.00

0.06

0.00

0.11

0.11

0.00

= 0.42

0.36

0.33

0.43

0.03

0.28

0.00

0.00

0.001.43

i

ii qpL1

4 3

1

2 2

0

0 0

3

1 1

1

0 0

2

1 0

0

1 0

0

0 0

0

9 6 7

8

43

3

2

1

1

0

0.44 0.50

0.14

0.22 0.33

0.00

0.00 0.00

0.43

0.11 0.17

0.14

0.00 0.00

0.28

0.11 0.00

0.00

0.11 0.00

0.00

0.00 0.00

0.00

0.36%

1.21%

0.00%

0.36%

0.00%

1.21%

1.21%

0.00%

= 0.04

13%

11%

18%

0%

8%

0%

0%

0%0.50

i

ii qpL 22

L2 distance

•

Uniform global grid (e.g. 2525 m)•

Some known places (Mopsi services)

•

All user data from Mopsi•

The two data as such:-

Cluster the joint activity data

-

Each cluster represents one bin-

Count blues and reds in the cluster using any histogram matching technique

How to create histogram bins

Approach 2:Clustering

Clustering of two distributions

6

10

4

1 7/1 = 7

9/6 = 1.5

4/0.5 = 8

6/0.5 = 12

Experimental results

Collected data

Municipalities:Municipalities:JoensuuJoensuuLiperiLiperiOutokumpuOutokumpuPolvijPolvijäärvirviKontiolahtiKontiolahtiIlomantsiIlomantsiJuukaJuuka

63.44N28.65E

62.25N31.58E

Joensuu sub-region bounding box

•

Histogram from 293 places (Mopsi services)•

User activities until 31.12.2014‒

Photos taken

‒

Tracking started or ended

Summary of APR trio data 2011-2014

Distinct users:

3Total users:

12

Sample sizes:

37 -

1263

14

412

8

4

5

3

4

Andrei2012

Andrei2011

10

6

6

11

Andrei2011

10

6

6

11

20

7

9 20

5

3

Andrei2013

20

7

9 20

5

3

Andrei2013

46

11

6

8

3

Andrei201446

11

6

8

3

Andrei2014

19

32Radu201119

32Radu2011

18

28

10

Radu201218

28

10

Radu2012

13

11

16 Radu2013

13

11

16 Radu2013

6

21

8

11

9

Radu20146

21

8

11

9

Radu2014

5

64

Pasi2011 5

64

Pasi2011

10

12Pasi201210

12Pasi2012

9

9

Pasi20149

9

Pasi2014

8

5

8

5

Pasi20138

5

8

5

Pasi2013

Ten most popular histogram bins Frequencies shown

Work

Home

Home

Test protocolExpected result

True positive:

13

11

16 Radu2013

13

11

16 Radu2013 5

64

Pasi2011 5

64

Pasi2011

True negative:

vs.9

9

Pasi20149

9

Pasi20145

64

Pasi2011 5

64

Pasi2011 vs.

Distance Distance

SAME DIFFERENT

• 12*12 = 144 comparisons in total• 3*4*4 = 48 true positives (33%)• 3*8*8 = 96 true negatives (67%)

??

Comparison of all pairs

vs.

Distance

0.01 0.01 0.02 0.03 0.05 0.06 …

0.74 0.76 0.81 0.82 0.88

Same user

threshold

Different user

Sort

Algorithm result:

Classification error

0.01 0.01 0.02 0.03 0.05 0.06 …

0.74 0.76 0.81 0.82 0.88

Same user Different user

same same not same same same …

not not not not not

Algorithm result:

Ground truth:

Errors made:

3Total comparisons:

144

Classification error:

2%

Classification results

Averagedistance A priori

Top-33% Best possible

Fuzzy worse than crisp

Error Error

What if top-10 places removed?

Conclusions

Main results:•

People’s identity can be recognized with high accuracy

•

All except L2

and L•

Fuzzy works worse than crisp

Future research:•

Clustering-based approach

•

K-NN classifier

Thank you

Time for Time for questions!questions!

Documents

Similarity of mobile users based on sparse location historycs.joensuu.fi/sipu/pub/LocationSimilarity.pdf · based on sparse location history. Pasi Fränti. Radu Mariescu-Istodor