38
A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department of Computer Traian Marius Truta Department of Computer Science Science Wayne State University Wayne State University Northern Kentucky University Northern Kentucky University

A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department

Embed Size (px)

Citation preview

Page 1: A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department

A Privacy Preserving Efficient Protocol for Semantic Similarity Join

Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Bilal Hawashin, Farshad Fotouhi Traian Marius

Truta Department of Computer Science Truta Department of Computer Science

Wayne State University Northern Kentucky Wayne State University Northern Kentucky University University

Page 2: A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department

OutlinesOutlines

What is Similarity JoinWhat is Similarity Join Long String ValuesLong String Values Our ContributionOur Contribution Privacy Preserving Protocol For Long String Privacy Preserving Protocol For Long String

ValuesValues Experiments and ResultsExperiments and Results Conclusions/Future WorkConclusions/Future Work Contact InformationContact Information

Page 3: A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department

MotivationMotivation

NameName AddressAddress MajorMajor ……

John John SmithSmith

4115 Main 4115 Main St.St.

BiologyBiology

Mary Mary JonesJones

2619 Ford 2619 Ford Rd.Rd.

Chemical Chemical Eng.Eng.

NameName AddressAddress Monthly Monthly Sal.Sal.

……

Smith, Smith, JohnJohn

4115 Main 4115 Main StreetStreet

16451645

Mary Mary JonsJons

2619 Ford 2619 Ford Rd.Rd.

21002100

Is Natural Join always suitable?

Page 4: A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department

Similarity JoinSimilarity Join

Joining a pair of records if they have Joining a pair of records if they have SIMILAR values in the join attribute.SIMILAR values in the join attribute.

Formally, similarity join consists of Formally, similarity join consists of grouping pairs of records whose grouping pairs of records whose similarity is greater than a threshold, similarity is greater than a threshold, TT. .

Studied widely in the literature, and Studied widely in the literature, and referred to as record linkage, entity referred to as record linkage, entity matching, duplicate detection, citation matching, duplicate detection, citation resolution, …resolution, …

Page 5: A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department

Our Previous Contribution: Our Previous Contribution: Long String Values (ICDM Long String Values (ICDM

MMIS10)MMIS10) The termThe term long string long string refers to the data type refers to the data type

representing any string value with unlimited length.representing any string value with unlimited length. The term The term long attributelong attribute refers to any attribute of long refers to any attribute of long

string data type. string data type. Most tables contain at least one attribute with long Most tables contain at least one attribute with long

string values.string values. Examples are Paper Abstract, Product Description, Examples are Paper Abstract, Product Description,

Movie Summary, User Comment, …Movie Summary, User Comment, … Most of the previous work studied similarity join on Most of the previous work studied similarity join on

short fields.short fields. In our previous work, we showed that using long In our previous work, we showed that using long

attributes as join attributes under supervised attributes as join attributes under supervised learning can enhance the similarity join performance.learning can enhance the similarity join performance.

Page 6: A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department

ExampleExample

P1 TitleP1 Title P1 P1 KwdsKwds

P1 P1 AuthrsAuthrs

P1 P1 AbstractAbstract

……

P2 TitleP2 Title P2 P2 KwdsKwds

P2 P2 AuthrsAuthrs

P2 P2 AbstractAbstract

……

P3 TitleP3 Title P3 P3 KwdsKwds

P3 P3 AuthrsAuthrs

P3 P3 AbstractAbstract

……

……P10 P10 TitleTitle

P10 P10 KwdsKwds

P10 P10 AuthrsAuthrs

P10 P10 AbstractAbstract

……

P11 P11 TitleTitle

P11 P11 KwdsKwds

P11 P11 AuthrsAuthrs

P11 P11 AbstractAbstract

……

……

Page 7: A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department

Our Paper (Motivation)Our Paper (Motivation) Some sources may not allow sharing its Some sources may not allow sharing its

whole data in the similarity join process.whole data in the similarity join process. Solution: Privacy Preserved Similarity Join.Solution: Privacy Preserved Similarity Join.

Using long attributes as join attributes can Using long attributes as join attributes can increase the similarity join accuracy.increase the similarity join accuracy.

Up to our knowledge, all the current Privacy Up to our knowledge, all the current Privacy Preserved SJ algorithms use short attributes.Preserved SJ algorithms use short attributes.

Most of the current privacy preserved SJ Most of the current privacy preserved SJ algorithms ignore the semantic similarities algorithms ignore the semantic similarities among the values.among the values.

Page 8: A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department

Problem FormulationProblem Formulation

Our goal is to find a Privacy Our goal is to find a Privacy Preserved Similarity Join Algorithm Preserved Similarity Join Algorithm when the join attribute is a long when the join attribute is a long attribute and consider the semantic attribute and consider the semantic similarities among such long values.similarities among such long values.

Page 9: A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department

Our Work PlanOur Work Plan

Phase1: Compare multiple similarity Phase1: Compare multiple similarity methods for long attributes when methods for long attributes when similarity thresholds are used.similarity thresholds are used.

Phase2: Use the best method as part Phase2: Use the best method as part in the privacy preserved SJ protocol.in the privacy preserved SJ protocol.

Page 10: A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department

Phase1: Finding Best SJ Phase1: Finding Best SJ Method for Long Strings Method for Long Strings

with Thresholdwith Threshold Candidate Methods:Candidate Methods:

Diffusion Maps.Diffusion Maps. Latent Semantic Indexing.Latent Semantic Indexing. Locality Preserving Projection.Locality Preserving Projection.

Page 11: A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department

Performance Performance MeasurementsMeasurements

F1 Measurement: the harmonic mean F1 Measurement: the harmonic mean between recall R and precision P.between recall R and precision P.

Where recall is the ratio of the Where recall is the ratio of the relevant data among the retrieved relevant data among the retrieved data, and precision is the ratio of the data, and precision is the ratio of the accurate data among the retrieved accurate data among the retrieved data. data.

Page 12: A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department

Performance Performance Measurements(Cont.)Measurements(Cont.)

Preprocessing time is the time needed to read the dataset and generate matrices that could be used later as an input to the semantic operation.

Operation time is the time needed to apply the semantic method.

Matching time is the time required by the third party, C, to find the cosine similarity among the records provided by both A and B in the reduced space and compare the similarities with the predefined similarity threshold.

Page 13: A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department

DatasetsDatasets

IMDB Internet Movies Dataset:IMDB Internet Movies Dataset: Movie Summary FieldMovie Summary Field

Amazon Dataset:Amazon Dataset: Product TitleProduct Title Product DescriptionProduct Description

Page 14: A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department

Phase1 ResultsPhase1 Results

Finding best dimensionality Finding best dimensionality reduction method using Movie reduction method using Movie Summary from IMDB Dataset (Left) Summary from IMDB Dataset (Left) and Product Descriptions from and Product Descriptions from Amazon (Right). Amazon (Right).

Page 15: A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department

Phase2 ResultsPhase2 Results

Preprocessing Time:Preprocessing Time:

Read Dataset (1000 Movie Summaries)

12 Sec.

TF.IDF Weighting 1 Sec.

Reduce Dimensionality using Mean TF.IDF

0.5 Sec.

Find Shared Features

Negligible

Page 16: A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department

Phase2 ResultsPhase2 Results

Operation Time for the best Operation Time for the best performing methods from phase 1.performing methods from phase 1.

Matching Time is negligible.Matching Time is negligible.

Page 17: A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department

Our ProtocolOur Protocol

Both sources A and B share the Both sources A and B share the Threshold value T to decide similar Threshold value T to decide similar pairs later.pairs later.

Page 18: A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department

Our ProtocolOur Protocol

P1 TitleP1 Title P1 P1 AuthorsAuthors

P1 P1 AbstractAbstract

……

P2 TitleP2 Title P2 P2 AuthorsAuthors

P2 P2 AbstractAbstract

……

P3 TitleP3 Title P3 P3 AuthorsAuthors

P3 P3 AbstractAbstract

Source A

Source BPx TitlePx Title Px Px

AuthorsAuthorsPx Px AbstractAbstract

……

Page 19: A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department

Find Term_LSV Find Term_LSV Frequency Matrix for Frequency Matrix for

Each SourceEach SourceLSV1LSV1 LSV2LSV2 LSV3LSV3

ImageImage 44 00 00

ClassifyClassify 55 00 00

SimilaritSimilarityy

00 66 55

JoinJoin 00 66 44

MMa

Page 20: A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department

Find TD_Weighted Matrix Find TD_Weighted Matrix Using TF.IDF WeightingUsing TF.IDF Weighting

LSV1LSV1 LSV2LSV2 LSV3LSV3

ImageImage 0.90.9 00 00

ClassifyClassify 0.70.7 00 00

SimilariSimilarityty

00 0.850.85 0.90.9

JoinJoin 00 0.70.7 0.850.85

WeightedWeightedMMa

Page 21: A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department

TF.IDF WeightingTF.IDF Weighting

TF.IDF weighting of a term W in a long TF.IDF weighting of a term W in a long string value x is given as:string value x is given as:

where tfw,x is the frequency of the term w in the long string value x, and idfw is , where N is the number of long string values in the relation, and nw is the number of long string values in the relation that contains the term w.

Page 22: A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department

MeanTF.IDF Feature MeanTF.IDF Feature SelectionSelection

MeanTF.IDF is an unsupervised feature MeanTF.IDF is an unsupervised feature selection method.selection method.

Every feature (term) is assigned a value Every feature (term) is assigned a value according to its importance. according to its importance.

The Value of a term feature w is given asThe Value of a term feature w is given as

Where TF.IDF(w, x) is the weighting of Where TF.IDF(w, x) is the weighting of feature w in long string value x, and N is feature w in long string value x, and N is the total number of long string values.the total number of long string values.

Page 23: A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department

Apply MeanTF.IDF on WeightedMApply MeanTF.IDF on WeightedMa a

and Get and Get Important Features to Imp_Fe Important Features to Imp_Feaa.. Add Random features to Add Random features to Imp_FeImp_Fea a to to

get get Rand_ Imp_FeRand_ Imp_Fea.a. Rand_ Imp_FeRand_ Imp_Fea a and Rand_ and Rand_ Imp_FeImp_Feb b are are returned to C.returned to C. C Finds the intersection and C Finds the intersection and return the return the shared important features SF shared important features SF to both A to both A and B.and B.

Page 24: A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department

Reduced WeightedM Reduced WeightedM Dimensions in Both Dimensions in Both Sources using SF.Sources using SF.

LSV1LSV1 LSV2LSV2 LSV3LSV3

ImageImage 0.90.9 00 00

SimilariSimilarityty

00 0.850.85 0.90.9

……

SFSFa

Page 25: A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department

Add Random Vectors to Add Random Vectors to SFSF

LSV1LSV1 LSV2LSV2 LSV3LSV3 RandoRandom Colsm Cols

ImageImage 0.90.9 00 00 0.60.6

SimilariSimilarityty

00 0.850.85 0.90.9 0.20.2

……

Rand_Weighted_a

Page 26: A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department

Find WFind Waa (The Kernel) (The Kernel)1-Cos_Sim(LSV1,LSV1)=01-Cos_Sim(LSV1,LSV1)=0 1-Cos_Sim(LSV1,LSV2)=0.21-Cos_Sim(LSV1,LSV2)=0.2 1-Cos_Sim(LSV1,LSV3)=0.31-Cos_Sim(LSV1,LSV3)=0.3 ……

1-1-Cos_Sim(LSV2,LSV1)=0.2Cos_Sim(LSV2,LSV1)=0.2

1-Cos_Sim(LSV2,LSV2)=01-Cos_Sim(LSV2,LSV2)=0 1-1-Cos_Sim(LSV2,LSV3)=0.87Cos_Sim(LSV2,LSV3)=0.87

……

1-1-Cos_Sim(LSV3,LSV1)=0.3Cos_Sim(LSV3,LSV1)=0.3

1-1-Cos_Sim(LSV3,LSV2)=0.87Cos_Sim(LSV3,LSV2)=0.87

1-Cos_Sim(LSV3,LSV3)=01-Cos_Sim(LSV3,LSV3)=0 ……

…… …… …… ……

|Wa| = D x D, where D is total number of columns in Rand_Weighteda

Page 27: A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department

Use Diffusion Maps to Find Use Diffusion Maps to Find Red_Rand_Weighted_aRed_Rand_Weighted_a

[[Red_Rand_Weighted_aRed_Rand_Weighted_a,,SSaa,,VVaa,,AAaa] = Diffusion_Map(] = Diffusion_Map(WWa a , , 10, 1, 10, 1, red_dimred_dim), red_dim < D), red_dim < D

Red_Rand_Weighted_a=Red_Rand_Weighted_a=Diffusion Map Diffusion Map Representation of first row of WRepresentation of first row of Waa

Diffusion Map Representation Diffusion Map Representation of second row of Wof second row of Waa

Diffusion Map Representation Diffusion Map Representation of third row of Wof third row of Waa

Col1Col1 Col2Col2 …… ColColred_dimred_dim

0.40.4 0.10.1

0.80.8 0.60.6

0.750.75 0.50.5

……

Page 28: A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department

C Finds Pairwise Similarity C Finds Pairwise Similarity Between Between

Red_Rand_Weighted_a and Red_Rand_Weighted_a and Red_Rand_Weighted_b Red_Rand_Weighted_b Red_RanRed_Rand_Weightd_Weighted_aed_a

Red_RanRed_Rand_Weightd_Weighted_bed_b

Cos_SimCos_Sim

11 11 0.770.77

11 22 0.30.3

…… …… ……

22 11 0.90.9

Page 29: A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department

If Cos_Sim>T, Insert the If Cos_Sim>T, Insert the tuple in Matched tuple in Matched

Red_RanRed_Rand_Weightd_Weighted_aed_a

Red_RanRed_Rand_Weightd_Weighted_bed_b

Cos_SimCos_Sim

11 11 0.770.77

22 11 0.90.9

22 77 0.850.85

…… …… ……

Matched

Page 30: A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department

Matched is returned to both A and B.

A and B remove random vectors from Matched and share their matrices.

Page 31: A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department

Our Protocol (Part1)Our Protocol (Part1)

Page 32: A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department

Our Protocol (Part2)Our Protocol (Part2)

Page 33: A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department

Phase2 ResultsPhase2 Results

Effect of adding random columns on Effect of adding random columns on the accuracy.the accuracy.

Page 34: A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department

Phase2 ResultsPhase2 Results

Effect of adding random columns on Effect of adding random columns on the number of suggested matches.the number of suggested matches.

Page 35: A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department

ConclusionsConclusions

Efficient secure SJ semantic protocol for Efficient secure SJ semantic protocol for long string attributes is proposed.long string attributes is proposed.

Diffusion maps is the best method (among Diffusion maps is the best method (among compared) to semantically join long string compared) to semantically join long string attributes when threshold values are attributes when threshold values are used.used.

Mapping into diffusion maps space and Mapping into diffusion maps space and adding random records can hide the adding random records can hide the original data without affecting the original data without affecting the accuracy.accuracy.

Page 36: A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department

Future WorkFuture Work

Potential further works: Potential further works: Compare diffusion maps with more Compare diffusion maps with more

candidate semantic methods for candidate semantic methods for joining long string attributes.joining long string attributes.

Study the performance of the Study the performance of the protocol on huge databases.protocol on huge databases.

Page 37: A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department

Thank You …Thank You …

Dr. Farshad Fotouhi.Dr. Farshad Fotouhi. Dr. Traian Marius Truta.Dr. Traian Marius Truta.

Page 38: A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department