28
1 Planted-model evaluation of Planted-model evaluation of algorithms for identifying algorithms for identifying differences between differences between spreadsheets spreadsheets Anna Harutyunyan, Glencora Borradaile, Christopher Chambers, Christopher Scaffidi School of Electrical Engineering and Computer Science Oregon State University

Planted-model evaluation of algorithms for identifying differences between spreadsheets

Embed Size (px)

DESCRIPTION

Planted-model evaluation of algorithms for identifying differences between spreadsheets. Anna Harutyunyan, Glencora Borradaile, Christopher Chambers, Christopher Scaffidi School of Electrical Engineering and Computer Science Oregon State University. Spreadsheets as a hub for work. - PowerPoint PPT Presentation

Citation preview

Page 1: Planted-model evaluation of algorithms  for identifying  differences between spreadsheets

1

Planted-model evaluation of algorithms Planted-model evaluation of algorithms for identifying differences between for identifying differences between

spreadsheetsspreadsheets

Anna Harutyunyan, Glencora Borradaile, Christopher Chambers, Christopher ScaffidiSchool of Electrical Engineering and Computer Science

Oregon State University

Page 2: Planted-model evaluation of algorithms  for identifying  differences between spreadsheets

2

Spreadsheets as a hub for workSpreadsheets as a hub for work

• Collecting, organizing, analyzing, and visualizing data

• Frequently shared among people in the organization– Who then edit the spreadsheets

• And then share the new versions

– To other people who then reuse and edit them…

Proliferation of spreadsheets– People choose among which spreadsheets to reuse

– Auditors may need to determine who made changes to which cells (that contain errors)

Background Algorithm Evaluation Conclusions

Page 3: Planted-model evaluation of algorithms  for identifying  differences between spreadsheets

3

Should I reuse Spreadsheet A or B?Should I reuse Spreadsheet A or B?Spreadsheet X

Spreadsheet A Spreadsheet B

Edits by BobEdits by Alice

Background Algorithm Evaluation Conclusions

Page 4: Planted-model evaluation of algorithms  for identifying  differences between spreadsheets

4

Existing features for Existing features for understanding spreadsheet differencesunderstanding spreadsheet differences

• TellTable, as well as Excel change tracking – Show differences between X and direct descendant A

– We need to compare A vs B

• DiffEngineX, Synkronizer, Suntrap, SheetDiff– Direct comparison of any A vs any B

– Somewhat inaccurate at recovering intervening edits(errors on 2-12% at cell level, even higher on row/column, for 8 real spreadsheet pairs from the EUSES corpus)

Background Algorithm Evaluation Conclusions

Page 5: Planted-model evaluation of algorithms  for identifying  differences between spreadsheets

5

Example of an error (Synkronizer)Example of an error (Synkronizer)

Actual edits: insert B’s second column (“c”, “g”, …), insert B’s second row (“d”, “d”, “d”), change B’s A3 from “d” to “e”

Note and apologies: This figure is referenced but missing in the printed proceedings. (It’s my fault: accidentally deleted it during final round of edits.)

Background Algorithm Evaluation Conclusions

Page 6: Planted-model evaluation of algorithms  for identifying  differences between spreadsheets

6

Outline of this talkOutline of this talk

• Background

• Algorithm

• Evaluation

• Conclusions

Background Algorithm Evaluation Conclusions

Page 7: Planted-model evaluation of algorithms  for identifying  differences between spreadsheets

7

New algorithm conceptNew algorithm concept

• Find a “target alignment” of cells that are nearly identical– i.e., Find what A and B have in common

• All remaining differences are attributable to edits– Specifically, row/column insertions in A or B

or cell-level edits within the target alignment cells

Background Algorithm Evaluation Conclusions

Page 8: Planted-model evaluation of algorithms  for identifying  differences between spreadsheets

8

Target alignment conceptTarget alignment concept

An alignment with only 1 cell-level edit out of 14 cells

Background Algorithm Evaluation Conclusions

Page 9: Planted-model evaluation of algorithms  for identifying  differences between spreadsheets

9

Starting point for a specific algorithm: Starting point for a specific algorithm: LCS in 1DLCS in 1D

f c a d b a e

f c a d b a ed

Background Algorithm Evaluation Conclusions

Page 10: Planted-model evaluation of algorithms  for identifying  differences between spreadsheets

10

Let’s think in terms of aligning rowsLet’s think in terms of aligning rows(put off thinking about columns for a moment)(put off thinking about columns for a moment)

Background Algorithm Evaluation Conclusions

Page 11: Planted-model evaluation of algorithms  for identifying  differences between spreadsheets

11

Insight: Match up rows based on the Insight: Match up rows based on the length of their LCS (1D)length of their LCS (1D)

df dc ba fd ab aa ee

dcf egc baa fad afb aga egeddd

A good alignment

1 1 2 2 2 2 2∑

equals 12

Background Algorithm Evaluation Conclusions

Page 12: Planted-model evaluation of algorithms  for identifying  differences between spreadsheets

12

Insight: Match up rows based on the Insight: Match up rows based on the length of their LCS (1D)length of their LCS (1D)

df dc ba fd ab aa ee

dcf egc baa fad afb aga egeddd

2 1 2 2 2 2 2

A better alignment (maximal, actually)

∑ equals

13

Background Algorithm Evaluation Conclusions

Page 13: Planted-model evaluation of algorithms  for identifying  differences between spreadsheets

13

Summary of algorithmSummary of algorithm

Given spreadsheets A and B, compute target alignment, then generate a list of edits AB

Background Algorithm Evaluation Conclusions

Page 14: Planted-model evaluation of algorithms  for identifying  differences between spreadsheets

14

Summary of algorithmSummary of algorithm

Given spreadsheets A and B, compute target alignment, then generate a list of edits AB

1.Use dynamic programming to choose which rows to include in the target alignment

– Argmax ∑LCS1D(rows retained in A, rows retained in B), where the ∑ is over rows. (Use dynamic programming.)

Background Algorithm Evaluation Conclusions

Page 15: Planted-model evaluation of algorithms  for identifying  differences between spreadsheets

15

Summary of algorithmSummary of algorithm

Given spreadsheets A and B, compute target alignment, then generate a list of edits AB

1.Use dynamic programming to choose which rows to include in the target alignment

2.Do the same with A and B to choose columns– Argmax ∑LCS1D(cols retained in A, cols retained in B),

where the ∑ is over columns

Background Algorithm Evaluation Conclusions

Page 16: Planted-model evaluation of algorithms  for identifying  differences between spreadsheets

16

Summary of algorithmSummary of algorithm

Given spreadsheets A and B, compute target alignment, then generate a list of edits AB

1.Use dynamic programming to choose which rows to include in the target alignment

2.Do the same with A and B to choose columns

3.For each row or column not chosen for target alignment– If it’s in B (i.e., not A), then represent as an insert

– Else (it’s in A, not B), represent as a delete

Background Algorithm Evaluation Conclusions

Page 17: Planted-model evaluation of algorithms  for identifying  differences between spreadsheets

17

Summary of algorithmSummary of algorithm

Given spreadsheets A and B, compute target alignment, then generate a list of edits AB

1.Use dynamic programming to choose which rows to include in the target alignment

2.Do the same with A and B to choose columns

3.For each row or column not chosen for target alignment

4.For each aligned row or column– If it has virtually no differences between A and B, then

represent any remaining differences as cell-level edits

– Else, represent the entire row/column as a delete+insert

Background Algorithm Evaluation Conclusions

Page 18: Planted-model evaluation of algorithms  for identifying  differences between spreadsheets

18

Three investigations we conducted to Three investigations we conducted to evaluate RowColAlignevaluate RowColAlign

• Tested on 10 manually-created spreadsheet pairs previously used to test an older algorithm (SheetDiff)– Won’t discuss today (due to time) – see paper

– Bottom line: RowColAlign made no errors

• Tested on >500 automatically-generated cases– Discussed below

– Bottom line: RowColAlign made no errors

• Formally analyzed expected behavior of RowColAlign– Summarized below

– Bottom line: RowColAlign will rarely if ever make errors in practice; runtime is O(spreadsheet area2)

Background Algorithm Evaluation Conclusions

Page 19: Planted-model evaluation of algorithms  for identifying  differences between spreadsheets

19

Evaluation based on planted modelEvaluation based on planted model

• Planted model = generative model

• Automatically generates test cases– For which we know the correct answer

• Very interesting technique to try because this way of thinking about evaluation might be useful for evaluating other algorithms that this community creates

Background Algorithm Evaluation Conclusions

Page 20: Planted-model evaluation of algorithms  for identifying  differences between spreadsheets

20

Planted model / generating test casesPlanted model / generating test cases

1. Create a blank spreadsheet O of size n x n

2. Randomly fill O with letters from alphabet of size s

3. Copy O twice to create A and B

4. For each row and each column in A and in BWith probability p, delete that row or column

5. For each cell in BWith probability q, replace with new random letter

Background Algorithm Evaluation Conclusions

Page 21: Planted-model evaluation of algorithms  for identifying  differences between spreadsheets

21

Parameter values based on 8 real Parameter values based on 8 real spreadsheet pairs from prior workspreadsheet pairs from prior work

Parameter Real range observed Range tested

Spreadsheet area 90 to 3212 cells (equiv. n=9.5-56.7)

n=10 to 50

Alphabet size (s) 50 to 671 50 to 450

Row & col insertion rate (p) 0.0167 to 0.08 0.01 to 0.41

Cell-level edit rate (q) 0.0016 to 0.05 0.001 to 0.401

For each parameter setting, we generated 25 test cases.

Background Algorithm Evaluation Conclusions

Page 22: Planted-model evaluation of algorithms  for identifying  differences between spreadsheets

22

Result: RowColAlign made no errorsResult: RowColAlign made no errors

Parameter Real range observed Range tested

Spreadsheet area 90 to 3212 cells (equiv. n=9.5-56.7)

n=10 to 50

Alphabet size (s) 50 to 671 50 to 450

Row & col insertion rate (p) 0.0167 to 0.08 0.01 to 0.41

Cell-level edit rate (q) 0.0016 to 0.05 0.001 to 0.401

For comparison: The existing SheetDiff algorithm made errors at a rate of up to 28% as p and q increased.

Background Algorithm Evaluation Conclusions

Page 23: Planted-model evaluation of algorithms  for identifying  differences between spreadsheets

23

Pushing the algorithm further: Pushing the algorithm further: Huge spreadsheets with many editsHuge spreadsheets with many edits

Parameter For comparison Range tested

Top quartile of all EUSES corpus spreadsheets

Width and height (n) 961 cells (n=31) 10000 cells (n=100)

8 pairs from prior work

Alphabet size (s) 50 to 671 10 to 1000

Row & col insertion rate (p) 0.0167 to 0.08 0.08

Cell-level edit rate (q) 0.0016 to 0.05 0.05

Background Algorithm Evaluation Conclusions

Page 24: Planted-model evaluation of algorithms  for identifying  differences between spreadsheets

24

Results: Still no errorsResults: Still no errors

Parameter For comparison Range tested

Top quartile of all EUSES corpus spreadsheets

Width and height (n) 961 cells (n=31) n=100

8 pairs from prior work

Alphabet size (s) 50 to 671 10 to 1000

Row & col insertion rate (p) 0.0167 to 0.08 0.08

Cell-level edit rate (q) 0.0016 to 0.05 0.05

Background Algorithm Evaluation Conclusions

Page 25: Planted-model evaluation of algorithms  for identifying  differences between spreadsheets

25

In brief: Why?In brief: Why?

Incorrect alignment would be caused by a chance when rows happen to be similar.

Which is less and less likely when…

-The alphabet is large- Because the probability that two cells have the same

value by chance is ~ 1/s

-The spreadsheet is large- Because the probability that n cells have matching

values by chance is ~ (1/s)n

Background Algorithm Evaluation Conclusions

Page 26: Planted-model evaluation of algorithms  for identifying  differences between spreadsheets

26

ConclusionsConclusions

• The subsequence of rows and columns that two spreadsheets have in common can be computed using a dynamic programming algorithm

• The error rate of such an algorithm can be evaluated using a planted model

• Our specific dynamic programming algorithm– Is unlikely to make errors when recovering edits

Except on spreadsheets that are small or have small alphabets

Background Algorithm Evaluation Conclusions

Page 27: Planted-model evaluation of algorithms  for identifying  differences between spreadsheets

27

Future research opportunitiesFuture research opportunities

• Develop tools based on this algorithm– To help people understand and manage versions

– To choose among multiple versions

• Develop enhanced algorithms– For simultaneous diff of more than 2 spreadsheets

– For clustering collections of spreadsheets based on similarity

Background Algorithm Evaluation Conclusions

Page 28: Planted-model evaluation of algorithms  for identifying  differences between spreadsheets

28

Thank youThank you

• For this opportunity to present

• For funding from Google and NSF

• For your questions and ideas

Background Algorithm Evaluation Conclusions