Planted-model evaluation of algorithms for identifying differences between spreadsheets

1

Planted-model evaluation of algorithms Planted-model evaluation of algorithms for identifying differences between for identifying differences between

spreadsheetsspreadsheets

Anna Harutyunyan, Glencora Borradaile, Christopher Chambers, Christopher ScaffidiSchool of Electrical Engineering and Computer Science

Oregon State University

2

Spreadsheets as a hub for workSpreadsheets as a hub for work

• Collecting, organizing, analyzing, and visualizing data

• Frequently shared among people in the organization– Who then edit the spreadsheets

• And then share the new versions

– To other people who then reuse and edit them…

Proliferation of spreadsheets– People choose among which spreadsheets to reuse

– Auditors may need to determine who made changes to which cells (that contain errors)

Background Algorithm Evaluation Conclusions

3

Should I reuse Spreadsheet A or B?Should I reuse Spreadsheet A or B?Spreadsheet X

Spreadsheet A Spreadsheet B

Edits by BobEdits by Alice


4

Existing features for Existing features for understanding spreadsheet differencesunderstanding spreadsheet differences

• TellTable, as well as Excel change tracking – Show differences between X and direct descendant A

– We need to compare A vs B

• DiffEngineX, Synkronizer, Suntrap, SheetDiff– Direct comparison of any A vs any B

– Somewhat inaccurate at recovering intervening edits(errors on 2-12% at cell level, even higher on row/column, for 8 real spreadsheet pairs from the EUSES corpus)


5

Example of an error (Synkronizer)Example of an error (Synkronizer)

Actual edits: insert B’s second column (“c”, “g”, …), insert B’s second row (“d”, “d”, “d”), change B’s A3 from “d” to “e”

Note and apologies: This figure is referenced but missing in the printed proceedings. (It’s my fault: accidentally deleted it during final round of edits.)


6

Outline of this talkOutline of this talk

• Background

• Algorithm

• Evaluation

• Conclusions


7

New algorithm conceptNew algorithm concept

• Find a “target alignment” of cells that are nearly identical– i.e., Find what A and B have in common

• All remaining differences are attributable to edits– Specifically, row/column insertions in A or B

or cell-level edits within the target alignment cells


8

Target alignment conceptTarget alignment concept

An alignment with only 1 cell-level edit out of 14 cells


9

Starting point for a specific algorithm: Starting point for a specific algorithm: LCS in 1DLCS in 1D

f c a d b a e

f c a d b a ed


10

Let’s think in terms of aligning rowsLet’s think in terms of aligning rows(put off thinking about columns for a moment)(put off thinking about columns for a moment)


11

Insight: Match up rows based on the Insight: Match up rows based on the length of their LCS (1D)length of their LCS (1D)

df dc ba fd ab aa ee

dcf egc baa fad afb aga egeddd

A good alignment

1 1 2 2 2 2 2∑

equals 12


12

Insight: Match up rows based on the Insight: Match up rows based on the length of their LCS (1D)length of their LCS (1D)

df dc ba fd ab aa ee

dcf egc baa fad afb aga egeddd

2 1 2 2 2 2 2

A better alignment (maximal, actually)

∑ equals

13


13

Summary of algorithmSummary of algorithm

Given spreadsheets A and B, compute target alignment, then generate a list of edits AB


14



1.Use dynamic programming to choose which rows to include in the target alignment

– Argmax ∑LCS1D(rows retained in A, rows retained in B), where the ∑ is over rows. (Use dynamic programming.)


15




2.Do the same with A and B to choose columns– Argmax ∑LCS1D(cols retained in A, cols retained in B),

where the ∑ is over columns


16




2.Do the same with A and B to choose columns

3.For each row or column not chosen for target alignment– If it’s in B (i.e., not A), then represent as an insert

– Else (it’s in A, not B), represent as a delete


17




2.Do the same with A and B to choose columns

3.For each row or column not chosen for target alignment

4.For each aligned row or column– If it has virtually no differences between A and B, then

represent any remaining differences as cell-level edits

– Else, represent the entire row/column as a delete+insert


18

Three investigations we conducted to Three investigations we conducted to evaluate RowColAlignevaluate RowColAlign

• Tested on 10 manually-created spreadsheet pairs previously used to test an older algorithm (SheetDiff)– Won’t discuss today (due to time) – see paper

– Bottom line: RowColAlign made no errors

• Tested on >500 automatically-generated cases– Discussed below

– Bottom line: RowColAlign made no errors

• Formally analyzed expected behavior of RowColAlign– Summarized below

– Bottom line: RowColAlign will rarely if ever make errors in practice; runtime is O(spreadsheet area2)


19

Evaluation based on planted modelEvaluation based on planted model

• Planted model = generative model

• Automatically generates test cases– For which we know the correct answer

• Very interesting technique to try because this way of thinking about evaluation might be useful for evaluating other algorithms that this community creates


20

Planted model / generating test casesPlanted model / generating test cases

1. Create a blank spreadsheet O of size n x n

2. Randomly fill O with letters from alphabet of size s

3. Copy O twice to create A and B

4. For each row and each column in A and in BWith probability p, delete that row or column

5. For each cell in BWith probability q, replace with new random letter


21

Parameter values based on 8 real Parameter values based on 8 real spreadsheet pairs from prior workspreadsheet pairs from prior work

Parameter Real range observed Range tested

Spreadsheet area 90 to 3212 cells (equiv. n=9.5-56.7)

n=10 to 50

Alphabet size (s) 50 to 671 50 to 450

Row & col insertion rate (p) 0.0167 to 0.08 0.01 to 0.41

Cell-level edit rate (q) 0.0016 to 0.05 0.001 to 0.401

For each parameter setting, we generated 25 test cases.


22

Result: RowColAlign made no errorsResult: RowColAlign made no errors

Parameter Real range observed Range tested

Spreadsheet area 90 to 3212 cells (equiv. n=9.5-56.7)

n=10 to 50


Row & col insertion rate (p) 0.0167 to 0.08 0.01 to 0.41

Cell-level edit rate (q) 0.0016 to 0.05 0.001 to 0.401

For comparison: The existing SheetDiff algorithm made errors at a rate of up to 28% as p and q increased.


23

Pushing the algorithm further: Pushing the algorithm further: Huge spreadsheets with many editsHuge spreadsheets with many edits

Parameter For comparison Range tested

Top quartile of all EUSES corpus spreadsheets

Width and height (n) 961 cells (n=31) 10000 cells (n=100)

8 pairs from prior work


Row & col insertion rate (p) 0.0167 to 0.08 0.08

Cell-level edit rate (q) 0.0016 to 0.05 0.05


24

Results: Still no errorsResults: Still no errors

Parameter For comparison Range tested

Top quartile of all EUSES corpus spreadsheets

Width and height (n) 961 cells (n=31) n=100

8 pairs from prior work


Row & col insertion rate (p) 0.0167 to 0.08 0.08

Cell-level edit rate (q) 0.0016 to 0.05 0.05


25

In brief: Why?In brief: Why?

Incorrect alignment would be caused by a chance when rows happen to be similar.

Which is less and less likely when…

-The alphabet is large- Because the probability that two cells have the same

value by chance is ~ 1/s

-The spreadsheet is large- Because the probability that n cells have matching

values by chance is ~ (1/s)n


26

ConclusionsConclusions

• The subsequence of rows and columns that two spreadsheets have in common can be computed using a dynamic programming algorithm

• The error rate of such an algorithm can be evaluated using a planted model

• Our specific dynamic programming algorithm– Is unlikely to make errors when recovering edits

Except on spreadsheets that are small or have small alphabets


27

Future research opportunitiesFuture research opportunities

• Develop tools based on this algorithm– To help people understand and manage versions

– To choose among multiple versions

• Develop enhanced algorithms– For simultaneous diff of more than 2 spreadsheets

– For clustering collections of spreadsheets based on similarity


28

Thank youThank you

• For this opportunity to present

• For funding from Google and NSF

• For your questions and ideas


Documents

Planted-model evaluation of algorithms for identifying differences between spreadsheets