Upload
gavin-sparks
View
237
Download
0
Embed Size (px)
DESCRIPTION
DAURUM: Deduplication & Fusion Index Introduction Process Successful stories Architecture Demo
Citation preview
DAU
RUM
: Ded
uplic
atio
n &
Fus
ion
Index
IntroductionProcessSuccessful storiesArchitectureDemo
DAU
RUM
: Ded
uplic
atio
n &
Fus
ion
Index
IntroductionProcessSuccessful storiesArchitectureDemo
Nom
e la presenatació o altra info (opcional)
DAU
RUM
: Ded
uplic
atio
n &
Fus
ion
Identification of suspected duplicated records inside a database
Merging of data belonging to several databases with different formats detecting duplicated records
Validation tools for the detected similarities
IntroductionBenefits
Nom
e la presenatació o altra info (opcional)
DAU
RUM
: Ded
uplic
atio
n &
Fus
ion
IntroductionDeduplication
Input data1 John Smith
2 Mike Delfino
3 Mary James
4 Jon Smith
5 Marian James
Nom
e la presenatació o altra info (opcional)
DAU
RUM
: Ded
uplic
atio
n &
Fus
ion
Introduction Deduplication
1. Configuration
2. Automatic execution
3. Validation of results
4. Personalized export
Input data1 John Smith
2 Mike Delfino
3 Mary James
4 Jon Smith
5 Marian James
Nom
e la presenatació o altra info (opcional)
DAU
RUM
: Ded
uplic
atio
n &
Fus
ion
1. Configuration
2. Automatic execution
3. Validation of results
4. Personalized export
IntroductionDeduplication
Output data1 4 John Smith
3 5 Mary James
Input data1 John Smith
2 Mike Delfino
3 Mary James
4 Jhon Smith
5 Marian James
Nom
e la presenatació o altra info (opcional)
DAU
RUM
: Ded
uplic
atio
n &
Fus
ion
IntroductionFusion
Input data 1John Smith 05-07-1963 11111111-A
Mike Delfino 03-08-1978 22222222- B
Mary James 01-05-1982 33333333-C
Input data 2Jhon Smith Portland Oregon
Mary James Hartford Connecticut
Nom
e la presenatació o altra info (opcional)
DAU
RUM
: Ded
uplic
atio
n &
Fus
ion
IntroductionFusion
Input data 1John Smith 05-07-1963 11111111-A
Mike Delfino 03-08-1978 22222222- B
Mary James 01-05-1982 33333333-C
Input data 2Jhon Smith Portland Oregon
Mary James Hartford Connecticut
1. Configuration
2. Automatic execution
3. Validation of results
4. Personalized export
Nom
e la presenatació o altra info (opcional)
DAU
RUM
: Ded
uplic
atio
n &
Fus
ion
IntroductionFusion
Output dataJohn Smith 05-07-1963 11111111-A Portland Oregon
Mary James 01-05-1982 33333333-B Hartford Connecticut
Input data 1
John Smith 05-07-1963 11111111-A
Mike Delfino 03-08-1978 22222222- B
Mary James 01-05-1982 33333333-C
Input data 2Jhon Smith Portland Oregon
Mary James Hartford Connecticut
1. Configuration
2. Automatic execution
3. Validation of results
4. Personalized export
Nom
e la presenatació o altra info (opcional)
DAU
RUM
: Ded
uplic
atio
n &
Fus
ion
IntroductionFeatures
Deduplication Merger
Configuration Manager
Several exportation
formats
Validation tools
Extensible normalization
filtersDictionary
support
High-score hits ratings
Web Application
Multiuser
Auditioning
DAU
RUM
: Ded
uplic
atio
n &
Fus
ion
Index
IntroductionProcess
Deduplication Fusion
Successful storiesArchitectureDemo
Nom
e la presenatació o altra info (opcional)
DAU
RUM
: Ded
uplic
atio
n &
Fus
ion
• Input data file format: CSV
• Select relevant columns to link registers
• Assign types to columns to help using the
most adequate automatic filters
DeduplicationConfigurations
Input data1004 Joan García Peres
1017 Jordi Garcia Pera
1031 Maria Bou Arnús
1058 Juan Garcia Pérez
1089 Mari Bou Arús
Configurations
CSV
Execution
Validation
Exportation
CSV
Nom
e la presenatació o altra info (opcional)
DAU
RUM
: Ded
uplic
atio
n &
Fus
ion
Deduplication Configurations
Configurations
CSV
Execution
Validation
Exportation
• Comparative type: exact value, estimation by
text, numerical estimation
• Percentage of the importance of each
column for the similarity computation
Name Surname1 Surname 2
1004 Joan García Peres
1017 Jordi Garcia Pera
1031 Maria Bou Arnús
1058 Juan Garcia Pérez
1089 Mari Bou Arús
30% 35% 35% 100% =
CSV
Nom
e la presenatació o altra info (opcional)
DAU
RUM
: Ded
uplic
atio
n &
Fus
ion
Deduplication Configurations
Configurations
CSV
Execution
Validation
Exportation
• Use filters to normalize values
• Available automatic and specific filters for
values such as name, dates, address, etc…
Name Surname 1 Surname 2
1004 JOAN GARCIA PERES
1017 JORDI GARCIA PERA
1031 MARIA BOU ARNUS
1058 JUAN GARCIA PEREZ
1089 MARI BOU ARUS
CSV
Filters applied
Nom
e la presenatació o altra info (opcional)
DAU
RUM
: Ded
uplic
atio
n &
Fus
ion
Deduplication Configurations
Configurations
CSV
Execution
Validation
Exportation
• Edition of filters (create new filters, delete
or update existing ones)
• Use of dictionaries: name-converter
dictionary (i.e.: Pepe Jose)
CSV
Nom
e la presenatació o altra info (opcional)
DAU
RUM
: Ded
uplic
atio
n &
Fus
ion
Deduplication Configurations
Configurations
CSV
Execution
Validation
Exportation
• Similarity computation algorithm called
Record Linkage. Parameters:
• Size for the sliding window:
number of registers each one will be
compared to.
• Sorting columns:
ordenation by columns.
• Threshold of similarity acceptance
CSV
Nom
e la presenatació o altra info (opcional)
DAU
RUM
: Ded
uplic
atio
n &
Fus
ion
Deduplication Execution
Configurations
CSV
Execution
Validation
Exportation
• Order by Surname 1
• Sliding window = 2
Name Surname 1 Surname 2
1031 MARIA BOU ARNUS
1089 MARI BOU ARUS
1004 JOAN GARCIA PERES
1017 JORDI GARCIA PERA
1058 JUAN GARCIA PEREZ
Name Surname 1 Surname 2
1031 MARIA BOU ARNUS
1089 MARI BOU ARUS
1004 JOAN GARCIA PERES
1017 JORDI GARCIA PERA
1058 JUAN GARCIA PEREZ
Window = 2
CSV
Nom
e la presenatació o altra info (opcional)
DAU
RUM
: Ded
uplic
atio
n &
Fus
ion
Deduplication Execution
Configurations
CSV
Execution
Validation
Exportation
• Similarities detected
Name Surname 1 Surname 2
1031 MARIA BOU ARNUS
1089 MARI BOU ARUS
1004 JOAN GARCIA PERES
1017 JORDI GARCIA PERA
1058 JUAN GARCIA PEREZ
Window = 2
80,41%MARI BOU ARUS
MARIA BOU ARNUS
0,0%JORDI GARCIA PERA
MARIA BOU ARNUS
Similarities
Similarity degree
CSV
Nom
e la presenatació o altra info (opcional)
DAU
RUM
: Ded
uplic
atio
n &
Fus
ion
Deduplication Execution
Configurations
CSV
Execution
Validation
Exportation
• Similarities detected
Name Surname 1 Surname 2
1031 MARIA BOU ARNUS
1089 MARI BOU ARUS
1004 JOAN GARCIA PERES
1017 JORDI GARCIA PERA
1058 JUAN GARCIA PEREZ
window = 2
Similarities
Similarity degree
7,21%JOAN GARCIA PERES
MARI BOU ARUS
8,24%JORDI GARCIA PERA
MARI BOU ARUS
CSV
Nom
e la presenatació o altra info (opcional)
DAU
RUM
: Ded
uplic
atio
n &
Fus
ion
Deduplication Execution
Configurations
CSV
Execution
Validation
Exportation
• List of detected similarities
88,65%JOAN GARCIA PERES
JUAN GARCIA PEREZ
7,21%JOAN GARCIA PERES
MARI BOU ARUS
8,24%JORDI GARCIA PERA
MARI BOU ARUS
52,57%JORDI GARCIA PERA
JUAN GARCIA PEREZ
80,41%MARIA BOU ARNUS
MARI BOU ARUS
60,82%JOAN GARCIA PERES
JORDI GARCIA PERA
0,0%JORDI GARCIA PERA
MARIA BOU ARNUS
CSV
Nom
e la presenatació o altra info (opcional)
DAU
RUM
: Ded
uplic
atio
n &
Fus
ion
Deduplication Execution
Configurations
CSV
Execution
Validation
Exportation
• List of detected similarities with percentage
bigger than threshold 50%
88,65%JOAN GARCIA PERES
JUAN GARCIA PEREZ
7,21%JOAN GARCIA PERES
MARI BOU ARUS
8,24%JORDI GARCIA PERA
MARI BOU ARUS
52,57%JORDI GARCIA PERA
JUAN GARCIA PEREZ
80,41%MARIA BOU ARNUS
MARI BOU ARUS
60,82%JOAN GARCIA PERES
JORDI GARCIA PERA
0,0%JORDI GARCIA PERA
MARIA BOU ARNUS
> 50%
CSV
Nom
e la presenatació o altra info (opcional)
DAU
RUM
: Ded
uplic
atio
n &
Fus
ion
Deduplication Validation
• Validation of results (including only those above
the threshold)
• Visualize by similarity/by group
• Massive validation
• Share validation between several supervisors
88,65%JOAN GARCIA PERES
JUAN GARCIA PEREZ
52,57%JORDI GARCIA PERA
JUAN GARCIA PEREZ
80,41%MARIA BOU ARNUS
MARI BOU ARUS
60,82%JOAN GARCIA PERES
JORDI GARCIA PERA
Configurations
CSV
Execution
Validation
Exportation
CSV
Nom
e la presenatació o altra info (opcional)
DAU
RUM
: Ded
uplic
atio
n &
Fus
ion
Deduplication Exportation
• Select output formatConfigurations
CSV
Execution
Validation
Exportation
CSV
88,65%1 Joan García Peres
4 Juan Garcia Pérez
80,41%3 Maria Bou Arnús
5 Mari Bou Arús
Export
1 4 Joan García Pérez
3 5 Maria Pou Arús
Nom
e la presenatació o altra info (opcional)
DAU
RUM
: Ded
uplic
atio
n &
Fus
ion
IntroductionProcess
Deduplication Fusion
Successful storiesArchitectureDemo
Index
Nom
e la presenatació o altra info (opcional)
DAU
RUM
: Ded
uplic
atio
n &
Fus
ion
• Input data file format: CSV
• Select relevant columns to link registers
• Relation between columns from different
data sources (only when merging)
• Assign types to columns to help using the
most adequate automatic filters
FusionConfigurations
Input data 1
Joan Garcia Peres bcn 08034
Jordi Prat Junyent taragona 43001
Maria Bou Arnús Mataró 08301
Configurations
CSV
Execution
Validation
Exportation
CSV
Input data 2BARCELONA 08035
TARRAGONA 43002
LLEIDA 25003
MATARO 08301
BADALONA 08917
Nom
e la presenatació o altra info (opcional)
DAU
RUM
: Ded
uplic
atio
n &
Fus
ion
Fusion Configurations
Configurations
CSV
Execution
Validation
Exportation
• Comparative type: exact value, estimation by
text, numerical estimation
• Percentage of the importance of each
column for the similarity computation
80% 20% 100% =
CSV
City CP
Joan Garcia Peres bcn 08034
Jordi Prat Junyent taragona 43001
Maria
Bou Arnús Mataró 08301
City CPBARCELONA 08035
TARRAGONA 43002
LLEIDA 25003
MATARO 08301
BADALONA 08917
Nom
e la presenatació o altra info (opcional)
DAU
RUM
: Ded
uplic
atio
n &
Fus
ion
Fusion Configurations
Configurations
CSV
Execution
Validation
Exportation
• Specific percentage for registers with null
valued columns
• Use filters to make values standard
• Available automatic and specific filters for
values such as name, dates, address, etc…
City CPJoan Garcia Peres BARCELONA 08034
Jordi Prat Junyent TARAGONA 43001
Maria Bou Arnús MATARO 08301
BARCELONA 08035
TARRAGONA 43002
LLEIDA 25003
MATARO 08301
BADALONA 08917
CSV
Nom
e la presenatació o altra info (opcional)
DAU
RUM
: Ded
uplic
atio
n &
Fus
ion
Fusion Configurations
Configurations
CSV
Execution
Validation
Exportation
• Edit filters (create new filters, delete or
update existing ones)
• Use of dictionaries: name-converter
dictionary (I.e.: BCN BARCELONA)
CSV
Nom
e la presenatació o altra info (opcional)
DAU
RUM
: Ded
uplic
atio
n &
Fus
ion
Fusion Configurations
Configurations
CSV
Execution
Validation
Exportation
CSV
• Similarity computation algorithm called
Record Linkage. Parameters:
• Size for the sliding window:
number of registers each one will be
compared to.
• Sorting columns:
ordenation by columns.
• Threshold of similarity acceptance
Nom
e la presenatació o altra info (opcional)
DAU
RUM
: Ded
uplic
atio
n &
Fus
ion
• Order by City
• Sliding window = 2City CP
BADALONA 08917
Joan Garcia Peres BARCELONA 08034
BARCELONA 08035
LLEIDA 25003
Maria Bou Arnús MATARO 08301
MATARO 08301
Jordi Prat Junyent TARAGONA 43001
TARRAGONA 43002
Fusion Execution
Configurations
CSV
Execution
Validation
Exportation
City CPBADALONA 08917
Joan Garcia Peres BARCELONA 08034
BARCELONA 08035
LLEIDA 25003
Maria Bou Arnús MATARO 08301
MATARO 08301
Jordi Prat Junyent TARAGONA 43001
TARRAGONA 43002
Window = 2
CSV
Nom
e la presenatació o altra info (opcional)
DAU
RUM
: Ded
uplic
atio
n &
Fus
ion
City CP
BADALONA 08917
Joan Garcia Peres BARCELONA 08034
BARCELONA 08035
LLEIDA 25003
Maria Bou Arnús MATARO 08301
MATARO 08301
Jordi Prat Junyent TARAGONA 43001
TARRAGONA 43002
Fusion Execution
Configurations
CSV
Execution
Validation
Exportation
• Similarities detected Window = 2
59,41%BADALONA 08917
BARCELONA 08034
BADALONA 08917
BARCELONA 08035
Similarity
Similarity degreeCSV
Nom
e la presenatació o altra info (opcional)
DAU
RUM
: Ded
uplic
atio
n &
Fus
ion
City CP
BADALONA 08917
Joan Garcia Peres BARCELONA 08034
BARCELONA 08035
LLEIDA 25003
Maria Bou Arnús MATARO 08301
MATARO 08301
Jordi Prat Junyent TARAGONA 43001
TARRAGONA 43002
Fusion Execution
Configurations
CSV
Execution
Validation
Exportation
• Similarities detectedWindow = 2
95,08%BARCELONA 08034BARCELONA 08035
0%BARCELONA 08034
LLEIDA 25003
Similarity degreeCSV
Similarities
Nom
e la presenatació o altra info (opcional)
DAU
RUM
: Ded
uplic
atio
n &
Fus
ion
Fusion Execution
Configurations
CSV
Execution
Validation
Exportation
• List of detected similarities
CSV
95,08%BARCELONA 08034BARCELONA 08035
100%MATARO 08301MATARO 08301
88,34%TARAGONA 43001
TARRAGONA 43002
59,41%BADALONA 08917
BARCELONA 08034
21,35%TARAGONA 43001
MATARO 08301
12,84%BARCELONA 08034
MATARO 08301
0%BARCELONA 08034
LLEIDA 25003
Nom
e la presenatació o altra info (opcional)
DAU
RUM
: Ded
uplic
atio
n &
Fus
ion
Fusion Execution
Configurations
CSV
Execution
Validation
Exportation
• List of detected similarities with percentage
bigger than threshold 50%
> 50%
CSV
95,08%BARCELONA 08034BARCELONA 08035
100%MATARO 08301MATARO 08301
88,34%TARAGONA 43001
TARRAGONA 43002
59,41%BADALONA 08917
BARCELONA 08034
21,35%TARAGONA 43001
MATARO 08301
12,84%BARCELONA 08034
MATARO 08301
0%BARCELONA 08034
LLEIDA 25003
Nom
e la presenatació o altra info (opcional)
DAU
RUM
: Ded
uplic
atio
n &
Fus
ion
Fusion Validation
• Validation of results (including only those above
the threshold)
• Visualize by similarity/by group
• Massive validation
• Share validation between several supervisors
Configurations
CSV
Execution
Validation
Exportation
CSV
95,08%BARCELONA 08034BARCELONA 08035
100%MATARO 08301MATARO 08301
88,34%TARAGONA 43001
TARRAGONA 43002
59,41%BADALONA 08917
BARCELONA 08034
Nom
e la presenatació o altra info (opcional)
DAU
RUM
: Ded
uplic
atio
n &
Fus
ion
Fusion Exportation
• Output format
• Select values for every similarity
Configurations
CSV
Execution
Validation
Exportation
CSV
Export
Maria Bou Arnús MATARO 08301
Joan Garcia Peres BARCELONA 08035
Jordi Prat Junyent TARRAGONA 43002
100%Maria Bou Arnús Mataró 08301
MATARO 08301
95,08%Joan Garcia Peres bcn 08034
BARCELONA 08035
88,34%Jordi Prat Junyent taragona 43001
TARRAGONA 43002
Nom
e la presenatació o altra info (opcional)
DAU
RUM
: Ded
uplic
atio
n &
Fus
ion
IntroductionProcessSuccessful storiesArchitectureDemo
Index
Nom
e la presenatació o altra info (opcional)
DAU
RUM
: Ded
uplic
atio
n &
Fus
ion
Who? Health Service
Objective Detect repeated health id cards
Solution Detect repeated registers in the database and delete themDeduplicaction with DAURUM
Result Health id cards database cleaned of repetitions
Succesful storiesHealth Service
Nom
e la presenatació o altra info (opcional)
DAU
RUM
: Ded
uplic
atio
n &
Fus
ion
Who? Beer manufacturerObjective Detect dealers that deliver to not previously
assigned centers
Solution Identify duplicates in each dealer’s delivery database and delete them
Deduplication with DAURUM
Detect deliveries to centers shared between different dealersFusion with DAURUM
Result Master database clean of repetitions and detection of dealers with wrong deliveries
Succesful storiesBeer manufacturer
Nom
e la presenatació o altra info (opcional)
DAU
RUM
: Ded
uplic
atio
n &
Fus
ion
IntroductionProcessSuccessful storiesArchitectureDemo
Index
Nom
e la presenatació o altra info (opcional)
DAU
RUM
: Ded
uplic
atio
n &
Fus
ion
• Struts 2: Model-View-Controller
• Hibernate: Database manipulation
Architecture
Nom
e la presenatació o altra info (opcional)
DAU
RUM
: Ded
uplic
atio
n &
Fus
ion
IntroductionProcessSuccessful storiesArchitectureDemo
Index
Nom
e la presenatació o altra info (opcional)
DAU
RUM
: Ded
uplic
atio
n &
Fus
ion
Demo
Nom
e la presenatació o altra info (opcional)
DAU
RUM
: Ded
uplic
atio
n &
Fus
ion
Thanks for your attention
Any questions?
DAMA-UPC. DATA MANAGEMENT (UPC) Departament d'Arquitectura de Computadors
Edifici C6-S103. Campus Nord. Jordi Girona, 1-3. 08034 - Barcelona
www.dama.upc.edu
SPARSITY-TECHNOLOGIESJordi Girona, 1-3, Edifici K2M
08034 [email protected]
http://www.sparsity-technologies.com