Upload
lizlavaveshkul
View
168
Download
1
Embed Size (px)
Citation preview
ETL QualityStage:Matching Stage
A simplified explanation of theMatching Stage
Data matching in Data matching in ETL Quality StageETL Quality Stage
Data matching is used to find records in a single data source or independent data sources that refer to the same entity (such as a person, organization,
location, product, or material) regardless of the availability of a predetermined key.
Let’s take a look at a simplified example and examine the process.
Let’s say our neighborhood club decided to display our pictures at our club house
bulletin board.
… But we only want to post one picture per club member.
Neighbors submit their pictures, but some
neighbors submit more than one picture.
Since we agreed to post only ONE picture, we’ll
have to weed out “duplicates”
(pictures of the same person).
How do we find pictures of the same person?
Well, traditionally, we’d compare them one by one to determine if they
match certain criteria (same eyes, nose, etc.)
Same person?
No.
Same person?
No.
Same person?
No.
Same person?
No.
Same person?
No.
We have 12 pictures, so we’ll have to compare
12 pictures
You get the idea.
That’s 144 times!
12 times.
The Matching Stage in QualityStage simplifies the work.
Matching is a two-step process: first you block records
and then you match them.
Blocking identifies subsets of data so that
matches can be more efficiently performed.
These subsets are called blocks.
Blocking
• Females < 18 years old• Females > 18 years old• Males < 18 years old• Males > 18 years old
Let’s say we decide to block the data.
We decide to form four subsets:
Females < 18 years old
Females >18 years old
BlockingMales < 18 years old
Males > 18 years old
Females < 18 years old
Females >18 years old
BlockingMales < 18 years old
Males > 18 years old
Making comparisons is easier now.
Females < 18 years old
Females >18 years old
BlockingMales < 18 years old
Males > 18 years oldCompare 5 pictures 5 times = 25 comparisons
Compare 3 pictures 3 times = 12 comparisons
Compare 2 pictures 2 times = 4 comparisons
Compare 2 pictures 2 times = 4 comparisons
Females < 18 years old
Females >18 years old
BlockingMales < 18 years old
Males > 18 years old25 comparisons
9 comparisons
4 comparisons
4 comparisons
Females < 18 years old
Females >18 years old
BlockingMales < 18 years old
Males > 18 years old25 comparisons
9 comparisons
4 comparisons
4 comparisons
4 25
94
52 comparisons
Females < 18 years old
Females >18 years old
BlockingMales < 18 years old
Males > 18 years old
52 comparisons.
That’s much more efficient than the 144 comparisons we had earlier, when we were doing one-on-one matching.
Matching is a two-step process: first you block records
and then you match them.
Females < 18 years old
Females >18 years old
Males < 18 years old
Males > 18 years old
To review:
Blocking identifies subsets of data within which matches can be more efficiently performed.
Females < 18 years old
Females >18 years old
Males < 18 years old
Males > 18 years old
Matching identifies relationships among records.
Matching is a 2-step process:
- First you block the records.
- Then you match them.
Matching
Females >18 years old
Let’s pause for a minute to examine the matching process more closely.
Matching is a 2-step process:
- First you block the records.
- Then you match them.
MatchingFirst, we have to make certain decisions to set up rules.
Will all of the criteria have to match exactly?
(If NO) Will some criteria be more important than other criteria?
(If YES) Can we use some of QualityStage’s “fuzzy logic”?
Which criteria will be more important? We will have to assign weights.
Matching is a 2-step process:
- First you block the records.
- Then you match them.
MatchingLet’s see what could happen if we were to apply the strict rule:
All the criteria have to match exactly.
In our example, the people in the pictures will need to have the same shape and color of eyes, same length and
color of hair, same hairstyle, etc.
If someone had different hair styles in the pictures, for example, we would have to say that it is a different
person, if we were to apply this strict rule.
Matching is a 2-step process:
- First you block the records.
- Then you match them.
If the rule were “All the criteria have to match exactly”:
We would have to conclude that these are not pictures of the same person.
Match
Match
No Match
No Match
Eyes Large, oval, brown, long eye lashes
Large, oval, brown, long eye lashes
Nose Not visible Not visible
Mouth Small, petite Large, open, tongue visible
Hair Dark brown, long, straight
Light brown, long, straight
If the rule were “All the criteria have to match exactly”:
We would have to conclude that these are not pictures of the same person.
Match
Match
No Match
Match
Eyes Large, oval, brown, long eye lashes
Large, oval, brown, long eye lashes
Nose Not visible Not visible
Mouth Small, petite Large, closed, tongue visible
Hair Dark brown, long, straight
Dark brown, long, straight
If the rule were “All the criteria have to match exactly”:
We would have to conclude that these are not pictures of the same person.
Match
Match
No Match
No Match
Eyes Large, oval, brown, long eye lashes
Large, oval, brown, long eye lashes
Nose Not visible Not visible
Mouth Large, open, tongue visible
Large, closed, tongue visible
Hair Dark brown, long, straight
Light brown, long, straight
MatchingAs an alternative, we can use some of QualityStage’s “fuzzy logic” and assign “weights” to the criteria.
We will have to decide: Which criteria are more important?
Matching is a 2-step process:
- First you block the records.
- Then you match them.
We could assign weights to the criteria.
Eyes Large, oval, brown, long eye lashes
Large, oval, brown, long eye lashes
Nose Not visible Not visible
Mouth Small, petite Large, closed, tongue visible
Hair Dark brown, long, straight
Dark brown, long, straight
Large, oval, brown, long eye lashes
Not visible
Large, open, tongue visible
Light brown, long, straight
For example, we could assign higherhigher weightsweights to “nosenose” and “eyeseyes,” a lower weightlower weight to “mouthmouth,” and the lowest weightlowest weight to “hairhair.”
We could assign weights to the criteria.
Using these assigned weights, ETL can help us conclude that these are pictures of the same person.
Eyes Large, oval, brown, long eye lashes
Large, oval, brown, long eye lashes
Nose Not visible Not visible
Mouth Small, petite Large, closed, tongue visible
Hair Dark brown, long, straight
Dark brown, long, straight
Large, oval, brown, long eye lashes
Not visible
Large, open, tongue visible
Light brown, long, straight
Match
Match
Match
Match