38
Data Quality Class 8

Data Quality Class 8. Agenda Tests will be graded by next week Project Discovery –Domain discovery –Domain identification –Nulls –Clustering for rule

  • View
    215

  • Download
    0

Embed Size (px)

Citation preview

Data Quality

Class 8

Agenda

• Tests will be graded by next week

• Project

• Discovery– Domain discovery– Domain identification– Nulls– Clustering for rule discovery

Discovery

• Domain Discovery

• Mapping Discovery

• Key Discovery

• Rule Discovery

First...

• Consult the manual

• Consult the experts

Automated Discovery

• Searching for:– obvious– insight– patterns– knowledge

What are characteristics of domains?

– They are typically finite data sets– They are relatively static– They may be used in many contexts– The number of values is relatively small– “intuitive distribution”– Attributes that use domains are typically not

null

Intuitive Distribution

• The distribution may be an even distribution• The distribution of values takes on the

characteristics of the context• Example: Security transactions

– Most will be denominated in US dollars– Many will be denominated in £, ¥,– Fewer in other currencies– The currency will draw its values from a domain, but

the distribution will be largely skewed to USD, then GBP, JPY, then others...

Domain Discovery

• The Brute Force method• Good for “seeding” a set of domains when no

information is already available• For each column, take all distinct values• Apply heuristics:

– How does the number of distinct values compare to the number of records from which the values were extracted?

– Is distribution relatively equal?– How many nulls are there?– Are the data values similar to other attributes’ values?

Pattern-based Domain Discovery

• Analyze value patterns• Start from the bottom up:

• Tag each character • digit• letter• punctuation• white space

• Sort found patterns, see if there is any overriding similarity

Pattern-based Domains - 2

• If nothing stands out, go up to next level• Characterize individual strings by classes:

• alphabetic• numeric• alphanumeric• verbs• nouns• name words• business words• address words• email addresses

Patterns... – 3

• Where do we get these word classifications?• Experience• USPS• Census department• SIC classifications• Domain registry• Standards bodies

• It may be worthwhile to collect words by classification as reference data

• These tables can also be scanned using the same methods for fast matching

Domain Identification

• Given an already existing set of domains, how can we determine if a column makes use of one of those domains

• Algorithm:– All distinct values are extracted– Perform a match against known domains– Compute 3 ratios: agreement, overlap,

disagreement

Fast Matching: Method 1

• We must be able to quickly match a pair of data value sets

• Method 1: Hash Tables– Load first domain into a hash table (O(n))– Search for each value of second domain (O(n))

Fast Matching: Method 2

• Bloom Filters

• An approximate method for testing set membership

• If the bolom filter test says no, the answer is no

• If the bloom filter says yes, it still might be no, with low (limited) probability

Bloom Filters

• Use k hash functions (h1, h2, .., hk)• For each member of the set, invoke each of the hash

functions, and set a bit in a bit vector for each result• For testing a value for membership:

– invoke all hash functions– if not all result bits are set in bit vector, the value is not in

the set– If all the bits are set, there is still a small chance the value is

not in the set– That probability is bounded based on the number of bits in

the bit vector and the number of hash functions

Bloom Filters – 2

• Table is used optimally when it is half-filled with 1’s

• The best choice for sizing the table is based on expectation of finding a false positive

• This is calculated as a function of n, the number of words, and m, the size of the table, and k, the number of hash functions

• k is approxamtely ln 2 * m/n• The error probability is 2-k • (email me for more details on this analysis)

Three Ratios

• agreement is calculated as the ratio of distinct attribute values that are present in a domain to the total number of distinct values in the attribute

• overlap is calculated as the number of domain member values that do not appear in the attribute divided by the number of domain values

• disagreement as the number of values that appear in the attribute but are not members of the domain

Domain Matching

1. All of the values used in the attribute are members of the known domain, and the all of the values in the domain are used in the attribute. (agreement = 100%, overlap= 0%, disagreement = 0%)

• GOOD MATCH

2. All of the values in the attribute are members of the known domain, but there are domain members that are not used in the attribute. (agreement = 100%, overlap > 0%, disagreement = 0%)

• Also GOOD MATCH (explore as subdomain?)

Domain Matching

3. Some of the attribute values are members of the known domain, but some of the values used in the attribute are not members of the known domain. (agreement < 100%, disagreement > 0%)

• either not a match, or perhaps the known domain is a subdomain

4. None of the values used in the attribute are taken from the known domain. (agreement = 0%, overlap= 100%, disagreement = 100%)

• Not a match

Sub and Super Domains

• subdomain: a set of values that is a subset of another domain

• superdomain: slightly more values than in the known domain

• Overloaded attribute: some of the values belong to one domain, other values belong to another domain

Mapping Discovery

• Much harder than domain discovery

• Must look for connectivity patterns

• one-to-one mappings are easiest

One-to-One Mappings

• Both attributes must belong to a domain• O(n2) algorithm• Extract value pairs from both attributes• When all distinct pairs have been extracted, there

may not be any duplicate entries of the source attribute value.

• Because different source values can map to the same target value, the third characteristic does not hold for the target attribute value

Mapping Heuristics

• We can also apply some heuristics:– See if the number of distinct values in the

source attribute is greater than or equal to the number of distinct values in the target attribute.

– See if the attribute names appear together in more than one table or relation

• We can also apply rule discovery, since 1-1 mappings are similar to functional dependencies

Rule Discovery

• Null rules

• Value ranges

• Key discovery

• Association rules

Null Value Rules

• For each attribute:– Are there missing values?

• Opens other questions, such as whether nulls are really allowed, what kinds of nulls are there, etc.

– If so, is there any relationship to other attributes?• This starts to look like a mapping from another attribute to the

null value

– Are there existing null value representations?• This would be reflected potentially as anomalous patterns that

appear with some frequency

Rule Discovery Techniques

• Clustering

• Decision Trees

• Association Rules

Clustering

• A set of algorithms that are used to segment items in a set into separate partitions, where each value is assigned to a specific partition

• We group items by some set of properties, and assign a meaning to each cluster

• Clustering hinges on the notion of “distance” between items in a set, making use of a distance function to determine how close any two items are to each other

• We can cluster value sets composed of values pulled from multiple attributes

Clustering – 2

• We can discover relationships between values based on assignment to a cluster

• Clusters are then sources for rules• Two kinds of clustering:

– Hierarchical (such as a classification taxonomy)– Non-hierarchical (such as a distribution into

equivalence classes)

• Clustering can be performed in one of 2 ways:– Agglomerative – collect clusters together– Divisive – break clusters apart

Similarity and Distance

• To assign data items to clusters, we need a notion of distance and similarity

• Obviously, similarity is measured as a function of distance

• All items to be explored are projected into an n-dimensional space

• Distance can be measured based on traditional distance functions, or specialized distance functions

Distance Measures

• Euclidean distance: the distance between any two points is the sum of the squares of the difference between the corresponding data points

• City Block distance: this measures the distance in terms of walking along the line segments in a grid

• Exact-match distance: For variables whose values are not allocated along continuous dimensions, the values assigned are either 1 for an exact match, or 0 for not a match

Divisive Clsutering

1. Assign all values into a single cluster.2. Determine a division of each cluster that will

maximize the similarity between all the elements within one side of the division and maximizes the difference between any pair of items across the division.

3. Repeat step 2 until either some predefined limit of clusters has been found, or some predefined limit of cluster size (minimum, of course, is 1) has been reached.

Divisive Clustering – 2

• Requires the storage of a lot of data

• Each pair of clusters needs to be examined to measure the difference between them

• Computationally intensive

• Not very popular

Agglomerative Clustering

1. Assign each data element to its own cluster

2. Find the two closest clusters and combine them into a single cluster

3. Repeat step 2 until there is only one cluster left

• The tricky part is determining which clusters are closest to each other

Agglomerative Clustering – 2

• Methods for determining cluster closeness:– Single link method: In this method, the distance

between any two clusters is determined by the distance between the nearest neighbors in the two clusters

– Complete link: In this method, the distance between any two clusters is determined by the greatest distance between any two data items in the two different clusters

– Unweighted pair-group average: In this method, the distance between any two clusters is calculated as the average distance between all pairs of data items in the two different clusters

Agglomerative Clustering – 3

– Weighted pair-group average: This method is similar to the unweighted pair-group average, except that the size of the cluster is used as a weight

– Unweighted pair-group centroid: A centroid is a point calculated as the average point in a space formed by a cluster. For any multidimensional space, the centroid effectively represents the center of gravity for all the points in a cluster. In this method, the distance between any two clusters is measured as the distance between centroids of the two clusters

– Weighted pair-group centroid: This is similar to the unweighted pair-group centroid method, except that the size of the cluster is used as a weight

Value Ranges Using Clustering

• This uses an agglomerative method• Look at aggregating values based on closeness in

euclidean space• Assume each value is in its own cluster• Aggregate based on distance ot other clusters• For a single dimensions (such as integer values),

each cluster will represent a value range• We can set limits on distance as a way to stop the

iteration

Clustering for Classification

• We can use clustering as a way to calssify records based on attributes of our choice

• If we have some ideas about rule relations, we can test it out using clustering– Extract the attribute values expected in the rule– Normalize any values– Apply any weights– Perform clustering

• Once the clustering has been done, those characteristics that are the drivers in determining similarity can be used as the basis for defining new rules

K Means Clsutering

• K stands for the number of clusters we would like to find

1. Arbitrarily select K seeds that are to be the first guess at determining the centroids of the clusters.

2. For each record, assign the record to a cluster based on the cluster centroid to which it is nearest.

3. After all records have been assigned to a cluster, recompute the centroids of the clusters.

4. If the centroids changed from the previous iteration, then repeat the process starting at step 2.

5. If the centroids did not change from the previous step, then the process is finished.