60
CS910: Foundations of Data Analytics Graham Cormode [email protected]. uk Data Basics

CS910: Foundations of Data Analytics

  • Upload
    iolana

  • View
    43

  • Download
    0

Embed Size (px)

DESCRIPTION

CS910: Foundations of Data Analytics. Graham Cormode [email protected]. Data Basics. Objectives. Introduce formal concepts and measures for describing data Refresh on relevant concepts from statistics Distributions, mean and standard deviation, quantiles - PowerPoint PPT Presentation

Citation preview

Page 1: CS910: Foundations of Data Analytics

CS910: Foundations of Data AnalyticsGraham Cormode

[email protected]

Data Basics

Page 2: CS910: Foundations of Data Analytics

Objectives

¨ Introduce formal concepts and measures for describing data¨ Refresh on relevant concepts from statistics

– Distributions, mean and standard deviation, quantiles– Covariance, Correlation, and correlation tests

¨ Introduce measures of similarity/distance between records¨ Understand issues of data quality

– Techniques for data cleaning, integration, transformation, reductionRecommended Reading:¨ Chapter 2 (Getting to know your data) and Chapter 3 (Data

preprocessing) in "Data Mining: Concepts and Techniques" (Han, Kamber, Pei).

CS910 Foundations of Data Analytics2

Page 3: CS910: Foundations of Data Analytics

Example Data Set

¨ Show examples using the “adult census data”– http://archive.ics.uci.edu/ml/machine-learning-databases/adult/– File: adult.data

¨ Tens of thousands of individuals, one per line– Age, Gender, Employment Type, Years of Education…

¨ Widely studied in Machine Learning community– Prediction task: is income > 50K?

CS910 Foundations of Data Analytics3

Page 4: CS910: Foundations of Data Analytics

Digging into Data

¨ Examine the adult.data set:39, State-gov, 77516, Bachelors, 13, Never-married, …

¨ The data is formed of many records – Each record corresponds to an entity in the data: e.g. a person– May be called tuples, rows, examples, data points, samples

¨ Each record has a number of attributes – May be called features, columns, dimensions, variables

¨ Each attribute is one of a number of types– Categoric, binary, numeric, ordered

CS910 Foundations of Data Analytics4

Page 5: CS910: Foundations of Data Analytics

Types of Data

¨ Categoric or nominal attributes take on one of a set of values– Country: England, Mexico, France…; – Marital status: Single, Divorced, Never-married– May not be able to “compare” values: only the same or different

¨ Binary attributes take one of two values (often true or false)– An (important) special case of categoric attributes– Income >50K: Yes/No– Sex: Male/Female

CS910 Foundations of Data Analytics5

Page 6: CS910: Foundations of Data Analytics

Types of Data

¨ Ordered attributes take values from an ordered set– Education level: high-school, bachelors, masters, phd

This example is not necessarily fully ordered: is a MD > a phd?– Coffee size: tall, grande, venti

¨ Numeric attributes measure quantities– Years of education: integer in range 1-16 [in this data set]– Age: integer in range 17-90– Could also be real-valued, e.g. temperature 20.5C

CS910 Foundations of Data Analytics6

Page 7: CS910: Foundations of Data Analytics

Weka

¨ A free software tool available to use for data analysis– WEKA: “Waikato Environment for Knowledge Analysis”– From Waikato University (NZ): www.cs.waikato.ac.nz/ml/weka/– Open source software in Java for data analysis

¨ Implements many of the ‘core’ algorithms we will be studying– Regression, classification, clustering

¨ Graphical user interface, but also Java implementations– Has some limited visualization options– Few options for data manipulation inside weka (can do outside)

¨ Prefers to read custom “arff” files: attribute-relation file format– Adult.data converted to arff:

www.inf.ed.ac.uk/teaching/courses/dme/data/Adult/adult.arff

CS910 Foundations of Data Analytics7

Page 8: CS910: Foundations of Data Analytics

Metadata

¨ Metadata = data about the data– Describes the data (e.g. by listing the attributes and types)

¨ Weka uses .arff : Attribute-Relation File Format– Begins by providing metadata before listing the data– Example (the “Iris” data set):

@RELATION iris @ATTRIBUTE sepallength NUMERIC @ATTRIBUTE sepalwidth NUMERIC @ATTRIBUTE petallength NUMERIC @ATTRIBUTE petalwidth NUMERIC @ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica}@DATA 5.1,3.5,1.4,0.2,Iris-setosa 4.9,3.0,1.4,0.2,Iris-setosa 4.7,3.2,1.3,0.2,Iris-setosa 4.6,3.1,1.5,0.2,Iris-setosa

CS910 Foundations of Data Analytics8

Page 9: CS910: Foundations of Data Analytics

The Statistical Lens

¨ It is helpful to use the tools of statistics to view data– Each attribute can be viewed as describing a random variable– Look at statistical properties of this random variable– Can also look at combinations of attributes

Correlation, Joint and conditional distributions¨ Distributions from data are typically discrete, not continuous

– We study the empirical distribution given by the data– The event probability is the frequency of occurrences in the data

CS910 Foundations of Data Analytics9

Discrete Continuous

Page 10: CS910: Foundations of Data Analytics

Distributions of Data

¨ Basic properties of a numeric variable X with observations X1…Xn:– Mean of data, m or E[X] = Si Xi/n: average age in adult.data = 38.6

Linearity of Expectation: E[X + Y] = E[X] + E[Y] ; E[cX] = cE[X]– Standard deviation s or √(Var[X]): std. dev. age = 13.64

Var[X] = E[X2] – (E[X])2 = (Si Xi2)/n – (Si Xi/n)2

Properties: Var[aX + b] = a2 Var[X]– Mode: most commonly occurring value

Most common age in adult.data: 36 (898 examples)– Median: the midpoint of the distribution

Half the observations are above, half below Mean of the two midpoints for n even Median of ages in adult.data = 37

CS910 Foundations of Data Analytics10

Page 11: CS910: Foundations of Data Analytics

Probability Distributions

¨ Given random variable X:– Probability distribution function (PDF), Pr[X = x]– Cumulative distribution function (CDF), Pr[X ≤ x]– Complementary Cumulative Distribution Function (CCDF), Pr[X > x]

Pr[X > x] = 1 – Pr[X ≤ x]¨ An attribute defines an empirical probability distribution

– Pr[X = x] = Si 1[Xi = x]/n [the fraction of examples equal to x]– E[X] = S Pr[Xi=x] x– Median(X): x such that Pr[X ≤ x] = 0.5

¨ Conditional probability, Pr[ X | Y ] : probability of X given Y– In data, compute empirically e.g. Pr[Age = 30 | Sex = Female ]

Compute Pr[Age = 30] only from examples where Sex = FemaleCS910 Foundations of Data Analytics11

Page 12: CS910: Foundations of Data Analytics

Quantiles

¨ The quantiles generalize the median¨ The f-quantile is the point such that Pr[ X ≤ x] = f

– The median is the 0.5 quantile– The 0-quantile is the minimum, the 1-quantile is the maximum– The quartiles are the 0.25, 0.5 and 0.75 quantiles

¨ Taking all quantiles at regular intervals (e.g. 0.01) approximately describes the distribution

CS910 Foundations of Data Analytics12

Page 13: CS910: Foundations of Data Analytics

PDF and CDF of age attribute in adult.data

CS910 Foundations of Data Analytics13

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

10

20

30

40

50

60

70

80

90

0

0.005

0.01

0.015

0.02

0.025

0.03

20

30

40

50

60

70

80

90

(Empirical) PDF of age

(Empirical) CDF of age

Page 14: CS910: Foundations of Data Analytics

Skewness in distributions

¨ Symmetric unimodal distribution:– Mean = median = mode

CS910 Foundations of Data Analytics14

positively skewed negatively skewed

Age: mean 38.6, median 37, mode 36

Page 15: CS910: Foundations of Data Analytics

Statistical Distributions in Data

¨ Many familiar distributions model observed data¨ Normal distribution: characterized by mean m and variance

s2

– From μ–σ to μ+σ: contains about 68% of the measurements (μ: mean, σ: standard deviation)

– From μ–2σ to μ+2σ: contains about 95% of data 95% of data within 1.96σ of mean

– From μ–3σ to μ+3σ: contains about 99.7% of data

CS910 Foundations of Data Analytics15

Page 16: CS910: Foundations of Data Analytics

Power Law Distribution

¨ Power law distribution: aka long tail, pareto, zipfian– PDF: Pr[X = x] = c x-a

– CCDF: Pr[X > x] = c’ x1-a

– E[X], Var[X] = if < 2¨ Arise in many cases:

– Number of people living in cities– Popularity of products from retailers– Frequency of word use in written text– Wealth distribution (99% vs 1%)– Video popularity

¨ Data may also fit a log-normal distribution, truncated power-law

CS910 Foundations of Data Analytics16

Page 17: CS910: Foundations of Data Analytics

Exponential / Geometric Distribution

CS910 Foundations of Data Analytics17

¨ Suppose that an event happens with probability p (for small p)– Independently at each time step

¨ How long before an event is observed?– Geometric dbn: Pr[X = x] = (1-p)x-1p, x > 0

CCDF: Pr[ X x] = (1-p)x

E[X] = 1/p, Var[X] = (1-p)/p2

¨ Continuous case: exponential distribution– Pr[X=x] = exp(- x) for parameter , x 0– Pr[X x] = exp(- x)– E[X] = 1/, Var[X] = 1/2

¨ Both capture “waiting time” between events in Poisson processes– Memoryless distributions: Pr[X > x + y | X > x] = Pr[X > y]

Page 18: CS910: Foundations of Data Analytics

Looking at the data

¨ Simple statistics (mean and variance) roughly describe the data– A Normal distribution is completely characterized by m and s– But not every data distribution is (approximately) Normal!– A few large “outlier” values can change the mean by a lot

¨ Looking at a few quantiles helps, but doesn’t tell the whole story– Two distributions can have same quartiles but look quite different!

CS910 Foundations of Data Analytics18

Page 19: CS910: Foundations of Data Analytics

Standard plots of data¨ There are several basic types of plots of data:

CS910 Foundations of Data Analytics19

0

2

4

6

8

10

12

14

16

10 20 30 40 50 60 70 80 90

Year

s of E

duca

tion

Age

Age versus Education

Adult data

0

5000

10000

15000

20000

25000

Female Male

Scatter plot

Histogram(Bar chart)

0

0.005

0.01

0.015

0.02

0.025

0.03

20 30 40 50 60 70 80 90

PDF plot

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

10

20

30

40

50

60

70

80

90

CDF plot

Page 20: CS910: Foundations of Data Analytics

Correlation between attributes

¨ We often want to determine how two attributes behave together¨ Scatter plots indicate whether they are correlated:

CS910 Foundations of Data Analytics20

Negative Correlation

Positive Correlation

Positive Correlation

Negative Correlation

Page 21: CS910: Foundations of Data Analytics

Uncorrelated Data

CS910 Foundations of Data Analytics2121

0

2

4

6

8

10

12

14

16

10 20 30 40 50 60 70 80 90

Year

s of E

duca

tion

Age

Age versus Education

Adult data

Page 22: CS910: Foundations of Data Analytics

Quantile-quantile plot¨ Compare the quantile distribution to another

– Allows comparison of attributes of different cardinality– Plot points corresponding to f-quantile of each attribute– If close to line y=x, evidence for same distribution

¨ Example: shows q-q plot for sales in two branches

CS910 Foundations of Data Analytics22

Page 23: CS910: Foundations of Data Analytics

Quantile-quantile plot

¨ Compare years of education in adult.data to adult.test– adult.data: 32K examples, adult.test: 16K examples– Computed the percentiles of each data set– Plot corresponding pairs of percentiles

e.g. using =percentile() function in a spreadsheet

CS910 Foundations of Data Analytics23

0 2 4 6 8 10 12 14 16 180

2

4

6

8

10

12

14

16

18

Adult.data years of education

adul

t.tes

t yea

rs o

f edu

catio

n

Page 24: CS910: Foundations of Data Analytics

Measures of Correlation

¨ Want to measure how correlated are two numeric variables– Covariance of X and Y, Cov(X,Y) = E[(X – E[X])(Y – E[Y])]

= E[ XY – YE[X] – XE[Y] + E[X]E[Y]]= E[XY] – E[Y]E[X] – E[X]E[Y] + E[X]E[Y]= E[XY] – E[X]E[Y]

– Notice: Cov(X, X) = E[X2] – E[X]2 = Var(X)– If X and Y are independent, then E[XY] = E[X]E[Y]: covariance is 0– But if covariance is 0, X and Y can still be related (dependent)

¨ Consider X = and Y=X2

– Then E[X] = 0, E[XY] = 0.25(-8 -1 + 1 + 8) =0: covariance is 0

CS910 Foundations of Data Analytics24

x -2 -1 1 2

Pr[X=x] 0.25 0.25 0.25 0.25

Page 25: CS910: Foundations of Data Analytics

Example (from past exam)

¨ E[X] = (8 + 13 + 8 + 8 + 9 + 8 + 9 + 4 + 8 + 5 + 6 + 10)/12 = 8¨ Var[X] = (0+25+0+0+1+0+1+16+0+9+4+4)/12 = 5¨ E[YZ] = (56+180+49+42+96+24+72+9+42+18+20+160)/12 = 64¨ E[Z] = (8 + 10 + 7 + 7 + 8 + 6 + 8 + 3 + 6 + 6 + 5 + 10)/12 = 7¨ E[Y] = (7 + 18 + 7 + 6 + 12 + 4 + 9 + 3 + 7 + 3 + 4 + 16)/12 = 8¨ Cov(Y,Z) = (E[YZ] - E[Y]E[Z]) = 64 - (7 * 8) = 8

CS910 Foundations of Data Analytics25

Page 26: CS910: Foundations of Data Analytics

Measuring Correlation

¨ Covariance depends on the magnitude of values of variables– Cov(aX,bY) = ab Cov(X,Y), proved by linearity of expectation– Normalize to get a measure of the amount of correlation

¨ Pearson product-moment correlation coefficient (PMCC)– PMCC(X,Y) = Cov(X,Y)/((X) (Y))

= (E[XY] – E[X]E[Y]) / √(E[X2] – E2[X]) √(E[Y2]-E2[Y])= (ni xi yi – (i xi)(i yi))/√(ni xi

2 – (i xi)2)√(ni yi2 – (i yi)2)

– Measures linear dependence in terms of simple quantities: n, number of examples (assumed to be reasonably large) i xi yi, sum of products i xi, i yi, sum of values i xi

2, i yi2, sum of squares

CS910 Foundations of Data Analytics26

Page 27: CS910: Foundations of Data Analytics

Interpreting PMCC

¨ In what range does PMCC fall?– Assume E[X] = 0, E[Y] = 0

(as we can “shift” distribution without changing covariance)– Set X’ = X/(X), and Y’ = Y/(Y)– Var(X’) = Var(X)/2(X) = 1, E[X’] = 0, Var(Y’) = 1, E[Y’] = 0– E[ (X’-Y’)2 ] 0, since (X’-Y’)2 0

E[ X’2 + Y’2 – 2X’Y’] 02E[X’Y’] E[X’]2 + E[Y’]2 = 2E[X’Y’] 1

– Rescaling, E[XY] = (X) (Y) E[X’Y’] (X)(Y)– Similarly, -(X)(Y) E[XY]– Hence, -1 PMCC(X,Y) +1

CS910 Foundations of Data Analytics27

Page 28: CS910: Foundations of Data Analytics

Interpreting PMCC

¨ Suppose X = Y [perfectly linearly correlated]– PMCC(X,Y) = (E[XY] – E[X]E[Y])/((X)(Y))

= (E[X2] – E2[X])/2(X)= Var(X)/Var(X) = 1

¨ Suppose X = -aY + b [perfectly negatively linearly correlated]– PMCC(X,Y) = (E[-aYY + bY] – E[-aY + b]E[Y])/(-aY + b)(Y)

= -aE[Y2] + bE[Y] – bE[Y] + aE2[Y])/(a2(Y))= -aVar(Y)/(aVar(Y)) = -1

CS910 Foundations of Data Analytics28

Page 29: CS910: Foundations of Data Analytics

Example of PMCC

What kind of correlation is there between Y and Z?¨ Cov(Y,Z) = 8 [from previous slide]¨ Var[Y] = 22.5, Var[Z] = 3.666 [check at your leisure]¨ PMCC(Y,Z) = 8/ √(22.5 * 3.66) = 0.88…¨ Correlation between Y and Z is strong, positive, linear

CS910 Foundations of Data Analytics29

Page 30: CS910: Foundations of Data Analytics

Correlation for Ordered Data

¨ What about when data is ordered, but not numeric?– Can look at how the ranks of the variables are correlated

¨ Example: (Grade in Mathematics, Grade in Statistics): – Data: (A, A), (B, C), (C, E), (F, F)– Convert to ranks: (1, 1), (2, 2), (3, 3), (4, 4)– Ranks are perfectly correlated

¨ Use PMCC on the ranks– Obtains Spearman’s Rank Correlation Coefficient (RCC)– For ties, define rank as mean position in sorted order– Also useful for identifying non-linear correlation

Y = X2 has RCC(X,Y) = 1

CS910 Foundations of Data Analytics30

Page 31: CS910: Foundations of Data Analytics

Testing Correlation in Categoric Data

¨ Test for statistical significance of correlation in categoric data¨ Look at how many times pairs co-occur, compared to expectation

– Consider attributes X with values x1 … xc, Y with values y1 … yr

– Let oij be the number of pairs (xi, yj)– Let eij be the expected number if independent, = n Pr[X=xi] Pr[Y=yj]

¨ Pearson 2 (chi-squared) statistic: – 2(X,Y) = i j (oij – eij)2/eij

¨ Compare 2 test statistic with (r-1)(c-1) degrees of freedom– Suppose r = c = 10, 2 for 81 d.o.f with 0.01 confidence = 113.51– If 2(X,Y) > 113.51 can conclude X and Y are (likely) correlated

Otherwise, there is not evidence to support this conclusion

CS910 Foundations of Data Analytics31

Page 32: CS910: Foundations of Data Analytics

Chi-squared example (from textbook)

Male Female TotalFiction 250 (90) 200 (360) 450Non-fiction 50 (210) 1000 (840) 1050Total 300 1200 1500

CS910 Foundations of Data Analytics32

Is gender and preferred reading material correlated? ¨ Expected frequency (male, fiction) = (300 * 450)/1500 = 90¨ 2 = (250-90)2/90 + (50-210)2/210 + (200-360)2/360 + (1000-840)2/840

= 507.93¨ Degrees of freedom = (2-1)(2-1) = 1¨ Test statistics for 1 DoF = 10.828 [lookup in table]

– Reject that they are independent, conclude that there is correlation

Page 33: CS910: Foundations of Data Analytics

Data Similarity Measures

¨ Need ways to measure similarity or dissimilarity of data points– Within various analytic methods: clustering, classification– Many different ways are used in practice

¨ Typically, measure dissimilarity of corresponding attribute values– Combine all these to get overall dissimilarity / distance– Distance = 0: identical– Increasing distance: less alike

¨ Typically look for distances that obey the metric rules:– d(x,x) = 0: Identity rule– d(x, y) = d(y, x): Symmetric rule– d(x, y) + d(y, z) d(x, z): Triangle inequality

CS910 Foundations of Data Analytics33

Page 34: CS910: Foundations of Data Analytics

Categoric data

¨ Suppose data has all categoric attributes¨ Measure dissimilarity by number of differences

– Sometimes called “Hamming distance”– Example:

(Private, Bachelors, England, Male)(Private, Masters, England, Female)2 differences, so distance is 2

– Can encode into binary vectors (useful for later): High-school = 100, Bachelors = 010, Masters = 001

¨ May build your own custom score functions– Example:

d(Bachelors, Masters) = 0.5, d(Masters, High-school) = 1, d(Bachelors, High-school) = 0.5

CS910 Foundations of Data Analytics34

Page 35: CS910: Foundations of Data Analytics

Binary data

¨ Suppose data has all binary attributes– Could count number of differences again (Hamming distance)

¨ Sometimes, “True” is more significant than “False”– E.g. presence of a medical symptom– Then only consider cases where one attribute is True

Called “Jaccard distance”– Measure fraction of disagreeing cases (0…1)– Example:

(Cough=T, Fever=T, Tired=F, Nausea=F)(Cough=T, Fever=F, Tired=T, Nausea=F)Fraction of disagreeing cases = 2/3

CS910 Foundations of Data Analytics35

Page 36: CS910: Foundations of Data Analytics

Numeric Data

¨ Suppose all data has numeric attributes– Can interpret data points as coordinates

¨ Measure distance between points with appropriate distances– Euclidean distance (L2): d(x,y) = ǁx-yǁ2 = √i (xi – yi)2

– Manhattan distance (L1): d(x,y)= ǁx-yǁ1 = i|xi – yi| [absolute values]– Maximum distance (L): d(x,y) = ǁx-yǁ = Maxi |xi – yi|– [Examples of Minkowski distances (Lp): ǁx -yǁp = (i (xi – yi)p)1/p ]

¨ If ranges of values are vastly different, may normalize first– Range scaling: Rescale so all values lie in the range [0…1]– Statistical scaling: Subtract mean, divide by standard deviation

Obtain the z-score, (x – μ)/σ

CS910 Foundations of Data Analytics36

Page 37: CS910: Foundations of Data Analytics

Ordered Data

¨ Suppose all data is ordered data– Can replace each point with its position in the ordering– Example: tall = 1, grande = 2, venti = 3

¨ Measure L1 distance of this encoding of the data– d(tall, venti) = 2

¨ May also normalize so distances are in range [0…1]– tall = 0, grande = 0.5, venti = 1

CS910 Foundations of Data Analytics37

Page 38: CS910: Foundations of Data Analytics

Mixed data

¨ But most data is a mixture of different types!¨ Encode each dimension into the range [0…1], use Lp distance

– Following previous techniques– (Age: 36, coffee: tall, education: bachelors, sex: male)

(0.36, 0, 010, 0)(Age: 46, coffee: grande, education: masters, sex: female)(0.46, 0.5, 001, 1)

– L1 distance: 0.1 + 0.5 + 2 + 1 = 3.6– L2 distance: √(0.01 + 0.25 + 4 + 1) = √(5.26) = 2.29– May reweight some coordinates to make more uniform

E.g. weight education by 0.5

CS910 Foundations of Data Analytics38

Page 39: CS910: Foundations of Data Analytics

Cosine Similarity

¨ For large vector objects, cosine similarity is often used– E.g. in measuring similarity of documents– Each coordinate indicates how often a word occurs in the document

“to be or not to be” : [to: 2, be: 2, or: 1, not: 1, artichoke: 0…]¨ Similarity between two vectors is given by x y/ǁxǁ2 ǁyǁ2

– (x y) = i (xi * yi) : dot product of x and y– ǁxǁ2 = ǁx - 0ǁ2 = √(i xi

2) : Euclidean norm of x¨ Example: “to be or not to be”, “do be do be do”

Cosine similarity: 4/√10 √13 = 0.35¨ Example: “to be or not to be”, “to not be or to be”

Cosine similarity: 10/ √10 √10 = 1.0¨ Example: “to be or not to be”, “that is the question”

Cosine similarity: 0/ √10 √4 = 0.0CS910 Foundations of Data Analytics39

Page 40: CS910: Foundations of Data Analytics

(Text) Edit Distance

¨ Edit Distance: Count the number of inserts, deletes and changes of characters to turn string A into string B– Seems like there could be many ways to do this

¨ Computer science: Use dynamic programming– Let Ai be the first i characters of string A, A[i] is i’th character of A– Use a recurrence formula:

d(Ai, “”) = i and d(“”, Bj) = jd(Ai, Bj) = min{d(Ai-1,Bj-1) + (A[i]≠B[j]),

d(Ai-1, Bj) +1,d(Ai,Bj-1) + 1}

where (A[i]≠B[j]) is 0 if i’th characterof A matches j’th character of B, else 1

– Can compute d(A,B) by filling a gridCS910 Foundations of Data Analytics40

Page 41: CS910: Foundations of Data Analytics

Data Preprocessing

¨ Often we need to preprocess the data before analysis– Data cleaning: remove noise, correct inconsistencies– Data integration: merge data from multiple sources– Data reduction: reduce data size for ease of processing– Data transformation: convert to different scales (e.g. normalization)

CS910 Foundations of Data Analytics41

Page 42: CS910: Foundations of Data Analytics

Data Cleaning – Missing Values

¨ Missing values are common in data– Many instances of ? in adult.data

¨ Values can be missing for many reasons– Data was lost / transcribed incorrectly– Data could not be measured– Did not apply in context: “not applicable” (e.g. national ID number)– User chose not to reveal some private information

CS910 Foundations of Data Analytics42

Page 43: CS910: Foundations of Data Analytics

Handling Missing Values

¨ Drop the whole record– OK if a small fraction of records have missing values

2400 rows in adult.data have a ?, out of 32K: 7.5%– Not ideal if missing values correlate with other features

¨ Fill in missing values manually– Based on human expertise– Would you want to look through 2400 examples?

¨ Accept missing values as “unknown”– Need to ensure that future processing can handle “unknown”– May lead to false patterns being found: a cluster of “unknown”s

CS910 Foundations of Data Analytics43

Page 44: CS910: Foundations of Data Analytics

Handling Missing Values

¨ Fill in some plausible value based on the data– E.g. for a missing temperature, fill in the mean temperature– E.g. for a missing education level, fill in most common one– May be the wrong thing to do if missing value means something

¨ Use the rest of the data to infer the missing value– Find a value that looks like best fit given other values in record

E.g., mean value of those that match on another attribute– Build a classifier to predict the missing value– Use regression to extrapolate the missing value

¨ No ideal solution – all methods introduce some bias…

CS910 Foundations of Data Analytics44

Page 45: CS910: Foundations of Data Analytics

Noise in Data

¨ “Noise” is values due to error or variance in measurement– Random noise in measurements e.g. in temperature, time– Interference or misconfiguration of devices– Coding / translation errors in software– Misleading values from data subjects

E.g. Date of birth = 1/1/1970¨ Noise can be difficult to detect at the record level

– If salary is 64,000 instead of 46,000, how can you tell?– Statistical tests may help identify if there is much noise in data

Do many people in data make more salary than national avg? Benford’s law: dbn of first digits is skewed, Pr[d] ≈ log(1 + 1/d)

CS910 Foundations of Data Analytics45

Page 46: CS910: Foundations of Data Analytics

Outliers

¨ Outliers are extreme values in data that often represent noise– E.g. salary = -10,000 [hard constraint: no salary is negative]– E.g. room temperature = 100C [constraint: should be below 40C?]– Is salary = $1M an outlier? What about salary = $1?

¨ Finding outliers in numeric data: – Sanity check: is mean, std dev, max, much higher than expected?– Visual: plot the data, are there spikes or values far from rest?– Rule-based: set limits, declare an outlier if outside bounds– Data-based: declare an outlier if > 6 standard deviations from mean

CS910 Foundations of Data Analytics46

Page 47: CS910: Foundations of Data Analytics

Outliers

¨ Finding outliers in categoric data:– Visual: Look at frequency statistics / histogram– Are there values with low frequency representing typos/errors?

E.g. Mela instead of Male Values other than True or False in Binary data

¨ Dealing with outliers– Delete outliers: remove records with outlier values from data set– Clip outliers: change value to maximum/minimum permitted– Treat as missing value: replace with more typical / plausible value

CS910 Foundations of Data Analytics47

Page 48: CS910: Foundations of Data Analytics

Outliers

GENDER Frequency-------------------2 1F 12M 13X 1f 2

CS910 Foundations of Data Analytics48

0

0.005

0.01

0.015

0.02

0.025

0.03

20

30

40

50

60

70

80

90

Page 49: CS910: Foundations of Data Analytics

Consistency Rules

¨ Data may be inconsistent in a number of ways– US vs European date styles: 17/10 vs 10/17– Temperature in Fahrenheit instead of Centrigrade– Same concept represented in multiple ways: tall, TALL, small?– Typos: age = 337– Functional dependencies: Annual salary = 52K, Weekly salary = 1500– Address does not exist

¨ Apply rules and use tools– Spell correction / address standardization tools– (Manually) find “consistent inconsistencies” and fix with a script– Look for “minimal repair”: smallest change that makes consistent

Principle of Parsimony / Occam’s Razor

CS910 Foundations of Data Analytics49

Page 50: CS910: Foundations of Data Analytics

Data Integration

¨ Want to combine data from two sources– May be inconsistent: different units/formats– May structurally differ: address vs (street number, road name)– May be different name for same entity: B. Obama vs Pres. Obama

¨ Challenging problem faced by many organizations– E.g. two companies merge and need to combine databases– Identify corresponding attributes by correlation analysis– Define rules to translate between formats– Try to identify matching entities within data via similarity/distance

CS910 Foundations of Data Analytics50

Page 51: CS910: Foundations of Data Analytics

Data Transformation

– Sometimes we need to transform the data before it can be used¨ Some methods want a small number of categoric values¨ Sometimes methods expect all values to be numeric¨ Sometimes need to reweight/rescale data so all features equal

¨ Have seen several data transformations already:¨ Represent ordered data by numbers/ranks ¨ Normalize numeric values by range scaling or divide by variance

¨ Another important transformation: discretization¨ Turn a fine-grained attribute into a coarser one

CS910 Foundations of Data Analytics51

Page 52: CS910: Foundations of Data Analytics

Discretization

¨ Features have too many values to be informative– E.g. everyone has a different salary– E.g. every location is at a different GPS location

¨ Can coarsen data to create fewer, well-supported groups– Binning: Place numeric/ordered data into bands e.g. salaries, ages

Based on domain knowledge: Ages 0-18, 18-24, 25-34, 35-50… Based on data distribution: partition by quantiles

– Use existing hierarchies in data Time in second/minute/hour/day/week/month/year… Geography in postcode (CV4 7AL), town/region (CV4), county…

CS910 Foundations of Data Analytics52

Page 53: CS910: Foundations of Data Analytics

Data Reduction

¨ Sometimes data is too large to conveniently work with– Too high dimensional: too many attributes to make sense of– Too numerous: too many examples to process

¨ Complex analytics may take a long time for large data– Painful when trying many different approaches/parameters– Can we draw almost the same conclusions with smaller data?

CS910 Foundations of Data Analytics53

Page 54: CS910: Foundations of Data Analytics

Random Sampling for number reduction¨ A solid principle from statistics: a random sample is often enough

– Pick a subset of rows from data (sampling without replacement)– Must be chosen randomly

Suppose data sorted by some attribute (salary):conclusions would be biased

¨ A standard trick: add an extra random field to data, and sort on it¨ Sampling can miss out small subpopulations

– Stratified sampling: take samples from subsets– E.g. sample from each occupation in adult.data

CS910 Foundations of Data Analytics54

Page 55: CS910: Foundations of Data Analytics

Feature selection (dimensionality reduction)

Feature selection picks a subset of attributes¨ Transformations of numeric data

– Numeric data: all n features are numeric, so data is a set of vectors– Fourier (or other) transform and pick important coefficients– May make it hard to interpret the new attributes

CS910 Foundations of Data Analytics55

(Discrete) Fourier transform

Page 56: CS910: Foundations of Data Analytics

Feature selection

¨ Principal Components Analysis (also on numeric data)– Pick k < n orthogonal vectors to project the data along– The projection of data x on direction y is dot product (x . y)– Gives a new set of coordinate axes for the data– PCA procedure finds the “best” set of directions– Still challenging to interpret the new axes

CS910 Foundations of Data Analytics56

Page 57: CS910: Foundations of Data Analytics

Feature selection

¨ Greedy attribute selection– Pick first attribute that contains most “information”– Add more attributes that maximize information (forward search)– Based on statistical tests or information theory– Or, start with all attributes and start removing (backward search)– May not reach the same set of attributes!

CS910 Foundations of Data Analytics57

Input set of attributes: {A1, A2, A3, A4, A5, A6}Forward search: {} {A1} {A1, A6} {A1, A4, A6}Backward search: {A1, A2, A3, A4, A5, A6} {A1, A3, A4, A5, A6}

{A1, A3, A4, A6} {A1, A3, A6}

Page 58: CS910: Foundations of Data Analytics

Implementing feature selection

¨ Previous methods implemented in some tools for analysis– Packages in R– Methods in Weka

¨ Simple techniques for feature selection:– High correlation: is a variable very highly correlated with another?

If so, it can likely be discarded– Try your analysis with all combinations of features for small input

Exponentially expensive, so may defeat the point!

CS910 Foundations of Data Analytics58

Page 59: CS910: Foundations of Data Analytics

Data Reduction by Aggregation

¨ Aggregate individual records together to get weighted data– E.g. look at total trading volume per minute from raw trades

¨ Aggregate and coarsen within feature hierarchies– E.g. look at (average) share price per hour, not per minute

¨ Look at a meaningful subset of the data – E.g. process only 1 month’s data instead of 1 year

CS910 Foundations of Data Analytics59

Industry Region Year

Category Country Quarter

Product City Month Week

Office Day

Page 60: CS910: Foundations of Data Analytics

Summary of Data Basics

¨ Data can be broken down into records with attributes¨ Statistical tools describe the distribution of attributes¨ Powerful tools for finding correlation, similarity/distance¨ Many problems with data, no perfect solutions

Recommended Reading:¨ Chapter 2 (Getting to know your data) and Chapter 3 (Data

preprocessing) in "Data Mining: Concepts and Techniques" (Han, Kamber, Pei).

CS910 Foundations of Data Analytics60