50
1 BUS 297D: Data Mining Professor David Mease Lecture 1 Agenda: 1) Go over information on course web page 2) Lecture over Chapter 1 3) Discuss necessary software 4) Lecture over Chapter 2

1 BUS 297D: Data Mining Professor David Mease Lecture 1 Agenda: 1) Go over information on course web page 2) Lecture over Chapter 1 3) Discuss necessary

  • View
    216

  • Download
    0

Embed Size (px)

Citation preview

Page 2: 1 BUS 297D: Data Mining Professor David Mease Lecture 1 Agenda: 1) Go over information on course web page 2) Lecture over Chapter 1 3) Discuss necessary

2

BUS 297D: Data Mining

Professor David Mease

Course web page and greensheet:

http://www.cob.sjsu.edu/mease_d/bus297D/

This page is linked from my SJSU web page

It is also linked from my personal page

www.davemease.com

which is easily found by querying “David Mease” or simply “Mease” on any search engine

Page 3: 1 BUS 297D: Data Mining Professor David Mease Lecture 1 Agenda: 1) Go over information on course web page 2) Lecture over Chapter 1 3) Discuss necessary

3

Introduction to Data Mining

byTan, Steinbach, Kumar

Chapter 1: Introduction

Page 4: 1 BUS 297D: Data Mining Professor David Mease Lecture 1 Agenda: 1) Go over information on course web page 2) Lecture over Chapter 1 3) Discuss necessary

4

What is Data Mining?

Data mining is the process of automatically discovering useful information in large data repositories. (page 2)

There are many other definitions

Page 5: 1 BUS 297D: Data Mining Professor David Mease Lecture 1 Agenda: 1) Go over information on course web page 2) Lecture over Chapter 1 3) Discuss necessary

5

In class exercise #1:Find a different definition of data mining online. How does it compare to the one in the text on the previous slide?

Page 6: 1 BUS 297D: Data Mining Professor David Mease Lecture 1 Agenda: 1) Go over information on course web page 2) Lecture over Chapter 1 3) Discuss necessary

6

Data Mining Examples and Non-Examples

Data Mining:-Certain names are more prevalent in certain US locations (O’Brien, O’Rurke, O’Reilly… in Boston area)

-Group together similar documents returned by search engine according to their context (e.g. Amazon rainforest, Amazon.com, etc.)

NOT Data Mining:

-Look up phone number in phone directory

-Query a Web search engine for information about “Amazon”

Page 7: 1 BUS 297D: Data Mining Professor David Mease Lecture 1 Agenda: 1) Go over information on course web page 2) Lecture over Chapter 1 3) Discuss necessary

7

Why Mine Data? Scientific Viewpoint

Data collected and stored at enormous speeds (GB/hour)

–remote sensors on a satellite

–telescopes scanning the skies

–microarrays generating gene expression data

–scientific simulations generating terabytes of data

Traditional techniques infeasible for raw dataData mining may help scientists

–in classifying and segmenting data–in hypothesis formation

Page 8: 1 BUS 297D: Data Mining Professor David Mease Lecture 1 Agenda: 1) Go over information on course web page 2) Lecture over Chapter 1 3) Discuss necessary

8

Why Mine Data? Commercial Viewpoint

Lots of data is being collected and warehoused

–Web data, e-commerce–Purchases at department/grocery stores–Bank/credit card transactions

Computers have become cheaper and more powerful

Competitive pressure is strong –Provide better, customized services for an edge

Page 9: 1 BUS 297D: Data Mining Professor David Mease Lecture 1 Agenda: 1) Go over information on course web page 2) Lecture over Chapter 1 3) Discuss necessary

9

In class exercise #2:Give an example of something you did yesterday or today which resulted in data which could potentially be mined to discover useful information.

Page 10: 1 BUS 297D: Data Mining Professor David Mease Lecture 1 Agenda: 1) Go over information on course web page 2) Lecture over Chapter 1 3) Discuss necessary

10

Origins of Data Mining (page 6)Draws ideas from machine learning, AI, pattern recognition and statisticsTraditional techniquesmay be unsuitable due to

–Enormity of data

–High dimensionality of data

–Heterogeneous, distributed nature of data

Statistics

Data Mining

AI/Machine Learning/Pattern

Recognition

Page 11: 1 BUS 297D: Data Mining Professor David Mease Lecture 1 Agenda: 1) Go over information on course web page 2) Lecture over Chapter 1 3) Discuss necessary

11

2 Types of Data Mining Tasks (page 7)

Prediction Methods:

Use some variables to predict unknown or future values of other variables.

Description Methods:

Find human-interpretable patterns that describe the data.

Page 12: 1 BUS 297D: Data Mining Professor David Mease Lecture 1 Agenda: 1) Go over information on course web page 2) Lecture over Chapter 1 3) Discuss necessary

12

Examples of Data Mining Tasks

Classification [Predictive] (Chapters 4,5)Regression [Predictive] (covered in stats classes)

Visualization [Descriptive] (in Chapter 3) Association Analysis [Descriptive] (Chapter 6)Clustering [Descriptive] (Chapter 8)Anomaly Detection [Descriptive] (Chapter 10)

Page 13: 1 BUS 297D: Data Mining Professor David Mease Lecture 1 Agenda: 1) Go over information on course web page 2) Lecture over Chapter 1 3) Discuss necessary

13

Software We Will Use:

You should make sure you have access to the following two software packages for this course

Microsoft Excel

R

–Can be downloaded from

http://cran.r-project.org/ for Windows, Mac or Linux

Page 14: 1 BUS 297D: Data Mining Professor David Mease Lecture 1 Agenda: 1) Go over information on course web page 2) Lecture over Chapter 1 3) Discuss necessary

14

Downloading R for Windows:

Page 15: 1 BUS 297D: Data Mining Professor David Mease Lecture 1 Agenda: 1) Go over information on course web page 2) Lecture over Chapter 1 3) Discuss necessary

15

Downloading R for Windows:

Page 16: 1 BUS 297D: Data Mining Professor David Mease Lecture 1 Agenda: 1) Go over information on course web page 2) Lecture over Chapter 1 3) Discuss necessary

16

Downloading R for Windows:

Page 17: 1 BUS 297D: Data Mining Professor David Mease Lecture 1 Agenda: 1) Go over information on course web page 2) Lecture over Chapter 1 3) Discuss necessary

17

Introduction to Data Mining

byTan, Steinbach, Kumar

Chapter 2: Data

Page 18: 1 BUS 297D: Data Mining Professor David Mease Lecture 1 Agenda: 1) Go over information on course web page 2) Lecture over Chapter 1 3) Discuss necessary

18

What is Data?

An attribute is a property or

characteristic of an object

Examples: eye color of a

person, temperature, etc.

Attribute is also known as variable,

field, characteristic, or feature

A collection of attributes describe an object

Object is also known as record, point, case, sample, entity, instance, or observation

Tid Refund Marital Status

Taxable Income Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes 10

Attributes

Objects

Page 19: 1 BUS 297D: Data Mining Professor David Mease Lecture 1 Agenda: 1) Go over information on course web page 2) Lecture over Chapter 1 3) Discuss necessary

19

Reading Data into Excel

Download it from the web at

http://www.cob.sjsu.edu/mease_d/stats202log.txt

Then right click on it and select “Open With” then “Choose Program”

Page 20: 1 BUS 297D: Data Mining Professor David Mease Lecture 1 Agenda: 1) Go over information on course web page 2) Lecture over Chapter 1 3) Discuss necessary

20

Reading Data into Excel

Choose Excel

Oops! Everything is in the first column!

Solution: click onthe first columnthen select“Data” then“Text to Columns”

Page 21: 1 BUS 297D: Data Mining Professor David Mease Lecture 1 Agenda: 1) Go over information on course web page 2) Lecture over Chapter 1 3) Discuss necessary

21

Reading Data into Excel

Choose “Delimited” and hit “Next”

Page 22: 1 BUS 297D: Data Mining Professor David Mease Lecture 1 Agenda: 1) Go over information on course web page 2) Lecture over Chapter 1 3) Discuss necessary

22

Reading Data into Excel

Check only “Space” then “Next” again

Page 23: 1 BUS 297D: Data Mining Professor David Mease Lecture 1 Agenda: 1) Go over information on course web page 2) Lecture over Chapter 1 3) Discuss necessary

23

Reading Data into Excel

Q: Why did this work? Why don’t all spaces cause column splits?

Page 24: 1 BUS 297D: Data Mining Professor David Mease Lecture 1 Agenda: 1) Go over information on course web page 2) Lecture over Chapter 1 3) Discuss necessary

24

Reading Data into Excel

Q: Why did this work? Why don’t all spaces cause column splits?

A: The file is escaped using quotes.

(read http://en.wikipedia.org/wiki/Delimiter for more information)

Page 25: 1 BUS 297D: Data Mining Professor David Mease Lecture 1 Agenda: 1) Go over information on course web page 2) Lecture over Chapter 1 3) Discuss necessary

25

Reading Data into R

Download it from the web at

http://www.cob.sjsu.edu/mease_d/stats202log.txt

Set your working directory:

setwd("C:/Documents and Settings/David/Desktop")

Read it in:

data<-read.csv("stats202log.txt",sep=" ",header=F)

Page 26: 1 BUS 297D: Data Mining Professor David Mease Lecture 1 Agenda: 1) Go over information on course web page 2) Lecture over Chapter 1 3) Discuss necessary

26

Reading Data into R

Look at the first 5 rows:

data[1:5,] V1 V2 V3 V4 V5 V6 V7 V8 V9

1 69.224.117.122 - - [19/Jun/2007:00:31:46 -0400] GET / HTTP/1.1 200 2867 www.davemease.com2 69.224.117.122 - - [19/Jun/2007:00:31:46 -0400] GET /mease.jpg HTTP/1.1 200 4583 www.davemease.com3 69.224.117.122 - - [19/Jun/2007:00:31:46 -0400] GET /favicon.ico HTTP/1.1 404 2295 www.davemease.com4 128.12.159.164 - - [19/Jun/2007:02:50:41 -0400] GET / HTTP/1.1 200 2867 www.davemease.com5 128.12.159.164 - - [19/Jun/2007:02:50:42 -0400] GET /mease.jpg HTTP/1.1 200 4583 www.davemease.com

V10 V11 V121 http://search.msn.com/results.aspx?q=mease&first=21&FORM=PERE2 Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322) -2 http://www.davemease.com/ Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322) -3 - Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322) -4 - Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.4) Gecko/20070515 Firefox/2.0.0.4 -5 http://www.davemease.com/ Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.4) Gecko/20070515 Firefox/2.0.0.4 -

Page 27: 1 BUS 297D: Data Mining Professor David Mease Lecture 1 Agenda: 1) Go over information on course web page 2) Lecture over Chapter 1 3) Discuss necessary

27

Reading Data into R

Look at the first column:

data[,1] [1] 69.224.117.122 69.224.117.122 69.224.117.122 128.12.159.164 128.12.159.164 128.12.159.164 128.12.159.164 128.12.159.164 128.12.159.164 128.12.159.164

………

[1901] 65.57.245.11 65.57.245.11 65.57.245.11 65.57.245.11 65.57.245.11 65.57.245.11 65.57.245.11 65.57.245.11 65.57.245.11 65.57.245.11 [1911] 65.57.245.11 67.164.82.184 67.164.82.184 67.164.82.184 171.66.214.36 171.66.214.36 171.66.214.36 65.57.245.11 65.57.245.11 65.57.245.11 [1921] 65.57.245.11 65.57.245.11 73 Levels: 128.12.159.131 128.12.159.164 132.79.14.16 171.64.102.169 171.64.102.98 171.66.214.36 196.209.251.3 202.160.180.150 202.160.180.57 ... 89.100.163.185

Page 28: 1 BUS 297D: Data Mining Professor David Mease Lecture 1 Agenda: 1) Go over information on course web page 2) Lecture over Chapter 1 3) Discuss necessary

28

Experimental vs. Observational Data (Important but not in book)

Experimental data describes data which was collected by someone who exercised strict control over all attributes.

Observational data describes data which was collected with no such controls. Most all data used in data mining is observational data so be careful.

Examples:

-Diet Coke vs. Weight

-Carbon Dioxide in Atmosphere vs. Earth’s Temperature

Page 29: 1 BUS 297D: Data Mining Professor David Mease Lecture 1 Agenda: 1) Go over information on course web page 2) Lecture over Chapter 1 3) Discuss necessary

29

Types of Attributes:

Qualitative vs. Quantitative (P. 26)

Qualitative (or Categorical) attributes represent distinct categories rather than numbers. Mathematical operations such as addition and subtraction do not make sense. Examples:

eye color, letter grade, IP address, zip code

Quantitative (or Numeric) attributes are numbers and can be treated as such. Examples:

weight, failures per hour, number of TVs, temperature

Page 30: 1 BUS 297D: Data Mining Professor David Mease Lecture 1 Agenda: 1) Go over information on course web page 2) Lecture over Chapter 1 3) Discuss necessary

30

Types of Attributes (P. 25):

All Qualitative (or Categorical) attributes are either Nominal or Ordinal.

Nominal = categories with no orderOrdinal = categories with a meaningful order

All Quantitative (or Numeric) attributes are either Interval or Ratio.

Interval = no “true” zero, division makes no senseRatio = true zero exists, division makes sense

Page 31: 1 BUS 297D: Data Mining Professor David Mease Lecture 1 Agenda: 1) Go over information on course web page 2) Lecture over Chapter 1 3) Discuss necessary

31

Types of Attributes:

Some examples:–Nominal

Examples: ID numbers, eye color, zip codes–Ordinal

Examples: letter grades, height in {tall, medium, short}

–IntervalExamples: calendar dates, temperatures in Celsius or Fahrenheit, GMAT score

–RatioExamples: temperature in Kelvin, length, time, counts

Page 32: 1 BUS 297D: Data Mining Professor David Mease Lecture 1 Agenda: 1) Go over information on course web page 2) Lecture over Chapter 1 3) Discuss necessary

32

Properties of Attribute Values

The type of an attribute depends on which of the following properties it possesses:

–Distinctness: = –Order: < > –Addition and subtraction: + - –Multiplication and division: * /

–Nominal attribute: distinctness–Ordinal attribute: distinctness & order–Interval attribute: distinctness, order & addition–Ratio attribute: all 4 properties

Page 33: 1 BUS 297D: Data Mining Professor David Mease Lecture 1 Agenda: 1) Go over information on course web page 2) Lecture over Chapter 1 3) Discuss necessary

33

Discrete vs. Continuous (P. 28) Discrete Attribute

–Has only a finite or countably infinite set of values–Examples: zip codes, counts, or the set of words in a collection of documents –Often represented as integer variables –Note: binary attributes are a special case of discrete attributes which have only 2 values

Continuous Attribute–Has real numbers as attribute values–Can compute as accurately as instruments allow–Examples: temperature, height, or weight –Practically, real values can only be measured and represented using a finite number of digits–Continuous attributes are typically represented

as floating-point variables

Page 34: 1 BUS 297D: Data Mining Professor David Mease Lecture 1 Agenda: 1) Go over information on course web page 2) Lecture over Chapter 1 3) Discuss necessary

34

Discrete vs. Continuous (P. 28)

Qualitative (categorical) attributes are always discrete

Quantitative (numeric) attributes can be either discrete or continuous

Page 35: 1 BUS 297D: Data Mining Professor David Mease Lecture 1 Agenda: 1) Go over information on course web page 2) Lecture over Chapter 1 3) Discuss necessary

35

In class exercise #3:Classify the following attributes as binary, discrete, or continuous. Also classify them as qualitative (nominal or ordinal) or quantitative (interval or ratio). Some cases may have more than one interpretation, so briefly indicate your reasoning if you think there may be some ambiguity.

a) Number of telephones in your houseb) Size of French Fries (Medium or Large or X-Large)c) Ownership of a cell phoned) Number of local phone calls you made in a monthe) Length of longest phone callf) Length of your footg) Price of your textbookh) Zip codei) Temperature in degrees Fahrenheitj) Temperature in degrees Celsiusk) Temperature in Kelvins

Page 36: 1 BUS 297D: Data Mining Professor David Mease Lecture 1 Agenda: 1) Go over information on course web page 2) Lecture over Chapter 1 3) Discuss necessary

36

Types of Data in R

R often distinguishes between qualitative (categorical) attributes and quantitative (numeric)

In R,

qualitative (categorical) = “factor”

quantitative (numeric) = “numeric”

Page 37: 1 BUS 297D: Data Mining Professor David Mease Lecture 1 Agenda: 1) Go over information on course web page 2) Lecture over Chapter 1 3) Discuss necessary

37

Types of Data in R For example, the IP address in the first column of http://www.cob.sjsu.edu/mease_d/stats202log.txt is a factor

> data<-read.csv("stats202log.txt",sep=" ",header=F)

> data[,1] [1] 69.224.117.122 69.224.117.122 69.224.117.122 128.12.159.164 128.12.159.164 128.12.159.164 128.12.159.164 128.12.159.164 128.12.159.164 128.12.159.164

……

[1901] 65.57.245.11 65.57.245.11 65.57.245.11 65.57.245.11 65.57.245.11 65.57.245.11 65.57.245.11 65.57.245.11 65.57.245.11 65.57.245.11 [1911] 65.57.245.11 67.164.82.184 67.164.82.184 67.164.82.184 171.66.214.36 171.66.214.36 171.66.214.36 65.57.245.11 65.57.245.11 65.57.245.11 [1921] 65.57.245.11 65.57.245.11 73 Levels: 128.12.159.131 128.12.159.164 132.79.14.16 171.64.102.169 171.64.102.98 171.66.214.36 196.209.251.3 202.160.180.150 202.160.180.57 ... 89.100.163.185

> is.factor(data[,1])[1] TRUE> data[,1]+10[1] NA NA NA NA NA NA NA NA …Warning message:+ not meaningful for factors …

Page 38: 1 BUS 297D: Data Mining Professor David Mease Lecture 1 Agenda: 1) Go over information on course web page 2) Lecture over Chapter 1 3) Discuss necessary

38

Types of Data in R

However, the 8th column looks like it should be numeric. Why is it not? How do we fix this?

> data[,8] [1] 2867 4583 2295 2867 4583 2295 1379 2294 4432 7134 2296 2297 3219968 1379 2294 4432 7134 2293 2297 2294

…[1901] 2294 4432 7134 2294 4432 7134 2294 2867 4583 2295 2294 4432 7134 2294 4432 7134 2294 2294 2294 2294 [1921] 2294 2294 Levels: - 1135151 122880 1379 1510 2290 2293 2294 2295 2296 2297 2309 238 241 246 248 250 2725487 280535 2867 3072 3219968 4432 4583 626 7134 7482

> is.factor(data[,8])[1] TRUE> is.numeric(data[,8])[1] FALSE

Page 39: 1 BUS 297D: Data Mining Professor David Mease Lecture 1 Agenda: 1) Go over information on course web page 2) Lecture over Chapter 1 3) Discuss necessary

39

Types of Data in R

A: We should have told R that “-” means missing when we read it in.

> data<-read.csv("stats202log.txt",sep=" ",header=F, na.strings = "-")

> is.factor(data[,8])[1] FALSE> is.numeric(data[,8])[1] TRUE

Page 40: 1 BUS 297D: Data Mining Professor David Mease Lecture 1 Agenda: 1) Go over information on course web page 2) Lecture over Chapter 1 3) Discuss necessary

40

Types of Data in R

Q: How would we create an attribute giving the following zip codes 94550, 00123, 43614 for three observations in R?

Page 41: 1 BUS 297D: Data Mining Professor David Mease Lecture 1 Agenda: 1) Go over information on course web page 2) Lecture over Chapter 1 3) Discuss necessary

41

Types of Data in R

Q: How would we create an attribute giving the following zip codes 94550, 00123, 43614 for three observations in R?

A: Use quotes:

> zip_codes<-as.factor(c("94550","00123","43614"))

Page 42: 1 BUS 297D: Data Mining Professor David Mease Lecture 1 Agenda: 1) Go over information on course web page 2) Lecture over Chapter 1 3) Discuss necessary

42

Types of Data in Excel

Excel is not quite as picky and allows you to mix types more

Also, you can change between a lot of different predefined formats in Excel by right clicking a column and then selecting “Format Cells” and looking under the “Number” tab

Page 43: 1 BUS 297D: Data Mining Professor David Mease Lecture 1 Agenda: 1) Go over information on course web page 2) Lecture over Chapter 1 3) Discuss necessary

43

Types of Data in Excel

Q: How would we create an attribute giving the following zip codes 94550, 00123, 43614 for three observations in Excel?

Page 44: 1 BUS 297D: Data Mining Professor David Mease Lecture 1 Agenda: 1) Go over information on course web page 2) Lecture over Chapter 1 3) Discuss necessary

44

Types of Data in Excel

Q: How would we create an attribute giving the following zip codes 94550, 00123, 43614 for three observations in Excel?

A: Right click on the column then choose “Format Cells” then under the “Number” tab select “Text”

Page 45: 1 BUS 297D: Data Mining Professor David Mease Lecture 1 Agenda: 1) Go over information on course web page 2) Lecture over Chapter 1 3) Discuss necessary

45

Working with Data in R

Creating Data:

> aa<-c(1,10,12)

> aa[1] 1 10 12

Some simple operations:

> aa+10[1] 11 20 22

> length(aa)[1] 3

Page 46: 1 BUS 297D: Data Mining Professor David Mease Lecture 1 Agenda: 1) Go over information on course web page 2) Lecture over Chapter 1 3) Discuss necessary

46

Working with Data in R

Creating More Data:

> bb<-c(2,6,79)

> my_data_set<-data.frame(attributeA=aa,attributeB=bb)

> my_data_set attributeA attributeB1 1 22 10 63 12 79

Page 47: 1 BUS 297D: Data Mining Professor David Mease Lecture 1 Agenda: 1) Go over information on course web page 2) Lecture over Chapter 1 3) Discuss necessary

47

Working with Data in R

Indexing Data:

> my_data_set[,1][1] 1 10 12

> my_data_set[1,] attributeA attributeB1 1 2

> my_data_set[3,2][1] 79

> my_data_set[1:2,] attributeA attributeB1 1 22 10 6

Page 48: 1 BUS 297D: Data Mining Professor David Mease Lecture 1 Agenda: 1) Go over information on course web page 2) Lecture over Chapter 1 3) Discuss necessary

48

Working with Data in R

Indexing Data:

> my_data_set[c(1,3),] attributeA attributeB1 1 23 12 79

Arithmetic:

> aa/bb[1] 0.5000000 1.6666667 0.1518987

Page 49: 1 BUS 297D: Data Mining Professor David Mease Lecture 1 Agenda: 1) Go over information on course web page 2) Lecture over Chapter 1 3) Discuss necessary

49

Working with Data in R

Summary Statistics:

> mean(my_data_set[,1])[1] 7.666667 > median(my_data_set[,1])[1] 10

> sqrt(var(my_data_set[,1]))[1] 5.859465

Page 50: 1 BUS 297D: Data Mining Professor David Mease Lecture 1 Agenda: 1) Go over information on course web page 2) Lecture over Chapter 1 3) Discuss necessary

50

Working with Data in R

Writing Data:

> setwd("C:/Documents and Settings/David/Desktop")

> write.csv(my_data_set,"my_data_set_file.csv")

Help!:

> ?write.csv