21
Lecture 16 Data Mining

Lecture 16

  • Upload
    lona

  • View
    14

  • Download
    0

Embed Size (px)

DESCRIPTION

Lecture 16.     Data Mining. Microsoft Synchronization Services for ADO.NET - PowerPoint PPT Presentation

Citation preview

Page 1: Lecture 16

Lecture 16

    Data Mining   

Page 2: Lecture 16

http://msdn.microsoft.com/en-us/library/bb725998.aspx

Microsoft Synchronization Services for ADO.NET

Microsoft Synchronization Services for ADO.NET* provides the ability to synchronize data from disparate sources over two-tier, N-tier, and service-based architectures. It is a set of DLLs that provides a composable API. The Synchronization Services API provides a set of components to synchronize data between data services and a local store.

Synchronization Services uses a hub-and-spoke model. All changes from each client are synchronized with the server before the changes are sent from the server to other clients (clients do not exchange changes directly with each other). Synchronization Services provides snapshot, download-only, upload-only, and bidirectional synchronization.

Snapshot and download-only synchronization are typically used to store and update reference data, such as a product list, on a client. Data changes that are made at the server are downloaded to the client database during synchronization. Snapshot synchronization refreshes data every time that the client is synchronized.

Download-only synchronization downloads only the incremental changes that have occurred since the previous synchronization.

Upload-only synchronization is typically used to insert data, such as a sales order, on a client.

Bidirectional synchronization is typically used for data that can be updated at the client and server. Any conflicting changes must be handled during synchronization.

*ActiveX Data Object

Page 3: Lecture 16

The client database for Synchronization Services applications is SQL Server Compact 3.5.

Synchronization Services provides an infrastructure to track incremental changes in the client database.

This infrastructure is enabled the first time any table is synchronized by using a method other than snapshot synchronization.

The server database can be any database for which an ADO.NET provider is available.

Client & Server Database

Page 4: Lecture 16

ADO.NET

ADO.NET is a set of computer software components that can be used by programmers to access data and data services.

It is a part of the base class library that is included with the Microsoft .NET Framework.

It is commonly used by programmers to access and modify data stored in relational database systems, though it can also be used to access data in non-relational sources.

Functionality exists in the Visual Studio IDE to create specialized subclasses of the DataSet classes for a particular database schema, allowing convenient access to each field through strongly-typed properties.

http://en.wikipedia.org/wiki/ADO.NET

Page 5: Lecture 16

Defining Data Mining

http://www.thearling.com/

Page 6: Lecture 16

http://www.thearling.com/

A Sample Problem

Page 7: Lecture 16

A Solution

http://www.thearling.com/

Page 8: Lecture 16

The Big Picture

http://www.thearling.com/

Page 9: Lecture 16

Tools of Modern Data Mining

The "What animal am I thining of" Game

If it walks like a duck and talks like a duck...

If you've got tons of data and no clue what to do, who you gonna call? - Neural Nets!!

Genetic Algorithms would fit here as well.

Clustering without Numbers

Let the Data Group Itself

http://www.thearling.com/

Page 10: Lecture 16

Basically Data Mining is the application of standard pattern classification techniques to the detection, extraction and interpretation of information from very large and diverse data sources.

The level to which Data Mining is described as something extraordinary and revolutionary is the level to which the person describing it doesn't understand it.

Data Analysis and Pattern Classification can be divided into three major levels of operations:

Data Mining Exposed

Signal Level - At this level we are separating the signals (data elements) of interest from the background clutter (noise). What is signal and what is noise depends on the application.

Syntactic Level - The inital task at this level (or the final task of the previous level) is the reduction in the amount of data needed to represent the information. The end-result is a relatively small list of features that describe the characteristics of the objects of interest.

Semantic Level - The goal of semantic level processing is to extract knowledge (understanding) from the data collected. The relationships between the syntactical elements and the context in which they appear (situational awareness) permit us to generate an explanation of the observations.

Page 11: Lecture 16

Rule Induction

Rule induction is an area of machine learning in which formal rules are extracted from a set of observations. The rules extracted may represent a full scientific model of the data, or merely represent local patterns in the data.

Some popular methods and tools related to Rule Induction:

Association rule algorithms - {onions, potatoes} -> {beef} Where's the beef?

Decision rule algorithms - Such as those based on Bayes' Rules

Hypothesis testing algorithms - The difference between coincidence and causality.

Inductive Logic Programming - Who's your Uncle?

Version spaces - Separating the Wheat from the Chaff.

http://en.wikipedia.org/wiki/Rule_induction

Page 12: Lecture 16

Bayes' Formula

Suppose that E is an event from a sample space S and that F1, F2, . . . , Fn are mutually exclusive events such that the union of all Fi = S, and that p(E)>0 and p(Fi)>0 for all i.

n

1i ii

jjj

FpFEp

FpFEpEFp

F1F2

F3

Fj

Fn

. . .

Fip(Fj|E)

E

Bayes's formula states: The probability that Fj is true assuming that E is has occurred is given by the ratio of the probability that E is true assumming that Fj has occurred times the probability that Fj is true divided by the total probability of E.

Page 13: Lecture 16

The Willies

A medical journal announces the availability of a new diagnostic test. The announce-ment states the following:

"An incredibly accurate indicator for the presence of the Willies has recently been developed by Hokes Laboratories that will give a positive reading on an infected patient with probability 0.998 and has a false positive reading in only 2 out of 1000 patients. This modern miracle will revolutionize..."

Even though you know that only 1 in 10,000 people in the world have the disease, you have long suspected that you have the willies. You rush to your doctor and demand to be tested using the new Hokes testing method. Confirming your suspicions, the test comes back positive. Based on this one test, what is the probability that you have the Willies?

In this example, the probability space is partitioned into two regions: A1 = You have the willies; and A2 = You do not have the willies. You also have the information that 1/10,000 persons in the population actually have the willies which is equivalent to an a priori (prior) probability that you are infected of 0.0001. This means that the a priori probability that you do not have the willies is 0.9999.

Page 14: Lecture 16

Since the test is positive, B is true in our example. In this case we can apply values to the following probabilities.

P(A1) = 0.0001 P(A2) = 0.9999

P(B|A1) = 0.998 P(B|A2) = 0.002

Before you continue, take a moment to be sure you understand the meaning of each of these probabilities. Now we can determine the probability that you actually have the Willies by direct application of Bayes' Rule.

n

1jjj

iii

ABPAP

ABPAPBAP

)|()(

)|()()|(

In other words, based on the results of a single test, there is less than a 5% chance that you actually have the willies.

04750

002099990998000010

998000010

.

).)(.().)(.(

).)(.(

P(B|A1) - the probability that the Hokes test willreturn a positive result, given that a person has the Willies

P(B|A2) - the probability that the Hokes test willreturn a positive result, given that a person doesnot have the Willies

P(A1) - the probability that a person, chosen atrandom from the general population, has the Willies

P(A2) - the probability that a person, chosen atrandom from the general population, does not have the Willies

Page 15: Lecture 16

First-Order Induction Logic Programming (ILP)

uncle(X,Y) :- brother(X,Z),parent(Z,Y).uncle(X,Y) :- husband(X,Z),sister(Z,W),parent(W,Y)

The purpose of ILP is to infer rules such as,

given lots of instance data such as,

uncle(tom,frank) uncle(bob,john)not uncle(tom,cindy) not uncle(bob,tom)parent(bob,frank) parent(cindy,frank)parent(alice,john) parent(tom,john)brother(tom,cindy) sister(cindy,tom)husband(tom,alice) husband(bob,cindy)

Relational Data Mining with Inductive Logic Programming for Link Discovery, R. J. Mooney, Submitted to Data Mining: Next Generation Challenges and Future Directions, H. Kargupta and A. Joshi (eds.), by AAAI/MIT Press

Page 16: Lecture 16

Version Spaces

A version space in concept learning or induction is the subset of all hypotheses that are consistent with the observed training examples.

This set contains all hypotheses that have not been eliminated as a result of being in conflict with observed data.

GB represents the most general hypothesis that has not be contradicted.

SB represent the most specific hypothesis that is consistent with all positive observations.

The version space is represented by the region between SB and GB.

http://en.wikipedia.org/wiki/Version_spaces

Page 17: Lecture 16

Data Mining in the Weeds

Page 18: Lecture 16

tgt Group

non-tgt Group

minval maxval

mean

PDFs*

In addition to load reduction, a non-parametric classifier is useful when the feature means of the non-target objects are co-located with the feature means of target objects.

When the Target & non-Target Groups Have ~= Means

Page 19: Lecture 16

Pix

els

Squ

are

Pix

els

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 10 20 30 40

Perimeter/Bounding Box

Fra

ctio

n

0

20

40

60

80

100

120

140

160

0 10 20 30 40

Orientation

Pix

els

0

100

200

300

400

500

600

700

0 10 20 30 40

Perimeter

0

500

1000

1500

2000

2500

3000

3500

0 10 20 30 40

Area

Distribution in Feature Space

Page 20: Lecture 16

1/492 491/492

small3

maxval

minval

per/boxorientperimeterareawidthlengthhue

23/23BG

nonBG

0/23

BG nonBG

0.803482148.958563.212781.515.5095243.98898.5

0.20141630.9829129.443449.56.1905557.2879.8889

Width

0

2

4

6

8

10

12

14

16

18

0 10 20 30 40

Pix

els

0

50

100

150

200

250

300

0 10 20 30 40

Leng

th

Length

Results

Feature Set (concluded)

Page 21: Lecture 16

Data Mining

Microsoft Synchronization Services for ADO.NET

Tools of Data Mining

Decision Trees Nearest Neighbor Classification Neural Net & Genetic Algorithm Rule Induction K-Means Clustering

Rule Induction

Association rule Decision rule Hypothesis testing Inductive Logic Programming Version spaces

Summary