Presentation DW DM

Data Warehousing and Data Mining

What is a Data Warehouse A data warehouse is a subject-

oriented, integrated, time-variant,and nonvolatile collection of data insupport of managements decision-making process. --- W. H. Inmon

Collection of data that is used primarilyin organizational decision making

A decision support database that ismaintained separately from theorganizations operational database

Data Warehouse - Subject Oriented

Subject oriented: oriented to the majorsubject areas of the corporation thathave been defined in the data model. E.g. for an insurance company: customer,

product, transaction or activity, policy, claim, account, and etc.

Operational DB and applications maybe organized differently E.g. based on type of insurance's: auto,

life, medical, fire, ...

Data Warehouse Integrated

There is no consistency in encoding, naming conventions, , among different data sources

Heterogeneous data sources When data is moved to the warehouse,

it is converted.

Data Warehouse - Non-Volatile

Operational data is regularly accessedand manipulated a record at a time, andupdate is done to data in the operationalenvironment.

Warehouse Data is loaded andaccessed. Update of data does notoccur in the data warehouseenvironment.

Data Warehouse - Time Variance The time horizon for the data warehouse is

significantly longer than that of operationalsystems.

Operational database: current value data. Data warehouse data : nothing more than a

sophisticated series of snapshots, taken of atsome moment in time.

The key structure of operational data may ormay not contain some element of time. Thekey structure of the data warehouse alwayscontains some element of time.

Why Separate Data Warehouse?

Performance special data organization, access methods,

and implementation methods are neededto support multidimensional views andoperations typical of OLAP

Complex OLAP queries would degradeperformance for operational transactions

Concurrency control and recovery modesof OLTP are not compatible with OLAPanalysis

Why Separate Data Warehouse?

Function missing data: Decision support requires

historical data which operational DBs donot typically maintain

data consolidation: DS requiresconsolidation (aggregation, summarization)of data from heterogeneous sources:operational DBs, external sources

data quality: different sources typically useinconsistent data representations, codesand formats which have to be reconciled.

Advantages of Warehousing High query performance Queries not visible outside warehouse Local processing at sources unaffected Can operate when sources unavailable Can query data not stored in a DBMS Extra information at warehouse

Modify, summarize (store aggregates) Add historical information

Advantages of Mediator Systems

No need to copy data less storage no need to purchase data

More up-to-date data Query needs can be unknown Only query interface needed at sources May be less draining on sources

ExtractTransformLoadRefresh

Data Warehouse

Metadatarepository

Data martsServes

OLAPserver

OLAP Data miningReports

Operational databases

External datasources

The Architecture of Data Warehousing

Data Sources Data sources are often the operational

systems, providing the lowest level of data.

Data sources are designed for operationaluse, not for decision support, and the datareflect this fact.

Multiple data sources are often from differentsystems, run on a wide range of hardwareand much of the software is built in-house orhighly customized.

Multiple data sources introduce a largenumber of issues -- semantic conflicts.

Creating and Maintaining a Warehouse

Data warehouse needs several tools thatautomate or support tasks such as:Data extraction from different external data

sources, operational databases, files ofstandard applications (e.g. Excel, COBOLapplications), and other documents (Word,WWW).

Data cleaning (finding and resolvinginconsistency in the source data)

Integration and transformation of data(between different data formats,languages, etc.)

Creating and Maintaining a Warehouse

Data loading (loading the data into the datawarehouse)

Data replication (replicating sourcedatabase into the data warehouse)

Data refreshmentData archivingChecking for data qualityAnalyzing metadata

Physical Structure of Data Warehouse

There are three basic architectures forconstructing a data warehouse:

Centralized Distributed Federated Tiered

The data warehouse is distributed for: load balancing, scalability and higher availability


CentralData

Warehouse

Client Client Client

Source Source

Centralized architecture


LogicalData

Warehouse

Source Source

LocalData Marts

EndUsers

MarketingFinancialDistribution

Federated architecture


PhysicalData

Warehouse

LocalData Marts

Workstations(higly summarizeddata)

Source Source

Tiered architecture


Federated architecture The logical data warehouse is only virtual

Tiered architecture The central data warehouse is physical There exist local data marts on different

triers which store copies or summarizationof the previous trier.

Conceptual Modeling ofData Warehouses

Three basic conceptual schemas:

Star schema Snowflake schema Fact constellations

Star schema

Star schema: A single object (fact table) in the middle connected to a number of dimension tables

Star schema

saleorderId

datecustIdprodIdstoreId

qtyamt

customercustIdname

addresscity

productprodIdnameprice

storestoreId

city

Star schema

customer custId name address city53 joe 10 main sfo81 fred 12 main sfo

111 sally 80 willow la

product prodId name pricep1 bolt 10p2 nut 5

store storeId cityc1 nycc2 sfoc3 la

sale oderId date custId prodId storeId qty amto100 1/7/97 53 p1 c1 1 12o102 2/7/97 53 p2 c1 2 11o105 3/8/97 111 p1 c3 5 50

Terms Basic notion: a measure (e.g. sales,

qty, etc) Given a collection of numeric

measures Each measure depends on a set of

dimensions (e.g. sales volume as afunction of product, time, and location)

Terms Relation, which relates the

dimensions to the measure of interest, is called the fact table (e.g. sale)

Information about dimensions can be represented as a collection of relations called the dimension tables (product, customer, store)

Each dimension can have a set of associated attributes

DateMonthYear

Date

CustIdCustNameCustCityCustCountry

Customer

Sales Fact Table

Date

Product

Store

Customer

unit_sales

dollar_sales

schilling_sales

Measurements

ProductNoProdNameProdDescCategoryQOH

Product

StoreIDCityStateCountryRegion

Store

Example of Star Schema

Dimension Hierarchies For each dimension, the set of associated attributes can be structured as a hierarchy

storesType

city region

customer city state country

Dimension Hierarchies

store storeId cityId tId mgrs5 sfo t1 joes7 sfo t2 freds9 la t1 nancy

city cityId pop regIdsfo 1M northla 5M south

region regId namenorth cold regionsouth warm region

sType tId size locationt1 small downtownt2 large suburbs

Snowflake Schema

Snowflake schema: A refinement of star schema where the dimensional hierarchy is represented explicitly by normalizing the dimension tables

Sales Fact TableDate

Product

Store

Customer

unit_sales

dollar_sales

schilling_sales

ProductNoProdNameProdDescCategoryQOH

Product

CustIdCustNameCustCityCustCountry

Cust

DateMonth

DateMonthYear

MonthYear

Year

CityState

City

CountryRegion

Country

StateCountry

State

StoreIDCity

Store

Measurements

Example of Snowflake Schema

Fact constellations

Fact constellations: Multiple fact tables share dimension tables

Database design methodology for data warehouses (1)

Nine-step methodology proposed by Kimball

Step Activ ity1 Choosing the process2 Choosing the grain3 Identifying and conforming the dimensions4 Choosing the facts5 Storing the precalculations in the fact table6 Rounding out the dimension tables7 Choosing the duration of the database8 Tracking slowly changing dimensions9 Deciding the query priorities and the query modes

Database design methodology for data warehouses (2)

There are many approaches that offer alternative routes to the creation of a data warehouse

Typical approach decompose the design of the data warehouse into manageable parts data marts, At a later stage, the integration of the smaller data marts leads to the creation of the enterprise-wide data warehouse.

The methodology specifies the steps required for the design of a data mart, however, the methodology also ties together separate data marts so that over time they merge together into a coherent overall data warehouse.

Step 1: Choosing the process

The process (function) refers to the subject matter of a particular data marts. The first data mart to be built should be the one that is most likely to be delivered on time, within budget, and to answer the most commercially important business questions.

The best choice for the first data mart tends to be the one that is related to sales

Step 2: Choosing the grain Choosing the grain means deciding exactly what a

fact table record represents. For example, the entity Sales may represent the facts about each property sale. Therefore, the grain of the Property_Sales fact table is individual property sale.

Only when the grain for the fact table is chosen we can identify the dimensions of the fact table.

The grain decision for the fact table also determines the grain of each of the dimension tables. For example, if the grain for the Property_Sales is an individual property sale, then the grain of the Client dimension is the detail of the client who bought a particular property.

Step 3: Identifying and conforming the dimensions

Dimensions set the context for formulating queries about the facts in the fact table.

We identify dimensions in sufficient detail to describe things such as clients and properties at the correct grain.

If any dimension occurs in two data marts, they must be exactly the same dimension, or one must be a subset of the other (this is the only way that two DM share one or more dimensions in the same application).

When a dimension is used in more than one DM, the dimension is referred to as being conformed.

Step 4: Choosing the facts

The grain of the fact table determines which facts can be used in the data mart all facts must be expressed at the level implied by the grain.

In other words, if the grain of the fact table is an individual property sale, then all the numerical facts must refer to this particular sale (the facts should be numeric and additive).

Step 5: Storing pre-calculations in the fact table

Once the facts have been selected each should be re-examined to determine whether there are opportunities to use pre-calculations.

Common example: a profit or loss statement These types of facts are useful since they are additive

quantities, from which we can derive valuable information.

This is particularly true for a value that is fundamental to an enterprise, or if there is any chance of a user calculating the value incorrectly.

Step 6: Rounding out the dimension tables

In this step we return to the dimension tables and add as many text descriptions to the dimensions as possible.

The text descriptions should be as intuitive and understandable to the users as possible

Step 7: Choosing the duration of the data warehouse

The duration measures how far back in time the fact table goes.

For some companies (e.g. insurance companies) there may be a legal requirement to retain data extending back five or more years.

Very large fact tables raise at least two very significant data warehouse design issues: The older data, the more likely there will be problems in reading

and interpreting the old files It is mandatory that the old versions of the important dimensions

be used, not the most current versions (we will discuss this issue later on)

Step 8: Tracking slowly changing dimensions

The changing dimension problem means that the proper description of the old client and the old branch must be used with the old data warehouse schema

Usually, the data warehouse must assign a generalized key to these important dimensions in order to distinguish multiple snapshots of clients and branches over a period of time

There are different types of changes in dimensions: A dimension attribute is overwritten A dimension attribute caauses a new dimension record to be

created etc.

Step 9: Deciding the query priorities and the query modes

In this step we consider physical design issues. The presence of pre-stored summaries and aggregates Indices Materialized views Security issue Backup issue Archive issue

Database design methodology for data warehouses - summary

At the end of this methodology, we have a design for a data mart that supports the requirements of a particular bussiness process and allows the easy integration with other related data marts to ultimately form the enterprise-wide data warehouse.

A dimensional model, which contains more than one fact table sharing one or more conformed dimension tables, is referred to as a fact constellation.

Multidimensional Data Model

Sales of products may be representedin one dimension (as a fact relation) orin two dimensions, e.g. : clients and products



sale Product Client Amtp1 c1 12p2 c1 11p1 c3 50p2 c2 8

c1 c2 c3p1 12 50p2 11 8

Fact relation Two-dimensional cube


sale Product Client Date Amtp1 c1 1 12p2 c1 1 11p1 c3 1 50p2 c2 1 8p1 c1 2 44p1 c2 2 4

day 2 c1 c2 c3p1 44 4p2 c1 c2 c3

p1 12 50p2 11 8

day 1

Fact relation 3-dimensional cube

Multidimensional Data Model and Aggregates

Add up amounts for day 1 In SQL: SELECT sum(Amt) FROM SALE

WHERE Date = 1


81result


Add up amounts by day In SQL: SELECT Date, sum(Amt)

FROM SALE GROUP BY Date


Date sum1 812 48

result


Add up amounts by client, product In SQL: SELECT client, product, sum(amt)

FROM SALE GROUP BY client, product



sale Product Client Sump1 c1 56p1 c2 4

p1 c3 50 p2 c1 11

p2 c2 8


In multidimensional data model together with measure values usually we store summarizing information (aggregates)

c1 c2 c3 Sump1 56 4 50 110p2 11 8 19

Sum 67 12 50 129

Aggregates Operators: sum, count, max, min,

median, ave Having clause Using dimension hierarchy

average by region (within store) maximum by month (within date)

Cube Aggregation

day 2 c1 c2 c3p1 44 4p2 c1 c2 c3

p1 12 50p2 11 8

c1 c2 c3p1 56 4 50p2 11 8 129

. . .Example: computing sums

day 1

Cube Operators

day 2 c1 c2 c3p1 44 4p2 c1 c2 c3

p1 12 50p2 11 8

c1 c2 c3p1 56 4 50p2 11 8 129

. . .

sale(c1,*,*)

sale(*,*,*)sale(c2,p2,*)

day 1

Cube

day 2

day 1

*

sale(*,p2,*)

Aggregation Using Hierarchies

day 2 c1 c2 c3p1 44 4p2 c1 c2 c3

p1 12 50p2 11 8

day 1

region A region Bp1 12 50p2 11 8

customer

region

country

(customer c1 in Region A;customers c2, c3 in Region B)

Aggregation Using Hierarchies

c1c2

c3c4

videoCamera

New Orleans

Pozna

CD

Date of sale

1012

1112

35

711

219715

Video Camera CDNO 22 8 30PN 23 18 22

aggregation withrespect to city

client

city

region

A Sample Data Cube

sum

sum

sum

USA

Canada

Mexico

Country

Date

CDvideocamera

1Q 2Q 3Q 4Q

Exercise (1) Suppose the AAA Automobile Co. builds a

data warehouse to analyze sales of its cars. The measure - price of a car

We would like to answer the following typical queries: find total sales by day, week, month and year find total sales by week, month, ... for each dealer find total sales by week, month, ... for each car

model find total sales by month for all dealers in a given

city, region and state.

Exercise (2) Dimensions:

time (day, week, month, quarter, year) dealer (name, city, state, region, phone) cars (serialno, model, color, category , )

Design the conceptual data warehouse schema

Data warehouse DatabaseDifferent Technological approaches to the

datawarehouse database are1. Parallel relational database designs that

require a parallel computing platform2. An innovative approach to speed up a

traditional RDBMS by using new index structures to bypass relational table scans

3. Multidimensional databases are designed to overcome any limitations placed on the warehouse by the nature of relational data model.

Sourcing, Acquisition, Cleanup and Transformation tools

The functionality includes the followinga. Removing unwanted data from

operational databasesb. Converting to common data names and

definitionsc. Calculating summaries and derived datad. Establishing defaults for missing datae. Accomdating source data definition

changes

Issues on datasourcing, cleanup, extract, transformation Database heterogeneity: DBMS are very

different in data models, data access language, data navigation, operations, concurrency, integrity, recovery and so on

Data heterogeneity: The way data is defined and used in different models.

Metadata

Metadata is data about data that describes the data warehouse.

Metadata can be classified into the following Technical Metadata Business Metadata Data warehouse operational information

such as data history, ownership, extract audit trail, usage data.

Technical Data Information about data sources Transformation description the mapping method from

operational database into the warehouse, and algorithms used to convert/enhance/ transform data

Rules to perform data cleanup and data enhancement Data structure definitions for data targets Data-mapping operations when capturing data from

source systems and applying it to the target warehouse database

Access authorisation, backup, history, archive history, information delivery history, data acquisition history, data access and so on

Business Metadata Subject areas and information object type,

including queries, reports, images, video and/or audio clips

Internet home pages Other information to support all data

warehousing components. For example, the information related to the information delivery system should include subscription information; scheduling information; details of delivery destinations; and the business query objects such as predefined queries, reports and analyses.

The information directory and the entire metadata repository will have the following attributes

Should be the gateway to the datawarehouse environment, and thus should be accessible from anyplatform via transparent and seamless connections

The information directory components should be accessible by any browsers and run on all major platforms.

The datastructures of the metadata repositry should be supported by on all major or object-oriented databases.

Should support an easy distribution and replication of its content for high performance and availability

Should be searchable by business-oriented key words Should be able to define the content of structured and unstructured data Should act as launch platform for end user data access and analysis tools Should support the sharing of information objects Should support a variety of scheduling options for requests against the data

warehouse, including on-demand, one-time, repetitve, event-driven and conditional delivery

Should suport and provide interfaces to other applications such as e-mails, spread sheets and so on.

Examples of metadata repositories include Microsoft Repositry, R&O Rochade, Prism Solutions Directory Manager and CA/Platinum Technologies

Accessing and Visualizing Information

Effective Data visualization provides the user with the following

Capability to compare data Capability to control scale Capability to map the visualization back to

the detail data that created it Capability to filter data to look only at

subsets of it

Tool Taxonomy

Data query and reporting tools Application Development tools Executive Information System tools Online analytical processing tools Data mining tools

Query and Reporting tools

Production reporting tools let companies generate regular operational reports

Report writers are inexpensive desktop tools designed for users

Managed query tools are designed for ease-of use, point-and-click, visual navigation that either accepts SQL or generates SQL statements to query relational data stored in the warehouse.

Application Development tools

Organizations will often rely on true and proven approach of in-house application development, using graphical data access environments designed primarily for client/server environments.

OLAP tools

The OLAP tools can be classified as multidimensional or MOLAP, relational or ROLAP and hybrid or HOLAP tools.Some of the more popular OLAP tools are Microsoft Decision support services, Microstartegy DSS server, Oracle Express, Metacube from Informix and so on.

Data mining tools

Discovering knowledge Segmentation Classification Association Preferencing Visualization

Data Marts

The data mart is directed at a partition of data that is created for the use of a dedicated group of users. A datamart is set of denormalized, summarized or aggregated data.

Datawarehouse Administration and Management

Security and priority management Monitoring updates from multiple sources Data quality checks Managing and updating metadata Auditing and reporting data warehouse usage

and status Purging data Replicating, subsetting and distributing data Backup and recovery

Data Mining

Data Mining

The process of employing one ormore computer learning techniquesto automatically analyze andextract knowledge from data.

SQL QueriesOperationalDatabase

DataWarehouse

ResultApplication

Interpretation&

EvaluationData Mining

A Simple Data Mining Process Model

General Phases of Data Mining Process

Problem Definition Creating a Database for Datamining Exploring the database Preparation for creating a Data Mining

Model Building a Data Mining Model Evaluating the Data Mining Model Deploying the Data Mining Model

Data Mining TasksThe model that you determine to solve a problem are

classified as Predictive model

ClassificationRegressionTime Series Analysis Predicition

Descriptive modelClusteringSummarizationAssociation RulesSequence Discovery

Data Mining Techniques Artificial neural networks: Non-linear predictive models that learn

through training and resemble biological neural networks in structure.

Decision trees: Tree-shaped structures that represent sets of decisions. These decisions generate rules for the classification of a dataset. Specific decision tree methods include Classification and Regression Trees (CART) and Chi Square Automatic Interaction Detection (CHAID) .

Genetic algorithms: Optimization techniques that use processes such as genetic combination, mutation, and natural selection in a design based on the concepts of evolution.

Nearest neighbor method: A technique that classifies each record in a dataset based on a combination of the classes of the k record(s) most similar to it in a historical dataset (where k 1). Sometimes called the k-nearest neighbor technique.

Rule induction: The extraction of useful if-then rules from data based on statistical significance.

Datamining Issues Human Interaction Overfitting Outliers Intrepretation of results Visualization of results Large datasets High dimensionality Multimedia data Missing Data Irrevelant data Noisy data Changing data Integration Application

Datamining metrics

Measuring the effectiveness or usefulness of a data mining is called datamining metric

It could be measured as increase in sales and reduce in the advertisement cost and cannot do as ROI

The metrics used include the traditional metrics of space and time for example similarity measures

Social implications of Datamining

Targeted advertising Datamining applications can derive much

demographic data concerning customers that was previously not known or hidden in the data

Fraud detection, Criminal suspects, prediction of terrorists.

Datamining from a Database Perspective

Scalability Real-world data Update Ease of use

Decision Tree

A tree structure where non-terminal nodes represent tests onone or more attributes andterminal nodes reflect decisionoutcomes.

Table 1.1 Hypothetical Training Data for Disease Diagnosis

Patient Sore SwollenID# Throat Fever Glands Congestion Headache Diagnosis

1 Yes Yes Yes Yes Yes Strep throat2 No No No Yes Yes Allergy3 Yes Yes No Yes No Cold4 Yes No Yes No No Strep throat5 No Yes No Yes No Cold6 No No No Yes No Allergy7 No No Yes No No Strep throat8 Yes No No Yes Yes Allergy9 No Yes No Yes Yes Cold10 Yes Yes No Yes Yes Cold

SwollenGlands

Fever

No

Yes

Diagnosis = Allergy Diagnosis = Cold

No

Yes

Diagnosis = Strep Throat

Table 1.2 Data Instances with an Unknown Classification

Patient Sore SwollenID# Throat Fever Glands Congestion Headache Diagnosis

11 No No Yes Yes Yes ?12 Yes Yes No No Yes ?13 No No No No Yes ?

Production Rules

IF Swollen Glands = YesTHEN Diagnosis = Strep ThroatIF Swollen Glands = No & Fever = YesTHEN Diagnosis = ColdIF Swollen Glands = No & Fever = NoTHEN Diagnosis = Allergy

An Algorithm for Building Decision Trees

1. Let T be the set of training instances.2. Choose an attribute that best differentiates the instances in T.3. Create a tree node whose value is the chosen attribute.

-Create child links from this node where each link represents a unique value for the chosen attribute.

-Use the child link values to further subdivide the instances into subclasses.4. For each subclass created in step 3:

-If the instances in the subclass satisfy predefined criteria or if the set of remaining attribute choices for this path is null, specify the classification for new instances following this decision path.

-If the subclass does not satisfy the criteria and there is at least one attribute to further subdivide the path of the tree, let T be the current set of subclass instances and return to step 2.

Generating Association RulesRule Confidence

Given a rule of the form If A then B, rule confidence is the conditional probability that B is true when A is known to be true.

Rule Support

The minimum percentage of instances in the database that contain all items listed in a given association rule.

Mining Association Rules: An Example

Table 3.3 A Subset of the Credit Card Promotion Database

Magazine Watch Life Insurance Credit CardPromotion Promotion Promotion Insurance Sex

Yes No No No MaleYes Yes Yes No FemaleNo No No No MaleYes Yes Yes Yes MaleYes No Yes No FemaleNo No No No FemaleYes No Yes Yes MaleNo Yes No No MaleYes No No No MaleYes Yes Yes No Female

Table 3.4 Single-Item Sets Single-Item Sets Number of Items Magazine Promotion = Yes 7 Watch Promotion = Yes 4 Watch Promotion = No 6 Life Insurance Promotion = Yes 5 Life Insurance Promotion = No 5 Credit Card Insurance = No 8 Sex = Male 6 Sex = Female 4

Table 3.5 Two-Item Sets

Two-Item Sets Number of Items

Magazine Promotion = Yes & Watch Promotion = No 4Magazine Promotion = Yes & Life Insurance Promotion = Yes 5Magazine Promotion = Yes & Credit Card Insurance = No 5Magazine Promotion = Yes & Sex = Male 4Watch Promotion = No & Life Insurance Promotion = No 4Watch Promotion = No & Credit Card Insurance = No 5Watch Promotion = No & Sex = Male 4Life Insurance Promotion = No & Credit Card Insurance = No 5Life Insurance Promotion = No & Sex = Male 4Credit Card Insurance = No & Sex = Male 4Credit Card Insurance = No & Sex = Female 4

Two Possible Two-Item Set Rules

IF Magazine Promotion =YesTHEN Life Insurance Promotion =Yes (5/7)

IF Life Insurance Promotion =YesTHEN Magazine Promotion =Yes (5/5)

Three-Item Set RulesIF Watch Promotion =No & Life Insurance

Promotion = NoTHEN Credit Card Insurance =No (4/4)

IF Watch Promotion =No THEN Life Insurance Promotion = No & Credit

Card Insurance = No (4/6)

General Considerations

We are interested in association rules that show alift in product sales where the lift is the resultof the products association with one or moreother products.

We are also interested in association rules thatshow a lower than expected confidence for aparticular association.

Nearest Neighbour

Objects that are near each other will also have similar prediction values. Thus, if you know the prediction value of one of the objects, you can predict it for its nearest neighbours.

The K-Means Algorithm

1. Choose a value for K, the total number of clusters.

2. Randomly choose K points as cluster centers.3. Assign the remaining instances to their closest

cluster center.4. Calculate a new cluster center for each cluster.5. Repeat steps 3-5 until the cluster centers do not

change.

Table 3.6 K-Means Input Values

Instance X Y

1 1.0 1.52 1.0 4.53 2.0 1.54 2.0 3.55 3.0 2.56 5.0 6.0

01234567

0 1 2 3 4 5 6

f(x)

x

Table 3.7 Several Applications of the K-Means Algorithm (K = 2)

Outcome Cluster Centers Cluster Points Squared Error

1 (2.67,4.67) 2, 4, 614.50

(2.00,1.83) 1, 3, 5

2 (1.5,1.5) 1, 315.94

(2.75,4.125) 2, 4, 5, 6

3 (1.8,2.7) 1, 2, 3, 4, 59.60

(5,6) 6

01234567

0 1 2 3 4 5 6

x

f(x)

General Considerations

Requires real-valued data. We must select the number of clusters present in

the data. Works best when the clusters in the data are of

approximately equal size. Attribute significance cannot be determined. Lacks explanation capabilities.

Bayesian ClassificationID Income Credit Class x(i)

1 4 E h1 x42 3 g h1 x73 2 e h1 x24 3 g h1 x75 4 g h1 x86 2 e h1 x27 3 b h2 x118 2 b h2 x109 3 b h3 x1110 1 b h4 x911 2 g h2 x6

P(h1/xi)= p(xi/h1)*p(h1)/(sum(p(xi/hi)*p(hi))

Let h1= authorize purchase, h2= authorise after identification h3=do not authorize h4=do not authorise report to police

Income Group1 0-100002 10000-500003 50000-1000004 100000- infConstruct a Table

1 2 3 4E x1 x2 x3 x4g x5 x6 x7 x8b x9 x10 x11 x12

P(x7/h1)=2/6, P(x4/h1)=1/6, p(x2/h1)=2/6 p(x8/h1)=1/6

P(h1/x4)= p(x4/h1)*p(h1)/sum of all= 1

Attribute Value Count Prob

Short Medium Tall Short Medium Tall

Gender M 1 2 3 2/8 3/3

F 3 6 0 6/8 0/3

Height 0-1.6 2 0 O 2/4 0 0

1.6-1.7 2 0 0 2/4 0 0

1.7-1.8 0 4 0 0 4/8 0

1.9-2 0 1 1 0 1/8 1/3

2- 0 0 2 0 0 2/3

p(t/Short)= * 0 =0P(t/medium)= 2/8* 1/8=0.031p(t/tall)= 3/3*1/3=0.333

Likelyhood of being short = 0 * 0.267 =0Likelyhood of being medium = 0.031 * .533=0.0166Likelyhood of being tall = 0.33*0.2 = 0.066P(t)=0+0.01666+0.066= 0.0826P(short/t)= 0 * 0.267/0.0826 = 0P(medium/t) = 0.031 * 0.533/0.0826 = 0.2P(tall/t)= 0.333*0.2/0.0826 = 0.799

The data of t belongs to tall since the probability is higher.

ID3 Algorithm

The concept used to quantify information is called entropy. Entropy is used to measure the amount of uncertainty or surprise or randomness in a set of data.

The basic strategy used by ID3 is to choose splitting attributes with the highest information gain first.

Given probabilities p1,p2, pS where Sum(pi)=1 entropy is defined as

H(p1,p2,.pS)= sum(pi * log(1/pi))Gain(D,S)= H(D)-sum(p(Di)*H(Di))

Short - 4/15, Medium 8/15 and tall 3/15. The entropy of the starting set is 4/15 log(15/4)+8/15 log (8/15)+3/15log(15/3) =0.4384Choosing the gender as the splitting attribute, 9 are F and 6 are M.The entropy of the subset that are F is

3/9 log(9/3)+6/9log(9/6)=0.2764The entropy of the subset that are M is1/6 log(6/1)+2/6log(6/2)+3/6log(6/3)=0.4392The ID3 algorithm must determine what the gain the information is by using this

split . Calculate the weighted sum of these last two entropies to get

9/15 * 0.2764 + 6/15 * 0.4392 = 0.34152The gain in entropy by using gender attribute is thus

0.4384-0.34152 = 0.09688Looking at the height attribute, we divide into ranges:

(0,1.6],(1.6,1.7],(1.7,1.8],(1.8,1.9],(1.9,2.0],(2.0,inf)(0,1.6]->(2/2(0)+0+0)=0, (1.6,1.7]->0, .(1.9,2.0]-

>(0+1/2log(2)+1/2log(2))=0.301The gain in entropy by using the height attribute is

0.4384-2/15(0.301)=0.3983

C4.5 or C5.0Gainratio(D,S)=Gain(D,S)/H(|D1|/|D| .|Ds|/|D|)To calculate the GainRatio for the gender split, we first find the entropy

associate with the split ignoring classesH(9/15,6/15)=9/15 log (15/9) + 6/15 log(15/6)=0.292This gives the GainRatio value for the gender attribute as

0.09688/0.292 = 0.332The entropy for the split on height is

H(2/15,2/15,3/15,4/15,2/15)=2/15 log(15/2)+ 2/15 log(15/2)+ 3/15 log(15/3)+ 4/15 log(15/4)+ 2/15 log(15/2)=0.1166*3+0.1397+0.15307=0.64257

This gives the GainRatio value for the height attribute as0.09688/0.64257=0.1507

Nueral NetworkHow to solve a classification problem usingNeural network as Determine the number of output nodes and

attributes to be used as input Determine the labels and functions to be used

for the graph Determine the functions for the graph Each tuple needs to be evaluated by filtering it

through the structure or the network For each tuples ti belongs to Di propagate ti

through the network and classify the tuple.

Various issues in the neural network Classification are

Deciding the attributes to be used as splitting attributes

Determination of the number of hidden nodes Determination of the number of hidden layers to

choose the best number of hidden nodes per hidden layer

Determination of the number of sinks Interconnectivity of all the nodes Using different activation function

Propagation in Neural Network

Output of each node i in the neural networks is based on the definition of a function fi which is called activation function, fi, when applied to an input {x1i,x2i,x3i,.xni} and weights {w1i,w2i,w3i,.wni} the sum of these inputs is

S=Sum( Whi Xhi)h=1 to k

For each node in the input layer dooutput x on each output arc fromi;for each hidden layer do for each node I do

S = WiJ XiJfor each output arc from i do

Output (i-e-si )/(i+e-si )for each node I in the output layer do

S = WiJ XiJoutput = 1/(i+e-csi )

Radial basis function network

A function whose value changes as it moves away from a central point is known as radial function.

fi(S)=e(-S2/V)

Perceptron

The neural network of the simplest type is named as perceptron.

The perception is sigmodial function

Association Rules

Let a data set I={I1 ,I2 ,I3 ,In } and a database of transaction {t1 ,t2 , ..tn } where t = { Ii1 , Ii2 , Ii3 ,..Iim } and IiJbelongs to I. Association rule is an implication of the form X=>Y where X,Y contained in I are items of data set called as itemsets and X intersection Y is 0.

Basic concepts of Association Rule

Support :The support for an association rule x=>y is the percentage of transaction in the database that consists of XUY

Confidence: The confidence for an association rule X=>Y is the ratio of the number of transactions that contains XUY to the number of transactions that contains X.

Large Itemset : A large itemset is an item set whose number of occurences is above a threshold or support. L represents the comlete set of large itemsets and I represents an individual item set. Those large itemsets that are counted from data set are called as candidates and the collection of all these counted large itemsets are known as candidate item set.

Apriori AlgorithmThis algorithm is an association rule algorithm that finds the large itemsets from a given dataset.

Transaction Items

T1 Bread,Jam,Butter

T2 Bread,Butter

T3 Bread,Cold-drink,Butter

T4 Milk,Bread

T5 Milk,Cold-Drink

Candidates and Large Itemset using Apriori

Scan Candidates Large Itemsets

1 {milk},{Bread},{Jam}, {colddrink},{Butter}

{milk},{Bread},{Cold-drink},{Butter}

2 {milk, Bread}, {milk,cold-drink}{milk,butter}, {Bread,Cold-drink}{Bread,butter}, {Cold-drink,Butter}

{Bread,Butter}

Sampling Algorithm

To overcome of the counting of itemset with large dataset in each scan, you use sampling algorithm. The sampling algorithm reduces the number of dataset scan 1 or 2 where 1 is for best case and 2 is for worst case. Sampling algorithm is also used to find the large itemset for the sample from dta set like the apriori algorithm. These samples are considered as Potentially large itemsets that are used as candidates for counting the entire database.

Clustering

Hierarchial Agglomerative Divisive

Partitional Categorical Large DB

Sampling Compression

Hierarchical

A nested set of clusters is created. Each level in the hierarchy has seperated set of clusters

Agglomerative : Clusters are created in bottom-up fashion.

Divisive: Top-Down fashion

A tree data structure called a dendrogram can be used to illustrate the heirarchical cluster and the set of different clusters.

Similarity and Distance Measures

Centroid = Cm = sum(tmi)/N

Radius = Rm = sqrt(sum(tmi-C m)2/N)

Diameter = Dm = sqrt(sum(tmi-tmj)2/N*N-1)

Methods to calculate the distance between clusters

Single link : Smallest distance between an element in one cluster and an element in the other. We thus have dis(Ki, Kj )=min(dis(til, tjm) for every tilbelongs to Ki and for every tjm belongs to Kj

Complete link : largest distance between an element in one cluster and an element in the other. We thus have dis(Ki, Kj )=max(dis(til, tjm) for every tilbelongs to Ki and for every tjm belongs to Kj

Average : Average distance between an element in one cluster and an element in the other. We thus have dis(Ki, Kj )=mean(dis(til, tjm) for every tilbelongs to Ki and for every tjm belongs to Kj

Centroid : If clusters have a representative centroid, then the centroid distance is defined as the distance between the centroids. We thus have dis(Ki, Kj )=dis(Ci, Cj ), where Ci is the centroid for Kj and similarly for Ci

Mediod: Using a medoid to reresent each cluster, the distance between the clusters can be defined by the distance the medoids dis(Ki, Kj )=dis(mi, mj )

Hypothesis testingNull hypothesisAlternative hypothesisChi square testing

Regression and correlation

Data Warehousing and Data MiningWhat is a Data WarehouseData Warehouse - Subject OrientedData Warehouse IntegratedData Warehouse - Non-VolatileData Warehouse - Time VarianceWhy Separate Data Warehouse?Why Separate Data Warehouse?Advantages of WarehousingAdvantages of Mediator SystemsSlide Number 11Data SourcesCreating and Maintaining a WarehouseCreating and Maintaining a WarehousePhysical Structure of Data WarehousePhysical Structure of Data WarehousePhysical Structure of Data WarehousePhysical Structure of Data WarehousePhysical Structure of Data WarehouseConceptual Modeling ofData WarehousesStar schemaStar schemaStar schemaTermsTermsSlide Number 26Dimension HierarchiesDimension HierarchiesSnowflake SchemaSlide Number 30Fact constellationsDatabase design methodology for data warehouses (1)Database design methodology for data warehouses (2)Step 1: Choosing the processStep 2: Choosing the grainStep 3: Identifying and conforming the dimensionsStep 4: Choosing the factsStep 5: Storing pre-calculations in the fact tableStep 6: Rounding out the dimension tablesStep 7: Choosing the duration of the data warehouseStep 8: Tracking slowly changing dimensionsStep 9: Deciding the query priorities and the query modesDatabase design methodology for data warehouses - summaryMultidimensional Data ModelMultidimensional Data ModelMultidimensional Data ModelMultidimensional Data Model and AggregatesMultidimensional Data Model and AggregatesMultidimensional Data Model and AggregatesMultidimensional Data Model and AggregatesMultidimensional Data Model and AggregatesAggregatesCube AggregationCube OperatorsCubeAggregation Using HierarchiesAggregation Using HierarchiesA Sample Data CubeExercise (1)Exercise (2)Data warehouse DatabaseSourcing, Acquisition, Cleanup and Transformation toolsIssues on datasourcing, cleanup, extract, transformation MetadataTechnical DataBusiness MetadataThe information directory and the entire metadata repository will have the following attributesAccessing and Visualizing InformationTool TaxonomyQuery and Reporting toolsApplication Development toolsOLAP toolsData mining toolsData MartsDatawarehouse Administration and ManagementData MiningData MiningSlide Number 78General Phases of Data Mining ProcessData Mining TasksData Mining TechniquesDatamining IssuesDatamining metricsSocial implications of DataminingDatamining from a Database PerspectiveDecision TreeSlide Number 87Slide Number 88Slide Number 89Production RulesAn Algorithm for Building Decision TreesGenerating Association RulesSlide Number 93Slide Number 94Slide Number 95Slide Number 96Two Possible Two-Item Set Rules Three-Item Set RulesGeneral ConsiderationsNearest NeighbourSlide Number 101Slide Number 102The K-Means AlgorithmSlide Number 104Slide Number 105Slide Number 106Slide Number 107General ConsiderationsBayesian ClassificationSlide Number 110Slide Number 111Slide Number 112ID3 AlgorithmSlide Number 114Slide Number 115C4.5 or C5.0Nueral NetworkVarious issues in the neural network Classification arePropagation in Neural NetworkSlide Number 120Radial basis function networkPerceptronAssociation RulesBasic concepts of Association RuleApriori AlgorithmCandidates and Large Itemset using AprioriSampling AlgorithmClusteringHierarchicalSimilarity and Distance MeasuresMethods to calculate the distance between clustersSlide Number 132

Documents

Presentation DW DM