Upload
dineshgokuldas-gokuldas
View
224
Download
2
Tags:
Embed Size (px)
Citation preview
Data Warehousing and Data Mining
What is a Data Warehouse A data warehouse is a subject-
oriented, integrated, time-variant,and nonvolatile collection of data insupport of managements decision-making process. --- W. H. Inmon
Collection of data that is used primarilyin organizational decision making
A decision support database that ismaintained separately from theorganizations operational database
Data Warehouse - Subject Oriented
Subject oriented: oriented to the majorsubject areas of the corporation thathave been defined in the data model. E.g. for an insurance company: customer,
product, transaction or activity, policy, claim, account, and etc.
Operational DB and applications maybe organized differently E.g. based on type of insurance's: auto,
life, medical, fire, ...
Data Warehouse Integrated
There is no consistency in encoding, naming conventions, , among different data sources
Heterogeneous data sources When data is moved to the warehouse,
it is converted.
Data Warehouse - Non-Volatile
Operational data is regularly accessedand manipulated a record at a time, andupdate is done to data in the operationalenvironment.
Warehouse Data is loaded andaccessed. Update of data does notoccur in the data warehouseenvironment.
Data Warehouse - Time Variance The time horizon for the data warehouse is
significantly longer than that of operationalsystems.
Operational database: current value data. Data warehouse data : nothing more than a
sophisticated series of snapshots, taken of atsome moment in time.
The key structure of operational data may ormay not contain some element of time. Thekey structure of the data warehouse alwayscontains some element of time.
Why Separate Data Warehouse?
Performance special data organization, access methods,
and implementation methods are neededto support multidimensional views andoperations typical of OLAP
Complex OLAP queries would degradeperformance for operational transactions
Concurrency control and recovery modesof OLTP are not compatible with OLAPanalysis
Why Separate Data Warehouse?
Function missing data: Decision support requires
historical data which operational DBs donot typically maintain
data consolidation: DS requiresconsolidation (aggregation, summarization)of data from heterogeneous sources:operational DBs, external sources
data quality: different sources typically useinconsistent data representations, codesand formats which have to be reconciled.
Advantages of Warehousing High query performance Queries not visible outside warehouse Local processing at sources unaffected Can operate when sources unavailable Can query data not stored in a DBMS Extra information at warehouse
Modify, summarize (store aggregates) Add historical information
Advantages of Mediator Systems
No need to copy data less storage no need to purchase data
More up-to-date data Query needs can be unknown Only query interface needed at sources May be less draining on sources
ExtractTransformLoadRefresh
Data Warehouse
Metadatarepository
Data martsServes
OLAPserver
OLAP Data miningReports
Operational databases
External datasources
The Architecture of Data Warehousing
Data Sources Data sources are often the operational
systems, providing the lowest level of data.
Data sources are designed for operationaluse, not for decision support, and the datareflect this fact.
Multiple data sources are often from differentsystems, run on a wide range of hardwareand much of the software is built in-house orhighly customized.
Multiple data sources introduce a largenumber of issues -- semantic conflicts.
Creating and Maintaining a Warehouse
Data warehouse needs several tools thatautomate or support tasks such as:Data extraction from different external data
sources, operational databases, files ofstandard applications (e.g. Excel, COBOLapplications), and other documents (Word,WWW).
Data cleaning (finding and resolvinginconsistency in the source data)
Integration and transformation of data(between different data formats,languages, etc.)
Creating and Maintaining a Warehouse
Data loading (loading the data into the datawarehouse)
Data replication (replicating sourcedatabase into the data warehouse)
Data refreshmentData archivingChecking for data qualityAnalyzing metadata
Physical Structure of Data Warehouse
There are three basic architectures forconstructing a data warehouse:
Centralized Distributed Federated Tiered
The data warehouse is distributed for: load balancing, scalability and higher availability
Physical Structure of Data Warehouse
CentralData
Warehouse
Client Client Client
Source Source
Centralized architecture
Physical Structure of Data Warehouse
LogicalData
Warehouse
Source Source
LocalData Marts
EndUsers
MarketingFinancialDistribution
Federated architecture
Physical Structure of Data Warehouse
PhysicalData
Warehouse
LocalData Marts
Workstations(higly summarizeddata)
Source Source
Tiered architecture
Physical Structure of Data Warehouse
Federated architecture The logical data warehouse is only virtual
Tiered architecture The central data warehouse is physical There exist local data marts on different
triers which store copies or summarizationof the previous trier.
Conceptual Modeling ofData Warehouses
Three basic conceptual schemas:
Star schema Snowflake schema Fact constellations
Star schema
Star schema: A single object (fact table) in the middle connected to a number of dimension tables
Star schema
saleorderId
datecustIdprodIdstoreId
qtyamt
customercustIdname
addresscity
productprodIdnameprice
storestoreId
city
Star schema
customer custId name address city53 joe 10 main sfo81 fred 12 main sfo
111 sally 80 willow la
product prodId name pricep1 bolt 10p2 nut 5
store storeId cityc1 nycc2 sfoc3 la
sale oderId date custId prodId storeId qty amto100 1/7/97 53 p1 c1 1 12o102 2/7/97 53 p2 c1 2 11o105 3/8/97 111 p1 c3 5 50
Terms Basic notion: a measure (e.g. sales,
qty, etc) Given a collection of numeric
measures Each measure depends on a set of
dimensions (e.g. sales volume as afunction of product, time, and location)
Terms Relation, which relates the
dimensions to the measure of interest, is called the fact table (e.g. sale)
Information about dimensions can be represented as a collection of relations called the dimension tables (product, customer, store)
Each dimension can have a set of associated attributes
DateMonthYear
Date
CustIdCustNameCustCityCustCountry
Customer
Sales Fact Table
Date
Product
Store
Customer
unit_sales
dollar_sales
schilling_sales
Measurements
ProductNoProdNameProdDescCategoryQOH
Product
StoreIDCityStateCountryRegion
Store
Example of Star Schema
Dimension Hierarchies For each dimension, the set of associated attributes can be structured as a hierarchy
storesType
city region
customer city state country
Dimension Hierarchies
store storeId cityId tId mgrs5 sfo t1 joes7 sfo t2 freds9 la t1 nancy
city cityId pop regIdsfo 1M northla 5M south
region regId namenorth cold regionsouth warm region
sType tId size locationt1 small downtownt2 large suburbs
Snowflake Schema
Snowflake schema: A refinement of star schema where the dimensional hierarchy is represented explicitly by normalizing the dimension tables
Sales Fact TableDate
Product
Store
Customer
unit_sales
dollar_sales
schilling_sales
ProductNoProdNameProdDescCategoryQOH
Product
CustIdCustNameCustCityCustCountry
Cust
DateMonth
DateMonthYear
MonthYear
Year
CityState
City
CountryRegion
Country
StateCountry
State
StoreIDCity
Store
Measurements
Example of Snowflake Schema
Fact constellations
Fact constellations: Multiple fact tables share dimension tables
Database design methodology for data warehouses (1)
Nine-step methodology proposed by Kimball
Step Activ ity1 Choosing the process2 Choosing the grain3 Identifying and conforming the dimensions4 Choosing the facts5 Storing the precalculations in the fact table6 Rounding out the dimension tables7 Choosing the duration of the database8 Tracking slowly changing dimensions9 Deciding the query priorities and the query modes
Database design methodology for data warehouses (2)
There are many approaches that offer alternative routes to the creation of a data warehouse
Typical approach decompose the design of the data warehouse into manageable parts data marts, At a later stage, the integration of the smaller data marts leads to the creation of the enterprise-wide data warehouse.
The methodology specifies the steps required for the design of a data mart, however, the methodology also ties together separate data marts so that over time they merge together into a coherent overall data warehouse.
Step 1: Choosing the process
The process (function) refers to the subject matter of a particular data marts. The first data mart to be built should be the one that is most likely to be delivered on time, within budget, and to answer the most commercially important business questions.
The best choice for the first data mart tends to be the one that is related to sales
Step 2: Choosing the grain Choosing the grain means deciding exactly what a
fact table record represents. For example, the entity Sales may represent the facts about each property sale. Therefore, the grain of the Property_Sales fact table is individual property sale.
Only when the grain for the fact table is chosen we can identify the dimensions of the fact table.
The grain decision for the fact table also determines the grain of each of the dimension tables. For example, if the grain for the Property_Sales is an individual property sale, then the grain of the Client dimension is the detail of the client who bought a particular property.
Step 3: Identifying and conforming the dimensions
Dimensions set the context for formulating queries about the facts in the fact table.
We identify dimensions in sufficient detail to describe things such as clients and properties at the correct grain.
If any dimension occurs in two data marts, they must be exactly the same dimension, or one must be a subset of the other (this is the only way that two DM share one or more dimensions in the same application).
When a dimension is used in more than one DM, the dimension is referred to as being conformed.
Step 4: Choosing the facts
The grain of the fact table determines which facts can be used in the data mart all facts must be expressed at the level implied by the grain.
In other words, if the grain of the fact table is an individual property sale, then all the numerical facts must refer to this particular sale (the facts should be numeric and additive).
Step 5: Storing pre-calculations in the fact table
Once the facts have been selected each should be re-examined to determine whether there are opportunities to use pre-calculations.
Common example: a profit or loss statement These types of facts are useful since they are additive
quantities, from which we can derive valuable information.
This is particularly true for a value that is fundamental to an enterprise, or if there is any chance of a user calculating the value incorrectly.
Step 6: Rounding out the dimension tables
In this step we return to the dimension tables and add as many text descriptions to the dimensions as possible.
The text descriptions should be as intuitive and understandable to the users as possible
Step 7: Choosing the duration of the data warehouse
The duration measures how far back in time the fact table goes.
For some companies (e.g. insurance companies) there may be a legal requirement to retain data extending back five or more years.
Very large fact tables raise at least two very significant data warehouse design issues: The older data, the more likely there will be problems in reading
and interpreting the old files It is mandatory that the old versions of the important dimensions
be used, not the most current versions (we will discuss this issue later on)
Step 8: Tracking slowly changing dimensions
The changing dimension problem means that the proper description of the old client and the old branch must be used with the old data warehouse schema
Usually, the data warehouse must assign a generalized key to these important dimensions in order to distinguish multiple snapshots of clients and branches over a period of time
There are different types of changes in dimensions: A dimension attribute is overwritten A dimension attribute caauses a new dimension record to be
created etc.
Step 9: Deciding the query priorities and the query modes
In this step we consider physical design issues. The presence of pre-stored summaries and aggregates Indices Materialized views Security issue Backup issue Archive issue
Database design methodology for data warehouses - summary
At the end of this methodology, we have a design for a data mart that supports the requirements of a particular bussiness process and allows the easy integration with other related data marts to ultimately form the enterprise-wide data warehouse.
A dimensional model, which contains more than one fact table sharing one or more conformed dimension tables, is referred to as a fact constellation.
Multidimensional Data Model
Sales of products may be representedin one dimension (as a fact relation) orin two dimensions, e.g. : clients and products
Multidimensional Data Model
Multidimensional Data Model
sale Product Client Amtp1 c1 12p2 c1 11p1 c3 50p2 c2 8
c1 c2 c3p1 12 50p2 11 8
Fact relation Two-dimensional cube
Multidimensional Data Model
sale Product Client Date Amtp1 c1 1 12p2 c1 1 11p1 c3 1 50p2 c2 1 8p1 c1 2 44p1 c2 2 4
day 2 c1 c2 c3p1 44 4p2 c1 c2 c3
p1 12 50p2 11 8
day 1
Fact relation 3-dimensional cube
Multidimensional Data Model and Aggregates
Add up amounts for day 1 In SQL: SELECT sum(Amt) FROM SALE
WHERE Date = 1
sale Product Client Date Amtp1 c1 1 12p2 c1 1 11p1 c3 1 50p2 c2 1 8p1 c1 2 44p1 c2 2 4
81result
Multidimensional Data Model and Aggregates
Add up amounts by day In SQL: SELECT Date, sum(Amt)
FROM SALE GROUP BY Date
sale Product Client Date Amtp1 c1 1 12p2 c1 1 11p1 c3 1 50p2 c2 1 8p1 c1 2 44p1 c2 2 4
Date sum1 812 48
result
Multidimensional Data Model and Aggregates
Add up amounts by client, product In SQL: SELECT client, product, sum(amt)
FROM SALE GROUP BY client, product
Multidimensional Data Model and Aggregates
sale Product Client Date Amtp1 c1 1 12p2 c1 1 11p1 c3 1 50p2 c2 1 8p1 c1 2 44p1 c2 2 4
sale Product Client Sump1 c1 56p1 c2 4
p1 c3 50 p2 c1 11
p2 c2 8
Multidimensional Data Model and Aggregates
In multidimensional data model together with measure values usually we store summarizing information (aggregates)
c1 c2 c3 Sump1 56 4 50 110p2 11 8 19
Sum 67 12 50 129
Aggregates Operators: sum, count, max, min,
median, ave Having clause Using dimension hierarchy
average by region (within store) maximum by month (within date)
Cube Aggregation
day 2 c1 c2 c3p1 44 4p2 c1 c2 c3
p1 12 50p2 11 8
c1 c2 c3p1 56 4 50p2 11 8 129
. . .Example: computing sums
day 1
Cube Operators
day 2 c1 c2 c3p1 44 4p2 c1 c2 c3
p1 12 50p2 11 8
c1 c2 c3p1 56 4 50p2 11 8 129
. . .
sale(c1,*,*)
sale(*,*,*)sale(c2,p2,*)
day 1
Cube
day 2
day 1
*
sale(*,p2,*)
Aggregation Using Hierarchies
day 2 c1 c2 c3p1 44 4p2 c1 c2 c3
p1 12 50p2 11 8
day 1
region A region Bp1 12 50p2 11 8
customer
region
country
(customer c1 in Region A;customers c2, c3 in Region B)
Aggregation Using Hierarchies
c1c2
c3c4
videoCamera
New Orleans
Pozna
CD
Date of sale
1012
1112
35
711
219715
Video Camera CDNO 22 8 30PN 23 18 22
aggregation withrespect to city
client
city
region
A Sample Data Cube
sum
sum
sum
USA
Canada
Mexico
Country
Date
CDvideocamera
1Q 2Q 3Q 4Q
Exercise (1) Suppose the AAA Automobile Co. builds a
data warehouse to analyze sales of its cars. The measure - price of a car
We would like to answer the following typical queries: find total sales by day, week, month and year find total sales by week, month, ... for each dealer find total sales by week, month, ... for each car
model find total sales by month for all dealers in a given
city, region and state.
Exercise (2) Dimensions:
time (day, week, month, quarter, year) dealer (name, city, state, region, phone) cars (serialno, model, color, category , )
Design the conceptual data warehouse schema
Data warehouse DatabaseDifferent Technological approaches to the
datawarehouse database are1. Parallel relational database designs that
require a parallel computing platform2. An innovative approach to speed up a
traditional RDBMS by using new index structures to bypass relational table scans
3. Multidimensional databases are designed to overcome any limitations placed on the warehouse by the nature of relational data model.
Sourcing, Acquisition, Cleanup and Transformation tools
The functionality includes the followinga. Removing unwanted data from
operational databasesb. Converting to common data names and
definitionsc. Calculating summaries and derived datad. Establishing defaults for missing datae. Accomdating source data definition
changes
Issues on datasourcing, cleanup, extract, transformation Database heterogeneity: DBMS are very
different in data models, data access language, data navigation, operations, concurrency, integrity, recovery and so on
Data heterogeneity: The way data is defined and used in different models.
Metadata
Metadata is data about data that describes the data warehouse.
Metadata can be classified into the following Technical Metadata Business Metadata Data warehouse operational information
such as data history, ownership, extract audit trail, usage data.
Technical Data Information about data sources Transformation description the mapping method from
operational database into the warehouse, and algorithms used to convert/enhance/ transform data
Rules to perform data cleanup and data enhancement Data structure definitions for data targets Data-mapping operations when capturing data from
source systems and applying it to the target warehouse database
Access authorisation, backup, history, archive history, information delivery history, data acquisition history, data access and so on
Business Metadata Subject areas and information object type,
including queries, reports, images, video and/or audio clips
Internet home pages Other information to support all data
warehousing components. For example, the information related to the information delivery system should include subscription information; scheduling information; details of delivery destinations; and the business query objects such as predefined queries, reports and analyses.
The information directory and the entire metadata repository will have the following attributes
Should be the gateway to the datawarehouse environment, and thus should be accessible from anyplatform via transparent and seamless connections
The information directory components should be accessible by any browsers and run on all major platforms.
The datastructures of the metadata repositry should be supported by on all major or object-oriented databases.
Should support an easy distribution and replication of its content for high performance and availability
Should be searchable by business-oriented key words Should be able to define the content of structured and unstructured data Should act as launch platform for end user data access and analysis tools Should support the sharing of information objects Should support a variety of scheduling options for requests against the data
warehouse, including on-demand, one-time, repetitve, event-driven and conditional delivery
Should suport and provide interfaces to other applications such as e-mails, spread sheets and so on.
Examples of metadata repositories include Microsoft Repositry, R&O Rochade, Prism Solutions Directory Manager and CA/Platinum Technologies
Accessing and Visualizing Information
Effective Data visualization provides the user with the following
Capability to compare data Capability to control scale Capability to map the visualization back to
the detail data that created it Capability to filter data to look only at
subsets of it
Tool Taxonomy
Data query and reporting tools Application Development tools Executive Information System tools Online analytical processing tools Data mining tools
Query and Reporting tools
Production reporting tools let companies generate regular operational reports
Report writers are inexpensive desktop tools designed for users
Managed query tools are designed for ease-of use, point-and-click, visual navigation that either accepts SQL or generates SQL statements to query relational data stored in the warehouse.
Application Development tools
Organizations will often rely on true and proven approach of in-house application development, using graphical data access environments designed primarily for client/server environments.
OLAP tools
The OLAP tools can be classified as multidimensional or MOLAP, relational or ROLAP and hybrid or HOLAP tools.Some of the more popular OLAP tools are Microsoft Decision support services, Microstartegy DSS server, Oracle Express, Metacube from Informix and so on.
Data mining tools
Discovering knowledge Segmentation Classification Association Preferencing Visualization
Data Marts
The data mart is directed at a partition of data that is created for the use of a dedicated group of users. A datamart is set of denormalized, summarized or aggregated data.
Datawarehouse Administration and Management
Security and priority management Monitoring updates from multiple sources Data quality checks Managing and updating metadata Auditing and reporting data warehouse usage
and status Purging data Replicating, subsetting and distributing data Backup and recovery
Data Mining
Data Mining
The process of employing one ormore computer learning techniquesto automatically analyze andextract knowledge from data.
SQL QueriesOperationalDatabase
DataWarehouse
ResultApplication
Interpretation&
EvaluationData Mining
A Simple Data Mining Process Model
General Phases of Data Mining Process
Problem Definition Creating a Database for Datamining Exploring the database Preparation for creating a Data Mining
Model Building a Data Mining Model Evaluating the Data Mining Model Deploying the Data Mining Model
Data Mining TasksThe model that you determine to solve a problem are
classified as Predictive model
ClassificationRegressionTime Series Analysis Predicition
Descriptive modelClusteringSummarizationAssociation RulesSequence Discovery
Data Mining Techniques Artificial neural networks: Non-linear predictive models that learn
through training and resemble biological neural networks in structure.
Decision trees: Tree-shaped structures that represent sets of decisions. These decisions generate rules for the classification of a dataset. Specific decision tree methods include Classification and Regression Trees (CART) and Chi Square Automatic Interaction Detection (CHAID) .
Genetic algorithms: Optimization techniques that use processes such as genetic combination, mutation, and natural selection in a design based on the concepts of evolution.
Nearest neighbor method: A technique that classifies each record in a dataset based on a combination of the classes of the k record(s) most similar to it in a historical dataset (where k 1). Sometimes called the k-nearest neighbor technique.
Rule induction: The extraction of useful if-then rules from data based on statistical significance.
Datamining Issues Human Interaction Overfitting Outliers Intrepretation of results Visualization of results Large datasets High dimensionality Multimedia data Missing Data Irrevelant data Noisy data Changing data Integration Application
Datamining metrics
Measuring the effectiveness or usefulness of a data mining is called datamining metric
It could be measured as increase in sales and reduce in the advertisement cost and cannot do as ROI
The metrics used include the traditional metrics of space and time for example similarity measures
Social implications of Datamining
Targeted advertising Datamining applications can derive much
demographic data concerning customers that was previously not known or hidden in the data
Fraud detection, Criminal suspects, prediction of terrorists.
Datamining from a Database Perspective
Scalability Real-world data Update Ease of use
Decision Tree
A tree structure where non-terminal nodes represent tests onone or more attributes andterminal nodes reflect decisionoutcomes.
Table 1.1 Hypothetical Training Data for Disease Diagnosis
Patient Sore SwollenID# Throat Fever Glands Congestion Headache Diagnosis
1 Yes Yes Yes Yes Yes Strep throat2 No No No Yes Yes Allergy3 Yes Yes No Yes No Cold4 Yes No Yes No No Strep throat5 No Yes No Yes No Cold6 No No No Yes No Allergy7 No No Yes No No Strep throat8 Yes No No Yes Yes Allergy9 No Yes No Yes Yes Cold10 Yes Yes No Yes Yes Cold
SwollenGlands
Fever
No
Yes
Diagnosis = Allergy Diagnosis = Cold
No
Yes
Diagnosis = Strep Throat
Table 1.2 Data Instances with an Unknown Classification
Patient Sore SwollenID# Throat Fever Glands Congestion Headache Diagnosis
11 No No Yes Yes Yes ?12 Yes Yes No No Yes ?13 No No No No Yes ?
Production Rules
IF Swollen Glands = YesTHEN Diagnosis = Strep ThroatIF Swollen Glands = No & Fever = YesTHEN Diagnosis = ColdIF Swollen Glands = No & Fever = NoTHEN Diagnosis = Allergy
An Algorithm for Building Decision Trees
1. Let T be the set of training instances.2. Choose an attribute that best differentiates the instances in T.3. Create a tree node whose value is the chosen attribute.
-Create child links from this node where each link represents a unique value for the chosen attribute.
-Use the child link values to further subdivide the instances into subclasses.4. For each subclass created in step 3:
-If the instances in the subclass satisfy predefined criteria or if the set of remaining attribute choices for this path is null, specify the classification for new instances following this decision path.
-If the subclass does not satisfy the criteria and there is at least one attribute to further subdivide the path of the tree, let T be the current set of subclass instances and return to step 2.
Generating Association RulesRule Confidence
Given a rule of the form If A then B, rule confidence is the conditional probability that B is true when A is known to be true.
Rule Support
The minimum percentage of instances in the database that contain all items listed in a given association rule.
Mining Association Rules: An Example
Table 3.3 A Subset of the Credit Card Promotion Database
Magazine Watch Life Insurance Credit CardPromotion Promotion Promotion Insurance Sex
Yes No No No MaleYes Yes Yes No FemaleNo No No No MaleYes Yes Yes Yes MaleYes No Yes No FemaleNo No No No FemaleYes No Yes Yes MaleNo Yes No No MaleYes No No No MaleYes Yes Yes No Female
Table 3.4 Single-Item Sets Single-Item Sets Number of Items Magazine Promotion = Yes 7 Watch Promotion = Yes 4 Watch Promotion = No 6 Life Insurance Promotion = Yes 5 Life Insurance Promotion = No 5 Credit Card Insurance = No 8 Sex = Male 6 Sex = Female 4
Table 3.5 Two-Item Sets
Two-Item Sets Number of Items
Magazine Promotion = Yes & Watch Promotion = No 4Magazine Promotion = Yes & Life Insurance Promotion = Yes 5Magazine Promotion = Yes & Credit Card Insurance = No 5Magazine Promotion = Yes & Sex = Male 4Watch Promotion = No & Life Insurance Promotion = No 4Watch Promotion = No & Credit Card Insurance = No 5Watch Promotion = No & Sex = Male 4Life Insurance Promotion = No & Credit Card Insurance = No 5Life Insurance Promotion = No & Sex = Male 4Credit Card Insurance = No & Sex = Male 4Credit Card Insurance = No & Sex = Female 4
Two Possible Two-Item Set Rules
IF Magazine Promotion =YesTHEN Life Insurance Promotion =Yes (5/7)
IF Life Insurance Promotion =YesTHEN Magazine Promotion =Yes (5/5)
Three-Item Set RulesIF Watch Promotion =No & Life Insurance
Promotion = NoTHEN Credit Card Insurance =No (4/4)
IF Watch Promotion =No THEN Life Insurance Promotion = No & Credit
Card Insurance = No (4/6)
General Considerations
We are interested in association rules that show alift in product sales where the lift is the resultof the products association with one or moreother products.
We are also interested in association rules thatshow a lower than expected confidence for aparticular association.
Nearest Neighbour
Objects that are near each other will also have similar prediction values. Thus, if you know the prediction value of one of the objects, you can predict it for its nearest neighbours.
The K-Means Algorithm
1. Choose a value for K, the total number of clusters.
2. Randomly choose K points as cluster centers.3. Assign the remaining instances to their closest
cluster center.4. Calculate a new cluster center for each cluster.5. Repeat steps 3-5 until the cluster centers do not
change.
Table 3.6 K-Means Input Values
Instance X Y
1 1.0 1.52 1.0 4.53 2.0 1.54 2.0 3.55 3.0 2.56 5.0 6.0
01234567
0 1 2 3 4 5 6
f(x)
x
Table 3.7 Several Applications of the K-Means Algorithm (K = 2)
Outcome Cluster Centers Cluster Points Squared Error
1 (2.67,4.67) 2, 4, 614.50
(2.00,1.83) 1, 3, 5
2 (1.5,1.5) 1, 315.94
(2.75,4.125) 2, 4, 5, 6
3 (1.8,2.7) 1, 2, 3, 4, 59.60
(5,6) 6
01234567
0 1 2 3 4 5 6
x
f(x)
General Considerations
Requires real-valued data. We must select the number of clusters present in
the data. Works best when the clusters in the data are of
approximately equal size. Attribute significance cannot be determined. Lacks explanation capabilities.
Bayesian ClassificationID Income Credit Class x(i)
1 4 E h1 x42 3 g h1 x73 2 e h1 x24 3 g h1 x75 4 g h1 x86 2 e h1 x27 3 b h2 x118 2 b h2 x109 3 b h3 x1110 1 b h4 x911 2 g h2 x6
P(h1/xi)= p(xi/h1)*p(h1)/(sum(p(xi/hi)*p(hi))
Let h1= authorize purchase, h2= authorise after identification h3=do not authorize h4=do not authorise report to police
Income Group1 0-100002 10000-500003 50000-1000004 100000- infConstruct a Table
1 2 3 4E x1 x2 x3 x4g x5 x6 x7 x8b x9 x10 x11 x12
P(x7/h1)=2/6, P(x4/h1)=1/6, p(x2/h1)=2/6 p(x8/h1)=1/6
P(h1/x4)= p(x4/h1)*p(h1)/sum of all= 1
Attribute Value Count Prob
Short Medium Tall Short Medium Tall
Gender M 1 2 3 2/8 3/3
F 3 6 0 6/8 0/3
Height 0-1.6 2 0 O 2/4 0 0
1.6-1.7 2 0 0 2/4 0 0
1.7-1.8 0 4 0 0 4/8 0
1.9-2 0 1 1 0 1/8 1/3
2- 0 0 2 0 0 2/3
p(t/Short)= * 0 =0P(t/medium)= 2/8* 1/8=0.031p(t/tall)= 3/3*1/3=0.333
Likelyhood of being short = 0 * 0.267 =0Likelyhood of being medium = 0.031 * .533=0.0166Likelyhood of being tall = 0.33*0.2 = 0.066P(t)=0+0.01666+0.066= 0.0826P(short/t)= 0 * 0.267/0.0826 = 0P(medium/t) = 0.031 * 0.533/0.0826 = 0.2P(tall/t)= 0.333*0.2/0.0826 = 0.799
The data of t belongs to tall since the probability is higher.
ID3 Algorithm
The concept used to quantify information is called entropy. Entropy is used to measure the amount of uncertainty or surprise or randomness in a set of data.
The basic strategy used by ID3 is to choose splitting attributes with the highest information gain first.
Given probabilities p1,p2, pS where Sum(pi)=1 entropy is defined as
H(p1,p2,.pS)= sum(pi * log(1/pi))Gain(D,S)= H(D)-sum(p(Di)*H(Di))
Short - 4/15, Medium 8/15 and tall 3/15. The entropy of the starting set is 4/15 log(15/4)+8/15 log (8/15)+3/15log(15/3) =0.4384Choosing the gender as the splitting attribute, 9 are F and 6 are M.The entropy of the subset that are F is
3/9 log(9/3)+6/9log(9/6)=0.2764The entropy of the subset that are M is1/6 log(6/1)+2/6log(6/2)+3/6log(6/3)=0.4392The ID3 algorithm must determine what the gain the information is by using this
split . Calculate the weighted sum of these last two entropies to get
9/15 * 0.2764 + 6/15 * 0.4392 = 0.34152The gain in entropy by using gender attribute is thus
0.4384-0.34152 = 0.09688Looking at the height attribute, we divide into ranges:
(0,1.6],(1.6,1.7],(1.7,1.8],(1.8,1.9],(1.9,2.0],(2.0,inf)(0,1.6]->(2/2(0)+0+0)=0, (1.6,1.7]->0, .(1.9,2.0]-
>(0+1/2log(2)+1/2log(2))=0.301The gain in entropy by using the height attribute is
0.4384-2/15(0.301)=0.3983
C4.5 or C5.0Gainratio(D,S)=Gain(D,S)/H(|D1|/|D| .|Ds|/|D|)To calculate the GainRatio for the gender split, we first find the entropy
associate with the split ignoring classesH(9/15,6/15)=9/15 log (15/9) + 6/15 log(15/6)=0.292This gives the GainRatio value for the gender attribute as
0.09688/0.292 = 0.332The entropy for the split on height is
H(2/15,2/15,3/15,4/15,2/15)=2/15 log(15/2)+ 2/15 log(15/2)+ 3/15 log(15/3)+ 4/15 log(15/4)+ 2/15 log(15/2)=0.1166*3+0.1397+0.15307=0.64257
This gives the GainRatio value for the height attribute as0.09688/0.64257=0.1507
Nueral NetworkHow to solve a classification problem usingNeural network as Determine the number of output nodes and
attributes to be used as input Determine the labels and functions to be used
for the graph Determine the functions for the graph Each tuple needs to be evaluated by filtering it
through the structure or the network For each tuples ti belongs to Di propagate ti
through the network and classify the tuple.
Various issues in the neural network Classification are
Deciding the attributes to be used as splitting attributes
Determination of the number of hidden nodes Determination of the number of hidden layers to
choose the best number of hidden nodes per hidden layer
Determination of the number of sinks Interconnectivity of all the nodes Using different activation function
Propagation in Neural Network
Output of each node i in the neural networks is based on the definition of a function fi which is called activation function, fi, when applied to an input {x1i,x2i,x3i,.xni} and weights {w1i,w2i,w3i,.wni} the sum of these inputs is
S=Sum( Whi Xhi)h=1 to k
For each node in the input layer dooutput x on each output arc fromi;for each hidden layer do for each node I do
S = WiJ XiJfor each output arc from i do
Output (i-e-si )/(i+e-si )for each node I in the output layer do
S = WiJ XiJoutput = 1/(i+e-csi )
Radial basis function network
A function whose value changes as it moves away from a central point is known as radial function.
fi(S)=e(-S2/V)
Perceptron
The neural network of the simplest type is named as perceptron.
The perception is sigmodial function
Association Rules
Let a data set I={I1 ,I2 ,I3 ,In } and a database of transaction {t1 ,t2 , ..tn } where t = { Ii1 , Ii2 , Ii3 ,..Iim } and IiJbelongs to I. Association rule is an implication of the form X=>Y where X,Y contained in I are items of data set called as itemsets and X intersection Y is 0.
Basic concepts of Association Rule
Support :The support for an association rule x=>y is the percentage of transaction in the database that consists of XUY
Confidence: The confidence for an association rule X=>Y is the ratio of the number of transactions that contains XUY to the number of transactions that contains X.
Large Itemset : A large itemset is an item set whose number of occurences is above a threshold or support. L represents the comlete set of large itemsets and I represents an individual item set. Those large itemsets that are counted from data set are called as candidates and the collection of all these counted large itemsets are known as candidate item set.
Apriori AlgorithmThis algorithm is an association rule algorithm that finds the large itemsets from a given dataset.
Transaction Items
T1 Bread,Jam,Butter
T2 Bread,Butter
T3 Bread,Cold-drink,Butter
T4 Milk,Bread
T5 Milk,Cold-Drink
Candidates and Large Itemset using Apriori
Scan Candidates Large Itemsets
1 {milk},{Bread},{Jam}, {colddrink},{Butter}
{milk},{Bread},{Cold-drink},{Butter}
2 {milk, Bread}, {milk,cold-drink}{milk,butter}, {Bread,Cold-drink}{Bread,butter}, {Cold-drink,Butter}
{Bread,Butter}
Sampling Algorithm
To overcome of the counting of itemset with large dataset in each scan, you use sampling algorithm. The sampling algorithm reduces the number of dataset scan 1 or 2 where 1 is for best case and 2 is for worst case. Sampling algorithm is also used to find the large itemset for the sample from dta set like the apriori algorithm. These samples are considered as Potentially large itemsets that are used as candidates for counting the entire database.
Clustering
Hierarchial Agglomerative Divisive
Partitional Categorical Large DB
Sampling Compression
Hierarchical
A nested set of clusters is created. Each level in the hierarchy has seperated set of clusters
Agglomerative : Clusters are created in bottom-up fashion.
Divisive: Top-Down fashion
A tree data structure called a dendrogram can be used to illustrate the heirarchical cluster and the set of different clusters.
Similarity and Distance Measures
Centroid = Cm = sum(tmi)/N
Radius = Rm = sqrt(sum(tmi-C m)2/N)
Diameter = Dm = sqrt(sum(tmi-tmj)2/N*N-1)
Methods to calculate the distance between clusters
Single link : Smallest distance between an element in one cluster and an element in the other. We thus have dis(Ki, Kj )=min(dis(til, tjm) for every tilbelongs to Ki and for every tjm belongs to Kj
Complete link : largest distance between an element in one cluster and an element in the other. We thus have dis(Ki, Kj )=max(dis(til, tjm) for every tilbelongs to Ki and for every tjm belongs to Kj
Average : Average distance between an element in one cluster and an element in the other. We thus have dis(Ki, Kj )=mean(dis(til, tjm) for every tilbelongs to Ki and for every tjm belongs to Kj
Centroid : If clusters have a representative centroid, then the centroid distance is defined as the distance between the centroids. We thus have dis(Ki, Kj )=dis(Ci, Cj ), where Ci is the centroid for Kj and similarly for Ci
Mediod: Using a medoid to reresent each cluster, the distance between the clusters can be defined by the distance the medoids dis(Ki, Kj )=dis(mi, mj )
Hypothesis testingNull hypothesisAlternative hypothesisChi square testing
Regression and correlation
Data Warehousing and Data MiningWhat is a Data WarehouseData Warehouse - Subject OrientedData Warehouse IntegratedData Warehouse - Non-VolatileData Warehouse - Time VarianceWhy Separate Data Warehouse?Why Separate Data Warehouse?Advantages of WarehousingAdvantages of Mediator SystemsSlide Number 11Data SourcesCreating and Maintaining a WarehouseCreating and Maintaining a WarehousePhysical Structure of Data WarehousePhysical Structure of Data WarehousePhysical Structure of Data WarehousePhysical Structure of Data WarehousePhysical Structure of Data WarehouseConceptual Modeling ofData WarehousesStar schemaStar schemaStar schemaTermsTermsSlide Number 26Dimension HierarchiesDimension HierarchiesSnowflake SchemaSlide Number 30Fact constellationsDatabase design methodology for data warehouses (1)Database design methodology for data warehouses (2)Step 1: Choosing the processStep 2: Choosing the grainStep 3: Identifying and conforming the dimensionsStep 4: Choosing the factsStep 5: Storing pre-calculations in the fact tableStep 6: Rounding out the dimension tablesStep 7: Choosing the duration of the data warehouseStep 8: Tracking slowly changing dimensionsStep 9: Deciding the query priorities and the query modesDatabase design methodology for data warehouses - summaryMultidimensional Data ModelMultidimensional Data ModelMultidimensional Data ModelMultidimensional Data Model and AggregatesMultidimensional Data Model and AggregatesMultidimensional Data Model and AggregatesMultidimensional Data Model and AggregatesMultidimensional Data Model and AggregatesAggregatesCube AggregationCube OperatorsCubeAggregation Using HierarchiesAggregation Using HierarchiesA Sample Data CubeExercise (1)Exercise (2)Data warehouse DatabaseSourcing, Acquisition, Cleanup and Transformation toolsIssues on datasourcing, cleanup, extract, transformation MetadataTechnical DataBusiness MetadataThe information directory and the entire metadata repository will have the following attributesAccessing and Visualizing InformationTool TaxonomyQuery and Reporting toolsApplication Development toolsOLAP toolsData mining toolsData MartsDatawarehouse Administration and ManagementData MiningData MiningSlide Number 78General Phases of Data Mining ProcessData Mining TasksData Mining TechniquesDatamining IssuesDatamining metricsSocial implications of DataminingDatamining from a Database PerspectiveDecision TreeSlide Number 87Slide Number 88Slide Number 89Production RulesAn Algorithm for Building Decision TreesGenerating Association RulesSlide Number 93Slide Number 94Slide Number 95Slide Number 96Two Possible Two-Item Set Rules Three-Item Set RulesGeneral ConsiderationsNearest NeighbourSlide Number 101Slide Number 102The K-Means AlgorithmSlide Number 104Slide Number 105Slide Number 106Slide Number 107General ConsiderationsBayesian ClassificationSlide Number 110Slide Number 111Slide Number 112ID3 AlgorithmSlide Number 114Slide Number 115C4.5 or C5.0Nueral NetworkVarious issues in the neural network Classification arePropagation in Neural NetworkSlide Number 120Radial basis function networkPerceptronAssociation RulesBasic concepts of Association RuleApriori AlgorithmCandidates and Large Itemset using AprioriSampling AlgorithmClusteringHierarchicalSimilarity and Distance MeasuresMethods to calculate the distance between clustersSlide Number 132