132
Data Warehousing and Data Mining

Presentation DW DM

Embed Size (px)

Citation preview

  • Data Warehousing and Data Mining

  • What is a Data Warehouse A data warehouse is a subject-

    oriented, integrated, time-variant,and nonvolatile collection of data insupport of managements decision-making process. --- W. H. Inmon

    Collection of data that is used primarilyin organizational decision making

    A decision support database that ismaintained separately from theorganizations operational database

  • Data Warehouse - Subject Oriented

    Subject oriented: oriented to the majorsubject areas of the corporation thathave been defined in the data model. E.g. for an insurance company: customer,

    product, transaction or activity, policy, claim, account, and etc.

    Operational DB and applications maybe organized differently E.g. based on type of insurance's: auto,

    life, medical, fire, ...

  • Data Warehouse Integrated

    There is no consistency in encoding, naming conventions, , among different data sources

    Heterogeneous data sources When data is moved to the warehouse,

    it is converted.

  • Data Warehouse - Non-Volatile

    Operational data is regularly accessedand manipulated a record at a time, andupdate is done to data in the operationalenvironment.

    Warehouse Data is loaded andaccessed. Update of data does notoccur in the data warehouseenvironment.

  • Data Warehouse - Time Variance The time horizon for the data warehouse is

    significantly longer than that of operationalsystems.

    Operational database: current value data. Data warehouse data : nothing more than a

    sophisticated series of snapshots, taken of atsome moment in time.

    The key structure of operational data may ormay not contain some element of time. Thekey structure of the data warehouse alwayscontains some element of time.

  • Why Separate Data Warehouse?

    Performance special data organization, access methods,

    and implementation methods are neededto support multidimensional views andoperations typical of OLAP

    Complex OLAP queries would degradeperformance for operational transactions

    Concurrency control and recovery modesof OLTP are not compatible with OLAPanalysis

  • Why Separate Data Warehouse?

    Function missing data: Decision support requires

    historical data which operational DBs donot typically maintain

    data consolidation: DS requiresconsolidation (aggregation, summarization)of data from heterogeneous sources:operational DBs, external sources

    data quality: different sources typically useinconsistent data representations, codesand formats which have to be reconciled.

  • Advantages of Warehousing High query performance Queries not visible outside warehouse Local processing at sources unaffected Can operate when sources unavailable Can query data not stored in a DBMS Extra information at warehouse

    Modify, summarize (store aggregates) Add historical information

  • Advantages of Mediator Systems

    No need to copy data less storage no need to purchase data

    More up-to-date data Query needs can be unknown Only query interface needed at sources May be less draining on sources

  • ExtractTransformLoadRefresh

    Data Warehouse

    Metadatarepository

    Data martsServes

    OLAPserver

    OLAP Data miningReports

    Operational databases

    External datasources

    The Architecture of Data Warehousing

  • Data Sources Data sources are often the operational

    systems, providing the lowest level of data.

    Data sources are designed for operationaluse, not for decision support, and the datareflect this fact.

    Multiple data sources are often from differentsystems, run on a wide range of hardwareand much of the software is built in-house orhighly customized.

    Multiple data sources introduce a largenumber of issues -- semantic conflicts.

  • Creating and Maintaining a Warehouse

    Data warehouse needs several tools thatautomate or support tasks such as:Data extraction from different external data

    sources, operational databases, files ofstandard applications (e.g. Excel, COBOLapplications), and other documents (Word,WWW).

    Data cleaning (finding and resolvinginconsistency in the source data)

    Integration and transformation of data(between different data formats,languages, etc.)

  • Creating and Maintaining a Warehouse

    Data loading (loading the data into the datawarehouse)

    Data replication (replicating sourcedatabase into the data warehouse)

    Data refreshmentData archivingChecking for data qualityAnalyzing metadata

  • Physical Structure of Data Warehouse

    There are three basic architectures forconstructing a data warehouse:

    Centralized Distributed Federated Tiered

    The data warehouse is distributed for: load balancing, scalability and higher availability

  • Physical Structure of Data Warehouse

    CentralData

    Warehouse

    Client Client Client

    Source Source

    Centralized architecture

  • Physical Structure of Data Warehouse

    LogicalData

    Warehouse

    Source Source

    LocalData Marts

    EndUsers

    MarketingFinancialDistribution

    Federated architecture

  • Physical Structure of Data Warehouse

    PhysicalData

    Warehouse

    LocalData Marts

    Workstations(higly summarizeddata)

    Source Source

    Tiered architecture

  • Physical Structure of Data Warehouse

    Federated architecture The logical data warehouse is only virtual

    Tiered architecture The central data warehouse is physical There exist local data marts on different

    triers which store copies or summarizationof the previous trier.

  • Conceptual Modeling ofData Warehouses

    Three basic conceptual schemas:

    Star schema Snowflake schema Fact constellations

  • Star schema

    Star schema: A single object (fact table) in the middle connected to a number of dimension tables

  • Star schema

    saleorderId

    datecustIdprodIdstoreId

    qtyamt

    customercustIdname

    addresscity

    productprodIdnameprice

    storestoreId

    city

  • Star schema

    customer custId name address city53 joe 10 main sfo81 fred 12 main sfo

    111 sally 80 willow la

    product prodId name pricep1 bolt 10p2 nut 5

    store storeId cityc1 nycc2 sfoc3 la

    sale oderId date custId prodId storeId qty amto100 1/7/97 53 p1 c1 1 12o102 2/7/97 53 p2 c1 2 11o105 3/8/97 111 p1 c3 5 50

  • Terms Basic notion: a measure (e.g. sales,

    qty, etc) Given a collection of numeric

    measures Each measure depends on a set of

    dimensions (e.g. sales volume as afunction of product, time, and location)

  • Terms Relation, which relates the

    dimensions to the measure of interest, is called the fact table (e.g. sale)

    Information about dimensions can be represented as a collection of relations called the dimension tables (product, customer, store)

    Each dimension can have a set of associated attributes

  • DateMonthYear

    Date

    CustIdCustNameCustCityCustCountry

    Customer

    Sales Fact Table

    Date

    Product

    Store

    Customer

    unit_sales

    dollar_sales

    schilling_sales

    Measurements

    ProductNoProdNameProdDescCategoryQOH

    Product

    StoreIDCityStateCountryRegion

    Store

    Example of Star Schema

  • Dimension Hierarchies For each dimension, the set of associated attributes can be structured as a hierarchy

    storesType

    city region

    customer city state country

  • Dimension Hierarchies

    store storeId cityId tId mgrs5 sfo t1 joes7 sfo t2 freds9 la t1 nancy

    city cityId pop regIdsfo 1M northla 5M south

    region regId namenorth cold regionsouth warm region

    sType tId size locationt1 small downtownt2 large suburbs

  • Snowflake Schema

    Snowflake schema: A refinement of star schema where the dimensional hierarchy is represented explicitly by normalizing the dimension tables

  • Sales Fact TableDate

    Product

    Store

    Customer

    unit_sales

    dollar_sales

    schilling_sales

    ProductNoProdNameProdDescCategoryQOH

    Product

    CustIdCustNameCustCityCustCountry

    Cust

    DateMonth

    DateMonthYear

    MonthYear

    Year

    CityState

    City

    CountryRegion

    Country

    StateCountry

    State

    StoreIDCity

    Store

    Measurements

    Example of Snowflake Schema

  • Fact constellations

    Fact constellations: Multiple fact tables share dimension tables

  • Database design methodology for data warehouses (1)

    Nine-step methodology proposed by Kimball

    Step Activ ity1 Choosing the process2 Choosing the grain3 Identifying and conforming the dimensions4 Choosing the facts5 Storing the precalculations in the fact table6 Rounding out the dimension tables7 Choosing the duration of the database8 Tracking slowly changing dimensions9 Deciding the query priorities and the query modes

  • Database design methodology for data warehouses (2)

    There are many approaches that offer alternative routes to the creation of a data warehouse

    Typical approach decompose the design of the data warehouse into manageable parts data marts, At a later stage, the integration of the smaller data marts leads to the creation of the enterprise-wide data warehouse.

    The methodology specifies the steps required for the design of a data mart, however, the methodology also ties together separate data marts so that over time they merge together into a coherent overall data warehouse.

  • Step 1: Choosing the process

    The process (function) refers to the subject matter of a particular data marts. The first data mart to be built should be the one that is most likely to be delivered on time, within budget, and to answer the most commercially important business questions.

    The best choice for the first data mart tends to be the one that is related to sales

  • Step 2: Choosing the grain Choosing the grain means deciding exactly what a

    fact table record represents. For example, the entity Sales may represent the facts about each property sale. Therefore, the grain of the Property_Sales fact table is individual property sale.

    Only when the grain for the fact table is chosen we can identify the dimensions of the fact table.

    The grain decision for the fact table also determines the grain of each of the dimension tables. For example, if the grain for the Property_Sales is an individual property sale, then the grain of the Client dimension is the detail of the client who bought a particular property.

  • Step 3: Identifying and conforming the dimensions

    Dimensions set the context for formulating queries about the facts in the fact table.

    We identify dimensions in sufficient detail to describe things such as clients and properties at the correct grain.

    If any dimension occurs in two data marts, they must be exactly the same dimension, or one must be a subset of the other (this is the only way that two DM share one or more dimensions in the same application).

    When a dimension is used in more than one DM, the dimension is referred to as being conformed.

  • Step 4: Choosing the facts

    The grain of the fact table determines which facts can be used in the data mart all facts must be expressed at the level implied by the grain.

    In other words, if the grain of the fact table is an individual property sale, then all the numerical facts must refer to this particular sale (the facts should be numeric and additive).

  • Step 5: Storing pre-calculations in the fact table

    Once the facts have been selected each should be re-examined to determine whether there are opportunities to use pre-calculations.

    Common example: a profit or loss statement These types of facts are useful since they are additive

    quantities, from which we can derive valuable information.

    This is particularly true for a value that is fundamental to an enterprise, or if there is any chance of a user calculating the value incorrectly.

  • Step 6: Rounding out the dimension tables

    In this step we return to the dimension tables and add as many text descriptions to the dimensions as possible.

    The text descriptions should be as intuitive and understandable to the users as possible

  • Step 7: Choosing the duration of the data warehouse

    The duration measures how far back in time the fact table goes.

    For some companies (e.g. insurance companies) there may be a legal requirement to retain data extending back five or more years.

    Very large fact tables raise at least two very significant data warehouse design issues: The older data, the more likely there will be problems in reading

    and interpreting the old files It is mandatory that the old versions of the important dimensions

    be used, not the most current versions (we will discuss this issue later on)

  • Step 8: Tracking slowly changing dimensions

    The changing dimension problem means that the proper description of the old client and the old branch must be used with the old data warehouse schema

    Usually, the data warehouse must assign a generalized key to these important dimensions in order to distinguish multiple snapshots of clients and branches over a period of time

    There are different types of changes in dimensions: A dimension attribute is overwritten A dimension attribute caauses a new dimension record to be

    created etc.

  • Step 9: Deciding the query priorities and the query modes

    In this step we consider physical design issues. The presence of pre-stored summaries and aggregates Indices Materialized views Security issue Backup issue Archive issue

  • Database design methodology for data warehouses - summary

    At the end of this methodology, we have a design for a data mart that supports the requirements of a particular bussiness process and allows the easy integration with other related data marts to ultimately form the enterprise-wide data warehouse.

    A dimensional model, which contains more than one fact table sharing one or more conformed dimension tables, is referred to as a fact constellation.

  • Multidimensional Data Model

    Sales of products may be representedin one dimension (as a fact relation) orin two dimensions, e.g. : clients and products

    Multidimensional Data Model

  • Multidimensional Data Model

    sale Product Client Amtp1 c1 12p2 c1 11p1 c3 50p2 c2 8

    c1 c2 c3p1 12 50p2 11 8

    Fact relation Two-dimensional cube

  • Multidimensional Data Model

    sale Product Client Date Amtp1 c1 1 12p2 c1 1 11p1 c3 1 50p2 c2 1 8p1 c1 2 44p1 c2 2 4

    day 2 c1 c2 c3p1 44 4p2 c1 c2 c3

    p1 12 50p2 11 8

    day 1

    Fact relation 3-dimensional cube

  • Multidimensional Data Model and Aggregates

    Add up amounts for day 1 In SQL: SELECT sum(Amt) FROM SALE

    WHERE Date = 1

    sale Product Client Date Amtp1 c1 1 12p2 c1 1 11p1 c3 1 50p2 c2 1 8p1 c1 2 44p1 c2 2 4

    81result

  • Multidimensional Data Model and Aggregates

    Add up amounts by day In SQL: SELECT Date, sum(Amt)

    FROM SALE GROUP BY Date

    sale Product Client Date Amtp1 c1 1 12p2 c1 1 11p1 c3 1 50p2 c2 1 8p1 c1 2 44p1 c2 2 4

    Date sum1 812 48

    result

  • Multidimensional Data Model and Aggregates

    Add up amounts by client, product In SQL: SELECT client, product, sum(amt)

    FROM SALE GROUP BY client, product

  • Multidimensional Data Model and Aggregates

    sale Product Client Date Amtp1 c1 1 12p2 c1 1 11p1 c3 1 50p2 c2 1 8p1 c1 2 44p1 c2 2 4

    sale Product Client Sump1 c1 56p1 c2 4

    p1 c3 50 p2 c1 11

    p2 c2 8

  • Multidimensional Data Model and Aggregates

    In multidimensional data model together with measure values usually we store summarizing information (aggregates)

    c1 c2 c3 Sump1 56 4 50 110p2 11 8 19

    Sum 67 12 50 129

  • Aggregates Operators: sum, count, max, min,

    median, ave Having clause Using dimension hierarchy

    average by region (within store) maximum by month (within date)

  • Cube Aggregation

    day 2 c1 c2 c3p1 44 4p2 c1 c2 c3

    p1 12 50p2 11 8

    c1 c2 c3p1 56 4 50p2 11 8 129

    . . .Example: computing sums

    day 1

  • Cube Operators

    day 2 c1 c2 c3p1 44 4p2 c1 c2 c3

    p1 12 50p2 11 8

    c1 c2 c3p1 56 4 50p2 11 8 129

    . . .

    sale(c1,*,*)

    sale(*,*,*)sale(c2,p2,*)

    day 1

  • Cube

    day 2

    day 1

    *

    sale(*,p2,*)

  • Aggregation Using Hierarchies

    day 2 c1 c2 c3p1 44 4p2 c1 c2 c3

    p1 12 50p2 11 8

    day 1

    region A region Bp1 12 50p2 11 8

    customer

    region

    country

    (customer c1 in Region A;customers c2, c3 in Region B)

  • Aggregation Using Hierarchies

    c1c2

    c3c4

    videoCamera

    New Orleans

    Pozna

    CD

    Date of sale

    1012

    1112

    35

    711

    219715

    Video Camera CDNO 22 8 30PN 23 18 22

    aggregation withrespect to city

    client

    city

    region

  • A Sample Data Cube

    sum

    sum

    sum

    USA

    Canada

    Mexico

    Country

    Date

    CDvideocamera

    1Q 2Q 3Q 4Q

  • Exercise (1) Suppose the AAA Automobile Co. builds a

    data warehouse to analyze sales of its cars. The measure - price of a car

    We would like to answer the following typical queries: find total sales by day, week, month and year find total sales by week, month, ... for each dealer find total sales by week, month, ... for each car

    model find total sales by month for all dealers in a given

    city, region and state.

  • Exercise (2) Dimensions:

    time (day, week, month, quarter, year) dealer (name, city, state, region, phone) cars (serialno, model, color, category , )

    Design the conceptual data warehouse schema

  • Data warehouse DatabaseDifferent Technological approaches to the

    datawarehouse database are1. Parallel relational database designs that

    require a parallel computing platform2. An innovative approach to speed up a

    traditional RDBMS by using new index structures to bypass relational table scans

    3. Multidimensional databases are designed to overcome any limitations placed on the warehouse by the nature of relational data model.

  • Sourcing, Acquisition, Cleanup and Transformation tools

    The functionality includes the followinga. Removing unwanted data from

    operational databasesb. Converting to common data names and

    definitionsc. Calculating summaries and derived datad. Establishing defaults for missing datae. Accomdating source data definition

    changes

  • Issues on datasourcing, cleanup, extract, transformation Database heterogeneity: DBMS are very

    different in data models, data access language, data navigation, operations, concurrency, integrity, recovery and so on

    Data heterogeneity: The way data is defined and used in different models.

  • Metadata

    Metadata is data about data that describes the data warehouse.

    Metadata can be classified into the following Technical Metadata Business Metadata Data warehouse operational information

    such as data history, ownership, extract audit trail, usage data.

  • Technical Data Information about data sources Transformation description the mapping method from

    operational database into the warehouse, and algorithms used to convert/enhance/ transform data

    Rules to perform data cleanup and data enhancement Data structure definitions for data targets Data-mapping operations when capturing data from

    source systems and applying it to the target warehouse database

    Access authorisation, backup, history, archive history, information delivery history, data acquisition history, data access and so on

  • Business Metadata Subject areas and information object type,

    including queries, reports, images, video and/or audio clips

    Internet home pages Other information to support all data

    warehousing components. For example, the information related to the information delivery system should include subscription information; scheduling information; details of delivery destinations; and the business query objects such as predefined queries, reports and analyses.

  • The information directory and the entire metadata repository will have the following attributes

    Should be the gateway to the datawarehouse environment, and thus should be accessible from anyplatform via transparent and seamless connections

    The information directory components should be accessible by any browsers and run on all major platforms.

    The datastructures of the metadata repositry should be supported by on all major or object-oriented databases.

    Should support an easy distribution and replication of its content for high performance and availability

    Should be searchable by business-oriented key words Should be able to define the content of structured and unstructured data Should act as launch platform for end user data access and analysis tools Should support the sharing of information objects Should support a variety of scheduling options for requests against the data

    warehouse, including on-demand, one-time, repetitve, event-driven and conditional delivery

    Should suport and provide interfaces to other applications such as e-mails, spread sheets and so on.

    Examples of metadata repositories include Microsoft Repositry, R&O Rochade, Prism Solutions Directory Manager and CA/Platinum Technologies

  • Accessing and Visualizing Information

    Effective Data visualization provides the user with the following

    Capability to compare data Capability to control scale Capability to map the visualization back to

    the detail data that created it Capability to filter data to look only at

    subsets of it

  • Tool Taxonomy

    Data query and reporting tools Application Development tools Executive Information System tools Online analytical processing tools Data mining tools

  • Query and Reporting tools

    Production reporting tools let companies generate regular operational reports

    Report writers are inexpensive desktop tools designed for users

    Managed query tools are designed for ease-of use, point-and-click, visual navigation that either accepts SQL or generates SQL statements to query relational data stored in the warehouse.

  • Application Development tools

    Organizations will often rely on true and proven approach of in-house application development, using graphical data access environments designed primarily for client/server environments.

  • OLAP tools

    The OLAP tools can be classified as multidimensional or MOLAP, relational or ROLAP and hybrid or HOLAP tools.Some of the more popular OLAP tools are Microsoft Decision support services, Microstartegy DSS server, Oracle Express, Metacube from Informix and so on.

  • Data mining tools

    Discovering knowledge Segmentation Classification Association Preferencing Visualization

  • Data Marts

    The data mart is directed at a partition of data that is created for the use of a dedicated group of users. A datamart is set of denormalized, summarized or aggregated data.

  • Datawarehouse Administration and Management

    Security and priority management Monitoring updates from multiple sources Data quality checks Managing and updating metadata Auditing and reporting data warehouse usage

    and status Purging data Replicating, subsetting and distributing data Backup and recovery

  • Data Mining

  • Data Mining

    The process of employing one ormore computer learning techniquesto automatically analyze andextract knowledge from data.

  • SQL QueriesOperationalDatabase

    DataWarehouse

    ResultApplication

    Interpretation&

    EvaluationData Mining

    A Simple Data Mining Process Model

  • General Phases of Data Mining Process

    Problem Definition Creating a Database for Datamining Exploring the database Preparation for creating a Data Mining

    Model Building a Data Mining Model Evaluating the Data Mining Model Deploying the Data Mining Model

  • Data Mining TasksThe model that you determine to solve a problem are

    classified as Predictive model

    ClassificationRegressionTime Series Analysis Predicition

    Descriptive modelClusteringSummarizationAssociation RulesSequence Discovery

  • Data Mining Techniques Artificial neural networks: Non-linear predictive models that learn

    through training and resemble biological neural networks in structure.

    Decision trees: Tree-shaped structures that represent sets of decisions. These decisions generate rules for the classification of a dataset. Specific decision tree methods include Classification and Regression Trees (CART) and Chi Square Automatic Interaction Detection (CHAID) .

    Genetic algorithms: Optimization techniques that use processes such as genetic combination, mutation, and natural selection in a design based on the concepts of evolution.

    Nearest neighbor method: A technique that classifies each record in a dataset based on a combination of the classes of the k record(s) most similar to it in a historical dataset (where k 1). Sometimes called the k-nearest neighbor technique.

    Rule induction: The extraction of useful if-then rules from data based on statistical significance.

  • Datamining Issues Human Interaction Overfitting Outliers Intrepretation of results Visualization of results Large datasets High dimensionality Multimedia data Missing Data Irrevelant data Noisy data Changing data Integration Application

  • Datamining metrics

    Measuring the effectiveness or usefulness of a data mining is called datamining metric

    It could be measured as increase in sales and reduce in the advertisement cost and cannot do as ROI

    The metrics used include the traditional metrics of space and time for example similarity measures

  • Social implications of Datamining

    Targeted advertising Datamining applications can derive much

    demographic data concerning customers that was previously not known or hidden in the data

    Fraud detection, Criminal suspects, prediction of terrorists.

  • Datamining from a Database Perspective

    Scalability Real-world data Update Ease of use

  • Decision Tree

    A tree structure where non-terminal nodes represent tests onone or more attributes andterminal nodes reflect decisionoutcomes.

  • Table 1.1 Hypothetical Training Data for Disease Diagnosis

    Patient Sore SwollenID# Throat Fever Glands Congestion Headache Diagnosis

    1 Yes Yes Yes Yes Yes Strep throat2 No No No Yes Yes Allergy3 Yes Yes No Yes No Cold4 Yes No Yes No No Strep throat5 No Yes No Yes No Cold6 No No No Yes No Allergy7 No No Yes No No Strep throat8 Yes No No Yes Yes Allergy9 No Yes No Yes Yes Cold10 Yes Yes No Yes Yes Cold

  • SwollenGlands

    Fever

    No

    Yes

    Diagnosis = Allergy Diagnosis = Cold

    No

    Yes

    Diagnosis = Strep Throat

  • Table 1.2 Data Instances with an Unknown Classification

    Patient Sore SwollenID# Throat Fever Glands Congestion Headache Diagnosis

    11 No No Yes Yes Yes ?12 Yes Yes No No Yes ?13 No No No No Yes ?

  • Production Rules

    IF Swollen Glands = YesTHEN Diagnosis = Strep ThroatIF Swollen Glands = No & Fever = YesTHEN Diagnosis = ColdIF Swollen Glands = No & Fever = NoTHEN Diagnosis = Allergy

  • An Algorithm for Building Decision Trees

    1. Let T be the set of training instances.2. Choose an attribute that best differentiates the instances in T.3. Create a tree node whose value is the chosen attribute.

    -Create child links from this node where each link represents a unique value for the chosen attribute.

    -Use the child link values to further subdivide the instances into subclasses.4. For each subclass created in step 3:

    -If the instances in the subclass satisfy predefined criteria or if the set of remaining attribute choices for this path is null, specify the classification for new instances following this decision path.

    -If the subclass does not satisfy the criteria and there is at least one attribute to further subdivide the path of the tree, let T be the current set of subclass instances and return to step 2.

  • Generating Association RulesRule Confidence

    Given a rule of the form If A then B, rule confidence is the conditional probability that B is true when A is known to be true.

    Rule Support

    The minimum percentage of instances in the database that contain all items listed in a given association rule.

  • Mining Association Rules: An Example

  • Table 3.3 A Subset of the Credit Card Promotion Database

    Magazine Watch Life Insurance Credit CardPromotion Promotion Promotion Insurance Sex

    Yes No No No MaleYes Yes Yes No FemaleNo No No No MaleYes Yes Yes Yes MaleYes No Yes No FemaleNo No No No FemaleYes No Yes Yes MaleNo Yes No No MaleYes No No No MaleYes Yes Yes No Female

  • Table 3.4 Single-Item Sets Single-Item Sets Number of Items Magazine Promotion = Yes 7 Watch Promotion = Yes 4 Watch Promotion = No 6 Life Insurance Promotion = Yes 5 Life Insurance Promotion = No 5 Credit Card Insurance = No 8 Sex = Male 6 Sex = Female 4

  • Table 3.5 Two-Item Sets

    Two-Item Sets Number of Items

    Magazine Promotion = Yes & Watch Promotion = No 4Magazine Promotion = Yes & Life Insurance Promotion = Yes 5Magazine Promotion = Yes & Credit Card Insurance = No 5Magazine Promotion = Yes & Sex = Male 4Watch Promotion = No & Life Insurance Promotion = No 4Watch Promotion = No & Credit Card Insurance = No 5Watch Promotion = No & Sex = Male 4Life Insurance Promotion = No & Credit Card Insurance = No 5Life Insurance Promotion = No & Sex = Male 4Credit Card Insurance = No & Sex = Male 4Credit Card Insurance = No & Sex = Female 4

  • Two Possible Two-Item Set Rules

    IF Magazine Promotion =YesTHEN Life Insurance Promotion =Yes (5/7)

    IF Life Insurance Promotion =YesTHEN Magazine Promotion =Yes (5/5)

  • Three-Item Set RulesIF Watch Promotion =No & Life Insurance

    Promotion = NoTHEN Credit Card Insurance =No (4/4)

    IF Watch Promotion =No THEN Life Insurance Promotion = No & Credit

    Card Insurance = No (4/6)

  • General Considerations

    We are interested in association rules that show alift in product sales where the lift is the resultof the products association with one or moreother products.

    We are also interested in association rules thatshow a lower than expected confidence for aparticular association.

  • Nearest Neighbour

    Objects that are near each other will also have similar prediction values. Thus, if you know the prediction value of one of the objects, you can predict it for its nearest neighbours.

  • The K-Means Algorithm

    1. Choose a value for K, the total number of clusters.

    2. Randomly choose K points as cluster centers.3. Assign the remaining instances to their closest

    cluster center.4. Calculate a new cluster center for each cluster.5. Repeat steps 3-5 until the cluster centers do not

    change.

  • Table 3.6 K-Means Input Values

    Instance X Y

    1 1.0 1.52 1.0 4.53 2.0 1.54 2.0 3.55 3.0 2.56 5.0 6.0

  • 01234567

    0 1 2 3 4 5 6

    f(x)

    x

  • Table 3.7 Several Applications of the K-Means Algorithm (K = 2)

    Outcome Cluster Centers Cluster Points Squared Error

    1 (2.67,4.67) 2, 4, 614.50

    (2.00,1.83) 1, 3, 5

    2 (1.5,1.5) 1, 315.94

    (2.75,4.125) 2, 4, 5, 6

    3 (1.8,2.7) 1, 2, 3, 4, 59.60

    (5,6) 6

  • 01234567

    0 1 2 3 4 5 6

    x

    f(x)

  • General Considerations

    Requires real-valued data. We must select the number of clusters present in

    the data. Works best when the clusters in the data are of

    approximately equal size. Attribute significance cannot be determined. Lacks explanation capabilities.

  • Bayesian ClassificationID Income Credit Class x(i)

    1 4 E h1 x42 3 g h1 x73 2 e h1 x24 3 g h1 x75 4 g h1 x86 2 e h1 x27 3 b h2 x118 2 b h2 x109 3 b h3 x1110 1 b h4 x911 2 g h2 x6

    P(h1/xi)= p(xi/h1)*p(h1)/(sum(p(xi/hi)*p(hi))

    Let h1= authorize purchase, h2= authorise after identification h3=do not authorize h4=do not authorise report to police

  • Income Group1 0-100002 10000-500003 50000-1000004 100000- infConstruct a Table

    1 2 3 4E x1 x2 x3 x4g x5 x6 x7 x8b x9 x10 x11 x12

    P(x7/h1)=2/6, P(x4/h1)=1/6, p(x2/h1)=2/6 p(x8/h1)=1/6

    P(h1/x4)= p(x4/h1)*p(h1)/sum of all= 1

  • Attribute Value Count Prob

    Short Medium Tall Short Medium Tall

    Gender M 1 2 3 2/8 3/3

    F 3 6 0 6/8 0/3

    Height 0-1.6 2 0 O 2/4 0 0

    1.6-1.7 2 0 0 2/4 0 0

    1.7-1.8 0 4 0 0 4/8 0

    1.9-2 0 1 1 0 1/8 1/3

    2- 0 0 2 0 0 2/3

  • p(t/Short)= * 0 =0P(t/medium)= 2/8* 1/8=0.031p(t/tall)= 3/3*1/3=0.333

    Likelyhood of being short = 0 * 0.267 =0Likelyhood of being medium = 0.031 * .533=0.0166Likelyhood of being tall = 0.33*0.2 = 0.066P(t)=0+0.01666+0.066= 0.0826P(short/t)= 0 * 0.267/0.0826 = 0P(medium/t) = 0.031 * 0.533/0.0826 = 0.2P(tall/t)= 0.333*0.2/0.0826 = 0.799

    The data of t belongs to tall since the probability is higher.

  • ID3 Algorithm

    The concept used to quantify information is called entropy. Entropy is used to measure the amount of uncertainty or surprise or randomness in a set of data.

    The basic strategy used by ID3 is to choose splitting attributes with the highest information gain first.

  • Given probabilities p1,p2, pS where Sum(pi)=1 entropy is defined as

    H(p1,p2,.pS)= sum(pi * log(1/pi))Gain(D,S)= H(D)-sum(p(Di)*H(Di))

  • Short - 4/15, Medium 8/15 and tall 3/15. The entropy of the starting set is 4/15 log(15/4)+8/15 log (8/15)+3/15log(15/3) =0.4384Choosing the gender as the splitting attribute, 9 are F and 6 are M.The entropy of the subset that are F is

    3/9 log(9/3)+6/9log(9/6)=0.2764The entropy of the subset that are M is1/6 log(6/1)+2/6log(6/2)+3/6log(6/3)=0.4392The ID3 algorithm must determine what the gain the information is by using this

    split . Calculate the weighted sum of these last two entropies to get

    9/15 * 0.2764 + 6/15 * 0.4392 = 0.34152The gain in entropy by using gender attribute is thus

    0.4384-0.34152 = 0.09688Looking at the height attribute, we divide into ranges:

    (0,1.6],(1.6,1.7],(1.7,1.8],(1.8,1.9],(1.9,2.0],(2.0,inf)(0,1.6]->(2/2(0)+0+0)=0, (1.6,1.7]->0, .(1.9,2.0]-

    >(0+1/2log(2)+1/2log(2))=0.301The gain in entropy by using the height attribute is

    0.4384-2/15(0.301)=0.3983

  • C4.5 or C5.0Gainratio(D,S)=Gain(D,S)/H(|D1|/|D| .|Ds|/|D|)To calculate the GainRatio for the gender split, we first find the entropy

    associate with the split ignoring classesH(9/15,6/15)=9/15 log (15/9) + 6/15 log(15/6)=0.292This gives the GainRatio value for the gender attribute as

    0.09688/0.292 = 0.332The entropy for the split on height is

    H(2/15,2/15,3/15,4/15,2/15)=2/15 log(15/2)+ 2/15 log(15/2)+ 3/15 log(15/3)+ 4/15 log(15/4)+ 2/15 log(15/2)=0.1166*3+0.1397+0.15307=0.64257

    This gives the GainRatio value for the height attribute as0.09688/0.64257=0.1507

  • Nueral NetworkHow to solve a classification problem usingNeural network as Determine the number of output nodes and

    attributes to be used as input Determine the labels and functions to be used

    for the graph Determine the functions for the graph Each tuple needs to be evaluated by filtering it

    through the structure or the network For each tuples ti belongs to Di propagate ti

    through the network and classify the tuple.

  • Various issues in the neural network Classification are

    Deciding the attributes to be used as splitting attributes

    Determination of the number of hidden nodes Determination of the number of hidden layers to

    choose the best number of hidden nodes per hidden layer

    Determination of the number of sinks Interconnectivity of all the nodes Using different activation function

  • Propagation in Neural Network

    Output of each node i in the neural networks is based on the definition of a function fi which is called activation function, fi, when applied to an input {x1i,x2i,x3i,.xni} and weights {w1i,w2i,w3i,.wni} the sum of these inputs is

    S=Sum( Whi Xhi)h=1 to k

  • For each node in the input layer dooutput x on each output arc fromi;for each hidden layer do for each node I do

    S = WiJ XiJfor each output arc from i do

    Output (i-e-si )/(i+e-si )for each node I in the output layer do

    S = WiJ XiJoutput = 1/(i+e-csi )

  • Radial basis function network

    A function whose value changes as it moves away from a central point is known as radial function.

    fi(S)=e(-S2/V)

  • Perceptron

    The neural network of the simplest type is named as perceptron.

    The perception is sigmodial function

  • Association Rules

    Let a data set I={I1 ,I2 ,I3 ,In } and a database of transaction {t1 ,t2 , ..tn } where t = { Ii1 , Ii2 , Ii3 ,..Iim } and IiJbelongs to I. Association rule is an implication of the form X=>Y where X,Y contained in I are items of data set called as itemsets and X intersection Y is 0.

  • Basic concepts of Association Rule

    Support :The support for an association rule x=>y is the percentage of transaction in the database that consists of XUY

    Confidence: The confidence for an association rule X=>Y is the ratio of the number of transactions that contains XUY to the number of transactions that contains X.

    Large Itemset : A large itemset is an item set whose number of occurences is above a threshold or support. L represents the comlete set of large itemsets and I represents an individual item set. Those large itemsets that are counted from data set are called as candidates and the collection of all these counted large itemsets are known as candidate item set.

  • Apriori AlgorithmThis algorithm is an association rule algorithm that finds the large itemsets from a given dataset.

    Transaction Items

    T1 Bread,Jam,Butter

    T2 Bread,Butter

    T3 Bread,Cold-drink,Butter

    T4 Milk,Bread

    T5 Milk,Cold-Drink

  • Candidates and Large Itemset using Apriori

    Scan Candidates Large Itemsets

    1 {milk},{Bread},{Jam}, {colddrink},{Butter}

    {milk},{Bread},{Cold-drink},{Butter}

    2 {milk, Bread}, {milk,cold-drink}{milk,butter}, {Bread,Cold-drink}{Bread,butter}, {Cold-drink,Butter}

    {Bread,Butter}

  • Sampling Algorithm

    To overcome of the counting of itemset with large dataset in each scan, you use sampling algorithm. The sampling algorithm reduces the number of dataset scan 1 or 2 where 1 is for best case and 2 is for worst case. Sampling algorithm is also used to find the large itemset for the sample from dta set like the apriori algorithm. These samples are considered as Potentially large itemsets that are used as candidates for counting the entire database.

  • Clustering

    Hierarchial Agglomerative Divisive

    Partitional Categorical Large DB

    Sampling Compression

  • Hierarchical

    A nested set of clusters is created. Each level in the hierarchy has seperated set of clusters

    Agglomerative : Clusters are created in bottom-up fashion.

    Divisive: Top-Down fashion

    A tree data structure called a dendrogram can be used to illustrate the heirarchical cluster and the set of different clusters.

  • Similarity and Distance Measures

    Centroid = Cm = sum(tmi)/N

    Radius = Rm = sqrt(sum(tmi-C m)2/N)

    Diameter = Dm = sqrt(sum(tmi-tmj)2/N*N-1)

  • Methods to calculate the distance between clusters

    Single link : Smallest distance between an element in one cluster and an element in the other. We thus have dis(Ki, Kj )=min(dis(til, tjm) for every tilbelongs to Ki and for every tjm belongs to Kj

    Complete link : largest distance between an element in one cluster and an element in the other. We thus have dis(Ki, Kj )=max(dis(til, tjm) for every tilbelongs to Ki and for every tjm belongs to Kj

    Average : Average distance between an element in one cluster and an element in the other. We thus have dis(Ki, Kj )=mean(dis(til, tjm) for every tilbelongs to Ki and for every tjm belongs to Kj

    Centroid : If clusters have a representative centroid, then the centroid distance is defined as the distance between the centroids. We thus have dis(Ki, Kj )=dis(Ci, Cj ), where Ci is the centroid for Kj and similarly for Ci

    Mediod: Using a medoid to reresent each cluster, the distance between the clusters can be defined by the distance the medoids dis(Ki, Kj )=dis(mi, mj )

  • Hypothesis testingNull hypothesisAlternative hypothesisChi square testing

    Regression and correlation

    Data Warehousing and Data MiningWhat is a Data WarehouseData Warehouse - Subject OrientedData Warehouse IntegratedData Warehouse - Non-VolatileData Warehouse - Time VarianceWhy Separate Data Warehouse?Why Separate Data Warehouse?Advantages of WarehousingAdvantages of Mediator SystemsSlide Number 11Data SourcesCreating and Maintaining a WarehouseCreating and Maintaining a WarehousePhysical Structure of Data WarehousePhysical Structure of Data WarehousePhysical Structure of Data WarehousePhysical Structure of Data WarehousePhysical Structure of Data WarehouseConceptual Modeling ofData WarehousesStar schemaStar schemaStar schemaTermsTermsSlide Number 26Dimension HierarchiesDimension HierarchiesSnowflake SchemaSlide Number 30Fact constellationsDatabase design methodology for data warehouses (1)Database design methodology for data warehouses (2)Step 1: Choosing the processStep 2: Choosing the grainStep 3: Identifying and conforming the dimensionsStep 4: Choosing the factsStep 5: Storing pre-calculations in the fact tableStep 6: Rounding out the dimension tablesStep 7: Choosing the duration of the data warehouseStep 8: Tracking slowly changing dimensionsStep 9: Deciding the query priorities and the query modesDatabase design methodology for data warehouses - summaryMultidimensional Data ModelMultidimensional Data ModelMultidimensional Data ModelMultidimensional Data Model and AggregatesMultidimensional Data Model and AggregatesMultidimensional Data Model and AggregatesMultidimensional Data Model and AggregatesMultidimensional Data Model and AggregatesAggregatesCube AggregationCube OperatorsCubeAggregation Using HierarchiesAggregation Using HierarchiesA Sample Data CubeExercise (1)Exercise (2)Data warehouse DatabaseSourcing, Acquisition, Cleanup and Transformation toolsIssues on datasourcing, cleanup, extract, transformation MetadataTechnical DataBusiness MetadataThe information directory and the entire metadata repository will have the following attributesAccessing and Visualizing InformationTool TaxonomyQuery and Reporting toolsApplication Development toolsOLAP toolsData mining toolsData MartsDatawarehouse Administration and ManagementData MiningData MiningSlide Number 78General Phases of Data Mining ProcessData Mining TasksData Mining TechniquesDatamining IssuesDatamining metricsSocial implications of DataminingDatamining from a Database PerspectiveDecision TreeSlide Number 87Slide Number 88Slide Number 89Production RulesAn Algorithm for Building Decision TreesGenerating Association RulesSlide Number 93Slide Number 94Slide Number 95Slide Number 96Two Possible Two-Item Set Rules Three-Item Set RulesGeneral ConsiderationsNearest NeighbourSlide Number 101Slide Number 102The K-Means AlgorithmSlide Number 104Slide Number 105Slide Number 106Slide Number 107General ConsiderationsBayesian ClassificationSlide Number 110Slide Number 111Slide Number 112ID3 AlgorithmSlide Number 114Slide Number 115C4.5 or C5.0Nueral NetworkVarious issues in the neural network Classification arePropagation in Neural NetworkSlide Number 120Radial basis function networkPerceptronAssociation RulesBasic concepts of Association RuleApriori AlgorithmCandidates and Large Itemset using AprioriSampling AlgorithmClusteringHierarchicalSimilarity and Distance MeasuresMethods to calculate the distance between clustersSlide Number 132