Data Warehousing and Data Mining_handbook

Embed Size (px)

Citation preview

  • 7/22/2019 Data Warehousing and Data Mining_handbook

    1/27

    DATA WAREHOUSING AND DATA MINING

    COURSE OBJECTIVE:

    Data Warehousing and Data Mining are advanced recent developments in database technology which aim

    to address the problem of extracting information from the overwhelmingly large amounts of data which

    modern societies are capable of amassing. Data warehousing focuses on supporting the analysis of data ina multidimensional way. Data mining focuses on inducing compressed representations of data in the form

    of descriptive and predictive models.

    This course gives students a good overview of the ideas and the techniques which arebehind recent developments in the data warehousing. It also makes students understand On Line

    Analytical Processing (OLAP) , create data models, work on data mining query languages,

    conceptual design methodologies, and storage techniques. It also helps to identify and develop

    useful algorithms to discover useful knowledge out of tremendous data volumes. Also todetermine in what application areas can data mining be applied. this course also speaks aboutresearch oriented topics like web mining, text mining are introduced for discussing the scope of research

    in this subject. At the end of the course the student will understand the difference between a datawarehouse and a database and also will be capable to implement the mining algorithms practically.

    SYLLABUS:

    III Year IT II Semester

    DATA WAREHOUSING AND MINING

    UNIT-I

    Introduction: Fundamentals of data mining, Data Mining Functionalities, Classification of Data

    Mining systems, Major issues in Data Mining, Data Warehouse and OLAP Technology for Data

    Mining Data Warehouse, Multidimensional Data Model, Data Warehouse Architecture, DataWarehouse Implementation, Further Development of Data Cube Technology, From Data

    Warehousing to Data Mining,

    Data Preprocessing: Needs Preprocessing the Data, Data Cleaning, Data Integration and

    Transformation, Data Reduction, Discretization and Concept Hierarchy Generation Online DataStorage.

    UNIT-II

    Data Warehouse and OLAP Technology for Data Mining: Data Warehouse,Multidimensional Data Model, Data Warehouse Architecture, Data Warehouse implementation,

    Further Development of Data cube technology, from data warehousing to data mining.

    Data cube computation and data generalization: efficient methods for data cube computation,Further Development of Data cube and OLAP Technology, AttributeOriented induction.

    UNIT-III

    Mining frequent patterns , association and correlation:Basic concepts, Efficient and scalable

    frequent itemset mining methods, Mining various kinds of association rules, From association

    mining to correlation analysis, Constraintbased association mining.UNITIV

    Classification and prediction:Issues regarding classification and prediction, Classification by

    Decision tree introduction, Bayesian Classification, Rule-based classification, Classification ofBack propagation, Support vector machines, Associative Classification, Lazy learners, Other

    classification methods, Prediction, Accuracy and error measures, Evaluating the accuracy of

    classifier or predictor, Ensemble methods- increasing.

    UNITV

  • 7/22/2019 Data Warehousing and Data Mining_handbook

    2/27

    Cluster analysis introduction: Types of Data in cluster analysis, A Categorization of major

    clustering methods, partitioning methods, Hierarchical methods, Density Based methods, Grid

    based methods and model based clustering methods, Clustering high dimensional data,Constraint based cluster analysis, Outlier analysis.

    UNITVI

    Mining streams, time series and sequence data:Mining data streams, Mining timeseries and

    sequence data , Mining sequence patterns in transactional databases, Mining sequence patterns

    in biological data , Graph mining, Social network analysis, Multirelational data mining.UNITVIIMining objects, spatial multimedia , text, and web data: Multidimensional analysis anddescriptive mining of complex data objects, Spatial data mining, Multimedia data mining , Text

    Data mining, Mining the world wide web.

    UNITVIII

    Application and trends in data mining:Data mining application, Data mining system products

    and research prototypes, Additional themes on data mining , Social impacts of data mining.

    SUGGESTED BOOKS:

    (i) TEXT BOOKS:

    T1. Jiawei Han & Micheline Kamber. Data Mining Concepts and Techniques,Kaufman PublishersT2. Tan, Pang-Ning and other Introduction to Data MiningPearson Education, 2006.

    (ii) REFERENCES:

    1. Arun K Pujari, Data Mining Techniques,University Press.2. Sam Anahory & Dennis Murray, Data Warehousing in the Real World,Pearson Edn Asia.3. K.P.Soman, Insight Into Data Mining, PHI 2008.

    4. Paulraj Ponnaiah, Data Warehousing Fundamentals,Wiley Student Edition.5. Ralph Kimball, The Data Warehouse Life cycle Tool kit,Wiley Student Edition.

  • 7/22/2019 Data Warehousing and Data Mining_handbook

    3/27

    SESSION PLAN:Subject : Data Warehousing and Mining Year/ Sem:IT III/II

    Topics in each unit as per

    JNTU Syllabus

    Lectu

    re No

    Modules / Sub modules for each topic Text Books /

    Reference Books

    UNIT-I

    Introduction 1 Overview of all units and course delivery plan Syllabus book

    Fundamentals of data mining 2 Data mining fundamentals T1: 1.1-1.3, T2:1R4:pg 2-7

    Data mining functionalities 3 What kinds of patterns can be mined T1: 1.4.1- 1.4.6T2:2.1,2.2

    Classification of Data mining

    system, Data mining taskprimitives

    4 Data mining as a confluence of multiple

    disciplines, Primitives for specifying a data

    mining task

    T1: 1.6 , 1.7

    Integration of data mining

    system with a database or adata warehouse system,Major issues in data mining

    5 Integration of data mining system with a database or

    a data warehouse system,

    T1: 1.8,1.9

    Data Preprocessing:

    Need preprocessing the data6 Need preprocessing the data T1: 2.1,T2:2.3,

    R3:5.1,2

    Data Cleaning 7 Missing values, Noisy Data, Inconsistent Data T1: 2.3.1-3,

    Data Integration andTransformation

    8 Data Integration, Data Transformation T1: 2.4.1-2.4.2

    Data Reduction 9 Data Cube Aggregation, Dimensionality ReductionData Compression, Numerosity Reduction

    T1: 2.5.1-2.6.2T2:2.3.3,4

    Discretization and concept

    hierarchy generation

    10 For Numeric data, For categorical data T1: 2.6.1-,2

    R3:5.3

    11 Revision of Unit-I

    UNITII

    Data warehouse and OLAPtechnology for data mining:Data Warehouse

    12 What is datawarehouse and its importance T1:3.1, R2:1.2R3:2.5.5, R4:Pg-12, R5: pg -6

    Multidimensional data model 13 From tables and spread sheets to data cubes, Stars,snowflakes and fact constellations schemas for

    multidimensional databases

    T1:3.2.1-3.2.2T2:3.4, R2:5.2-6

    Data warehouse Architecture 14 Steps for design and construction of data warehouseA threetier data warehouse architecture

    Types of OLAP servers

    T1: 3.3.1-5,R5:pg-13

    R2:3.1-3

    Data warehouse

    implementation

    15 Efficient computation of data cubes

    Indexing OLAP data, Meta data Repository, Data

    warehouse Back-end tools

    T1: 3.4.1-3.4.3

    R2:4.1-4,R3:2.5.3

    R4:pg-57Further Development of Datacube Technology

    16 Further Development of Data cube Technology T1: 4.2

    From data warehousing to datamining

    17 Data warehouse usage From OLAP to OLAM T1: 3.5

    Efficient methods for data cube

    computation

    18 Different kinds of cubes, For full cube BUC, Star

    cubing for fast high dimensional OLAP, Cubes withiceberg conditions

    T1: 4.1

  • 7/22/2019 Data Warehousing and Data Mining_handbook

    4/27

    Further Development of Datacube Technology

    19 DiscoveryDriven Exploration of data cubes,Complex Aggregation at multiple granularities,Other development

    T1: 4.2.1-3,

    Attribute oriented induction 20 Attribute oriented induction, Efficient

    implementation of attribute oriented inductionPresentation of the derived generalization ,Mining

    class comparison, Class description

    T1: 4.3.1-5,

    21 Revision of Unit-II

    UNITIII

    Mining frequent patterns ,

    association and correlation:

    Basic concepts

    22 Market basket analysis ,Frequent item sets, closeditem sets and Association Rule Frequent patternmining

    T1: 5.1.1,2,3T2:6.1,R3:7.1

    Efficient and scalable frequentitemset mining methods

    23 The Apriori algorithm , Generating association rulesfrom frequent item sets Improving the efficiency ofapriori, Mining frequent itemsets without candidate

    generation

    T1: 5.2.1-5T2:6.2-3R1:4.4,8

    R3:7.3

    24 Mining frequent itemsets using vertical data formatMining closed frequent itemsets

    T1:5.2.6T2:6.5

    Mining various kinds of

    association rules

    25 Multilevel association rules, Mining

    multidimensional Association rules from relationaldatabases and data warehouses

    T1: 5.3.1,2

    From association mining to

    correlation analysis

    26 Strong rules are not necessarily interesting

    From association analysis to correlation analysis

    T1: 5.4.1,2

    Constraintbased association

    mining

    27 Meta ruleGuided mining of association rules

    Mining guided by additional rule constraints

    T1: 5.5.1,2

    R1:4.13

    UNITIV

    Classification and prediction:Issues regarding classificationand prediction

    28 Preparing the data for classification and predictionComparing classification methods

    T1: 6.2.1,2T2:4.1,2, R1:6.1,2

    Classification by Decision treeintroduction

    29 Decision tree induction, Attribute selection measuresTree pruning , Scalability and decision tree induction

    T1: 6.3.1-4T2:4.3

    R1:6.3-6.7R3:4.7

    Bayesian Classification 30 Bayes theorem, Naive Bayesian classificationBayesian Belief Networks, Training Bayesian Belief

    networks

    T1: 6.4.1-4T2:5.3

    R3:9.2

    Rule-based classification 31 Using If then rules for classification, Ruleextraction from a decision tree, Rule induction usinga sequential covering algorithm

    T1: 6.5.1-3T2:5.1

    Classification of Backpropagation

    32 A Multilayer feed-forward neural network, Defininga network topology Back propagation, Back

    propagation, Back propagation and interpretability

    T1: 6.6.1-4

    Support vector machinesAssociative Classification

    33 Linearly separable, Linearly inseparableClassification by association rule analysis

    T1: 6.7.1,2,6.8T2:5.5R3:10.1

    Lazy learners

    Other classification methods

    34 K-nearest neighbor classifiers ,Case-based reasoning

    Genetic algorithms, Rough set approach, Fuzzy setapproaches

    T1: 6.9.1,2,

    6.10.1,2,3T2:5.2

    Prediction 35 Linear and multiple regression, Non-liner regression T1: 6.11.1,2,3

  • 7/22/2019 Data Warehousing and Data Mining_handbook

    5/27

    Other regression models

    Accuracy and error measures 36 Classifier accuracy measures, Predictor errormeasures

    T1: 6.12.1,2

    Evaluating the accuracy ofclassifier or predictor

    37 Holdout method and random subsampling, Cross-validation, Bootstrap

    T1: 6.13.1,2,3T2:4.5

    R3:4.8.1

    Ensemble methods- increasing

    the accuracy

    38 Bagging , Boosting T1: 6.14.1,2

    T2:5.6.4,5

    39 Revision of Unit-III & IV

    UNITV

    Cluster analysis introduction:Types of Data in clusteranalysis, A Categorization of

    major clustering methods

    40 Intervalscaled variables, Binary variablesNominal, ordinal, and ratio scaled variablesVariables of mixed types, Vector objects

    T1: 7.2,3T2:8.1.1,2R1:5.1,2

    Partitioning methods

    Hierarchical methods

    41 Classical partitioning methods, Partitioning methods

    in large databasesAgglomerative and divisive , BIRCH, ROCK,

    Chameleon

    T1: 7.4.1,2, 7.5.1-

    T2:8.2, R1:5.3R1: 5.4,7,9,13

    R3:11.2, 11.5

    DensityBased methods 42 DBSCAN: A Density-based clustering method ,OPTICS , DENCLUE

    T1: 7.6.1,2,3T2:8.4,9.3.3R1:5.8, R3:11.6,7

    Gridbased methods andModel based clustering

    methods

    43 STING , Wave clusterExpectation maximization, Conceptual clustering

    Neural network approach

    T1: 7.7.1-7.7.2T1: 7.8.1,2,3

    Clustering high dimensionaldata

    44 CLIQUE, PROCLUS, Frequent pattern basedclustering methods

    T1: 7.9.1,2,3

    Constraint based clusteranalysis

    45 Clustering with obstacle objects, User constraintcluster analysis, Semi supervised cluster analysis

    T1: 7.10.1,2,3

    Outlier analysis 46 Statistical distribution based, Distance based, Density

    based, Deviation based

    T1: 7.11.1-4

    47 Revision of Unit-V

    UNITVI

    Mining streams, time series

    and sequence data:

    Mining data streams

    48 Methodologies for stream data processing and stream

    data systems, Stream OLAP and stream data cubesFrequent patterns mining in data streams,Classification of dynamic data streams, Clusteringevolving data streams

    T1: 8.1.1-8.1.5,

    Mining timeseries andsequence data

    49 Trend analysis , Similarity search in time-seriesanalysis

    T1: 8.2.1,2T2:9.11

    Mining sequence patterns in

    transactional databases

    50 sequential pattern mining, scalable methods for

    mining, constraint based periodicity analysis for timerelated data

    T1: 8.3.1-4,

    Mining sequence patterns in

    biological data

    51 alignment of biological sequences, hidden Markov

    model

    T1: 8.4.1,2

    Graph mining 52 methods for mining frequent subgraphs , miningvariant and constrained substructure patterns

    applications

    T1: 9.1.1,2,3

  • 7/22/2019 Data Warehousing and Data Mining_handbook

    6/27

    Social network analysis 53 what is social networks characteristics, link miningmining on social networks

    T1: 9.2.1-4,

    Multirelational data mining 54 what is multirelational datamining? ILP approach,

    Tuple ID propagation , Multirelational classificationMultirelational clustering

    T1: 9.3.1-5,

    55 Revision of Unit-VI

    UNITVII

    Mining objects,spatial

    multimedia , text, and web

    data:

    Multidimensional analysis anddescriptive mining of complexdata objects

    56 Generalization of structured data , Aggregation and

    approximation in spatial and multimedia datageneralization , Generalization of object identifiersand class

    T1: 10.1.1-3

    57 Generalization of class composition hierarchies

    Construction and mining of object cubesGeneralization based mining of plan databases by

    divide and conquer

    T1: 10.1.3-6

    Spatial data mining 58 Spatial data cube construction and spatial OLAP

    Spatial association analysis, clustering methods,classification and trend analysis, Mining Raster

    databases

    T1:10.2.15,

    R1:9.12-14R3:12.1

    Multimedia data mining 59 Similarity search in multimedia data,

    Multidimensional analysis , Classification andprediction analysis , Mining association inmultimedia data

    T1: 10.3.15,

    Pg: 607-613

    Text Data mining 60 Text data analysis and information retrievalDimensionality reduction, Text mining

    T1: 10.4.1-3R1:8.6

    Mining the world wide web 61 Mining the webs link structures to identifyAutomatic classification of web documentsConstruction of a multilayered web information

    base, Web usage mining

    T1: 10.5.1-5R1:8.2-8.5

    UNITVIII

    Application and trends indata mining:

    Data mining application

    62 For financial data analysis, For retail industry T1: 11.1.1,2R1:3.11

    63 For telecommunication industry, For biological data

    analysis, Other scientific applications, For intrusiondetection

    T1: 11.1.3,4,

    11.1.5,6

    Data mining system productsand research prototypes

    64 How to choose a data mining system T1: 11.2.1

    65 Examples of commercial data mining systems T1: 11.2.2

    66 Revision of Unit-VII & VIII

    67 Subject Summary

    Books Referred by faculty:

    (i) TEXT BOOKS:

    T1. Jiawei Han & Micheline Kamber. Data Mining Concepts and Techniques,Kaufman PublishersT2. Tan, Pang-Ning and other Introduction to Data MiningPearson Education, 2006.

    (ii) REFERENCES:

  • 7/22/2019 Data Warehousing and Data Mining_handbook

    7/27

    1. Arun K Pujari, Data Mining Techniques,University Press.2. Sam Anahory & Dennis Murray, Data Warehousing in the Real World,Pearson Edn Asia.

    3. K.P.Soman, Insight Into Data Mining, PHI 2008.4. Paulraj Ponnaiah, Data Warehousing Fundamentals,Wiley Student Edition.5. Ralph Kimball, The Data Warehouse Life cycle Tool kit,Wiley Student Edition.

    WEBSITES:

    1. www.the-data-mine.com/bin/view/Misc/IntroductionToData Mining : Links about differentjournals of Data Warehousing and Mining, Links to Books for Data Mining

    2. http://Data Mining warehousing.blogspot.com3. www.pcc.qub.ac.uk/tec/courses/Data Mining /ohp/dm-OHP-final_1.html: Online notes for DM4. http://pagesperso-orange.fr/bernard.lupin/english/:Provides information about OLAP5. http://www.cs.sfu.ca/~han/dmbook : Slides of text book

    JOURNALS:

    1. IEEE Transactions on Knowledge and Data Engineering: This Journal has paperspublished on the data mining techniques like classification, Association Analysis,Prediction, Cluster analysis which gives an exposure to students and also they come to

    know that how these techniques can be used in the real world applications.

    2. IEEE Transactions on Pattern Analysis and Machine Intelligence: This includes alltraditional areas of computer vision and image understanding, all traditional areas of

    pattern analysis and recognition, and selected areas of machine intelligence. Areas of

    such machine learning, search techniques, document and handwriting analysis, medicalimage analysis, video and image sequence analysis, content-based retrieval of image and

    video, face and gesture recognition and relevant specialized hardware and/or software

    architectures are also covered.

    3. Data Mining and Knowledge Discovery: The journal publishes original technicalpapers in both the research and practice of data mining and knowledge discovery, surveys

    and tutorials of important areas and techniques, and detailed descriptions of significant

    applications. Its Coverage includes: Theory and Foundational Issues, Data MiningMethods, Algorithms for Data Mining, Knowledge Discovery Process, and Application

    Issues.

    4. Journal of Intelligent Systems : This journal provides readers with a compilation ofstimulating and up-to-date articles within the field of intelligent systems. The focus of thejournal is on high quality research that addresses paradigms, development, applications

    and implications in the field of intelligent systems.

    The list of topics spans all the areas of modern intelligent systems such as: ArtificialIntelligence, Pattern Recognition, , Data Mining, Evolutionary algorithms, SwarmIntelligence, Intelligent Agents and Multi-Agent Systems, Decision Support Systems,

    Supervised Semi-Supervised and Unsupervised Learning, , Knowledge Management and

    Representation, Intelligent System Design, Bayesian Learning, , Evolving ClusteringMethods, Natural Language Processing and Fusion.

    http://pagesperso-orange.fr/bernard.lupin/english/http://www.cs.sfu.ca/~han/dmbookhttp://www.cs.sfu.ca/~han/dmbookhttp://pagesperso-orange.fr/bernard.lupin/english/
  • 7/22/2019 Data Warehousing and Data Mining_handbook

    8/27

    STUDENT SEMINAR TOPICS:

    1. Integrating Data Warehouses With Web Data, J.M. Perez, R. Berlanga, M.J. Aramburu&T.B.Pedersen, July 2008, Vol.No.20, Pg.No.940.

    2. IDD: A Supervised Interval Distance - Based Method for Discretization, F.J. Ruiz, C. Angulo& N. Agell, September 2008, Vol. No. 20, Pg. No. 1230.

    3. Automatic Website Summarization By Image Content, J.Li.R.C.-W.Wong, A.W. FU & J.Pei,Sep 2008, Vol. 20, Pg.No.1195.

    4. Label Propagation through Linear Neighborhoods Fei Wang, Changshui Zhang, January 2008,Vol.No.20, Pg.No.55.

    5. Time-Aware Web Users Clustering, S.G. Petridou, V.A. Koutsonikola, A.I. Vakali & G.I.Papadimitriou, May 2008, Vol.No.20, Pg.No.653.

    6. Efficient Similarity Search Over Future Stream Time Series, Xiang Lian, Lei Chen; January2008, Vol.No.20, Pg.No.40.

    7. Distributed Decision Tree Induction within the Grid Data Mining Framework GridMiner-Core,Jurgen Hofer and Peter Brezany

    8. Distributed Data Mining in Credit Card Fraud Detection Philip K Chan

  • 7/22/2019 Data Warehousing and Data Mining_handbook

    9/27

  • 7/22/2019 Data Warehousing and Data Mining_handbook

    10/27

    20. What are the various issues in data mining? Explain each one in detail? (R07 APRIL/MAY

    2011)21. Why preprocess the data and explain in brief? (R07 APRIL/MAY 2011)

    22. Write short notes on GUI, DMQL? How to design GUI based on DMQL? (R07 APRIL/MAY

    2011)23.a) What are the various issues relating to the diversity of database types?

    b) Explain how data mining used in health care analysis? (R07 APRIL/MAY 2011)24. (a) When is a summary table too big to be useful ?

    (b) Relate and discuss the various degrees of aggregation within summary tables. (NR APRIL

    2011)25. (a) Explain the design and construction process of data warehouses.

    (b) Explain the architecture of a typical data mining system. (JNTU Dec-2010)26. (a) How can we smooth out noise in data cleaning process? Explain.

    (b) Why preprocessing of data is needed? (JNTU Dec-2010)

    27. (a) Briefly discuss the data smoothing techniques.(b) Explain about concept hierarchy generation for categorical data. (JNTU Dec-2010)

    28. (a) Explain data mining as a step in the process of knowledge discovery.

    (b) Differentiate operational database systems and data warehousing. (JNTU Dec-2010)

    29. (a) How can you go about filling in the missing val ues in data cleaning process?

    (b) Discuss the data smoothing techniques. (JNTU Dec-2010)30. Write short notes on the following:

    (a) Association analysis(b) Classiffication and prediction

    (c) Cluster analysis(d) Outlier analysis. (JNTU Dec-2010)

    31. (a) What are the desired architectures for Data mining systems.(b) Briey explain about concept hierarchies. (JNTU Dec 2010)

    32. (a) List and describe any four primitives for specifying a data mining task.

    (b) Describe why concept hierarchies are useful in data mining. (JNTU Dec-2010)33. Briey discuss the following data mining primitives:

    (a) Task-relevant data

    (b) The kind of knowledge to be mined(c) Interestingness measures

    (d) Presentation and visualization of discovered patterns. (JNTU Dec-2010)34. (a) Briefly discuss about data integration.

    (b) Briefly discuss about data transformation. (JNTU Dec-2010)

    35. (a) Discuss construction and mining of object cubes.(b) Give a detail note on trend analysis. (JNTU Dec-2010)

    36. (a) Explain the design and construction process of data warehouses.(b) Explain the architecture of a typical data mining system. (JNTU Dec-2010)

    37. (a) Draw and explain the architecture of typical data mining system.(b) Differentiate OLTP and OLAP. (JNTU Aug/Sep 2008)

    38. (a) Explain data mining as a step in the process of knowledge discovery.(b) Differentiate operational database systems and data warehousing.(JNTU Aug/Sep 2008)39. (a) Explain the architecture of a typical data mining system.

    (b) Discuss the issues regarding data warehouse architecture. (JNTU Aug/Sep 2008)40. (a) Briefly discuss the data smoothing techniques.

    (b) Explain about concept hierarchy generation for categorical data. (JNTU Aug/Sep 2008)

    41. Explain various data reduction techniques. (JNTU Aug/Sep 2008)42. (a) List and describe any four primitives for specifying a data mining task.

    (b) Describe why concept hierarchies are useful in data mining. (JNTU Aug/Sep 2008)

  • 7/22/2019 Data Warehousing and Data Mining_handbook

    11/27

    43. (a) Briefly discuss various forms of presenting and visualizing the discovered patterns.(b) Discuss about the objective measures of pattern interestingness. (JNTU Aug/Sep 2008)

    44. Write the syntax for the following data mining primitives.

    (a) The kind of knowledge to be mined.(b) Measures of pattern interestingness. (JNTU Aug/Sep 2008, May 2008)

    45. (a) How can we specify a data mining query for characterization with DMQL?

    (b) Describe the transformation of a data mining query to a relational query.(JNTU Aug/Sep 2008)

    46. Briefly compare the following concepts. Use an example to explain your points.(a) Snowflake schema, fact constellation, starnet query model.

    (b) Data cleaning, data transformation, refresh.(c) Discovery driven cube, multifeature cube, and virtual warehouse. (JNTU Aug/Sep 2008)

    47. Write the syntax for the following data mining primitives:

    (a) Task-relevant data.

    (b) Concept hierarchies. (JNTU May 2008)48. (a) Describe why is it important to have a data mining query language.

    (b) The four major types of concept hierarchies are: schema hierarchies, set- grouping

    hierarchies, operation-derived hierarchies, and rule-based hierarchies. Briefly define each

    type of hierarchy. (JNTU May 2008)

    49. (a) Explain the syntax for Task-relevant data specification.(b) Explain the syntax for specifying the kind of knowledge to be mined.(JNTU May 2008)

    50. (a) Draw and explain the architecture for on-line analytical mining.(b) Briefly discuss the data warehouse applications. (JNTU May 2008)

    51. (a) Explain data mining as a step in the process of knowledge discovery.(b) Differentiate operational database systems and data warehousing. (JNTU May 2008)

    52. (a) Explain the major issues in data mining.(b) Explain the three-tier datawarehousing architecture. (JNTU May 2008)

    53. (a) Explain data mining as a step in the process of knowledge discovery.

    (b) Differentiate operational database systems and data warehousing. (JNTU May 2008)54. Briefly discuss the role of data cube aggregation and dimension reduction in the

    data reduction process. (JNTU May 2008)

    55. (a) Briefly discuss about data integration.(b) Briefly discuss about data transformation. (JNTU May 2008)

    56. Discuss the role of data compression and numerosity reduction in data reduction process.(JNTU May 2008)

    UNIT-II

    1. (a) What is Concept description? Explain.

    (b) What are the differences between concept description in large data bases and OLAP? (R07

    APRIL 2011)

    2. (a) What does the data warehouse provide for business analyst? Explain(b) How do data warehousing and OLAP related to Data mining? (R07 APRIL 2011)

    3. (a) Draw and explain the architecture of typical data mining system.

    (b) Differentiate OLTP and OLAP. (R07 APRIL 2011)4. Write short notes for the following in detail:(a) Attribute-oriented induction.

    (b) Efficient implementation of Attribute-oriented induction. (R07 APRIL 2011)5. Briefly compare and explain by taking an example of your point(s).

    a) Snowflake schema, fact constellation b) Data cleaning, data transformation.

    (R07 APRIL/MAY 2011)6.a) Differentiate between OLAP and OLTP?

  • 7/22/2019 Data Warehousing and Data Mining_handbook

    12/27

    b) Draw and explain the star schema for the data warehouse? (R07 APRIL/MAY 2011)7. What is data compression? How would you compress data using principle component analysis

    (PCA)? (R07 APRIL/MAY 2011)

    8. How is class comparison performed? Can class comparison mining be implemented efficientlyusing data cube techniques? If yes explain? (R07 APRIL/MAY 2011)

    9. (a) Explain the ADHOC query and Automation in Data Warehouse delivery pro-cess.(b) Explain to the idea Can we do without an Enterprise data warehouse"? (NR APRIL 2011)

    10. (a) Explain the basic levels of testing a data warehouse.

    (b) Explain a plan for testing the data warehouse. (NR APRIL 2011)11. (a) Describe the server management features of a data warehouse system.

    (b) Management tools are required to manage a large, dynamic and complex system such as

    data warehouse system" ower your explanation with justification.

    (NR APRIL 2011)12. Explain the significance of tuning the Data warehouse. (NR APRIL 2011)

    13. (a) Discuss the design strategies to implement backup strategies.

    (b) Describe the recovery strategies of a data warehouse system. (NR APRIL 2011)14. (a) How can we perform discrimination between different classes? Explain.

    (b) Explain the analytical characterization with an example. (JNTU Dec 2010)

    15. Write short notes for the following in detail:

    (a) Measuring the central tendency(b) Measuring the dispersion of data. (JNTU Dec-2010)

    16. Suppose that the following table is derived by Attribute-oriented induction.CLASS BIRTH-PLACE COUNT

    Canada 180Programmer others 120Canada 20DBA others 80

    (a) Transform the table into a crosstab showing the associated t-weights and d-weights.

    (b) Map the class programmer into a(Bi-directional) quantitative descriptive rule,for example,8X,Programmer(X) , (birth place(X)= "Canada" ^ ...) [t: x%,d:y%] ... ^ (...) [ t:w%, d:z%]

    (JNTU Dec-2010)

    17. (a) Differentiate attribute generalization threshold control and generalized relation threshold control.(b) Differentiate between predictive and descriptive data mining. (JNTU Dec-2010)

    18. (a) What are the differences between concept description in large data bases and OLAP?(b) Explain about the graph displays of basic statistical class description.

    (JNTU Aug/Sep 2008)19. (a) How can we perform attribute relevant analysis for concept description? Explain.

    (b) Explain the measures of central tendency in detail. (JNTU Aug/Sep 2008, May 2008)

    20. Write short notes for the following in detail:(a) Measuring the central tendency

    (b) Measuring the dispersion of data. (JNTU May 2008)21. (a) Write the algorithm for attribute-oriented induction. Explain the steps involved in it.

    (b) How can concept description mining be performed incrementally and in a distributed manner?

    (JNTU May 2008)22. (a) What are the differences between concept description in large data bases and OLAP?

    (b) Explain about the graph displays of basic statistical class description. (JNTU May 2008)

    UNIT-III

    1. Give a detail note on classiffication based on concepts from association rule mining. (R07

    APRIL 2011)2. (a) Explain how concept hierarchies are used in mining multilevel association rule?

  • 7/22/2019 Data Warehousing and Data Mining_handbook

    13/27

    (b) Give the classiffication of association rules in detail. (R07 APRIL 2011)

    3. (a) How sclable is decision tree induction?Eplain

    (b) Discuss classiffication based on concept from association rule mining. (R07 APRIL 2011)4.a) How is association rules mined from large databases?

    b) Describe the different classifications of associate rule mining? (R07 APRIL/MAY 2011)5. List and explain the five techniques to improve the efficiency apriori algorithm?

    (R07 APRIL/MAY 2011)6. What is Divide and Conquer? How it could be helpful for FP Growth method in generating

    frequent item sets without candidate generation? (R07 APRIL/MAY 2011)7. Describe example of data set for which apriori check would actually increase the cost?

    (R07 APRIL/MAY 2011)8.Compare and contrast Apriori algorithm with frequent pattern growth algorithm. Consider a data setapply both algorithms and explain the results. (JNTU Dec 2010)9.a) Discribe mining multidimensional association rule using static discretization of quantitative attribute.

    b) Expalin association rule generation from frequent itemsets. (JNTU Dec-2010)10..Explain mining multilevel association rules from transaction databases. (JNTU Dec-2010)

    11.(a) Explain how concept hierarchies are used in mining multilevel association rule?(b) Give the classification of association rules in detail. (JNTU Dec-2010)

    12.Sequential patterns can be mined in methods similar to the mining of association rules. Design an

    efficient algorithm to mine multilevel sequential patterns from a transaction database. An example ofsuch a pattern is the following A customer who buys a PC will buy Microsoft software within threemoths on which one may drill down to find a more refined version of the patters, such as A customer

    who buys a Pentium PC will buy Microsoft office within three months. (JNTU Aug/Sep 2008)13. (a)Which algorithms an influential algorithm for mining frequent item sets for Boolean associationrules? Explain.

    (b)What are additional rule constraints to guide mining? Explain. (JNTU Aug/Sep 2008)14. (a) Discuss about Association rule mining.

    (b) What are the approaches for mining multilevel Association rules? Explain.(JNTU Aug/Sep 2008)

    15. Explain the Apriori algorithm with example. (JNTU Aug/Sep 2008, May 2008)

    16. (a) Write the FP-growth algorithm. Explain.

    (b) What is an iceberg query? Explain with example. (JNTU May 2008)17. (a) How can we mine multilevel Association rules efficiently using concept hierarchies? Explain.(b) Can we design a method that mines the complete set of frequent item sets without candidate

    generation. If yes, explain with example. (JNTU May 2008)

    UNIT-IV

    1. (a) Discuss about Concept hierarchy.(b) Briefly explain about - classiffication of database systems. (R07 APRIL 2011)

    2. (a) Explain Distance-based discretization.

    (b) Give a detail note on iceberg queries . (R07 APRIL 2011)

    3. (a) Discuss the various measures available to judge a classiffier.(b) Give a note on neive Bayesian classiffier. (R07 APRIL 2011)

    4. (a) Discuss automatic classiffication of web documents .(b) Write about basic measures of text retrieval.(c) Explain mining raster databases. [5+6+5] (R07 APRIL 2011)

    5. Explain in detail the major steps of decision tree classiffication. (R07 APRIL 2011)6. Why perform attribute relevance analysis? Explain the various methods of its?

    (R07 APRIL/MAY 2011)7. How will you solve a classification problem using decision trees? (R07 APRIL/MAY 2011)

  • 7/22/2019 Data Warehousing and Data Mining_handbook

    14/27

    8. Outline a data cube-based incremental algorithm for mining analytical class comparisons? (R07

    APRIL/MAY 2011)9. What is backpropagation? Explain classification by back-propagation? (R07 APRIL/MAY 2011)

    10. Can we get classification rules from decision trees? If so how? What are the enhancements to thebasic decision tree? (R07 APRIL/MAY 2011)11. Explain the various preprocessing steps to improve the accuracy, efficiency, and scalability of the

    classification or prediction process? (R07 APRIL/MAY 2011)12.Explain in detail the major steps of decision tree classiffication. (JNTU Dec 2010)

    8. (a) Discuss the five criteria for the evaluation of classification and prediction methods.(b) Explain how rules can be extracted from training neural networks. (JNTU Dec-2010)

    9.(a) Explain with an example a measure of the goodness of split.(b) Write a detail note on genetic algorithms for classiffication. (JNTU Dec-2010)

    15. (a) Give a note on log-linear models.

    (b) Explain the hold out method for estimating classiffier accuracy.(c) Discuss Fuzzy set approach for classiffication. [5+5+6] (JNTU Dec-2010)

    16. Discuss about Back propagation classification. (JNTU Aug/Sep 2008)17. The following table consists of training data from an employee database. The data have been

    generalized. For a given row entry, count represents the number of data tuples having the values for

    department, status, age and salary given in that below.

    Let salary be the class label attribute.Design a multiplayer feed-forward neural network for the given data. Label the nodes in the input and

    output layers.(JNTU Aug/Sep 2008)

    18. (a) Explain decision tree induction classification.(b) Describe backpropagation classification. (JNTU Aug/Sep 2008. May 2008)

    19. (a) Can any ideas from association rule mining be applied to classification? Explain.

    (b) Explain training Bayesian belief networks.

    (c) How does tree pruning work? What are some enhancements to basic decision tree induction?(JNTU Aug/Sep 2008)

    20. (a) What is classification? What is prediction?(b) What is Bayes theorem? Explain about Naive Bayesian classification.

    (c) Discuss about k-Nearest neighbor classifiers and case-based reasoning. (JNTU May 2008)21. (a) Describe the data classification process with a neat diagram.

    (b) How does the Naive Bayesian classification works? Explain.(c) Explain classifier accuracy. (JNTU May 2008)

    22. (a) Explain about basic decision tree induction algorithm.

  • 7/22/2019 Data Warehousing and Data Mining_handbook

    15/27

    (b) Discuss about Bayesian classification. (JNTU May 2008)

    UNIT-V

    1. (a) Explain competitive learning and self organizing feature maps methods to

    clustering.(b) Discuss in detail BIRCH algorithm. (R07 APRIL 2011)

    (b) What is an outlier? Explain in brief outlier analysis. (R07 APRIL 2011)

    each. (R07 APRIL 2011)

    4. What is association analysis? Discuss cluster analysis. Explain the correlationbetween these two types of analysis. (R07 APRIL 2011)

    5. (a) How are association rules mined from large databases? Explain.

    (b) Explain in detail constraint based association mining. (R07 APRIL 2011)

    6. (a) Give a detail note on CLIQUE algorithm.(b) Discuss expectation maximization algorithm for clustering. (R07 APRIL 2011)7.a) What are the fields in which clustering techniques are used?b) What are the major requirements of clustering analysis? (R07 APRIL/MAY 2011)8. Why is outlier mining important? Discuss about different outlier detection approaches? Briefly

    discuss about any two hierarchical clustering methods with suitable examples? (R07 APRIL/MAY2011)9. What are the different types of data used in cluster analysis? Explain in brief each one with anexample? (R07 APRIL/MAY 2011)

    10.a) What are the differences between clustering and nearest neighbor prediction?

    b) Define nominal, ordinal, and ratio scaled variables? (R07 APRIL/MAY 2011)11. (a) Explain how COBWEB method is used for clustering. (JNTU Dec 2010)

    (b) Discuss in detail DENCLUE clustering methods.

    12. (a) Explain K-means algorithm for clustering.(b) Given two objects represented by the tuples(22,1,42,10) and(20,0,36,8)

    i. Compute the Manhatten distance between the two objects.ii. Compute the Euchidean distance between the two objects. (JNTU Dec-2010)

    13. (a) Discuss interval-scaled vari abl e and binary vari abl es.

    (b) Explain in detail K-Medoids algorithm for clustering. (JNTU Dec-2010)14. (a) Discuss distance based outlier detection.

    (b) Explain OPTICS algorithm for clustering. (JNTU Dec-2010)15. (a) What major advantages does DENCLUE have in comparison with other clustering

    algorithms?

    (b) What advantages does STING offer over other clustering methods?(c) Why wavelet transformation useful for clustering?(d) Explain about outlier analysis. (JNTU Aug/Sep 2008)

    16. (a) Define mean absolute deviation, z-source, city block distance, and minikowski distance.(b) What are different types of hierarchical methods? Explain. (JNTU Aug/Sep 2008)

    17. (a) Discuss about binary, nominal, ordinal and ratio-scaled variables.(b) Explain about grid-based methods. (JNTU Aug/Sep 2008)

    18. (a) What are the categories of major clustering methods? Explain.(b) Explain about outlier analysis. (JNTU Aug/Sep 2008)

    19. (a) Given the following measurement for the variable age:

    18, 22, 25, 42, 28, 43, 33, 35, 56, 28Standardize the variable by the following:

    i. Compute the mean absolute deviation of age.ii. Compute the Z-score for the first four measurements.

  • 7/22/2019 Data Warehousing and Data Mining_handbook

    16/27

    (b) What is a distance-based outlier? What are efficient algorithms for mining distance-basedalgorithm? How are outliers determined in this method? (JNTU May 2008)

    20. (a) Write algorithms for k-Means and k-Medoids. Explain.(b) Discuss about density-based methods. (JNTU May 2008)

    21. (a) Given two objects represented by the tuples (22,1,42,10) and (20,0,36,8):i. Compute the Euclidean distance between the two objects.ii. Compute the Manhanttan distance between the two objects.

    iii. Compute the Minkowski distance between the two objects, using q=3.

    (b) Explain about Statistical-based outlier detection and Deviation-based outlier detection.(JNTU May 2008)

    UNIT-VI

    1. (a) Describe the essential features of temporal data and temporal inferences.

    (b) Discuss the major algorithms of the sequence mining problem. (NR APRIL 2011)

    UNIT-VII

    1. (a) Discuss various ways to estimate the trend.

    (b) Explain construction of a multilayered web information base. (R07 APRIL 2011)

    (b) Explain latent semantic inducing technique. (R07 APRIL 2011)3. (a) Discuss data transformation from time domain to frequency domain.

    (b) Explain HITS algorithm for web structure mining. (R07 APRIL 2011)4. Write short notes on:i) Mining Spatial Databases

    ii) Mining the World Wide Web. (R07 APRIL/MAY 2011)5. Write short notes on:i) Data objects

    ii) Sequence Data Miningiii) Mining Text Databases. (R07 APRIL/MAY 2011)

    6. (a) Which frequent itemset mining is suitable for text mining and why. Explain?(b) Discuss the relationship between text mining and information retrieval and

    information extraction. (NR APRIL 2011)

    7. (a) Describe cosine measure for similarity in documents. (JNTU Dec 2010)(b) Explain in detail similarity search in time-series analysis.

    8. (a) Explain spatial datacube construction and spatial OLAP.(b) Give a note on item frequency matrix. (JNTU Dec 2010)

    9. (a) Discuss web content mining and web usage mining.(b) compare information retrieval with text mining. (JNTU Dec 2010)

    10. (a) Explain spatial data cube construction and spatial OLAP.(b) Discuss about mining text databases. (JNTU Aug/Sep 2008)

    11. (a) Define spatial database, multimedia database, time-series database, sequence database, and

    text database.(b) What is web usage mining? Explain with suitable example. (JNTU Aug/Sep 2008)

    12. A heterogeneous database system consists of multiple database systems that are defined

    independently, but that need to exchange transform information among themselves and answer globalqueries. Discuss how to process a descriptive mining query in such a system using a generalization-

    based approach. (JNTU Aug/Sep 2008)13. (a) How to mine Multimedia databases? Explain.

    (b) Define web mining. What are the observations made in mining the web for effective resourceand knowledge discovery?

    (c) What is web usage mining? (JNTU Aug/Sep 2008)

  • 7/22/2019 Data Warehousing and Data Mining_handbook

    17/27

    .14. An e-mail database is a database that stores a large number of electronic mail messages. It can beviewed as a semistructured database consisting mainly of text data. Discuss the following.

    (a) How can such an e-mail database be structured so as to facilitate multi-dimensional search, suchas by sender, by receiver, by subject, by time, and so on?

    (b) What can be mined from such an e-mail database?(c) Suppose you have roughly classified a set of your previous e-mail messages as junk,

    unimportant, normal, or important. Describe how a data mining system may take this as the

    training set to automatically classify new e-mail messages for unclassified ones

    (JNTU May 2008)15. Suppose that a city transportation department would like to perform data analysis on highway traffic

    for the planning of highway construction based on the city traffic data collected at different hoursevery day.

    (a) Design a spatial data warehouse that stores the highway traffic information so that people caneasily see the average and peak time traffic flow by highway, by time of day, and by weekdays,and the traffic situation when a major accident occurs.

    (b) What information can we mine from such a spatial data warehouse to help city planners?

    (c) This data warehouse contains both spatial and temporal data. Propose one mining technique thatcan efficiently mine interesting patterns from such a spatio-temporal data warehouse

    (JNTU May 2008)

    16. Explain the following:

    (a) Constriction and mining of object cubes(b) Mining associations in multimedia data(c) Periodicity analysis

    (d) Latent semantic indexing (JNTU May 2008)

    17. (a) Give an example of generalization-based mining of plan databases by divide- and-conquer.(b) What is sequential pattern mining? Explain.

    (c) Explain the construction of a multilayered web information base. (JNTU May 2008)

    UNIT-VIII

  • 7/22/2019 Data Warehousing and Data Mining_handbook

    18/27

    ASSIGNMENT QUESTIONS:

    UNIT-I

    Unit-I1. Describe three challenges to data mining regarding data mining methodology and user

    interaction issues.

    2. What are the major challenges of mining a huge amount of data (such as billions oftuples) in comparison with mining a small amount of data (such as a few hundred tupledata set)?

    3. Outline the major research challenges of data mining in one specific application domain,such as stream/sensor data analysis, spatiotemporal data analysis, or bioinformatics

    4. What is the difference between discrimination and classification? Betweencharacterization and clustering? Between classification and prediction? For each of thesepairs of tasks, how are they similar?

    5. Outliers are often discarded as noise. However, one persons garbage could be anotherstreasure. For example, exceptions in credit card transactions can help us detect the

    fraudulent use of credit cards. Taking fraudulence detection as an example, propose two

    methods that can be used to detect outliers and discuss which one is more reliable.6. Recent applications pay special attention to spatiotemporal data streams. A

    spatiotemporal data stream contains spatial information that changes over time, and is inthe form of stream data (i.e., the data flow in and out like possibly infinite streams).

    (a)Present three application examples of spatiotemporal data streams.(b)Discuss what kind of interesting knowledge can be mined from such data streams,

    with limited time and resources.

    (c) Identify and discuss the major challenges in spatiotemporal data mining.(d)Using one application example, sketch a method to mine one kind of knowledge

    from such stream data efficiently.

    7. Describe the differences between the following approaches for the integration of a datamining system with a database or data warehouse system: no coupling, loose coupling,8. Present an example where data mining is crucial to the success of a business .What data

    mining functions does this business need? Can they be performed alternatively by data

    query processing or simple statistical analysis?

    9. Give three additional commonly used statistical measures (i.e., not illustrated in thischapter) for the characterization of data dispersion, and discuss how they can be

    computed efficiently in large databases.

    10.In many applications, new data sets are incrementally added to the existing large datasets. Thus an important consideration for computing descriptive data summary is whether

    a measure can be computed efficiently in incremental manner. Use count, standard

    deviation, and median as examples to show that a distributive or algebraic measurefacilitates efficient incremental computation, whereas a holistic measure does not.

    11.In real-world data, tuples with missing values for some attributes are a commonoccurrence. Describe various methods for handling this problem.

    12.Use a flowchart to summarize the following procedures for attribute subset selection:(a)stepwise forward selection

  • 7/22/2019 Data Warehousing and Data Mining_handbook

    19/27

    (b)stepwise backward elimination(c)a combination of forward selection and backward elimination

    13.The median is one of the most important holistic measures in data analysis. Proposeseveral methods for median approximation. Analyze their respective complexity under

    different parameter settings and decide to what extent the real value can be approximated.

    Moreover, suggest a heuristic strategy to balance between accuracy and complexity and

    then apply it to all methods you have given.14.Explain why it is not possible to analyze some large data sets using classical modelingtechniques. Explain the differences between statistical and machine - learning approachesto the analysis of large data sets.

    15.Why are preprocessing and dimensionality reduction important phases in successful data- mining applications?

    16.Give examples of data where the time component may be recognized explicitly and otherdata where the time component is given implicitly in a data organization.

    17.Why is it important that the data miner understand data well? Give examples ofstructured, semi - structured, and unstructured data from everyday situations.

    18.Can a set with 50,000 samples be called a large data set? Explain your answer.Enumerate the tasks that a data warehouse may solve as a part of the data miningprocess.

    19.Which of the following quantities is likely to show more temporal autocorrelation: dailyrainfall or daily temperature? Why?

    20.Many sciences rely on observation instead of (or in addition to) designed experiments.Compare the data quality issues involved in observational science with those ofexperimental science and data mining.

    Unit-II1. Suppose that a data warehouse consists of the four dimensions, date, spectator, location,

    and game, and the two measures, count and charge, where charge is the fare that aspectator pays when watching a game on a given date. Spectators may be students, adults,or seniors, with each category having its own charge rate.

    (a)Draw astar schema diagram for the data warehouse.(b)Starting with the base cuboid [date, spectator, location, game], what specific

    OLAP operations should one perform in order to list the total charge paid by

    student spectators at GM Place in 2004?

    (c)Bitmap indexing is useful in data warehousing. Taking this cube as an example,briefly discuss advantages and problems of using a bitmap index structure.

    2. A data warehouse can be modeled by either astar schema or asnowflake schema. Brieflydescribe the similarities and the differences of the two models, and then analyze their

    advantages and disadvantages with regard to one another. Give your opinion of which

    might be more empirically useful and state the reasons behind your answer.3. Design a data warehouse for a regional weather bureau. The weather bureau has about

    1,000 probes, which are scattered throughout various land and ocean locations in the

    region to collect basic weather data, including air pressure, temperature, and precipitationat each hour. All data are sent to the central station, which has collected such data for

    over 10 years. Your design should facilitate efficient querying and on-line analytical

    processing, and derive general weather patterns in multidimensional space.

  • 7/22/2019 Data Warehousing and Data Mining_handbook

    20/27

    4. A popular data warehouse implementation is to construct a multidimensional database,known as a data cube. Unfortunately, this may often generate a huge, yet very sparse

    multidimensional matrix. Present an example illustrating such a huge and sparse datacube.

    5. State why, for the integration of multiple heterogeneous information sources, manycompanies in industry prefer the update-driven approach (which constructs and uses data

    warehouses), rather than the query-driven approach (which applies wrappers and

    integrators).6. Describe situations where the query-driven approach is preferable over the update-drivenapproach.

    7. What is a factless fact table? Design a simple STAR schema with a factless fact table totrack patients in a hospital by diagnostic procedures and time.

    8. You are the data design specialist on the data warehouse project team for a manufacturingcompany. Design a STAR schema to track the production quantities. Production

    quantities are normally analyzed along the business dimensions of product, time, parts

    used, production facility, and production run. State your assumptions.

    9. In a STAR schema to track the shipments for a distribution company, the followingdimension tables are found: (1) time, (2) customer ship-to, (3) ship-from, (4) product, (5)

    type of deal, and (6) mode of shipment. Review these dimensions and list the possibleattributes for each of the dimension tables. Also, designate a primary key for each table.

    10.A dimension table is wide; the fact table is deep. Explain.11.A data warehouse is subject-oriented. What would be the major critical business subjects

    for the following companies?

    (a)an international manufacturing company(b)a local community bank(c)a domestic hotel chain

    12.You are the data analyst on the project team building a data warehouse for an insurancecompany. List the possible data sources from which you will bring the data into your datawarehouse. State your assumptions.

    13.For an airlines company, identify three operational applications that would feed into thedata warehouse. What would be the data load and refresh cycles?

    14.Prepare a table showing all the potential users and information delivery methods for adata warehouse supporting a large national grocery chain.

    15.Briefly compare the following concepts. You may use an example to explain yourpoint(s).

    (a)Snowflake schema, fact constellation, star net query model(b)Data cleaning, data transformation, refresh(c)Enterprise warehouse, data mart, virtual warehouse

    16.Suppose that a data warehouse consists of the three dimensions time, doctor, andpatient,and the two measures count and charge, where charge is the fee that a doctor charges a

    patient for a visit.(a)Enumerate three classes of schemas that are popularly used for modeling data

    warehouses.

    (b)Draw a schema diagram for the above data warehouse using one of the schemaclasses listed in (a).

    (c)Starting with the base cuboid [day, doctor, patient], what specific OLAPoperations should be performed in order to list the total fee collected by each

    doctor in 2004?

  • 7/22/2019 Data Warehousing and Data Mining_handbook

    21/27

    (d)To obtain the same list, write an SQL query assuming the data are stored in arelational database with the schema fee (day, month, year, doctor, hospital,

    patient, count, charge).17.Suppose that a data warehouse for Big University consists of the following four

    dimensions: student, course, semester, and instructor, and two measures count and avg

    grade.

    18.When at the lowest conceptual level (e.g., for a given student, course, semester, andinstructor combination), the avg grade measure stores the actual course grade of thestudent. At higher conceptual levels, avg grade stores the average grade for the given

    combination.(a)Draw asnowflake schema diagram for the data warehouse.(b)Starting with the base cuboid [student, course,semester, instructor], what specific

    OLAP operations (e.g., roll-up from semester to year) should one perform inorder to list the average grade of CS courses for eachBig University student.

    UNIT-III

    1. Below database has five transactions. Let min sup = 60% and min con f = 80%.

    (a)Find all frequent itemsets using Apriori and FP-growth, respectively. Compare theefficiency of the two mining processes.

    (b)List all of thestrong association rules (with supports and confidence c) matchingthe following metarule, where X is a variable representing customers, and itemi

    denotes variables representing items (e.g., A, B, etc.):

    2. Consider the following data set.Cust_ID Transaction ID Items Bought

    1 0001 {a, d, e}

    1 0024 {a, b, c, e}

    2 0012 {a, b, d, e}2 0031 {a, c, d, e}

    3 0015 {b, c, e}

    3 0022 {b, d, e}4 0029 {c, d}4 0040 {a, b, c}

    5 0033 {a, d, e}

    5 0038 {a, b, e}

  • 7/22/2019 Data Warehousing and Data Mining_handbook

    22/27

    (a)Compute the support for itemsets {e}, {b, d}, and {b, d, e}by treating each transactionID as a market basket.

    (b) Use the results in part (a) to compute the confidence for the association rules

    {b, d} {e} and {e} {b, d}. Is confidence a symmetric measure?

    (b)Repeat part (a) by treating each customer ID as a market basket. Each item should betreated as a binary variable (1 if an item appears in at least one transaction bought by

    the customer, and 0 otherwise.)

    (c)Use the results in part (c) to compute the confidence for the association rules{b, d} {e} and {e} {b, d}.

    (e) Supposes1 and c1 are the support and confidence values of an association rule r whentreating each transaction ID as a market basket. Also, let s2 and c2 be the support and

    confidence values of r when treating each customer ID as a market basket. Discuss

    whether there are any relationships betweens1 ands2 or c1 and c2.

    3. Implementfrequent itemset mining algorithm Apriori (mining using vertical data format),using a programming language that you are familiarwith, such as C++ or Java. Compare

    the performance of each algorithm with various kinds of large data sets. Write a report toanalyze the situations (such as data size, data distribution, minimal support threshold

    setting, and pattern density) where one algorithm may perform better than the others, andstate why.

    4. Suppose that a large store has a transaction database that is distributed among fourlocations. Transactions in each component database have the same format, namely Tj :

    (i1, ., im), where Tj is a transaction identifier, and ik (1

  • 7/22/2019 Data Warehousing and Data Mining_handbook

    23/27

    Using the threshold values support = 25% and confi dence = 60%,

    (a) find all large itemsets in database X;(b)find strong association rules for database X;(c)analyze misleading associations for the rule set obtained in (b).

    9. From the transactional database of question 7, find FP tree for this database if(a)support threshold is 5(b)support threshold is 3(c)

    support threshold is 4 draw the tree step by step.

    10.Why is the process of discovering association rules relatively simple compared withgenerating large itemsets in transactional databases?

    11.Solve question 18 with support of 50% and confidence 60%.12.What are the common values for support and confidence parameters in the Apriori

    algorithm? Explain using the retail industry as an example.

    13.Propose and outline a level-shared mining approach to mining multilevel associationrules in which each item is encoded by its level position, and an initial scan of the

    database collects the count for each item at each concept level, identifying frequent and

    sub frequent items. Comment on the processing cost of mining multilevel associationswith this method in comparison to mining single-level associations.

    14.Implement frequent itemset mining algorithm FP-growth (mining using vertical dataformat), using a programming language that you are familiar with, such as C++ or Java.

    Compare the performance of each algorithm with various kinds of large data sets. Write a

    report to analyze the situations (such as data size, data distribution, minimal support

    threshold setting, and pattern density) where one algorithm may perform better than the

    others, and state why.

    15.Solve question 1 with support of 50% and confidence 60%.16.Using data in question 2 find all association rules using FP-Growth with a support of

    50% and confidence 60%.

    17.Suppose we have market basket data consisting of 100 transactions and 20 items. If thesupport for item a is 25%, the support for item b is 90% and the support for itemset {a, b}is 20%. Let the support and confidence thresholds be 10% and 60%, respectively.

    (a) Compute the confidence of the association rule {a} {b}. Is the rule interestingaccording to the confidence measure?

    (b) Compute the interest measure for the association pattern {a, b}. Describe the nature of

    the relationship between item a and item b in terms of the interest measure.

    (c) What conclusions can you draw from the results of parts (a) and (b)?

  • 7/22/2019 Data Warehousing and Data Mining_handbook

    24/27

    (d) Prove that if the confidence of the rule {a} {b} is less than the support of {b},

    then: i. c({a} {b}) > c({a} {b}), ii. c({a} {b}) > s({b}), where c() denote the

    rule confidence ands() denote the support of an itemset.

    18.Using data in question 2 find all association rules using apriori with a support of 50% andconfidence 60%.

    19.From the transactional database of question 7, find FP tree for this database if(a)support threshold is 2(b)support threshold is 3(c)support threshold is 5 draw the tree step by step.

    20.Generate association rules from the following dataset. Minimum support is taken as 22%and minimum confidence is 70%.

    TID T1 T2 T3 T4 T5 T6 T7 T8 T9

    List of

    Items

    I1, I2,

    I5

    I2, I4 I2, I3 I1, I2,

    I4

    I1, I3 I2, I3 I1, I3 I1, I2

    ,I3, I5

    I1, I2,

    I3

    UNIT-IV1) It is difficult to assess classification accuracy when individual data objects may belong to

    more than one class at a time. In such cases, comment on what criteria you would use to

    compare different classifiers modeled after the same data.

    2) Show that accuracy is a function ofsensitivity andspecificity, that is, prove this equation

    3) Suppose that we would like to select between two prediction models, M1 andM2. We haveperformed 10 rounds of 10-fold cross-validation on each model, where the same data

    partitioning in round i is used for bothM1 andM2. The error rates obtained forM1 are 30.5,32.2, 20.7, 20.6, 31.0, 41.0, 27.7, 26.0, 21.5, 26.0. The error rates forM2 are 22.4, 14.5, 22.4,

    19.6, 20.7, 20.4, 22.1, 19.4, 16.2, 35.0. Comment on whether one model is significantly

    better than the other considering a significance level of 1%.

    4) Given a decision tree, you have the option of (i) converting the decision tree to rules and thenpruning the resulting rules, or (ii) pruning the decision tree and then converting the prunedtree to rules. What advantage does (i) have over (ii)?

    5) Take a data set with at least 5 attributes and 15 records and apply decision tree (informationgain)classification.

    6) Why is nave Bayesian classification called nave? Briefly outline the major ideas of naveBayesian classification.

    7) Take a data set with at least 5 attributes and 15 records and apply decision tree ( gainratio)classification.

  • 7/22/2019 Data Warehousing and Data Mining_handbook

    25/27

    8) The following table shows the midterm and final exam grades obtained for studentsin adatabase course.

    x yMidterm exam Final exam

    72 84

    50 63

    81 7774 78

    94 9086 75

    59 49

    83 7965 77

    33 52

    88 74

    81 90(i) Plot the data. Dox andy seem to have a linear relationship?

    (ii) Use the method of least squares to find an equation for the prediction of a students finalexam grade based on the students midterm grade in the course.

    (iii) Predict the final exam grade of a student who received an 86 on the midterm exam.

    9) Consider the following data set for a binary class problem.A T T T T T F F F T T

    B F T T F T F F F T F

    Class Label + + + _ + _ _ _ _ _

    (a) Calculate the information gain when splitting onA andB. Which attribute would the decision

    tree induction algorithm choose?

    (b) Calculate the gain in the Gini index when splitting on A and B. Which attribute would the

    decision tree induction algorithm choose?(c)In real-world data, tuples with missing values for some attributes are a common occurrence.

    Describe various methods for handling this problem.

    10)The following table summarizes a data set with three attributes A,B, C and two class labels+, . Build a two-level decision tree.

  • 7/22/2019 Data Warehousing and Data Mining_handbook

    26/27

    (a) According to the classification error rate, which attribute would be chosen as the first splittingattribute? For each attribute, show the contingency table and the gains in classification error rate.

    (b)Repeat for the two children of the root node.

    (c)How many instances are misclassified by the resulting decision tree?(d) Repeat parts (a), (b), and (c) using C as the splitting attribute.

    11)Using Bayesian method andCustomer ID Height Hair Eyes Credit Rating

    E1 Short Dark Blue A

    E2 Tall Dark Black B

    E3 Tall Dark Blue B

    E4 Tall red Black C

    E5 Short Blond Brown B

    E6 Tall Blond Black B

    E7 Average Blond Blue C

    E8 Average Blond Blue B

    E9 Tall Grey Blue A

    E10 Average Grey Black B

    E11 Tall Blond Brown A

    E12 Short Blond Blue B

    E13 Average Grey Brown B

    E14 Tall red Brown C

    Using data set given in the above table assuming the credit rating be the class label attribute.Predict the credit rating for a new customer who is

    a)short, has red hair and blue eyes.

  • 7/22/2019 Data Warehousing and Data Mining_handbook

    27/27

    b)tall, has blond hair and brown eyes.

    And

    Predict the eyes when credit rating is B , short height and dark hair.12)Apply GINI on the data in the table of question number-11 and create decision tree.13)RainForest is an interesting scalable algorithm for decision tree induction. Develop a scalable

    naive Bayesian classification algorithm that requires just a single scan of the entire data set

    for most databases. Discuss whether such an algorithm can be refined to incorporate boosting

    to further enhance its classification accuracy.

    14)Compare the advantages and disadvantages of eager classification (e.g., decision tree,Bayesian, neural network) versus lazy classification (e.g., k-nearest neighbor, case based

    reasoning).

    15)Design an efficient method that performs effective nave Bayesian classification over aninfinite data stream (i.e., you can scan the data stream only once). If we wanted to discover

    the evolution of such classification schemes (e.g., comparing the classification scheme at this

    moment with earlier schemes, such as one from a week ago), what modified design would

    you suggest?

    16)What is boosting? State why it may improve the accuracy of decision tree induction.17)The support vector machine (SVM) is a highly accurate classification method. However,

    SVM classifiers suffer from slow processing when training with a large set of data tuples.

    Discuss how to overcome this difficulty and develop a scalable SVM algorithm for efficient

    SVM classification in large datasets.

    18)Write an algorithm for k-nearest-neighbor classification given k and n, the number ofattributes describing each tuple.

    19)What is associative classification? Why is associative classification able to achieve higherclassification accuracy than a classical decision tree method? Explain how associativeclassification can be used for text document classification.

    20)Why is tree pruning useful in decision tree induction? What is a drawback of using a separateset of tuples to evaluate pruning?