11
 A data warehouse is a repository of integrated information, available for queries and analysis. Data and information are extracted from non-homogeneous sources as they are generated and processed using process managers (load/wareh ouse/query ).This makes it much easier and more efficient to run queries over data that originally came from different sources. It also enables the people to take informed decisions. Data mining draws from the data warehouse, revealing patterns of information in historical data, in terms of customer data or any other data in ways that we never thought possib le. It combines techniques like statistical analysis, data visualizatio n, induction and neutral networks. Data mining systems improve an organization’s effectiveness, efficiency and value by increasing the usefulness of the knowledge the organization possesses. Data warehousing & Data Mining- A View:

23336437-warehouse

Embed Size (px)

Citation preview

7/28/2019 23336437-warehouse

http://slidepdf.com/reader/full/23336437-warehouse 1/11

 

A data warehouse is a repository of integrated information, available for queriesand analysis. Data and information are extracted from non-homogeneous sources asthey are generated and processed using process managers (load/warehouse/query).Thismakes it much easier and more efficient to run queries over data that originally came

from different sources. It also enables the people to take informed decisions.Data mining draws from the data warehouse, revealing patterns of information

in historical data, in terms of customer data or any other data in ways that we never thought possible. It combines techniques like statistical analysis, data visualization,induction and neutral networks. Data mining systems improve an organization’seffectiveness, efficiency and value by increasing the usefulness of the knowledge theorganization possesses.

Data warehousing & Data Mining- A View:

7/28/2019 23336437-warehouse

http://slidepdf.com/reader/full/23336437-warehouse 2/11

Data Warehousing – Introduction:

We know today’s world is fully competitive. All companies, which are runningover the years, are trying to increase their profits and also optimizing their qualitiesand activities. So, every organization having a lot of information and growingincrementally needs modifications. But our traditional operational systems are never designed to support the kind of business activities. Even the latest technology also

cannot be optimized to support cost-effective operational and multidimensionalrequirements. To meet these needs there is a new breed system, known as a DATAWAREHOUSE.

Data warehousing- Definition:

Data warehouses are built to support large cost-effective data volumes (above100GB of database) which can be a relational database, multidimensional database,flat file, hierarchical database, object database, etc.

 

Here, the more difficult challenge is to architect systems that automaterequirements, going on a continuing basis. If we take business area, as it grows andchanges, it needs will change. Data Warehouses are designed to ride with thesechanges, always building a degree of flexibility with in the system. The real problem

is that the business itself may not be aware of its requirements in the future so, indesigning a data warehouse we use the current, today’s requirements and can guess for the future.

Operational systems Vs Data warehouse systems:

Operational systems, which enable day-to-day functions, feed the datawarehouse but both differ in many aspects. Operating systems are process-orientedwhich support transaction processing. They depend on current data and update the dataregularly. They do fast insert and updates of small data. On the other hand, Datawarehouse systems are subject-oriented which support analytical processing. They

depend on historical data and almost data is read only. They does fast retrievals of large data.

7/28/2019 23336437-warehouse

http://slidepdf.com/reader/full/23336437-warehouse 3/11

Data-warehouse –Decision making:

A decision support system or tool is one specifically design to allow businessend users to perform computer generated analysis of their own. Many data warehousesare not used as decision support systems or tools do not necessarily require the use of data warehouse as a source for data, the most used decision support tools arespreadsheets not connected in any automated way which a data ware house.

Data warehouse- goals:

The fundamental to enable user’s appropriate access to a homogenized andcomprehensive view of organization. It also supports forecasting planning anddecision making processes. In additional goal is to achieve information consistency

 provide security and adaptability.Data warehouse-process flow:

The process flow is represented as follows:

 Extract and load the data: Data extraction involves extracting the data from sourcesystems and makes it available to the data warehouse where as data load takesextracted data and loads it into the data warehouse.

Clean and transform data: It performs the consistency checks on the loaded data,and then structures it for query performance and for minimizing the operationalcosts.

Data warehousing-“Errors”: The possible errors encountered in a data warehouses

are:  Incomplete errors: missing records, etc.  Incorrect errors: wrong (but sometimes right) codes, wrong calculations, etc.  Incomprehensibility errors: Unknown codes, spreadsheets and word processingfiles, etc.  Inconsistency errors: Inconsistent use of different codes, over lapping codes,etc.

Back up and archive data: The data is being backed up regularly and also older data is removed from the system in a format that allows it to be quickly restored if 

required. Query management: It manages the queries and speeds them up by directing

queries to the most effective data source and also monitor the actual query profiles.

7/28/2019 23336437-warehouse

http://slidepdf.com/reader/full/23336437-warehouse 4/11

Data warehouse architecture:

Data warehouse architecture (DWA) is a way of representing the over allstructure of data, communication, processing and presentation that exists for end-user Computing with in the enterprise. The architecture is made up of a number of inter-connected parts:

• Operational database/External database layer: Operational systems processdata to support critical operational needs. To do that, operational databases have

 been historically created to provide an efficient processing structure for a relativelysmall number of well-defined business transactions.

• Information Access Layer: This is the layer that the end-user deals with directly.In particular, it represents the tools that the end-user normally uses day to day.

e.g.: Excel, Lotus 1-2-3, etc.• Data Access Layer: The Data Access Layer of the Data Warehouse Architecture is

involved with allowing the information Layer to talk to the Operational Layer.

• Data Directory (Meta-data) Layer: Meta-data is the data about data with in theenterprise. Record description in a COBOL program is Meta-data. 

Data Warehouse Architecture

Process Management Layer: The Process Management Layer is involved inscheduling the various tasks that must be accomplished to build and maintain thedata warehouse and data directory.

Application Messaging Layer: The Application Message Layer has to do withtransporting information around the enterprise-computing network. ApplicationMessage is also referred to as “Middleware”, but it can involve those justnetworking protocols.

7/28/2019 23336437-warehouse

http://slidepdf.com/reader/full/23336437-warehouse 5/11

Data Warehouse (physical) layer: The (core) Data Warehouse is where the actualdata used primarily for informational uses occur. In a Physical Data Warehouse,copies, in some cases many copies, of operational and or external data are actuallystored in form that is easy to access.

Data Stating Layer: Data staging is also called copy management or replication

management, but in fact, it includes all of the processes necessary to select, edit,summarize, combine and load data warehouse and information access data fromoperational and/or external databases.

Data Warehouse-Data Redundancy:

A data warehousing strategy may ultimately include all the three. “virtual” or “point-to-point” Data Warehouses

Central Data Warehouses

 Distributed Data Warehouse

Data Warehouse-applications: Role of data warehouse in various application areas.1.  Marketing solutions: Marketing database, customer loyaltyscheme & profiling, etc.2.  Retail: sales analysis, shrinkage analysis, promotion analysis,space planning.3.  Insurance: Product profitability analysis, orphan analysis.4. Telephone companies: individual terrifying through callanalysis, network analysis.5.  Retail banking: customer profitability analysis, customer 

scoring/loan decision.

Data Warehouse-Future developments:

Data warehousing is such a new field that it is difficult to estimate what newdevelopments are likely to most affect it. Clearly, the development of parallel DBservers with improved query engines is likely to be one of the most important. Another new technology is data warehouses that allow for the mixing of traditional numbers,text and multi-media. The availability of improved tools for data visualization(business intelligence) will allow users to see things that could never be seen before.

Data Mining- introduction:

Data mining techniques are the result of a long process of research and productdevelopment. This evolution began when business data was first stored on computers,continued with improvements in data access, and more recently, generatedtechnologies that allow users to navigate through their data in real time. Data miningtakes this evolutionary process beyond retrospective data access and navigation to

 prospective and proactive information delivery.

Data mining – definition:Data mining, “the extraction of hidden predictive information from large

databases”, is a powerful new technology with great potential to help companies

7/28/2019 23336437-warehouse

http://slidepdf.com/reader/full/23336437-warehouse 6/11

focus on the most important information in their data warehouses. Data mining tools predict future trends and behaviors, allowing business to make proactive, knowledge-driven decisions. The automated, prospective analyses offered by data mining move

 beyond the analyses of past events provided by retrospective tools typical of decisionsupport systems.

Data mining – Supporting Technologies: Data mining can be applied in the business field, because it is supported bythree technologies.

 Massive data collection: Databases are growing at unprecedented rates and can belarger than expected.

 Powerful multiprocessor computers: The need for improved computational enginescan now be met.

 Data mining algorithms: They have been implemented as mature, reliable,understandable tools.

Data Mining:

Scope: Given databases of sufficient size and quality, data mining technology can

generate new business opportunities by providing these capabilities:Automated prediction of trends and behaviors: Example: predictive problem istargeted marketing.Automated discovery of previously unknown patters: Data mining tools sweepthrough databases and identify previously hidden patterns in one step. An example isthe analysis of retail sales.

Data Mining-Algorithms: Some of the most common data mining algorithms in usetoday are two sections based on when the technique was developed and when it

 became ready to be used.1. Classical Techniques: Statistics, neighborhoods and clustering that have been usedfor decades.Statistics: These are data driven and are used to discover patterns and build predictivemodels.(a)Histograms: One of the best ways to summarize data is to provide a histogram of the data.

Ex 1: Counting the numbers of occurrences of different colors of eyes in our database.Ex 2: Representing the majority of customers that are over the age of 50.

7/28/2019 23336437-warehouse

http://slidepdf.com/reader/full/23336437-warehouse 7/11

Figure – depicts a simple predictor (eye color).

Figure – depicts customers of different ages.

(b) Linear regression: In statistics, prediction is usually synonymous withregression of some form. The simplest form of regression is simple linear regressionthat just contains one predictors and a prediction. The relationship between the twocan be mapped on a two dimensional space.

The predictive model is the line

Nearest Neighbor: Objects that are “near” to each other will have similar predictionvalues as well. Thus if you know the prediction value of one of the objects you can

 predict it for its nearest neighbors. One of the improvements that are usually made to

the basic nearest neighbor algorithm is to take a vote from the “k” nearest neighbors.Ex: The nearest neighbors are shown are shown graphically for three unclassified 

records: A, B, and C.

7/28/2019 23336437-warehouse

http://slidepdf.com/reader/full/23336437-warehouse 8/11

Clustering: It is the method by which like records are grouped together. Usually thisis done to give the end user a high level view of what is going on in the database.There are mainly two types.   Hierarchical and Non-Hierarchical Clustering: The hierarchy of clusters is

usually viewed as a tree where the smallest clusters merge together to create the next 

highest level of clusters and so on.

 Hierararchy of clusters elongated clusters nested clusters

2. Next Generations Techniques: They represent techniques such as Trees, Networksand Rules that have only been widely used since the early 1980’s.

• Neural Networks: Neural networks are very powerful predictive modeling

techniques but some of the power comes at the expense of ease of use and ease of deployment. Because of the complexity of these techniques much effort has beenexpanded in trying to increase its clarity.

100 Outputs nodes

LinksAge47

5hdden nodes nodes default?

7/28/2019 23336437-warehouse

http://slidepdf.com/reader/full/23336437-warehouse 9/11

NOIncome

$65,000

100 Input nodes

Data compression and feature extraction. Prediction of loan default.

• Decision Trees: Tree-shaped structures that represent sets of decisions. These

decisions generate rules for the classification of a dataset. Specific decision treemethods tree methods include Classification and Regression (CART) and Chi Square

Automatic Interaction Detection (CHAID).

A decision tree

•Rule Induction: The extraction of useful if-then rules from data based onstatistical significance.

These capabilities are now evolving to integrate directly with industry-standard data warehouse and OLAP platforms.

Data Mining - Working Procedure:

The technique that is used to perform the tasks of predicting and handlingover the important things to the people, in data mining is called modeling. Modeling issimply the act of building a model in one situation where you know the answer andthen applying it to another situation that you don’t. This act of model building is thussomething that people have been doing for along time, certainly before the advent of computers or data mining technology. Once the model is built it can then be used insimilar situations where you don’t know the answer.

7/28/2019 23336437-warehouse

http://slidepdf.com/reader/full/23336437-warehouse 10/11

The following table illustrates a model for new customer prospecting in dataWarehouse. 

yesterday today tomorrow

Static information and current plans(e.g. demographic data)

Known Known Known

Dynamic information(e.g. customer transactions)

Known Known Target

Data mining for predictions

Once the mining is complete, the results can be tested against the data held inthe vault to confirm the model’s validity. If the model works, its observation shouldhold for the vaulted data.

Data Mining – Architecture:To best apply these advanced techniques, they must be fully integrated withdata Warehouse as well as flexible interactive business analysis tools. Many datamining tools currently operate outside of the Warehouse, requiring extra steps for extracting, importing and analyzing the data. Furthermore, when new insights requireoperational implementation, integration with the warehouse simplifies the applicationof results from data mining. The resulting analytical data warehouse can be applied toimprove business processes throughout the organization. The following figureillustrates architecture for advanced analysis in a large data warehouse.

The ideal starting point is a data warehouse containing a combination of internaldata tracking all customer contact coupled with external market data about competitor activity. Background information on potential customers also provides an excellent

7/28/2019 23336437-warehouse

http://slidepdf.com/reader/full/23336437-warehouse 11/11

 basis for prospecting. This warehouse can be implemented in a variety of relationaldatabase systems: Sybase, Oracle, Redbrick and so on.

An OLAP (On-Line Analytical Processing) server enables amore sophisticatedend-user business model to be applied when navigating the data ware house. Themultidimensional structures allow the user to analyze the data as they want to view

their business. The Data Mining Server must be integrated with the data warehouseand the OLAP server to embed ROI-focused business analysis directly into thisinfrastructure. As the warehouse grows with new decisions and results, theorganization can continually mine the best practices and apply them to futuredecisions.

This design represents a fundamental shift from conventional decision supportsystems. Rather than simply delivering data to the end-user through query andreporting software, the Advanced Analysis Server applies user’s business modelsdirectly to the warehouse and returns a proactive analysis of the most relevantinformation. These results enhance the metadata in the OLAP Server by providing a

dynamic metadata layer that represents a distilled view of the data. Reporting,visualization, and other analysis tools can then be applied to plan future actions andconfirm the impact of those plans.

Data mining – Applications:

Some successful application areas:

1. A pharmaceutical company can analyze its recent sales and can determinewhich marketing activities will have the greatest impact in future.

2. A credit card company using a small test mailing can identify the customer attributes.

3. A diversified transportation comp-any can apply data mining to identify the best prospects.

4. A large consumer package goods company can apply data mining to improve itssales process to retailers.

Conclusions- Data Warehousing & Data Mining:

All large organizations already have data warehouses, but they are just notmanaging them. In order to get most out of this period, the data warehouse plannersand developers must have a clear idea of what they are looking for and then choosestrategies and methods that will improve the performance and flexibility.

There is a growing gap between more powerful storage and retrieval systemsand the users’ ability of effectively analyzing them. As seen, both relational andOLAP technologies are used for navigating massive data warehouses. Quantifiable

 business benefits have been proven through the integration of data mining with current

information systems, and new products are on the horizon that will bring thisintegration to an even wider audience of users.