Upload
sammy
View
54
Download
5
Embed Size (px)
DESCRIPTION
מבוא ל BI. Automated Decision-Making Framework . BI (לפי ויקיפדיה) . http://he.wikipedia.org/wiki/%D7%91%D7%99%D7%A0%D7%94_%D7%A2%D7%A1%D7%A7%D7%99%D7%AA תוכן עניינים 1 היסטוריה 2 תהליך העבודה 3 מחסן נתונים ו- BI 4 עיבוד אנליטי מקוון (OLAP ) 5 כריית מידע (כל שיטות הלמידה שלמדנו) - PowerPoint PPT Presentation
Citation preview
ל BIמבוא
Automated Decision-Making Framework
BI ) ויקיפדיה) לפי
•:// . . / /% 7%91% 7%99% 7% 0% 7%94_% 7% 2% 7% 1% 7% 7% 7%99% 7%http he wikipedia org wiki D D D A D D A D A D A D D AA
עניינים תוכןהיסטוריה 1•תהליךהעבודה 2•נתוניםו- 3• מחסן BI•4 מקוון אנליטי )OLAP(עיבוד•5 מידע (כריית שלמדנו ) הלמידה שיטות כלעסקיתתפעולית 6• בינהשימושיםעיקריים 7•מוצרי 8• BI
של DSSהיסטוריהClassical Definitions of DSS
• Interactive computer-based systems, which help decision makers utilize data and models to solve unstructured problems" - Gorry and Scott-Morton, 1971
• Decision support systems couple the intellectual resources of individuals with the capabilities of the computer to improve the quality of decisions. It is a computer-based support system for management decision makers who deal with semistructured problems - Keen and Scott-Morton, 1978
Types of DSS • Two major types:
– Model-oriented DSS– Data-oriented DSS
• Evolution of DSS into Business Intelligence– Use of DSS moved from specialist to managers, and
then whomever, whenever, wherever– Enabling tools like OLAP, data warehousing, data mining,
intelligent systems, delivered via Web technology have collectively led to the term “business intelligence” (BI) and “business analytics”
...מויקיפדיה
מאמצע -2000החל ה עסקית שנות לבינה חדשים כלים קיימים הנקראת 2.0Business בתפיסה Intelligence( BI 2.0 ,)
הארגון נתוני על עובדים ידי על שאילתות ביצוע המאפשרים . המושג אמיתי BI 2.0בזמן למושג בהקבלה 2.0Web נטבע
של בתפיסה הם זה מסוג שעיבודים דפדפןמשום בסביבת Web .כליBI 2.0 מהדיווחים יותר דינמיים דיווחים מאפשרים
. מסוג לעיבודים חשוב בסיס קודם מדור כלים שאפיינו הסטטיים- ב השימוש הוא SOAזה במוצרי, שימוש עם ביחד תָוְוכָהשבא
( Middleware )ב ושימוש יותר .תקניםגמישים מידע להעברתService Oriented Architecture = SOA
DSS Description
• DSS application A DSS program built for a specific purpose (e.g., a scheduling system for a specific company)
• Business intelligence (BI)A conceptual framework for decision support. It combines architecture, databases (or data warehouses), analytical tools, and applications
Business Intelligence (BI) • BI is an evolution of decision support
concepts over time.– Meaning of EIS/DSS…
• Then: Executive Information System • Now: Everybody’s Information System (BI)
• BI systems are enhanced with additional visualizations, alerts, and performance measurement capabilities.
• The term BI emerged from industry apps.
The Evolution of BI Capabilities
The Architecture of BI• A BI system has four major components
– a data warehouse, with its source data– business analytics, a collection of tools for manipulating, mining,
and analyzing the data in the data warehouse; – business performance management (BPM) for monitoring and
analyzing performance– a user interface (e.g., dashboard)
ב – מרכזי מקום עסקית בינה של הנושא תפס האחרונות בשניםהמידע ממוחשבות. מערכות במערכות הנצבר במידע הרב הגידול
תהיה שלמידע מנת על רלוונטיים נתונים של וריכוז הצגה מחייבחברות. רכישת הוא התחום לחשיבות הביטויים אחד משמעות
גדולות תוכנה חברות ידי על בתחום המתמחות בולטות
A High-Level Architecture of BI
Learning Objectives• Explain data integration and the extraction,
transformation, and load (ETL) processes• Describe real-time (a.k.a. right-time and/or active)
data warehousing• Understand data warehouse administration and
security issues
Stage 1: Data Warehouse• A physical repository where relational data are
specially organized to provide enterprise-wide, cleansed data in a standardized format
• “The data warehouse is a collection of integrated, subject-oriented databases designed to support DSS functions, where each unit of data is non-volatile and relevant to some moment in time”
DW Framework
DataSources
ERP
Legacy
POS
OtherOLTP/wEB
External data
Select
Transform
Extract
Integrate
Load
ETL Process
EnterpriseData warehouse
Metadata
Replication
A P
I
/ M
iddl
ewar
e Data/text mining
Custom builtapplications
OLAP,Dashboard,Web
RoutineBusinessReporting
Applications(Visualization)
Data mart(Engineering)
Data mart(Marketing)
Data mart(Finance)
Data mart(...)
Access
No data marts option
Extraction, transformation, and load (ETL)
Data Integration and the Extraction, Transformation, and Load (ETL) Process
Packaged application
Legacy system
Other internal applications
Transient data source
Extract Transform Cleanse Load
Datawarehouse
Data mart
Data MartA departmental data warehouse that stores only relevant data
– Dependent data mart A subset that is created directly from a data warehouse
– Independent data martA small data warehouse designed for a strategic business unit or a department
OLAP vs. OLTPOnline Analytical vs. Online Transaction (Processing)
OLAP
Product
Time
Geo
grap
hy
Sales volumes of a specific Product on variable Time and Region
Sales volumes of a specific Region on variable Time and Products
Sales volumes of a specific Time on variable Region and Products
Cells are filled with numbers representing
sales volumes
A 3-dimensional OLAP cube with slicing operations
Slicing Operations on a Simple Tree-DimensionalData Cube
Star vs Snowflake Schema
Fact TableSALES
UnitsSold
...
DimensionTIME
Quarter
...
DimensionPEOPLE
Division
...
DimensionPRODUCT
Brand
...
DimensionGOGRAPHY
Coutry
...
Fact TableSALES
UnitsSold
...
DimensionDATE
Date
...
DimensionPEOPLE
Division
...
DimensionPRODUCT
LineItem
...
DimensionSTORE
LocID
...
DimensionBRAND
Brand
...
DimensionCATEGORY
Category
...
DimensionLOCATION
State
...
DimensionMONTH
M_Name
...
DimensionQUARTER
Q_Name
...
Star Schema Snowflake Schema
של דוגמא SNOWFLAKEעוד
מידע כריית•)... ל ) שווה לא או שווה סיווג
– , , חדש סניף לפחות בתחום להשקיע כסף להלוותאשכולות )• (Clusteringניתוח
–? ? אותם מאחד מה יש לקוחות סוגי כמהרגרסיה • ניתוח
– , אופטימיזציה נרוויח כמה
מידע סוגימנתונים • מידע כריית
–" פשוט " היותרמטקסטים • מידע כריית
–INFORMATION RETRIEVAL–TREND ANALYSIS, SENTIMENT ANALYSIS
Categories of ModelsCategory Objective Techniques
Optimization of problems with few alternatives
Find the best solution from a small number of alternatives
Decision tables, decision trees
Optimization via algorithm
Find the best solution from a large number of alternatives using a step-by-step process
Linear and other mathematical programming models
Optimization via an analytic formula
Find the best solution in one step using a formula
Some inventory models
Simulation Find a good enough solution by experimenting with a dynamic model of the system
Several types of simulation
Heuristics Find a good enough solution using “common-sense” rules
Heuristic programming and expert systems
Predictive and other models
Predict future occurrences, what-if analysis, …
Forecasting, Markov chains, financial, …
Static and Dynamic Models
• Static Analysis– Single snapshot of the situation– Single interval– Steady state
• Dynamic Analysis– Dynamic models– Evaluate scenarios that change over time– Time dependent– Represents trends and patterns over time– More realistic: Extends static models
Decision Analysis: A Few Alternatives
Single Goal Situations
• Decision trees– Graphical representation of
relationships– Multiple criteria approach– Demonstrates complex
relationships– Cumbersome, if many alternatives
exists
Decision Tables
• Investment example
• One goal: maximize the yield after one year
• Yield depends on the status of the economy (the state of nature)– Solid growth– Stagnation– Inflation
Investment Example: Possible Situations
1. If solid growth in the economy, bonds yield 12%; stocks 15%; time deposits 6.5%
2. If stagnation, bonds yield 6%; stocks 3%; time deposits 6.5%
3. If inflation, bonds yield 3%; stocks lose 2%; time deposits yield 6.5%
Optimization via Mathematical Programming
• Mathematical Programming A family of tools designed to help solve managerial problems in which the decision maker must allocate scarce resources among competing activities to optimize a measurable goal
• Optimal solution: The best possible solution to a modeled problem – Linear programming (LP): A mathematical model for the
optimal solution of resource allocation problems. All the relationships are linear
LP Problem Characteristics
1. Limited quantity of economic resources2. Resources are used in the production of products or
services3. Two or more ways (solutions, programs) to use the
resources4. Each activity (product or service) yields a return in
terms of the goal5. Allocation is usually restricted by constraints
Line
Linear Programming Steps• 1. Identify the …
– Decision variables – Objective function – Objective function coefficients – Constraints
• Capacities / Demands
• 2. Represent the model– LINDO: Write mathematical formulation– EXCEL: Input data into specific cells in Excel
• 3. Run the model and observe the results
LP ExampleThe Product-Mix Linear Programming Model • MBI Corporation • Decision: How many computers to build next month?• Two types of mainframe computers: CC7 and CC8• Constraints: Labor limits, Materials limit, Marketing lower limits
CC7 CC8 Rel LimitLabor (days) 300 500 <= 200,000 /moMaterials ($) 10,000 15,000 <= 8,000,000 /moUnits 1 >= 100Units 1 >= 200Profit ($) 8,000 12,000 Max
Objective: Maximize Total Profit / Month
Sensitivity, What-if, and Goal Seeking Analysis
• Sensitivity– Assesses impact of change in inputs on outputs– Eliminates or reduces variables– Can be automatic or trial and error
• What-if– Assesses solutions based on changes in variables or
assumptions (scenario analysis)• Goal seeking
– Backwards approach, starts with goal– Determines values of inputs needed to achieve goal– Example is break-even point determination
Heuristic Programming
• Cuts the search space• Gets satisfactory solutions more
quickly and less expensively• Finds good enough feasible
solutions to very complex problems• Heuristics can be
– Quantitative– Qualitative (in ES)
• Traveling Salesman Problem >>>
Heuristic Programming - SEARCH
Traveling Salesman Problem• What is it?
– A traveling salesman must visit customers in several cities, visiting each city only once, across the country. Goal: Find the shortest possible route
– Total number of unique routes (TNUR):TNUR = (1/2) (Number of Cities – 1)!Number of Cities TNUR
5 12 6 60 9 20,160
20 1.22 1018
When to Use Heuristics
When to Use Heuristics– Inexact or limited input data– Complex reality– Reliable, exact algorithm not available– Computation time excessive– For making quick decisions
Limitations of Heuristics– Cannot guarantee an optimal solution
• Tabu search– Intelligent search algorithm
• Genetic algorithms– Survival of the fittest
• Simulated annealing– Analogy to Thermodynamics
Modern Heuristic Methods
Simulation
• Technique for conducting experiments with a computer on a comprehensive model of the behavior of a system
• Frequently used in DSS tools
• Imitates reality and capture its richness• Technique for conducting experiments• Descriptive, not normative tool• Often to “solve” very complex problems
Simulation is normally used only when a problem is too complex to be treated using numerical optimization techniques
Major Characteristics of Simulation
Advantages of Simulation
• The theory is fairly straightforward• Great deal of time compression• Experiment with different alternatives• The model reflects manager’s perspective• Can handle wide variety of problem types • Can include the real complexities of problems • Produces important performance measures• Often it is the only DSS modeling tool for non-
structured problems
Limitations of Simulation
• Cannot guarantee an optimal solution• Slow and costly construction process• Cannot transfer solutions and inferences to solve
other problems (problem specific)• So easy to explain/sell to managers, may lead
overlooking analytical solutions• Software may require special skills
Simulation Types• Stochastic vs. Deterministic Simulation
– In stochastic simulations: We use distributions (Discrete or Continuous probability distributions)
• Time-dependent vs. Time-independent Simulation– Time independent stochastic simulation via Monte Carlo technique (X =
A + B)• Discrete event vs. Continuous simulation• Steady State vs. Transient Simulation
• Simulation Implementation – Visual simulation– Object-oriented simulation
Data Mining Methods: Classification
• Most frequently used DM method• Part of the machine-learning family • Employ supervised learning• Learn from past data, classify new data• The output variable is categorical (nominal
or ordinal) in nature• Classification versus regression?• Classification versus clustering?
Assessment Methods for Classification
• Predictive accuracy– Hit rate
• Speed– Model building; predicting
• Robustness• Scalability• Interpretability
– Transparency, explainability
Accuracy of Classification Models• In classification problems, the primary source for
accuracy estimation is the confusion matrix
True Positive
Count (TP)
FalsePositive
Count (FP)
TrueNegative
Count (TN)
FalseNegative
Count (FN)
True ClassPositive Negative
Pos
itive
Neg
ativ
eP
redi
cted
Cla
ss FNTPTPRatePositiveTrue
FPTNTNRateNegativeTrue
FNFPTNTPTNTPAccuracy
FPTPTPrecision
PFNTP
TPcallRe
Estimation Methodologies for Classification
• Simple split (or holdout or test sample estimation) – Split the data into 2 mutually exclusive sets training
(~70%) and testing (30%)
PreprocessedData
Training Data
Testing Data
Model Development
Model Assessment
(scoring)
2/3
1/3
Classifier
Prediction Accuracy
Estimation Methodologies for Classification
• k-Fold Cross Validation (rotation estimation) – Split the data into k mutually exclusive subsets– Use each subset as testing while using the rest of the
subsets as training– Repeat the experimentation for k times – Aggregate the test results for true estimation of
prediction accuracy training• Other estimation methodologies
– Leave-one-out, bootstrapping, jackknifing– Area under the ROC curve
Classification Techniques
• Decision tree analysis• Statistical analysis• Neural networks• Support vector machines• Case-based reasoning• Bayesian classifiers• Genetic algorithms• Rough sets
Decision Trees
• Employs the divide and conquer method• Recursively divides a training set until each division
consists of examples from one class1. Create a root node and assign all of the training data to it2. Select the best splitting attribute3. Add a branch to the root node for each value of the split.
Split the data into mutually exclusive subsets along the lines of the specific split
4. Repeat the steps 2 and 3 for each and every leaf node until the stopping criteria is reached
Decision Trees
• DT algorithms mainly differ on– Splitting criteria
• Which variable to split first?• What values to use to split?• How many splits to form for each node?
– Stopping criteria• When to stop building the tree
– Pruning (generalization method)• Pre-pruning versus post-pruning
• Most popular DT algorithms include– ID3, C4.5, C5; CART; CHAID; M5
Cluster Analysis for Data Mining
• k-Means Clustering Algorithm– k : pre-determined number of clusters– Algorithm (Step 0: determine value of k)Step 1: Randomly generate k random points as initial cluster
centersStep 2: Assign each point to the nearest cluster centerStep 3: Re-compute the new cluster centersRepetition step: Repeat steps 3 and 4 until some
convergence criterion is met (usually that the assignment of points to clusters becomes stable)
Cluster Analysis for Data Mining - k-Means Clustering Algorithm
Step 1 Step 2 Step 3
Data Mining Myths
• Data mining …– provides instant solutions/predictions– is not yet viable for business applications– requires a separate, dedicated database– can only be done by those with advanced degrees– is only for large firms that have lots of customer
data– is another name for the good-old statistics
Common Data Mining Mistakes1. Selecting the wrong problem for data mining2. Ignoring what your sponsor thinks data mining is
and what it really can/cannot do3. Not leaving insufficient time for data acquisition,
selection and preparation4. Looking only at aggregated results and not at
individual records/predictions5. Being sloppy about keeping track of the data
mining procedure and results
Common Data Mining Mistakes6. Ignoring suspicious (good or bad) findings and
quickly moving on7. Running mining algorithms repeatedly and blindly,
without thinking about the next stage8. Naively believing everything you are told about
the data9. Naively believing everything you are told about
your own data mining analysis10. Measuring your results differently from the way
your sponsor measures them
Text Mining Application Area
• Information extraction• Topic tracking• Summarization• Categorization• Clustering• Concept linking• Question answering
Text Mining Terminology
• Unstructured or semistructured data• Corpus (and corpora)• Terms• Concepts• Stemming• Stop words (and include words)• Synonyms (and polysemes)• Tokenizing
Text Mining Terminology
• Term dictionary• Word frequency• Part-of-speech tagging• Morphology• Term-by-document matrix
– Occurrence matrix• Singular value decomposition
– Latent semantic indexing
Natural Language Processing (NLP)• Structuring a collection of text
– Old approach: bag-of-words– New approach: natural language processing
• NLP is …– a very important concept in text mining– a subfield of artificial intelligence and computational
linguistics– the studies of "understanding" the natural human
language• Syntax versus semantics based text mining
Natural Language Processing (NLP)• Challenges in NLP
– Part-of-speech tagging– Text segmentation– Word sense disambiguation– Syntax ambiguity– Imperfect or irregular input– Speech acts
• Dream of AI community – to have algorithms that are capable of automatically
reading and obtaining knowledge from text
NLP Task Categories• Information retrieval • Information extraction• Named-entity recognition• Question answering• Automatic summarization• Natural language generation and understanding• Machine translation• Foreign language reading and writing• Speech recognition• Text proofing• Optical character recognition
Text Mining Applications• Marketing applications
– Enables better CRM• Security applications
– ECHELON, OASIS– Deception detection (…)
• Medicine and biology– Literature-based gene identification (…)
• Academic applications– Research stream analysis
Web Mining Success Stories• Amazon.com, Ask.com, Scholastic.com, …• Website Optimization Ecosystem
Web Analytics
Voice of Customer
Customer Experience Management
Customer Interaction on the Web
Analysis of Interactions Knowledge about the Holistic View of the Customer
Web Mining ToolsProduct Name URL
Angoss Knowledge WebMiner angoss.com
ClickTracks clicktracks.com
LiveStats from DeepMetrix deepmetrix.com
Megaputer WebAnalyst megaputer.com
MicroStrategy Web Traffic Analysis microstrategy.com
SAS Web Analytics sas.com
SPSS Web Mining for Clementine spss.com
WebTrends webtrends.com
XML Miner scientio.com
Machine Learning MethodsMachine Learning
Supervised Learning
Reinforcement Learning
Unsupervised Learning
Classification· Decision Tree · Neural Networks· Support Vector Machines· Case-based Reasoning· Rough Sets· Discriminant Analysis· Logistic Regression· Rule Induction
Regression· Regression Trees· Neural Networks· Support Vector Machines· Linear Regression· Non-linear Regression· Bayesian Linear Regression
Clustering / Segmentation· SOM (Neural Networks)· Adaptive Resonance Theory · Expectation Maximization· K-Means · Genetic Algorithms
Association· Apriory· ECLAT Algorithm· FP-Growth· One-attribute Rule· Zero-attribute Rule
· Q-Learning· Adaptive Heuristic Critic
(AHC), · State-Action-Reward-State-
Action (SARSA) · Genetic Algorithms· Gradient Descent
BPM versus BI
• BPM is an outgrowth of BI and incorporates many of its technologies, applications, and techniques. – The same companies market and sell them.– BI has evolved so that many of the original differences
between the two no longer exist (e.g., BI used to be focused on departmental rather than enterprise-wide projects).
– BI is a crucial element of BPM.
• BPM = BI + Planning (a unified solution)
• Key performance indicator (KPI)A KPI represents a strategic objective and metric that measures performance against a goal
• Distinguishing features of KPIs
Performance Measurement KPIs and Operational Metrics
Strategy Targets Ranges
Encodings Time frames Benchmarks
• Key performance indicator (KPI)Outcome KPIs vs. Driver KPIs(lagging indicators (leading indicators e.g., revenues) e.g., sales leads)
• Operational areas covered by driver KPIs– Customer performance– Service performance – Sales operations– Sales plan/forecast
Performance Measurement
• The meaning of “balance” – BSC is designed to overcome the limitations of
systems that are financially focused – Nonfinancial objectives fall into one of three
perspectives: 1. Customer2. Internal business process 3. Learning and growth
BPM Methodologies
• In BSC, the term “balance” arises because the combined set of measures are supposed to encompass indicators that are: – Financial and nonfinancial– Leading and lagging– Internal and external– Quantitative and qualitative– Short term and long term
BPM Methodologies
BPM Methodologies
Strategy mapA visual display that delineates the relationships among the key organizational objectives for all four BSC perspectives
Performance Dashboards
• Dashboards and scorecards both provide visual displays of important information that is consolidated and arranged on a single screen so that information can be digested at a single glance and easily explored
Performance Dashboards
Performance Dashboards
• Dashboards versus scorecards – Performance dashboards
Visual display used to monitor operational performance (free form)
– Performance scorecards Visual display used to chart progress against strategic and tactical goals and targets (predetermined measures)