Upload
sachidanandan-ananthasayanam
View
222
Download
0
Embed Size (px)
Citation preview
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
1/158
D Venkata SubramaninAugust 2nd 2011
For TJ Institute of TechnologyD!"#Data$ining for %S& IV A ' ( sections
Data Ware HousingData Ware Housing
& Data Mining& Data Mining
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
2/158
To)ics To (e %o*ered
+ ,eca) and -*er*ie. of D!" / IT I
+ (usiness Analysis ' tools / IT II
+ Details about -A3 / IT II+ Data $ining ' Algorithms / IT III T- V
+ 4uick introduction of the to)ic andalgorithms
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
3/158
Decision Su))ort Systems
+ %reated to facilitate the decisionmaking )rocess
+So much information that it isdi5cult to e6tract it all from atraditional database
+ eed for a more com)rehensi*e datastorage facility
Data !arehouse
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
4/158
Decision Su))ort Systems
+ &6tract Information from data to use asthe basis for decision making
+ sed at all le*els of the -rgani7ation
+ Tailored to s)eci8c business areas
+ Interacti*e
+ Ad "oc 9ueries to retrie*e and dis)lay
information+ %ombines historical o)eration data .ith
business acti*ities
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
5/158
: %om)onents of DSS
• Data Store / The DSS Database
– (usiness Data
– (usiness $odel Data
– Internal and &6ternal Data
• Data &6traction and Filtering
– &6tract and *alidate data from theo)erational database and the e6ternaldata sources
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
6/158
: %om)onents of DSS
• &nd;ser 4uery Tool
– %reate 4ueries that access either the-)erational or the DSS database
• &nd ser 3resentation Tools
– -rgani7e and 3resent the Data
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
7/158
!hat is a Data !arehouse <
A data warehouse is a subject-oriented, integrated, nonvolatile,
time-variant collection of data in supportof management's decisions.
- WH Inmon
WH Inmon - Regarded As Father Of Data WarehousingWH Inmon - Regarded As Father Of Data Warehousing
Data stored forhistorical )eriod= Datais )o)ulated in thedata .arehouse ondaily#.eekly basisde)ending u)on there9uirement=
Data stored forhistorical )eriod= Datais )o)ulated in thedata .arehouse ondaily#.eekly basisde)ending u)on there9uirement=
%an I see creditre)ort fromAccounts Sales
from marketing ando)en order re)ortfrom order entry forthis customer
%an I see creditre)ort fromAccounts Sales
from marketing ando)en order re)ortfrom order entry forthis customer
Data frommulti)le
sources isintegrated for asub>ect
Data frommulti)le
sources isintegrated for asub>ect
Identical 9ueries.ill gi*e sameresults at di?erenttimes= Su))ortsanalysis re9uiringhistorical data
Identical 9ueries.ill gi*e sameresults at di?erenttimes= Su))ortsanalysis re9uiringhistorical data
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
8/158
@
Data ro.th
In 2 years (2003 to 2005),
the size of the largest database TRIPLED
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
9/158
B
Data ro.th ,ate
• T.ice as much information .as createdin 2002 as in 1BBB CE0 gro.th rateG
• -ther gro.th rate estimates e*en
higher
• Very little data .ill e*er be looked at bya human
Hno.ledge Disco*ery is NEEDED to makesense and use of data=
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
10/158
A)ril 2 201 Data $ining %once)ts and Techni9ues 10
Data !arehouseKSub>ect;-riented
• -rgani7ed around ma>or sub>ects such as
customer )roduct sales=
• Focusing on the modeling and analysis of data for
decision makers not on daily o)erations or
transaction )rocessing=
• 3ro*ide a sim)le and concise *ie. around)articular sub>ect issues by e6cluding data that
are not useful in the decision su))ort )rocess=
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
11/158
A)ril 2 201 Data $ining %once)ts and Techni9ues 11
Data !arehouseKIntegrated
• %onstructed by integrating multi)leheterogeneous data sources
– relational databases Lat 8les on;linetransaction records
• Data cleaning and data integration techni9uesare a))lied=
– &nsure consistency in naming con*entionsencoding structures attribute measures etc=
among di?erent data sources• &=g= "otel )rice currency ta6 breakfast co*ered
etc=
– !hen data is mo*ed to the .arehouse it iscon*erted=
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
12/158
A)ril 2 201 Data $ining %once)ts and Techni9ues 12
Data !arehouseKTimeVariant
• The time hori7on for the data .arehouse is
signi8cantly longer than that of o)erational systems=
– -)erational database current *alue data=
– Data .arehouse data )ro*ide information from a
historical )ers)ecti*e Ce=g= )ast M;10 yearsG
• &*ery key structure in the data .arehouse
– %ontains an element of time e6)licitly or im)licitly – (ut the key of o)erational data may or may not
contain Ntime elementO=
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
13/158
A)ril 2 201 Data $ining %once)ts and Techni9ues 1E
Data !arehouseKon;Volatile
• A )hysically se)arate store of data transformed from
the o)erational en*ironment=
• -)erational u)date of data does not occur in the
data .arehouse en*ironment=
– Does not re9uire transaction )rocessing reco*ery
and concurrency control mechanisms
– ,e9uires only t.o o)erations in data accessing
• initial loading of data and access of data=
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
14/158
A)ril 2 201 Data $ining %once)ts and Techni9ues 1:
Data !arehouse *s= "eterogeneousD($S
• Traditional heterogeneous D( integration
– (uild .ra))ers#mediators on to) of heterogeneous databases
– 4uery dri*en a))roach
• !hen a 9uery is )osed to a client site a meta;dictionary is
used to translate the 9uery into 9ueries a))ro)riate for
indi*idual heterogeneous sites in*ol*ed and the results areintegrated into a global ans.er set
• %om)le6 information 8ltering com)ete for resources
• Data .arehouse u)date;dri*en high )erformance
– Information from heterogeneous sources is integrated in
ad*ance and stored in .arehouses for direct 9uery and analysis
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
15/158
A)ril 2 201 Data $ining %once)ts and Techni9ues 1M
Data !arehouse *s= -)erationalD($S
• -T3 Con;line transaction )rocessingG
– $a>or task of traditional relational D($S
– Day;to;day o)erations )urchasing in*entory banking
manufacturing )ayroll registration accounting etc=
• -A3 Con;line analytical )rocessingG
– $a>or task of data .arehouse system – Data analysis and decision making
• Distinct features C-T3 *s= -A3G
– ser and system orientation customer *s= market
– Data contents current detailed *s= historical consolidated – Database design &, P a))lication *s= star P sub>ect
– Vie. current local *s= e*olutionary integrated
– Access )atterns u)date *s= read;only but com)le6 9ueries
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
16/158
A)ril 2 201 Data $ining %once)ts and Techni9ues 1
-T3 *s= -A3
OLTP OLAP
users !ler", IT #rofessional "no$ledge $or"er
function day to day o#erations de!ision s%##ort
DB design a##li!ation&oriented s%b'e!t&oriented
data !%rrent, %#&to&datedetailed, flat relational
isolated
histori!al,s%arized, %ltidiensional
integrated, !onsolidated
usage re#etitie ad&ho!
access read*$rite
inde+*hash on #ri "ey
lots of s!ans
unit of work short, si#le transa!tion !o#le+ -%ery# records accessed tens illions
#users tho%sands h%ndreds
DB size .00/&1 .001&T
metric transa!tion thro%gh#%t -%ery thro%gh#%t, res#onse
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
17/158
A)ril 2 201 Data $ining %once)ts and Techni9ues 1Q
!hy Se)arate Data!arehouse<
• "igh )erformance for both systems
– D($SK tuned for -T3 access methods inde6ingconcurrency control reco*ery
– !arehouseKtuned for -A3 com)le6 -A3 9ueries
multidimensional *ie. consolidation=• Di?erent functions and di?erent data
– missing data Decision su))ort re9uires historicaldata .hich o)erational D(s do not ty)ically maintain
– data consolidation DS re9uires consolidation
Caggregation summari7ationG of data fromheterogeneous sources
– data 9uality di?erent sources ty)ically useinconsistent data re)resentations codes andformats .hich ha*e to be reconciled
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
18/158
Sub>ect;-riented
+ Data is arranged and o)timi7ed to)ro*ide ans.er to 9uestions fromdi*erse functional areas
Data is organi7ed and summari7ed byto)icSales # $arketing # Finance # Distribution #
&tc=
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
19/158
Integrated
+ The data .arehouse is a centrali7edconsolidated database thatintegrated data deri*ed from the
entire organi7ation$ulti)le Sources
Di*erse Sources
Di*erse Formats
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
20/158
Time;Variant
+ The Data !arehouse re)resents theLo. of data through time
+ %an contain )ro>ected data fromstatistical models
+ Data is )eriodically u)loaded thentime;de)endent data is recom)uted
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
21/158
on*olatile
+ -nce data is entered it is &V&, remo*ed
+ ,e)resents the com)anyRs entire historyear term history is continually added to it
Al.ays gro.ing
$ust su))ort terabyte databases andmulti)rocessors
+ ,ead;-nly database for data analysis and9uery )rocessing
Sub>ect -riented
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
22/158
Sub>ect;-riented;%haracteristics of a Data
!arehouse
4uotes
eads
-rders
3ros)ects
-)erational
Data!arehouse
%ustomers 3roducts
,egions Time
Focus is on Subject Areas rather than ApplicationsFocus is on Subject Areas rather than Applications
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
23/158
Integrated ; %haracteristics ofa Data !arehouse
Appl A - m,fAppl B - 1,0Appl C - male,female
Appl A - balance dec fixed (13,2)Appl B - balance pic 9(9)V99
Appl C - balance pic S9(7)V99 comp-3
Appl A - bal-on-handAppl B - cuen!-balanceAppl C - ca"h-on-hand
Appl A - da!e (#ulian)Appl B - da!e ($$mmdd)Appl C - da!e (ab"olu!e)
m,f
balance dec
fixed (13,2)
da!e (#ulian)
%rrent balan!e
Integrated Vie Is !he "ssence Of A Data WarehouseIntegrated Vie Is !he "ssence Of A Data Warehouse
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
24/158
on;*olatile ; %haracteristics ofa Data !arehouse
-)erational Data!arehouse
re)lacechange
insert
changeinsert
delete load
read onlyaccess
Integrated Vie Is !he "ssence Of A Data WarehouseIntegrated Vie Is !he "ssence Of A Data Warehouse
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
25/158
Time Variant ; %haracteristicsof a Data !arehouse
-)erationalData
!arehouse
%urrent Value data+ time hori7on 0;B0 days+ key may not ha*e element of
time
Sna)shot data+ time hori7on M;10 years+ key has an element of time
+ data .arehouse storeshistorical data
Data Warehouse !#picall# Spans Across !imeData Warehouse !#picall# Spans Across !ime
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
26/158
Alternate De8nitions
A collection of integrated, subjectoriented databases designed tosupport the D function, where
each unit of data is relevant to somemoment of time
; Imho!
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
27/158
Alternate De8nitions
Data Warehouse is a repositor" of datasummari#ed or aggregated in
simpli$ed form from operationals"stems. %nd user orientated dataaccess and reporting tools let user
get at the data for decision support -&abcoc
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
28/158
12 ,ules of a Data!arehouse
+ Data !arehouse and -)erational&n*ironments are Se)arated
+ Data is integrated
+ %ontains historical data o*er a long)eriod of time
+ Data is a sna)shot data ca)tured at agi*en )oint in time
+ Data is sub>ect;oriented
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
29/158
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
30/158
12 ,ules of Data !arehouse
+ &n*ironment is characteri7ed by ,ead;onlytransactions to *ery large data sets
+ System that traces data sources
transformations and storage+ $etadata is a critical com)onent
Source transformation integrationstorage relationshi)s history etc
+ %ontains a chargeback mechanism forresource usage that enforces o)timal useof data by end users
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
31/158
eed for Data !arehousing
+ (etter business intelligence for end;users
+ ,eduction in time to locate access andanaly7e information
+%onsolidation of dis)arate information sources+ Strategic ad*antage o*er com)etitors
+ Faster time;to;market for )roducts andser*ices
+ ,e)lacement of older less;res)onsi*e decisionsu))ort systems
+ ,eduction in demand on IS to generate re)orts
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
32/158
I%S M:1 ; 02 Data $ining %once)ts E2
Data !arehouseArchitecture
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
33/158
A)ril 2 201 Data $ining %once)ts and Techni9ues EE
Multi-Tiered ArchitectureMulti-Tiered Architecture
Data
areho%se
E+tra!t
Transfor
Load
Refresh
4LP Engine
nalysis
6%eryRe#orts
Data ining
/onitor
7
Integrator
/etadata
Data 8o%r!es 9ront&End Tools
8ere
Data /arts
4#erational
Ds
other
so%r!e
s
Data 8torage
4LP 8erer
y) ca ata are ouse
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
34/158
y) ca ata are ouseArchitecture
Operational
Systems/Data
Select
Extract
Transform
Integrate
Maintain
Data
reparation
Mi!!le"are/
#I
Data
Ware$ouse
Meta!ata
EIS /DSS
%uery Tools
O#/'O#
We( )ro"sers
Data Mining
Data
Marts
$ulti-tiered Data Warehouse ithout ODS$ulti-tiered Data Warehouse ithout ODS
Ty)ical Data !arehouse
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
35/158
Ty)ical Data !arehouseArchitecture
Operational
Systems/Data
Select
Extract
Transform
Integrate
Maintain
Data
reparation
Data
Marts
Data
Ware$ouse
Meta!ata
ODS
Meta!ata
Select
Extract
Transform
oa!
Data
reparation
$ulti-tiered Data Warehouse ith ODS$ulti-tiered Data Warehouse ith ODS
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
36/158
&T; &6tract Transform andoadAs the name suggests &T )rocess co*ers the follo.ing )hases
• &6traction of data from data sources=• Transforming the e6tracted data to meet business
re9uirements=
• oading the data in to the target .arehouse#database=
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
37/158
&T / &6tract Transform andoad• Data &6tract
et Data from source• Data Transformation
Data %leansing ; Data 4uality Assurance
Data Scrubbing ; ,emo*ing errors and inconsistencies
3rocessing %alculations
A))lying (usiness ,ules
%hanging Data Ty)es
$aking the Data $ore ,eadable
,e)lacing %odes .ith Actual Values
Summari7ing the Data
• Data oad
oad data into !arehouse
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
38/158
&6traction
+ The 8rst )art of the &T )rocess=+ Data under consideration is being e6tracted from the
di?erent data sources=
+ The source data may use a di?erent data
organi7ation#format=+ Some of the common data sources are
Databases
Flat 8les
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
39/158
Transform
+ It in*ol*es a))lying a series of rules to the datae6tracted from the source to deri*e the data to loadthe target=
+ De)ending on the re9uirement of the target the
transformation rules may be sim)le or com)le6=+ Transformation may in*ol*e
Selecting only certain columns to load
Filtering
Sorting
%ombining data from multi)le sources
enerating Surrogate keys etc=
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
40/158
oad
+ The last ste) of the &T )rocess+ The load )hase loads the transformed data to the
end target=
+ De)ending on the re9uirement the load )hase may
be Full oad
Incremental oad
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
41/158
Im)ortance of &T
+ Data of an organi7ation s)read across multi)legeogra)hies and domains=
+ Data organi7ed in di?erent format in di?erentsources=
+ %onsolidation of the data to make it moremeaningful=
+ A))lying (usiness rules enriches the *alue the data)ro*ides=
+ Identifying the inconsistencies and )ro*iding auni8ed *ie.=
+ Im)ro*ing the data 9uality=
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
42/158
Sam)le &T Tools
+ Teradata !arehouse (uilder from Teradata
+ DataStage from I($
+ SAS System from SAS Institute
+ 3o.er $art#3o.er %enter from Informatica
+ Sagent Solution from Sagent Soft.are
+ "ummingbird enio Suite from"ummingbird %ommunications
+ Abinitio+ -racle Data !arehouse (uilder and -DI
from -racle %or)oration
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
43/158
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
44/158
Data Access and Analysis ; Terminologies
• Reporting – A category of data access solution in .hich the information is
)resented in the form of re)orts – ,e)orting tools are also referred as 4uery and re)orting tools
• OLAP (On-Line Analytic Processing)
– De8ned as NFast Analysis of $ultidimensional InformationO bythe -A3 council – sed interchangeably .ith (IR – -A3 tools are synonymous .ith $ultidimensional tools or
a))lications
• DSS tools that use multidimensional data analysis
techni9ues – Su))ort for a DSS data store – Data e6traction and integration 8lter – S)eciali7ed )resentation interface
d l i
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
45/158
Data Access and Analysis ; Terminologies
• Data $ining – A )rocess that uses a *ariety of statistical and
arti8cial intelligence frame.orks to disco*er)atterns and relationshi)s in data
– sed to make *alid )redictions in data analysis)roblems .here the e6act se9uence andnature of 9ueries#9uestions tobe .ritten#asked against the data to make the)rediction is not kno.n and the number of*ariables in*ol*ed in the analysis is too largeto be intuiti*ely handled by structured
9uerying or -A3 tools• !eb Access
– A category of data access solutions in .hichinformation is *ie.ed through a .eb bro.ser
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
46/158
Im)ortance of Data Access
• (usinesses today face challenges like – arge *olume of data – ser demands of Le6ible and timely access
to information
– &6tracting *alue from key business data• Data Access is the last mileR that
enables decision makers to – ,each the database infrastructure
• 3rom)t reliable data access – o.ers o)erating costs – ,educes error – Increases )roducti*ity=
T di i l D i i $ ki
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
47/158
Traditional Decision $akingtechni9ues
• S)readsheets and S4 are traditionally usedas tool for analysis and decision making
• imitations of Traditional techni9ues – It is *ery di5cult to de8ne the aggregationle*els *ie.s in s)readsheets
– S4 does not ha*e a natural .ay of )ro*idingLe6ible *ie. reorgani7ations that .ill trans)ose
the data – %ommon analytic functions such as cumulati*e
a*erage and total are not su))orted in S4
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
48/158
Design %onsiderations
– 3latform ; S$3 $33 T ni6
– Target Database ; ,D($S $DD(
– 3artitioning
– Data 3re)aration ; Data 4uality Audit%leansing &6traction Transformation
– $odeling ; Facts ' Dimensions
– Information Directory ; $etadata$anagement
– !arehouse Administration
– &nd ser Tools
– ranularity ; Detail and Summari7ation
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
49/158
Data !arehouse "ard.are
Hardware onsiderations
:Parallelis:8/P or /PP
:Dis" 8torage
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
50/158
"ard.are %onsiderations
• 3arallelism
– $ost de)loyments of VD( Data!arehouses are on S$3 or $33
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
51/158
"ard.are %onsiderations
• Three o)tions for "ard.are – Symmetric $ulti)rocessing CS$3G
• Shared $emory Architecture – $assi*ely 3arallel 3rocessing C$33G
• Shared othing Architecture
• &ach node has its o.n memory and I#-
– on niform $emory Access C$AG• %luster of S$3 machines
• %lassi8ed as large S$3 machines
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
52/158
S$3 *s $33 machines
SMP MPP%o &i""ion Ci!ical '*
o medium +SS
Complex Anal$!ical lae
Scale +SS
Scale !o 10-12 C*" (no. 30) Scale !o moe !han 100
C*"
/o.!h i" Slo. and
S!ead$
/o.!h i" apid and unpedic!able
+a!aba"e Sie 200/B +a!aba"e Sie 00 /B
Aim i" Au!oma!ion o Ba"ic +eci"ion
Suppo!
*ima$ aim i" "!a!eic ad4an!ae
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
53/158
Ser*er Scalability
;
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
54/158
-T3 Vs !arehouseOperational System Data Warehouse
Transaction Processing Query Processing
Time Sensitive History Oriented
Operator View Managerial View
Organized y transactions !Order" #nput"#nventory$
Organized y su%ect !&ustomer" Product$
'elatively smaller dataase (arge dataase size
Many concurrent users 'elatively )ew concurrent users
Volatile Data *on Volatile Data
Stores all data Stores relevant data
*ot +le,ile +le,ile
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
55/158
%a)acity 3lanning
P r o c e s s i n g P
o w e r
Time of day
%rocessing &oad %ea's During the (eginning and "nd of Da#%rocessing &oad %ea's During the (eginning and "nd of Da#
&6am)les -f Some
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
56/158
&6am)les -f SomeA))lications
9:inancial 'eporting an!
+onsoli!ation
9Target Mar6eting
9Mar6et Segmentation9)u!geting
9+re!it 'ating #gencies
+$urn #nalysis+$urn #nalysis rofita(ility Managementrofita(ility Management
E*ent trac6ingE*ent trac6ing
Manufacturers Manufacturers Manufacturers Manufacturers
Customers Customers Customers Customers
Retailers Retailers Retailers Retailers
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
57/158
Data $arts
+ Small Data Stores
+ $ore manageable data sets
+ Targeted to meet the needs of smallgrou)s .ithin the organi7ation
+ Small Single;Sub>ect data .arehousesubset that )ro*ides decision su))ortto a small grou) of )eo)le
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
58/158
Data $arts
+ &nter)rise .ide data .arehousing)ro>ects ha*e a *ery large cycle time
+ etting consensus bet.een multi)le)arties may also be di5cult
+ De)artments may not be satis8ed .ith)riority accorded to them
+ Sometimes indi*idual de)artmentalneeds may be strong enough to .arranta local im)lementation
+ A))lication#database distribution is alsoan im)ortant factor
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
59/158
Data $arts
• Sub>ect or A))lication -riented(usiness Vie. of !arehouse
– 4uick Solution to a s)eci8c (usiness
3roblem – Finance $anufacturing Sales etc=
– Smaller amount of data used for
Analytic 3rocessing
A &ogical Subset of !he )omplete Data WarehouseA &ogical Subset of !he )omplete Data Warehouse
Data !arehouses or Data
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
60/158
Data !arehouses or Data$artsFor com)anies interested in changing their
cor)orate cultures or integrating se)arate
de)artments an enter)rise .ide a))roach
makes sense=
%om)anies that .ant a 9uick solution to a
s)eci8c business )roblem are better ser*ed by
a standalone data mart=
Some com)anies o)t to build a .arehouse
incrementally data mart by data mart=
A &ogical Subset of !he )omplete Data WarehouseA &ogical Subset of !he )omplete Data Warehouse
h d
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
61/158
Data !arehouse and Data$artData Warehouse Data Marts
Scope Application NeutralCentralie!" Share!
Cross LO#$enterprise
Speci%c ApplicationRe&uire'ent
LO#" !epart'ent
#usiness ProcessOriente!
DataPerspectie
istorical Detaile! !ata
So'e su''ary
Detaile! (so'ehistory)
Su''arie!
Su*+ects Multiple su*+ect areas Single Partial
su*+ectMultiple partialsu*+ects
h d
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
62/158
Data !arehouse and Data$artData Warehouse Data Marts
Data Sources ManyOperational$ E,ternalData
e.
Operational"e,ternal !ata
/'ple'ent
0i'e ra'e
1-23 'onths 4or %rststage
Multiple stagei'ple'entation
5-26 'onths
Characteristics
le,i*le" e,tensi*le
Dura*le$Strategic
Data orientation
Restrictie" none,tensi*le
Short li4e$tactical
Pro+ectOrientation
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
63/158
!arehouse or $art First <
Data Warehouse +irst Data Mart )irst E,pensie Relatiely cheap
Large !eelop'ent cycle Deliere! in 7 8 'onths
Change 'anage'ent is !i9cult Easy to 'anage change
Di9cult to o*tain continuouscorporate support
Can lea! to in!epen!ent an!inco'pati*le 'arts
0echnical challenges in *uil!inglarge !ata*ases
Cleansing" trans4or'ation"'o!eling techni&ues 'ay *einco'pati*le
-T3 Systems Vs Data
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
64/158
-T3 Systems Vs Data!arehouse
Remember Between OLTP and Data Warehouse systems
users are different
data content is different,
data structures are different
hardware is different
*nderstanding !he Differences Is !he +e#*nderstanding !he Differences Is !he +e#
- ti l D t St
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
65/158
-)erational Data Store ;De8nition
4#erational
D88
Data
areho%se
4D8
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
66/158
-)erational Data Store
+ The -DS a))lies only to the .orld ofo)erational systems=
+ The -DS contains current *alued and
near current *alued data=
+ The -DS contains almost e6clusi*elyall detail data
+ The -DS re9uires a full functionu)date record oriented en*ironment=
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
67/158
-)erational Data Store
• Functions of an -DS+ %on*erts Data+ Decides !hich Data of $ulti)le Sources
Is the (est
+ Summari7es Data+ Decodes#encodes Data+ Alters the Hey Structures+ Alters the 3hysical Structures+ ,eformats Data+ Internally ,e)resents Data+ ,ecalculates Data=
Di?erent kinds of Information
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
68/158
Di?erent kinds of Informationeeds
• %urrent
• ,ecent
• "istorical
• %urrent
• ,ecent
• "istorical
Is this edi!ine aailable
in sto!"
hat are the tests this #atient has !o#leted so
far
=as the in!iden!e ofT%ber!%losis in!reased in
last 5 years in 8o%thern
region
-T3 Vs -DS Vs D!"
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
69/158
-T3 Vs -DS Vs D!"&haracteristic O(TP ODS Data
Warehouse
-udience OperatingPersonnel
-nalysts Managers andanalysts
Data access #ndividualrecords"transaction
driven
#ndividual records"transaction oranalysis driven
Set o) records"analysis driven
Data content &urrent" real.time
&urrent and near.current
Historical
Data Structure Detailed Detailed and lightlysummarized
Detailed andSummarized
Data organization +unctional Su%ect.oriented Su%ect.oriented
Type o) Data Homogeneous Homogeneous Vast Supply o) veryheterogeneous data
-T3 Vs -DS Vs D!"
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
70/158
-T3 Vs -DS Vs D!"&haracteristic O(TP ODS Data
Warehouse
Dataredundancy
*on.redundant withinsystem/ 0nmanagedredundancy amongsystems
Somewhatredundant withoperationaldataases
Managedredundancy
Data update +ield y )ield +ield y )ield &ontrolled atch
Dataase size Moderate Moderate (arge to verylarge
Development
Methodology
'e1uirements driven"structured
Data driven"somewhatevolutionary
Data driven"evolutionary
Philosophy Support day.to.dayoperation
Support day.to.day decisions2 operationalactivities
Support managingthe enterprise
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
71/158
E5D O: 35IT I 'E+#E5D O: 35IT I 'E+#
)E0I55I50 O: 35IT II)E0I55I50 O: 35IT II
TOI+TOI+
)3SI5ESS #5#;SIS)3SI5ESS #5#;SIS
3rinci)les of Data.arehouse
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
72/158
3rinci)les of Data.arehouse#(usiness Analysis
• The )rinci)le )ur)ose of the data .arehousing is to )ro*ideinformation to business users for strategic decision making=
• The decision making )rocess is the business analysis of theinformation stored in a data .arehouse
• The business analysis is enabled by
– umber of a))lications – umber of tools
– umber of techni9ues
0o proi!e arious *usiness 4ocuse! ie.s to *usiness
!o'ain e,perts:
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
73/158
T-- %AT&-,I&S
• M $ain %ategories of decision su))orttools
– ,e)orting
– $anaged 4uery
– &6ecuti*e Information Systems
– -nline Analytical 3rocessing C -A3G
– Data $ining
% t 1 , ti T l
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
74/158
• %ategory 1 ; ,e)orting Tools
– 3roduction re)orting tools
• sed by com)anies to generate regular re)orts orsu))ort high *olume batch >obs such as calculatingand )rinting )ay checks or summary of re*enuesby month
!ritten using %obol or high le*el languages such
as =net or >a*a or using custom tools / these aree6)ensi*e .ill be de*elo)ed or customi7ed basedon the needs of an organi7ation
– Deskto) re)ort .riters
• Designed for end users and used by end users or
business users in their deskto) for designingde*elo)ing and generating re)orts daily or on;demand
– &6am)le %rystal ,e)orts
%ategory 2 $anaged 4uery Tools
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
75/158
%ategory 2 ; $anaged 4uery Tools
– These tools shield end users from the
com)le6ities of S4 and databasestructures by inserting a meta;layerbet.een users and the database=
– $eta;layer is the soft.are that )ro*ides
sub>ect;oriented *ie.s of database andsu))orts )oint;and;click creation of S4==drag; and dro) and form the com)le6 S4to search or )roduce information= These
follo.s three tiered architectures toim)ro*e the scalability=
• %--S ' (SI&SS -(J&%TS
%ategory E &6ecuti*e Information
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
76/158
%ategory E ; &6ecuti*e InformationSystems
• These tools )redate re)ort .riters andmanaged 9uery tools
• They .ere 8rst de)loyed in mainframe
• 3ro*ides customi7ed gra)hical decision
su))ort a))lications that gi*es themanagers and e6ecuti*es a high le*el *ie.of the business and access to the e6ternalsources such as custom and online feeds
• &6am)les 3ilot Soft.are 3latinum Technology Forest and Trees SAS
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
77/158
Nee! 4or the tools an! applications 4or *usinessl i
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
78/158
analysis
Sim)le tabular form re)orting
Ad;hoc user;s)eci8ed 9ueries
3re;de8ned re)eatable 9ueries ' %om)le6 9ueries.ith multi;table >oins multi;le*el
sub;9ueries ' so)histicated search criteria
•,anking $ulti*ariable Analysis
• Time Series Analysis
•Data Visuali7ation ; gra)hing charting ' )i*oting
•%om)le6 Te6tual Search
•Statistical Analysis•Arti8cial Intelligence techni9ues for testing hy)othesis
•Information $a))ing
•Interacti*e Drill;Do.n ,e)orting and AnalysisC$iningG
4&,U AD ,&3-,TI T--S
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
79/158
4&,U AD ,&3-,TI T--S
Must helps 4or the 4ollo.ing three
!istinct types o4 reporting2:Creation an! ie.ing o4S0ANDARD REPOR0S
6:De%nition an! creation o4 AD-OC REPOR0S
;:Data E,ploration
%heck oogle or any .eb site or .iki)edia
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
80/158
%heck oogle or any .eb site or .iki)ediato kno. more about some of the tools
1G= %ognos Im)ro)mtu2G= 3o.er(uilder
EG= Forte
:G= Information (uilders / %actus ' FocusMG= $icrosoft S4ser*er / IRll )ro*ide thenotes
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
81/158
35IT II35IT IITOI+TOI+
O# & M3TI1DIME5SIO5# MODESO# & M3TI1DIME5SIO5# MODES
-A3
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
82/158
-A3eed or Dri*ers for -A3
•eed for $ore Intensi*e Decision Su))ort
•$ulti;dimensional nature of the )roblems•,etrie*al of *ery large data sets C100Rs of (Rs or T(RsG andsummari7e them on the Ly
• The result set may look like a multi;dimensional s)read;sheet hencethe term multi;dimensional= Ctraditional ,D($S su))orts t.odimensional relational model through S4G
•Sol*ing modern business )roblems such as market analysis 8nancialforecasting re9uires
• 4uery centric and array oriented and multi;dimensionaldatabase schemas
t f -A3 A l i
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
83/158
@E
ature of -A3 Analysis
• Aggregation ;; Ctotal sales )ercent;to;totalG
• %om)arison ;; (udget *s= &6)enses
• ,anking ;; To) 10 9uartile analysis
• Access to detailed and aggregate data
• %om)le6 criteria s)eci8cation
• Visuali7ation
• eed interacti*e res)onse to aggregate 9ueries
$ lti di i l D t
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
84/158
@:
MonthMonth
33 44 55 66 778899
P
r o d u c t
P
r o d u c t
ToothpasteToothpaste
:uice:uice&ola&ola
Mil;Mil;
&ream&ream
SoapSoap
' e g i o
n
' e g i o
n
WWSS
**
DimensionsDimensions
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
85/158
@M
%once)tual $odel for -A3
• umeric measures to be analy7ed – e=g= Sales C,sG sales C*olumeG budget
re*enue in*entory
• Dimensions – other attributes of data de8ne the
s)ace
– e=g= store )roduct date;of;sale – hierarchies on dimensions
• e=g= branch ;W city ;W state
- ti
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
86/158
@
-)erations
• ,ollu) summari7e data – e=g= gi*en sales data summari7e sales
for last year by )roduct category and
region• Drill do.n get more details
– e=g= gi*en summari7ed sales as abo*e
8nd breaku) of sales by city .ithin eachregion or .ithin the Andhra region
$ % b - ti
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
87/158
@Q
$ore %ube -)erations
• Slice and dice select and )ro>ect
– e=g= Sales of soft;drinks in Andhra o*erthe last 9uarter
• 3i*ot change the *ie. of data
– 41 42 Total S Total ,edS (lue
Total Total
22 EE MM
1M :: MB
EQ QQ 11:
1: 0Q 21
:1 M2 BEMM MB 11:
$ -A3 - ti
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
88/158
@@
$ore -A3 -)erations
• "y)othesis dri*en search &=g=factors a?ecting defaulters – *ie. defaulting rate on age aggregated o*er
other dimensions – for )articular age segment detail along
)rofession
• eed interacti*e res)onse to aggregate
9ueries – XW )recom)ute *arious aggregates
$-A3 ,-A3
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
89/158
@B
$-A3 *s ,-A3
• $-A3 $ultidimensional array -A3
• ,-A3 ,elational -A3
Ty#e 8ize 2olo%r 5(o%ntShirt S (lue 10
Shirt (lue 2M
Shirt A (lue EM
Shirt S ,ed E
Shirt ,ed Q
Shirt A ,ed 10Shirt A A :M
V V V V
A A A 12B0
S4 & tensions
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
90/158
B0
S4 &6tensions
• %ube o)erator – grou) by on all subsets of a set of
attributes CmonthcityG
– redundant scan and sorting of data canbe a*oided
• Various other non;standard S4
e6tensions by *endors
Strengths of -A3
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
91/158
B1
Strengths of -A3
• It is a )o.erful *isuali7ationtool
• It )ro*ides fast interacti*e
res)onse times• It is good for analy7ing timeseries
• It can be useful to 8ndsome clusters and outliners
• $any *endors o?er -A3tools
(rief "istory
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
92/158
B2
(rief "istory
• &6)ress and System ! DSS• -nline Analytical 3rocessing ; coined by
&F %odd in 1BB: ; .hite )a)er byArbor Soft.are
• enerally synonymous .ith earlier terms such asDecisions Su))ort (usiness Intelligence &6ecuti*eInformation System
• $-A3 $ultidimensional -A3 C"y)erion CArbor
&ssbaseG -racle &6)ressG• ,-A3 ,elational -A3 CInformi6 $eta%ube$icrostrategy DSS AgentG
-A3 and &6ecuti*e
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
93/158
BE
Information Systems
• Andyne %om)uting ;;3ablo
• Arbor Soft.are ;; &ssbase
• %ognos ;; 3o.er3lay
• %omshare ;; %ommander-A3
• "olistic Systems ;; "olos
• Information Ad*antage ;;AYSUS !eb-A3
• Informi6 ;; $etacube
• $icrostrategies;;DSS#Agent
• -racle ;; &6)ress• 3ilot ;; ightShi)
• 3lanning Sciences ;;entium
• 3latinum Technology ;;3rodea(eacon Forest' Trees
• SAS Institute ;;SAS#&IS -A3PP
• S)eed.are ;; $edia
$icrosoft -A3 strategy
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
94/158
B:
$icrosoft -A3 strategy
• 3lato -A3 ser*er )o.erful integrating *ariouso)erational sources
• -&;D( for -A3 emerging industry standardbased on $DY ;;W e6tension of S4 for -A3
• 3i*ot;table ser*ices integrate .ith -5ce 2000
– &*ery deskto) .ill ha*e -A3 ca)ability=
• %lient side caching and calculations
• 3artitioned and *irtual cube
• "ybrid relational and multidimensional storage
$ultidimensional Data Analysis
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
95/158
y Techni9ues
• Ad*anced Data 3resentation Functions – E;D gra)hics 3i*ot Tables %rosstabs etc=
– %om)atible .ith S)readsheets ' Statistical
)ackages – Ad*anced data aggregations consolidation
and classi8cation across time dimensions
– Ad*anced com)utational functions
– Ad*anced data modeling functions
Ad*anced Database Su))ort
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
96/158
Ad*anced Database Su))ort
• Ad*anced Data Access Features – Access to many kinds of D($SRs Lat 8les
and internal and e6ternal data sources
–Access to aggregated data .arehousedata
– Ad*anced data na*igation Cdrill;do.nsand roll;u)sG
– Ability to ma) end;user re9uests to thea))ro)riate data source
– Su))ort for Very arge Databases
&asy;to;se &nd;ser
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
97/158
yInterface
+ ra)hical ser Interfaces
+ $uch more useful if access is ke)tsim)le
%lient#Ser*er Architecture
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
98/158
%lient#Ser*er Architecture
+ Frame.ork for the ne. systems to bedesigned de*elo)ed andim)lemented
+ Di*ide the -A3 system into se*eralcom)onents that de8ne itsarchitecture
Same %om)uterDistributed among se*eral com)uter
-A3 Architecture
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
99/158
-A3 Architecture
• E $ain $odules – I
– Analytical 3rocessing ogic
– Data;)rocessing ogic
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
100/158
100
-A3 E Tier DSS
Data Warehouse
Dataase (ayer
Store atomic datain industry
standard Data!arehouse=
O(-P >ngine
-pplication (ogic (ayer
enerate S4e6ecution )lans in the
-A3 engine to obtain-A3 functionality=
Decision Support &lient
Presentation (ayer
-btain multi;dimensional re)orts
from the DSS %lient=
-A3 %lient#Ser*er Architecture-A3 %lient#Ser*er Architecture
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
101/158
,elational -A3
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
102/158
,elational -A3
+ ,elational -nline Analytical 3rocessing-A3 functionality using relational
database and familiar 9uery tools to store
and analy7e multidimensional data+ $ultidimensional data schema su))ort
+ Data access language ' 9uery
)erformance for multidimensional data+ Su))ort for Very arge Databases
Data $odeling for Data
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
103/158
!arehouse
+ "o. to structure the data in yourdata .arehouse <
+ 3rocess that )roduces abstract data
models for one or more databasecom)onents of the data .arehouse
+ $odeling for !arehouse is di?erentfrom that for -)erational databaseDimensional $odeling Star Schema
$odeling or Fact#Dimension $odeling
$odeling Techni9ues
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
104/158
$odeling Techni9ues
• &ntity;,elationshi) $odeling – Traditional modeling techni9ue
– Techni9ue of choice for -T3
– Suited for cor)orate data .arehouse
• Dimensional $odeling – Analy7ing business measures in the s)eci8c
business conte6t
– "el)s *isuali7e *ery abstract business9uestions
– &nd users can easily understand and na*igatethe data structure
&ntity;,elationshi) $odeling ;
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
105/158
(asic %once)ts
+ The &, modeling techni9ue is adisci)line used to illuminate themicrosco)ic relationshi)s among
data elements=+ The highest art form of &, modeling
is to remo*e all redundancy in the
data=
n r er rocess ng$odel
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
106/158
$odel
Order Header
Order Details
ustomer Ta!leFK
"tem Ta!le
FK
alesre$ ta!leit%
ales District
ales &egion
ales ountr% Product Brand
Product ategor%
FK
&ntity;,elationshi) $odeling ;
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
107/158
(asic %once)ts
• &ntity – -b>ect that can be obser*ed and
classi8ed by its )ro)erties and
characteristics – (usiness de8nition .ith a clear boundary
– %haracteri7ed by a noun
– &6am)le
• 3roduct
• &m)loyee
&ntity;,elationshi) $odeling ;
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
108/158
(asic %once)ts
• ,elationshi) – ,elationshi) bet.een entities ;
structural interaction and association
– described by a *erb – %ardinality
• 1;1
• 1;$
• $;$
– &6am)le (ooks belong to 3rinted $edia
&ntity;,elationshi) $odeling ;
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
109/158
(asic %once)ts
• Attributes – %haracteristics and )ro)erties of entities
– &6am)le
• (ook Id Descri)tion book category areattributes of entity N(ookO
– Attribute name should be uni9ue andself;e6)lanatory
– 3rimary Hey Foreign Hey %onstraintsare de8ned on Attributes
&ntity;,elationshi) $odeling /
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
110/158
!hy ot <
+ se of the &, modeling techni9uedefeats the basic allure of data.arehousing namely intuiti*e and
high;)erformance retrie*al of data=
Dimensional $odeling ; (asic
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
111/158
%once)ts
+ ,e)resents the data in a standard intuiti*eframe.ork that allo.s for high;)erformanceaccessZ
+ Schema designed to )rocess large com)le6adhoc and data intensi*e 9ueries=
+ o concern for concurrency locking andinsert#u)date#delete )erformance
+ &*ery dimensional model is com)osed of onetable .ith a multi)art key called the fact table
and a set of smaller tables called dimensiontables=+ This characteristic [star;like[ structure is often
called a star >oin=
Star Schema ,e)resentation
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
112/158
Star Schema ,e)resentation
+ Fact and Dimensions are re)resented by)hysical tables in the data .arehouse database
+ Fact tables are related to each dimension tablein a $any to -ne relationshi) C3rimary#Foreign
Hey ,elationshi)sG+ Fact Table is related to many dimension tables
The )rimary key of the fact table is acom)osite )rimary key from the dimension
tables+ &ach fact table is designed to ans.er a s)eci8cDSS 9uestion
Star Schema
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
113/158
Star Schema
+ The fact table is al.ays the largesttable in the star schema
+ &ach dimension record is related to
thousand of fact records+ Star Schema facilitated data retrie*al
functions
+ D($S 8rst searches the Dimension Tables before the larger fact table
Star Schema
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
114/158
Star Schema
"T'
P&OD(T
P)&"OD(TOM)&
AL) AMO(*T
(*"T
&ea"ue"
+imen"ion
"
&)+"O*
TAT)
D"T&"T
"T'P&OD(T
B&A*D
OLO&
AT)+O&'
",)
DA'
MO*TH
')A&
(A&T)&
(TOM)&
AT)+O&'
O*TAT
ADD&)
Star Schema for Sales
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
115/158
act 0a*le
Di'ension 0a*les
Dimensional $odeling ; (asic%
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
116/158
%once)ts
• Fact Tables – The most useful facts in a fact table are
numeric and additi*e
– Ty)ically re)resents a businesstransaction or e*ent that can be used inanaly7ing business )rocess
– (y nature fact tables are s)arse
– sually *ery large ; billions of records
Dimensional $odeling ; (asic%
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
117/158
%once)ts
• Dimension Tables – &ach dimension table has a single;)art )rimary
key that corres)onds e6actly to one of thecom)onents of the multi)art key in the fact
table= – Dimension tables most often containdescri)ti*e te6tual information
– Determine conte6tual background for facts
– &6am)les • Time
• ocation#,egion
• %ustomers
Dimensional $odeling ; (asic% t
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
118/158
%once)ts
• $easures – A numeric attribute of a fact
– ,e)resents )erformance or beha*ior of thebusiness relati*e to the dimensions
– The actual numbers are called *ariables – -ccu)y *ery little s)ace com)ared to Fact
Tables
– &6am)les • 4uantity su))lied
• Transaction amount
• Sales *olume
Fact Table ' DimensionT bl
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
119/158
Tables+Fact Tables+umerical$easurements ofbusiness are stored in
Fact Tables=
+Dimensional Tables+Dimensions areattributes about facts=
%onformed Dimensions
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
120/158
%onformed Dimensions
+ Dimension that means the same thing.ith e*ery )ossible fact table that it canbe >oined .ith
+ %onformed dimensions most essential
For the (us ArchitectureIntegrated function of the Data !arehouse
+ Some common dimensions are %ustomer
3roductocation Time
Surrogate Heys
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
121/158
Surrogate Heys
+ All tables Cfacts and dimensionsG shouldnot use )roduction keys but Data!arehouse generated surrogate keys3roductions keys get reused sometimes
In case of mergers#ac9uisitions )rotects youfrom di?erent key formats
3roduction systems may change theirsystems to generali7e key de8nitions
sing surrogate key .ill be faster%an handle Slo.ly %hanging dimensions .ell
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
122/158
Factless Fact Tables
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
123/158
Factless Fact Tables
• For &*ent Tracking e=g= attendance
Date>?ey
8t%dent>?eyo%rse>?ey
Tea!her>?ey
9a!ility>?ey
Date
Diension
o%rse
Diension
9a!ility
Diension
8t%dent
Diension
Tea!her
Diension
%o*erage Tables
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
124/158
%o*erage Tables
• 3roblem To 8nd out .hich 3roductson )romotion did not sell<
Date>?eyProd%!t>?ey
8tore>?ey
Prootion>?ey
Dollars 8old
Date
Diension
8tore
Diension
Prod%!t
Diension
Prootion
Diension
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
125/158
Date>?ey
Prod%!t>?ey
8tore>?ey
Prootion>?ey
Date
Diension
8tore
Diension
Prod%!t
Diension
Prootion
Diension
8ales Prootion oerage Table
%o*erage Tables
• Solution ; %o*erage Tables
Sno.Lake Schema
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
126/158
S o a e Sc e a
+ Dimension tables are normali7ed bydecom)osing at the attribute le*el
+ &ach dimension has one key for eachle*el of the dimensionRs hierarchy
+ ood )erformance .hen 9ueriesin*ol*e aggregation
+ %om)licated maintenance andmetadata e6)losion in number of table=
+ $akes user re)resentation morecom)le6 and intricate
Sno.Lake schema ;& l
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
127/158
&6am)le
•
9a!t
Table
Di
Table
Di
Table
Di
Table
Di
Table
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
128/158
35IT III to
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
129/158
g
+ Data mining is the automated detection forne. *aluable and non tri*ial informationin large *olumes of data=
+ It )redicts future trends and 8nds beha*ior
that the e6)erts may miss because it liesoutside their e6)ectationsData mining lets you be )roacti*e3ros)ecti*e rather than ,etros)ecti*e
+ Data $ining eads to sim)li8cation andautomation of the o*erall statistical)rocess of deri*ing information from huge*olume of data=
&6am)les of Data $odelingTools
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
130/158
Tools
• &,!I – Su))orts Data !arehouse design as a
modeling techni9ue
• 3o.ersoft !arehouseArchitect
– $odule of 3o.er Designer s)eci8cally for D!$odeling
• -racle Designer
– %an be e6tended for !arehouse modeling• -thers like Infomodeler Sil*errun are also
used
Data $ining Introduction
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
131/158
g
• D$ ; .hat it can do – &6)loit )atterns ' relationshi)s in data to
)roduce models
– T.o uses for models
• 3redicti*e• Descri)ti*e
• D$ ; .hat it canRt do
– Automatically 8nd relationshi)s• .ithout user inter*ention
• .hen no relationshi)s e6ist
Data $ining Introduction
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
132/158
g
• Data $ining and Data !arehousing – Data )re)aration for D$ may be )art of the
Data !arehousing – Data !arehouse not a re9uirement for Data
$ining
• D$ and -A3 – -A3 X %lassic descri)ti*e model – ,e9uires signi8cant user in)ut – &6am)le (eer and dia)er sales
• An -A3 tools sho.s re)orts gi*ing sales of di?erentitems
• A data mining tool analyses the data and )redictsho. many times beer and dia)ers are sold together
Data $ining
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
133/158
g
+ 3roacti*e+ Automatically searches
– Anomalies – 3ossible ,elationshi)s
– Identify 3roblems before the end;user+ Data $ining tools analy7e the data unco*er
)roblems or o))ortunities hidden in datarelationshi)s form com)uter models based
on their 8ndings and then user the modelsto )redict business beha*ior / .ith minimalend;user inter*ention
Data $ining
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
134/158
g
+ A methodology designed to )erformkno.ledge;disco*ery e6)editionso*er the database data .ith minimal
end;user inter*ention+ E Stages of Data
Data
InformationHno.ledge
&6traction of Hno.ledge from
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
135/158
&6traction of Hno.ledge from
Data
: 3hases of Data $ining
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
136/158
g
• Data 3re)aration – Identify the main data sets to be used by
the data mining o)eration Cusually the data.arehouseG
• Data Analysis and %lassi8cation – Study the data to identify common data
characteristics or )atterns• Data grou)ings classi8cations clusters
se9uences• Data de)endencies links or relationshi)s
• Data )atterns trends de*iation
: 3hases of Data $ining
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
137/158
g
• Hno.ledge Ac9uisition – ses the ,esults of the Data Analysis and %lassi8cation
)hase – Data mining tool selects the a))ro)riate modeling or
kno.ledge;ac9uisition algorithms• eural et.orks• Decision Trees• ,ules Induction• enetic algorithms• $emory;(ased ,easoning
• 3rognosis – 3redict Future (eha*ior – Forecast (usiness -utcomes
• M of customers .ho did not use a )articular credit card inthe last months are @@ likely to cancel the account=
Data $ining
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
138/158
g
+ Still a e. Techni9ue+ $ay 8nd many n;meaningful
,elationshi)s
+ ood at 8nding 3ractical ,elationshi)sDe8ne %ustomer (uying 3atternsIm)ro*e 3roduct De*elo)ment and Acce)tance
&tc=
+ 3otential of becoming the ne6t frontier indatabase de*elo)ment
!hy Data $ining
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
139/158
y g
• %redit ratings#targeted marketing – i*en a database of 100000 names .hich )ersons are
the least likely to default on their credit cards<
– Identify likely res)onders to sales )romotions
• Fraud detection – !hich ty)es of transactions are likely to be fraudulent
gi*en the demogra)hics and transactional history of a)articular customer
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
140/158
• 3rocess of semi;automatically analy7ing
large databases to 8nd )atterns that are – *alid hold on ne. data .ith some certainity
– no*el non;ob*ious to the system
–useful should be )ossible to act on the item – understandable humans should be able tointer)ret the )attern
• Also kno.n as Hno.ledge Disco*ery in
Databases CHDDG
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
141/158
A))lications CcontinuedG
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
142/158
• $edicine disease outcome e?ecti*eness oftreatments
– analy7e )atient disease history 8nd relationshi)bet.een diseases
• $olecular#3harmaceutical identify ne. drugs• Scienti8c data analysis
– identify ne. gala6ies by searching for sub clusters
• !eb site#store design and )romotion
– 8nd a5nity of *isitor to )ages and modify layout
no. e ge sco*eryDe8nition
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
143/158
1:E
Hno.ledge Disco*ery in Data is the
non-trivial )rocess of identifying – valid – novel – )otentially useful
– and ultimately understandable patterns in data=
from Advances in (nowledge Discover" and Data)ining, Fayyad 3iatetsky;Sha)iro Smyth andthurusamy C%ha)ter 1G AAAI#$IT 3ress 1BB
,elated Fields
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
144/158
1::
Statistics
MachineLearning
Data*ases
=isualiation
Data Mining an!
>no.le!ge Discoery
Statistics $achine earning andData $ining
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
145/158
1:M
Data $ining
• Statistics – more theory;based – more focused on testing hy)otheses
• $achine learning – more heuristic – focused on im)ro*ing )erformance of a learning agent
– also looks at real;time learning and robotics / areas not )artof data mining
• Data $ining and Hno.ledge Disco*ery – integrates theory and heuristics – focus on the entire )rocess of kno.ledge disco*ery
including data cleaning learning and integration and
*isuali7ation of results• Distinctions are fu77y
Hno.ledge Disco*ery 3rocessLo. according to %,IS3;D$
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
146/158
1:
Lo. according to %,IS3 D$
$onitoring
ontin%o%s
onitoring and
i#roeent isan addition to RI8P
"istorical ote$any ames of Data $ining
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
147/158
1:Q
• Data Fishing Data Dredging 1B0;
– used by statisticians Cas bad nameG
• Data $ining 1BB0 ;;
– used in D( community business
• Hno.ledge Disco*ery in DatabasesC1B@B;G
– used by AI $achine earning %ommunity
• also Data Archaeology Information "ar*estingInformation Disco*ery Hno.ledge &6traction ===
&urrently< Data Mining and ?nowledge Discovery are used interchangealy
Some De8nitions
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
148/158
1:@
• Instance Calso Item or ,ecordG – an e6am)le described by a number of
attributes
– e=g= a day can be described by tem)erature
humidity and cloud status• Attribute or Field
– measuring as)ects of the Instance e=g=tem)erature
• %lass CabelG – grou)ing of instances e=g= days good for
)laying
$a>or Data $ining Tasks
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
149/158
1:B
• Classi%cation?
3redicting an item class # Decision Tree• Clustering? Finding clusters in data• Associations? e=g= A ' ( ' % occur
fre9uently•
=isualiation? to facilitate human disco*ery• Su''ariation? describing a grou)• De*iation Detection 8nding changes• &stimation )redicting a continuous *alue
• ink Analysis 8nding relationshi)s•
Data ro.th
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
150/158
1M0
In 2 years (2003 to 2005),the size of the largest database TRIPLED
Data ro.th ,ate
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
151/158
1M1
• T.ice as much information .as createdin 2002 as in 1BBB CE0 gro.th rateG
• -ther gro.th rate estimates e*en
higher• Very little data .ill e*er be looked at by
a human
Hno.ledge Disco*ery is NEEDED to makesense and use of data=
no. e ge sco*eryDe8nition
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
152/158
1M2
Hno.ledge Disco*ery in Data is the
non-trivial )rocess of identifying – valid – novel – )otentially useful
– and ultimately understandable patterns in data=
from Advances in (nowledge Discover" and Data)ining, Fayyad 3iatetsky;Sha)iro Smyth andthurusamy C%ha)ter 1G AAAI#$IT 3ress 1BB
,elated Fields
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
153/158
1ME
Statistics
MachineLearning
Data*ases
=isualiation
Data Mining an!
>no.le!ge Discoery
Statistics $achine earning andData $ining
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
154/158
1M:
g
• Statistics – more theory;based – more focused on testing hy)otheses
• $achine learning – more heuristic – focused on im)ro*ing )erformance of a learning agent
– also looks at real;time learning and robotics / areas not )artof data mining
• Data $ining and Hno.ledge Disco*ery – integrates theory and heuristics – focus on the entire )rocess of kno.ledge disco*ery
including data cleaning learning and integration and
*isuali7ation of results• Distinctions are fu77y
Hno.ledge Disco*ery 3rocess
Lo. according to %,IS3;D$
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
155/158
1MM
g
$onitoring
see
$$$!ris#&dorg
for ore
inforation
ontin%o%s
onitoring and
i#roeent isan addition to RI8P
"istorical ote$any ames of Data $ining
http://www.crisp-dm.org/http://www.crisp-dm.org/
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
156/158
1M
• Data Fishing Data Dredging 1B0;
– used by statisticians Cas bad nameG
• Data $ining 1BB0 ;;
– used in D( community business
• Hno.ledge Disco*ery in DatabasesC1B@B;G
– used by AI $achine earning %ommunity
• also Data Archaeology Information "ar*estingInformation Disco*ery Hno.ledge &6traction ===
&urrently< Data Mining and ?nowledge Discovery
are used interchangealy
Some De8nitions
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
157/158
1MQ
• Instance Calso Item or ,ecordG – an e6am)le described by a number of
attributes
– e=g= a day can be described by tem)erature
humidity and cloud status• Attribute or Field
– measuring as)ects of the Instance e=g=tem)erature
• %lass CabelG – grou)ing of instances e=g= days good for
)laying
$a>or Data $ining Tasks
8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt
158/158
• Classi%cation? )redicting an item class
• Clustering? 8nding clusters in data• Associations? e=g= A ' ( ' % occur
fre9uently• =isualiation? to facilitate human disco*ery
• Su''ariation? describing a grou)• De*iation Detection 8nding changes• &stimation )redicting a continuous *alue• ink Analysis 8nding relationshi)s
•