Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Business Analytics and Big Data:the process and the tools
Mehmet GençerAssoc.Prof., Organization Studies &
Computer [email protected]@ieu.edu.tr
https://mgencer.com
How big?
1st Character of Big DataVolume
•A typical PC might have had 10 gigabytes of storage in 2000.
•Today, Facebook ingests 500 terabytes of new data every day.
•Boeing 737 will generate 240 terabytes of flight data during a single flight across the US.
• The smart phones, the data they create and consume; sensors embedded into everyday objects will soon result in billions of new, constantly-updated data feeds containing environmental, location, and other information, including video.
2nd Character of Big DataVelocity
• Clickstreams and ad impressions capture user behavior at millions of events per second
• high-frequency stock trading algorithms reflect market changes within microseconds
• machine to machine processes exchange data between billions of devices
• infrastructure and sensors generate massive log data in real-time
• on-line gaming systems support millions of concurrent users, each producing multiple inputs per second.
3rd Character of Big DataVariety
• Big Data isn't just numbers, dates, and strings. Big Data is also geospatial data, 3D data, audio and video, and unstructured text, including log files and social media.
• Traditional database systems were designed to address smaller volumes of structured data, fewer updates or a predictable, consistent data structure.
• Big Data analysis includes different types of data
The Structure of Big Data
Structured• Most traditional
data sources
Semi-structured• Many sources of
big data
Unstructured• Video data, audio
data6
A Application Of Big Data analytics
Homeland Security
Smarter Healthcare Multi-
channel sales
Telecom
Manufacturing
Traffic Control
Trading Analytics
Search Quality
• Where processing is hosted?– Distributed Servers / Cloud (e.g. Amazon EC2)
• Where data is stored?– Distributed Storage (e.g. Amazon S3)
• What is the programming model?– Distributed Processing (e.g. MapReduce)
• How data is stored & indexed?– High-performance schema-free databases (e.g. MongoDB)
• What operations are performed on data?– Analytic / Semantic Processing
• Where is the processing performed?
– Web service (e.g. sense.io), desktop (R, SPSS, WEKA)
Types of tools used in Big-Data
Leading Technology Vendors
Example Vendors
• IBM – Netezza• EMC – Greenplum• Oracle – Exadata
Commonality
• MPP architectures• Commodity Hardware• RDBMS based• Full SQL compliance
Open source tools:● R● SPSS● WEKA
● Sense.io● sage
Statistics 101
Random Sample and Statistics• Population: is used to refer to the set or universe of all
entities under study.• However, looking at the entire population may not be
feasible, or may be too expensive.• Instead, we draw a random sample from the
population, and compute appropriate statistics from the sample, that give estimates of the corresponding population parameters of interest.
Statistic
• Let Si denote the random variable (e.g. age)
corresponding to data point xi (e.g. a person in
the sample), then a statistic θ is a function θ : (S
1, S
2, · · · , S
n) → R.
• If we use the value of a statistic to estimate a population parameter, this value is called a point estimate of the parameter, and the statistic is called as an estimator of the parameter.
Empirical Cumulative Distribution Function
Where
Inverse Cumulative Distribution Function
Example
Measures of Central Tendency (Mean)
Population Mean:
Sample Mean (Unbiased, not robust):
Measures of Central Tendency (Median) Population Median:
or
Sample Median:
Example
Measures of Dispersion (Range)Range:
Not robust, sensitive to extreme values
Sample Range:
Measures of Dispersion (Inter-Quartile Range)
Inter-Quartile Range (IQR):
More robust
Sample IQR:
Measures of Dispersion (Variance and Standard Deviation)
Standard Deviation:
Variance:
Measures of Dispersion (Variance and Standard Deviation)
Standard Deviation:
Variance:
Univariate Normal Distribution
Multivariate Normal Distribution
OLAP (online analytical processing) and Data Mining
Warehouse Architecture
25
Client Client
Warehouse
Source Source Source
Query & Analysis
Integration
Metadata
26
Star Schemas
• A star schema is a common organization for data at a warehouse. It consists of:
1. Fact table : a very large accumulation of facts such as sales. Often “insert-only.”
2. Dimension tables : smaller, generally static information about the entities involved in the facts.
Terms
• Fact table• Dimension tables• Measures
27
saleorderId
datecustIdprodIdstoreId
qtyamt
customercustIdname
addresscity
productprodIdnameprice
storestoreId
city
Star
28
customer custId name address city53 joe 10 main sfo81 fred 12 main sfo
111 sally 80 willow la
product prodId name pricep1 bolt 10p2 nut 5
store storeId cityc1 nycc2 sfoc3 la
sale oderId date custId prodId storeId qty amto100 1/7/97 53 p1 c1 1 12o102 2/7/97 53 p2 c1 2 11105 3/8/97 111 p1 c3 5 50
Cube
29
Fact table view:Multi-dimensional cube:
dimensions = 2
sale prodId storeId amtp1 c1 12p2 c1 11p1 c3 50p2 c2 8
c1 c2 c3p1 12 50p2 11 8
3-D Cube
30
day 2
day 1
dimensions = 3
Multi-dimensional cube:Fact table view:
sale prodId storeId date amtp1 c1 1 12p2 c1 1 11p1 c3 1 50p2 c2 1 8p1 c1 2 44p1 c2 2 4
c1 c2 c3p1 44 4p2 c1 c2 c3
p1 12 50p2 11 8
ROLAP vs. MOLAP
• ROLAP:Relational On-Line Analytical Processing
• MOLAP:Multi-Dimensional On-Line Analytical Processing
31
Aggregates
32
• Add up amounts for day 1• In SQL: SELECT sum(amt) FROM SALE WHERE date = 1
81
sale prodId storeId date amtp1 c1 1 12p2 c1 1 11p1 c3 1 50p2 c2 1 8p1 c1 2 44p1 c2 2 4
Aggregates
33
• Add up amounts for day 1• In SQL: SELECT sum(amt) FROM SALE WHERE date = 1
81
sale prodId storeId date amtp1 c1 1 12p2 c1 1 11p1 c3 1 50p2 c2 1 8p1 c1 2 44p1 c2 2 4
Another Example
34
• Add up amounts by day, product• In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date, prodId
drill-down
rollup
sale prodId storeId date amtp1 c1 1 12p2 c1 1 11p1 c3 1 50p2 c2 1 8p1 c1 2 44p1 c2 2 4
sale prodId date amtp1 1 62p2 1 19p1 2 48
Aggregates
35
• Add up amounts for day 1• In SQL: SELECT sum(amt) FROM SALE WHERE date = 1
81
sale prodId storeId date amtp1 c1 1 12p2 c1 1 11p1 c3 1 50p2 c2 1 8p1 c1 2 44p1 c2 2 4
What is Data Mining?
• Discovery of useful, possibly unexpected, patterns in data
• Non-trivial extraction of implicit, previously unknown and potentially useful information from data
• Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns
Data Mining Tasks
• Classification [Predictive]
• Clustering [Descriptive]
• Association Rule Discovery [Descriptive]
• Sequential Pattern Discovery [Descriptive]
• Regression [Predictive]
• Deviation Detection [Predictive]
• Collaborative Filter [Predictive]
Regression
Estimating the relationship between a dependent variable (Y) and one or more independent variables (predictors, X), represented as parameters (B)
Linear: Y=BX+e
Non-linear: no general form
Classification: Definition
• Given a collection of records (training set )– Each record contains a set of attributes, one of the
attributes is the class.• Find a model for class attribute as a function
of the values of other attributes.• Goal: previously unseen records should be
assigned a class as accurately as possible.– A test set is used to determine the accuracy of the
model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.
Classification: Decision Trees
40
Example:• Conducted survey to see what customers were interested in new model car• Want to select customers for advertising campaign
trainingset
Classification: KNN
41
K-nearest neighbours: key idea is that similar observations belong to similar classes. Thus, one simply has to look for the class designators of a certain number of the nearest neighbors
and weigh their class numbers to assign a class number to the unknown.
Clustering
42
age
income
education
K-Means Clustering
43
http://kodlab.seas.upenn.edu/Omur/WAFR2014
Association Rule Mining
44
transactio
n
id custo
mer
id products
bought
salesrecords:
• Trend: Products p5, p8 often bough together• Trend: Customer 12 likes product p9
market-basketdata
Association Rule Discovery
• Marketing and Sales Promotion:– Let the rule discovered be {Bagels, … } --> {Potato Chips}– Potato Chips as consequent => Can be used to
determine what should be done to boost its sales.– Bagels in the antecedent => can be used to see which
products would be affected if the store discontinues selling bagels.
– Bagels in antecedent and Potato chips in consequent
=> Can be used to see what products should be sold with Bagels to promote sale of Potato chips!
• Supermarket shelf management.• Inventory Managemnt
Collaborative Filtering• Goal: predict what movies/books/… a person may be
interested in, on the basis of– Past preferences of the person– Other people with similar past preferences– The preferences of such people for a new movie/book/…
• One approach based on repeated clustering– Cluster people on the basis of preferences for movies– Then cluster movies on the basis of being liked by the same
clusters of people– Again cluster people based on their preferences for (the newly
created clusters of) movies– Repeat above till equilibrium
• Above problem is an instance of collaborative filtering, where users collaborate in the task of filtering information to find information of interest
46
Other Types of Mining
• Text mining: application of data mining to textual documents– cluster Web pages to find related pages– cluster pages a user has visited to organize their
visit history– classify Web pages automatically into a Web
directory– Mine consumer or public opinion in Twitter messages
• Graph Mining: – Deal with graph data– Social Network Analysis
47
Data Streams• What are Data Streams?
– Continuous streams– Huge, Fast, and Changing
• Why Data Streams?– The arriving speed of streams and the huge amount of
data are beyond our capability to store them. – “Real-time” processing
• Window Models– Landscape window (Entire Data Stream)– Sliding Window– Damped Window
• Mining Data Stream
48
Model quality and comparison
● A statistical model has limited explanatory and/or predictive power● One needs to use measures to compare alternative models and
choose the best one
Example measures:● Regression analysis: Rsquare measure● Classification: k-value of parameters● Classification: Precision-recall
Know the alternative models
"How [the] translation from subject-matter problem to statistical model is done is often the most critical part of an analysis" Sir David Cox, British Statistician
supervised
unsupervised
Parametric(results are easy to interpret)
nonparametric
RegressionDecision tree
K-nearest neighboursNeural networks
Hierarchical Clustering Association rulesText mining/Topic analysis
Recommender systems (Collaborative filter)
Hands on statistics with R
Resources:● https://cran.r-project.org/doc/manuals/r-release/R-intro.pdf See
Appendix A sample session● http://www.rdatamining.com/