Business Analytics and Big Data: the process and the toolsmgencer.com/files/BilgiBigDataAtolyeDersi-NextAcademy-BAandBigD… · • RDBMS based • Full SQL compliance Open source

Business Analytics and Big Data:the process and the tools

Mehmet GençerAssoc.Prof., Organization Studies &

Computer [email protected]@ieu.edu.tr

https://mgencer.com

How big?

1st Character of Big DataVolume

•A typical PC might have had 10 gigabytes of storage in 2000.

•Today, Facebook ingests 500 terabytes of new data every day.

•Boeing 737 will generate 240 terabytes of flight data during a single flight across the US.

• The smart phones, the data they create and consume; sensors embedded into everyday objects will soon result in billions of new, constantly-updated data feeds containing environmental, location, and other information, including video.

2nd Character of Big DataVelocity

• Clickstreams and ad impressions capture user behavior at millions of events per second

• high-frequency stock trading algorithms reflect market changes within microseconds

• machine to machine processes exchange data between billions of devices

• infrastructure and sensors generate massive log data in real-time

• on-line gaming systems support millions of concurrent users, each producing multiple inputs per second.

3rd Character of Big DataVariety

• Big Data isn't just numbers, dates, and strings. Big Data is also geospatial data, 3D data, audio and video, and unstructured text, including log files and social media.

• Traditional database systems were designed to address smaller volumes of structured data, fewer updates or a predictable, consistent data structure.

• Big Data analysis includes different types of data

The Structure of Big Data

Structured• Most traditional

data sources

Semi-structured• Many sources of

big data

Unstructured• Video data, audio

data6

A Application Of Big Data analytics

Homeland Security

Smarter Healthcare Multi-

channel sales

Telecom

Manufacturing

Traffic Control

Trading Analytics

Search Quality

• Where processing is hosted?– Distributed Servers / Cloud (e.g. Amazon EC2)

• Where data is stored?– Distributed Storage (e.g. Amazon S3)

• What is the programming model?– Distributed Processing (e.g. MapReduce)

• How data is stored & indexed?– High-performance schema-free databases (e.g. MongoDB)

• What operations are performed on data?– Analytic / Semantic Processing

• Where is the processing performed?

– Web service (e.g. sense.io), desktop (R, SPSS, WEKA)

Types of tools used in Big-Data

Leading Technology Vendors

Example Vendors

• IBM – Netezza• EMC – Greenplum• Oracle – Exadata

Commonality

• MPP architectures• Commodity Hardware• RDBMS based• Full SQL compliance

Open source tools:● R● SPSS● WEKA

● Sense.io● sage

Statistics 101

Random Sample and Statistics• Population: is used to refer to the set or universe of all

entities under study.• However, looking at the entire population may not be

feasible, or may be too expensive.• Instead, we draw a random sample from the

population, and compute appropriate statistics from the sample, that give estimates of the corresponding population parameters of interest.

Statistic

• Let Si denote the random variable (e.g. age)

corresponding to data point xi (e.g. a person in

the sample), then a statistic θ is a function θ : (S

1, S

2, · · · , S

n) → R.

• If we use the value of a statistic to estimate a population parameter, this value is called a point estimate of the parameter, and the statistic is called as an estimator of the parameter.

Empirical Cumulative Distribution Function

Where

Inverse Cumulative Distribution Function

Example

Measures of Central Tendency (Mean)

Population Mean:

Sample Mean (Unbiased, not robust):

Measures of Central Tendency (Median) Population Median:

or

Sample Median:

Example

Measures of Dispersion (Range)Range:

Not robust, sensitive to extreme values

Sample Range:

Measures of Dispersion (Inter-Quartile Range)

Inter-Quartile Range (IQR):

More robust

Sample IQR:

Measures of Dispersion (Variance and Standard Deviation)

Standard Deviation:

Variance:

Measures of Dispersion (Variance and Standard Deviation)

Standard Deviation:

Variance:

Univariate Normal Distribution

Multivariate Normal Distribution

OLAP (online analytical processing) and Data Mining

Warehouse Architecture

25

Client Client

Warehouse

Source Source Source

Query & Analysis

Integration

Metadata

26

Star Schemas

• A star schema is a common organization for data at a warehouse. It consists of:

1. Fact table : a very large accumulation of facts such as sales. Often “insert-only.”

2. Dimension tables : smaller, generally static information about the entities involved in the facts.

Terms

• Fact table• Dimension tables• Measures

27

saleorderId

datecustIdprodIdstoreId

qtyamt

customercustIdname

addresscity

productprodIdnameprice

storestoreId

city

Star

28

customer custId name address city53 joe 10 main sfo81 fred 12 main sfo

111 sally 80 willow la

product prodId name pricep1 bolt 10p2 nut 5

store storeId cityc1 nycc2 sfoc3 la

sale oderId date custId prodId storeId qty amto100 1/7/97 53 p1 c1 1 12o102 2/7/97 53 p2 c1 2 11105 3/8/97 111 p1 c3 5 50

Cube

29

Fact table view:Multi-dimensional cube:

dimensions = 2

sale prodId storeId amtp1 c1 12p2 c1 11p1 c3 50p2 c2 8

c1 c2 c3p1 12 50p2 11 8

3-D Cube

30

day 2

day 1

dimensions = 3

Multi-dimensional cube:Fact table view:

sale prodId storeId date amtp1 c1 1 12p2 c1 1 11p1 c3 1 50p2 c2 1 8p1 c1 2 44p1 c2 2 4

c1 c2 c3p1 44 4p2 c1 c2 c3

p1 12 50p2 11 8

ROLAP vs. MOLAP

• ROLAP:Relational On-Line Analytical Processing

• MOLAP:Multi-Dimensional On-Line Analytical Processing

31

Aggregates

32

• Add up amounts for day 1• In SQL: SELECT sum(amt) FROM SALE WHERE date = 1

81


Aggregates

33


81


Another Example

34

• Add up amounts by day, product• In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date, prodId

drill-down

rollup


sale prodId date amtp1 1 62p2 1 19p1 2 48

Aggregates

35


81


What is Data Mining?

• Discovery of useful, possibly unexpected, patterns in data

• Non-trivial extraction of implicit, previously unknown and potentially useful information from data

• Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns

Data Mining Tasks

• Classification [Predictive]

• Clustering [Descriptive]

• Association Rule Discovery [Descriptive]

• Sequential Pattern Discovery [Descriptive]

• Regression [Predictive]

• Deviation Detection [Predictive]

• Collaborative Filter [Predictive]

Regression

Estimating the relationship between a dependent variable (Y) and one or more independent variables (predictors, X), represented as parameters (B)

Linear: Y=BX+e

Non-linear: no general form

Classification: Definition

• Given a collection of records (training set )– Each record contains a set of attributes, one of the

attributes is the class.• Find a model for class attribute as a function

of the values of other attributes.• Goal: previously unseen records should be

assigned a class as accurately as possible.– A test set is used to determine the accuracy of the

model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.

Classification: Decision Trees

40

Example:• Conducted survey to see what customers were interested in new model car• Want to select customers for advertising campaign

trainingset

Classification: KNN

41

K-nearest neighbours: key idea is that similar observations belong to similar classes. Thus, one simply has to look for the class designators of a certain number of the nearest neighbors

and weigh their class numbers to assign a class number to the unknown.

Clustering

42

age

income

education

K-Means Clustering

43

http://kodlab.seas.upenn.edu/Omur/WAFR2014

http://kodlab.seas.upenn.edu/Omur/WAFR2014

Association Rule Mining

44

transactio

n

id custo

mer

id products

bought

salesrecords:

• Trend: Products p5, p8 often bough together• Trend: Customer 12 likes product p9

market-basketdata

Association Rule Discovery

• Marketing and Sales Promotion:– Let the rule discovered be {Bagels, … } --> {Potato Chips}– Potato Chips as consequent => Can be used to

determine what should be done to boost its sales.– Bagels in the antecedent => can be used to see which

products would be affected if the store discontinues selling bagels.

– Bagels in antecedent and Potato chips in consequent

=> Can be used to see what products should be sold with Bagels to promote sale of Potato chips!

• Supermarket shelf management.• Inventory Managemnt

Collaborative Filtering• Goal: predict what movies/books/… a person may be

interested in, on the basis of– Past preferences of the person– Other people with similar past preferences– The preferences of such people for a new movie/book/…

• One approach based on repeated clustering– Cluster people on the basis of preferences for movies– Then cluster movies on the basis of being liked by the same

clusters of people– Again cluster people based on their preferences for (the newly

created clusters of) movies– Repeat above till equilibrium

• Above problem is an instance of collaborative filtering, where users collaborate in the task of filtering information to find information of interest

46

Other Types of Mining

• Text mining: application of data mining to textual documents– cluster Web pages to find related pages– cluster pages a user has visited to organize their

visit history– classify Web pages automatically into a Web

directory– Mine consumer or public opinion in Twitter messages

• Graph Mining: – Deal with graph data– Social Network Analysis

47

Data Streams• What are Data Streams?

– Continuous streams– Huge, Fast, and Changing

• Why Data Streams?– The arriving speed of streams and the huge amount of

data are beyond our capability to store them. – “Real-time” processing

• Window Models– Landscape window (Entire Data Stream)– Sliding Window– Damped Window

• Mining Data Stream

48

Model quality and comparison

● A statistical model has limited explanatory and/or predictive power● One needs to use measures to compare alternative models and

choose the best one

Example measures:● Regression analysis: Rsquare measure● Classification: k-value of parameters● Classification: Precision-recall

Know the alternative models

"How [the] translation from subject-matter problem to statistical model is done is often the most critical part of an analysis" Sir David Cox, British Statistician

supervised

unsupervised

Parametric(results are easy to interpret)

nonparametric

RegressionDecision tree

K-nearest neighboursNeural networks

Hierarchical Clustering Association rulesText mining/Topic analysis

Recommender systems (Collaborative filter)

Hands on statistics with R

Resources:● https://cran.r-project.org/doc/manuals/r-release/R-intro.pdf See

Appendix A sample session● http://www.rdatamining.com/

https://cran.r-project.org/doc/manuals/r-release/R-intro.pdf

http://www.rdatamining.com/

Documents

Business Analytics and Big Data: the process and the toolsmgencer.com/files/BilgiBigDataAtolyeDersi-NextAcademy-BAandBigD… · • RDBMS based • Full SQL compliance Open source