51
Business Analytics and Big Data: the process and the tools Mehmet Gençer Assoc.Prof., Organization Studies & Computer Engineering [email protected] [email protected] https://mgencer.com

Business Analytics and Big Data: the process and the toolsmgencer.com/files/BilgiBigDataAtolyeDersi-NextAcademy-BAandBigD… · • RDBMS based • Full SQL compliance Open source

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Business Analytics and Big Data: the process and the toolsmgencer.com/files/BilgiBigDataAtolyeDersi-NextAcademy-BAandBigD… · • RDBMS based • Full SQL compliance Open source

Business Analytics and Big Data:the process and the tools

Mehmet GençerAssoc.Prof., Organization Studies &

Computer [email protected]@ieu.edu.tr

https://mgencer.com

Page 2: Business Analytics and Big Data: the process and the toolsmgencer.com/files/BilgiBigDataAtolyeDersi-NextAcademy-BAandBigD… · • RDBMS based • Full SQL compliance Open source

How big?

Page 3: Business Analytics and Big Data: the process and the toolsmgencer.com/files/BilgiBigDataAtolyeDersi-NextAcademy-BAandBigD… · • RDBMS based • Full SQL compliance Open source

1st Character of Big DataVolume

•A typical PC might have had 10 gigabytes of storage in 2000.

•Today, Facebook ingests 500 terabytes of new data every day.

•Boeing 737 will generate 240 terabytes of flight data during a single flight across the US.

• The smart phones, the data they create and consume; sensors embedded into everyday objects will soon result in billions of new, constantly-updated data feeds containing environmental, location, and other information, including video.

Page 4: Business Analytics and Big Data: the process and the toolsmgencer.com/files/BilgiBigDataAtolyeDersi-NextAcademy-BAandBigD… · • RDBMS based • Full SQL compliance Open source

2nd Character of Big DataVelocity

•  Clickstreams and ad impressions capture user behavior at millions of events per second

• high-frequency stock trading algorithms reflect market changes within microseconds

• machine to machine processes exchange data between billions of devices

• infrastructure and sensors generate massive log data in real-time

• on-line gaming systems support millions of concurrent users, each producing multiple inputs per second.

Page 5: Business Analytics and Big Data: the process and the toolsmgencer.com/files/BilgiBigDataAtolyeDersi-NextAcademy-BAandBigD… · • RDBMS based • Full SQL compliance Open source

3rd Character of Big DataVariety

• Big Data isn't just numbers, dates, and strings. Big Data is also geospatial data, 3D data, audio and video, and unstructured text, including log files and social media.

• Traditional database systems were designed to address smaller volumes of structured data, fewer updates or a predictable, consistent data structure.

• Big Data analysis includes different types of data

Page 6: Business Analytics and Big Data: the process and the toolsmgencer.com/files/BilgiBigDataAtolyeDersi-NextAcademy-BAandBigD… · • RDBMS based • Full SQL compliance Open source

The Structure of Big Data

Structured• Most traditional

data sources

Semi-structured• Many sources of

big data

Unstructured• Video data, audio

data6

Page 7: Business Analytics and Big Data: the process and the toolsmgencer.com/files/BilgiBigDataAtolyeDersi-NextAcademy-BAandBigD… · • RDBMS based • Full SQL compliance Open source

A Application Of Big Data analytics

Homeland Security

Smarter Healthcare Multi-

channel sales

Telecom

Manufacturing

Traffic Control

Trading Analytics

Search Quality

Page 8: Business Analytics and Big Data: the process and the toolsmgencer.com/files/BilgiBigDataAtolyeDersi-NextAcademy-BAandBigD… · • RDBMS based • Full SQL compliance Open source

• Where processing is hosted?– Distributed Servers / Cloud (e.g. Amazon EC2)

• Where data is stored?– Distributed Storage (e.g. Amazon S3)

• What is the programming model?– Distributed Processing (e.g. MapReduce)

• How data is stored & indexed?– High-performance schema-free databases (e.g. MongoDB)

• What operations are performed on data?– Analytic / Semantic Processing

• Where is the processing performed?

– Web service (e.g. sense.io), desktop (R, SPSS, WEKA)

Types of tools used in Big-Data

Page 9: Business Analytics and Big Data: the process and the toolsmgencer.com/files/BilgiBigDataAtolyeDersi-NextAcademy-BAandBigD… · • RDBMS based • Full SQL compliance Open source

Leading Technology Vendors

Example Vendors

• IBM – Netezza• EMC – Greenplum• Oracle – Exadata

Commonality

• MPP architectures• Commodity Hardware• RDBMS based• Full SQL compliance

Open source tools:● R● SPSS● WEKA

● Sense.io● sage

Page 10: Business Analytics and Big Data: the process and the toolsmgencer.com/files/BilgiBigDataAtolyeDersi-NextAcademy-BAandBigD… · • RDBMS based • Full SQL compliance Open source

Statistics 101

Page 11: Business Analytics and Big Data: the process and the toolsmgencer.com/files/BilgiBigDataAtolyeDersi-NextAcademy-BAandBigD… · • RDBMS based • Full SQL compliance Open source

Random Sample and Statistics• Population: is used to refer to the set or universe of all

entities under study.• However, looking at the entire population may not be

feasible, or may be too expensive.• Instead, we draw a random sample from the

population, and compute appropriate statistics from the sample, that give estimates of the corresponding population parameters of interest.

Page 12: Business Analytics and Big Data: the process and the toolsmgencer.com/files/BilgiBigDataAtolyeDersi-NextAcademy-BAandBigD… · • RDBMS based • Full SQL compliance Open source

Statistic

• Let Si denote the random variable (e.g. age)

corresponding to data point xi (e.g. a person in

the sample), then a statistic θ is a function θ : (S

1, S

2, · · · , S

n) → R.

• If we use the value of a statistic to estimate a population parameter, this value is called a point estimate of the parameter, and the statistic is called as an estimator of the parameter.

Page 13: Business Analytics and Big Data: the process and the toolsmgencer.com/files/BilgiBigDataAtolyeDersi-NextAcademy-BAandBigD… · • RDBMS based • Full SQL compliance Open source

Empirical Cumulative Distribution Function

Where

Inverse Cumulative Distribution Function

Page 14: Business Analytics and Big Data: the process and the toolsmgencer.com/files/BilgiBigDataAtolyeDersi-NextAcademy-BAandBigD… · • RDBMS based • Full SQL compliance Open source

Example

Page 15: Business Analytics and Big Data: the process and the toolsmgencer.com/files/BilgiBigDataAtolyeDersi-NextAcademy-BAandBigD… · • RDBMS based • Full SQL compliance Open source

Measures of Central Tendency (Mean)

Population Mean:

Sample Mean (Unbiased, not robust):

Page 16: Business Analytics and Big Data: the process and the toolsmgencer.com/files/BilgiBigDataAtolyeDersi-NextAcademy-BAandBigD… · • RDBMS based • Full SQL compliance Open source

Measures of Central Tendency (Median) Population Median:

or

Sample Median:

Page 17: Business Analytics and Big Data: the process and the toolsmgencer.com/files/BilgiBigDataAtolyeDersi-NextAcademy-BAandBigD… · • RDBMS based • Full SQL compliance Open source

Example

Page 18: Business Analytics and Big Data: the process and the toolsmgencer.com/files/BilgiBigDataAtolyeDersi-NextAcademy-BAandBigD… · • RDBMS based • Full SQL compliance Open source

Measures of Dispersion (Range)Range:

Not robust, sensitive to extreme values

Sample Range:

Page 19: Business Analytics and Big Data: the process and the toolsmgencer.com/files/BilgiBigDataAtolyeDersi-NextAcademy-BAandBigD… · • RDBMS based • Full SQL compliance Open source

Measures of Dispersion (Inter-Quartile Range)

Inter-Quartile Range (IQR):

More robust

Sample IQR:

Page 20: Business Analytics and Big Data: the process and the toolsmgencer.com/files/BilgiBigDataAtolyeDersi-NextAcademy-BAandBigD… · • RDBMS based • Full SQL compliance Open source

Measures of Dispersion (Variance and Standard Deviation)

Standard Deviation:

Variance:

Page 21: Business Analytics and Big Data: the process and the toolsmgencer.com/files/BilgiBigDataAtolyeDersi-NextAcademy-BAandBigD… · • RDBMS based • Full SQL compliance Open source

Measures of Dispersion (Variance and Standard Deviation)

Standard Deviation:

Variance:

Page 22: Business Analytics and Big Data: the process and the toolsmgencer.com/files/BilgiBigDataAtolyeDersi-NextAcademy-BAandBigD… · • RDBMS based • Full SQL compliance Open source

Univariate Normal Distribution

Page 23: Business Analytics and Big Data: the process and the toolsmgencer.com/files/BilgiBigDataAtolyeDersi-NextAcademy-BAandBigD… · • RDBMS based • Full SQL compliance Open source

Multivariate Normal Distribution

Page 24: Business Analytics and Big Data: the process and the toolsmgencer.com/files/BilgiBigDataAtolyeDersi-NextAcademy-BAandBigD… · • RDBMS based • Full SQL compliance Open source

OLAP (online analytical processing) and Data Mining

Page 25: Business Analytics and Big Data: the process and the toolsmgencer.com/files/BilgiBigDataAtolyeDersi-NextAcademy-BAandBigD… · • RDBMS based • Full SQL compliance Open source

Warehouse Architecture

25

Client Client

Warehouse

Source Source Source

Query & Analysis

Integration

Metadata

Page 26: Business Analytics and Big Data: the process and the toolsmgencer.com/files/BilgiBigDataAtolyeDersi-NextAcademy-BAandBigD… · • RDBMS based • Full SQL compliance Open source

26

Star Schemas

• A star schema is a common organization for data at a warehouse. It consists of:

1. Fact table : a very large accumulation of facts such as sales. Often “insert-only.”

2. Dimension tables : smaller, generally static information about the entities involved in the facts.

Page 27: Business Analytics and Big Data: the process and the toolsmgencer.com/files/BilgiBigDataAtolyeDersi-NextAcademy-BAandBigD… · • RDBMS based • Full SQL compliance Open source

Terms

• Fact table• Dimension tables• Measures

27

saleorderId

datecustIdprodIdstoreId

qtyamt

customercustIdname

addresscity

productprodIdnameprice

storestoreId

city

Page 28: Business Analytics and Big Data: the process and the toolsmgencer.com/files/BilgiBigDataAtolyeDersi-NextAcademy-BAandBigD… · • RDBMS based • Full SQL compliance Open source

Star

28

customer custId name address city53 joe 10 main sfo81 fred 12 main sfo

111 sally 80 willow la

product prodId name pricep1 bolt 10p2 nut 5

store storeId cityc1 nycc2 sfoc3 la

sale oderId date custId prodId storeId qty amto100 1/7/97 53 p1 c1 1 12o102 2/7/97 53 p2 c1 2 11105 3/8/97 111 p1 c3 5 50

Page 29: Business Analytics and Big Data: the process and the toolsmgencer.com/files/BilgiBigDataAtolyeDersi-NextAcademy-BAandBigD… · • RDBMS based • Full SQL compliance Open source

Cube

29

Fact table view:Multi-dimensional cube:

dimensions = 2

sale prodId storeId amtp1 c1 12p2 c1 11p1 c3 50p2 c2 8

c1 c2 c3p1 12 50p2 11 8

Page 30: Business Analytics and Big Data: the process and the toolsmgencer.com/files/BilgiBigDataAtolyeDersi-NextAcademy-BAandBigD… · • RDBMS based • Full SQL compliance Open source

3-D Cube

30

day 2

day 1

dimensions = 3

Multi-dimensional cube:Fact table view:

sale prodId storeId date amtp1 c1 1 12p2 c1 1 11p1 c3 1 50p2 c2 1 8p1 c1 2 44p1 c2 2 4

c1 c2 c3p1 44 4p2 c1 c2 c3

p1 12 50p2 11 8

Page 31: Business Analytics and Big Data: the process and the toolsmgencer.com/files/BilgiBigDataAtolyeDersi-NextAcademy-BAandBigD… · • RDBMS based • Full SQL compliance Open source

ROLAP vs. MOLAP

• ROLAP:Relational On-Line Analytical Processing

• MOLAP:Multi-Dimensional On-Line Analytical Processing

31

Page 32: Business Analytics and Big Data: the process and the toolsmgencer.com/files/BilgiBigDataAtolyeDersi-NextAcademy-BAandBigD… · • RDBMS based • Full SQL compliance Open source

Aggregates

32

• Add up amounts for day 1• In SQL: SELECT sum(amt) FROM SALE WHERE date = 1

81

sale prodId storeId date amtp1 c1 1 12p2 c1 1 11p1 c3 1 50p2 c2 1 8p1 c1 2 44p1 c2 2 4

Page 33: Business Analytics and Big Data: the process and the toolsmgencer.com/files/BilgiBigDataAtolyeDersi-NextAcademy-BAandBigD… · • RDBMS based • Full SQL compliance Open source

Aggregates

33

• Add up amounts for day 1• In SQL: SELECT sum(amt) FROM SALE WHERE date = 1

81

sale prodId storeId date amtp1 c1 1 12p2 c1 1 11p1 c3 1 50p2 c2 1 8p1 c1 2 44p1 c2 2 4

Page 34: Business Analytics and Big Data: the process and the toolsmgencer.com/files/BilgiBigDataAtolyeDersi-NextAcademy-BAandBigD… · • RDBMS based • Full SQL compliance Open source

Another Example

34

• Add up amounts by day, product• In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date, prodId

drill-down

rollup

sale prodId storeId date amtp1 c1 1 12p2 c1 1 11p1 c3 1 50p2 c2 1 8p1 c1 2 44p1 c2 2 4

sale prodId date amtp1 1 62p2 1 19p1 2 48

Page 35: Business Analytics and Big Data: the process and the toolsmgencer.com/files/BilgiBigDataAtolyeDersi-NextAcademy-BAandBigD… · • RDBMS based • Full SQL compliance Open source

Aggregates

35

• Add up amounts for day 1• In SQL: SELECT sum(amt) FROM SALE WHERE date = 1

81

sale prodId storeId date amtp1 c1 1 12p2 c1 1 11p1 c3 1 50p2 c2 1 8p1 c1 2 44p1 c2 2 4

Page 36: Business Analytics and Big Data: the process and the toolsmgencer.com/files/BilgiBigDataAtolyeDersi-NextAcademy-BAandBigD… · • RDBMS based • Full SQL compliance Open source

What is Data Mining?

• Discovery of useful, possibly unexpected, patterns in data

• Non-trivial extraction of implicit, previously unknown and potentially useful information from data

• Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns

Page 37: Business Analytics and Big Data: the process and the toolsmgencer.com/files/BilgiBigDataAtolyeDersi-NextAcademy-BAandBigD… · • RDBMS based • Full SQL compliance Open source

Data Mining Tasks

• Classification [Predictive]

• Clustering [Descriptive]

• Association Rule Discovery [Descriptive]

• Sequential Pattern Discovery [Descriptive]

• Regression [Predictive]

• Deviation Detection [Predictive]

• Collaborative Filter [Predictive]

Page 38: Business Analytics and Big Data: the process and the toolsmgencer.com/files/BilgiBigDataAtolyeDersi-NextAcademy-BAandBigD… · • RDBMS based • Full SQL compliance Open source

Regression

Estimating the relationship between a dependent variable (Y) and one or more independent variables (predictors, X), represented as parameters (B)

Linear: Y=BX+e

Non-linear: no general form

Page 39: Business Analytics and Big Data: the process and the toolsmgencer.com/files/BilgiBigDataAtolyeDersi-NextAcademy-BAandBigD… · • RDBMS based • Full SQL compliance Open source

Classification: Definition

• Given a collection of records (training set )– Each record contains a set of attributes, one of the

attributes is the class.• Find a model for class attribute as a function

of the values of other attributes.• Goal: previously unseen records should be

assigned a class as accurately as possible.– A test set is used to determine the accuracy of the

model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.

Page 40: Business Analytics and Big Data: the process and the toolsmgencer.com/files/BilgiBigDataAtolyeDersi-NextAcademy-BAandBigD… · • RDBMS based • Full SQL compliance Open source

Classification: Decision Trees

40

Example:• Conducted survey to see what customers were interested in new model car• Want to select customers for advertising campaign

trainingset

Page 41: Business Analytics and Big Data: the process and the toolsmgencer.com/files/BilgiBigDataAtolyeDersi-NextAcademy-BAandBigD… · • RDBMS based • Full SQL compliance Open source

Classification: KNN

41

K-nearest neighbours: key idea is that similar observations belong to similar classes. Thus, one simply has to look for the class designators of a certain number of the nearest neighbors

and weigh their class numbers to assign a class number to the unknown.

Page 42: Business Analytics and Big Data: the process and the toolsmgencer.com/files/BilgiBigDataAtolyeDersi-NextAcademy-BAandBigD… · • RDBMS based • Full SQL compliance Open source

Clustering

42

age

income

education

Page 43: Business Analytics and Big Data: the process and the toolsmgencer.com/files/BilgiBigDataAtolyeDersi-NextAcademy-BAandBigD… · • RDBMS based • Full SQL compliance Open source

K-Means Clustering

43

http://kodlab.seas.upenn.edu/Omur/WAFR2014

Page 44: Business Analytics and Big Data: the process and the toolsmgencer.com/files/BilgiBigDataAtolyeDersi-NextAcademy-BAandBigD… · • RDBMS based • Full SQL compliance Open source

Association Rule Mining

44

transactio

n

id custo

mer

id products

bought

salesrecords:

• Trend: Products p5, p8 often bough together• Trend: Customer 12 likes product p9

market-basketdata

Page 45: Business Analytics and Big Data: the process and the toolsmgencer.com/files/BilgiBigDataAtolyeDersi-NextAcademy-BAandBigD… · • RDBMS based • Full SQL compliance Open source

Association Rule Discovery

• Marketing and Sales Promotion:– Let the rule discovered be {Bagels, … } --> {Potato Chips}– Potato Chips as consequent => Can be used to

determine what should be done to boost its sales.– Bagels in the antecedent => can be used to see which

products would be affected if the store discontinues selling bagels.

– Bagels in antecedent and Potato chips in consequent

=> Can be used to see what products should be sold with Bagels to promote sale of Potato chips!

• Supermarket shelf management.• Inventory Managemnt

Page 46: Business Analytics and Big Data: the process and the toolsmgencer.com/files/BilgiBigDataAtolyeDersi-NextAcademy-BAandBigD… · • RDBMS based • Full SQL compliance Open source

Collaborative Filtering• Goal: predict what movies/books/… a person may be

interested in, on the basis of– Past preferences of the person– Other people with similar past preferences– The preferences of such people for a new movie/book/…

• One approach based on repeated clustering– Cluster people on the basis of preferences for movies– Then cluster movies on the basis of being liked by the same

clusters of people– Again cluster people based on their preferences for (the newly

created clusters of) movies– Repeat above till equilibrium

• Above problem is an instance of collaborative filtering, where users collaborate in the task of filtering information to find information of interest

46

Page 47: Business Analytics and Big Data: the process and the toolsmgencer.com/files/BilgiBigDataAtolyeDersi-NextAcademy-BAandBigD… · • RDBMS based • Full SQL compliance Open source

Other Types of Mining

• Text mining: application of data mining to textual documents– cluster Web pages to find related pages– cluster pages a user has visited to organize their

visit history– classify Web pages automatically into a Web

directory– Mine consumer or public opinion in Twitter messages

• Graph Mining: – Deal with graph data– Social Network Analysis

47

Page 48: Business Analytics and Big Data: the process and the toolsmgencer.com/files/BilgiBigDataAtolyeDersi-NextAcademy-BAandBigD… · • RDBMS based • Full SQL compliance Open source

Data Streams• What are Data Streams?

– Continuous streams– Huge, Fast, and Changing

• Why Data Streams?– The arriving speed of streams and the huge amount of

data are beyond our capability to store them. – “Real-time” processing

• Window Models– Landscape window (Entire Data Stream)– Sliding Window– Damped Window

• Mining Data Stream

48

Page 49: Business Analytics and Big Data: the process and the toolsmgencer.com/files/BilgiBigDataAtolyeDersi-NextAcademy-BAandBigD… · • RDBMS based • Full SQL compliance Open source

Model quality and comparison

● A statistical model has limited explanatory and/or predictive power● One needs to use measures to compare alternative models and

choose the best one

Example measures:● Regression analysis: Rsquare measure● Classification: k-value of parameters● Classification: Precision-recall

Page 50: Business Analytics and Big Data: the process and the toolsmgencer.com/files/BilgiBigDataAtolyeDersi-NextAcademy-BAandBigD… · • RDBMS based • Full SQL compliance Open source

Know the alternative models

"How [the] translation from subject-matter problem to statistical model is done is often the most critical part of an analysis" Sir David Cox, British Statistician

supervised

unsupervised

Parametric(results are easy to interpret)

nonparametric

RegressionDecision tree

K-nearest neighboursNeural networks

Hierarchical Clustering Association rulesText mining/Topic analysis

Recommender systems (Collaborative filter)

Page 51: Business Analytics and Big Data: the process and the toolsmgencer.com/files/BilgiBigDataAtolyeDersi-NextAcademy-BAandBigD… · • RDBMS based • Full SQL compliance Open source

Hands on statistics with R

Resources:● https://cran.r-project.org/doc/manuals/r-release/R-intro.pdf See

Appendix A sample session● http://www.rdatamining.com/