60
Big Data Analysis 1 Author: Vikram Andem ISRM & IT GRC Conference Big Data Analysis Concepts and References Use Cases in Airline Industry

Big Data Analysis Concepts and References

Embed Size (px)

DESCRIPTION

Big Data Analysis Concepts and References @ ISRM & IT GRC Conference by Vikram Andem Senior Manager - Information Technology @ United Airlines

Citation preview

Page 1: Big Data Analysis Concepts and References

Big Data Analysis 1 Author: Vikram Andem ISRM & IT GRC Conference

Big Data Analysis Concepts and References Use Cases in Airline Industry

Page 2: Big Data Analysis Concepts and References

Big Data Analysis 2 Author: Vikram Andem ISRM & IT GRC Conference

The objective of this presentation is to provide awareness and familiarize a general business or management user with terms and terminology of Big Data Analysis and some references to use cases that can be (or are currently) applied in Airline industry.

The presentation is intended for an business or a management user to help with the thinking process on formulating an analytical question given business situation / problem for Big Data Analysis.

The presentation may also help provide an insight on basic terms and concepts that are a need to know, what to ask, how to evaluate and/or help solve a business problem for a potential Big Data Analysis use case and what to expect from the work of an competent Data Scientist when dealing with such use case for Big Data Analysis.

NOTE: Just reviewing this presentation will most likely NOT make you competent enough to instantly perform Big Data Analysis. Big Data Analysis is a new (very recent) aspect of Data Science and requires some college or university level course work in (fields such as, but not limited to) mathematics, statistics, computer science , management science, econometrics, engineering etc.

The presentation is divided into three parts following a separate presentation on Big Data Security & Governance, Risk Management & Compliance

Part 1. Big Data : Introduction ( page # 3)

Part 2. Very quick introduction to understanding Data and analysis of Data ( page # 8)

(Beginner: if you are new to understanding data and use of data you should start here)

Part 3. Big Data Analysis : Concepts and References to Use Cases in Airline Industry ( page # 17)

(Advanced: if you understand data and how to use data, you may jump to this part).

Page 3: Big Data Analysis Concepts and References

Big Data Analysis 3 Author: Vikram Andem ISRM & IT GRC Conference

Big Data: Introduction

You may skip this section if you are familiar with Big Data

and directly jump to Part 2 (page # 8)

Part 1

Page 4: Big Data Analysis Concepts and References

Big Data Analysis 4 Author: Vikram Andem ISRM & IT GRC Conference

Introduction

Projected growth and use of Unstructured vs. Structured data¹

2012 2013 2014 2015 2016 2017 2018 2019 2020

Unstructured Structured

¹ 2013 IEEE Bigdata conference (projected growth of data

combined for all fortune 500 companies only)

Limitations of existing Data Analytics Architecture

BI Reports + Interactive Apps

RDBMS (aggregated data)

ETL Computer Grid

Collection

Instrumentation

Storage Only Grid (original raw data)

Mostly Append

Limit #1 : Moving data to compute doesn’t scale.

Limit #2 : Can’t explore high fidelity

raw data

Limit #3 : Archiving =Prematuredata death

*Zet

taby

te’s

of d

ata

* 1 Zettabyte = 1000 Exabyte's = 1 Million Petabyte’s = 1 Billion Terabyte’s.

Big Data a general term refers to the large voluminous amounts (at least terabytes) of poly-structured data that is gleaned from traditional and non-traditional sources and continuously flows through and around organizations, including but not limited-to e-mail, text, event logs, audio, video, blogs, social media and transactional records.

What does this

information hold?

What is the challenge extracting

it?

It holds the promise of giving enterprises like United a deeper insight into their customers, partners, and business. This data can provide answers to questions they may not have even thought to ask. Companies like United can benefit from a multidimensional view of their business when they add insight from big data to the traditional types of information they collect and analyze.

Num

ber o

f Res

ults

Dem

and

<- More Generic More Specific ->Popularity Rank

<- Small Tail Long Tail ->

Traditional EDWClassical Statistics Big Data

Specific Spikes

Transactional Data(e.g., Reservations)

Non-Transactional & Raw Data(e.g., Search's, Event logs)

-

+

The challenge of extracting value from big data is similar in many ways to the age-old problem of distilling business intelligence from transactional data. At the heart of this challenge is the process used to extract data from multiple sources, transform it to fit your analytical needs, and load it into a data warehouse for subsequent analysis, a process known as “Extract, Transform & Load” (ETL). The nature of big data requires that the infrastructure for this process can scale cost-effectively.

While the storage capacities of hard drives have increased massively over the years, access speeds — the rate at which data can be read from drives — have not kept up. One typical drive from year 1990 could store 1,370 MB of data and had a transfer speed of 4.4 MB/s, so you could read all the data from a full drive in around five minutes. Over 20 years later, one terabyte drives are the norm, but the transfer speed is around 100 MB/s, so it takes more than two and a half hours to read all data off the disk. This is a long time to read all data on a single drive — and writing is even slower.

Data Storage and Analysis

Page 5: Big Data Analysis Concepts and References

Big Data Analysis 5 Author: Vikram Andem ISRM & IT GRC Conference

Hadoop

Apache Hadoop is a scalable fault- tolerant distributed system for storage and large scale processing of data-sets on clusters of commodity hardware. Hadoop can be used to store Exabyte's of unstructured and semi-structured data reliably on tens of thousands of general purpose servers while scaling performance cost-effectively by merely adding inexpensive nodes to the cluster. Using Hadoop in this way, organization like United gains an additional ability to store and access data that they “might” need, data that may never be loaded into the data warehouse.

The Key Benefit: Agility/Flexibility

Schema-on-Write (RDBMS) Schema must be created before

any data can be loaded. An explicit load operation has to

take place which transforms data to DB internal structure.

New columns must be added explicitly before new data for such columns can be loaded into the database.

Schema-on-Read (Hadoop) Data is simply copied to the file

store, no transformation is needed. A SerDe (Serializer/ Deserlizer) is

applied during read time to extract the required columns (late binding).

New data can start flowing anytime and will appear retroactively once the SerDe is updated to parse it.

Read is Fast Standards / Governance

Load is Fast Flexibility / Agility Pros

Interactive OLAP Analytics (<1sec) Multistep ACID Transactions 100% SQL Compliance

Structured or Not (Flexibility) Scalability of Storage/Compute Complex Data Processing

Use When

Hadoop Architecture

Central to the scalability of Hadoop is the distributed processing framework known as

MapReduce which splits the input data-set into multiple chunks, each of which is assigned a map task that can process the data in parallel. Each map task reads the input as a set of (key, value) pairs and produces a transformed set of (key, value) pairs as the output. The framework shuffles and sorts outputs of the map tasks, sending the intermediate (key, value) pairs to the reduce tasks, which group them into final results. MapReduce uses Job Tracker and Task Tracker mechanisms to schedule tasks, monitor them, & restart any that fail.

Hadoop Distributed File System (HDFS) is designed for scalability and fault tolerance. HDFS stores large files by dividing them into blocks (usually 64 or 128 MB) and replicating the blocks on three or more servers. HDFS provides APIs for MapReduce applications to read and write data in parallel. Capacity and performance can be scaled by adding Data Nodes, and a single NameNode mechanism manages data placement and monitors server availability. HDFS clusters in production use today reliably hold petabytes of data on thousands of nodes.

² Pictures source :Intel White Paper on Big Data Analytics

Page 6: Big Data Analysis Concepts and References

Big Data Analysis 6 Author: Vikram Andem ISRM & IT GRC Conference

Use Case Description

Data Storage Collect and store unstructured data in a fault-resilient scalable data store that can be organized and sorted for indexing and analysis.

Analytics Ability to query in real time at the speed of thought on petabyte scale unstructured and semi structured data using Hbase and Hive.

Batch Processing of Unstructured data

Ability to batch-process (index, analyze etc.) tens to hundreds of petabytes of unstructured and semi-structured data.

Data Archive Medium-term (12-36 months) archival of data from EDW/DBMS to meet data retention policies.

Integration with EDW

Extract, transfer and load data in and out of Hadoop into separate DBMS for advanced analytics.

Search and predictive analytics

Crawl, extract, index and transform structured and unstructured data for search and predictive analytics.

Use Cases

Southwest Airlines uses Hadoop based solution for its "Rapid Rewards loyalty program" for Customer Service.

Com

mon

Pat

tern

s of H

adoo

p U

se

Big Data Exploration

Pattern # 1: Hadoop as a Data Refinery

Traditional Sources

(RDBMS, OLTP, OLAP)

DATA

SYS

TEM

SDA

TA S

OUR

CES

APPL

ICAT

ION

S

New Sources(Web logs, email, sensor

data, social media)

RDBMS EDWTraditional Repos

BusinessAnalytics

CustomApplications

EnterpriseApplications

1

3

2

1

3

2

Collect data and apply a known

algorithm into trusted operational process

CaptureCapture all data

ProcessParse, cleanse,

apply structure & transform

ExchangePush to existing

data warehouse for use with existing

analytic tools

Pattern # 2: Data Exploration with Hadoop

Traditional Sources

(RDBMS, OLTP, OLAP)

DATA

SYS

TEM

SDA

TA S

OUR

CES

APPL

ICAT

ION

S

New Sources(Web logs, email, sensor

data, social media)

RDBMS EDWTraditional Repos

BusinessAnalytics

CustomApplications

EnterpriseApplications

1

3

2

1

3

2

Collect data and perform iterative

investigation for value

CaptureCapture all data

ProcessParse, cleanse,

apply structure & transform

ExchangeExplore and

visualize with analytics tools

supporting Hadoop

Pattern # 3: Application Enrichment with Hadoop

Traditional Sources

(RDBMS, OLTP, OLAP)

DATA

SYS

TEM

SDA

TA S

OUR

CES

APPL

ICAT

ION

S

New Sources(Web logs, email, sensor

data, social media)

RDBMS EDWTraditional Repos

BusinessAnalytics

CustomApplications

EnterpriseApplications

1

3

2

1

3

2

Collect data, analyze and present salient

results for online apps

CaptureCapture all data

ProcessParse, cleanse,

apply structure & transform

ExchangeIncorporate data

directly into applications

Top 5 General Usages

Obtaining a 360-degree view of Customers

Operations Analytics

Data Warehouse Augmentation

Social Media

How Airline Industry is using Hadoop ? ³

Capturing Sensor Data to Optimize Maintenance

Top 5 Airline Usages

Forecasting the Weather to Optimize Fuel Loads

Identifying and Capturing the Demand Signal (Competitive offerings, Travel partner feeds)

Loyalty and Promotions

Webpage Visit’s, Log Storage

When is the best time of day/day of week/time of year to fly to minimize delays?

Do older planes suffer more delays?

How does the number of people flying between different locations change over time?

How well does weather predict plane delays?

Can you detect cascading failures as delays in one airport create delays in others? Are there critical links in the system?

American Airlines is utilizing Hadoop based solution for Clickstream, Customer, Kiosk and Data Analytics.

British Airlines uses Hadoop based solution for Forecasting and Revenue Management.

Orbitz Worldwide utilizes Hadoop based solution for Statistical Analysis to identify best possible promotions that combine air travel with hotel stay.

Prospects: CTO Q’s?

³ source : Internet – Google search

Page 7: Big Data Analysis Concepts and References

Big Data Analysis 7 Author: Vikram Andem ISRM & IT GRC Conference

Cost Benefit Analysis

Gigabyte scale Petabyte scaleSize of data

Cost

per

Ter

abyt

e

Cost of current, structured data management technologies

Economics of Data

Cost of Hadoop

Value captured by customers (like United)

using Hadoop

Why is Hadoop a value proposition?

TCOD (Total Cost of Data) is the cost of owning (and using!) data over time for analytic purposes is a better metric than TCO (Total Cost of Ownership) for Cost Benefit Analysis of this case. TCOD estimates what a company like United will really spend to get to its business goal and the focus is on on total cost, not just the platform cost. In this case the TCOD comparison is made between EDW Platform/Appliance and Hadoop for the same amount of (raw or unstructured) data.

Cost Comparison & Benefits are based on underlying “Data Management” Requirements⁴

1.Hundreds of TB of data per week –500 TB data. 2.Raw data life: few hours to a few days. 3.Challenge: find the important events or trends. 4.Analyze the raw data once or a few times. 5.When analyzing, read entire files. 6.Keep only the significant data.

Project A: Emphasis on “Data Refining” Requirements

EDW Appliance Hadoop Total System Cost $23M $1.3M

System and Data Admin $2M $2M Application Development $5M $6M

Total Cost of Data $30M $9.3M

Cost equation is favorable to Hadoop for data refining, data landing and archival requirements.

1.Data volume 500 TB to start – all must be retained for at least five years. 2.Continual growth of data and workload 3.Data sources: thousands 4. Data sources change their feeds frequently 5.Challenges: Data must be correct and data must be integrated 6.Typical enterprise data lifetime: decades 7.Analytic application lifetime: years 8.Millions of data users 9.Hundreds of analytic applications 10.Thousands of one time analyses 11.Tens of thousands of complex queries

Project B: Emphasis on “EDW” Requirements

EDW Appliance Hadoop Total System Cost $45M $5M

System and Data Admin $50M $100M Application Development $40M $300M

ETL $60M $100M Complex Queries $40M $80M

Analysis $30M $70M

Total Cost of Data $265M $655M

Cost equation is favorable to EDW for Data Warehouse appliance requirements

Cost Comparison Conclusions Each technology has large advantages in its sweet spot(s). Neither platform is cost effective in the other’s sweet spot. Biggest differences for the data warehouse are the development cost for “Complex Queries” & “Analytics”. Total cost is extremely sensitive to technology choice. Analytic architectures will require both Hadoop and data warehouse platforms. Focus on total cost, not platform cost, in making your choice for a particular application or use. Many analytic processes will use both Hadoop & EDW technology – so integration cost also counts!

⁴ source for TCOD comparison is “The Real Cost of Big Data Spreadsheet” provided by Winter Corp (www.wintercorp.com/tcod)

Hadoop is ideal for data storage of • data which rarely needed; • data which can grow rapidly; • data which can grow very large; • data for which it is uncertain how it will be

needed in the future; • data which may or may not have structure; • data which may require ETL and Analysis

sometime in the future but just needs to be stored now …..for some unknown use.

TCOD is the cost of owning (and using!) data over time for analytic purposes* ETL is extract, transform and load (preparing data for analytic use)

Software Development/Maintenance Cost SYSTEM COST

ADMIN COST

ETL * APPS QUERIES ANALYTICS

Page 8: Big Data Analysis Concepts and References

Big Data Analysis 8 Author: Vikram Andem ISRM & IT GRC Conference

Very quick introduction to understanding Data and analysis of Data

Start here if you are new to understanding the data or do not know how to analyze data.

Part 2

Page 9: Big Data Analysis Concepts and References

Big Data Analysis 9 Author: Vikram Andem ISRM & IT GRC Conference

Introduction to Data

Targets business constraints. Assesses and determines new ways to operate.

What is the best action/outcome?

What could happen?

What happened?

Finds associations in data not readily apparent with customary analysis .

Forecasts future probabilities and trends.

Pattern recognition from samples for reporting of trends. Formulates and analyzes historical data.

Desc

riptiv

e An

alyt

ics

Pr

edic

tive

Anal

ytic

s

Pres

crip

tive

Anal

ytic

s Benefits

Very High

High

Medium

Myth: I have large sets of data on Hadoop File System. Running powerful analytical tools (e.g., R, SAS, Tableau etc.) on Hadoop Infrastructure will perform all the data analysis work for me and provide/deliver useful information. Fact : The data by itself may not contain the answer, Big or Small - you need the right data. The combination of data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data . Performing data analysis with an understanding (and application of) data science principles, by correctly framing the analytical problem (with correct data sampling/collection methods) and with the use of appropriate analytical tools will most likely provide useful information for statistical/analytical inference.

Data Analysis Benefits

Data Facts

Data is just data. Data does not live nor die. Data does not offer truth nor does it lie. Data is not large nor its small. Data has always existed as both big data and small data.

Data is the second most important thing

The most important thing in data science is the question. The second most important is the data. Often the data will limit or enable the questions. But having data is useless if you don't have an appropriate

question.

Data are values of qualitative or quantitative variables, belonging to a set of items. Source: Wikipedia

Page 10: Big Data Analysis Concepts and References

Big Data Analysis 10 Author: Vikram Andem ISRM & IT GRC Conference

Data Types : Basics

Types of Data Variables

Examples (from above table): gender: categorical sleep: numerical, continuous bedtime: categorical, ordinal countries: numerical, discrete dread: categorical, ordinal - could also

be used as numerical

Data variables

numerical categorical

continuous discrete regular categorical ordinal

Numerical (quantitative) variables take on numerical values. It is sensible to add,

subtract, take averages, etc. with these values.

Categorical (qualitative) variables take on a limited number of distinct categories. These categories can be identified with numbers, but it wouldn’t be sensible to

do arithmetic operations with these values.

Continuous numerical variables are measured,

and can take on any numerical value.

Discrete numerical variables are counted, and can take on only whole non-negative numbers.

Categorical variables that have ordered levels are called ordinal. Think

about a flight survey question where you are asked about how satisfied you

are with the customer service you received, and the options are very unsatisfied, unsatisfied, neutral,

satisfied or very satisfied. These levels will have inherent ordering and hence

the variable will be called ordinal.

If the levels of the categorical variable do not have an inherent ordering to them,

then the variable is simply called (regular) categorical. (e.g., do you prefer morning

flight or an evening flight?)

Observations, Variables and Data Matrices

Data are organized in what we call a data matrix, where each row represents an observation (or a case), and each column represents a variable.

City no_flights %_ontime ……. region

Chicago 350 90 …… Midwest

Houston 330 96 …… South

……… ……… ……… …… ………

Newark 306 92 …… Northeast

San Francisco 310 93 …… West

data matrix

observation (case)

variable

First variable is City which is an identifier variable for the name of the city United serves to which the data are gathered.

Next is the no_flights (number of flights) served by united daily, and is a discrete numerical variable.

Next is the %_ontime (percentage on time) representing the united flights that operated on-time (arrival or departure) which represents a continuous numerical variable (as it can take on any value between zero and 100, even though the values shown here are rounded to whole numbers).

Last column is region representing where the city is located in USA as designated by US census (Northeast, Midwest, South, and West) and this a categorical variable.

When two variables show some connection with one another, they are called associated, or dependent variables.

The association can be further described as positive or negative If two variables are not associated they are said to be independent.

Relationships between Variables

example

Page 11: Big Data Analysis Concepts and References

Big Data Analysis 11 Author: Vikram Andem ISRM & IT GRC Conference

Data Observational Studies and Experiments

Observational Study

In an observational study you collect data in a way that does not directly interfere with how the data arise, i.e. merely “observe”.

We can only establish an association (or correlation) between the explanatory and response variables.

If an observation study uses data from the past, it’s called a retrospective study, whereas if data are collected throughout the study, it’s called prospective.

Experiment

In an experiment, you randomly assign subjects to various treatments and can therefore establish a causal connection between the explanatory and response variables.

work out

don't work out

averageenergy

level

averageenergy

level

work out don't work out

averageenergy level

averageenergy level

randomassignment

In observational study we sample two types of people from the population, those who choose to work out regularly and those who don’t, then we find the average energy level

for the two groups of people and compare. On the other hand in an experiment, we sample a group of people from the population and then we randomly assign these

people into two groups, those who will regularly workout throughout the course of the study and those who will not. The difference is that the decision of whether to work out or not is not left to the subjects as in the observational study but is instead imposed by

the researcher. At the end we compare the average energy levels of the two groups. Based on the observational study, even if we find the difference between the energy

levels of these two groups of people, we really can’t attribute this difference solely to working out, because there may be other variables that we didn’t control for in this

study that contribute to the observed difference. For example people who are in better shape might be more likely to work out and also have high energy levels. However in the

experiment such variables that likely to contribute to the outcome are equally represented in the two groups, due to random assignment. Therefore if we find a

difference between the two averages, we can indeed make a causal statement attributing this difference to working out.

Example: Suppose you want to evaluate the relationship between regularly working out and energy level. We can design the study as observational study or an experiment.

What type of study is this, observational study or an experiment?

“Girls who regularly ate breakfast, particularly one that includes cereal, were slimmer than those who skipped the morning meal, according to a study that tracked nearly 2,400 girls for 10 years. [...] As part of the survey, the girls were asked once a year what they had eaten during the previous three days.”

This is an observational study since the researchers merely observed the behavior of the girls (subjects) as opposed to imposing treatments on them. The study concludes there is an association between girls eating breakfast and being slimmer.

#1 : Eating breakfast causes girls to be thinner. #2: Being thin causes girls to eat breakfast #3: A third variable is responsible for both. What could it be? An extraneous variable that affects both the explanatory and the response variable and that make it seem like there is a relationship between the two are called confounding variables.

Images from: http://www.appforhealth.com/ wp-content/ uploads/ 2011/ 08/ ipn-cerealfrijo-300x135.jpg, http://www.dreamstime.com/ stock-photography-too-thin-woman-anorexia-model-image2814892.

• What determines whether we can infer correlation or causation depends on the type of study that we are basing our decision on.

• Observational studies for the most part only allow us to make correlation statements, while experiments infer us to causation.

Correlation does not imply causation.

Question

Answer

3 Possible Explanations

vs.

Page 12: Big Data Analysis Concepts and References

Big Data Analysis 12 Author: Vikram Andem ISRM & IT GRC Conference

Data Sampling and Sources of Bias

Data Sampling

Think about sampling something that you are cooking - you taste (examine) a small part of what you’re cooking to get an idea about the dish as a whole.

When you taste a spoonful of soup and decide the spoonful you tasted isn’t salty enough, that’s called exploratory analysis for the sample at hand.

If you can generalize and conclude that your entire soup needs salt, that’s making an inference.

For your inference to be valid, the spoonful you tasted (the sample) needs to be representative of the entire pot (the population).

If your spoonful comes only from the surface and the salt is collected at the bottom of the pot, what you tasted is probably not representative of the whole pot.

On the other hand, if you first stir the soup thoroughly before you taste, your spoonful will more likely be representative of the whole pot.

Exploratory Analysis

Representative Sample

Inference

Sources of Sampling Bias Convenience sample bias: Individuals who are easily accessible are

more likely to be included in the sample. Example: say you want to find out how people in your city feel about recent increase in public transportation costs. If you only poll people in your neighborhood as opposed to a representative of a whole sample from the entire city your study will suffer from Convenience sample bias.

Voluntary Response bias: Occurs when the sample consists of people who volunteer to respond because they have strong opinions on the issue. Such a sample will also not be representative of the entire population. Example: say you place polling machines at all bus stops and metro stations in your city, but only those who suffered by the price increase choose to actually take the time to vote and express their opinion on the recent increase in public transportation fares. The people who respond to such sample do not make up the representative of the entire population.

Non-Response sampling bias: If only a (non-random) fraction of the randomly sampled people choose to respond to a survey, the sample is no longer a representative of the entire population. Example: say you take a random sample of individuals from your city and attempt to survey them but certain segments of the population; say those from the lower socio-economic status are less likely to respond to the survey then its not a representative of entire population.

Sampling Bias a historical example: Landon vs. FDR In 1936, Landon sought the Republican presidential nomination opposing the re-election of FDR.

A popular magazine of the time (1936) “Literary Digest” polled about 10 million Americans, and got responses from about 2.4 million. To put things in perspective, nowadays reliable polls in USA poll about 1500 to 3000 people, so the “10 million” poll was very huge sample.

The poll showed that Landon would likely be the overwhelming winner and FDR would get only 43% of the votes.

Election result: FDR won, with 62% of the votes.

What went wrong with the Literary Digest Poll?

The magazine had surveyed its own readers: registered automobile owners, and registered telephone users.

These groups had incomes well above the national average of the day (remember, this is Great Depression era) which resulted in lists of voters far more likely to support Republicans than a truly typical voter of the time, i.e. the sample was not representative of the American population at the time.

The Literary Digest election poll was based on a sample size of 2.4 million, which is huge, but since the sample was biased, the sample did not yield an accurate prediction. Back to the soup analogy: If the soup is not well stirred, it doesn’t matter how large a spoon you have, it will still not taste right. If the soup is well stirred, a small spoon will suffice to test the soup.

Page 13: Big Data Analysis Concepts and References

Big Data Analysis 13 Author: Vikram Andem ISRM & IT GRC Conference

Data Sampling Methods & Experimental Design

Obtaining Good Samples Almost all statistical methods are based on the notion of implied randomness. If observational data are not collected in a random framework

from a population, these statistical methods the estimates & errors associated with the estimates are not reliable.

Most commonly used random sampling techniques are simple, stratified, and cluster sampling.

Simple Random Sample

Randomly select cases from the population, where there is no implied connection between the points that are selected.

Stratified Sample

Strata are made up of similar observations. We take a simple random sample from each stratum.

Cluster Sample Clusters are usually not made up of homogeneous observations, and we take a simple random sample from a random sample of clusters. Usually preferred for economical reasons.

(data) Experimental Design Concepts • Control: Compare treatment of interest to a control group. • Randomize: Randomly assign subjects to treatments, and randomly sample from the population

whenever possible. • Replicate: Within a study, replicate by collecting a sufficiently large sample. Or replicate the

entire study. • Block: If there are variables that are known or suspected to affect the response variable, first

group subjects into blocks based on these variables, and then randomize cases within each block to treatment groups.

• Blocking example: We would like to design an experiment to investigate if energy gels makes you run faster: Treatment: energy gel ; Control: no energy gel

• It is suspected that energy gels might affect pro and amateur athletes differently, therefore we block for pro status:

• Divide the sample to pro and amateur. Randomly assign pro athletes to treatment and control groups. Randomly assign amateur athletes to treatment and control groups. • Pro/amateur status is equally represented in resulting treatment and control groups

Random Assignment vs. Random Sampling

Page 14: Big Data Analysis Concepts and References

Big Data Analysis 14 Author: Vikram Andem ISRM & IT GRC Conference

Hypothesis Testing

Two competing claims

Claim 1. “There is nothing going on.” Promotion and gender are independent, no gender discrimination, observed difference in proportions is simply due to chance. => Null hypothesis Claim 2. “There is something going on.” Promotion and gender are dependent, there is gender discrimination, observed difference in proportions is not due to chance. => Alternative hypothesis

A court trial as a hypothesis test

Hypothesis testing is very much like a court trial. H0: Defendant is innocent HA: Defendant is guilty We then present the evidence: collect data. Then we judge the evidence - “Could these data plausibly have happened by

chance if the null hypothesis were true?” If they were very unlikely to have occurred, then the evidence raises

more than a reasonable doubt in our minds about the null hypothesis.

Ultimately we must make a decision. How unlikely is unlikely? If the evidence is not strong enough to reject the assumption of innocence, the

jury returns with a verdict of “not guilty”. The jury does not say that the defendant is innocent, just that there

is not enough evidence to convict. The defendant may, in fact, be innocent, but the jury has no way of

being sure. Said statistically, we fail to reject the null hypothesis.

We never declare the null hypothesis to be true, because we simply do not know whether it’s true or not.

Therefore we never “accept the null hypothesis”. In a trial, the burden of proof is on the prosecution. In a hypothesis test, the burden of proof is on the unusual claim. The null hypothesis is the ordinary state of affairs (the status quo), so it’s the

alternative hypothesis that we consider unusual and for which we must gather evidence.

Hypothesis Testing

Page 15: Big Data Analysis Concepts and References

Big Data Analysis 15 Author: Vikram Andem ISRM & IT GRC Conference

Statistical Inference and Prediction

Statistical Inference

Statistical inference is the process of drawing conclusions from data that are subject to random variation, for example, observational errors or sampling variation. Initial requirements of such a system of procedures for inference and induction are that the system should produce reasonable answers when applied to well-defined situations and that it should be general enough to be applied across a range of situations. Inferential statistics are used to test hypotheses and make estimations using sample data.

Confidence Interval

Outcome of statistical inference may be an answer to the question "what should be done next?", where this might be a decision about making further experiments or surveys, or about drawing a conclusion before implementing some organizational or governmental policy.

A confidence interval gives an estimated range of values which is likely to include an unknown population parameter, the estimated range being calculated from a given set of sample data. If independent samples are taken repeatedly from the same population, and a confidence interval calculated for each sample, then a certain percentage (confidence level) of the intervals will include the unknown population parameter. Confidence intervals are usually calculated so that this percentage is 95%, but we can produce 90%, 99%, 99.9% (or whatever) confidence intervals for the unknown parameter.

Confidence Level Confidence level is the probability value (1-alpha) associated with a confidence interval. It is often expressed as a percentage. For example, say alpha = 0.05 = 5%, then the confidence level is equal to (1-0.05) = 0.95, i.e. a 95% confidence level.

Prediction

In statistics, prediction is the process of determining the magnitude of statistical variates at some future point of time. For data analysis context the word may also occur in slightly different meanings; e.g. in a

regression equation expressing a dependent variate y in terms of dependent x’s, the value given for y by specified values of x’s is called the “predicted” value even when no temporal element is involved. Prediction vs. Inference Using data to predict an event that has yet to occur is statistical

prediction. Inferring the value of a population quantity such as the average income of a country or the proportion of eligible voters who say they will vote ‘yes’ is statistical inference.

Prediction and inference answer different types of data analysis questions.

Examples of predictions (because the events have not occurred at the time of writing this content): The probability that Chicago Bulls will win the 2018 NBA playoffs is __. The probability that Republican Party will win the 2020 Presidential

election is __.

Examples of inferences: (because the questions involve estimating a population value.) The proportion of NBA fans that currently believe Chicago Bulls will win

the 2018 playoffs is __. The proportion of eligible voters that currently state they will vote for

Republican Party in the 2020 Presidential election is __.

Page 16: Big Data Analysis Concepts and References

Big Data Analysis 16 Author: Vikram Andem ISRM & IT GRC Conference

Before you proceed to Part 3 :

Please quickly review the Appendix section to familiarize

with terms and terminology that will be used in the rest of the presentation.

Page 17: Big Data Analysis Concepts and References

Big Data Analysis 17 Author: Vikram Andem ISRM & IT GRC Conference

Big Data Analysis Concepts and References Use Cases in Airline Industry

Jump here directly, if you are a advanced user who understands data and knows how to analyze data.

Part 3

Page 18: Big Data Analysis Concepts and References

Big Data Analysis 18 Author: Vikram Andem ISRM & IT GRC Conference

Big Data Analysis: Concepts and Airline Industry Use Cases

Data Analysis

Machine Learning

&

Models derive useful analytical information so humans can better understand it.

Examples: Does spending more money on marketing & sales in

area “X” vs. area “Y” makes the company more profitable?

What does the customer want ? (e.g., customer survey).

Models allow machines (software programs & applications) to make “real-time” (auto) decisions.

Examples: Google search / Amazon product recommendations,

Facebook news feed, etc. Geographic (GPS based) specific advertisements or

event (holiday, weather, traffic) based promotions.

Big Data application areas

Page 19: Big Data Analysis Concepts and References

Big Data Analysis 19 Author: Vikram Andem ISRM & IT GRC Conference

Bayesian Approach and Bayes Rule

Bayesian Approach

Differences between Bayesians and Non-Bayesians

Bayes Theorem

The probability the email message is spam, given the words in the email message is expressed as below:

Example: Email Spam Filtering

With Bayes: • A key benefit: The ability to incorporate prior knowledge • A key weakness: The need to incorporate prior knowledge

Page 20: Big Data Analysis Concepts and References

Big Data Analysis 21 Author: Vikram Andem ISRM & IT GRC Conference

Bayesian Belief Network Representation of Airline Passenger Behavior

Source: Booz Allen Hamilton

Bayesian Belief Network Representation of Airline Passenger Behavior

The basis of this slide is from the paper titled Airline Analytics:

Decision Analytics Center of Excellence by Cenk Tunasar, Ph.D., and Alex Cosmas

of Booz Allen Hamilton

In the above listed paper authors claim Booz Allen used the Big Data infrastructure of an airline client, and were able to analyze large datasets containing more than 3 years’ worth of passenger data of approximately 100 GB+. Booz Allen generated hypotheses to test from the Big Data set including , but not limited to: Airline Market Performance

• What are the client’s natural market types and their distinct attributes? • What is the client’s competitive market health? • Where does the client capture fare premiums or fare discounts relative to other carriers?

Passenger Behavior

• What is the variability of booking curves by market type? • What are the intrinsic attributes of markets with the highest earn and highest burn rates? • Can predictive modeling be developed for reservation changes and no-show rates for individual passengers on individual itineraries?

Consumer Choice

• What is the demand impact of increasing connection time? • What is the effect of direct versus connecting itineraries on passenger preference?

A use case in Airline

industry

(URL: http://www.boozallen.com/media/file/airline-analytics-brochure.pdf)

Page 21: Big Data Analysis Concepts and References

Big Data Analysis 22 Author: Vikram Andem ISRM & IT GRC Conference

Bayesian Ideas are very important for Big Data Analysis

Bayesian Themes

Prediction

Average over unknowns,

don't maximize.

Uncertainty

Probability coherently represents uncertainty.

Combine Information

Hierarchical models combine information

from multiple sources.

Source: Steve Scott (Google Inc.

Sparsity Sparsity plays an important role in modeling Big Data Models are "big" because of a small

number of factors with many levels. Big data problems are often big

collections of small data problems.

Multi-armed Bandits Problem

Multi-armed bandit problem is the problem a gambler faces at a row of slot machines, sometimes known as "one-armed bandits", when deciding which slot machines to play, how many times to play each machine and in which order to play them. When played, each machine provides a random reward from a distribution specific to that machine. The objective of the gambler is to maximize the sum of rewards earned through a sequence of lever pulls. Source: Wikipedia

Bayes Rule applied to Machine Learning

A use

case in Airline

industry

Big Data Project at South West Airlines The below URL provides a visual (interactive graphics) presentation of the Big Data Project at South West Airlines and how they used Bayesian approach and Naive Bayes classification with WEKA("Waikato Environment for Knowledge Analysis") tool for analysis of the following questions:

1) What are the important factors that cause delays and their weightage ? 2) What kind of weather (e.g. sunny, cloudy, snow, rain, etc.) causes weather delays? 3) Are some of the time periods during the day (e.g. early morning, morning, noon, etc.) that are

more prone to delays than others? (URL: http://prezi.com/f3bsv9m6yl2g/big-data-project_southwest-airlines/)

Entirely driven by parameter uncertainty

Page 22: Big Data Analysis Concepts and References

Big Data Analysis 23 Author: Vikram Andem ISRM & IT GRC Conference

Example: Bayesian based “Search Optimization” on Google File System (Source: Google Analytics)

Source: Steve Scott (Google Inc.)

Personalization as a “Big Logistic Regression"

Search words: “Chicago to Houston today”

Search words: “Chicago to Houston flight tomorrow”

Search words: “Chicago to Houston cheapest”

Page 23: Big Data Analysis Concepts and References

Big Data Analysis 24 Author: Vikram Andem ISRM & IT GRC Conference

Meta Analysis

Meta Analysis

Meta-analysis refers to methods that focus on contrasting and combining results from different studies, in the hope of identifying patterns among study results, sources of disagreement among those results, or other interesting relationships that may come to light in the context of multiple studies. In its simplest form, meta-analysis is normally done by identification of a common measure of effect size. A weighted average of that common measure is the output of a meta-analysis. The weighting is related to sample sizes within the individual studies. More generally there are other differences between the studies that need to be allowed for, but the general aim of a meta-analysis is to more powerfully estimate the true effect size as opposed to a less precise effect size derived in a single study under a given single set of assumptions and conditions. Source: Wikipedia

Advantages Results can be generalized to a larger population, The precision and accuracy of estimates can be improved as

more data is used. This, in turn, may increase the statistical power to detect an effect.

Inconsistency of results across studies can be quantified and analyzed. For instance, does inconsistency arise from sampling error, or are study results (partially) influenced by between-study heterogeneity.

Hypothesis testing can be applied on summary estimates.

A use case in Airline

industry

Price Elasticities of Demand for Passenger Air Travel

A good discussion of the topic is detailed in the paper listed below:

Price Elasticities of Demand for Passenger Air Travel: A Meta-Analysis

by Martijn Brons, Eric Pels, Peter Nijkamp, Piet Rietveld of Tinbergen Institute

(URL: http://papers.tinbergen.nl/01047.pdf)

Meta Analysis and Big Data

A good discussion of the topic is detailed in the article listed below: Meta-Analysis: The Original 'Big Data‘

by Blair T. Johnson , Professor at University of Connecticut (URL: http://meta-analysis.ning.com/profiles/blogs/meta-analysis-the-original-big-data)

Page 24: Big Data Analysis Concepts and References

Big Data Analysis 25 Author: Vikram Andem ISRM & IT GRC Conference

Effect Size

Effect Size

Effect size is a measure of the strength of a phenomenon (for example, the change in an outcome after experimental intervention). An effect size calculated from data is a descriptive statistic that conveys the estimated magnitude of a relationship without making any statement about whether the apparent relationship in the data reflects a true relationship in the population. In that way, effect sizes complement inferential statistics such as p-values. Among other uses, effect size measures play an important role in meta-analysis studies that summarize findings from a specific area of research, and in statistical power analyses. Source: Wikipedia

Example: A weight loss program may boast that it leads to an average weight loss of 30 pounds. In this case, 30 pounds is the claimed effect size. if the weight loss program results in an average loss of 30 pounds, it is possible that every participant loses exactly 30 pounds, or half the participants lose 60 pounds and half lose no weight at all.

"Small", “Medium", “Large" Effect Sizes

Effect sizes apply terms such as "small", "medium" and "large" to the size of the effect and are relative. Whether an effect size should be interpreted small, medium, or large depends on its substantive context and its operational definition. Cohen's conventional criteria small, medium, or big are near ubiquitous across many fields. Power analysis or sample size planning requires an assumed population parameter of effect sizes. For Cohen's an effect size of 0.2 to 0.3 might be a "small" effect, around 0.5 a "medium" effect and 0.8 to infinity, a "large" effect.

Page 25: Big Data Analysis Concepts and References

Big Data Analysis 26 Author: Vikram Andem ISRM & IT GRC Conference

Monte Carlo Method

Monte Carlo Method

Monte Carlo methods (or experiments) are a broad class of computational algorithms that rely on repeated random sampling to obtain numerical results; typically one runs simulations many times over in order to obtain the distribution of an unknown probabilistic entity. The name comes from the resemblance of the technique to the act of playing and recording results in a real gambling casino. They are often used in physical and mathematical problems and are most useful when it is difficult or impossible to obtain a closed-form expression, or infeasible to apply a deterministic algorithm. Monte Carlo methods are mainly used in three distinct problem classes: optimization, numerical integration and generation of draws from a probability distribution. Monte Carlo methods vary, but tend to follow a particular pattern: Define a domain of possible inputs. Generate inputs randomly from a probability distribution

over the domain. Perform a deterministic computation on the inputs. Aggregate the results.

For example: Consider a circle inscribed in a unit square. Given that circle and the square have a ratio of areas that is π/4, the value of π can be approximated using a Monte Carlo method: Draw a square on ground, then inscribe a circle within it. Uniformly scatter some objects of uniform size (grains of

rice or sand) over the square. Count the number of objects inside the circle and the total

number of objects. The ratio of the two counts is an estimate of the ratio of

the two areas, which is π/4. Multiply the result by 4 to estimate π.

Monte Carlo Methods for Bayesian Analysis and Big Data

A good discussion of the topic is detailed in the paper listed below: A Sequential Monte Carlo Method for Bayesian Analysis of Massive Datasets

by David Madigan, Professor and Dean at Columbia University and Greg Ridgeway, Deputy Director at National Institute of Justice. (URL: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2753529/ )

Source: Wikipedia

A use

case in Airline

industry

Flight Delay-Cost (Initial delay – “type I” and Propagated delay “type II”) and Dynamic Simulation Analysis for Airline Schedule Optimization

Flight Delay-Cost Simulation Analysis and Airline Schedule Optimization

by Duojia Yuan of RMIT University, Victoria, Australia (URL: http://researchbank.rmit.edu.au/eserv/rmit:9807/Yuan.pdf

General use case for

Customer Satisfaction

and Customer

Loyalty

Concurrent Reinforcement Learning from Customer Interactions

Concurrent Reinforcement Learning from Customer Interactions

by David Silver of University College London (published 2013) and Leonard Newnham, Dave Barker, Suzanne Weller, Jason McFall of Causata Ltd . (URL: http://www0.cs.ucl.ac.uk/staff/d.silver/web/Publications_files/concurrent-rl.pdf )

A good discussion of the topic is detailed in the Ph.D. thesis listed below. The reliability modeling approach developed in this project (to enhance the dispatch reliability of Australian X airline fleet) is based on the probability distributions and Monte Carlo Simulation (MCS) techniques. Initial (type I) delay and propagated (type II) delay are adopted as the criterion for data classification and analysis.

In the below paper, authors present a framework for concurrent reinforcement learning, a new method of a company interacting concurrently with many customers with an objective function to maximize revenue, customer satisfaction, or customer loyalty, which depends primarily on the sequence of interactions between company and customer (such as promotions, advertisements, or emails) and actions by the customer (such as point-of- sale purchases, or clicks on a website). The proposed concurrent reinforcement learning framework uses a variant of temporal- difference learning to learn efficiently from partial interaction sequences. The goal is to maximize the future rewards for each customer, given their history of interactions with the company. The proposed framework differs from traditional reinforcement learning paradigms, due to the concurrent nature of the customer interactions. This distinction leads to new considerations for reinforcement learning algorithms.

Page 26: Big Data Analysis Concepts and References

Big Data Analysis 27 Author: Vikram Andem ISRM & IT GRC Conference

Bayes and Big Data: Consensus Monte Carlo and Nonparametric Bayesian Data Analysis

A good discussion of the topic is detailed in the article listed below:

“Bayes and Big Data: The Consensus Monte Carlo Algorithm” by

Robert E. McCulloch, of University of Chicago, Booth School of Business Edward I. George, of University of Pennsylvania, The Wharton School Steven L. Scott, of Google, Inc Alexander W. Blocker, of Google, Inc Fernando V. Bonassi, Google, Inc.

(URL: http://www.rob-mcculloch.org/some_papers_and_talks/papers/working/consensus-mc.pdf)

Consensus Monte Carlo

For Bayesian methods to work in a MapReduce / Hadoop environment, we need algorithms that require very little communication. Need: A useful definition of “big data” is data that is too big to fit on a single machine, either because of processor, memory, or disk bottlenecks. Graphics Processing Units (GPU) can alleviate the processor bottleneck, but memory or disk bottlenecks can only be alleviated by splitting “big data” across multiple machines. Communication between large numbers of machines is expensive (regardless of the amount of data being communicated), so there is a need for algorithms that perform distributed approximate Bayesian analyses with minimal communication.

Consensus Monte Carlo operates by running a separate Monte Carlo algorithm on each machine, and then averaging the individual Monte Carlo draws. Depending on the model, the resulting draws can be nearly indistinguishable from the draws that would have been obtained by running a single machine algorithm for a very long time.

Source: Steve Scott (Google Inc.)

Non-Parametric Bayesian Data Analysis

A use case in Airline

industry

Airline Delays in International Air Cargo Logistics A good discussion of the topic is detailed in the paper below:

“Nonparametric Bayesian Analysis in International Air Cargo Logistics”

by Yan Shang of Fuqua School of Business, Duke University (URL: https://bayesian.org/abstracts/5687 )

Non-Parametric Analysis refers to comparative properties (statistics) of the data, or population, which do not include the typical parameters, of mean, variance, standard deviation, etc.

Need / Motivation: Models are never correct for real world data.

Non-Parametric Modelling of Large Data Sets What is a nonparametric model? A parametric model where the number of parameters increases with data. A really large parametric model. A model over infinite dimensional function or measure spaces. A family of distributions that is dense in some large space.

Why nonparametric models in Bayesian theory of learning? Broad class of priors that allows data to “speak for itself”. Side-step model selection and averaging.

Bayes and Big Data

Page 27: Big Data Analysis Concepts and References

Big Data Analysis 28 Author: Vikram Andem ISRM & IT GRC Conference

Homoscedasticity vs. Heteroskedasticity

Homoscedasticity

In regression analysis , homoscedasticity means a situation in which the variance of the dependent variable is the same for all the data. Homoscedasticity facilitates analysis because most methods are based on the assumption of equal variance. A sequence or a vector of random variables is homoscedastic if all random variables in the sequence or vector have the same finite variance. This is also known as homogeneity of variance.

In regression analysis , heteroskedasticity means a situation in which the variance of the dependent variable varies across the data. Heteroskedasticity complicates analysis because many methods in regression analysis are based on an assumption of equal variance. A collection of random variables is heteroscedastic if there are sub-populations that have different variabilities from others. Here "variability" could be quantified by the variance or any other measure of statistical dispersion. Thus heteroscedasticity is the absence of homoscedasticity.

Heteroskedasticity

Page 28: Big Data Analysis Concepts and References

Big Data Analysis 29 Author: Vikram Andem ISRM & IT GRC Conference

Benford’s Law

Benford’s Law

Benford's Law, also called the First-Digit Law, refers to the frequency distribution of digits in many (but not all) real-life sources of data. In this distribution, the number 1 occurs as the leading digit about 30% of the time, while larger numbers occur in that position less frequently: 9 as the first digit less than 5% of the time. Benford's Law also concerns the expected distribution for digits beyond the first, which approach a uniform distribution.

This result has been found to apply to a wide variety of data sets, including electricity bills, street addresses, stock prices, population numbers, death rates, lengths of rivers, physical and mathematical constants, and processes described by power laws (which are very common in nature). It tends to be most accurate when values are distributed across multiple orders of magnitude. Source: Wikipedia

Numerically, the leading digits have the following distribution in Benford's Law, where d is the leading digit and P(d) the probability:

Benford’s Law Big Data Application: Fraud Detection Facts

The graph below shows Benford's Law for base 10. There is a generalization of the law to numbers expressed in other bases (for example, base 16), and also a generalization from leading 1 digit to leading n digits. A set of numbers is said to satisfy Benford's Law if the leading digit d (d ∈ {1, ..., 9}) occurs with Probability.

Benford’s Law holds true for a data set that grows exponentially (e.g., doubles, then doubles again in the same time span). It is best applied to data sets that go across multiple orders of magnitude . The theory does not hold true for data sets in which digits are predisposed to begin with a limited set of digits. The theory also does not hold true when a data set covers only one or two orders of magnitude.

Helps identify duplicates & other data pattern anomalies in large data sets. Enables auditors and data analysts to focuses on possible anomalies in very large data

sets. It does not "directly" prove that error or fraud exist, but identifies items that deserve

further study on statistical grounds. Mainly used for setting future auditing plans and is a low cost entry for continuous

analysis of very large data sets Not good for sampling – results in very large selection sizes. As technology matures, finding fraud will increase (not decrease). Not all data sets are suitable for analysis .

A use case in Airline industry

An financial/accounting auditor can evaluate very large data sets (in a continuous monitoring or continuous audit environment) that represents a continuous stream of

transactions , such as the sales made by an (third party) online retailer or the internal airline reservation system.

Fraud Detection in Airline Ticket Purchases

Christopher J. Rosetti, CPA, CFE, DABFA of KPMG states in his presentation titled "SAS 99: Detecting Fraud Using Benford’s Law" presented at the FAE/NYSSCPA, Technology Assurance Committee , on March 13, 2003 claims that United Airlines currently uses Benford's law for fraud detection!

(URL: http://www.nysscpa.org/committees/emergingtech/law.ppt )

Page 29: Big Data Analysis Concepts and References

Big Data Analysis 30 Author: Vikram Andem ISRM & IT GRC Conference

Multiple Hypothesis Testing

Multiple Testing Problem

Multiple testing problem occurs when one considers a set of statistical inferences simultaneously or infers a subset of parameters selected based on the observed values. Errors in inference, including confidence intervals that fail to include their corresponding population parameters or hypothesis tests that incorrectly reject the null hypothesis are more likely to occur when one considers the set as a whole. Source: Wikipedia For example, one might declare that a coin was biased if in 10 flips it landed heads at least 9 times. Indeed, if one assumes as a null hypothesis that the coin is fair, then the probability that a fair coin would come up heads at least 9 out of 10 times is (10 + 1) × (1/2)10 = 0.0107. This is relatively unlikely, and under statistical criteria such as p-value < 0.05, one would declare that the null hypothesis should be rejected — i.e., the coin is unfair.

A multiple-comparisons problem arises if one wanted to use this test (which is appropriate for testing the fairness of a single coin), to test the fairness of many coins. Imagine if one were to test 100 fair coins by this method. Given that the probability of a fair coin coming up 9 or 10 heads in 10 flips is 0.0107, one would expect that in flipping 100 fair coins ten times each, to see a particular (i.e., pre-selected) coin comes up heads 9 or 10 times would still be very unlikely, but seeing any coin behave that way, without concern for which one, would be more likely than not. Precisely, the likelihood that all 100 fair coins are identified as fair by this criterion is (1 − 0.0107)100 ≈ 0.34. Therefore the application of our single-test coin-fairness criterion to multiple comparisons would be more likely to falsely identify at least one fair coin as unfair.

Multiple Hypothesis Testing

A use

case in Airline

industry

Predicting Flight Delays using Multiple Hypothesis Testing A good discussion of the topic is detailed in the paper listed below:

Predicting Flight Delays by Dieterich Lawson and William Castillo of Stanford University

(URL: http://cs229.stanford.edu/proj2012/CastilloLawson-PredictingFlightDelays.pdf )

Also detailed in the book “Big Data for Chimps: A Seriously Fun guide to Terabyte-scale data processing“

by the same author (Dieterich Lawson) and Philip Kromer. Sample Source Code for modelling in Matlab is also provided by the Dieterich Lawson and can be found at

URL: https://github.com/infochimps-labs/big_data_for_chimps

Page 30: Big Data Analysis Concepts and References

Big Data Analysis 31 Author: Vikram Andem ISRM & IT GRC Conference

The German Tank Problem

The German Tank Problem

The problem of estimating the maximum of a discrete uniform distribution from sampling without replacement is known in English as the German tank problem, due to its application in World War II to the estimation of the number of German tanks. The analyses illustrate the difference between frequentist inference and Bayesian inference. Estimating the population maximum based on a single sample yields divergent results, while the estimation based on multiple samples is an instructive practical estimation question whose answer is simple but not obvious. Source: Wikipedia

During World War II, production of German tanks such as the Panther

(below photo) was accurately estimated by Allied intelligence using

statistical methods.

Example: Suppose an intelligence officer has spotted k = 4 tanks with serial numbers, 2, 6, 7, and 14, with maximum observed serial number, m = 14. The unknown total number of tanks is called N. The formula for estimating the total number of tanks suggested by the frequentist approach outlined is: Whereas, the Bayesian analysis below yield (primarily) a probability mass function for the number of tanks: from which we can estimate the number of tanks according to: This distribution has positive skewness, related to the fact that there are at least 14 tanks.

During the course of the war the Western Allies made sustained efforts to determine the extent of German production, and approached this in two major ways: conventional intelligence gathering and statistical estimation. To do this they used the serial numbers on captured or destroyed tanks. The principal numbers used were gearbox numbers, as these fell in two unbroken sequences. Chassis and engine numbers were also used, though their use was more complicated. Various other components were used to cross-check the analysis. Similar analyses were done on tires, which were observed to be sequentially numbered (i.e., 1, 2, 3, ..., N). The analysis of tank wheels yielded an estimate for the number of wheel molds that were in use.

Analysis of wheels from two tanks (48 wheels each, 96 wheels total) yielded an estimate of 270 produced in February 1944, substantially more than had previously been suspected. German records after the war showed production for the month of February 1944 was 276. The statistical approach proved to be far more accurate than conventional intelligence methods, and the phrase German tank problem became accepted as a descriptor for this type of statistical analysis.

Application in Big Data Analysis Similar to German Tank Problem we can estimate/analyze (large or

small) data sets that we don’t have (or assumed that we don’t have). There is “leaky” data all around us; all we have to do is to think outside

the box. Companies very often don’t think about the data they publish publicly and we can either extrapolate from that data (as in the German Tank problem) or simply extract useful information from it.

A company’s competitors' websites (publicly available data) can be a valuable hunting ground. Think about whether you can use it to estimate some missing data (as with the serial numbers) and/or combine that data with other, seemingly innocuous, sets to produce some vital information. If that information gives your company a commercial advantage and is legal, then you should use it as part of your analysis.

Source: Wikipedia

Page 31: Big Data Analysis Concepts and References

Big Data Analysis 32 Author: Vikram Andem ISRM & IT GRC Conference

Nyquist–Shannon Sampling Theorem

Nyquist–Shannon Sampling Theorem

The Nyquist Theorem, also known as the sampling theorem, is a principle that engineers follow in the digitization of analog signals. For analog-to-digital conversion (ADC) to result in a faithful reproduction of the signal, slices, called samples, of the analog waveform must be taken frequently. The number of samples per second is called the sampling rate or sampling frequency.

Any analog signal consists of components at various frequencies. The simplest case is the sine wave, in which all the signal energy is concentrated at one frequency. In practice, analog signals usually have complex waveforms, with components at many frequencies. The highest frequency component in an analog signal determines the bandwidth of that signal. The higher the frequency, the greater the bandwidth, if all other factors are held constant. Suppose the highest frequency component, in hertz, for a given analog signal is fmax. According to the Nyquist Theorem, the sampling rate must be at least 2fmax, or twice the highest analog frequency component. The sampling in an analog-to-digital converter is actuated by a pulse generator (clock). If the sampling rate is less than 2fmax, some of the highest frequency components in the analog input signal will not be correctly represented in the digitized output. When such a digital signal is converted back to analog form by a digital-to-analog converter, false frequency components appear that were not in the original analog signal. This undesirable condition is a form of distortion called aliasing.

Application in Big Data Analysis

Even though the “Nyquist–Shannon Sampling Theorem” is about the minimum sampling rate of a continuous wave, but with Big Data Analysis practice it will tell you how frequently you need to collect that Big Data from sensors like smart meters.

The frequency of data collection for Big Data is the “Velocity”, one of the three “V”s for terms that define Big Data; Volume, Velocity and Varity.

Left figure: X(f) (top blue) and XA(f) (bottom blue) are continuous Fourier transforms of two different functions, x(t) and xA(t) (not shown). When the functions are sampled at rate fs, the images (green) are added to the original transforms (blue) when one examines the discrete-time Fourier transforms (DTFT) of the sequences. In this hypothetical example, the DTFTs are identical, which means the sampled sequences are identical, even though the original continuous pre-sampled functions are not. If these were audio signals, x(t) and xA(t) might not sound the same. But their samples (taken at rate fs) are identical and would lead to identical reproduced sounds; thus xA(t) is an alias of x(t) at this sample rate. In this example (of a bandlimited function), such aliasing can be prevented by increasing fs such that the green images in the top figure do not overlap the blue portion. Right figure: Spectrum, Xs(f), of a properly sampled bandlimited signal (blue) and the adjacent DTFT images (green) that do not overlap. A brick-wall low-pass filter, H(f), removes the images, leaves the original spectrum, X(f), and recovers the original signal from its samples. Source: Wikipedia

Source: Wikipedia

Page 32: Big Data Analysis Concepts and References

Big Data Analysis 33 Author: Vikram Andem ISRM & IT GRC Conference

Simpson’s Paradox

Simpson’s Paradox

Simpson's paradox is a paradox in which a trend that appears in different groups of data disappears when these groups are combined, and the reverse trend appears for the aggregate data. This result is particularly confounding when frequency data are unduly given causal interpretations. Simpson's Paradox disappears when causal relations are brought into consideration.

Example:

It's a well accepted rule of thumb that the larger the data set, the more reliable the conclusions drawn. Simpson' paradox, however, slams a hammer down on the rule and the result is a good deal worse than a sore thumb. Simpson's paradox demonstrates that a great deal of care has to be taken when combining small data sets into a large one. Sometimes conclusions from the large data set may be the exact opposite of conclusion from the smaller sets. Unfortunately, the conclusions from the large set can (also) be wrong.

The lurking variables (or confounding variable) in Simpson’s paradox are categorical. That is, they break the observation into groups, such as the city of origin for the airline flights. Simpson’s paradox is an extreme form of the fact that the observed associations can be misleading when there are lurking variables.

Status Airline A

Airline B

On Time 718 5534 Delayed 74 532

Total 792 6066

From the left table: Airline A is delayed 9.3% (74/792) of the time;

Airline B is delayed only 8.8% (532/6066) of the time.

So Airline A would NOT be preferable.

Chicago Houston

Airline On

Time Delayed Total On

Time Delayed Total

A 497 62 559 221 12 233 B 694 117 811 4840 415 5255

From the above table:

From Chicago, Airline A is delayed 11.1% (62/559) of the time, but Airline B is delayed 14.4% (117/811) of the time. From Houston, Airline A is delayed 5.2% (12/233) of the time, but Airline B is delayed 7.9% (415/5255). Consequently, Airline A would be preferable. This conclusion contradicts the previous conclusion.

Simpsons' Paradox is when Big Data sets CAN go wrong

A use

case in Airline

industry

Airline On-Time Performance at Hub-and-Spoke Flight Networks

A good discussion of the topic is detailed in the paper listed below:

Simpson’s Paradox, Aggregation, and Airline On-time Performance

by Bruce Brown of Cal State Polytechnic University (URL: http://www.csupomona.edu/~bbrown/Brown_SimpPar_WEAI06.pdf)

Big Data doesn’t happen overnight and there’s no magic to it.

Just deploying Big Data tools and analytical solutions (R, SAS, and Tableau etc.) doesn’t guarantee anything, as Simpson’s Paradox proves.

Page 33: Big Data Analysis Concepts and References

Big Data Analysis 34 Author: Vikram Andem ISRM & IT GRC Conference

Machine Learning

Machine Learning and Data Mining

Machine learning concerns the construction and study of systems that can learn from data. For example, a machine learning system could be trained on email messages to learn to distinguish between spam and non-spam messages. After learning, it can then be used to classify new email messages into spam and non-spam folders. The core of machine learning deals with representation and generalization. Representation of data instances and functions evaluated on these instances are part of all machine learning systems. Generalization is the property that the system will perform well on unseen data instances. Source: Wikipedia

These two terms are commonly confused, as they often employ the same methods and overlap significantly. They can be roughly defined as follows:

Machine learning focuses on prediction, based on known properties learned from the training data.

Data mining focuses on the discovery of (previously) unknown properties in the data. This is the analysis step of Knowledge Discovery in

Databases.

Terminology

Classification: The learned attribute is categorical (“nominal”) Regression: The learned attribute is numeric Supervised Learning (“Training”) : We are given examples of inputs and

associated outputs and we learn the relationship between them. Unsupervised Learning (sometimes: “Mining”): We are given inputs, but

no outputs (such as unlabeled data) and we learn the “Latent” labels. (example: Clustering).

Example: Document Classification

• Highly accurate predictions on real time and continuous data (based on rule sets with earlier training / learning and training / historical data).

• Goal is not to uncover underlying “truth”. • Emphasis on methods that can handle very large datasets for better

predictions.

A use case in Airline

industry

South West Airlines use of Machine Learning for Airline Safety

The below URL details an article (published September 2013) on how South West Airlines uses Machine Learning algorithms for Big Data purposes to analyze vast amounts of very large data sets (which are publicly accessible from NASA’s DASHlink site) to find anomalies and potential safety issues and to identify patterns to improve airline safety.

URL: http://www.bigdata-startups.com/BigData-startup/southwest-airlines-uses-big-data-deliver-excellent-customer-service/

Primary Goal of Machine Learning

Why Machine Learning?

Increase barrier to entry when product / service quality is dependent on data

Customize product / service to increase engagement and profits. Example: Customize sales page to increase conversion rates for online products.

vs.

Use Case1 Use Case 2

Page 34: Big Data Analysis Concepts and References

Big Data Analysis 35 Author: Vikram Andem ISRM & IT GRC Conference

Classification Rules and Rule Sets

Rule Set to Classify Data

Golf Example: To Play or Not to Play

A use

case in Airline

industry

Optimal Airline Ticket Purchasing (automated feature selection) A good discussion of the topic is detailed in the paper listed below:

Optimal Airline Ticket Purchasing Using Automated User-Guided Feature Selection

by William Groves and Maria Gini of University of Minnesota (URL: http://ijcai.org/papers13/Papers/IJCAI13-032.pdf )

Classification Problems

Examples of Classification Problems: • Text categorization (e.g., spam filtering) • Market segmentation (e.g.: predict if customer will respond to promotion). • Natural-language processing (e.g., spoken language understanding).

Page 35: Big Data Analysis Concepts and References

Big Data Analysis 36 Author: Vikram Andem ISRM & IT GRC Conference

Decision Tree Learning

Example: Good vs. Evil

Decision tree learning uses a decision tree as a predictive model which maps observations about an item to conclusions about the item's target value. More descriptive names for such tree models are classification trees or regression trees. In these tree structures, leaves represent class labels and branches represent conjunctions of features that lead to those class labels. In decision analysis, a decision tree can be used to visually and explicitly represent decisions and decision making. In data mining, a decision tree describes data but not decisions; rather the resulting classification tree can be an input for decision making. Source: Wikipedia

Page 36: Big Data Analysis Concepts and References

Big Data Analysis 37 Author: Vikram Andem ISRM & IT GRC Conference

Tree Size vs. Accuracy

Accuracy, Confusion Matrix, Overfitting, Good/Bad Classifiers, and Controlling Tree Size

Building an Accurate Classifier

Good and Bad

Classifiers

A use

case in Airline

industry

Predicting Airline Customers Future Values

A good discussion of the topic is detailed in the paper listed below:

Applying decision trees for value-based customer relations management: Predicting airline customers future values

by Giuliano Tirenni, Christian Kaiser and Andreas Herrmann of the Center for Business Metrics at University of St. Gallen, Switzerland.

(URL: http://ipgo.webs.upv.es/azahar/Pr%C3%A1cticas/articulo2.pdf )

Theory

Overfitting example

Accuracy and Confusion Matrix

Page 37: Big Data Analysis Concepts and References

Big Data Analysis 38 Author: Vikram Andem ISRM & IT GRC Conference

Entropy and Information Gain

Entropy

Question: How do you determine which attribute best classifies data or a data set? Answer: Entropy

Entropy is a measure of unpredictability of information content.

Example : A poll on some political issue. Usually, such polls happen because the outcome of the poll isn't already known. In other words, the outcome of the poll is relatively unpredictable, and actually performing the poll and learning the results gives some new information; these are just different ways of saying that the entropy of the poll results is large. Now, consider the case that the same poll is performed a second time shortly after the first poll. Since the result of the first poll is already known, the outcome of the second poll can be predicted well and the results should not contain much new information; in this case the entropy of the second poll results is small. Source: Wikipedia

Statistical quantity measuring how well an attribute classifies the data. Calculate the information gain for each attribute. Choose attribute with greatest information gain.

If there are n equally probable possible messages, then the probability p of each is 1/n Information conveyed by a message is -log(p) = log(n) Example, if there are 16 messages, then log(16) = 4 and we need 4 bits to identify/send each message. In general, if we are given a probability distribution P = (p1, p2, .., pn) The information conveyed by distribution (aka Entropy of P) is: I(P) = -(p1*log(p1) + p2*log(p2) + .. + pn*log(pn))

Information Theory : Background

Information Gain

Largest Entropy: Boolean functions with the same number of ones and zero's have largest entropy.

In machine learning, this concept can be used to define a preferred sequence of attributes to investigate to most rapidly narrow down the state of X. Such a sequence (which depends on the outcome of the investigation of previous attributes at each stage) is called a decision tree. Usually an attribute with high mutual information should be preferred to other attributes.

A use

case in Airline

industry

An Airline matching Airplanes to Routes (using Machine Learning)

((URL: http://machinelearning.wustl.edu/mlpapers/paper_files/jmlr10_helmbold09a.pdf )

A good discussion of the topic is detailed in the paper listed below:

Learning Permutations with Exponential Weights

by David P. Helmbold and Manfred K.Warmuth of University of California, Santa Cruz

Page 38: Big Data Analysis Concepts and References

Big Data Analysis 39 Author: Vikram Andem ISRM & IT GRC Conference

The Bootstrap

The Bootstrap

A good discussion of the topic is detailed in the article listed below:

“The Big Data Bootstrap” by Ariel Kleiner, Ameet Talwalkar, Purnamrita Sarkar

and Michael I. Jordan of University of California, Berkeley

(URL: http://www.cs.berkeley.edu/~jordan/papers/blb_icml2012.pdf )

Bootstrapping is a method for assigning measures of accuracy (defined in terms of bias, variance, confidence intervals, prediction error or some other such measure) to sample estimates. This technique allows estimation of the sampling distribution of almost any statistic using only very simple methods. Generally, it falls in the broader class of resampling methods. The basic idea of bootstrapping is that inference about a population from sample data (sample → population) can be modeled by resampling the sample data and performing inference on (resample → sample). As the population is unknown, the true error in a sample statistic against its population value is unknowable. In bootstrap-resamples, the 'population' is in fact the sample, and this is known; hence the quality of inference from resample data → 'true' sample is measurable. Source: Wikipedia

Concept

Big Data and the Bootstrap Abstract from the paper listed on the lower left side: The bootstrap provides a simple and powerful means of assessing the quality of estimators. However, in settings involving large datasets, the computation of bootstrap-based quantities can be prohibitively demanding. As an alternative, the authors present the Bag of Little Bootstraps (BLB), a new procedure which incorporates features of both the bootstrap and subsampling to obtain a robust, computationally efficient means of assessing estimator quality. BLB is well suited to modern parallel and distributed computing architectures and retains the generic applicability, statistical efficiency, and favorable theoretical properties of the bootstrap. The authors provide the results of an extensive empirical and theoretical investigation of BLB's behavior, including a study of its statistical correctness, its large-scale implementation and performance, selection of hyper parameters, and performance on real data.

The authors claim their procedure for quantifying estimator quality is “accurate”, “automatic” and “scalable and have tested on data sets of size exceeding 1 Terabyte.

A use case in Airline

industry

Modeling Demand and Supply for Domestic and International Air Travel Economics for Cost Minimization and Profit Maximization

An in-depth and “excellent” scholarly detail of the application of bootstrapping for modelling Domestic and International Air travel Economics (demand / supply) for an

Airline company is detailed in the Ph.D. thesis listed below: (slightly old – published April 1999, but still very relevant to this age )

Essays on Domestic and International Airline Economics with Some Bootstrap Applications

by Anthony Kenneth Postert of Rice University

(URL: http://scholarship.rice.edu/bitstream/handle/1911/19428/9928581.PDF?sequence=1 )

Bootstrap and Big Data

Page 39: Big Data Analysis Concepts and References

Big Data Analysis 40 Author: Vikram Andem ISRM & IT GRC Conference

Ensemble Learning, Bagging and Boosting

Ensemble Learning

The basis of this slide is from the original presentation titled

Bayesian Ensemble Learning for Big Data by Rob McCulloch

of University of Chicago, Booth School of Business Published, November 17, 2013

(URL: http://www.rob-mcculloch.org/some_papers_and_talks/talks/dsi-bart.pdf )

Ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms. Unlike a statistical ensemble in statistical mechanics, which is usually infinite, a machine learning ensemble refers only to a concrete finite set of alternative models, but typically allows for much more flexible structure to exist between those alternatives. Source : Wikipedia

Bagging

A use

case in Airline

industry

Air Traffic Capacity impact during Adverse Weather conditions

A good discussion of the topic is detailed in the paper listed below: An Translation of Ensemble Weather Forecasts

into Probabilistic Air Traffic Capacity Impact by Matthias Steiner, Richard Bateman, Daniel Megenhardt,

Yubao Liu, Mei Xu, Matthew Pocernich, of the National Center for Atmospheric Research,

and by Jimmy Krozel of Metron Aviation (URL: http://nldr.library.ucar.edu/repository/assets/osgc/OSGC-000-000-000-687.pdf )

Bootstrap aggregating, also called Bagging, is a machine learning ensemble meta-algorithm designed to improve the stability and accuracy of machine learning algorithms used in statistical classification and regression. It also reduces variance and helps to avoid overfitting. Although it is usually applied to decision tree methods, it can be used with any type of method. Bagging is a special case of the model averaging approach. Source : Wikipedia

Boosting Boosting is a machine learning meta-algorithm for reducing bias in supervised learning. Boosting is based on the question: Can a set of weak learners create a single strong learner? A weak learner is defined to be a classifier which is only slightly correlated with the true classification (it can label examples better than random guessing). In contrast, a strong learner is a classifier that is arbitrarily well-correlated with the true classification. Source : Wikipedia

Ensemble Learning and Big Data

Page 40: Big Data Analysis Concepts and References

Big Data Analysis 41 Author: Vikram Andem ISRM & IT GRC Conference

Random Forests

Random Forests

Random forests are an ensemble learning method for classification (and regression) that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes output by individual trees. Source: Wikipedia

A use

case in Airline

industry

Network based model for Predicting Air Traffic Delays

The authors of the below paper propose a model using Random Forest (RF) algorithms, considering both temporal and spatial (that is, network) delay states as explanatory variables. In addition to local delay variables that describe the arrival or departure delay states of the most influential airports and origin-destination (OD) pairs in the network, the authors propose new network delay variables that depict

the global delay state of the entire NAS at the time of prediction.

A Network-Based Model for Predicting Air Traffic Delays by Juan Jose Rebollo and Hamsa Balakrishnan

of Massachusetts Institute of Technology (URL: http://www.mit.edu/~hamsa/pubs/RebolloBalakrishnanICRAT2012.pdf)

Random Forests in Big Data

Cloudera: In the below URL link Cloudera (a major Big Data vendor), shows how to implement a Poisson approximation to enable and train a random forest on an enormous data set (with R an open source free statistical software on Hadoop File System). The link also provides Map and Reduce source code.

URL: https://blog.cloudera.com/blog/2013/02/how-to-resample-from-a-large-data-set-in-parallel-with-r-on-hadoop/

Page 41: Big Data Analysis Concepts and References

Big Data Analysis 42 Author: Vikram Andem ISRM & IT GRC Conference

k-nearest Neighbours

k-nearest Neighbours

k -nearest Neighbors algorithm (or k-NN for short) is a non-parametric method used for classification and regression. In both cases, the input consists of the k closest training examples in the feature space. The output depends on whether k -NN is used for classification or regression:

In k-NN classification, the output is a class membership. An object is classified by a majority vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor.

In k-NN regression, the output is the property value for the object. This value is the average of the values of its k nearest neighbors.

k-NN is a type of instance-based learning, or lazy learning, where the function is only approximated locally and all computation is deferred until classification. The k-NN algorithm is among the simplest of all machine learning algorithms. Both for classification and regression, it can be useful to weight contributions of neighbors, so that the nearer neighbors contribute more to the average than the more distant ones.

For example, a common weighting scheme consists in giving each neighbor a weight of 1/d, where d is the distance to the neighbor. The neighbors are taken from a set of objects for which the class (for k -NN classification) or the object property value (for k -NN regression) is known. This can be thought of as the training set for the algorithm, though no explicit training step is required. A shortcoming of the k -NN algorithm is that it is sensitive to the local structure of the data.

Example of k-NN classification. The test sample (green circle) should be classified either to the first class of blue squares or to the second class of red triangles. If k = 3 (solid line circle) it is assigned to the second class because there are 2 triangles and only 1 square inside the inner circle. If k = 5 (dashed line circle) it is assigned to the first class (3 squares vs. 2 triangles inside the outer circle).

Data Reduction Data reduction is one of the most important problems for work with huge data sets. Usually, only some of the data points are needed for accurate classification. Those data are called the prototypes and can be found as follows: 1. Select the class-outliers, that is, training data that are classified incorrectly by k-NN (for a given k) 2. Separate the rest of the data into two sets: (i) the prototypes that are used for the classification decisions and (ii) the absorbed points that can be correctly classified by k-NN using prototypes which can be removed from the training set.

K-Nearest Neighbours and Big Data

A good discussion of the how to execute kNN joins in a MapReduce cluster with algorithms in MapReduce to perform efficient parallel kNN joins on large data is presented

in the paper below. The authors demonstrated ideas on using Hadoop with extensive experiments in large real and synthetic datasets, with tens or hundreds of millions of records ran in R up to 30 dimensions, with efficiency, effectiveness, and scalability.

Efficient Parallel kNN Joins for Large Data in MapReduce by Chi Zhang of Florida State University and Jeffrey Jestes of University of Utah

(URL: http://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf )

Source: Wikipedia

Background: Nearest Neighbor Graph Nearest neighbor graph (NNG) for a set of n objects P in a metric space (e.g., for a set of points in the plane with Euclidean distance) is a directed graph with P being its vertex set and with a directed edge from p to q whenever q is a nearest neighbor of p.

Example: The right side image shows a nearest neighbor graph of 100 points in the Euclidean plane.

k-nearest neighbor graph (k-NNG) is a graph in which two vertices p and q are connected by an edge, if the distance between p and q is among the k-th smallest distances from p to other objects from P

Source: Wikipedia

Source: Wikipedia

Page 42: Big Data Analysis Concepts and References

Big Data Analysis 43 Author: Vikram Andem ISRM & IT GRC Conference

k-nearest Neighbours (continued)

A use case in Airline

industry

Seating Arrangement and Inflight Purchase / Buying Behavior of Airline Customers

The below paper investigates and characterizes how social influence affects buying behavior of airline passengers who can purchase items through an individual entertainment system located in front of them. The author used the seating configuration in the airplane as a basis for the analysis. The authored used large data sets with a sample size for analysis of purchase behavior of about 257,000 passengers in nearly 2,000 fights for the analysis, where the passengers performed 65,525 transactions, with an average of 33.3 transactions per flight. The author claims that he finds strong evidence of social effects and states the number of average transactions per passenger increases 30% upon observation of a neighbor's purchase. Within and cross-category effects the author analyzed and found that the passengers are likely to buy from the same category purchased by their neighbors. For example, a purchase of an alcoholic beverage increases the probability of same category purchases by a neighbor in 78%. The author claims peer effects also take place at a deeper level than product category. Passengers `imitate' their peers' decisions on the type of food, alcohol and even movie genre. The paper also investigates the determinants of social influence: Author claims no support is found for informational learning as a significant mechanism in driving social influence. The main determinant of social influence is found to be the number of neighbors observed purchasing an item. The results are consistent with informational learning where consumers only learn from others' actions, but not from their inaction.

Peer Effects in Buying Behavior: Evidence from In-Flight Purchases

By Pedro M. Gardete, Assistant Professor of Marketing at Stanford University (published September 2013)

(URL: http://faculty-gsb.stanford.edu/gardete/documents/SocialEffects_8_2013.pdf )

Page 43: Big Data Analysis Concepts and References

Big Data Analysis 44 Author: Vikram Andem ISRM & IT GRC Conference

Stochastic Gradient Descent

Gradient Descent

Gradient descent is a first-order optimization algorithm. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or of the approximate gradient) of the function at the current point. If instead one takes steps proportional to the positive of the gradient, one approaches a local maximum of that function; the procedure is then known as gradient ascent. Gradient descent is also known as steepest descent, or the method of steepest descent. When known as the latter, gradient descent should not be confused with the method of steepest descent for approximating integrals.

Source: Wikipedia

Stochastic Gradient Descent

Stochastic gradient descent is a gradient descent optimization method for minimizing an objective function that is written as a sum of differentiable functions.

Source: Wikipedia

Source: Wikipedia

Page 44: Big Data Analysis Concepts and References

Big Data Analysis 45 Author: Vikram Andem ISRM & IT GRC Conference

Stochastic Gradient Descent (continued)

Gradient Descent Example

Page 45: Big Data Analysis Concepts and References

Big Data Analysis 46 Author: Vikram Andem ISRM & IT GRC Conference

Stochastic Gradient Descent (continued)

Stochastic Gradient Descent: Example

Source: Wikipedia

Stochastic Gradient Descent vs. Minibatch Gradient Descent

Stochastic Gradient Descent and Big Data

The below URL link provides a presentation on “Stochastic Optimization for Big Data Analytics”.

Stochastic Optimization for Big Data Analytics by Tianbao Yang and Shenghuo Zhu of NEC Laboratories America

and Rong Jin of Michigan State University (URL: http://www.cse.msu.edu/~yangtia1/sdm14-tutorial.pdf)

The authors in the below paper present stochastic gradient descent techniques for online learning and ensemble methods to scale out to large amounts of data at Twitter with details on how to integrate machine learning tools into Hadoop platform (using Pig a programming tool) .

Large-Scale Machine Learning at Twitter by Jimmy Lin and Alek Kolcz of Twitter, Inc.

(URL: http://www.umiacs.umd.edu/~jimmylin/publications/Lin_Kolcz_SIGMOD2012.pdf )

Page 46: Big Data Analysis Concepts and References

Big Data Analysis 47 Author: Vikram Andem ISRM & IT GRC Conference

Stochastic Games and Markov Perfect Equilibrium

Importance of Markov Perfect Equilibrium in Airline industry Pricing As an example of the use of this equilibrium concept we consider the competition between firms which had invested heavily into fixed costs and are dominant producers in an industry, forming an oligopoly. The players are taken to be committed to levels of production capacity in the short run, and the strategies describe their decisions in setting prices. Firms' objectives are modeled as maximizing present discounted value of profits.

Airfare Game / Airline Pricing Game: Often an airplane ticket for a certain route has the same price on either airline A or airline B. Presumably, the two airlines do not have exactly the same costs, nor do they face the same demand function given their varying frequent-flyer programs, the different connections their passengers will make, and so forth. Thus, a realistic general equilibrium model would be unlikely to result in nearly identical prices. Both airlines have made sunk investments into the equipment, personnel, and legal framework. In the near term we may think of them as committed to offering service. We therefore see that they are engaged, or trapped, in a strategic game with one another when setting prices.

Equilibrium: Consider the following strategy of an airline for setting the ticket price for a certain route. At every price-setting opportunity:

If the other airline is charging $300 or more, or is not selling tickets on that flight, charge $300 If the other airline is charging between $200 and $300, charge the same price If the other airline is charging $200 or less, choose randomly between the following three options with equal probability: matching that price, charging $300, or exiting the game by ceasing indefinitely to offer service on this route.

This is a Markov strategy because it does not depend on a history of past observations. It satisfies also the Markov reaction function definition because it does not depend on other information which is irrelevant to revenues and profits. Assume now that both airlines follow this strategy exactly. Assume further that passengers always choose the cheapest flight and so if the airlines charge different prices, the one charging the higher price gets zero passengers. Then if each airline assumes that the other airline will follow this strategy, there is no higher-payoff alternative strategy for itself, i.e. it is playing a best response to the other airline strategy. If both airlines followed this strategy, it would form a Nash equilibrium in every proper subgame, thus a subgame-perfect Nash equilibrium. Source: Wikipedia

Stochastic (or Markov) Games Stochastic (or Markov) game, is a dynamic game with probabilistic transitions played by one or more players. The game is played in a sequence of stages. At the beginning of each stage the game is in some state. The players select actions and each player receives a payoff that depends on the current state and the chosen actions. The game then moves to a new random state whose distribution depends on previous state and actions chosen by players. The procedure is repeated at the new state and play continues for a finite or infinite number of stages. Total payoff to a player is often taken to be discounted sum of stage payoffs or the limit inferior of averages of stage payoffs. Source: Wikipedia

Markov Perfect Equilibrium A Markov perfect equilibrium is an equilibrium concept in game theory. It is the refinement of the concept of sub game perfect equilibrium to extensive form games for which a pay-off relevant state space can be readily identified. In extensive form games, and specifically in stochastic games, a Markov perfect equilibrium is a set of mixed strategies for each of the players which satisfy the following criteria: The strategies have the Markov property of memorylessness, meaning that each player's mixed strategy can

be conditioned only on the state of the game. These strategies are called Markov reaction functions. The state can only encode payoff-relevant information. This rules out strategies that depend on non-

substantive moves by the opponent. It excludes strategies that depend on signals, negotiation, or cooperation between players (e.g. cheap talk or contracts).

The strategies form a subgame perfect equilibrium of the game. Source: Wikipedia

Subgame Perfect Equilibrium

Subgame Perfect Equilibrium is a refinement of a Nash equilibrium used in dynamic games. A strategy profile is a subgame perfect equilibrium if it represents a Nash equilibrium of every subgame of the original game. Informally, this means that if (1) the players played any smaller game that consisted of only one part of the larger game and (2) their behavior represents a Nash equilibrium of that smaller game, then their behavior is a subgame perfect equilibrium of the larger game. Source: Wikipedia

Dynamic Airfare Pricing and Competition

The below paper details Airline industry price competition for an oligopoly in a dynamic setting, where each of the sellers has a fixed number of units available for sale over a fixed number of periods. Demand is stochastic, and depending on how it evolves, sellers may change their prices at any time.

Dynamic Price Competition with Fixed Capacities

by Kalyan Talluri & Victor Martinez de Albeniz

A use case in Airline

industry

Most of the work in this

paper was done prior to the writing of the paper as part

of (both author's) Ph.D. dissertation at

Massachusetts Institute of Technology

(published February 2010) (URL: www.econ.upf.edu/docs/papers/downloads/1205.pdf)

Page 47: Big Data Analysis Concepts and References

Big Data Analysis 48 Author: Vikram Andem ISRM & IT GRC Conference

Stochastic Games and Markov Perfect Equilibrium (continued)

A use case in Airline

industry

Dynamic Revenue Management in Airline Alliances / Code Sharing

The below paper presents and excellent formulization of a Markov-game model of a two-partner airline alliance that can be used to analyze the effects of these mechanisms on each partner’s behavior. The authors show that no Markovian transfer pricing mechanism can coordinate an arbitrary alliance. Next,

the authors derive the equilibrium acceptance policies under each scheme and use analytical techniques, as well a numerical analyses of sample alliances, to generate fundamental insights about partner behavior under each scheme. The analysis and numerical examples also illustrate how certain transfer price schemes are likely to perform in networks with particular characteristics.

Dynamic Revenue Management in Airline Alliances

by Robert Shumsky of Dartmouth College and Christopher Wright, Harry Groenevelt of University of Rochester (published February 2009)

(URL: http://www.researchgate.net/publication/220413135_Dynamic_Revenue_Management_in_Airline_Alliances/file/72e7e5215a1f91ed5b.pdf )

Page 48: Big Data Analysis Concepts and References

Big Data Analysis 49 Author: Vikram Andem ISRM & IT GRC Conference

Logistic Regression

Logistic Regression

Uses and examples of Logistic Regression Examples: Logistic regression might be used to predict Whether a patient has a given disease (e.g. diabetes), based on observed

characteristics of the patient (age, gender, BMI, results of various blood tests, etc.).

Whether an American voter will vote Democratic or Republican, based on age, income, gender, race, state of residence, previous elections, etc.

In engineering, for predicting the probability of failure of a given process, system or product.

In marketing applications such as prediction of a customer's propensity to purchase a product or cease a subscription, etc.

In economics it can be used to predict the likelihood of a person's choosing to be in the labor force, and a business application would be to predict the likelihood of a homeowner defaulting on a mortgage. Source: Wikipedia

Logistic Regression is a type of probabilistic statistical classification model used to predict a binary response from a binary predictor and can be used for predicting the outcome of a categorical dependent variable (i.e., a class label) based on one or more predictor variables (features). It is used in estimating the parameters of a qualitative response model. Logistic regression measures the relationship between a categorical dependent variable and one or more independent variables, which are usually (but not necessarily) continuous, by using probability scores as the predicted values of the dependent variable. Source: Wikipedia

The three images below show a funny explanation and presentation of the “Logistic Regression in Machine Learning” by David Hu on his internship at the Khan Academy (a non-profit education site)

URL: http://david-hu.com/2012/01/05/khan-academy-internship-post-mortem.html

Source: Wikipedia

Logistic Regression in Big Data Internet search provides lot of (widely) use cases, a couple are listed below: A Big Data Logistic Regression with R and ODBC by Larry D'Agostino of "R news and tutorials at R bloggers " URL: http://www.r-bloggers.com/big-data-logistic-regression-with-r-and-odbc/ Large Data Logistic Regression (with example Hadoop code) by John Mount of "Win-Vector Blog : Applied Theorist's Point of View" URL: http://www.win-vector.com/blog/2010/12/large-data-logistic-regression-with- example-hadoop-code/

Page 49: Big Data Analysis Concepts and References

Big Data Analysis 50 Author: Vikram Andem ISRM & IT GRC Conference

Logistic Regression (continued)

Predicting Airline Passenger No-show Rates

A use case in Airline industry

Accurate forecasts of the expected number of no shows for each flight can increase airline revenue by reducing the number of spoiled seats (empty seats that might otherwise have been sold) and the number of involuntary denied boarding’s at the departure gate. Conventional no-show forecasting methods typically average the no-show rates of his historically similar flights, without the use of passenger-specific information. The authors of the below paper develop two classes of models to predict cabin-level no-show rates using specific information on the individual passengers booked on each flight.

• The first of the proposed models computes the no-show probability for each passenger, using both the cabin-level historical forecast and the extracted passenger features as explanatory variables. This passenger level model is implemented using three different predictive methods: a C4.5 decision-tree, a segmented Naive Bayes algorithm, and a new aggregation method for an ensemble of probabilistic models. • The second cabin-level model is formulated using the desired cabin-level no-show rate as the response variable. Inputs to this model include the predicted cabin-level no-show rates derived from the various passenger-level models, as well as simple statistics of the features of the cabin passenger population.

The cabin-level model is implemented using either linear regression, or as a direct probability model with explicit incorporation of the cabin-level no-show rates derived from the passenger-level model outputs. The new passenger-based models are compared to a conventional historical model, using train and evaluation data sets taken from over 1 million passenger name records. Standard metrics such as lift curves and mean-square cabin-level errors establish the improved accuracy of the passenger based models over the historical model. The authors evaluated all models using a simple revenue model, and show that the cabin-level passenger-based predictive model can produce between 0.4% and 3.2% revenue gain over the conventional model, depending on the revenue-model parameters. Passenger-Based Predictive Modeling of Airline No-show Rates by Richard Lawrence, Se June Hong of IBM T. J. Watson Research Center and Jacques Cherrier of Air Canada (URL: http://www.msci.memphis.edu/~linki/7118papers/Lawrence03Passenger.pdf )

Airline Customer Satisfaction / Loyalty

A use case in Airline

industry

There are lot of research papers and “actual implementation” articles on leveraging Logistical Regression for Airline Customer Satisfaction / Loyalty , a few listed below:

A Logistic Regression Model of Customer Satisfaction of Airline by Peter Josephat and Abbas Ismail of University of Dodoma (URL: http://www.macrothink.org/journal/index.php/ijhrs/article/view/2868/2669)

Analytical CRM at the airlines: Managing loyal passengers using knowledge discovery in database by Jehn-Yih Wong and Lin-Hao Chiu of Ming Chuan University, and Pi-Heng Chung, of De Lin Institute of Technology and Te-Yi Chang of National Kaohsiung University (URL: http://ir.nkuht.edu.tw/bitstream/987654321/1528/1/8%E5%8D%B74-001.pdf )

Modelling Customer Response Rate for Exchange /Purchase of Airline Frequent Flier Miles for

3rd Party (non-airline) Products and Services.

A use case in Airline

industry

The below presents the Logistic Regression model for “Predictive Goal” used by Sprint/Nextel to identify Delta Airlines Sky Miles members who will most likely respond to an offer for exchange of frequent flier miles for the purchase of Sprint-Nextel wireless phones and service.

Delta Airlines Response Model Overview by Geoff Gray, Armando Litonjua , Matt McNamara, Tim Roach and Jason Thayer (URL: http://galitshmueli.com/system/files/FrequentFliers.pdf

Page 50: Big Data Analysis Concepts and References

Big Data Analysis 51 Author: Vikram Andem ISRM & IT GRC Conference

Support Vector Machine

Support Vector Machine

Support vector machines (SVM) are supervised learning models with associated learning algorithms that analyze data and recognize patterns, used for classification and regression analysis. Given a set of training examples, each marked as belonging to one of two categories, an SVM training algorithm builds a model that assigns new examples into one category or the other, making it a non-probabilistic binary linear classifier. An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall on. Source: Wikipedia

Airline Ticket Cancellation Forecasting & Revenue Management

Passenger Name Record Data Mining Based Cancellation Forecasting for Revenue Management

by Dolores Romero Morales and Jingbo Wang of Said Business School, University of Oxford

A use case in Airline industry

Using real-world datasets, the authors of the below paper examine the performance of existing models and propose new promising

ones based on Logistic Regression and Support Vector Machines for ticket cancellation forecasting & improving Airline revenue

(URL: http://www.optimization-online.org/DB_FILE/2008/04/1953.pdf )

H1 does not separate the classes. H2 does, but only with a small margin. H3 separates them with the maximum

margin.

Example Maximum-margin hyperplane and margins for an SVM trained with samples from two classes. Samples on the margin are called the support vectors.

Support Vector Machine and Big Data

The below paper provides good information on Support Vector Machine classifiers for solving the pattern recognition problem in machine learning on large data sets

Support Vector Machine Classifiers for Large Data Sets by E. Michael Gertz and Joshua D. Griffin of Argonne National Laboratory

(URL: http://ftp.mcs.anl.gov/pub/tech_reports/reports/TM-289A.pdf )

Page 51: Big Data Analysis Concepts and References

Big Data Analysis 52 Author: Vikram Andem ISRM & IT GRC Conference

k-means Clustering

k-means Clustering k-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells (see voronoi diagram below).

Given a set of observations (x1, x2, …, xn), where each observation is a d-dimensional real vector, k-means clustering aims to partition the n observations into k sets (k ≤ n) S = {S1, S2, …, Sk} so as to minimize the within-cluster sum of squares (WCSS):

where μi is the mean of points in Si.

Demonstration of k-means standard algorithm

Airline Data Model Mining for: • Customer Segmentation Analysis • Customer Loyalty Analysis • Customer Life Time Value Analysis • Frequent Flyer Passenger Prediction

Oracle Airlines Data Model Data Mining Models

by Oracle Corporation

A use case in Airline industry

Oracle (a major database software and applications vendor) for its software product offering titled "Airlines Data Model and Data Mining

Models” provides reference information about the data mining models :

(URL: http://docs.oracle.com/cd/E11882_01/doc.112/e26208/data_mining_adm.htm#DMARF1188 )

k-means clustering and Big Data The below paper provides an excellent (theory and application pseudo code) for NoSQL based

application / programs using k-means clustering in Big Data Analysis . Turning Big data into tiny data:

Constant-size coresets for k-means, PCA and projective clustering by Dan Feldman, Melanie Schmidt, Christian Sohler

of Massachusetts Institute of Technology (URL: http://people.csail.mit.edu/dannyf/subspace.pdf )

Background: Voronoi diagram In mathematics, a Voronoi diagram is a way of dividing space into a number of regions. A set of points (called seeds, sites, or generators) is specified beforehand and for each seed there will be a corresponding region consisting of all points closer to that seed than to any other. Source: Wikipedia

Step 1: k initial "means" (in this case k=3) are randomly generated within the data domain (shown in color).

Step 2: k clusters are created by associating every observation with the nearest mean. The partitions here represent the Voronoi diagram generated by the means.

Step 3: The centroid of each of the k clusters becomes the new mean.

Step 4: Steps 2 and 3 are repeated until convergence has been reached.

Source: Wikipedia

Modelling / Predicting Airline Passenger’s Ticket Choice (Price Sensitivity) with Brand Loyalty

Modeling Passenger's Airline Ticket Choice Using Segment Specific Cross

Nested Logit Model with Brand Loyalty

A use case in Airline industry

Below paper provides methods on modelling /predicting Airline Passenger’s ticket choices (that are price sensitive) with Brand Loyalty and authors have used large data sets from a stated preference choice experiment conducted among Australian citizens traveling to the USA.

(URL: http://www.agifors.org/award/submissions2012/TomDrabas.pdf )

( Published May 2012)

United Airlines is part of the analysis in this

paper, along with other Australian airlines

by Tomasz Drabas and Cheng-Lung Wu of University of New South Wales

Page 52: Big Data Analysis Concepts and References

Big Data Analysis 53 Author: Vikram Andem ISRM & IT GRC Conference

Appendix Sources / Credits / References used to make this presentation: 1. Google : www.google.com 2. Wikipedia : http://www.wikipedia.org/ 3. Big Data Analytics: Harvard Extension School http://www.extension.harvard.edu/courses/big-data-analytics 4. Tackling the Challenges of Big Data: Massachusetts Institute of Technology http://web.mit.edu/professional/pdf/oxp-docs/BigDataCourseFlyer.pdf 5. Course era:

a) “Data Scientist’s Tool Box” by Brian Caffo, Roger D. Peng, Jeff Leek of John Hopkins University https://www.coursera.org/specialization/jhudatascience/1 b) “Data Analysis and Statistical Inference” by Mine Çetinkaya-Rundel of Duke University: https://www.coursera.org/course/statistics c) “Introduction to Data Science” by Bill Howe of University of Washington; https://www.coursera.org/specialization/jhudatascience/1

6. OpenIntro Statistics : http://www.openintro.org/stat/

Excellent Reading Material (publicly available for free )

“OpenIntro Statistics “ 2nd Edition authored by David M Diez, Quantitative Analyst at Google/YouTube Christopher D Barr, Assistant Professor at Harvard University Mine Cetinkaya-Rundel, Assistant Professor at Duke University URL: https://dl.dropboxusercontent.com/u/8602054/os2.pdf

“Practical Regression and Anova using R” authored by Julian J. Faraway Electronic version at the below URL is free; Physical / Print version for $79 URL: http://cran.r-project.org/doc/contrib/Faraway-PRA.pdf “ThinkStats: Probability and Statistics for Programmers” authored by Allen B. Downey URL: http://greenteapress.com/thinkstats/thinkstats.pdf

Page 53: Big Data Analysis Concepts and References

Source: Openintro Statistics

Mean

Median

Variance

Standard deviation

Shape of the distribution

Mean vs. Median

If the distribution is symmetric, center is often defined as the mean: mean = median

If the distribution is skewed or has extreme outliers, center is often defined as the median. • Right-skewed: mean > median • Left-skewed: mean < median

Probability

Independence & Conditional Probability

Page 54: Big Data Analysis Concepts and References

Source: Openintro Statistics

Bayes’ Theorem

Random Variables

Normal Distribution

Normal Distribution with different parameters

Z Scores

Finding the exact probability - using the Z table

Page 55: Big Data Analysis Concepts and References

Source: Openintro Statistics

Six Sigma

Binomial Distribution

Normal Probability Plot and Skewness

Central Limit Theorem

Confidence Intervals

Changing the Confidence Levels

Page 56: Big Data Analysis Concepts and References

Source: Openintro Statistics

p - values

Decision Errors

Hypothesis Test as a trial

Hypothesis Testing for Population Mean

The t - distribution

Type 1 and Type 2 Errors

Page 57: Big Data Analysis Concepts and References

Source: Openintro Statistics

ANOVA (Analysis of Variance)

Conditions

z/t test vs. ANOVA

Purpose

Method

Page 58: Big Data Analysis Concepts and References

Source: Openintro Statistics

Parameter and Point Estimate

Comparing two Proportions

Standard Error Calculations

Anatomy of a Test Statistic

Chi-square statistic

Why Square?

The chi-square distribution

Conditions for chi-square test

Page 59: Big Data Analysis Concepts and References

Source: Openintro Statistics

Quantifying a Relationship

The least squares line

Slope of the Regression

Intercept

Prediction and Extrapolation

Conditions for the least

squares line

Terminology

R² vs. Adjusted R²

Adjusted R²

Page 60: Big Data Analysis Concepts and References

Source: Openintro Statistics

Sensitivity and Specificity Generalized Linear Models

Logistic Regression

Logit function

Properties of the Logit

The Logistic Regression Model