Fundamentals of Descriptive Analytics€¦ · Prerequisite: Fundamentals of Data Warehousing COURSE OBJECTIVES At the end of the course, the students should be able to: 1. Explain

Fundamentals of

Descriptive Analytics A Business Analytics Course

University of the Philippines Open University

Dr. Melinda Lumanta Ms. Louise Villanueva Dr. Eugene Rex Jalao Ms. Marie Karen Enrile

Asst. Prof. Joyce Manalo

Course Writers

Fundamentals of Descriptive Analytics 1

University of the Philippines

OPEN UNIVERSITY

COMMISSION ON HIGHER EDUCATION


Fundamentals of Descriptive Analytics

Course Package

This learning package consists of:

1. Course Guide 2. Study Guides 3. Video lectures (Available at UPOU Networks and in the

attached USB) 4. Assessments


UNIVERSITY OF THE PHILIPPINES OPEN UNIVERSITY

Fundamentals of Descriptive Analytics

A Business Analytics Course

This course aims to introduce students to the fundamentals of descriptive analytics.

Descriptive analytics make use of current transactions to enable managers to visualize

how the company is performing. This course will teach students to learn how to prepare

reports using descriptive analytic tools.

Prerequisite: Fundamentals of Data Warehousing

COURSE OBJECTIVES

At the end of the course, the students should be able to:

1. Explain the concepts in descriptive statistics

2. Contextualize the descriptive statistics concepts and analytical techniques in

business decision-making

3. Explain the importance of data pre-processing

4. Apply data pre-processing techniques in business

5. Explain the importance of data visualization and communication

6. Apply data visualization techniques to communicate the results of descriptive

analytics to stakeholders

7. Develop an awareness of ethical norms as required under policies and applicable

laws governing confidentiality and non-disclosure of data/information/documents

and proper conduct in the learning process and application of business analytics.


COURSE OUTLINE

UNIT I. Introduction to Descriptive Analytics

MODULE 1. Statistics in Business

A. Data and data sets

B. What is statistics?

C. Bases in choosing what statistics to use

D. Application of statistics in business

MODULE 2. Basic Descriptive Statistics

A. Frequency distributions

B. Measures of location

C. Measures of dispersion

D. Measures of association

E. Measures of shape and other statistics

MODULE 3. Sampling and Data Collection

A. Types of sampling

B. Central limit theorem

Unit II. Data Preprocessing

MODULE 1: Basic Concepts in Data Preprocessing

A. What is data pre-processing?

B. Tasks for data processing

Module 2: Methods for Data Preprocessing

A. Data Integration

B. Data Transformation

C. Data Cleaning

D. Data Reduction

MODULE 3: Post-Processing and Visualization Of Data Inside The Data Warehouse


Unit III. Data Visualization and Communication

Unit IV. Ethics

COURSE MATERIALS

1. Course guide

2. Study guides per modules

3. Video lectures

4. Additional reading materials in digital forms

STUDY SCHEDULE

Schedule Topic Activity

Week 1 Course Overview 1. Read the course guide

2. Participate in Discussion Forum 1 -

Introduce yourself and write a brief

reflection paper about the importance of

big data in businesses today

Week

2 - 4

Unit I - Introduction to

Descriptive Analytics

1. Go through Module 1 to 3

2. Participate in Discussion Forums 2 to 4

3. Watch the videos on Basic Descriptive

Statistics and Sampling and Data

Collection by Dr. Lisa S. Bersales

4. Submit the required assignment as

specified in the study guide.

Week

5-9

Unit II - Data Pre-

processing

1. Go through Module 5 to 6

2. Watch the videos on Data Processing by

Dr. Eugene Rex L. Jalao

3. Participate in Forum Discussion 5


4. Submit the required assignment as

specified in the study guide.

Week

10 - 14

Unit III - Data

Visualization and

Communication

1. Go through Module 7.

2. Watch the video on Data Visualization and

Communication

3. Participate in Discussion Forum 6

Week 15 Unit IV - Ethics 1. Go through Unit IV.

2. Watch the video on Ethics by Atty.

Emerson Banes and Mr. Dominic Ligot

3. Write a reflection paper on ethics in

descriptive analytics in business.

Week 16 Course Evaluation 1. Write a self-reflection on how the course

contributed to your understanding of


COURSE REQUIREMENTS

For you to pass the course, you will be evaluated on the following required activities:

Unit Weight

1 20%

2 35%

3 35%

4 10%


Online Discussions

There will be a series of online discussions and activities for this course. In addition to

gauging your understanding of the course topics, the online discussions provide

everybody an opportunity to apply the concepts discussed in the modules in specific

situations.

As we progress through the course, we will be posting discussion topics and specific

questions/ instructions, so make it a point to visit the course regularly.

Remember the following when participating in online discussions:

• All discussions will take place in the course site. A separate discussion forum will

be created for each topic.

• Everybody is encouraged to contribute to the discussions by answering the

discussion question and/or reacting to each discussion topic if you wish to acquire

the Certificate of Completion. Passing remarks like "I agree" are not considered

substantial.

• Do not post lengthy contributions. Be clear on what your main point is and express

it as concisely as possible.

• The forums will remain open throughout the course's duration.

• Please be guided by netiquette rules (see

http://www.albion.com/netiquette/corerules.html) when participating in online

discussions. Respond to other postings courteously. Personal messages should

be emailed directly to the person concerned.

• If you would like to use some printed or online reference materials in your posting,

don't forget to cite them accordingly (e.g., According to Hernandez (2010), this

concept is...).

Assignments

The assignment is intended to help you to integrate and apply the learning. Specific

instructions will be posted in the course site.

If you wish to get the Certificate of Completion for this course, you must submit and get a

passing mark in the assignment. Online submission of assignments will be in the

Assignment Bin.


GENERAL GUIDELINES Please comply with the following house rules:

• You are always expected to uphold academic integrity and intellectual honesty as

a learner. Cheating or plagiarism is not allowed.

• Submit your assignment on time.

• Observe deadlines. Follow the schedule of course activities, submit your

assignments on time, and never ask for an exemption from a required task. Read

in advance. Try to anticipate possible conflicts between your personal schedule

and the course schedule, and make the necessary adjustments to your study

schedule.

• Limit the comments and materials you post to those that are relevant to the course

topics. For your profile photo, do not post an informal photo or a photo that would

be more appropriate for a personal website. Maintain a professional demeanour

in all courses.


UNIT I: INTRODUCTION TO DESCRIPTIVE ANALYTICS

This unit intends to:

1. Introduce basic statistical concepts; 2. Introduce basic descriptive statistics; and 3. Introduce sampling and data collection.

MODULE 1: STATISTICS IN BUSINESS

1.1. Data and Data Sets

This part of the module is intended to familiarize the students with data and data sets

which serve as foundations of statistics and analytics.

Learning Objectives

At the end of this part of the module the students must be able to do the following:

1. Differentiate the types of data and levels of measurement

2. Differentiate and understand where data sets must be best used

3. Differentiate the two branches of statistics:

a. Descriptive statistics

b. Inferential statistics

4. Determine appropriate use of statistics in business analytics.

Key Concepts

Attributes and Variables

Anyone who wants to embark on analytics must first start with data. These data are

composed of objects and their attributes. For example, the new human resources

manager in a pharmaceutical company wants to know the profile of their sales

representatives. The new human resources manager is presented with the table below.


Sales Representative’s Performance for 1st Quarter 2018

Sales Representative

Age Sex City Standardized Product Expertise Test Scores

Total Amount of Quarterly Sales

Rank Based on Quarterly Sales

Abad, Maria 23 Female Caloocan 96 3,500,430.40 13

Basilio, Anna 27 Female Las Piñas 91 3,850,875.30 12

Cruz, Juan 28 Male Makati 92 5,290,320.50 2

Delos Santos, Jose

23 Male Malabon 94 3,216,739.95 14

Encarnacion, Leonora

24 Female Mandaluyong 94 4,589,850.00 6

Fajardo, Mario 30 Male Manila 95 4,670,902.25 5

Guzman, Emilio 22 Male Marikina 97 3,993,741.50 9

Herminio, Adela 35 Female Muntinlupa 96 3,890,004.70 10

Ilagan, Bienvenido

28 Male Navotas 93 2,863,045.25 15

Jacob, Rosa 29 Female Parañaque 92 4,097,589.75 8

Kalaw, Clarissa 22 Female Pasay 98 3,856,910.15 11

Lagman, Francisco

28 Male Pasig 94 4,970,438.25 4

Montero, Antonio 31 Male Pateros 92 1,368,495.45 16

Nuñez, Isabel 25 Female Quezon City 97 5,400,369.90 1

Ortiz, Katrina 27 Female San Juan 90 4,283,907.72 7

Pantaleon, Roel 34 Male Taguig 96 5,278,900.80 3

In this table, the objects are the sales representatives and the attributes are the

characteristics that these representatives have—age, sex, area or city of assignment, test

scores, and their quarterly performance in the form of sales.


Variables and Levels of Measurement

In statistics, attributes which are organized for further data processing are called variables.

There are different types of variables, and each type has properties that determine how it

can be subjected to analysis.

Nominal variables are characterized by distinctness. They are labels and are non-

numerical. In the table, we can say that the sales representative’s sex and city of

assignment are nominal variables. Males are distinct from females. No intrinsic ordering

can be observed in the cities of assignment.

Another type of variable is the ordinal variable. These are variables that pertain to order.

The ranks given to the sales representatives based on their quarterly sales are considered

as ordinal. From the data on rank, we can say that Juan Cruz who ranked 2nd sold more

than Roel Pantaleon who ranked 3rd but less than Isabel Nuñez who ranked 1st. Looking at

the ranks of these sales representatives does not give an idea about the difference

between the quarterly sales of these sales representatives.

The last three variables in the table are product expertise test scores, age, and quarterly

sales. These variables can be classified into what has been referred together as interval-

ratio variables. These two are often taken together since these variables are quantitative

in nature. The standardized product expertise test scores are considered as interval

variables. This is because standardized test scores like the IQ are often of arbitrary origin

(one does not necessarily start with zero) and there is a fixed distance between the scores.

For examples, we can say that the distance between 91 and 92 is also equal to the

distance between 95 and 96.

Meanwhile, variables such as age and quarterly sales are considered as ratio variables.

These variables have the characteristics of nominal, ordinal, and interval variables.

However, unlike interval variables, ratio variables have absolute zero origins. At some

point, a sales representative does start with zero sales for the quarter. Age as a variable

is also characterized by a meaningful zero point which is upon someone’s birth.

Quantitative variables such as interval and ratio may be discrete or continuous. Discrete

variables are those that take the form of an integer. Meanwhile continuous variables take

the form of real numbers.

Nominal, ordinal, interval, and ratio variables are also referred to as levels of

measurement. The numerical properties of interval and ratio variables permit its use in

higher statistical tests. Meanwhile, nominal and ordinal variables are often used for

descriptive purposes.


Study Question Think of an interesting phenomenon that you want to study in your organization. List all of the possible variables and categorize them according to types.

Data Sets

While aggregated data are important, sets of data are proved to be more useful to

organizations. These permit organizations to analyze and interpret scenarios effectively

and efficiently. There are three main types of data sets: record, graph, and ordered data

sets.

Record data set

Record data sets are those that are structured and presented in rows. Record data sets

may come in texts, numbers, or sequences.

The table about quarterly sales is considered as a collection of record data.

Sales Representative

Age Sex City Standardized Product Expertise Test Scores

Total Amount of Quarterly Sales

Rank Based on Quarterly Sales

Abad, Maria 23 Female Caloocan 96 3,500,430.40 13

Basilio, Anna 27 Female Las Piñas 91 3,850,875.30 12

Cruz, Juan 28 Male Makati 92 5,290,320.50 2

Delos Santos, Jose

23 Male Malabon 94 3,216,739.95 14

Encarnacion, Leonora

24 Female Mandaluyong 94 4,589,850.00 6

Fajardo, Mario 30 Male Manila 95 4,670,902.25 5

Guzman, Emilio 22 Male Marikina 97 3,993,741.50 9

Herminio, Adela 35 Female Muntinlupa 96 3,890,004.70 10


Ilagan, Bienvenido

28 Male Navotas 93 2,863,045.25 15

Jacob, Rosa 29 Female Parañaque 92 4,097,589.75 8

Kalaw, Clarissa 22 Female Pasay 98 3,856,910.15 11

Lagman, Francisco

28 Male Pasig 94 4,970,438.25 4

Montero, Antonio

31 Male Pateros 92 1,368,495.45 16

Nuñez, Isabel 25 Female Quezon City 97 5,400,369.90 1

Ortiz, Katrina 27 Female San Juan 90 4,283,907.72 7

Pantaleon, Roel 34 Male Taguig 96 5,278,900.80 3

Another kind of record data is an m x n data matrix where m represents the rows or the

numerical objects and n the columns for the numerical attributes. This is a matrix

composed of real numbers.

Projection of x Load Projection of y Load Distance Load Thickness

10.23 5.27 15.22 2.7 1.2

12.65 6.25 16.22 2.2 1.1

Aside from m x n data matrix, record data can also be in the form of term by document

data set. This serves as a means to count how many times a term appears in a document.

Area Sales Quota

Document 1 1 4 3

Document 2 2 5 3

Document 3 1 9 4

A special kind of record data is composed of combination of items or services that are often bought or lumped together. This is called the market basket or transaction data set.


Transaction ID Items

1 Coffee, Pancakes

2 Coffee, Pancakes, Hash Brown

3 Pineapple Juice, Pancakes, Hash Brown

4 Pineapple Juice, Rice, Egg

5 Pineapple Juice, Rice, Egg, Beef Steak

Graph data set

Graph data sets are those that represent relationships through the interconnections of

points. This can be commonly observed in sociograms and matrices that show the

interaction between and among individuals in networks.

. Ordered data set Ordered data sets are those that show data over certain sequences, periods, or progressions. One of the most common ordered data sets is the time series. This includes data of a certain variable over a period of time.


Sales per quarter for the last 5 years (in millions)

Year Quarter 1 Quarter 2 Quarter 3 Quarter 4

2011 25 21 24.3 29.3

2012 25.4 19.7 25.6 27.9

2013 23 20.1 26.2 28.9

2014 26.7 18.3 24.5 29

2015 25.2 20.9 26 28.2

1.2. Introduction to Statistics

The part of the module is intended to familiarize the students with descriptive and

inferential statistics. This will serve as the backbone for the forthcoming modules that

tackle statistical measures and tests that could be applied to describe, predict, and infer

information based on the available data.

Key Concepts

Statistics as a field of study and in business

Word cloud of basic statistical concepts


Statistics as a field could be traced in Europe during the 1500s. Statesmen and scholars

from Great Britain, France, and Sweden were urged to make sense of data gathered from

the census (Stephenson, 2000). In 1662, the first demographic report on mortality was

produced by John Graunt based on the weekly mortality reports in London (Encyclopaedia

Britannica, 2012).

Many statistical reports on demographics emerged as the field also progressed from

description to inference. This could largely be attributed to the advancement in

mathematics, particularly in probability theory. Given the said history, statistics to this day

is inextricably linked to the field of mathematics while others consider statistics as a branch

of science. In this regard, statistics can be deemed as a meta-science or meta-language

that aims to collect, analyze, summarize, and interpret data (Stephenson, 2000).

While the use of statistics can be traced to the affairs of nation states, it has been proven

to be an integral part of knowledge creation in both the natural and the social sciences. In

fact, statistics as a field was viewed with complexity that it was only deemed accessible to

those who chose it to be their field of expertise. This led many to overlook its usefulness

until Sir Geoffrey Heyworth of the Royal Statistical Society (1950) made a case for the use

of statistics in business. Accordingly, business statistics must be simple for a businessman

to comprehend. It must also guide action, but it should not serve as a substitute to a

businessman’s judgment. Heyworth further recognized the application of statistics in

various facets of a business. However, he also warned that business statistics should be

guided by a businessman’s knowledge and experience because figures would only make

sense if coupled with an understanding of contexts. While there was no hard and fast rule

as to where statistics should be used, Heyworth considered it as a never-ending process

to make an idea more accurate and acceptable as it often served as a bridge between

initial and informed business judgments.

Heyworth’s assertions in 1950 are still relevant to this day. Businesses employed

statisticians who could guide them in making decisions as to how they could optimize their

processes, target their consumers, and create more buzz around their products and

services among others. These actions as guided by business statistics and the

businessman’s knowledge and experience are made to minimize costs, maximize profits,

and place the business in a competitive advantage.

You have probably heard about television show ratings as publicized by rival networks.

These ratings are examples of business statistics. Since television remains to be one of

the most used media, advertisers of products and services would rely on television show


ratings to determine where they could place their advertisements to reach consumers.

Ideally, the higher the exposure, the better.

The advent of new information and communication technology also ushered new ways of

processing and using business statistics. You’ve probably been accustomed to the

number of reactions, views, and shares of social media posts and uploads. These are

aggregated to determine the reach of social media pages, and thus, these also serve as

measures and drivers of business. Individuals and groups have created a new industry

out of their presence and activities in social media, and this industry is anchored in

business statistics.

Population and Parameters

Statistics as a field is associated with the research process. This entailed the gathering of

data from concerned parties. One important concept in the field is population. This pertains

to all of the items or individuals that a researcher, or in this case, a businessman would

want to study. Once these data have been gathered completely, the characteristics of the

items or individuals are called parameters. Let’s say that there is a businessman who sells

his product via an online platform. He’s interested to determine the customer satisfaction.

Based on the analytics provided by the online platform, a total of 10,000 customers bought

the product for the year. The businessman decides to conduct a survey of all 10,000

customers to get their demographics and an accurate product satisfaction rating. The

10,000 customers are considered as the population and their demographic details and

product satisfaction ratings are the parameters.

Sample and Statistic

Let’s say that the businessman consults the other members of his team regarding his plan

to conduct a survey with the population of 10,000 customers. The team expresses concern

about the resources that will be needed to reach all of the 10,000 customers. With their

knowledge and experience in business statistics, the businessman and his team decide

to select only 2,000 of the 10,000 customers for the year. These 2,000 customers are

considered as the sample and their demographic details and product satisfaction ratings

are the statistic.

While the use of population to understand one’s business is more accurate and secured,

the use of samples is undeniably more effective and efficient.


An illustration of the concepts of population and parameters and sample and statistic

The two branches of statistics

Once the data are gathered from either the population or the sample, the analytic part of

business statistics comes in. Businessmen may opt to subject the data to descriptive

statistics or the branch of statistics that deals with procedures used to describe and

summarize the important characteristics of a sample or population (Mendenhall, Beaver

& Beaver, 2006). Let’s say that the businessman and his team managed to survey all the

10,000 consumers of his product for the year. Simple counting would inform the

businessmen that 70% or 7,000 of his consumers are men aged 30 to 40.

Since you’ve also been introduced to the concept of sample and statistic, you should also

know that these are crucial to the conduct of another branch of statistics. This branch is

called inferential statistics because this allows one to draw conclusions, make predictions,

and decide about the population based on the data gathered from the sample


(Mendenhall, Beaver & Beaver, 2006). Let’s say that the businessman and his team

forewent the survey of the population and pushed through with 2,000 randomly selected

samples. The samples showed that 40% or 800 consumers are happy about the product

and are considering to buy an improved version in the following year. Guided by this value,

the businessman and his team may expect an estimate of 4,000 old consumers who may

buy the improved version of the product.

These simple examples showed the functions of the two branches of statistics as its

application in the context of business.

Descriptive Statistics

Purpose

Your knowledge of population and samples is useful to understanding descriptive statistics. As the name implies, descriptive statistics is meant to help you describe and summarize the parameter of the population or the statistic of the sample. This may come in different measures:

A. Measures of location - these measures tell you the position of values in the frequency distribution. Common measures of location are what we call central tendency or mean, median, or mode. Let’s say that the businessman and his team found out that the average age of their consumers is 27. This means that majority of the consumers are 27 years old and the others are either younger or older.

B. Measures of spread - measures of location are not enough to capture the variability among the data in the frequency distribution. This is the function of the measures of spread. These measures tell you how close or far apart certain values are in the frequency distribution. Some examples of measures of spread are range, percentiles, variance, and standard deviation. Let’s say for example that the businessman and his team found out through the survey that product satisfaction was affected by customer service. This information prompted them to conduct another study based on the call logs of customer service representatives. They found out that on average, customer service representatives could address customer concerns three days upon inquiry. However, when the businessman and his team measured the standard deviation, it yielded a value of 2. This indicates that some customer service representatives managed to address inquiries as quickly as one day or as late as five days. How could this happen? The businessman and his team should tackle the inconsistency to pave the way for more cost-efficient customer service.


Assumptions

In contrast with inferential statistics, descriptive statistics only requires data that can be subjected to acceptable mathematical operations (Carbin, 2016).

Inferential Statistics

Purpose

Descriptive statistics is useful when it comes to providing a summary of the data gathered

from either the population or the sample. However, statisticians have recognized that due

to limited resources, gathering data from the population is less popular than gathering data

from samples. In this regard, inferential statistics can be used. In contrast to descriptive

statistics, inferential statistics is used to estimate parameters and test hypotheses with the

data from the samples. This can lead to the generalization of data to the population.

Assumptions

Unlike descriptive statistics, inferential statistics has stricter prerequisites before this can

be applied to data. Aside from having data that can be subjected to acceptable

mathematical operations, inferential statistics also requires unbiased estimation since only

the samples are used to infer the parameters of the population (Carbin, 2016). This

assumption entails the use of sampling or a process of selection where every case from

the population has an equal chance of being selected for the sample (Healey, 2009).


References

Text Book

Open Stax, Introduction to Statistics. Open Stax CNX. September 28, 2016. https://cnx.org/contents/30189442-6998-4686-ac05-ed152b91b9de

Healey, J.F. (2009). Statistics: A tool for social research (8th ed.). USA: Wadsworth, Cengage Learning

Videos

Friedman, L.W. (2016). Introduction to business statistics. Retrieved from https://www.youtube.com/watch?v=poA0KntMgSM

Rigollet, Philippe, 18.650 Fundamentals of Statistics, Fall 2017. (Massachusetts Institute of Technology: MIT OpenCouseWare), https://www.youtube.com/watch?v=VPZD_aij8H0&list=PLUl4u3cNGP60uVBMaoNERc6knT_MgPKS0 (Accessed December 21, 2016). License: Creative Commons BY-NC-SA

Statistics Canada. (2013). Statistics: The invisible made visible. Retrieved from https://www.youtube.com/watch?v=_4GT5v0YaOE

SAS Software. (2013). How do you use statistics and how does it benefit your organization?. Retrieved from https://www.youtube.com/watch?v=LJV-Mlv-7dM

Websites

Garbin, C. (n.d.). Statistics and statistical tests: Assumptions and conclusions. Retrieved from http://psych.unl.edu/psycrs/941/q4/assumptions_141.pdf

John Graunt. (n.d.). In Encyclopaedia Britannica. Retrieved from https://www.britannica.com/biography/John-Graunt

Royal Statistical Society. (1950). Stephenson, D. (2000). Brief history of statistics. Retrieved from http://folk.uib.no/ngbnk/kurs/notes/node4.html

https://cnx.org/contents/30189442-6998-4686-ac05-ed152b91b9dehttps://www.youtube.com/watch?v=poA0KntMgSMhttps://www.youtube.com/watch?v=VPZD_aij8H0&list=PLUl4u3cNGP60uVBMaoNERc6knT_MgPKS0https://www.youtube.com/watch?v=VPZD_aij8H0&list=PLUl4u3cNGP60uVBMaoNERc6knT_MgPKS0https://www.youtube.com/watch?v=_4GT5v0YaOEhttps://www.youtube.com/watch?v=LJV-Mlv-7dMhttp://psych.unl.edu/psycrs/941/q4/assumptions_141.pdfhttps://www.britannica.com/biography/John-Graunthttp://folk.uib.no/ngbnk/kurs/notes/node4.html


Web Center for Social Research. (2006). Descriptive statistics. Retrieved from https://www.socialresearchmethods.net/kb/statdesc.php

Royal Statistical Society. (1950). The use of statistics in business. The Journal of Royal Statistical Society, 113(1), 1-8. DOI: 10.2307/2980797

https://www.socialresearchmethods.net/kb/statdesc.php


MODULE 2: BASIC DESCRIPTIVE STATISTICS

Introduction

We will focus on Summary Statistics. These are the different measures that are used to

describe any set of data. If we want to know the typical value of a certain variable, how

different the values are from one another, ho6w a certain data point compares to the rest,

we can use these measures.

2.1. Frequency Distribution

Frequency is simply the number of occurrences of an event. A frequency distribution is a

list, table or graph that displays the frequency of various outcomes in a sample. It tells us

how many there are of each item in the data set.

Frequency distribution can show us the raw number of each item and its percentage

toward the total.

Learning Resources Read: This online resource explains it in a simple way and shows examples https://www.spss-tutorials.com/frequency-distribution-what-is-it/ This video illustrates the concept in a novel way https://www.youtube.com/watch?time_continue=145&v=dr1DynUzjq0

Understanding Frequency Distribution gives us a way of understanding and organizing

our data in a logical way. Once we have done this, we will be able to apply different

summary statistics measures to our data. These Measures are explained in the following

sections.

https://www.spss-tutorials.com/frequency-distribution-what-is-it/https://www.youtube.com/watch?time_continue=145&v=dr1DynUzjq0


2.2. Measures of Central Tendency

Learning Resources Watch: Measures of Central Tendency, Measures of Location, Measures of Dispersion Video by Dr. Lisa Bersales [From 02:05] https://www.youtube.com/watch?v=9TF8btw1aNU&index=31&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6I

Measures of Central Tendency give us the typical value of data. There are three measures

of central tendency, the Mean, Median, and Mode.

Mean

The mean is the sum of all values of observations divided by the number of observations

in the data set.

Mean = ( Σ Xi )N

Where the Mean is the summation of all values of X (from X1 to XN ) divided by the total

number of values (N). You can see an example of this in Dr. Bersales’s video.

Median

The median is simple the middle value in the data set.

Where N is the total number of Values, this is the formula of Median for odd numbers.

Median=(N+1/ N) th term

This is the formula for even numbers

Median=( N/2) th term + (N/2 +1)th term2

https://www.youtube.com/watch?v=9TF8btw1aNU&index=31&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6Ihttps://www.youtube.com/watch?v=9TF8btw1aNU&index=31&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6I


Note that the formulas do not return an actual value. Instead this would return the nth term.

This means that you need to order the data (as we learned in frequency distribution) and

count from the beginning until you reach the term mentioned.

Make sure to watch Dr. Bersales video to learn more about Median.

Mode

The mode is the value that occurs most often in the data set. There is no formula for the

mode. Instead we can identify the mode by looking at the frequency distribution. There

can be multiple modes. Dr. Bersales’s video discusses this further.

Study Questions When is it best to use mean? What about median or mode? Name some specific examples of situations in which one would choose a certain measure over the two others.

2.3. Measures of Location

Sometimes, we want to know how a certain data point compares with the rest. This is for example, in the case of rankings and quotas. In some situations, we could also divide data into a certain number of equal sections to answer our questions, as with certain problems that would involve brackets, classes, and other groupings.

Measures of Location specifiy points in the data set in which a specified amount of data lie. This allows us to find the position of a data in relation to the entire data set.

Some examples of these are percentiles, deciles and quartiles. Percentiles divide the data into 100 equal parts, deciles divide the data into 10 equal parts, and quartiles divide the data into 4 equal parts.

Median, a measure of central tendency discussed earlier, is also a special measure of location. If you can recall, the median is the middle value in the data set so it divides the data into two equal parts.

Dr. Bersales explains this further in her video.


2.4. Measures of Dispersion

Learning Material Watch: Measures of Central Tendency, Measures of Location, Measures of Dispersion Video by Dr. Lisa Bersales [From 21:19] https://www.youtube.com/watch?v=9TF8btw1aNU&index=31&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6I

There are two types of Measures of Dispersion. First is Absolute, which is the measure of

the variability within a data set, and relative dispersion which compares this data set with

other data sets

Variance and Standard Deviation are measures of dispersion with reference to the mean.

The higher these values are, the farther away from the mean the data values are. Standard

deviation is the square of variance, resulting in a number that is always positive and is in

the same units as the mean.

https://www.youtube.com/watch?v=9TF8btw1aNU&index=31&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6Ihttps://www.youtube.com/watch?v=9TF8btw1aNU&index=31&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6I


MODULE 3: SAMPLING AND DATA COLLECTION

Introduction

Guided by the knowledge in statistics, students must also become accustomed to the

process of sampling. This module is intended to familiarize students with the different

types of sampling and the theory that guides the process.

Learning Objectives

At the end of the module, the students must be able to conduct the following:

a. Differentiate the types of sampling; b. Understand the theory behind sampling; and c. Use sampling in the business context.

3.1. History of Sampling

Sampling is defined as the “a process or method of drawing a representative group of

individuals or cases from a particular population” (Encyclopaedia Britannica, 2017). This

process is associated with the fact that it is more effective and efficient to study samples

taken from a population.

Much like the history of statistics, the history of sampling has various roots. Bethlehem

(2009) reiterated that sampling theory became a legitimate area of study in statistics

through the works of Anders Kaier of the Norwegian Statistical Bureau. In his published

study in 1895, Kaier presented his “Representative Method” of selecting samples based

on the population. The Representative Method received both praises and criticisms of

scholars, and these prompted Kaier and other statisticians to address and improve the

method. The Representative Method’s lack of random selection was improved by Bowley

in 1906. The works of both Kaier and Bowley led to the rise of probability and non-

probability sampling.


3.2. Probability Sampling

The use of probability sampling is guided by the probability theory particularly the law of

large numbers and the central limit theorem. The assumption is that as the number of

samples selected from the population increases, the likelihood that the statistic obtained

from these samples become closer to the expected or actual values from the population

and will follow a normal distribution.

There are different techniques of probability sampling:

• Simple Random Sampling - In simple random sampling, the researcher only

implements a selection that ensures that every member of the population has an

opportunity to be selected.

• Stratified Sampling - This is a probability sampling technique where a

heterogenous group was further classified into homogenous groups or strata

where the samples will be selected. The number of samples selected per stratum

corresponds to the percentage of the stratum to the entire population.

• Cluster Sampling - This type of probability sampling technique is similar to stratified

sampling. The only difference is that not all of the strata are selected. Instead, the

researcher will first select a number of strata from which the samples will be

randomly taken from.

• Systematic Sampling - In systematic sampling, selection of samples starts from a

random point and will be carried over based on a fixed interval. Some researchers

conduct systematic sampling by generating a random number that will serve as

the starting point and the interval.

• Multistage Sampling - This is the combination of the probability sampling

techniques mentioned above.

3.3. Non-Probability Sampling

This type of sampling is only used when researchers are not concerned about generalizing

the results of the study to the population. Instead, the researcher only aims to get data for

specific cases.

Some examples of non-probability sampling are as follows:

• Quota Sampling - In quota sampling, the researcher only ensures that a number

of samples will be selected from all the strata. For example, a businesswoman


finds out that the customer-base of her cosmetic company is composed of

Caucasian, Asian, Black, and Latina women aged 20-30. She sets to survey 10

women from each race.

• Purposive Sampling - Purposive sampling is the selection of participants on the

premise that they met the criteria of the researcher. Snowball or chain sampling is

an example of purposive sampling. For example, you want to compare and

contrast the manufacturing practices of Japanese companies in the Philippines.

Given the specificity of your purpose, your study does not entail random selection.

Instead, you will be driven more by the criterion in the selection of your sample.

• Convenience Sampling - Convenience sampling relies solely on availability. For

example, a chef hands over a survey form to every customer who eats at his

restaurant to determine the level of satisfaction on the products and services.

Study Question When is it appropriate to use probability sampling or non-probability sampling?

References

“Measures of Central Tendency, Measures of Location, Measures of Dispersion” (Video) by Dr. Lisa Bersales https://www.youtube.com/watch?v=9TF8btw1aNU&index=31&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6

Bethlehem, J. (2009). The rise of survey sampling. Retrieved from https://www.cbs.nl/-/media/imported/documents/2009/07/2009-15-x10-pub.pdf

Parker, M. (2017). Types of sampling. Retrieved from https://www.ma.utexas.edu/users/parker/sampling/srs.htm

Sampling. (2017). In Enclyclopaedia Britannica. Retrieved from https://www.britannica.com/science/sampling-statistics

https://www.youtube.com/watch?v=9TF8btw1aNU&index=31&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6Ihttps://www.youtube.com/watch?v=9TF8btw1aNU&index=31&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6Ihttps://www.cbs.nl/-/media/imported/documents/2009/07/2009-15-x10-pub.pdfhttps://www.ma.utexas.edu/users/parker/sampling/srs.htmhttps://www.britannica.com/science/sampling-statistics


Assignment for Unit 1 - Introduction to Descriptive Analytics

Go to the National Monthly National Government Cash Operations Report

(https://data.gov.ph/?q=dataset/national-government-cash-operations-report) and

download Data Sheets 2011 to 2014. Using the lessons learned in Unit 1, conduct the

following:

1. What are the common variables in the data sheets? Identify the level of

measurement of each variable (5 points).

2. Randomly select two data sheets from Data Sheets 2011 to 2014. Indicate the

years of the two data sheets selected and the process of selection employed (5

points).

3. Randomly select six out of the twelve months that will be part of the record data

set. Indicate the months selected and the process of selection employed (5

points).

4. Create a table that shows the sum of values of common variables for each of the

selected years. Explain the type of data set generated (15 points).

5. Compute for the mean and standard deviation of the data from the selected years.

Write a description of the results (20 points).

https://data.gov.ph/?q=dataset/national-government-cash-operations-report


UNIT II: DATA PREPROCESSING


1. Introduce basic concepts in data preprocessing; and 2. Introduce methods of data preprocessing.

MODULE 1: BASIC CONCEPTS IN DATA PREPROCESSING

Introduction Data preprocessing is an important step in data analytics. It aims at assessing and

improving the quality of data for secondary statistical analysis. With this, the data is better

understood and the data analysis is performed more accurately and efficiently.

Learning Objectives

After studying this module, you should be able to:

1. Explain what data preprocessing is and why it is important in data analytics; and

2. Describe different forms of data preprocessing.

1.1. What is Data Pre-processing?

Data in the real world tend to be incomplete, noisy, and inconsistent. “Dirty” data can lead

to errors in parameter estimation and incorrect analysis leading users to draw false

conclusions. Quality decisions must be based in quality data; hence, unclean data may

cause incorrect or even misleading statistical results and predictive analysis. Data

preprocessing is a data mining technique that involves transforming raw or source data

into an understandable format for further processing.


1.2. Tasks for Data Pre-procesing

Several distinct steps are involved in preprocessing data. Here are the general steps taken to pre-process data:

1. Data cleaning

• This step deals with missing data, noise, outliers, and duplicate or incorrect

records while minimizing introduction of bias into the database.

• Data is cleansed through processes such as filling in missing values,

smoothing the noisy data, or resolving the inconsistencies in the data.

2. Data integration

• Extracted raw data can come from heterogeneous sources or be in

separate datasets. This step reorganizes the various raw datasets into a

single dataset that contain all the information required for the desired

statistical analyses.

• Involves integration of multiple databases, data cubes, or files.

• Data with different representations are put together and conflicts within the

data are resolved.


3. Data transformation

• This step translates and/or scales variables stored in a variety of formats

or units in the raw data into formats or units that are more useful for the

statistical methods that the researcher wants to use.

• Data is normalized, aggregated and generalized.

4. Data reduction

• After the dataset has been integrated and transformed, this step removes

redundant records and variables, as well as reorganizes the data in an

efficient and “tidy” manner for analysis.

• Pertains to obtaining reduced representation in volume but produces the

same or similar analytical results.

• This step aims to present a reduced representation of the data in a data

warehouse.

Pre-processing is sometimes iterative and may involve repeating this series of steps until

the data are satisfactorily organized for the purpose of statistical analysis. During

preprocessing, one needs to take care not to accidentally introduce bias by modifying the

dataset in ways that will impact the outcome of statistical analyses. Similarly, we must

avoid reaching statistically significant results through “trial and error” analyses on

differently pre-processed versions of a dataset.

Learning Resource Watch Dr. Eugene Rex Jalao’s video on Data Preprocessing. https://www.youtube.com/watch?v=qk3gedLrpIU&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6I&index=20

https://www.youtube.com/watch?v=qk3gedLrpIU&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6I&index=20


MODULE 2: METHODS OF DATA PREPROCESSING

Introduction

Data preprocessing consists of series of steps to transform data extracted from different

data sources into a “clean” data prior to statistical analysis. Data pre-processing includes

data cleaning, data integration, data transformation, and data reduction.

Learning Objectives After studying this module, you should be able to:

1. Understand the different methods of data preprocessing; and

2. Differentiate the different techniques of data preprocessing.

2.1. Data Integration

Data integration is the process of combining data derived from various data sources (such

as databases, flat files, etc.) into a consistent dataset. In data integration, data from the

different sources, as well as the metadata - the data about this data - from different sources

are integrated to come up with a single data store. There are a number of issues to

consider during data integration related mostly to possible different standards among data

sources. These issues could be entity identification problem, data value conflicts, and

redundant data. Careful integration of the data from multiple sources may help reduce or

avoid redundancies and inconsistencies and improve data mining speed and quality of

sources.

Four Types of Data Integration Methodologies

1. Inner Join - creates a new result table by combining column values of two

tables (A and B) based upon the join-predicate.

2. Left Join - returns all the values from an inner join plus all values in the left

table that do not match to the right table, including rows with NULL (empty)

values in the link column.

3. Right Join - returns all the values from the right table and matched values

from the left table (NULL in the case of no matching join predicate).

4. Outer Join - the union of all the left join and right join values.


Learning Resource Watch: Dr. Eugene Rex Jalao’s video on Data Integration https://www.youtube.com/watch?v=EpdIz2uH1aM&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6I&index=21

2.2. Data Transformation

Data transformation is a process of transforming data from one format to another. It aims

to transform the data values into a format, scale or unit that is more suitable for analysis.

Data transformation is an important step in data preprocessing and a prerequisite for doing

predictive analytic solutions.

https://www.youtube.com/watch?v=EpdIz2uH1aM&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6I&index=21https://www.youtube.com/watch?v=EpdIz2uH1aM&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6I&index=21


Here are a few common possible options for data transformation:

1. Normalization - a way to scale specific variable to fall within a small specific range

a. min-max normalization - transforming values to a new scale such that all

attributes fall between a standardized format.

b. Z-score standardization - transforming a numerical variable to a standard

normal distribution


2. Encoding and Binning a. Binning - the process of transforming numerical variables into categorical

counterparts. i. Equal-width (distance) partitioning

Divides the range into N intervals of equal size, thus forming a

uniform grid.

ii. Equal-depth (frequency) partitioning

Divides the range into N intervals, each containing

approximately the same number of samples.

b. Encoding - the process of transforming categorical values to binary or

numerical counterparts, e.g. treat male or female for gender to 1 or 0. Data

encoding is needed because some data mining methodologies, such as

Linear Regression, require all data to be numerical.

i. Binary Encoding (Unsupervised)

Transformation of categorical variables by taking the values 0

or 1 to indicate the absence or presence of each category.


If the categorical variable has k categories, we would need to create k binary variables.

ii. Class-based Encoding (Supervised)

• Discrete Class

Replace the categorical variable with just one new

numerical variable and replace each category of the

categorical variable with its corresponding probability of

the class variable.


• Continuous Class Replace the categorical variable with just one new numerical variable and replace each category of the categorical variable with its corresponding average of the class variable.

Learning Resources Watch: 1. Dr. Eugene Rex Jalao’s video on Data Transformation

https://www.youtube.com/watch?v=ihHGKlAKL_s&index=18&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6I

2. Dr. Eugene Rex Jalao’s video on Data Encoding https://www.youtube.com/watch?v=wLqJ3HRtC_w&index=22&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6I

https://www.youtube.com/watch?v=ihHGKlAKL_s&index=18&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6Ihttps://www.youtube.com/watch?v=ihHGKlAKL_s&index=18&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6Ihttps://www.youtube.com/watch?v=wLqJ3HRtC_w&index=22&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6Ihttps://www.youtube.com/watch?v=wLqJ3HRtC_w&index=22&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6I


2.3. Data Cleaning

All data sources potentially include errors and missing values – data cleaning addresses

these anomalies. Data cleaning is the process of altering data in a given storage resource

to make sure that it is accurate and correct. Data cleaning routines attempts to fill in

missing values, smooth out noise while identifying outliers, and correct inconsistencies in

the data, as well as resolve redundancy caused by data integration.

Data Cleaning Tasks:

1. Fill in missing values

Solutions for handling missing data:

a. Ignore the tuple

b. Fill in the missing value manually

c. Data Imputation

- Use a global constant to fill in the missing value

- Use the attribute mean to fill in the missing value

- Use the attribute mean for all samples belonging to the same class

2. Cleaning noisy data

Solutions for cleaning noisy data:

a. Binning - transforming numerical values into categorical components

b. Clustering - grouping data into corresponding cluster and use the cluster

average to represent a value

c. Regression - utilizing a simple regression line to estimate a very erratic

data set

d. Combined computer and human inspection - detecting suspicious values

and checking it by human interventions

3. Identifying outliers

Solutions for identifying outliers:

a. Box plot

Learning Resource Watch: Dr. Jalao’s video on Data Cleaning https://www.youtube.com/watch?v=qKC4oPpcbEg&index=23&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6I

https://www.youtube.com/watch?v=qKC4oPpcbEg&index=23&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6Ihttps://www.youtube.com/watch?v=qKC4oPpcbEg&index=23&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6I


2.4. Data Reduction and Manipulation

Data reduction is a process of obtaining a reduced representation of the data set that is

much smaller in volume but yet produce the same (or almost the same) analytical results.

The need for data reduction emerged from the fact that some database/data warehouse

may store terabytes of data, and complex data analysis/mining may take a very long time

to run on the complete data set.

Data Reduction Strategies:

1. Sampling - utilizing a smaller representative or sample from the big data set or

population that will generalize the entire population.

A. Types of Sampling

i. Simple Random Sampling - there is an equal probability of selecting

any particular item.

ii. Sampling without replacement - as each item is selected, it is

removed from the population

iii. Sampling with replacement - objects are not removed from the

population as they are selected for the sample

iv. Stratified sampling - split the data into several partitions, then draw

random samples from each partition.

2. Feature Subset Selection - reduces the dimensionality of data by eliminating

redundant and irrelevant features.

A. Feature Subset Selection Techniques

i. Brute-force approach - try all possible feature subsets as input to

data mining algorithm

ii. Embedded approaches - feature selection occurs naturally as part

of the data mining algorithm

iii. Filter approaches - features are selected before data mining

algorithm is run

iv. Wrapper approaches - use the data mining algorithm as a black

box to find the best subset or attributes

3. Feature Creation - creating new attributes that can capture the important

information in a data set much more efficiently than the original attributes.

A. Feature Creation Methodologies

i. Feature Extraction

ii. Mapping Data to New Space

iii. Feature Construction


Learning Resource Watch:

Dr. Jalao’s video on Data Reduction and Manipulation

https://www.youtube.com/watch?v=-

JPopvvngsQ&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6I&index=19

https://www.youtube.com/watch?v=-JPopvvngsQ&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6I&index=19https://www.youtube.com/watch?v=-JPopvvngsQ&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6I&index=19


MODULE 3: POST-PROCESSING AND VISUALIZATION OF DATA INSIDE THE DATA WAREHOUSE

Introduction

Let us now learn how we can post-process and visualize the data inside the data

warehouse.

Learning Objectives

After working on this module, you should be able to:

1. Understand various techniques used for post-processing of discovered structures

and visualization.

3.1. Exercises using R

First, what is R? R is an integrated suite of software facilities for data manipulation,

calculation and graphical display.

It has an effective data handling and storage facility. It also has a large, coherent,

integrated collection of intermediate tools for data analysis. In addition, it has graphical

facilities for data analysis and display either directly at the computer or on hard copy.

Take note that R is not a database but connects to a DBMS. It is not a spreadsheet view

of data, but it connects to Excel/MS Office.

R is free and open source though it has a steep learning curve. RStudio IDE is a powerful

and productive 3rd Party user interface for R. It’s free, open source, and works great on

Windows, Mac, and Linux.

Exercises for this session will include the following:

1. Working with dataset Wage

2. Studying, reducing and structuring the dataset

3. Plotting the dataset

4. Introducing a business analytics task for the dataset

5. Working with another dataset


In post-processing, we remember that data extracted from a data warehouse or pieces of

knowledge extracted from an initial data mining task could be further processed. We can

simplify the data, apply descriptive statistics, do visualizations or graphing tasks, or

applying further business analytics tools.

Watch the "Data Post-processing" video by Raymond Lagria to understand preliminaries,

data frames, reading data, subsetting, graphing and plotting, and regression analysis in

R.

Always take note to transform your dataset into your desired format before applying further

data mining techniques.

Study Question If you were a business manager, what types of visualizations for the data warehouse’s

data would you like to see?

3.2. Case Study

Let us continue to see how post-processing and plotting is done with R in the “Data Post-

processing” Video by Raymond Lagria.

https://www.youtube.com/watch?v=0fgDbPhegg4&index=86&list=PLiqeNUxu5x2HplGrE

axWMGlb_h9MVEo6I&t=0s

References

https://www.analyticsvidhya.com/blog/2015/07/guide-data-visualization-r/

Data Post-Processing (Slides) by Raymond Lagria

Data Post-Processing (Video) by Raymond Lagria

https://www.youtube.com/watch?v=0fgDbPhegg4&index=86&list=PLiqeNUxu5x2HplGrE

axWMGlb_h9MVEo6I&t=0s

https://www.youtube.com/watch?v=0fgDbPhegg4&index=86&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6I&t=0shttps://www.youtube.com/watch?v=0fgDbPhegg4&index=86&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6I&t=0shttps://www.analyticsvidhya.com/blog/2015/07/guide-data-visualization-r/https://www.youtube.com/watch?v=0fgDbPhegg4&index=86&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6I&t=0shttps://www.youtube.com/watch?v=0fgDbPhegg4&index=86&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6I&t=0s


Assignment for Unit 2 - Data Preprocessing

Open the bankdata.csv file. The Bank Dataset contains 11 independent variables

specifically age, region, income, sex, married, children, car, save_act, current_act, and

mortgage and one response variable which answers the question: “Did the customer buy

a PEP (Personal Equity Plan) after the last mailing?” with a yes/no response.”

Using the lessons learned in Unit 2, conduct the following:

1. Normalize the income variable into a [0,1] scale. (10 points)

2. Create an equal-depth (frequency) variable for Income where the new variable

could take in “Low”, “Medium”, and “High” data. (15 points)

3. With reference to the region and pep variables, create a new numerical variable

(region_encoded) containing the numerical equivalent of each category of the

region variable. Replace each category with its corresponding probability of the

pep variable. (25 points)

Other References Used for Unit II:

A Comprehensive Approach Towards Data Preprocessing Techniques & Association Rules Jasdeep Singh Malik, Prachi Goyal, 3 Mr.Akhilesh K Sharma 3 Assistant Professor, IES-IPS Academy, Rajendra Nagar Indore – 452012 , India. Available at URL https://bvicam.ac.in/news/INDIACom%202010%20Proceedings/papers/Group3/INDIACom10_279_Paper%20(2).pdf

Son NH (2006) Data mining course—data cleaning and data preprocessing. Warsaw University. Available at URL http://www.mimuw.edu.pl/~son/datamining/DM/4-preprocess.pdf

Malley B., Ramazzotti D., Wu J.T. (2016) Data Pre-processing. In: Secondary Analysis of Electronic Health Records. Springer, Cham. Available at URL https://link.springer.com/chapter/10.1007%2F978-3-319-43742-2_12#Sec2

https://bvicam.ac.in/news/INDIACom%202010%20Proceedings/papers/Group3/INDIACom10_279_Paper%20(2).pdfhttps://bvicam.ac.in/news/INDIACom%202010%20Proceedings/papers/Group3/INDIACom10_279_Paper%20(2).pdfhttp://www.mimuw.edu.pl/~son/datamining/DM/4-preprocess.pdfhttp://www.mimuw.edu.pl/~son/datamining/DM/4-preprocess.pdfhttp://www.mimuw.edu.pl/~son/datamining/DM/4-preprocess.pdfhttps://link.springer.com/chapter/10.1007%2F978-3-319-43742-2_12#Sec2


UNIT III: DATA VISULIZATION AND COMMUNICATION

Introduction Objectives

1. Define Visualization; 2. Give examples of charts; and 3. Describe what makes an effective Visualization.

Learning Resources 1. Watch: Teh Beauty of Data Visualization

https://www.oercommons.org/courses/the-beauty-of-data-visualization/view

2. Watch: Raymond Freth Lagria “Visualization” Video

https://www.youtube.com/watch?v=Yu1qoZ6Y9EU&index=88&list=PLiqeNUxu5x2HplG

rEaxWMGlb_h9MVEo6I

3. Read: MIT Statistics and Visualization for Data Analysis Lecture Notes from OER

Commons https://www.oercommons.org/courses/statistics-and-visualization-for-data-

analysis-and-inference-january-iap-2009/view

1.1. What is Visualization?

Watch this OER Video from TED-Ed to Understand the Importance of Data Visualizaiton:

https://www.oercommons.org/courses/the-beauty-of-data-visualization/view

Visualization is the presentation of information using spatial or graphical representations,

for the purposes of: comparison facilitation, recognition of patterns and general decision

making.

It makes use of the human senses to understand data sets. Seeing things visually allow

humans to easily notice patterns, trends, and comparisons in such a way that looking at

raw numbers cannot.

Some examples are provided in Lagria’s video from [02:47]

https://www.oercommons.org/courses/the-beauty-of-data-visualization/viewhttps://www.youtube.com/watch?v=Yu1qoZ6Y9EU&index=88&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6Ihttps://www.youtube.com/watch?v=Yu1qoZ6Y9EU&index=88&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6Ihttps://www.oercommons.org/courses/statistics-and-visualization-for-data-analysis-and-inference-january-iap-2009/viewhttps://www.oercommons.org/courses/statistics-and-visualization-for-data-analysis-and-inference-january-iap-2009/viewhttps://www.oercommons.org/courses/the-beauty-of-data-visualization/view


1.2. Types of Visualizations

Data can be visualized in a number of ways. Lagria’s video presented two Types of

Visualization. First those that are meant to explore and calculate, and then the other is to

communicate information.

A Graph is a medium of visualization designed to communicate information. Depending

on the type of data, there is almost always a suitable graph to use.

Categorical Data can be visualized through:

1. Bar Graph

2. Pie Chart

3. Pareto Chart

4. Side-by-side chart

Numerical Data can be displayed using:

1. Stem-and-Leaf Display

2. Histogram

Charts, on the other hand, typically refer to a visualization medium that shows structure

and relationship. Some examples are flowcharts and network diagrams. Note that these

definitions are quite fluid. For example, even though it’s technically a graph, we refer to a

circle divided into sections showing proportion as a “Pie Chart”.

Finally, we call schematic pictures or illustrations of objects and entities Diagrams.

There are many different types of graphs and charts. You can learn more about these in

Lagria’s video from 06:52, and on the MIT Lecture Notes OER from p.26 to 42.

1.3. Visual Design Principles

Lagria’s video (starting from 15:35) described a study by Lohse in 1994. The study

involved 60 participants and their response to different types of visualizations.

Some of their findings were:

1. Simple images are better. Icons were preferred over photographs

2. Graphs and Tables were the most self-similar categories

3. Animation is recommended for temporal data.


One of the important characteristic of data visualization is that it needs to have preattentive

properties. It means that you will be able to communicate without the need for the viewer

to pay close attention. This can be likened to glance value. It is determined by some factors

described in the study, such as eye movement in milliseconds. Color, shape and order

can also determine how preattentive a visualization is.

Lagria’s video also presented Tufte’s Principles of Graphical Design Excellence from

22:22. In this video, he introduced concepts such as Graphical Integrity, Data Ink, Density

and Lie Factor. These also affect how visualizations are perceived.

Study Questions

What is the purpose of visualizing data?

What are some types of graphs? Give examples that use each.


UNIT IV: ETHICS IN DESCRIPTIVE ANALYTICS


1. Familiarize students with possible ethical and legal dilemmas in research;

2. Familiarize students with ethical and legal guidelines that can be applied to

descriptive analytics; and

3. Make students realize the implications of ethical and legal descriptive analytics.

1.1. Ethics in Descriptive Analytics: Dilemmas and Guidelines

Many of us are familiar with the scientific method and scientific research. While these

contribute to the body of knowledge and worldwide advancement, these could also

compromise people and data when conducted without ethics. Research ethics started

primarily in the area of health research where human participants serve as guinea pigs in

clinical trials. Over the decades, scholars from different disciplines realized that ethics is

not only applicable to health research; it is applicable to everything that requires the use

of data.

Why is research ethics important? As the group that trains researchers from all disciplines

in the Philippines, the Philippine Health Research Ethics Board of the Department of

Science and Technology argued that research ethics should be embedded in all research

processes primarily because of the following reasons:

• It is the right thing to do;

• It protects research participants;

• It provides advocates for research participants;

• It preserves credibility, trust, and accountability;

• It reduces liabilities, wasted time, and resources;

• It useless, harmful, worthless to useful helpful and worthy.

How are these applicable to descriptive analytics? As pointed out by lawyer Emerson

Banez, descriptive analytics is an activity that’s embedded within society, and such society

has norms; it is also governed by laws. These could already provide people an idea about

the right thing to do. Before embarking in a descriptive analytics activity, check the policies

of the company to be studied. These policies reflect the norms of the company, and these

have also incorporated the laws that regulate the practices of the industry where the

company belongs. Ensure that you will not break any of these policies in your conduct of

descriptive analytics.


You may also argue based on what you have learned that descriptive analytics and

business analytics in general do not deal much with human participants who should be

protected and advocated for. However, it is important to note that data are products of

human activities. While these are not directly collected from individuals, it still requires

proper handling and judgment. Lawyer Emerson Banez discussed about the importance

of one’s lack of bias and discrimination when dealing with data. In the previous units, you

have learned about sampling and various descriptive statistical tests. To be ethical, one

must not discriminate the data that should be analyzed. Selection must not be biased

towards data that could only reflect the outcomes that the researchers desire; these data

must reflect objectivity in order to guide the company towards better decision-making.

In addition to the lack of discrimination and bias, lawyer Emerson Banez also pointed out

the importance of integrity, transparency, and accountability. On the premise that the data

you used have been compromised, full disclosure must be done to ensure that the

company will be protected against decision based on bad descriptive analytics.

Perhaps the most important aspect of ethics in descriptive analytics is privacy. Data must

not expose the individuals involved in the activities. Data privacy is not only an ethical

obligation; this is also a law in the Philippines. Also known as the Republic Act 10173, the

Data Privacy Act recognizes the need for citizens’ data to be protected and secured. These

meant that consent must be sought from individuals whose activities result in the data that

will be analyzed in descriptive analytics. One must exercise caution and ensure that no

privacy is violated by the acquisition, process, and dissemination of data for descriptive

analytics. Otherwise, the company that used the data will be fined or shut down due to

legal issues. By being caution, descriptive analytics can guide the company towards

competitive advantage without liabilities and wasted resources.

Learning Resources 1. Watch: Atty. Emerson Banez’s video on ethical issues

https://www.youtube.com/watch?v=LRn6Nvd6Qqc&index=46&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6I

2. Watch: Dominic Ligot’s Ethical Implications of Business Analytics https://www.youtube.com/watch?v=PhDHtc_8nm8&index=64&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6I

https://www.youtube.com/watch?v=LRn6Nvd6Qqc&index=46&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6Ihttps://www.youtube.com/watch?v=LRn6Nvd6Qqc&index=46&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6Ihttps://www.youtube.com/watch?v=PhDHtc_8nm8&index=64&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6Ihttps://www.youtube.com/watch?v=PhDHtc_8nm8&index=64&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6I


References

Government of the Philippines. (2012). Republic Act 10173 – Data Privacy Act of 2012.

Retrieved from https://privacy.gov.ph/data-privacy-act/

Philippine Health Research Ethics Board. (n.d.). An introduction to ethics in research.

Department of Science and Technology, Taguig City.

Assignment

Write a two-page self-reflection on how the course contributed to your understanding of


Documents

Fundamentals of Descriptive Analytics€¦ · Prerequisite: Fundamentals of Data Warehousing COURSE OBJECTIVES At the end of the course, the students should be able to: 1. Explain