Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
Fundamentals of
Descriptive Analytics A Business Analytics Course
University of the Philippines Open University
Dr. Melinda Lumanta Ms. Louise Villanueva Dr. Eugene Rex Jalao Ms. Marie Karen Enrile
Asst. Prof. Joyce Manalo
Course Writers
Fundamentals of Descriptive Analytics 1
University of the Philippines
OPEN UNIVERSITY
COMMISSION ON HIGHER EDUCATION
Fundamentals of Descriptive Analytics 2
Fundamentals of Descriptive Analytics
Course Package
This learning package consists of:
1. Course Guide 2. Study Guides 3. Video lectures (Available at UPOU Networks and in the
attached USB) 4. Assessments
Fundamentals of Descriptive Analytics 3
UNIVERSITY OF THE PHILIPPINES OPEN UNIVERSITY
Fundamentals of Descriptive Analytics
A Business Analytics Course
This course aims to introduce students to the fundamentals of descriptive analytics.
Descriptive analytics make use of current transactions to enable managers to visualize
how the company is performing. This course will teach students to learn how to prepare
reports using descriptive analytic tools.
Prerequisite: Fundamentals of Data Warehousing
COURSE OBJECTIVES
At the end of the course, the students should be able to:
1. Explain the concepts in descriptive statistics
2. Contextualize the descriptive statistics concepts and analytical techniques in
business decision-making
3. Explain the importance of data pre-processing
4. Apply data pre-processing techniques in business
5. Explain the importance of data visualization and communication
6. Apply data visualization techniques to communicate the results of descriptive
analytics to stakeholders
7. Develop an awareness of ethical norms as required under policies and applicable
laws governing confidentiality and non-disclosure of data/information/documents
and proper conduct in the learning process and application of business analytics.
Fundamentals of Descriptive Analytics 4
COURSE OUTLINE
UNIT I. Introduction to Descriptive Analytics
MODULE 1. Statistics in Business
A. Data and data sets
B. What is statistics?
C. Bases in choosing what statistics to use
D. Application of statistics in business
MODULE 2. Basic Descriptive Statistics
A. Frequency distributions
B. Measures of location
C. Measures of dispersion
D. Measures of association
E. Measures of shape and other statistics
MODULE 3. Sampling and Data Collection
A. Types of sampling
B. Central limit theorem
Unit II. Data Preprocessing
MODULE 1: Basic Concepts in Data Preprocessing
A. What is data pre-processing?
B. Tasks for data processing
Module 2: Methods for Data Preprocessing
A. Data Integration
B. Data Transformation
C. Data Cleaning
D. Data Reduction
MODULE 3: Post-Processing and Visualization Of Data Inside The Data Warehouse
Fundamentals of Descriptive Analytics 5
Unit III. Data Visualization and Communication
Unit IV. Ethics
COURSE MATERIALS
1. Course guide
2. Study guides per modules
3. Video lectures
4. Additional reading materials in digital forms
STUDY SCHEDULE
Schedule Topic Activity
Week 1 Course Overview 1. Read the course guide
2. Participate in Discussion Forum 1 -
Introduce yourself and write a brief
reflection paper about the importance of
big data in businesses today
Week
2 - 4
Unit I - Introduction to
Descriptive Analytics
1. Go through Module 1 to 3
2. Participate in Discussion Forums 2 to 4
3. Watch the videos on Basic Descriptive
Statistics and Sampling and Data
Collection by Dr. Lisa S. Bersales
4. Submit the required assignment as
specified in the study guide.
Week
5-9
Unit II - Data Pre-
processing
1. Go through Module 5 to 6
2. Watch the videos on Data Processing by
Dr. Eugene Rex L. Jalao
3. Participate in Forum Discussion 5
Fundamentals of Descriptive Analytics 6
4. Submit the required assignment as
specified in the study guide.
Week
10 - 14
Unit III - Data
Visualization and
Communication
1. Go through Module 7.
2. Watch the video on Data Visualization and
Communication
3. Participate in Discussion Forum 6
Week 15 Unit IV - Ethics 1. Go through Unit IV.
2. Watch the video on Ethics by Atty.
Emerson Banes and Mr. Dominic Ligot
3. Write a reflection paper on ethics in
descriptive analytics in business.
Week 16 Course Evaluation 1. Write a self-reflection on how the course
contributed to your understanding of
descriptive analytics in business.
COURSE REQUIREMENTS
For you to pass the course, you will be evaluated on the following required activities:
Unit Weight
1 20%
2 35%
3 35%
4 10%
Fundamentals of Descriptive Analytics 7
Online Discussions
There will be a series of online discussions and activities for this course. In addition to
gauging your understanding of the course topics, the online discussions provide
everybody an opportunity to apply the concepts discussed in the modules in specific
situations.
As we progress through the course, we will be posting discussion topics and specific
questions/ instructions, so make it a point to visit the course regularly.
Remember the following when participating in online discussions:
• All discussions will take place in the course site. A separate discussion forum will
be created for each topic.
• Everybody is encouraged to contribute to the discussions by answering the
discussion question and/or reacting to each discussion topic if you wish to acquire
the Certificate of Completion. Passing remarks like "I agree" are not considered
substantial.
• Do not post lengthy contributions. Be clear on what your main point is and express
it as concisely as possible.
• The forums will remain open throughout the course's duration.
• Please be guided by netiquette rules (see
http://www.albion.com/netiquette/corerules.html) when participating in online
discussions. Respond to other postings courteously. Personal messages should
be emailed directly to the person concerned.
• If you would like to use some printed or online reference materials in your posting,
don't forget to cite them accordingly (e.g., According to Hernandez (2010), this
concept is...).
Assignments
The assignment is intended to help you to integrate and apply the learning. Specific
instructions will be posted in the course site.
If you wish to get the Certificate of Completion for this course, you must submit and get a
passing mark in the assignment. Online submission of assignments will be in the
Assignment Bin.
Fundamentals of Descriptive Analytics 8
GENERAL GUIDELINES Please comply with the following house rules:
• You are always expected to uphold academic integrity and intellectual honesty as
a learner. Cheating or plagiarism is not allowed.
• Submit your assignment on time.
• Observe deadlines. Follow the schedule of course activities, submit your
assignments on time, and never ask for an exemption from a required task. Read
in advance. Try to anticipate possible conflicts between your personal schedule
and the course schedule, and make the necessary adjustments to your study
schedule.
• Limit the comments and materials you post to those that are relevant to the course
topics. For your profile photo, do not post an informal photo or a photo that would
be more appropriate for a personal website. Maintain a professional demeanour
in all courses.
Fundamentals of Descriptive Analytics 9
UNIT I: INTRODUCTION TO DESCRIPTIVE ANALYTICS
This unit intends to:
1. Introduce basic statistical concepts; 2. Introduce basic descriptive statistics; and 3. Introduce sampling and data collection.
MODULE 1: STATISTICS IN BUSINESS
1.1. Data and Data Sets
This part of the module is intended to familiarize the students with data and data sets
which serve as foundations of statistics and analytics.
Learning Objectives
At the end of this part of the module the students must be able to do the following:
1. Differentiate the types of data and levels of measurement
2. Differentiate and understand where data sets must be best used
3. Differentiate the two branches of statistics:
a. Descriptive statistics
b. Inferential statistics
4. Determine appropriate use of statistics in business analytics.
Key Concepts
Attributes and Variables
Anyone who wants to embark on analytics must first start with data. These data are
composed of objects and their attributes. For example, the new human resources
manager in a pharmaceutical company wants to know the profile of their sales
representatives. The new human resources manager is presented with the table below.
Fundamentals of Descriptive Analytics 10
Sales Representative’s Performance for 1st Quarter 2018
Sales Representative
Age Sex City Standardized Product Expertise Test Scores
Total Amount of Quarterly Sales
Rank Based on Quarterly Sales
Abad, Maria 23 Female Caloocan 96 3,500,430.40 13
Basilio, Anna 27 Female Las Piñas 91 3,850,875.30 12
Cruz, Juan 28 Male Makati 92 5,290,320.50 2
Delos Santos, Jose
23 Male Malabon 94 3,216,739.95 14
Encarnacion, Leonora
24 Female Mandaluyong 94 4,589,850.00 6
Fajardo, Mario 30 Male Manila 95 4,670,902.25 5
Guzman, Emilio 22 Male Marikina 97 3,993,741.50 9
Herminio, Adela 35 Female Muntinlupa 96 3,890,004.70 10
Ilagan, Bienvenido
28 Male Navotas 93 2,863,045.25 15
Jacob, Rosa 29 Female Parañaque 92 4,097,589.75 8
Kalaw, Clarissa 22 Female Pasay 98 3,856,910.15 11
Lagman, Francisco
28 Male Pasig 94 4,970,438.25 4
Montero, Antonio 31 Male Pateros 92 1,368,495.45 16
Nuñez, Isabel 25 Female Quezon City 97 5,400,369.90 1
Ortiz, Katrina 27 Female San Juan 90 4,283,907.72 7
Pantaleon, Roel 34 Male Taguig 96 5,278,900.80 3
In this table, the objects are the sales representatives and the attributes are the
characteristics that these representatives have—age, sex, area or city of assignment, test
scores, and their quarterly performance in the form of sales.
Fundamentals of Descriptive Analytics 11
Variables and Levels of Measurement
In statistics, attributes which are organized for further data processing are called variables.
There are different types of variables, and each type has properties that determine how it
can be subjected to analysis.
Nominal variables are characterized by distinctness. They are labels and are non-
numerical. In the table, we can say that the sales representative’s sex and city of
assignment are nominal variables. Males are distinct from females. No intrinsic ordering
can be observed in the cities of assignment.
Another type of variable is the ordinal variable. These are variables that pertain to order.
The ranks given to the sales representatives based on their quarterly sales are considered
as ordinal. From the data on rank, we can say that Juan Cruz who ranked 2nd sold more
than Roel Pantaleon who ranked 3rd but less than Isabel Nuñez who ranked 1st. Looking at
the ranks of these sales representatives does not give an idea about the difference
between the quarterly sales of these sales representatives.
The last three variables in the table are product expertise test scores, age, and quarterly
sales. These variables can be classified into what has been referred together as interval-
ratio variables. These two are often taken together since these variables are quantitative
in nature. The standardized product expertise test scores are considered as interval
variables. This is because standardized test scores like the IQ are often of arbitrary origin
(one does not necessarily start with zero) and there is a fixed distance between the scores.
For examples, we can say that the distance between 91 and 92 is also equal to the
distance between 95 and 96.
Meanwhile, variables such as age and quarterly sales are considered as ratio variables.
These variables have the characteristics of nominal, ordinal, and interval variables.
However, unlike interval variables, ratio variables have absolute zero origins. At some
point, a sales representative does start with zero sales for the quarter. Age as a variable
is also characterized by a meaningful zero point which is upon someone’s birth.
Quantitative variables such as interval and ratio may be discrete or continuous. Discrete
variables are those that take the form of an integer. Meanwhile continuous variables take
the form of real numbers.
Nominal, ordinal, interval, and ratio variables are also referred to as levels of
measurement. The numerical properties of interval and ratio variables permit its use in
higher statistical tests. Meanwhile, nominal and ordinal variables are often used for
descriptive purposes.
Fundamentals of Descriptive Analytics 12
Study Question Think of an interesting phenomenon that you want to study in your organization. List all of the possible variables and categorize them according to types.
Data Sets
While aggregated data are important, sets of data are proved to be more useful to
organizations. These permit organizations to analyze and interpret scenarios effectively
and efficiently. There are three main types of data sets: record, graph, and ordered data
sets.
Record data set
Record data sets are those that are structured and presented in rows. Record data sets
may come in texts, numbers, or sequences.
The table about quarterly sales is considered as a collection of record data.
Sales Representative
Age Sex City Standardized Product Expertise Test Scores
Total Amount of Quarterly Sales
Rank Based on Quarterly Sales
Abad, Maria 23 Female Caloocan 96 3,500,430.40 13
Basilio, Anna 27 Female Las Piñas 91 3,850,875.30 12
Cruz, Juan 28 Male Makati 92 5,290,320.50 2
Delos Santos, Jose
23 Male Malabon 94 3,216,739.95 14
Encarnacion, Leonora
24 Female Mandaluyong 94 4,589,850.00 6
Fajardo, Mario 30 Male Manila 95 4,670,902.25 5
Guzman, Emilio 22 Male Marikina 97 3,993,741.50 9
Herminio, Adela 35 Female Muntinlupa 96 3,890,004.70 10
Fundamentals of Descriptive Analytics 13
Ilagan, Bienvenido
28 Male Navotas 93 2,863,045.25 15
Jacob, Rosa 29 Female Parañaque 92 4,097,589.75 8
Kalaw, Clarissa 22 Female Pasay 98 3,856,910.15 11
Lagman, Francisco
28 Male Pasig 94 4,970,438.25 4
Montero, Antonio
31 Male Pateros 92 1,368,495.45 16
Nuñez, Isabel 25 Female Quezon City 97 5,400,369.90 1
Ortiz, Katrina 27 Female San Juan 90 4,283,907.72 7
Pantaleon, Roel 34 Male Taguig 96 5,278,900.80 3
Another kind of record data is an m x n data matrix where m represents the rows or the
numerical objects and n the columns for the numerical attributes. This is a matrix
composed of real numbers.
Projection of x Load Projection of y Load Distance Load Thickness
10.23 5.27 15.22 2.7 1.2
12.65 6.25 16.22 2.2 1.1
Aside from m x n data matrix, record data can also be in the form of term by document
data set. This serves as a means to count how many times a term appears in a document.
Area Sales Quota
Document 1 1 4 3
Document 2 2 5 3
Document 3 1 9 4
A special kind of record data is composed of combination of items or services that are often bought or lumped together. This is called the market basket or transaction data set.
Fundamentals of Descriptive Analytics 14
Transaction ID Items
1 Coffee, Pancakes
2 Coffee, Pancakes, Hash Brown
3 Pineapple Juice, Pancakes, Hash Brown
4 Pineapple Juice, Rice, Egg
5 Pineapple Juice, Rice, Egg, Beef Steak
Graph data set
Graph data sets are those that represent relationships through the interconnections of
points. This can be commonly observed in sociograms and matrices that show the
interaction between and among individuals in networks.
. Ordered data set Ordered data sets are those that show data over certain sequences, periods, or progressions. One of the most common ordered data sets is the time series. This includes data of a certain variable over a period of time.
Fundamentals of Descriptive Analytics 15
Sales per quarter for the last 5 years (in millions)
Year Quarter 1 Quarter 2 Quarter 3 Quarter 4
2011 25 21 24.3 29.3
2012 25.4 19.7 25.6 27.9
2013 23 20.1 26.2 28.9
2014 26.7 18.3 24.5 29
2015 25.2 20.9 26 28.2
1.2. Introduction to Statistics
The part of the module is intended to familiarize the students with descriptive and
inferential statistics. This will serve as the backbone for the forthcoming modules that
tackle statistical measures and tests that could be applied to describe, predict, and infer
information based on the available data.
Key Concepts
Statistics as a field of study and in business
Word cloud of basic statistical concepts
Fundamentals of Descriptive Analytics 16
Statistics as a field could be traced in Europe during the 1500s. Statesmen and scholars
from Great Britain, France, and Sweden were urged to make sense of data gathered from
the census (Stephenson, 2000). In 1662, the first demographic report on mortality was
produced by John Graunt based on the weekly mortality reports in London (Encyclopaedia
Britannica, 2012).
Many statistical reports on demographics emerged as the field also progressed from
description to inference. This could largely be attributed to the advancement in
mathematics, particularly in probability theory. Given the said history, statistics to this day
is inextricably linked to the field of mathematics while others consider statistics as a branch
of science. In this regard, statistics can be deemed as a meta-science or meta-language
that aims to collect, analyze, summarize, and interpret data (Stephenson, 2000).
While the use of statistics can be traced to the affairs of nation states, it has been proven
to be an integral part of knowledge creation in both the natural and the social sciences. In
fact, statistics as a field was viewed with complexity that it was only deemed accessible to
those who chose it to be their field of expertise. This led many to overlook its usefulness
until Sir Geoffrey Heyworth of the Royal Statistical Society (1950) made a case for the use
of statistics in business. Accordingly, business statistics must be simple for a businessman
to comprehend. It must also guide action, but it should not serve as a substitute to a
businessman’s judgment. Heyworth further recognized the application of statistics in
various facets of a business. However, he also warned that business statistics should be
guided by a businessman’s knowledge and experience because figures would only make
sense if coupled with an understanding of contexts. While there was no hard and fast rule
as to where statistics should be used, Heyworth considered it as a never-ending process
to make an idea more accurate and acceptable as it often served as a bridge between
initial and informed business judgments.
Heyworth’s assertions in 1950 are still relevant to this day. Businesses employed
statisticians who could guide them in making decisions as to how they could optimize their
processes, target their consumers, and create more buzz around their products and
services among others. These actions as guided by business statistics and the
businessman’s knowledge and experience are made to minimize costs, maximize profits,
and place the business in a competitive advantage.
You have probably heard about television show ratings as publicized by rival networks.
These ratings are examples of business statistics. Since television remains to be one of
the most used media, advertisers of products and services would rely on television show
Fundamentals of Descriptive Analytics 17
ratings to determine where they could place their advertisements to reach consumers.
Ideally, the higher the exposure, the better.
The advent of new information and communication technology also ushered new ways of
processing and using business statistics. You’ve probably been accustomed to the
number of reactions, views, and shares of social media posts and uploads. These are
aggregated to determine the reach of social media pages, and thus, these also serve as
measures and drivers of business. Individuals and groups have created a new industry
out of their presence and activities in social media, and this industry is anchored in
business statistics.
Population and Parameters
Statistics as a field is associated with the research process. This entailed the gathering of
data from concerned parties. One important concept in the field is population. This pertains
to all of the items or individuals that a researcher, or in this case, a businessman would
want to study. Once these data have been gathered completely, the characteristics of the
items or individuals are called parameters. Let’s say that there is a businessman who sells
his product via an online platform. He’s interested to determine the customer satisfaction.
Based on the analytics provided by the online platform, a total of 10,000 customers bought
the product for the year. The businessman decides to conduct a survey of all 10,000
customers to get their demographics and an accurate product satisfaction rating. The
10,000 customers are considered as the population and their demographic details and
product satisfaction ratings are the parameters.
Sample and Statistic
Let’s say that the businessman consults the other members of his team regarding his plan
to conduct a survey with the population of 10,000 customers. The team expresses concern
about the resources that will be needed to reach all of the 10,000 customers. With their
knowledge and experience in business statistics, the businessman and his team decide
to select only 2,000 of the 10,000 customers for the year. These 2,000 customers are
considered as the sample and their demographic details and product satisfaction ratings
are the statistic.
While the use of population to understand one’s business is more accurate and secured,
the use of samples is undeniably more effective and efficient.
Fundamentals of Descriptive Analytics 18
An illustration of the concepts of population and parameters and sample and statistic
The two branches of statistics
Once the data are gathered from either the population or the sample, the analytic part of
business statistics comes in. Businessmen may opt to subject the data to descriptive
statistics or the branch of statistics that deals with procedures used to describe and
summarize the important characteristics of a sample or population (Mendenhall, Beaver
& Beaver, 2006). Let’s say that the businessman and his team managed to survey all the
10,000 consumers of his product for the year. Simple counting would inform the
businessmen that 70% or 7,000 of his consumers are men aged 30 to 40.
Since you’ve also been introduced to the concept of sample and statistic, you should also
know that these are crucial to the conduct of another branch of statistics. This branch is
called inferential statistics because this allows one to draw conclusions, make predictions,
and decide about the population based on the data gathered from the sample
Fundamentals of Descriptive Analytics 19
(Mendenhall, Beaver & Beaver, 2006). Let’s say that the businessman and his team
forewent the survey of the population and pushed through with 2,000 randomly selected
samples. The samples showed that 40% or 800 consumers are happy about the product
and are considering to buy an improved version in the following year. Guided by this value,
the businessman and his team may expect an estimate of 4,000 old consumers who may
buy the improved version of the product.
These simple examples showed the functions of the two branches of statistics as its
application in the context of business.
Descriptive Statistics
Purpose
Your knowledge of population and samples is useful to understanding descriptive statistics. As the name implies, descriptive statistics is meant to help you describe and summarize the parameter of the population or the statistic of the sample. This may come in different measures:
A. Measures of location - these measures tell you the position of values in the frequency distribution. Common measures of location are what we call central tendency or mean, median, or mode. Let’s say that the businessman and his team found out that the average age of their consumers is 27. This means that majority of the consumers are 27 years old and the others are either younger or older.
B. Measures of spread - measures of location are not enough to capture the variability among the data in the frequency distribution. This is the function of the measures of spread. These measures tell you how close or far apart certain values are in the frequency distribution. Some examples of measures of spread are range, percentiles, variance, and standard deviation. Let’s say for example that the businessman and his team found out through the survey that product satisfaction was affected by customer service. This information prompted them to conduct another study based on the call logs of customer service representatives. They found out that on average, customer service representatives could address customer concerns three days upon inquiry. However, when the businessman and his team measured the standard deviation, it yielded a value of 2. This indicates that some customer service representatives managed to address inquiries as quickly as one day or as late as five days. How could this happen? The businessman and his team should tackle the inconsistency to pave the way for more cost-efficient customer service.
Fundamentals of Descriptive Analytics 20
Assumptions
In contrast with inferential statistics, descriptive statistics only requires data that can be subjected to acceptable mathematical operations (Carbin, 2016).
Inferential Statistics
Purpose
Descriptive statistics is useful when it comes to providing a summary of the data gathered
from either the population or the sample. However, statisticians have recognized that due
to limited resources, gathering data from the population is less popular than gathering data
from samples. In this regard, inferential statistics can be used. In contrast to descriptive
statistics, inferential statistics is used to estimate parameters and test hypotheses with the
data from the samples. This can lead to the generalization of data to the population.
Assumptions
Unlike descriptive statistics, inferential statistics has stricter prerequisites before this can
be applied to data. Aside from having data that can be subjected to acceptable
mathematical operations, inferential statistics also requires unbiased estimation since only
the samples are used to infer the parameters of the population (Carbin, 2016). This
assumption entails the use of sampling or a process of selection where every case from
the population has an equal chance of being selected for the sample (Healey, 2009).
Fundamentals of Descriptive Analytics 21
References
Text Book
Open Stax, Introduction to Statistics. Open Stax CNX. September 28, 2016. https://cnx.org/contents/30189442-6998-4686-ac05-ed152b91b9de
Healey, J.F. (2009). Statistics: A tool for social research (8th ed.). USA: Wadsworth, Cengage Learning
Videos
Friedman, L.W. (2016). Introduction to business statistics. Retrieved from https://www.youtube.com/watch?v=poA0KntMgSM
Rigollet, Philippe, 18.650 Fundamentals of Statistics, Fall 2017. (Massachusetts Institute of Technology: MIT OpenCouseWare), https://www.youtube.com/watch?v=VPZD_aij8H0&list=PLUl4u3cNGP60uVBMaoNERc6knT_MgPKS0 (Accessed December 21, 2016). License: Creative Commons BY-NC-SA
Statistics Canada. (2013). Statistics: The invisible made visible. Retrieved from https://www.youtube.com/watch?v=_4GT5v0YaOE
SAS Software. (2013). How do you use statistics and how does it benefit your organization?. Retrieved from https://www.youtube.com/watch?v=LJV-Mlv-7dM
Websites
Garbin, C. (n.d.). Statistics and statistical tests: Assumptions and conclusions. Retrieved from http://psych.unl.edu/psycrs/941/q4/assumptions_141.pdf
John Graunt. (n.d.). In Encyclopaedia Britannica. Retrieved from https://www.britannica.com/biography/John-Graunt
Royal Statistical Society. (1950). Stephenson, D. (2000). Brief history of statistics. Retrieved from http://folk.uib.no/ngbnk/kurs/notes/node4.html
https://cnx.org/contents/30189442-6998-4686-ac05-ed152b91b9dehttps://www.youtube.com/watch?v=poA0KntMgSMhttps://www.youtube.com/watch?v=VPZD_aij8H0&list=PLUl4u3cNGP60uVBMaoNERc6knT_MgPKS0https://www.youtube.com/watch?v=VPZD_aij8H0&list=PLUl4u3cNGP60uVBMaoNERc6knT_MgPKS0https://www.youtube.com/watch?v=_4GT5v0YaOEhttps://www.youtube.com/watch?v=LJV-Mlv-7dMhttp://psych.unl.edu/psycrs/941/q4/assumptions_141.pdfhttps://www.britannica.com/biography/John-Graunthttp://folk.uib.no/ngbnk/kurs/notes/node4.html
Fundamentals of Descriptive Analytics 22
Web Center for Social Research. (2006). Descriptive statistics. Retrieved from https://www.socialresearchmethods.net/kb/statdesc.php
Royal Statistical Society. (1950). The use of statistics in business. The Journal of Royal Statistical Society, 113(1), 1-8. DOI: 10.2307/2980797
https://www.socialresearchmethods.net/kb/statdesc.php
Fundamentals of Descriptive Analytics 23
MODULE 2: BASIC DESCRIPTIVE STATISTICS
Introduction
We will focus on Summary Statistics. These are the different measures that are used to
describe any set of data. If we want to know the typical value of a certain variable, how
different the values are from one another, ho6w a certain data point compares to the rest,
we can use these measures.
2.1. Frequency Distribution
Frequency is simply the number of occurrences of an event. A frequency distribution is a
list, table or graph that displays the frequency of various outcomes in a sample. It tells us
how many there are of each item in the data set.
Frequency distribution can show us the raw number of each item and its percentage
toward the total.
Learning Resources Read: This online resource explains it in a simple way and shows examples https://www.spss-tutorials.com/frequency-distribution-what-is-it/ This video illustrates the concept in a novel way https://www.youtube.com/watch?time_continue=145&v=dr1DynUzjq0
Understanding Frequency Distribution gives us a way of understanding and organizing
our data in a logical way. Once we have done this, we will be able to apply different
summary statistics measures to our data. These Measures are explained in the following
sections.
https://www.spss-tutorials.com/frequency-distribution-what-is-it/https://www.youtube.com/watch?time_continue=145&v=dr1DynUzjq0
Fundamentals of Descriptive Analytics 24
2.2. Measures of Central Tendency
Learning Resources Watch: Measures of Central Tendency, Measures of Location, Measures of Dispersion Video by Dr. Lisa Bersales [From 02:05] https://www.youtube.com/watch?v=9TF8btw1aNU&index=31&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6I
Measures of Central Tendency give us the typical value of data. There are three measures
of central tendency, the Mean, Median, and Mode.
Mean
The mean is the sum of all values of observations divided by the number of observations
in the data set.
Mean = ( Σ Xi )N
Where the Mean is the summation of all values of X (from X1 to XN ) divided by the total
number of values (N). You can see an example of this in Dr. Bersales’s video.
Median
The median is simple the middle value in the data set.
Where N is the total number of Values, this is the formula of Median for odd numbers.
Median=(N+1/ N) th term
This is the formula for even numbers
Median=( N/2) th term + (N/2 +1)th term2
https://www.youtube.com/watch?v=9TF8btw1aNU&index=31&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6Ihttps://www.youtube.com/watch?v=9TF8btw1aNU&index=31&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6I
Fundamentals of Descriptive Analytics 25
Note that the formulas do not return an actual value. Instead this would return the nth term.
This means that you need to order the data (as we learned in frequency distribution) and
count from the beginning until you reach the term mentioned.
Make sure to watch Dr. Bersales video to learn more about Median.
Mode
The mode is the value that occurs most often in the data set. There is no formula for the
mode. Instead we can identify the mode by looking at the frequency distribution. There
can be multiple modes. Dr. Bersales’s video discusses this further.
Study Questions When is it best to use mean? What about median or mode? Name some specific examples of situations in which one would choose a certain measure over the two others.
2.3. Measures of Location
Sometimes, we want to know how a certain data point compares with the rest. This is for example, in the case of rankings and quotas. In some situations, we could also divide data into a certain number of equal sections to answer our questions, as with certain problems that would involve brackets, classes, and other groupings.
Measures of Location specifiy points in the data set in which a specified amount of data lie. This allows us to find the position of a data in relation to the entire data set.
Some examples of these are percentiles, deciles and quartiles. Percentiles divide the data into 100 equal parts, deciles divide the data into 10 equal parts, and quartiles divide the data into 4 equal parts.
Median, a measure of central tendency discussed earlier, is also a special measure of location. If you can recall, the median is the middle value in the data set so it divides the data into two equal parts.
Dr. Bersales explains this further in her video.
Fundamentals of Descriptive Analytics 26
2.4. Measures of Dispersion
Learning Material Watch: Measures of Central Tendency, Measures of Location, Measures of Dispersion Video by Dr. Lisa Bersales [From 21:19] https://www.youtube.com/watch?v=9TF8btw1aNU&index=31&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6I
There are two types of Measures of Dispersion. First is Absolute, which is the measure of
the variability within a data set, and relative dispersion which compares this data set with
other data sets
Variance and Standard Deviation are measures of dispersion with reference to the mean.
The higher these values are, the farther away from the mean the data values are. Standard
deviation is the square of variance, resulting in a number that is always positive and is in
the same units as the mean.
https://www.youtube.com/watch?v=9TF8btw1aNU&index=31&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6Ihttps://www.youtube.com/watch?v=9TF8btw1aNU&index=31&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6I
Fundamentals of Descriptive Analytics 27
MODULE 3: SAMPLING AND DATA COLLECTION
Introduction
Guided by the knowledge in statistics, students must also become accustomed to the
process of sampling. This module is intended to familiarize students with the different
types of sampling and the theory that guides the process.
Learning Objectives
At the end of the module, the students must be able to conduct the following:
a. Differentiate the types of sampling; b. Understand the theory behind sampling; and c. Use sampling in the business context.
3.1. History of Sampling
Sampling is defined as the “a process or method of drawing a representative group of
individuals or cases from a particular population” (Encyclopaedia Britannica, 2017). This
process is associated with the fact that it is more effective and efficient to study samples
taken from a population.
Much like the history of statistics, the history of sampling has various roots. Bethlehem
(2009) reiterated that sampling theory became a legitimate area of study in statistics
through the works of Anders Kaier of the Norwegian Statistical Bureau. In his published
study in 1895, Kaier presented his “Representative Method” of selecting samples based
on the population. The Representative Method received both praises and criticisms of
scholars, and these prompted Kaier and other statisticians to address and improve the
method. The Representative Method’s lack of random selection was improved by Bowley
in 1906. The works of both Kaier and Bowley led to the rise of probability and non-
probability sampling.
Fundamentals of Descriptive Analytics 28
3.2. Probability Sampling
The use of probability sampling is guided by the probability theory particularly the law of
large numbers and the central limit theorem. The assumption is that as the number of
samples selected from the population increases, the likelihood that the statistic obtained
from these samples become closer to the expected or actual values from the population
and will follow a normal distribution.
There are different techniques of probability sampling:
• Simple Random Sampling - In simple random sampling, the researcher only
implements a selection that ensures that every member of the population has an
opportunity to be selected.
• Stratified Sampling - This is a probability sampling technique where a
heterogenous group was further classified into homogenous groups or strata
where the samples will be selected. The number of samples selected per stratum
corresponds to the percentage of the stratum to the entire population.
• Cluster Sampling - This type of probability sampling technique is similar to stratified
sampling. The only difference is that not all of the strata are selected. Instead, the
researcher will first select a number of strata from which the samples will be
randomly taken from.
• Systematic Sampling - In systematic sampling, selection of samples starts from a
random point and will be carried over based on a fixed interval. Some researchers
conduct systematic sampling by generating a random number that will serve as
the starting point and the interval.
• Multistage Sampling - This is the combination of the probability sampling
techniques mentioned above.
3.3. Non-Probability Sampling
This type of sampling is only used when researchers are not concerned about generalizing
the results of the study to the population. Instead, the researcher only aims to get data for
specific cases.
Some examples of non-probability sampling are as follows:
• Quota Sampling - In quota sampling, the researcher only ensures that a number
of samples will be selected from all the strata. For example, a businesswoman
Fundamentals of Descriptive Analytics 29
finds out that the customer-base of her cosmetic company is composed of
Caucasian, Asian, Black, and Latina women aged 20-30. She sets to survey 10
women from each race.
• Purposive Sampling - Purposive sampling is the selection of participants on the
premise that they met the criteria of the researcher. Snowball or chain sampling is
an example of purposive sampling. For example, you want to compare and
contrast the manufacturing practices of Japanese companies in the Philippines.
Given the specificity of your purpose, your study does not entail random selection.
Instead, you will be driven more by the criterion in the selection of your sample.
• Convenience Sampling - Convenience sampling relies solely on availability. For
example, a chef hands over a survey form to every customer who eats at his
restaurant to determine the level of satisfaction on the products and services.
Study Question When is it appropriate to use probability sampling or non-probability sampling?
References
“Measures of Central Tendency, Measures of Location, Measures of Dispersion” (Video) by Dr. Lisa Bersales https://www.youtube.com/watch?v=9TF8btw1aNU&index=31&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6
Bethlehem, J. (2009). The rise of survey sampling. Retrieved from https://www.cbs.nl/-/media/imported/documents/2009/07/2009-15-x10-pub.pdf
Parker, M. (2017). Types of sampling. Retrieved from https://www.ma.utexas.edu/users/parker/sampling/srs.htm
Sampling. (2017). In Enclyclopaedia Britannica. Retrieved from https://www.britannica.com/science/sampling-statistics
https://www.youtube.com/watch?v=9TF8btw1aNU&index=31&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6Ihttps://www.youtube.com/watch?v=9TF8btw1aNU&index=31&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6Ihttps://www.cbs.nl/-/media/imported/documents/2009/07/2009-15-x10-pub.pdfhttps://www.ma.utexas.edu/users/parker/sampling/srs.htmhttps://www.britannica.com/science/sampling-statistics
Fundamentals of Descriptive Analytics 30
Assignment for Unit 1 - Introduction to Descriptive Analytics
Go to the National Monthly National Government Cash Operations Report
(https://data.gov.ph/?q=dataset/national-government-cash-operations-report) and
download Data Sheets 2011 to 2014. Using the lessons learned in Unit 1, conduct the
following:
1. What are the common variables in the data sheets? Identify the level of
measurement of each variable (5 points).
2. Randomly select two data sheets from Data Sheets 2011 to 2014. Indicate the
years of the two data sheets selected and the process of selection employed (5
points).
3. Randomly select six out of the twelve months that will be part of the record data
set. Indicate the months selected and the process of selection employed (5
points).
4. Create a table that shows the sum of values of common variables for each of the
selected years. Explain the type of data set generated (15 points).
5. Compute for the mean and standard deviation of the data from the selected years.
Write a description of the results (20 points).
https://data.gov.ph/?q=dataset/national-government-cash-operations-report
Fundamentals of Descriptive Analytics 31
UNIT II: DATA PREPROCESSING
This unit intends to:
1. Introduce basic concepts in data preprocessing; and 2. Introduce methods of data preprocessing.
MODULE 1: BASIC CONCEPTS IN DATA PREPROCESSING
Introduction Data preprocessing is an important step in data analytics. It aims at assessing and
improving the quality of data for secondary statistical analysis. With this, the data is better
understood and the data analysis is performed more accurately and efficiently.
Learning Objectives
After studying this module, you should be able to:
1. Explain what data preprocessing is and why it is important in data analytics; and
2. Describe different forms of data preprocessing.
1.1. What is Data Pre-processing?
Data in the real world tend to be incomplete, noisy, and inconsistent. “Dirty” data can lead
to errors in parameter estimation and incorrect analysis leading users to draw false
conclusions. Quality decisions must be based in quality data; hence, unclean data may
cause incorrect or even misleading statistical results and predictive analysis. Data
preprocessing is a data mining technique that involves transforming raw or source data
into an understandable format for further processing.
Fundamentals of Descriptive Analytics 32
1.2. Tasks for Data Pre-procesing
Several distinct steps are involved in preprocessing data. Here are the general steps taken to pre-process data:
1. Data cleaning
• This step deals with missing data, noise, outliers, and duplicate or incorrect
records while minimizing introduction of bias into the database.
• Data is cleansed through processes such as filling in missing values,
smoothing the noisy data, or resolving the inconsistencies in the data.
2. Data integration
• Extracted raw data can come from heterogeneous sources or be in
separate datasets. This step reorganizes the various raw datasets into a
single dataset that contain all the information required for the desired
statistical analyses.
• Involves integration of multiple databases, data cubes, or files.
• Data with different representations are put together and conflicts within the
data are resolved.
Fundamentals of Descriptive Analytics 33
3. Data transformation
• This step translates and/or scales variables stored in a variety of formats
or units in the raw data into formats or units that are more useful for the
statistical methods that the researcher wants to use.
• Data is normalized, aggregated and generalized.
4. Data reduction
• After the dataset has been integrated and transformed, this step removes
redundant records and variables, as well as reorganizes the data in an
efficient and “tidy” manner for analysis.
• Pertains to obtaining reduced representation in volume but produces the
same or similar analytical results.
• This step aims to present a reduced representation of the data in a data
warehouse.
Pre-processing is sometimes iterative and may involve repeating this series of steps until
the data are satisfactorily organized for the purpose of statistical analysis. During
preprocessing, one needs to take care not to accidentally introduce bias by modifying the
dataset in ways that will impact the outcome of statistical analyses. Similarly, we must
avoid reaching statistically significant results through “trial and error” analyses on
differently pre-processed versions of a dataset.
Learning Resource Watch Dr. Eugene Rex Jalao’s video on Data Preprocessing. https://www.youtube.com/watch?v=qk3gedLrpIU&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6I&index=20
https://www.youtube.com/watch?v=qk3gedLrpIU&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6I&index=20
Fundamentals of Descriptive Analytics 34
MODULE 2: METHODS OF DATA PREPROCESSING
Introduction
Data preprocessing consists of series of steps to transform data extracted from different
data sources into a “clean” data prior to statistical analysis. Data pre-processing includes
data cleaning, data integration, data transformation, and data reduction.
Learning Objectives After studying this module, you should be able to:
1. Understand the different methods of data preprocessing; and
2. Differentiate the different techniques of data preprocessing.
2.1. Data Integration
Data integration is the process of combining data derived from various data sources (such
as databases, flat files, etc.) into a consistent dataset. In data integration, data from the
different sources, as well as the metadata - the data about this data - from different sources
are integrated to come up with a single data store. There are a number of issues to
consider during data integration related mostly to possible different standards among data
sources. These issues could be entity identification problem, data value conflicts, and
redundant data. Careful integration of the data from multiple sources may help reduce or
avoid redundancies and inconsistencies and improve data mining speed and quality of
sources.
Four Types of Data Integration Methodologies
1. Inner Join - creates a new result table by combining column values of two
tables (A and B) based upon the join-predicate.
2. Left Join - returns all the values from an inner join plus all values in the left
table that do not match to the right table, including rows with NULL (empty)
values in the link column.
3. Right Join - returns all the values from the right table and matched values
from the left table (NULL in the case of no matching join predicate).
4. Outer Join - the union of all the left join and right join values.
Fundamentals of Descriptive Analytics 35
Learning Resource Watch: Dr. Eugene Rex Jalao’s video on Data Integration https://www.youtube.com/watch?v=EpdIz2uH1aM&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6I&index=21
2.2. Data Transformation
Data transformation is a process of transforming data from one format to another. It aims
to transform the data values into a format, scale or unit that is more suitable for analysis.
Data transformation is an important step in data preprocessing and a prerequisite for doing
predictive analytic solutions.
https://www.youtube.com/watch?v=EpdIz2uH1aM&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6I&index=21https://www.youtube.com/watch?v=EpdIz2uH1aM&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6I&index=21
Fundamentals of Descriptive Analytics 36
Here are a few common possible options for data transformation:
1. Normalization - a way to scale specific variable to fall within a small specific range
a. min-max normalization - transforming values to a new scale such that all
attributes fall between a standardized format.
b. Z-score standardization - transforming a numerical variable to a standard
normal distribution
Fundamentals of Descriptive Analytics 37
2. Encoding and Binning a. Binning - the process of transforming numerical variables into categorical
counterparts. i. Equal-width (distance) partitioning
Divides the range into N intervals of equal size, thus forming a
uniform grid.
ii. Equal-depth (frequency) partitioning
Divides the range into N intervals, each containing
approximately the same number of samples.
b. Encoding - the process of transforming categorical values to binary or
numerical counterparts, e.g. treat male or female for gender to 1 or 0. Data
encoding is needed because some data mining methodologies, such as
Linear Regression, require all data to be numerical.
i. Binary Encoding (Unsupervised)
Transformation of categorical variables by taking the values 0
or 1 to indicate the absence or presence of each category.
Fundamentals of Descriptive Analytics 38
If the categorical variable has k categories, we would need to create k binary variables.
ii. Class-based Encoding (Supervised)
• Discrete Class
Replace the categorical variable with just one new
numerical variable and replace each category of the
categorical variable with its corresponding probability of
the class variable.
Fundamentals of Descriptive Analytics 39
• Continuous Class Replace the categorical variable with just one new numerical variable and replace each category of the categorical variable with its corresponding average of the class variable.
Learning Resources Watch: 1. Dr. Eugene Rex Jalao’s video on Data Transformation
https://www.youtube.com/watch?v=ihHGKlAKL_s&index=18&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6I
2. Dr. Eugene Rex Jalao’s video on Data Encoding https://www.youtube.com/watch?v=wLqJ3HRtC_w&index=22&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6I
https://www.youtube.com/watch?v=ihHGKlAKL_s&index=18&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6Ihttps://www.youtube.com/watch?v=ihHGKlAKL_s&index=18&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6Ihttps://www.youtube.com/watch?v=wLqJ3HRtC_w&index=22&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6Ihttps://www.youtube.com/watch?v=wLqJ3HRtC_w&index=22&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6I
Fundamentals of Descriptive Analytics 40
2.3. Data Cleaning
All data sources potentially include errors and missing values – data cleaning addresses
these anomalies. Data cleaning is the process of altering data in a given storage resource
to make sure that it is accurate and correct. Data cleaning routines attempts to fill in
missing values, smooth out noise while identifying outliers, and correct inconsistencies in
the data, as well as resolve redundancy caused by data integration.
Data Cleaning Tasks:
1. Fill in missing values
Solutions for handling missing data:
a. Ignore the tuple
b. Fill in the missing value manually
c. Data Imputation
- Use a global constant to fill in the missing value
- Use the attribute mean to fill in the missing value
- Use the attribute mean for all samples belonging to the same class
2. Cleaning noisy data
Solutions for cleaning noisy data:
a. Binning - transforming numerical values into categorical components
b. Clustering - grouping data into corresponding cluster and use the cluster
average to represent a value
c. Regression - utilizing a simple regression line to estimate a very erratic
data set
d. Combined computer and human inspection - detecting suspicious values
and checking it by human interventions
3. Identifying outliers
Solutions for identifying outliers:
a. Box plot
Learning Resource Watch: Dr. Jalao’s video on Data Cleaning https://www.youtube.com/watch?v=qKC4oPpcbEg&index=23&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6I
https://www.youtube.com/watch?v=qKC4oPpcbEg&index=23&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6Ihttps://www.youtube.com/watch?v=qKC4oPpcbEg&index=23&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6I
Fundamentals of Descriptive Analytics 41
2.4. Data Reduction and Manipulation
Data reduction is a process of obtaining a reduced representation of the data set that is
much smaller in volume but yet produce the same (or almost the same) analytical results.
The need for data reduction emerged from the fact that some database/data warehouse
may store terabytes of data, and complex data analysis/mining may take a very long time
to run on the complete data set.
Data Reduction Strategies:
1. Sampling - utilizing a smaller representative or sample from the big data set or
population that will generalize the entire population.
A. Types of Sampling
i. Simple Random Sampling - there is an equal probability of selecting
any particular item.
ii. Sampling without replacement - as each item is selected, it is
removed from the population
iii. Sampling with replacement - objects are not removed from the
population as they are selected for the sample
iv. Stratified sampling - split the data into several partitions, then draw
random samples from each partition.
2. Feature Subset Selection - reduces the dimensionality of data by eliminating
redundant and irrelevant features.
A. Feature Subset Selection Techniques
i. Brute-force approach - try all possible feature subsets as input to
data mining algorithm
ii. Embedded approaches - feature selection occurs naturally as part
of the data mining algorithm
iii. Filter approaches - features are selected before data mining
algorithm is run
iv. Wrapper approaches - use the data mining algorithm as a black
box to find the best subset or attributes
3. Feature Creation - creating new attributes that can capture the important
information in a data set much more efficiently than the original attributes.
A. Feature Creation Methodologies
i. Feature Extraction
ii. Mapping Data to New Space
iii. Feature Construction
Fundamentals of Descriptive Analytics 42
Learning Resource Watch:
Dr. Jalao’s video on Data Reduction and Manipulation
https://www.youtube.com/watch?v=-
JPopvvngsQ&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6I&index=19
https://www.youtube.com/watch?v=-JPopvvngsQ&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6I&index=19https://www.youtube.com/watch?v=-JPopvvngsQ&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6I&index=19
Fundamentals of Descriptive Analytics 43
MODULE 3: POST-PROCESSING AND VISUALIZATION OF DATA INSIDE THE DATA WAREHOUSE
Introduction
Let us now learn how we can post-process and visualize the data inside the data
warehouse.
Learning Objectives
After working on this module, you should be able to:
1. Understand various techniques used for post-processing of discovered structures
and visualization.
3.1. Exercises using R
First, what is R? R is an integrated suite of software facilities for data manipulation,
calculation and graphical display.
It has an effective data handling and storage facility. It also has a large, coherent,
integrated collection of intermediate tools for data analysis. In addition, it has graphical
facilities for data analysis and display either directly at the computer or on hard copy.
Take note that R is not a database but connects to a DBMS. It is not a spreadsheet view
of data, but it connects to Excel/MS Office.
R is free and open source though it has a steep learning curve. RStudio IDE is a powerful
and productive 3rd Party user interface for R. It’s free, open source, and works great on
Windows, Mac, and Linux.
Exercises for this session will include the following:
1. Working with dataset Wage
2. Studying, reducing and structuring the dataset
3. Plotting the dataset
4. Introducing a business analytics task for the dataset
5. Working with another dataset
Fundamentals of Descriptive Analytics 44
In post-processing, we remember that data extracted from a data warehouse or pieces of
knowledge extracted from an initial data mining task could be further processed. We can
simplify the data, apply descriptive statistics, do visualizations or graphing tasks, or
applying further business analytics tools.
Watch the "Data Post-processing" video by Raymond Lagria to understand preliminaries,
data frames, reading data, subsetting, graphing and plotting, and regression analysis in
R.
Always take note to transform your dataset into your desired format before applying further
data mining techniques.
Study Question If you were a business manager, what types of visualizations for the data warehouse’s
data would you like to see?
3.2. Case Study
Let us continue to see how post-processing and plotting is done with R in the “Data Post-
processing” Video by Raymond Lagria.
https://www.youtube.com/watch?v=0fgDbPhegg4&index=86&list=PLiqeNUxu5x2HplGrE
axWMGlb_h9MVEo6I&t=0s
References
https://www.analyticsvidhya.com/blog/2015/07/guide-data-visualization-r/
Data Post-Processing (Slides) by Raymond Lagria
Data Post-Processing (Video) by Raymond Lagria
https://www.youtube.com/watch?v=0fgDbPhegg4&index=86&list=PLiqeNUxu5x2HplGrE
axWMGlb_h9MVEo6I&t=0s
https://www.youtube.com/watch?v=0fgDbPhegg4&index=86&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6I&t=0shttps://www.youtube.com/watch?v=0fgDbPhegg4&index=86&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6I&t=0shttps://www.analyticsvidhya.com/blog/2015/07/guide-data-visualization-r/https://www.youtube.com/watch?v=0fgDbPhegg4&index=86&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6I&t=0shttps://www.youtube.com/watch?v=0fgDbPhegg4&index=86&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6I&t=0s
Fundamentals of Descriptive Analytics 45
Assignment for Unit 2 - Data Preprocessing
Open the bankdata.csv file. The Bank Dataset contains 11 independent variables
specifically age, region, income, sex, married, children, car, save_act, current_act, and
mortgage and one response variable which answers the question: “Did the customer buy
a PEP (Personal Equity Plan) after the last mailing?” with a yes/no response.”
Using the lessons learned in Unit 2, conduct the following:
1. Normalize the income variable into a [0,1] scale. (10 points)
2. Create an equal-depth (frequency) variable for Income where the new variable
could take in “Low”, “Medium”, and “High” data. (15 points)
3. With reference to the region and pep variables, create a new numerical variable
(region_encoded) containing the numerical equivalent of each category of the
region variable. Replace each category with its corresponding probability of the
pep variable. (25 points)
Other References Used for Unit II:
A Comprehensive Approach Towards Data Preprocessing Techniques & Association Rules Jasdeep Singh Malik, Prachi Goyal, 3 Mr.Akhilesh K Sharma 3 Assistant Professor, IES-IPS Academy, Rajendra Nagar Indore – 452012 , India. Available at URL https://bvicam.ac.in/news/INDIACom%202010%20Proceedings/papers/Group3/INDIACom10_279_Paper%20(2).pdf
Son NH (2006) Data mining course—data cleaning and data preprocessing. Warsaw University. Available at URL http://www.mimuw.edu.pl/~son/datamining/DM/4-preprocess.pdf
Malley B., Ramazzotti D., Wu J.T. (2016) Data Pre-processing. In: Secondary Analysis of Electronic Health Records. Springer, Cham. Available at URL https://link.springer.com/chapter/10.1007%2F978-3-319-43742-2_12#Sec2
https://bvicam.ac.in/news/INDIACom%202010%20Proceedings/papers/Group3/INDIACom10_279_Paper%20(2).pdfhttps://bvicam.ac.in/news/INDIACom%202010%20Proceedings/papers/Group3/INDIACom10_279_Paper%20(2).pdfhttp://www.mimuw.edu.pl/~son/datamining/DM/4-preprocess.pdfhttp://www.mimuw.edu.pl/~son/datamining/DM/4-preprocess.pdfhttp://www.mimuw.edu.pl/~son/datamining/DM/4-preprocess.pdfhttps://link.springer.com/chapter/10.1007%2F978-3-319-43742-2_12#Sec2
Fundamentals of Descriptive Analytics 46
UNIT III: DATA VISULIZATION AND COMMUNICATION
Introduction Objectives
1. Define Visualization; 2. Give examples of charts; and 3. Describe what makes an effective Visualization.
Learning Resources 1. Watch: Teh Beauty of Data Visualization
https://www.oercommons.org/courses/the-beauty-of-data-visualization/view
2. Watch: Raymond Freth Lagria “Visualization” Video
https://www.youtube.com/watch?v=Yu1qoZ6Y9EU&index=88&list=PLiqeNUxu5x2HplG
rEaxWMGlb_h9MVEo6I
3. Read: MIT Statistics and Visualization for Data Analysis Lecture Notes from OER
Commons https://www.oercommons.org/courses/statistics-and-visualization-for-data-
analysis-and-inference-january-iap-2009/view
1.1. What is Visualization?
Watch this OER Video from TED-Ed to Understand the Importance of Data Visualizaiton:
https://www.oercommons.org/courses/the-beauty-of-data-visualization/view
Visualization is the presentation of information using spatial or graphical representations,
for the purposes of: comparison facilitation, recognition of patterns and general decision
making.
It makes use of the human senses to understand data sets. Seeing things visually allow
humans to easily notice patterns, trends, and comparisons in such a way that looking at
raw numbers cannot.
Some examples are provided in Lagria’s video from [02:47]
https://www.oercommons.org/courses/the-beauty-of-data-visualization/viewhttps://www.youtube.com/watch?v=Yu1qoZ6Y9EU&index=88&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6Ihttps://www.youtube.com/watch?v=Yu1qoZ6Y9EU&index=88&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6Ihttps://www.oercommons.org/courses/statistics-and-visualization-for-data-analysis-and-inference-january-iap-2009/viewhttps://www.oercommons.org/courses/statistics-and-visualization-for-data-analysis-and-inference-january-iap-2009/viewhttps://www.oercommons.org/courses/the-beauty-of-data-visualization/view
Fundamentals of Descriptive Analytics 47
1.2. Types of Visualizations
Data can be visualized in a number of ways. Lagria’s video presented two Types of
Visualization. First those that are meant to explore and calculate, and then the other is to
communicate information.
A Graph is a medium of visualization designed to communicate information. Depending
on the type of data, there is almost always a suitable graph to use.
Categorical Data can be visualized through:
1. Bar Graph
2. Pie Chart
3. Pareto Chart
4. Side-by-side chart
Numerical Data can be displayed using:
1. Stem-and-Leaf Display
2. Histogram
Charts, on the other hand, typically refer to a visualization medium that shows structure
and relationship. Some examples are flowcharts and network diagrams. Note that these
definitions are quite fluid. For example, even though it’s technically a graph, we refer to a
circle divided into sections showing proportion as a “Pie Chart”.
Finally, we call schematic pictures or illustrations of objects and entities Diagrams.
There are many different types of graphs and charts. You can learn more about these in
Lagria’s video from 06:52, and on the MIT Lecture Notes OER from p.26 to 42.
1.3. Visual Design Principles
Lagria’s video (starting from 15:35) described a study by Lohse in 1994. The study
involved 60 participants and their response to different types of visualizations.
Some of their findings were:
1. Simple images are better. Icons were preferred over photographs
2. Graphs and Tables were the most self-similar categories
3. Animation is recommended for temporal data.
Fundamentals of Descriptive Analytics 48
One of the important characteristic of data visualization is that it needs to have preattentive
properties. It means that you will be able to communicate without the need for the viewer
to pay close attention. This can be likened to glance value. It is determined by some factors
described in the study, such as eye movement in milliseconds. Color, shape and order
can also determine how preattentive a visualization is.
Lagria’s video also presented Tufte’s Principles of Graphical Design Excellence from
22:22. In this video, he introduced concepts such as Graphical Integrity, Data Ink, Density
and Lie Factor. These also affect how visualizations are perceived.
Study Questions
What is the purpose of visualizing data?
What are some types of graphs? Give examples that use each.
Fundamentals of Descriptive Analytics 49
UNIT IV: ETHICS IN DESCRIPTIVE ANALYTICS
This unit intends to:
1. Familiarize students with possible ethical and legal dilemmas in research;
2. Familiarize students with ethical and legal guidelines that can be applied to
descriptive analytics; and
3. Make students realize the implications of ethical and legal descriptive analytics.
1.1. Ethics in Descriptive Analytics: Dilemmas and Guidelines
Many of us are familiar with the scientific method and scientific research. While these
contribute to the body of knowledge and worldwide advancement, these could also
compromise people and data when conducted without ethics. Research ethics started
primarily in the area of health research where human participants serve as guinea pigs in
clinical trials. Over the decades, scholars from different disciplines realized that ethics is
not only applicable to health research; it is applicable to everything that requires the use
of data.
Why is research ethics important? As the group that trains researchers from all disciplines
in the Philippines, the Philippine Health Research Ethics Board of the Department of
Science and Technology argued that research ethics should be embedded in all research
processes primarily because of the following reasons:
• It is the right thing to do;
• It protects research participants;
• It provides advocates for research participants;
• It preserves credibility, trust, and accountability;
• It reduces liabilities, wasted time, and resources;
• It useless, harmful, worthless to useful helpful and worthy.
How are these applicable to descriptive analytics? As pointed out by lawyer Emerson
Banez, descriptive analytics is an activity that’s embedded within society, and such society
has norms; it is also governed by laws. These could already provide people an idea about
the right thing to do. Before embarking in a descriptive analytics activity, check the policies
of the company to be studied. These policies reflect the norms of the company, and these
have also incorporated the laws that regulate the practices of the industry where the
company belongs. Ensure that you will not break any of these policies in your conduct of
descriptive analytics.
Fundamentals of Descriptive Analytics 50
You may also argue based on what you have learned that descriptive analytics and
business analytics in general do not deal much with human participants who should be
protected and advocated for. However, it is important to note that data are products of
human activities. While these are not directly collected from individuals, it still requires
proper handling and judgment. Lawyer Emerson Banez discussed about the importance
of one’s lack of bias and discrimination when dealing with data. In the previous units, you
have learned about sampling and various descriptive statistical tests. To be ethical, one
must not discriminate the data that should be analyzed. Selection must not be biased
towards data that could only reflect the outcomes that the researchers desire; these data
must reflect objectivity in order to guide the company towards better decision-making.
In addition to the lack of discrimination and bias, lawyer Emerson Banez also pointed out
the importance of integrity, transparency, and accountability. On the premise that the data
you used have been compromised, full disclosure must be done to ensure that the
company will be protected against decision based on bad descriptive analytics.
Perhaps the most important aspect of ethics in descriptive analytics is privacy. Data must
not expose the individuals involved in the activities. Data privacy is not only an ethical
obligation; this is also a law in the Philippines. Also known as the Republic Act 10173, the
Data Privacy Act recognizes the need for citizens’ data to be protected and secured. These
meant that consent must be sought from individuals whose activities result in the data that
will be analyzed in descriptive analytics. One must exercise caution and ensure that no
privacy is violated by the acquisition, process, and dissemination of data for descriptive
analytics. Otherwise, the company that used the data will be fined or shut down due to
legal issues. By being caution, descriptive analytics can guide the company towards
competitive advantage without liabilities and wasted resources.
Learning Resources 1. Watch: Atty. Emerson Banez’s video on ethical issues
https://www.youtube.com/watch?v=LRn6Nvd6Qqc&index=46&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6I
2. Watch: Dominic Ligot’s Ethical Implications of Business Analytics https://www.youtube.com/watch?v=PhDHtc_8nm8&index=64&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6I
https://www.youtube.com/watch?v=LRn6Nvd6Qqc&index=46&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6Ihttps://www.youtube.com/watch?v=LRn6Nvd6Qqc&index=46&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6Ihttps://www.youtube.com/watch?v=PhDHtc_8nm8&index=64&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6Ihttps://www.youtube.com/watch?v=PhDHtc_8nm8&index=64&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6I
Fundamentals of Descriptive Analytics 51
References
Government of the Philippines. (2012). Republic Act 10173 – Data Privacy Act of 2012.
Retrieved from https://privacy.gov.ph/data-privacy-act/
Philippine Health Research Ethics Board. (n.d.). An introduction to ethics in research.
Department of Science and Technology, Taguig City.
Assignment
Write a two-page self-reflection on how the course contributed to your understanding of
descriptive analytics in business.