46
The Elements of Multi-Variate Analysis for Data Science Mohammad Samy BALADRAM and Nobuaki OBATA Graduate School of Information Sciences, Tohoku University, Sendai 980-8579, Japan These lecture notes provide a quick review of basic concepts in statistical analysis and probability theory for data science. We survey general description of single- and multi-variate data, and derive regression models by means of the method of least squares. As theoretical backgrounds we provide basic knowledge of probability theory which is indispensable for further study of mathematical statistics and probability models. We show that the regression line for a multi-variate normal distribution coincides with the regression curve defined through the conditional density function. In Appendix matrix operations are quickly reviewed. These notes are based on the lectures delivered in Graduate Program in Data Science (GP-DS) and Data Sciences Program (DSP) at Tohoku University in 2018–2020. KEYWORDS: data matrix, method of least squares, multi-variate analysis, regression analysis, probability distribution 1. Data and Statistical Analysis 1.1 Data matrices A set of characteristics collected from objects is called data in general. The totality of objects to be measured or surveyed is called a population and each member therein an individual. Thus, the data are collected from each individual in a target population. Data consisting of values obtained by measuring an amount are called quantitative data. An example is shown in Table 1.1, which is the list of height, weight and age of the players of Team A. In this survey the set of all players of Team A is a population and each player is an individual. The full list is deferred to Appendix for the readers’ exercise. With each survey item we associate a variable or a variate, which means a measurable quantity varying a certain range of real numbers. The term ‘‘variate’’ is often used in the context of physical, economical or statistical surveys, while the term ‘‘variable’’ is very common in any context of mathematics. Thus, data are a collection of values of the variable corresponding to a measurement. For example, for the data in Table 1.1 one may associate a variable x to height, y to weight and z to age. Then the values of data of the ith individual are denoted by x i ; y i ; z i : In this fashion we have x 1 ¼ 178, y 2 ¼ 90 and z 82 ¼ 24. When the number of variables is large, instead of assigning different symbols such as x; y; z; ... to variables, we use a symbol with indices. For example, we may assign x 1 to height, x 2 to weight and x 3 to age. Note that x 1 ; x 2 ; x 3 are not the data of the first three individuals, but the three variables. In this case the ith data are denoted by Table 1.1. Height, weight and age of players of Team A. Team A No. height (cm) weight (kg) age (year) 1 178 100 33 2 185 90 22 3 190 90 29 . . . . . . . . . . . . 82 180 93 24 83 184 85 17 2010 Mathematics Subject Classification: Primary 62-01; Secondary 60-01, 62A01, 62H10, 62J05 Corresponding author. E-mail: [email protected] Received August 31, 2020; Accepted October 25, 2020; J-STAGE Advance published December 8, 2020 Interdisciplinary Information Sciences Vol. 26, No. 1 (2020) 41–86 #Graduate School of Information Sciences, Tohoku University ISSN 1340-9050 print/1347-6157 online DOI 10.4036/iis.2020.A.02

The Elements of Multi-Variate Analysis for Data Science

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: The Elements of Multi-Variate Analysis for Data Science

The Elements of Multi-Variate Analysis for Data Science

Mohammad Samy BALADRAM� and Nobuaki OBATA

Graduate School of Information Sciences, Tohoku University, Sendai 980-8579, Japan

These lecture notes provide a quick review of basic concepts in statistical analysis and probability theory fordata science. We survey general description of single- and multi-variate data, and derive regression models bymeans of the method of least squares. As theoretical backgrounds we provide basic knowledge of probabilitytheory which is indispensable for further study of mathematical statistics and probability models. We show that theregression line for a multi-variate normal distribution coincides with the regression curve defined through theconditional density function. In Appendix matrix operations are quickly reviewed. These notes are based on thelectures delivered in Graduate Program in Data Science (GP-DS) and Data Sciences Program (DSP) at TohokuUniversity in 2018–2020.

KEYWORDS: data matrix, method of least squares, multi-variate analysis, regression analysis,probability distribution

1. Data and Statistical Analysis

1.1 Data matrices

A set of characteristics collected from objects is called data in general. The totality of objects to be measured orsurveyed is called a population and each member therein an individual. Thus, the data are collected from eachindividual in a target population. Data consisting of values obtained by measuring an amount are called quantitativedata. An example is shown in Table 1.1, which is the list of height, weight and age of the players of Team A. In thissurvey the set of all players of Team A is a population and each player is an individual. The full list is deferred toAppendix for the readers’ exercise.

With each survey item we associate a variable or a variate, which means a measurable quantity varying a certainrange of real numbers. The term ‘‘variate’’ is often used in the context of physical, economical or statistical surveys,while the term ‘‘variable’’ is very common in any context of mathematics. Thus, data are a collection of values of thevariable corresponding to a measurement. For example, for the data in Table 1.1 one may associate a variable x toheight, y to weight and z to age. Then the values of data of the ith individual are denoted by

xi; yi; zi:

In this fashion we have x1 ¼ 178, y2 ¼ 90 and z82 ¼ 24. When the number of variables is large, instead of assigningdifferent symbols such as x; y; z; . . . to variables, we use a symbol with indices. For example, we may assign x1 toheight, x2 to weight and x3 to age. Note that x1; x2; x3 are not the data of the first three individuals, but the threevariables. In this case the ith data are denoted by

Table 1.1. Height, weight and age of players of Team A.

Team A

No. height (cm) weight (kg) age (year)

1 178 100 33

2 185 90 22

3 190 90 29... ..

. ... ..

.

82 180 93 24

83 184 85 17

2010 Mathematics Subject Classification: Primary 62-01; Secondary 60-01, 62A01, 62H10, 62J05�Corresponding author. E-mail: [email protected]

Received August 31, 2020; Accepted October 25, 2020; J-STAGE Advance published December 8, 2020

Interdisciplinary Information Sciences Vol. 26, No. 1 (2020) 41–86#Graduate School of Information Sciences, Tohoku UniversityISSN 1340-9050 print/1347-6157 onlineDOI 10.4036/iis.2020.A.02

Page 2: The Elements of Multi-Variate Analysis for Data Science

xi1; xi2; xi3:

At first glance those notations appear confusing, but after some practice the usefulness will be understood.Let x1; . . . ; xj; . . . ; xp denote the variables corresponding to p measurements. After surveying n individuals we obtain

p-variate data or p-dimensional data, where the values of data of the ith individual are denoted by

xi1; . . . ; xij; . . . ; xip:

In other words, xij denotes the value of the variable xj of the ith individual. The data matrix is an n� p matrix with xijbeing an ði; jÞ entry as follows:

X ¼

x11 � � � x1j � � � x1p

..

. ... ..

.

xi1 � � � xij � � � xip

..

. ... ..

.

xn1 � � � xnj � � � xnp

2666666664

3777777775

ð1:1Þ

The number of rows coincides with that of individuals and the number of columns coincides with that of variables. Forexample, the data matrix obtained from Table 1.1 becomes an 83� 3 matrix. In these lecture notes we will beconcerned only with quantitative data given in the form of data matrix, for which the very powerful mathematical toolsof linear algebra are available effectively, see Sect. 3.

Remark 1.1. There are other types of data called qualitative data, which are recorded in terms of letters, symbols,diagrams and so on. If the individuals are classified into categories by some nominal attribute, we obtain nominal dataor category data. For example, the nominal attribute ‘‘sex’’ gives rise to two categories ‘‘male’’ and ‘‘female.’’Likewise, ‘‘nationality’’ gives rise to quite a few categories such as American, English, French, Japanese, . . . . Thesedata are often quantified by using dummy variables for further analysis. For example, two categories ‘‘male’’ and‘‘female’’ are represented by 0 and 1, respectively. Another type of qualitative data is ordinal data. The results of aquestionnaire survey of customer satisfaction are recorded in terms of a few grades, such as A, B and C for three grades.Even if these grades might be recorded in terms of numbers such as 1, 2 and 3, those numbers indicate only the gradesor the order. Hence the difference or the ratio among those numbers do not make any sense in general.

1.2 Statistical analysis

Such data as shown in Table 1.1 or in the form of data matrix ð1.1Þ are called raw data in the sense that the data arein original form, collected directly from observation, unorganized and uncooked. The raw data are usually entered intoa computer system in a suitable form according to a software. A common spread sheet software requires the input formjust as in Table 1.1.

A data matrix being just a large array of values, it is difficult to extract useful information at a glance. What we needis reduction of data. Given a data matrix X, we apply some functions f to get new values called statistics, which clarifycharacteristics of data. If X consists of np values as shown in ð1.1Þ, a function of X is in fact a function of np variables.The main purpose of these lecture notes is to study basic statistics and their applications.

Upon closing this introductory section we mention a few remarks on statistical inference. In statistical analysis it isessential to distinguish a target population and surveyed individuals. A survey that measures the entire population iscalled a complete survey or a census. A national population census is a typical example. It would be, however,impractical to perform a complete survey for reasons of size, time, cost and so forth. In most cases we select someindividuals from a target population, each of which is called a sample. A survey that measures only selected samplesis called a sample survey. Examples include an audience rating survey, a public-opinion poll, a sampling inspectionof products and so forth. Moreover, usual experiments or observations are in principle regarded as sample surveys. Asample survey saves the cost and time, and makes it easier to maintain high quality information, whereas it can notget away from sampling errors because it measures only a part of a target population. In this context statisticalinference becomes essential for estimating population characteristics from sample data and it is the main theme ofmathematical statistics.

Data collected over time are called time series data, where data are listed in time order. In the form of a data matrixX ¼ ½xij� we understand i as a time parameter. Examples include counts of sunspots, weather data, stock data, trafficaccident outbreaks and so forth. In this context prediction becomes a main theme, where probability models (stochasticprocesses) play an essential role. Interested readers should refer to suitable books for further study.

42 BALADRAM and OBATA

Page 3: The Elements of Multi-Variate Analysis for Data Science

2. Summarizing Single-Variate Data

2.1 Frequency table and histogram

Consider a single variable x and suppose we are given singe-variate data of size n as

x1; x2; . . . ; xn: ð2:1Þ

According to the standard notation of data matrix introduced in Sect. 1, the above data ð2.1Þ should be written in theform of a column vector. However, saving space is in priority here as there is no danger of confusion.

In practice, a single-variate data is a long sequence of numbers and we are not able to find useful information at aglance. The first task is to classify the data and extract information concerning how the data are distributed on the realline R or on the x-axis. Take an interval I � R containing all the data and divide I into a few small intervals of equalwidth:

I : c0 < c1 < � � � < ck:

Each small interval Ii ¼ ½ci�1; ciÞ is called a class. The midpoint of Ii ¼ ½ci�1; ciÞ defined by

ai ¼ci�1 þ ci

2

is called the class mark. A class mark is used to represent the values in the interval Ii.

Each value of the data ð2.1Þ falls into a unique class Ii ¼ ½ci�1; ciÞ. Then for each class Ii we count the number ofvalues falling into it, which is referred to as the (absolute) frequency. If fi is the frequency for Ii, the ratio

pi ¼fi

n

is called the relative frequency, where n is the total number of the data. Finally the results are summarized in the formof frequency table as shown in Table 2.1.

There is no strict rule to decide the number of classes or their width. As the width of a class becomes wider, we losemore information on distribution of the data. Conversely, as the width becomes narrower, the outline of the distributionis difficult to grasp. It is recommended to make some trials.

Graphical representation of a frequency table is useful. On each small interval Ii in the x-axis we draw a rectanglewith height proportional to the frequency fi or equivalently the relative frequency pi. These rectangles are not separatedsince the x-axis stands for a continuous scale. The diagram obtained in this way is called a histogram. The graphobtained by connecting the midpoints of the tops of the histogram by straight lines is called a frequency polygon.

Another useful statistic is a cumulative frequency. For each class Ii the cumulative frequency is defined by

f1 þ f2 þ � � � þ fi:

Likewise we define cumulative relative frequency, which is also called cumulative percentage. We will see in Sect. 4that the cumulative relative frequency is a bridge connecting probability theory and practical statistical analysis.

Table 2.1. Frequency table.

Classes Class marks Frequency Relative frequency

I1 a1 f1 p1

I2 a2 f2 p2

..

. ... ..

. ...

Ik ak fk pk

Total n 1

i

x

c c ci i ck

I

c

Fig. 2.1. Classification of data.

Multivariate Analysis for Data Science 43

Page 4: The Elements of Multi-Variate Analysis for Data Science

Example 2.1. The frequency table of height of players of Team A is shown in Table 2.2, where the cumulativefrequencies and the relative cumulative frequencies are added. The left diagram in Fig. 2.2 shows the histogramtogether with a frequency polygon. The right diagram in Fig. 2.2 shows the cumulative relative frequencies.

2.2 Measures of centrality

Suppose we are given single-variate data of size n for a variable x as in ð2.1Þ. We now look for a suitable value thatrepresents a center of the data. Most commonly used is the mean or the average defined by

�x ¼1

n

Xni¼1

xi; ð2:2Þ

that is, by adding up all the values and dividing by the number of data. As there are many variants of ‘‘mean’’ inmathematics, to avoid confusion the mean defined by ð2.2Þ is called the arithmetic mean. Instead of raw data, we maystart with a frequency table given as in Table 2.1, where fi is a frequency of a class Ii with class mark ai. Then the meanis defined by

�x ¼1

n

Xki¼1

ai fi: ð2:3Þ

Using the relative frequency pi ¼ fi=n, we obtain a useful formula:

�x ¼Xki¼1

aifi

n¼Xki¼1

ai pi: ð2:4Þ

We will notice in Sect. 5 that ð2.4Þ is consistent with the definition of mean (or expected value) of a random variable.It is noted that a frequency table sacrifices some information of the original raw data. Once the raw data are

transferred into a frequency table, the exact values of data are not recovered. From a frequency fi of a class Ii ¼½ci�1; ciÞ we only know that there are fi values in the raw data lying in the interval Ii ¼ ½ci�1; ciÞ. We then understandthat those values are equally distributed across the interval. In fact, the formula ð2.3Þ or ð2.4Þ is based on thisinterpretation. As a result, the mean computed directly from raw data and that from a frequency table do not coincide.

Exercise 2.2. Given single-variate data let �x be the mean of the original raw data and a the mean calculated from afrequency table summarizing the same data with classes of width d. Show that

Table 2.2. Frequency table: Height of players in Team A.

Classes Class marks FrequencyCumulative

frequency

Relative

frequency

Cumulative

relative frequency

165–170 167.5 1 1 0.012 0.012

170–175 172.5 13 14 0.157 0.169

175–180 177.5 27 41 0.325 0.494

180–185 182.5 23 64 0.277 0.771

185–190 187.5 15 79 0.181 0.952

190–195 192.5 3 82 0.036 0.988

195–200 197.5 1 83 0.012 1.000

Total 83 — 1.000 —

0.0

0.1

0.2

0.3

0.4

0.0

0.2

0.4

0.6

0.8

1.0

165 170 175 180 185 190 195 200 165 170 175 180 185 190 195 200

Fig. 2.2. Histogram and frequency polygon (left). Cumulative relative frequencies (right).

44 BALADRAM and OBATA

Page 5: The Elements of Multi-Variate Analysis for Data Science

j �x� aj �d

2:

Rearranging the data ð2.1Þ from smallest to largest in such a way that

xð1Þ � xð2Þ � � � � � xðnÞ; ð2:5Þ

we call xðiÞ the ith order statistic. In particular, the minimum of data is defined by

min x ¼ minfx1; x2; . . . ; xng ¼ xð1Þ; ð2:6Þ

and the maximum by

max x ¼ maxfx1; x2; . . . ; xng ¼ xðnÞ: ð2:7Þ

The value at the middle position in ð2.5Þ is called the median. If we have an odd number of data, the median is exactbecause the middle rank is determined among n data. If we have an even number, the median is defined to be theaverage of two data at the middle rank. To be precise, the median is defined by

med x ¼ medfx1; x2; . . . ; xng ¼xðnþ1

2Þ; if n is odd,

1

2

�xðn

2Þ þ xðn

2þ1Þ�; if n is even.

8<:

Another candidate of representing data is the mode, defined to be the most frequently occurring value among thedata. Usually the mode is applied to the frequency table, where the mode appears as a peak of the histogram.

The three statistics, mean, median and mode are most commonly used for central values of data. It is noted that thereis no relation among the three statistics. In fact, any order of the three occurs as is easily seen by simple extremalexamples.

Example 2.3. Figure 2.3 is the histogram of annual family income in Japan in 2016,� where the horizontal axisshows annual income in ten thousand yen and the vertical one the relative frequencies. Note that the values of 2000 orabove are bundled into just one class. A significant feature of the histogram, commonly observed in similar surveys, isthat the distribution spreads along a one-sided long tail (that is why the values of 2000 or above are bundled into oneclass). The mean, median and mode are given by

mean ¼ 560; median ¼ 442; mode ¼ 350:

Which value to use for representing the center of data depends on purposes.

Example 2.4. Demography is an interesting research topic. Figure 2.4 is the histogram of the Japanese population byage in 2018,y where the horizontal axis shows the age in year and the vertical axis the population in ten thousand. Notethat the ages of 100 or above are bundled into just one class. We know that

mean ¼ 47:2; median ¼ 47; mode ¼ 69:5:

0.00

0.05

0.10

0.15

0 500 1000 1500 2000

Fig. 2.3. Annual family income in Japan in 2016.

�Source: Comprehensive Survey of Living Conditions, Ministry of Health, Labour and Welfare, Japan, 2017.ySource: Population Estimates, Portal Site of Official Statistics of Japan (e-Stat).

Multivariate Analysis for Data Science 45

Page 6: The Elements of Multi-Variate Analysis for Data Science

The mean and median are almost in coincidence, while the mode is fairly larger. Moreover, we find a significant featurethat the histogram shows the second peak at the age of 45.5.

Remark 2.5. The definition of mode adopted in these lecture notes is based on the traditional descriptive statistics. Ifthe highest frequency appears at two or more classes, the mode is not uniquely defined. From the histogram in Fig. 2.4one may expect some significant meaning of peaks of the histogram. In some literature the term ‘‘mode’’ is used forany class mark that attains a peak. The latter definition is more common in theoretical study of the shape ofdistributions.

2.3 Measures of variability

In the previous subsection we introduced statistics that present the centrality of data. However, many different sets ofdata could have the same centrality. The next key to characterize distributions of data is to observe variability of data.

For given data x1; x2; . . . ; xn of size n let

xð1Þ � xð2Þ � � � � � xðnÞ ð2:8Þ

be the rearrangement from smallest to largest. The minimum, maximum and median are already introduced as orderstatistics. Along a similar line we define the first quartile to be the value at the first quarter position in ð2.8Þ. Likewisethe value at the third quarter position is called the third quartile. The former is denoted by Q1 and the latter by Q3. (Theprecise definition of these quartiles will be mentioned at the end of this subsection.) The set of the five statistics

min; Q1; med; Q3; max

is called the five-number summary of Tukey. A box plot is often used for its graphical representation, see Fig. 2.5.Occasionally in some literatures, the median in the box plot is replaced with the mean.

A simple index for variability of data is given by the range, which is by definition the difference between themaximum and minimum:

R ¼ max�min:

It is noted, however, that the range is affected heavily by extremal values in the data. In that sense the differencebetween the first and third quartiles

0

50

100

150

200

250

0 20 40 60 80 100

Fig. 2.4. Japanese population by age in 2018.

xfirst quartilemimimum median third quartile maximum

Fig. 2.5. Box plot.

46 BALADRAM and OBATA

Page 7: The Elements of Multi-Variate Analysis for Data Science

IQR ¼ Q3 � Q1;

called the interquartile range, is more useful for variability of the data. The interquartile range corresponds to thelength of the box part of a box plot.

From both practical and theoretical aspects a better statistic for variability of data is variance. Let x1; x2; . . . ; xn be n

data with mean �x. Then the variance of the data is defined by

s2 ¼ s2x ¼1

n

Xni¼1

ðxi � �xÞ2; ð2:9Þ

which is the average of the squared deviation of each xi from the mean �x. When we need to clarify the variable x wewrite s2x but when there is no danger of confusion we write s2 just for simplicity. Apparently, s2 0 by definition. Thepositive root of the variance:

s ¼ sx ¼ffiffiffiffis2x

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi1

n

Xni¼1

ðxi � �xÞ2s

is called the standard deviation.Expanding ðxi � �xÞ2 in the right-hand side of ð2.9Þ, we obtain

s2x ¼1

n

Xni¼1

ðx2i � 2�xxi þ �x2Þ ¼

1

n

Xni¼1

x2i � 2�x

1

n

Xni¼1

xi þ1

n

Xni¼1

�x2: ð2:10Þ

The second term becomes �2�x2 by definition of the mean. The third term stands for the average of a constant value �x2

independent of i and is equal to �x2. Thus, ð2.10Þ becomes

s2x ¼1

n

Xni¼1

x2i � 2�x2 þ �x2 ¼

1

n

Xni¼1

x2i � �x2: ð2:11Þ

Recall that the bar notation �x stands for the mean of the variable x. Accordingly, the mean of the variable x2 is denotedby x2. Thus, ð2.11Þ is written in a concise form:

s2x ¼ x2 � �x2; ð2:12Þ

where the right-hand side is equal to the mean of the square of x minus the square of the mean of x.

Example 2.6. Table 2.3 lists basic statistics of height of players of Team A and the box plots is shown in Fig. 2.6.

Table 2.3. Basic statistics: Height of players of Team A.

Size of data (n) 83

Mean ( �x) 179.8

Minimum (min) 168.0

First quartile (Q1) 175.5

Median (med) 180.0

Third quartile (Q3) 183.5

Maximum (max) 196.0

Range (R) 28.0

Interquartile range (IQR) 8.0

Variance (s2) 28.82

Standard deviation (s) 5.37

160 170 180 190 200

Fig. 2.6. Box plot: Height of players of Team A.

Multivariate Analysis for Data Science 47

Page 8: The Elements of Multi-Variate Analysis for Data Science

Remark 2.7. There is an important variant of variance. Recall the definition of variance ð2.9Þ, where the sum ofsquared deviations is divided by n. Instead of dividing by n, we define a new statistic by

u2 ¼ u2x ¼

1

n� 1

Xni¼1

ðxi � �xÞ2; ð2:13Þ

which is called the unbiased variance. The difference is small but crucial. To draw a clear line the variance defined byð2.9Þ is called sample variance. However, the use of these terminologies is mixed up in literatures and we need to takecare. In the Excel commands ‘‘VAR.P’’ is for sample variance and ‘‘VAR.S’’ for the unbiased variance.

Exercise 2.8. For data x1; x2; . . . ; xn prove that s2x ¼ 0 occurs if and only if the data are constant.

Exercise 2.9. Let x1; x2; . . . ; xn be data of size n. Find the value a that minimizes the sum of squared deviation from a:

Xni¼1

ðxi � aÞ2:

Exercise 2.10. Let x1; x2; . . . ; xn be data of size n. Find the value a that minimizes the sum of modulus of deviationfrom a:

Xni¼1

jxi � aj:

Generalizing the quartiles, we use percentiles to report relative standing of an individual within a given data set.Roughly speaking, the 85th percentile is the value that is greater than or equal to 85% of all the values and less than orequal to the remaining 15%. Of course, this definition is not strict because the 85th percentile is not determined only bythe above condition. In practice, for the kth percentile of a given data of size n we apply the following steps:

Step 1) Order all the values in the data set from smallest to largest, say,

xð1Þ � xð2Þ � � � � � xði�1Þ � xðiÞ � xðiþ1Þ � � � � � xðnÞ: ð2:14Þ

Step 2) Calculate r ¼ nk=100.Step 3) If r is an integer, count the number in the ordered data ð2.14Þ from left to right until we reach r. Then the kth

percentile is defined to be the average of xðrÞ and xðrþ1Þ.Step 4) If r is not an integer, round it up to the nearest integer to obtain s ¼ brc. Then, count the number in the

ordered data from left to right until we reach s. Then the kth percentile is defined to be the value of xðsÞ.As is easily seen, the median coincides with the 50th percentile. The first and third quartiles are defined to be the 25thand 75th percentiles, respectively.

Example 2.11. Below is a list of 25 test scores ordered from lowest to highest:

43 54 56 61 62 66 68 69 69 70 71 72 77

78 79 85 87 88 89 93 95 96 98 99 99

Let us find the 90th percentile. Multiplying 90% with the total number of scores, we obtain 0:9� 25 ¼ 22:5. This is notan integer. Rounding it up to the nearest integer, we obtain 23. Counting the number of ordered data from left to right,we find the 23rd value in the data. That is 98, which is the 90th percentile of the given data. For the 20th percentile, firsttake 0:20� 25 ¼ 5. This is an integer so the 20th percentile is the average of the 5th and 6th values in the ordered data.Thus the 20th percentile is given by ð62þ 66Þ=2 ¼ 64.

Remark 2.12. Our definition of a percentile is just for practical use and is less theoretical. The idea of percentile ismore suitable to a continuous distribution function (or a density function), and plays an essential role in statisticalestimation and hypothesis testing.

2.4 Normalization

Let x1; x2; . . . ; xn be data of the variable x. Recall that the mean and variance are defined by

�x ¼1

n

Xni¼1

xi; s2x ¼1

n

Xni¼1

ðxi � �xÞ2;

respectively. The standard deviation sx is by definition the positive square root of the variance. The new variable

~x ¼x� �x

sxð2:15Þ

is called the normalization of x.

48 BALADRAM and OBATA

Page 9: The Elements of Multi-Variate Analysis for Data Science

Theorem 2.13. Given data x1; x2; . . . ; xn, let ~x1; ~x2; . . . ; ~xn be their normalization. Then the normalized data has mean0, variance 1, and hence standard deviation 1. That is,

�~x ¼ 0; s2~x ¼ 1; s ~x ¼ 1:

Proof. By definition of normalization ð2.15Þ we have

�~x ¼1

n

Xni¼1

~xi ¼1

n

Xni¼1

xi � �x

sx; ð2:16Þ

and after simple algebra we come to

�~x ¼1

sx

1

n

Xni¼1

ðxi � �xÞ ¼1

sx

1

n

Xni¼1

xi �1

n

Xni¼1

�x

1

sxð �x� �xÞ ¼ 0:

Then the variance of the normalized data is given by

s2~x ¼1

n

Xni¼1

ð ~xi � �~xÞ2 ¼1

n

Xni¼1

~x2i :

Again by definition of normalization ð2.15Þ we come to

s2~x ¼1

n

Xni¼1

xi � �x

sx

� �2

¼1

s2x

1

n

Xni¼1

ðxi � �xÞ2 ¼1

s2x� s2x ¼ 1

as desired. �

There are several merits of normalization of data. As a rule, original data are real numbers with certain unitassociated to a measurement, and hence the values depend on the choice of the unit as well as the origin of a scale.After normalization the effect of such freedom is cancelled. In fact, the normalization ð2.15Þ depends only on thedifference between xi and the mean �x, moreover, it is free from the unit after taking the ratio to the standard deviation.For example, direct comparison of two measured values 172 cm (height) and 65 kg (weight) does not make a sense,whereas their normalizations may be compared reasonably.

Example 2.14 (Students’ deviation values). Suppose that a candidate got 75 points out of 100 in a screening test A.Obviously, the value 75 contains no information about his rank among the candidates. Suppose he got 62 points out of100 in another screening test B. We know that comparison of two values 75 and 62 does not imply that he gets a higherrank in test A than in test B. In that case the normalized points are more informative. According to our practicalexperience the normalized point varies mostly between �3 and 3. Since negative numbers are not convenient inbureaucracy and two-digit numbers are preferable, the deviation value is defined by

y ¼ 50þ 10~x ¼ 50þ 10x� �x

sx:

As is easily seen, the mean and the standard deviation of the deviation values over all the candidates are 50 and 10,respectively. Thus, the deviation value varies mostly between 20 and 80. Moreover, being approximated by a normaldistribution, the deviation value is useful for estimating the rank of a candidate and comparing the results of differenttests. Historically, the students’ deviation value was introduced by a Japanese high school teacher as a reasonable scaleof scholastic attainments.

Exercise 2.15. There were two screening tests A and B. The mean and the standard deviation of the points of allcandidates of test A are 70 and 12, respectively. Likewise, those of test B are 50 and 8. A candidate got 75 points in testA and 62 in test B. Discuss the results by means of students’ deviation values.

2.5 Use of second or higher moments

Theorem 2.16 (Chebyshev inequality). Let x1; x2; . . . ; xn be data of size n, and let �x be the mean and s > 0 thestandard variance. For k > 0 let NðkÞ be the number of data xi satisfying jxi � �xj ks. Then we have

NðkÞn�

1

k2: ð2:17Þ

Proof. Coming back to the definition ð2.9Þ, we divide the sum in the right-hand side into two parts as follows:

s2 ¼1

n

Xi:jxi� �xjks

ðxi � �xÞ2 þ1

n

Xi:jxi� �xj<ks

ðxi � �xÞ2:

Multivariate Analysis for Data Science 49

Page 10: The Elements of Multi-Variate Analysis for Data Science

In the first sum we have ðxi � �xÞ2 ðksÞ2 since xi satisfies jxi � �xj ks and the second sum is always non-negative.Therefore s2 is estimated as

s2 1

n

Xi:jxi� �xjks

ðksÞ2 ¼NðkÞn� ðksÞ2:

Dividing both sides by ðksÞ2, which is non-zero by assumption, we obtain ð2.17Þ. �

Example 2.17. It follows from the Chebyshev inequality that the number of data more than 2s depart from themean is less than 1=22 ¼ 1=4 of the total number of data. Let us examine it by using the data of Team A. Recall that themean and the standard deviation are given by

�x ¼ 179:8; s ¼ 5:37;

respectively. Hence that the value xi deviates from the mean more than 2s means that xi � 169:04 or xi 190:54.There are 3 data satisfying this condition so that Nð2Þ ¼ 3. Since the total number of data is n ¼ 83, the relativefrequency of the data deviating from the mean more than 2s is

Nð2Þn¼

3

83¼ 0:036:

Indeed, the above relative frequency is less than 1=4 ¼ 0:25 as is inferred from the Chebyshev inequality.

The Chebyshev inequality holds independently of the size of data and their shape of distribution, whereas it givesoften a rather rough estimate. In fact, in the above example, the real frequency 0.036 is much smaller than 1/4 thatfollows from the Chebyshev inequality. Thus, the equality of ð2.17Þ holds in very special cases.

Having introduced so far basic statistics such as the mean, variance, standard deviation, minimum, maximum,median, and so forth, we add a few more statistics. Given data x1; x2; . . . ; xn let �x denote the mean as usual. For a naturalnumber k the central moment of degree k is defined by

mk ¼1

n

Xni¼1

ðxi � �xÞk:

The variance s2 is nothing else but the second central moment m2. Recall that the positive square root of it is thestandard deviation s ¼ ffiffiffiffiffiffi

m2p

.Using the cubic central moment m3 we define the skewness byffiffiffiffiffi

�1

m3

s3:

The somehow confusing symbolffiffiffiffiffi�1

pis common for statistics. We note that

ffiffiffiffiffi�1

pmay take a negative value. In fact,

skewness measures asymmetry of the distribution of data with respect to the mean. If the distribution has a heavier tailon the right, the skewness becomes a larger positive value. If the distribution has a heavier tail on the left, the skewnessffiffiffiffiffi�1

pbecomes a larger negative value. While, if the distribution is symmetric, we have

ffiffiffiffiffi�1

p¼ 0.

Using the central moment of fourth order m4 we define the kurtosis by

�2 ¼m4

s4:

This measures the grade of concentration of the data around the mean.

Example 2.18. The skewness and kurtosis of height of players in Team A are given byffiffiffiffiffi�1

p¼ 0:459; �2 ¼ 3:038;

respectively. The positive skewness suggests that the distribution of heights has a heavier tail on the right-side of themean. The kurtosis near to 3 suggests that the distribution is similar to a normal distribution, see Example 2.19 below.

Example 2.19. In statistics the normal distribution is of fundamental importance. It is a continuous distribution givenby the density function:

f ðxÞ ¼1ffiffiffiffiffiffiffiffiffiffi

2��2p exp �

ðx� �Þ2

2�2

� �;

where � is the mean and �2 the variance, and is denoted by Nð�; �2Þ. The outline of f ðxÞ is illustrated in Fig. 2.7. Theskewness and kurtosis of the normal distribution Nð�; �2Þ are independent of � and �2, and are given byffiffiffiffiffi

�1

p¼ 0; �2 ¼ 3;

respectively. These values are to be compared with the ones in Example 2.18, see also Fig. 2.2.

50 BALADRAM and OBATA

Page 11: The Elements of Multi-Variate Analysis for Data Science

Remark 2.20. In some literatures m4=s4 � 3 is taken to be the definition of kurtosis. This alternative definition is

useful for checking similarity to the normal distribution, of which the kurtosis is 3 as shown in Example 2.19.

Remark 2.21. The normal distribution is observed often in real world. Suppose that a population obeys a normaldistribution with mean � and standard deviation �, see Fig. 2.2. Then we have

(i) About 68% of the values lie within 1 standard deviation of the mean. In statistical notation, this is represented as� 1�.

(ii) About 95% of the values lie within 2 standard deviations of the mean, that is, � 2�.(iii) About 99.7% of the values lie within 3 standard deviations of the mean, that is, � 3�.

The above three facts are often called the empirical rule or the 68-95-99.7 rule.

3. Description of Multi-Variate Data

3.1 Two-variate data and scatter plot

Let us start with two-variate data given by an n� 2 data matrix:

x1 y1

x2 y2

..

. ...

xn yn

266664

377775; ð3:1Þ

where x and y stand for the variables corresponding to measurement, and n is the size of data (the number of surveyedindividuals). A pair of values ðxi; yiÞ is identified with a point in the xy-coordinate plane. Then, the data matrix ð3.1Þ istransferred into a set of n points plotted in the coordinate plane, which is called the scatter plot or scatter diagram of thedata.

A scatter plot is useful to check relationship between two variables. In this context, the relationship is calledcorrelation in general. If the scatter plot is approximately along a straight line, the relationship is called linearcorrelation. We will discuss only linear correlation.

(i) If the scatter plot shows an uphill pattern from left to right, we say that the two variables are positivelycorrelated. Namely, as the x-values increase (move right), the y-values increase (move up).

(ii) If the scatter plot shows a downhill pattern from left to right, two variables are negatively correlated. Namely, asthe x-values increase (move right), the y-values decrease (move down).

Fig. 2.7. Normal distribution Nð�; �2Þ.

Fig. 3.1. Positive correlation (left) and negative correlation (right).

Multivariate Analysis for Data Science 51

Page 12: The Elements of Multi-Variate Analysis for Data Science

Here is an example. The left diagram of Fig. 3.2 shows the scatter plot of height (horizontal axis) and weight(vertical axis) of players of Team A. Therein we may observe a trend of growing, which means that higher players aregenerally heavier. Of course, this trend is understood from the common-sense perspective. Likewise, the right diagramof Fig. 3.2 shows the scatter plot of age (horizontal axis) and height (vertical axis), where it seems difficult to find agrowing or declining trend.

A scatter plot is useful to roughly grasp correlation but decision by looking easily leads to a mistake. A bettertreatment is to use the normalized data. Figure 3.3 shows the scatter plots of the normalized data of the original onesused in Fig. 3.2. Since the mean of the normalized data is 0, the scatter plot becomes a set of points distributed aroundthe origin ð0; 0Þ. Moreover, since the variance of normalized data is 1, the variability of points along the horizontal andvertical axes are unified.

3.2 Correlation coefficient

Two variables are correlated more strongly if points of the scatter plot are more tightly concentrated along a straightline. For proper judgement of the strength of correlation we need a statistic called the correlation coefficient.

Let ðx1; y1Þ; ðx2; y2Þ; . . . ; ðxn; ynÞ be two-variate data. The mean and variance of the variable x are given by

�x ¼1

n

Xni¼1

xi; s2x ¼1

n

Xni¼1

ðxi � �xÞ2: ð3:2Þ

Similarly, for the variable y we have

�y ¼1

n

Xni¼1

yi; s2y ¼1

n

Xni¼1

ðyi � �yÞ2: ð3:3Þ

We need a new statistic depending on both variables. The covariance of x and y is defined by

sxy ¼1

n

Xni¼1

ðxi � �xÞðyi � �yÞ: ð3:4Þ

-3

3

-30

3

-3

3

-3 0 3

Fig. 3.3. Normalized scatter plots: (Left) height and weight. (Right) age and height.

60

70

80

90

100

110

120

160

170

180

190

200

160 170 180 190 200 0 10 20 30 40 50

Fig. 3.2. Scatter plots: (Left) height and weight. (Right) age and height.

52 BALADRAM and OBATA

Page 13: The Elements of Multi-Variate Analysis for Data Science

By definition we have

sxy ¼ syx; sxx ¼ s2x ; syy ¼ s2y :

Expanding the right-hand side of ð3.4Þ, we obtain

sxy ¼1

n

Xni¼1

ðxiyi � �x yi � xi �yþ �x � �yÞ

¼1

n

Xni¼1

xiyi � �x1

n

Xni¼1

yi � �y1

n

Xni¼1

xi þ1

n

Xni¼1

�x � �y

¼1

n

Xni¼1

xiyi � �x � �y� �y � �xþ �x � �y

¼1

n

Xni¼1

xiyi � �x � �y:

The sum in the last expression is the mean of the variable xy, which is naturally denoted by xy. We thus come to theuseful formula:

sxy ¼ xy� �x � �y: ð3:5Þ

We say that x and y are positively correlated if the covariance is positive sxy > 0. Similarly, x and y are negativelycorrelated if the covariance is negative sxy < 0. Finally, x and y are uncorrelated if the covariance vanishes sxy ¼ 0.

We see from ð3.4Þ that the covariance sxy is more likely positive if there are a larger number of data ðxi; yiÞ withðxi � �xÞðyi � �yÞ > 0 than those with ðxi � �xÞðyi � �yÞ < 0. In other words, sxy is more likely positive if more points arescattered in the upper right or lower left regions with respect to the mean point ð �x; �yÞ, see Fig. 3.4. In that case agrowing trend of the scatter plot is more likely observed. A declining trend is similarly understood.

In order to judge strength of the correlation we take normalized data. Let ~x and ~y be the normalized variables x and y,respectively. The normalized data are given by

~xi ¼xi � �x

sx; ~yi ¼

yi � �y

sy: ð3:6Þ

Recall that the means of the normalized data are zero: �~x ¼ �~y ¼ 0. Applying the definition of covariance ð3.4Þ to the pairof normalized variables ð ~x; ~yÞ, we obtain

s ~x ~y ¼1

n

Xni¼1

ð ~xi � �~xÞð ~yi � �~yÞ ¼1

n

Xni¼1

~xi ~yi ¼1

n

Xni¼1

xi � �x

sx

yi � �y

sy¼

1

sxsy

1

n

Xni¼1

ðxi � �xÞðyi � �yÞ:

The last term being written in terms of the covariance of x and y, we come to an important formula:

s ~x ~y ¼sxy

sxsy: ð3:7Þ

The above statistic is called the correlation coefficient of ðx; yÞ and is denoted by r ¼ rxy. In other words, the correlationcoefficient is defined by

( )x , yii

yi y

xi x

xx

y

y

Fig. 3.4. Graphical understanding of the covariance.

Multivariate Analysis for Data Science 53

Page 14: The Elements of Multi-Variate Analysis for Data Science

r ¼ rxy ¼sxy

sxsy¼ s ~x ~y: ð3:8Þ

In short, the correlation coefficient is the normalized covariance.Of course, the signature of the correlation coefficient coincides with the one of covariance. We say that a pair of

variables ðx; yÞ is positively correlated if they have a positive correlation. Similarly, ðx; yÞ is negatively correlated ifthey have a negative correlation. If the correlation coefficient is zero, there is no correlation between the two variables.

Theorem 3.1. For the correlation coefficient of two variables x; y we have

�1 � rxy ¼ ryx � 1: ð3:9Þ

Proof. From the definition ð3.8Þ we see immediately that rxy ¼ ryx. For the inequality in ð3.9Þ it is sufficient to showthat r2xy � 1. As usual, the normalized variables of x and y are denoted by ~x and ~y, respectively. We start with theobvious inequality:

Xni¼1

ðt ~xi � ~yiÞ2 0; t 2 R:

Expanding the left-hand side, we have

Xni¼1

~x2i

!t2 � 2

Xni¼1

~xi ~yi

!t þ

Xni¼1

~y2i

! 0:

Dividing both sides by n and using �~x ¼ �~y ¼ 0, we obtain

s2~x t2 � 2s ~x ~y t þ s2~y 0: ð3:10Þ

Using s ~x ¼ s ~y ¼ 1 and s ~x ~y ¼ rxy, we come to

t2 � 2rxyt þ 1 0;

which holds for all real numbers t 2 R. Hence the discriminant D ¼ ð�2rxyÞ2 � 4 � 0, from which r2xy � 1 follows.�

As is stated in Theorem 3.1, the correlation coefficient r always lies between �1 and +1. We can interpret variousvalues of r as follows:

(i) A correlation r exactly equal to �1 indicates a perfect negative (linear) correlation (Exercise 3.7).(ii) A correlation r close to �1 indicates a strong negative correlation.

(iii) A correlation r close to 0 means no linear correlation.(iv) A correlation r close to +1 indicates a strong positive correlation.(v) A correlation r exactly equal to +1 indicates a perfect positive (linear) correlation (Exercise 3.7).

Most statisticians accept that the correlation is strong if the correlation coefficient is above +0.60 or below �0:60. Notehowever that the correlation coefficient is applied only for linear correlation, see Remark 3.3 below.

Example 3.2. Table 3.1 shows correlation coefficients of height, weight and age of players in Team A. Thecorrelation coefficient 0.628 is not very strong but is enough to indicate the trend of growing along a straight line, seethe left diagram in Fig. 3.3. The correlation coefficient of age and height is almost zero, as is suggested by the scatterplot, see the right diagram in Fig. 3.3.

Remark 3.3. Even when the correlation coefficient is almost zero, we can not infer that there is no correlationbetween the two variables. Figure 3.5 shows two scatter plots of which the correlation coefficients are 0.043 (left) and0.082 (right), whereas both scatter plots suggest correlations. In the left case, the data are scattered along an ellipsesuggesting a ‘‘quadratic’’ relation between two variables. The correlation coefficient reflects only ‘‘linear’’ correlationso it is useless for non-linear relations. In the right case, we see that most data are scattered clearly along a straight linebut there are a few extremal data. In fact, the correlation coefficient of the data except the extremal ones is 0.915indicating a very strong linear correlation. It is noted that the correlation coefficient is sensitive to extremal data.

Table 3.1. Correlation coefficients: Height, weight and age of players in Team A.

Covariance Correlation coefficient

height and weight 28.27 0.628

age and height �3:46 �0:130

age and weight �1:33 �0:032

54 BALADRAM and OBATA

Page 15: The Elements of Multi-Variate Analysis for Data Science

Remark 3.4. The correlation is a unitless measure. This means that if we change the units of x or y, the correlationdoes not change. For example, changing the height (y) from centimeters to inches will not affect the correlationbetween the age and height. Also, as explained in Theorem 3.1, the correlation does not change after switching thevariables x and y in the data set.

Remark 3.5. Clearly, condition that sx > 0 and sy > 0 is necessary to define the correlation coefficient. If sx ¼ 0 orsy ¼ 0, then data corresponding to the variable x or y are constant. In that case our original question of finding agrowing or declining trend of a scatter plot does not make sense.

Exercise 3.6. For two variables x; y show that jsxyj � sxsy.

Exercise 3.7. Let ðx1; y1Þ; ðx2; y2Þ; . . . ; ðxn; ynÞ be two-variate data of size n and assume that sx > 0 and sy > 0. Showthat all points in the scatter plot lie on a straight line with positive slope if and only if the correlation coefficient rxy ¼ 1.Similarly, show that the scatter plots lie on a straight line with negative slope if and only if rxy ¼ �1.

Exercise 3.8. Consider two variables ðx; yÞ and let rxy be the correlation coefficient. For constant numbers a and b

with a 6¼ 0 set x0 ¼ axþ b. Show that

rx0y ¼rxy; if a > 0,

�rxy; if a < 0.

3.3 Regression analysis

There are many problems transformed into an input-output model. Let us consider a system which receives an inputand yields an output, where the system is often a black box with no detailed information about its operation. Here weconsider a p-dimensional vector ðx1; . . . ; xpÞ as an input and a single variable y as an output.

Input

ðx1; . . . ; xpÞ�!

System

(Black Box)�!

Output

yð3:11Þ

Mathematically a system is an unknown function:

y ¼ f ðx1; x2; . . . ; xpÞ: ð3:12Þ

Given a set of input-output data, we look for a function y ¼ f ðx1; x2; . . . ; xpÞ which explains the data. For example, amanager of a beer company could make a good plan if the sales of beer y could be expected in terms of advertising costx1 and temperature x2. But we naturally wonder there is no fundamental principle for this problem. Our approachconsists of collecting data of three variables ðx1; x2; yÞ and looking for a function y ¼ f ðx1; x2Þ which recovers the datawith reasonable accuracy.

To formulate our problems we need some notions and notations. A variable that we wish to predict is called a targetvariable or independent variable. While a variable that are used for calculating the target variable is called anexplanatory variable, controlled variable or dependent variable. Let y be a target variable and x1; x2; . . . ; xp be a setof explanatory variables. Given ðpþ 1Þ-variate data, which is usually given in the form of of n� ðpþ 1Þ datamatrix:

22

2

2

4

4

Fig. 3.5. Examples of almost zero correlation coefficients: 0.043 (left) and 0.082 (right).

Multivariate Analysis for Data Science 55

Page 16: The Elements of Multi-Variate Analysis for Data Science

x11 x12 � � � x1p y1

..

. ... . .

. ... ..

.

xi1 xi2 � � � xip yi

..

. ... . .

. ... ..

.

xn1 xn2 � � � xnp yn

2666666664

3777777775;

our problem is to find a function y ¼ f ðx1; x2; . . . ; xpÞ which recovers the given data. However, in general we can nothope to get a function y ¼ f ðx1; x2; . . . ; xpÞ which reproduces all the data exactly. First of all, it is impossible if there aretwo data showing that the system ð3.11Þ yields different outputs from the same inputs. In that case we might hope to addmore explanatory variables to determine the function, but this strategy is not so realistic and hopeful because additionalvariables are often uncontrollable and unmeasurable. In fact, in a practical experiment or observation we can notspecify all the variables which might affect the output. It is therefore essential to find a function y ¼ f ðx1; x2; . . . ; xpÞwhich recovers the data with reasonable accuracy. In other words, we allow an error term � to justify the data in such away that

y ¼ f ðx1; x2; . . . ; xpÞ þ �:

In this context the function y ¼ f ðx1; x2; . . . ; xpÞ is called a regression model. In particular, y ¼ f ðx1; x2; . . . ; xpÞ is calleda linear regression model if it is a linear function:

y ¼ �1x1 þ �2x2 þ � � � þ �pxp þ �0; ð3:13Þ

where �0; �1; . . . ; �p are real coefficients. On the other hand, a regression model is called a single regression modelif there is just one explanatory variable, and a multiple regression model otherwise. In general, the methodology ofconstructing regression models is called regression analysis.

3.4 Regression lines and method of least squares

We consider a single linear regression model. Given two-variate data

ðx1; y1Þ; . . . ; ðxi; yiÞ; . . . ; ðxn; ynÞ; ð3:14Þ

our problem is to determine a linear function

y ¼ axþ b; ð3:15Þ

which recovers the data with reasonable accuracy. Such a linear function is also called a regression line. Here we takex as the explanatory variable and y as the target variable. Accordingly, a value ðxi; yiÞ in the data is understood in such away that an input x ¼ xi yields an output y ¼ axi þ b by ð3.15Þ but the observed value yi appears with deviation orfluctuation caused by some uncontrolled effects. Define the deviation �i by

yi ¼ axi þ bþ �i;

see Fig. 3.6. We consider that the most reasonable model minimizes the total deviation. In fact, there are several waysof defining the total deviation. The sum of squared deviations:

Q ¼Xni¼1

�2i ¼Xni¼1

ðyi � axi � bÞ2 ð3:16Þ

is the most fundamental for some theoretical and practical reasons. Thus, our task is to find the constants a and b thatminimize Q ¼ Qða; bÞ. This principle is called the method of least squares, tracing back to Gauss and Legendre. A

( x , y )iiy = ax + b

i

xi x

y

iy

Fig. 3.6. Derivation of a regression line.

56 BALADRAM and OBATA

Page 17: The Elements of Multi-Variate Analysis for Data Science

linear regression model or a regression line is usually obtained by means of the method of least squares.We outline the argument of deriving the linear regression model. Since the sum of squared deviations Q ¼ Qða; bÞ is

a quadratic function in a and b though it has a lengthy expression, the minimum is found by simple algebra ofcompleting the square or by simple differential calculus. The essence is stated in the following

Lemma 3.9. Given n data ðx1; y1Þ; ðx2; y2Þ; . . . ; ðxn; ynÞ with sx > 0, the quadratic function:

Qða; bÞ ¼Xni¼1

ðyi � axi � bÞ2 ð3:17Þ

attains the minimum at a ¼ a0 and b ¼ b0 given by

a0 ¼sxy

s2x; b0 ¼ �y� a0 �x: ð3:18Þ

Proof. Expanding the right-hand side of ð3.17Þ, we obtain

Q ¼Xðy2

i þ a2x2i þ b2 � 2axiyi � 2byi þ 2abxiÞ

¼X

y2i þ a2

Xx2i þ b2n� 2a

Xxiyi � 2b

Xyi þ 2ab

Xxi; ð3:19Þ

where the sum is always taken over 1 � i � n. Use of the mean, variance and covariance:

�x ¼1

n

Xxi; s2x ¼

1

n

Xx2i � �x2;

�y ¼1

n

Xyi; s2y ¼

1

n

Xy2i � �y2;

sxy ¼1

n

Xxiyi � �x � �y

is slightly helpful for a concise expression. In fact, after simple algebra we obtain

Q ¼ nðs2y þ �y2Þ þ a2nðs2x þ �x2Þ þ b2n� 2anðsxy þ �x � �yÞ � 2bn �yþ 2abn �x:

We see from the form in ð3.16Þ that Q ¼ Qða; bÞ takes a minimum. Then we need only to find the stationary points ofQða; bÞ. The partial derivatives are easily obtained as

@Q

@a¼ 2anðs2x þ �x2Þ � 2nðsxy þ �x � �yÞ þ 2bn �x;

@Q

@b¼ 2bn� 2n �yþ 2an �x:

Thus, our task is to solve the linear system:

@Q

@a¼@Q

@b¼ 0:

Indeed, ða0; b0Þ given in ð3.18Þ is a unique solution, which means that Q ¼ Qða; bÞ attains the minimum only thereat.�

Remark 3.10. In Lemma 3.9 we assume that sx > 0. That sx ¼ 0 is equivalent to that x1; x2; . . . ; xn are constant.If two-variate data ðxi; yiÞ has this property, the question of finding a regression model makes no sense.

Remark 3.11. An alternative proof of Lemma 3.9 is by an elementary algebra. Using the identity:

1

nQða0; b0Þ ¼ s2y �

s2xy

s2x;

we easily obtain

1

nðQða; bÞ � Qða0; b0ÞÞ ¼ ða �xþ b� �yÞ2 þ asx �

sxy

sx

� �2

0:

Thus we see that Qða; bÞ Qða0; b0Þ holds for all a; b.

Theorem 3.12. For two-variate data ðx1; y1Þ; ðx2; y2Þ; . . . ; ðxn; ynÞ the regression line is given by

y� �y

sy¼ rxy

x� �x

sx; ð3:20Þ

Multivariate Analysis for Data Science 57

Page 18: The Elements of Multi-Variate Analysis for Data Science

where x is the explanatory variable and y the target variable. Similarly, the regression line with explanatory variable y

and target variable x is given by

x� �x

sx¼ rxy

y� �y

sy: ð3:21Þ

Proof. By definition, the regression line with explanatory variable x and target variable y is given by y ¼ a0xþ b0,where a0 and b0 are given as in ð3.18Þ. Using the explicit expression in Lemma 3.9, we come to

y� �y ¼sxy

s2xðx� �xÞ:

Furthermore, using the correlation coefficient rxy ¼ sxy=ðsxsyÞ, we have

y� �y ¼sy

sxrxyðx� �xÞ:

Dividing both sides by sy, we obtain ð3.20Þ. The second half of the statement is obvious by exchanging the roles of xand y, together with an obvious relation rxy ¼ ryx. �

It is noted that two regression lines ð3.20Þ and ð3.21Þ pass the common point ð �x; �yÞ but their slopes are different. Infact, for the ratio of their slopes we have

sy

sxrxy

sy

rxysx¼ r2xy � 1:

In other words, the roles of explanatory and target variables are not symmetric.

Example 3.13. We examined the height and weight of players of Team A in previous subsections. Let us remindsome statistics:

�x ¼ 179:8; �y ¼ 82:9; sx ¼ 5:37; sy ¼ 8:39; rxy ¼ 0:628:

Then the regression line with height as explanatory variable x and weight as target variable y is given by

y� 82:9

8:39¼ 0:628�

x� 179:8

5:37;

that is,

y ¼ 0:98x� 93:3: ð3:22Þ

Similarly, the regression line with weight as explanatory variable y and height as target variable x is given by

x� 179:8

5:37¼ 0:628�

y� 82:9

8:39;

that is,

x ¼ 0:40yþ 146:6: ð3:23Þ

The above equation is equivalent to y ¼ 2:5x� 366:5, of which the slop in the xy-coordinate plane is 2.5 and is biggerthan the slope of ð3.22Þ certainly, see Fig. 3.7.

60

70

80

90

100

110

120

160 170 180 190 200

Fig. 3.7. Regression lines with explanatory variables x (solid line) and y (dotted line).

58 BALADRAM and OBATA

Page 19: The Elements of Multi-Variate Analysis for Data Science

In Example 3.13, getting a strong linear correlation, we found the regression lines ð3.22Þ and ð3.23Þ. Both are thebest-fitting lines to data. The regression line ð3.22Þ is used to predict a y-value from a given x-value. In other words,using x-variable whose data are easily observed or collected, we can predict the value of y-variable that is difficult orimpossible to measure. The regression line ð3.23Þ is used when the roles of x- and y-variables are exchanged. This ideaworks well as long as x and y are correlated.

Finally, we mention an important remark on application of regression lines. We know that the regression line isdetermined by means of five statistics �x; �y; sx; sy; sxy. Hence, without looking at a scatter plot we can get the regressionline as a result of simple calculus. Figure 3.8 shows two examples of scatter plots of normalized data and theirregression lines with explanatory variable x (horizontal axis) and target variable y (vertical axis). The correlationcoefficients are 0.756 (left) and 0.415 (right), both of which show positive correlation. Looking at the left scatter plotwe are easily convinced that application of regression line is improper because the scattered point obeys more likely aquadratic curve. On the other hand, the right scatter plot shows that most data are scattered along a straight line which isdifferent from the regression line. This is caused by a few extremal data. In this case we must examine the extremaldata carefully. To avoid a risk of misuse of a regression line it is recommended to look at a scatter plot.

Exercise 3.14. For two-variate data ðx; yÞ let L1 be the regression line with explanatory variable x and target variabley, and L2 the one with explanatory variable y and target variable x. Show that the modulus of the slope of L2 is greaterthan or equal to that of L1.Exercise 3.15. Let � (0 � � � �=2) be the intersection angle of the regression line with explanatory variable x andtarget variable y, and the one with explanatory variable y and target variable x. Show that

tan � ¼ rxy �1

rxy

�������� sxsy

s2x þ s2y:

Then prove that the intersection angle becomes closer to �=2 as the correlation between x and y becomes weaker, andthat it becomes closer to 0 as the correlation between x and y becomes stronger.

3.5 Description of general multi-variate data

Now we discuss general p-variate data of variables x1; . . . ; xj; . . . ; xp. We start with a data matrix of the form:

X ¼

x11 � � � x1j � � � x1p

..

. . .. ..

. . .. ..

.

xi1 � � � xij � � � xip

..

. . .. ..

. . .. ..

.

xn1 � � � xnj � � � xnp

2666666664

3777777775: ð3:24Þ

The ith row of X gives rise to a p-dimensional vector denoted by

xi ¼ ½xi1 � � � xij � � � xip�: ð3:25Þ

As usual we identify xi with a point in p-dimensional coordinate space. Our main interest lies in how those pointscorresponding to the data ð3.24Þ are distributed in the p-dimensional coordinate space. In the previous subsections westudied the case of two-variate data (p ¼ 2), where visualization by a scatter plot is useful. In case of general p-variatedata, direct observation of the scatter plot is not easy and reduction of dimension becomes important.

33

3

0

3

33

3

0

3

Fig. 3.8. Misuse of regression lines.

Multivariate Analysis for Data Science 59

Page 20: The Elements of Multi-Variate Analysis for Data Science

Suppose we are given p-variate data in the form of data matrix as in ð3.24Þ. Focusing on a variable xj, chosen fromthe p variables, we obtain a single-variate data:

x1j; . . . ; xij; . . . ; xnj;

which appear as the jth column of the data matrix X. Then the mean and variance of the variable xj are defined by

�xj ¼1

n

Xni¼1

xij; ð3:26Þ

s2xj ¼1

n

Xni¼1

ðxij � �xjÞ2; ð3:27Þ

respectively. Similarly, for two variables xj and xk we define their covariance and correlation coefficient by

sxj;xk ¼1

n

Xni¼1

ðxij � �xjÞðxik � �xkÞ; ð3:28Þ

rxj;xk ¼sxj;xk

sxj sxk; ð3:29Þ

respectively. From now on, avoiding annoying symbols we write

s2j ¼ s2xj ; sjk ¼ sxj;xk ; rjk ¼ rxj ;xk :

We note by definition that

sjj ¼ s2j ; rjj ¼ 1:

With these statistics we define two p� p matrices by

� ¼

s11 s12 � � � s1p

s21 s22 � � � s2p

..

. ... . .

. ...

sp1 sp2 � � � spp

266664

377775; R ¼

r11 r12 � � � r1p

r21 r22 � � � r2p

..

. ... . .

. ...

rp1 rp2 � � � rpp

266664

377775: ð3:30Þ

The former is called the variance-covariance matrix and the latter the correlation matrix. Both matrices are symmetricin the sense that they are invariant under transposition, namely, sjk ¼ skj and rjk ¼ rkj. Note also that the diagonalentries of the correlation matrix are all rjj ¼ 1. The variance-covariance matrix and correlation matrix are fundamentalin multi-variate analysis.

It is noticeable that the variance-covariance matrix and correlation matrix are derived directly from the data matrix X

by means of matrix operations. Let J be the n� n matrix whose entries are all one, i.e.,

J ¼

1 1 � � � 1

1 1 � � � 1

..

. ... . .

. ...

1 1 � � � 1

266664

377775:

Calculating JX and comparing the mean ð3.26Þ, we obtain

1

nðJXÞij ¼

1

n

Xnk¼1

ðJÞikðXÞkj ¼1

n

Xnk¼1

xkj ¼ �xj:

We set

Y ¼ X �1

nJX: ð3:31Þ

Then we have

ðYÞij ¼ X �1

nJX

� �ij

¼ xij � �xj: ð3:32Þ

Since Y is an n� p matrix, the product YTY is defined and becomes a p� p matrix. The ð j; kÞ entry of YTY is given by

ðYTYÞjk ¼Xni¼1

ðYT ÞjiYik ¼Xni¼1

ðYÞijYik; ð3:33Þ

and, with the help of ð3.32Þ we obtain

60 BALADRAM and OBATA

Page 21: The Elements of Multi-Variate Analysis for Data Science

ðYTYÞjk ¼Xni¼1

ðxij � �xjÞðxik � �xkÞ:

In view of the covariance of two variables xj and xk given in ð3.28Þ, we see that

sjk ¼1

nðYTYÞjk:

Consequently, the variance-covariance matrix � in ð3.30Þ becomes

� ¼1

nYTY ¼

1

nX �

1

nJX

� �TX �

1

nJX

� �: ð3:34Þ

For the correlation matrix we prepare a p� p matrix defined by

D ¼

ffiffiffiffiffiffis11

p0 � � � 0

0ffiffiffiffiffiffis22p � � � 0

..

. ... . .

. ...

0 0 � � � ffiffiffiffiffiffisppp

2666664

3777775; ð3:35Þ

where the diagonal entries are the standard deviations and the off-diagonal entries are all zero. Note that D�1 is thediagonal matrix with diagonal entries are all inverse of those of D. Then

Z ¼ YD�1

becomes an n� p matrix whose ði; jÞ entry is given by

zij ¼Xpk¼1

ðYÞikðD�1Þkj ¼ ðYÞijðD�1Þjj ¼xij � �xjffiffiffiffi

sjjp :

In other words, zij is the normalization of xij and the matrix Z itself is the normalization of the data matrix. Moreover,ZTZ becomes a p� p matrix whose ð j; kÞ entry is given by

ðZTZÞjk ¼Xni¼1

ðZT ÞjiðZÞik ¼Xni¼1

zijzik ¼Xni¼1

xij � �xjffiffiffiffisjjp

xik � �xkffiffiffiffiffiffiskkp :

We then see from ð3.28Þ and ð3.29Þ that

1

nðZTZÞjk ¼

1ffiffiffiffisjjp ffiffiffiffiffiffi

skkp

1

n

Xni¼1

ðxij � �xjÞðxik � �xkÞ ¼sjkffiffiffiffi

sjjp ffiffiffiffiffiffi

skkp ¼

sjk

sjsk¼ rjk;

which is the correlation coefficient. Thus, from the definition ð3.30Þ we obtain

R ¼1

nZTZ ¼

1

nðYD�1ÞT ðYD�1Þ ¼

1

nD�1YTYD�1:

Finally, in view of ð3.31Þ we obtain the formula for the correlation matrix R:

R ¼1

nD�1 X �

1

nJX

� �TX �

1

nJX

� �D�1: ð3:36Þ

Moreover, being combined with ð3.34Þ, we come to the basic identity linking the variance-covariance and correlationmatrices:

R ¼ D�1�D�1:

Summing up, we claim the following result.

Theorem 3.16. Let X be an n� p data matrix as in ð3.24Þ. Then the variance-covariance matrix and the correlationmatrix are respectively given by

� ¼1

nX �

1

nJX

� �TX �

1

nJX

� �; R ¼ D�1�D�1;

where J is the all-one matrix and D is the diagonal matrix consisting of the standard deviations of variables as inð3.35Þ.

3.6 Multi-variate regression analysis

Introducing the method of least squares, we derived in Sect. 3.4 the regression line from two-variate data ðx; yÞ,where x and y are explanatory and target variables, respectively. In this subsection we deal with a general case of

Multivariate Analysis for Data Science 61

Page 22: The Elements of Multi-Variate Analysis for Data Science

ðpþ 1Þ-variate data ðx1; x2; . . . ; xp; yÞ, where x1; x2; . . . ; xp are explanatory variables and y is a target variable.We start with an n� ðpþ 1Þ data matrix given by

x11 x12 � � � x1p y1

..

. ... . .

. ... ..

.

xi1 xi2 � � � xip yi

..

. ... . .

. ... ..

.

xn1 xn2 � � � xnp yn

2666666664

3777777775: ð3:37Þ

Our goal is to derive a multi-linear regression model:

y ¼ �1x1 þ �2x2 þ � � � þ �pxp þ �0 ð3:38Þ

by means of the method of least squares. As before, the ith value ðxi1; xi2; . . . ; xip; yiÞ is understood in such a way that aninput ðxi1; xi2; . . . ; xipÞ yields the output y ¼ �1xi1 þ �2xi2 þ � � � þ �pxip þ �0 according to ð3.38Þ but an observed valueyi is deviated from it by uncontrolled effects. The deviation �i is defined by

yi ¼ �1xi1 þ �2xi2 þ � � � þ �pxip þ �0 þ �i: ð3:39Þ

Then we will minimize the sum of squared deviations:

Q ¼Xni¼1

�2i ¼Xni¼1

ðyi � ð�1xi1 þ �2xi2 þ � � � þ �pxip þ �0ÞÞ2: ð3:40Þ

This is the principle of the method of least squares. In fact, as Q ¼ Qð�1; . . . ; �p; �0Þ is a quadratic function we mayapply a similar argument as in the case of p ¼ 1. In order to overcome the difficulty caused by the number of variableswe employ matrix notation.

We first note that in the right-hand side of ð3.38Þ the roles of �1; . . . ; �p and �0 are not equal. It is then convenient tointroduce a dummy variable x0. The data corresponding to this new variable is set to be all one. Let X be the data matrixassociated with the variables x0; x1; . . . ; xp and y the one associated to y. In fact, X becomes an n� ðpþ 1Þ matrix and yan n� 1 matrix or an n-dimensional column vector:

X ¼

x10 x11 � � � x1p

..

. ... . .

. ...

xi0 xi1 � � � xip

..

. ... . .

. ...

xn0 xn1 � � � xnp

2666666664

3777777775; y ¼

y1

..

.

yi

..

.

yn

2666666664

3777777775; ð3:41Þ

where xi0 ¼ 1 for all i. Next we define ðpþ 1Þ-dimensional column vector by

� ¼

�0

�1

..

.

�p

266664

377775:

Our problem is to determine � from X and y. With the above matrix notations the deviation ð3.39Þ becomes

�i ¼ ðy� X�Þi; ð3:42Þ

where the right-hand side is the ith entry of n-dimensional vector y� X�. Then the sum of squared deviations is givenby the norm and inner product:

Q ¼Xni¼1

�2i ¼ ky� X�k2 ¼ hy� X�; y� X�i: ð3:43Þ

The above simple expression helps our argument very much. Expanding the right-hand side, we obtain

Q ¼ hy; yi � 2hy;X�i þ hX�;X�i¼ hy; yi � 2hXTy;�i þ hXTX�;�i:

It follows from the general theory that Q ¼ Qð�Þ attains the minimum at a stationary point. Stationary points arecharacterized by the linear system:

62 BALADRAM and OBATA

Page 23: The Elements of Multi-Variate Analysis for Data Science

@Q

@�0

¼@Q

@�1

¼ � � � ¼@Q

@�p¼ 0: ð3:44Þ

On the other hand, the partial derivative of Q is easily computed. Let ej be the ðpþ 1Þ-dimensional vector whose jthentry is one and the others are all zero. Then we have

@Q

@�j¼ �2hXTy; eji þ 2hXTX�; eji ¼ h�2XTyþ 2XTX�; eji;

from which we see that the linear system ð3.44Þ is equivalent to �2XTyþ 2XTX� ¼ 0, or equivalently,

XTX� ¼ XTy: ð3:45Þ

The above equation is often called the normal equation. Assuming that the matrix XTX has the inverse, we come to aunique solution to ð3.44Þ, that is,

�0 ¼ ðXTXÞ�1XTy: ð3:46Þ

Consequently, � ¼ �0 is the unique point at which Q ¼ Qð�Þ attains the minimum. Summing up the above argumentwe come to the following statement.

Theorem 3.17. Assume that ðpþ 1Þ-variate data are given by a data matrix as in ð3.37Þ. Introduce a dummy variablex0 and set the corresponding data to be all one. Let X and y be data matrices defined as in ð3.41Þ, and assume that XTX

has the inverse. Then the multilinear regression model ð3.38Þ that minimizes the sum of squared deviations is given byð3.46Þ.

Remark 3.18. During the above argument we cannot drop the condition that the matrix XTX has the inverse. If thesize of data is large, the matrix XTX has the inverse most probably in practice. On the other hand, it is proved that XTX

has no inverse if p > n. Thus, the case where the number of variables exceeds the size of data requires an advancedmethodology.

It is instructive to check directly that Q ¼ Qð�Þ attains the minimum at � ¼ �0 given by ð3.46Þ. We start with ð3.43Þ:

Qð�Þ ¼ ky� X�k2

¼ ky� X�0 þ X�0 � X�k2

¼ ky� X�0k2 þ kX�0 � X�k2 þ 2hy� X�0;X�0 � X�i ð3:47Þ¼ ky� X�0k2 þ kX�0 � X�k2 þ 2hXT ðy� X�0Þ;�0 � �i: ð3:48Þ

Since �0 fulfills ð3.45Þ, we have XT ðy� X�0Þ ¼ 0 and hence the inner product in the last expression of ð3.48Þ vanishes.We then have

Qð�Þ ¼ ky� X�0k2 þ kX�0 � X�k2 ¼ Qð�0Þ þ kX�0 � X�k2 Qð�0Þ:

Apparently, the equality happens only when X�0 ¼ X�. But X�0 ¼ X� does not imply � ¼ �0 in general. If XTX isinvertible, we may obtain � ¼ �0. In that case Q ¼ Qð�Þ attains the minimum at � ¼ �0 and the minimum is attainedonly at � ¼ �0.

Example 3.19. In Sect. 3.3 we derived the linear regression model for two-variate data. Of course, the methoddescribed in this subsection covers the case of p ¼ 1. It is instructive to apply the matrix method to the case of p ¼ 1.We start with two-variate data

ðx1; y1Þ; ðx2; y2Þ; . . . ; ðxn; ynÞ;

where y is the target variable and x is the explanatory variable. Introduce a dummy variable x0 and rewrite x by x1. Thenthe data matrices X and y in ð3.41Þ take the forms:

X ¼

x10 x11

x20 x21

..

. ...

xn0 xn1

266664

377775 ¼

1 x1

1 x2

..

. ...

1 xn

266664

377775; y ¼

y1

y2

..

.

yn

266664

377775;

respectively. Then by direct calculation we have

XTX ¼n

Pni¼1 xiPn

i¼1 xiPn

i¼1 x2i

" #¼

n n �x

n �x ns2x þ n �x2

" #:

It is known that XTX has the inverse if and only if s2x 6¼ 0. This is equivalent to that the data of the variable x are notconstant. Under this condition we have

Multivariate Analysis for Data Science 63

Page 24: The Elements of Multi-Variate Analysis for Data Science

ðXTXÞ�1 ¼1

n2s2x

ns2x þ n �x2 �n �x

�n �x n

" #¼

1

ns2x

s2x þ �x2 � �x

� �x 1

" #:

On the other hand, since

XTy ¼1 � � � 1

x1 � � � xn

� � y1

..

.

yn

2664

3775 ¼

Pni¼1 yiPn

i¼1 xiyi

� �¼

n �y

nsxy þ n �x � �y

� �;

we see from the formula ð3.46Þ that

�0 ¼ ðXTXÞ�1XTy ¼1

ns2x

s2x þ �x2 � �x

� �x 1

" #n �y

nsxy þ n �x � �y

� �¼

1

s2x

s2x �y� sxy �x

sxy

" #: ð3:49Þ

Consequently, the desired linear regression model is given by

y ¼ �1xþ �0;

where

�0 ¼1

s2xðs2x �y� sxy �xÞ ¼ �y�

sxy

s2x�x; �1 ¼

sxy

s2x:

Of course, the result coincides with the one stated in Theorem 3.12.

4. Foundations of Probability

4.1 Events and probability

An event is the result of an observation or experiment, for which we can clearly decide whether or not it occurs.When the occurrence is not predicted with total certainty, we are interested in how likely it occurs. A probability is ascale to measure the likelihood by means of a real number between 0 and 1.

To be slightly more precise, we need a sample point, that is an outcome of observation or experiment which areindivisible and primary. Collecting all sample points, we form a sample space often denoted by �. Then an event Ais understood as a subset of �, namely, a set of sample points. The probability that an event A occurs is denoted byPðAÞ.

Example 4.1 (Coin tossing). In coin tossing we observe two possibilities, heads or tails. By convention we usenumbers 1 and 0 for heads and tails, respectively. Then the sample space of coin toss becomes

� ¼ f0; 1g:

Since there are four subsets of �, we have four events for coin tossing:

;; f0g; f1g; � ¼ f0; 1g:

In general, ; stands for an event containing no sample point and is called an empty event or null event. While, thesample space � itself is an event called the whole event. Since ; never occurs and � occurs with total certainty, wehave

Pð;Þ ¼ 0; Pð�Þ ¼ 1:

Assuming that the coin is fair, we understand by symmetry that the probabilities of heads and tails are equal. Hence wehave

Pðf0gÞ ¼ Pðf1gÞ ¼1

2: ð4:1Þ

Remark 4.2. An event consisting of a single sample point is called an elementary event. Strictly speaking, a samplepoint ! 2 � and an elementary event f!g � � are conceptually different as a probability is given to an elementaryevent but not to a sample point. Nevertheless, we occasionally write Pðf!gÞ ¼ Pð!Þ for simple notation.

Example 4.3 (Dice rolling). For rolling a die (with six sides) we may set

� ¼ f1; 2; 3; 4; 5; 6g:

For example, rolling a 1 is an elementary event denoted by f1g, and rolling an even value is an event denoted byf2; 4; 6g. By symmetry we have

64 BALADRAM and OBATA

Page 25: The Elements of Multi-Variate Analysis for Data Science

Pðf1gÞ ¼ Pðf2gÞ ¼ Pðf3gÞ ¼ Pðf4gÞ ¼ Pðf5gÞ ¼ Pðf6gÞ ¼1

6ð4:2Þ

and

Pðf2; 4; 6gÞ ¼1

2:

Apparently it is not convenient to list up the probabilities of 26 events for dice rolling. The simple formula

PðAÞ ¼jAj6;

where jAj stands for the number of sample points of A, is much more essential.

We find a common idea in ð4.1Þ and ð4.2Þ, where an elementary event is given an equal probability. In general,consider a sample space � with finitely many sample points. If every elementary event occurs equally likely, theprobability of an event A � � is given by proportion:

PðAÞ ¼jAjj�j

; ð4:3Þ

where jAj denotes the number of sample points in A. The formula ð4.3Þ, tracing back to the very early stage ofprobability theory and formulated explicitly by Laplace, is our starting point of combinatorial probability.

The essence of ð4.3Þ is that an event is represented as a set and the probability is given by the ratio of ‘‘sizes’’ of sets.In fact, in combinatorial probability the size of a set is given by the number of elements. We may employ anothermeasure for the ‘‘size’’ of a set. Let � be a Euclidean domain and consider a trial to choose a point of � randomly insuch a way that every point is chosen equally likely. Then the probability that a point is chosen from a subset A � � isnaturally defined by the same formula ð4.3Þ, where the ‘‘size’’ of a set is measured in terms of length, area or volumeaccording to the dimension of the Euclidean space. This idea is sometimes referred to as geometric probability.

Example 4.4. Consider an interval � ¼ ½a; b� in the real line R with a < b. Choose a point of � randomly in such away that every point of � is chosen equally likely. Let A be the event that a point is chosen from the interval ½�; ��,where a � � � � � b, see Fig. 4.1. Then we have

PðAÞ ¼jAjj�j¼�� �b� a

;

where jAj denotes the length of A. Note that the probability that a particular point (e.g., the mid-point of the interval) ischosen is zero.

Example 4.5. Let � be a disc of radius R > 0 and choose a point of � randomly in such a way that every point of �

is chosen equally likely. Let A be a subset of �, see Fig. 4.2. Then the event that a point is chosen from A, denoted bythe same symbol, is given by

PðAÞ ¼jAjj�j¼jAj�R2

;

where jAj denotes the area of A. Strictly speaking, a subset A can not be arbitrary but we need to restrict ourselves to ameasurable set, see Remark 4.16.

a b

A

Fig. 4.1. Choosing a point from an interval ½a; b�.

RA

Fig. 4.2. Choosing a point from a disc with radius R > 0.

Multivariate Analysis for Data Science 65

Page 26: The Elements of Multi-Variate Analysis for Data Science

4.2 Probability spaces

Now we mention the mathematical formulation of probability due to Kolmogorov. We start with a non-empty set �

as a sample space. An event is a subset of �. However, we do not require that every subset of � is an event, but weimpose some conditions on the collection of events instead.

We need some operations on events. The complement of an event A, denoted by Ac, is the event that A does notoccur. For two events A and B, the event that at least one of A or B occurs is called the union and is denoted by A [ B.The event that both A and B occur is called the intersection and is denoted by A \ B. Two events A and B are calleddisjoint if A \ B ¼ ;.

Remark 4.6. There are some variants of notation in literatures. The complement Ac is also denoted by �A. The unionA [ B may be called the sum. The intersection A \ B is also called the product and is denoted by AB.

Definition 4.7 (�-field). A collection F of subsets of � is called a �-field if the following conditions are satisfied:(i) ; 2 F and � 2 F ;

(ii) If A 2 F , then the complement Ac 2 F ;(iii) If A1;A2; � � � 2 F , then the union

S1i¼1 Ai 2 F .

A �-field is closed under countable intersection too, namely, for A1;A2; � � � 2 F the intersectionT1

i¼1 Ai belongs toF . It is a consequence of the definition that a �-field is also closed under finite union and intersection (Exercise 4.10). Itis essential that the events form a �-field.

Definition 4.8. Let � be a non-empty set and F a �-field over �. A probability is a map (or function) P : F ! R

satisfying the following properties:(i) 0 � PðAÞ � 1 for all A 2 F ;

(ii) Pð�Þ ¼ 1;(iii) If A1;A2; � � � 2 F are mutually disjoint, i.e., Ai \ Aj ¼ ; for i 6¼ j, then

P[1i¼1

Ai

!¼X1i¼1

PðAiÞ:

Definition 4.9. A probability space is a triple ð�;F ;PÞ, where � is a non-empty set, F is a �-field over �, and P is aprobability defined on F .

Once we are given a probability space ð�;F ;PÞ, the set � is called a sample space and an element ! 2 � a samplepoint. A subset A of � which belongs to F is called an event, thus F is the space of events. Finally, PðAÞ means theprobability that an event A occurs.

An event A is called almost sure if PðAÞ ¼ 1. By definition the whole event � is almost sure, but there might be manyother almost sure events. An event A is called almost impossible if PðAÞ ¼ 0. We know that the empty event ; is almostimpossible, but there might be many other almost impossible events, see Examples 4.4 and 4.5.

Exercise 4.10. Let ð�;F ;PÞ be a probability space. Show that if A and B are events, so are A \ B and A [ B.

Exercise 4.11. Consider the experiment in Example 4.3 of rolling a die. Set A ¼ f2; 4; 6g, B ¼ f1; 3; 5g andC ¼ f1; 2; 3; 4g. Show that f�;;;A;Bg is a �-field but f�;;;Ag and f�; ;;A;B;Cg are not.

Exercise 4.12. Let ð�;F ;PÞ be a probability space. Prove the following assertions, where A and B are events.(1) Pð;Þ ¼ 0.(2) If A \ B ¼ ;, then PðA [ BÞ ¼ PðAÞ þ PðBÞ.(3) PðAcÞ ¼ 1� PðAÞ, where Ac is the complement.(4) If A � B, then PðAÞ � PðBÞ.(5) PðA [ BÞ ¼ PðAÞ þ PðBÞ � PðA \ BÞ.

cAA A BA B

Fig. 4.3. Complement Ac, union A [ B and intersection A \ B (from left to right).

66 BALADRAM and OBATA

Page 27: The Elements of Multi-Variate Analysis for Data Science

Exercise 4.13. Let ð�;F ;PÞ be a probability space and let A be an almost impossible event. Prove thatPðA [ BÞ ¼ PðBÞ for any event B.

Exercise 4.14. An experiment consists of tossing two dice.(1) Determine the probability space ð�;F ;PÞ.(2) Find the event A that the sum of the dots on the dice equals 8 and the probability PðAÞ.(3) Find the event B that the sum of the dots on the dice is greater than 10 and the probability PðBÞ.(4) Find the event C that the sum of the dots on the dice is greater than 12 and the probability PðCÞ.

Exercise 4.15. Let � ¼ f!1; !2; . . . g be a countable set and F the set of all subsets of �. Let p1; p2; . . . be asequence such that pi 0 and

P1i¼1 pi ¼ 1. Define P : F ! R by

PðAÞ ¼Xi:!i2A

pi ¼X1i¼1

pi1Að!iÞ;

where 1Að!Þ is the indicator function defined by

1Að!Þ ¼1; ! 2 A,

0; otherwise.

Show that ð�;F ;PÞ is a probability space and determine the almost sure events.

Remark 4.16. In combinatorial probability, � is a finite set and every subset of � is an event. In Exercise 4.15, �

being an infinite set, every subset of � is still an event. The idea of �-field becomes essential in Examples 4.4 and 4.5.In fact, it is known that there is a subset of the real line of which the length is not determined and likewise that there is asubset of the plane of which the area is not determined. Hence for defining the probability by length or area, we need toavoid those pathological sets and to restrict ourselves to measurable sets that admit the length or area. It is a basicconsequence of measure theory that the length or the area is well defined for all Borel sets which form the minimum�-field containing all open subsets, for more details see the standard textbooks.

The above concept of probability is quite abstract and needs to relate to experiments in the real world. In fact,traditionally the probability was discussed in the context of strictly controlled experiments that can be repeated underidentical conditions as many times as we like. Suppose that such an experiment is repeated n times. If an event A occursnðAÞ times, then the relative frequency of A is defined by

pnðAÞ ¼nðAÞn:

Obviously pnðAÞ is not uniquely determined by n. If the limit

PðAÞ ¼ limn!1

pnðAÞ ¼ limn!1

nðAÞn

ð4:4Þ

exists, we accept that PðAÞ is the probability of A. Note that the above limit may not exist, and in addition, there aremany situations in which the concepts of repeatability may not be valid.

Following the definition of probability space, the relative frequency of A possesses the following properties:(i) 0 � pnðAÞ � 1, where pnðAÞ ¼ 0 if A occurs in none of the n repeated trials and pnðAÞ ¼ 1 if A occurs in all of

the n repeated trials.(ii) If A and B are mutually exclusive (or disjoint) events, then

pnðA [ BÞ ¼ pnðAÞ þ pnðBÞ

and

PðA [ BÞ ¼ limn!1

nðA [ BÞn

¼ limn!1

nðAÞnþ lim

n!1

nðBÞn¼ PðAÞ þ PðBÞ:

Starting with a probability space, the basic limit formula ð4.4Þ is proved within a somehow sophisticated argument ofprobability theory. The result is known as the law of large numbers (LLN), for details see standard textbooks.

Upon closing this subsection, we mention the basic concept of independent events.

Definition 4.17. A family of events fA ; 2 �g is called independent if any finitely many events A1; . . . ;An chosen

from the family satisfies

P\ni¼1

Ai

!¼Yni¼1

PðAiÞ:

In particular, two events A and B are independent if

Multivariate Analysis for Data Science 67

Page 28: The Elements of Multi-Variate Analysis for Data Science

PðA \ BÞ ¼ PðAÞPðBÞ:

Thus three events A, B and C are independent if and only if

PðA \ B \ CÞ ¼ PðAÞPðBÞPðCÞ;PðA \ BÞ ¼ PðAÞPðBÞ;PðB \ CÞ ¼ PðBÞPðCÞ;PðC \ AÞ ¼ PðCÞPðAÞ:

Exercise 4.18. Show that if three events A, B and C are independent, then A and B [ C are independent.

Exercise 4.19. Let A and B be two events that are disjoint and independent. Show that PðAÞ ¼ 0 or PðBÞ ¼ 0. Thus,two disjoint events are not independent except the trivial case.

Exercise 4.20. In the experiment of rolling two fair dice, let A be the event that the first die is odd, B the event thatthe second die is odd, and C the event that the sum is odd. Show that the three events A, B and C are pairwiseindependent, but A, B and C are not independent.

4.3 Conditional probability

Let ð�;F ;PÞ be a probability space. For two events A;B 2 F the conditional probability of A relative to B or undercondition B is defined by

PðA j BÞ ¼PðA \ BÞPðBÞ

; ð4:5Þ

whenever PðBÞ > 0. The conditional probability PðA j BÞ is interpreted as the probability that an event A occursprovided that we already know the occurrence of another event B. Similarly, if PðAÞ > 0,

PðB j AÞ ¼PðA \ BÞPðAÞ

: ð4:6Þ

From ð4.5Þ and ð4.6Þ, we obtain immediately the following

Theorem 4.21 (Multiplicative rule). Let A and B be two events. If PðBÞ > 0, we have

PðA \ BÞ ¼ PðBÞPðA j BÞ: ð4:7Þ

Similarly, if PðAÞ > 0, we have

PðA \ BÞ ¼ PðAÞPðB j AÞ: ð4:8Þ

In computing the joint probability of events the useful relation:

PðA \ BÞ ¼ PðA j BÞPðBÞ ¼ PðB j AÞPðAÞ

will be applied without an explanation.

Theorem 4.22. Let A and B be two events with PðBÞ > 0. Then A and B are independent if and only ifPðA j BÞ ¼ PðAÞ.

Proof. By definition A and B are independent if and only if PðA \ BÞ ¼ PðAÞPðBÞ. Then the assertion is straightforwardfrom the definition ð4.5Þ. �

Thus, if A is independent of B, then the probability that A occurs is unchanged by information on whether or not Boccurs.

Exercise 4.23. Show that if two events A and B are independent, so are A and Bc.

Exercise 4.24. Show the following relations, where A;B and C are events.(1) PðA j AÞ ¼ 1.(2) PðA \ B j CÞ ¼ PðB j CÞPðA j B \ CÞ.(3) PðA \ B \ CÞ ¼ PðA j B \ CÞPðB j CÞPðCÞ.

Theorem 4.25 (Bayes’ formula). Let ð�;F ;PÞ be a probability space. Let B1; . . . ;Bn be mutually disjoint events withPðBiÞ > 0 such that

� ¼[ni¼1

Bi: ð4:9Þ

68 BALADRAM and OBATA

Page 29: The Elements of Multi-Variate Analysis for Data Science

Then for any event A with PðAÞ > 0 we have

PðBjjAÞ ¼PðBjÞPðA j BjÞXn

i¼1

PðBiÞPðA j BiÞ: ð4:10Þ

Proof. We first note that

PðBjjAÞ ¼PðA \ BjÞPðAÞ

; ð4:11Þ

which is by definition. It follows from ð4.9Þ that

A ¼[ni¼1

ðA \ BiÞ:

Since A \ B1; . . . ;A \ Bn are mutually disjoint, we have

PðAÞ ¼Xni¼1

PðA \ BiÞ: ð4:12Þ

On the other hand, for two events A and Bi we have

PðA \ BiÞ ¼ PðBiÞPðA j BiÞ

by the multiplicative rule. Then the numerator of the right-hand side in ð4.11Þ becomes

PðA \ BjÞ ¼ PðBjÞPðA j BjÞ ð4:13Þ

and the denominator becomes

PðAÞ ¼Xni¼1

PðBiÞPðA j BiÞ: ð4:14Þ

Then the formula ð4.10Þ is obtained by inserting ð4.13Þ and ð4.14Þ into ð4.11Þ. �

Example 4.26 (Diagnostic test). Consider an infectious disease in a certain population �. The population � isdivided into two parts, infected D or health Dc. There is a test T for the disease and the result is positive or negative.Again the population � is divided into two parts, positive test result Tþ or negative test result T�. We now consider �

as a sample space and D, Dc, Tþ and T� as events. The conditional probabilities PðTþjDÞ and PðT�jDcÞ are called thesensitivity and specificity of the test, respectively. Those values may be obtained in a laboratory. The conditionalprobability PðT�jDÞ stands for the probability that a false negative occurs while PðTþjDcÞ stands for the probabilitythat a false positive occurs. Obviously,

PðTþjDÞ þ PðT�jDÞ ¼ 1; PðTþjDcÞ þ PðT�jDcÞ ¼ 1:

It follows from Bayes’ formula that

PðDjTþÞ ¼PðDÞPðTþjDÞ

PðDÞPðTþjDÞ þ PðDcÞPðTþjDcÞð4:15Þ

and

PðDcjT�Þ ¼PðDcÞPðT�jDcÞ

PðDcÞPðT�jDcÞ þ PðDÞPðT�jDÞ:

The former is called the predictive value of a positive test and the latter the one of a negative test.For illustration, we set PðTþjDÞ ¼ 0:7, PðT�jDcÞ ¼ 0:99 and PðDÞ ¼ d. Note that PðDÞ is difficult to know in

practical application. Substituting these probability into ð4.15Þ, we obtain the predictive value of a positive test:

PðDjTþÞ ¼0:7d

0:7d þ 0:01ð1� dÞ¼

70d

1þ 69d; 0 � d � 1: ð4:16Þ

Note that PðDjTþÞ varies from 0 to 1. If PðDjTþÞ is close to 1 the test is effective from the point of view of medicaltreatment. Thus, the formula ð4.16Þ is useful for judging the performance of the test particularly when d is small, seeFig. 4.4.

Exercise 4.27. There are 10 lottery tickets with serial numbers from 1 to 10. Two tickets with numbers 1 and 2 arewinning ones. A boy got 4 tickets.(1) The boy says that he has a ticket of number 1. Find the probability that there is a winning ticket in the 6 rest

tickets.

Multivariate Analysis for Data Science 69

Page 30: The Elements of Multi-Variate Analysis for Data Science

(2) The boy says that he has at least one winning ticket. Find the probability that there is a winning ticket in the 6 resttickets.

The above result would be counterintuitive.

Exercise 4.28. There are 100 patients in a hospital with a certain disease. Of these, 10 are selected to undergo a drugtreatment that increases the percentage cured rate from 50% to 80%. Find the probability that the patient received adrug treatment if the patient is known to be cured.

5. Random Variables

5.1 Random variables and their distributions

A random variable is intuitively a variable whose values appear along with a certain probability law. A typicalexample appears in random sampling. Consider a variable whose values are obtained from the measurement of sampleschosen randomly from a population. Then the variable obeys a certain probability law arising from random sampling,so it is a random variable.

To be slightly more precise, a random variable is a variable X for which we may ask the probability PðX � xÞ that Xtakes values less than or equal x 2 R. However, for logical validity of PðX � xÞ we need to prepare a probability spaceð�;F ;PÞ before introducing a random variable. In the above-mentioned example of random sampling, we set � to bethe population and define the probability by PðAÞ ¼ jAj=j�j along with combinatorial probability. Our variable X givesa definite value for each individual !. In other words, X : �! R is a function. Then PðX � xÞ is defined by

PðX � xÞ ¼jfX � xgjj�j

; x 2 R; ð5:1Þ

where fX � xg is a short-hand notation for f! 2 �;Xð!Þ � xg. Abstracting the above argument, we give the followingformal definition.

Definition 5.1. Let ð�;F ;PÞ be a probability space. A function X : �! R is called a random variable if fX � xg ¼f! 2 �;Xð!Þ � xg is an event in F for all x 2 R. Moreover, the function

FðxÞ ¼ FXðxÞ ¼ PðX � xÞ; x 2 R; ð5:2Þ

is called the distribution function of X.

Example 5.2. Tossing a coin, we set X ¼ 1 if the heads occurs and X ¼ 0 if the tails occurs. Then X becomes arandom variable such that

PðX ¼ 0Þ ¼ PðX ¼ 1Þ ¼1

2:

The distribution function is given by

FXðxÞ ¼0; x < 0,

1=2; 0 � x < 1,

1; x 1.

8<:

Rolling dice being similar, the investigation is left to the readers in the following exercise.

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1 d

P(D|T )+

Fig. 4.4. Graph of PðDjTþÞ for 0 � d � 1.

70 BALADRAM and OBATA

Page 31: The Elements of Multi-Variate Analysis for Data Science

Exercise 5.3. Consider the experiment of rolling a fair die. Let X be the random variable which assigns 1 if thenumber appears is even and 0 if the number that appears is odd. Find PðX ¼ 1Þ and PðX ¼ 0Þ.

Exercise 5.4. Consider the experiment of tossing a coin three times. Let X be the number of heads obtained. Weassume that the tosses are independent and the probability of a head is p. Find the probabilities PðX ¼ 0Þ, PðX ¼ 1Þ,PðX ¼ 2Þ and PðX ¼ 3Þ.

Exercise 5.5. Suppose that a fair die is rolled seven times. Find the probability that 1 and 2 dots appear twice each; 3,4, and 5 dots once each; and 6 dots not at all.

Example 5.6. Let � be the set of players of Team A. Let X be the height of a player randomly chosen from �. Thenthe distribution function of X is given by

FXðxÞ ¼ PðX � xÞ ¼jf! 2 �;Xð!Þ � xgj

j�j: ð5:3Þ

This is essentially the cumulative relative frequencies of the heights of Team A, see Sect. 2.1.

Theorem 5.7. Let X be a random variable and FðxÞ ¼ FXðxÞ the distribution function. Then we have(i) lim

x!�1FðxÞ ¼ 0 and lim

x!þ1FðxÞ ¼ 1;

(ii) If x1 � x2, then Fðx1Þ � Fðx2Þ;(iii) lim

�!þ0Fðxþ �Þ ¼ FðxÞ, namely, FðxÞ is right-continuous.

Theorem 5.8. Let X be a random variable and FðxÞ ¼ FXðxÞ the distribution function. Then we have

PðX ¼ xÞ ¼ FXðxÞ � lim�!þ0

FXðx� �Þ; x 2 R:

For the proofs see the standard textbooks. We only mention here that countable operations of sets are required in theproofs.

Exercise 5.9. Verify the properties (i)–(iii) in Theorem 5.7 for the distribution function in Example 5.2.

Definition 5.10. A random variable X is called discrete if the distribution function FXðxÞ increases only by jumps. Arandom variable X is called continuous if the distribution function FXðxÞ is continuous. (Note that there are randomvariables that are neither discrete nor continuous.)

For a discrete random variable X the jump points of FXðxÞ are at most countable, say, a1; a2; . . . . The jump at x ¼ aiis denoted by pi > 0. Then we have

pi ¼ PðX ¼ aiÞ ¼ FXðaiÞ � lim�!þ0

FXðai � �Þ;Xi

pi ¼ 1:

Thus, with a discrete random variable X, we may associate the possible values ai and its probability pi. It is convenientto allow pi ¼ 0 (in that case ai is not a possible value though). The random variables in Examples 5.2 and 5.6 arediscrete.

For a continuous random variable X, we have PðX ¼ xÞ ¼ 0 for all x. This is an immediate consequence ofTheorem 5.8. We now understand why we needed to consider the probability of the events fX � xg instead of fX ¼ xgfor introducing a random variable.

Example 5.11. Let X be the coordinate of a point chosen from an interval � ¼ ½0;L�, L > 0, in such a way that everypoint of � is chosen equally likely, see Example 4.4. We are interested in the distribution function FXðxÞ. Since theevent fX � xg never occurs if x < 0, we have FXðxÞ ¼ 0 for x < 0. While the event fX � xg certainly occurs if x > L,we have FXðxÞ ¼ 1 for x > L. For 0 � x � L we have

PðX � xÞ ¼j½0; x�jj½0; L�j

¼x

L

Consequently, we have

FXðxÞ ¼0; x < 0,

x=L; 0 � x � L,

1; x > L.

8<: ð5:4Þ

Since FXðxÞ is a continuous function obviously, the random variable X is continuous.

Example 5.12. Cutting off a stick of length L at a randomly chosen point, we obtain two fragments. We are interestedin the length of the shorter fragment, which is denoted by S. The stick is modeled by an interval � ¼ ½0;L� in the realline and let X denote the coordinate of a randomly chosen point. Then X becomes a random variable as is discussed in

Multivariate Analysis for Data Science 71

Page 32: The Elements of Multi-Variate Analysis for Data Science

Example 5.11. Since fS � xg never occurs for x < 0 and fS � xg certainly occurs if x > L=2, we have FSðxÞ ¼ 0 forx < 0 and FSðxÞ ¼ 1 for x > L=2. Suppose that 0 � x � L=2. Then we have

PðS � xÞ ¼ Pð0 � X � xÞ þ PðL� x � X � LÞ ¼x

x

2x

L:

Summing up, we have

FSðxÞ ¼0; x < 0,

2x=L; 0 � x � L=2,

1; x > L=2.

8<: ð5:5Þ

Thus, S is a continuous random variable. We may derive the distribution function FSðxÞ alternatively by using S ¼minfX; L� Xg.

Example 5.13. Let � be a disc of radius R > 0. Choose a point randomly from � and let X be the distance betweenthe chosen point and the center of the disc. Then X becomes a random variable. Obviously, FXðxÞ ¼ 0 for x < 0 andFXðxÞ ¼ 1 for x > R. Suppose that 0 � x � R. Since the event fX � xg corresponds to the concentric disc with radius x,we have

PðX � xÞ ¼�x2

�R2¼

x2

R2;

where the probability is calculated along with Example 4.5. Consequently, we have

FXðxÞ ¼0; x < 0,

x2=R2; 0 � x � R,

1; x > R.

8<: ð5:6Þ

Thus, X is a continuous random variable.

In general, the distribution function FXðxÞ of a continuous random variable X is continuous by definition but notnecessarily differentiable. If FXðxÞ is piecewise differentiable, the derivative

fXðxÞ ¼ F0XðxÞ ¼d

dxFXðxÞ

is called the (probability) density function of X. It then follows from the fundamental theorem of differential integralcalculus that

PðX � xÞ ¼ FXðxÞ ¼Z x

�1fXðtÞdt

and

Pða � X � bÞ ¼Z b

a

fXðxÞdx; a < b:

Since the density function gives a probability only through integration, if FXðxÞ is not differentiable at x ¼ a, the valueof fXðxÞ at x ¼ a may be given arbitrarily. Continuous random variables with density functions as well as discreterandom variables cover a quite wide range of applications.

Exercise 5.14. Define a function FðxÞ by

FðxÞ ¼

0; x < 0,

xþ1

2; 0 � x �

1

2,

1; x 1

2.

8>>>><>>>>:

Verify the properties (i)–(iii) in Theorem 5.7.

Exercise 5.15. Let X be a random variable of which the distribution function is given by FðxÞ described inExercise 5.14, Find the following probabilities:

P X �1

4

� �; P 0 < X �

1

4

� �; PðX ¼ 0Þ:

(Note that FðxÞ is not continuous at x ¼ 0.)

Exercise 5.16. Determine the constants a and b such that

72 BALADRAM and OBATA

Page 33: The Elements of Multi-Variate Analysis for Data Science

FðxÞ ¼ 1� ae�x=b; x 0,

0; x < 0,

is the distribution function of a random variable.

Definition 5.17. The mean or expectation of a random variable X is defined by

E½X� ¼ �X ¼Z

Xð!ÞPðd!Þ;

where the right-hand side is the so-called the Lebesgue integral.

For practical problems we consider two cases. For a discrete random variable X with possible values a1; a2; . . . themean becomes

E½X� ¼ �X ¼Xi

aiPðX ¼ aiÞ ¼Xx

xPðX ¼ xÞ:

In the most right expression, which is just by convention, the sum is taken over all real numbers x but in fact, sincePðX ¼ xÞ ¼ 0 except at most countable x ¼ ai the expression is reduced to a usual sum. For a continuous randomvariable with density function fXðxÞ the mean becomes

E½X� ¼ �X ¼Z þ1�1

x fXðxÞdx:

Many important statistics of a random variable X is defined in terms of the mean. For example, the variance of X isdefined by

V½X� ¼ �2X ¼ E½ðX � E½X�Þ2� ¼ E½X2� � E½X�2:

Moreover, the central moment of degree k is defined by

mk½X� ¼ E½ðX � E½X�Þk�:

Example 5.18. Let X be the random variable introduced in Example 5.11. It is a continuous random variable sincethe distribution function FXðxÞ in ð5.4Þ is continuous. The density function fXðxÞ is obtained by differentiating FXðxÞ asfollows:

fXðxÞ ¼1=L; 0 � x � L,

0; otherwise.

ð5:7Þ

Then the mean of X is given by

E½X� ¼Z þ1�1

x fXðxÞdx ¼Z L

0

x1

Ldx ¼

L

2:

Similarly, we have

E½X2� ¼Z þ1�1

x2 fXðxÞdx ¼Z L

0

x2 1

Ldx ¼

L2

3:

Hence the variance is given by

V½X� ¼ E½X2� � E½X�2 ¼L2

3�

L

2

� �2

¼L2

12:

The probability distribution defined by the density function ð5.7Þ is called the uniform distribution on ½0;L�.Accordingly, the random variable S introduced in Example 5.12 obeys the uniform distribution on ½0;L=2�.

Example 5.19. For � 2 R and � > 0, the normal distribution Nð�; �2Þ is defined by the density function:

f ðxÞ ¼1ffiffiffiffiffiffiffiffiffiffi

2��2p exp �

ðx� �Þ2

2�2

� �;

see also Example 2.19. In particular, Nð0; 1Þ is called the standard normal distribution. If a random variable obeysNð0; 1Þ, the distribution function is given by

FXðxÞ ¼ PðX � xÞ ¼1ffiffiffiffiffiffi2�p

Z x

�1e�t

2=2dt: ð5:8Þ

It is noted that the right-hand side is not expressed in terms of an elementary function. Instead, we define the (Gauss)error function by

Multivariate Analysis for Data Science 73

Page 34: The Elements of Multi-Variate Analysis for Data Science

erfðxÞ ¼2ffiffiffi�p

Z x

0

e�t2

dt:

Then ð5.8Þ becomes

FXðxÞ ¼1

21þ erf

xffiffiffi2p

� �� �:

Exercise 5.20. Let X be a discrete random variable such that

PðX ¼ �1Þ ¼ PðX ¼ 0Þ ¼ PðX ¼ 1Þ ¼1

3:

Find the mean and variance of X.

Exercise 5.21. Let X be a continuous random variable of which the density function is given by

fXðxÞ ¼2x; 0 < x < 1,

0; otherwise.

Find the mean and variance of X.

Exercise 5.22. Let X be the random variable introduced in Example 5.13. Find the density function of X and showthat E½X� ¼ 2R=3 and V½X� ¼ R2=18.

Exercise 5.23. Prove that the moment of degree 2m of the standard normal distribution Nð0; 1Þ is given by

1ffiffiffiffiffiffi2�p

Z þ1�1

x2me�x2=2dx ¼

ð2mÞ!2mm!

; m ¼ 1; 2; . . . :

5.2 Joint distributions

Let X1;X2; . . . ;Xn be random variables defined on a probability space ð�;F ;PÞ. There are two points of view. One isto regard them as a sequence of random variables. This is suitable for the study of asymptotic properties and limitbehavior. The other is to regard them as a random vector ðX1;X2; . . . ;XnÞ in n-dimensional space. Since the essence isthe same, we switch the notation by convenience. The statistics of finitely many random variables X1;X2; . . . ;Xn isdescribed by the joint distribution function defined by

FX1X2...Xnðx1; x2; . . . ; xnÞ ¼ PðX1 � x1;X2 � x2; . . . ;Xn � xnÞ; x1; x2; . . . ; xn 2 R;

where the right-hand side is the probability of the product eventTn

i¼1fXi � xig.If X1; . . . ;Xn are discrete random variables, it is sufficient and more convenient to deal with the joint probability of

the form

PðX1 ¼ x1;X2 ¼ x2; . . . ;Xn ¼ xnÞ;

where x1; x2; . . . ; xn run over all possible values of X1;X2; . . . ;Xn, respectively. In that case the random pointsðX1;X2; . . . ;XnÞ are scattered in n-dimensional space in a discrete manner. We are also interested in a particular type ofcontinuous random vector, where the joint distribution function is given by the integral:

FX1X2...Xnðx1; x2; . . . ; xnÞ ¼

Z x1

�1dt1

Z x2

�1dt2 � � �

Z xn

�1dtn f ðt1; t2; . . . ; tnÞ

for x1; x2; . . . ; xn 2 R. In that case, the integrand f ðx1; x2; . . . ; xnÞ is called the joint density function of X1;X2; . . . ;Xn anddenoted by fX1X2...Xn

ðx1; x2; . . . ; xnÞ.

Exercise 5.24. Consider an experiment of tossing a fair coin twice. Let ðX;YÞ be a 2-dimensional random vector,where X is the number of heads that occurs in the two tosses and Y is the number of tails that occurs in the two tosses.Find PðX ¼ 2;Y ¼ 0Þ, PðX ¼ 0; Y ¼ 1Þ and PðX ¼ 1;Y ¼ 1Þ.

Let ðX1; . . . ;XnÞ be an n-dimensional random vector such that each Xj is a discrete random variable. Then we have

PðX1 ¼ xÞ ¼Xx2;...;xn

PðX1 ¼ x;X2 ¼ x2; . . . ;Xn ¼ xnÞ

and the mean of X1 is given by

�X1¼ E½X1� ¼

Xx

xPðX1 ¼ xÞ ¼X

x1;x2;...;xn

x1PðX1 ¼ x1;X2 ¼ x2; . . . ;Xn ¼ xnÞ:

Similarly,

74 BALADRAM and OBATA

Page 35: The Elements of Multi-Variate Analysis for Data Science

�Xj¼ E½Xj� ¼

Xx1;x2;...;xn

xjPðX1 ¼ x1;X2 ¼ x2; . . . ;Xn ¼ xnÞ:

If ðX1; . . . ;XnÞ admits a joint density function fX1X2...Xnðx1; x2; . . . ; xnÞ, we have

fX1ðxÞ ¼

Z þ1�1� � �Z þ1�1

fX1X2...Xnðx; x2; . . . ; xnÞdx2 � � � dxn;

and the mean of X1 is given by

�X1¼ E½X1� ¼

Z þ1�1

x fX1ðxÞdx ¼

Z þ1�1� � �Z þ1�1

x1 fX1X2...Xnðx1; x2; . . . ; xnÞdx1dx2 � � � dxn:

Similarly,

�Xj¼ E½Xj� ¼

Z þ1�1� � �Z þ1�1

xj fX1X2...Xnðx1; x2; . . . ; xnÞdx1dx2 � � � dxn:

Moreover, the higher-order statistics are defined by means of E½Xp1

1 � � �Xpnn �. For example, E½X2

i � is the moment of2nd order and E½XiXj� is a mixed moment of 2nd order. For discrete random variables we have

E½XjXk� ¼X

x1;x2;...;xn

xjxkPðX1 ¼ x1;X2 ¼ x2; . . . ;Xn ¼ xnÞ

and for continuous random variables with a joint density function we have

E½XjXk� ¼Z þ1�1� � �Z þ1�1

xjxk fX1X2...Xnðx1; x2; . . . ; xnÞdx1dx2 � � � xn:

Definition 5.25. The covariance of two random variables X and Y is defined by

�XY ¼ CovðX; YÞ ¼ E½ðX � E½X�ÞðY � E½Y�Þ ¼ E½XY� � E½X�E½Y�: ð5:9Þ

The correlation coefficient of X and Y is defined by

XY ¼CovðX;YÞffiffiffiffiffiffiffiffiffiffiV½X�p ffiffiffiffiffiffiffiffiffiffi

V½Y�p ¼

�XY

�X�Y; ð5:10Þ

where �X ¼ffiffiffiffiffiffiffiffiffiffiV½X�p

and �Y ¼ffiffiffiffiffiffiffiffiffiffiV½Y�p

are the standard deviations of X and Y , respectively.For random variables X1; . . . ;Xn, the matrix � with

� ¼ ½�jk�; �jj ¼ �2Xj¼ V½Xj�; �jk ¼ �XjXk

¼ CovðXj;XkÞ

is called the variance-covariance matrix.

Definition 5.26. We say that random variables X1;X2; . . . ;Xn are independent if the joint distribution function isfactorized as

PðX1 � x1;X2 � x2; . . . ;Xn � xnÞ ¼Yni¼1

PðXi � xiÞ

or equivalently,

FX1X2...Xnðx1; x2; . . . ; xnÞ ¼

Yni¼1

FXiðxiÞ:

It is proved by definition that discrete random variables X1; . . . ;Xn are independent if and only if

PðX1 ¼ x1;X2 ¼ x2; . . . ;Xn ¼ xnÞ ¼Yni¼1

PðXi ¼ xiÞ

for all x1; x2; . . . ; xn 2 R. Random variables X1; . . . ;Xn with joint density function fX1X2...Xnðx1; x2; . . . ; xnÞ are

independent if and only if the joint density function is factorized as

fX1X2...Xnðx1; x2; . . . ; xnÞ ¼

Yni¼1

fXiðxiÞ;

where fXiðxiÞ is the density function of Xi.

Remark 5.27. Definition 5.26 applies to an arbitrary family of random variables. A family of random variablesfX ; 2 �g is called independent if any finitely many random variables X1

; . . . ;Xn chosen from the family areindependent in the sense of Definition 5.26.

Multivariate Analysis for Data Science 75

Page 36: The Elements of Multi-Variate Analysis for Data Science

Remark 5.28. By definition two random variables X and Y are independent if PðX � x; Y � yÞ ¼ PðX � xÞPðY � yÞfor all x; y 2 R. A family of random variables fX ; 2 �g is called pairwise independent if any two random variablesX1

and X2, 1 6¼ 2, chosen from the family are independent. Note that a pairwise independent family of random

variables is not necessarily independent.

Exercise 5.29. For � > 0 and � > 0 let Fðx; yÞ be a function defined by

Fðx; yÞ ¼ ð1� e��xÞð1� e��yÞ; x 0, y 0,

0; otherwise.

Prove that Fðx; yÞ is the joint distribution function of a 2-dimensional random vector ðX;YÞ. Then show that X and Y areindependent.

Theorem 5.30. If two random variables X and Y are independent, we have E½XY� ¼ E½X�E½Y� and CovðX;YÞ ¼ 0.

Proof. Suppose that X and Y are discrete random variables. Since they are independent by assumption, we have thefactorization PðX ¼ x;Y ¼ yÞ ¼ PðX ¼ xÞPðY ¼ yÞ. Then we have

E½XY� ¼Xx;y

xyPðX ¼ x; Y ¼ yÞ ¼Xx;y

xyPðX ¼ xÞPðY ¼ yÞ ¼Xx

xPðX ¼ xÞXy

yPðY ¼ yÞ

and hence

E½XY� ¼ E½X�E½Y�: ð5:11Þ

Suppose next that X and Y admits a joint density function fXY ðx; yÞ. Since they are independent by assumption we havefXY ðx; yÞ ¼ fXðxÞ fY ðyÞ. Then we have

E½XY� ¼Z þ1�1

Z þ1�1

xyfXY ðx; yÞdxdy ¼Z þ1�1

Z þ1�1

xyfXðxÞfY ðyÞdxdy ¼Z þ1�1

xfXðxÞdxZ þ1�1

yfY ðyÞdy

and we come to ð5.11Þ. For a general pair of independent random variables X and Y we need Lebesgue integral on aprobability space and omit the proof, see the standard textbooks. Finally, it follows immediately from ð5.11Þ that

CovðX; YÞ ¼ E½XY� � E½X�E½Y� ¼ E½X�E½Y� � E½X�E½Y� ¼ 0;

as desired. �

Remark 5.31. Two random variables X and Y are called uncorrelated if CovðX;YÞ ¼ 0. Theorem 5.30 says thatindependent random variables are uncorrelated. However, the converse is not true in general, see Exercise 5.32.

Exercise 5.32. Let Z1 and Z2 be independent random variables such that

PðZ1 ¼ 1Þ ¼ PðZ2 ¼ 1Þ ¼1

2;

in other words, Z1 and Z2 stand for tossing two coins. Set

X ¼ Z1 þ Z2; Y ¼ Z1 � Z2:

Show that X and Y are uncorrelated but are not independent.

Exercise 5.33. Let ðX; YÞ be a 2-dimensional random vector of which the density function is given by

fXY ðx; yÞ ¼x2 þ y2

4�e�ðx

2þy2Þ=2:

Show that X and Y are uncorrelated but are not independent.

5.3 Regression curves

Let X and Y be two random variables. We identify ðX; YÞ with a random point in the xy-coordinate plane. First weconsider the case where both X and Y are discrete. For x; y 2 R the conditional probability

PðY ¼ yjX ¼ xÞ ¼PðX ¼ x; Y ¼ yÞ

PðX ¼ xÞis defined whenever PðX ¼ xÞ > 0. Note thatX

y

PðY ¼ yjX ¼ xÞ ¼1

PðX ¼ xÞ

Xy

PðX ¼ x;Y ¼ yÞ ¼1

PðX ¼ xÞ� PðX ¼ xÞ ¼ 1:

Then we regard PðY ¼ yjX ¼ xÞ as a probability distribution concentrated on the vertical line with x-coordinate x in thexy-coordinate plane. Then the conditional expectation of Y under the condition X ¼ x is defined by

76 BALADRAM and OBATA

Page 37: The Elements of Multi-Variate Analysis for Data Science

E½YjX ¼ x� ¼Xy

yPðY ¼ yjX ¼ xÞ: ð5:12Þ

Then we obtain a function x 7!E½Y jX ¼ x�, where x runs over R such that PðX ¼ xÞ > 0. This function gives rise to adiscrete curve in the xy-coordinate plane, which is called the regression curve for Y subject to X.

We next consider the case where X and Y admit a joint density function fXY ðx; yÞ. The conditional density function ofY under the condition X ¼ x is defined by

fYjXðyjxÞ ¼fXY ðx; yÞfXðxÞ

¼fXY ðx; yÞZ þ1

�1fXY ðx; yÞdy

; ð5:13Þ

whenever the denominator is positive. SinceZ þ1�1

fYjXðyjxÞdy ¼1Z þ1

�1fXY ðx; yÞdy

Z þ1�1

fXY ðx; yÞdy ¼ 1;

as in the discrete case we understand that fYjXðyjxÞ is a density function concentrated on the vertical line withx-coordinate is x in the xy-coordinate plane. Then the conditional expectation of Y under the condition X ¼ x is definedby

E½YjX ¼ x� ¼Z þ1�1

y fYjXðyjxÞdy: ð5:14Þ

Then we obtain a function x 7!E½Y jX ¼ x�, where x runs over R such thatRþ1�1 fXY ðx; yÞdy > 0. This function gives

rise to a curve in the xy-coordinate plane, which is called the regression curve for Y subject to X.

Exercise 5.34. Let ðX; YÞ be a 2-dimensional random vector of which the density function is given by

fXY ðx; yÞ ¼e�y; 0 < x � y,

0; otherwise.

Find the conditional density function of Y under the condition X ¼ x. Then calculate E½YjX ¼ x�.

Exercise 5.35 (Bayes’ formula for continuous random variables). Let X;Y be random variables with a joint densityfunction fXY ðx; yÞ. Prove that

fYjXðyjxÞ ¼fXjY ðxjyÞ fY ðyÞZ þ1

�1fXjY ðxjyÞ fY ðyÞdy

:

5.4 Two-dimensional normal distributions

Let � ¼ ½�j� be an n-dimensional column vector and � ¼ ½�jk� a strictly positive definite n� n matrix. By definitionhx;�xi > 0 for all x 2 Rn with x 6¼ 0 and necessarily � is invertible and symmetric. Define a function f ðxÞ by

f ðxÞ ¼1ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffið2�Þnj�jp exp �

1

2hðx� �Þ;��1ðx� �Þi

� �; x 2 Rn; ð5:15Þ

where j�j is the determinant. It is proved thatZ þ1�1� � �Z þ1�1

f ðx1; . . . ; xnÞdx1 � � � dxn ¼ 1;

with the help of diagonalization of � and coordinate change. In other words, f ðxÞ is a probability density function in n

variables. The corresponding probability distribution is called an n-dimensional normal distribution and is denoted byNð�;�Þ. Moreover, we can check by elementary calculus thatZ þ1

�1� � �Z þ1�1

xj f ðx1; . . . ; xnÞdx1 � � � dxn ¼ �j ð5:16Þ

and Z þ1�1� � �Z þ1�1ðxj � �jÞðxk � �kÞ f ðx1; . . . ; xnÞdx1 � � � dxn ¼ �jk: ð5:17Þ

As a result, � is the mean vector and � the variance-covariance matrix of the normal distribution Nð�;�Þ.Here we study the case of two dimension. Take a vector � 2 R2 and a strictly positive definite 2� 2 matrix �, say,

Multivariate Analysis for Data Science 77

Page 38: The Elements of Multi-Variate Analysis for Data Science

� ¼a

b

� �; � ¼

�11 �12

�21 �22

� �:

Note that � becomes a symmetric matrix i.e., �12 ¼ �21. The density function of Nð�;�Þ is defined by

f ðx; yÞ ¼1ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffið2�Þ2j�j

p exp �1

2hðx� �Þ;��1ðx� �Þi

� �; x ¼

x

y

� �2 R2: ð5:18Þ

Theorem 5.36. Let X and Y be random variables with means �X ; �Y , variances �2X ; �

2Y and covariance �XY . If ðX; YÞ

obeys a 2-dimensional normal distribution, the joint density function is given by

fXY ðx; yÞ ¼1

2��X�Yffiffiffiffiffiffiffiffiffiffiffiffiffi1� 2

p exp �1

2ð1� 2Þx� �X

�X

� �2

� 2x� �X

�X

y� �Y

�Yþ

y� �Y

�Y

� �2( )" #

; ð5:19Þ

where ¼ �XY=ð�X�Y Þ is the correlation coefficient.

Proof. Let Nð�;�Þ be the normal distribution that ðX;YÞ obeys and f ðx; yÞ its density function as in ð5.18Þ. As aparticular case of ð5.16Þ and ð5.17Þ, we have

�X ¼ E½X� ¼Z þ1�1

Z þ1�1

x f ðx; yÞdxdy ¼ a;

�Y ¼ E½Y� ¼Z þ1�1

Z þ1�1

y f ðx; yÞdxdy ¼ b;

and

�2X ¼ E½ðX � �XÞ2� ¼

Z þ1�1

Z þ1�1ðx� �XÞ2 f ðx; yÞdxdy ¼ �11;

�2Y ¼ E½ðY � �Y Þ2� ¼

Z þ1�1

Z þ1�1ðy� �Y Þ2 f ðx; yÞdxdy ¼ �22;

�XY ¼ E½ðX � �XÞðY � �Y Þ� ¼Z þ1�1

Z þ1�1ðx� �XÞðy� �Y Þ f ðx; yÞdxdy ¼ �12 ¼ �21:

Hence the joint density function fXY ðx; yÞ is given by the density function of Nð�;�Þ with � and � being given asabove. We now look at the quadratic function hðx� �Þ;��1ðx� �Þi in the right-hand side of ð5.18Þ. First setting

��1 ¼�11 �12

�21 �22

� �;

we obtain

hðx� �Þ;��1ðx� �Þi ¼ �11ðx� aÞ2 þ 2�12ðx� aÞðy� bÞ þ �22ðy� bÞ2: ð5:20Þ

y

x

x

y

a

b

Fig. 5.1. Density function of 2-dimensional normal distribution Nð�;�Þ and its contour curves.

78 BALADRAM and OBATA

Page 39: The Elements of Multi-Variate Analysis for Data Science

Then inserting

�11 ¼�22

j�j; �22 ¼

�11

j�j; �12 ¼ �21 ¼ �

�12

j�j;

into ð5.20Þ, we have

hðx� �Þ;��1ðx� �Þi ¼1

j�jf�22ðx� aÞ2 � 2�12ðx� aÞðy� bÞ þ �11ðy� bÞ2g

¼�11�22

j�jðx� aÞ2

�11

�2�12

�11�22

ðx� aÞðy� bÞ þðy� bÞ2

�22

�: ð5:21Þ

Finally, using the correlation coefficient

¼�XY

�X�Y¼

�12ffiffiffiffiffiffiffi�11p ffiffiffiffiffiffiffi

�22p

together with a ¼ �X , b ¼ �Y , we come to

hðx� �Þ;��1ðx� �Þi ¼�2X�

2Y

j�jx� �X

�X

� �2

� 2x� �X

�X

y� �Y

�Yþ

y� �Y

�Y

� �2( )

: ð5:22Þ

On the other hand, we have

j�j ¼ �11�22 � �12�21 ¼ �2X�

2Y � �

2XY ¼ �

2X�

2Y 1�

�2XY

�2X�

2Y

� �¼ �2

X�2Y ð1�

2Þ: ð5:23Þ

Then ð5.19Þ follows immediately from ð5.22Þ and ð5.23Þ. �

Theorem 5.37. Let X and Y be random variables with means �X ; �Y , variances �2X ; �

2Y and covariance �XY . If ðX; YÞ

obeys a 2-dimensional normal distribution, the density functions of X and Y (called the marginal density function in thiscontext) are given by

fXðxÞ ¼Z þ1�1

fXY ðx; yÞdy ¼1ffiffiffiffiffiffiffiffiffiffiffi

2��2X

p exp �ðx� �XÞ2

2�2X

� �; ð5:24Þ

fY ðyÞ ¼Z þ1�1

fXY ðx; yÞdx ¼1ffiffiffiffiffiffiffiffiffiffiffi

2��2Y

p exp �ðy� �Y Þ2

2�2Y

� �; ð5:25Þ

respectively. In other words, X and Y obeys the normal distributions Nð�X ; �2XÞ and Nð�Y ; �

2Y Þ, respectively.

Proof. We see from ð5.21Þ that

hðx� �Þ;��1ðx� �Þi ¼�11

j�jy� b�

�12

�11

ðx� aÞ� �2

þ1

�11

ðx� aÞ2;

and hence

fXY ðx; yÞ ¼1ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffið2�Þ2j�j

p exp �1

2

�11

j�jy� b�

�12

�11

ðx� aÞ� �2

þ1

�11

ðx� aÞ2( )" #

: ð5:26Þ

Then we have

fXðxÞ ¼Z þ1�1

fXY ðx; yÞdy ¼1ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffið2�Þ2j�j

pffiffiffiffiffiffiffiffiffiffiffiffiffi2�j�j�11

sexp �

1

2�11

ðx� aÞ2� �

¼1ffiffiffiffiffiffiffiffiffiffiffiffi

2��11

p exp �1

2�11

ðx� aÞ2� �

; ð5:27Þ

which proves ð5.24Þ. Similarly, ð5.25Þ is derived. �

Theorem 5.38. Let X and Y be random variables with means �X ; �Y , variances �2X ; �

2Y and covariance �XY . If ðX; YÞ

obeys a 2-dimensional normal distribution, the conditional density function fY jXðyjxÞ is given by

fYjXðyjxÞ ¼fXY ðx; yÞfXðxÞ

¼1ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

2��2Y ð1� 2Þ

p exp �1

2�2Y ð1� 2Þ

y� �Y ��XY

�2X

ðx� �XÞ �2

" #; ð5:28Þ

where ¼ XY is the correlation coefficient of X and Y . In particular, the conditional density function fY jXðyjxÞ is anormal distribution.

Proof. By taking the ratio of ð5.26Þ against ð5.27Þ we obtain

Multivariate Analysis for Data Science 79

Page 40: The Elements of Multi-Variate Analysis for Data Science

fYjXðyjxÞ ¼fXY ðx; yÞfXðxÞ

¼1ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

2���111 j�j

p exp �1

2��111 j�j

y� b��12

�11

ðx� aÞ �2

" #:

Then, using ��111 j�j ¼ �2

Y ð1� 2Þ we obtain ð5.28Þ. �

From ð5.28Þ we obtain immediately

E½Y jX ¼ x� ¼Z þ1�1

yfYjXðyjxÞdy ¼ �Y þ�XY

�2X

ðx� �XÞ:

Thus the regression curve for 2-dimensional normal distribution is a line determined by

y ¼ �Y þ�XY

�2X

ðx� �XÞ;

or equivalently by

y� �Y

�Y¼

�XY

�X�Y

x� �X

�X¼ XY

x� �X

�X; ð5:29Þ

where XY is the correlation coefficient of X and Y .In Sect. 3.4 we discussed the method of least square for a regression line. We apply a similar idea to joint density

function. Namely, given a joint density function fXY ðx; yÞ, we ask for a line y ¼ �xþ � which gives the bestapproximation. For each fixed x the fluctuation of Y under the condition X ¼ x is measured by the so-called square errorE½ðY � ð�xþ �ÞÞ2jX ¼ x�. The method of least square is to find the coefficients � and � in such a way that the total ofsquare errors

Q ¼Z þ1�1

E½ðY � ð�xþ �ÞÞ2jX ¼ x� fXðxÞdx ð5:30Þ

is minimized. In fact, we have

Q ¼Z þ1�1

Z þ1�1ðy� ð�xþ �ÞÞ2 fYjXðyjxÞ fXðxÞdydx

¼Z þ1�1

Z þ1�1ðy� ð�xþ �ÞÞ2 fXY ðx; yÞdxdy

¼ E½ðY � ð�X þ �ÞÞ2�:

Note that Q ¼ Qð�; �Þ is a quadratic function. Then, after simple calculus we see that the minimum of Q is attained at

� ¼�XY

�2X

; � ¼ �Y ��XY

�2X

�X:

Hence the regression line is given by

y ¼�XY

�2X

xþ �Y ��XY

�2X

�X ;

or equivalently by

y� �Y

�Y¼

�XY

�X�Y

x� �X

�X¼ XY

x� �X

�X: ð5:31Þ

Note that ð5.29Þ and ð5.31Þ coincide. Thus, we come to the following assertion.

Theorem 5.39. Let X and Y be two random variables and assume that ðX; YÞ obeys a 2-dimensional normaldistribution. Then the regression curve defined by x 7!E½Y jX ¼ x� is given by

y� �Y

�Y¼

�XY

�X�Y

x� �X

�X¼ XY

x� �X

�X;

which coincides with the regression line determined by the method of least squares. In other words, the above lineminimizes the total of square errors Q defined in ð5.30Þ.

Remark 5.40. For general distribution, the regression curve defined by x 7!E½Y jX ¼ x� is not necessarily a line.Hence the regression line obtained by the method of least squares does not necessarily coincide with the regressioncurve determined by the conditional expectation.

Exercise 5.41. Let ðX; YÞ be a 2-dimensional random vector of which the density function is given by

80 BALADRAM and OBATA

Page 41: The Elements of Multi-Variate Analysis for Data Science

fXY ðx; yÞ ¼1

2ffiffiffi3p�

exp �1

3ðx2 � xyþ y2 þ x� 2yþ 1Þ

� �:

(1) Find the means of X and Y .(2) Find the variances of X and Y .(3) Find the correlation coefficient of X and Y .

Exercise 5.42. Assume that a random vector ðX;YÞ obeys a 2-dimensional normal distribution Nð�;�Þ. Show thatCovðX; YÞ ¼ 0 implies that X and Y are independent. (See Remark 5.31.)

Bibliographical Notes

There are many excellent textbooks on multi-variate analysis. Here we only mention some of them. For generalintroduction to multi-variate analysis, we refer to Anderson [1], also see Dobson–Barnett [7] for a new approach bygeneralized linear models. Wooldridge [21] contains basics of multi-variate analysis and further topics in econometrics.Hair–Black–Babin–Anderson [9] provides a comprehensive guidline of multi-variate analysis from a practical point ofview. For introduction to statistics see Brink [4], Rumsey [18, 19], see also Hoel [12]. For sampling theory see e.g.,Hansen–Hurwitz–Madow [11], Kish [14]. For Baysian analysis see Robert [17], Smith [20]. For general introduction toprobability theory we refer to Chung [6], Durrett [8], while Kolmogorov [15] is the origin of modern probability theory.For measure theoretical probability theory, where the measure space and Lebesgue integrals are crucial, see Athreya–Lahiri [2], Billingsley [3] and Capinski-Kopp [5]. The old book Halmos [10] is also widely known. For more exerciseon probability and random variables see Hsu [13]. For more on basics of data science, see e.g., Kotu–Desphande [16].

Acknowledgments

These lecture notes are based on the lectures delivered as a part of Data Science Basics in Graduate Program in DataScience (GP-DS) and Data Sciences Program (DSP) at Tohoku University in 2018–2020. The authors are grateful to theteaching stuffs for their invaluable assistance and constant encouragement. They thank the anonymous referees for theircomments and suggestions that improved the presentation of these lecture notes.

REFERENCES

[1] Anderson, T. W., An Introduction to Multivariate Statistical Analysis, 3rd Ed., Wiley-Interscience, Hoboken, NJ (2003).[2] Athreya, K. B., and Lahiri, S. N., Measure Theory and Probability Theory, Springer, New York (2006).[3] Billingsley, P., Probability and Measure, Wiley Series in Probability and Statistics, John Wiley & Sons, Inc., Hoboken, NJ

(2012).[4] Brink, D., Statistics, Ventus Publishing ApS (2010).[5] Capinski, M., and Kopp, E., Measure, Integral and Probability, 2nd Ed., Springer-Verlag, London (2004).[6] Chung, K. L., A Course in Probability Theory, 3rd Ed., Academic Press, Inc., San Diego, CA (2001).[7] Dobson, A. J., and Barnett, A. G., An Introduction to Generalized Linear Models, 4th Ed., CRC Press, Boca Raton, FL (2018).[8] Durrett, R., Probability — Theory and Examples, 5th Ed., Cambridge University Press, Cambridge (2019).[9] Hair, J., Black, W., Babin, B., and Anderson, R., Multivariate Data Analysis, Cengage Learning EMEA (2018).

[10] Halmos, P. R., Measure Theory, D. Van Nostrand Company, Inc., New York, N.Y. (1950).[11] Hansen, M. H., Hurwitz, W. N., and Madow, W. G., Sample Survey Methods and Theory, Wiley Classics Library, John Wiley

& Sons, Inc., New York (1993).[12] Hoel, P. G., Introduction to Mathematical Statistics, 5th Ed., Wiley (1984).[13] Hsu, H. P., Schaum’s Outline of Probability, Random Variables, and Random Processes, 4th Ed., McGraw-Hill Education

(2020).[14] Kish, L., Survey Sampling, Wiley-Interscience (1995).[15] Kolmogorov, A., Grundbegriffe der Wahrscheinlichkeitsrechnung, Julius Springer, Berlin (1933).[16] Kotu, V., and Desphande, B., Data Science: Concepts and Practice, 2nd Ed., Morgan Kaufmann (2019).[17] Robert, C., The Bayesian Choice: From Decision-Theoretic Foundations to Computational Implementation, 2nd Ed., Springer

(2007).[18] Rumsey, D. J., Statistics for Dummies, 2nd Ed., Wiley Publishing (2011).[19] Rumsey, D. J., Statistics Essentials for Dummies, John Wiley & Sons, Inc. (2019).[20] Smith, J. Q., Bayesian Decision Analysis: Principles and Practice, Cambridge University Press (2010).[21] Wooldridge, J. M., Econometric Analysis of Cross Section and Panel Data, 2nd Ed., MIT Press (2010).

Appendix A: Some Tips for Matrices and Vectors

A.1 Definition

A rectangular array of numbers is called a matrix, where each horizontal array is called a row and each vertical one acolumn. A matrix having n rows and p columns is called an n� p matrix and is of the form:

Multivariate Analysis for Data Science 81

Page 42: The Elements of Multi-Variate Analysis for Data Science

( jth column)

X ¼

x11 � � � x1j � � � x1p

..

. ... ..

.

xi1 � � � xij � � � xip

..

. ... ..

.

xn1 � � � xnj � � � xnp

2666666664

3777777775

(ith row):ðA:1Þ

As a rule, the rows are numbered from top to bottom and the columns are numbered from left to right. An entry of amatrix X at the crossroad of the ith row and jth column is called the ði; jÞ-entry and is denoted by ðXÞij. For X in ðA.1Þwe have ðXÞij ¼ xij.

As a special case, a matrix with a single column, that is, an n� 1 matrix:

x1

x2

..

.

xn

266664

377775 ðA:2Þ

is called an n-dimensional column vector. A matrix with a single row, that is, a 1� n matrix

½x1 x2 � � � xn� ðA:3Þ

is called an n-dimensional row vector. These vectors represent a point in n-dimensional coordinate space, where thechoice of a column or row vector is up to the context.

We say that two matrices are of the same type or of the same size if the numbers of their columns coincide as well asthat of their rows. For two matrices X and Y we say that X ¼ Y if they are of the same type and ðXÞij ¼ ðYÞij for all i andj. Note that two matrices are never equal if they are not of the same type.

A.2 Transposition

For a matrix X the transposed matrix, denoted by XT , is defined by exchanging the rows and columns of X. If X is ann� p matrix, the transposed matrix XT becomes a p� n matrix. Writing down their entries explicitly, we have

X ¼

x11 � � � x1j � � � x1p

..

. . .. ..

. . .. ..

.

xi1 � � � xij � � � xip

..

.� � � ..

. . .. ..

.

xn1 � � � xnj � � � xnp

2666666664

3777777775; XT ¼

x11 � � � xi1 � � � xn1

..

. . .. ..

. . .. ..

.

x1j � � � xij � � � xnj

..

. . .. ..

. . .. ..

.

x1p � � � xip � � � xnp

2666666664

3777777775;

where xij is the ði; jÞ-entry of X and it is the ð j; iÞ-entry of XT . In short,

ðXT Þij ¼ ðXÞji:

By repeating the operation of transposition twice, a matrix turns back to the original one. Namely, we have ðXT ÞT ¼ X.Applying transposition, a column vector becomes a row vector and vice versa as follows:

x1

x2

..

.

xn

266664

377775

T

¼ ½x1 x2 � � � xn�; ½x1 x2 � � � xn�T ¼

x1

x2

..

.

xn

266664

377775:

It is often convenient to write a column vector as ½x1 x2 � � � xn�T for saving space.

A.3 Addition and scalar multiplication

Let X and Y be matrices of the same type, say,

82 BALADRAM and OBATA

Page 43: The Elements of Multi-Variate Analysis for Data Science

X ¼

x11 � � � x1j � � � x1p

..

. . .. ..

. . .. ..

.

xi1 � � � xij � � � xip

..

. . .. ..

. . .. ..

.

xn1 � � � xnj � � � xnp

2666666664

3777777775; Y ¼

y11 � � � y1j � � � y1p

..

. . .. ..

. . .. ..

.

yi1 � � � yij � � � yip

..

. . .. ..

. . .. ..

.

yn1 � � � ynj � � � ynp

2666666664

3777777775:

Then their sum X þ Y is defined entrywise, namely,

X þ Y ¼

x11 þ y11 � � � x1j þ y1j � � � x1p þ y1p

..

. . .. ..

. . .. ..

.

xi1 þ yi1 � � � xij þ yij � � � xip þ yip

..

. . .. ..

. . .. ..

.

xn1 þ yn1 � � � xnj þ ynj � � � xnp þ ynp

2666666664

3777777775:

The difference X � Y is similarly defined. These rules are written in a simpler form:

ðX YÞij ¼ ðXÞij ðYÞij ¼ xij yij:

Next, for a real number a the scalar multiplication aX is defined by

aX ¼

ax11 � � � ax1j � � � ax1p

..

. . .. ..

. . .. ..

.

axi1 � � � axij � � � axip

..

. . .. ..

. . .. ..

.

axn1 � � � axnj � � � axnp

2666666664

3777777775;

or equivalently,

ðaXÞij ¼ aðXÞij ¼ axij:

Since the addition and scalar multiplication are defined entrywise, the usual calculation rules are valid also formatrices:

(i) X þ Y ¼ Y þ X;(ii) X þ ðY þ ZÞ ¼ ðX þ YÞ þ Z;

(iii) aðX þ YÞ ¼ aX þ aY ;(iv) ðaþ bÞX ¼ aX þ bX;(v) ðabÞX ¼ aðbXÞ.

Let O be the matrix with entries being all zero. If X and O are of the same type, we have

X þ O ¼ Oþ X ¼ X:

Moreover, letting �X ¼ ð�1ÞX (scalar multiplication by �1) we have

X þ ð�XÞ ¼ ð�XÞ þ X ¼ O:

In fact, for the subtraction we have

X � Y ¼ X þ ð�YÞ ¼ X þ ð�1ÞY:

Next we review the multiplication of matrices. The seemingly complicated definition likely keeps the beginnersaway, but the real power of matrix multiplication will be understood after some patient practice. Let X and Y be twomatrices, and assume that the number of columns of X and that of rows of Y coincide, say, X is an n� p matrix and Y isa p� m matrix. The the product XY is defined to be an n� m matrix whose ði; jÞ-entry is given by

ðXYÞij ¼ xi1y1j þþxi2y2j þ � � � þ xipypj;

that is,

ðXYÞij ¼Xpk¼1

xikykj ¼Xpk¼1

ðXÞikðYÞkj: ðA:4Þ

In fact, ðXYÞij is obtained from the ith row of X and the jth column of Y :

Multivariate Analysis for Data Science 83

Page 44: The Elements of Multi-Variate Analysis for Data Science

X ¼

x11 � � � x1k � � � x1p

..

. ... ..

.

xi1 � � � xik � � � xip

..

. ... ..

.

xn1 � � � xnk � � � xnp

2666666664

3777777775; Y ¼

y11 � � � y1j � � � y1m

..

. ... ..

.

yk1 � � � ykj � � � ykm

..

. ... ..

.

yp1 � � � ypj � � � ypm

2666666664

3777777775:

It is instructive to note in the formula ðA.4Þ that the sum is taken over k appearing as if connecting two matrices X andY . The other indices i and j are outside the summation and stand for the row and column numbers of the new matrix XY .

By calculation we can check the associativity:

ðXYÞZ ¼ XðYZÞ

just as in the case of numbers. Thanks to the associativity we may write XYZ for the multiplication of three matriceswithout brackets. On the other hand, the commutativity does not hold, that is, we have XY 6¼ YX in general. First of all,if X and Y do not satisfy the condition on sizes, it happens that XY is defined but YX not. Even if both XY and YX aredefined, their sizes do not necessarily coincide. Even if both XY and YX are defined and their sizes coincide, the entriesdo not necessarily coincide. As a matter of fact, two matrices X and Y are in a very special relation when XY ¼ YX

holds.The n� n matrix with diagonal entries are all one and the off-diagonal ones zero is called the identity matrix and is

denoted by I:

I ¼

1 � � � 0 � � � 0

..

. . .. ..

. . .. ..

.

0 � � � 1 � � � 0

..

. . .. ..

. . .. ..

.

0 � � � 0 � � � 1

2666666664

3777777775:

Then for any n� n matrix X we have XI ¼ IX ¼ X, namely I is the multiplication unit. If two n� n matrices X and Y

satisfy XY ¼ I, we have YX ¼ I. In that case we say that Y is the inverse matrix of X and is denoted by X�1. It is notedthat the inverse matrix X�1 does not necessarily exist, but if exists it is uniquely determined. A simple application ofinverse matrix makes us possible to write down the unique solution to an equation AX ¼ B as X ¼ A�1B whenever theinverse matrix of A exists.

A.4 Inner product and norm

The inner product of n-dimensional column vectors

x ¼

x1

x2

..

.

xn

266664

377775; y ¼

y1

y2

..

.

yn

266664

377775 ðA:5Þ

is defined by

hx; yi ¼Xni¼1

xiyi: ðA:6Þ

For the inner product the symbol x � y is also used in literatures. According to the definition of matrix multiplicationxTy becomes a 1� 1 matrix with a single entry given by the right-hand side of ðA.6Þ. Then, it is convenient to identifythe 1� 1 matrix xTy with the single number hx; yi and write

hx; yi ¼ xTy ¼Xni¼1

xiyi:

For any column vector x we have

hx; xi ¼ xTx ¼Xni¼1

x2i

and hence hx; xi 0. The positive square root is denoted by

84 BALADRAM and OBATA

Page 45: The Elements of Multi-Variate Analysis for Data Science

kxk ¼ffiffiffiffiffiffiffiffiffiffiffihx; xi

ffiffiffiffiffiffiffiffiffiffiffiffiXni¼1

x2i

s

and is called the (Euclidean) norm of x.As a particular case of matrix multiplication, for an m� n matrix X and an n-dimensional column vector x we define

an m-dimensional column vector Xx. Moreover, it is shown by calculation that for m-dimensional column vector y wehave

hy;Xxi ¼ hXTy; xi;

where the left-hand side is the inner product of m-dimensional column vectors and the right-hand side is the one ofn-dimensional column vectors.

A.5 Metric

Measuring difference between two objects is of fundamental importance. For that purpose a metric or a distancefunction is the most basic tool in various aspects. Let X be a set of objects or more generally elements. A metric on X isa function d : X � X! R satisfying the following conditions:(M1) dðx; xÞ ¼ 0 and dðx; yÞ ¼ 0 if and only if x ¼ y;(M2) dðx; yÞ ¼ dðy; xÞ;(M3) (triangle inequality) dðx; yÞ � dðx; zÞ þ dðz; yÞ.Condition (M1) means that the metric separates two distinct elements in X. A function dðx; yÞ satisfying (M1)–(M3)except the second half of condition (M1) is called pseudo-metric, where two distinct elements are not necessarilyseparated by means of dðx; yÞ.

Let Rn be the n-dimensional coordinate space, where every point is associated with an n-tuple of real numbers, i.e.,the coordinate ðx1; . . . ; xnÞ. For two points in the n-dimensional coordinate space, which are identified with columnvectors x and y as in ðA.5Þ, we set

dðx; yÞ ¼ kx� yk; x; y 2 Rn: ðA:7Þ

As is easily verified, the above function d satisfies conditions (M1)–(M3). Thus, dðx; yÞ defined by ðA.7Þ is a metric onR

n and is called the Euclidean metric or Euclidean distance. Using coordinate the Euclidean metric is given by

dðx; yÞ ¼Xni¼1

ðxi � yiÞ2 !1=2

; x ¼ ½x1; x2; . . . ; xn�T ; y ¼ ½y1; y2; . . . ; yn�T :

For n ¼ 2 or n ¼ 3 the above relation reflects the Pythagorean theorem.The Euclidean metric admits a one-parameter deformation. For 1 � p <1 we define

dpðx; yÞ ¼Xni¼1

jxi � yijp !1=p

;

and for p ¼ 1 we set

d1ðx; yÞ ¼ maxfjxi � yij; 1 � i � ng;

where x ¼ ½x1; x2; . . . ; xn�T and y ¼ ½y1; y2; . . . ; yn�T are points in Rn. It is proved that dpðx; yÞ is a metric on Rn for any1 � p � 1. The Euclidean metric is just the case of p ¼ 2. The metric d1 is also referred to as the block distance.

Finally, we mention another distance often appearing in data analysis. Let W be a set of p letters, say, a; b; . . . ; c. Aconcatenation of letters in W is called a word. Let Wn be the set of words of length n. A word in Wn is of the form:

x ¼ x1x2 . . . xn; x1; . . . ; xn 2 W :

Given two words, the number of steps to transform one to another by changing letters is of interest. The number of suchsteps is given by

dðx; yÞ ¼ jf1 � i � n; xi 6¼ yigj; x ¼ x1x2 . . . xn; y ¼ y1y2 . . . yn:

It is easy to see that the above function d is a metric on Wn, and is called the Hamming distance.

Appendix B: Raw Data for Exercise

The following table contains the raw data used for illustration of statistical analysis.

Multivariate Analysis for Data Science 85

Page 46: The Elements of Multi-Variate Analysis for Data Science

Team A

No. height (cm) weight (kg) age (year)

1 178 100 332 185 90 223 190 90 294 175 79 275 185 81 296 196 106 307 188 100 318 186 87 239 188 84 19

10 182 77 3111 180 82 1912 176 82 3013 186 85 2514 178 88 3115 177 77 2616 180 77 3517 187 108 2518 175 68 2619 177 78 3920 193 105 2521 186 96 2422 174 78 3223 178 85 2624 177 81 1825 175 81 2726 182 86 1827 183 80 1928 179 93 2729 173 74 2430 182 84 2431 180 89 2732 178 80 2733 178 85 2334 168 69 2435 189 90 1836 174 74 2437 172 72 2038 184 74 2539 185 88 2840 173 72 2941 172 75 18

No. height (cm) weight (kg) age (year)

42 183 78 2543 190 84 1944 183 83 2345 180 80 3646 175 77 2847 172 96 2048 182 90 2949 185 92 2750 174 79 2551 178 82 2952 178 83 3053 180 73 2254 178 76 2355 180 93 2556 180 74 2357 181 85 2558 173 75 2459 177 86 2260 181 75 2561 173 73 3962 180 79 2363 183 85 2464 180 75 3065 175 75 3766 185 85 2167 185 86 2468 175 73 2569 178 73 1870 178 85 3271 172 70 3072 178 87 2273 177 85 3574 174 82 2275 171 75 2676 176 78 2477 185 85 3378 179 80 3079 180 96 2380 175 79 3081 185 82 2382 180 93 2483 184 85 17

86 BALADRAM and OBATA