15
Multivariate Data Analysis in Practice 6th Edition Supplementary Tutorial Book for 2019 Multivariate Data Analysis Kim H. Esbensen & Brad Swarbrick

Multivariate Data Analysis in PracticeMultivariate Data Analysis in Practice 6th Edition Supplementary Tutorial Book for 2019 Multivariate Data Analysis Kim H. Esbensen & Brad Swarbrick

  • Upload
    others

  • View
    29

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Multivariate Data Analysis in PracticeMultivariate Data Analysis in Practice 6th Edition Supplementary Tutorial Book for 2019 Multivariate Data Analysis Kim H. Esbensen & Brad Swarbrick

MultivariateData Analysisin Practice

6th Edition

Supplementary Tutorial Book for

2019

Multivariate Data Analysis

Kim H. Esbensen & Brad Swarbrick

Page 2: Multivariate Data Analysis in PracticeMultivariate Data Analysis in Practice 6th Edition Supplementary Tutorial Book for 2019 Multivariate Data Analysis Kim H. Esbensen & Brad Swarbrick

1

Published by CAMO Software AS:

CAMO Software AS

Oslo Science Park

Gaustadalléen 21

0349 Oslo

Norway

Tel: (+47) 223 963 00

The Unscrambler® is a trademark of CAMO Software AS.

Design-Expert® is a trademark of Stat-Ease, Inc.

ISBN 978-82-691104-1-8

© 2019 CAMO Software AS

All Rights reserved. No part of this publication may be reproduced, stored or transmitted, in any

form or by any means, except with the prior permission in writing of the publishers.

Cover art by Gry Andrea Esbensen Norang.

Page 3: Multivariate Data Analysis in PracticeMultivariate Data Analysis in Practice 6th Edition Supplementary Tutorial Book for 2019 Multivariate Data Analysis Kim H. Esbensen & Brad Swarbrick

i

Contents

1. Introduction to this tutorial short book ...................................................................................... 1

2. Data sets used in this tutorial short book ................................................................................... 2

2.1. The Jam Data Set (Chapter 2).............................................................................................. 2

2.2. Product Mass Testing and Method Comparison Testing (Chapter 2) ................................. 2

2.3. Beverage Consumption in Europe (Chapter 4) ................................................................... 2

2.4. Ripeness of Green Peas (Chapter 4).................................................................................... 2

2.5. Classification of Vegetable Oils Using Spectroscopic Methods (Chapter 4) ....................... 2

2.6. City Temperatures in Europe (Chapter 4) ........................................................................... 2

2.7. Scaling Process Data (Chapter 5) ........................................................................................ 2

2.8. Preprocessing Mid Infrared Spectra of Vegetable Oils (Chapter 5) .................................... 3

2.9. Preprocessing of Process Near Infrared Spectra (Chapter 5) ............................................. 3

2.10. The Gluten-Starch Data Set: Preprocessing a Difficult Problem (Chapter 5) .................. 3

2.11. Octane Number in Gasoline (Chapter 6) ......................................................................... 3

2.12. Alcohols in Water (Chapter 6) ......................................................................................... 3

2.13. Detecting Outliers: Troodos (Chapter 6) ......................................................................... 3

2.14. Prediction of Alcohol Concentration in Mixtures (Chapter 7) ........................................ 3

2.15. Development of a Predictive Model of Octane Number in Gasoline (Chapter 7) .......... 3

2.16. Prediction of Paper Quality (Chapter 7) .......................................................................... 3

2.17. Prediction of Octane Number in Gasoline (Chapter 7) ................................................... 4

2.18. Prediction of Gluten-Starch Mixtures (Chapter 7) .......................................................... 4

2.19. Raw Material Identification Using Cluster Analysis (Chapter 10) ................................... 4

2.20. Fishers Iris Classification Data (Chapter 10) .................................................................... 4

2.21. Classification of Vegetable Oils Using Supervised Classification (Chapter 10) ............... 4

2.22. Sports Drink Formulation Using Factorial Designs (Chapter 11) ..................................... 4

2.23. Understanding a Chemical Manufacturing Process Using Full and Fractional Factorial

Designs (Chapter 11) ....................................................................................................................... 4

2.24. Optimisation of Bread Baking Using a Central Composite Design (CCD) (Chapter 11) ... 5

2.25. Blending Wines Using a Mixture Design (Chapter 11) .................................................... 5

2.26. Blending Fruit Juices Using a Constrained Mixture Design (Chapter 11) ........................ 5

2.27. Fat Content in Fish Using Factor Rotation (Chapter 12) ................................................. 5

2.28. Chemical Reaction Monitoring Using Multivariate Curve Resolution (MCR) (Chapter

12) 5

2.29. Combining MCR and PLS to Solve Difficult Problems (Fat in Fish Analysis) (Chapter 12)

6

3. The Unscrambler Environment ....................................................................................................... 7

Page 4: Multivariate Data Analysis in PracticeMultivariate Data Analysis in Practice 6th Edition Supplementary Tutorial Book for 2019 Multivariate Data Analysis Kim H. Esbensen & Brad Swarbrick

ii

3.1. Data Import ............................................................................................................................. 7

3.2. Data Visualization ................................................................................................................... 8

3.3. Transform ................................................................................................................................ 8

3.4. Analyze .................................................................................................................................... 9

3.5. Predict ..................................................................................................................................... 9

4. Overview of the Modelling Process .............................................................................................. 10

5. Tutorials ........................................................................................................................................ 11

5.1. The Jam Data Set (Chapter 2) ................................................................................................... 11

5.1.1. Description of the Data Set ............................................................................................... 11

5.1.2. Overview of the Data ........................................................................................................ 11

5.1.3. Data Visualisation.............................................................................................................. 12

5.1.4. Descriptive Statistics ......................................................................................................... 16

5.1.5. Summary ........................................................................................................................... 20

5.2. Product Mass Testing and Method Comparison Testing (Chapter 2) ....................................... 21

5.2.1. Description of the Data Set ............................................................................................... 21

5.2.2. Setup of the Data Table .................................................................................................... 22

5.2.3. Evaluation of the Data ...................................................................................................... 24

5.2.4. Summary ........................................................................................................................... 28

5.3. Beverage Consumption in Europe (Chapter 4) ......................................................................... 29

5.3.1. Description of the Data Set ............................................................................................... 29

5.3.2. Evaluation of the Data ...................................................................................................... 29

5.3.3. Running a PCA on the Beverage Data ............................................................................... 35

5.3.4. The PCA Overview ............................................................................................................. 39

5.3.5. Summary ........................................................................................................................... 50

5.4. Ripeness of Green Peas (Chapter 4) ......................................................................................... 51

5.4.1. Description of the Data Set ............................................................................................... 51

5.4.2. Evaluation of the Data ...................................................................................................... 51

5.4.3. Descriptive Statistics ......................................................................................................... 56

5.4.4. Principal Component Analysis of Peas Data ..................................................................... 59

5.4.5. The PCA Overview ............................................................................................................. 63

5.4.6. Influence Plot for Peas Analysis ........................................................................................ 71

5.4.7. Summary ........................................................................................................................... 72

5.5. Classification of Vegetable Oils Using Spectroscopic Methods (Chapter 4) ............................. 74

5.5.1. Description of the Data Set ............................................................................................... 74

5.5.2. Evaluation of the Data ...................................................................................................... 74

5.5.3. Principal Component Analysis of Raw Vegetable Oil Data ............................................... 77

Page 5: Multivariate Data Analysis in PracticeMultivariate Data Analysis in Practice 6th Edition Supplementary Tutorial Book for 2019 Multivariate Data Analysis Kim H. Esbensen & Brad Swarbrick

iii

5.5.4. The PCA Overview ............................................................................................................. 80

5.5.5. Influence Plot for Vegetable Oil Analysis .......................................................................... 87

5.5.6. PCA Projection of Unknown Samples onto Vegetable Oil PCA Model ............................. 88

5.5.7. Summary ........................................................................................................................... 90

5.6. City Temperatures in Europe (Chapter 4) ................................................................................. 92

5.6.1. Description of the Data Set ............................................................................................... 92

5.6.2. Evaluation of the Data ...................................................................................................... 92

5.6.3. Principal Component Analysis of European City Temperature Data ................................ 94

5.6.4. The PCA Overview ............................................................................................................. 98

5.6.5. Assessment of 1D Loadings of City Temperature Data ................................................... 102

5.6.6. The Influence Plot of City Temperature Data for 3 PCs .................................................. 104

5.6.7. Recalculate the Model without Belgrade ....................................................................... 107

5.6.8. Summary ......................................................................................................................... 108

5.7. Scaling Process Data (Chapter 5) ............................................................................................ 109

5.7.1. Description of the Data Set ............................................................................................. 109

5.7.2. Evaluation of the Data .................................................................................................... 109

5.7.3. Autoscaling the Data ....................................................................................................... 114

5.7.4. Summary ......................................................................................................................... 116

5.8. Preprocessing of Mid-Infrared Spectroscopic Data of Vegetable Oils (Chapter 5)................. 117

5.8.1. Description of the Data Set ............................................................................................. 117

5.8.2. Evaluation of the Data .................................................................................................... 117

5.8.3. Data Visualization and Descriptive Statistics .................................................................. 117

5.8.4. Summary ......................................................................................................................... 123

5.9. Preprocessing of Process Near Infrared Spectra (Chapter 5) ................................................. 125

5.9.1. Description of the Data Set ............................................................................................. 125

5.9.2. Evaluation of the Data .................................................................................................... 125

5.9.3. Data Visualization and Descriptive Statistics .................................................................. 126

5.9.4. Application of SNV to the Data ....................................................................................... 127

5.9.5. Application of Multiplicative Scatter Correction (MSC) to the Data ............................... 133

5.9.6. Application of Derivatives to the Data ............................................................................ 135

5.9.7. Application of First Derivative and SNV .......................................................................... 138

5.9.8. Summary ......................................................................................................................... 140

5.10. The Gluten-Starch Data Set: A Difficult Preprocessing Problem (Chapter 5) ..................... 141

5.10.1. Description of the Data Set ............................................................................................. 141

5.10.2. Data Visualization and Descriptive Statistics. ................................................................. 141

5.10.3. Application of Multiplicative Scatter Correction (MSC) .................................................. 143

Page 6: Multivariate Data Analysis in PracticeMultivariate Data Analysis in Practice 6th Edition Supplementary Tutorial Book for 2019 Multivariate Data Analysis Kim H. Esbensen & Brad Swarbrick

iv

5.10.4. Application of Extended Multiplicative Scatter Correction (EMSC) ................................ 146

5.10.5. Application of Modified Extended Multiplicative Scatter Correction (mEMSC) ............. 148

5.10.6. Summary ......................................................................................................................... 150

5.11. Octane Number in Gasoline: Part 1- PCA of Spectra (Chapter 6) ....................................... 151

5.11.1. Description of the Data Set ............................................................................................. 151

5.11.2. Data Visualization and Grouping..................................................................................... 151

5.11.3. Principal Component Analysis of Gasoline Spectra ........................................................ 154

5.11.4. Summary ......................................................................................................................... 161

5.12. Alcohols in Water (Chapter 6) ............................................................................................. 162

5.12.1. Description of the Data Set ............................................................................................. 162

5.12.2. Data Visualization and Grouping..................................................................................... 162

5.12.3. Principal Component Analysis of Alcohol Spectra .......................................................... 164

5.12.4. Summary ......................................................................................................................... 172

5.13. Detecting Outliers (Troodos) (Chapter 6) ........................................................................... 173

5.13.1. Description of the Data Set ............................................................................................. 173

5.13.2. Data Visualization and Grouping..................................................................................... 173

5.13.3. Principal Component Analysis of Troodos Data .............................................................. 176

5.13.4. Imputation of Missing Values ......................................................................................... 182

5.13.5. Full Interpretation Troodos PCA Model .......................................................................... 183

5.13.6. Summary ......................................................................................................................... 187

5.14. Prediction of Alcohol Concentration in Mixtures (Chapter 7) ............................................ 188

5.14.1. Description of the Data ................................................................................................... 188

5.14.2. Application of Principal Component Regression (PCR) to the Alcohols data set. ........... 188

5.14.3. Application of Partial Least Squares (PLS) Regression to the Alcohols data set. ............ 199

5.14.4. Summary ......................................................................................................................... 209

5.15. Development of a Predictive Model Part 2: Octane Number in Gasoline (Chapter 7) ....... 211

5.15.1. Description of the Data ................................................................................................... 211

5.15.2. Application of Partial Least Squares (PLS) Regression to the Octane data set. .............. 211

5.15.3. Recalculation of Model Without Suspect Samples ......................................................... 227

5.15.4. Recalculation of the Octane Model Without Selected Variables ................................... 232

5.15.5. Summary ......................................................................................................................... 234

5.16. Prediction of Paper Quality (Chapter 7) .............................................................................. 235

5.16.1. Description of the Data Set ............................................................................................. 235

5.16.2. Data Visualization and Grouping..................................................................................... 236

5.16.3. Perform PLS on the Paper Data Set ................................................................................ 237

5.16.4. Recalculate the Paper Model With Important Variables Only ........................................ 243

Page 7: Multivariate Data Analysis in PracticeMultivariate Data Analysis in Practice 6th Edition Supplementary Tutorial Book for 2019 Multivariate Data Analysis Kim H. Esbensen & Brad Swarbrick

v

5.16.5. Prediction of New Samples ............................................................................................. 248

5.16.6. Summary ......................................................................................................................... 251

5.17. Octane in Gasoline (part 3): Prediction of New Samples Using Various Models (Chapter 7)

252

5.17.1. Application of the Full Model to the Test Set ................................................................. 252

5.17.2. Application of the Model Without Outliers to the Test Set ............................................ 256

5.17.3. Application of the Model Without Outliers to the Test Set ............................................ 260

5.17.4. Summary ......................................................................................................................... 264

5.18. Prediction of Gluten Starch Mixtures (Chapter 7) .............................................................. 265

5.18.1. Development and Application of PLS to the Raw Data Set ............................................. 265

5.18.2. Development and Application of PLS to the MSC Preprocessed Data Set ..................... 270

5.18.3. Development and Application of PLS to the EMSC Preprocessed Data Set ................... 275

5.18.4. Development and Application of PLS to the mEMSC Preprocessed Data Set ................ 279

5.18.5. Summary ......................................................................................................................... 284

5.19. Raw Material Classification Using Cluster Analysis (Chapter 10) ....................................... 286

5.19.1. Description of the Data Set. ............................................................................................ 286

5.19.2. Overview of the Data. ..................................................................................................... 286

5.19.3. Application of k-Means Clustering to the Data. .............................................................. 289

5.19.4. Application of Hierarchical Cluster Analysis (HCA) to the Data. ..................................... 292

5.19.5. Application of Principal Component Analysis (PCA) to the Data. ................................... 293

5.19.6. Grouping PCA Scores by the Results of Cluster Analysis Methods. ................................ 296

5.19.7. Summary ......................................................................................................................... 297

5.20. Fischers Iris Data (Chapter 10) ............................................................................................ 298

5.20.1. Description of the Data ................................................................................................... 298

5.20.2. Data Visualisation............................................................................................................ 298

5.20.3. Classification Using k-Means and Hierarchical Cluster Analysis (HCA) ........................... 301

5.20.4. Application of PCA to the Iris Data Set. ........................................................................... 304

5.20.5. Developing a SIMCA Library for the Iris Data .................................................................. 306

5.20.6. Summary ......................................................................................................................... 320

5.21. Classification of Vegetable Oils Using Supervised Methods (Chapter 10) .......................... 321

5.21.1. Development of PCA Class Models for Vegetable Oils ................................................... 321

5.21.2. Classification of Oil Samples Using Partial Least Squares Discriminant Analysis (PLS-DA)

324

5.21.3. Classification of Vegetable Oils Using Linear Discriminant Analysis (LDA) ..................... 329

5.21.4. Classification of Vegetable Oils Using Support Vector Machine Classification............... 332

5.21.5. Summary ......................................................................................................................... 336

Page 8: Multivariate Data Analysis in PracticeMultivariate Data Analysis in Practice 6th Edition Supplementary Tutorial Book for 2019 Multivariate Data Analysis Kim H. Esbensen & Brad Swarbrick

vi

5.22. Sports Drink Formulation Using Factorial Designs (Chapter 11)......................................... 338

5.22.1. Description of the Data Set ............................................................................................. 338

5.22.2. Building a Design ............................................................................................................. 338

5.22.3. Summary ......................................................................................................................... 359

5.23. Understanding a Chemical Manufacturing Process Using Designed Experiments (Chapter

11) 360

5.23.1. Experimental Approach – Define Stage .......................................................................... 360

5.23.2. Analysis of the Fractional Factorial Design. .................................................................... 363

5.23.3. Extension of the Fractional Factorial Design into a Full Factorial Design. ...................... 370

5.23.4. Summary ......................................................................................................................... 376

5.24. Optimisation of Bread Baking Using a Central Composite Design (Chapter 11)................. 377

5.24.1. Optimisation – Define Stage ........................................................................................... 377

5.24.2. Optimisation – Design Stage ........................................................................................... 377

5.24.3. Joint Optimisation of Two Responses Using Graphical Optimisation. ............................ 389

5.24.4. Summary ......................................................................................................................... 392

5.25. Blending Wines Using a Mixture Design (Chapter 11) ........................................................ 394

5.25.1. Mixture Design – Design Stage ....................................................................................... 395

5.25.2. Mixture Design - Design Analysis .................................................................................... 397

5.25.3. Graphical Optimisation of Wine Preference Criteria. ..................................................... 403

5.25.4. Summary ......................................................................................................................... 404

5.26. Blending Fruit Juices Using A Constrained Mixture Design (Chapter 11) ........................... 406

5.26.1. Define Stage .................................................................................................................... 406

5.26.2. Design Stage .................................................................................................................... 406

5.26.3. Design Table .................................................................................................................... 406

5.26.4. Design Analysis ................................................................................................................ 408

5.26.5. Summary ......................................................................................................................... 415

5.27. Fat Content in Fish Using Factor Rotation (Chapter 12) ..................................................... 416

5.27.1. Visualisation of the Data ................................................................................................. 416

5.27.2. PCA of Second Derivative Spectra. .................................................................................. 419

5.27.3. Parsimax Rotation of PC Axes. ........................................................................................ 422

5.27.4. Summary ......................................................................................................................... 423

5.28. Chemical Reaction Monitoring Using Multivariate Curve Resolution (MCR) (Chapter 12) 424

5.28.1. Data Visualisation............................................................................................................ 424

5.28.2. Principal Component Analysis (PCA) of the UV-Vis Spectra. .......................................... 425

5.28.3. Multivariate Curve Resolution (MCR) of the UV-Vis Data. ............................................. 428

5.28.4. Summary ......................................................................................................................... 430

Page 9: Multivariate Data Analysis in PracticeMultivariate Data Analysis in Practice 6th Edition Supplementary Tutorial Book for 2019 Multivariate Data Analysis Kim H. Esbensen & Brad Swarbrick

vii

5.29. Combining MCR and PLS to Solve Difficult Problems (Fat in Fish Analysis) (Chapter 12) ... 431

5.29.1. Application of MCR to the NIR Spectra of Fish ............................................................... 431

5.29.2. Application of PLS Regression to Preprocessed NIR Spectra of Fish............................... 435

5.29.2.1. No Preprocessing .................................................................................................... 435

5.29.2.2. Savitzky-Golay Second Derivative ........................................................................... 436

5.29.2.3. Multiplicative Scatter Correction (MSC) ................................................................. 437

5.29.2.4. Extended Multiplicative Scatter Correction (EMSC). .............................................. 437

5.29.2.5. Standard Normal Variate (SNV). ............................................................................. 438

5.29.2.6. Modified Extended Multiplicative Scatter Correction (mEMSC) ............................ 438

5.29.3. Model Comparisons ........................................................................................................ 439

5.29.4. Summary ......................................................................................................................... 440

6. Resources .................................................................................................................................... 441

7. Final Words of Wisdom ............................................................................................................... 442

Page 10: Multivariate Data Analysis in PracticeMultivariate Data Analysis in Practice 6th Edition Supplementary Tutorial Book for 2019 Multivariate Data Analysis Kim H. Esbensen & Brad Swarbrick

9

1. Introduction to this tutorial short book This tutorial study guide provides a step-by-step procedure for performing the software steps used

to generate the analyses provided in ‘Multivariate Data Analysis’ 6th edition published by Camo.

Once the data sets used in this tutorial have been downloaded, the procedures described can be

followed to see how the final results were arrived at. For the Design of Experiments (DoE) exercises,

a valid copy of the Design Expert package is required. If this package is not part of your Unscrambler

installation or if you do not have a standalone version of Design Expert, then please contact Camo

Analytics for more details.

The tutorials in this short book are best performed using The Unscrambler version 10.5., however,

many of the tutorials can be performed using the 10.3 or 10.4 platforms.

Throughout this short book, a number of the data sets are used in multiple chapters to describe a

‘story’ of the data from preprocessing, to data mining and regression analysis. The next section

describes the motivation behind the use of each of the datasets used in the tutorials and their

relevance in a multivariate data analysis setting.

As always, tutorials are used to gain better understanding of the functions and special features of

The Unscrambler and Design Expert. When analysing the data in the tutorials, it is highly important

that you, as a data analyst, translate the information learnt to your own applications and build your

own toolkit for data analysis. The prescriptive use of a tutorial for your own datasets is not

recommended; however, the steps used in the tutorials (where possible) describe a ‘Define, Design,

Analyse, Implement’ logic and this is about where the prescriptiveness should stop and your own

expertise should come through.

If you perform the tutorials with an open mind for learning, then this tutorial book will open up

many new insights into The Unscrambler and Design Expert that will allow you to progress in your

multivariate analysis journey.

Page 11: Multivariate Data Analysis in PracticeMultivariate Data Analysis in Practice 6th Edition Supplementary Tutorial Book for 2019 Multivariate Data Analysis Kim H. Esbensen & Brad Swarbrick

10

2. Data sets used in this tutorial short book The datasets used in the tutorials are those that many of the trainers of The Unscrambler around

the world have used in their short courses to describe the power of the Multivariate Analysis and

Design of Experiments methodologies. Up until the 6th edition of the Multivariate Data Analysis

book, the tutorials formed a major part of the book. The authors felt that in the 6th Edition, their

approach to solving data analysis problems should be highlighted to a user so as not to distract from

the learning objectives. This tutorial short book supplements that initial learning with a follow up

‘hands on’ experience with the software. With this is mind, we suggest that you read our approach

in the book first, then use the tutorials to reinforce keystroking in the software and then further

investigate the data using all of the analysis and diagnostic tools in the software.

2.1. The Jam Data Set (Chapter 2) Highlights the functionality of the ‘Descriptive Statistics’ section of the software. Introduces the use

of ‘Box Plots’ for the analysis of sensory data when applied to the taste and appearance of

raspberries used to make fruit jams.

2.2. Product Mass Testing and Method Comparison Testing (Chapter 2) Demonstrates the use of the ‘Statistical Tests’ functionality of the software. In this tutorial, the use

of normality testing, tests for equivalence of variances and means will be investigated. The

appropriate use of one sample, two population and paired t-tests will be described.

2.3. Beverage Consumption in Europe (Chapter 4) Investigates a data set collected on the beverage consumption of 17 cities located around Europe

and Scandinavia. Demonstrates how the power of Principal Component Analysis (PCA) can be used

to assess the drinking patterns of various demographics and is particularly useful in marketing/

product placement studies.

2.4. Ripeness of Green Peas (Chapter 4) An oldie, but a goodie. This classical data set has survived many editions of the book and training

courses due to its educational appeal. Uses a set of sensory data attributes to classify green peas.

This data reveals a hidden variable when external information is used and highlights the graphical

ability of The Unscrambler to reveal the hidden structures.

2.5. Classification of Vegetable Oils Using Spectroscopic Methods (Chapter 4) This is the first example in the book on the application of multivariate methods to spectroscopic

data. PCA is an excellent cluster analysis method and this tutorial shows how the simple collection of

a spectrum from known oil samples can be used to separate the oils into their respective types. This

example forms the basis of the classification methods discussed in chapter 10 of the book.

2.6. City Temperatures in Europe (Chapter 4) Monthly average temperature data for 26 European cities were tabulated and the regions of their

origin were provided for possible grouping (clustering). Temperature profiles are similar to spectral

profiles in many ways and what may not be obvious to the naked eye is perfectly clear to PCA. This

example also introduces the analyst to the concept of outliers and how to deal with them.

2.7. Scaling Process Data (Chapter 5) This tutorial describes the approach to use to compare three (or more) variables together when their

natural scales are orders of magnitude different from each other. It also demonstrates some of the

simple univariate plotting routines available in The Unscrambler

Page 12: Multivariate Data Analysis in PracticeMultivariate Data Analysis in Practice 6th Edition Supplementary Tutorial Book for 2019 Multivariate Data Analysis Kim H. Esbensen & Brad Swarbrick

11

2.8. Preprocessing Mid Infrared Spectra of Vegetable Oils (Chapter 5) Extends on the vegetable oil example introduced in chapter 4. Demonstrates how the use of

application-specific preprocessing techniques can reduce physical effects in spectral data that better

reveal the chemical information in the data.

2.9. Preprocessing of Process Near Infrared Spectra (Chapter 5) This tutorial investigates the application of preprocessing methods to data collected on a

pharmaceutical formulation during the process of Fluid Bed Drying (FBD). Spectra collected in an FBD

operation are highly affected by physical density and light scattering effects and the use of

application-specific preprocessing methods can minimise this variability to reveal the chemical

information in the data.

2.10. The Gluten-Starch Data Set: Preprocessing a Difficult Problem (Chapter 5) This data set is another example of Near Infrared spectroscopy applied to a binary mixture of gluten

and starch in known proportions. Sounds easy… right? This tutorial will highlight the intricacies and

pitfalls to be aware of before the application of any preprocessing method to a data set.

2.11. Octane Number in Gasoline (Chapter 6) This is another example using Near Infrared spectroscopy to determine whether the technique is

capable of detecting the differences in various grades of gasoline. The use of samples grouping helps

to highlight the hidden classes within the data. This data set is used extensively throughout the book

due to its ability to show in particular that not all things that look like outliers are outliers.

2.12. Alcohols in Water (Chapter 6) This data set, again based on Near Infrared Spectroscopy, shows how Principal Component Analysis

can be used to solve the Mixture problem. This data set is based on a type of experimental design

known as a Mixture design and the Scores plot can be used to reveal the structure of the design,

provided the preprocessing method used is correct.

2.13. Detecting Outliers: Troodos (Chapter 6) Another classical data set that has survived a number of editions of the book. The data is from the

field of Geochemistry, in particular in the Troodos region of Cyprus. This tutorial shows how outliers

can be detected and justifiably removed from the data set. Reanalysis of the data without the

outliers reveals the true structure in the samples.

2.14. Prediction of Alcohol Concentration in Mixtures (Chapter 7) This tutorial extends on the Alcohols in Water example introduced in chapter 6. Introduces an

analyst on how to develop a Principal Component Regression (PCR) model in The Unscrambler, how

to interpret it and most importantly, how to validate the model.

2.15. Development of a Predictive Model of Octane Number in Gasoline (Chapter 7) This tutorial extends on the Octane Number in Gasoline example introduced in chapter 6. Introduces

an analyst on how to develop a Partial Least Squares (PLS) regression model in The Unscrambler and

introduces why some visual outliers are not actually outliers. This tutorial also introduces an analyst

on how to apply PLS regression models to new data in order to predict new values for unknown

samples.

2.16. Prediction of Paper Quality (Chapter 7) This tutorial presents a set of process variables used in the paper manufacturing industry and

determines whether such variables can be used to predict the quality indicator Print Through, the

amount of visibility of ink from one side of the paper when viewed through the other side. This

Page 13: Multivariate Data Analysis in PracticeMultivariate Data Analysis in Practice 6th Edition Supplementary Tutorial Book for 2019 Multivariate Data Analysis Kim H. Esbensen & Brad Swarbrick

12

tutorial uses PLS regression to model the data and introduces Martens Uncertainty Test as a method

for Variable Selection. Application of the model generated to a separate prediction set is performed

and the use of model diagnostics is introduced to show a user how such diagnostics can be used to

determine the quality of a prediction.

2.17. Prediction of Octane Number in Gasoline (Chapter 7) This tutorial provides an analyst with the steps for evaluating PLS regression models when applied to

new data in order to predict new values for unknown samples for the various models developed in

the tutorial described in 2.15.

2.18. Prediction of Gluten-Starch Mixtures (Chapter 7) This tutorial is an extension of the tutorial described in section 2.10 where PLS models are developed

for the data using the various preprocessing methods and these models are used to predict the

gluten content in a test set of samples. When all models have been developed, a comparison of the

predictive ability of each model using the optimal factors and a 1 factor model are made.

2.19. Raw Material Identification Using Cluster Analysis (Chapter 10) One of the most powerful application of non-destructive spectroscopic methods is their use in the

identification of incoming materials, particularly in highly regulated industries such as the

pharmaceutical and related industries. This tutorial introduces the unsupervised cluster analysis

methods applied to spectra of three different raw materials so that an analyst can investigate the

outputs and graphical capabilities of The Unscrambler.

2.20. Fishers Iris Classification Data (Chapter 10) This is the classical data set used to verify nearly every cluster analysis method developed. The data,

collected in the 1930’s by Sir Ronald Fischer, was an attempt to classify Iris species by four

characteristics, Sepal Length, Sepal Width, Petal Length and Petal Width. This tutorial investigates

the use of unsupervised methods for the classification of the Iris data and introduces one of the

most power supervised classification methods known as Soft Independent Modelling of Class

Analogy (SIMCA).

2.21. Classification of Vegetable Oils Using Supervised Classification (Chapter 10) This tutorial is an extension of the data set introduced in section 2.8. where the method of infrared

spectroscopy was used to analyse samples of various vegetable oil types. The methods investigated

in this tutorial are SIMCA, Linear Discriminant Analysis (LDA), Partial Least Squares Discriminant

Analysis (PLS-DA) and Support Vector Machines (SVM) Classification.

2.22. Sports Drink Formulation Using Factorial Designs (Chapter 11) This tutorial provides a first introduction to the development and analysis of a simple Factorial

Design using the Design Expert package. This is an excellent first step to exploring the Design Expert

software and it is suggested this tutorial is performed first, even if you have experience with the

software as it provides much detail on keystroking, which is reduced in later tutorials.

2.23. Understanding a Chemical Manufacturing Process Using Full and Fractional

Factorial Designs (Chapter 11) This tutorial shows how Fractional Factorial Designs can be extended into Full Factorial Designs

without having to repeat any of the initial experiments. The concept of Blocking is demonstrated

and a full analysis of the model with its interpretation is provided.

Page 14: Multivariate Data Analysis in PracticeMultivariate Data Analysis in Practice 6th Edition Supplementary Tutorial Book for 2019 Multivariate Data Analysis Kim H. Esbensen & Brad Swarbrick

13

2.24. Optimisation of Bread Baking Using a Central Composite Design (CCD)

(Chapter 11) This tutorial introduces one of the most commonly used Optimisation designs known as the Central

Composite Design (CCD). A CCD is a composite of an initial Factorial Design with points that extend

the design outside of the original such that all design points lie on the surface of a sphere. It also

uses centre points in the model where polynomials up to the quartic can be used to analyse the

data. In this case, we investigate how to optimise two attributes of bread individually and then use

the method of Graphical Optimisation to find a space in the design where both parameters are

jointly optimised.

2.25. Blending Wines Using a Mixture Design (Chapter 11) This tutorial introduces the concept of the Mixture Design. This type of design is based on a

Factorial Design; however, the design is naturally Constrained, i.e. the mixture components are all

dependent on each other. We evaluate two responses in this design and use Graphical Optimisation

to determine if a suitable blend of wines can be made (as determined by a trained sensory panel) at

a cost that a consumer will consider acceptable.

2.26. Blending Fruit Juices Using a Constrained Mixture Design (Chapter 11) This tutorial extends on concepts of the previous tutorial and shows how a Lower Bound on one

component imposes Upper Bounds on the rest of the components in the mixture. When only Lower

Bounds are placed on components, the design is still Simplex Shaped and can be analysed using the

standard methods used for Designed Experiments. In this tutorial, we only model one response and

to find the most acceptable blends, we use Numerical Optimisation to maximise the inclusion of

some components and minimise the addition of others, while at the same time, ensuring the final

blend is still acceptable to the consumer. This tutorial is an adaptation of a classical problem

described in the book by John Cornell, Mixture Designs.

2.27. Fat Content in Fish Using Factor Rotation (Chapter 12) This tutorial is a first introduction to PCA Rotation (also called Factor Rotation) in order to find the

Simple Structure in the data set. The data set consists of NIR spectra collected in transmission mode

on fish samples with associated Fat reference measurements. We show how PCA can provide

abstract components that describe the variance in the data set, however, PCA Rotation can be used

to find a set of factors that better describe the chemistry of the samples.

Even though the new factors are orthogonal to each other, they are no longer orthogonal to the

original PC axes used to describe them, therefore the new factors are not independent of each

other.

2.28. Chemical Reaction Monitoring Using Multivariate Curve Resolution (MCR)

(Chapter 12) This tutorial introduces the concepts of Multivariate Curve Resolution (MCR) applied to a set of

Ultraviolet-Visible (UV-VIS) spectra collected from a chemical reaction. We show how PCA can

reveal the majority of the information in the data set, however, MCR is better able to provide more

chemical meaningful information from the data.

In particular, MCR is able to find a reaction intermediate profile in the data, consistent with the

kinetics of the reaction and also provide an estimate pure spectrum of the intermediate, which

otherwise would not have been able to be physically of chemically separated from the reaction

mixture.

Page 15: Multivariate Data Analysis in PracticeMultivariate Data Analysis in Practice 6th Edition Supplementary Tutorial Book for 2019 Multivariate Data Analysis Kim H. Esbensen & Brad Swarbrick

14

2.29. Combining MCR and PLS to Solve Difficult Problems (Fat in Fish Analysis)

(Chapter 11) This tutorial uses the NIR spectra for Fat in Fish data described in 2.27. where MCR is used to find a

pure spectrum of Fat in the in the spectra. This estimated spectrum is used as a Good Spectrum in

Modified Extended Multiplicative Scatter Correction (mEMSC). A number of typical preprocessing

methods used for such data are applied and a comparison of the results are made to show which

methods provide the best results.