The Partial Least Squares Action Set

SAS® Visual Data Mining andMachine Learning 8.1:Statistics Programming GuidePartial Least SquaresAction Set

This document is an individual chapter from SAS® Visual Data Mining and Machine Learning 8.1: Statistics Programming Guide.

The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2016. SAS® Visual Data Mining and MachineLearning 8.1: Statistics Programming Guide. Cary, NC: SAS Institute Inc.

SAS® Visual Data Mining and Machine Learning 8.1: Statistics Programming Guide

Copyright © 2016, SAS Institute Inc., Cary, NC, USA

All Rights Reserved. Produced in the United States of America.

For a hard-copy book: No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or byany means, electronic, mechanical, photocopying, or otherwise, without the prior written permission of the publisher, SAS InstituteInc.

For a web download or e-book: Your use of this publication shall be governed by the terms established by the vendor at the timeyou acquire this publication.

The scanning, uploading, and distribution of this book via the Internet or any other means without the permission of the publisher isillegal and punishable by law. Please purchase only authorized electronic editions and do not participate in or encourage electronicpiracy of copyrighted materials. Your support of others’ rights is appreciated.

U.S. Government License Rights; Restricted Rights: The Software and its documentation is commercial computer softwaredeveloped at private expense and is provided with RESTRICTED RIGHTS to the United States Government. Use, duplication, ordisclosure of the Software by the United States Government is subject to the license terms of this Agreement pursuant to, asapplicable, FAR 12.212, DFAR 227.7202-1(a), DFAR 227.7202-3(a), and DFAR 227.7202-4, and, to the extent required under U.S.federal law, the minimum restricted rights as set out in FAR 52.227-19 (DEC 2007). If FAR 52.227-19 is applicable, this provisionserves as notice under clause (c) thereof and no other notice is required to be affixed to the Software or documentation. TheGovernment’s rights in Software and documentation shall be only those set forth in this Agreement.

SAS Institute Inc., SAS Campus Drive, Cary, NC 27513-2414

September 2016

SAS® and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in theUSA and other countries. ® indicates USA registration.

Other brand and product names are trademarks of their respective companies.

SAS software may be provided with certain third-party software, including but not limited to open-source software, which islicensed under its applicable third-party software license agreement. For license information about third-party software distributedwith SAS software, refer to http://support.sas.com/thirdpartylicenses.

http://support.sas.com/thirdpartylicenses

Chapter 4

The Partial Least Squares Action Set

ContentsDetails . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

The pls Action . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

Example 4.1: Spectrometric Calibration . . . . . . . . . . . . . . . . . . . . . . . . . 55References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

Details

The pls ActionThe pls action fits models in SAS Viya by using any one of a number of linear predictive methods, includingpartial least squares (PLS). The pls action performs the underlying computations for the PLSMOD procedurein SAS Viya.

Ordinary least squares regression has the single goal of minimizing sample response prediction error, andit seeks linear functions of the predictors that explain as much variation in each response as possible. Thepls action implements techniques that have the additional goal of accounting for variation in the predictors,under the assumption that directions in the predictor space that are well sampled should provide betterprediction for new observations when the predictors are highly correlated. All the techniques that the plsaction implements work by extracting successive linear combinations of the predictors, called factors (alsocalled components, latent vectors, or latent variables), which optimally address one or both of these twogoals: explaining response variation and explaining predictor variation. In particular, the method of partialleast squares balances the two objectives by seeking factors that explain both response and predictor variation.

The name “partial least squares” also applies to a more general statistical method that is not implemented inthis action. The partial least squares method was originally developed in the 1960s by the econometricianHerman Wold (1966) for modeling “paths” of causal relation between any number of “blocks” of variables.However, the pls action fits only predictive partial least squares models that have one “block” of predictorsand one “block” of responses.

The pls action implements the following methods:

� principal component regression, which extracts factors to explain as much predictor sample variationas possible

40 F Chapter 4: The Partial Least Squares Action Set

� reduced rank regression, which extracts factors to explain as much response variation as possible. Thistechnique, also known as (maximum) redundancy analysis, differs from multivariate linear regressiononly when there are multiple responses.

� partial least squares regression, which balances the two objectives of explaining response variation andexplaining predictor variation. Two different formulations for partial least squares are available: theoriginal predictive method of Wold (1966) and the straightforward implementation of a statisticallyinspired modification of the partial least squares (SIMPLS) method of De Jong (1993).

The pls action has the following features:

� provides model-building syntax with classification variables, continuous variables, interactions, andnestings

� provides effect-construction syntax for polynomial and spline effects

� partitions your data into training and testing subsets

� provides test set validation to choose the number of extracted factors, where the model is fit to onlypart of the available data (the training set) and the fit is evaluated over the other part of the data (thetest set)

� produces an output data table that contains predicted values and other observationwise statistics

Because the pls action runs on CAS, it also does the following:

� enables you to run on a cluster of machines that distribute the data and the computations

� exploits all the available cores and concurrent threads

For more information about predictive partial least squares models and the capabilities of the pls action,see the section “Details: PLSMOD Procedure” (Chapter 8, SAS Visual Data Mining and Machine Learning:Statistical Procedures).

Results

The results that the pls action produces are stored in tables. You can access these tables by using theirnames, which are shown in Table 4.1. For more information about the contents of these tables, see the section“Displayed Output” (Chapter 8, SAS Visual Data Mining and Machine Learning: Statistical Procedures).

The table names are the last level of the path of the output objects; paths depend on the grouping structure ofthe output tables. For example, when you use PROC CAS on the SAS client to submit the action, the Resid-ualSummary table can have paths like the following: CAS.pls.CrossValidation.ResidualSummary.However, if you run the action in Lua or Python, the first two levels (CAS.pls.) are omitted. You can obtainthe full paths by submitting code as shown in Example 4.1.

The pls Action F 41

Table 4.1 Results Tables Produced by the pls Action

Table Name Description Parameter: Subparameter

CVResults Results of test set validation partitionByVar orpartitionByFrac

CenScaleParms Parameter estimates for centered andscaled data

model: solution

ClassInfo Level information for classificationvariables

class or classvars

CodedCoef Coded regression coefficients details

Dimensions Model dimensions DefaultModelInfo Information about the modeling

environmentDefault

NObs Number of observations read and used DefaultOutputCasTables Library and name of the output data

table, and number of rows and columnsin the table

output or outputTables

ParameterEstimates Parameter estimates for raw data model: solutionPercentVariation Predictor and response variation that are

accounted for by each factorDefault

ResidualSummary Residual summary from test setvalidation

partitionByVar orpartitionByFrac

Timing Absolute and relative times for tasksperformed by the action

Default

XEffectCenScale Centering and scaling information forpredictor effects

cenScale

XLoadings Loadings for predictor effects details

XPercentVariation Variation that is accounted for by eachfactor for predictor effects

varss

XWeights Weights for predictor effects details

YPercentVariation Variation that is accounted for by eachfactor for responses

varss

YVariableCenScale Centering and scaling information forresponses

cenScale

YWeights Weights for responses details

NOTE: You use these table names when you specify the display and outputTables parameters.

Partial Least Squares Action Set: Syntax

pls Action

CASL Syntax

pls.pls <result =results> <status=rc> /

attributes={{

format="string",

formattedLength=integer,

label="string",

* name="variable-name",

nfd=integer,

nfl=integer}, {...}}

cenScale=TRUE | FALSE

class={{

countMissing=TRUE | FALSE,

descending=TRUE | FALSE,

ignoreMissing=TRUE | FALSE,

levelizeRaw=TRUE | FALSE,

maxLev=integer,

order="FORMATTED" | "FREQ" | "FREQFORMATTED" | "FREQINTERNAL" | "INTERNAL",

param="BTH" | "EFFECT" | "GLM" | "ORDINAL" | "ORTHBTH" | "ORTHEFFECT" | "ORTHORDINAL" | "ORTHPOLY" |"ORTHREF" | "POLYNOMIAL" | "REFERENCE",

ref="FIRST" | "LAST" | "string",

split=TRUE | FALSE,

* vars={"variable-name-1" <, "variable-name-2", ...>}}, {...}}

classGlobalOptions={

countMissing=TRUE | FALSE,

descending=TRUE | FALSE,

ignoreMissing=TRUE | FALSE,

levelizeRaw=TRUE | FALSE,

maxLev=integer,

order="FORMATTED" | "FREQ" | "FREQFORMATTED" | "FREQINTERNAL" | "INTERNAL",

param="BTH" | "EFFECT" | "GLM" | "ORDINAL" | "ORTHBTH" | "ORTHEFFECT" | "ORTHORDINAL" | "ORTHPOLY" |"ORTHREF" | "POLYNOMIAL" | "REFERENCE",

ref="FIRST" | "LAST" | "string",

split=TRUE | FALSE}

classLevelsPrint=TRUE | FALSE

collection={{

details=TRUE | FALSE,

* name="string",


cvTest={

nSamp=integer,

pValue=double,

seed=integer,

stat="PRESS" | "T2"}

details=TRUE | FALSE

display={

caseSensitive=TRUE | FALSE,

exclude=TRUE | FALSE,

excludeAll=TRUE | FALSE,

keyIsPath=TRUE | FALSE,

names={"string-1" <, "string-2", ...>},

pathType="LABEL" | "NAME",

traceNames=TRUE | FALSE}

inputs={{

format="string",


label="string",


nfd=integer,


* method={

algorithm="EIG" | "NIPALS" | "SVD",

epsilon=double,

maxIter=integer,* name="PCR" | "PLS" | "RRR" | "SIMPLS"

}

model={

depVars={{

name="variable-name"}, {...}},

effects={{

interaction="BAR" | "CROSS" | "NONE",

maxInteract=integer,

nest={"string-1" <, "string-2", ...>},

* vars={"string-1" <, "string-2", ...>}}, {...}},

intercept=TRUE | FALSE,

solution=TRUE | FALSE,}

multimember={{


* name="string",

noEffect=TRUE | FALSE,

stdize=TRUE | FALSE,

* vars={"variable-name-1" <, "variable-name-2", ...>},

weight={"variable-name-1" <, "variable-name-2", ...>}}, {...}}

nClassLevelsPrint=integer

nFactors=integer

noCVStdize=TRUE | FALSE

noCenter=TRUE | FALSE

noScale=TRUE | FALSE

nominals={{

format="string",


label="string",


nfd=integer,


output={

* casOut={

caslib="string"

compress=TRUE | FALSE

label="string"

maxMemSize=64-bit-integer

name="table-name"

onDemand=TRUE | FALSE

promote=TRUE | FALSE

replace=TRUE | FALSE

replication=integer

timeStamp="string"},

copyVars="ALL" | "ALL_MODEL" | "ALL_NUMERIC" | {"variable-name-1" <, "variable-name-2", ...>},

h="string",

predicted="string",

press="string",

role="string",

t2="string",

xResidual="string",

xScore="string",

xStd="string",

xStdsse="string",

yResidual="string",

yScore="string",

yStd="string",

yStdsse="string"}

outputTables={

groupByVarsRaw=TRUE | FALSE,

names={"string-1" <, "string-2", ...>} | {key-1={casouttable-1} <, key-2={casouttable-2}, ...>},

repeated=TRUE | FALSE,

replace=TRUE | FALSE}

partitionByFrac={

seed=integer,

test=double,}

partitionByVar={


test="string",

train="string",}

polynomial={{

degree=integer,


labelStyle={

expand=TRUE | FALSE

exponent="string"

includeName=TRUE | FALSE

productSymbol="NONE" | "string"},

mDegree=integer,

* name="string",

noSeparate=TRUE | FALSE,

standardize={

method="MOMENTS" | "MRANGE" | "WMOMENTS"

options="CENTER" | "CENTERSCALE" | "NONE" | "SCALE"

prefix="NONE" | "string"},


spline={{

basis="BSPLINE" | "TPF_DEFAULT" | "TPF_NOINT" | "TPF_NOINTANDNOPOWERS" | "TPF_NOPOWERS",

dataBoundary=TRUE | FALSE,

degree=integer,


knotMax=double,

knotMethod={

equal=integer

list={double-1 <, double-2, ...>}

listWithBoundary={double-1 <, double-2, ...>}

multiscale={

endScale=integer

startScale=integer}

rangeFractions={double-1 <, double-2, ...>}},

knotMin=double,

* name="string",

naturalCubic=TRUE | FALSE,

separate=TRUE | FALSE,

split=TRUE | FALSE,


* table={

caslib="string",

computedOnDemand=TRUE | FALSE,

computedVars={{

format="string",


label="string",


nfd=integer,

nfl=integer}, {...}},

computedVarsProgram="string",

importOptions={fileType="AUTO" | "BASESAS" | "CSV" | "DTA" | "ESP" | "EXCEL" | "FMT" | "HDAT" | "JMP" | "LASR" |"MVA" | "SPSS" | "XLS", fileType-specific-parameters},

* name="table-name",

onDemand=TRUE | FALSE,

orderBy={{

format="string",


label="string",


nfd=integer,


singlePass=TRUE | FALSE,

vars={{

format="string",


label="string",


nfd=integer,


where="where-expression"}

target="string"

varss=TRUE | FALSE;

Parameter Descriptions

attributes={{casinvardesc-1} <, {casinvardesc-2}, ...>}alters attributes on variables used in this action.

To simplify, you can specify only a value for the name parameter. In this case, the other parameters in the list use default values.

For more information about this common parameter, see casinvardesc.

Alias attribute

cenScale=TRUE | FALSEwhen set to True, displays the centering and scaling information.

Default FALSE

class={{classStatement-1} <, {classStatement-2}, ...>}specifies the classification variables to be used as explanatory variables in the analysis.

To simplify, you can specify only a value for the vars parameter. In this case, the other parameters in the list use default values.

For more information about this common parameter, see classStatement.

Alias classVars

classGlobalOptions={classopts}specifies options that apply to all classification variables.

To simplify, you can specify only a value for the param parameter. In this case, the other parameters in the list use default values.

For more information about this common parameter, see classopts.

Alias classGlobalOpts

classLevelsPrint=TRUE | FALSEsuppresses the display of class levels.

Default TRUE

collection={{collection-1} <, {collection-2}, ...>}defines a set of variables that are treated as a single effect that has multiple degrees of freedom.

cvTest={cvTestOptions}performs van der Voet's randomization-based model comparison test.

To simplify, you can specify only a value for the stat parameter. In this case, the other parameters in the list use default values.

The cvTestOptions value can be one or more of the following:

nSamp=integerspecifies the number of randomizations to perform.

Default 1000

Range 0–MACINT

pValue=doublespecifies the cutoff probability for declaring an insignificant difference.

Alias pVal

Default 0.1

Range 0–1

seed=integerspecifies the seed value for the random number stream.

Default 0

Minimum value 0

stat="PRESS" | "T2"specifies the test statistic for the model comparison. You can specify either T2, for Hotelling's T^2 statistic, or PRESS, for the predictedresidual sum of squares.

Default T2

details=TRUE | FALSEwhen set to True, displays the details of the fitted model.

Alias detail

Default FALSE

display={displayTables}specifies a list of result tables to send to the client for display.

To simplify, you can specify only a value for the names parameter. In this case, the other parameters in the list use default values.

For more information about this common parameter, see displayTables.

inputs={{casinvardesc-1} <, {casinvardesc-2}, ...>}specifies variables to use for analysis.



Alias input

* method={methodOptions}specifies the settings for the general factor extraction method.


The methodOptions value can be one or more of the following:

algorithm="EIG" | "NIPALS" | "SVD"specifies the algorithm used to compute extracted PLS factors.

Alias alg

Default NIPALS

epsilon=doublespecifies the convergence criterion for the NIPALS algorithm.

Alias eps

Default 1e-12

Range 0–1

maxIter=integerspecifies the maximum number of iterations for the NIPALS algorithm.

Default 200

Range 0–MACINT

* name="PCR" | "PLS" | "RRR" | "SIMPLS"specifies the name of the general factor extraction method to use.

model={modelOptions}specifies the responses and the predictors, which determine the Y and X matrices of the model, respectively.

To simplify, you can specify only a value for the depVars parameter. In this case, the other parameters in the list use default values.

The modelOptions value can be one or more of the following:

depVars={{responsevar-1} <, {responsevar-2}, ...>}specifies a list of variables and variable-specific parameters to use as response variables in the modeling. When variable-specificparameters are not needed, you can simplify the value of the parameter by specifying a single response variable name as a quoted string.


Alias depVar, target

The responsevar value can be one or more of the following:

name="variable-name"defines a response variable.

effects={{effect-1} <, {effect-2}, ...>}specifies a list of effects that define the model. Each term in this list is made up of variables specified in the vars parameter and theirinteraction (which can be NONE, CROSS, or BAR). When the interaction is BAR, it can be limited by the maxInteract parameter. Anyterm that consists of only the vars parameter can be simplified by specifying the list of variable names directly in the effect parameter.

To simplify, you can specify only a value for the vars parameter. In this case, the other parameters in the list use default values.

The effect value can be one or more of the following:

interaction="BAR" | "CROSS" | "NONE"specifies the type of interaction for the variables.

Alias interactDefault NONE

maxInteract=integereliminates higher-order interaction effects when used in conjunction with the BAR interaction.

nest={"string-1" <, "string-2", ...>}specifies the variables that are nested within the term defined by the vars parameter. For terms with a BAR or CROSS interaction,the nest corresponds to the last variable in the vars parameter. For terms with no interaction, the nest is distributed across allvariables listed in the vars parameter.

* vars={"string-1" <, "string-2", ...>}specifies the variables to be used in defining a term of the effect. You must specify at least one such variable.

intercept=TRUE | FALSEwhen set to True, includes the intercept term in the model.

Default FALSE

solution=TRUE | FALSEwhen set to True, displays the coefficients of the final predictive model for the responses.

Default FALSE

multimember={{multimember-1} <, {multimember-2}, ...>}uses one or more classification variables specified in the vars parameter in such a way that each observation can be associated with one ormore levels of the union of the levels of the classification variables.

nClassLevelsPrint=integerlimits the display of class levels. The value 0 suppresses all levels.

Minimum value 0

nFactors=integerspecifies the number of factors to extract.

Alias nFac, lv

Default 0

Minimum value 0

noCVStdize=TRUE | FALSEwhen set to True, suppresses re-centering and rescaling of the responses and predictors when cross validating.

Default FALSE

noCenter=TRUE | FALSEwhen set to True, suppresses centering of the responses and predictors before fitting.

Default FALSE

noScale=TRUE | FALSEwhen set to True, suppresses scaling of the responses and predictors before fitting.

Default FALSE

nominals={{casinvardesc-1} <, {casinvardesc-2}, ...>}specifies nominal variables to use for analysis.



Alias nominal

output={outputOptions}creates a data table on the server that contains observationwise statistics, which are computed after fitting the model.

To simplify, you can specify only a value for the casOut parameter. In this case, the other parameters in the list use default values.

The outputOptions value can be one or more of the following:

* casOut={casouttable}specifies the settings for an output table.

For more information about this common parameter, see casouttable.

copyVars="ALL" | "ALL_MODEL" | "ALL_NUMERIC" | {"variable-name-1" <, "variable-name-2", ...>}specifies a list of one or more variables to be copied from the input table to the output table. You can alternatively specify the value ALL,ALL_MODEL, or ALL_NUMERIC, which respectively request that all variables, all variables used in the modeling, or all numericvariables be copied from the input table to the output table.

h="string"requests the approximate leverage. If set to an empty string, the prefix H is used for naming the output variable.

predicted="string"requests predicted values for each response. If set to an empty string, the prefix Pred is used for naming the output variables.

Alias p, pred

press="string"requests approximate predicted residuals for each response. If set to an empty string, the prefix PRESS is used for naming the outputvariables.

role="string"requests numeric values that indicate the role played by each observation in fitting the model. If set to an empty string, the prefix_ROLE_ is used for naming the output variable.

t2="string"requests scaled sum of squares of score values. If set to an empty string, the prefix TSquare is used for naming the output variable.

Alias tSquare

xResidual="string"requests residuals for each predictor. If set to an empty string, the prefix XResid is used for naming the output variables.

Alias xr, xResid

xScore="string"requests extracted factors (X-scores, latent vectors, latent variables, T) for each selected model factor. If set to an empty string, theprefix XScore is used for naming the output variables.

xStd="string"requests standardized (centered and scaled) predictor values for each predictor. If set to an empty string, the prefix StdX is used fornaming the output variables.

Alias stdX

xStdsse="string"requests the sum of squares of residuals for standardized predictors. If set to an empty string, the prefix StdXSSE is used for naming theoutput variable.

Alias xQres, stdXsse

yResidual="string"requests residuals for each response. If set to an empty string, the prefix YResid is used for naming the output variables.

Alias yr, yResid

yScore="string"requests extracted responses (Y-scores, U) for each selected model factor. If set to an empty string, the prefix YScore is used fornaming the output variables.

yStd="string"requests standardized (centered and scaled) response values for each response. If set to an empty string, the prefix StdY is used fornaming the output variables.

Alias stdY

yStdsse="string"requests the sum of squares of residuals for standardized responses. If set to an empty string, the prefix StdYSSE is used for naming theoutput variable.

Alias yQres, stdYsse

outputTables={outputTables}lists the result table names that are saved as CAS tables on the server.

To simplify, you can specify only a value for the names parameter. In this case, the other parameters in the list use default values.

For more information about this common parameter, see outputTables.

partitionByFrac={partByFracStatement}specifies the fractions of the data to be used for training and testing.

To simplify, you can specify only a value for the test parameter. In this case, the other parameters in the list use default values.

Alias partByFrac

The partByFracStatement value can be one or more of the following:

seed=integerspecifies the seed to use in the random number generator that is used for partitioning the data.

Default 0

test=doublerandomly assigns the specified proportion of observations in the input table to the testing role. The sum of the fractions that are specifiedin the test and validate parameters must be less than 1.

Range 0–1

partitionByVar={partByVarStatement}specifies the variable and its values used to partition the data into training and testing roles.


Alias partByVar

The partByVarStatement value can be one or more of the following:

* name="variable-name"names the variable in the input table whose values are used to assign rows to each observation.

test="string"specifies the formatted value of the variable that is used to assign observations to the testing role.

train="string"

specifies the formatted value of the variable that is used to assign observations to the training role. If you do not specify the trainparameter, then all observations whose roles are not determined by the test and validate parameters are assigned to training.

polynomial={{polynomial-1} <, {polynomial-2}, ...>}specifies a polynomial effect. All specified variables must be numeric. A design matrix column is generated for each term of the specifiedpolynomial. By default, each of these terms is treated as a separate effect for the purpose of model building.

spline={{spline-1} <, {spline-2}, ...>}expands variables into spline bases whose form depends on the specified parameters.

* table={castable}specifies the settings for an input table.


The castable value can be one or more of the following:

caslib="string"specifies the caslib containing the table that you want to use with the action. By default, the active caslib is used. Specify a value only ifyou need to access a table from a different caslib.

computedOnDemand=TRUE | FALSEwhen set to True, the computed variables specified in the compVars parameter are created when the table is loaded instead of when theaction begins.

Alias compOnDemand

Default FALSE

computedVars={{casinvardesc-1} <, {casinvardesc-2}, ...>}specifies the names of the computed variables to create. Specify an expression for each parameter in the compPgm parameter.



Alias compVars

computedVarsProgram="string"specifies an expression for each variable that you included in the compVars parameter.

Alias compPgm

importOptions={fileType="AUTO" | "BASESAS" | "CSV" | "DTA" | "ESP" | "EXCEL" | "FMT" | "HDAT" | "JMP" | "LASR" |"MVA" | "SPSS" | "XLS", fileType-specific-parameters}

specifies the settings for reading a table from a data source.

The value that you specify for fileType determines the other parameters that apply. For more information about this common parameter,see importOptions.

* name="table-name"specifies the name of the table to use.

onDemand=TRUE | FALSEwhen set to True, table access is less aggressive with virtual memory use.

Default TRUE

orderBy={{casinvardesc-1} <, {casinvardesc-2}, ...>}specifies the variables to use for ordering observations within partitions. This parameter applies to partitioned tables or it can becombined with groupBy variables when groupByMode is set to REDISTRIBUTE.



singlePass=TRUE | FALSEwhen set to True, the data does not create a transient table in the server. Setting this parameter to True can be efficient, but the datamight not have stable ordering upon repeated runs.

Default FALSE

vars={{casinvardesc-1} <, {casinvardesc-2}, ...>}specifies the variables to use in the action.



where="where-expression"specifies an expression for subsetting the input data.

target="string"specifies target variable to use for analysis.

varss=TRUE | FALSEwhen set to True, displays the amount of variation accounted for in each response and predictor.

Default FALSE

Examples F 55

Examples

Example 4.1: Spectrometric Calibration

Spectrometric Calibration

This section contains PROC CAS code.

NOTE: Input data must be in a CAS table that is accessible in your CAS session. This table has a two-levelname; the first level is your CAS engine libref and the second level is the table name. You refer to this tablein the CAS procedure by specifying only the second level. For more information about PROC CAS, seeSAS Cloud Analytic Services: CAS Procedure Programming Guide and Reference. For more informationabout two-level names, see Chapter 2, “Shared Concepts” (SAS Visual Data Mining and Machine Learning:Statistical Procedures).

The example in this section illustrates basic features of the pls action in the pls action set. The data arereported in Umetrics (1995); the original source is Lindberg, Persson, and Wold (1983). Suppose you areresearching pollution in the Baltic Sea and you want to use the fluorescence spectra of seawater samplesto determine the amounts of three compounds present in those samples: lignin sulfonate (ls: pulp industrypollution), humic acids (ha: natural forest products), and optical whitener from laundry detergent (dt).Spectrometric calibration is a type of problem in which partial least squares can be very effective. Thepredictors are the spectra emission intensities at different frequencies in a sample spectrum, and the responsesare the amounts of various chemicals in the sample.

For the purpose of calibrating the model, samples that have known compositions are used. The calibrationdata consist of 16 samples of known concentrations of ls, ha, and dt, with spectra based on 27 frequencies(or, equivalently, wavelengths). In order to demonstrate the use of test set validation, the data contain thevariable Role, which is used to assign observations to the training and testing roles. In this case, the trainingrole has nine samples and the testing role has seven samples.

The following DATA step creates the mycas.Sample data table, which provides the calibration data, in yourCAS session. This DATA step assume that your CAS engine libref is named mycas, but you can substituteany appropriately defined CAS engine libref.

data mycas.Sample;input obsnam $ v1-v27 ls ha dt Role $5. @@;datalines;

EM1 2766 2610 3306 3630 3600 3438 3213 3051 2907 2844 27962787 2760 2754 2670 2520 2310 2100 1917 1755 1602 14671353 1260 1167 1101 1017 3.0110 0.0000 0.00 TRAIN

EM2 1492 1419 1369 1158 958 887 905 929 920 887 800710 617 535 451 368 296 241 190 157 128 10689 70 65 56 50 0.0000 0.4005 0.00 TEST

EM3 2450 2379 2400 2055 1689 1355 1109 908 750 673 644640 630 618 571 512 440 368 305 247 196 156120 98 80 61 50 0.0000 0.0000 90.63 TRAIN

EM4 2751 2883 3492 3570 3282 2937 2634 2370 2187 2070 20071974 1950 1890 1824 1680 1527 1350 1206 1080 984 888810 732 669 630 582 1.4820 0.1580 40.00 TEST

EM5 2652 2691 3225 3285 3033 2784 2520 2340 2235 2148 2094


2049 2007 1917 1800 1650 1464 1299 1140 1020 909 810726 657 594 549 507 1.1160 0.4104 30.45 TEST

EM6 3993 4722 6147 6720 6531 5970 5382 4842 4470 4200 40774008 3948 3864 3663 3390 3090 2787 2481 2241 2028 18301680 1533 1440 1314 1227 3.3970 0.3032 50.82 TRAIN

EM7 4032 4350 5430 5763 5490 4974 4452 3990 3690 3474 33573300 3213 3147 3000 2772 2490 2220 1980 1779 1599 14401320 1200 1119 1032 957 2.4280 0.2981 70.59 TRAIN

EM8 4530 5190 6910 7580 7510 6930 6150 5490 4990 4670 44904370 4300 4210 4000 3770 3420 3060 2760 2490 2230 20601860 1700 1590 1490 1380 4.0240 0.1153 89.39 TRAIN

EM9 4077 4410 5460 5857 5607 5097 4605 4170 3864 3708 35883537 3480 3330 3192 2910 2610 2325 2064 1830 1638 14761350 1236 1122 1044 963 2.2750 0.5040 81.75 TEST

EM10 3450 3432 3969 4020 3678 3237 2814 2487 2205 2061 20011965 1947 1890 1776 1635 1452 1278 1128 981 867 753663 600 552 507 468 0.9588 0.1450 101.10 TRAIN

EM11 4989 5301 6807 7425 7155 6525 5784 5166 4695 4380 41974131 4077 3972 3777 3531 3168 2835 2517 2244 2004 18091620 1470 1359 1266 1167 3.1900 0.2530 120.00 TRAIN

EM12 5340 5790 7590 8390 8310 7670 6890 6190 5700 5380 52005110 5040 4900 4700 4390 3970 3540 3170 2810 2490 22402060 1870 1700 1590 1470 4.1320 0.5691 117.70 TEST

EM13 3162 3477 4365 4650 4470 4107 3717 3432 3228 3093 30092964 2916 2838 2694 2490 2253 2013 1788 1599 1431 13051194 1077 990 927 855 2.1600 0.4360 27.59 TRAIN

EM14 4380 4695 6018 6510 6342 5760 5151 4596 4200 3948 38073720 3672 3567 3438 3171 2880 2571 2280 2046 1857 16801548 1413 1314 1200 1119 3.0940 0.2471 61.71 TRAIN

EM15 4587 4200 5040 5289 4965 4449 3939 3507 3174 2970 28502814 2748 2670 2529 2328 2088 1851 1641 1431 1284 11341020 918 840 756 714 1.6040 0.2856 108.80 TEST

EM16 4017 4725 6090 6570 6354 5895 5346 4911 4611 4422 43144287 4224 4110 3915 3600 3240 2913 2598 2325 2088 19171734 1587 1452 1356 1257 3.1620 0.7012 60.00 TEST

;

To isolate a few underlying spectral factors that provide a good predictive model, you can fit a PLS model tothe 16 samples by using the pls action in the pls action set in the following PROC CAS statements:

proc cas;action pls.pls /

table='SAMPLE',method='PLS',model={depVars={'ls', 'ha', 'dt'},

effects={'v1', 'v2', 'v3', 'v4', 'v5', 'v6', 'v7', 'v8','v9', 'v10', 'v11', 'v12', 'v13', 'v14', 'v15','v16', 'v17', 'v18', 'v19', 'v20', 'v21', 'v22','v23', 'v24', 'v25', 'v26', 'v27'}};

run;

By default, the pls action extracts at most 15 factors. The default output from this analysis is presented inOutput 4.1.1 and Output 4.1.2.

Output 4.1.1 displays the “Model Information,” “Dimensions,” and “Number of Observations” tables.

Example 4.1: Spectrometric Calibration F 57

The “Model Information” table identifies the data source and shows that the factor extraction method is partialleast squares regression and that the nonlinear iterative partial least squares (NIPALS) algorithm (which isthe default) is used to compute extracted PLS factors.

The “Dimensions” table shows the number of response variables, the number of effects, the number ofpredictor parameters, and the number of factors to extract.

The “Number of Observations” table shows that all 16 of the sample observations in the input data are usedin the analysis because all the samples contain complete data.

Output 4.1.1 Model Information, Dimensions, and Number of Observations

R e s u l t s f r o m p l s . p l sR e s u l t s f r o m p l s . p l s

M o d e l I n f o r m a t i o n

D a t a S o u r c e S A M P L E

F a c t o r E x t r a c t i o n M e t h o d P a r t i a l L e a s t S q u a r e s

P L S A l g o r i t h m N I P A L S

V a l i d a t i o n M e t h o d N o n e

D i m e n s i o n s

N u m b e r o f R e s p o n s e V a r i a b l e s 3

N u m b e r o f E f f e c t s 2 7

N u m b e r o f P r e d i c t o r P a r a m e t e r s 2 7

N u m b e r o f F a c t o r s 1 5

N u m b e r o f O b s e r v a t i o n s R e a d 1 6

N u m b e r o f O b s e r v a t i o n s U s e d 1 6

Output 4.1.2 lists the amount of variation, both individual and cumulative, that is accounted for by each ofthe 15 factors. All the variation in both the predictors and the responses is accounted for by only 15 factorsbecause there are only 16 sample observations. More important, almost all the variation is accounted for byeven fewer factors—one or two for the predictors and three to eight for the responses.


Output 4.1.2 PLS Variation Summary

P e r c e n t a g e V a r i a t i o n A c c o u n t e d f o r b y P a r t i a l L e a s tS q u a r e s F a c t o r s

M o d e l E f f e c t sR e s p o n s e

V a r i a b l e s

N u m b e r o fE x t r a c t e d

F a c t o r s C u r r e n t T o t a l C u r r e n t T o t a l

1 9 7 . 4 6 0 6 8 9 7 . 4 6 0 6 8 4 1 . 9 1 5 4 6 4 1 . 9 1 5 4 6

2 2 . 1 8 2 9 6 9 9 . 6 4 3 6 5 2 4 . 2 4 3 5 5 6 6 . 1 5 9 0 0

3 0 . 1 7 8 0 6 9 9 . 8 2 1 7 0 2 4 . 5 3 3 9 3 9 0 . 6 9 2 9 3

4 0 . 1 1 9 7 3 9 9 . 9 4 1 4 3 3 . 7 8 9 7 8 9 4 . 4 8 2 7 1

5 0 . 0 4 1 4 6 9 9 . 9 8 2 8 9 1 . 0 0 4 5 4 9 5 . 4 8 7 2 5

6 0 . 0 1 0 5 8 9 9 . 9 9 3 4 7 2 . 2 8 0 8 4 9 7 . 7 6 8 0 9

7 0 . 0 0 1 6 8 9 9 . 9 9 5 1 5 1 . 1 6 9 3 5 9 8 . 9 3 7 4 4

8 0 . 0 0 0 9 7 5 8 6 9 9 . 9 9 6 1 3 0 . 5 0 4 1 0 9 9 . 4 4 1 5 3

9 0 . 0 0 1 4 2 9 9 . 9 9 7 5 5 0 . 1 2 2 9 2 9 9 . 5 6 4 4 6

1 0 0 . 0 0 0 9 7 0 3 7 9 9 . 9 9 8 5 2 0 . 1 1 0 2 7 9 9 . 6 7 4 7 2

1 1 0 . 0 0 0 3 2 7 2 5 9 9 . 9 9 8 8 4 0 . 1 5 2 2 7 9 9 . 8 2 6 9 9

1 2 0 . 0 0 0 2 9 3 3 8 9 9 . 9 9 9 1 4 0 . 1 2 9 0 7 9 9 . 9 5 6 0 6

1 3 0 . 0 0 0 2 4 7 9 2 9 9 . 9 9 9 3 9 0 . 0 3 1 2 1 9 9 . 9 8 7 2 7

1 4 0 . 0 0 0 4 2 7 4 2 9 9 . 9 9 9 8 1 0 . 0 0 6 5 1 9 9 . 9 9 3 7 8

1 5 0 . 0 0 0 1 8 6 3 9 1 0 0 . 0 0 0 0 0 0 . 0 0 6 2 2 1 0 0 . 0 0 0 0 0

A PLS model is not complete until you choose the number of factors. You can choose the number of factorsby using test set validation, in which the data table is divided into two groups called the training data and thetest data. You fit the model to the training data, and then you check the capability of the model to predictresponses for the test data. The predicted residual sum of squares (PRESS) statistic is based on the residualsthat are generated by this process.

To select the number of extracted factors by test set validation, you use the partitionByVar orpartitionByFrac parameter to specify how to logically divide observations in the input data table into twosubsets for model training and testing. For example, you can designate a variable in the input data table and aset of formatted values of that variable to determine the role of each observation, as in the following PROCCAS statements. The ODS TRACE command displays the full paths of the output tables in the SAS log.

ods trace on;proc cas;

action pls.pls /table='SAMPLE',method='PLS',model={depVars={'ls', 'ha', 'dt'},

effects={'v1', 'v2', 'v3', 'v4', 'v5', 'v6', 'v7', 'v8','v9', 'v10', 'v11', 'v12', 'v13', 'v14', 'v15','v16', 'v17', 'v18', 'v19', 'v20', 'v21', 'v22','v23', 'v24', 'v25', 'v26', 'v27'}},

partitionByVar={name='Role', train='TRAIN', test='TEST'};run;

The resulting output is shown in Output 4.1.3 through Output 4.1.5.


Output 4.1.3 Model Information, Dimensions, and Number of Observations with Test Set Validation






V a l i d a t i o n M e t h o d T e s t S e t V a l i d a t i o n

D i m e n s i o n s

N u m b e r o f R e s p o n s e V a r i a b l e s 3

N u m b e r o f E f f e c t s 2 7

N u m b e r o f P r e d i c t o r P a r a m e t e r s 2 7

M a x i m u m N u m b e r o f F a c t o r s 9

N u m b e r o f O b s e r v a t i o n s R e a d 1 6

N u m b e r o f O b s e r v a t i o n s U s e d 1 6

N u m b e r o f O b s e r v a t i o n s U s e d f o r T r a i n i n g 9

N u m b e r o f O b s e r v a t i o n s U s e d f o r T e s t i n g 7

Output 4.1.4 Test-Set-Validated PRESS Statistics for Number of Factors


T e s t S e t V a l i d a t i o nf o r t h e N u m b e r o f

E x t r a c t e d F a c t o r s


F a c t o r s

R o o tM e a n

P R E S S

0 1 . 4 2 6 3 6 2

1 1 . 2 7 6 6 9 4

2 1 . 1 8 1 7 5 2

3 0 . 6 5 6 9 9 9

4 0 . 4 3 4 5 7

5 0 . 4 2 0 9 1 6

6 0 . 5 8 5 0 3 1

7 0 . 5 7 6 5 8 6

8 0 . 5 6 3 9 3 5

9 0 . 5 6 3 9 3 5

M i n i m u m R o o t M e a n P R E S S 0 . 4 2 0 9 1 6

M i n i m i z i n g N u m b e r o f F a c t o r s 5


Output 4.1.5 PLS Variation Summary for Test-Set-Validated Model

P e r c e n t a g e V a r i a t i o n A c c o u n t e d f o r b y P a r t i a lL e a s t S q u a r e s F a c t o r s


V a r i a b l e s



1 9 5 . 9 2 4 9 5 9 5 . 9 2 4 9 5 3 7 . 2 7 0 7 1 3 7 . 2 7 0 7 1

2 3 . 8 6 4 0 7 9 9 . 7 8 9 0 3 3 2 . 3 8 1 6 7 6 9 . 6 5 2 3 8

3 0 . 1 0 1 7 0 9 9 . 8 9 0 7 3 2 0 . 7 6 8 8 2 9 0 . 4 2 1 2 0

4 0 . 0 8 9 7 9 9 9 . 9 8 0 5 2 4 . 6 6 6 6 6 9 5 . 0 8 7 8 7

5 0 . 0 1 1 4 2 9 9 . 9 9 1 9 4 3 . 8 8 1 8 4 9 8 . 9 6 9 7 1

In Output 4.1.3, the “Model Information” table indicates that test set validation is used. The “Dimensions”table shows that the maximum number of factors to extract is nine. The “Number of Observations” tableshows that nine sample observations are assigned to training roles and seven are assigned to testing roles.

Output 4.1.4 provides details about the results from test set validation. These results show that the absoluteminimum PRESS is achieved with five extracted factors. Notice, however, that this is not much smallerthan the PRESS for three factors. By using the cvTest parameter, you can perform the statistical modelcomparison that is suggested by Van der Voet (1994) to test whether this difference is significant, as shown inthe following PROC CAS statements:

proc cas;action pls.pls /

table='SAMPLE',method='PLS',model={depVars={'ls', 'ha', 'dt'},

effects={'v1', 'v2', 'v3', 'v4', 'v5', 'v6', 'v7', 'v8','v9', 'v10', 'v11', 'v12', 'v13', 'v14', 'v15','v16', 'v17', 'v18', 'v19', 'v20', 'v21', 'v22','v23', 'v24', 'v25', 'v26', 'v27'}},

partitionByVar={name='Role', train='TRAIN', test='TEST'},cvTest={pValue=0.15, seed=12345};

run;

The model comparison test is based on a rerandomization of the data. By default, the seed for this randomiza-tion is based on the system clock, but it is specified here. The resulting output is presented in Output 4.1.6through Output 4.1.8.


Output 4.1.6 Model Information with Model Comparison Test






V a l i d a t i o n M e t h o d T e s t S e t V a l i d a t i o n

V a l i d a t i o n T e s t i n g C r i t e r i o n P r o b T * * 2 > 0 . 1 5

N u m b e r o f R a n d o m P e r m u t a t i o n s 1 0 0 0

R a n d o m N u m b e r S e e d f o r P e r m u t a t i o n 1 2 3 4 5

Output 4.1.7 Testing Test Set Validation for Number of Factors


T e s t S e t V a l i d a t i o n f o r t h e N u m b e r o fE x t r a c t e d F a c t o r s


F a c t o r s

R o o tM e a n

P R E S S T * * 2P r o b >

T * * 2

0 1 . 4 2 6 3 6 2 5 . 1 9 1 6 2 9 0 . 0 4 8 0

1 1 . 2 7 6 6 9 4 6 . 1 7 4 8 2 5 < . 0 0 0 1

2 1 . 1 8 1 7 5 2 4 . 6 0 2 0 3 0 . 0 6 2 0

3 0 . 6 5 6 9 9 9 3 . 0 9 9 9 9 0 . 5 4 0 0

4 0 . 4 3 4 5 7 4 . 9 8 0 2 2 7 0 . 0 9 6 0

5 0 . 4 2 0 9 1 6 0 1 . 0 0 0 0

6 0 . 5 8 5 0 3 1 2 . 0 5 4 9 6 0 . 7 4 7 0

7 0 . 5 7 6 5 8 6 3 . 0 0 9 1 7 2 0 . 4 9 8 0

8 0 . 5 6 3 9 3 5 2 . 4 1 6 6 3 5 0 . 7 5 0 0

9 0 . 5 6 3 9 3 5 2 . 4 1 6 6 3 5 0 . 7 5 0 0

M i n i m u m R o o t M e a n P R E S S 0 . 4 2 0 9 1 6

M i n i m i z i n g N u m b e r o f F a c t o r s 5

S m a l l e s t N u m b e r o f F a c t o r s w i t h p > 0 . 1 5 3

Output 4.1.8 PLS Variation Summary for Tested Test-Set-Validated Model

P e r c e n t a g e V a r i a t i o n A c c o u n t e d f o r b y P a r t i a lL e a s t S q u a r e s F a c t o r s


V a r i a b l e s



1 9 5 . 9 2 4 9 5 9 5 . 9 2 4 9 5 3 7 . 2 7 0 7 1 3 7 . 2 7 0 7 1

2 3 . 8 6 4 0 7 9 9 . 7 8 9 0 3 3 2 . 3 8 1 6 7 6 9 . 6 5 2 3 8

3 0 . 1 0 1 7 0 9 9 . 8 9 0 7 3 2 0 . 7 6 8 8 2 9 0 . 4 2 1 2 0

The “Model Information” table in Output 4.1.6 displays information about the options that are used in the


model comparison test. In Output 4.1.7, the p-value in comparing the test-set-validated residuals from modelsthat have five and three factors indicates that the difference between the two models is insignificant; therefore,the model with fewer factors is preferred. The variation summary in Output 4.1.8 shows that more than 99%of the predictor variation and more than 90% of the response variation are accounted for by the three factors.

References

De Jong, S. (1993). “SIMPLS: An Alternative Approach to Partial Least Squares Regression.” Chemometricsand Intelligent Laboratory Systems 18:251–263.

Lindberg, W., Persson, J.-A., and Wold, S. (1983). “Partial Least-Squares Method for SpectrofluorimetricAnalysis of Mixtures of Humic Acid and Ligninsulfonate.” Analytical Chemistry 55:643–648.

Umetrics (1995). Multivariate Analysis. Three-day course. Winchester, MA: Umetrics.

Van der Voet, H. (1994). “Comparing the Predictive Accuracy of Models Using a Simple RandomizationTest.” Chemometrics and Intelligent Laboratory Systems 25:313–323.

Wold, H. (1966). “Estimation of Principal Components and Related Models by Iterative Least Squares.” InMultivariate Analysis, edited by P. R. Krishnaiah, 391–420. New York: Academic Press.

Documents

The Partial Least Squares Action Set