46
2017 Predictive Analytics Symposium Session 10, Clustering Techniques Moderator: Geoffrey R. Hileman, FSA, MAAA Presenters: Matthias Kullowatz Marjorie A. Rosenberg, FSA SOA Antitrust Compliance Guidelines SOA Presentation Disclaimer

2017 Predictive Analytics Symposium...Sneath, P. H. and R. R. Sokal (1973). Numerical taxonomy. The principles and practice of numerical classification. Rosenberg et al. (UW-Madison)

  • Upload
    others

  • View
    10

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 2017 Predictive Analytics Symposium...Sneath, P. H. and R. R. Sokal (1973). Numerical taxonomy. The principles and practice of numerical classification. Rosenberg et al. (UW-Madison)

2017 Predictive Analytics Symposium

Session 10, Clustering Techniques

Moderator: Geoffrey R. Hileman, FSA, MAAA

Presenters:

Matthias Kullowatz Marjorie A. Rosenberg, FSA

SOA Antitrust Compliance Guidelines SOA Presentation Disclaimer

Page 2: 2017 Predictive Analytics Symposium...Sneath, P. H. and R. R. Sokal (1973). Numerical taxonomy. The principles and practice of numerical classification. Rosenberg et al. (UW-Madison)

An application of K-means cluster analysis with missing values

Matthias Kullowatz, MSSession 10: Clustering TechniquesSeptember 14, 2017

Page 3: 2017 Predictive Analytics Symposium...Sneath, P. H. and R. R. Sokal (1973). Numerical taxonomy. The principles and practice of numerical classification. Rosenberg et al. (UW-Madison)

Motivation

2

Page 4: 2017 Predictive Analytics Symposium...Sneath, P. H. and R. R. Sokal (1973). Numerical taxonomy. The principles and practice of numerical classification. Rosenberg et al. (UW-Madison)

Clustering examples

Taxonomy (e.g. organisms)Optimal geographic positioning (e.g. ambulances)Market segmentation

3

Page 5: 2017 Predictive Analytics Symposium...Sneath, P. H. and R. R. Sokal (1973). Numerical taxonomy. The principles and practice of numerical classification. Rosenberg et al. (UW-Madison)

BackgroundUse enriched dataset to predict variable annuity

policyholder behavior (with GLWB rider)DataCredit info Lifestyle infoMortgage infoCensus Bureau info

Behaviors LapseGLWB election timingGLWB utilization efficiency

9/18/2017 4

Page 6: 2017 Predictive Analytics Symposium...Sneath, P. H. and R. R. Sokal (1973). Numerical taxonomy. The principles and practice of numerical classification. Rosenberg et al. (UW-Madison)

Benefits of clustering

Produce a more implementable modelMore stable representation of complex data for

long-term projectionsCreate recognizable clusters to tell intuitive stories

9/18/2017 5

Page 7: 2017 Predictive Analytics Symposium...Sneath, P. H. and R. R. Sokal (1973). Numerical taxonomy. The principles and practice of numerical classification. Rosenberg et al. (UW-Madison)

What is (and isn’t) k-means clustering?ExclusiveUnsupervisedNot hierarchical

6

https://support.sas.com/documentation/cdl/en/statug/63962/HTML/default/images/tree1a.png

Page 8: 2017 Predictive Analytics Symposium...Sneath, P. H. and R. R. Sokal (1973). Numerical taxonomy. The principles and practice of numerical classification. Rosenberg et al. (UW-Madison)

Basic k-means algorithm

• Select k cluster “centroids” in the data space• Assign each of n observations to nearest cluster by

shortest Euclidean distance to centroid• Standardize variables first• No categorical variables allowed

• Re-calculate new centroids as means• Repeat until convergence

7

Page 9: 2017 Predictive Analytics Symposium...Sneath, P. H. and R. R. Sokal (1973). Numerical taxonomy. The principles and practice of numerical classification. Rosenberg et al. (UW-Madison)

Visualizing K-means

8

Documentation can be found at: https://github.com/milliman/SOA_PAS_KmeansClustering

Page 10: 2017 Predictive Analytics Symposium...Sneath, P. H. and R. R. Sokal (1973). Numerical taxonomy. The principles and practice of numerical classification. Rosenberg et al. (UW-Madison)

K-means with missing values

9

Page 11: 2017 Predictive Analytics Symposium...Sneath, P. H. and R. R. Sokal (1973). Numerical taxonomy. The principles and practice of numerical classification. Rosenberg et al. (UW-Madison)

What to do with missing values

• Calculate centroid means: • Ignore missing values for each dimension’s mean• Weight observations using proportion of known values

• Observation assignment:• Calculate distance in a reduced dimension

(ignore missing values)

10

Page 12: 2017 Predictive Analytics Symposium...Sneath, P. H. and R. R. Sokal (1973). Numerical taxonomy. The principles and practice of numerical classification. Rosenberg et al. (UW-Madison)

Example: Calculating cluster centroid

* Our algorithm uses the weighted centroid means

11

Cluster A Creditscore

Delinquencies Populationdensity

Home value

Appreciation

Obs A1 1.0 1 -0.5 NA NA

Obs A2 2.0 3 -0.4 -0.5 -0.7

Obs A3 NA NA -0.6 -0.5 -0.3

Unweighted 1.5 2 -0.5 -0.5 -0.5

Weighted* 1.625 2.25 -0.48 -0.5 -0.55

Page 13: 2017 Predictive Analytics Symposium...Sneath, P. H. and R. R. Sokal (1973). Numerical taxonomy. The principles and practice of numerical classification. Rosenberg et al. (UW-Madison)

Example: Calculating distance

𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷 𝐷𝐷𝑡𝑡 𝐴𝐴: 0.6252 + 1.2502 + 0.0202 = 1.40𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷 𝐷𝐷𝑡𝑡 𝐵𝐵: 0.5002 + 0.5002 + 0.5002 = 0.87

12

Metric Creditscore

Delinquencies Populationdensity

Home value

Appreciation

Centroid A 1.625 2.25 -0.48 -0.5 -0.55

Centroid B 0.5 0.5 0.0 0.25 0.25

Obs A1 1.0 1 -0.5 NA NA

Page 14: 2017 Predictive Analytics Symposium...Sneath, P. H. and R. R. Sokal (1973). Numerical taxonomy. The principles and practice of numerical classification. Rosenberg et al. (UW-Madison)

Considerations

• How do you choose initial cluster centroids?• Why Euclidean distance?• Binary variables: could we handle them?• Could weight some fields differently

13

Page 15: 2017 Predictive Analytics Symposium...Sneath, P. H. and R. R. Sokal (1973). Numerical taxonomy. The principles and practice of numerical classification. Rosenberg et al. (UW-Madison)

Variable Annuity Example

14

Page 16: 2017 Predictive Analytics Symposium...Sneath, P. H. and R. R. Sokal (1973). Numerical taxonomy. The principles and practice of numerical classification. Rosenberg et al. (UW-Madison)

K-means data summary

220,000 policyholders20 third-party data fields+Attained age, account value

15

Observations Missing Proportion0 11.2%1 2.8%2 13.3%3 7.1%4 20.4%5 19.2%

6+ 26.0%

Page 17: 2017 Predictive Analytics Symposium...Sneath, P. H. and R. R. Sokal (1973). Numerical taxonomy. The principles and practice of numerical classification. Rosenberg et al. (UW-Madison)

How to define clusters

UnsupervisedNo preconceived notion of group labels

How different is each cluster across each dimension?Calculate average Z-scores of each dimension

Identify dimensions that make each cluster unique

16

Page 18: 2017 Predictive Analytics Symposium...Sneath, P. H. and R. R. Sokal (1973). Numerical taxonomy. The principles and practice of numerical classification. Rosenberg et al. (UW-Madison)

The Debt cluster

9/18/2017 17

Page 19: 2017 Predictive Analytics Symposium...Sneath, P. H. and R. R. Sokal (1973). Numerical taxonomy. The principles and practice of numerical classification. Rosenberg et al. (UW-Madison)

The Urban Renters cluster

9/18/2017 18

Page 20: 2017 Predictive Analytics Symposium...Sneath, P. H. and R. R. Sokal (1973). Numerical taxonomy. The principles and practice of numerical classification. Rosenberg et al. (UW-Madison)

Customer Segments

19

In Debt:Low credit scores, high counts of credit delinquencies in the last five years

Lower Income:Lower than average education levels, home values, and income levels

Middle Income:Slightly higher than average education levels, home values, and income levels

High Income:Highest education levels, home values, and income levels

Urban Renters: Live in high population density areas, with low proportion of homeowners

Families:More likely to have children living at home, younger on average

Retired:Likely to be older, and live in areas with high proportions of individuals over the age of 65

Page 21: 2017 Predictive Analytics Symposium...Sneath, P. H. and R. R. Sokal (1973). Numerical taxonomy. The principles and practice of numerical classification. Rosenberg et al. (UW-Madison)

9/18/2017 20

Modeling process

Cash flow projection

Customer level profitability (CLP)

Cash flow projection

Predictive modeling

Clustering

Page 22: 2017 Predictive Analytics Symposium...Sneath, P. H. and R. R. Sokal (1973). Numerical taxonomy. The principles and practice of numerical classification. Rosenberg et al. (UW-Madison)

Lapse sensitivity to in-the-moneyness (ITM)

21

Page 23: 2017 Predictive Analytics Symposium...Sneath, P. H. and R. R. Sokal (1973). Numerical taxonomy. The principles and practice of numerical classification. Rosenberg et al. (UW-Madison)

Benefit utilization deferral curves

22

Page 24: 2017 Predictive Analytics Symposium...Sneath, P. H. and R. R. Sokal (1973). Numerical taxonomy. The principles and practice of numerical classification. Rosenberg et al. (UW-Madison)

CLP by cluster

9/18/2017 23

This boxplot shows us that the policies sold to the urban, debt, and retired clusters are the most profitable on average. Also the retired cluster shows the widest range of CLP.

Page 25: 2017 Predictive Analytics Symposium...Sneath, P. H. and R. R. Sokal (1973). Numerical taxonomy. The principles and practice of numerical classification. Rosenberg et al. (UW-Madison)

Thank you

24

Page 26: 2017 Predictive Analytics Symposium...Sneath, P. H. and R. R. Sokal (1973). Numerical taxonomy. The principles and practice of numerical classification. Rosenberg et al. (UW-Madison)

A Numerical Taxonomy Application to ClusterIndividuals for Predicting Healthcare Expenditures

Work in Progress

Josh Agterberg, Fanghao Zhong, Richard Crabb, MargieRosenberg

University of Wisconsin – Madison

We acknowledge the Society of Actuaries CAE Research Grant fortheir partial support in this work.

Rosenberg et al. (UW-Madison) SOA Predictive Analytics Meeting September 2017 1 / 21

Page 27: 2017 Predictive Analytics Symposium...Sneath, P. H. and R. R. Sokal (1973). Numerical taxonomy. The principles and practice of numerical classification. Rosenberg et al. (UW-Madison)

Purpose

To present a novel way of clustering individuals to group similarindividuals for prediction of health care expenditures where covariatesare categorical

Secondarily: To define what is a numerical taxonomy system and itsrelationship to unsupervised clustering

Rosenberg et al. (UW-Madison) SOA Predictive Analytics Meeting September 2017 2 / 21

Page 28: 2017 Predictive Analytics Symposium...Sneath, P. H. and R. R. Sokal (1973). Numerical taxonomy. The principles and practice of numerical classification. Rosenberg et al. (UW-Madison)

Outline

1 Background

2 Numerical Taxonomy

3 Methods

4 Our Approach

5 Data and Design

6 Results

7 Conclusions

Rosenberg et al. (UW-Madison) SOA Predictive Analytics Meeting September 2017 3 / 21

Page 29: 2017 Predictive Analytics Symposium...Sneath, P. H. and R. R. Sokal (1973). Numerical taxonomy. The principles and practice of numerical classification. Rosenberg et al. (UW-Madison)

Background

Area of Application

• Focus on health care data• Only categorical predictors• That are not necessarily numbered or ordered• With no prior clusters defined• And with no labels as to assignment to cluster

Want to group individuals who are similar

Rosenberg et al. (UW-Madison) SOA Predictive Analytics Meeting September 2017 4 / 21

Page 30: 2017 Predictive Analytics Symposium...Sneath, P. H. and R. R. Sokal (1973). Numerical taxonomy. The principles and practice of numerical classification. Rosenberg et al. (UW-Madison)

Background

Approach: Want to group individuals who are similar

• Data need to be definable• Data need to be measurable• Need to determine how two individuals are related

Sneath and Sokal (1973)

Rosenberg et al. (UW-Madison) SOA Predictive Analytics Meeting September 2017 5 / 21

Page 31: 2017 Predictive Analytics Symposium...Sneath, P. H. and R. R. Sokal (1973). Numerical taxonomy. The principles and practice of numerical classification. Rosenberg et al. (UW-Madison)

Background

Examples of Health Care

• ICD classification• DRG• HCC

Rosenberg et al. (UW-Madison) SOA Predictive Analytics Meeting September 2017 6 / 21

Page 32: 2017 Predictive Analytics Symposium...Sneath, P. H. and R. R. Sokal (1973). Numerical taxonomy. The principles and practice of numerical classification. Rosenberg et al. (UW-Madison)

Numerical Taxonomy

Some Definitions

Taxonomy (Simpson (1961)) : Theoretical study of classification includingits bases, principles, and rules

Taxon (Sneath and Sokal (1973)) : Abbreviation for taxonomic group.Plural is taxa

Classification (Simpson (1961)) : Ordering of entities into groups or setson the basis of relationships

Classification could be seen as method, but is sometimesused as outcome of process (i.e. result of classification isclassification). Taxonomy could be viewed as theory (butsometimes used interchangeably)

Numerical Taxonomy (Sneath and Sokal (1973)) : Grouping by numericalmethods of taxonomic units based on their characterstates, using methods that are objective, explicit, andrepeatable

Rosenberg et al. (UW-Madison) SOA Predictive Analytics Meeting September 2017 7 / 21

Page 33: 2017 Predictive Analytics Symposium...Sneath, P. H. and R. R. Sokal (1973). Numerical taxonomy. The principles and practice of numerical classification. Rosenberg et al. (UW-Madison)

Numerical Taxonomy

Numerical Taxonomic Principles

• More information available for classification, the better theclassification

• Every variable has equal weight (or not?)• Overall similarity between 2 individuals is function of

individual-level similarity in each of the variables• Distinct groups recognized due to correlation of variables within

each group• Classification based on similarity• Inferences valid from developed clusters (originally based on

biology)

Note: of course, data need to be definable and measurable

Sneath and Sokal (1973) (Note: Traced back to 1700s)

Rosenberg et al. (UW-Madison) SOA Predictive Analytics Meeting September 2017 8 / 21

Page 34: 2017 Predictive Analytics Symposium...Sneath, P. H. and R. R. Sokal (1973). Numerical taxonomy. The principles and practice of numerical classification. Rosenberg et al. (UW-Madison)

Numerical Taxonomy

Taxonomic Principles Applied (ICD, DRG, HCC, andbiology)

• Approach Defensible• Reproducible process• Individuals classified into reliable groups by different coders• Set of operational protocol with standardized names• Interpretable clusters with characteristics generally constant• Manageable number of clusters• Predictive power of cluster

Evans, Pope, Kautter, Ingber, Freeman, Sekar, and Newhart (2011); Feinstein (1967, 1988);

Fetter, Shin, Freeman, Averill, and Thompson (1980); Sneath and Sokal (1973)

Rosenberg et al. (UW-Madison) SOA Predictive Analytics Meeting September 2017 9 / 21

Page 35: 2017 Predictive Analytics Symposium...Sneath, P. H. and R. R. Sokal (1973). Numerical taxonomy. The principles and practice of numerical classification. Rosenberg et al. (UW-Madison)

Methods Principles

Questions when Clustering with Only CategoricalVariables

• How many variables to include?• Which variables to include and how include?• If ordered categorical variable, how measure differences?• If multi-level,

• Convert into multiple indicator variables?• Weight rarer levels more?• How incorporate (or not) missing values

• How define the center?

Goodall (1964, 1966)

Rosenberg et al. (UW-Madison) SOA Predictive Analytics Meeting September 2017 10 / 21

Page 36: 2017 Predictive Analytics Symposium...Sneath, P. H. and R. R. Sokal (1973). Numerical taxonomy. The principles and practice of numerical classification. Rosenberg et al. (UW-Madison)

Methods Principles

Clustering Operational Protocol with CategoricalVariables

To create clusters (labeled as cluster method), we need:1 Definition of cluster center2 General algorithm of how to create clusters: like k-means,

k-medoids3 Particular function in software to calculate4 Need a definition of similarity5 We need a matrix of dis-similarity (per numerical taxonomy

literature, like distance, correlation or probability). If distance, thenwhich kind L1, L2 or other?

Note: Above items are inter-relatedEnd result sensitive to choice of algorithm

Rosenberg et al. (UW-Madison) SOA Predictive Analytics Meeting September 2017 11 / 21

Page 37: 2017 Predictive Analytics Symposium...Sneath, P. H. and R. R. Sokal (1973). Numerical taxonomy. The principles and practice of numerical classification. Rosenberg et al. (UW-Madison)

Methods Principles

Issues with Clustering Methods

1 Appropriateness of cluster method when all variables arecategorical

2 Random choice of starting point produces clusters that could differ(i.e. greedy algorithm may converge to local minimum but may notbe global minimum)

3 Choice of similarity measures could change results4 How justify clusters using taxonomic principles?

Rosenberg et al. (UW-Madison) SOA Predictive Analytics Meeting September 2017 12 / 21

Page 38: 2017 Predictive Analytics Symposium...Sneath, P. H. and R. R. Sokal (1973). Numerical taxonomy. The principles and practice of numerical classification. Rosenberg et al. (UW-Madison)

Our Approach

Our Cluster Protocols

1 Define center as medoid (center represents actual data point)2 General clustering algorithm: PAM (Partitioning Around Medoids)3 Cluster function: Use pam() function in cluster package in R4 Compare two similarity measures: Gower’s distance and

Goodall’s similarity• Gower: Define similar if attributes are equal; compute simple

average• Goodall: Define two entities to be more similar if attributes are

rarer; compute similarity index

Rosenberg et al. (UW-Madison) SOA Predictive Analytics Meeting September 2017 13 / 21

Page 39: 2017 Predictive Analytics Symposium...Sneath, P. H. and R. R. Sokal (1973). Numerical taxonomy. The principles and practice of numerical classification. Rosenberg et al. (UW-Madison)

Data and Design

NHIS/MEPS Data

• National Health Interview Survey (NHIS) linked to MedicalExpenditure Panel Survey (MEPS)

• NHIS Sample Adult Questionnaire for adult health behavior data• Complex survey design allowing estimates of US civilian

non-institutionalized population

Note: Complex design provides estimates of population, althoughsample (possibly allow for more reproducible results)

Rosenberg et al. (UW-Madison) SOA Predictive Analytics Meeting September 2017 14 / 21

Page 40: 2017 Predictive Analytics Symposium...Sneath, P. H. and R. R. Sokal (1973). Numerical taxonomy. The principles and practice of numerical classification. Rosenberg et al. (UW-Madison)

Data and Design

Study Design

• NHIS baseline year 2010: All individual-level characteristics todefine initial clusters

• MEPS panel 16 (representing calendar years 2011 to 2012) forvalidation of clusters using expenditures

Rosenberg et al. (UW-Madison) SOA Predictive Analytics Meeting September 2017 15 / 21

Page 41: 2017 Predictive Analytics Symposium...Sneath, P. H. and R. R. Sokal (1973). Numerical taxonomy. The principles and practice of numerical classification. Rosenberg et al. (UW-Madison)

Data and Design

MEPS Longitudinal Design

Rosenberg et al. (UW-Madison) SOA Predictive Analytics Meeting September 2017 16 / 21

Page 42: 2017 Predictive Analytics Symposium...Sneath, P. H. and R. R. Sokal (1973). Numerical taxonomy. The principles and practice of numerical classification. Rosenberg et al. (UW-Madison)

Data and Design

NHIS Variables for Clusters

• Personal demographic information• Health status information• Living style or habits• Working environment and status

Rosenberg et al. (UW-Madison) SOA Predictive Analytics Meeting September 2017 17 / 21

Page 43: 2017 Predictive Analytics Symposium...Sneath, P. H. and R. R. Sokal (1973). Numerical taxonomy. The principles and practice of numerical classification. Rosenberg et al. (UW-Madison)

Results

Results

• To be filled in

Rosenberg et al. (UW-Madison) SOA Predictive Analytics Meeting September 2017 18 / 21

Page 44: 2017 Predictive Analytics Symposium...Sneath, P. H. and R. R. Sokal (1973). Numerical taxonomy. The principles and practice of numerical classification. Rosenberg et al. (UW-Madison)

Conclusions

Conclusions

• Numerical taxonomy literature provides history• Taxonomy principles provide guidance to justify approach• Operational protocol provide guidance to communicate methods• Presence of categorical variables can complicate analysis

Rosenberg et al. (UW-Madison) SOA Predictive Analytics Meeting September 2017 19 / 21

Page 45: 2017 Predictive Analytics Symposium...Sneath, P. H. and R. R. Sokal (1973). Numerical taxonomy. The principles and practice of numerical classification. Rosenberg et al. (UW-Madison)

Conclusions

Bibliography I

Evans, M. A., G. C. Pope, J. Kautter, M. J. Ingber, S. Freeman,R. Sekar, and C. Newhart (2011). Evaluation of the CMS-HCC riskadjustment model. CfMM Services, Editor .

Feinstein, A. R. (1967). Clinical Judgment. Williams and Wilkens Co.Feinstein, A. R. (1988). ICD, POR, and DRG: Unsolved scientific

problems in the nosology of clinical medicine. Archives of internalmedicine 148(10), 2269–2274.

Fetter, R. B., Y. Shin, J. L. Freeman, R. F. Averill, and J. D. Thompson(1980). Case mix definition by diagnosis-related groups. Medicalcare 18(2), i–53.

Goodall, D. W. (1964). A probabilistic similarity index.Nature 203(4949), 1098–1098.

Rosenberg et al. (UW-Madison) SOA Predictive Analytics Meeting September 2017 20 / 21

Page 46: 2017 Predictive Analytics Symposium...Sneath, P. H. and R. R. Sokal (1973). Numerical taxonomy. The principles and practice of numerical classification. Rosenberg et al. (UW-Madison)

Conclusions

Bibliography II

Goodall, D. W. (1966). A new similarity index based on probability.Biometrics, 882–907.

Simpson, G. G. (1961). Principles of animal taxonomy.

Sneath, P. H. and R. R. Sokal (1973). Numerical taxonomy. Theprinciples and practice of numerical classification.

Rosenberg et al. (UW-Madison) SOA Predictive Analytics Meeting September 2017 21 / 21