6
© Saville Consulting Group Ltd 2011 1 Introduction The problems with evaluating personality questionnaires There are two major difficulties in evaluating different behavioural style questionnaires. Firstly, each questionnaire has a different design and theoretical orientation, and secondly it is hard to compare results, because the methodologies and criteria differ from study to study. Some publishers offer published results from peer reviewed scientific journals, while others make bald statements on their websites like: “This test is .75 to .95 valid”. We have discovered that such claims relate to the percentage of people who agree with their profile, or the extent to which one questionnaire correlates with another questionnaire. This is not validity. And yet, validity is the single most important characteristic of a test. What is validity? The speed and pulling power of a car is measured in horsepower. The ‘speed and pulling power’ of a test is measured in criterion validity. Wave Focus has over 40,000 people in its norm base, but this does not constitute validity. Validity is the extent to which results from a test can predict a job relevant criterion such as observed competencies, job proficiency, potential, and the like. Without evidence of validity there is little point in using a test, or any other method of assessment for that matter. Criterion validity is a correlation index between test scores and job performance. It ranges from 0 to 1, where 0 is no relationship and 1 is perfect prediction. In the complex world of people and jobs, no single assessment method can account for all the factors that constantly impact on people’s performance at work, so there is no likelihood of perfect prediction, a correlation of 1. Validities in the range of 0.8 and 0.9 are also unheard of, using any single method. A correlation around 0.3 is considered moderate. The higher the correlation, the better the prediction of job performance and the greater the likely return on investment. So the overriding objective of good test publishers should be to find ways to improve validity. By the way, it isn’t sufficient simply to say a test is ‘validated’ – you wouldn’t get away with saying a car has horsepower or a company has profit. The question is how much? Details of the study In order to address this problem of comparability, Saville Consulting conducted research to compare the validity of 29 different instruments. In this paper we have summarised the findings for 6 major questionnaires on the market as well as 3 well known questionnaires with a smaller number of scales. This is far from easy. Few studies have attempted to directly compare a large number of different questionnaires and to assess them against an independent measure of performance at work. These exercises are expensive and resource intensive. However, in order to advance the field of personality measurement and to improve selection and development practices in industry, we felt this was justified. Methodology Criteria Until a few decades ago, the big challenge for personality questionnaires (and interviews for that matter) was ‘breaking the 0.3 barrier’. Having achieved this, there have been various advances, but the best correlations have been obtained with ratings of competencies, rather than global measures of performance such as proficiency or promotability. This may be compared to research with aptitude tests where correlations of 0.5 are achieved for overall job performance. As a result, the criteria for our research were twofold: A global measure of Total Work Performance, and Competencies from a generic competency model For the global measure we chose content which originates from the work of Nyfield et al (1995) 1 and covers three broad areas: Applying Specialist Expertise (eg utilising expert knowledge, applying specialist skills, sharing expertise) Accomplishing Objectives (eg achieving personal targets, contributing to team objectives, furthering organisational goals) Demonstrating Potential (eg seeking career progression, demonstrating capabilities required for high level roles, showing potential for promotion) 1 Nyfield G, Gibbons PJ, Baron H & Robertson I (1995) The Cross-Cultural Validity of Management Assessment Methods. Paper presented at the 10th Annual SIOP Conference, Orlando USA, May 1995 WHICH TEST IS BEST?

WHICH TEST IS BEST? - Saville Consulting Asia Pacific ... · PDF fileSaville Consulting Group Ltd 2011

  • Upload
    vokhanh

  • View
    221

  • Download
    3

Embed Size (px)

Citation preview

Page 1: WHICH TEST IS BEST? - Saville Consulting Asia Pacific ... · PDF fileSaville Consulting Group Ltd 2011

© Saville Consulting Group Ltd 2011 1

IntroductionThe problems with evaluating personality questionnaires

There are two major difficulties in evaluating different behavioural style questionnaires. Firstly, each questionnaire has a different design and theoretical orientation, and secondly it is hard to compare results, because the methodologies and criteria differ from study to study.

Some publishers offer published results from peer reviewed scientific journals, while others make bald statements on their websites like: “This test is .75 to .95 valid”. We have discovered that such claims relate to the percentage of people who agree with their profile, or the extent to which one questionnaire correlates with another questionnaire. This is not validity. And yet, validity is the single most important characteristic of a test.

What is validity?

The speed and pulling power of a car is measured in horsepower. The ‘speed and pulling power’ of a test is measured in criterion validity. Wave Focus has over 40,000 people in its norm base, but this does not constitute validity.

Validity is the extent to which results from a test can predict a job relevant criterion such as observed competencies, job proficiency, potential, and the like. Without evidence of validity there is little point in using a test, or any other method of assessment for that matter.

Criterion validity is a correlation index between test scores and job performance. It ranges from 0 to 1, where 0 is no relationship and 1 is perfect prediction. In the complex world of people and jobs, no single assessment method can account for all the factors that constantly impact on people’s performance at work, so there is no likelihood of perfect prediction, a correlation of 1. Validities in the range of 0.8 and 0.9 are also unheard of, using any single method. A correlation around 0.3 is considered moderate. The higher the correlation, the better the prediction of job performance and the greater the likely return on investment. So the overriding objective of good test publishers should be to find ways to improve validity. By the way, it isn’t sufficient simply to say a test is ‘validated’ – you wouldn’t get away with saying a car has horsepower or a company has profit. The question is how much?

Details of the study

In order to address this problem of comparability, Saville Consulting conducted research to compare the validity of 29 different instruments. In this paper we have summarised the findings for 6 major questionnaires on the market as well as 3 well known questionnaires with a smaller number of scales.

This is far from easy. Few studies have attempted to directly compare a large number of different questionnaires and to assess them against an independent measure of performance at work. These exercises are expensive and resource intensive. However, in order to advance the field of personality measurement and to improve selection and development practices in industry, we felt this was justified.

MethodologyCriteria

Until a few decades ago, the big challenge for personality questionnaires (and interviews for that matter) was ‘breaking the 0.3 barrier’. Having achieved this, there have been various advances, but the best correlations have been obtained with ratings of competencies, rather than global measures of performance such as proficiency or promotability.

This may be compared to research with aptitude tests where correlations of 0.5 are achieved for overall job performance.

As a result, the criteria for our research were twofold:

� A global measure of Total Work Performance, and

� Competencies from a generic competency model

For the global measure we chose content which originates from the work of Nyfield et al (1995)1 and covers three broad areas:

� Applying Specialist Expertise (eg utilising expert knowledge, applying specialist skills, sharing expertise)

� Accomplishing Objectives (eg achieving personal targets, contributing to team objectives, furthering organisational goals)

� Demonstrating Potential (eg seeking career progression, demonstrating capabilities required for high level roles, showing potential for promotion)

1 Nyfield G, Gibbons PJ, Baron H & Robertson I (1995) The Cross-Cultural Validity of Management Assessment Methods. Paper presented at the 10th Annual SIOP Conference, Orlando USA, May 1995

WHICH TEST IS BEST?

Page 2: WHICH TEST IS BEST? - Saville Consulting Asia Pacific ... · PDF fileSaville Consulting Group Ltd 2011

© Saville Consulting Group Ltd 2011 2

The overall measure of Total Work Performance was obtained by adding these three together. This measure is particularly useful as it is a general criterion which does not favour any particular personality questionnaire over the other.

For the generic competency model we chose SHL’s Great 8 (Kurtz and Bartram 2002)2. The Great 8 framework is an independent model of work performance skill, personality, motivation and intelligence, not developed by Saville Consulting.

The Great 8 competencies are as follows:

� Analysing and Interpreting

� Creating and Conceptualising

� Leading and Deciding

� Interacting and Presenting

� Supporting and Co-operating

� Adapting and Coping

� Organising and Executing

� Enterprising and Performing

2 Kurz, R & Bartram D (2002) Competency and Individual Performance: Modelling the World of Work. In IT Robertson, M Callinen and D Bartram (eds) Organisational Effectiveness: The Role of Psychology Chichester: Wiley

Each participant in the study was asked to nominate 2 other people who would act as independent raters and who could evaluate their performance at work. Ratings were obtained from a 7 point scale ranging from Extremely Ineffective to Extremely Effective.

Questionnaires

A matched sample of 308 participants completed a range of different tests, including Professional Styles and Focus Styles versions of Saville Consulting Wave®, Hogan Personality Inventory, 16PF5, NEO-PI-R and OPQ32i .

The majority also completed a larger range of questionnaires including the Hogan Development Survey, Thomas International DISC, and MBTI. The order in which these questionnaires were completed was counterbalanced across participants in order to control for fatigue effects.

Features of the major questionnaires which have a larger number of scales are shown below. They are ordered according to test taking time – longer to shorter.

Instrument

No. of Questions

Typical Completion Time

Format

OPQ32i 416 60 minutes Ipsative Most Least blocks of 4

NEO-PI-PR 240 40 minutes Normative – 5 point

Wave Professional Styles 216 40 minutes Dynamic ipsative/normative

Hogan PI 206 35 minutes Normative True/False

16PF5 185 30 minutes Normative – 3 point

Wave Focus Styles 72 13 minutes Dynamic normative/ipsative

Page 3: WHICH TEST IS BEST? - Saville Consulting Asia Pacific ... · PDF fileSaville Consulting Group Ltd 2011

© Saville Consulting Group Ltd 2011 3

Analysis

We compared all these questionnaires in an identical manner against the global performance measure and the Great 8 model. For the Great 8 criteria we compared questionnaire results using exactly the wording of Bartram (2005)3 and for the OPQ32i we used the exact Great 8 equations published by SHL. Prior to analysis we predicted which aspects of work performance in the Great 8 model each questionnaire should measure4. This was done based on statistical modelling and content review, and we avoided statistical approaches which can lead to overestimated validities.

ResultsTotal work performance

0.6

0.5

0.4

0.3

0.2

0.1

0.0OPQ32i

Hogan PI

16PF5

WaveFocus

NEOWaveProfessional

The graph above shows the average validities of 6 key questionnaires in measuring Total Work Performance. This is the gold standard. Here, all of the 6 questionnaires showed at least a moderate level of validity in predicting work performance. However Wave Professional Styles and Wave Focus Styles questionnaires eclipsed all the others.

Power

OPQ32i Hogan PI 16PF5WaveFocus

NEOWaveProfessional

0.6

0.5

0.4

0.3

0.2

0.1

0.0

Test-taking time is an increasingly important consideration in a competitive labour market, eg graduate selection. The graph above compares the power of the questionnaires in terms of how much validity can be achieved per 15 minutes. Wave Focus is a clear winner.

3 Bartram D (2005) The Great Eight Competencies: A Criterion-centric Approach to Validation. Journal of Applied Psychology, 90, 1185-1203

4 Available in Saville et al (2009): Handbook to Saville Consulting Wave® Professional Styles, p403.

.57

.21

.44

.51

.31

.08

.36

.13

.32

.16

.32

.16

Page 4: WHICH TEST IS BEST? - Saville Consulting Asia Pacific ... · PDF fileSaville Consulting Group Ltd 2011

© Saville Consulting Group Ltd 2011 4

Competencies

0.6

0.5

0.4

0.3

0.2

0.1

0.0OPQ32i Hogan PI 16PF5Wave

FocusNEOWave

Professional Finally, the 6 questionnaires were compared against external ratings of the Great 8 competencies in turn. The average validities in measuring these job competencies are shown in the graph above. Wave and Focus outperformed the other questionnaires on the Great 8 Competencies as well.

Shorter popular questionnaires

Because the following questionnaires have fewer scales or are shorter, a different methodology was adopted. Rather than a global Total Work Performance rating (this would not work with short ipsative questionnaires like DISC for example), each scale was correlated with our 4 Performance Clusters, Problem Solving, Influencing People, Adapting Approaches and Delivering Results. In each case, a priori5 hypotheses were made. The Results are summarised as follows:

DISC – Overall, there was little validity for DISC. While Dominance predicted Influencing People as hypothesised, it was in fact a better predictor of Problem Solving. Significantly, there were near zero correlations for the important area of Delivering Results.

MBTI – There was some degree of support for responses to MBTI questions about Extraversion accurately measuring a person’s ability to influence people at work. Similarly there was some evidence that people who were high on Judging were seen as better at Delivering Results. However, there was generally found to be little validity in measuring broad aspects of work performance.

5 ie made in advance of the study to avoid claiming chance correlations

Hogan Development Survey – Out of 11 scales, we found little evidence that the HDS relates positively or negatively to job performance. An exception was a positive correlation for Diligent with Delivering Results. The majority of hypotheses did not validate well. They may validate better against clinical concepts rather than behavioural work criteria.

Further work is being done to see if differences in performance between dark side and light side predictions are substantiated by external raters.

HDS makes a prediction of potential Dark Side behaviour if an individual scores above the 84th percentile. However when 180 of our participants re-completed the HDS one week later, only 8% kept their same light and dark side profiles.

Feedback gullibility

We are also conscious that many users of questionnaires become attached to instruments because they ‘look right’. To test this theory, we replicated a classic experiment by Forer (1949)6. We took a series of statements from the actual feedback reports of some personality questionnaires used in this study and asked 144 participants whether they thought the statements were an accurate description of them. The results are shown below:

6 Forer BR The Fallacy of Personal Validations. Journal of Abnormal & Social Psychology 1949, 44, 118-123

.50

.47

.36 .37

.32

.28

Page 5: WHICH TEST IS BEST? - Saville Consulting Asia Pacific ... · PDF fileSaville Consulting Group Ltd 2011

© Saville Consulting Group Ltd 2011 5

% Agreeing

99% You are fair-minded.

99% You are kind.

97% You are thoughtful.

97% You are reasonable.

99% You are co-operative.

99% You value honest communication.

98% You value sincere appreciation and recognition of a job well done.

98% You like to be on good terms with other people and will generally react to them in a friendly and open way.

We found that almost everyone said that the statements provided were accurate for them.

If the end result of these personality questionnaires is such generalised and socially desirable reports, it is perhaps no wonder that they have a level of appeal for the recipient. However, this is perplexingly at odds with how poorly they differentiate people and how little success they demonstrate in measuring performance at work.

Similarly, people may like the theoretical or practical framework provided by the questionnaire for coaching and development purposes, but that doesn’t mean it will accurately predict work performance.

Why Does Wave Perform Better?

The key to this is that virtually all technology should be capable of continuous improvement. Summarised below are the key contributors to the improved validity offered by the Wave questionnaires.

Contemporary business considerations

Wave measures personal characteristics and competencies which are relevant for business today, such as IT skills, learning styles, networking, and identifying business opportunities.

Questionnaires developed in previous decades, naturally, have not taken into account the changing nature of job roles and work culture, and so may be measuring out–of-date aspects of work performance.

Development expertise

Wave was continuously validated during its construction, combining a number of important development strategies. Obtaining such a large and comprehensive validation sample and literally being able to pre-validate 4000 items was made possible by internet technology and as far as we know is unprecedented.

Peter Saville, author of the OPQ® built on his extensive knowledge of the domain and combined this with online trialing to select the most valid questions. His development team produced an overarching Wave framework which takes into account personality, competencies, aptitudes and the impact of culture in the workplace.

Extensive research on personality questionnaires has indicated that certain strategies are key to successful questionnaire design and these were capitalised upon in the development of the Wave framework. For example, many questionnaires are full of badly worded, ambiguous and negatively phrased questions which can reduce a questionnaire’s validity. Similarly, a single well written item can achieve the same or better validity than a large number of poor items.

Facets of behaviour

Wave questionnaires uniquely show facet splits where the respondent has answered closely related questions somewhat differently. Facet splits are a means of highlighting a unique aspect of an individual’s preferred style. It makes it possible to make finer judgements about placement and development. Further we have used facets to saturate validity.

Talent and Motive interplay

Wave questionnaires separate Talent and Motive items to reveal the unique interplay between effectiveness on the one hand, and motivation on the other. Motive ‘turbocharges’ talent, and when mathematically combined, increases validity.

Ipsative and Normative questions

Both rating (normative) and ranking (ipsative) scales have advantages and disadvantages. A rating scale enables the respondent to answer freely, but it is easier to impression manage their profile. Forced choice rankings control for social desirability responding, but they distort in a different way by exaggerating both strengths and weakness. Moreover, there is recent evidence that social desirability (a contaminant of rating scales) correlates quite strongly with success, so a person’s belief about their effectiveness often translates to good performance. Wave questionnaires combine the best elements of both approaches, thereby giving a more valid profile. Wave uses both scales in a dynamic way to increase validity and significantly reduce test-taking time. It enables areas of potential distortion to be pinpointed.

Available validity

Being able to examine the validity of a questionnaire with statistical techniques does not necessarily make the validity meaningful to the end user.

Valid scores must be presented simply in a report which can be easily interpreted and used as a direct measure of work

Page 6: WHICH TEST IS BEST? - Saville Consulting Asia Pacific ... · PDF fileSaville Consulting Group Ltd 2011

© Saville Consulting Group Ltd 2011 6

performance. While Wave technology is sophisticated, the reporting of Competency Potential is user-friendly and practitioners are provided with an even better understanding of people’s behaviours and performance than has previously been available to HR specialists and organisational psychologists alike.

Implications

Validity is the key indicator of return on investment. The greater the validity, the greater the ROI.

Small increases in validity have a large effect. For example, going from using a test with a validity of 0.3 to one with a validity of 0.4 represents an increase of 33%, a potentially huge impact on an organisation’s productivity. Such a return is usually dramatically outweighed by the relatively small investment in the test.

Similarly, a key organisational requirement may be to identify the top 20% of performers. Particularly serious errors occur when someone from the bottom 20% is identified as demonstrating top 20% potential. An assessment procedure with a validity of 0.3 will incorrectly identify 1 in 10 people from the bottom 20% of performers to be in the top 20%. An assessment method with a validity of 0.6 on the other hand will reduce this to just 1 in 50.

This simple example shows the importance of using tests which are valid and suitable for predicting actual performance at work, such as the Saville Consulting Wave® styles questionnaires.

Conclusion

This research is one of the first studies to clearly compare the validities of popular personality questionnaires in one paradigm. It also confirms existing research carried out on personality questionnaires. We did not use esoteric statistical techniques which maximise validity at the expense of meaning. The Great 8 is an independent framework against which to measure the validities of all questionnaires on a level playing field. However, it is not necessarily the ultimate method. Against the more discerning ‘Terrific 12’ of the Wave framework, the Wave styles questionnaires actually outperformed all others by about 50%. However to ensure fairness, we did not use this model here.

It is important to consider the specific functions for which you are using any test. Some, like the Saville Consulting Wave® questionnaires, have been designed to maximise the prediction of performance at work, while others were not originally designed to be used in industrial settings at all. Some have clinical origins and bring obscure, irrelevant and badly written items to the measurement of performance in the workplace. By developing and sharing our understanding of performance and potential at work, we can help more people to self-actualise at work and it is our sincere wish that personality assessment will continue to become more efficient and valid in the future.

Further InformationFor further information please contact us:

61 (0)2 9416 0840 [email protected] by S. Ruhfus from Saville P, McIver R, Kurz R and Hopton T: Project Epsom: How Valid is Your Questionnaire? 2008.