Online Early — Preprint of Accepted Manuscript · 2016. 10. 28. · Online Early — Preprint of Accepted Manuscript This is a of a manuscript that has been accepted for publication

Online Early — Preprint of Accepted ManuscriptThis is a PDF file of a manuscript that has been accepted for publication in an American Accounting Association journal. It is the final version that was uploaded and approved by the author(s). While the paper has been through the usual rigorous peer review process for AAA journals, it has not been copyedited, nor have the graphics and tables been modified for final publication. Also note that the paper may refer to online Appendices and/or Supplements that are not yet available. The manuscript will undergo copyediting, typesetting and review of page proofs before it is published in its final form, therefore the published version will look different from this version and may also have some differences in content.

We have posted this preliminary version of the manuscript as a service to our members and subscribers in the interest of making the information available for distribution and citation as quickly as possible following acceptance.

The DOI for this manuscript and the correct format for citing the paper are given at the top of the online (html) abstract.

Once the final published version of this paper is posted online, it will replace this preliminary version at the specified DOI.

The Accounting Review • Issues in Accounting Education • Accounting HorizonsAccounting and the Public Interest • Auditing: A Journal of Practice & Theory

Behavioral Research in Accounting • Current Issues in Auditing Journal of Emerging Technologies in Accounting • Journal of Information Systems

Journal of International Accounting Research Journal of Management Accounting Research • The ATA Journal of Legal Tax Research

The Journal of the American Taxation Association

preprint

accepted manuscript

Finding Needles in a Haystack: Using Data Analytics to Improve Fraud Prediction

Johan L. Perols Associate Professor of Accounting

University of San Diego [email protected]

Robert M. Bowen

Distinguished Professor of Accounting University of San Diego [email protected]

Carsten Zimmermann

Associate Professor of Management University of San Diego

[email protected]

Basamba Samba RWTH Aachen University

[email protected]

Editor’s Note: Accepted by Elaine Mauldin Submitted February 2015 Accepted July 2016

We acknowledge financial assistance from the School of Business Administration at the University of San Diego and helpful comments from Elaine Mauldin (editor), Darren Bernard, Barbara Bliss, Nicole Cade, Ed deHaan, Weili Ge, Jane Jollineau, Yen-Ting (Daniel) Lin, Sarah Lyon, Dawn Matsumoto, Barry Mishra, Ted Mock, Ryan Ratcliff, Terry Shevlin, Brady Williams, two anonymous reviewers, and workshop participants at the University of California, Riverside and the University of San Diego. All remaining errors are our own.

preprint

accepted manuscript

ABSTRACT

Developing models to detect financial statement fraud involves challenges related to (i) the rarity of fraud observations, (ii) the relative abundance of explanatory variables identified in the prior literature, and (iii) the broad underlying definition of fraud. Following the emerging data analytics literature, we introduce and systematically evaluate three data analytics preprocessing methods to address these challenges. Results from evaluating actual cases of financial statement fraud suggest that two of these methods improve fraud prediction performance by approximately ten percent relative to the best current techniques. Improved fraud prediction can result in meaningful benefits, such as improving the ability of the SEC to detect fraudulent filings and improving audit firms' client portfolio decisions.

Keywords: fraud, financial statement fraud, data analytics, predictive analytics, data rarity, data imbalance

preprint

accepted manuscript

- 1 -

I. INTRODUCTION

Organizations lose an estimated 5 percent of annual revenues to fraud in general and 1.6

percent of annual revenues specifically to financial statement fraud (ACFE 2014). Further, when

resources are misallocated because of misleading financial data, fraud can harm the efficiency of

capital, labor, and product markets. Financial statement fraud (henceforth fraud) also increases

business risk. For example, audit firms can face lawsuits, reputational costs, and loss of clients,

and investors and banks are more likely to make suboptimal investment and loan decisions.

Data analytics is an important emerging field in both academic research (e.g., Agarwal and

Dhar 2014; Chen, Chiang, and Storey 2012) and in practice (e.g., Brown, Chui, and Manyika

2011; LaValle, Lesser, Shockley, Hopkins, and Kruschwitz 2011).1 In the fraud context, data

analytics can, for example, be used to create fraud prediction models that help (i) auditors

improve client portfolio management and audit planning decisions and (ii) regulators and other

oversight agencies identify firms for potential fraud investigation (SEC 2015; Walter 2013).

However, the usefulness of data analytics in fraud prediction is hindered by three challenges.

First, fraud prediction is a “needle in a haystack problem.” That is, the relative rarity of fraud

firms compared to non-fraud2 control firms (Bell and Carcello 2000) makes fraud prediction

difficult (Perols 2011). Second, fraud prediction is complicated by the “curse of data

dimensionality” (Bellman 1961). The rarity of fraud observations relative to the large number of

explanatory variables identified in the fraud literature (Whiting, Hansen, McDonald, Albrecht,

1 Data analytics refers to techniques that are grounded in data mining (e.g., decision trees, artificial neural networks, and support vector machines) and statistics (e.g., ANOVA, regression analysis, and logistic regressions) (Chen et al. 2012). Data analytics draws from statistics, artificial intelligence, computer science, and database research. It is related to big data in that it provides tools that enable the analysis of large datasets. Data analytics is typically focused on prediction as opposed to explanation. 2 We use the term non-fraud firms to describe all firms for which fraud has not been detected. This primarily includes firms that have not committed fraud, but also includes undetected cases of fraud. To the extent that undetected fraud exists in our data, noise is introduced. This noise reduces the effectiveness of all prediction models, and methods that address this noise might further improve fraud prediction. However, this noise is not likely to bias performance comparisons among prediction models that use the same data.

preprint

accepted manuscript

- 2 -

and Albrecht 2012) can result in over-fitted prediction models that perform poorly when

predicting new observations. Third, prior research generally treats all frauds as homogeneous

events. This can make fraud prediction more difficult because prediction models have to detect

patterns that are common across different fraud types (e.g., revenue vs. expense fraud).

While prior fraud detection research enhances our general understanding of fraud indicators

and prediction methods, this research rarely addresses these problems explicitly. With a primary

objective of improving fraud prediction, we address these challenges by introducing three

methods grounded in data analytics research.3 The methods we examine have performed well in

other settings characterized by data rarity, such as predicting credit card fraud (e.g., Chan and

Stolfo 1998). The first method, Multi-subset Observation Undersampling (OU), addresses the

imbalance between the low number of fraud observations relative to the number of non-fraud

observations by creating multiple subsets of the original dataset that each contain all fraud

observations and different random subsamples of non-fraud observations. The second method,

Multi-subset Variable Undersampling (VU), addresses the imbalance between the low number of

fraud observations relative to the number of explanatory variables identified in the fraud

prediction literature by creating multiple subsets of randomly selected explanatory variables.

The third method, VU partitioned by type of fraud (PVU), is a variation of the second method

that addresses issues associated with treating all fraud cases as homogenous events. Rather than

randomly selecting variables, we instead use our a priori knowledge to partition the variables

into subsets based on their relation to specific types of fraud (e.g., revenue vs. expense fraud).

We use a dataset with 51 fraud firms, 15,934 non-fraud firm years, and 109 explanatory

variables from prior research. We then analyze over 10,000 prediction models to systematically

3 We evaluate our results on out-of-sample data and thus perform predictive modeling. To clearly delineate our work from explanatory models, we refer to our models as prediction models throughout the paper.

preprint

accepted manuscript

- 3 -

evaluate how to best implement these methods, e.g., how many data subsets to use in OU. In

addition, we examine the prediction performance of these implementations relative to various

benchmarks that represent the current standard in the literature, e.g., model 2 in Dechow, Ge,

Larson, and Sloan (2011) and simple undersampling as used in Perols (2011). To avoid biasing

the results, we evaluate prediction performance using the prediction models’ probability

predictions on hold-out data that are not processed by the proposed methods.

Results indicate that including additional data subsets (up to approximately 12 subsets)

increases OU fraud prediction performance, i.e., additional subsets after 12 do not appear to

enhance performance. This 12-subset configuration improves prediction performance by 10.8

percent relative to the best performing benchmark.

While results indicate that VU also has the potential to improve fraud prediction, the

performance of this method is highly dependent on the specific variables selected in the various

subsets. However, performance improves when we use a priori knowledge to separate

independent variables into different subsets based on the type of fraud they are likely to predict,

e.g., revenue or expense fraud. This method, i.e., PVU, improves fraud prediction performance

by 9.6 percent relative to the best performing VU benchmark. Additional analyses also show

that performance can be further improved by combining OU and PVU, but only under certain

conditions as described in section IV.

Our paper makes at least five important contributions. First, by introducing and

systematically evaluating three new methods and showing that two of these methods improve

prediction performance relative to the best performing benchmarks, we directly contribute to

research that focuses on improving the performance of fraud prediction models. The

performance improvements from OU and PVU are large relative to other approaches for

preprint

accepted manuscript

- 4 -

improving prediction performance, e.g., (i) a 0.9 percent performance advantage in Dechow et al.

(2011) when two additional significant independent variables are added to their initial model and

(ii) a 2.2 percent improvement in Price, Sharp, and Wood (2011), when comparing Audit

Integrity’s Accounting and Governance Risk measure to Dechow et al. (2011) model 2.4

Second, the finding that OU significantly improves prediction performance has important

methodological implications for research that evaluates the value of new explanatory variables.

This research can potentially benefit from applying OU to ensure that (i) results are robust across

different subsamples and (ii) new variables provide incremental predictive value to models

implemented using our recommended methods.

Third, we show that the ability of VU to predict fraud improves consistently only when we

recognize that not all frauds are alike and therefore subdivide the general fraud problem into

types of fraud. The importance of this approach likely extends beyond variable undersampling.

For example, future research could reorganize or design new fraud variables to predict a specific

fraud type (e.g., revenue fraud or expense fraud).

Fourth, OU and PVU can be extended to address rarity and data dimensionality problems that

are prevalent in other accounting classification settings, including prediction of financial

statement restatements, material weaknesses in internal controls, auditor resignations, audit

qualifications, and bankruptcy.

Finally, the introduction and evaluation of these methods makes an important contribution to

practice. Better prediction models can, for example, help the SEC and external auditors improve

4 Dechow et al. (2011) do not report predictive performance and the 0.9 percent difference is based on a separate analysis that we performed using the two models in their paper (Model 1 and Model 2). This analysis uses the same procedures used in our material misstatement analysis described in Section IV. Price et al. (2011) compare Audit Integrity’s Accounting and Governance Risk measure, which is considered the gold standard in commercial risk measures, to Dechow et al. (2011) Model 1 using material misstatement data. Based on their results, we calculate a 3.16 percent fraud prediction performance improvement of the commercial measure to model 1. This implies a 2.24 percent improvement over Dechow et al. (2011) Model 2, which we include as one of our benchmark models.

preprint

accepted manuscript

- 5 -

their identification of potentially fraudulent accounting practices (Walter 2013; SEC 2015).

The remainder of the paper is organized as follows. Section II summarizes the fraud

literature, discusses data rarity, and describes how methods drawn from the data analytics

literature can be applied to fraud prediction. Section III describes the data, performance

measure, and experimental design. Section IV provides results, and Section V concludes.

II. PRIOR LITERATURE, BACKGROUND, AND PROPOSED METHODS

Prior Fraud Prediction Research

Research on financial statement fraud prediction contributes to understanding factors that can

be used to predict fraud. Prior research includes testing fraud hypotheses grounded in the

earnings management and corporate governance literatures (e.g., Beasley 1996; Dechow, Sloan,

and Sweeney 1996; Summers and Sweeney 1998; Beneish 1999; Sharma 2004; Erickson

Hanlon, and Maydew 2006; Lennox and Pittman 2010; Feng, Ge, Luo, and Shevlin 2011; Perols

and Lougee 2011; Caskey and Hanlon 2013; Armstrong, Larcker, Ormazabal, and Taylor 2013;

Markelevich and Rosner 2013). This research also evaluates the significance of a variety of

other potential explanatory variables, such as red flags emphasized in auditing standards,

discretionary accruals measures, and non-financial indicators (e.g., Loebbecke, Eining, and

Willingham 1989; Beneish 1997; Lee, Ingram, and Howard 1999; Apostolou, Hassell, and

Webber 2000; Kaminski, Wetzel, and Guan 2004; Ettredge, Sun, Lee, and Anandarajan 2008;

Jones, Krishnan, and Melendrez 2008; Brazel, Jones, and Zimbelman 2009; Dechow et al. 2011).

We use independent variables from this research as input into our models.

Varian (2014) highlights the importance of the emerging field of data analytics. He suggests

that researchers using traditional econometric methods should consider adapting recent advances

from this field. A second stream of financial statement fraud prediction research follows this

suggestion and applies developments in data analytics research to improve fraud prediction.

preprint

accepted manuscript

- 6 -

Early research within this stream concludes that artificial neural networks perform well relative

to discriminant analysis and logistic regressions (e.g., Green and Choi 1997; Fanning and Cogger

1998; Lin, Hwang, and Becker 2003). More recent research in this stream examines additional

classification algorithms, such as support vector machines, decision trees, and adaptive learning

methods (e.g., Cecchini, Koehler, Aytug, and Pathak 2010; Perols 2011; Abbasi, Albrecht,

Vance, and Hansen 2012; Gupta and Gill 2012; Whiting et al. 2012) and text mining methods

(e.g., Glancy and Yadav 2011; Humpherys, Moffitt, Burns, Burgoon, and Felix 2011; Goel and

Gangolly 2012; Larcker and Zakolyukina 2012). We follow recent fraud data analytics research

(e.g., Cecchini et al. 2010) and findings in Perols (2011) and implement all prediction models

using support vector machines. Support vector machines determine how to separate fraud firms

from non-fraud firms by finding the hyperplane that provides the maximum separation in the

training data between fraud and non-fraud firms. In additional analyses reported in Online

Appendix B, we also use logistic regression and bootstrap aggregation to examine the robustness

of our results.

Data Rarity, Related Prior Research, and Proposed Methods

Data rarity is observed in diverse prediction settings, such as credit card fraud (Chan and

Stolfo 1998), auto insurance fraud (Phua, Alahakoon, and Lee 2004), bankruptcy (Shin, Lee, and

Kim 2005), and financial statement fraud (Whiting et al. 2012). Classification algorithms (e.g.,

logistic regression) have inherent difficulties in processing rarity (Weiss 2004), and data rarity is

regarded as one of the primary challenges in data analytics research (Yang and Wu 2006). Data

rarity is particularly severe in financial statement fraud detection because financial statement

fraud is characterized by both (i) relative rarity (a.k.a., the needle in the haystack problem) and

(ii) absolute rarity combined with an abundance of explanatory variables proposed in the

literature (a.k.a., the curse of data dimensionality problem).

preprint

accepted manuscript

- 7 -

The needle in a haystack problem. Relative rarity occurs when detected fraud observations

are a relatively small percentage of the majority non-fraud observations, e.g., approximately only

0.6 percent of all audited U.S. financial reports have been identified as fraudulent (Bell and

Carcello 2000). Relative rarity is a challenge since it forces classification algorithms to consider

a large number of potential patterns without having enough fraud observations to determine

which patterns are driven by noisy data. This increases the risk that identified patterns are based

on spurious relations in a particular sample, resulting in increased false positive rates for a given

false negative rate when the developed model is applied to a new sample (Weiss 2004). Further,

to minimize total classification errors, algorithms tend to be biased towards classifying

observations from the majority class correctly (e.g., Maloof 2003). To illustrate, if 99 percent of

all observations are non-fraudulent, a prediction model identifying all observations as non-

fraudulent achieves an overall accuracy of 99 percent – correctly classifying 100 percent of the

non-fraudulent observations, but 0 percent of the fraudulent observations.

Perols (2011) takes an initial step towards addressing the relative rarity problem in a fraud

context by examining the performance of classification algorithms after undersampling the non-

fraud observations. However, while the simple undersampling method used in Perols (2011),

i.e., a method that simply removes non-fraud observations from the sample, generates more

balanced datasets, it also discards potentially useful non-fraud observations. We, therefore,

introduce a more sophisticated undersampling method that does not discard non-fraud

observations (and include simple undersampling as a benchmark).

More specifically, we use Multi-subset Observation Undersampling (OU), developed by

Chan and Stolfo (1998), to address relative rarity. OU uses multiple data subsets, where each

subset contains all fraud observations but different subsamples of non-fraud observations. We

preprint

accepted manuscript

- 8 -

specifically select OU because prior research shows that it performs well in other settings

constrained by relative rarity, such as predicting credit card fraud (e.g., Chan and Stolfo 1998).

OU is also effective compared to (i) other undersampling and oversampling methods (Nguyen,

Cooper, and Kamei 2012) and (ii) various types of bootstrap aggregation, boosting, and hybrid

ensemble data rarity methods used in the data analytics literature (Galar, Fernández,

Barrenechea, Bustince, and Herrera 2012). Nguyen et al. (2012) suggests that OU improves

performance not only because it makes the ratio of minority (fraud) to majority (non-fraud)

observations more balanced, but also because it more efficiently incorporates potentially useful

majority cases. By increasing the balance between fraud and non-fraud cases, OU adjusts the

focus of the classification algorithms towards the fraud cases. This focus is desirable given the

importance of minority cases in fraud detection, i.e., it is more costly to incorrectly classify fraud

cases than non-fraud cases. By creating multiple prediction models that are based on different

non-overlapping subsets of majority observations, each prediction model is likely to differ

somewhat from the other prediction models. Importantly, patterns that are predictive of fraud are

likely to be present in multiple subsets. However, spurious patterns that exist by random chance

in individual subsets are unlikely to also exist in other subsets. By using a combination of these

models rather than a model built using a single dataset, potentially important patterns are more

likely to be identified and estimated accurately (by combining models with slightly different

pattern estimates, the estimates should be more robust). Additionally, when individual models

are combined, spurious patterns are likely to be discarded (or given less weight). Thus, by

combining predictions from multiple models, OU reduces the risk of overfitting, i.e., that the

prediction model has good in-sample performance but does not generalize to new observations.

When applied in the fraud setting, OU first preprocesses the model building data by dividing

preprint

accepted manuscript

- 9 -

the data into multiple subsets, where each subset includes all fraud observations and a random

sample of non-fraud observations selected without replacement (Figure 1). Thus, all fraud

observations are included in all subsets while each non-fraud observation is part of at most one

subset. Each subset is then used in combination with a classification algorithm to build a fraud

prediction model. To perform fraud prediction, each prediction model is then applied to out-of-

sample data. For each observation in the out-of-sample data, the resulting model predictions are

combined into an overall fraud probability prediction for the observation (see further details in

the next section).

<Insert Figure 1 Here>

The curse of data dimensionality problem. According to the “curse of data dimensionality,”

data requirements increase exponentially with the number of explanatory variables in the dataset

(Bellman 1961). The curse of data dimensionality is a potential problem in fraud prediction

because the number of known fraud cases is small relative to the extensive number of

independent variables identified in prior fraud research. Hence, only a small number of fraud

observations are available to identify patterns among a large number of independent variables

and fraud. This may result in over-fitted prediction models that perform poorly when predicting

new observations.

By using stepwise backward variable selection to build a parsimonious fraud prediction

model, Dechow et al. (2011) partially address the problem of data dimensionality in the fraud

context. However, while stepwise backward variable selection is designed to retain explanatory

variables with the highest significance levels, it may discard potentially useful variables. We

build on Dechow et al. (2011) and introduce a new method that attempts to address the curse of

data dimensionality, while simultaneously retaining potentially useful explanatory variables. We

preprint

accepted manuscript

- 10 -

include the Dechow et al. (2011) model as a benchmark in our analyses.

To reduce the imbalance between minority fraud observations and the number of variables

identified in the literature to predict fraud, we design a new data rarity method, Multi-subset

Variable Undersampling (VU).5 VU randomly splits the set of explanatory variables without

replacement into different subsets (Figure 2). Each subset contains the same observations, but

different non-overlapping sets of explanatory variables. As with OU, each subset is then used in

combination with a classification algorithm to build a fraud prediction model that is applied to

out-of-sample data. For each observation in the out-of-sample data, the resulting model

predictions are then combined into an overall fraud probability prediction for the observation.


Partitioning Fraud into Types

Managers commit financial statement fraud by manipulating specific accounts, e.g., they may

improve reported earnings by artificially increasing revenue or reducing expenses. Many

financial statement fraud variables used in the literature are inherently related to a specific type

of fraud. For example, abnormal revenue growth is a potential measure of revenue fraud while

an abnormally low amount of allowance for doubtful accounts is a potential measure of expense

fraud. Although these variables may provide useful information about a specific type of fraud,

they are less likely to detect multiple types of fraud. When different fraud types are combined

into a binary classification problem, variables that are helpful when detecting a specific type of

fraud may be discarded if they do not do well in predicting fraud in general. For example, a

variable that provides a good signal about expense fraud but provides no useful information

5 In an attempt to further mitigate problems associated with having a small number of fraud observations to learn from, we examine the usefulness of an observation oversampling method named SMOTE (Chawla, Bowyer, Hall, and Kegelmeyer 2002) in fraud prediction. We, however, do not find a significant performance advantage for SMOTE relative to simple oversampling (results available from the corresponding author) and as such do not recommend SMOTE to address data rarity in the fraud context.

preprint

accepted manuscript

- 11 -

about other types of fraud will only provide value when classifying expense fraud cases, which

in our sample is only about ten percent of the fraud cases. Additionally, by combining different

fraud types into a binary classification problem, the classification algorithms focus on finding

patterns common to all fraud types. Given heterogeneity among different fraud types, such

patterns may be difficult to detect.

To reduce the potential negative effects associated with combining different fraud types into

binary classification models, we implement VU by partitioning the independent variables based

on different fraud types (PVU).6 When implementing PVU, we place all variables that appear to

predict a specific fraud type into a separate variable subset. Variables that can be used to predict

multiple fraud types are placed in multiple subsets. This creates four subsets of variables relating

to revenue, expenses, assets, and liabilities (for each subset, the model building data are

restricted to fraud observations that represent the associated fraud type). We also include three

additional variable subsets, because some fraud variables measure general attributes of fraud,

such as incentives, opportunities, or the aggregate effect of fraud. The first of these subsets

includes all variables not categorized as a specific fraud type variable. The second subset

includes the variables used in Dechow et al. (2011). These variables are included for their utility

in binary fraud prediction. The third subset includes all variables and is created to allow the

classifiers to find patterns among both fraud type specific and non-fraud type specific variables.

III. DATA AND EXPERIMENTAL DESIGN

Sample Data

We obtain a sample containing 51 fraud firms7 and 15,934 non-fraud firm years from Perols

6 Additionally, the use of multiple VU variable subsets that focus on different fraud types increases the likelihood that different prediction models capture different fraud patterns, which improves diversity among the prediction models. Prediction model diversity is important for performance when combining multiple models (Kittler, Hatef, Duin, and Matas 1998). We do not modify OU based on fraud types because OU only undersamples the non-fraud data and does not preprocess the fraud data. 7 This sample size of 51 fraud firms is comparable to other fraud studies (e.g., Beasley 1996, Erickson et al. 2006; Brazel et al.

preprint

accepted manuscript

- 12 -

(2011). We only include one firm year for each fraud observation that corresponds to the first

year that the Accounting and Auditing Enforcement Release (AAER) alleges that fraud was

committed. We do not include previous years as the fraud may have predated the reported first

fraud year. We do not include multiple fraud years for each fraud firm to prevent a single fraud

firm from being included in both the model building dataset and the out-of-sample model

evaluation dataset.

Perols (2011) identifies fraud firms in SEC investigations reported in AAERs between 1998

and 2005 that explicitly reference Section 10(b) Rule 10b-5 (Beasley 1996) or contain

descriptions of fraud. This fraud firm dataset excludes: financial firms; firms without the first

fraud year specified in the SEC release; non-annual financial statement fraud; foreign firms;

releases related to auditors; not-for-profit organizations; fraud related to registration statements,

10-KSB or IPO; and firms with missing Compustat (financial statement data), Compact D/SEC

(executive and director names, titles and company holdings), or I/B/E/S data (one-year-ahead

analyst earnings per share forecasts and actual earnings per share) in relevant years.8 Randomly

selected Compustat non-fraud firms (excluding observations following the applicable criteria

specified for fraud firms above) are added to the fraud firm dataset to create a sample with 0.3

percent fraud firms, which allows us to examine the robustness of the results around best

estimates of prior fraud probability, i.e., 0.6 percent (Bell and Carcello 2000), in the population

of interest. We include explanatory variables (summarized in Appendix A) that have been used

in recent literature to predict fraud or material misstatements (Cecchini et al. 2010; Dechow et al.

2011; Perols 2011). More specifically, we include all variables from Perols (2011) and all

2009). Other research (e.g., Dechow et al. 2011) uses AAERs to create samples focused on material misstatements. Material misstatement data include firms with AAERs that explicitly allege fraud as well as other firms that describe a material misstatement without explicitly alleging fraud. While such samples are larger, they do not necessarily focus on fraud. 8 Since we add additional variables to the Perols (2011) dataset, some of the variables have missing values. Missing values are replaced by global means/modes. The effect of this is a reduction in the utility of variables that have many missing values.

preprint

accepted manuscript

- 13 -

variables from the final Dechow et al. (2011) model that can be calculated using Compustat data.

Following and extending Cecchini et al. (2010), we also include 48 variables measuring levels

and changes in levels, percentage changes in levels, and abnormal percentage changes of

commonly manipulated financial statement items and ratios.

Experimental Design

Overview of the experiments. As summarized in Table 1, we perform multiple experiments

to (i) determine how to best implement OU and VU (e.g., how many subsets to use) and (ii)

evaluate their relative performance compared to various benchmarks. The primary objective in

these experiments is to detect trends that indicate how to implement the methods in future

research. By detecting clear trends between the number of subsets and predictive ability rather

than selecting implementations that happen to be the most predictive, we reduce the risk that we

recommend implementations that perform well on our test data, but do not generalize well.

In experiment 1, we use OU to create observation subsets that contain all fraud observations

and random samples of non-fraud observations that yield 20 percent fraud observations per

subset. In an evaluation of simple undersampling ratios, Perols (2011) finds that this ratio

provides relatively good performance. We then evaluate how many observation subsets to

include when implementing OU. In experiment 2a, we use VU to randomly divide the variables

used in prior fraud prediction research into 20 subsets. We then assess how many variable

subsets to include when implementing VU. In experiment 2b, we examine whether the number

of variables included in each subset affects performance by dividing the total number of

variables into subsets as follows: one subset with all variables, two subsets each with one-half of

the variables, four subsets each with one-quarter, six subsets each with one-sixth, eight subsets

each with one-eighth, etc. We then evaluate how many variables per subset to include when

implementing VU. Finally, in experiment 3, we evaluate the performance of VU when

preprint

accepted manuscript

- 14 -

independent variables are partitioned based on their relation to specific types of fraud.

<Insert Table 1 Here>

After selecting what appear to be robust implementations, we determine whether these

implementations outperform assorted benchmarks in predicting fraud. Because we introduce OU

to the fraud detection literature to reduce the imbalance between the number of fraud versus non-

fraud observations, we use simple undersampling as a benchmark (Perols 2011) for OU.9 This

benchmark randomly removes non-fraud observations from the model-building sample to

generate a more balanced sample. OU and the OU benchmarks use all variables (as independent

variable reduction is examined in the VU analysis) and are implemented using support vector

machines.

We introduce VU (and PVU) as an independent variable (data dimensionality) reduction

method that has the potential to improve the performance over currently used variable selection

methods. As a baseline we include a benchmark (the Dechow benchmark) that uses the

independent variables from model 2 in Dechow et al. (2011). We also use (i) a benchmark that

randomly selects variables and (ii) a benchmark that includes all variables (the all variables

benchmark) where data dimensionality is not reduced. The benchmark that randomly selects

variables performs better than both the Dechow benchmark and the all variables benchmark.10

Thus, we report our VU (and PVU) results using the benchmark that randomly selects variables.

VU, PVU, and their benchmarks use all observations (observation undersampling is examined in

the OU analysis) and are implemented using support vector machines.

10-fold cross-validation. Out-of-sample performance measures are generally preferred over

9 We also used ‘no undersampling’ as an additional benchmark. However, because simple undersampling performs better than no undersampling by 7.3 percent, we adopted simple undersampling as the benchmark. 10 The Dechow benchmark performed 0.02 percent better than the all variables benchmark and the random variable selection benchmark performed 3.87 percent better than Dechow benchmark.

preprint

accepted manuscript

- 15 -

in-sample performance measures because they provide a “more realistic measure of prediction

performance than measures commonly used in economics” (Varian 2014: 7), and cross-

validation is particularly useful. We use stratified 10-fold cross-validation, where 10 folds (i.e.,

subsamples of observations) are generated using random sampling without replacement. The 10

folds rotate between being used for training and testing the prediction models. In each rotation,

nine folds are used for training (i.e., model building) and one fold is used for testing (i.e., model

evaluation). For example, in the first round, subsets one through nine are used for training and

subset 10 is used for testing; in round two, subsets one through eight and subset 10 are used for

training, and subset nine is used for testing. By using stratified cross-validation, we ensure that

the ratio of fraud to non-fraud observations is kept consistent across all folds. With a total of 51

fraud firms in the sample, 45 or 46 fraud firms are used for model building and five or six fraud

firms are used for model evaluation in each cross-validation round. In our experiments, the OU

and VU methods are only applied to training data.

Prediction performance metric. Following prior financial statement fraud research (e.g.,

Beneish 1997; Feroz, Kwon, Pastena, and Park 2000; Lin et al. 2003; Perols 2011), we use

expected cost of misclassification (ECM) as the preferred performance metric. ECM allows

researchers to vary two important parameters when evaluating a prediction model’s performance

on out-of-sample data: (i) estimated percentage of fraud firms in the population of interest and

(ii) estimated ratio of the cost of a false negative to the cost of a false positive in the population

of interest. Including both parameters is important in settings such as fraud prediction that are

characterized by relative rarity and uneven misclassification costs. Given specific classification

results, ECM is calculated as follows:

ECM = CFN x P(Fraud) x nFN / nP + CFP x P(Non-Fraud) x nFP / nN (1)

preprint

accepted manuscript

- 16 -

where CFP and CFN are estimates of the cost of false positive and false negative classifications,

respectively, deflated by the lower of CFP or CFN; P(Fraud) and P(Non-Fraud) are estimates of

prior probability of fraud and non-fraud, respectively; nFP and nFN are the number of false

positive and false negative classifications, respectively, on the cross-validation test data;11 and nP

and nN are the number of fraud and non-fraud observations, respectively, in the cross-validation

test data. Bayley and Taylor (2007) estimate that actual cost ratios (FN to FP cost) average

between 20:1 and 40:1, while Bell and Carcello (2000) estimates that approximately 0.6 percent

of all firm years represent detected fraud. Thus, in experiments, which compare model

prediction performance at best estimates of prior fraud probability and cost ratios, we calculate

ECM at a cost ratio of 30:1 and a prior fraud probability of 0.6 percent (together with the

prediction models’ actual false positive and false negative rates). The goal of the prediction

models is to minimize ECM.

Experimental procedure – OU example. The following example provides a summary step-

by-step description of the experimental procedures using OU (see Figure 3). Apart from ECM

score calculations, which was done using Visual Basic for Applications in Excel, the

experimental procedures were implemented using the Knowledge Flow interface in Weka

(Witten and Frank 2005). Weka is a collection of machine learning algorithms that provide

support for the data mining process, including loading and transforming data, building prediction

models, and evaluating the performance of prediction models (Witten and Frank 2005). Weka is

primarily used to evaluate the out-of-sample performance of models rather than to evaluate the

statistical significance of independent variables. (Weka provides a number of algorithms that

11 Following prior research (e.g., Beneish 1997; Feroz et al. 2000; Lin et al. 2003; Perols 2011), nFP and nFN are obtained using optimal fraud classification thresholds (e.g., probability cutoffs for classifying an observation as fraud or non-fraud) for each combination of prior fraud probability and cost ratio. These optima are established by examining ECM scores using all unique fraud probability predictions as potential thresholds.

preprint

accepted manuscript

- 17 -

help select independent variables to include in models, but even then, the focus is not on

evaluating the statistical significance of the independent variables.)

1) The full sample is first separated into model-building data (a.k.a., training data) and model-

evaluation data (a.k.a., test data) using 10-fold cross-validation.

2) For each cross-validation round and OU implementation, the OU method is applied to the

training data (but not the test data, which is left intact) to partition the training data into OU

subsets. For example, in the first cross-validation round when evaluating the OU

implementation with 12 subsets, the OU method creates 12 subsets of the first training set.

3) A classification algorithm, which in our experiments is a support vector machine algorithm,

is used with each OU training subset generated in step 2 to build one prediction model for

each OU subset. For example, in OU with 12 subsets, a total of 12 prediction models are

generated.

4) The test set, which was not modified using the OU method, is applied to each of the

prediction models generated in step 3.

5) For each observation in the test set, the probability predictions from each prediction model

are combined by averaging the probability predictions. This method (averaging) has been

found to perform well compared to more complex combiner methods (Duin and Tax 2000).

After combining the probability predictions, each observation in the test set has a single

probability prediction representing the average prediction of all the prediction models

developed in step 3.

6) The probability predictions along with the class labels (e.g., fraud or non-fraud) are used to

calculate ECM scores. When calculating ECM scores, optimal fraud classification thresholds

(“cutoffs”) are first determined for each combination of prior fraud probability and cost ratio

by examining ECM scores at different classification threshold levels (Beneish 1997).

Optimal thresholds are then used to calculate ECM scores for each combination of prior

fraud probability and cost ratio for that specific test dataset.

7) The experimental procedure repeats steps two through six for each cross-validation round

and each OU implementation, e.g., OU with two subsets, OU with three subsets, etc. within

each cross-validation round.

preprint

accepted manuscript

- 18 -

8) After completing all ten rounds, each OU implementation has ten ECM scores (one for each

test set) for each prior fraud probability and cost ratio combination. Averages of the ten ECM

scores are then used to examine prediction performance of different OU implementations and

against the benchmarks at different prior fraud probability and cost ratio levels.


IV. RESULTS

Main Results

Figures 4-6 summarize the performance results of different OU and VU implementations.

For each implementation, the results represent the average expected cost of misclassification

(ECM) from ten test folds. ECM is reported at the best estimates of (i) prior fraud probability,

i.e., 0.6 percent, and (ii) false negative to false positive cost ratios, i.e., 30:1. The results are

presented as the percentage difference in ECM between each OU and VU implementation and

their respective benchmarks.12 Given that each figure is plotted using a single benchmark that is

held constant across different implementations, we first use the figures to look for clear trends

that indicate how to implement OU and VU, respectively. We then compare the performance of

selected implementations to their respective benchmarks.

Multi-subset Observation Undersampling (OU) – Experiment 1. Figure 4 (with supporting

details in Table 2) presents the performance results of OU relative to the best performing OU

benchmark (i.e., simple undersampling) as the number of subsets in OU increases. Our results

indicate that the benefit provided by OU initially increases as additional subsets are used but

remains relatively constant after 10 subsets.

<Insert Figure 4 and Table 2 Here>

12 Reported p-values are based on pairwise t-tests using the average and standard deviation of ECM scores across the ten test folds and are one-tailed unless otherwise noted. Assumptions related to normality and independent observations are unlikely to be satisfied, and p-values are only included as an indication of the relation between the magnitude and the variance of the difference between each implementation and the respective benchmarks.

preprint

accepted manuscript

- 19 -

Figure 4 also includes the corresponding results from two sensitivity analyses, i.e., the

experiments in which the subsets are selected in a different order and the random selection of

non-fraud cases is repeated. The results across all three versions of experiment 1 are similar in

that each shows a performance benefit from using OU that initially increases in the number of

OU subsets, but starts to plateau after about 10 subsets. These results indicate that the marginal

performance benefit from adding subsets declines as new subsets become less and less likely to

contain information not already included in the prior subsets.

Taken together, these experiments indicate that OU provides performance benefits and that

the number of subsets to include in OU is relatively consistent in the fraud setting. In an attempt

to balance performance benefits (we want to include enough subsets to make sure that we have

reached the performance plateau) with analysis costs (given that we have reached the plateau, we

want to keep the number of subsets low since adding additional subsets increases processing

costs), we include 12 subsets in OU in subsequent experiments and label this configuration

OU(12). This configuration lowers the expected cost of misclassification in the primary analysis

by 10.8 percent (p = 0.006) relative to simple undersampling (the best performing OU

benchmark).

To better understand how OU(12) improves performance, we first note that untabulated

results indicate that simple undersampling improves performance over no undersampling by 7.3

percent. Thus, it appears that OU provides performance benefits by undersampling the data and

thereby changing the bias of the classifiers to focus more on the fraud cases. Second, as reported

in the main experiment, OU(12) improves the performance over simple undersampling by an

additional 10.8 percent. Thus, it appears that OU also provides performance benefits by

combining the predictions of multiple models and thereby reducing the risk of overfitting.

preprint

accepted manuscript

- 20 -

Multi-Subset Variable Undersampling (VU) – Experiment 2. Figure 5 presents the

performance of VU relative to the best performing VU benchmark (i.e., random selection of

explanatory variables) as the number of subsets in VU increases. As summarized in Table 1, we

examine two versions. The dashed line shows the results when the number of variables in each

subset remains constant per experimental round (Experiment 2a). The round dotted line shows

the results when all variables are included and divided evenly across the subsets in each

experimental round (Experiment 2b).


When the number of variables is kept constant in each subset (the dashed line), the

performance of VU increases as additional variable subsets are included, plateauing at about 11

subsets, and then decreasing at 19 subsets. However, even at the plateau (VU with 11 to 18

subsets), untabulated results show that the performance difference between VU and the

benchmark only approaches statistical significance (p = 0.125 on average). In addition, the

jagged line indicates that VU is sensitive to the usefulness of the individual explanatory variables

in each additional subset.

When all available variables are divided into the selected subsets (the round dotted line), VU

does not provide a performance benefit relative to the random variable selection benchmark.

This second VU experiment emphasizes the importance of how variables are grouped together.

Multi-Subset Variable Undersampling Partitioned on Fraud Types (PVU) – Experiment 3.

The VU results discussed above suggest that a more deliberate partitioning of variables may be

important. We earlier argued that fraud consists of multiple types (e.g., revenue vs. expense

fraud) and that it might be beneficial to partition the explanatory variables with this in mind. Our

results for PVU support this conjecture. More specifically, as shown in Figure 5, PVU lowers

preprint

accepted manuscript

- 21 -

the expected cost of misclassification by 9.6 percent (p = 0.019) relative to the best performing

VU benchmark.

To better understand why PVU improves performance over the benchmarks, we first note

that untabulated results indicate that the all variables benchmark (that uses all observations and

all variables) and the Dechow benchmark (that uses all observations and a subset of variables as

selected in Dechow et al. 2011) performs similarly (0.02 percent difference). Thus, it does not

appear that performance is improved by simply selecting a subset of the variables (and thereby

reducing data dimensionality). Given that untabulated results show that VU improves

performance relative to the all variables benchmark by 7.2 percent, it appears that segmentation

of the variables contributes to the performance improvement. Additionally, because PVU

performs 6.3 percent better than VU, it appears that how the variables are segmented matters.

Additional Analyses

Further validation using misstatement data. We use the observations in a material

misstatement dataset that is an expanded version (additional years) of the data used in Dechow et

al. (2011) to perform two additional analyses (we also use this dataset to examine the robustness

of OU and PVU to the use of other classification algorithms, including logistic regression and

bootstrapping – see Online Appendix B). This dataset is available from the Center for Financial

Reporting and Management at the University of California, Berkeley and includes the fraud

firms used in our primary dataset as well as additional material misstatement firms reported in

AAERs by the SEC.13 To evaluate predictive performance, we again use 10-fold cross-

13 We exclude firms from the finance industry and, following Dechow et al. (2011), add all Compustat non-fraud firms in the same year and industry as the fraud firms. We do, however, only include the first fraud year, i.e., we do not include multiple years for each fraud firm, due to the potential bias introduced when including fraud firm years. We also follow the procedure used in Dechow et al. (2011) to eliminate observations with missing values in one or more of the variables included in the Dechow benchmark. We use mean replacement to handle missing values in the remaining variables. We also perform the analyses reported in this section after eliminating all observations with one or more missing values. Before performing this elimination, we remove six variables with over 25 percent missing values: abnormal change in order backlog, allowance for

preprint

accepted manuscript

- 22 -

validation. Further, due to a lack of good estimates of prior probabilities and cost ratios for

material misstatements, we use a performance metric known as “area under the Receiver

Operating Curve (ROC)” or simply AUC for “area under curve.”14

The first analysis provides further validation of out-of-sample prediction performance of the

proposed methods and compares OU(12) and PVU (implemented using the same variables as in

the main experiments) to the Dechow benchmark when using the observations in the material

misstatement data. We use the Dechow benchmark given that these data are based on Dechow et

al. (2011). This analysis also provides insight into the usefulness of OU and PVU in a slightly

different setting (material misstatements vs. fraud). Results in Table 3 Panel A suggest that

OU(12) and PVU continue to improve performance over the Dechow benchmark when using

material misstatement data – now by 26.0 (p < 0.001) and 16.9 (p = 0.004) percent, respectively.

The second analysis provides insight into (i) the usefulness of OU when used in combination

with a different set of independent variables (based on the “financial kernel” of Cecchini et al.

2010) and (ii) whether OU provides incremental predictive power when used in combination

with this kernel. Cecchini et al. (2010) based their financial kernel on 23 financial statement

variables commonly used to construct independent variables for fraud prediction models. The

financial kernel divides each of the 23 original variables by each other – both in the current year

and in the prior year and calculates changes in the ratios. Both current and lagged ratios as well

as their changes are then used to construct a dataset with 1,518 independent variables.

We use the same initial set of observations used in the previous analysis and recreate the

doubtful accounts, allowance for doubtful accounts to accounts receivable, allowance for doubtful accounts to net sales, expected return on pension plan assets, and change in expected return on pension plan assets. 14 While ECM is a preferred performance metric when prior probabilities and cost ratios are known, AUC is preferred over other performance measures in settings with unknown error costs and prior probabilities (Provost, Fawcett, and Kohavi 1998). AUC has become the de facto standard performance measure in machine learning research and has also been used in accounting research (e.g., Larcker and Zakolyukina 2012). AUC indicates how the prediction model performs in ranking observations in the evaluation dataset. An AUC of 0.5 is equivalent to a random rank order while an AUC of 1 indicates a perfect ranking of the evaluation observations. See Larcker and Zakolyukina (2012) for a more complete description of AUC.

preprint

accepted manuscript

- 23 -

financial kernel following Cecchini et al. (2010). We also follow their procedures and exclude

all observations with missing values. We then compare OU(12) implemented with the variables

in the financial kernel to the Cecchini benchmark, which uses the financial kernel but does not

undersample observations (both implementations use support vector machines). We do not

attempt to implement PVU, as it is not clear how we would separate the 1,518 variables into

different fraud types. Results in Table 3 (panel B) indicate that OU(12) (AUC = 0.67)

outperforms the financial kernel (AUC = 0.59) by 14.2 percent (p = 0.004).15

Combining the Methods. We analyze whether combining OU(12) and PVU provides

additional performance benefits compared to OU(12), the best performing individual method.

Figure 6 plots the performance difference between OU(12) and PVU and the combination of

OU(12) and PVU at different cost ratios. The combination is generated by creating prediction

models using OU(12) and PVU separately and then averaging the predictions from the OU(12)

and PVU prediction models.16

In untabulated results, the combination of OU(12) and PVU does not perform significantly

different than OU(12) (p = 0.421) at best estimates of prior fraud probability and cost ratios.

Thus, in typical fraud prediction research settings, we recommend using OU(12). However, the

combination of OU(12) and PVU provide performance benefits over OU(12) at higher cost ratios

and higher prior fraud probability levels (see Figure 6). Given that the combination of OU(12)

and PVU either performs significantly better or not significantly different than OU(12), we

recommend using the combination of the two methods if maximizing predictive ability is more

important than minimizing implementation costs. For example, when the SEC uses a prediction

15 When including all fraud firm years, OU performs 5.7 percent (p < 0.001) better than the Cecchini benchmark and both approaches have high AUC values (AUC = 0.863 and AUC = 0.816, respectively). 16 We also first create the OU(12) subsets and then apply PVU to these subsets, but this more integrated and complex combination does not improve performance further.

preprint

accepted manuscript

- 24 -

model to help decide which firms to investigate for potential fraud, the additional

implementation costs associated with using the combination is likely to be small relative to the

costs of misclassifying a non-fraud firm and using resources to investigate the firm (and even

more so relative to misclassifying a fraud firm and not detecting the fraud).


Impact of mislabeled firms on OU. Some firms that are labeled non-fraud may actually have

committed fraud. To assess the sensitivity of OU to the inclusion of mislabeled fraud firms, we

(1) manipulate the training data in each cross-validation round by using OU(12) to generate fraud

probability predictions for all observations in the training data and then remove all non-fraud

firms with high fraud probability predictions (we tried five different thresholds: 0.9, 0.8, 0.7, 0.6,

and 0.5) from the training data; (2) use the modified training data from step 1 as input into

OU(12); and (3) compare the results from step 2 to the original OU(12) implementation.

Untabulated results do not show any significant performance improvements over the original

OU(12) implementation. Compared to the original implementation, the change in AUC

(averaged across the ten test folds) when removing non-fraud firms with high fraud probabilities

is -0.08% (p = 0.809; two-tailed), 0.08% (p = 0.360; one-tailed), 0.12% (p = 0.337; one-tailed),

0.31% (p = 0.182; one-tailed), and 0.24% (p = 0.228; one-tailed) when using thresholds of 0.5,

0.6, 0.7, 0.8, and 0.9, respectively. Thus, it does not appear that the performance of OU is

sensitive to the inclusion of fraud firms mislabeled as non-fraud firms in the training data.

Using OU to explore robustness and incremental predictive performance of independent

variables. Fraud research often seeks to identify new explanatory variables. Traditionally, this

research uses the entire sample (i.e., all observations) or a single matched sample to evaluate the

significance of one or more independent variables that are hypothesized to be associated with the

preprint

accepted manuscript

- 25 -

dependent variable. However, the predictive performance benefits of OU reported earlier

suggest that classification algorithms (e.g., logistic regression) recognize different fraud patterns

when trained on different subsets of non-fraud firms. Thus, when evaluating explanatory

variables in hypothesis testing research, it may be important to consider the robustness of results

across different subsamples of the original data. Further, given that OU improves performance

over benchmarks, to conclude that a new independent variable provides utility in fraud

prediction, research should show that this variable provides incremental predictive performance

to prediction models implemented using OU.

As an example of how to use OU in hypothesis testing, we perform an analysis that examines

the significance of sales to employees given a set of control variables selected based on prior

research (the control variables in this example were selected using step-wise backward variable

selection). Traditionally, the hypothesis would be tested using all observations in the sample,

i.e., the full sample. We compare the traditional hypothesis-testing results (full sample) to results

from implementing OU. More specifically, OU is first used to partition the full sample into 12

subsamples. Each subsample is then used to fit a logistic regression model, resulting in 12

different models. Independent variable coefficient estimates and p-values from the 12 models

are then summarized. The example uses data from the additional analyses that examine

misstatement data.

The results for the full sample in Table 4 indicate that the hypothesis is supported (p =

0.012). However, the OU subsample analysis indicates that this result might not be robust. For

example, the average p-value of all sales to employees estimates across the 12 models obtained

using different sub-samples is p = 0.180 and the p-value is above 0.05 in four of the 12 models.

The results for the control variables further suggest that, while OU yields similar results to the

preprint

accepted manuscript

- 26 -

traditional hypothesis testing analysis in that most significant variables in the traditional

approach tend to be the most significant in the OU analysis, the OU results are generally more

conservative. For example, in only two cases are the median p-values from the OU results

numerically smaller (more significant) than the corresponding parametric result. For 12 of 17

variables, the median p-values are numerically larger (less significant) than their parametric

counterparts. Thus, we encourage future research to consider applying OU as a robustness check

for hypothesis testing.17

<Insert Table 4 Here>

Online Appendix C provides an example of how to use OU to evaluate the incremental

predictive performance of new independent variables. The example explains how to use OU in

combination with out-of-sample data and includes example SAS code.

V. DISCUSSION AND FUTURE RESEARCH

Financial statement fraud is a costly problem that has far reaching negative consequences.

Hence, the accounting literature investigates a wide range of explanatory variables and various

classification algorithms that contribute to more accurate prediction of fraud and material

misstatements. However, the rarity of fraud data, the relative abundance of variables identified in

prior literature, and the broad definition of fraud create challenges in specifying effective

prediction models.

Research in the emerging field of data analytics has been applied successfully in other

settings constrained by data rarity, such as predicting credit card fraud (Chan and Stolfo 1998).

We, therefore, follow the call of Varian (2014) to apply recent advances in data analytics in other

17 In untabulated results, we repeat the analysis using bootstrapping. More specifically, the full sample is used to generate 1,000 bootstrap subsamples (each sample contained observations selected randomly with replacement). Each bootstrap subsample is then used to fit a logistic regression model from which 2.5 and 97.5 percentiles of independent variable coefficient estimates are obtained. The bootstrapping results are similar to the OU results in that they are also generally more conservative.

preprint

accepted manuscript

- 27 -

settings and investigate the ability of data preprocessing methods drawn from data analytics to

improve fraud prediction. We first use Multi-subset Observation Undersampling (OU) to

investigate undersampling of non-fraud observations to establish a more effective balance with

scarce fraud observations. When used with 12 subsamples, this method improves fraud

prediction by lowering the expected cost of misclassification by more than ten percent relative to

the best performing benchmark. This method is also both efficient and relatively easy to

implement. Second, we use Multi-subset Variable Undersampling (VU) to investigate

undersampling of explanatory variables to put them more in balance with scarce fraud

observations. Fraud prediction improves in select situations when we randomly undersample

explanatory variables into different subsets. However, it does not do so reliably. When we

instead implement Multi-subset Variable Undersampling by partitioning variables into subsets

based on the type of fraud they are likely to predict (PVU), the expected cost of misclassification

is reduced by 9.6 percent relative to the best performing VU benchmark.

Our research makes multiple contributions to the prior literature. First, we identify and

directly address financial statement fraud data rarity problems by systematically evaluating

multiple data preprocessing methods that we believe are new to the accounting literature. Based

on our experiments, we conclude that OU and PVU each produce economically and statistically

significant reductions in the expected cost of misclassification of about ten percent. This

compares to, for example, a 0.9 percent prediction performance advantage when, following

Dechow et al. (2011), two additional significant independent variables are added to their initial

model. The introduction and evaluation of these methods directly contributes to research that

focuses on improving fraud prediction. Beneish (1997) and Dechow et al. (2011), among others,

create fraud prediction models that can be used to indicate the likelihood that a company has

preprint

accepted manuscript

- 28 -

committed financial statement fraud. Our methods can be used to improve the quality of such

fraud predictions. We also directly extend research that examines the usefulness of data

analytics methods in fraud prediction (e.g., Cecchini et al. 2010; Perols 2011; Larcker and

Zakolyukina 2012; and Whiting et al. 2012). Future research attempting to improve fraud

prediction using data analytics methods can also examine other problems related to rarity, such

as noisy data that potentially have more significant negative effects on rare cases (Weiss 2004).

Second, by showing that performance benefits can be gained by (i) addressing data rarity

problems in fraud detection and (ii) partitioning financial statement fraud into different fraud

types, our results provide an indication of the potential benefits that may result from addressing

similar problems in other settings. For example, bankruptcy, financial statement restatements,

material weaknesses in internal control over financial reporting, and audit qualifications are also

rare events in both absolute and relative terms.

Third, our research has implications for research that focuses on designing new explanatory

variables and developing parsimonious prediction models (e.g., Dechow et al. 2011; and

Markelevich and Rosner 2013). Our findings suggest that classification algorithms recognize

different fraud patterns when trained on different subsets of non-fraud firms. Thus, even if an

explanatory variable is deemed significant in one subsample, it is valuable to show that it is also

significant in other subsamples. Example techniques include OU, bootstrapping, or a robustness

measure proposed by Athey and Imbens (2015) that creates subsamples based on values of the

independent variables in the model. While we perform additional analyses that suggest that OU

(i) performs better than bootstrapping in predictive modeling and (ii) can be used to evaluate the

robustness of explanatory models, future research is needed to provide more definitive

recommendations about which method(s) to use for hypothesis testing. Future research can also

preprint

accepted manuscript

- 29 -

examine the use of OU in conjunction with propensity score matching. For example, future

research can (1) examine whether OU can be used to generate more robust propensity scores or

(2) use OU to evaluate the robustness of propensity score matching results by first using OU to

generate different subsets and then applying propensity score matching procedure within each

subset. Further, research that concludes that a new explanatory variable provides incremental

predictive power should consider showing that the variable provides incremental predictive value

to models implemented using our methods.

Fourth, we also make a contribution by following the call to consider different types of fraud

(Brazel et al. 2009). We partition financial statement fraud into types and show that this

reframing improves the performance of VU in fraud prediction. The importance of this finding

may extend beyond VU. Research that examines predictors of fraud could, similarly to Brazel et

al. (2009), design new explanatory variables to detect a specific type of fraud instead of fraud in

general. For example, fraud research could potentially develop variables that predict different

fraud types using different types of analyst forecasts (e.g., revenue vs. earnings) or different

types of debt covenants (e.g., leverage vs. interest coverage). For example, an independent

variable that indicates whether a firm uses a leverage (interest expense) debt covenant can in turn

be used in a prediction model that predicts liabilities (expense) fraud. This reframing could as

such contribute to better theoretical understanding of fraud and also more precise evaluation of

explanatory variables.

Finally, regulators and practitioners can potentially benefit from our findings. Regulators,

such as the SEC, are investing resources to develop better fraud risk models (Walter 2013; SEC

2015). Our findings may enhance their ability to identify firms that have committed fraud. This

is important because, due to resource constraints, the SEC has to focus investigations on a small

preprint

accepted manuscript

- 30 -

sample of firms, and improvements in financial statement fraud prediction models can be cost

effective in identifying potential fraud firms. The negative effects of financial statement fraud

on other stakeholders, such as employees, auditors, suppliers, customers, and lenders can also be

potentially reduced. For example, auditors can use our methods to potentially improve fraud risk

assessment models that, in turn, can improve audit client portfolio management and audit

planning decisions. Given the significant costs and widespread effects of financial statement

fraud, improvements in fraud prediction models can have a substantial positive impact on

society.

preprint

accepted manuscript

- 31 -

REFERENCES

Abbasi, A., C. Albrecht, A. Vance, and J. Hansen. 2012. MetaFraud: A meta-learning framework for detecting financial fraud. MIS Quarterly. 36(4): 1293-1327.

Agarwal, R., and V. Dhar. 2014. Editorial—Big data, data science, and analytics: The opportunity and challenge for IS research. Information Systems Research. 25(3): 443-448.

Apostolou, B., J. Hassell, and S. Webber. 2000. Forensic expert classification of management fraud risk factors. Journal of Forensic Accounting. 1(2): 181-192.

Armstrong, C. S., D. F. Larcker, G. Ormazabal, and D. J. Taylor. 2013. The relation between equity incentives and misreporting: the role of risk-taking incentives. Journal of Financial Economics. 109(2): 327-350.

Association of Certified Fraud Examiners. 2014. Report to the nation on occupational fraud and abuse. Austin, TX.

Athey, S., and G. Imbens. 2015. A measure of robustness to misspecification. American Economic Review. 105(5): 476-80.

Bayley, L., and S. Taylor. 2007. Identifying earnings management: A financial statement analysis (red flag) approach. Proceedings of the American Accounting Association Annual Meeting, Chicago, IL.

Beasley, M. 1996. An empirical analysis of the relation between the board of director composition and financial statement fraud. The Accounting Review. 71(4): 443-465.

Bell, T., and J. Carcello. 2000. A decision aid for assessing the likelihood of fraudulent financial reporting. Auditing: A Journal of Practice & Theory. 19(1): 169-184.

Bellman, R. 1961. Adaptive control processes: A guided tour. Princeton, NJ: Princeton University Press.

Beneish, M. 1997. Detecting GAAP violation: Implications for assessing earnings management among firms with extreme financial performance. Journal of Accounting and Public Policy. 16(3): 271-309.

Beneish, M. 1999. Incentives and penalties related to earnings overstatements that violate GAAP. The Accounting Review. 74(4): 425-457.

Brazel, J. F., K. L. Jones, and M. F. Zimbelman. 2009. Using nonfinancial measures to assess fraud risk. Journal of Accounting Research. 47(5): 1135-1166.

Brown, B., M. Chui, and J. Manyika. 2011. Are you ready for the era of ‘big data’? McKinsey Quarterly. 4: 24-35.

Caskey, J., and M. Hanlon. 2013. Dividend policy at firms accused of accounting fraud. Contemporary Accounting Research. 30(2): 818-850.

Cecchini, M., G. Koehler, H. Aytug, and P. Pathak. 2010. Detecting management fraud in public companies. Management Science. 56(7): 1146-1160.

Chan, P., and S. Stolfo. 1998. Toward scalable learning with non-uniform class and cost distributions: A case study in credit card fraud detection. Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, New York, NY.

Chawla, N. V., K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. 2002. SMOTE: Synthetic Minority Oversampling Technique. Journal of Artificial Intelligence Research. 16: 321-357.

preprint

accepted manuscript

- 32 -

Chen, H., R. H. Chiang, and V. C. Storey. 2012. Business intelligence and analytics: From big data to big impact. MIS Quarterly. 36(4): 1165-1188.

Dechow, P. M., R. G. Sloan, and A. P. Sweeney. 1996. Causes and consequences of earnings manipulation: An analysis of firms subject to enforcement actions by the SEC. Contemporary Accounting Research. 13(1): 1-36.

Dechow, P. M., W. Ge, C. R. Larson, and R. G. Sloan. 2011. Predicting material accounting misstatements. Contemporary Accounting Research. 28(1): 17-82.

Duin, P. W. R., and M. J. D. Tax. 2000. Experiments with classifier combining rules. Proceedings of the International Workshop on Multiple Classifier Systems.

Erickson, M., M. Hanlon, and E. L. Maydew. 2006. Is there a link between executive equity incentives and accounting fraud? Journal of Accounting Research. 44(1): 113-143.

Ettredge, M. L., L. Sun, P. Lee, and A. A. Anandarajan. 2008. Is earnings fraud associated with high deferred tax and/or book minus tax levels? Auditing: A Journal of Practice & Theory. 27(1): 1-33.

Fanning, K., and K. Cogger. 1998. Neural network detection of management fraud using published financial data. International Journal of Intelligent Systems in Accounting, Finance and Management. 7(1): 21-41.

Feng, M., W. Ge, S. Luo, and T. Shevlin. 2011. Why do CFOs become involved in material accounting manipulations? Journal of Accounting and Economics. 51(1): 21-36.

Feroz, E., T. Kwon, V. Pastena, and K. Park. 2000. The efficacy of red-flags in predicting the sec's targets: An artificial neural networks approach. International Journal of Intelligent Systems in Accounting, Finance & Management. 9(3): 145-157.

Galar, M., A. Fernández, E. Barrenechea, H. Bustince, and F. Herrera. 2012. A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews. 42(4): 463-484.

Glancy, F. H., and S. B. Yadav. 2011. A computational model for financial reporting fraud detection. Decision Support Systems. 50(3): 595-601.

Goel, S., and J. Gangolly. 2012. Beyond the numbers: Mining the annual reports for hidden cues indicative of financial statement fraud. Intelligent Systems in Accounting, Finance and Management. 19(2): 75-89.

Green, B. P., and J. H. Choi. 1997. Assessing the risk of management fraud through neural network technology. Auditing: A Journal of Practice & Theory. 16(1): 14-28.

Gupta, R., and N. S. Gill. 2012. A solution for preventing fraudulent financial reporting using descriptive data mining techniques. International Journal of Computer Applications. 58(1): 22-28.

Humpherys, S. L., K. C. Moffitt, M. B. Burns, J. K. Burgoon, and W. F. Felix. 2011. Identification of fraudulent financial statements using linguistic credibility analysis. Decision Support Systems. 50(3): 585-594.

Jones, K. L., G. V. Krishnan, and K. D. Melendrez. 2008. Do models of discretionary accruals detect actual cases of fraudulent and restated earnings? An empirical analysis. Contemporary Accounting Research. 25(2): 499-531.

preprint

accepted manuscript

- 33 -

Kaminski, K., S. Wetzel, and L. Guan. 2004. Can financial ratios detect fraudulent financial reporting? Managerial Auditing Journal. 19(1): 15-28.

Kittler, J., M. Hatef, R.P.W. Duin, and J. Matas. 1998. On combining classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence. 20(3): 226-239.

Larcker, D. F., and A. A. Zakolyukina. 2012. Detecting deceptive discussions in conference calls. Journal of Accounting Research. 50(2): 495-540.

LaValle, S., E. Lesser, R. Shockley, M. S. Hopkins, and N. Kruschwitz. 2011. Big data, analytics and the path from insights to value. MIT Sloan Management Review. 52(2): 21-32.

Lee, T. A., R. W. Ingram, and T. P. Howard. 1999. The difference between earnings and operating cash flow as an indicator of financial reporting fraud. Contemporary Accounting Research. 16(4): 749-786.

Lennox, C., and J. A. Pittman. 2010. Big five audits and accounting fraud. Contemporary Accounting Research, 27(1): 209-247.

Lin, J., M. Hwang, and J. Becker. 2003. A fuzzy neural network for assessing the risk of fraudulent financial reporting. Managerial Auditing Journal. 18(8): 657-665.

Loebbecke, J. K., M. M. Eining, and J. J. Willingham. 1989. Auditors’ experience with material irregularities: Frequency, nature, and detectability. Auditing: A Journal of Practice and Theory. 9(1): 1-28.

Maloof, M. 2003. Learning when data sets are imbalanced and when costs are unequal and unknown. Proceedings of the Twenty International Conference on Machine Learning, Washington, DC.

Markelevich, A., and R. L. Rosner. 2013. Auditor fees and fraud firms. Contemporary Accounting Research. 30(4): 1590-1625.

Nguyen, H. M., E. W. Cooper, and K. Kamei. 2012. A comparative study on sampling techniques for handling class imbalance in streaming data. Proceedings of the Soft Computing and Intelligent Systems (SCIS) and 13th International Symposium on Advanced Intelligent Systems (ISIS). 1762-1767.

Perols, J. 2011. Financial statement fraud detection: An analysis of statistical and machine learning algorithms. Auditing: A Journal of Practice & Theory. 30(2): 19-50.

Perols, J. L., and B. A. Lougee. 2011. The relation between earnings management and financial statement fraud. Advances in Accounting. 27(1): 39-53.

Phua, C., D. Alahakoon, and V. Lee. 2004. Minority report in fraud detection: Classification of skewed data. SIGKDD Explorations. 6(1): 50-59.

Price III, R. A., N. Y. Sharp, and D. A. Wood. 2011. Detecting and predicting accounting irregularities: A comparison of commercial and academic risk measures. Accounting Horizons. 25(4): 755-780.

Provost, F. J., T. Fawcett, and R. Kohavi. 1998. The case against accuracy estimation for comparing induction algorithms. Proceedings of the Fifteenth International Conference on Machine Learning, Madison, WI. 98: 445-453.

SEC 2015. Examination Priorities for 2015. Retrieved from http://www.sec.gov/about/offices/ocie/national-examination-program-priorities-2015.pdf.

Sharma, V. 2004. Board of director characteristics, institutional ownership, and fraud: Evidence

preprint

accepted manuscript

- 34 -

from Australia, Auditing: A Journal of Practice & Theory. 23(2): 105-117. Shin, K. S., T. Lee, and H. J. Kim. 2005. An application of support vector machines in

bankruptcy prediction models. Expert Systems with Application. 28: 127-135. Summers, S. L., and J. T. Sweeney. 1998. Fraudulently misstated financial statements and

insider trading: An empirical analysis. The Accounting Review. 73(1): 131-146. Varian, H. R. 2014. Big data: New tricks for econometrics. The Journal of Economic

Perspectives. 28(2): 3-27. Walter, E. (2013, February). Harnessing Tomorrow’s Technology for Today’s Investors and

Markets. Speech Presented at American University School of Law, Washington, D.C. Weiss, G. 2004. Mining with rarity: A unifying framework. ACM SIGKDD Explorations

Newsletter. 6(1): 7-19. Whiting, D. G., J. V. Hansen, J. B. McDonald, C. Albrecht, and W. S. Albrecht. 2012. Machine

learning methods for detecting patterns of management fraud. Computational Intelligence. 28(4): 505-527.

Witten, I.H., and E. Frank. 2005. Data Mining: Practical machine learning tools and techniques. 2nd edition. San Francisco, CA: Morgan Kaufmann Publishers.

Yang, Q., and X. Wu. 2006. 10 challenging problems in data mining research. International Journal of Information Technology & Decision Making. 5(4): 597-604.

preprint

accepted manuscript

- 35 -

APPENDIX A: Definitions of explanatory variablesa

Panel A: Variables from Dechow et al. (2011)

Variable Definitionb

Abnormal change in order backlog

(OB - OBt-1) / OBt-1 - (SALE - SALEt-1) / SALEt-1

Actual issuance IF SSTK>0 or DLTIS>0 THEN 1 ELSE 0 Book to market CEQ / (CSHO * PRCC_F) Change in expected return on pension plan assets

PPROR-PPRORt-1

Change in free cash flows

(IB - RSST Accruals) / Average total assets - (IBt-1 - RSST Accrualst-1) / Average total assetst-1

Change in inventory (INVT- INVTt-1)/Average total assets Change in operating lease activity

((MRC1/1.1+ MRC2/1.1^2+ MRC3/1.1^3+ MRC4/1.1^4+ MRC5/1.1^5) - (MRC1t-

1/1.1+ MRC2t-1/1.1^2+ MRC3t-1/1.1^3+ MRC4 -1/1.1^4+ MRC5t-1/1.1^5) )/ Average total assets

Change in receivables (RECT- RECTt-1)/Average total assets Change in return on assets

IB / Average total assets - IBt-1 / Average total assetst-1

Deferred tax expense TXDI / ATt-1 Demand for financing (ex ante)

IF ((OANCF-(CAPXt-3+CAPXt-2+ CAPXt-1)/ 3) /(ACT) < -0.5 THEN 1 ELSE 0

Earnings to price IB / (CSHO x PRCC_F) Existence of operating leases

IF (MRC1 > 0 OR MRC2 > 0 OR MRC3 > 0 OR MRC4 > 0 OR MRC5 > 0 THEN 1 ELSE 0

Expected return on pension plan assets

PPROR

Level of finance raised FINCF / Average total assets Leverage DLTT / AT Percentage change in cash margin

((1-(COGS+(INVT-INVTt-1))/(SALE-(RECT-RECTt-1)))- (1-(COGSt-1+(INVTt-1-INVTt-2))/(SALEt-1-(RECTt-1-RECTt-2)))) / (1-(COGSt-1+(INVTt-1-INVTt-2))/(SALEt-1-(RECTt-1-RECTt-2)))

Percentage change in cash sales

((SALE - (RECT - RECTt-1)) - (SALEt-1 - (RECTt-1 - RECTt-2))) / (SALEt-1 - (RECTt-1 - RECTt-2))

RSST accruals RSST Accruals = (ΔWC + ΔNCO + ΔFIN)/Average total assets, where WC = (ACT - CHE) - (LCT - DLC) NCO = (AT - ACT - IVAO) - (LT - LCT - DLTT) FIN = (IVST + IVAO) - (DLTT + DLC + PSTK)

Soft assets (AT-PPENT-CHE)/Average total assets Unexpected employee productivityc

(SALE/EMP - SALEt-1/EMPt-1)/(SALEt-1/EMPt-1) - INDUSTRY((SALE/EMP - SALEt-1/EMPt-1)/(SALE t-1/EMPt-1))

WC accruals (((ACT - ACTt-1) - (CHE - CHEt-1)) - ((LCT - LCTt-1) - (DLC - DLCt-1) - (TXP - TXPt-1)) - DP)/Average total assets

preprint

accepted manuscript

- 36 -

APPENDIX A: Definitions of explanatory variables (continued)

Panel B: Variables from Perols (2011)


Accounts receivable to sales

(RECT/SALE)

Accounts receivable to total assets

(RECT/AT)

Allowance for doubtful accounts

(RECD)

Allowance for doubtful accounts to accounts receivable

(RECD/RECT)

Allowance for doubtful accounts to net sales

(RECD/SALE)

Altman Z score 3.3*(IB+XINT+TXT)/AT+0.999*SALE/AT+0.6*CSHO*PRCC_F/LT+ 1.2*WCAP/AT+1.4*RE/AT

Big four auditor IF 0 < AU < 9 THEN 1 ELSE 0 Current minus prior year inventory to sales

(INVT)/(SALE)-(INVTt-1)/(SALEt-1)

Days in receivables index

(RECT/SALE)/(RECTt-1/SALEt-1)

Debt to equity (LT/CEQ) Declining cash sales dummy

IF SALE-(RECT-RECTt-1) < SALE t-1-(RECT t-1-RECT t-2) THEN 1 ELSE 0

Fixed assets to total assets

PPEGT/AT

Four year geometric sales growth rate

(SALE/SALEt-3)^(1/4)-1

Gross margin (SALE-COGS)/SALE Holding period return in the violation period

(PRCC_F-PRCC_Ft-1)/PRCC_Ft-1

Industry ROE minus firm ROE

NIindustry/CEQindustry - NI /CEQ

Inventory to sales INVT/SALE Net sales SALE Positive accruals dummy IF (IB-OANCF) > 0 and (IBt-1-OANCFt-1) > 0 THEN 1 ELSE 0 Prior year ROA to total assets current year

(NIt-1/ATt-1) / AT

Property plant and equipment to total assets

PPENT/AT

Sales to total assets SALE/AT The number of auditor turnovers

IF AU<>AUt-1 THEN 1 ELSE 0 + IF AUt-1<>AUt-2 THEN 1 ELSE 0 + IF AUt-2<>AUt-3 THEN 1 ELSE 0

Times interest earned (IB+XINT+TXT) / XINT Total accruals to total assets

(IB-OANCF) / AT

Total debt to total assets LT/AT

preprint

accepted manuscript

- 37 -


Total discretionary accrual

RSST Accrualst-1 + RSST Accrualst-2 + RSST Accrualst-3

Value of issued securities to market value

IF CSHI > 0 THEN CSHI*PRCC_F/(CSHO*PRCC_F) ELSE IF (CSHO-CSHOt-1)>0 THEN ((CSHO - CSHOt-1)*PRCC_F) / (CSHO*PRCC_F) ELSE 0

Whether accounts receivable > 1.1 of last year’s

IF (RECT/RECT t-1) > 1.1 THEN 1 ELSE 0

Whether firm was listed on AMEX

IF EXCHG=5, 15, 16, 17, 18 THEN 1 ELSE 0

Whether gross margin percent > 1.1 of last year’s

IF ((SALE-COGS) / SALE) / ((SALEt-1 - COGSt-1)/SALEt-1) > 1.1 THEN 1 ELSE 0

Whether LIFO IF INVVAL=2 THEN 1 ELSE 0 Whether new securities were issued

IF (CSHO-CSHOt-1)>0 OR CSHI>0 THEN 1 ELSE 0

Whether SIC code between 2999 and 4000

IF 2999<SIC<4000 THEN 1 ELSE 0

Panel C: Variables based on Cecchini et al. (2010)d


Sales SALE Change in sales SALE - SALEt-1 % Change in sales (SALE - SALEt-1) / (SALEt-1) Abnormal % change in sales

(SALE - SALEt-1) / (SALEt-1) - INDUSTRY(SALE - SALEt-1) / (SALEt-1))

Sales to assets SALE/AT Change in sales to assets SALE/AT - SALEt-1/ATt-1 % Change in sales to assets

(SALE/AT - SALEt-1/ATt-1) / (SALEt-1/ATt-1)

Abnormal % change in sales to assets

(SALE/AT - SALEt-1/ATt-1) / (SALEt-1/ATt-1) - INDUSTRY(SALE/AT - SALEt-1/ATt-1) / (SALEt-1/ATt-1))

Sales to employees SALE/EMP Change in sales to employees

SALE/EMP - SALEt-1/EMPt-1

% Change in sales to employees

(SALE/EMP - SALEt-1/EMPt-1) / (SALEt-1/EMPt-1)

Sales to operating expenses

SALE/XOPR

Change in sales to operating expenses

SALE/XOPR - SALEt-1/XOPRt-1

% Change in sales to operating expenses

(SALE/XOPR - SALEt-1/XOPRt-1) / (SALEt-1/XOPRt-1)

Abnormal % change in sales to operating expenses

(SALE/XOPR - SALEt-1/XOPRt-1) / (SALEt-1/XOPRt-1) - INDUSTRY(SALE/XOPR-SALEt-1/XOPRt-1) / (SALEt-1/XOPRt-1))

Return on assets NI/AT

preprint

accepted manuscript

- 38 -


Change in return on assets

NI/AT - NIt-1/ATt-1

% Change in return on assets

(NI/AT - NIt-1/ATt-1) / (NIt-1/ATt-1)

Abnormal % change in return on assets

(NI/AT - NIt-1/ATt-1) / (NIt-1/ATt-1) - INDUSTRY(NI/AT - NIt-1/ATt-1) / (NIt-1/ATt-1))

Return on equity NI/CEQ Change in return on equity

NI/CEQ - NIt-1/CEQt-1

% Change in return on equity

(NI/CEQ - NIt-1/CEQt-1) / (NIt-1/CEQt-1)

Abnormal % change in return on equity

(NI/CEQ - NIt-1/CEQt-1) / (NIt-1/CEQt-1) - INDUSTRY(NI/CEQ - NIt-1/CEQt-1) / (NIt-1/CEQt-1))

Return on sales NI/SALE Change in return on sales

NI/SALE - NIt-1/SALEt-1

% Change in return on sales

(NI/SALE - NIt-1/SALEt-1) / (NIt-1/SALEt-1)

Abnormal % change in return on sales

(NI/SALE - NIt-1/SALEt-1) / (NIt-1/SALEt-1) - INDUSTRY(NI/SALE - NIt-1/ SALEt-1) / (NIt-1/SALEt-1))

Accounts payable to inventory

AP/INVT

Change in accounts payable to inventory

AP/INVT - APt-1/INVTt-1

% Change in accounts payable to inventory

(AP/INVT - APt-1/INVTt-1) / (APt-1/INVTt-1)

Abnormal % change in accounts payable to inventory

(AP/INVT - APt-1/INVTt-1) / (APt-1/INVTt-1) - INDUSTRY(AP/INVT - APt-1/ INVTt-1) / (APt-1/INVTt-1))

Liabilities LT Change in liabilities LT - LTt-1 % Change in liabilities (LT - LTt-1) / (LTt-1) Abnormal % change in liabilities

(LT - LTt-1) / (LTt-1) - INDUSTRY(LT - LTt-1) / (LTt-1))

Liabilities to interest expenses

LT/XINT

Change in liabilities to interest expenses LT/XINT - LTt-1/XINTt-1 % Change in liabilities to interest expenses

(LT/XINT - LTt-1/XINTt-1) / (LTt-1/XINTt-1)

Abnormal % change in liabilities to interest expenses

(LT/XINT - LTt-1/XINTt-1) / (LTt-1/XINTt-1) - INDUSTRY(LT/XINT - LTt-1/XINTt-1) / (LTt-1/XINTt-1))

Assets AT Change in assets AT - ATt-1 % Change in assets (AT - ATt-1) / (ATt-1) Abnormal % change in assets

(AT - ATt-1) / (ATt-1) - INDUSTRY(AT - ATt-1) / (ATt-1))

Assets to liabilities AT/LT

preprint

accepted manuscript

- 39 -


Change in assets to liabilities

AT/LT - ATt-1/LTt-1

% Change in assets to liabilities

(AT/LT - ATt-1/LTt-1) / (ATt-1/LTt-1)

Abnormal % change in assets to liabilities

(AT/LT - ATt-1/LTt-1) / (ATt-1/LTt-1) - INDUSTRY(AT/LT - ATt-1/LTt-1) / (ATt-1/LTt-1))

Expenses XOPR Change in expenses XOPR - XOPRt-1 % Change in expenses (XOPR - XOPRt-1) / (XOPRt-1) Abnormal % change in expenses

(XOPR - XOPRt-1) / (XOPRt-1) - INDUSTRY(XOPR - XOPRt-1) / (XOPRt-1))

Notes: a The explanatory variables included represent a relatively comprehensive set of variables based on recent fraud and material misstatement literature (Cecchini et al. 2010; Dechow et al. 2011; Perols 2011). We include all variables from Perols (2011) and all variables from the final Dechow et al. (2011) model that can be calculated using Compustat data. Dechow et al. (2011) perform step-wise backward feature selection to derive more parsimonious material misstatement models. We use their second model, which is the most complete model in their study that only relies on Compustat data (they also include a model that requires market related data). This study predicts material misstatements using the following variables: RSST accruals, change in receivables, change in inventory, soft assets, percentage change in cash sales, change in return on assets, actual issuance of securities, abnormal change in employees, and existence of operating leases. The model in Cecchini et al. (2010) includes a total of 1,518 explanatory variables derived using 23 financial statement items. These items are divided by each other both in the current year and in the prior year and used to calculate changes in the ratios. Both current and lagged ratios as well as their changes are then used to construct a dataset with 1,518 independent variables. Rather than including all 1,518 variables in our study, we follow and extend the approach used in Cecchini et al. (2010) by including 48 variables measuring levels and changes in levels, percentage change in levels, and abnormal percentage change of commonly manipulated financial statement items and ratios. We examine a model with all 1,518 variables from Cecchini et al. (2010) in an additional analysis. b ACT is Current Assets - Total; AT is Assets - Total; AU is Auditor ; CAPX is Capital Expenditures; CEQ is Common/Ordinary Equity - Total; CEQ is Common/Ordinary Equity - Total; CHE is Cash and Short-Term Investments; COGS is Cost of Goods Sold; CSHI is Common Shares Issued; CSHO is Common Shares Outstanding; DLC is Debt in Current Liabilities - Total; DLTIS is Long-Term Debt – Issuance; DLTT is Long-Term Debt - Total; DP is Depreciation and Amortization; EMP is Employees; EXCHG is Stock Exchange ; FINCF is Financing Activities – Net Cash Flow; IB is Income Before Extraordinary Items; INVT is Inventories - Total; INVVAL is Inventory Valuation Method; IVAO is Investment and Advances – Other; IVST is Short-Term Investments - Total; LCT is Current Liabilities - Total; LT is Liabilities - Total; MRC1 is Rental Commitments – Minimum – 1st Year; MRC2 is Rental Commitments – Minimum – 2nd Year; MRC3 is Rental Commitments – Minimum – 3rd Year; MRC4 is Rental Commitments – Minimum – 4th Year; MRC5 is Rental Commitments – Minimum – 5th Year; NI is Net Income (Loss); OANCF is Operating Activities – Net Cash Flow; OB is Order Backlog; PPEGT is Property Plant and Equipment - Total (Gross); PPENT is Property Plant and Equipment - Total (Net); PPROR is Pension Plans – Anticipated Long-Term Rate of Return on Plan Assets; PRCC_F is Price Close - Annual - Fiscal Year; PSTK is Preferred/Preference Stock (Capital) - Total; RE is Retained Earnings; RECD is Receivables - Estimated Doubtful; RECT is Receivables – Total; SALE is Sales/Turnover (Net); SIC is SIC Code; SSTK is Sale of Common and Preferred Stock; TXDI is Income Taxes - Deferred; TXP is Income Taxes Payable; TXT is Income Taxes - Total; WCAP is Working Capital (Balance Sheet); XINT is Interest and Related Expense - Total; XINT is Interest and Related Expense - Total; and XOPR is Operating Expense. We also included controls for year and industry (two-digit SIC code). c Similar variable used in both Dechow et al. (2011) (abnormal change in employees) and Perols (2011) (unexpected employee productivity). d Variable construction based on Financial Kernel in Cecchini et al. (2010).

preprint

accepted manuscript

- 40 -

Figure 1 Multi-subset Observation Undersampling (OU)

Notes: Column 1 represents the raw data with the fraud observations stacked on top and non-fraud cases below. Column 1 also shows that model building and out-of-sample data are kept separated. Column 2 shows the data subsets that are created based on the OU method. All fraud data are used in each subset while the non-fraud data are under-sampled to address data rarity within each subset. Cumulatively across all subsets, all of the non-fraud data can be used, but a single non-fraud observation is only used in one subset. In column 3, a classification algorithm is used to build one prediction model per subset with the goal of accurately classifying firms into fraud or non-fraud cases. Each model is then applied out-of-sample and generates a fraud probability prediction for each observation in the out-of-sample data. In column 4, for each out-of-sample observation, the individual fraud prediction probabilities are then combined to arrive at an overall combined fraud probability prediction for each observation.

More formally, let M={f1, f2, f3,…, fk} be a set of k fraud observations f and let C={c1, c2, c3,…, cn} be a set of n non-fraud observations c, where M is the minority class, i.e., k < n. Note that the union of M and C, i.e., M U C, forms a set that contains k fraud and n non-fraud observations. To achieve a more balanced dataset, d non-fraud observations c are removed from the non-fraud set C, where 0 < d ≤ n - k. However, instead of deleting these removed non-fraud observations, OU segments the non-fraud observations into n / (n - d) or fewer subsets Ui that each contains n - d different non-fraud observations c, i.e., C={U1, U2, U3,…, Un/n-d}. Note that all subsets Ui contain mutually exclusive (disjoint) sets of non-fraud observations, Ui ∩ Uj = Ø for i ≠ j. OU then combines all fraud observations, i.e., the entire set M, with each Ui to create subsets Wi. OU thus creates up to n / (n - d) subsets Wi that contain all fraud observations f and n - d unique non-fraud observations c. Each subset Wi is then used to build a prediction model that is used to predict out-of-sample observations. In our experiments, OU is only used on the model building data and the model evaluation data is left intact. Finally, for each out-of-sample observation, the different prediction models’ probability predictions are averaged into an overall probability prediction for each observation.

preprint

accepted manuscript

- 41 -

Figure 2 Multi-subset Variable Undersampling (VU)

Notes: Column 1 represents the raw data that include all explanatory variables used to predict fraud. These explanatory variables are partitioned into different subsets represented by the vertical lines. Each subset contains a subset of the explanatory variables and all of the observations. Column 1 also shows that model building and out-of-sample data are kept separated. In column 2, a classification algorithm is used to build one prediction model per variable subset with the goal of classifying firms into fraud vs. non-fraud cases. Each prediction model is then applied out-of-sample to generate a fraud probability prediction for each observation in the out-of-sample data. In column 3, for each out-of-sample observation, the fraud prediction probabilities from the different prediction models are combined to arrive at an overall combined fraud prediction probability for each observation.

More formally, let W denote a dataset with m variables x, i.e., W={x1, x2, x3,…, xm}. VU reduces data dimensionality by randomly dividing the variables in W into q subsets X, where each X contains m/q variables, i.e., the following variable subsets are created by VU, X1={x1, x2, x3,…, xm/q}, X2={xm/q+1, xm/q+2, xm/q+3,…, x2m/q}, X3={x2m/q+1, x2m/q+2, x2m/q+3,…, x3m/q},…, Xq={xm-m/q+1, xm-m/q+2, xm-m/q+3,…, xm}. The subsets X are then used to build q prediction models. The prediction models are then (i) used to predict out-of-sample observations and (ii) for each out-of-sample observation, the prediction models’ probability predictions are combined into an overall prediction for each observation by taking an average of the individual probability predictions.

preprint

accepted manuscript

- 42 -

Figure 3 Experimental Procedures Multi-subset Observation Undersampling (OU) Example

Perform 10-fold cross-validation

Start

Raw Data

Round n training data

For each cross-validation round n = {1, 2, 3,…, 10}

For each OU implementation l =

{1, 2, 3,…, 20}

Predict round ntest data

l round n training OU subsets

Round n test data

n = 10?

True

False

Combine OU subset predictions

True

Calculate ECMscores

l round n test data sets with predictions

End

Round n test data with combined predictions

Create l OU subsets

Build l prediction models

l prediction models

l = 20? False

For each observation in the round n test data, average the

l probability predictions

preprint

accepted manuscript

- 43 -

Figure 4 Multi-subset Observation Undersampling (OU) with Different Numbers of Subsets - Percentage Performance Improvement Relative to Benchmark

Notes: ECM is calculated using a 0.6 percent fraud probability and a 30:1 false negative to false positive cost ratio.

As discussed in the text, three versions of the experiment were conducted. Original order refers to the main OU experiment; new order refers to the analysis in which the OU subsets are selected in a different order; and new subsets refers to the analysis in which the random sampling of non-fraud cases is repeated using a different random draw.

The benchmark is simple undersampling (Perols 2011), which randomly removes non-fraud observations from the sample to generate a more balanced training sample. This benchmark performs better than a benchmark that includes all fraud and non-fraud observations. OU and the OU benchmarks use all variables (independent variable reduction is examined in the VU analysis) and are implemented using support vector machines.

Original order

New order

New subsets

0123456789

101112131415

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

ECM % Improvement

# OUSubsets

preprint

accepted manuscript

- 44 -

Figure 5 Multi-subset Variable Undersampling (VU) with Different Numbers of Subsets of Explanatory Variables and Partitioned VU (PVU) - Percentage Performance Improvement Relative to Benchmark

Notes: ECM is calculated using a 0.6 percent fraud probability and a 30:1 false negative to false positive cost ratio.

As discussed in the text, two versions of the VU experiment were conducted. The constant number of variables in each subset experiment (the dashed line) uses subsets that contain five or six variables in each subset; the all variables in each round experiment (the round dotted line) uses all variables in each experimental round by randomly dividing all 109 variables into different subsets (consequently, as the number of subsets increases, the number of variables in each subset decreases).

The all variables in each round experiment only manipulates the number of VU subsets in even increments.

PVU partitions the variables into subsets based on their relation to specific types of fraud (e.g., revenue vs. expense fraud) and this partition is done independently of how VU is implemented. Consequently, the performance of PVU does not change as the number of VU subsets changes.

The benchmark contains six randomly selected variables (from the 109 variables described in Appendix A) and is equivalent to the VU implementation with only one subset. This benchmark performed better than benchmarks implemented using (i) all the variables in the dataset and (ii) the variables selected in Dechow et al. (2011), i.e., RSST accruals, change in receivables, change in inventory, soft assets, percentage change in cash sales, change in return on assets, actual issuance of securities, abnormal change in employees, and existence of operating leases. VU, PVU, and the benchmarks use all fraud and non-fraud observations (observation undersampling is examined in the OU analysis) and are implemented using support vector machines.

VU (constant number of variables

in each subset)

VU (all variables in each round)

-6-5-4-3-2-10123456789

10

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

ECM % Improvement

# VU Subsets

PVU

preprint

accepted manuscript

- 45 -

Figure 6 Performance Comparison of OU(12), PVU, and the Combination of OU and PVU

Notes: OU is Multi-subset Observation Undersampling. OU(12) represents the best performing individual OU implementation.

PVU is Multi-subset Observation Undersampling partitioned on fraud type.

ECM is calculated assuming an evaluation fraud probability of 0.6 percent.

PVU

PVU + OU(12)

-6

-4

-2

0

2

4

6

1:1 10:1 20:1 30:1 40:1 50:1 60:1 70:1 80:1 90:1 100:1

ECM %Difference to

OU(12)

CostRatio

preprint

accepted manuscript

TABLE 1 Summary of Experiments

Experiment Description Benchmarka 1. Evaluating the number of Multi-subset Observation Undersampling (OU) subsets

We create OU subsets with 20 percent fraud cases in each subset.b Each subset includes all original fraud cases and a random sample of the original non-fraud cases selected without replacement. We then empirically examine the optimal number of subsets for implementing OU.c As sensitivity analyses, we repeat the experiment using the same subsets, but select the subsets in a different order, and re-perform the random selection procedure of non-fraud cases. OU and the benchmarks use all variables (data dimensionality reduction is examined in the VU analyses below).

We use two benchmarks: (i) simple observation undersampling, i.e., OU with only one subset, as used in Perols (2011), and (ii) no undersampling.

2a. Evaluating the number of Multi-subset Variable Undersampling (VU) subsets

To evaluate how many VU subsets to use, we first randomly select variables from the 109 variables used in the prior literature and place these variables into 20 different subsets. Thus, each variable subset contains five or six variables. To determine how many VU variable subsets to use, we perform an experiment in which the subsets are randomly added one by one to VU.

We use three benchmarks: (i) simple variable undersampling (i.e., VU with only one subset); (ii) a model that includes all variables; and (iii) model 2 in Dechow et al. (2011).

2b. Evaluating the number of variables in each VU subset

This experiment evaluates the performance of VU as we change the number of variables in each subset. We use all variables in each experimental round and randomly divide the variables into the different subsets. Thus, as the number of subsets increase, the number of variables in each subset decreases. For example, all variables are included in one set in the first experimental round, half the variables are included in each of two subsets in the second experimental round, etc. This experiment skips all uneven rounds except for the first round to reduce processing time.

We use the same three benchmarks as the first VU experiment.

3. Evaluating VU partitioned on fraud types (PVU)

This experiment evaluates the performance of VU when partitioned on fraud types. Note that we do not examine the performance of different PVU implementations in this experiment as the specific subsets included in PVU are driven by the partitioning rather than an empirical evaluation.

We compare the performance of PVU to the best performing benchmark in the VU experiments (i.e., simple variable undersampling).

preprint

accepted manuscript

- 47 -

TABLE 1 (continued) Summary of Experiments

Notes: a Since we introduce OU to the fraud detection literature to reduce the imbalance between the number of fraud and the number of non-fraud observations, we use simple undersampling as a benchmark (Perols 2011) when evaluating the performance of OU. This benchmark randomly removes non-fraud observations from the sample to generate a more balanced training sample. We also use no undersampling as an additional benchmark. However, simple undersampling performs on average 7.3 percent better than no undersampling and we consequently report only simple undersampling. The OU and the OU benchmarks use all variables (as data dimensionality reduction is examined in the VU analysis). VU is introduced as a data dimensionality reduction method that is argued to improve the performance over currently used variable selection methods. As a baseline, we use a benchmark that was created using the variables included in Dechow et al. (2011) model 2 (the Dechow benchmark): RSST accruals, change in receivables, change in inventory, soft assets, percentage change in cash sales, change in return on assets, actual issuance of securities, abnormal change in employees, and existence of operating leases. We also use (i) a benchmark that randomly selects variables and (ii) a benchmark that includes all variables (the all variables benchmark), i.e., where data dimensionality is not reduced. The benchmark that randomly selects variables performs better than the Dechow benchmark (3.9 percent) and the all variables benchmark (3.9 percent). Thus, we report our results using the benchmark that randomly selects variables. VU and the VU benchmarks use all fraud and non-fraud observations (as observation undersampling is examined in the OU analysis). Following recent fraud prediction research (e.g., Cecchini et al. 2010) and findings in Perols (2011), all prediction models are implemented using support vector machines. Sensitivity analyses reported in Online Appendix B are used to examine other classification algorithms, including logistic regression and bootstrapping. b Perols (2011) finds that a simple undersampling ratio of 20 percent provides relatively good performance in fraud prediction compared to other undersampling ratios. c More specifically, we first create one subset and examine the performance of OU with this single subset. We then create a second subset and use this subset along with the previously created subset to evaluate the performance of OU with two subsets. Note that while it is possible to derive a total of 41 subsets following Chan and Stolfo’s (1998) approach, the addition of another OU subset is only valuable if the additional subset contains new information. We expect that the marginal benefit of adding an additional subset decreases as the total number of subsets in OU increases. Additionally, for each subset that is added, another prediction model has to be built, used for prediction, and combined with the other prediction models’ predictions. Thus, there is a computational cost associated with increasing the number of subsets used. Based on this and the results that indicate that the performance benefit tapers off around 12 subsets, we do not extend the experiment beyond 20 subsets.

preprint

accepted manuscript

Notea PrewhicPerfoacrosprobb Repstandunlesobseran indiffec TheremobalanbencOU bexammachbootsB.)

Multi-sPerform

es: diction perform

ch separate dataormance is the ss the ten test fability, i.e., 0.6ported p-valuesdard deviation ss otherwise norvations are un

ndication of theerence betweene benchmark isoves non-fraudnced training schmark that incbenchmarks us

mined in the VUhines. (Other cstrapping, are u

Tsubset Observ

mancea - Incre

mance is evaluasets are used faverage Expec

folds. ECM is m6 percent, and cs are based on in ECM scoresoted. Assumptnlikely to be sae relation betwen each implemes simple undersd observations fample. This b

cluded all fraudse all variables U analysis) andlassification alused in additio

Perf

orm

ance

Pla

teau

EC

M Im

prov

ing

TABLE 2 vation Undeeasing the Nu

ated using 10-ffor model buildcted Cost of Mmeasured at becost ratios, i.e.pairwise t-tests across the tentions related toatisfied and p-veen the magnitentation and thsampling (Perofrom the samplenchmark perf

d and non-fraud(independent v

d are implemenlgorithms, inclu

onal analyses re

rsampling (Oumber of Sub

fold cross-validing vs. model

Misclassificationest estimates of, 30:1.

ts using the aven test folds ando normality andvalues are only tude and the vae respective beols 2011), whicle to generate aformed better thd observationsvariable reductnted using suppuding logistic reported in Onli

OU) bsets

dation in l evaluation. n (ECM) f prior fraud

erage and d are one-tailedd independent

included as ariance of the enchmarks. ch randomly a more han a . OU and the tion is port vector regression andine Appendix

d

d

preprint

accepted manuscript

Notes: a Predictionmodel builnumeric vaprobabilitynegative (nperfect ranb The resulmisstatemeclassificatiin Online Aand providPanel B coimplementused in comet al. (2010financial kc In panel Adata, we us(2011): mapercentageemployeesusing a masample waoverfitted musing the Od The finanin the ratioresearch. Imachines. variables ine p-values aten test folvalues are between ea

n performance lding vs. modealue of how wey that a randomnon-misstatemenking of the evalts in Panel A cent data (all meion algorithms,Appendix B). des insight into ompare the perftations use suppmbination with0)) and (ii) whe

kernel. A, given the sose the Dechowaterial misstatee change in cass + existence ofaterial misstatemas used when semodel. In this

OU method. PVncial kernel conos of 23 financiIn this experim PVU is not imnto different frare one-tailed bds. Assumptioonly included ach implement

PredictMate

is evaluated usl evaluation. Pell the predictio

mly selected poent) instance. aluation observcompares the pethods and ben, including logiThis comparisthe usefulness

formance of thport vector ma

h a different setether OU provi

ource of the datw et al. (2011) bement = RSST h sales + changf operating leasment sample thelecting these vexperiment, O

VU uses all obnsists of 1,518 ial statement va

ment, OU is impmplemented in raud categoriesbased on pairwons related to nas an indicatioation and the b

Ttion Performerial Misstat

sing 10-fold crPerformance ison model rankssitive (misstateAn AUC of 0.vations. performance ofnchmarks are imistic regressionon provides fus of the proposehe financial kerachines). This t of independenides increment

ta, i.e., Dechowbenchmark. Thaccruals + chage in return onses. These indhat is similar tovariables it is p

OU uses all 107bservations, but

independent variables commplemented usinthis experimen

s. wise t-tests usinnormality and ion of the relatiobenchmark.

- 49 -

TABLE 3 mancea,b of OU

ements Hold

ross-validation area under thes the observatioement) instanc5 is equivalent

f OU and VU tomplemented usn and bootstrapurther validationed methods in rnel from Cecchanalysis providnt variables (crtal predictive p

w et al. (2011),his benchmark ange in receivabn assets + actuadependent variao the sample uspossible that th7 variables, butt partitions the

variables represmonly used to cong the same 1,5nt, as it is not c

ng the average independent obon between the

U and PVU od-Out Sample

in which sepae ROC curve (Aons in the test e is ranked higt to a random r

o the Dechow bsing support vepping, are usedn of the resultsa slightly diffehini et al. (201des insight intoreated using th

power when use

, and the natureis based on mobles + change ial issuance of sables were selesed in this expe

he benchmark pt under-sample

e original 107 vsenting currentonstruct indepe518 independenclear how to pa

and standard dbservations aree magnitude an

on a e

arate datasets arAUC). AUC psets and repres

gher than a randank order whil

benchmark usiector machinesd in additional as reported earlierent setting. T0) with and wi

o (i) the usefulnhe financial kered in combinat

e of the materiaodel 2 from Dein inventory +

securities + abnected in Dechoweriment. Beca

performance rees the non-fraudvariables basedt and lagged raendent variablent variables an

artition the 1,51

deviation of EC unlikely to be

nd the variance

re used for provides a sents the domly selectedle an AUC of 1

ing material s; other analyses reportier on fraud datThe results in ithout OU (botness of OU whrnel of Cecchintion with the

al misstatemenechow et al. soft assets +

normal change w et al. (2011)

ause the entire epresents an d observations

d on fraud typetios and changes in fraud

nd support vect18 independent

CM scores acroe satisfied and p

of the differen

d 1 is

ted ta

th hen ni

nt

in )

s.

ges

or t

oss the p-nce

preprint

accepted manuscript

Nr

Hypoth

Note: Average estiresults. P-values le

hesis Testing: R

mates, standard deess than 0.0001 wer

Results on Full S

eviation estimates, are converted to 0.0

Sample Logisti

and average p-valu0001 before taking t

TABLE 4 ic Regressions v

ues are based on estthe average.

versus 12 OU S

timates and p-value

Subsamples Lo

es from the 12 OU

gistic Regressio

subsample logistic

ons

c regression

Documents

Online Early — Preprint of Accepted Manuscript · 2016. 10. 28. · Online Early — Preprint of Accepted Manuscript This is a of a manuscript that has been accepted for publication