Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Online Early — Preprint of Accepted ManuscriptThis is a PDF file of a manuscript that has been accepted for publication in an American Accounting Association journal. It is the final version that was uploaded and approved by the author(s). While the paper has been through the usual rigorous peer review process for AAA journals, it has not been copyedited, nor have the graphics and tables been modified for final publication. Also note that the paper may refer to online Appendices and/or Supplements that are not yet available. The manuscript will undergo copyediting, typesetting and review of page proofs before it is published in its final form, therefore the published version will look different from this version and may also have some differences in content.
We have posted this preliminary version of the manuscript as a service to our members and subscribers in the interest of making the information available for distribution and citation as quickly as possible following acceptance.
The DOI for this manuscript and the correct format for citing the paper are given at the top of the online (html) abstract.
Once the final published version of this paper is posted online, it will replace this preliminary version at the specified DOI.
The Accounting Review • Issues in Accounting Education • Accounting HorizonsAccounting and the Public Interest • Auditing: A Journal of Practice & Theory
Behavioral Research in Accounting • Current Issues in Auditing Journal of Emerging Technologies in Accounting • Journal of Information Systems
Journal of International Accounting Research Journal of Management Accounting Research • The ATA Journal of Legal Tax Research
The Journal of the American Taxation Association
preprint
accepted manuscript
Finding Needles in a Haystack: Using Data Analytics to Improve Fraud Prediction
Johan L. Perols Associate Professor of Accounting
University of San Diego [email protected]
Robert M. Bowen
Distinguished Professor of Accounting University of San Diego [email protected]
Carsten Zimmermann
Associate Professor of Management University of San Diego
Basamba Samba RWTH Aachen University
Editor’s Note: Accepted by Elaine Mauldin Submitted February 2015 Accepted July 2016
We acknowledge financial assistance from the School of Business Administration at the University of San Diego and helpful comments from Elaine Mauldin (editor), Darren Bernard, Barbara Bliss, Nicole Cade, Ed deHaan, Weili Ge, Jane Jollineau, Yen-Ting (Daniel) Lin, Sarah Lyon, Dawn Matsumoto, Barry Mishra, Ted Mock, Ryan Ratcliff, Terry Shevlin, Brady Williams, two anonymous reviewers, and workshop participants at the University of California, Riverside and the University of San Diego. All remaining errors are our own.
preprint
accepted manuscript
ABSTRACT
Developing models to detect financial statement fraud involves challenges related to (i) the rarity of fraud observations, (ii) the relative abundance of explanatory variables identified in the prior literature, and (iii) the broad underlying definition of fraud. Following the emerging data analytics literature, we introduce and systematically evaluate three data analytics preprocessing methods to address these challenges. Results from evaluating actual cases of financial statement fraud suggest that two of these methods improve fraud prediction performance by approximately ten percent relative to the best current techniques. Improved fraud prediction can result in meaningful benefits, such as improving the ability of the SEC to detect fraudulent filings and improving audit firms' client portfolio decisions.
Keywords: fraud, financial statement fraud, data analytics, predictive analytics, data rarity, data imbalance
preprint
accepted manuscript
- 1 -
I. INTRODUCTION
Organizations lose an estimated 5 percent of annual revenues to fraud in general and 1.6
percent of annual revenues specifically to financial statement fraud (ACFE 2014). Further, when
resources are misallocated because of misleading financial data, fraud can harm the efficiency of
capital, labor, and product markets. Financial statement fraud (henceforth fraud) also increases
business risk. For example, audit firms can face lawsuits, reputational costs, and loss of clients,
and investors and banks are more likely to make suboptimal investment and loan decisions.
Data analytics is an important emerging field in both academic research (e.g., Agarwal and
Dhar 2014; Chen, Chiang, and Storey 2012) and in practice (e.g., Brown, Chui, and Manyika
2011; LaValle, Lesser, Shockley, Hopkins, and Kruschwitz 2011).1 In the fraud context, data
analytics can, for example, be used to create fraud prediction models that help (i) auditors
improve client portfolio management and audit planning decisions and (ii) regulators and other
oversight agencies identify firms for potential fraud investigation (SEC 2015; Walter 2013).
However, the usefulness of data analytics in fraud prediction is hindered by three challenges.
First, fraud prediction is a “needle in a haystack problem.” That is, the relative rarity of fraud
firms compared to non-fraud2 control firms (Bell and Carcello 2000) makes fraud prediction
difficult (Perols 2011). Second, fraud prediction is complicated by the “curse of data
dimensionality” (Bellman 1961). The rarity of fraud observations relative to the large number of
explanatory variables identified in the fraud literature (Whiting, Hansen, McDonald, Albrecht,
1 Data analytics refers to techniques that are grounded in data mining (e.g., decision trees, artificial neural networks, and support vector machines) and statistics (e.g., ANOVA, regression analysis, and logistic regressions) (Chen et al. 2012). Data analytics draws from statistics, artificial intelligence, computer science, and database research. It is related to big data in that it provides tools that enable the analysis of large datasets. Data analytics is typically focused on prediction as opposed to explanation. 2 We use the term non-fraud firms to describe all firms for which fraud has not been detected. This primarily includes firms that have not committed fraud, but also includes undetected cases of fraud. To the extent that undetected fraud exists in our data, noise is introduced. This noise reduces the effectiveness of all prediction models, and methods that address this noise might further improve fraud prediction. However, this noise is not likely to bias performance comparisons among prediction models that use the same data.
preprint
accepted manuscript
- 2 -
and Albrecht 2012) can result in over-fitted prediction models that perform poorly when
predicting new observations. Third, prior research generally treats all frauds as homogeneous
events. This can make fraud prediction more difficult because prediction models have to detect
patterns that are common across different fraud types (e.g., revenue vs. expense fraud).
While prior fraud detection research enhances our general understanding of fraud indicators
and prediction methods, this research rarely addresses these problems explicitly. With a primary
objective of improving fraud prediction, we address these challenges by introducing three
methods grounded in data analytics research.3 The methods we examine have performed well in
other settings characterized by data rarity, such as predicting credit card fraud (e.g., Chan and
Stolfo 1998). The first method, Multi-subset Observation Undersampling (OU), addresses the
imbalance between the low number of fraud observations relative to the number of non-fraud
observations by creating multiple subsets of the original dataset that each contain all fraud
observations and different random subsamples of non-fraud observations. The second method,
Multi-subset Variable Undersampling (VU), addresses the imbalance between the low number of
fraud observations relative to the number of explanatory variables identified in the fraud
prediction literature by creating multiple subsets of randomly selected explanatory variables.
The third method, VU partitioned by type of fraud (PVU), is a variation of the second method
that addresses issues associated with treating all fraud cases as homogenous events. Rather than
randomly selecting variables, we instead use our a priori knowledge to partition the variables
into subsets based on their relation to specific types of fraud (e.g., revenue vs. expense fraud).
We use a dataset with 51 fraud firms, 15,934 non-fraud firm years, and 109 explanatory
variables from prior research. We then analyze over 10,000 prediction models to systematically
3 We evaluate our results on out-of-sample data and thus perform predictive modeling. To clearly delineate our work from explanatory models, we refer to our models as prediction models throughout the paper.
preprint
accepted manuscript
- 3 -
evaluate how to best implement these methods, e.g., how many data subsets to use in OU. In
addition, we examine the prediction performance of these implementations relative to various
benchmarks that represent the current standard in the literature, e.g., model 2 in Dechow, Ge,
Larson, and Sloan (2011) and simple undersampling as used in Perols (2011). To avoid biasing
the results, we evaluate prediction performance using the prediction models’ probability
predictions on hold-out data that are not processed by the proposed methods.
Results indicate that including additional data subsets (up to approximately 12 subsets)
increases OU fraud prediction performance, i.e., additional subsets after 12 do not appear to
enhance performance. This 12-subset configuration improves prediction performance by 10.8
percent relative to the best performing benchmark.
While results indicate that VU also has the potential to improve fraud prediction, the
performance of this method is highly dependent on the specific variables selected in the various
subsets. However, performance improves when we use a priori knowledge to separate
independent variables into different subsets based on the type of fraud they are likely to predict,
e.g., revenue or expense fraud. This method, i.e., PVU, improves fraud prediction performance
by 9.6 percent relative to the best performing VU benchmark. Additional analyses also show
that performance can be further improved by combining OU and PVU, but only under certain
conditions as described in section IV.
Our paper makes at least five important contributions. First, by introducing and
systematically evaluating three new methods and showing that two of these methods improve
prediction performance relative to the best performing benchmarks, we directly contribute to
research that focuses on improving the performance of fraud prediction models. The
performance improvements from OU and PVU are large relative to other approaches for
preprint
accepted manuscript
- 4 -
improving prediction performance, e.g., (i) a 0.9 percent performance advantage in Dechow et al.
(2011) when two additional significant independent variables are added to their initial model and
(ii) a 2.2 percent improvement in Price, Sharp, and Wood (2011), when comparing Audit
Integrity’s Accounting and Governance Risk measure to Dechow et al. (2011) model 2.4
Second, the finding that OU significantly improves prediction performance has important
methodological implications for research that evaluates the value of new explanatory variables.
This research can potentially benefit from applying OU to ensure that (i) results are robust across
different subsamples and (ii) new variables provide incremental predictive value to models
implemented using our recommended methods.
Third, we show that the ability of VU to predict fraud improves consistently only when we
recognize that not all frauds are alike and therefore subdivide the general fraud problem into
types of fraud. The importance of this approach likely extends beyond variable undersampling.
For example, future research could reorganize or design new fraud variables to predict a specific
fraud type (e.g., revenue fraud or expense fraud).
Fourth, OU and PVU can be extended to address rarity and data dimensionality problems that
are prevalent in other accounting classification settings, including prediction of financial
statement restatements, material weaknesses in internal controls, auditor resignations, audit
qualifications, and bankruptcy.
Finally, the introduction and evaluation of these methods makes an important contribution to
practice. Better prediction models can, for example, help the SEC and external auditors improve
4 Dechow et al. (2011) do not report predictive performance and the 0.9 percent difference is based on a separate analysis that we performed using the two models in their paper (Model 1 and Model 2). This analysis uses the same procedures used in our material misstatement analysis described in Section IV. Price et al. (2011) compare Audit Integrity’s Accounting and Governance Risk measure, which is considered the gold standard in commercial risk measures, to Dechow et al. (2011) Model 1 using material misstatement data. Based on their results, we calculate a 3.16 percent fraud prediction performance improvement of the commercial measure to model 1. This implies a 2.24 percent improvement over Dechow et al. (2011) Model 2, which we include as one of our benchmark models.
preprint
accepted manuscript
- 5 -
their identification of potentially fraudulent accounting practices (Walter 2013; SEC 2015).
The remainder of the paper is organized as follows. Section II summarizes the fraud
literature, discusses data rarity, and describes how methods drawn from the data analytics
literature can be applied to fraud prediction. Section III describes the data, performance
measure, and experimental design. Section IV provides results, and Section V concludes.
II. PRIOR LITERATURE, BACKGROUND, AND PROPOSED METHODS
Prior Fraud Prediction Research
Research on financial statement fraud prediction contributes to understanding factors that can
be used to predict fraud. Prior research includes testing fraud hypotheses grounded in the
earnings management and corporate governance literatures (e.g., Beasley 1996; Dechow, Sloan,
and Sweeney 1996; Summers and Sweeney 1998; Beneish 1999; Sharma 2004; Erickson
Hanlon, and Maydew 2006; Lennox and Pittman 2010; Feng, Ge, Luo, and Shevlin 2011; Perols
and Lougee 2011; Caskey and Hanlon 2013; Armstrong, Larcker, Ormazabal, and Taylor 2013;
Markelevich and Rosner 2013). This research also evaluates the significance of a variety of
other potential explanatory variables, such as red flags emphasized in auditing standards,
discretionary accruals measures, and non-financial indicators (e.g., Loebbecke, Eining, and
Willingham 1989; Beneish 1997; Lee, Ingram, and Howard 1999; Apostolou, Hassell, and
Webber 2000; Kaminski, Wetzel, and Guan 2004; Ettredge, Sun, Lee, and Anandarajan 2008;
Jones, Krishnan, and Melendrez 2008; Brazel, Jones, and Zimbelman 2009; Dechow et al. 2011).
We use independent variables from this research as input into our models.
Varian (2014) highlights the importance of the emerging field of data analytics. He suggests
that researchers using traditional econometric methods should consider adapting recent advances
from this field. A second stream of financial statement fraud prediction research follows this
suggestion and applies developments in data analytics research to improve fraud prediction.
preprint
accepted manuscript
- 6 -
Early research within this stream concludes that artificial neural networks perform well relative
to discriminant analysis and logistic regressions (e.g., Green and Choi 1997; Fanning and Cogger
1998; Lin, Hwang, and Becker 2003). More recent research in this stream examines additional
classification algorithms, such as support vector machines, decision trees, and adaptive learning
methods (e.g., Cecchini, Koehler, Aytug, and Pathak 2010; Perols 2011; Abbasi, Albrecht,
Vance, and Hansen 2012; Gupta and Gill 2012; Whiting et al. 2012) and text mining methods
(e.g., Glancy and Yadav 2011; Humpherys, Moffitt, Burns, Burgoon, and Felix 2011; Goel and
Gangolly 2012; Larcker and Zakolyukina 2012). We follow recent fraud data analytics research
(e.g., Cecchini et al. 2010) and findings in Perols (2011) and implement all prediction models
using support vector machines. Support vector machines determine how to separate fraud firms
from non-fraud firms by finding the hyperplane that provides the maximum separation in the
training data between fraud and non-fraud firms. In additional analyses reported in Online
Appendix B, we also use logistic regression and bootstrap aggregation to examine the robustness
of our results.
Data Rarity, Related Prior Research, and Proposed Methods
Data rarity is observed in diverse prediction settings, such as credit card fraud (Chan and
Stolfo 1998), auto insurance fraud (Phua, Alahakoon, and Lee 2004), bankruptcy (Shin, Lee, and
Kim 2005), and financial statement fraud (Whiting et al. 2012). Classification algorithms (e.g.,
logistic regression) have inherent difficulties in processing rarity (Weiss 2004), and data rarity is
regarded as one of the primary challenges in data analytics research (Yang and Wu 2006). Data
rarity is particularly severe in financial statement fraud detection because financial statement
fraud is characterized by both (i) relative rarity (a.k.a., the needle in the haystack problem) and
(ii) absolute rarity combined with an abundance of explanatory variables proposed in the
literature (a.k.a., the curse of data dimensionality problem).
preprint
accepted manuscript
- 7 -
The needle in a haystack problem. Relative rarity occurs when detected fraud observations
are a relatively small percentage of the majority non-fraud observations, e.g., approximately only
0.6 percent of all audited U.S. financial reports have been identified as fraudulent (Bell and
Carcello 2000). Relative rarity is a challenge since it forces classification algorithms to consider
a large number of potential patterns without having enough fraud observations to determine
which patterns are driven by noisy data. This increases the risk that identified patterns are based
on spurious relations in a particular sample, resulting in increased false positive rates for a given
false negative rate when the developed model is applied to a new sample (Weiss 2004). Further,
to minimize total classification errors, algorithms tend to be biased towards classifying
observations from the majority class correctly (e.g., Maloof 2003). To illustrate, if 99 percent of
all observations are non-fraudulent, a prediction model identifying all observations as non-
fraudulent achieves an overall accuracy of 99 percent – correctly classifying 100 percent of the
non-fraudulent observations, but 0 percent of the fraudulent observations.
Perols (2011) takes an initial step towards addressing the relative rarity problem in a fraud
context by examining the performance of classification algorithms after undersampling the non-
fraud observations. However, while the simple undersampling method used in Perols (2011),
i.e., a method that simply removes non-fraud observations from the sample, generates more
balanced datasets, it also discards potentially useful non-fraud observations. We, therefore,
introduce a more sophisticated undersampling method that does not discard non-fraud
observations (and include simple undersampling as a benchmark).
More specifically, we use Multi-subset Observation Undersampling (OU), developed by
Chan and Stolfo (1998), to address relative rarity. OU uses multiple data subsets, where each
subset contains all fraud observations but different subsamples of non-fraud observations. We
preprint
accepted manuscript
- 8 -
specifically select OU because prior research shows that it performs well in other settings
constrained by relative rarity, such as predicting credit card fraud (e.g., Chan and Stolfo 1998).
OU is also effective compared to (i) other undersampling and oversampling methods (Nguyen,
Cooper, and Kamei 2012) and (ii) various types of bootstrap aggregation, boosting, and hybrid
ensemble data rarity methods used in the data analytics literature (Galar, Fernández,
Barrenechea, Bustince, and Herrera 2012). Nguyen et al. (2012) suggests that OU improves
performance not only because it makes the ratio of minority (fraud) to majority (non-fraud)
observations more balanced, but also because it more efficiently incorporates potentially useful
majority cases. By increasing the balance between fraud and non-fraud cases, OU adjusts the
focus of the classification algorithms towards the fraud cases. This focus is desirable given the
importance of minority cases in fraud detection, i.e., it is more costly to incorrectly classify fraud
cases than non-fraud cases. By creating multiple prediction models that are based on different
non-overlapping subsets of majority observations, each prediction model is likely to differ
somewhat from the other prediction models. Importantly, patterns that are predictive of fraud are
likely to be present in multiple subsets. However, spurious patterns that exist by random chance
in individual subsets are unlikely to also exist in other subsets. By using a combination of these
models rather than a model built using a single dataset, potentially important patterns are more
likely to be identified and estimated accurately (by combining models with slightly different
pattern estimates, the estimates should be more robust). Additionally, when individual models
are combined, spurious patterns are likely to be discarded (or given less weight). Thus, by
combining predictions from multiple models, OU reduces the risk of overfitting, i.e., that the
prediction model has good in-sample performance but does not generalize to new observations.
When applied in the fraud setting, OU first preprocesses the model building data by dividing
preprint
accepted manuscript
- 9 -
the data into multiple subsets, where each subset includes all fraud observations and a random
sample of non-fraud observations selected without replacement (Figure 1). Thus, all fraud
observations are included in all subsets while each non-fraud observation is part of at most one
subset. Each subset is then used in combination with a classification algorithm to build a fraud
prediction model. To perform fraud prediction, each prediction model is then applied to out-of-
sample data. For each observation in the out-of-sample data, the resulting model predictions are
combined into an overall fraud probability prediction for the observation (see further details in
the next section).
<Insert Figure 1 Here>
The curse of data dimensionality problem. According to the “curse of data dimensionality,”
data requirements increase exponentially with the number of explanatory variables in the dataset
(Bellman 1961). The curse of data dimensionality is a potential problem in fraud prediction
because the number of known fraud cases is small relative to the extensive number of
independent variables identified in prior fraud research. Hence, only a small number of fraud
observations are available to identify patterns among a large number of independent variables
and fraud. This may result in over-fitted prediction models that perform poorly when predicting
new observations.
By using stepwise backward variable selection to build a parsimonious fraud prediction
model, Dechow et al. (2011) partially address the problem of data dimensionality in the fraud
context. However, while stepwise backward variable selection is designed to retain explanatory
variables with the highest significance levels, it may discard potentially useful variables. We
build on Dechow et al. (2011) and introduce a new method that attempts to address the curse of
data dimensionality, while simultaneously retaining potentially useful explanatory variables. We
preprint
accepted manuscript
- 10 -
include the Dechow et al. (2011) model as a benchmark in our analyses.
To reduce the imbalance between minority fraud observations and the number of variables
identified in the literature to predict fraud, we design a new data rarity method, Multi-subset
Variable Undersampling (VU).5 VU randomly splits the set of explanatory variables without
replacement into different subsets (Figure 2). Each subset contains the same observations, but
different non-overlapping sets of explanatory variables. As with OU, each subset is then used in
combination with a classification algorithm to build a fraud prediction model that is applied to
out-of-sample data. For each observation in the out-of-sample data, the resulting model
predictions are then combined into an overall fraud probability prediction for the observation.
<Insert Figure 2 Here>
Partitioning Fraud into Types
Managers commit financial statement fraud by manipulating specific accounts, e.g., they may
improve reported earnings by artificially increasing revenue or reducing expenses. Many
financial statement fraud variables used in the literature are inherently related to a specific type
of fraud. For example, abnormal revenue growth is a potential measure of revenue fraud while
an abnormally low amount of allowance for doubtful accounts is a potential measure of expense
fraud. Although these variables may provide useful information about a specific type of fraud,
they are less likely to detect multiple types of fraud. When different fraud types are combined
into a binary classification problem, variables that are helpful when detecting a specific type of
fraud may be discarded if they do not do well in predicting fraud in general. For example, a
variable that provides a good signal about expense fraud but provides no useful information
5 In an attempt to further mitigate problems associated with having a small number of fraud observations to learn from, we examine the usefulness of an observation oversampling method named SMOTE (Chawla, Bowyer, Hall, and Kegelmeyer 2002) in fraud prediction. We, however, do not find a significant performance advantage for SMOTE relative to simple oversampling (results available from the corresponding author) and as such do not recommend SMOTE to address data rarity in the fraud context.
preprint
accepted manuscript
- 11 -
about other types of fraud will only provide value when classifying expense fraud cases, which
in our sample is only about ten percent of the fraud cases. Additionally, by combining different
fraud types into a binary classification problem, the classification algorithms focus on finding
patterns common to all fraud types. Given heterogeneity among different fraud types, such
patterns may be difficult to detect.
To reduce the potential negative effects associated with combining different fraud types into
binary classification models, we implement VU by partitioning the independent variables based
on different fraud types (PVU).6 When implementing PVU, we place all variables that appear to
predict a specific fraud type into a separate variable subset. Variables that can be used to predict
multiple fraud types are placed in multiple subsets. This creates four subsets of variables relating
to revenue, expenses, assets, and liabilities (for each subset, the model building data are
restricted to fraud observations that represent the associated fraud type). We also include three
additional variable subsets, because some fraud variables measure general attributes of fraud,
such as incentives, opportunities, or the aggregate effect of fraud. The first of these subsets
includes all variables not categorized as a specific fraud type variable. The second subset
includes the variables used in Dechow et al. (2011). These variables are included for their utility
in binary fraud prediction. The third subset includes all variables and is created to allow the
classifiers to find patterns among both fraud type specific and non-fraud type specific variables.
III. DATA AND EXPERIMENTAL DESIGN
Sample Data
We obtain a sample containing 51 fraud firms7 and 15,934 non-fraud firm years from Perols
6 Additionally, the use of multiple VU variable subsets that focus on different fraud types increases the likelihood that different prediction models capture different fraud patterns, which improves diversity among the prediction models. Prediction model diversity is important for performance when combining multiple models (Kittler, Hatef, Duin, and Matas 1998). We do not modify OU based on fraud types because OU only undersamples the non-fraud data and does not preprocess the fraud data. 7 This sample size of 51 fraud firms is comparable to other fraud studies (e.g., Beasley 1996, Erickson et al. 2006; Brazel et al.
preprint
accepted manuscript
- 12 -
(2011). We only include one firm year for each fraud observation that corresponds to the first
year that the Accounting and Auditing Enforcement Release (AAER) alleges that fraud was
committed. We do not include previous years as the fraud may have predated the reported first
fraud year. We do not include multiple fraud years for each fraud firm to prevent a single fraud
firm from being included in both the model building dataset and the out-of-sample model
evaluation dataset.
Perols (2011) identifies fraud firms in SEC investigations reported in AAERs between 1998
and 2005 that explicitly reference Section 10(b) Rule 10b-5 (Beasley 1996) or contain
descriptions of fraud. This fraud firm dataset excludes: financial firms; firms without the first
fraud year specified in the SEC release; non-annual financial statement fraud; foreign firms;
releases related to auditors; not-for-profit organizations; fraud related to registration statements,
10-KSB or IPO; and firms with missing Compustat (financial statement data), Compact D/SEC
(executive and director names, titles and company holdings), or I/B/E/S data (one-year-ahead
analyst earnings per share forecasts and actual earnings per share) in relevant years.8 Randomly
selected Compustat non-fraud firms (excluding observations following the applicable criteria
specified for fraud firms above) are added to the fraud firm dataset to create a sample with 0.3
percent fraud firms, which allows us to examine the robustness of the results around best
estimates of prior fraud probability, i.e., 0.6 percent (Bell and Carcello 2000), in the population
of interest. We include explanatory variables (summarized in Appendix A) that have been used
in recent literature to predict fraud or material misstatements (Cecchini et al. 2010; Dechow et al.
2011; Perols 2011). More specifically, we include all variables from Perols (2011) and all
2009). Other research (e.g., Dechow et al. 2011) uses AAERs to create samples focused on material misstatements. Material misstatement data include firms with AAERs that explicitly allege fraud as well as other firms that describe a material misstatement without explicitly alleging fraud. While such samples are larger, they do not necessarily focus on fraud. 8 Since we add additional variables to the Perols (2011) dataset, some of the variables have missing values. Missing values are replaced by global means/modes. The effect of this is a reduction in the utility of variables that have many missing values.
preprint
accepted manuscript
- 13 -
variables from the final Dechow et al. (2011) model that can be calculated using Compustat data.
Following and extending Cecchini et al. (2010), we also include 48 variables measuring levels
and changes in levels, percentage changes in levels, and abnormal percentage changes of
commonly manipulated financial statement items and ratios.
Experimental Design
Overview of the experiments. As summarized in Table 1, we perform multiple experiments
to (i) determine how to best implement OU and VU (e.g., how many subsets to use) and (ii)
evaluate their relative performance compared to various benchmarks. The primary objective in
these experiments is to detect trends that indicate how to implement the methods in future
research. By detecting clear trends between the number of subsets and predictive ability rather
than selecting implementations that happen to be the most predictive, we reduce the risk that we
recommend implementations that perform well on our test data, but do not generalize well.
In experiment 1, we use OU to create observation subsets that contain all fraud observations
and random samples of non-fraud observations that yield 20 percent fraud observations per
subset. In an evaluation of simple undersampling ratios, Perols (2011) finds that this ratio
provides relatively good performance. We then evaluate how many observation subsets to
include when implementing OU. In experiment 2a, we use VU to randomly divide the variables
used in prior fraud prediction research into 20 subsets. We then assess how many variable
subsets to include when implementing VU. In experiment 2b, we examine whether the number
of variables included in each subset affects performance by dividing the total number of
variables into subsets as follows: one subset with all variables, two subsets each with one-half of
the variables, four subsets each with one-quarter, six subsets each with one-sixth, eight subsets
each with one-eighth, etc. We then evaluate how many variables per subset to include when
implementing VU. Finally, in experiment 3, we evaluate the performance of VU when
preprint
accepted manuscript
- 14 -
independent variables are partitioned based on their relation to specific types of fraud.
<Insert Table 1 Here>
After selecting what appear to be robust implementations, we determine whether these
implementations outperform assorted benchmarks in predicting fraud. Because we introduce OU
to the fraud detection literature to reduce the imbalance between the number of fraud versus non-
fraud observations, we use simple undersampling as a benchmark (Perols 2011) for OU.9 This
benchmark randomly removes non-fraud observations from the model-building sample to
generate a more balanced sample. OU and the OU benchmarks use all variables (as independent
variable reduction is examined in the VU analysis) and are implemented using support vector
machines.
We introduce VU (and PVU) as an independent variable (data dimensionality) reduction
method that has the potential to improve the performance over currently used variable selection
methods. As a baseline we include a benchmark (the Dechow benchmark) that uses the
independent variables from model 2 in Dechow et al. (2011). We also use (i) a benchmark that
randomly selects variables and (ii) a benchmark that includes all variables (the all variables
benchmark) where data dimensionality is not reduced. The benchmark that randomly selects
variables performs better than both the Dechow benchmark and the all variables benchmark.10
Thus, we report our VU (and PVU) results using the benchmark that randomly selects variables.
VU, PVU, and their benchmarks use all observations (observation undersampling is examined in
the OU analysis) and are implemented using support vector machines.
10-fold cross-validation. Out-of-sample performance measures are generally preferred over
9 We also used ‘no undersampling’ as an additional benchmark. However, because simple undersampling performs better than no undersampling by 7.3 percent, we adopted simple undersampling as the benchmark. 10 The Dechow benchmark performed 0.02 percent better than the all variables benchmark and the random variable selection benchmark performed 3.87 percent better than Dechow benchmark.
preprint
accepted manuscript
- 15 -
in-sample performance measures because they provide a “more realistic measure of prediction
performance than measures commonly used in economics” (Varian 2014: 7), and cross-
validation is particularly useful. We use stratified 10-fold cross-validation, where 10 folds (i.e.,
subsamples of observations) are generated using random sampling without replacement. The 10
folds rotate between being used for training and testing the prediction models. In each rotation,
nine folds are used for training (i.e., model building) and one fold is used for testing (i.e., model
evaluation). For example, in the first round, subsets one through nine are used for training and
subset 10 is used for testing; in round two, subsets one through eight and subset 10 are used for
training, and subset nine is used for testing. By using stratified cross-validation, we ensure that
the ratio of fraud to non-fraud observations is kept consistent across all folds. With a total of 51
fraud firms in the sample, 45 or 46 fraud firms are used for model building and five or six fraud
firms are used for model evaluation in each cross-validation round. In our experiments, the OU
and VU methods are only applied to training data.
Prediction performance metric. Following prior financial statement fraud research (e.g.,
Beneish 1997; Feroz, Kwon, Pastena, and Park 2000; Lin et al. 2003; Perols 2011), we use
expected cost of misclassification (ECM) as the preferred performance metric. ECM allows
researchers to vary two important parameters when evaluating a prediction model’s performance
on out-of-sample data: (i) estimated percentage of fraud firms in the population of interest and
(ii) estimated ratio of the cost of a false negative to the cost of a false positive in the population
of interest. Including both parameters is important in settings such as fraud prediction that are
characterized by relative rarity and uneven misclassification costs. Given specific classification
results, ECM is calculated as follows:
ECM = CFN x P(Fraud) x nFN / nP + CFP x P(Non-Fraud) x nFP / nN (1)
preprint
accepted manuscript
- 16 -
where CFP and CFN are estimates of the cost of false positive and false negative classifications,
respectively, deflated by the lower of CFP or CFN; P(Fraud) and P(Non-Fraud) are estimates of
prior probability of fraud and non-fraud, respectively; nFP and nFN are the number of false
positive and false negative classifications, respectively, on the cross-validation test data;11 and nP
and nN are the number of fraud and non-fraud observations, respectively, in the cross-validation
test data. Bayley and Taylor (2007) estimate that actual cost ratios (FN to FP cost) average
between 20:1 and 40:1, while Bell and Carcello (2000) estimates that approximately 0.6 percent
of all firm years represent detected fraud. Thus, in experiments, which compare model
prediction performance at best estimates of prior fraud probability and cost ratios, we calculate
ECM at a cost ratio of 30:1 and a prior fraud probability of 0.6 percent (together with the
prediction models’ actual false positive and false negative rates). The goal of the prediction
models is to minimize ECM.
Experimental procedure – OU example. The following example provides a summary step-
by-step description of the experimental procedures using OU (see Figure 3). Apart from ECM
score calculations, which was done using Visual Basic for Applications in Excel, the
experimental procedures were implemented using the Knowledge Flow interface in Weka
(Witten and Frank 2005). Weka is a collection of machine learning algorithms that provide
support for the data mining process, including loading and transforming data, building prediction
models, and evaluating the performance of prediction models (Witten and Frank 2005). Weka is
primarily used to evaluate the out-of-sample performance of models rather than to evaluate the
statistical significance of independent variables. (Weka provides a number of algorithms that
11 Following prior research (e.g., Beneish 1997; Feroz et al. 2000; Lin et al. 2003; Perols 2011), nFP and nFN are obtained using optimal fraud classification thresholds (e.g., probability cutoffs for classifying an observation as fraud or non-fraud) for each combination of prior fraud probability and cost ratio. These optima are established by examining ECM scores using all unique fraud probability predictions as potential thresholds.
preprint
accepted manuscript
- 17 -
help select independent variables to include in models, but even then, the focus is not on
evaluating the statistical significance of the independent variables.)
1) The full sample is first separated into model-building data (a.k.a., training data) and model-
evaluation data (a.k.a., test data) using 10-fold cross-validation.
2) For each cross-validation round and OU implementation, the OU method is applied to the
training data (but not the test data, which is left intact) to partition the training data into OU
subsets. For example, in the first cross-validation round when evaluating the OU
implementation with 12 subsets, the OU method creates 12 subsets of the first training set.
3) A classification algorithm, which in our experiments is a support vector machine algorithm,
is used with each OU training subset generated in step 2 to build one prediction model for
each OU subset. For example, in OU with 12 subsets, a total of 12 prediction models are
generated.
4) The test set, which was not modified using the OU method, is applied to each of the
prediction models generated in step 3.
5) For each observation in the test set, the probability predictions from each prediction model
are combined by averaging the probability predictions. This method (averaging) has been
found to perform well compared to more complex combiner methods (Duin and Tax 2000).
After combining the probability predictions, each observation in the test set has a single
probability prediction representing the average prediction of all the prediction models
developed in step 3.
6) The probability predictions along with the class labels (e.g., fraud or non-fraud) are used to
calculate ECM scores. When calculating ECM scores, optimal fraud classification thresholds
(“cutoffs”) are first determined for each combination of prior fraud probability and cost ratio
by examining ECM scores at different classification threshold levels (Beneish 1997).
Optimal thresholds are then used to calculate ECM scores for each combination of prior
fraud probability and cost ratio for that specific test dataset.
7) The experimental procedure repeats steps two through six for each cross-validation round
and each OU implementation, e.g., OU with two subsets, OU with three subsets, etc. within
each cross-validation round.
preprint
accepted manuscript
- 18 -
8) After completing all ten rounds, each OU implementation has ten ECM scores (one for each
test set) for each prior fraud probability and cost ratio combination. Averages of the ten ECM
scores are then used to examine prediction performance of different OU implementations and
against the benchmarks at different prior fraud probability and cost ratio levels.
<Insert Figure 3 Here>
IV. RESULTS
Main Results
Figures 4-6 summarize the performance results of different OU and VU implementations.
For each implementation, the results represent the average expected cost of misclassification
(ECM) from ten test folds. ECM is reported at the best estimates of (i) prior fraud probability,
i.e., 0.6 percent, and (ii) false negative to false positive cost ratios, i.e., 30:1. The results are
presented as the percentage difference in ECM between each OU and VU implementation and
their respective benchmarks.12 Given that each figure is plotted using a single benchmark that is
held constant across different implementations, we first use the figures to look for clear trends
that indicate how to implement OU and VU, respectively. We then compare the performance of
selected implementations to their respective benchmarks.
Multi-subset Observation Undersampling (OU) – Experiment 1. Figure 4 (with supporting
details in Table 2) presents the performance results of OU relative to the best performing OU
benchmark (i.e., simple undersampling) as the number of subsets in OU increases. Our results
indicate that the benefit provided by OU initially increases as additional subsets are used but
remains relatively constant after 10 subsets.
<Insert Figure 4 and Table 2 Here>
12 Reported p-values are based on pairwise t-tests using the average and standard deviation of ECM scores across the ten test folds and are one-tailed unless otherwise noted. Assumptions related to normality and independent observations are unlikely to be satisfied, and p-values are only included as an indication of the relation between the magnitude and the variance of the difference between each implementation and the respective benchmarks.
preprint
accepted manuscript
- 19 -
Figure 4 also includes the corresponding results from two sensitivity analyses, i.e., the
experiments in which the subsets are selected in a different order and the random selection of
non-fraud cases is repeated. The results across all three versions of experiment 1 are similar in
that each shows a performance benefit from using OU that initially increases in the number of
OU subsets, but starts to plateau after about 10 subsets. These results indicate that the marginal
performance benefit from adding subsets declines as new subsets become less and less likely to
contain information not already included in the prior subsets.
Taken together, these experiments indicate that OU provides performance benefits and that
the number of subsets to include in OU is relatively consistent in the fraud setting. In an attempt
to balance performance benefits (we want to include enough subsets to make sure that we have
reached the performance plateau) with analysis costs (given that we have reached the plateau, we
want to keep the number of subsets low since adding additional subsets increases processing
costs), we include 12 subsets in OU in subsequent experiments and label this configuration
OU(12). This configuration lowers the expected cost of misclassification in the primary analysis
by 10.8 percent (p = 0.006) relative to simple undersampling (the best performing OU
benchmark).
To better understand how OU(12) improves performance, we first note that untabulated
results indicate that simple undersampling improves performance over no undersampling by 7.3
percent. Thus, it appears that OU provides performance benefits by undersampling the data and
thereby changing the bias of the classifiers to focus more on the fraud cases. Second, as reported
in the main experiment, OU(12) improves the performance over simple undersampling by an
additional 10.8 percent. Thus, it appears that OU also provides performance benefits by
combining the predictions of multiple models and thereby reducing the risk of overfitting.
preprint
accepted manuscript
- 20 -
Multi-Subset Variable Undersampling (VU) – Experiment 2. Figure 5 presents the
performance of VU relative to the best performing VU benchmark (i.e., random selection of
explanatory variables) as the number of subsets in VU increases. As summarized in Table 1, we
examine two versions. The dashed line shows the results when the number of variables in each
subset remains constant per experimental round (Experiment 2a). The round dotted line shows
the results when all variables are included and divided evenly across the subsets in each
experimental round (Experiment 2b).
<Insert Figure 5 Here>
When the number of variables is kept constant in each subset (the dashed line), the
performance of VU increases as additional variable subsets are included, plateauing at about 11
subsets, and then decreasing at 19 subsets. However, even at the plateau (VU with 11 to 18
subsets), untabulated results show that the performance difference between VU and the
benchmark only approaches statistical significance (p = 0.125 on average). In addition, the
jagged line indicates that VU is sensitive to the usefulness of the individual explanatory variables
in each additional subset.
When all available variables are divided into the selected subsets (the round dotted line), VU
does not provide a performance benefit relative to the random variable selection benchmark.
This second VU experiment emphasizes the importance of how variables are grouped together.
Multi-Subset Variable Undersampling Partitioned on Fraud Types (PVU) – Experiment 3.
The VU results discussed above suggest that a more deliberate partitioning of variables may be
important. We earlier argued that fraud consists of multiple types (e.g., revenue vs. expense
fraud) and that it might be beneficial to partition the explanatory variables with this in mind. Our
results for PVU support this conjecture. More specifically, as shown in Figure 5, PVU lowers
preprint
accepted manuscript
- 21 -
the expected cost of misclassification by 9.6 percent (p = 0.019) relative to the best performing
VU benchmark.
To better understand why PVU improves performance over the benchmarks, we first note
that untabulated results indicate that the all variables benchmark (that uses all observations and
all variables) and the Dechow benchmark (that uses all observations and a subset of variables as
selected in Dechow et al. 2011) performs similarly (0.02 percent difference). Thus, it does not
appear that performance is improved by simply selecting a subset of the variables (and thereby
reducing data dimensionality). Given that untabulated results show that VU improves
performance relative to the all variables benchmark by 7.2 percent, it appears that segmentation
of the variables contributes to the performance improvement. Additionally, because PVU
performs 6.3 percent better than VU, it appears that how the variables are segmented matters.
Additional Analyses
Further validation using misstatement data. We use the observations in a material
misstatement dataset that is an expanded version (additional years) of the data used in Dechow et
al. (2011) to perform two additional analyses (we also use this dataset to examine the robustness
of OU and PVU to the use of other classification algorithms, including logistic regression and
bootstrapping – see Online Appendix B). This dataset is available from the Center for Financial
Reporting and Management at the University of California, Berkeley and includes the fraud
firms used in our primary dataset as well as additional material misstatement firms reported in
AAERs by the SEC.13 To evaluate predictive performance, we again use 10-fold cross-
13 We exclude firms from the finance industry and, following Dechow et al. (2011), add all Compustat non-fraud firms in the same year and industry as the fraud firms. We do, however, only include the first fraud year, i.e., we do not include multiple years for each fraud firm, due to the potential bias introduced when including fraud firm years. We also follow the procedure used in Dechow et al. (2011) to eliminate observations with missing values in one or more of the variables included in the Dechow benchmark. We use mean replacement to handle missing values in the remaining variables. We also perform the analyses reported in this section after eliminating all observations with one or more missing values. Before performing this elimination, we remove six variables with over 25 percent missing values: abnormal change in order backlog, allowance for
preprint
accepted manuscript
- 22 -
validation. Further, due to a lack of good estimates of prior probabilities and cost ratios for
material misstatements, we use a performance metric known as “area under the Receiver
Operating Curve (ROC)” or simply AUC for “area under curve.”14
The first analysis provides further validation of out-of-sample prediction performance of the
proposed methods and compares OU(12) and PVU (implemented using the same variables as in
the main experiments) to the Dechow benchmark when using the observations in the material
misstatement data. We use the Dechow benchmark given that these data are based on Dechow et
al. (2011). This analysis also provides insight into the usefulness of OU and PVU in a slightly
different setting (material misstatements vs. fraud). Results in Table 3 Panel A suggest that
OU(12) and PVU continue to improve performance over the Dechow benchmark when using
material misstatement data – now by 26.0 (p < 0.001) and 16.9 (p = 0.004) percent, respectively.
The second analysis provides insight into (i) the usefulness of OU when used in combination
with a different set of independent variables (based on the “financial kernel” of Cecchini et al.
2010) and (ii) whether OU provides incremental predictive power when used in combination
with this kernel. Cecchini et al. (2010) based their financial kernel on 23 financial statement
variables commonly used to construct independent variables for fraud prediction models. The
financial kernel divides each of the 23 original variables by each other – both in the current year
and in the prior year and calculates changes in the ratios. Both current and lagged ratios as well
as their changes are then used to construct a dataset with 1,518 independent variables.
We use the same initial set of observations used in the previous analysis and recreate the
doubtful accounts, allowance for doubtful accounts to accounts receivable, allowance for doubtful accounts to net sales, expected return on pension plan assets, and change in expected return on pension plan assets. 14 While ECM is a preferred performance metric when prior probabilities and cost ratios are known, AUC is preferred over other performance measures in settings with unknown error costs and prior probabilities (Provost, Fawcett, and Kohavi 1998). AUC has become the de facto standard performance measure in machine learning research and has also been used in accounting research (e.g., Larcker and Zakolyukina 2012). AUC indicates how the prediction model performs in ranking observations in the evaluation dataset. An AUC of 0.5 is equivalent to a random rank order while an AUC of 1 indicates a perfect ranking of the evaluation observations. See Larcker and Zakolyukina (2012) for a more complete description of AUC.
preprint
accepted manuscript
- 23 -
financial kernel following Cecchini et al. (2010). We also follow their procedures and exclude
all observations with missing values. We then compare OU(12) implemented with the variables
in the financial kernel to the Cecchini benchmark, which uses the financial kernel but does not
undersample observations (both implementations use support vector machines). We do not
attempt to implement PVU, as it is not clear how we would separate the 1,518 variables into
different fraud types. Results in Table 3 (panel B) indicate that OU(12) (AUC = 0.67)
outperforms the financial kernel (AUC = 0.59) by 14.2 percent (p = 0.004).15
Combining the Methods. We analyze whether combining OU(12) and PVU provides
additional performance benefits compared to OU(12), the best performing individual method.
Figure 6 plots the performance difference between OU(12) and PVU and the combination of
OU(12) and PVU at different cost ratios. The combination is generated by creating prediction
models using OU(12) and PVU separately and then averaging the predictions from the OU(12)
and PVU prediction models.16
In untabulated results, the combination of OU(12) and PVU does not perform significantly
different than OU(12) (p = 0.421) at best estimates of prior fraud probability and cost ratios.
Thus, in typical fraud prediction research settings, we recommend using OU(12). However, the
combination of OU(12) and PVU provide performance benefits over OU(12) at higher cost ratios
and higher prior fraud probability levels (see Figure 6). Given that the combination of OU(12)
and PVU either performs significantly better or not significantly different than OU(12), we
recommend using the combination of the two methods if maximizing predictive ability is more
important than minimizing implementation costs. For example, when the SEC uses a prediction
15 When including all fraud firm years, OU performs 5.7 percent (p < 0.001) better than the Cecchini benchmark and both approaches have high AUC values (AUC = 0.863 and AUC = 0.816, respectively). 16 We also first create the OU(12) subsets and then apply PVU to these subsets, but this more integrated and complex combination does not improve performance further.
preprint
accepted manuscript
- 24 -
model to help decide which firms to investigate for potential fraud, the additional
implementation costs associated with using the combination is likely to be small relative to the
costs of misclassifying a non-fraud firm and using resources to investigate the firm (and even
more so relative to misclassifying a fraud firm and not detecting the fraud).
<Insert Figure 6 Here>
Impact of mislabeled firms on OU. Some firms that are labeled non-fraud may actually have
committed fraud. To assess the sensitivity of OU to the inclusion of mislabeled fraud firms, we
(1) manipulate the training data in each cross-validation round by using OU(12) to generate fraud
probability predictions for all observations in the training data and then remove all non-fraud
firms with high fraud probability predictions (we tried five different thresholds: 0.9, 0.8, 0.7, 0.6,
and 0.5) from the training data; (2) use the modified training data from step 1 as input into
OU(12); and (3) compare the results from step 2 to the original OU(12) implementation.
Untabulated results do not show any significant performance improvements over the original
OU(12) implementation. Compared to the original implementation, the change in AUC
(averaged across the ten test folds) when removing non-fraud firms with high fraud probabilities
is -0.08% (p = 0.809; two-tailed), 0.08% (p = 0.360; one-tailed), 0.12% (p = 0.337; one-tailed),
0.31% (p = 0.182; one-tailed), and 0.24% (p = 0.228; one-tailed) when using thresholds of 0.5,
0.6, 0.7, 0.8, and 0.9, respectively. Thus, it does not appear that the performance of OU is
sensitive to the inclusion of fraud firms mislabeled as non-fraud firms in the training data.
Using OU to explore robustness and incremental predictive performance of independent
variables. Fraud research often seeks to identify new explanatory variables. Traditionally, this
research uses the entire sample (i.e., all observations) or a single matched sample to evaluate the
significance of one or more independent variables that are hypothesized to be associated with the
preprint
accepted manuscript
- 25 -
dependent variable. However, the predictive performance benefits of OU reported earlier
suggest that classification algorithms (e.g., logistic regression) recognize different fraud patterns
when trained on different subsets of non-fraud firms. Thus, when evaluating explanatory
variables in hypothesis testing research, it may be important to consider the robustness of results
across different subsamples of the original data. Further, given that OU improves performance
over benchmarks, to conclude that a new independent variable provides utility in fraud
prediction, research should show that this variable provides incremental predictive performance
to prediction models implemented using OU.
As an example of how to use OU in hypothesis testing, we perform an analysis that examines
the significance of sales to employees given a set of control variables selected based on prior
research (the control variables in this example were selected using step-wise backward variable
selection). Traditionally, the hypothesis would be tested using all observations in the sample,
i.e., the full sample. We compare the traditional hypothesis-testing results (full sample) to results
from implementing OU. More specifically, OU is first used to partition the full sample into 12
subsamples. Each subsample is then used to fit a logistic regression model, resulting in 12
different models. Independent variable coefficient estimates and p-values from the 12 models
are then summarized. The example uses data from the additional analyses that examine
misstatement data.
The results for the full sample in Table 4 indicate that the hypothesis is supported (p =
0.012). However, the OU subsample analysis indicates that this result might not be robust. For
example, the average p-value of all sales to employees estimates across the 12 models obtained
using different sub-samples is p = 0.180 and the p-value is above 0.05 in four of the 12 models.
The results for the control variables further suggest that, while OU yields similar results to the
preprint
accepted manuscript
- 26 -
traditional hypothesis testing analysis in that most significant variables in the traditional
approach tend to be the most significant in the OU analysis, the OU results are generally more
conservative. For example, in only two cases are the median p-values from the OU results
numerically smaller (more significant) than the corresponding parametric result. For 12 of 17
variables, the median p-values are numerically larger (less significant) than their parametric
counterparts. Thus, we encourage future research to consider applying OU as a robustness check
for hypothesis testing.17
<Insert Table 4 Here>
Online Appendix C provides an example of how to use OU to evaluate the incremental
predictive performance of new independent variables. The example explains how to use OU in
combination with out-of-sample data and includes example SAS code.
V. DISCUSSION AND FUTURE RESEARCH
Financial statement fraud is a costly problem that has far reaching negative consequences.
Hence, the accounting literature investigates a wide range of explanatory variables and various
classification algorithms that contribute to more accurate prediction of fraud and material
misstatements. However, the rarity of fraud data, the relative abundance of variables identified in
prior literature, and the broad definition of fraud create challenges in specifying effective
prediction models.
Research in the emerging field of data analytics has been applied successfully in other
settings constrained by data rarity, such as predicting credit card fraud (Chan and Stolfo 1998).
We, therefore, follow the call of Varian (2014) to apply recent advances in data analytics in other
17 In untabulated results, we repeat the analysis using bootstrapping. More specifically, the full sample is used to generate 1,000 bootstrap subsamples (each sample contained observations selected randomly with replacement). Each bootstrap subsample is then used to fit a logistic regression model from which 2.5 and 97.5 percentiles of independent variable coefficient estimates are obtained. The bootstrapping results are similar to the OU results in that they are also generally more conservative.
preprint
accepted manuscript
- 27 -
settings and investigate the ability of data preprocessing methods drawn from data analytics to
improve fraud prediction. We first use Multi-subset Observation Undersampling (OU) to
investigate undersampling of non-fraud observations to establish a more effective balance with
scarce fraud observations. When used with 12 subsamples, this method improves fraud
prediction by lowering the expected cost of misclassification by more than ten percent relative to
the best performing benchmark. This method is also both efficient and relatively easy to
implement. Second, we use Multi-subset Variable Undersampling (VU) to investigate
undersampling of explanatory variables to put them more in balance with scarce fraud
observations. Fraud prediction improves in select situations when we randomly undersample
explanatory variables into different subsets. However, it does not do so reliably. When we
instead implement Multi-subset Variable Undersampling by partitioning variables into subsets
based on the type of fraud they are likely to predict (PVU), the expected cost of misclassification
is reduced by 9.6 percent relative to the best performing VU benchmark.
Our research makes multiple contributions to the prior literature. First, we identify and
directly address financial statement fraud data rarity problems by systematically evaluating
multiple data preprocessing methods that we believe are new to the accounting literature. Based
on our experiments, we conclude that OU and PVU each produce economically and statistically
significant reductions in the expected cost of misclassification of about ten percent. This
compares to, for example, a 0.9 percent prediction performance advantage when, following
Dechow et al. (2011), two additional significant independent variables are added to their initial
model. The introduction and evaluation of these methods directly contributes to research that
focuses on improving fraud prediction. Beneish (1997) and Dechow et al. (2011), among others,
create fraud prediction models that can be used to indicate the likelihood that a company has
preprint
accepted manuscript
- 28 -
committed financial statement fraud. Our methods can be used to improve the quality of such
fraud predictions. We also directly extend research that examines the usefulness of data
analytics methods in fraud prediction (e.g., Cecchini et al. 2010; Perols 2011; Larcker and
Zakolyukina 2012; and Whiting et al. 2012). Future research attempting to improve fraud
prediction using data analytics methods can also examine other problems related to rarity, such
as noisy data that potentially have more significant negative effects on rare cases (Weiss 2004).
Second, by showing that performance benefits can be gained by (i) addressing data rarity
problems in fraud detection and (ii) partitioning financial statement fraud into different fraud
types, our results provide an indication of the potential benefits that may result from addressing
similar problems in other settings. For example, bankruptcy, financial statement restatements,
material weaknesses in internal control over financial reporting, and audit qualifications are also
rare events in both absolute and relative terms.
Third, our research has implications for research that focuses on designing new explanatory
variables and developing parsimonious prediction models (e.g., Dechow et al. 2011; and
Markelevich and Rosner 2013). Our findings suggest that classification algorithms recognize
different fraud patterns when trained on different subsets of non-fraud firms. Thus, even if an
explanatory variable is deemed significant in one subsample, it is valuable to show that it is also
significant in other subsamples. Example techniques include OU, bootstrapping, or a robustness
measure proposed by Athey and Imbens (2015) that creates subsamples based on values of the
independent variables in the model. While we perform additional analyses that suggest that OU
(i) performs better than bootstrapping in predictive modeling and (ii) can be used to evaluate the
robustness of explanatory models, future research is needed to provide more definitive
recommendations about which method(s) to use for hypothesis testing. Future research can also
preprint
accepted manuscript
- 29 -
examine the use of OU in conjunction with propensity score matching. For example, future
research can (1) examine whether OU can be used to generate more robust propensity scores or
(2) use OU to evaluate the robustness of propensity score matching results by first using OU to
generate different subsets and then applying propensity score matching procedure within each
subset. Further, research that concludes that a new explanatory variable provides incremental
predictive power should consider showing that the variable provides incremental predictive value
to models implemented using our methods.
Fourth, we also make a contribution by following the call to consider different types of fraud
(Brazel et al. 2009). We partition financial statement fraud into types and show that this
reframing improves the performance of VU in fraud prediction. The importance of this finding
may extend beyond VU. Research that examines predictors of fraud could, similarly to Brazel et
al. (2009), design new explanatory variables to detect a specific type of fraud instead of fraud in
general. For example, fraud research could potentially develop variables that predict different
fraud types using different types of analyst forecasts (e.g., revenue vs. earnings) or different
types of debt covenants (e.g., leverage vs. interest coverage). For example, an independent
variable that indicates whether a firm uses a leverage (interest expense) debt covenant can in turn
be used in a prediction model that predicts liabilities (expense) fraud. This reframing could as
such contribute to better theoretical understanding of fraud and also more precise evaluation of
explanatory variables.
Finally, regulators and practitioners can potentially benefit from our findings. Regulators,
such as the SEC, are investing resources to develop better fraud risk models (Walter 2013; SEC
2015). Our findings may enhance their ability to identify firms that have committed fraud. This
is important because, due to resource constraints, the SEC has to focus investigations on a small
preprint
accepted manuscript
- 30 -
sample of firms, and improvements in financial statement fraud prediction models can be cost
effective in identifying potential fraud firms. The negative effects of financial statement fraud
on other stakeholders, such as employees, auditors, suppliers, customers, and lenders can also be
potentially reduced. For example, auditors can use our methods to potentially improve fraud risk
assessment models that, in turn, can improve audit client portfolio management and audit
planning decisions. Given the significant costs and widespread effects of financial statement
fraud, improvements in fraud prediction models can have a substantial positive impact on
society.
preprint
accepted manuscript
- 31 -
REFERENCES
Abbasi, A., C. Albrecht, A. Vance, and J. Hansen. 2012. MetaFraud: A meta-learning framework for detecting financial fraud. MIS Quarterly. 36(4): 1293-1327.
Agarwal, R., and V. Dhar. 2014. Editorial—Big data, data science, and analytics: The opportunity and challenge for IS research. Information Systems Research. 25(3): 443-448.
Apostolou, B., J. Hassell, and S. Webber. 2000. Forensic expert classification of management fraud risk factors. Journal of Forensic Accounting. 1(2): 181-192.
Armstrong, C. S., D. F. Larcker, G. Ormazabal, and D. J. Taylor. 2013. The relation between equity incentives and misreporting: the role of risk-taking incentives. Journal of Financial Economics. 109(2): 327-350.
Association of Certified Fraud Examiners. 2014. Report to the nation on occupational fraud and abuse. Austin, TX.
Athey, S., and G. Imbens. 2015. A measure of robustness to misspecification. American Economic Review. 105(5): 476-80.
Bayley, L., and S. Taylor. 2007. Identifying earnings management: A financial statement analysis (red flag) approach. Proceedings of the American Accounting Association Annual Meeting, Chicago, IL.
Beasley, M. 1996. An empirical analysis of the relation between the board of director composition and financial statement fraud. The Accounting Review. 71(4): 443-465.
Bell, T., and J. Carcello. 2000. A decision aid for assessing the likelihood of fraudulent financial reporting. Auditing: A Journal of Practice & Theory. 19(1): 169-184.
Bellman, R. 1961. Adaptive control processes: A guided tour. Princeton, NJ: Princeton University Press.
Beneish, M. 1997. Detecting GAAP violation: Implications for assessing earnings management among firms with extreme financial performance. Journal of Accounting and Public Policy. 16(3): 271-309.
Beneish, M. 1999. Incentives and penalties related to earnings overstatements that violate GAAP. The Accounting Review. 74(4): 425-457.
Brazel, J. F., K. L. Jones, and M. F. Zimbelman. 2009. Using nonfinancial measures to assess fraud risk. Journal of Accounting Research. 47(5): 1135-1166.
Brown, B., M. Chui, and J. Manyika. 2011. Are you ready for the era of ‘big data’? McKinsey Quarterly. 4: 24-35.
Caskey, J., and M. Hanlon. 2013. Dividend policy at firms accused of accounting fraud. Contemporary Accounting Research. 30(2): 818-850.
Cecchini, M., G. Koehler, H. Aytug, and P. Pathak. 2010. Detecting management fraud in public companies. Management Science. 56(7): 1146-1160.
Chan, P., and S. Stolfo. 1998. Toward scalable learning with non-uniform class and cost distributions: A case study in credit card fraud detection. Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, New York, NY.
Chawla, N. V., K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. 2002. SMOTE: Synthetic Minority Oversampling Technique. Journal of Artificial Intelligence Research. 16: 321-357.
preprint
accepted manuscript
- 32 -
Chen, H., R. H. Chiang, and V. C. Storey. 2012. Business intelligence and analytics: From big data to big impact. MIS Quarterly. 36(4): 1165-1188.
Dechow, P. M., R. G. Sloan, and A. P. Sweeney. 1996. Causes and consequences of earnings manipulation: An analysis of firms subject to enforcement actions by the SEC. Contemporary Accounting Research. 13(1): 1-36.
Dechow, P. M., W. Ge, C. R. Larson, and R. G. Sloan. 2011. Predicting material accounting misstatements. Contemporary Accounting Research. 28(1): 17-82.
Duin, P. W. R., and M. J. D. Tax. 2000. Experiments with classifier combining rules. Proceedings of the International Workshop on Multiple Classifier Systems.
Erickson, M., M. Hanlon, and E. L. Maydew. 2006. Is there a link between executive equity incentives and accounting fraud? Journal of Accounting Research. 44(1): 113-143.
Ettredge, M. L., L. Sun, P. Lee, and A. A. Anandarajan. 2008. Is earnings fraud associated with high deferred tax and/or book minus tax levels? Auditing: A Journal of Practice & Theory. 27(1): 1-33.
Fanning, K., and K. Cogger. 1998. Neural network detection of management fraud using published financial data. International Journal of Intelligent Systems in Accounting, Finance and Management. 7(1): 21-41.
Feng, M., W. Ge, S. Luo, and T. Shevlin. 2011. Why do CFOs become involved in material accounting manipulations? Journal of Accounting and Economics. 51(1): 21-36.
Feroz, E., T. Kwon, V. Pastena, and K. Park. 2000. The efficacy of red-flags in predicting the sec's targets: An artificial neural networks approach. International Journal of Intelligent Systems in Accounting, Finance & Management. 9(3): 145-157.
Galar, M., A. Fernández, E. Barrenechea, H. Bustince, and F. Herrera. 2012. A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews. 42(4): 463-484.
Glancy, F. H., and S. B. Yadav. 2011. A computational model for financial reporting fraud detection. Decision Support Systems. 50(3): 595-601.
Goel, S., and J. Gangolly. 2012. Beyond the numbers: Mining the annual reports for hidden cues indicative of financial statement fraud. Intelligent Systems in Accounting, Finance and Management. 19(2): 75-89.
Green, B. P., and J. H. Choi. 1997. Assessing the risk of management fraud through neural network technology. Auditing: A Journal of Practice & Theory. 16(1): 14-28.
Gupta, R., and N. S. Gill. 2012. A solution for preventing fraudulent financial reporting using descriptive data mining techniques. International Journal of Computer Applications. 58(1): 22-28.
Humpherys, S. L., K. C. Moffitt, M. B. Burns, J. K. Burgoon, and W. F. Felix. 2011. Identification of fraudulent financial statements using linguistic credibility analysis. Decision Support Systems. 50(3): 585-594.
Jones, K. L., G. V. Krishnan, and K. D. Melendrez. 2008. Do models of discretionary accruals detect actual cases of fraudulent and restated earnings? An empirical analysis. Contemporary Accounting Research. 25(2): 499-531.
preprint
accepted manuscript
- 33 -
Kaminski, K., S. Wetzel, and L. Guan. 2004. Can financial ratios detect fraudulent financial reporting? Managerial Auditing Journal. 19(1): 15-28.
Kittler, J., M. Hatef, R.P.W. Duin, and J. Matas. 1998. On combining classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence. 20(3): 226-239.
Larcker, D. F., and A. A. Zakolyukina. 2012. Detecting deceptive discussions in conference calls. Journal of Accounting Research. 50(2): 495-540.
LaValle, S., E. Lesser, R. Shockley, M. S. Hopkins, and N. Kruschwitz. 2011. Big data, analytics and the path from insights to value. MIT Sloan Management Review. 52(2): 21-32.
Lee, T. A., R. W. Ingram, and T. P. Howard. 1999. The difference between earnings and operating cash flow as an indicator of financial reporting fraud. Contemporary Accounting Research. 16(4): 749-786.
Lennox, C., and J. A. Pittman. 2010. Big five audits and accounting fraud. Contemporary Accounting Research, 27(1): 209-247.
Lin, J., M. Hwang, and J. Becker. 2003. A fuzzy neural network for assessing the risk of fraudulent financial reporting. Managerial Auditing Journal. 18(8): 657-665.
Loebbecke, J. K., M. M. Eining, and J. J. Willingham. 1989. Auditors’ experience with material irregularities: Frequency, nature, and detectability. Auditing: A Journal of Practice and Theory. 9(1): 1-28.
Maloof, M. 2003. Learning when data sets are imbalanced and when costs are unequal and unknown. Proceedings of the Twenty International Conference on Machine Learning, Washington, DC.
Markelevich, A., and R. L. Rosner. 2013. Auditor fees and fraud firms. Contemporary Accounting Research. 30(4): 1590-1625.
Nguyen, H. M., E. W. Cooper, and K. Kamei. 2012. A comparative study on sampling techniques for handling class imbalance in streaming data. Proceedings of the Soft Computing and Intelligent Systems (SCIS) and 13th International Symposium on Advanced Intelligent Systems (ISIS). 1762-1767.
Perols, J. 2011. Financial statement fraud detection: An analysis of statistical and machine learning algorithms. Auditing: A Journal of Practice & Theory. 30(2): 19-50.
Perols, J. L., and B. A. Lougee. 2011. The relation between earnings management and financial statement fraud. Advances in Accounting. 27(1): 39-53.
Phua, C., D. Alahakoon, and V. Lee. 2004. Minority report in fraud detection: Classification of skewed data. SIGKDD Explorations. 6(1): 50-59.
Price III, R. A., N. Y. Sharp, and D. A. Wood. 2011. Detecting and predicting accounting irregularities: A comparison of commercial and academic risk measures. Accounting Horizons. 25(4): 755-780.
Provost, F. J., T. Fawcett, and R. Kohavi. 1998. The case against accuracy estimation for comparing induction algorithms. Proceedings of the Fifteenth International Conference on Machine Learning, Madison, WI. 98: 445-453.
SEC 2015. Examination Priorities for 2015. Retrieved from http://www.sec.gov/about/offices/ocie/national-examination-program-priorities-2015.pdf.
Sharma, V. 2004. Board of director characteristics, institutional ownership, and fraud: Evidence
preprint
accepted manuscript
- 34 -
from Australia, Auditing: A Journal of Practice & Theory. 23(2): 105-117. Shin, K. S., T. Lee, and H. J. Kim. 2005. An application of support vector machines in
bankruptcy prediction models. Expert Systems with Application. 28: 127-135. Summers, S. L., and J. T. Sweeney. 1998. Fraudulently misstated financial statements and
insider trading: An empirical analysis. The Accounting Review. 73(1): 131-146. Varian, H. R. 2014. Big data: New tricks for econometrics. The Journal of Economic
Perspectives. 28(2): 3-27. Walter, E. (2013, February). Harnessing Tomorrow’s Technology for Today’s Investors and
Markets. Speech Presented at American University School of Law, Washington, D.C. Weiss, G. 2004. Mining with rarity: A unifying framework. ACM SIGKDD Explorations
Newsletter. 6(1): 7-19. Whiting, D. G., J. V. Hansen, J. B. McDonald, C. Albrecht, and W. S. Albrecht. 2012. Machine
learning methods for detecting patterns of management fraud. Computational Intelligence. 28(4): 505-527.
Witten, I.H., and E. Frank. 2005. Data Mining: Practical machine learning tools and techniques. 2nd edition. San Francisco, CA: Morgan Kaufmann Publishers.
Yang, Q., and X. Wu. 2006. 10 challenging problems in data mining research. International Journal of Information Technology & Decision Making. 5(4): 597-604.
preprint
accepted manuscript
- 35 -
APPENDIX A: Definitions of explanatory variablesa
Panel A: Variables from Dechow et al. (2011)
Variable Definitionb
Abnormal change in order backlog
(OB - OBt-1) / OBt-1 - (SALE - SALEt-1) / SALEt-1
Actual issuance IF SSTK>0 or DLTIS>0 THEN 1 ELSE 0 Book to market CEQ / (CSHO * PRCC_F) Change in expected return on pension plan assets
PPROR-PPRORt-1
Change in free cash flows
(IB - RSST Accruals) / Average total assets - (IBt-1 - RSST Accrualst-1) / Average total assetst-1
Change in inventory (INVT- INVTt-1)/Average total assets Change in operating lease activity
((MRC1/1.1+ MRC2/1.1^2+ MRC3/1.1^3+ MRC4/1.1^4+ MRC5/1.1^5) - (MRC1t-
1/1.1+ MRC2t-1/1.1^2+ MRC3t-1/1.1^3+ MRC4 -1/1.1^4+ MRC5t-1/1.1^5) )/ Average total assets
Change in receivables (RECT- RECTt-1)/Average total assets Change in return on assets
IB / Average total assets - IBt-1 / Average total assetst-1
Deferred tax expense TXDI / ATt-1 Demand for financing (ex ante)
IF ((OANCF-(CAPXt-3+CAPXt-2+ CAPXt-1)/ 3) /(ACT) < -0.5 THEN 1 ELSE 0
Earnings to price IB / (CSHO x PRCC_F) Existence of operating leases
IF (MRC1 > 0 OR MRC2 > 0 OR MRC3 > 0 OR MRC4 > 0 OR MRC5 > 0 THEN 1 ELSE 0
Expected return on pension plan assets
PPROR
Level of finance raised FINCF / Average total assets Leverage DLTT / AT Percentage change in cash margin
((1-(COGS+(INVT-INVTt-1))/(SALE-(RECT-RECTt-1)))- (1-(COGSt-1+(INVTt-1-INVTt-2))/(SALEt-1-(RECTt-1-RECTt-2)))) / (1-(COGSt-1+(INVTt-1-INVTt-2))/(SALEt-1-(RECTt-1-RECTt-2)))
Percentage change in cash sales
((SALE - (RECT - RECTt-1)) - (SALEt-1 - (RECTt-1 - RECTt-2))) / (SALEt-1 - (RECTt-1 - RECTt-2))
RSST accruals RSST Accruals = (ΔWC + ΔNCO + ΔFIN)/Average total assets, where WC = (ACT - CHE) - (LCT - DLC) NCO = (AT - ACT - IVAO) - (LT - LCT - DLTT) FIN = (IVST + IVAO) - (DLTT + DLC + PSTK)
Soft assets (AT-PPENT-CHE)/Average total assets Unexpected employee productivityc
(SALE/EMP - SALEt-1/EMPt-1)/(SALEt-1/EMPt-1) - INDUSTRY((SALE/EMP - SALEt-1/EMPt-1)/(SALE t-1/EMPt-1))
WC accruals (((ACT - ACTt-1) - (CHE - CHEt-1)) - ((LCT - LCTt-1) - (DLC - DLCt-1) - (TXP - TXPt-1)) - DP)/Average total assets
preprint
accepted manuscript
- 36 -
APPENDIX A: Definitions of explanatory variables (continued)
Panel B: Variables from Perols (2011)
Variable Definitionb
Accounts receivable to sales
(RECT/SALE)
Accounts receivable to total assets
(RECT/AT)
Allowance for doubtful accounts
(RECD)
Allowance for doubtful accounts to accounts receivable
(RECD/RECT)
Allowance for doubtful accounts to net sales
(RECD/SALE)
Altman Z score 3.3*(IB+XINT+TXT)/AT+0.999*SALE/AT+0.6*CSHO*PRCC_F/LT+ 1.2*WCAP/AT+1.4*RE/AT
Big four auditor IF 0 < AU < 9 THEN 1 ELSE 0 Current minus prior year inventory to sales
(INVT)/(SALE)-(INVTt-1)/(SALEt-1)
Days in receivables index
(RECT/SALE)/(RECTt-1/SALEt-1)
Debt to equity (LT/CEQ) Declining cash sales dummy
IF SALE-(RECT-RECTt-1) < SALE t-1-(RECT t-1-RECT t-2) THEN 1 ELSE 0
Fixed assets to total assets
PPEGT/AT
Four year geometric sales growth rate
(SALE/SALEt-3)^(1/4)-1
Gross margin (SALE-COGS)/SALE Holding period return in the violation period
(PRCC_F-PRCC_Ft-1)/PRCC_Ft-1
Industry ROE minus firm ROE
NIindustry/CEQindustry - NI /CEQ
Inventory to sales INVT/SALE Net sales SALE Positive accruals dummy IF (IB-OANCF) > 0 and (IBt-1-OANCFt-1) > 0 THEN 1 ELSE 0 Prior year ROA to total assets current year
(NIt-1/ATt-1) / AT
Property plant and equipment to total assets
PPENT/AT
Sales to total assets SALE/AT The number of auditor turnovers
IF AU<>AUt-1 THEN 1 ELSE 0 + IF AUt-1<>AUt-2 THEN 1 ELSE 0 + IF AUt-2<>AUt-3 THEN 1 ELSE 0
Times interest earned (IB+XINT+TXT) / XINT Total accruals to total assets
(IB-OANCF) / AT
Total debt to total assets LT/AT
preprint
accepted manuscript
- 37 -
APPENDIX A: Definitions of explanatory variables (continued)
Total discretionary accrual
RSST Accrualst-1 + RSST Accrualst-2 + RSST Accrualst-3
Value of issued securities to market value
IF CSHI > 0 THEN CSHI*PRCC_F/(CSHO*PRCC_F) ELSE IF (CSHO-CSHOt-1)>0 THEN ((CSHO - CSHOt-1)*PRCC_F) / (CSHO*PRCC_F) ELSE 0
Whether accounts receivable > 1.1 of last year’s
IF (RECT/RECT t-1) > 1.1 THEN 1 ELSE 0
Whether firm was listed on AMEX
IF EXCHG=5, 15, 16, 17, 18 THEN 1 ELSE 0
Whether gross margin percent > 1.1 of last year’s
IF ((SALE-COGS) / SALE) / ((SALEt-1 - COGSt-1)/SALEt-1) > 1.1 THEN 1 ELSE 0
Whether LIFO IF INVVAL=2 THEN 1 ELSE 0 Whether new securities were issued
IF (CSHO-CSHOt-1)>0 OR CSHI>0 THEN 1 ELSE 0
Whether SIC code between 2999 and 4000
IF 2999<SIC<4000 THEN 1 ELSE 0
Panel C: Variables based on Cecchini et al. (2010)d
Variable Definitionb
Sales SALE Change in sales SALE - SALEt-1 % Change in sales (SALE - SALEt-1) / (SALEt-1) Abnormal % change in sales
(SALE - SALEt-1) / (SALEt-1) - INDUSTRY(SALE - SALEt-1) / (SALEt-1))
Sales to assets SALE/AT Change in sales to assets SALE/AT - SALEt-1/ATt-1 % Change in sales to assets
(SALE/AT - SALEt-1/ATt-1) / (SALEt-1/ATt-1)
Abnormal % change in sales to assets
(SALE/AT - SALEt-1/ATt-1) / (SALEt-1/ATt-1) - INDUSTRY(SALE/AT - SALEt-1/ATt-1) / (SALEt-1/ATt-1))
Sales to employees SALE/EMP Change in sales to employees
SALE/EMP - SALEt-1/EMPt-1
% Change in sales to employees
(SALE/EMP - SALEt-1/EMPt-1) / (SALEt-1/EMPt-1)
Sales to operating expenses
SALE/XOPR
Change in sales to operating expenses
SALE/XOPR - SALEt-1/XOPRt-1
% Change in sales to operating expenses
(SALE/XOPR - SALEt-1/XOPRt-1) / (SALEt-1/XOPRt-1)
Abnormal % change in sales to operating expenses
(SALE/XOPR - SALEt-1/XOPRt-1) / (SALEt-1/XOPRt-1) - INDUSTRY(SALE/XOPR-SALEt-1/XOPRt-1) / (SALEt-1/XOPRt-1))
Return on assets NI/AT
preprint
accepted manuscript
- 38 -
APPENDIX A: Definitions of explanatory variables (continued)
Change in return on assets
NI/AT - NIt-1/ATt-1
% Change in return on assets
(NI/AT - NIt-1/ATt-1) / (NIt-1/ATt-1)
Abnormal % change in return on assets
(NI/AT - NIt-1/ATt-1) / (NIt-1/ATt-1) - INDUSTRY(NI/AT - NIt-1/ATt-1) / (NIt-1/ATt-1))
Return on equity NI/CEQ Change in return on equity
NI/CEQ - NIt-1/CEQt-1
% Change in return on equity
(NI/CEQ - NIt-1/CEQt-1) / (NIt-1/CEQt-1)
Abnormal % change in return on equity
(NI/CEQ - NIt-1/CEQt-1) / (NIt-1/CEQt-1) - INDUSTRY(NI/CEQ - NIt-1/CEQt-1) / (NIt-1/CEQt-1))
Return on sales NI/SALE Change in return on sales
NI/SALE - NIt-1/SALEt-1
% Change in return on sales
(NI/SALE - NIt-1/SALEt-1) / (NIt-1/SALEt-1)
Abnormal % change in return on sales
(NI/SALE - NIt-1/SALEt-1) / (NIt-1/SALEt-1) - INDUSTRY(NI/SALE - NIt-1/ SALEt-1) / (NIt-1/SALEt-1))
Accounts payable to inventory
AP/INVT
Change in accounts payable to inventory
AP/INVT - APt-1/INVTt-1
% Change in accounts payable to inventory
(AP/INVT - APt-1/INVTt-1) / (APt-1/INVTt-1)
Abnormal % change in accounts payable to inventory
(AP/INVT - APt-1/INVTt-1) / (APt-1/INVTt-1) - INDUSTRY(AP/INVT - APt-1/ INVTt-1) / (APt-1/INVTt-1))
Liabilities LT Change in liabilities LT - LTt-1 % Change in liabilities (LT - LTt-1) / (LTt-1) Abnormal % change in liabilities
(LT - LTt-1) / (LTt-1) - INDUSTRY(LT - LTt-1) / (LTt-1))
Liabilities to interest expenses
LT/XINT
Change in liabilities to interest expenses LT/XINT - LTt-1/XINTt-1 % Change in liabilities to interest expenses
(LT/XINT - LTt-1/XINTt-1) / (LTt-1/XINTt-1)
Abnormal % change in liabilities to interest expenses
(LT/XINT - LTt-1/XINTt-1) / (LTt-1/XINTt-1) - INDUSTRY(LT/XINT - LTt-1/XINTt-1) / (LTt-1/XINTt-1))
Assets AT Change in assets AT - ATt-1 % Change in assets (AT - ATt-1) / (ATt-1) Abnormal % change in assets
(AT - ATt-1) / (ATt-1) - INDUSTRY(AT - ATt-1) / (ATt-1))
Assets to liabilities AT/LT
preprint
accepted manuscript
- 39 -
APPENDIX A: Definitions of explanatory variables (continued)
Change in assets to liabilities
AT/LT - ATt-1/LTt-1
% Change in assets to liabilities
(AT/LT - ATt-1/LTt-1) / (ATt-1/LTt-1)
Abnormal % change in assets to liabilities
(AT/LT - ATt-1/LTt-1) / (ATt-1/LTt-1) - INDUSTRY(AT/LT - ATt-1/LTt-1) / (ATt-1/LTt-1))
Expenses XOPR Change in expenses XOPR - XOPRt-1 % Change in expenses (XOPR - XOPRt-1) / (XOPRt-1) Abnormal % change in expenses
(XOPR - XOPRt-1) / (XOPRt-1) - INDUSTRY(XOPR - XOPRt-1) / (XOPRt-1))
Notes: a The explanatory variables included represent a relatively comprehensive set of variables based on recent fraud and material misstatement literature (Cecchini et al. 2010; Dechow et al. 2011; Perols 2011). We include all variables from Perols (2011) and all variables from the final Dechow et al. (2011) model that can be calculated using Compustat data. Dechow et al. (2011) perform step-wise backward feature selection to derive more parsimonious material misstatement models. We use their second model, which is the most complete model in their study that only relies on Compustat data (they also include a model that requires market related data). This study predicts material misstatements using the following variables: RSST accruals, change in receivables, change in inventory, soft assets, percentage change in cash sales, change in return on assets, actual issuance of securities, abnormal change in employees, and existence of operating leases. The model in Cecchini et al. (2010) includes a total of 1,518 explanatory variables derived using 23 financial statement items. These items are divided by each other both in the current year and in the prior year and used to calculate changes in the ratios. Both current and lagged ratios as well as their changes are then used to construct a dataset with 1,518 independent variables. Rather than including all 1,518 variables in our study, we follow and extend the approach used in Cecchini et al. (2010) by including 48 variables measuring levels and changes in levels, percentage change in levels, and abnormal percentage change of commonly manipulated financial statement items and ratios. We examine a model with all 1,518 variables from Cecchini et al. (2010) in an additional analysis. b ACT is Current Assets - Total; AT is Assets - Total; AU is Auditor ; CAPX is Capital Expenditures; CEQ is Common/Ordinary Equity - Total; CEQ is Common/Ordinary Equity - Total; CHE is Cash and Short-Term Investments; COGS is Cost of Goods Sold; CSHI is Common Shares Issued; CSHO is Common Shares Outstanding; DLC is Debt in Current Liabilities - Total; DLTIS is Long-Term Debt – Issuance; DLTT is Long-Term Debt - Total; DP is Depreciation and Amortization; EMP is Employees; EXCHG is Stock Exchange ; FINCF is Financing Activities – Net Cash Flow; IB is Income Before Extraordinary Items; INVT is Inventories - Total; INVVAL is Inventory Valuation Method; IVAO is Investment and Advances – Other; IVST is Short-Term Investments - Total; LCT is Current Liabilities - Total; LT is Liabilities - Total; MRC1 is Rental Commitments – Minimum – 1st Year; MRC2 is Rental Commitments – Minimum – 2nd Year; MRC3 is Rental Commitments – Minimum – 3rd Year; MRC4 is Rental Commitments – Minimum – 4th Year; MRC5 is Rental Commitments – Minimum – 5th Year; NI is Net Income (Loss); OANCF is Operating Activities – Net Cash Flow; OB is Order Backlog; PPEGT is Property Plant and Equipment - Total (Gross); PPENT is Property Plant and Equipment - Total (Net); PPROR is Pension Plans – Anticipated Long-Term Rate of Return on Plan Assets; PRCC_F is Price Close - Annual - Fiscal Year; PSTK is Preferred/Preference Stock (Capital) - Total; RE is Retained Earnings; RECD is Receivables - Estimated Doubtful; RECT is Receivables – Total; SALE is Sales/Turnover (Net); SIC is SIC Code; SSTK is Sale of Common and Preferred Stock; TXDI is Income Taxes - Deferred; TXP is Income Taxes Payable; TXT is Income Taxes - Total; WCAP is Working Capital (Balance Sheet); XINT is Interest and Related Expense - Total; XINT is Interest and Related Expense - Total; and XOPR is Operating Expense. We also included controls for year and industry (two-digit SIC code). c Similar variable used in both Dechow et al. (2011) (abnormal change in employees) and Perols (2011) (unexpected employee productivity). d Variable construction based on Financial Kernel in Cecchini et al. (2010).
preprint
accepted manuscript
- 40 -
Figure 1 Multi-subset Observation Undersampling (OU)
Notes: Column 1 represents the raw data with the fraud observations stacked on top and non-fraud cases below. Column 1 also shows that model building and out-of-sample data are kept separated. Column 2 shows the data subsets that are created based on the OU method. All fraud data are used in each subset while the non-fraud data are under-sampled to address data rarity within each subset. Cumulatively across all subsets, all of the non-fraud data can be used, but a single non-fraud observation is only used in one subset. In column 3, a classification algorithm is used to build one prediction model per subset with the goal of accurately classifying firms into fraud or non-fraud cases. Each model is then applied out-of-sample and generates a fraud probability prediction for each observation in the out-of-sample data. In column 4, for each out-of-sample observation, the individual fraud prediction probabilities are then combined to arrive at an overall combined fraud probability prediction for each observation.
More formally, let M={f1, f2, f3,…, fk} be a set of k fraud observations f and let C={c1, c2, c3,…, cn} be a set of n non-fraud observations c, where M is the minority class, i.e., k < n. Note that the union of M and C, i.e., M U C, forms a set that contains k fraud and n non-fraud observations. To achieve a more balanced dataset, d non-fraud observations c are removed from the non-fraud set C, where 0 < d ≤ n - k. However, instead of deleting these removed non-fraud observations, OU segments the non-fraud observations into n / (n - d) or fewer subsets Ui that each contains n - d different non-fraud observations c, i.e., C={U1, U2, U3,…, Un/n-d}. Note that all subsets Ui contain mutually exclusive (disjoint) sets of non-fraud observations, Ui ∩ Uj = Ø for i ≠ j. OU then combines all fraud observations, i.e., the entire set M, with each Ui to create subsets Wi. OU thus creates up to n / (n - d) subsets Wi that contain all fraud observations f and n - d unique non-fraud observations c. Each subset Wi is then used to build a prediction model that is used to predict out-of-sample observations. In our experiments, OU is only used on the model building data and the model evaluation data is left intact. Finally, for each out-of-sample observation, the different prediction models’ probability predictions are averaged into an overall probability prediction for each observation.
preprint
accepted manuscript
- 41 -
Figure 2 Multi-subset Variable Undersampling (VU)
Notes: Column 1 represents the raw data that include all explanatory variables used to predict fraud. These explanatory variables are partitioned into different subsets represented by the vertical lines. Each subset contains a subset of the explanatory variables and all of the observations. Column 1 also shows that model building and out-of-sample data are kept separated. In column 2, a classification algorithm is used to build one prediction model per variable subset with the goal of classifying firms into fraud vs. non-fraud cases. Each prediction model is then applied out-of-sample to generate a fraud probability prediction for each observation in the out-of-sample data. In column 3, for each out-of-sample observation, the fraud prediction probabilities from the different prediction models are combined to arrive at an overall combined fraud prediction probability for each observation.
More formally, let W denote a dataset with m variables x, i.e., W={x1, x2, x3,…, xm}. VU reduces data dimensionality by randomly dividing the variables in W into q subsets X, where each X contains m/q variables, i.e., the following variable subsets are created by VU, X1={x1, x2, x3,…, xm/q}, X2={xm/q+1, xm/q+2, xm/q+3,…, x2m/q}, X3={x2m/q+1, x2m/q+2, x2m/q+3,…, x3m/q},…, Xq={xm-m/q+1, xm-m/q+2, xm-m/q+3,…, xm}. The subsets X are then used to build q prediction models. The prediction models are then (i) used to predict out-of-sample observations and (ii) for each out-of-sample observation, the prediction models’ probability predictions are combined into an overall prediction for each observation by taking an average of the individual probability predictions.
preprint
accepted manuscript
- 42 -
Figure 3 Experimental Procedures Multi-subset Observation Undersampling (OU) Example
Perform 10-fold cross-validation
Start
Raw Data
Round n training data
For each cross-validation round n = {1, 2, 3,…, 10}
For each OU implementation l =
{1, 2, 3,…, 20}
Predict round ntest data
l round n training OU subsets
Round n test data
n = 10?
True
False
Combine OU subset predictions
True
Calculate ECMscores
l round n test data sets with predictions
End
Round n test data with combined predictions
Create l OU subsets
Build l prediction models
l prediction models
l = 20? False
For each observation in the round n test data, average the
l probability predictions
preprint
accepted manuscript
- 43 -
Figure 4 Multi-subset Observation Undersampling (OU) with Different Numbers of Subsets - Percentage Performance Improvement Relative to Benchmark
Notes: ECM is calculated using a 0.6 percent fraud probability and a 30:1 false negative to false positive cost ratio.
As discussed in the text, three versions of the experiment were conducted. Original order refers to the main OU experiment; new order refers to the analysis in which the OU subsets are selected in a different order; and new subsets refers to the analysis in which the random sampling of non-fraud cases is repeated using a different random draw.
The benchmark is simple undersampling (Perols 2011), which randomly removes non-fraud observations from the sample to generate a more balanced training sample. This benchmark performs better than a benchmark that includes all fraud and non-fraud observations. OU and the OU benchmarks use all variables (independent variable reduction is examined in the VU analysis) and are implemented using support vector machines.
Original order
New order
New subsets
0123456789
101112131415
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
ECM % Improvement
# OUSubsets
preprint
accepted manuscript
- 44 -
Figure 5 Multi-subset Variable Undersampling (VU) with Different Numbers of Subsets of Explanatory Variables and Partitioned VU (PVU) - Percentage Performance Improvement Relative to Benchmark
Notes: ECM is calculated using a 0.6 percent fraud probability and a 30:1 false negative to false positive cost ratio.
As discussed in the text, two versions of the VU experiment were conducted. The constant number of variables in each subset experiment (the dashed line) uses subsets that contain five or six variables in each subset; the all variables in each round experiment (the round dotted line) uses all variables in each experimental round by randomly dividing all 109 variables into different subsets (consequently, as the number of subsets increases, the number of variables in each subset decreases).
The all variables in each round experiment only manipulates the number of VU subsets in even increments.
PVU partitions the variables into subsets based on their relation to specific types of fraud (e.g., revenue vs. expense fraud) and this partition is done independently of how VU is implemented. Consequently, the performance of PVU does not change as the number of VU subsets changes.
The benchmark contains six randomly selected variables (from the 109 variables described in Appendix A) and is equivalent to the VU implementation with only one subset. This benchmark performed better than benchmarks implemented using (i) all the variables in the dataset and (ii) the variables selected in Dechow et al. (2011), i.e., RSST accruals, change in receivables, change in inventory, soft assets, percentage change in cash sales, change in return on assets, actual issuance of securities, abnormal change in employees, and existence of operating leases. VU, PVU, and the benchmarks use all fraud and non-fraud observations (observation undersampling is examined in the OU analysis) and are implemented using support vector machines.
VU (constant number of variables
in each subset)
VU (all variables in each round)
-6-5-4-3-2-10123456789
10
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
ECM % Improvement
# VU Subsets
PVU
preprint
accepted manuscript
- 45 -
Figure 6 Performance Comparison of OU(12), PVU, and the Combination of OU and PVU
Notes: OU is Multi-subset Observation Undersampling. OU(12) represents the best performing individual OU implementation.
PVU is Multi-subset Observation Undersampling partitioned on fraud type.
ECM is calculated assuming an evaluation fraud probability of 0.6 percent.
PVU
PVU + OU(12)
-6
-4
-2
0
2
4
6
1:1 10:1 20:1 30:1 40:1 50:1 60:1 70:1 80:1 90:1 100:1
ECM %Difference to
OU(12)
CostRatio
preprint
accepted manuscript
TABLE 1 Summary of Experiments
Experiment Description Benchmarka 1. Evaluating the number of Multi-subset Observation Undersampling (OU) subsets
We create OU subsets with 20 percent fraud cases in each subset.b Each subset includes all original fraud cases and a random sample of the original non-fraud cases selected without replacement. We then empirically examine the optimal number of subsets for implementing OU.c As sensitivity analyses, we repeat the experiment using the same subsets, but select the subsets in a different order, and re-perform the random selection procedure of non-fraud cases. OU and the benchmarks use all variables (data dimensionality reduction is examined in the VU analyses below).
We use two benchmarks: (i) simple observation undersampling, i.e., OU with only one subset, as used in Perols (2011), and (ii) no undersampling.
2a. Evaluating the number of Multi-subset Variable Undersampling (VU) subsets
To evaluate how many VU subsets to use, we first randomly select variables from the 109 variables used in the prior literature and place these variables into 20 different subsets. Thus, each variable subset contains five or six variables. To determine how many VU variable subsets to use, we perform an experiment in which the subsets are randomly added one by one to VU.
We use three benchmarks: (i) simple variable undersampling (i.e., VU with only one subset); (ii) a model that includes all variables; and (iii) model 2 in Dechow et al. (2011).
2b. Evaluating the number of variables in each VU subset
This experiment evaluates the performance of VU as we change the number of variables in each subset. We use all variables in each experimental round and randomly divide the variables into the different subsets. Thus, as the number of subsets increase, the number of variables in each subset decreases. For example, all variables are included in one set in the first experimental round, half the variables are included in each of two subsets in the second experimental round, etc. This experiment skips all uneven rounds except for the first round to reduce processing time.
We use the same three benchmarks as the first VU experiment.
3. Evaluating VU partitioned on fraud types (PVU)
This experiment evaluates the performance of VU when partitioned on fraud types. Note that we do not examine the performance of different PVU implementations in this experiment as the specific subsets included in PVU are driven by the partitioning rather than an empirical evaluation.
We compare the performance of PVU to the best performing benchmark in the VU experiments (i.e., simple variable undersampling).
preprint
accepted manuscript
- 47 -
TABLE 1 (continued) Summary of Experiments
Notes: a Since we introduce OU to the fraud detection literature to reduce the imbalance between the number of fraud and the number of non-fraud observations, we use simple undersampling as a benchmark (Perols 2011) when evaluating the performance of OU. This benchmark randomly removes non-fraud observations from the sample to generate a more balanced training sample. We also use no undersampling as an additional benchmark. However, simple undersampling performs on average 7.3 percent better than no undersampling and we consequently report only simple undersampling. The OU and the OU benchmarks use all variables (as data dimensionality reduction is examined in the VU analysis). VU is introduced as a data dimensionality reduction method that is argued to improve the performance over currently used variable selection methods. As a baseline, we use a benchmark that was created using the variables included in Dechow et al. (2011) model 2 (the Dechow benchmark): RSST accruals, change in receivables, change in inventory, soft assets, percentage change in cash sales, change in return on assets, actual issuance of securities, abnormal change in employees, and existence of operating leases. We also use (i) a benchmark that randomly selects variables and (ii) a benchmark that includes all variables (the all variables benchmark), i.e., where data dimensionality is not reduced. The benchmark that randomly selects variables performs better than the Dechow benchmark (3.9 percent) and the all variables benchmark (3.9 percent). Thus, we report our results using the benchmark that randomly selects variables. VU and the VU benchmarks use all fraud and non-fraud observations (as observation undersampling is examined in the OU analysis). Following recent fraud prediction research (e.g., Cecchini et al. 2010) and findings in Perols (2011), all prediction models are implemented using support vector machines. Sensitivity analyses reported in Online Appendix B are used to examine other classification algorithms, including logistic regression and bootstrapping. b Perols (2011) finds that a simple undersampling ratio of 20 percent provides relatively good performance in fraud prediction compared to other undersampling ratios. c More specifically, we first create one subset and examine the performance of OU with this single subset. We then create a second subset and use this subset along with the previously created subset to evaluate the performance of OU with two subsets. Note that while it is possible to derive a total of 41 subsets following Chan and Stolfo’s (1998) approach, the addition of another OU subset is only valuable if the additional subset contains new information. We expect that the marginal benefit of adding an additional subset decreases as the total number of subsets in OU increases. Additionally, for each subset that is added, another prediction model has to be built, used for prediction, and combined with the other prediction models’ predictions. Thus, there is a computational cost associated with increasing the number of subsets used. Based on this and the results that indicate that the performance benefit tapers off around 12 subsets, we do not extend the experiment beyond 20 subsets.
preprint
accepted manuscript
Notea PrewhicPerfoacrosprobb Repstandunlesobseran indiffec TheremobalanbencOU bexammachbootsB.)
Multi-sPerform
es: diction perform
ch separate dataormance is the ss the ten test fability, i.e., 0.6ported p-valuesdard deviation ss otherwise norvations are un
ndication of theerence betweene benchmark isoves non-fraudnced training schmark that incbenchmarks us
mined in the VUhines. (Other cstrapping, are u
Tsubset Observ
mancea - Incre
mance is evaluasets are used faverage Expec
folds. ECM is m6 percent, and cs are based on in ECM scoresoted. Assumptnlikely to be sae relation betwen each implemes simple undersd observations fample. This b
cluded all fraudse all variables U analysis) andlassification alused in additio
Perf
orm
ance
Pla
teau
EC
M Im
prov
ing
TABLE 2 vation Undeeasing the Nu
ated using 10-ffor model buildcted Cost of Mmeasured at becost ratios, i.e.pairwise t-tests across the tentions related toatisfied and p-veen the magnitentation and thsampling (Perofrom the samplenchmark perf
d and non-fraud(independent v
d are implemenlgorithms, inclu
onal analyses re
rsampling (Oumber of Sub
fold cross-validing vs. model
Misclassificationest estimates of, 30:1.
ts using the aven test folds ando normality andvalues are only tude and the vae respective beols 2011), whicle to generate aformed better thd observationsvariable reductnted using suppuding logistic reported in Onli
OU) bsets
dation in l evaluation. n (ECM) f prior fraud
erage and d are one-tailedd independent
included as ariance of the enchmarks. ch randomly a more han a . OU and the tion is port vector regression andine Appendix
d
d
preprint
accepted manuscript
Notes: a Predictionmodel builnumeric vaprobabilitynegative (nperfect ranb The resulmisstatemeclassificatiin Online Aand providPanel B coimplementused in comet al. (2010financial kc In panel Adata, we us(2011): mapercentageemployeesusing a masample waoverfitted musing the Od The finanin the ratioresearch. Imachines. variables ine p-values aten test folvalues are between ea
n performance lding vs. modealue of how wey that a randomnon-misstatemenking of the evalts in Panel A cent data (all meion algorithms,Appendix B). des insight into ompare the perftations use suppmbination with0)) and (ii) whe
kernel. A, given the sose the Dechowaterial misstatee change in cass + existence ofaterial misstatemas used when semodel. In this
OU method. PVncial kernel conos of 23 financiIn this experim PVU is not imnto different frare one-tailed bds. Assumptioonly included ach implement
PredictMate
is evaluated usl evaluation. Pell the predictio
mly selected poent) instance. aluation observcompares the pethods and ben, including logiThis comparisthe usefulness
formance of thport vector ma
h a different setether OU provi
ource of the datw et al. (2011) bement = RSST h sales + changf operating leasment sample thelecting these vexperiment, O
VU uses all obnsists of 1,518 ial statement va
ment, OU is impmplemented in raud categoriesbased on pairwons related to nas an indicatioation and the b
Ttion Performerial Misstat
sing 10-fold crPerformance ison model rankssitive (misstateAn AUC of 0.vations. performance ofnchmarks are imistic regressionon provides fus of the proposehe financial kerachines). This t of independenides increment
ta, i.e., Dechowbenchmark. Thaccruals + chage in return onses. These indhat is similar tovariables it is p
OU uses all 107bservations, but
independent variables commplemented usinthis experimen
s. wise t-tests usinnormality and ion of the relatiobenchmark.
- 49 -
TABLE 3 mancea,b of OU
ements Hold
ross-validation area under thes the observatioement) instanc5 is equivalent
f OU and VU tomplemented usn and bootstrapurther validationed methods in rnel from Cecchanalysis providnt variables (crtal predictive p
w et al. (2011),his benchmark ange in receivabn assets + actuadependent variao the sample uspossible that th7 variables, butt partitions the
variables represmonly used to cong the same 1,5nt, as it is not c
ng the average independent obon between the
U and PVU od-Out Sample
in which sepae ROC curve (Aons in the test e is ranked higt to a random r
o the Dechow bsing support vepping, are usedn of the resultsa slightly diffehini et al. (201des insight intoreated using th
power when use
, and the natureis based on mobles + change ial issuance of sables were selesed in this expe
he benchmark pt under-sample
e original 107 vsenting currentonstruct indepe518 independenclear how to pa
and standard dbservations aree magnitude an
on a e
arate datasets arAUC). AUC psets and repres
gher than a randank order whil
benchmark usiector machinesd in additional as reported earlierent setting. T0) with and wi
o (i) the usefulnhe financial kered in combinat
e of the materiaodel 2 from Dein inventory +
securities + abnected in Dechoweriment. Beca
performance rees the non-fraudvariables basedt and lagged raendent variablent variables an
artition the 1,51
deviation of EC unlikely to be
nd the variance
re used for provides a sents the domly selectedle an AUC of 1
ing material s; other analyses reportier on fraud datThe results in ithout OU (botness of OU whrnel of Cecchintion with the
al misstatemenechow et al. soft assets +
normal change w et al. (2011)
ause the entire epresents an d observations
d on fraud typetios and changes in fraud
nd support vect18 independent
CM scores acroe satisfied and p
of the differen
d 1 is
ted ta
th hen ni
nt
in )
s.
ges
or t
oss the p-nce
preprint
accepted manuscript
Nr
Hypoth
Note: Average estiresults. P-values le
hesis Testing: R
mates, standard deess than 0.0001 wer
Results on Full S
eviation estimates, are converted to 0.0
Sample Logisti
and average p-valu0001 before taking t
TABLE 4 ic Regressions v
ues are based on estthe average.
versus 12 OU S
timates and p-value
Subsamples Lo
es from the 12 OU
gistic Regressio
subsample logistic
ons
c regression