Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
What are you saying? Using topic to detect financialmisreporting*
Nerissa C. BrownAssociate Professor
University of [email protected]
Richard M. CrowleyPh.D. Student
University of [email protected]
W. Brooke ElliottAssociate ProfessorUniversity of [email protected]
September 2015Preliminary version.
Please do not cite without the permission of authors.
*We thank Andrew Bauer, Paul Demeré, Shawn Gordon, Kristina Rennekamp, and workshopparticipants at the University of Illinois, 2015 AAA FARS Mid-year Meeting, 2015 AAA AnnualMeeting, and the 2015 Conference on Convergence of Financial and Managerial Accounting Re-search for their helpful comments. We also thank Xiao Yu for insightful comments on methodologyand coding, and Stephanie Grant, Jill Santore, and Jingpeng Zhu for excellent research assistance.
What are you saying? Using topic to detect financialmisreporting
Abstract: Detection models of financial misreporting have evolved beyond basic quanti-tative or financial measures to include textual or linguistic characteristics of firms’ disclosures.While these textual analysis methods provide incremental power in identifying misreport-ing, they examine how content is being disclosed as opposed to what is being disclosed.This study introduces a novel fraud-detection measure, labeled as “topic,” that quantifiesthe thematic content of financial statements. We derive our measure from a Bayesian topicmodeling methodology called Latent Dirichlet Allocation (LDA). We then demonstrate theincremental predictive power of our topic measure in detecting intentional financial misre-porting. We identify occurrences of financial misreporting using SEC enforcement actions(AAERs) and restatements arising from intentional misapplications of GAAP (i.e., irregular-ities). We find strong evidence that topic predicts intentional misreporting beyond financialand textual style characteristics. Furthermore, our results indicate that the detection powerof financial metrics is subsumed by our topic measure in prediction models for both AAERsand restatements arising from irregularities.
Keywords: Topic, Disclosure, LDA, Financial Misreporting, Intentional Restatements
1
1 Introduction
Detection models of financial misreporting have long focused on firms’ financial characteris-
tics and are often based on quantitative measures of extreme or abnormal performance (e.g.,
Beneish [1997]; Dechow, Ge, Larson & Sloan [2011]).1 One drawback of this approach is that
financial misreporting can go undetected for long periods since many firms engage in earnings
manipulation to blend in better with either their peers or the firm’s own past performance
[Lewis, 2013]. To address this weakness, recent studies have begun to examine linguistic and
text-based measures as additional signals of financial misreporting. These signals include
disclosure tone, sentiment, vocal dissonance, and the use of suspicious or deceptive words
[Loughran & McDonald, 2011; Hobson, Mayew & Venkatachalam, 2011; Purda & Skillicorn,
2015; Larcker & Zakolyukina, 2012]. In practice, regulators, auditors, investors, and infor-
mation intermediaries are employing textual analytic tools to reveal early warning signs of
accounting fraud and misstatements. For example, the Securities and Exchange Commis-
sion (SEC) is currently incorporating basic lists of deceptive words and phrases from annual
reports into their computer-powered fraud detection model, termed the Accounting Quality
Model (AQM, Eaglesham [2013]). Also, auditors and accountants are also using textual
analytic tools in assurance tasks such as fraud detection and compliance [Schneider, Dai,
Janvrin, Ajayi & Raschke, 2015].2
While text analysis methods are quite promising in the detection of financial misreporting,
many of the techniques used in prior research capture basic textual or linguistic character-
istics of firms’ disclosures rather than the content. Examining the content of disclosures is
important since firms engaging in accounting irregularities tend to make unusual content
choices that are more difficult to classify as deceptive. As SEC officials note, firms that
1Consistent with prior literature (e.g., Beasley [1996], Farber [2005], Dechow et al. [2011]), we use theterms misreporting, misstatement, manipulation, irregularity, and fraud interchangeably throughout thepaper. While firms often do not admit to outright fraud, our data sources for identifying instances ofintentional misreporting capture more egregious cases in the accounting error-to-fraud continuum.
2Furthermore, our discussions with data providers indicate the increased use of textual analytics toidentify instances of accounting manipulations and other disclosure quality risks.
2
manipulate financial reporting rules tend to deflect attention from core problems by under-
reporting relevant risk factors compared to other firms [Lewis, 2013]. Further, prior research
based on theories of deception provides evidence that the thematic content (or topic) of com-
munication is typically chosen intentionally (e.g., Buller & Burgoon [1996]), suggesting that
topics contained in financial disclosures likely reflect the intentions of managers (as opposed
to the manager’s own subconscious slant or their optimistic or pessimistic beliefs). In line
with these observations, our study introduces a textual analytic methodology that directly
detects disclosure topics, i.e., what is being disclosed in annual 10-K filings (as opposed to
how content is being disclosed). Using this unique measure (labeled as “topic”), we evaluate
the common types of topics discussed in the annual filings of misreporting firms and how
these disclosure topics change over time. Lastly we examine the incremental predictive power
of our topic measure in detecting accounting misstatements relative to a comprehensive set
of financial measures and textual style characteristics used in prior research.
To generate our topic measure, we employ a topic modeling technique developed by Blei,
Ng & Jordan [2003], termed Latent Dirichlet Allocation (LDA). The LDA approach allows
us to determine the proportion of each 10-K filing devoted to each topic detected by the
algorithm. Topic modeling does not require preset definitions (referred to as dictionaries
or word lists) or predetermined topic categories, and in fact, relies on the basic observation
that words frequently appearing together in text documents tend to be semantically related.
Using unstructured cluster analysis, the LDA algorithm simply uses a set of 10-K filings
to “learn” or generate the various topics discussed in firms’ annual reports in a given year.
This method offers a unique advantage in that researchers are not required to know the
topics commonly discussed in annual reports at a given point in time, and thus our own
(preconceived) knowledge of the documents’ content does not bias our construction of the
topic measure. Furthermore, the LDA method allows us to analyze the actual content of a
large collection of financial statements — a task that would be infeasible for researchers to
perform manually. This represents a significant step forward in the textual analysis literature
3
and goes beyond the basic “bag of words” or style analytics used in prior research.3
We conduct our fraud detection analyses using two sources for identifying the occur-
rence of accounting misstatements involving intentional GAAP violations. Our first source
is a comprehensive sample of SEC Accounting and Auditing Enforcement Releases (AAERs)
compiled by Dechow et al. [2011].4 These releases identify instances of formal SEC investiga-
tions of firms that manipulate their financial statements. Our second source of misstatements
is an automated search for financial restatements arising from intentional misreporting (here-
after referred to as irregularity restatements), i.e., those restatements that involve intentional
irregularities rather than unintentional misapplications of GAAP (errors). We identify re-
statements involving intentional irregularities based on the criteria discussed in Hennes,
Leone & Miller [2008]. Specifically, we use an automated method to parse through the text
of amended 10-K filings for variants of the words “fraud” or “irregularity” in reference to
accounting misstatements. We also search for references to restatements resulting from SEC
or Department of Justice (DOJ) investigations as well as references to other independent
investigations.5
While there is some overlap between our sample of AAER and financial restatement
firms, each data source has its unique advantages and disadvantages. As Dechow et al.
[2011] note, the AAER sample provides researchers with a high confidence level of fraud
detection since the SEC typically targets firms where there is strong evidence of accounting
manipulation. However, many misstatement events are likely to go undetected due to limited
SEC resources, and the investigated cases are likely to reflect the SEC’s selection criteria.
3Although LDA is a word choice algorithm, it goes significantly beyond the näıve Bayes or simple“bag of words” approach used in prior research (e.g., word counts and rank-ordered word lists) by usingthe distribution of words across documents to discover actual content without the need for predefined orresearcher determined word lists. Our analyses control for the most commonly used word list measures suchas those based on the Loughran and McDonald and Harvard IV dictionaries.
4We use the updated version of this dataset available from the Center for Financial Reporting andManagement at UC Berkeley’s Haas School of Business.
5We use an automated search to identify restatements since other data sources such as Audit Analyticsand the Government Accountability Office (GAO) database provide less extensive coverage. For instance,restatement data is not available in Audit Analytics for periods prior to 2001, while the GAO database islimited to restatements announced from July 2002 to October 2006.
4
The irregularity restatement dataset spans a broader sample of accounting misstatements
and alleviates the concern of SEC-related selection biases. Nonetheless, the restatement
sample could be affected by changes in how firms disclose or discuss restatements within
their financial statements.
We build our fraud detection model by combining our unique topic measure with a
comprehensive set of financial statement measures and textual style characteristics used
in prior fraud research. Our financial statement variables follow closely from the Dechow
et al. [2011] F-score model and include measures of accrual quality, earnings and cash flow
performance, engaging in off-balance sheet activities, and market-related incentives. Our
text-based characteristics include measures of financial report readability, disclosure tone
and emphasis, as well as basic lists of deceptive, litigious and uncertain words constructed
by Loughran & McDonald [2011].
Using out-of-sample tests of our AAER sample, we find that our topic measure provides
significant incremental predictive power over commonly-used financial metrics and textual
style variables for detecting instances of accounting misstatement. In fact, a stand-alone
model of disclosure topic is a better predictor of accounting fraud compared to models using
financials or style characteristics, or models using both financial and style measures. We
find similar results when we analyze our sample of irregularity restatements, i.e., topic adds
significant incremental predictive power over financial and style variables. Interestingly, we
find that a model including only topic and style (and excluding financial metrics) is most
predictive of our irregularity restatements in out-of-sample tests. We find that these results
are robust to several sensitivity checks including the use of alternative misreporting measures,
regressions specifications, and topic measure definitions. Our inferences also hold when we
base our analyses solely on the topic content of the Management Discussion and Analysis
(MD&A) section of firms’ 10-K filings.
Our study makes several important contributions to the literature. First, we extend prior
research on financial misreporting by providing evidence that the topics discussed in firms’
5
10-Ks are useful in identifying intentional misstatements above and beyond traditionally
examined financial measures and style characteristics. Second, we expand the burgeoning
research in accounting that examines the textual portion of corporate disclosures. Specifi-
cally, we exploit a robust textual analysis methodology, LDA, which directly quantifies what
is being disclosed in 10-K filings (as opposed to how it is being disclosed). This thematic
content analysis is adaptable to any type of financial disclosure with sufficient length. Fur-
ther, since the topics that are indicative of intentional misreporting are likely to change over
time given the fluidity of disclosures, our approach provides a significant improvement over
a simple “bag of words” approach where the list of deceptive words is fairly static and easily
identifiable by firms. Lastly, our study has significant practical implications for regulators,
investors, and practitioners, who continue to implement sophisticated initiatives aimed at
detecting accounting violations. In particular, our results suggest that extracting informa-
tion about what is being disclosed is a fruitful avenue for capturing high-risk accounting
activity.
2 Background and Research Questions
2.1 Predicting Financial Misreporting
Over the past two decades, researchers have examined several different predictors of financial
misreporting. Early work by Feroz, Park & Pastena [1991] and Beneish [1997, 1999] investi-
gate the link between accounting misstatements and several measures of extreme or abnormal
financial performance. In particular, Feroz et al. [1991] find that fraud firms identified from
SEC enforcement actions (AAERs) have misstated receivables and inventory. Using a larger
sample of fraud events gathered from both AAERs and the business press, Beneish [1997]
finds that abnormal accruals, disproportionate increases in receivables, and poor abnormal
market performance are significant predictors of financial misreporting. Beneish [1999] also
finds that extreme firm performance based on indices of financial ratios is useful for detecting
6
fraud.
Prior studies also find that stock and debt market pressures and firms internal monitoring
mechanisms are strong predictors of accounting misstatements. For instance, Dechow, Sloan
& Sweeney [1996] find that firms subject to SEC enforcement actions have lower free cash
flow and higher leverage. In addition, these firms are more likely to violate debt covenants
and tend to issue securities during the earnings manipulation period. Dechow et al. [1996]
further find that fraud firms have weak internal governance mechanisms proxied by insider
dominance of the board of directors and the lack of an audit committee or outside block-
holders. Beasley [1996], Beasley, Carcello, Hermanson & Lapides [2000], and Farber [2005]
provide similar evidence of the association between governance mechanisms and the like-
lihood of fraud. Specifically, they find that fraud firms have lower percentages of outside
board membership, less independent audit committees, fewer financial experts on the audit
committee, fewer audit committee meetings, and lower quality external audit firms. In sum,
these results suggest that weak monitoring mechanisms provide managers with less internal
oversight and, in turn, greater opportunities to engage in suspect accounting.
In a comprehensive study of AAERs, Dechow et al. [2011] investigate the fraud detection
power of a battery of financial and nonfinancial measures.6 They find that poor accrual
quality, increases in accrual components, declines in returns on assets, high stock returns,
and abnormal reductions in the number of employees are strong predictors of accounting
misstatements. They also find that fraud firms conduct aggressive off-balance-sheet and
external financing transactions during misstatement periods. Using these variables, Dechow
et al. [2011] develop a composite prediction score termed the F-score. They show that the
F-score is a better predictor of both within-GAAP and aggressive accounting misstatements
compared to traditional models of accrual management.
Recent research has started to explore the predictive value of language-based tools in
6Dechow et al. [2011] do not examine corporate governance and incentive compensation variables becausethese variables are available for only limited samples. Our study follows the same approach to ensure thatour results are generalizable to a wide set of firms.
7
detecting intentional financial misreporting. The basic premise of these studies is that the
textual style or vocal qualities of managements’ disclosures can be used as additional tools
to identify accounting manipulations. At a basic word-choice level, Loughran & McDonald
[2011] find that instances of fraud are associated with the use of negative, uncertain, and
litigious words in annual reports. Larcker & Zakolyukina [2012] analyze the transcripts of
conference calls and find that words related to deception are indicative of accounting mis-
statements and serve as better predictors than standard measures of discretionary accruals.
Goel & Gangolly [2012] go a step further from word lists and examine linguistic qualities
such as tone, tense, uncertainty, adverbs, and emphasis. They find significant differences
in linguistic qualities across the full text of 10-Ks issued by firms that engage in financial
irregularities versus those that do not. Cecchini, Aytug, Koehler & Pathak [2010] employ a
dictionary approach with synonyms to detect financial misstatements. This approach allows
the authors to take a broader look at managements’ disclosures in annual reports as opposed
to focusing solely on individual style characteristics.
Goel, Gangolly, Faerman & Uzuner [2010] presents an expanded fraud detection model
using a machine learning algorithm termed Support Vector Machine (SVM) to classify annual
reports containing irregularities. The SVM approach provides a significant improvement over
prior text mining models as the SVM model learns by example and does not require pre-
defined fraud indicators. The SVM tool is trained to classify annual reporting using both
word dictionaries and writing style characteristics such as word usage, word and sentence
length, readability, tone, the use of passive versus active voice, the frequency of uncertainty or
hedge words, and other deeper linguistic style markers and keyword usage. Goel et al. [2010]
find that their SVM approach improves the prediction accuracy of their fraud detection model
by about 58% compared to a baseline model using a Näıve Bayes classification approach.
This approach improves upon the word-list tool used in prior research by employing Näıve
Bayes and SVM algorithms to classify annual reports based on vectors of common words
detected in fraudulent reports. However, this algorithm is based on simple word counts and
8
ignores the disclosure content of the annual report as well as relationships between words in
a document.
Lastly, Purda & Skillicorn [2015] use SVM tools to distinguish fraudulent and truthful
annual reports. In contrast to Goel et al. [2010], Purda & Skillicorn [2015] examine both
annual and quarterly financial reports and compare the accuracy of their fraud detection
model to both the predictive power of linguistic-based models and traditional models such
as the Dechow et al. [2011] F-score model. The SVM approach in Purda & Skillicorn [2015]
uses a learning algorithm to generate a rank-ordered list of words that are best able to cap-
ture fraudulent reporting. The authors find that their data-generated classification scheme
outperforms textual-based prediction models built using pre-defined dictionaries as well as
traditional models based on financial and non-financial measures.7
Our study extends this body of literature by constructing a direct text-based measure
aimed at capturing the content of disclosures within firms’ financial statements. Our ap-
proach goes beyond traditional fraud detection models that are based primarily on quanti-
tative information. Further, we draw on theories of deception to predict that the thematic
content, or topics, a manager chooses is more likely to reflect the manager’s intentions, and
thus intentional misreporting, than either the simple “bag of words” or textual style approach
used in prior textual analysis studies.
2.2 LDA Topic Modeling
We employ a topic modeling approach developed by Blei et al. [2003] termed Latent Dirichlet
Allocation (LDA), to capture the disclosure content of annual reports. The LDA technique
is widely-used in the linguistic and information retrieval literature to quantify the thematic
content (i.e., topics) of text corpora and other collections of discrete disclosure data (see Blei
7In international settings, Kirkos, Spathis & Manolopoulos [2007] and Pai, Hsu & Wang [2011] usetextual data mining techniques to detect intentional financial misreporting in Greek and Taiwanese firms,respectively.
9
[2012] for a review of topic modeling and its application to various text collections).8 We use
this approach to construct a firm-specific measure of the topics discussed in each 10-K filing
in a given year. This unique measure (defined as the normalized percent of the annual report
attributed to each topic identified by the algorithm) captures how much of each report is
devoted to discussing a particular disclosure topic.
Topic modeling is relatively new to accounting and finance, and our measurement ap-
proach is consistent with recent studies that apply LDA to extract the disclosure content
of large volumes of financial-related textual data. For instance, Curme, Preis, Stanley &
Moat [2014] use LDA to identify the semantic topics within the large online text corpus of
Wikipedia. The identified topics are then used to determine the link between stock market
movements and how frequently Internet users search for the most representative words of
each identified topic. Huang, Lehavy, Zang & Zheng [2014] employ LDA topic modeling to
compare the thematic content of analyst reports and the text narrative of conference calls.
Consistent with the information discovery role of analysts, Huang et al. find that analyst
reports issued immediately after conference calls contain exclusive topics that were not dis-
cussed during the conference calls. Bao & Datta [2014] discover and quantify the various
topics discussed in textual risk disclosures from annual 10-K filings (Item 1A). The results
indicate that about two-thirds of the identified risk topics are uninformative to investors, con-
sistent with the notion that risk disclosures are largely boilerplate. Of the remaining topics,
disclosures of systematic macroeconomic and liquidity risks have an increasing effect on in-
vestors’ risk perceptions, whereas topics related to diversifiable risks (i.e., human resources,
regulatory changes, information security, and operation disruption) lead to a decrease in
investors’ risk perceptions.
Concurrent with our study, Hoberg & Lewis [2015] use topic modeling and the cosine
similarity to provide evidence of the content disclosed by firms involved in SEC enforcement
8In practice, topic modeling is being used by search engines such as Google and Bing to improve corre-lations between search terms and web content. Search engine marketers are also applying topic modeling toguide keyword selection and optimize website content [Fishkin, 2014]
10
actions (AAERs). Focusing on the MD&A section of 10-K filings, Hoberg & Lewis [2015]
find that relative to industry peers, AAER firms have abnormal verbal disclosure that is
common among fraud firms. The topic analysis results indicate that AAER firms disclose
more information about complex business issues such as acquisitions and foreign operations,
and are more likely to grandstand their good financial performance. AAER firms also under-
disclose certain topics such as liquidity challenges and provide fewer quantitative details
explaining their performance.
Our study extends the Hoberg & Lewis [2015] in several respects. First, Hoberg & Lewis
[2015] fit their LDA model using the text contained in the MD&A section (Item 7) of annual
reports filed in only the first year of their sample period (1997-2008). This approach does not
account for changes in disclosure topics over time and could induce ‘staleness’ in the topics
used in their empirical analyses. Our study accounts for the dynamic nature of management
disclosure by simultaneously discovering the topics and quantifying the attention each annual
report dedicates to the estimated topics in a given year. We also employ a rolling-window
procedure that predicts financial misreporting using the topics identified over the five years
prior to the manipulation period. Second, unlike Hoberg & Lewis [2015], we analyze the
thematic content of the entire 10-K filing as opposed to the MD&A section. While the MD&A
section provides a useful setting for examining disclosure content, it does not capture relevant
content that are discussed in other distinct sections of the annual report, e.g., risk factors
(Item 1A), legal proceedings (Item 3), and executive executive compensation (Item 11). As
we will show, topics identified in the MD&A section have less variation and significantly lower
fraud detection power compared to topics identified from the full annual report. Last, and
most important, our study goes a step further by demonstrating the incremental predictive
power of thematic content for detecting fraud over and above traditional measures of financial
performance and textual style characteristics.
11
2.3 Research Questions
While prior research suggest a link between accounting misstatement and various word
choices and writing styles, the literature is unclear as to whether disclosure content is related
to intentional misreporting. Our primary research questions tackles this issue by investigat-
ing the association between disclosure topics and instances of financial misreporting.
We seek to understand the role of disclosure content in detecting intentional financial
misreporting beyond traditional financial performance and style characteristics examined in
prior work. Disclosure topics may provide incremental detection ability since the thematic
content of financial statements captures an aspect of managerial deception that is distinct
from that of financial metrics. Specifically, regulatory oversight is more difficult for textual
narratives, especially at the topic level, thus leaving more room for managers to use disclosure
content as a means of diverting attention away from misstated financials. While prior re-
search has identified a set of “lying words” (see e.g., Newman, Pennebaker, Berry & Richards
[2003]), it is more difficult to näıvely identify a set of “lying topics” as these same topics
may be benign or informative about actual performance in other settings or at other points
in time. Furthermore, financial metrics in annual reports are primarily backward-looking,
whereas textual disclosures contain a significant amount of forward-looking information and
cover a wide range of topics [Bozanic, Roulstone & Van Buskirk, 2014]. As prior research
suggests, forward-looking information is inherently more uncertain and less verifiable at the
time of disclosure [Bonsall IV, Bozanic & Merkley, 2014]. We therefore explore whether
the topics discussed in annual reports provide incremental predictive value beyond financial
metrics in detecting financial misreporting. Our first research question is stated as follows:
Research Question 1: Topic provides predictive information beyond that of financial
measures when detecting intentional financial misreporting.
We also investigate whether disclosure topics provide informational value beyond textual
style characteristics. Style characteristics are broad textual metrics – they do not reflect the
underlying content, but are simply summary statistics across the broad base of the document.
12
For example, style characteristics may refer to textual complexity, readability, or formatting
choices such as the use of bullets or the amount of whitespace, as well as grammar and
word choice. Topic modeling, in contrast, captures the underlying distribution of content
in the document, as opposed to just summary style statistics. Furthermore, the discussion
of various topics within the annual report is more intentional than style characteristics, i.e.,
managers must select the thematic content of each annual filing and the attention dedicated
to each topic within the document. Prior research on deception provides evidence that
written topics are typically chosen intentionally. Specifically, theories of manipulation and
deception suggest that individuals actively monitor the amount, veracity, relevance and
clarity of topics communicated (see e.g., McCornack [1992]). Also, experimental evidence
indicates that individuals adapt deception strategies of giving false answers, withholding
relevant information, or giving evasive answers on demand, suggesting that choosing a topic
is an intentional process [Buller & Burgoon, 1996].
It is unclear from prior work whether style characteristics such as length and readability
reflect intentional manager choices to obfuscate or deceive or simply reflect a manger’s own
optimism or subconscious slant (see e.g., Li [2008] and the related discussion in Bloomfield
[2008]). Further, prior research suggests that word choice, in and of itself, is often sub-
conscious (or without intent). Specifically, prior work in linguistics provides evidence that
the use of function words (i.e., pronouns, prepositions, articles, conjunctions, and auxiliary
verbs) is often without awareness and is difficult to control [Chung & Pennebaker, 2007].
While there is some evidence that individuals consciously choose abstract words to describe
recurring events and concrete words to describe non-recurring events, goals to influence the
message recipients’ beliefs make it difficult to disentangle whether the communicator is be-
ing truthful or not. Thus, even for those words that a manager may intentionally choose,
it has been shown to be difficult to discern intentional deception (see e.g., Douglas & Sut-
ton [2003]). In sum, while style features of disclosure, including word choice captured by
commonly used-dictionaries or word lists, may reflect a manager’s own optimism or subcon-
13
scious slant, topic modeling is more likely to reveal managers’ intentional content choices.
This discussion leads to our second research question:
Research Question 2: Topic provides predictive information beyond that of textual
style characteristics when detecting intentional financial misreporting.
3 Data and Empirical Measures
3.1 Data and Sample Selection
We use annual 10-K filings retrieved from the SEC’s EDGAR system as the textual disclosure
source to examine our research questions. We focus on annual reports because they 1) allow
us to maximize the number of firm-year observations in our sample, 2) are comprehensive
in their coverage of the firm and its activities throughout the fiscal year, and 3) avoid
self-selection biases given their mandatory disclosure status.9 We download the full text
of all 10-Ks available through the SEC EDGAR FTP site from January 1, 1994 (the first
year such data is available) until December 31, 2012 (the final year of the AAER dataset,
discussed below). This download yields 131,528 filings. We follow Li [2008] in parsing the
10-K filings, but expand this methodology to remove all items included in the documents
other than raw text.10 We also restrict the words in the files to match those contained in
the standard Unix words dictionary, to remove typos and uncommon terminology.11 We
describe our full parsing methodology in detail in Appendix A.1 of the online appendix. As
9We acknowledge that 10-Ks are not always timely sources for detecting financial misreporting. Forinstance, the filing of the 10-K can lag any occurrence of fraud by up to a year. Purda & Skillicorn [2015]highlight the added value of including quarterly report narratives in language-based fraud analyses. However,we choose not to include quarterly reports to ensure consistency in the disclosure content of the reports acrossfirms’ reporting periods.
10We construct measures for all textual items removed from the documents, some of which are includedin our analyses.
11The standard dictionary, provided by the wamerican package in the official Debian repositories, contains99,171 words. We also conduct robustness checks using no dictionary, the wamerican-huge dictionary, andthe wamerican-insane dictionary. These checks confirm that the standard dictionary provides the best modelperformance in-sample, along with the most coherent topics.
14
discussed below, we gather data on accounting misstatements from the SEC AAER dataset
compiled by Dechow et al. [2011] and from disclosures of restatements due to intentional
misreporting in amended 10-K filings. We also gather financial statement and stock market
data from Compustat and CRSP, respectively, as opposed to the actual 10-K filings to ensure
consistency and accuracy across our sample.
3.1.1 Identifying Intentional Financial Misreporting
We use two data sources to identify instances of intentional financial misreporting. Following
Dechow et al. [2011], our first data source uses SEC AAERs to classify firms engaging in
material accounting misstatements. We focus on misstatements occurring during the annual
reporting period. We exclude quarter-period misstatements to ensure that the measurement
period for our prediction variables is consistent across firms. We create an indicator variable
(𝑚𝑖𝑠𝑟𝑒𝑝𝑜𝑟𝑡) that equals 1 for each fiscal year identified as misstated by the SEC, and zero
otherwise. We use this indicator variable to classify those 10-K filings containing potential
GAAP violations.
Our second data source is a customized automated search for occurrences of financial
restatements that are seemingly due to intentional misapplications of GAAP (irregularity
restatements). We use the classification methodology discussed in Hennes et al. [2008] to
develop a customized identification tool.
Our customized tool performs well in capturing financial misreporting in our sample, as a
manual inspection of irregularities identified by the search tool indicates that the misstated
financial reports contained material and intentional misapplications of GAAP. To identify
irregularity restatements, we download all amended 10-K filings (10-K/As) from the SEC
EDGAR FTP site. We gather firm-identifying information for matching purposes from the
header (or alternately from the body of the text when the header is missing or incomplete),
and then parse the 10-K/A in a manner similar to our parsing of unamended 10-Ks (see
Appendix A.1 of the online appendix). After parsing the filings, we follow Hennes et al.
15
[2008] and search the text for direct statements of the occurrence of financial reporting
irregularities or narratives referring to the investigation of misstatements by either regulatory
or independent parties. Appendix A describes our full search terms.
We search for phrases such as “fraud,” “materially false and misleading,” and “violation
of federal securities laws” to identify restated filings with direct discussion of irregularities.
We identify restatements with related regulatory investigations based on narratives refer-
ring to investigation by the SEC, the DOJ, or by an Attorney General. Restatements with
independent investigations are classified based on discussion of investigations by forensic ac-
countants, the audit committee or an independent committee, as well as statements referring
to the retention of legal counsel over the misstatement. Based on this identification strategy,
we classify each 10-K filing as misstated if our search of the corresponding 10-K/A detects
narratives reflecting an irregularity as detailed above. We then code 𝑚𝑖𝑠𝑟𝑒𝑝𝑜𝑟𝑡 as 1 for those
firm-years with misstated annual reports; 𝑚𝑖𝑠𝑟𝑒𝑝𝑜𝑟𝑡 equals 0 if there is no amended 10-K
for the respective fiscal year or if the amended 10-K filing does not involve an irregularity.
3.2 Empirical Measures
3.2.1 Financial Measures
We build our fraud prediction model by including the financial statement and market-related
variables selected for the Dechow et al. [2011] F-Score model.12 We focus on the set of vari-
ables providing the highest predictive power as reported in Dechow et al. [2011] (see Model
3 in Table 9). The financial statement variables include measures of accrual quality, firm
performance, off-balance-sheet activities, and market-based incentives. Panel A of Appendix
B defines each of the variables outlined below. The accrual quality measures include an ex-
tended definition of working capital accruals as developed in Richardson, Sloan, Soliman &
Tuna [2005] (termed RSST accruals). The RSST accruals measure captures the change in
12In robustness tests, we find that our results hold when we include standard financial ratios andbankruptcy prediction measures, consistent with prior fraud studies (e.g., Beneish [1997], Cecchini et al.[2010]).
16
noncash net operating assets.13 We also measure the change in receivables and the change
in inventory since misstatement of these two accrual components affect widely-used perfor-
mance metrics, namely, sales growth and gross margin. Lastly, the percent of soft assets
on the balance sheet captures accounting flexibility, and in turn, the room for managerial
discretion in changing the measurement assumptions of net operating assets in order to meet
short-term performance goals.
Our performance measures capture managerial incentives to manipulate their financial
statements to mask poor firm performance. These measures include the change in cash sales
and the change in return on assets. To gauge the extent to which firms engage in off-balance-
sheet financing, we include an indicator variable to identify firm-years with nonzero future
operating lease obligations. The GAAP rules for operating leases leads to lower expenses
being booked to the income statement in the early years of the asset’s life. Thus, the existence
of operating leases proxies for managers’ propensity to window-dress financial performance.
Lastly, the issuance of securities in a given firm-year, the book-to-market ratio, and the
market-adjusted stock return of the prior fiscal year all capture market-related pressures to
engage in fraudulent reporting.
3.2.2 Style Measures
Style characteristics are in essence simple summary statistics of textual information. Since
our main construct of interest, disclosure topic, is derived from text, the ability to detect
financial misreporting beyond simple style characteristics is quite important. As such, we
benchmark our topic measure against a comprehensive set of style characteristics from prior
literature, as well as four new measures developed from our analysis. Panel B of Appendix
B presents a full list of the style variables and their measurement.14
13We do not include other measures of discretionary accruals (e.g., modified Jones and performance-matched discretionary accruals) as Dechow et al. [2011] find that these measures perform poorly in detectingaccounting manipulation compared to unadjusted accrual measures.
14Our results are robust to a large vector of alternative style characteristics. This vector includes a fullbattery of processing measures (a variable for each part removed from the filing), median word, sentenceand paragraph lengths (in addition to the already included mean lengths), Harvard IV dictionary measures,
17
Our new measures are the log of the number of bullets, the length of the SEC mandated
header, number of excess newlines (vertical whitespace) in the filings, and the character
length of HTML tags. The log of the number of bullets captures an aspect of readability,
as bulleted information is typically concise. We find considerable variation in this mea-
sure, as 13.5% of our sample filings do not contain bullets, while 10% of the filings include
over 1,400 bullets. The SEC header contains basic corporate and filing form information
such as company name and address, SIC industry, form type, and filing date. Filings with
long headers generally identify firms that operated under former company names in prior
years.15 We therefore expect the SEC header length to be correlated with disclosure com-
plexity attributable to complex firm transactions that have been shown to be correlated with
fraudulent activity (e.g., mergers, acquisitions, and corporate restructurings).
Excess newlines (vertical whitespace) increase the length of the 10-K filing without adding
any substantive content. Managers engaging in financial misreporting could insert additional
whitespace to keep the length of the filing consistent with filings by peer firms or the firm’s
own prior filings while omitting some pertinent information. We include the character length
of HTML tags in the unparsed documents as a broad measure of technological expertise or
savvy. This proxy attempts to distinguish between documents created using a basic word
processing program, e.g., Microsoft Word (which should embed numerous HTML tags),
versus documents created by more specialized financial software.
The second group of textual style variables are filing length and style structure measures,
in the vein of Li [2008] and Goel et al. [2010]. We use these length and structure measures
as additional proxies for disclosure readability and textual complexity.
The selected variables include the mean and standard deviation of the length of words,
sentences, and paragraphs in the 10-K filing, as well as measures of sentence repetition and
six alternative readability measures, a variable capturing every part of speech coded in the Brown corpus,total and tagged word counts, two other measures of sentence repetition, and deviation from the Benforddistribution. We find that majority of these variables are highly correlated with the style characteristicsselected for our primary analyses.
15Former company names and the date of each name change are disclosed in a separate block of headerfields. Firms can enter up to three former names in the EDGAR system.
18
type-token ratio (see Goel et al. [2010]; Li [2014]). The type-token ratio (number of unique
words scaled by the number of total words) measures vocabulary variation and, consistent
with Rennekamp [2012], captures the idea of superfluous words, as a higher ratio indicates the
use of a broader vocabulary. We also compute the percent of short and long sentences (≤ 30
or ≥ 60 words, respectively) contained in the filing. We include two complementary measures
of readability: the Gunning Fog Index and the Coleman-Liau Index. These measures are
widely used in the accounting literature to indicate disclosure inefficiencies or misinformation
[Goel et al., 2010; Lehavy, Li & Merkley, 2011; Li, 2008; Rennekamp, 2012].
Our final set of textual measures comprises a battery of language and word choice vari-
ables. First, we measure language voice (active and passive), which has been shown to
correlate with the incidence of financial misreporting [Goel et al., 2010; Goel & Gangolly,
2012]. Consistent with Purda & Skillicorn [2015], we measure word choices using the six word
dictionaries defined in Loughran & McDonald [2011]. These dictionaries contain a lists of
financial-related words that capture disclosure tone and the use of uncertainty and litigious
vocabulary. We also include three measures of disclosure emphasis: the use of capitalized
words, exclamation points, and question marks (see Goel & Gangolly [2012]).
3.2.3 LDA Topic Measure
Our topic measure is based on the unstructured and unsupervised LDA topic modeling
methodology developed by Blei et al. [2003].16 We choose this approach due to its intu-
itive characteristics and strong performance. In particular, LDA is a Bayesian probabilistic
model and offers significant theoretical improvement over older data-driven and principle-
component-based tools such as Latent Semantic Analysis (LSA). Furthermore, the topic
16For predictive purposes, Mcauliffe & Blei [2008] develop a supervised LDA model (sLDA) which allowseach text document to be paired with a response variable that classifies each document. The goal of the sLDAmodel is to infer disclosure topics predictive of the response. This response variable would be misreport inour setting. We refrain from using the sLDA model for two reasons. First, the unsupervised LDA modelallows us to provide a baseline for the common disclosure topics contained in annual reports, irrespective ofmisreporting. Second, Mcauliffe & Blei [2008] find that the prediction performance of sLDA is equivalent toLDA in text corpora with difficult-to-predict responses.
19
modeling accuracy of LDA is quite strong when compared to human classification of topics
or other unsupervised machine algorithms such as LSA-IDF or LSA-TF.17 In an experiment,
Anaya [2011] finds that humans classify main topics with 94% accuracy, while LDA achieves
84% accuracy. Comparable accuracy statistics were 84% for LSA-IDF and 59% for LSA-TF.
While the accuracy of human classification is greater than that of LDA, the human approach
is infeasible when classifying large volumes of textual data. In fact, the LDA tool allows us
to categorize the disclosure content of annual reports containing text narratives of over 3 bil-
lion words, allowing for rigorous testing that otherwise would be impossible based on human
topic classifications.
The LDA model is based on a few simple assumptions. The model assumes a collection
of 𝐾 topics in a given text document and that the vocabulary of each topic is distributed
following a Dirichlet distribution, 𝛽𝐾 ∼ Dirichlet(𝜂).18 The model further assumes that
the topic proportions in each document 𝑑 are drawn from a Dirichlet distribution 𝜃𝑑 ∼
Dirichlet(𝛼). Given these assumptions, a specific number of topics to identify, and a few
learning parameters, the LDA model categorizes the words in a given set of documents
into well-defined topics. Because the model uses Bayesian analysis, a word is allowed to be
associated with multiple topics. This is a convenient feature of LDA, as words can have
multiple meanings, especially in different contexts. In sum, the generative process of LDA
can be viewed as a probabilistic factorization of the vocabulary in a collection of documents
into a set of topic weights and a dictionary of topics.
We implement the LDA algorithm using a dynamic time-series process since we ex-
pect disclosure content to change across time due to factors such as macroeconomic condi-
tions, technological changes in business operations, regulatory interventions (e.g., the 2002
Sarbanes-Oxley Act), and changes in firm management. Consequently, this approach allows
us to assess the changing nature of disclosure content and its ability to predict account-
17LSA-IDF and LSA-TF are LSA based measures using a term-document matrix that has undergone atransform: inverse document frequency or term frequency, respectively.
18A Dirichlet distribution is essentially a multivariate generalization of a beta distribution.
20
ing misstatements. Our time-series procedure identifies the topics discussed in each rolling
five-year window over our sample period (1994 2012). That is, we run the algorithm for
the periods, 1994 1998, 1995 1999, 1996 2000, and so on. The topics discovered in each
window are then used to determine the disclosure content of annual reports issued in the
year immediately following each five-year window. This results in a test period of 1999 2012
for our prediction analyses. Note that while new topics may arise in the year after each
window, the topics discussed in the prior five years provide the most practical estimates of
current-year disclosure content while avoiding potential look-ahead biases in our prediction
tests.
For our implementation of the algorithm, we follow Hoffman, Bach & Blei [2010] and
use an “online” variant of LDA. This approach allows us to run the algorithm in small
batches and to classify large quantities of text without encountering computational barriers.
We run the online LDA algorithm in batches of 100 filings since small batches are more
computationally efficient given the large sizes of 10-K filings. We draw the filings in each
batch in random order to mitigate overweighting of early years in the online LDA tool.
Consistent with Hoffman et al. [2010], we use symmetric Dirichlet distributional parameters
of 𝛼 = 𝜂 = 120
and the learning parameters of 𝜅 = 710
and 𝜏0 = 1024. The learning parameter
𝜅 controls how quickly old information is forgotten, while parameter 𝜏0 downweights early
iterations of the model. Hoffman et al. [2010] document that these distributional and learning
parameter settings are optimal when categorizing articles from the science journal Nature, as
well as categorizing text from Wikipedia. We then set the algorithm to identify 31 topics in
each five-year window. We select 31 topics since simulated results indicate that this number
of topics is optimal in capturing the occurrence of irregularity restatements (see Appendix
A.2 of the online appendix for a description of this simulation).19
Next, we pre-process the parsed 10-K filings by first removing stop words. Stop words
are those deemed irrelevant for our text-based measures because they occur either frequently
19We run the simulation on irregularity restatements given the lower occurrence of SEC AAERs.
21
(e.g., ‘the’, ‘an’, ‘is’) or are too infrequent to be of use in fraud prediction (such cases were
often garbled text or misspellings in the 10-K filings). Because our analysis uses rolling five-
year windows, we generate our stopwords on matching five-year windows to avoid potential
look-ahead biases. We remove three types of stopwords: 1) the most frequent words appear-
ing in each rolling five-year window of our sample period until we have removed 60% of all
word occurrences in the window, 2) words that occur less than 1100 times in the window,
and 3) words that occur in less than 100 filings. These parameters are also derived in our
simulation (see Appendix A.2 of the web appendix).
We run the LDA algorithm on the pre-processed filings, generating 31 topics in each
rolling window and the weighting for each word associated with the topic. Using these word
weights, we compute the weight of each topic in filings issued in the year following the five-
year window. For example, the weighted word vectors for the topics identified in the 1994 –
1998 window are used to determine the topic weights in filings issued in 1999. To compute
the topic weights in a given filing, we multiply the vector of word weights within the topic by
a vector of word counts for the filing. We then normalize the weight of the topic by the sum
of the weights of all topics identified in the filing. This procedure generates the proportion
of the content of each document that is associated with each topic. We denote these topic
proportions as topic. Note that while new topics may arise in the year after each window,
the topics discussed in the prior five years provide the most effective estimates of current
topic proportions while avoiding potential look-ahead biases in our prediction tests.
3.3 Validation of LDA Topic Measure
Before investigating our research questions, we validate our topic measure using several
methods. Following prior research (e.g., Bao & Datta [2014], Huang et al. [2014]), our
first method evaluates the semantic validity of the LDA output by labeling the topics and
assessing the extent to which the topics provide meaningful content. As discussed above,
we derive our topic measure using a rolling-window approach with 31 topics identified in
22
each of the 14 rolling five-year windows over our sample period. For ease of interpretation,
we aggregate the topics discovered in each window up to the full sample. We refer to these
aggregate topics as “combined topics.” Since the optimal number of topics in each window
can vary, we allow multiple topics within a window to be associated with the same combined
topic. We also allow the number of combined topics to be greater than 31 since some topics
may not be present in all of the five-year windows. We derive the combined topics by
matching topics across years based on the Pearson correlation of the word weights within
the topics. We group all topics with a Pearson correlation above a specific threshold. We
test thresholds for the Pearson correlations from 1% to 90% in 1% intervals to determine the
most coherent grouping. The most coherent topics were found when the Pearson correlation
threshold was set at 11%.20 This matching procedure results in 64 combined topics across
our sample period.
To determine the underlying content of each combined topic, we generate a list of the
highest weighted phrases and sentences associated with each topic. We construct the list
by first extracting the top 1,000 sentences per topic based on the weighted words associated
with each combined topic. Next, we sort the sentences based on their length and extract
the middle tercile (334 sentences) as representative sentences with a typical length. We
then extract the top 20 most frequent bigrams (i.e., two-word phrases excluding stopwords,
numbers, and symbols) within the 334 mid-length sentences. We also sort the 334 sentences
based on the cosine similarity between a given sentence and the remaining 333 sentences.
We manually review the top 20 bigrams and top 100 mid-length sentences based on cosine
similarity and assign descriptive labels to each of the 64 topics.
Appendix C presents a list of the 64 combined topics with 10 selected bigrams per topic
and our inferred topic labels.21 We report 10 representative bigrams after excluding re-
20The first pass of this test determined that the optimal correlation threshold ranged between 8% and18%. We then conduct tests of this threshold range in 0.05% increments to locate the 11% cut-off point.We also compare the combined topics generated by groupings based on Spearman correlation and Euclideandistance. Both of these alternative methods performed poorly due to overweighting on words with low topicweightings, leading to incoherent topic groupings.
21The inferred labels for a few topics are overlapping due to only minor differences in the content inferred
23
dundant bigrams (e.g., “millions in,” “company also,” “in year”) and those with similar
inferences (e.g., “compared to” and “compared with” in topic 2, or “derivative financial”
and “financial derivative” in topic 9). We note that the LDA algorithm performs well in
identifying narrative content that is distinctively related to changes in firms’ financial per-
formance. For instance, topics 1 and 2 both refer to the firms income performance compared
to prior periods. Examples of top mid-length sentences from topic 1 include the following:
“Other income decreased to $11,745,000 in 1999 as compared to $11,882,000 in 1998 and
$10,521,000 in 1997” and “Management fee income decreased to $0 as compared to $1.4
million in 1997.” Other topics related to financial performance include segment performance
(topics 16 and 54), franchise revenues (topic 26), and general references to quantitative fi-
nancial statement information (topics 7, 34, 62, and 63). LDA also identifies topics related
to complex business transactions and arrangements such as corporate spin-offs (13 and 64),
derivatives and hedging activities (9), fair value/cash flow hedging (41), merger activities
(31), R&D partnerships (32), joint venture agreements (39), strategic alliances (46), and
investment in securitized/guaranteed securities (55).
Several topics also refer to specific financial statement items and/or their underlying
measurement assumptions such as post-retirement health care cost assumptions (4), account
receivables and doubtful accounts (12), long term assets (25), advertising expenses (36), and
the measurement of natural gas properties (38). Consistent with Huang et al. [2014], we are
able to identify industry-specific topics such as aircraft leasing arrangements in the airline
industry, franchise revenue recognition and restaurant growth in the restaurant industry,
as well as general discussion of business risks and operational factors in the agricultural,
gaming, mining, marine transportation, and hotel industries. Lastly, as demonstrated in
Bao & Datta [2014], LDA effectively discovers content related to common risk factors and
contingencies such as foreign currency risks (57), country risks (18 and 37), environmental
liabilities and risks (6 and 56) patent infringement and rights (48), and legal proceedings
from the bigrams and mid-length sentences. We treat these topics separately in our empirical analyses tomitigate any noise introduced by our topic aggregation process.
24
(45). In summary, the evidence in Appendix C suggests that the LDA algorithm provides a
valid set of economically meaningful topics.
Our next validation method evaluates whether the disclosure topics perform reasonably
well in detecting misstatements using in-sample tests. Figures 1 and 2 depict the distribution
of each combined topics over the 1999 to 2012 period (our misreporting prediction years) and
whether the topic is significantly associated with the occurrence of financial misreporting.
Figure 1 (2) illustrates the distribution for the sample of irregularity restatements (AAERs).
We determine the significance of the combined topics by estimating yearly in-sample regres-
sions of the disaggregated subtopics (i.e., the topics associated with a given combined topic
in each year) on our misreport indicator variable. We orthogonalize the subtopic proportions
to 2-digit SIC industries to control for unobserved industry effects.
We observe in both figures that the discussion of several topics is relatively consistent
across the sample years. These topics include changes in income performance (topics 1 and
2), measurement of post-retirement benefits (3), and industry-specific topics such as aircraft
leasing arrangements (4) and real estate loan operations (10). Other topics appear later in
the sample period, indicating the evolving nature of firms disclosure content. For instance,
discussions of collaborative business arrangements such as joint ventures (39), strategic al-
liances (46), and partnerships (51) are more prominent in the second half of our prediction
period.
With respect to the ability to detect misreporting, Figure 1 illustrates that discussion of
increases in income performance compared to prior periods (combined topic 2) is significantly
associated with irregularity restatements in most of our prediction years. However, the
direction of the significance is not consistent throughout the sample period. We also observe
that discussions of declines in income performance (topic 1) is significant in relatively few
years in our sample. These results suggest that the association between misreporting and
managerial discussion of financial performance is not as clear cut as suggested by prior work
on the relation between fraud and poor financial performance.
25
The results in Figure 1 also suggest that misreporting firms are more likely to discuss
issues related to share capital transactions, investments in securitized/guaranteed securities,
environmental risks, foreign operations, and growth in franchised operations. Combined
topics that load consistently negative include discussions of merger activities, joint venture
arrangements, fair value/cash flow hedging, legal proceedings, and stock option plans, sug-
gesting that restatement firms are less likely to discuss these issues in misstatement years.
The results for AAER firms (Figure 2) are similar, but with some variation in the timing of
the topic loadings. AAER firms are less likely to discuss segment performance and declines
in income performance, primarily in the earlier years of the sample. Taken together, our ev-
idence in Figures 1 and 2 suggest that disclosure content provides significant informational
value for detecting misstatement events. These results provide us with greater confidence
for investigating the fraud prediction performance of topic relative to financial statement
variables (RQ1) and textual style characteristics (RQ2).
4 Empirical Methodology and Results
4.1 Empirical Methodology
To investigate our research questions, we conduct both in-sample and out-of-sample tests
using our time-series approach. We first estimate in-sample prediction models using rolling
five-year windows. We then conduct out-of-sample tests using the estimates from each five-
year window to predict the likelihood of intentional misreporting in the year following each
window.2223
We begin our analyses by estimating logistic regressions of our misreporting indicator
22For filings coded as an irregularity restatement, we ensure that the restatement is revealed by the endof the in-sample window. We are unable to apply this restriction in the AAER sample as the UC Berkeleydataset does not include the release dates of the AAERs.
23For example, the estimated results for 1994 – 1998 (1995 – 1999) are used to predict misreporting fora holdout sample of firms in 1999 (2000) and so on.
26
variable (𝑚𝑖𝑠𝑟𝑒𝑝𝑜𝑟𝑡) on vectors of the disaggregated topic proportions (𝑡𝑜𝑝𝑖𝑐) as follows:
log
(︂𝑚𝑖𝑠𝑟𝑒𝑝𝑜𝑟𝑡𝑖,𝑡
1−𝑚𝑖𝑠𝑟𝑒𝑝𝑜𝑟𝑡𝑖,𝑡
)︂= 𝛼 +
31
Σ𝑗=1
𝛽𝑗𝑡𝑜𝑝𝑖𝑐𝑗,𝑖,𝑡 + 𝜀𝑖,𝑡, (1)
𝑡 ∈ [𝑇 − 5, 𝑇 − 1], 𝑖 ∈ Companies
We estimate equation (1) for the five-year window preceding each of the prediction years,
1999 to 2012. For our AAER specification, we lack sufficient instances of financial misre-
porting for our out-of-sample test for the year 2012, and thus we remove this year from the
specification. We remove out-of-sample test years with insufficient misreporting events when
conducting analyses on various subsamples. We then use the estimated regression coefficients
to predict the likelihood of intentional financial misreporting for 10-K filings in the subse-
quent year. Similar to Dechow et al. [2011], we construct a prediction score (𝑝 𝑚𝑖𝑠𝑟𝑒𝑝𝑜𝑟𝑡)
using the estimated coefficients and apply this scoring in our out-of-sample tests as follows:
log
(︂𝑚𝑖𝑠𝑟𝑒𝑝𝑜𝑟𝑡𝑖,𝑇
1−𝑚𝑖𝑠𝑟𝑒𝑝𝑜𝑟𝑡𝑖,𝑇
)︂= 𝛼 + 𝛽1𝑝 𝑚𝑖𝑠𝑟𝑒𝑝𝑜𝑟𝑡𝑖,𝑇 + 𝜀𝑖,𝑇 , 𝑖 ∈ Companies (2)
To investigate Research Question 1 (RQ1), we estimate two additional regression specifi-
cations. The first specification replaces the topic vector with the vector of financial variables
discussed previously, whereas the second specification extends equation (1) by including both
vectors of 𝑡𝑜𝑝𝑖𝑐 and the financial variables. In both cases, we generate a 𝑝 𝑚𝑖𝑠𝑟𝑒𝑝𝑜𝑟𝑡 mea-
sure and run the out-of-sample tests as well. These two specifications allows us to gauge the
incremental fraud-prediction ability of 𝑡𝑜𝑝𝑖𝑐 beyond traditional financial statement variables.
For our second research question (RQ2), we introduce four specifications that include
style characteristics. The first specification includes our style characteristics with the second
including both financial variables and style characteristics. The third and fourth specifi-
cations are expanded versions of the first two models with the 𝑡𝑜𝑝𝑖𝑐 vector inserted. Our
27
general regression form for RQ2 is specified below in equation (3):
log
(︂𝑚𝑖𝑠𝑟𝑒𝑝𝑜𝑟𝑡𝑖,𝑡
1−𝑚𝑖𝑠𝑟𝑒𝑝𝑜𝑟𝑡𝑖,𝑡
)︂= 𝛼 +
10
Σ𝑗=1
𝛽𝑗𝑓𝑖𝑛𝑎𝑛𝑐𝑖𝑎𝑙𝑗,𝑖,𝑡 +30
Σ𝑗=1
𝛽𝑗+10𝑠𝑡𝑦𝑙𝑒𝑗,𝑖,𝑡 (3)
+20
Σ𝑗=1
𝛽𝑗+40𝑡𝑜𝑝𝑖𝑐𝑗,𝑖,𝑡 + 𝜀𝑖,𝑡,
𝑡 ∈ [𝑇 − 5, 𝑇 − 1], 𝑖 ∈ Companies
Due to the large number of variables included in the regressions and the naturally small
number of AAERs and intentional restatements in our windows, we tightly control the
convergence of our logistic regressions. We control the convergence by conducting checks
for both completeness and quasi-completeness of each regression specification. Appendix
A.3 of the online appendix details the necessary steps for conducting these checks.
4.1.1 Statistical Testing
Given the structure of our rolling time-series analysis, we are unable to use a standard Fama-
MacBeth methodology to pool our results for the predicted window. This restriction results
from the topic variables naturally changing across windows as previously reported. Thus,
we cannot aggregate across on a variable level. To address this research design issue, we
use Fisher’s (1932) method to provide aggregated test statistics.24 The Fisher test statistic
is appropriate for our analyses since the out-of-sample regressions are estimated using non-
overlapping years.
We refine our test statistic further by deriving a statistic referred to as a Var-Gamma
test (see Appendix D. This test statistic allows us to compare the results of Fisher’s method,
statistically testing whether one fraud detection model performs better than another when
pooled across years.
24The test statistic is computed as −2𝑁
Σ𝑖=1
log(𝑝𝑖) ∼ 𝑋22𝑁 , where 𝑝𝑖 is the 𝑖th p-value of 𝑁 total p-values.
28
4.2 Empirical Results
4.2.1 The Predictive Value of Topic versus Financial Variables (RQ1)
We address our first research question by investigating the informational role of 𝑡𝑜𝑝𝑖𝑐 versus
financial statement variables in detecting intentional misreporting. Table 1 presents separate
summary statistics of our financial statement variables for fraud and non-fraud firm-years
in the AAER and irregularity restatement samples. Consistent with Dechow et al. [2011],
we find that the percent of soft assets is significantly higher in both samples, suggesting
more reporting flexibility in misstatement years (𝑝-value < 0.01). We also find a greater
tendency to issue securities and engage in off-balance-sheet activities in misstatement years
in both samples. Furthermore, firms involved in AAERs have significantly larger increases
in receivables and inventory, and larger declines in ROA, consistent with greater earnings
management and poorer financial performance in manipulation years. Lastly, AAER firms
experience higher market-adjusted stock returns in the year prior to the misstated year. This
result combined with the higher rate of security issuance during the misstated year indicate
that market-related incentives play a strong role in intentional misreporting. In sum, the
univariate results provide initial evidence that financial statement variables provide useful
information for predicting intentional financial misreporting.
Table 2 presents the results of our out-of-sample tests of the predictive role of 𝑡𝑜𝑝𝑖𝑐
and financial variables (denoted as 𝐹 − 𝑆𝑐𝑜𝑟𝑒). Panels A and B present the Fisher and
Var-Gamma test statistics for AAERs, while Panels C and D present similar statistics for
irregularity restatements. The Fisher test statistics (Panels A and C) indicate that the
financial variables provide a significant amount of information for predicting AAERs (𝑝 <
0.001); however, they fail to provide significant informational value for predicting irregularity
restatements (𝑝 = 0.223). Furthermore, the results in Panels B and D suggest that the stand-
alone 𝑡𝑜𝑝𝑖𝑐 vector performs significantly better at predicting both AAERs and irregularity
restatements than either the stand-alone vector of financial metrics or the paired vectors of
29
𝑡𝑜𝑝𝑖𝑐 and financial variables. In both samples, the var-gamma test statistics are significantly
positive at the 1% level when the stand-alone 𝑡𝑜𝑝𝑖𝑐 vector is benchmarked against 𝐹 −𝑆𝑐𝑜𝑟𝑒
and the 𝑡𝑜𝑝𝑖𝑐 vector paired with 𝐹 − 𝑆𝑐𝑜𝑟𝑒. The pairing of 𝑡𝑜𝑝𝑖𝑐 with financial measures
performs significantly better at predicting both AAERs and irregularity restatements than
financial measures alone (𝑝 < 0.001). Overall, we find that our measure of the thematic
content performs significantly better at detecting instances of accounting misstatements
relative to traditional financial characteristics.
4.2.2 The Predictive Value of Topic versus Textual Style Characteristics (RQ2)
Since style measures are also text-based, one could argue that 𝑡𝑜𝑝𝑖𝑐 simply proxies for the
style characteristics of firms’ financial statements. Table 3 presents separate univariate
statistics for our vector of style characteristics for misstated and non-misstated firm-years.
Regarding our processing variables, we find that misstated filings in both the AAER and
irregularity restatement samples have more (concise) bulleted information. This finding is
inconsistent with conventional notions, but could reflect managers’ use of conciseness to omit
relevant information. Misstated filings in both samples have longer headers relative to non-
misstated filings, consistent with complex firm transactions like restructuring and mergers
being associated with misstatements. Also, misstated filings have fewer newlines and html
tags in the AAER sample; whereas, misstated filings have more newlines and more html tags
in the irregularity restatement sample. We further note that the misstated filings in both
samples are longer overall with longer MD&A sections, suggesting less readability during
manipulation periods.
In terms of complexity, misstated filings in the AAER sample contain longer words,
shorter sentences, and shorter paragraphs, along with fewer long (> 60 word) sentences
and a greater number of short (< 30 word) sentences. In contrast misstated filings in
the irregularity restatement sample tend to use longer sentences, longer paragraphs, fewer
long sentences, and fewer short sentences. Regarding variation, both AAER and irregular-
30
ity restatement filings have significantly lower variation in sentence length and use fewer
unique words (type-token ratio). AAER filings also have less variation in paragraph length,
while irregularity restatement filings have greater variation in paragraph length and have
more repeated sentences. We note that both sets of misstated filings are less readable per
the Gunning Fog and Coleman-Liau indices, and are more likely to contain passive voice
grammar, consistent with managers using passive voice to disassociate themselves from the
disclosure content [Goel & Gangolly, 2012].
Regarding word choice, both AAER and irregularity restatement filings have significantly
higher percentages of positive, negative, and uncertain words, consistent with Loughran
& McDonald [2011]. AAER filings also have a lower percentage of strong words, while
irregularity restatement filings have a greater percentage of litigious, strong, and weak words.
Lastly, misstated filings in the AAER and irregularity restatement samples contain more
textual emphasis as indicated by more words in all caps; however, misstated AAER filings
have fewer exclamation points and question marks, on average.
We approach RQ2 by combining the 𝑡𝑜𝑝𝑖𝑐 and textual style vectors in the same regression
model. Table 4 presents the Fisher and Var-Gamma test statistics of our out-of-sample tests
of the predictive performance of topic relative to textual style characteristics. Panels A and
B presents the test statistics for AAERs; Panels C and D presents the results for irregularity
restatements. The evidence in panels A and C suggests that 𝑡𝑜𝑝𝑖𝑐 combined with style is a
good predictor of misstatements involving AAERs and irregularity restatements (𝑝 < 0.001).
However, for AAER misstatements, the Var-Gamma results in Panel B show that 𝑡𝑜𝑝𝑖𝑐 by
itself is a better predictor than either textual style or 𝑡𝑜𝑝𝑖𝑐 combined with style at 𝑝 < 0.001.
The Var-Gamma tests in Panel D show that while 𝑡𝑜𝑝𝑖𝑐 is a better predictor than style
(𝑝 = 0.019), the joint vector of 𝑡𝑜𝑝𝑖𝑐 and style characteristics is a better predictor than the
stand-alone 𝑡𝑜𝑝𝑖𝑐 vector at 𝑝 = 0.003. Thus, we find that the best specification for predicting
AAERs is 𝑡𝑜𝑝𝑖𝑐 by itself, while the best specification for predicting irregularity restatements
is 𝑡𝑜𝑝𝑖𝑐 with style. This evidence could suggest that fraud detection models based on 𝑡𝑜𝑝𝑖𝑐
31
and style characteristics are more able to detect accounting manipulations that are likely to
go unidentified by the SEC.
4.2.3 The Joint Predictive Value of topic, Financial, and Textual Style Variables
In this section, we conduct extended analyses of the interplay between all three vectors of
our fraud prediction variables: 𝑡𝑜𝑝𝑖𝑐, financial statement variables, and textual style char-
acteristics. This comprehensive analysis attempts to verify that the fraud detection ability
of both sets of topic is robust to the inclusion of financial and textual style characteristics.
We therefore estimate a comprehensive regression of the vectors of 𝑡𝑜𝑝𝑖𝑐, financial, and tex-
tual style measures. Table 5 presents the out-of-sample results. In Panels A and C we find
that the combined vector of 𝑡𝑜𝑝𝑖𝑐, financial, and style measures performs reasonably well in
detecting accounting misstatements (𝑝 < 0.001). Nonetheless, the results in Panels B and
D indicate that the stand-alone 𝑡𝑜𝑝𝑖𝑐 vector performs markedly better in predicting both
types of accounting misstatements (𝑝 < 0.001). For AAERs (Panel B), 𝑡𝑜𝑝𝑖𝑐 outperforms
the joint vector of 𝑡𝑜𝑝𝑖𝑐, financials, and style 𝑝 < 0.001, whereas 𝑡𝑜𝑝𝑖𝑐 paired with style is
the dominant predictor of irregularity restatements at 𝑝 < 0.001 (Panel D). This evidence
agrees with our previous results that 𝑡𝑜𝑝𝑖𝑐 is the best predictor of misstatements involving
AAERs, while 𝑡𝑜𝑝𝑖𝑐 and textual style characteristics provides the strongest prediction power
for misstatements involving irregularity restatements.
In order to examine the economic significance of our out-of-sample results, we examine
the percent of accounting misstatements in the top 5% of the prediction scores from the
out-of-sample regressions. Consistent with our out-of-sample results, Table 6 shows that
𝑡𝑜𝑝𝑖𝑐 and textual style characteristics captures the most misstatements involving irregularity
restatements, capturing 16.41% of the misstatements. For AAERs, we find that combining
𝑡𝑜𝑝𝑖𝑐, 𝐹 − 𝑠𝑐𝑜𝑟𝑒, and textual style characteristics captures the most AAERs, detecting
19.93% of all AAERs. While this contrasts to the the results above, it is not inconsistent
— the out-of-sample tests capture which measure is best on average, rather than over any
32
specific percent of the fraud scores. More importantly, Table 6 shows that topic is useful for
prediction in an economic sense, increasing the amount of AAERs (irregularity restatements)
captured when looking at 5% of firms by 67% (3.3%).
5 Additional Analysis and Robustness
This section provides a series of extended analyses as well as sensitivity checks for our
primary results. We first examine the robustness of our results to alternative sources of
financial restatements due to irregularities, as well as restatements attributable to uninten-
tional misapplications of GAAP (errors). We also replicate our primary results using MD&A
statements instead of the full text of the filings. Next, we change the regression form to an
L1 regularized logit model, to alleviate concerns of potential overfitting. Lastly, we adjust
our samples of misstated filings to exclude repeat GAAP violators as well as replicate our
analyses using the raw 𝑡𝑜𝑝𝑖𝑐 measures (as opposed to the normalized 𝑡𝑜𝑝𝑖𝑐 proportions).
5.1 Alternative Restatement Measures
Our strategy for identifying irregularity restatements is based on three classification crite-
ria: 1) management’s use of variants of the word “fraud” or “irregularity” in reference to
the misstatement (direct restatements), 2) misstatements identified by regulatory or DOJ
investigation (government-identified restatements), and 3) misstatements uncovered by in-
dependent investigations (other irregularity restatements). We examine whether our results
differ for misstatements identified under each criterion. We conduct this analysis since man-
agerial discussion during misstatement events is likely to differ across the three settings. For
instance, irregularities involving SEC or DOJ investigations could be more egregious com-
pared to those involving within-firm or independent investigations. We also investigate our
models’ ability to distinguish unintentional misstatements or errors (i.e., those misstatements
that are not classified as intentional).
33
The out-of-sample results (not tabulated) provide an interesting story when we distin-
guish the three settings. For direct restatements, the vectors of 𝑡𝑜𝑝𝑖𝑐, financial, and style
measures are all statistically significant predictors; however, 𝑡𝑜𝑝𝑖𝑐 is the most powerful pre-
dictor, and all other combinations of the 𝑡𝑜𝑝𝑖𝑐, financials, and style vectors leads to spec-
ifications that are significantly weaker. Government-identified restatements are also pre-
dicted most strongly by 𝑡𝑜𝑝𝑖𝑐. Interestingly, financial statement variables are not predictive
of government-identified misstatements (𝑝 = 0.987). Other irregularity restatements are
likewise best captured by the 𝑡𝑜𝑝𝑖𝑐 measure, while financial measures are once again poor
predictors.
Lastly, all specifications perform well at predicting unintentional accounting errors. The
𝑡𝑜𝑝𝑖𝑐 vector is tied with 𝑡𝑜𝑝𝑖𝑐 paired with the financial and style vectors when detecting
accounting errors. We also find that 𝑡𝑜𝑝𝑖𝑐 and 𝑡𝑜𝑝𝑖𝑐 paired with style are the best specifica-
tions, irrespective of the type of restatement (intentional irregularity or unintentional error).
Overall, our results suggest that quantifying the thematic content of annual reports results
in a detection tool that performs best when predicting accounting misstatements in general.
5.2 Management Discussion and Analysis
Several style-focused studies such as Li [2008], Li [2010], Cecchini et al. [2010], and Goel
& Gangolly [2012] examine the MD&A section of the 10-K. We therefore investigate the
fraud prediction performance of our topic measure based on this subset of the 10-K. We
reconstruct our topic and style variables using the text extracted from the MD&A section
(see Appendix A.1 of the online appendix for further details). Our out-of-sample evidence
indicates that style performs weaker than financial measures at predicting misstatements
involving AAERs, and that the predictive ability of 𝑡𝑜𝑝𝑖𝑐 is not significantly different from
that of financial statement variables in the case of AAERs. Out-of-sample tests for the
irregularity restatement sample show weaker Fisher statistics compared to our reported re-
sults; however, our main results continue to hold. Thus, we conclude that the incremental
34
detection value of 𝑡𝑜𝑝𝑖𝑐 is robust to restricting our analysis to MD&A statements, though it
is generally better to use the full text of the 10-K filings when examining disclosure content
and textual style characteristics.
5.3 Regularized Logit
Here, we change the form of our regressions from a standard logistic regression to an L1
regularized logit. The L1 regularization approach applies a penalty for increasing the number
of independent variables, which biases against including too many independent variables in
the regression. We find only one difference in our set of out-of-sample tests: for the sample
of irregularity restatements, there is no significant difference in the prediction ability of 𝑡𝑜𝑝𝑖𝑐
relative to 𝑡𝑜𝑝𝑖𝑐 paired with style. As such, the additional predictive ability from adding
textual style to 𝑡𝑜𝑝𝑖𝑐 does not overcome the penalty from the L1 regularization applied for
increasing the number of independent variables.
Overall, the L1 regularization results do not change our primary inferences, with the
exception that the stand-alone 𝑡𝑜𝑝𝑖𝑐 vector may be a stronger predictor of irregularity re-
statements under certain circumstances.
5.4 Removing second time offenders
Our next robustness test controls for possible biases introduced by allowing a firm to be
flagged as an AAER or irregularity restatement firm in both the learning window and the
following testing year. One concern arising from this approach is that the topic measure is
biased toward firms that are repeat offenders, rather than the first instance of an AAER
or irregularity restatement. To alleviate this concern, we adjust our misstatement samples
by removing any firm-years in which the preceding firm-year was involved in a misstate-
ment. Thus, our out-of-sample dependent variable only picks up the first year affected by a
misstatement.
Our prediction results are virtually identical using using these adjusted misstatement
35
samples. Specifically the results for both misstatement samples indicate that the predictive
ability for 𝑡𝑜𝑝𝑖𝑐, financial, and style measures is lower when removing repeat misstatements,
but all inferences are identical to our primary inferences. As such, it appears that our 𝑡𝑜𝑝𝑖𝑐
measure is not biased towards firms with repeated misstatements.
5.5 Raw Topic Measures
Our final sensitivity check uses the raw 𝑡𝑜𝑝𝑖𝑐 measures instead of the normalized measures.
This approach increases the variance in the topic measures, as they are now influenced by
the amount of text in the document.
The prediction results for the AAER sample are quite similar, except that 𝑡𝑜𝑝𝑖𝑐 by itself
is no longer a significantly better predictor than 𝑡𝑜𝑝𝑖𝑐 paired with style. Results fo