What are you saying? Using topic to detect financial ......While there is some overlap between our sample of AAER and financial restatement firms, each data source has its unique advantages

What are you saying? Using topic to detect financialmisreporting*

Nerissa C. BrownAssociate Professor

University of [email protected]

Richard M. CrowleyPh.D. Student

University of [email protected]

W. Brooke ElliottAssociate ProfessorUniversity of [email protected]

September 2015Preliminary version.

Please do not cite without the permission of authors.

*We thank Andrew Bauer, Paul Demeré, Shawn Gordon, Kristina Rennekamp, and workshopparticipants at the University of Illinois, 2015 AAA FARS Mid-year Meeting, 2015 AAA AnnualMeeting, and the 2015 Conference on Convergence of Financial and Managerial Accounting Re-search for their helpful comments. We also thank Xiao Yu for insightful comments on methodologyand coding, and Stephanie Grant, Jill Santore, and Jingpeng Zhu for excellent research assistance.

What are you saying? Using topic to detect financialmisreporting

Abstract: Detection models of financial misreporting have evolved beyond basic quanti-tative or financial measures to include textual or linguistic characteristics of firms’ disclosures.While these textual analysis methods provide incremental power in identifying misreport-ing, they examine how content is being disclosed as opposed to what is being disclosed.This study introduces a novel fraud-detection measure, labeled as “topic,” that quantifiesthe thematic content of financial statements. We derive our measure from a Bayesian topicmodeling methodology called Latent Dirichlet Allocation (LDA). We then demonstrate theincremental predictive power of our topic measure in detecting intentional financial misre-porting. We identify occurrences of financial misreporting using SEC enforcement actions(AAERs) and restatements arising from intentional misapplications of GAAP (i.e., irregular-ities). We find strong evidence that topic predicts intentional misreporting beyond financialand textual style characteristics. Furthermore, our results indicate that the detection powerof financial metrics is subsumed by our topic measure in prediction models for both AAERsand restatements arising from irregularities.

Keywords: Topic, Disclosure, LDA, Financial Misreporting, Intentional Restatements

1

1 Introduction

Detection models of financial misreporting have long focused on firms’ financial characteris-

tics and are often based on quantitative measures of extreme or abnormal performance (e.g.,

Beneish [1997]; Dechow, Ge, Larson & Sloan [2011]).1 One drawback of this approach is that

financial misreporting can go undetected for long periods since many firms engage in earnings

manipulation to blend in better with either their peers or the firm’s own past performance

[Lewis, 2013]. To address this weakness, recent studies have begun to examine linguistic and

text-based measures as additional signals of financial misreporting. These signals include

disclosure tone, sentiment, vocal dissonance, and the use of suspicious or deceptive words

[Loughran & McDonald, 2011; Hobson, Mayew & Venkatachalam, 2011; Purda & Skillicorn,

2015; Larcker & Zakolyukina, 2012]. In practice, regulators, auditors, investors, and infor-

mation intermediaries are employing textual analytic tools to reveal early warning signs of

accounting fraud and misstatements. For example, the Securities and Exchange Commis-

sion (SEC) is currently incorporating basic lists of deceptive words and phrases from annual

reports into their computer-powered fraud detection model, termed the Accounting Quality

Model (AQM, Eaglesham [2013]). Also, auditors and accountants are also using textual

analytic tools in assurance tasks such as fraud detection and compliance [Schneider, Dai,

Janvrin, Ajayi & Raschke, 2015].2

While text analysis methods are quite promising in the detection of financial misreporting,

many of the techniques used in prior research capture basic textual or linguistic character-

istics of firms’ disclosures rather than the content. Examining the content of disclosures is

important since firms engaging in accounting irregularities tend to make unusual content

choices that are more difficult to classify as deceptive. As SEC officials note, firms that

1Consistent with prior literature (e.g., Beasley [1996], Farber [2005], Dechow et al. [2011]), we use theterms misreporting, misstatement, manipulation, irregularity, and fraud interchangeably throughout thepaper. While firms often do not admit to outright fraud, our data sources for identifying instances ofintentional misreporting capture more egregious cases in the accounting error-to-fraud continuum.

2Furthermore, our discussions with data providers indicate the increased use of textual analytics toidentify instances of accounting manipulations and other disclosure quality risks.

2

manipulate financial reporting rules tend to deflect attention from core problems by under-

reporting relevant risk factors compared to other firms [Lewis, 2013]. Further, prior research

based on theories of deception provides evidence that the thematic content (or topic) of com-

munication is typically chosen intentionally (e.g., Buller & Burgoon [1996]), suggesting that

topics contained in financial disclosures likely reflect the intentions of managers (as opposed

to the manager’s own subconscious slant or their optimistic or pessimistic beliefs). In line

with these observations, our study introduces a textual analytic methodology that directly

detects disclosure topics, i.e., what is being disclosed in annual 10-K filings (as opposed to

how content is being disclosed). Using this unique measure (labeled as “topic”), we evaluate

the common types of topics discussed in the annual filings of misreporting firms and how

these disclosure topics change over time. Lastly we examine the incremental predictive power

of our topic measure in detecting accounting misstatements relative to a comprehensive set

of financial measures and textual style characteristics used in prior research.

To generate our topic measure, we employ a topic modeling technique developed by Blei,

Ng & Jordan [2003], termed Latent Dirichlet Allocation (LDA). The LDA approach allows

us to determine the proportion of each 10-K filing devoted to each topic detected by the

algorithm. Topic modeling does not require preset definitions (referred to as dictionaries

or word lists) or predetermined topic categories, and in fact, relies on the basic observation

that words frequently appearing together in text documents tend to be semantically related.

Using unstructured cluster analysis, the LDA algorithm simply uses a set of 10-K filings

to “learn” or generate the various topics discussed in firms’ annual reports in a given year.

This method offers a unique advantage in that researchers are not required to know the

topics commonly discussed in annual reports at a given point in time, and thus our own

(preconceived) knowledge of the documents’ content does not bias our construction of the

topic measure. Furthermore, the LDA method allows us to analyze the actual content of a

large collection of financial statements — a task that would be infeasible for researchers to

perform manually. This represents a significant step forward in the textual analysis literature

3

and goes beyond the basic “bag of words” or style analytics used in prior research.3

We conduct our fraud detection analyses using two sources for identifying the occur-

rence of accounting misstatements involving intentional GAAP violations. Our first source

is a comprehensive sample of SEC Accounting and Auditing Enforcement Releases (AAERs)

compiled by Dechow et al. [2011].4 These releases identify instances of formal SEC investiga-

tions of firms that manipulate their financial statements. Our second source of misstatements

is an automated search for financial restatements arising from intentional misreporting (here-

after referred to as irregularity restatements), i.e., those restatements that involve intentional

irregularities rather than unintentional misapplications of GAAP (errors). We identify re-

statements involving intentional irregularities based on the criteria discussed in Hennes,

Leone & Miller [2008]. Specifically, we use an automated method to parse through the text

of amended 10-K filings for variants of the words “fraud” or “irregularity” in reference to

accounting misstatements. We also search for references to restatements resulting from SEC

or Department of Justice (DOJ) investigations as well as references to other independent

investigations.5

While there is some overlap between our sample of AAER and financial restatement

firms, each data source has its unique advantages and disadvantages. As Dechow et al.

[2011] note, the AAER sample provides researchers with a high confidence level of fraud

detection since the SEC typically targets firms where there is strong evidence of accounting

manipulation. However, many misstatement events are likely to go undetected due to limited

SEC resources, and the investigated cases are likely to reflect the SEC’s selection criteria.

3Although LDA is a word choice algorithm, it goes significantly beyond the näıve Bayes or simple“bag of words” approach used in prior research (e.g., word counts and rank-ordered word lists) by usingthe distribution of words across documents to discover actual content without the need for predefined orresearcher determined word lists. Our analyses control for the most commonly used word list measures suchas those based on the Loughran and McDonald and Harvard IV dictionaries.

4We use the updated version of this dataset available from the Center for Financial Reporting andManagement at UC Berkeley’s Haas School of Business.

5We use an automated search to identify restatements since other data sources such as Audit Analyticsand the Government Accountability Office (GAO) database provide less extensive coverage. For instance,restatement data is not available in Audit Analytics for periods prior to 2001, while the GAO database islimited to restatements announced from July 2002 to October 2006.

4

The irregularity restatement dataset spans a broader sample of accounting misstatements

and alleviates the concern of SEC-related selection biases. Nonetheless, the restatement

sample could be affected by changes in how firms disclose or discuss restatements within

their financial statements.

We build our fraud detection model by combining our unique topic measure with a

comprehensive set of financial statement measures and textual style characteristics used

in prior fraud research. Our financial statement variables follow closely from the Dechow

et al. [2011] F-score model and include measures of accrual quality, earnings and cash flow

performance, engaging in off-balance sheet activities, and market-related incentives. Our

text-based characteristics include measures of financial report readability, disclosure tone

and emphasis, as well as basic lists of deceptive, litigious and uncertain words constructed

by Loughran & McDonald [2011].

Using out-of-sample tests of our AAER sample, we find that our topic measure provides

significant incremental predictive power over commonly-used financial metrics and textual

style variables for detecting instances of accounting misstatement. In fact, a stand-alone

model of disclosure topic is a better predictor of accounting fraud compared to models using

financials or style characteristics, or models using both financial and style measures. We

find similar results when we analyze our sample of irregularity restatements, i.e., topic adds

significant incremental predictive power over financial and style variables. Interestingly, we

find that a model including only topic and style (and excluding financial metrics) is most

predictive of our irregularity restatements in out-of-sample tests. We find that these results

are robust to several sensitivity checks including the use of alternative misreporting measures,

regressions specifications, and topic measure definitions. Our inferences also hold when we

base our analyses solely on the topic content of the Management Discussion and Analysis

(MD&A) section of firms’ 10-K filings.

Our study makes several important contributions to the literature. First, we extend prior

research on financial misreporting by providing evidence that the topics discussed in firms’

5

10-Ks are useful in identifying intentional misstatements above and beyond traditionally

examined financial measures and style characteristics. Second, we expand the burgeoning

research in accounting that examines the textual portion of corporate disclosures. Specifi-

cally, we exploit a robust textual analysis methodology, LDA, which directly quantifies what

is being disclosed in 10-K filings (as opposed to how it is being disclosed). This thematic

content analysis is adaptable to any type of financial disclosure with sufficient length. Fur-

ther, since the topics that are indicative of intentional misreporting are likely to change over

time given the fluidity of disclosures, our approach provides a significant improvement over

a simple “bag of words” approach where the list of deceptive words is fairly static and easily

identifiable by firms. Lastly, our study has significant practical implications for regulators,

investors, and practitioners, who continue to implement sophisticated initiatives aimed at

detecting accounting violations. In particular, our results suggest that extracting informa-

tion about what is being disclosed is a fruitful avenue for capturing high-risk accounting

activity.

2 Background and Research Questions

2.1 Predicting Financial Misreporting

Over the past two decades, researchers have examined several different predictors of financial

misreporting. Early work by Feroz, Park & Pastena [1991] and Beneish [1997, 1999] investi-

gate the link between accounting misstatements and several measures of extreme or abnormal

financial performance. In particular, Feroz et al. [1991] find that fraud firms identified from

SEC enforcement actions (AAERs) have misstated receivables and inventory. Using a larger

sample of fraud events gathered from both AAERs and the business press, Beneish [1997]

finds that abnormal accruals, disproportionate increases in receivables, and poor abnormal

market performance are significant predictors of financial misreporting. Beneish [1999] also

finds that extreme firm performance based on indices of financial ratios is useful for detecting

6

fraud.

Prior studies also find that stock and debt market pressures and firms internal monitoring

mechanisms are strong predictors of accounting misstatements. For instance, Dechow, Sloan

& Sweeney [1996] find that firms subject to SEC enforcement actions have lower free cash

flow and higher leverage. In addition, these firms are more likely to violate debt covenants

and tend to issue securities during the earnings manipulation period. Dechow et al. [1996]

further find that fraud firms have weak internal governance mechanisms proxied by insider

dominance of the board of directors and the lack of an audit committee or outside block-

holders. Beasley [1996], Beasley, Carcello, Hermanson & Lapides [2000], and Farber [2005]

provide similar evidence of the association between governance mechanisms and the like-

lihood of fraud. Specifically, they find that fraud firms have lower percentages of outside

board membership, less independent audit committees, fewer financial experts on the audit

committee, fewer audit committee meetings, and lower quality external audit firms. In sum,

these results suggest that weak monitoring mechanisms provide managers with less internal

oversight and, in turn, greater opportunities to engage in suspect accounting.

In a comprehensive study of AAERs, Dechow et al. [2011] investigate the fraud detection

power of a battery of financial and nonfinancial measures.6 They find that poor accrual

quality, increases in accrual components, declines in returns on assets, high stock returns,

and abnormal reductions in the number of employees are strong predictors of accounting

misstatements. They also find that fraud firms conduct aggressive off-balance-sheet and

external financing transactions during misstatement periods. Using these variables, Dechow

et al. [2011] develop a composite prediction score termed the F-score. They show that the

F-score is a better predictor of both within-GAAP and aggressive accounting misstatements

compared to traditional models of accrual management.

Recent research has started to explore the predictive value of language-based tools in

6Dechow et al. [2011] do not examine corporate governance and incentive compensation variables becausethese variables are available for only limited samples. Our study follows the same approach to ensure thatour results are generalizable to a wide set of firms.

7

detecting intentional financial misreporting. The basic premise of these studies is that the

textual style or vocal qualities of managements’ disclosures can be used as additional tools

to identify accounting manipulations. At a basic word-choice level, Loughran & McDonald

[2011] find that instances of fraud are associated with the use of negative, uncertain, and

litigious words in annual reports. Larcker & Zakolyukina [2012] analyze the transcripts of

conference calls and find that words related to deception are indicative of accounting mis-

statements and serve as better predictors than standard measures of discretionary accruals.

Goel & Gangolly [2012] go a step further from word lists and examine linguistic qualities

such as tone, tense, uncertainty, adverbs, and emphasis. They find significant differences

in linguistic qualities across the full text of 10-Ks issued by firms that engage in financial

irregularities versus those that do not. Cecchini, Aytug, Koehler & Pathak [2010] employ a

dictionary approach with synonyms to detect financial misstatements. This approach allows

the authors to take a broader look at managements’ disclosures in annual reports as opposed

to focusing solely on individual style characteristics.

Goel, Gangolly, Faerman & Uzuner [2010] presents an expanded fraud detection model

using a machine learning algorithm termed Support Vector Machine (SVM) to classify annual

reports containing irregularities. The SVM approach provides a significant improvement over

prior text mining models as the SVM model learns by example and does not require pre-

defined fraud indicators. The SVM tool is trained to classify annual reporting using both

word dictionaries and writing style characteristics such as word usage, word and sentence

length, readability, tone, the use of passive versus active voice, the frequency of uncertainty or

hedge words, and other deeper linguistic style markers and keyword usage. Goel et al. [2010]

find that their SVM approach improves the prediction accuracy of their fraud detection model

by about 58% compared to a baseline model using a Näıve Bayes classification approach.

This approach improves upon the word-list tool used in prior research by employing Näıve

Bayes and SVM algorithms to classify annual reports based on vectors of common words

detected in fraudulent reports. However, this algorithm is based on simple word counts and

8

ignores the disclosure content of the annual report as well as relationships between words in

a document.

Lastly, Purda & Skillicorn [2015] use SVM tools to distinguish fraudulent and truthful

annual reports. In contrast to Goel et al. [2010], Purda & Skillicorn [2015] examine both

annual and quarterly financial reports and compare the accuracy of their fraud detection

model to both the predictive power of linguistic-based models and traditional models such

as the Dechow et al. [2011] F-score model. The SVM approach in Purda & Skillicorn [2015]

uses a learning algorithm to generate a rank-ordered list of words that are best able to cap-

ture fraudulent reporting. The authors find that their data-generated classification scheme

outperforms textual-based prediction models built using pre-defined dictionaries as well as

traditional models based on financial and non-financial measures.7

Our study extends this body of literature by constructing a direct text-based measure

aimed at capturing the content of disclosures within firms’ financial statements. Our ap-

proach goes beyond traditional fraud detection models that are based primarily on quanti-

tative information. Further, we draw on theories of deception to predict that the thematic

content, or topics, a manager chooses is more likely to reflect the manager’s intentions, and

thus intentional misreporting, than either the simple “bag of words” or textual style approach

used in prior textual analysis studies.

2.2 LDA Topic Modeling

We employ a topic modeling approach developed by Blei et al. [2003] termed Latent Dirichlet

Allocation (LDA), to capture the disclosure content of annual reports. The LDA technique

is widely-used in the linguistic and information retrieval literature to quantify the thematic

content (i.e., topics) of text corpora and other collections of discrete disclosure data (see Blei

7In international settings, Kirkos, Spathis & Manolopoulos [2007] and Pai, Hsu & Wang [2011] usetextual data mining techniques to detect intentional financial misreporting in Greek and Taiwanese firms,respectively.

9

[2012] for a review of topic modeling and its application to various text collections).8 We use

this approach to construct a firm-specific measure of the topics discussed in each 10-K filing

in a given year. This unique measure (defined as the normalized percent of the annual report

attributed to each topic identified by the algorithm) captures how much of each report is

devoted to discussing a particular disclosure topic.

Topic modeling is relatively new to accounting and finance, and our measurement ap-

proach is consistent with recent studies that apply LDA to extract the disclosure content

of large volumes of financial-related textual data. For instance, Curme, Preis, Stanley &

Moat [2014] use LDA to identify the semantic topics within the large online text corpus of

Wikipedia. The identified topics are then used to determine the link between stock market

movements and how frequently Internet users search for the most representative words of

each identified topic. Huang, Lehavy, Zang & Zheng [2014] employ LDA topic modeling to

compare the thematic content of analyst reports and the text narrative of conference calls.

Consistent with the information discovery role of analysts, Huang et al. find that analyst

reports issued immediately after conference calls contain exclusive topics that were not dis-

cussed during the conference calls. Bao & Datta [2014] discover and quantify the various

topics discussed in textual risk disclosures from annual 10-K filings (Item 1A). The results

indicate that about two-thirds of the identified risk topics are uninformative to investors, con-

sistent with the notion that risk disclosures are largely boilerplate. Of the remaining topics,

disclosures of systematic macroeconomic and liquidity risks have an increasing effect on in-

vestors’ risk perceptions, whereas topics related to diversifiable risks (i.e., human resources,

regulatory changes, information security, and operation disruption) lead to a decrease in

investors’ risk perceptions.

Concurrent with our study, Hoberg & Lewis [2015] use topic modeling and the cosine

similarity to provide evidence of the content disclosed by firms involved in SEC enforcement

8In practice, topic modeling is being used by search engines such as Google and Bing to improve corre-lations between search terms and web content. Search engine marketers are also applying topic modeling toguide keyword selection and optimize website content [Fishkin, 2014]

10

actions (AAERs). Focusing on the MD&A section of 10-K filings, Hoberg & Lewis [2015]

find that relative to industry peers, AAER firms have abnormal verbal disclosure that is

common among fraud firms. The topic analysis results indicate that AAER firms disclose

more information about complex business issues such as acquisitions and foreign operations,

and are more likely to grandstand their good financial performance. AAER firms also under-

disclose certain topics such as liquidity challenges and provide fewer quantitative details

explaining their performance.

Our study extends the Hoberg & Lewis [2015] in several respects. First, Hoberg & Lewis

[2015] fit their LDA model using the text contained in the MD&A section (Item 7) of annual

reports filed in only the first year of their sample period (1997-2008). This approach does not

account for changes in disclosure topics over time and could induce ‘staleness’ in the topics

used in their empirical analyses. Our study accounts for the dynamic nature of management

disclosure by simultaneously discovering the topics and quantifying the attention each annual

report dedicates to the estimated topics in a given year. We also employ a rolling-window

procedure that predicts financial misreporting using the topics identified over the five years

prior to the manipulation period. Second, unlike Hoberg & Lewis [2015], we analyze the

thematic content of the entire 10-K filing as opposed to the MD&A section. While the MD&A

section provides a useful setting for examining disclosure content, it does not capture relevant

content that are discussed in other distinct sections of the annual report, e.g., risk factors

(Item 1A), legal proceedings (Item 3), and executive executive compensation (Item 11). As

we will show, topics identified in the MD&A section have less variation and significantly lower

fraud detection power compared to topics identified from the full annual report. Last, and

most important, our study goes a step further by demonstrating the incremental predictive

power of thematic content for detecting fraud over and above traditional measures of financial

performance and textual style characteristics.

11

2.3 Research Questions

While prior research suggest a link between accounting misstatement and various word

choices and writing styles, the literature is unclear as to whether disclosure content is related

to intentional misreporting. Our primary research questions tackles this issue by investigat-

ing the association between disclosure topics and instances of financial misreporting.

We seek to understand the role of disclosure content in detecting intentional financial

misreporting beyond traditional financial performance and style characteristics examined in

prior work. Disclosure topics may provide incremental detection ability since the thematic

content of financial statements captures an aspect of managerial deception that is distinct

from that of financial metrics. Specifically, regulatory oversight is more difficult for textual

narratives, especially at the topic level, thus leaving more room for managers to use disclosure

content as a means of diverting attention away from misstated financials. While prior re-

search has identified a set of “lying words” (see e.g., Newman, Pennebaker, Berry & Richards

[2003]), it is more difficult to näıvely identify a set of “lying topics” as these same topics

may be benign or informative about actual performance in other settings or at other points

in time. Furthermore, financial metrics in annual reports are primarily backward-looking,

whereas textual disclosures contain a significant amount of forward-looking information and

cover a wide range of topics [Bozanic, Roulstone & Van Buskirk, 2014]. As prior research

suggests, forward-looking information is inherently more uncertain and less verifiable at the

time of disclosure [Bonsall IV, Bozanic & Merkley, 2014]. We therefore explore whether

the topics discussed in annual reports provide incremental predictive value beyond financial

metrics in detecting financial misreporting. Our first research question is stated as follows:

Research Question 1: Topic provides predictive information beyond that of financial

measures when detecting intentional financial misreporting.

We also investigate whether disclosure topics provide informational value beyond textual

style characteristics. Style characteristics are broad textual metrics – they do not reflect the

underlying content, but are simply summary statistics across the broad base of the document.

12

For example, style characteristics may refer to textual complexity, readability, or formatting

choices such as the use of bullets or the amount of whitespace, as well as grammar and

word choice. Topic modeling, in contrast, captures the underlying distribution of content

in the document, as opposed to just summary style statistics. Furthermore, the discussion

of various topics within the annual report is more intentional than style characteristics, i.e.,

managers must select the thematic content of each annual filing and the attention dedicated

to each topic within the document. Prior research on deception provides evidence that

written topics are typically chosen intentionally. Specifically, theories of manipulation and

deception suggest that individuals actively monitor the amount, veracity, relevance and

clarity of topics communicated (see e.g., McCornack [1992]). Also, experimental evidence

indicates that individuals adapt deception strategies of giving false answers, withholding

relevant information, or giving evasive answers on demand, suggesting that choosing a topic

is an intentional process [Buller & Burgoon, 1996].

It is unclear from prior work whether style characteristics such as length and readability

reflect intentional manager choices to obfuscate or deceive or simply reflect a manger’s own

optimism or subconscious slant (see e.g., Li [2008] and the related discussion in Bloomfield

[2008]). Further, prior research suggests that word choice, in and of itself, is often sub-

conscious (or without intent). Specifically, prior work in linguistics provides evidence that

the use of function words (i.e., pronouns, prepositions, articles, conjunctions, and auxiliary

verbs) is often without awareness and is difficult to control [Chung & Pennebaker, 2007].

While there is some evidence that individuals consciously choose abstract words to describe

recurring events and concrete words to describe non-recurring events, goals to influence the

message recipients’ beliefs make it difficult to disentangle whether the communicator is be-

ing truthful or not. Thus, even for those words that a manager may intentionally choose,

it has been shown to be difficult to discern intentional deception (see e.g., Douglas & Sut-

ton [2003]). In sum, while style features of disclosure, including word choice captured by

commonly used-dictionaries or word lists, may reflect a manager’s own optimism or subcon-

13

scious slant, topic modeling is more likely to reveal managers’ intentional content choices.

This discussion leads to our second research question:

Research Question 2: Topic provides predictive information beyond that of textual

style characteristics when detecting intentional financial misreporting.

3 Data and Empirical Measures

3.1 Data and Sample Selection

We use annual 10-K filings retrieved from the SEC’s EDGAR system as the textual disclosure

source to examine our research questions. We focus on annual reports because they 1) allow

us to maximize the number of firm-year observations in our sample, 2) are comprehensive

in their coverage of the firm and its activities throughout the fiscal year, and 3) avoid

self-selection biases given their mandatory disclosure status.9 We download the full text

of all 10-Ks available through the SEC EDGAR FTP site from January 1, 1994 (the first

year such data is available) until December 31, 2012 (the final year of the AAER dataset,

discussed below). This download yields 131,528 filings. We follow Li [2008] in parsing the

10-K filings, but expand this methodology to remove all items included in the documents

other than raw text.10 We also restrict the words in the files to match those contained in

the standard Unix words dictionary, to remove typos and uncommon terminology.11 We

describe our full parsing methodology in detail in Appendix A.1 of the online appendix. As

9We acknowledge that 10-Ks are not always timely sources for detecting financial misreporting. Forinstance, the filing of the 10-K can lag any occurrence of fraud by up to a year. Purda & Skillicorn [2015]highlight the added value of including quarterly report narratives in language-based fraud analyses. However,we choose not to include quarterly reports to ensure consistency in the disclosure content of the reports acrossfirms’ reporting periods.

10We construct measures for all textual items removed from the documents, some of which are includedin our analyses.

11The standard dictionary, provided by the wamerican package in the official Debian repositories, contains99,171 words. We also conduct robustness checks using no dictionary, the wamerican-huge dictionary, andthe wamerican-insane dictionary. These checks confirm that the standard dictionary provides the best modelperformance in-sample, along with the most coherent topics.

14

discussed below, we gather data on accounting misstatements from the SEC AAER dataset

compiled by Dechow et al. [2011] and from disclosures of restatements due to intentional

misreporting in amended 10-K filings. We also gather financial statement and stock market

data from Compustat and CRSP, respectively, as opposed to the actual 10-K filings to ensure

consistency and accuracy across our sample.

3.1.1 Identifying Intentional Financial Misreporting

We use two data sources to identify instances of intentional financial misreporting. Following

Dechow et al. [2011], our first data source uses SEC AAERs to classify firms engaging in

material accounting misstatements. We focus on misstatements occurring during the annual

reporting period. We exclude quarter-period misstatements to ensure that the measurement

period for our prediction variables is consistent across firms. We create an indicator variable

(𝑚𝑖𝑠𝑟𝑒𝑝𝑜𝑟𝑡) that equals 1 for each fiscal year identified as misstated by the SEC, and zero

otherwise. We use this indicator variable to classify those 10-K filings containing potential

GAAP violations.

Our second data source is a customized automated search for occurrences of financial

restatements that are seemingly due to intentional misapplications of GAAP (irregularity

restatements). We use the classification methodology discussed in Hennes et al. [2008] to

develop a customized identification tool.

Our customized tool performs well in capturing financial misreporting in our sample, as a

manual inspection of irregularities identified by the search tool indicates that the misstated

financial reports contained material and intentional misapplications of GAAP. To identify

irregularity restatements, we download all amended 10-K filings (10-K/As) from the SEC

EDGAR FTP site. We gather firm-identifying information for matching purposes from the

header (or alternately from the body of the text when the header is missing or incomplete),

and then parse the 10-K/A in a manner similar to our parsing of unamended 10-Ks (see

Appendix A.1 of the online appendix). After parsing the filings, we follow Hennes et al.

15

[2008] and search the text for direct statements of the occurrence of financial reporting

irregularities or narratives referring to the investigation of misstatements by either regulatory

or independent parties. Appendix A describes our full search terms.

We search for phrases such as “fraud,” “materially false and misleading,” and “violation

of federal securities laws” to identify restated filings with direct discussion of irregularities.

We identify restatements with related regulatory investigations based on narratives refer-

ring to investigation by the SEC, the DOJ, or by an Attorney General. Restatements with

independent investigations are classified based on discussion of investigations by forensic ac-

countants, the audit committee or an independent committee, as well as statements referring

to the retention of legal counsel over the misstatement. Based on this identification strategy,

we classify each 10-K filing as misstated if our search of the corresponding 10-K/A detects

narratives reflecting an irregularity as detailed above. We then code 𝑚𝑖𝑠𝑟𝑒𝑝𝑜𝑟𝑡 as 1 for those

firm-years with misstated annual reports; 𝑚𝑖𝑠𝑟𝑒𝑝𝑜𝑟𝑡 equals 0 if there is no amended 10-K

for the respective fiscal year or if the amended 10-K filing does not involve an irregularity.

3.2 Empirical Measures

3.2.1 Financial Measures

We build our fraud prediction model by including the financial statement and market-related

variables selected for the Dechow et al. [2011] F-Score model.12 We focus on the set of vari-

ables providing the highest predictive power as reported in Dechow et al. [2011] (see Model

3 in Table 9). The financial statement variables include measures of accrual quality, firm

performance, off-balance-sheet activities, and market-based incentives. Panel A of Appendix

B defines each of the variables outlined below. The accrual quality measures include an ex-

tended definition of working capital accruals as developed in Richardson, Sloan, Soliman &

Tuna [2005] (termed RSST accruals). The RSST accruals measure captures the change in

12In robustness tests, we find that our results hold when we include standard financial ratios andbankruptcy prediction measures, consistent with prior fraud studies (e.g., Beneish [1997], Cecchini et al.[2010]).

16

noncash net operating assets.13 We also measure the change in receivables and the change

in inventory since misstatement of these two accrual components affect widely-used perfor-

mance metrics, namely, sales growth and gross margin. Lastly, the percent of soft assets

on the balance sheet captures accounting flexibility, and in turn, the room for managerial

discretion in changing the measurement assumptions of net operating assets in order to meet

short-term performance goals.

Our performance measures capture managerial incentives to manipulate their financial

statements to mask poor firm performance. These measures include the change in cash sales

and the change in return on assets. To gauge the extent to which firms engage in off-balance-

sheet financing, we include an indicator variable to identify firm-years with nonzero future

operating lease obligations. The GAAP rules for operating leases leads to lower expenses

being booked to the income statement in the early years of the asset’s life. Thus, the existence

of operating leases proxies for managers’ propensity to window-dress financial performance.

Lastly, the issuance of securities in a given firm-year, the book-to-market ratio, and the

market-adjusted stock return of the prior fiscal year all capture market-related pressures to

engage in fraudulent reporting.

3.2.2 Style Measures

Style characteristics are in essence simple summary statistics of textual information. Since

our main construct of interest, disclosure topic, is derived from text, the ability to detect

financial misreporting beyond simple style characteristics is quite important. As such, we

benchmark our topic measure against a comprehensive set of style characteristics from prior

literature, as well as four new measures developed from our analysis. Panel B of Appendix

B presents a full list of the style variables and their measurement.14

13We do not include other measures of discretionary accruals (e.g., modified Jones and performance-matched discretionary accruals) as Dechow et al. [2011] find that these measures perform poorly in detectingaccounting manipulation compared to unadjusted accrual measures.

14Our results are robust to a large vector of alternative style characteristics. This vector includes a fullbattery of processing measures (a variable for each part removed from the filing), median word, sentenceand paragraph lengths (in addition to the already included mean lengths), Harvard IV dictionary measures,

17

Our new measures are the log of the number of bullets, the length of the SEC mandated

header, number of excess newlines (vertical whitespace) in the filings, and the character

length of HTML tags. The log of the number of bullets captures an aspect of readability,

as bulleted information is typically concise. We find considerable variation in this mea-

sure, as 13.5% of our sample filings do not contain bullets, while 10% of the filings include

over 1,400 bullets. The SEC header contains basic corporate and filing form information

such as company name and address, SIC industry, form type, and filing date. Filings with

long headers generally identify firms that operated under former company names in prior

years.15 We therefore expect the SEC header length to be correlated with disclosure com-

plexity attributable to complex firm transactions that have been shown to be correlated with

fraudulent activity (e.g., mergers, acquisitions, and corporate restructurings).

Excess newlines (vertical whitespace) increase the length of the 10-K filing without adding

any substantive content. Managers engaging in financial misreporting could insert additional

whitespace to keep the length of the filing consistent with filings by peer firms or the firm’s

own prior filings while omitting some pertinent information. We include the character length

of HTML tags in the unparsed documents as a broad measure of technological expertise or

savvy. This proxy attempts to distinguish between documents created using a basic word

processing program, e.g., Microsoft Word (which should embed numerous HTML tags),

versus documents created by more specialized financial software.

The second group of textual style variables are filing length and style structure measures,

in the vein of Li [2008] and Goel et al. [2010]. We use these length and structure measures

as additional proxies for disclosure readability and textual complexity.

The selected variables include the mean and standard deviation of the length of words,

sentences, and paragraphs in the 10-K filing, as well as measures of sentence repetition and

six alternative readability measures, a variable capturing every part of speech coded in the Brown corpus,total and tagged word counts, two other measures of sentence repetition, and deviation from the Benforddistribution. We find that majority of these variables are highly correlated with the style characteristicsselected for our primary analyses.

15Former company names and the date of each name change are disclosed in a separate block of headerfields. Firms can enter up to three former names in the EDGAR system.

18

type-token ratio (see Goel et al. [2010]; Li [2014]). The type-token ratio (number of unique

words scaled by the number of total words) measures vocabulary variation and, consistent

with Rennekamp [2012], captures the idea of superfluous words, as a higher ratio indicates the

use of a broader vocabulary. We also compute the percent of short and long sentences (≤ 30

or ≥ 60 words, respectively) contained in the filing. We include two complementary measures

of readability: the Gunning Fog Index and the Coleman-Liau Index. These measures are

widely used in the accounting literature to indicate disclosure inefficiencies or misinformation

[Goel et al., 2010; Lehavy, Li & Merkley, 2011; Li, 2008; Rennekamp, 2012].

Our final set of textual measures comprises a battery of language and word choice vari-

ables. First, we measure language voice (active and passive), which has been shown to

correlate with the incidence of financial misreporting [Goel et al., 2010; Goel & Gangolly,

2012]. Consistent with Purda & Skillicorn [2015], we measure word choices using the six word

dictionaries defined in Loughran & McDonald [2011]. These dictionaries contain a lists of

financial-related words that capture disclosure tone and the use of uncertainty and litigious

vocabulary. We also include three measures of disclosure emphasis: the use of capitalized

words, exclamation points, and question marks (see Goel & Gangolly [2012]).

3.2.3 LDA Topic Measure

Our topic measure is based on the unstructured and unsupervised LDA topic modeling

methodology developed by Blei et al. [2003].16 We choose this approach due to its intu-

itive characteristics and strong performance. In particular, LDA is a Bayesian probabilistic

model and offers significant theoretical improvement over older data-driven and principle-

component-based tools such as Latent Semantic Analysis (LSA). Furthermore, the topic

16For predictive purposes, Mcauliffe & Blei [2008] develop a supervised LDA model (sLDA) which allowseach text document to be paired with a response variable that classifies each document. The goal of the sLDAmodel is to infer disclosure topics predictive of the response. This response variable would be misreport inour setting. We refrain from using the sLDA model for two reasons. First, the unsupervised LDA modelallows us to provide a baseline for the common disclosure topics contained in annual reports, irrespective ofmisreporting. Second, Mcauliffe & Blei [2008] find that the prediction performance of sLDA is equivalent toLDA in text corpora with difficult-to-predict responses.

19

modeling accuracy of LDA is quite strong when compared to human classification of topics

or other unsupervised machine algorithms such as LSA-IDF or LSA-TF.17 In an experiment,

Anaya [2011] finds that humans classify main topics with 94% accuracy, while LDA achieves

84% accuracy. Comparable accuracy statistics were 84% for LSA-IDF and 59% for LSA-TF.

While the accuracy of human classification is greater than that of LDA, the human approach

is infeasible when classifying large volumes of textual data. In fact, the LDA tool allows us

to categorize the disclosure content of annual reports containing text narratives of over 3 bil-

lion words, allowing for rigorous testing that otherwise would be impossible based on human

topic classifications.

The LDA model is based on a few simple assumptions. The model assumes a collection

of 𝐾 topics in a given text document and that the vocabulary of each topic is distributed

following a Dirichlet distribution, 𝛽𝐾 ∼ Dirichlet(𝜂).18 The model further assumes that

the topic proportions in each document 𝑑 are drawn from a Dirichlet distribution 𝜃𝑑 ∼

Dirichlet(𝛼). Given these assumptions, a specific number of topics to identify, and a few

learning parameters, the LDA model categorizes the words in a given set of documents

into well-defined topics. Because the model uses Bayesian analysis, a word is allowed to be

associated with multiple topics. This is a convenient feature of LDA, as words can have

multiple meanings, especially in different contexts. In sum, the generative process of LDA

can be viewed as a probabilistic factorization of the vocabulary in a collection of documents

into a set of topic weights and a dictionary of topics.

We implement the LDA algorithm using a dynamic time-series process since we ex-

pect disclosure content to change across time due to factors such as macroeconomic condi-

tions, technological changes in business operations, regulatory interventions (e.g., the 2002

Sarbanes-Oxley Act), and changes in firm management. Consequently, this approach allows

us to assess the changing nature of disclosure content and its ability to predict account-

17LSA-IDF and LSA-TF are LSA based measures using a term-document matrix that has undergone atransform: inverse document frequency or term frequency, respectively.

18A Dirichlet distribution is essentially a multivariate generalization of a beta distribution.

20

ing misstatements. Our time-series procedure identifies the topics discussed in each rolling

five-year window over our sample period (1994 2012). That is, we run the algorithm for

the periods, 1994 1998, 1995 1999, 1996 2000, and so on. The topics discovered in each

window are then used to determine the disclosure content of annual reports issued in the

year immediately following each five-year window. This results in a test period of 1999 2012

for our prediction analyses. Note that while new topics may arise in the year after each

window, the topics discussed in the prior five years provide the most practical estimates of

current-year disclosure content while avoiding potential look-ahead biases in our prediction

tests.

For our implementation of the algorithm, we follow Hoffman, Bach & Blei [2010] and

use an “online” variant of LDA. This approach allows us to run the algorithm in small

batches and to classify large quantities of text without encountering computational barriers.

We run the online LDA algorithm in batches of 100 filings since small batches are more

computationally efficient given the large sizes of 10-K filings. We draw the filings in each

batch in random order to mitigate overweighting of early years in the online LDA tool.

Consistent with Hoffman et al. [2010], we use symmetric Dirichlet distributional parameters

of 𝛼 = 𝜂 = 120

and the learning parameters of 𝜅 = 710

and 𝜏0 = 1024. The learning parameter

𝜅 controls how quickly old information is forgotten, while parameter 𝜏0 downweights early

iterations of the model. Hoffman et al. [2010] document that these distributional and learning

parameter settings are optimal when categorizing articles from the science journal Nature, as

well as categorizing text from Wikipedia. We then set the algorithm to identify 31 topics in

each five-year window. We select 31 topics since simulated results indicate that this number

of topics is optimal in capturing the occurrence of irregularity restatements (see Appendix

A.2 of the online appendix for a description of this simulation).19

Next, we pre-process the parsed 10-K filings by first removing stop words. Stop words

are those deemed irrelevant for our text-based measures because they occur either frequently

19We run the simulation on irregularity restatements given the lower occurrence of SEC AAERs.

21

(e.g., ‘the’, ‘an’, ‘is’) or are too infrequent to be of use in fraud prediction (such cases were

often garbled text or misspellings in the 10-K filings). Because our analysis uses rolling five-

year windows, we generate our stopwords on matching five-year windows to avoid potential

look-ahead biases. We remove three types of stopwords: 1) the most frequent words appear-

ing in each rolling five-year window of our sample period until we have removed 60% of all

word occurrences in the window, 2) words that occur less than 1100 times in the window,

and 3) words that occur in less than 100 filings. These parameters are also derived in our

simulation (see Appendix A.2 of the web appendix).

We run the LDA algorithm on the pre-processed filings, generating 31 topics in each

rolling window and the weighting for each word associated with the topic. Using these word

weights, we compute the weight of each topic in filings issued in the year following the five-

year window. For example, the weighted word vectors for the topics identified in the 1994 –

1998 window are used to determine the topic weights in filings issued in 1999. To compute

the topic weights in a given filing, we multiply the vector of word weights within the topic by

a vector of word counts for the filing. We then normalize the weight of the topic by the sum

of the weights of all topics identified in the filing. This procedure generates the proportion

of the content of each document that is associated with each topic. We denote these topic

proportions as topic. Note that while new topics may arise in the year after each window,

the topics discussed in the prior five years provide the most effective estimates of current

topic proportions while avoiding potential look-ahead biases in our prediction tests.

3.3 Validation of LDA Topic Measure

Before investigating our research questions, we validate our topic measure using several

methods. Following prior research (e.g., Bao & Datta [2014], Huang et al. [2014]), our

first method evaluates the semantic validity of the LDA output by labeling the topics and

assessing the extent to which the topics provide meaningful content. As discussed above,

we derive our topic measure using a rolling-window approach with 31 topics identified in

22

each of the 14 rolling five-year windows over our sample period. For ease of interpretation,

we aggregate the topics discovered in each window up to the full sample. We refer to these

aggregate topics as “combined topics.” Since the optimal number of topics in each window

can vary, we allow multiple topics within a window to be associated with the same combined

topic. We also allow the number of combined topics to be greater than 31 since some topics

may not be present in all of the five-year windows. We derive the combined topics by

matching topics across years based on the Pearson correlation of the word weights within

the topics. We group all topics with a Pearson correlation above a specific threshold. We

test thresholds for the Pearson correlations from 1% to 90% in 1% intervals to determine the

most coherent grouping. The most coherent topics were found when the Pearson correlation

threshold was set at 11%.20 This matching procedure results in 64 combined topics across

our sample period.

To determine the underlying content of each combined topic, we generate a list of the

highest weighted phrases and sentences associated with each topic. We construct the list

by first extracting the top 1,000 sentences per topic based on the weighted words associated

with each combined topic. Next, we sort the sentences based on their length and extract

the middle tercile (334 sentences) as representative sentences with a typical length. We

then extract the top 20 most frequent bigrams (i.e., two-word phrases excluding stopwords,

numbers, and symbols) within the 334 mid-length sentences. We also sort the 334 sentences

based on the cosine similarity between a given sentence and the remaining 333 sentences.

We manually review the top 20 bigrams and top 100 mid-length sentences based on cosine

similarity and assign descriptive labels to each of the 64 topics.

Appendix C presents a list of the 64 combined topics with 10 selected bigrams per topic

and our inferred topic labels.21 We report 10 representative bigrams after excluding re-

20The first pass of this test determined that the optimal correlation threshold ranged between 8% and18%. We then conduct tests of this threshold range in 0.05% increments to locate the 11% cut-off point.We also compare the combined topics generated by groupings based on Spearman correlation and Euclideandistance. Both of these alternative methods performed poorly due to overweighting on words with low topicweightings, leading to incoherent topic groupings.

21The inferred labels for a few topics are overlapping due to only minor differences in the content inferred

23

dundant bigrams (e.g., “millions in,” “company also,” “in year”) and those with similar

inferences (e.g., “compared to” and “compared with” in topic 2, or “derivative financial”

and “financial derivative” in topic 9). We note that the LDA algorithm performs well in

identifying narrative content that is distinctively related to changes in firms’ financial per-

formance. For instance, topics 1 and 2 both refer to the firms income performance compared

to prior periods. Examples of top mid-length sentences from topic 1 include the following:

“Other income decreased to $11,745,000 in 1999 as compared to $11,882,000 in 1998 and

$10,521,000 in 1997” and “Management fee income decreased to $0 as compared to $1.4

million in 1997.” Other topics related to financial performance include segment performance

(topics 16 and 54), franchise revenues (topic 26), and general references to quantitative fi-

nancial statement information (topics 7, 34, 62, and 63). LDA also identifies topics related

to complex business transactions and arrangements such as corporate spin-offs (13 and 64),

derivatives and hedging activities (9), fair value/cash flow hedging (41), merger activities

(31), R&D partnerships (32), joint venture agreements (39), strategic alliances (46), and

investment in securitized/guaranteed securities (55).

Several topics also refer to specific financial statement items and/or their underlying

measurement assumptions such as post-retirement health care cost assumptions (4), account

receivables and doubtful accounts (12), long term assets (25), advertising expenses (36), and

the measurement of natural gas properties (38). Consistent with Huang et al. [2014], we are

able to identify industry-specific topics such as aircraft leasing arrangements in the airline

industry, franchise revenue recognition and restaurant growth in the restaurant industry,

as well as general discussion of business risks and operational factors in the agricultural,

gaming, mining, marine transportation, and hotel industries. Lastly, as demonstrated in

Bao & Datta [2014], LDA effectively discovers content related to common risk factors and

contingencies such as foreign currency risks (57), country risks (18 and 37), environmental

liabilities and risks (6 and 56) patent infringement and rights (48), and legal proceedings

from the bigrams and mid-length sentences. We treat these topics separately in our empirical analyses tomitigate any noise introduced by our topic aggregation process.

24

(45). In summary, the evidence in Appendix C suggests that the LDA algorithm provides a

valid set of economically meaningful topics.

Our next validation method evaluates whether the disclosure topics perform reasonably

well in detecting misstatements using in-sample tests. Figures 1 and 2 depict the distribution

of each combined topics over the 1999 to 2012 period (our misreporting prediction years) and

whether the topic is significantly associated with the occurrence of financial misreporting.

Figure 1 (2) illustrates the distribution for the sample of irregularity restatements (AAERs).

We determine the significance of the combined topics by estimating yearly in-sample regres-

sions of the disaggregated subtopics (i.e., the topics associated with a given combined topic

in each year) on our misreport indicator variable. We orthogonalize the subtopic proportions

to 2-digit SIC industries to control for unobserved industry effects.

We observe in both figures that the discussion of several topics is relatively consistent

across the sample years. These topics include changes in income performance (topics 1 and

2), measurement of post-retirement benefits (3), and industry-specific topics such as aircraft

leasing arrangements (4) and real estate loan operations (10). Other topics appear later in

the sample period, indicating the evolving nature of firms disclosure content. For instance,

discussions of collaborative business arrangements such as joint ventures (39), strategic al-

liances (46), and partnerships (51) are more prominent in the second half of our prediction

period.

With respect to the ability to detect misreporting, Figure 1 illustrates that discussion of

increases in income performance compared to prior periods (combined topic 2) is significantly

associated with irregularity restatements in most of our prediction years. However, the

direction of the significance is not consistent throughout the sample period. We also observe

that discussions of declines in income performance (topic 1) is significant in relatively few

years in our sample. These results suggest that the association between misreporting and

managerial discussion of financial performance is not as clear cut as suggested by prior work

on the relation between fraud and poor financial performance.

25

The results in Figure 1 also suggest that misreporting firms are more likely to discuss

issues related to share capital transactions, investments in securitized/guaranteed securities,

environmental risks, foreign operations, and growth in franchised operations. Combined

topics that load consistently negative include discussions of merger activities, joint venture

arrangements, fair value/cash flow hedging, legal proceedings, and stock option plans, sug-

gesting that restatement firms are less likely to discuss these issues in misstatement years.

The results for AAER firms (Figure 2) are similar, but with some variation in the timing of

the topic loadings. AAER firms are less likely to discuss segment performance and declines

in income performance, primarily in the earlier years of the sample. Taken together, our ev-

idence in Figures 1 and 2 suggest that disclosure content provides significant informational

value for detecting misstatement events. These results provide us with greater confidence

for investigating the fraud prediction performance of topic relative to financial statement

variables (RQ1) and textual style characteristics (RQ2).

4 Empirical Methodology and Results

4.1 Empirical Methodology

To investigate our research questions, we conduct both in-sample and out-of-sample tests

using our time-series approach. We first estimate in-sample prediction models using rolling

five-year windows. We then conduct out-of-sample tests using the estimates from each five-

year window to predict the likelihood of intentional misreporting in the year following each

window.2223

We begin our analyses by estimating logistic regressions of our misreporting indicator

22For filings coded as an irregularity restatement, we ensure that the restatement is revealed by the endof the in-sample window. We are unable to apply this restriction in the AAER sample as the UC Berkeleydataset does not include the release dates of the AAERs.

23For example, the estimated results for 1994 – 1998 (1995 – 1999) are used to predict misreporting fora holdout sample of firms in 1999 (2000) and so on.

26

variable (𝑚𝑖𝑠𝑟𝑒𝑝𝑜𝑟𝑡) on vectors of the disaggregated topic proportions (𝑡𝑜𝑝𝑖𝑐) as follows:

log

(︂𝑚𝑖𝑠𝑟𝑒𝑝𝑜𝑟𝑡𝑖,𝑡

1−𝑚𝑖𝑠𝑟𝑒𝑝𝑜𝑟𝑡𝑖,𝑡

)︂= 𝛼 +

31

Σ𝑗=1

𝛽𝑗𝑡𝑜𝑝𝑖𝑐𝑗,𝑖,𝑡 + 𝜀𝑖,𝑡, (1)

𝑡 ∈ [𝑇 − 5, 𝑇 − 1], 𝑖 ∈ Companies

We estimate equation (1) for the five-year window preceding each of the prediction years,

1999 to 2012. For our AAER specification, we lack sufficient instances of financial misre-

porting for our out-of-sample test for the year 2012, and thus we remove this year from the

specification. We remove out-of-sample test years with insufficient misreporting events when

conducting analyses on various subsamples. We then use the estimated regression coefficients

to predict the likelihood of intentional financial misreporting for 10-K filings in the subse-

quent year. Similar to Dechow et al. [2011], we construct a prediction score (𝑝 𝑚𝑖𝑠𝑟𝑒𝑝𝑜𝑟𝑡)

using the estimated coefficients and apply this scoring in our out-of-sample tests as follows:

log

(︂𝑚𝑖𝑠𝑟𝑒𝑝𝑜𝑟𝑡𝑖,𝑇

1−𝑚𝑖𝑠𝑟𝑒𝑝𝑜𝑟𝑡𝑖,𝑇

)︂= 𝛼 + 𝛽1𝑝 𝑚𝑖𝑠𝑟𝑒𝑝𝑜𝑟𝑡𝑖,𝑇 + 𝜀𝑖,𝑇 , 𝑖 ∈ Companies (2)

To investigate Research Question 1 (RQ1), we estimate two additional regression specifi-

cations. The first specification replaces the topic vector with the vector of financial variables

discussed previously, whereas the second specification extends equation (1) by including both

vectors of 𝑡𝑜𝑝𝑖𝑐 and the financial variables. In both cases, we generate a 𝑝 𝑚𝑖𝑠𝑟𝑒𝑝𝑜𝑟𝑡 mea-

sure and run the out-of-sample tests as well. These two specifications allows us to gauge the

incremental fraud-prediction ability of 𝑡𝑜𝑝𝑖𝑐 beyond traditional financial statement variables.

For our second research question (RQ2), we introduce four specifications that include

style characteristics. The first specification includes our style characteristics with the second

including both financial variables and style characteristics. The third and fourth specifi-

cations are expanded versions of the first two models with the 𝑡𝑜𝑝𝑖𝑐 vector inserted. Our

27

general regression form for RQ2 is specified below in equation (3):

log

(︂𝑚𝑖𝑠𝑟𝑒𝑝𝑜𝑟𝑡𝑖,𝑡

1−𝑚𝑖𝑠𝑟𝑒𝑝𝑜𝑟𝑡𝑖,𝑡

)︂= 𝛼 +

10

Σ𝑗=1

𝛽𝑗𝑓𝑖𝑛𝑎𝑛𝑐𝑖𝑎𝑙𝑗,𝑖,𝑡 +30

Σ𝑗=1

𝛽𝑗+10𝑠𝑡𝑦𝑙𝑒𝑗,𝑖,𝑡 (3)

+20

Σ𝑗=1

𝛽𝑗+40𝑡𝑜𝑝𝑖𝑐𝑗,𝑖,𝑡 + 𝜀𝑖,𝑡,

𝑡 ∈ [𝑇 − 5, 𝑇 − 1], 𝑖 ∈ Companies

Due to the large number of variables included in the regressions and the naturally small

number of AAERs and intentional restatements in our windows, we tightly control the

convergence of our logistic regressions. We control the convergence by conducting checks

for both completeness and quasi-completeness of each regression specification. Appendix

A.3 of the online appendix details the necessary steps for conducting these checks.

4.1.1 Statistical Testing

Given the structure of our rolling time-series analysis, we are unable to use a standard Fama-

MacBeth methodology to pool our results for the predicted window. This restriction results

from the topic variables naturally changing across windows as previously reported. Thus,

we cannot aggregate across on a variable level. To address this research design issue, we

use Fisher’s (1932) method to provide aggregated test statistics.24 The Fisher test statistic

is appropriate for our analyses since the out-of-sample regressions are estimated using non-

overlapping years.

We refine our test statistic further by deriving a statistic referred to as a Var-Gamma

test (see Appendix D. This test statistic allows us to compare the results of Fisher’s method,

statistically testing whether one fraud detection model performs better than another when

pooled across years.

24The test statistic is computed as −2𝑁

Σ𝑖=1

log(𝑝𝑖) ∼ 𝑋22𝑁 , where 𝑝𝑖 is the 𝑖th p-value of 𝑁 total p-values.

28

4.2 Empirical Results

4.2.1 The Predictive Value of Topic versus Financial Variables (RQ1)

We address our first research question by investigating the informational role of 𝑡𝑜𝑝𝑖𝑐 versus

financial statement variables in detecting intentional misreporting. Table 1 presents separate

summary statistics of our financial statement variables for fraud and non-fraud firm-years

in the AAER and irregularity restatement samples. Consistent with Dechow et al. [2011],

we find that the percent of soft assets is significantly higher in both samples, suggesting

more reporting flexibility in misstatement years (𝑝-value < 0.01). We also find a greater

tendency to issue securities and engage in off-balance-sheet activities in misstatement years

in both samples. Furthermore, firms involved in AAERs have significantly larger increases

in receivables and inventory, and larger declines in ROA, consistent with greater earnings

management and poorer financial performance in manipulation years. Lastly, AAER firms

experience higher market-adjusted stock returns in the year prior to the misstated year. This

result combined with the higher rate of security issuance during the misstated year indicate

that market-related incentives play a strong role in intentional misreporting. In sum, the

univariate results provide initial evidence that financial statement variables provide useful

information for predicting intentional financial misreporting.

Table 2 presents the results of our out-of-sample tests of the predictive role of 𝑡𝑜𝑝𝑖𝑐

and financial variables (denoted as 𝐹 − 𝑆𝑐𝑜𝑟𝑒). Panels A and B present the Fisher and

Var-Gamma test statistics for AAERs, while Panels C and D present similar statistics for

irregularity restatements. The Fisher test statistics (Panels A and C) indicate that the

financial variables provide a significant amount of information for predicting AAERs (𝑝 <

0.001); however, they fail to provide significant informational value for predicting irregularity

restatements (𝑝 = 0.223). Furthermore, the results in Panels B and D suggest that the stand-

alone 𝑡𝑜𝑝𝑖𝑐 vector performs significantly better at predicting both AAERs and irregularity

restatements than either the stand-alone vector of financial metrics or the paired vectors of

29

𝑡𝑜𝑝𝑖𝑐 and financial variables. In both samples, the var-gamma test statistics are significantly

positive at the 1% level when the stand-alone 𝑡𝑜𝑝𝑖𝑐 vector is benchmarked against 𝐹 −𝑆𝑐𝑜𝑟𝑒

and the 𝑡𝑜𝑝𝑖𝑐 vector paired with 𝐹 − 𝑆𝑐𝑜𝑟𝑒. The pairing of 𝑡𝑜𝑝𝑖𝑐 with financial measures

performs significantly better at predicting both AAERs and irregularity restatements than

financial measures alone (𝑝 < 0.001). Overall, we find that our measure of the thematic

content performs significantly better at detecting instances of accounting misstatements

relative to traditional financial characteristics.

4.2.2 The Predictive Value of Topic versus Textual Style Characteristics (RQ2)

Since style measures are also text-based, one could argue that 𝑡𝑜𝑝𝑖𝑐 simply proxies for the

style characteristics of firms’ financial statements. Table 3 presents separate univariate

statistics for our vector of style characteristics for misstated and non-misstated firm-years.

Regarding our processing variables, we find that misstated filings in both the AAER and

irregularity restatement samples have more (concise) bulleted information. This finding is

inconsistent with conventional notions, but could reflect managers’ use of conciseness to omit

relevant information. Misstated filings in both samples have longer headers relative to non-

misstated filings, consistent with complex firm transactions like restructuring and mergers

being associated with misstatements. Also, misstated filings have fewer newlines and html

tags in the AAER sample; whereas, misstated filings have more newlines and more html tags

in the irregularity restatement sample. We further note that the misstated filings in both

samples are longer overall with longer MD&A sections, suggesting less readability during

manipulation periods.

In terms of complexity, misstated filings in the AAER sample contain longer words,

shorter sentences, and shorter paragraphs, along with fewer long (> 60 word) sentences

and a greater number of short (< 30 word) sentences. In contrast misstated filings in

the irregularity restatement sample tend to use longer sentences, longer paragraphs, fewer

long sentences, and fewer short sentences. Regarding variation, both AAER and irregular-

30

ity restatement filings have significantly lower variation in sentence length and use fewer

unique words (type-token ratio). AAER filings also have less variation in paragraph length,

while irregularity restatement filings have greater variation in paragraph length and have

more repeated sentences. We note that both sets of misstated filings are less readable per

the Gunning Fog and Coleman-Liau indices, and are more likely to contain passive voice

grammar, consistent with managers using passive voice to disassociate themselves from the

disclosure content [Goel & Gangolly, 2012].

Regarding word choice, both AAER and irregularity restatement filings have significantly

higher percentages of positive, negative, and uncertain words, consistent with Loughran

& McDonald [2011]. AAER filings also have a lower percentage of strong words, while

irregularity restatement filings have a greater percentage of litigious, strong, and weak words.

Lastly, misstated filings in the AAER and irregularity restatement samples contain more

textual emphasis as indicated by more words in all caps; however, misstated AAER filings

have fewer exclamation points and question marks, on average.

We approach RQ2 by combining the 𝑡𝑜𝑝𝑖𝑐 and textual style vectors in the same regression

model. Table 4 presents the Fisher and Var-Gamma test statistics of our out-of-sample tests

of the predictive performance of topic relative to textual style characteristics. Panels A and

B presents the test statistics for AAERs; Panels C and D presents the results for irregularity

restatements. The evidence in panels A and C suggests that 𝑡𝑜𝑝𝑖𝑐 combined with style is a

good predictor of misstatements involving AAERs and irregularity restatements (𝑝 < 0.001).

However, for AAER misstatements, the Var-Gamma results in Panel B show that 𝑡𝑜𝑝𝑖𝑐 by

itself is a better predictor than either textual style or 𝑡𝑜𝑝𝑖𝑐 combined with style at 𝑝 < 0.001.

The Var-Gamma tests in Panel D show that while 𝑡𝑜𝑝𝑖𝑐 is a better predictor than style

(𝑝 = 0.019), the joint vector of 𝑡𝑜𝑝𝑖𝑐 and style characteristics is a better predictor than the

stand-alone 𝑡𝑜𝑝𝑖𝑐 vector at 𝑝 = 0.003. Thus, we find that the best specification for predicting

AAERs is 𝑡𝑜𝑝𝑖𝑐 by itself, while the best specification for predicting irregularity restatements

is 𝑡𝑜𝑝𝑖𝑐 with style. This evidence could suggest that fraud detection models based on 𝑡𝑜𝑝𝑖𝑐

31

and style characteristics are more able to detect accounting manipulations that are likely to

go unidentified by the SEC.

4.2.3 The Joint Predictive Value of topic, Financial, and Textual Style Variables

In this section, we conduct extended analyses of the interplay between all three vectors of

our fraud prediction variables: 𝑡𝑜𝑝𝑖𝑐, financial statement variables, and textual style char-

acteristics. This comprehensive analysis attempts to verify that the fraud detection ability

of both sets of topic is robust to the inclusion of financial and textual style characteristics.

We therefore estimate a comprehensive regression of the vectors of 𝑡𝑜𝑝𝑖𝑐, financial, and tex-

tual style measures. Table 5 presents the out-of-sample results. In Panels A and C we find

that the combined vector of 𝑡𝑜𝑝𝑖𝑐, financial, and style measures performs reasonably well in

detecting accounting misstatements (𝑝 < 0.001). Nonetheless, the results in Panels B and

D indicate that the stand-alone 𝑡𝑜𝑝𝑖𝑐 vector performs markedly better in predicting both

types of accounting misstatements (𝑝 < 0.001). For AAERs (Panel B), 𝑡𝑜𝑝𝑖𝑐 outperforms

the joint vector of 𝑡𝑜𝑝𝑖𝑐, financials, and style 𝑝 < 0.001, whereas 𝑡𝑜𝑝𝑖𝑐 paired with style is

the dominant predictor of irregularity restatements at 𝑝 < 0.001 (Panel D). This evidence

agrees with our previous results that 𝑡𝑜𝑝𝑖𝑐 is the best predictor of misstatements involving

AAERs, while 𝑡𝑜𝑝𝑖𝑐 and textual style characteristics provides the strongest prediction power

for misstatements involving irregularity restatements.

In order to examine the economic significance of our out-of-sample results, we examine

the percent of accounting misstatements in the top 5% of the prediction scores from the

out-of-sample regressions. Consistent with our out-of-sample results, Table 6 shows that

𝑡𝑜𝑝𝑖𝑐 and textual style characteristics captures the most misstatements involving irregularity

restatements, capturing 16.41% of the misstatements. For AAERs, we find that combining

𝑡𝑜𝑝𝑖𝑐, 𝐹 − 𝑠𝑐𝑜𝑟𝑒, and textual style characteristics captures the most AAERs, detecting

19.93% of all AAERs. While this contrasts to the the results above, it is not inconsistent

— the out-of-sample tests capture which measure is best on average, rather than over any

32

specific percent of the fraud scores. More importantly, Table 6 shows that topic is useful for

prediction in an economic sense, increasing the amount of AAERs (irregularity restatements)

captured when looking at 5% of firms by 67% (3.3%).

5 Additional Analysis and Robustness

This section provides a series of extended analyses as well as sensitivity checks for our

primary results. We first examine the robustness of our results to alternative sources of

financial restatements due to irregularities, as well as restatements attributable to uninten-

tional misapplications of GAAP (errors). We also replicate our primary results using MD&A

statements instead of the full text of the filings. Next, we change the regression form to an

L1 regularized logit model, to alleviate concerns of potential overfitting. Lastly, we adjust

our samples of misstated filings to exclude repeat GAAP violators as well as replicate our

analyses using the raw 𝑡𝑜𝑝𝑖𝑐 measures (as opposed to the normalized 𝑡𝑜𝑝𝑖𝑐 proportions).

5.1 Alternative Restatement Measures

Our strategy for identifying irregularity restatements is based on three classification crite-

ria: 1) management’s use of variants of the word “fraud” or “irregularity” in reference to

the misstatement (direct restatements), 2) misstatements identified by regulatory or DOJ

investigation (government-identified restatements), and 3) misstatements uncovered by in-

dependent investigations (other irregularity restatements). We examine whether our results

differ for misstatements identified under each criterion. We conduct this analysis since man-

agerial discussion during misstatement events is likely to differ across the three settings. For

instance, irregularities involving SEC or DOJ investigations could be more egregious com-

pared to those involving within-firm or independent investigations. We also investigate our

models’ ability to distinguish unintentional misstatements or errors (i.e., those misstatements

that are not classified as intentional).

33

The out-of-sample results (not tabulated) provide an interesting story when we distin-

guish the three settings. For direct restatements, the vectors of 𝑡𝑜𝑝𝑖𝑐, financial, and style

measures are all statistically significant predictors; however, 𝑡𝑜𝑝𝑖𝑐 is the most powerful pre-

dictor, and all other combinations of the 𝑡𝑜𝑝𝑖𝑐, financials, and style vectors leads to spec-

ifications that are significantly weaker. Government-identified restatements are also pre-

dicted most strongly by 𝑡𝑜𝑝𝑖𝑐. Interestingly, financial statement variables are not predictive

of government-identified misstatements (𝑝 = 0.987). Other irregularity restatements are

likewise best captured by the 𝑡𝑜𝑝𝑖𝑐 measure, while financial measures are once again poor

predictors.

Lastly, all specifications perform well at predicting unintentional accounting errors. The

𝑡𝑜𝑝𝑖𝑐 vector is tied with 𝑡𝑜𝑝𝑖𝑐 paired with the financial and style vectors when detecting

accounting errors. We also find that 𝑡𝑜𝑝𝑖𝑐 and 𝑡𝑜𝑝𝑖𝑐 paired with style are the best specifica-

tions, irrespective of the type of restatement (intentional irregularity or unintentional error).

Overall, our results suggest that quantifying the thematic content of annual reports results

in a detection tool that performs best when predicting accounting misstatements in general.

5.2 Management Discussion and Analysis

Several style-focused studies such as Li [2008], Li [2010], Cecchini et al. [2010], and Goel

& Gangolly [2012] examine the MD&A section of the 10-K. We therefore investigate the

fraud prediction performance of our topic measure based on this subset of the 10-K. We

reconstruct our topic and style variables using the text extracted from the MD&A section

(see Appendix A.1 of the online appendix for further details). Our out-of-sample evidence

indicates that style performs weaker than financial measures at predicting misstatements

involving AAERs, and that the predictive ability of 𝑡𝑜𝑝𝑖𝑐 is not significantly different from

that of financial statement variables in the case of AAERs. Out-of-sample tests for the

irregularity restatement sample show weaker Fisher statistics compared to our reported re-

sults; however, our main results continue to hold. Thus, we conclude that the incremental

34

detection value of 𝑡𝑜𝑝𝑖𝑐 is robust to restricting our analysis to MD&A statements, though it

is generally better to use the full text of the 10-K filings when examining disclosure content

and textual style characteristics.

5.3 Regularized Logit

Here, we change the form of our regressions from a standard logistic regression to an L1

regularized logit. The L1 regularization approach applies a penalty for increasing the number

of independent variables, which biases against including too many independent variables in

the regression. We find only one difference in our set of out-of-sample tests: for the sample

of irregularity restatements, there is no significant difference in the prediction ability of 𝑡𝑜𝑝𝑖𝑐

relative to 𝑡𝑜𝑝𝑖𝑐 paired with style. As such, the additional predictive ability from adding

textual style to 𝑡𝑜𝑝𝑖𝑐 does not overcome the penalty from the L1 regularization applied for

increasing the number of independent variables.

Overall, the L1 regularization results do not change our primary inferences, with the

exception that the stand-alone 𝑡𝑜𝑝𝑖𝑐 vector may be a stronger predictor of irregularity re-

statements under certain circumstances.

5.4 Removing second time offenders

Our next robustness test controls for possible biases introduced by allowing a firm to be

flagged as an AAER or irregularity restatement firm in both the learning window and the

following testing year. One concern arising from this approach is that the topic measure is

biased toward firms that are repeat offenders, rather than the first instance of an AAER

or irregularity restatement. To alleviate this concern, we adjust our misstatement samples

by removing any firm-years in which the preceding firm-year was involved in a misstate-

ment. Thus, our out-of-sample dependent variable only picks up the first year affected by a

misstatement.

Our prediction results are virtually identical using using these adjusted misstatement

35

samples. Specifically the results for both misstatement samples indicate that the predictive

ability for 𝑡𝑜𝑝𝑖𝑐, financial, and style measures is lower when removing repeat misstatements,

but all inferences are identical to our primary inferences. As such, it appears that our 𝑡𝑜𝑝𝑖𝑐

measure is not biased towards firms with repeated misstatements.

5.5 Raw Topic Measures

Our final sensitivity check uses the raw 𝑡𝑜𝑝𝑖𝑐 measures instead of the normalized measures.

This approach increases the variance in the topic measures, as they are now influenced by

the amount of text in the document.

The prediction results for the AAER sample are quite similar, except that 𝑡𝑜𝑝𝑖𝑐 by itself

is no longer a significantly better predictor than 𝑡𝑜𝑝𝑖𝑐 paired with style. Results fo

Documents

What are you saying? Using topic to detect financial ......While there is some overlap between our sample of AAER and financial restatement firms, each data source has its unique advantages