59
What are you saying? Using topic to detect financial misreporting Nerissa C. Brown Associate Professor University of Delaware [email protected] Richard M. Crowley Ph.D. Student University of Illinois [email protected] W. Brooke Elliott Associate Professor University of Illinois [email protected] September 2015 Preliminary version. Please do not cite without the permission of authors. We thank Andrew Bauer, Paul Demer´ e, Shawn Gordon, Kristina Rennekamp, and workshop participants at the University of Illinois, 2015 AAA FARS Mid-year Meeting, 2015 AAA Annual Meeting, and the 2015 Conference on Convergence of Financial and Managerial Accounting Re- search for their helpful comments. We also thank Xiao Yu for insightful comments on methodology and coding, and Stephanie Grant, Jill Santore, and Jingpeng Zhu for excellent research assistance.

What are you saying? Using topic to detect financial ......While there is some overlap between our sample of AAER and financial restatement firms, each data source has its unique advantages

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

  • What are you saying? Using topic to detect financialmisreporting*

    Nerissa C. BrownAssociate Professor

    University of [email protected]

    Richard M. CrowleyPh.D. Student

    University of [email protected]

    W. Brooke ElliottAssociate ProfessorUniversity of [email protected]

    September 2015Preliminary version.

    Please do not cite without the permission of authors.

    *We thank Andrew Bauer, Paul Demeré, Shawn Gordon, Kristina Rennekamp, and workshopparticipants at the University of Illinois, 2015 AAA FARS Mid-year Meeting, 2015 AAA AnnualMeeting, and the 2015 Conference on Convergence of Financial and Managerial Accounting Re-search for their helpful comments. We also thank Xiao Yu for insightful comments on methodologyand coding, and Stephanie Grant, Jill Santore, and Jingpeng Zhu for excellent research assistance.

  • What are you saying? Using topic to detect financialmisreporting

    Abstract: Detection models of financial misreporting have evolved beyond basic quanti-tative or financial measures to include textual or linguistic characteristics of firms’ disclosures.While these textual analysis methods provide incremental power in identifying misreport-ing, they examine how content is being disclosed as opposed to what is being disclosed.This study introduces a novel fraud-detection measure, labeled as “topic,” that quantifiesthe thematic content of financial statements. We derive our measure from a Bayesian topicmodeling methodology called Latent Dirichlet Allocation (LDA). We then demonstrate theincremental predictive power of our topic measure in detecting intentional financial misre-porting. We identify occurrences of financial misreporting using SEC enforcement actions(AAERs) and restatements arising from intentional misapplications of GAAP (i.e., irregular-ities). We find strong evidence that topic predicts intentional misreporting beyond financialand textual style characteristics. Furthermore, our results indicate that the detection powerof financial metrics is subsumed by our topic measure in prediction models for both AAERsand restatements arising from irregularities.

    Keywords: Topic, Disclosure, LDA, Financial Misreporting, Intentional Restatements

    1

  • 1 Introduction

    Detection models of financial misreporting have long focused on firms’ financial characteris-

    tics and are often based on quantitative measures of extreme or abnormal performance (e.g.,

    Beneish [1997]; Dechow, Ge, Larson & Sloan [2011]).1 One drawback of this approach is that

    financial misreporting can go undetected for long periods since many firms engage in earnings

    manipulation to blend in better with either their peers or the firm’s own past performance

    [Lewis, 2013]. To address this weakness, recent studies have begun to examine linguistic and

    text-based measures as additional signals of financial misreporting. These signals include

    disclosure tone, sentiment, vocal dissonance, and the use of suspicious or deceptive words

    [Loughran & McDonald, 2011; Hobson, Mayew & Venkatachalam, 2011; Purda & Skillicorn,

    2015; Larcker & Zakolyukina, 2012]. In practice, regulators, auditors, investors, and infor-

    mation intermediaries are employing textual analytic tools to reveal early warning signs of

    accounting fraud and misstatements. For example, the Securities and Exchange Commis-

    sion (SEC) is currently incorporating basic lists of deceptive words and phrases from annual

    reports into their computer-powered fraud detection model, termed the Accounting Quality

    Model (AQM, Eaglesham [2013]). Also, auditors and accountants are also using textual

    analytic tools in assurance tasks such as fraud detection and compliance [Schneider, Dai,

    Janvrin, Ajayi & Raschke, 2015].2

    While text analysis methods are quite promising in the detection of financial misreporting,

    many of the techniques used in prior research capture basic textual or linguistic character-

    istics of firms’ disclosures rather than the content. Examining the content of disclosures is

    important since firms engaging in accounting irregularities tend to make unusual content

    choices that are more difficult to classify as deceptive. As SEC officials note, firms that

    1Consistent with prior literature (e.g., Beasley [1996], Farber [2005], Dechow et al. [2011]), we use theterms misreporting, misstatement, manipulation, irregularity, and fraud interchangeably throughout thepaper. While firms often do not admit to outright fraud, our data sources for identifying instances ofintentional misreporting capture more egregious cases in the accounting error-to-fraud continuum.

    2Furthermore, our discussions with data providers indicate the increased use of textual analytics toidentify instances of accounting manipulations and other disclosure quality risks.

    2

  • manipulate financial reporting rules tend to deflect attention from core problems by under-

    reporting relevant risk factors compared to other firms [Lewis, 2013]. Further, prior research

    based on theories of deception provides evidence that the thematic content (or topic) of com-

    munication is typically chosen intentionally (e.g., Buller & Burgoon [1996]), suggesting that

    topics contained in financial disclosures likely reflect the intentions of managers (as opposed

    to the manager’s own subconscious slant or their optimistic or pessimistic beliefs). In line

    with these observations, our study introduces a textual analytic methodology that directly

    detects disclosure topics, i.e., what is being disclosed in annual 10-K filings (as opposed to

    how content is being disclosed). Using this unique measure (labeled as “topic”), we evaluate

    the common types of topics discussed in the annual filings of misreporting firms and how

    these disclosure topics change over time. Lastly we examine the incremental predictive power

    of our topic measure in detecting accounting misstatements relative to a comprehensive set

    of financial measures and textual style characteristics used in prior research.

    To generate our topic measure, we employ a topic modeling technique developed by Blei,

    Ng & Jordan [2003], termed Latent Dirichlet Allocation (LDA). The LDA approach allows

    us to determine the proportion of each 10-K filing devoted to each topic detected by the

    algorithm. Topic modeling does not require preset definitions (referred to as dictionaries

    or word lists) or predetermined topic categories, and in fact, relies on the basic observation

    that words frequently appearing together in text documents tend to be semantically related.

    Using unstructured cluster analysis, the LDA algorithm simply uses a set of 10-K filings

    to “learn” or generate the various topics discussed in firms’ annual reports in a given year.

    This method offers a unique advantage in that researchers are not required to know the

    topics commonly discussed in annual reports at a given point in time, and thus our own

    (preconceived) knowledge of the documents’ content does not bias our construction of the

    topic measure. Furthermore, the LDA method allows us to analyze the actual content of a

    large collection of financial statements — a task that would be infeasible for researchers to

    perform manually. This represents a significant step forward in the textual analysis literature

    3

  • and goes beyond the basic “bag of words” or style analytics used in prior research.3

    We conduct our fraud detection analyses using two sources for identifying the occur-

    rence of accounting misstatements involving intentional GAAP violations. Our first source

    is a comprehensive sample of SEC Accounting and Auditing Enforcement Releases (AAERs)

    compiled by Dechow et al. [2011].4 These releases identify instances of formal SEC investiga-

    tions of firms that manipulate their financial statements. Our second source of misstatements

    is an automated search for financial restatements arising from intentional misreporting (here-

    after referred to as irregularity restatements), i.e., those restatements that involve intentional

    irregularities rather than unintentional misapplications of GAAP (errors). We identify re-

    statements involving intentional irregularities based on the criteria discussed in Hennes,

    Leone & Miller [2008]. Specifically, we use an automated method to parse through the text

    of amended 10-K filings for variants of the words “fraud” or “irregularity” in reference to

    accounting misstatements. We also search for references to restatements resulting from SEC

    or Department of Justice (DOJ) investigations as well as references to other independent

    investigations.5

    While there is some overlap between our sample of AAER and financial restatement

    firms, each data source has its unique advantages and disadvantages. As Dechow et al.

    [2011] note, the AAER sample provides researchers with a high confidence level of fraud

    detection since the SEC typically targets firms where there is strong evidence of accounting

    manipulation. However, many misstatement events are likely to go undetected due to limited

    SEC resources, and the investigated cases are likely to reflect the SEC’s selection criteria.

    3Although LDA is a word choice algorithm, it goes significantly beyond the näıve Bayes or simple“bag of words” approach used in prior research (e.g., word counts and rank-ordered word lists) by usingthe distribution of words across documents to discover actual content without the need for predefined orresearcher determined word lists. Our analyses control for the most commonly used word list measures suchas those based on the Loughran and McDonald and Harvard IV dictionaries.

    4We use the updated version of this dataset available from the Center for Financial Reporting andManagement at UC Berkeley’s Haas School of Business.

    5We use an automated search to identify restatements since other data sources such as Audit Analyticsand the Government Accountability Office (GAO) database provide less extensive coverage. For instance,restatement data is not available in Audit Analytics for periods prior to 2001, while the GAO database islimited to restatements announced from July 2002 to October 2006.

    4

  • The irregularity restatement dataset spans a broader sample of accounting misstatements

    and alleviates the concern of SEC-related selection biases. Nonetheless, the restatement

    sample could be affected by changes in how firms disclose or discuss restatements within

    their financial statements.

    We build our fraud detection model by combining our unique topic measure with a

    comprehensive set of financial statement measures and textual style characteristics used

    in prior fraud research. Our financial statement variables follow closely from the Dechow

    et al. [2011] F-score model and include measures of accrual quality, earnings and cash flow

    performance, engaging in off-balance sheet activities, and market-related incentives. Our

    text-based characteristics include measures of financial report readability, disclosure tone

    and emphasis, as well as basic lists of deceptive, litigious and uncertain words constructed

    by Loughran & McDonald [2011].

    Using out-of-sample tests of our AAER sample, we find that our topic measure provides

    significant incremental predictive power over commonly-used financial metrics and textual

    style variables for detecting instances of accounting misstatement. In fact, a stand-alone

    model of disclosure topic is a better predictor of accounting fraud compared to models using

    financials or style characteristics, or models using both financial and style measures. We

    find similar results when we analyze our sample of irregularity restatements, i.e., topic adds

    significant incremental predictive power over financial and style variables. Interestingly, we

    find that a model including only topic and style (and excluding financial metrics) is most

    predictive of our irregularity restatements in out-of-sample tests. We find that these results

    are robust to several sensitivity checks including the use of alternative misreporting measures,

    regressions specifications, and topic measure definitions. Our inferences also hold when we

    base our analyses solely on the topic content of the Management Discussion and Analysis

    (MD&A) section of firms’ 10-K filings.

    Our study makes several important contributions to the literature. First, we extend prior

    research on financial misreporting by providing evidence that the topics discussed in firms’

    5

  • 10-Ks are useful in identifying intentional misstatements above and beyond traditionally

    examined financial measures and style characteristics. Second, we expand the burgeoning

    research in accounting that examines the textual portion of corporate disclosures. Specifi-

    cally, we exploit a robust textual analysis methodology, LDA, which directly quantifies what

    is being disclosed in 10-K filings (as opposed to how it is being disclosed). This thematic

    content analysis is adaptable to any type of financial disclosure with sufficient length. Fur-

    ther, since the topics that are indicative of intentional misreporting are likely to change over

    time given the fluidity of disclosures, our approach provides a significant improvement over

    a simple “bag of words” approach where the list of deceptive words is fairly static and easily

    identifiable by firms. Lastly, our study has significant practical implications for regulators,

    investors, and practitioners, who continue to implement sophisticated initiatives aimed at

    detecting accounting violations. In particular, our results suggest that extracting informa-

    tion about what is being disclosed is a fruitful avenue for capturing high-risk accounting

    activity.

    2 Background and Research Questions

    2.1 Predicting Financial Misreporting

    Over the past two decades, researchers have examined several different predictors of financial

    misreporting. Early work by Feroz, Park & Pastena [1991] and Beneish [1997, 1999] investi-

    gate the link between accounting misstatements and several measures of extreme or abnormal

    financial performance. In particular, Feroz et al. [1991] find that fraud firms identified from

    SEC enforcement actions (AAERs) have misstated receivables and inventory. Using a larger

    sample of fraud events gathered from both AAERs and the business press, Beneish [1997]

    finds that abnormal accruals, disproportionate increases in receivables, and poor abnormal

    market performance are significant predictors of financial misreporting. Beneish [1999] also

    finds that extreme firm performance based on indices of financial ratios is useful for detecting

    6

  • fraud.

    Prior studies also find that stock and debt market pressures and firms internal monitoring

    mechanisms are strong predictors of accounting misstatements. For instance, Dechow, Sloan

    & Sweeney [1996] find that firms subject to SEC enforcement actions have lower free cash

    flow and higher leverage. In addition, these firms are more likely to violate debt covenants

    and tend to issue securities during the earnings manipulation period. Dechow et al. [1996]

    further find that fraud firms have weak internal governance mechanisms proxied by insider

    dominance of the board of directors and the lack of an audit committee or outside block-

    holders. Beasley [1996], Beasley, Carcello, Hermanson & Lapides [2000], and Farber [2005]

    provide similar evidence of the association between governance mechanisms and the like-

    lihood of fraud. Specifically, they find that fraud firms have lower percentages of outside

    board membership, less independent audit committees, fewer financial experts on the audit

    committee, fewer audit committee meetings, and lower quality external audit firms. In sum,

    these results suggest that weak monitoring mechanisms provide managers with less internal

    oversight and, in turn, greater opportunities to engage in suspect accounting.

    In a comprehensive study of AAERs, Dechow et al. [2011] investigate the fraud detection

    power of a battery of financial and nonfinancial measures.6 They find that poor accrual

    quality, increases in accrual components, declines in returns on assets, high stock returns,

    and abnormal reductions in the number of employees are strong predictors of accounting

    misstatements. They also find that fraud firms conduct aggressive off-balance-sheet and

    external financing transactions during misstatement periods. Using these variables, Dechow

    et al. [2011] develop a composite prediction score termed the F-score. They show that the

    F-score is a better predictor of both within-GAAP and aggressive accounting misstatements

    compared to traditional models of accrual management.

    Recent research has started to explore the predictive value of language-based tools in

    6Dechow et al. [2011] do not examine corporate governance and incentive compensation variables becausethese variables are available for only limited samples. Our study follows the same approach to ensure thatour results are generalizable to a wide set of firms.

    7

  • detecting intentional financial misreporting. The basic premise of these studies is that the

    textual style or vocal qualities of managements’ disclosures can be used as additional tools

    to identify accounting manipulations. At a basic word-choice level, Loughran & McDonald

    [2011] find that instances of fraud are associated with the use of negative, uncertain, and

    litigious words in annual reports. Larcker & Zakolyukina [2012] analyze the transcripts of

    conference calls and find that words related to deception are indicative of accounting mis-

    statements and serve as better predictors than standard measures of discretionary accruals.

    Goel & Gangolly [2012] go a step further from word lists and examine linguistic qualities

    such as tone, tense, uncertainty, adverbs, and emphasis. They find significant differences

    in linguistic qualities across the full text of 10-Ks issued by firms that engage in financial

    irregularities versus those that do not. Cecchini, Aytug, Koehler & Pathak [2010] employ a

    dictionary approach with synonyms to detect financial misstatements. This approach allows

    the authors to take a broader look at managements’ disclosures in annual reports as opposed

    to focusing solely on individual style characteristics.

    Goel, Gangolly, Faerman & Uzuner [2010] presents an expanded fraud detection model

    using a machine learning algorithm termed Support Vector Machine (SVM) to classify annual

    reports containing irregularities. The SVM approach provides a significant improvement over

    prior text mining models as the SVM model learns by example and does not require pre-

    defined fraud indicators. The SVM tool is trained to classify annual reporting using both

    word dictionaries and writing style characteristics such as word usage, word and sentence

    length, readability, tone, the use of passive versus active voice, the frequency of uncertainty or

    hedge words, and other deeper linguistic style markers and keyword usage. Goel et al. [2010]

    find that their SVM approach improves the prediction accuracy of their fraud detection model

    by about 58% compared to a baseline model using a Näıve Bayes classification approach.

    This approach improves upon the word-list tool used in prior research by employing Näıve

    Bayes and SVM algorithms to classify annual reports based on vectors of common words

    detected in fraudulent reports. However, this algorithm is based on simple word counts and

    8

  • ignores the disclosure content of the annual report as well as relationships between words in

    a document.

    Lastly, Purda & Skillicorn [2015] use SVM tools to distinguish fraudulent and truthful

    annual reports. In contrast to Goel et al. [2010], Purda & Skillicorn [2015] examine both

    annual and quarterly financial reports and compare the accuracy of their fraud detection

    model to both the predictive power of linguistic-based models and traditional models such

    as the Dechow et al. [2011] F-score model. The SVM approach in Purda & Skillicorn [2015]

    uses a learning algorithm to generate a rank-ordered list of words that are best able to cap-

    ture fraudulent reporting. The authors find that their data-generated classification scheme

    outperforms textual-based prediction models built using pre-defined dictionaries as well as

    traditional models based on financial and non-financial measures.7

    Our study extends this body of literature by constructing a direct text-based measure

    aimed at capturing the content of disclosures within firms’ financial statements. Our ap-

    proach goes beyond traditional fraud detection models that are based primarily on quanti-

    tative information. Further, we draw on theories of deception to predict that the thematic

    content, or topics, a manager chooses is more likely to reflect the manager’s intentions, and

    thus intentional misreporting, than either the simple “bag of words” or textual style approach

    used in prior textual analysis studies.

    2.2 LDA Topic Modeling

    We employ a topic modeling approach developed by Blei et al. [2003] termed Latent Dirichlet

    Allocation (LDA), to capture the disclosure content of annual reports. The LDA technique

    is widely-used in the linguistic and information retrieval literature to quantify the thematic

    content (i.e., topics) of text corpora and other collections of discrete disclosure data (see Blei

    7In international settings, Kirkos, Spathis & Manolopoulos [2007] and Pai, Hsu & Wang [2011] usetextual data mining techniques to detect intentional financial misreporting in Greek and Taiwanese firms,respectively.

    9

  • [2012] for a review of topic modeling and its application to various text collections).8 We use

    this approach to construct a firm-specific measure of the topics discussed in each 10-K filing

    in a given year. This unique measure (defined as the normalized percent of the annual report

    attributed to each topic identified by the algorithm) captures how much of each report is

    devoted to discussing a particular disclosure topic.

    Topic modeling is relatively new to accounting and finance, and our measurement ap-

    proach is consistent with recent studies that apply LDA to extract the disclosure content

    of large volumes of financial-related textual data. For instance, Curme, Preis, Stanley &

    Moat [2014] use LDA to identify the semantic topics within the large online text corpus of

    Wikipedia. The identified topics are then used to determine the link between stock market

    movements and how frequently Internet users search for the most representative words of

    each identified topic. Huang, Lehavy, Zang & Zheng [2014] employ LDA topic modeling to

    compare the thematic content of analyst reports and the text narrative of conference calls.

    Consistent with the information discovery role of analysts, Huang et al. find that analyst

    reports issued immediately after conference calls contain exclusive topics that were not dis-

    cussed during the conference calls. Bao & Datta [2014] discover and quantify the various

    topics discussed in textual risk disclosures from annual 10-K filings (Item 1A). The results

    indicate that about two-thirds of the identified risk topics are uninformative to investors, con-

    sistent with the notion that risk disclosures are largely boilerplate. Of the remaining topics,

    disclosures of systematic macroeconomic and liquidity risks have an increasing effect on in-

    vestors’ risk perceptions, whereas topics related to diversifiable risks (i.e., human resources,

    regulatory changes, information security, and operation disruption) lead to a decrease in

    investors’ risk perceptions.

    Concurrent with our study, Hoberg & Lewis [2015] use topic modeling and the cosine

    similarity to provide evidence of the content disclosed by firms involved in SEC enforcement

    8In practice, topic modeling is being used by search engines such as Google and Bing to improve corre-lations between search terms and web content. Search engine marketers are also applying topic modeling toguide keyword selection and optimize website content [Fishkin, 2014]

    10

  • actions (AAERs). Focusing on the MD&A section of 10-K filings, Hoberg & Lewis [2015]

    find that relative to industry peers, AAER firms have abnormal verbal disclosure that is

    common among fraud firms. The topic analysis results indicate that AAER firms disclose

    more information about complex business issues such as acquisitions and foreign operations,

    and are more likely to grandstand their good financial performance. AAER firms also under-

    disclose certain topics such as liquidity challenges and provide fewer quantitative details

    explaining their performance.

    Our study extends the Hoberg & Lewis [2015] in several respects. First, Hoberg & Lewis

    [2015] fit their LDA model using the text contained in the MD&A section (Item 7) of annual

    reports filed in only the first year of their sample period (1997-2008). This approach does not

    account for changes in disclosure topics over time and could induce ‘staleness’ in the topics

    used in their empirical analyses. Our study accounts for the dynamic nature of management

    disclosure by simultaneously discovering the topics and quantifying the attention each annual

    report dedicates to the estimated topics in a given year. We also employ a rolling-window

    procedure that predicts financial misreporting using the topics identified over the five years

    prior to the manipulation period. Second, unlike Hoberg & Lewis [2015], we analyze the

    thematic content of the entire 10-K filing as opposed to the MD&A section. While the MD&A

    section provides a useful setting for examining disclosure content, it does not capture relevant

    content that are discussed in other distinct sections of the annual report, e.g., risk factors

    (Item 1A), legal proceedings (Item 3), and executive executive compensation (Item 11). As

    we will show, topics identified in the MD&A section have less variation and significantly lower

    fraud detection power compared to topics identified from the full annual report. Last, and

    most important, our study goes a step further by demonstrating the incremental predictive

    power of thematic content for detecting fraud over and above traditional measures of financial

    performance and textual style characteristics.

    11

  • 2.3 Research Questions

    While prior research suggest a link between accounting misstatement and various word

    choices and writing styles, the literature is unclear as to whether disclosure content is related

    to intentional misreporting. Our primary research questions tackles this issue by investigat-

    ing the association between disclosure topics and instances of financial misreporting.

    We seek to understand the role of disclosure content in detecting intentional financial

    misreporting beyond traditional financial performance and style characteristics examined in

    prior work. Disclosure topics may provide incremental detection ability since the thematic

    content of financial statements captures an aspect of managerial deception that is distinct

    from that of financial metrics. Specifically, regulatory oversight is more difficult for textual

    narratives, especially at the topic level, thus leaving more room for managers to use disclosure

    content as a means of diverting attention away from misstated financials. While prior re-

    search has identified a set of “lying words” (see e.g., Newman, Pennebaker, Berry & Richards

    [2003]), it is more difficult to näıvely identify a set of “lying topics” as these same topics

    may be benign or informative about actual performance in other settings or at other points

    in time. Furthermore, financial metrics in annual reports are primarily backward-looking,

    whereas textual disclosures contain a significant amount of forward-looking information and

    cover a wide range of topics [Bozanic, Roulstone & Van Buskirk, 2014]. As prior research

    suggests, forward-looking information is inherently more uncertain and less verifiable at the

    time of disclosure [Bonsall IV, Bozanic & Merkley, 2014]. We therefore explore whether

    the topics discussed in annual reports provide incremental predictive value beyond financial

    metrics in detecting financial misreporting. Our first research question is stated as follows:

    Research Question 1: Topic provides predictive information beyond that of financial

    measures when detecting intentional financial misreporting.

    We also investigate whether disclosure topics provide informational value beyond textual

    style characteristics. Style characteristics are broad textual metrics – they do not reflect the

    underlying content, but are simply summary statistics across the broad base of the document.

    12

  • For example, style characteristics may refer to textual complexity, readability, or formatting

    choices such as the use of bullets or the amount of whitespace, as well as grammar and

    word choice. Topic modeling, in contrast, captures the underlying distribution of content

    in the document, as opposed to just summary style statistics. Furthermore, the discussion

    of various topics within the annual report is more intentional than style characteristics, i.e.,

    managers must select the thematic content of each annual filing and the attention dedicated

    to each topic within the document. Prior research on deception provides evidence that

    written topics are typically chosen intentionally. Specifically, theories of manipulation and

    deception suggest that individuals actively monitor the amount, veracity, relevance and

    clarity of topics communicated (see e.g., McCornack [1992]). Also, experimental evidence

    indicates that individuals adapt deception strategies of giving false answers, withholding

    relevant information, or giving evasive answers on demand, suggesting that choosing a topic

    is an intentional process [Buller & Burgoon, 1996].

    It is unclear from prior work whether style characteristics such as length and readability

    reflect intentional manager choices to obfuscate or deceive or simply reflect a manger’s own

    optimism or subconscious slant (see e.g., Li [2008] and the related discussion in Bloomfield

    [2008]). Further, prior research suggests that word choice, in and of itself, is often sub-

    conscious (or without intent). Specifically, prior work in linguistics provides evidence that

    the use of function words (i.e., pronouns, prepositions, articles, conjunctions, and auxiliary

    verbs) is often without awareness and is difficult to control [Chung & Pennebaker, 2007].

    While there is some evidence that individuals consciously choose abstract words to describe

    recurring events and concrete words to describe non-recurring events, goals to influence the

    message recipients’ beliefs make it difficult to disentangle whether the communicator is be-

    ing truthful or not. Thus, even for those words that a manager may intentionally choose,

    it has been shown to be difficult to discern intentional deception (see e.g., Douglas & Sut-

    ton [2003]). In sum, while style features of disclosure, including word choice captured by

    commonly used-dictionaries or word lists, may reflect a manager’s own optimism or subcon-

    13

  • scious slant, topic modeling is more likely to reveal managers’ intentional content choices.

    This discussion leads to our second research question:

    Research Question 2: Topic provides predictive information beyond that of textual

    style characteristics when detecting intentional financial misreporting.

    3 Data and Empirical Measures

    3.1 Data and Sample Selection

    We use annual 10-K filings retrieved from the SEC’s EDGAR system as the textual disclosure

    source to examine our research questions. We focus on annual reports because they 1) allow

    us to maximize the number of firm-year observations in our sample, 2) are comprehensive

    in their coverage of the firm and its activities throughout the fiscal year, and 3) avoid

    self-selection biases given their mandatory disclosure status.9 We download the full text

    of all 10-Ks available through the SEC EDGAR FTP site from January 1, 1994 (the first

    year such data is available) until December 31, 2012 (the final year of the AAER dataset,

    discussed below). This download yields 131,528 filings. We follow Li [2008] in parsing the

    10-K filings, but expand this methodology to remove all items included in the documents

    other than raw text.10 We also restrict the words in the files to match those contained in

    the standard Unix words dictionary, to remove typos and uncommon terminology.11 We

    describe our full parsing methodology in detail in Appendix A.1 of the online appendix. As

    9We acknowledge that 10-Ks are not always timely sources for detecting financial misreporting. Forinstance, the filing of the 10-K can lag any occurrence of fraud by up to a year. Purda & Skillicorn [2015]highlight the added value of including quarterly report narratives in language-based fraud analyses. However,we choose not to include quarterly reports to ensure consistency in the disclosure content of the reports acrossfirms’ reporting periods.

    10We construct measures for all textual items removed from the documents, some of which are includedin our analyses.

    11The standard dictionary, provided by the wamerican package in the official Debian repositories, contains99,171 words. We also conduct robustness checks using no dictionary, the wamerican-huge dictionary, andthe wamerican-insane dictionary. These checks confirm that the standard dictionary provides the best modelperformance in-sample, along with the most coherent topics.

    14

  • discussed below, we gather data on accounting misstatements from the SEC AAER dataset

    compiled by Dechow et al. [2011] and from disclosures of restatements due to intentional

    misreporting in amended 10-K filings. We also gather financial statement and stock market

    data from Compustat and CRSP, respectively, as opposed to the actual 10-K filings to ensure

    consistency and accuracy across our sample.

    3.1.1 Identifying Intentional Financial Misreporting

    We use two data sources to identify instances of intentional financial misreporting. Following

    Dechow et al. [2011], our first data source uses SEC AAERs to classify firms engaging in

    material accounting misstatements. We focus on misstatements occurring during the annual

    reporting period. We exclude quarter-period misstatements to ensure that the measurement

    period for our prediction variables is consistent across firms. We create an indicator variable

    (𝑚𝑖𝑠𝑟𝑒𝑝𝑜𝑟𝑡) that equals 1 for each fiscal year identified as misstated by the SEC, and zero

    otherwise. We use this indicator variable to classify those 10-K filings containing potential

    GAAP violations.

    Our second data source is a customized automated search for occurrences of financial

    restatements that are seemingly due to intentional misapplications of GAAP (irregularity

    restatements). We use the classification methodology discussed in Hennes et al. [2008] to

    develop a customized identification tool.

    Our customized tool performs well in capturing financial misreporting in our sample, as a

    manual inspection of irregularities identified by the search tool indicates that the misstated

    financial reports contained material and intentional misapplications of GAAP. To identify

    irregularity restatements, we download all amended 10-K filings (10-K/As) from the SEC

    EDGAR FTP site. We gather firm-identifying information for matching purposes from the

    header (or alternately from the body of the text when the header is missing or incomplete),

    and then parse the 10-K/A in a manner similar to our parsing of unamended 10-Ks (see

    Appendix A.1 of the online appendix). After parsing the filings, we follow Hennes et al.

    15

  • [2008] and search the text for direct statements of the occurrence of financial reporting

    irregularities or narratives referring to the investigation of misstatements by either regulatory

    or independent parties. Appendix A describes our full search terms.

    We search for phrases such as “fraud,” “materially false and misleading,” and “violation

    of federal securities laws” to identify restated filings with direct discussion of irregularities.

    We identify restatements with related regulatory investigations based on narratives refer-

    ring to investigation by the SEC, the DOJ, or by an Attorney General. Restatements with

    independent investigations are classified based on discussion of investigations by forensic ac-

    countants, the audit committee or an independent committee, as well as statements referring

    to the retention of legal counsel over the misstatement. Based on this identification strategy,

    we classify each 10-K filing as misstated if our search of the corresponding 10-K/A detects

    narratives reflecting an irregularity as detailed above. We then code 𝑚𝑖𝑠𝑟𝑒𝑝𝑜𝑟𝑡 as 1 for those

    firm-years with misstated annual reports; 𝑚𝑖𝑠𝑟𝑒𝑝𝑜𝑟𝑡 equals 0 if there is no amended 10-K

    for the respective fiscal year or if the amended 10-K filing does not involve an irregularity.

    3.2 Empirical Measures

    3.2.1 Financial Measures

    We build our fraud prediction model by including the financial statement and market-related

    variables selected for the Dechow et al. [2011] F-Score model.12 We focus on the set of vari-

    ables providing the highest predictive power as reported in Dechow et al. [2011] (see Model

    3 in Table 9). The financial statement variables include measures of accrual quality, firm

    performance, off-balance-sheet activities, and market-based incentives. Panel A of Appendix

    B defines each of the variables outlined below. The accrual quality measures include an ex-

    tended definition of working capital accruals as developed in Richardson, Sloan, Soliman &

    Tuna [2005] (termed RSST accruals). The RSST accruals measure captures the change in

    12In robustness tests, we find that our results hold when we include standard financial ratios andbankruptcy prediction measures, consistent with prior fraud studies (e.g., Beneish [1997], Cecchini et al.[2010]).

    16

  • noncash net operating assets.13 We also measure the change in receivables and the change

    in inventory since misstatement of these two accrual components affect widely-used perfor-

    mance metrics, namely, sales growth and gross margin. Lastly, the percent of soft assets

    on the balance sheet captures accounting flexibility, and in turn, the room for managerial

    discretion in changing the measurement assumptions of net operating assets in order to meet

    short-term performance goals.

    Our performance measures capture managerial incentives to manipulate their financial

    statements to mask poor firm performance. These measures include the change in cash sales

    and the change in return on assets. To gauge the extent to which firms engage in off-balance-

    sheet financing, we include an indicator variable to identify firm-years with nonzero future

    operating lease obligations. The GAAP rules for operating leases leads to lower expenses

    being booked to the income statement in the early years of the asset’s life. Thus, the existence

    of operating leases proxies for managers’ propensity to window-dress financial performance.

    Lastly, the issuance of securities in a given firm-year, the book-to-market ratio, and the

    market-adjusted stock return of the prior fiscal year all capture market-related pressures to

    engage in fraudulent reporting.

    3.2.2 Style Measures

    Style characteristics are in essence simple summary statistics of textual information. Since

    our main construct of interest, disclosure topic, is derived from text, the ability to detect

    financial misreporting beyond simple style characteristics is quite important. As such, we

    benchmark our topic measure against a comprehensive set of style characteristics from prior

    literature, as well as four new measures developed from our analysis. Panel B of Appendix

    B presents a full list of the style variables and their measurement.14

    13We do not include other measures of discretionary accruals (e.g., modified Jones and performance-matched discretionary accruals) as Dechow et al. [2011] find that these measures perform poorly in detectingaccounting manipulation compared to unadjusted accrual measures.

    14Our results are robust to a large vector of alternative style characteristics. This vector includes a fullbattery of processing measures (a variable for each part removed from the filing), median word, sentenceand paragraph lengths (in addition to the already included mean lengths), Harvard IV dictionary measures,

    17

  • Our new measures are the log of the number of bullets, the length of the SEC mandated

    header, number of excess newlines (vertical whitespace) in the filings, and the character

    length of HTML tags. The log of the number of bullets captures an aspect of readability,

    as bulleted information is typically concise. We find considerable variation in this mea-

    sure, as 13.5% of our sample filings do not contain bullets, while 10% of the filings include

    over 1,400 bullets. The SEC header contains basic corporate and filing form information

    such as company name and address, SIC industry, form type, and filing date. Filings with

    long headers generally identify firms that operated under former company names in prior

    years.15 We therefore expect the SEC header length to be correlated with disclosure com-

    plexity attributable to complex firm transactions that have been shown to be correlated with

    fraudulent activity (e.g., mergers, acquisitions, and corporate restructurings).

    Excess newlines (vertical whitespace) increase the length of the 10-K filing without adding

    any substantive content. Managers engaging in financial misreporting could insert additional

    whitespace to keep the length of the filing consistent with filings by peer firms or the firm’s

    own prior filings while omitting some pertinent information. We include the character length

    of HTML tags in the unparsed documents as a broad measure of technological expertise or

    savvy. This proxy attempts to distinguish between documents created using a basic word

    processing program, e.g., Microsoft Word (which should embed numerous HTML tags),

    versus documents created by more specialized financial software.

    The second group of textual style variables are filing length and style structure measures,

    in the vein of Li [2008] and Goel et al. [2010]. We use these length and structure measures

    as additional proxies for disclosure readability and textual complexity.

    The selected variables include the mean and standard deviation of the length of words,

    sentences, and paragraphs in the 10-K filing, as well as measures of sentence repetition and

    six alternative readability measures, a variable capturing every part of speech coded in the Brown corpus,total and tagged word counts, two other measures of sentence repetition, and deviation from the Benforddistribution. We find that majority of these variables are highly correlated with the style characteristicsselected for our primary analyses.

    15Former company names and the date of each name change are disclosed in a separate block of headerfields. Firms can enter up to three former names in the EDGAR system.

    18

  • type-token ratio (see Goel et al. [2010]; Li [2014]). The type-token ratio (number of unique

    words scaled by the number of total words) measures vocabulary variation and, consistent

    with Rennekamp [2012], captures the idea of superfluous words, as a higher ratio indicates the

    use of a broader vocabulary. We also compute the percent of short and long sentences (≤ 30

    or ≥ 60 words, respectively) contained in the filing. We include two complementary measures

    of readability: the Gunning Fog Index and the Coleman-Liau Index. These measures are

    widely used in the accounting literature to indicate disclosure inefficiencies or misinformation

    [Goel et al., 2010; Lehavy, Li & Merkley, 2011; Li, 2008; Rennekamp, 2012].

    Our final set of textual measures comprises a battery of language and word choice vari-

    ables. First, we measure language voice (active and passive), which has been shown to

    correlate with the incidence of financial misreporting [Goel et al., 2010; Goel & Gangolly,

    2012]. Consistent with Purda & Skillicorn [2015], we measure word choices using the six word

    dictionaries defined in Loughran & McDonald [2011]. These dictionaries contain a lists of

    financial-related words that capture disclosure tone and the use of uncertainty and litigious

    vocabulary. We also include three measures of disclosure emphasis: the use of capitalized

    words, exclamation points, and question marks (see Goel & Gangolly [2012]).

    3.2.3 LDA Topic Measure

    Our topic measure is based on the unstructured and unsupervised LDA topic modeling

    methodology developed by Blei et al. [2003].16 We choose this approach due to its intu-

    itive characteristics and strong performance. In particular, LDA is a Bayesian probabilistic

    model and offers significant theoretical improvement over older data-driven and principle-

    component-based tools such as Latent Semantic Analysis (LSA). Furthermore, the topic

    16For predictive purposes, Mcauliffe & Blei [2008] develop a supervised LDA model (sLDA) which allowseach text document to be paired with a response variable that classifies each document. The goal of the sLDAmodel is to infer disclosure topics predictive of the response. This response variable would be misreport inour setting. We refrain from using the sLDA model for two reasons. First, the unsupervised LDA modelallows us to provide a baseline for the common disclosure topics contained in annual reports, irrespective ofmisreporting. Second, Mcauliffe & Blei [2008] find that the prediction performance of sLDA is equivalent toLDA in text corpora with difficult-to-predict responses.

    19

  • modeling accuracy of LDA is quite strong when compared to human classification of topics

    or other unsupervised machine algorithms such as LSA-IDF or LSA-TF.17 In an experiment,

    Anaya [2011] finds that humans classify main topics with 94% accuracy, while LDA achieves

    84% accuracy. Comparable accuracy statistics were 84% for LSA-IDF and 59% for LSA-TF.

    While the accuracy of human classification is greater than that of LDA, the human approach

    is infeasible when classifying large volumes of textual data. In fact, the LDA tool allows us

    to categorize the disclosure content of annual reports containing text narratives of over 3 bil-

    lion words, allowing for rigorous testing that otherwise would be impossible based on human

    topic classifications.

    The LDA model is based on a few simple assumptions. The model assumes a collection

    of 𝐾 topics in a given text document and that the vocabulary of each topic is distributed

    following a Dirichlet distribution, 𝛽𝐾 ∼ Dirichlet(𝜂).18 The model further assumes that

    the topic proportions in each document 𝑑 are drawn from a Dirichlet distribution 𝜃𝑑 ∼

    Dirichlet(𝛼). Given these assumptions, a specific number of topics to identify, and a few

    learning parameters, the LDA model categorizes the words in a given set of documents

    into well-defined topics. Because the model uses Bayesian analysis, a word is allowed to be

    associated with multiple topics. This is a convenient feature of LDA, as words can have

    multiple meanings, especially in different contexts. In sum, the generative process of LDA

    can be viewed as a probabilistic factorization of the vocabulary in a collection of documents

    into a set of topic weights and a dictionary of topics.

    We implement the LDA algorithm using a dynamic time-series process since we ex-

    pect disclosure content to change across time due to factors such as macroeconomic condi-

    tions, technological changes in business operations, regulatory interventions (e.g., the 2002

    Sarbanes-Oxley Act), and changes in firm management. Consequently, this approach allows

    us to assess the changing nature of disclosure content and its ability to predict account-

    17LSA-IDF and LSA-TF are LSA based measures using a term-document matrix that has undergone atransform: inverse document frequency or term frequency, respectively.

    18A Dirichlet distribution is essentially a multivariate generalization of a beta distribution.

    20

  • ing misstatements. Our time-series procedure identifies the topics discussed in each rolling

    five-year window over our sample period (1994 2012). That is, we run the algorithm for

    the periods, 1994 1998, 1995 1999, 1996 2000, and so on. The topics discovered in each

    window are then used to determine the disclosure content of annual reports issued in the

    year immediately following each five-year window. This results in a test period of 1999 2012

    for our prediction analyses. Note that while new topics may arise in the year after each

    window, the topics discussed in the prior five years provide the most practical estimates of

    current-year disclosure content while avoiding potential look-ahead biases in our prediction

    tests.

    For our implementation of the algorithm, we follow Hoffman, Bach & Blei [2010] and

    use an “online” variant of LDA. This approach allows us to run the algorithm in small

    batches and to classify large quantities of text without encountering computational barriers.

    We run the online LDA algorithm in batches of 100 filings since small batches are more

    computationally efficient given the large sizes of 10-K filings. We draw the filings in each

    batch in random order to mitigate overweighting of early years in the online LDA tool.

    Consistent with Hoffman et al. [2010], we use symmetric Dirichlet distributional parameters

    of 𝛼 = 𝜂 = 120

    and the learning parameters of 𝜅 = 710

    and 𝜏0 = 1024. The learning parameter

    𝜅 controls how quickly old information is forgotten, while parameter 𝜏0 downweights early

    iterations of the model. Hoffman et al. [2010] document that these distributional and learning

    parameter settings are optimal when categorizing articles from the science journal Nature, as

    well as categorizing text from Wikipedia. We then set the algorithm to identify 31 topics in

    each five-year window. We select 31 topics since simulated results indicate that this number

    of topics is optimal in capturing the occurrence of irregularity restatements (see Appendix

    A.2 of the online appendix for a description of this simulation).19

    Next, we pre-process the parsed 10-K filings by first removing stop words. Stop words

    are those deemed irrelevant for our text-based measures because they occur either frequently

    19We run the simulation on irregularity restatements given the lower occurrence of SEC AAERs.

    21

  • (e.g., ‘the’, ‘an’, ‘is’) or are too infrequent to be of use in fraud prediction (such cases were

    often garbled text or misspellings in the 10-K filings). Because our analysis uses rolling five-

    year windows, we generate our stopwords on matching five-year windows to avoid potential

    look-ahead biases. We remove three types of stopwords: 1) the most frequent words appear-

    ing in each rolling five-year window of our sample period until we have removed 60% of all

    word occurrences in the window, 2) words that occur less than 1100 times in the window,

    and 3) words that occur in less than 100 filings. These parameters are also derived in our

    simulation (see Appendix A.2 of the web appendix).

    We run the LDA algorithm on the pre-processed filings, generating 31 topics in each

    rolling window and the weighting for each word associated with the topic. Using these word

    weights, we compute the weight of each topic in filings issued in the year following the five-

    year window. For example, the weighted word vectors for the topics identified in the 1994 –

    1998 window are used to determine the topic weights in filings issued in 1999. To compute

    the topic weights in a given filing, we multiply the vector of word weights within the topic by

    a vector of word counts for the filing. We then normalize the weight of the topic by the sum

    of the weights of all topics identified in the filing. This procedure generates the proportion

    of the content of each document that is associated with each topic. We denote these topic

    proportions as topic. Note that while new topics may arise in the year after each window,

    the topics discussed in the prior five years provide the most effective estimates of current

    topic proportions while avoiding potential look-ahead biases in our prediction tests.

    3.3 Validation of LDA Topic Measure

    Before investigating our research questions, we validate our topic measure using several

    methods. Following prior research (e.g., Bao & Datta [2014], Huang et al. [2014]), our

    first method evaluates the semantic validity of the LDA output by labeling the topics and

    assessing the extent to which the topics provide meaningful content. As discussed above,

    we derive our topic measure using a rolling-window approach with 31 topics identified in

    22

  • each of the 14 rolling five-year windows over our sample period. For ease of interpretation,

    we aggregate the topics discovered in each window up to the full sample. We refer to these

    aggregate topics as “combined topics.” Since the optimal number of topics in each window

    can vary, we allow multiple topics within a window to be associated with the same combined

    topic. We also allow the number of combined topics to be greater than 31 since some topics

    may not be present in all of the five-year windows. We derive the combined topics by

    matching topics across years based on the Pearson correlation of the word weights within

    the topics. We group all topics with a Pearson correlation above a specific threshold. We

    test thresholds for the Pearson correlations from 1% to 90% in 1% intervals to determine the

    most coherent grouping. The most coherent topics were found when the Pearson correlation

    threshold was set at 11%.20 This matching procedure results in 64 combined topics across

    our sample period.

    To determine the underlying content of each combined topic, we generate a list of the

    highest weighted phrases and sentences associated with each topic. We construct the list

    by first extracting the top 1,000 sentences per topic based on the weighted words associated

    with each combined topic. Next, we sort the sentences based on their length and extract

    the middle tercile (334 sentences) as representative sentences with a typical length. We

    then extract the top 20 most frequent bigrams (i.e., two-word phrases excluding stopwords,

    numbers, and symbols) within the 334 mid-length sentences. We also sort the 334 sentences

    based on the cosine similarity between a given sentence and the remaining 333 sentences.

    We manually review the top 20 bigrams and top 100 mid-length sentences based on cosine

    similarity and assign descriptive labels to each of the 64 topics.

    Appendix C presents a list of the 64 combined topics with 10 selected bigrams per topic

    and our inferred topic labels.21 We report 10 representative bigrams after excluding re-

    20The first pass of this test determined that the optimal correlation threshold ranged between 8% and18%. We then conduct tests of this threshold range in 0.05% increments to locate the 11% cut-off point.We also compare the combined topics generated by groupings based on Spearman correlation and Euclideandistance. Both of these alternative methods performed poorly due to overweighting on words with low topicweightings, leading to incoherent topic groupings.

    21The inferred labels for a few topics are overlapping due to only minor differences in the content inferred

    23

  • dundant bigrams (e.g., “millions in,” “company also,” “in year”) and those with similar

    inferences (e.g., “compared to” and “compared with” in topic 2, or “derivative financial”

    and “financial derivative” in topic 9). We note that the LDA algorithm performs well in

    identifying narrative content that is distinctively related to changes in firms’ financial per-

    formance. For instance, topics 1 and 2 both refer to the firms income performance compared

    to prior periods. Examples of top mid-length sentences from topic 1 include the following:

    “Other income decreased to $11,745,000 in 1999 as compared to $11,882,000 in 1998 and

    $10,521,000 in 1997” and “Management fee income decreased to $0 as compared to $1.4

    million in 1997.” Other topics related to financial performance include segment performance

    (topics 16 and 54), franchise revenues (topic 26), and general references to quantitative fi-

    nancial statement information (topics 7, 34, 62, and 63). LDA also identifies topics related

    to complex business transactions and arrangements such as corporate spin-offs (13 and 64),

    derivatives and hedging activities (9), fair value/cash flow hedging (41), merger activities

    (31), R&D partnerships (32), joint venture agreements (39), strategic alliances (46), and

    investment in securitized/guaranteed securities (55).

    Several topics also refer to specific financial statement items and/or their underlying

    measurement assumptions such as post-retirement health care cost assumptions (4), account

    receivables and doubtful accounts (12), long term assets (25), advertising expenses (36), and

    the measurement of natural gas properties (38). Consistent with Huang et al. [2014], we are

    able to identify industry-specific topics such as aircraft leasing arrangements in the airline

    industry, franchise revenue recognition and restaurant growth in the restaurant industry,

    as well as general discussion of business risks and operational factors in the agricultural,

    gaming, mining, marine transportation, and hotel industries. Lastly, as demonstrated in

    Bao & Datta [2014], LDA effectively discovers content related to common risk factors and

    contingencies such as foreign currency risks (57), country risks (18 and 37), environmental

    liabilities and risks (6 and 56) patent infringement and rights (48), and legal proceedings

    from the bigrams and mid-length sentences. We treat these topics separately in our empirical analyses tomitigate any noise introduced by our topic aggregation process.

    24

  • (45). In summary, the evidence in Appendix C suggests that the LDA algorithm provides a

    valid set of economically meaningful topics.

    Our next validation method evaluates whether the disclosure topics perform reasonably

    well in detecting misstatements using in-sample tests. Figures 1 and 2 depict the distribution

    of each combined topics over the 1999 to 2012 period (our misreporting prediction years) and

    whether the topic is significantly associated with the occurrence of financial misreporting.

    Figure 1 (2) illustrates the distribution for the sample of irregularity restatements (AAERs).

    We determine the significance of the combined topics by estimating yearly in-sample regres-

    sions of the disaggregated subtopics (i.e., the topics associated with a given combined topic

    in each year) on our misreport indicator variable. We orthogonalize the subtopic proportions

    to 2-digit SIC industries to control for unobserved industry effects.

    We observe in both figures that the discussion of several topics is relatively consistent

    across the sample years. These topics include changes in income performance (topics 1 and

    2), measurement of post-retirement benefits (3), and industry-specific topics such as aircraft

    leasing arrangements (4) and real estate loan operations (10). Other topics appear later in

    the sample period, indicating the evolving nature of firms disclosure content. For instance,

    discussions of collaborative business arrangements such as joint ventures (39), strategic al-

    liances (46), and partnerships (51) are more prominent in the second half of our prediction

    period.

    With respect to the ability to detect misreporting, Figure 1 illustrates that discussion of

    increases in income performance compared to prior periods (combined topic 2) is significantly

    associated with irregularity restatements in most of our prediction years. However, the

    direction of the significance is not consistent throughout the sample period. We also observe

    that discussions of declines in income performance (topic 1) is significant in relatively few

    years in our sample. These results suggest that the association between misreporting and

    managerial discussion of financial performance is not as clear cut as suggested by prior work

    on the relation between fraud and poor financial performance.

    25

  • The results in Figure 1 also suggest that misreporting firms are more likely to discuss

    issues related to share capital transactions, investments in securitized/guaranteed securities,

    environmental risks, foreign operations, and growth in franchised operations. Combined

    topics that load consistently negative include discussions of merger activities, joint venture

    arrangements, fair value/cash flow hedging, legal proceedings, and stock option plans, sug-

    gesting that restatement firms are less likely to discuss these issues in misstatement years.

    The results for AAER firms (Figure 2) are similar, but with some variation in the timing of

    the topic loadings. AAER firms are less likely to discuss segment performance and declines

    in income performance, primarily in the earlier years of the sample. Taken together, our ev-

    idence in Figures 1 and 2 suggest that disclosure content provides significant informational

    value for detecting misstatement events. These results provide us with greater confidence

    for investigating the fraud prediction performance of topic relative to financial statement

    variables (RQ1) and textual style characteristics (RQ2).

    4 Empirical Methodology and Results

    4.1 Empirical Methodology

    To investigate our research questions, we conduct both in-sample and out-of-sample tests

    using our time-series approach. We first estimate in-sample prediction models using rolling

    five-year windows. We then conduct out-of-sample tests using the estimates from each five-

    year window to predict the likelihood of intentional misreporting in the year following each

    window.2223

    We begin our analyses by estimating logistic regressions of our misreporting indicator

    22For filings coded as an irregularity restatement, we ensure that the restatement is revealed by the endof the in-sample window. We are unable to apply this restriction in the AAER sample as the UC Berkeleydataset does not include the release dates of the AAERs.

    23For example, the estimated results for 1994 – 1998 (1995 – 1999) are used to predict misreporting fora holdout sample of firms in 1999 (2000) and so on.

    26

  • variable (𝑚𝑖𝑠𝑟𝑒𝑝𝑜𝑟𝑡) on vectors of the disaggregated topic proportions (𝑡𝑜𝑝𝑖𝑐) as follows:

    log

    (︂𝑚𝑖𝑠𝑟𝑒𝑝𝑜𝑟𝑡𝑖,𝑡

    1−𝑚𝑖𝑠𝑟𝑒𝑝𝑜𝑟𝑡𝑖,𝑡

    )︂= 𝛼 +

    31

    Σ𝑗=1

    𝛽𝑗𝑡𝑜𝑝𝑖𝑐𝑗,𝑖,𝑡 + 𝜀𝑖,𝑡, (1)

    𝑡 ∈ [𝑇 − 5, 𝑇 − 1], 𝑖 ∈ Companies

    We estimate equation (1) for the five-year window preceding each of the prediction years,

    1999 to 2012. For our AAER specification, we lack sufficient instances of financial misre-

    porting for our out-of-sample test for the year 2012, and thus we remove this year from the

    specification. We remove out-of-sample test years with insufficient misreporting events when

    conducting analyses on various subsamples. We then use the estimated regression coefficients

    to predict the likelihood of intentional financial misreporting for 10-K filings in the subse-

    quent year. Similar to Dechow et al. [2011], we construct a prediction score (𝑝 𝑚𝑖𝑠𝑟𝑒𝑝𝑜𝑟𝑡)

    using the estimated coefficients and apply this scoring in our out-of-sample tests as follows:

    log

    (︂𝑚𝑖𝑠𝑟𝑒𝑝𝑜𝑟𝑡𝑖,𝑇

    1−𝑚𝑖𝑠𝑟𝑒𝑝𝑜𝑟𝑡𝑖,𝑇

    )︂= 𝛼 + 𝛽1𝑝 𝑚𝑖𝑠𝑟𝑒𝑝𝑜𝑟𝑡𝑖,𝑇 + 𝜀𝑖,𝑇 , 𝑖 ∈ Companies (2)

    To investigate Research Question 1 (RQ1), we estimate two additional regression specifi-

    cations. The first specification replaces the topic vector with the vector of financial variables

    discussed previously, whereas the second specification extends equation (1) by including both

    vectors of 𝑡𝑜𝑝𝑖𝑐 and the financial variables. In both cases, we generate a 𝑝 𝑚𝑖𝑠𝑟𝑒𝑝𝑜𝑟𝑡 mea-

    sure and run the out-of-sample tests as well. These two specifications allows us to gauge the

    incremental fraud-prediction ability of 𝑡𝑜𝑝𝑖𝑐 beyond traditional financial statement variables.

    For our second research question (RQ2), we introduce four specifications that include

    style characteristics. The first specification includes our style characteristics with the second

    including both financial variables and style characteristics. The third and fourth specifi-

    cations are expanded versions of the first two models with the 𝑡𝑜𝑝𝑖𝑐 vector inserted. Our

    27

  • general regression form for RQ2 is specified below in equation (3):

    log

    (︂𝑚𝑖𝑠𝑟𝑒𝑝𝑜𝑟𝑡𝑖,𝑡

    1−𝑚𝑖𝑠𝑟𝑒𝑝𝑜𝑟𝑡𝑖,𝑡

    )︂= 𝛼 +

    10

    Σ𝑗=1

    𝛽𝑗𝑓𝑖𝑛𝑎𝑛𝑐𝑖𝑎𝑙𝑗,𝑖,𝑡 +30

    Σ𝑗=1

    𝛽𝑗+10𝑠𝑡𝑦𝑙𝑒𝑗,𝑖,𝑡 (3)

    +20

    Σ𝑗=1

    𝛽𝑗+40𝑡𝑜𝑝𝑖𝑐𝑗,𝑖,𝑡 + 𝜀𝑖,𝑡,

    𝑡 ∈ [𝑇 − 5, 𝑇 − 1], 𝑖 ∈ Companies

    Due to the large number of variables included in the regressions and the naturally small

    number of AAERs and intentional restatements in our windows, we tightly control the

    convergence of our logistic regressions. We control the convergence by conducting checks

    for both completeness and quasi-completeness of each regression specification. Appendix

    A.3 of the online appendix details the necessary steps for conducting these checks.

    4.1.1 Statistical Testing

    Given the structure of our rolling time-series analysis, we are unable to use a standard Fama-

    MacBeth methodology to pool our results for the predicted window. This restriction results

    from the topic variables naturally changing across windows as previously reported. Thus,

    we cannot aggregate across on a variable level. To address this research design issue, we

    use Fisher’s (1932) method to provide aggregated test statistics.24 The Fisher test statistic

    is appropriate for our analyses since the out-of-sample regressions are estimated using non-

    overlapping years.

    We refine our test statistic further by deriving a statistic referred to as a Var-Gamma

    test (see Appendix D. This test statistic allows us to compare the results of Fisher’s method,

    statistically testing whether one fraud detection model performs better than another when

    pooled across years.

    24The test statistic is computed as −2𝑁

    Σ𝑖=1

    log(𝑝𝑖) ∼ 𝑋22𝑁 , where 𝑝𝑖 is the 𝑖th p-value of 𝑁 total p-values.

    28

  • 4.2 Empirical Results

    4.2.1 The Predictive Value of Topic versus Financial Variables (RQ1)

    We address our first research question by investigating the informational role of 𝑡𝑜𝑝𝑖𝑐 versus

    financial statement variables in detecting intentional misreporting. Table 1 presents separate

    summary statistics of our financial statement variables for fraud and non-fraud firm-years

    in the AAER and irregularity restatement samples. Consistent with Dechow et al. [2011],

    we find that the percent of soft assets is significantly higher in both samples, suggesting

    more reporting flexibility in misstatement years (𝑝-value < 0.01). We also find a greater

    tendency to issue securities and engage in off-balance-sheet activities in misstatement years

    in both samples. Furthermore, firms involved in AAERs have significantly larger increases

    in receivables and inventory, and larger declines in ROA, consistent with greater earnings

    management and poorer financial performance in manipulation years. Lastly, AAER firms

    experience higher market-adjusted stock returns in the year prior to the misstated year. This

    result combined with the higher rate of security issuance during the misstated year indicate

    that market-related incentives play a strong role in intentional misreporting. In sum, the

    univariate results provide initial evidence that financial statement variables provide useful

    information for predicting intentional financial misreporting.

    Table 2 presents the results of our out-of-sample tests of the predictive role of 𝑡𝑜𝑝𝑖𝑐

    and financial variables (denoted as 𝐹 − 𝑆𝑐𝑜𝑟𝑒). Panels A and B present the Fisher and

    Var-Gamma test statistics for AAERs, while Panels C and D present similar statistics for

    irregularity restatements. The Fisher test statistics (Panels A and C) indicate that the

    financial variables provide a significant amount of information for predicting AAERs (𝑝 <

    0.001); however, they fail to provide significant informational value for predicting irregularity

    restatements (𝑝 = 0.223). Furthermore, the results in Panels B and D suggest that the stand-

    alone 𝑡𝑜𝑝𝑖𝑐 vector performs significantly better at predicting both AAERs and irregularity

    restatements than either the stand-alone vector of financial metrics or the paired vectors of

    29

  • 𝑡𝑜𝑝𝑖𝑐 and financial variables. In both samples, the var-gamma test statistics are significantly

    positive at the 1% level when the stand-alone 𝑡𝑜𝑝𝑖𝑐 vector is benchmarked against 𝐹 −𝑆𝑐𝑜𝑟𝑒

    and the 𝑡𝑜𝑝𝑖𝑐 vector paired with 𝐹 − 𝑆𝑐𝑜𝑟𝑒. The pairing of 𝑡𝑜𝑝𝑖𝑐 with financial measures

    performs significantly better at predicting both AAERs and irregularity restatements than

    financial measures alone (𝑝 < 0.001). Overall, we find that our measure of the thematic

    content performs significantly better at detecting instances of accounting misstatements

    relative to traditional financial characteristics.

    4.2.2 The Predictive Value of Topic versus Textual Style Characteristics (RQ2)

    Since style measures are also text-based, one could argue that 𝑡𝑜𝑝𝑖𝑐 simply proxies for the

    style characteristics of firms’ financial statements. Table 3 presents separate univariate

    statistics for our vector of style characteristics for misstated and non-misstated firm-years.

    Regarding our processing variables, we find that misstated filings in both the AAER and

    irregularity restatement samples have more (concise) bulleted information. This finding is

    inconsistent with conventional notions, but could reflect managers’ use of conciseness to omit

    relevant information. Misstated filings in both samples have longer headers relative to non-

    misstated filings, consistent with complex firm transactions like restructuring and mergers

    being associated with misstatements. Also, misstated filings have fewer newlines and html

    tags in the AAER sample; whereas, misstated filings have more newlines and more html tags

    in the irregularity restatement sample. We further note that the misstated filings in both

    samples are longer overall with longer MD&A sections, suggesting less readability during

    manipulation periods.

    In terms of complexity, misstated filings in the AAER sample contain longer words,

    shorter sentences, and shorter paragraphs, along with fewer long (> 60 word) sentences

    and a greater number of short (< 30 word) sentences. In contrast misstated filings in

    the irregularity restatement sample tend to use longer sentences, longer paragraphs, fewer

    long sentences, and fewer short sentences. Regarding variation, both AAER and irregular-

    30

  • ity restatement filings have significantly lower variation in sentence length and use fewer

    unique words (type-token ratio). AAER filings also have less variation in paragraph length,

    while irregularity restatement filings have greater variation in paragraph length and have

    more repeated sentences. We note that both sets of misstated filings are less readable per

    the Gunning Fog and Coleman-Liau indices, and are more likely to contain passive voice

    grammar, consistent with managers using passive voice to disassociate themselves from the

    disclosure content [Goel & Gangolly, 2012].

    Regarding word choice, both AAER and irregularity restatement filings have significantly

    higher percentages of positive, negative, and uncertain words, consistent with Loughran

    & McDonald [2011]. AAER filings also have a lower percentage of strong words, while

    irregularity restatement filings have a greater percentage of litigious, strong, and weak words.

    Lastly, misstated filings in the AAER and irregularity restatement samples contain more

    textual emphasis as indicated by more words in all caps; however, misstated AAER filings

    have fewer exclamation points and question marks, on average.

    We approach RQ2 by combining the 𝑡𝑜𝑝𝑖𝑐 and textual style vectors in the same regression

    model. Table 4 presents the Fisher and Var-Gamma test statistics of our out-of-sample tests

    of the predictive performance of topic relative to textual style characteristics. Panels A and

    B presents the test statistics for AAERs; Panels C and D presents the results for irregularity

    restatements. The evidence in panels A and C suggests that 𝑡𝑜𝑝𝑖𝑐 combined with style is a

    good predictor of misstatements involving AAERs and irregularity restatements (𝑝 < 0.001).

    However, for AAER misstatements, the Var-Gamma results in Panel B show that 𝑡𝑜𝑝𝑖𝑐 by

    itself is a better predictor than either textual style or 𝑡𝑜𝑝𝑖𝑐 combined with style at 𝑝 < 0.001.

    The Var-Gamma tests in Panel D show that while 𝑡𝑜𝑝𝑖𝑐 is a better predictor than style

    (𝑝 = 0.019), the joint vector of 𝑡𝑜𝑝𝑖𝑐 and style characteristics is a better predictor than the

    stand-alone 𝑡𝑜𝑝𝑖𝑐 vector at 𝑝 = 0.003. Thus, we find that the best specification for predicting

    AAERs is 𝑡𝑜𝑝𝑖𝑐 by itself, while the best specification for predicting irregularity restatements

    is 𝑡𝑜𝑝𝑖𝑐 with style. This evidence could suggest that fraud detection models based on 𝑡𝑜𝑝𝑖𝑐

    31

  • and style characteristics are more able to detect accounting manipulations that are likely to

    go unidentified by the SEC.

    4.2.3 The Joint Predictive Value of topic, Financial, and Textual Style Variables

    In this section, we conduct extended analyses of the interplay between all three vectors of

    our fraud prediction variables: 𝑡𝑜𝑝𝑖𝑐, financial statement variables, and textual style char-

    acteristics. This comprehensive analysis attempts to verify that the fraud detection ability

    of both sets of topic is robust to the inclusion of financial and textual style characteristics.

    We therefore estimate a comprehensive regression of the vectors of 𝑡𝑜𝑝𝑖𝑐, financial, and tex-

    tual style measures. Table 5 presents the out-of-sample results. In Panels A and C we find

    that the combined vector of 𝑡𝑜𝑝𝑖𝑐, financial, and style measures performs reasonably well in

    detecting accounting misstatements (𝑝 < 0.001). Nonetheless, the results in Panels B and

    D indicate that the stand-alone 𝑡𝑜𝑝𝑖𝑐 vector performs markedly better in predicting both

    types of accounting misstatements (𝑝 < 0.001). For AAERs (Panel B), 𝑡𝑜𝑝𝑖𝑐 outperforms

    the joint vector of 𝑡𝑜𝑝𝑖𝑐, financials, and style 𝑝 < 0.001, whereas 𝑡𝑜𝑝𝑖𝑐 paired with style is

    the dominant predictor of irregularity restatements at 𝑝 < 0.001 (Panel D). This evidence

    agrees with our previous results that 𝑡𝑜𝑝𝑖𝑐 is the best predictor of misstatements involving

    AAERs, while 𝑡𝑜𝑝𝑖𝑐 and textual style characteristics provides the strongest prediction power

    for misstatements involving irregularity restatements.

    In order to examine the economic significance of our out-of-sample results, we examine

    the percent of accounting misstatements in the top 5% of the prediction scores from the

    out-of-sample regressions. Consistent with our out-of-sample results, Table 6 shows that

    𝑡𝑜𝑝𝑖𝑐 and textual style characteristics captures the most misstatements involving irregularity

    restatements, capturing 16.41% of the misstatements. For AAERs, we find that combining

    𝑡𝑜𝑝𝑖𝑐, 𝐹 − 𝑠𝑐𝑜𝑟𝑒, and textual style characteristics captures the most AAERs, detecting

    19.93% of all AAERs. While this contrasts to the the results above, it is not inconsistent

    — the out-of-sample tests capture which measure is best on average, rather than over any

    32

  • specific percent of the fraud scores. More importantly, Table 6 shows that topic is useful for

    prediction in an economic sense, increasing the amount of AAERs (irregularity restatements)

    captured when looking at 5% of firms by 67% (3.3%).

    5 Additional Analysis and Robustness

    This section provides a series of extended analyses as well as sensitivity checks for our

    primary results. We first examine the robustness of our results to alternative sources of

    financial restatements due to irregularities, as well as restatements attributable to uninten-

    tional misapplications of GAAP (errors). We also replicate our primary results using MD&A

    statements instead of the full text of the filings. Next, we change the regression form to an

    L1 regularized logit model, to alleviate concerns of potential overfitting. Lastly, we adjust

    our samples of misstated filings to exclude repeat GAAP violators as well as replicate our

    analyses using the raw 𝑡𝑜𝑝𝑖𝑐 measures (as opposed to the normalized 𝑡𝑜𝑝𝑖𝑐 proportions).

    5.1 Alternative Restatement Measures

    Our strategy for identifying irregularity restatements is based on three classification crite-

    ria: 1) management’s use of variants of the word “fraud” or “irregularity” in reference to

    the misstatement (direct restatements), 2) misstatements identified by regulatory or DOJ

    investigation (government-identified restatements), and 3) misstatements uncovered by in-

    dependent investigations (other irregularity restatements). We examine whether our results

    differ for misstatements identified under each criterion. We conduct this analysis since man-

    agerial discussion during misstatement events is likely to differ across the three settings. For

    instance, irregularities involving SEC or DOJ investigations could be more egregious com-

    pared to those involving within-firm or independent investigations. We also investigate our

    models’ ability to distinguish unintentional misstatements or errors (i.e., those misstatements

    that are not classified as intentional).

    33

  • The out-of-sample results (not tabulated) provide an interesting story when we distin-

    guish the three settings. For direct restatements, the vectors of 𝑡𝑜𝑝𝑖𝑐, financial, and style

    measures are all statistically significant predictors; however, 𝑡𝑜𝑝𝑖𝑐 is the most powerful pre-

    dictor, and all other combinations of the 𝑡𝑜𝑝𝑖𝑐, financials, and style vectors leads to spec-

    ifications that are significantly weaker. Government-identified restatements are also pre-

    dicted most strongly by 𝑡𝑜𝑝𝑖𝑐. Interestingly, financial statement variables are not predictive

    of government-identified misstatements (𝑝 = 0.987). Other irregularity restatements are

    likewise best captured by the 𝑡𝑜𝑝𝑖𝑐 measure, while financial measures are once again poor

    predictors.

    Lastly, all specifications perform well at predicting unintentional accounting errors. The

    𝑡𝑜𝑝𝑖𝑐 vector is tied with 𝑡𝑜𝑝𝑖𝑐 paired with the financial and style vectors when detecting

    accounting errors. We also find that 𝑡𝑜𝑝𝑖𝑐 and 𝑡𝑜𝑝𝑖𝑐 paired with style are the best specifica-

    tions, irrespective of the type of restatement (intentional irregularity or unintentional error).

    Overall, our results suggest that quantifying the thematic content of annual reports results

    in a detection tool that performs best when predicting accounting misstatements in general.

    5.2 Management Discussion and Analysis

    Several style-focused studies such as Li [2008], Li [2010], Cecchini et al. [2010], and Goel

    & Gangolly [2012] examine the MD&A section of the 10-K. We therefore investigate the

    fraud prediction performance of our topic measure based on this subset of the 10-K. We

    reconstruct our topic and style variables using the text extracted from the MD&A section

    (see Appendix A.1 of the online appendix for further details). Our out-of-sample evidence

    indicates that style performs weaker than financial measures at predicting misstatements

    involving AAERs, and that the predictive ability of 𝑡𝑜𝑝𝑖𝑐 is not significantly different from

    that of financial statement variables in the case of AAERs. Out-of-sample tests for the

    irregularity restatement sample show weaker Fisher statistics compared to our reported re-

    sults; however, our main results continue to hold. Thus, we conclude that the incremental

    34

  • detection value of 𝑡𝑜𝑝𝑖𝑐 is robust to restricting our analysis to MD&A statements, though it

    is generally better to use the full text of the 10-K filings when examining disclosure content

    and textual style characteristics.

    5.3 Regularized Logit

    Here, we change the form of our regressions from a standard logistic regression to an L1

    regularized logit. The L1 regularization approach applies a penalty for increasing the number

    of independent variables, which biases against including too many independent variables in

    the regression. We find only one difference in our set of out-of-sample tests: for the sample

    of irregularity restatements, there is no significant difference in the prediction ability of 𝑡𝑜𝑝𝑖𝑐

    relative to 𝑡𝑜𝑝𝑖𝑐 paired with style. As such, the additional predictive ability from adding

    textual style to 𝑡𝑜𝑝𝑖𝑐 does not overcome the penalty from the L1 regularization applied for

    increasing the number of independent variables.

    Overall, the L1 regularization results do not change our primary inferences, with the

    exception that the stand-alone 𝑡𝑜𝑝𝑖𝑐 vector may be a stronger predictor of irregularity re-

    statements under certain circumstances.

    5.4 Removing second time offenders

    Our next robustness test controls for possible biases introduced by allowing a firm to be

    flagged as an AAER or irregularity restatement firm in both the learning window and the

    following testing year. One concern arising from this approach is that the topic measure is

    biased toward firms that are repeat offenders, rather than the first instance of an AAER

    or irregularity restatement. To alleviate this concern, we adjust our misstatement samples

    by removing any firm-years in which the preceding firm-year was involved in a misstate-

    ment. Thus, our out-of-sample dependent variable only picks up the first year affected by a

    misstatement.

    Our prediction results are virtually identical using using these adjusted misstatement

    35

  • samples. Specifically the results for both misstatement samples indicate that the predictive

    ability for 𝑡𝑜𝑝𝑖𝑐, financial, and style measures is lower when removing repeat misstatements,

    but all inferences are identical to our primary inferences. As such, it appears that our 𝑡𝑜𝑝𝑖𝑐

    measure is not biased towards firms with repeated misstatements.

    5.5 Raw Topic Measures

    Our final sensitivity check uses the raw 𝑡𝑜𝑝𝑖𝑐 measures instead of the normalized measures.

    This approach increases the variance in the topic measures, as they are now influenced by

    the amount of text in the document.

    The prediction results for the AAER sample are quite similar, except that 𝑡𝑜𝑝𝑖𝑐 by itself

    is no longer a significantly better predictor than 𝑡𝑜𝑝𝑖𝑐 paired with style. Results fo