Corporate Fraud, LDA, and Econometrics

CorporateFraud,LDA,and

Econometrics

DSSG ⋅2019March27

Dr.RichardM.Crowley

SMU

⋅

Slides:

[email protected] @prof_rmc

rmc.link/DSSG

1

mailto:[email protected]

https://twitter.com/prof_rmc

https://rmc.link/DSSG

▪ Businessinsight

▪ Economictheory

▪ Psychologytheory

▪ Statistics

▪ Machinelearning

▪ Carefuleconometrics

Theproblem

▪ Detect:Classificationproblem

▪ Currently:Predictionproblem

▪ Misreporting:Theaccountingside

▪ Theapproachcombines…

Howcanwedetectifafirmiscurrently

involvedinamajorinstanceof

misreporting?

2

Whydowecare?

▪ Theabove,basedonAuditAnalytics,ignores:

▪ GDPimpacts:Enron’scollapsecost

▪ Societalcosts:Lostjobs,economicconfidence

▪ Anynegativeexternalities,e.g.compliancecosts

▪ Inflation:Incurrentdollarsitisevenhigher

The10mostexpensiveUScorporatefrauds

costshareholders12.85BUSD

~35BUSD

Catchingeven1moreoftheseastheyhappen

couldsavebillionsofdollars

3

https://www.brookings.edu/research/cooking-the-books-the-cost-to-the-economy/

WhatisMisreporting?

4 . 1

Misreporting:Asimpledefinition

Errorsthataffectfirms’accountingstatementsor

disclosureswhichweredoneseeminglyintentionallyby

managementorotheremployeesatthefirm.

4 . 2

Traditionalaccountingfraud

1. Acompanyisunderperforming

2. Managementcooksupsomeschemetoincreaseearnings

▪ WellsFargo(2011-2018?)

▪ Fake/duplicatecustomersandtransactions

3. Createaccountingstatementsusingthefakeinformation

4 . 3

Otheraccountingfraudtypes

▪

▪ Cookiejarreserve(secretpaymentsbyIntelofupto76%ofquarterly

income)

1. Thecompanyisoverperforming

2. “Saveup”excessperformanceforarainyday

3. Recognizerevenue/earningswhenneededtohitfuturetargets

▪

▪ Optionsbackdating

▪

▪ Relatedpartytransactions(transferring59MUSDfromthefirmto

familymembersover176transactions)

▪

▪ Improperaccountingtreatments(Notusingmark-to-market

accountingtofairvaluestuffedanimalinventories)

▪

▪ Goldreserveswereactually…dirt

Dell(2002-2007)

Apple(2001)

ChinaNorthEastPetroleumHoldingsLimited

CVS(2000)

CountrylandWellnessResorts,Inc.(1997-2000)

4 . 4

https://www.economist.com/newsbook/2010/07/23/taking-away-dells-cookie-jar

https://www.sec.gov/news/press/2007/2007-70.htm

https://www.sec.gov/litigation/litreleases/2012/lr22552.htm

https://www.sec.gov/litigation/admin/2007/33-8815.pdf

https://www.sec.gov/litigation/litreleases/lr16732.htm

Wherearethesedisclosed?(US)

1. :AccountingandAuditingEnforcementReleases

▪ Highlightlarger/moreimportantcases,writtenbytheSEC

▪ Example:TheSummarysectionof

2. 10-K/Afilings(“10-K” ⇒annualreport,“/A” ⇒amendment)

▪ Note:notall10-K/Afilingsarecausedbyfraud!

▪ Benigncorrectionsoradjustmentscanalsobefiledasa10-K/A

▪ Note:

3. BytheUSgovernmentthrougha13(b)action

4. Inanoteinsidea10-Kfiling

▪ Thesearesometimesreferredtoas“littler”restatements

5. Inapressrelease,whichislaterfiledwiththeUSSECasan8-K

▪ 8-Ksarefiledformanyotherreasonstoothough

USSECAAERs

thisAAERagainstSanofi

AuditAnalytics’write-uponthisfor2017

Originaldisclosuremotivatedbymanagementadmission,

governmentinvestigation,orshareholderlawsuit

4 . 5

https://www.sec.gov/divisions/enforce/friactions.shtml

https://www.sec.gov/litigation/admin/2018/34-84017.pdf

https://www.auditanalytics.com/blog/reasons-for-an-amended-10-k-2017/

Whereareweat?

▪ Allofthemareimportanttocapture

▪ Allofthemaffectaccountingnumbersdifferently

▪ Noneoftheindividualmethodsarefrequent…

▪ Weneedtobecarefulhere(orcheckmultiplesources)

Fraudhappensinmanyways,formanyreasons

Itisdisclosedinmanyplaces.Allhavesubtlydifferent

meaningsandimplications

Thisisahardproblem!

4 . 6

PredictingFraud

5 . 1

Mainquestionandapproaches

▪ 1990s:Financialsandfinancialratios

▪ Misreportingfirms’financialsshouldbedifferentthanexpected

▪ Late2000s/early2010s:Characteristicsoffirmdisclosures

▪ Annualreportlength,sentiment,wordchoice,…

▪ Late2010s:Moreholistictext-basedMLmeasuresofdisclosures

▪ Modelingwhatthecompanydiscussesintheirannualreport

Howcanwedetectifafirmiscurrentlyinvolvedinamajor

instanceofmisreporting?

Allofthesearediscussedin

–IwillrefertothepaperasBCEforshort

Brown,CrowleyandElliott

(2018)

5 . 2

https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2803733

Whatweneedtoaddress:

1. Detectingvariedevents

▪ “Careful”featureselection(offloadtoeconometrics)

▪ Intelligentfeaturedesign(partiallyoffloadtoML)

2. Forbusinessusers…Interpretabilitymatters

▪ Psychology-styleexperiment

▪ Andaquasi-experiment

3. Predictivemodel

▪ Needclean,outofsampledesigns+backtesting

▪ Windoweddesign–datafrom1998won’thelptoday,butitwould

in1999

4. Infrequentevents

▪ Goodforsociety,badformodeling

▪ Carefuleconometrics

5 . 3

Mainresults

5 . 4

Issue1:Variedevents

6 . 1

Financialmodelbasedon

▪ 17measuresincluding:

▪ Logofassets

▪ %changeincashsales

▪ Indicatorformergers

▪ Theory:Purelyeconomic

▪ Misreportingfirms’

financialsshouldbe

differentthanexpected

▪ Perhapsmoreincome

▪ Oddcapitalstructure

Textualstylemodelbasedon

variouspapers

▪ 20measuresincluding:

▪ Lengthandrepetition

▪ Sentiment

▪ Grammarandstructure

▪ Theory:Communications

▪ Stylereflectscomplexity

andunintentionalbiases

▪ Somemeasuresadhoc

▪ Misreporting ⇒annual

reportwrittendifferently

Pastmodels

Dechow,etal.(2011)

Wetestedanadditional26financial&60stylevariables6 . 2

https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1911-3846.2010.01041.x

TheBCEmodel

1. Retainthevariablesfromthepreviousmodelsregressions

▪ Formsausefulbaseline

2. AddinanMLmeasurequantifyinghowmucheachannualreport(~20-

300pages)talksaboutdifferenttopics

▪ Trainonwindowsoftheprior5years

▪ Balancedatastaleness,dataavailability,andquantityoftext

▪ Optimaltohave31topicsper5years

▪ Basedonin-samplelogisticregressionoptimization

▪ Fromcommunicationsandpsychology:

▪ Whenpeoplearetryingtodeceiveothers,whattheysayiscarefully

picked–topicschosenareintentional

▪ Puttingthisinabusinesscontext:

▪ Ifyouaremanipulatinginventory,youdon’ttalkaboutinventory

Whydowedothis?—Thinklikeafraudster!

6 . 3

Whatthetopicslooklike

6 . 4

Howtodothis:LDA

▪ LDA:LatentDirichletAllocation

▪ Widely-usedinlinguisticsandinformationretrieval

▪ AvailableinC,C++,Python,Mathematica,Java,R,Hadoop,Spark,

…

▪ Weused

▪ isgreatforpython; isgreatforR

▪ UsedbyGoogleandBingtooptimizeinternetsearches

▪ UsedbyTwitterandNYTforrecommendations

▪ LDAreadsdocumentsallonitsown!Youjusthavetotellithowmany

topicstofind

onlineldavb

Gensim STM

6 . 5

https://github.com/blei-lab/onlineldavb

https://radimrehurek.com/gensim/

https://www.structuraltopicmodel.com/

Implementationdetails

1. Annualreportsareamess

▪ Fixedwidthtextfiles;properhtml;htmlexportedfromMSWord…

▪ Embeddedheximages

▪ Solution:Regexes,regexes,regexes

▪ Detailedinthepaper’swebappendix

2. Stemming,tokenizing,stopwords

3. FeedtoLDA

4. Tunehyperparameters(#oftopicsismostcrucial)

5. Finallyimplementthemodel

Theusualaddagethatdatacleaningtakesthelongeststill

holdstrue

6 . 6

Otherconsiderations

1. LDAprovidestheweightoneachtopic,butdocumentsvaryalotby

length

▪ Solution:Normalizetoapercentagebetween0and1

2. Thereisamechanicalcomponenttotopicsduetofirms’industries

▪ Solution:Orthogonalizetopicstoindustry

▪ Runalinearregressionandretain ε :

topic = α + β Industry + ε

i,firm

i,firm

j

∑ i,j j,firm i,firm

6 . 7

Issue2:Interpretability

7 . 1

LDAVerification

▪ LDAiswellvalidatedongeneraltext,noquestion

▪ Onekeyistopresentsomedetailsofthetopicstoensurecomfort

▪ Anotherkeyishavingpriorevidencetofallbackon

▪ WhetherLDAworksonbusiness-specificdocumentsisnotsowell

studied

▪ Moststudiesjustaskpeoplewhethertheyagreewiththehand-

codedtopiccategorizations

Wedecidedtofillthisgap

7 . 2

Experimentaldesign

▪ Whichworddoesn’tbelong?

1. Commodity,Bank,Gold,Mining

2. Aircraft,Pharmaceutical,Drug,Manufacturing

3. Collateral,Iowa,Residential,Adjustable

▪ 100individualsonAmazonTurk(20questionseach)

▪ Humanbutnotspecialized

Instrument:Awordintrusiontask

Participants

7 . 3

Quasi-experimentaldesign

▪ 3Computeralgorithms(>10Mquestionseach)

▪ Nothumanbutspecialized

1. GloVeongeneralwebsitecontent

▪ Lessspecificbutmorebroad

2. Word2vectrainedonWallStreetJournalarticles

▪ Morespecific,businessoriented

3. Word2vecdirectlyonannualreports

▪ Mostspecific

Theselearnthe“meaning”ofwordsinagivencontext

Runtheexactsameexperimentasonhumans

7 . 4

Experimentalresults

Experiment Internet WSJ Filings

10

20

30

40

50

60

70

ValidationofLDAmeasure(Intrusiontask)

Maximumaccuracy

Averageaccuracy

Minimumaccuracy

Randomchance

Datasource

%ofquestionscorrect

7 . 5

Issue3:Predictivemodeling

8 . 1

Backtesting

▪ So,wewillbacktest

▪ Usehistoricaldatatovalidateourmodel

▪ Problems:

1. Misreportingchangesovertime

2. Misreportingisunobservable(untilit’sobservable)

Wedon’tknowwhoismisreportingtoday

8 . 2

Movingtarget

▪ Implementamovingwindowapproach

▪ 5yearsfortraining+1yearfortesting

▪ Thestudyusesdatafrom1994through2012–14possiblewindows

▪ Ex.:topredictmisreportingin2010,trainondatafrom2005to2009

Problem:Nowwehave14models…

8 . 3

Comparingmultiplemodels

▪ Performancemeasures:

1. ROCAUC

2. Fisherstatistics

3. Performanceatareasonablecutoff(5%)

4. NDCG@k(usuallyusedinrankingproblems)

ROCAUCandFisherstatisticswillalsoallowusto

statisticallycompareacrossmodels

8 . 4

ROCAUCforwindowedapproaches

▪ ROCAUC

▪ Whatistheprobabilitythatarandomlyselected1isrankedhigher

thanarandomlyselected0

▪ Agoodscoreisabove0.70

▪ Aggregating:

▪ Simple:averageAUC

▪ Moreuseful:Poolpredictionstogether(withclusteringbyyear)

▪ ComparingROCAUCs

▪ Notsimple…

▪ Waldstatisticwithbootstrappedvarianceestimatesclusteredby

year

▪ ImplementedinStataasrocreg

8 . 5

https://www.stata.com/manuals13/rrocreg.pdf

▪ Comparingmodels:Variance-

Gammatest(seeBCE)

▪ Keyinsight:differenceof

X varshasthesameMGF

astheVarianceGammadist

▪ Calculationbelow

▪ KisthemodifiedBessel

functionofthesecondkind

Purelystatisticalmethod

▪ Fisherstatistic(Fisher1932)

▪ Combiningp-values(Note: p ∼ U 0, 1 )

▪ p-valuescomefromourout-of-samplepredictionmodel

▪ Calculatedas: X = −2 ln(p )

P(X > X ) = z K z dz

[ ]

∑i=1k

i

2

1 2 ∫−∞

X −X1 2

2 Γ(k)k√π

1∣ ∣k−

21

k−21 (∣ ∣)

8 . 6

Observability

▪ Theotherissueisthat,asofagivenyear,say2009,wedonotknow

everyfirmthatwasmisreporting

▪ Wecouldbuildanalgorithmwithperfectinformation,butitmayfall

flatoncurrent,noisydata!

▪ Itcouldalsogiveusafalseimpressionofanalgorithm’s

effectivenesswhenbacktesting

▪ Misreportingcantakealongtimetodiscover:Zale’sstartedin2004,

finishedin2009,andwasdisclosedin2011!

▪ Usedataonwhenamisreportingcasewasfirstdisclosed

▪ Ifthefraudwasn’tknownbytheendofthewindow,trainasifthat

was0(asitwasunobservablebackthen)

▪ Mimicsourcurrentsituation

Solution:Censorourdatatowhatwasknownatthepoint

intime

8 . 7

Issue4:Infrequentevents

9 . 1

Dealingwithinfrequentevents

▪ Fraudisinfrequent

▪ E.g.:Outof38,311firm-yearsofdata,thereare505firm-years

subjecttoAAERs

▪ Keyissue:Wemayhavemorevariablesthaneventsinawindow…

▪ Evenifwedon’t,convergenceisiffyusingalogisticmodel

▪ Afewwaystohandlethis:

1. Verycarefulmodelselection(keepitsufficientlysimple)

2. Sophisticateddegeneratevariableidentificationcriterion+

simulationtoimplementcomplexmodelsthatarejustbarely

simpleenough

▪ ThemainmethodinBCE

3. Automatedmethodologiesforpairingdownmodels(LASSO,

XGBoost)

9 . 2

Degeneratevariableidentification

1. Tosseveryinputintoamodel

2. CheckindependentnessusingaQRdecomposition

▪ Thiswillletusdetermineanorderfordroppinginputs

▪ A = Q × R,where Aisourfeaturematrix, Qisanorthogonal

matrix,and Risthetransformation

▪ Moreweightonthediagonalelementin Rmeansmore

independent(effectively)

▪ SameunderlyingmethodasaGram-Schmidtprocess

3. Removeexcessinputsiftoofew1s

▪ Why?Becauselogitcan’tconvergeiftherearemoreinputsthan

events(ornon-events)inthedata

Independentnessisausefulcriterionforremovingfeatures

withlowerlikelihoodofbeinguseful

9 . 3

Logisticiteration

1. RunalogitusingaNewton-Raphsonsolverfor50iterations

2. Checkconvergenceforsignsofquasi-completeness

▪ Standarderrorswillbeinthemillionsifquasi-complete

▪ Ifquasi-complete,dropthenextleastindependentvariableand

restart

3. Runa500iterationlogitusingaNewton-Raphsonsolver

4. Recheckconvergence

▪ Iffailed,dropthenextleastindependentvariableandrestart

Wewillessentiallygetthemostcomplexfeasiblemodel

withthemostindependentsetoffeatures

9 . 4

Finalcomments

10 . 1

Someotherinterestingresults

10 . 2

Waystobuildonthismodel

1. UseabettertokenizersuchasspaCy

▪ Ourtokenizerdidn’tdetectnounphrases

2. Useeconometricmethodsthatarebettersuitedforsparsity

▪ E.g.:XGBoost

3. ConsiderusingamorepowerfulLDAvariantsuchassupervisedLDA

(sLDA)

4. NoneedtostopatLDA–therehavebeenalotofadvancementsinNLP

since2003

Finalnote:Themotivationbehindourworkwasnottobuilda

bettermousetrap,buttoillustratetheusefulnessdocuments’

contenttobetterunderstandcompany/managerbehavior

10 . 3

Endmatter

11 . 1

Thanks!

Dr.RichardM.Crowley

SMU

⋅

Web:

[email protected] @prof_rmc

rmc.link

Tolearnmore:

▪ Theseslidespubliclyavailableat

▪ Plentyoflinkstoclickthroughandexplore

▪ Technicaldetailspubliclyavailableat

rmc.link/DSSG

SSRN

11 . 2

mailto:[email protected]

https://twitter.com/prof_rmc

https://rmc.link/

https://rmc.link/DSSG

https://ssrn.com/abstract=2803733

▪ Predictionscoresfor1999

rankedinthe98thpercentile

▪ Firstpublicizedin2001

▪ IncreasesinIncometopicand

firmsizearethebiggestred

flags

▪ Predictionscoresfor2004

through2009rank97th

percentileorhighereachyear

▪ publishedin2011

▪ MediaandDigitalServices

topicsaretheredflags

Casestudies

AAER

11 . 3

https://www.sec.gov/litigation/litreleases/2011/lr21930.htm

▪ Logofassets

▪ Totalaccruals

▪ %changeinA/R

▪ %changeininventory

▪ %softassets

▪ %changeinsalesfromcash

▪ %changeinROA

▪ Indicatorforstock/bond

issuance

▪ Indicatorforoperatingleases

▪ BVequity/MVequity

▪ Lagofstockreturnminus

valueweightedmarketreturn

▪ BelowareBCE’sadditions

▪ Indicatorformergers

▪ IndicatorforBigNauditor

▪ Indicatorformediumsize

auditor

▪ Totalfinancingraised

▪ Netamountofnewcapital

raised

▪ Indicatorforrestructuring

Financialmodel

BasedonDechow,Ge,LarsonandSloan(2011)

11 . 4

https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1911-3846.2010.01041.x

▪ Logof#ofbulletpoints+1

▪ #ofcharactersinfileheader

▪ #ofexcessnewlines

▪ Amountofhtmltags

▪ Lengthofcleanedfile,

characters

▪ Meansentencelength,words

▪ S.D.ofwordlength

▪ S.D.ofparagraphlength

(sentences)

▪ Wordchoicevariation

▪ Readability

▪ ColemanLiauIndex

▪ FogIndex

▪ %activevoicesentences

▪ %passivevoicesentences

▪ #ofallcapwords

▪ #of“!”

▪ #of“?”

Stylemodel(late2000s/early2010s)

Fromavarietyofresearchpapers

11 . 5

Documents

Corporate Fraud, LDA, and Econometrics