55
1 Evolution of Financial Studies Over Forty Years: What Can We Learn from Machine Learning? Abstract How did the finance research topics evolve in the past forty years? We apply machine learning models of textual analysis on 20,185 abstracts of finance articles published between 1976 and 2015, and identify 38 research topics. We present the fastest growing topics of published and working papers. Our algorithm can be used to categorize the articles without JEL codes. We use citation network to present how topics are related, and cluster the topics in five “territories”. Moreover, we find a strong bibliometric regularity: the number of researchers covering n topics is approximately 1/2 of those covering just one topic. JEL Classification: G00, G10, G20, G30, B26 Keywords: Textual Analysis, Machine Learning, Network Analysis, Evolution of Financial Studies

Evolution of Financial Studies Over Forty Years: What Can ...sfi.cuhk.edu.cn/uploads/file/20181115/20181115161202.pdf · over Web of Science, ScienceDirect, JSTOR, every journal’s

  • Upload
    vuongtu

  • View
    214

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Evolution of Financial Studies Over Forty Years: What Can ...sfi.cuhk.edu.cn/uploads/file/20181115/20181115161202.pdf · over Web of Science, ScienceDirect, JSTOR, every journal’s

1

Evolution of Financial Studies Over Forty Years:

What Can We Learn from Machine Learning?

Abstract

How did the finance research topics evolve in the past forty years? We apply

machine learning models of textual analysis on 20,185 abstracts of finance articles

published between 1976 and 2015, and identify 38 research topics. We present the fastest

growing topics of published and working papers. Our algorithm can be used to categorize

the articles without JEL codes. We use citation network to present how topics are related,

and cluster the topics in five “territories”. Moreover, we find a strong bibliometric

regularity: the number of researchers covering n topics is approximately 1/2𝑛 of those

covering just one topic.

JEL Classification: G00, G10, G20, G30, B26

Keywords: Textual Analysis, Machine Learning, Network Analysis, Evolution of

Financial Studies

Page 2: Evolution of Financial Studies Over Forty Years: What Can ...sfi.cuhk.edu.cn/uploads/file/20181115/20181115161202.pdf · over Web of Science, ScienceDirect, JSTOR, every journal’s

2

1. Introduction

Finance researchers are interested in knowing what other finance researchers work on,

but not every researcher has a full picture of all topics in finance research and their connections.

Therefore, an analysis of all academic publications in finance may be beneficial to those who

desire to have an overview of this academic profession and inspire more cross-topic research.

How did the finance research topics evolve in the past forty years? Were there any topics

popular decades ago but are unattractive today? Which topics attracted the most attention in

the recent decade? To answer these questions, we need to 1) construct a comprehensive sample

that contains most of the published finance articles in the last 40 years, and 2) determine which

topic each article belongs to.

To address the first requirement, we collect information of 20,185 academic articles

published on 17 finance journals between 1976 and 2015. To determine each article’s topic, we

need to read each article, summarize a list of topics that the articles mainly cover, and finally

classify each article’s topic. Human reading is not only time-consuming but also constrained

by the reader’s comprehensibility. Instead, we employ textual analysis techniques to analyze

the literature.

Textual analysis has been used in finance literature to process textual data of media news

(e.g. Tetlock, 2007; Tetlock, Saar‐Tsechansky, Macskassy, 2008), financial disclosures (e.g.

Loughran and McDonald, 2011; Loughran and McDonald, 2014), Form S-1 on IPO SEC filings

(e.g. Loughran and McDonald, 2013), product descriptions (e.g. Hoberg and Phillips, 2010)

and etc. As far as we know, the textual analysis has not been applied to the finance academic

research itself to categorize the topics and analyze the connections between papers.

We apply two popular unsupervised machine learning algorithms of textual analysis,

Page 3: Evolution of Financial Studies Over Forty Years: What Can ...sfi.cuhk.edu.cn/uploads/file/20181115/20181115161202.pdf · over Web of Science, ScienceDirect, JSTOR, every journal’s

3

latent Dirichlet allocation1 (LDA) and dynamic topic model (DTM) to determine 1) the number

of topics there are in the finance literature; 2) the topics that each article focuses on; 3) which

topics grew most and declined most in recent years; and 4) the evolution of specific interest

within each topic - for example, in the banking area, have researchers been more interested in

banking networks recently?

It is natural to think of JEL classification codes to tell which topic that an article belongs

to. However, there are three reasons that JEL codes are not enough for the analysis of historical

trend. First, some journals do not provide JEL codes, such as The Journal of Finance (JF) and

Journal of Financial and Quantitative Analysis (JFQA). Second, although some journals

provide JEL codes today, they did not provide them in early years. For example, Review of

Financial Studies (RFS) had its first volume in 1988 but started providing JEL codes in 20072.

Therefore, it is difficult to analyze the early articles and the historical trend. After searching

over Web of Science, ScienceDirect, JSTOR, every journal’s official website, and each

published article’s working paper record, there are still 65.2% of all articles without JEL codes.

The phenomenon of missing JEL codes was more common in early years. 79.6% (or 87.1%,

98.2%) of all articles on Journal of Financial Economics (JFE), JF and RFS before 2005 (or

2000, 1995) do not have JEL codes3. JFE is the earliest providing JEL codes among the above

three journals, starting from 1994.

Third, the JEL codes are self-reported and they often change. After comparing the JEL

codes of published articles and their last version of working paper before publication, we find

1 As of January 1, 2018, the LDA article by Blei, Ng, and Jordan (2003) has been cited 21,464 times on

Google Scholar.

2 It’s similar in other journals. For example, Journal of Banking and Finance had its first volume in

1977 but started providing JEL codes in 1993; Journal of Futures Markets had its first volume in

1981 but only some of its articles started providing JEL codes in 2013.

3 73.9% (or 82.0%, 96.5%) of all published articles before 2005 (or 2000, 1995) do not have JEL

codes.

Page 4: Evolution of Financial Studies Over Forty Years: What Can ...sfi.cuhk.edu.cn/uploads/file/20181115/20181115161202.pdf · over Web of Science, ScienceDirect, JSTOR, every journal’s

4

that 31.14% of articles changed JEL codes at least once. When we consider the different

versions of working papers, the percentage of change is even higher. The JEL codes are

subjective and there is little research discussing whether the authors’ classification is accurate.

In this research, we provide another way to obtain an objective classification by using the

unsupervised machine learning that minimizes the human input of prior knowledge. We find

our algorithm-computed topics of the articles and their self-reported JEL codes comparable.

Therefore, we are able to apply the machine learning algorithm on the articles without JEL

codes to determine their topics.

Our unsupervised machine learning algorithm reads all abstracts 4 and shows that

published research can be categorized into 38 topics5 . The largest topics include “Option

Pricing”, “Commercial Banking”, “CEO, Board, Director”, “Market Microstructure”, “Central

Bank, Monetary Policy”, and “Mergers and Acquisitions”. Besides traditional asset pricing and

corporate finance topics, we also identify topics such as “Social Network and Cultural Effect”

and “Venture Capital, Entrepreneurship”. We plot each topic’s historical publication number

and show the rise and fall over time. Publications on “Financial System, Banking Crisis” and

“Hedge Fund, Mutual Fund” increased the fastest in the past decade.

We also apply the LDA model trained from published papers on 130,547 working paper

abstracts that we obtained from SSRN Financial Economics Network. We find that working

papers on “Social Network and Cultural Effect”, “News, Analyst Report, Earnings

Announcement”, and “International Capital Markets” grew fastest from 2006 to 2015. “Market

4 We also used the full text of all articles as input of the model. Due to noisier information in the full

text compared to abstracts such as discussion of prior literature, the topics generated only using abstracts

are better categorized.

5 More rigorously, the published research is categorized into 50 topics including 12 general sentence

topics that do not indicate specific research interest. For example, a topic with keywords “relat”, “posit”,

“neg”, “associ” and “evid” may represent an often used general sentence “we provide evidences on a

positive/negative relation/association”. See Section 3.1 and 5.1 for detailed explanation.

Page 5: Evolution of Financial Studies Over Forty Years: What Can ...sfi.cuhk.edu.cn/uploads/file/20181115/20181115161202.pdf · over Web of Science, ScienceDirect, JSTOR, every journal’s

5

Microstructure”, “Macro Finance”, and “Statistical Estimation Methodology” experienced the

greatest contraction during the same period.

The advantage of unsupervised machine learning over supervised machine learning is

its minimum need of human input. For example, the optimal number of topics is determined

by the algorithm6, not chosen by us. One of the few human involvements of the analysis is

naming the topics based on the keywords that the algorithm chooses to represent each topic.

The supervised machine learning needs humans to label the training sample so that the machine

can be “taught”. In the labelling process, there may be bias or even errors7. In contrast, the

unsupervised machine learning algorithm does not need labelled data8.

Using dynamic topic models, we present how specific research interests evolve. For

example, within the topic of “Determinants of Stock Return” there were many publications on

the January effect of stock prices9 before 1990. Since 2000, the January effect has not been on

the top list of researchers and research of momentum strategies and cross-sectional analysis

had become more popular.

The next question that we examine is how the topics are related to each other. To answer

6 As shown in Section 5.1 and Fig. 1, the optimal number of topics should be accompanied by the

highest computed log-likelihood of the data from the trained model.

7 There is another disadvantage of using supervised machine learning. When there are more data, the

researcher must provide more pre-labelled sample to train the model. Therefore, a supervised machine

learning algorithm requires human labelling every time the dataset changes. Since unsupervised

machine learning does not use pre-labelled sample, it does not have this disadvantage and can adapt to

other dataset easily.

8 It is also difficult to use the articles with JEL codes to train a supervised machine learning model,

and then apply it on the articles without JEL codes. The major problem is that the articles with JEL

codes are usually more recent, and the research topics and specific words used may be different from

early articles without JEL codes.

9 The January effect is a hypothesis that there is a seasonal increase in stock prices during the month of

January.

Page 6: Evolution of Financial Studies Over Forty Years: What Can ...sfi.cuhk.edu.cn/uploads/file/20181115/20181115161202.pdf · over Web of Science, ScienceDirect, JSTOR, every journal’s

6

this question, we plot a citation network between topics and show that the research topics can

be largely grouped into five “territories”: asset pricing, corporate finance, market

microstructure, banking and macro finance, and “mixed areas”. From the network figure, we

easily see that the research of “Mergers and Acquisitions” is closer to “CEO, Board, Director”

compared to “Commercial Banking”.

Moreover, we find a strong bibliometric regularity: the number of researchers covering

n topics is approximately 1/2𝑛 of those covering one topic. Moreover, we find that on average

a published finance article covers fewer research topics over years, which indicates that the

published articles tend to become more focused than being broad.

Compared with prior related research that used no more than several thousand articles,

our sample is larger and more representative to the whole literature body. To the best of our

knowledge, it is among the first machine learning research of finance academic publication.

The rest of this article is organized as follows. Section 2 reviews the prior research on

academic profession. Section 3 discusses the sources and how we clean the textual data. Section

4 explains methodologies, mainly the two machine learning models - latent Dirichlet allocation

(LDA) and dynamic topic model (DTM). Section 5 presents our results and Section 6 concludes.

2. Literature Review

We believe that our research is among the first to study the evolution of research topics

in finance, but there is prior research about academic profession in general. The earliest works

include Froman (1952), Cleary and Edwards (1960), Henry and Burch (1974) and Klemkosky

and Tuttle (1977). Though methodologically simple, they provided important insights. For

example, Froman (1952) generated summary statistics of graduate students in economics

before the 1950s, presenting the institutions that granted the most degrees. Klemkosky and

Tuttle (1977) found that the University of Chicago, the University of Pennsylvania, Stanford

Page 7: Evolution of Financial Studies Over Forty Years: What Can ...sfi.cuhk.edu.cn/uploads/file/20181115/20181115161202.pdf · over Web of Science, ScienceDirect, JSTOR, every journal’s

7

and UCLA contributed most to financial research and journal publication from 1966 to 1975.

Descriptive findings are also found in Heck, Cooley, and Hubbard (1986), Schwert (1993),

Niemi (1987) among others.

This field of research continued to emerge in the 1990s. Chung and Cox (1990) found

that in an academic journal, the number of researchers who published n articles is equal to 1/𝑛𝑐

of the number of researchers who publish just one article in this journal. They estimated that c

is approximately 2 for JF and JFE. Zivney and Bertin (1992) found that many researchers who

became productive later in their careers were incorrectly screened from tenure, while many

researchers who passed the mechanical screens ceased to publish following tenure. They argued

that simply knowing the number of publications and where the articles appeared is insufficient

for reliably predicting future research productivity. Alexander and Mabry (1994) ranked

journals according to the number of citations.

Borokhovich, Bricker and Simkins (1994) found that JF and JFE were the core

influences in finance research, most journals published in a variety of research areas but were

influential in a smaller number during their sample period. Borokhovich, Bricker, Brunarski

and Simkins (1995) found a skewed distribution of academic institutions’ influence; a relatively

small number of institutions contributed a majority of top journal publications and citation.

Corrado and Ferris (1997) investigated what kind of articles were used in doctoral

education. Swidler and Goldreyer (1998) concluded that top journal publication helps

researchers with promotion and salary increase. They estimated that the first top finance journal

publication provided the author with a then present value of between $19,493 and $33,754.

In more recent publications, Azoulay, Wang, and Zivin (2010) found a decline of

collaborators’ productivity following the premature death of an academic “superstar”. Brogaard,

Engelberg, and Parsons (2014) showed that editors’ personal connections help them screen

articles in the reviewing process. Welch (2014) finds that the referees: 1) differ in their scales

Page 8: Evolution of Financial Studies Over Forty Years: What Can ...sfi.cuhk.edu.cn/uploads/file/20181115/20181115161202.pdf · over Web of Science, ScienceDirect, JSTOR, every journal’s

8

as some referees were intrinsically more generous than others, and 2) differ in their opinions

of what a good paper was as they often disagreed about the relative ordering of papers.

3. Data

As shown in Table 1, our sample consists 20,185 articles published on 17 academic

finance journals from 1976 to 2015. We obtain each article’s title, authors, affiliations, abstract,

full text, references, citations and publishing date from Web of Science, supplemented with

ScienceDirect10 , JSTOR and manual search. Table 1 lists the journals and their summary

statistics, including the first years that the abstracts start to exist in our sample. In this research,

we only use the articles’ abstracts in our models11. We have data of RFS from 1988, the year

of its first volume. JF was founded in 1946 and JFE had its first publication in 1974, but Web

of Science started storing these two journals’ data only from 1976. Moreover, Web of Science

stores the article abstracts of JF from 1991 and the article abstracts of JFQA from 1992. We

supplement the missing abstracts of JF between 1976 and 1990 and those of JFQA between

1984 and 1991 from JSTOR and manual search. Journal of Banking and Finance, Journal of

International Money and Finance, Journal of Money Credit and Banking, and JFQA are also

the largest contributors of articles in our sample.

Our sample does not contain the finance articles published on economics or accounting

journals. Many articles on these journals are not finance research. We do not selectively choose

some finance articles published in economics or accounting journals to supplement our sample

in order to avoid our subjective intervention in the algorithm’s analysis. But when we input all

10 ScienceDirect database is mainly used to get more detailed author names and abstracts of the articles

published by Elsevier. JF and JFQA are not published by Elsevier.

11 We also conducted the analysis using full text data. In categorizing the topics, the effect of using

abstracts is better than using full texts that contain noisier information such as the discussion of prior

literature.

Page 9: Evolution of Financial Studies Over Forty Years: What Can ...sfi.cuhk.edu.cn/uploads/file/20181115/20181115161202.pdf · over Web of Science, ScienceDirect, JSTOR, every journal’s

9

articles published on economics or accounting journals, the algorithm generates many non-

finance topics because these journals have many non-finance articles. Therefore, we only use

the articles published on finance journals.

3.1. Textual Data Cleaning

This section describes the process of cleaning textual data and determining the

parameters of the models in a general way.

The textual data often contain commonly used but uninformative words such as “of”,

“you” or “that”. We generally follow the approach of Hansen, McMahon, and Prat (2014) to

clean the data. For each abstract, we

1. Tokenize the text into words, or tokens, with word tokenizer in the Natural

Language Toolkit (NLTK)12.

2. Remove tokens that are numbers or punctuation.

3. Remove tokens with length 1 such as “I”, “a”, “&” and etc.

4. Convert all tokens to lower case.

5. Remove stop words13, which are mainly English pronouns and auxiliary verbs

such as “you”, “your”, “yours”, “am”, “is”, “are”, “isn’t” and etc.

6. Stem the tokens with Porter Stemmer14, a popular stemming algorithm in the

Python library NLTK. Stemmers bring words with similar meanings to a common

linguistic root. For example, “manage”, “manager”, and “management” all become

12 NLTK is an open-source Python library for English natural language processing. See

http://www.nltk.org/ for more information.

13 In computing, stop words are words being filtered out before processing of natural language text,

which usually refer to the most common words. The list of stop words is at

http://snowball.tartarus.org/algorithms/english/stop.txt

14 For the details of Porter Stemmer, see https://tartarus.org/martin/PorterStemmer/ for more

information.

Page 10: Evolution of Financial Studies Over Forty Years: What Can ...sfi.cuhk.edu.cn/uploads/file/20181115/20181115161202.pdf · over Web of Science, ScienceDirect, JSTOR, every journal’s

10

“manag” after stemming. We group words with similar meanings together by stemming,

which makes the final results more interpretable to humans.

7. Remove tokens appearing less than 5 times.

8. Combine the words that appear in a phrase at high frequency as one unit to

process. Appendix Table A.1 lists 53 phrases that we use. The most frequently appeared

phrases in our textual data are “interest rate”, “united states” and “exchange rate”.

4. Methodologies

We apply unsupervised machine learning models on the textual data to categorize

unobserved topics. We first obtain each abstract’s probability distribution over topics and each

topic’s probability distribution over words using LDA (Blei, Ng, and Jordan, 2003). Compared

to LDA, DTM (Blei and Lafferty, 2006) considers an additional dimension time. We then

observe how each topic’s probability distribution over words evolves over time from DTM,

and furthermore the evolution of word usage in each topic.

Intuitively speaking, LDA categorizes all abstracts into a number of topics. Moreover, it

can analyze an abstract’s quantitative distribution on different topics. For example, LDA may

find that an abstract, for instance Laeven and Levine (2009), is 12.7% on “Systematic Risk and

Risk Premium”, 11% on “Shareholder Right, Ownership Structure”, 10.3% on “Commercial

Banking” and 10.1% on “Financial Regulation”. The rest of percentages are distributed over

other topics. Within a topic, DTM can analyze the evolution of specific interests over time.

The following subsections address the basic concepts of LDA and DTM and how we

apply them to the textual data.

4.1. Latent Dirichlet Allocation

A collection of M abstracts is denoted by 𝐷 = {𝑤1, 𝑤2, … , 𝑤𝑀}, and each abstract d

Page 11: Evolution of Financial Studies Over Forty Years: What Can ...sfi.cuhk.edu.cn/uploads/file/20181115/20181115161202.pdf · over Web of Science, ScienceDirect, JSTOR, every journal’s

11

with 𝑁𝑑 words is denoted by 𝑤𝑑 = {𝑤𝑑,1, 𝑤𝑑,2, … , 𝑤𝑑,𝑁𝑑} . The model assumes that text is

generated by unobserved variables 𝛽 and 𝜃 that are to be estimated. Let V denote the number

of unique words across all abstracts, and K denote the number of topics. 𝛽𝑘 is a V-dimension

vector over V words for topic k. 𝛽𝑘,𝑣, the vth element in 𝛽𝑘, represents the appearing probability

of word v given topic k. 𝜃𝑑 is a K-dimension vector of probabilities over K topics for abstract

d. 𝜃𝑑,𝑘, the kth element in 𝜃𝑑, represents the percentage distribution of topic k in abstract d.

LDA assumes that the abstracts are generated in the following process. To generate the

nth word in abstract d, a topic 𝑧𝑑,𝑛 is sampled from the probability vector 𝜃𝑑. With the given

topic 𝑧𝑑,𝑛, a word 𝑤𝑑,𝑛 is sampled from the distribution over 𝛽𝑍𝑑,𝑛. The model assumes that

each word in each abstract in the corpus is generated through this process. Therefore, the

probability of a given corpus D generated through this process is

𝑃𝑟(𝐷|𝜃, 𝛽) = ∏ ∏ ∑ 𝑃𝑟(𝑧𝑑,𝑛|𝜃𝑑) 𝑃𝑟(𝑤𝑑,𝑛|𝛽𝑧𝑑,𝑛)

𝑧𝑑,𝑛

𝑁𝑑

𝑛=1

𝑀

𝑑=1

(1)

where 𝑃𝑟(𝑧𝑑,𝑛|𝜃𝑑) is the probability of topic 𝑧𝑑,𝑛 given abstract 𝑑’s topic composition

𝜃𝑑 , and 𝑃𝑟(𝑤𝑑,𝑛|𝛽𝑧𝑑,𝑛) is the probability of word 𝑤𝑑𝑛 given topic 𝑧𝑑,𝑛’s word composition

𝛽𝑧𝑑,𝑛. The summation of the product of the two probabilities is the probability of each word

∑ 𝑃𝑟(𝑧𝑑,𝑛|𝜃𝑑) 𝑃𝑟(𝑤𝑑,𝑛|𝛽𝑧𝑑,𝑛)𝑧𝑑,𝑛, which is a summation of conditional probabilities on each

topic. The total probability 𝑃𝑟(𝐷|𝜃, 𝛽) is the product of each word’s probability.

We use the following example to illustrate how the above formula works on

hypothetical abstracts and parameters. The hypothetical abstracts are only for explanatory

purposes and are not from real articles. Suppose our collection of abstracts 𝐷 contains 2

abstracts 𝑤1and 𝑤2, where 𝑀 is 2, and

𝑤1: “𝐵𝑎𝑛𝑘𝑖𝑛𝑔 𝑖𝑠 𝑐𝑟𝑢𝑐𝑖𝑎𝑙 𝑡𝑜 𝑒𝑛𝑡𝑟𝑒𝑝𝑟𝑒𝑛𝑒𝑢𝑟𝑠ℎ𝑖𝑝. ”

𝑤2: “𝐵𝑎𝑛𝑘𝑖𝑛𝑔 𝑖𝑠 𝑐𝑟𝑢𝑐𝑖𝑎𝑙 𝑡𝑜 𝑖𝑛𝑣𝑒𝑠𝑡𝑚𝑒𝑛𝑡. ”

Page 12: Evolution of Financial Studies Over Forty Years: What Can ...sfi.cuhk.edu.cn/uploads/file/20181115/20181115161202.pdf · over Web of Science, ScienceDirect, JSTOR, every journal’s

12

After our text cleaning process, the above abstracts become

𝑤1: “𝑏𝑎𝑛𝑘 𝑐𝑟𝑢𝑐𝑖𝑎𝑙 𝑒𝑛𝑡𝑟𝑒𝑝𝑟𝑒𝑛𝑒𝑢𝑟𝑠ℎ𝑖𝑝”

𝑤2: “𝑏𝑎𝑛𝑘 𝑐𝑟𝑢𝑐𝑖𝑎𝑙 𝑖𝑛𝑣𝑒𝑠𝑡”

For 𝑤1, 𝑁1 = 3, and {𝑤1,1, 𝑤1,2, 𝑤1,3} = {"𝑏𝑎𝑛𝑘", "𝑐𝑟𝑢𝑐𝑖𝑎𝑙", "𝑒𝑛𝑡𝑟𝑒𝑝𝑟𝑒𝑛𝑒𝑢𝑟𝑠ℎ𝑖𝑝"}. For 𝑤2,

𝑁2 = 3, and {𝑤2,1, 𝑤2,2, 𝑤2,3} = {"𝑏𝑎𝑛𝑘", "𝑐𝑟𝑢𝑐𝑖𝑎𝑙", "𝑖𝑛𝑣𝑒𝑠𝑡"}. We now assign our

parameters’ numerical values. We assign 𝛽’s value matrix as follows:

The value 𝛽𝑘,𝑣, or 𝛽𝑡𝑜𝑝𝑖𝑐,𝑤𝑜𝑟𝑑, is the probability of the word conditional on the topic. Here we

set the number of topics to be 3, so 𝐾 is 3; we have a dictionary of 4 unique words, so 𝑉 is 4.

The topic names are not the direct output of LDA. When we implement the machine learning

strategy, the algorithm only returns the key word list for each topic. We assign topic names to

facilitate the readability. For example, under the condition that the topic is 1 (banking), the

word “bank” appears with probability 0.7, and “crucial” appears with probability 0.3. We could

also arrange the matrix into a different representation

where each topic is associated with a list of words and their probabilities. We then proceed to

assume 𝜃 as

β

bank crucial invest entrepreneurship

1 (banking) 0.7 0.3 0 0

2 (investment) 0 0 1 0

3 (entrepreneurship) 0 0 0 1

Top

ic

(1…

K)

Word (1…V )

bank 0.7 invest 1 entrepreneurship 1

crucial 0.3

Topic 2 (investment)Topic 1 (banking) Topic 3 (entrepreneurship)

Page 13: Evolution of Financial Studies Over Forty Years: What Can ...sfi.cuhk.edu.cn/uploads/file/20181115/20181115161202.pdf · over Web of Science, ScienceDirect, JSTOR, every journal’s

13

The value 𝜃𝑑,𝑘, or 𝜃𝑎𝑏𝑠𝑡𝑟𝑎𝑐𝑡,𝑡𝑜𝑝𝑖𝑐, is the percentage distribution of the topic in the abstract. We

could see that abstract 1 “bank crucial entrepreneurship” consists of 0.6 of topic 1 (banking)

and 0.4 of topic (entrepreneurship), and abstract 2 “bank crucial invest” consists of 0.7 of topic

1 (banking) and 0.3 of topic 2 (investment).

With the assumption of parameters, we could proceed to calculate the probability of

this collection of documents. The probability of the first word “bank” appearing in abstract 1

is calculated as

= ∑ 𝑃𝑟(𝑧1,1|𝜃1) 𝑃𝑟(𝑤1,1|𝛽𝑧1,1)

𝑧1,1

= 𝑃𝑟(𝑡𝑜𝑝𝑖𝑐1|𝜃𝑎𝑏𝑠𝑡𝑟𝑎𝑐𝑡1)𝑃𝑟("𝑏𝑎𝑛𝑘"|𝛽𝑡𝑜𝑝𝑖𝑐1) + 𝑃𝑟(𝑡𝑜𝑝𝑖𝑐2|𝜃𝑎𝑏𝑠𝑡𝑟𝑎𝑐𝑡1)𝑃𝑟("𝑏𝑎𝑛𝑘"|𝛽𝑡𝑜𝑝𝑖𝑐2)

+ 𝑃𝑟(𝑡𝑜𝑝𝑖𝑐3|𝜃𝑎𝑏𝑠𝑡𝑟𝑎𝑐𝑡1)𝑃𝑟("𝑏𝑎𝑛𝑘"|𝛽𝑡𝑜𝑝𝑖𝑐3)

= 0.6 × 0.7 + 0 × 0 + 0.4 × 0

= 0.42

and we could multiply the probability of all 3 words in abstract 1 together to obtain the

probability of abstract 1, which is

= ∏ ∑ 𝑃𝑟(𝑧1,𝑛|𝜃1) 𝑃𝑟 (𝑤1,𝑛|𝛽𝑧1,𝑛

)

𝑧1,𝑛

𝑁1

𝑛=1

= (0.6 × 0.7) × (0.6 × 0.3) × (0.4 × 1)

= 0.03024

Likewise, we could also calculate the probability of abstract 2, which is

θ

1 (banking) 2 (investment) 3 (entrepreneurship)

1 0.6 0 0.4

2 0.7 0.3 0

Topic (1…K )

Ab

stra

ct

(1…M

)

Page 14: Evolution of Financial Studies Over Forty Years: What Can ...sfi.cuhk.edu.cn/uploads/file/20181115/20181115161202.pdf · over Web of Science, ScienceDirect, JSTOR, every journal’s

14

= ∏ ∑ 𝑃𝑟(𝑧2,𝑛|𝜃2) 𝑃𝑟 (𝑤2,𝑛|𝛽𝑧2,𝑛

)

𝑧2,𝑛

𝑁2

𝑛=1

= (0.7 × 0.7) × (0.7 × 0.3) × (0.3 × 1)

= 0.03087

We can then multiply the probability of each abstract together and obtain the probability of

this collection of abstracts, which is calculated as

= ∏ ∏ ∑ 𝑃𝑟(𝑧𝑑,𝑛|𝜃𝑑) 𝑃𝑟(𝑤𝑑,𝑛|𝛽𝑧𝑑,𝑛)

𝑧𝑑,𝑛

𝑁𝑑

𝑛=1

2

𝑑=1

= 0.03024 × 0.03087

= 9.335088 × 10−4

By adjusting the values of 𝜃 and 𝛽, we would obtain different probability values. The goal of

topic modeling and LDA is to find an optimized set of 𝜃 and 𝛽 so that the computed probability

is maximized.

However, the optimization of the computation above is generally intractable, as noted

by Hansen, McMahon, and Prat (2014). Therefore, direct maximum likelihood estimation

based on this computation is not applicable. To facilitate the computation, LDA assumes that

each 𝜃𝑑 is a K-dimensional Dirichlet random variable 𝐷𝑖𝑟𝑖𝑐ℎ𝑙𝑒𝑡(𝛼) , and each 𝛽𝑘 is a V-

dimensional Dirichlet random variable 𝐷𝑖𝑟𝑖𝑐ℎ𝑙𝑒𝑡(𝜂). The resulting probability of a corpus D

generated through the process is

𝑃𝑟(𝐷|𝛼, 𝜂)

= ∏ ∫ ⋯ ∫ ∏ 𝑃𝑟(𝛽𝑘|𝜂) 𝑃𝑟(𝜃𝑑|𝛼) (∏ ∑ 𝑃𝑟(𝑧𝑑,𝑛|𝜃𝑑) 𝑃𝑟(𝑤𝑑,𝑛|𝛽𝑧𝑑,𝑛)

𝑧𝑑,𝑛

𝑁𝑑

𝑛=1

)

𝐾

𝑘=1

𝑑𝜃𝑑𝑑𝛽1 … 𝑑𝛽𝐾

𝑀

𝑑=1

(2)

Dirichlet distribution is a multivariate generalization of the beta distribution, with

probability density function as

Page 15: Evolution of Financial Studies Over Forty Years: What Can ...sfi.cuhk.edu.cn/uploads/file/20181115/20181115161202.pdf · over Web of Science, ScienceDirect, JSTOR, every journal’s

15

𝑓(𝑥1, … , 𝑥𝐾; 𝛼1, … , 𝛼𝐾) = 1

B(𝛼) ∏ 𝑥𝑖1

𝛼𝑖−1

𝐾

𝑖=1

(3)

where 𝑥1, … , 𝑥𝐾 sum to 1, and 𝛼1, … , 𝛼𝐾 are the parameters of the distribution. Dirichlet

distribution is the conjugate prior distribution of the categorical distribution and is often used

as the prior distribution for the categorical distribution. When the prior distribution is a

Dirichlet distribution and the data points are categorical distributions, as in the case of LDA,

then the posterior distribution will also be a Dirichlet distribution.

With the conjugation property between Dirichlet distribution and categorical distribution,

this optimization of probability of a corpus becomes tractable and we are able to estimate the

latent variables by maximum likelihood methods. 𝛼 and 𝜂 are hyper-parameters of this model,

and they can be tuned for different model behaviors. For example, abstracts contain fewer

topics with lower 𝛼 and they contain more topics with higher 𝛼 . Following Griffiths and

Steyvers (2004) and Steyvers and Griffiths (2007), we choose 𝛼 = 50/𝐾, and 𝜂 = 0.025 in

our analysis.

Various properties of LDA are worth noting. LDA is a bag-of-words language model,

where each abstract is modeled as the occurrence frequency of each word inside the abstract.

This approach ignores word order and simplifies the computation complexity. Hansen,

McMahon, and Prat (2014) argue that the resulting information loss has little impact on our

goal of determining the topic coverage. In addition, LDA is an “unsupervised” machine

learning algorithm. This means that the algorithm requires no pre-assigned labels – it is enough

to simply feed the textual data into the algorithm. This unsupervised property significantly

reduces workload when processing big data.

4.2. Dynamic Topic Model

Page 16: Evolution of Financial Studies Over Forty Years: What Can ...sfi.cuhk.edu.cn/uploads/file/20181115/20181115161202.pdf · over Web of Science, ScienceDirect, JSTOR, every journal’s

16

For each abstract published in a discrete time t, the parameters 𝛼 and 𝛽𝑘 are now

replaced by 𝛼𝑡 and 𝛽𝑡,𝑘, which are evolved with Gaussian noise from 𝛼𝑡−1 and 𝛽𝑡−1,𝑘,

respectively. A simple version of such models are

𝛽𝑡,𝑘 |𝛽𝑡−1,𝑘~𝒩(𝛽𝑡−1,𝑘, 𝜎2𝐼) (4)

and

𝛼𝑡 |𝛼𝑡−1~𝒩(𝛼𝑡−1, 𝛿2𝐼) (5)

In our experiment, we set t as the publishing year of an abstract. Therefore, in each year

we obtain a different 𝛽𝑡,𝑘, the probability of each word that appears in each topic. Then we can

observe the evolution of word usage of every topic.

Apart from the discrete time DTM described above, continuous time DTM is proposed

by Wang, Blei, and Heckerman (2012). Rather than being discrete, t can take on any point on

a continuous timeline. While continuous time DTM is useful for high-frequency textual data,

such as tweets from Twitter, it is hardly applicable in our project that mainly uses yearly data.

5. Results

We apply LDA and DTM to the abstracts in 17 finance academic journals. The dataset

contains 20,185 abstracts and 12,046 unique words and phrases. After the cleaning process as

in Section 3.1, we are left with 5,332 unique words and phrases. Summary statistics of the 17

journals are listed in Table 1, including each journal’s time horizon and number of articles with

abstracts. For example, JF in our sample starts from 1976, the year when Web of Science

started storing its data15. Fig. A.1 plots the number of active journals and articles every year.

15 JF was founded in 1946 and JFE had its first publication in 1974, but Web of Science starts storing

these two journals’ data only from 1976. Moreover, Web of Science stores the article abstracts of JF

from 1991 and the article abstracts of JFQA from 1992. Table 1 lists the journals and their summary

statistics, including the first years that Web of Science stores the abstracts. We supplement the missing

Page 17: Evolution of Financial Studies Over Forty Years: What Can ...sfi.cuhk.edu.cn/uploads/file/20181115/20181115161202.pdf · over Web of Science, ScienceDirect, JSTOR, every journal’s

17

5.1. Appropriate Number of Topics

To determine the appropriate number of topics, we run LDA and maximize the log-

likelihood of the data from the models trained for different numbers of topics. We compute the

probability of a set of abstracts unseen to the estimated model at the end of the machine learning

process to avoid the caveat of overfitting. The optimal number of topics should be accompanied

by the highest computed probability. Fig. 1 reports the log-likelihood of the data from the

trained model of different numbers of topics. The number of topics with the highest likelihood

is approximately 40. In implementing this approach, we find that there are topics that represent

general sentences and do not indicate specific research interest. For example, a topic with

keywords “relat”, “posit”, “neg”, “associ” and “evid” may simply represent an often used

general sentence “we provide evidences on a positive/negative relation/association”. Therefore,

we finally choose 50 topics when implementing LDA and exclude 12 general sentence topics

from them. A full list of general sentence topics is presented in Appendix Table A.2.

5.2. Naming the Topics

Table 2 presents each topic’s top ten keywords generated by LDA, i.e. the ten words

with the highest appearing probability in each topic. We name each topic by reading the

keywords and the articles that belong to it. For example, if we observe that “bank”, “loan”,

“borrow”, “lend”, “commerce” and “deposit” appear in one topic, after reading the articles

belonging to this topic, we name it as “Commercial Banking”; if we observe that “ceo”,

“manag”, “board”, “compens”, “incent”, “director” appear in one topic, we name it as “CEO,

abstracts of JF between 1976 and 1990 and those of JFQA between 1984 and 1991 from JSTOR and

manual search.

Page 18: Evolution of Financial Studies Over Forty Years: What Can ...sfi.cuhk.edu.cn/uploads/file/20181115/20181115161202.pdf · over Web of Science, ScienceDirect, JSTOR, every journal’s

18

Board, Director”. The abstracts are categorized into 38 research topics and 12 general sentence

topics.

As we explained in the previous example of Laeven and Levine (2009), each abstract

has quantitative distribution on different topics. We define that an abstract focuses on a topic

if it has over 10% distribution on it. An abstract with higher distribution of a certain topic tends

to have more keywords for that topic. An abstract may have two or more topic focuses. For

example, Laeven and Levine (2009) is 12.7% on “Systematic Risk and Risk Premium”, 11%

on “Shareholder Right, Ownership Structure”, 10.3% on “Commercial Banking” and 10.1%

on “Financial Regulation”. Therefore, Laeven and Levine (2009) focuses on the four topics

“Systematic Risk and Risk Premium”, “Shareholder Right, Ownership Structure”,

“Commercial Banking” and “Financial Regulation” by our definition.

“Option Pricing” is the topic with the most publications that focus on it, followed by

“Commercial Banking”, “CEO, Board, Director”, “Market Microstructure”, “Central Bank,

Monetary Policy”, and “Mergers and Acquisitions”.

Table 3 lists the most cited articles in each topic. The citation numbers are collected on

Feb 25th, 2016. The year of publication is in the parenthesis. The number of citation is behind

the comma. We present Web of Science citation behind the author-years. In Appendix Table

A.3, we present the most cited articles in each topic by Google Scholar citation.

5.3. Historical Trend of Topics

Fig. 2 presents the historical evolution of topics. The topics are identified by LDA. The

horizontal axis represents the year of publication. The vertical axis represents the average

percentage for a given topic across abstracts in a given year, and its value can be interpreted as

the topic’s popularity of research interest. It is computed as

𝑝𝑖𝑡 = ∑ 𝑝𝑖𝑡𝑘

𝑁𝑡

𝑘=1/𝑁𝑡 (6)

Page 19: Evolution of Financial Studies Over Forty Years: What Can ...sfi.cuhk.edu.cn/uploads/file/20181115/20181115161202.pdf · over Web of Science, ScienceDirect, JSTOR, every journal’s

19

where 𝑝𝑖𝑡𝑘 is year-t-published abstract 𝑘’s percentage distribution on topic 𝑖. 𝑁𝑡 is the total

number of articles published in year 𝑡16. For example, the average percentage distribution on

“CEO, Board, Director” across all abstracts rose from about 1.5% in 1980 to about 2.5% in

2015. We choose 50 topics, including 12 general sentences topics, and the total percentage 100%

is distributed on the 50 topics, therefore a topic attracts more attention and can be seen as

“popular” if its percentage is higher than 100%/50 = 2%.

We observe that the research interest in “Financial System, Banking Crisis” often spiked

around or after the financial crises, such as the savings and loan crisis in the late 1980s and

early 1990s. It grew even faster after the 2008 financial crisis. The research interest of “CEO,

Board, Director” has been growing stably in the past 40 years. Other topics that attracted more

attention include “Behavioral Finance”, “Central Bank, Monetary Policy”, “Commercial

Banking”, “Corporate Cash Holding”, “Hedge Fund, Mutual Fund”, “International Capital

Markets”, “Social Network and Cultural Effect”, “Venture Capital, Entrepreneurship”, and

“Volatility”. The research interest in topics like “Bond Term Structure” and “Optimal Choice

Model” has been shrinking.

It is worth noting that high fluctuation of values exists in the 1970s and 1980s for most

of the topics. Fig. A.1 plots the number of active journals and articles every year. In the 1970s

and 1980s, there were fewer journals and articles, resulting in more volatile values. Fig. A.2

plots the yearly publication numbers in JF, JFE, and RFS. Zivney and Bertin (1992) explain

that the output has become constant since the 1980s, following rapid growth in the number of

journals and articles published in the 1960s and 1970s. Our results show a continuous growth

in the number of journals and articles after 1990s.

16 We also computed 𝑝𝑖𝑡′

= ∑ 𝑝𝑖𝑡𝑘𝑁𝑖𝑡

𝑘=1 /𝑁𝑡 where 𝑁𝑖𝑡′ is the total number of articles that focus on topic

𝑖 in year 𝑡 and obtain robust results.

Page 20: Evolution of Financial Studies Over Forty Years: What Can ...sfi.cuhk.edu.cn/uploads/file/20181115/20181115161202.pdf · over Web of Science, ScienceDirect, JSTOR, every journal’s

20

5.3.1. Topics with Fastest Growth and Contraction

Fig. 3.1 plots three fastest growing and three fastest shrinking topics in 17 journals. The

topics are identified by LDA. The horizontal axis represents the year of publication. The

vertical axis represents the popularity of the given topic, calculated as the average percentage

of each article’s percentage distribution on that topic. “Financial System, Banking Crisis”,

“Hedge Fund, Mutual Fund” and “Social Network and Cultural Effect” grew fastest from 2006

to 2015. “Market Microstructure”, “IPO” and “Option Pricing” experienced the greatest

contraction during the same period.

Fig. 3.2 plots the topics with fastest popularity increase and decrease in JF, JFE, and

RFS. “Social Network and Cultural Effect”, “Default and CDS”, and “CEO, Board, Director”

grew fastest from 2006 to 2015. “IPO”, “News, Analyst Report, Earnings Announcement”, and

“Determinants of Stock Return” experienced the most contraction during the same period.

5.3.2. Working Papers

For many articles, there is a time lag between its first circulation and final publication.

Sometimes the lag can be several years. Therefore, the trend of the published articles that we

show in Fig. 3 may not reflect the most recent dynamics of the finance research. To address

this concern, we apply the LDA model trained from published articles on 130,547 working

paper abstracts that we obtained from SSRN Financial Economics Network. We do not use

working papers uploaded to IDEAS because IDEAS does not distinguish working papers in

finance from those in economics.

We present three fastest growing and three fastest shrinking topics among working

papers in Fig. 4, which is similar to Fig. 3. Fig. 4 reports the rise and fall of each topic’s

popularity from 2006 to 2015. The horizontal axis represents the year of publication. The

Page 21: Evolution of Financial Studies Over Forty Years: What Can ...sfi.cuhk.edu.cn/uploads/file/20181115/20181115161202.pdf · over Web of Science, ScienceDirect, JSTOR, every journal’s

21

vertical axis represents the popularity of the given topic, calculated as the average percentage

of an article’s distribution on that topic.

Working papers on “Social Network and Cultural Effect”, “News, Analyst Report,

Earnings Announcement”, and “International Capital Markets” grew fastest from 2006 to 2015.

“Market Microstructure”, “Macro Finance”, and “Statistical Estimation Methodology”

experienced the greatest contraction during the same period.

5.3.3. JEL Classification Codes

In some journals such as JFE and RFS, JEL codes are reported when articles are

published. In other journals such as JF and JFQA, JEL codes are not reported in published

articles.

We compare our algorithm-computed topics of the articles and their self-reported JEL

codes in this section by listing the most reported JEL codes of each topic, shown in Table 4.

The explanation of each JEL code is in Table A.4. We list 5 most reported JEL codes in articles

belonging to each topic, among the 190 (=5*38) JEL codes, 161 of them are in G category

(Financial Economics).

In some algorithm-computed topics, JEL codes that are not in G category (Financial

Economics) are also among the most reported. For example, in the topic of “CEO, Board,

Director”, J33 (Compensation Packages, Payment Methods) in J category (Labor and

Demographic Economics) is also one of the most reported JEL code, reminding the group of

research in CEO compensation. In the topic of “International Asset Pricing and Foreign

Exchange”, F31 (Foreign Exchange) and F36 (Financial Aspects of Economic Integration) in

F3 category (International Finance) are two of the five most reported JEL codes. In the topic

of “Statistical Estimation Methodology”, two JEL codes in C category (Mathematical and

Quantitative Methods) are among the five most reported JEL codes.

Page 22: Evolution of Financial Studies Over Forty Years: What Can ...sfi.cuhk.edu.cn/uploads/file/20181115/20181115161202.pdf · over Web of Science, ScienceDirect, JSTOR, every journal’s

22

“Central Bank, Monetary Policy” is the only topic that does not have any of its five most

reported JEL codes in G category. Instead, two are in E category (Macroeconomics and

Monetary Economics) and three are in F category (International Finance). The two JEL codes

in E category are E52 (Monetary Policy) and E58 (Central Banks and Their Policies). The three

JEL codes in F category are F31 (Foreign Exchange), F41 (Open Economy Macroeconomics),

and F32 (Current Account Adjustment, Short-Term Capital Movements).

We find our algorithm-computed topics of the articles and their self-reported JEL codes

comparable. Therefore, we are able to apply the unsupervised machine learning algorithm on

the articles without JEL codes to determine their topics.

5.3.4. Evolution of Research Interests within Topics

Table 5 reports results of the Dynamic Topic Model: the evolution of interest within

topics. We report the results every 5 years. When implementing DTM, we use 50 topics and

the same hyper-parameters as we used with LDA to produce comparable results with our LDA

results. In Table 5, The words under each period are ranked by its frequency; words in higher

positions are more frequently appearing words.

In Panel A, the topic of “CEO, Board, Director”, the use of “manager/management” and

“control” declined after 2000, while the research of “CEO” and “board” rose.

In Panel B, the topic of “Determinants of Stock Return”, the January effect was once a

top theme before 1995. Since 2000, the January effect has not been on the list of the most

frequent words. Instead, “momentum” and “cross-section” rank higher over years.

In Panel C, the topic of “Commercial Banking”, we observe the rise of research interest

in lending and network, accompanied with a decline of deposit.

5.3.5. Trend of Cross-topic Research

Page 23: Evolution of Financial Studies Over Forty Years: What Can ...sfi.cuhk.edu.cn/uploads/file/20181115/20181115161202.pdf · over Web of Science, ScienceDirect, JSTOR, every journal’s

23

In this section, we study whether there was more cross-topic research over years. To put

it another way, we examine whether research articles becomes broader or narrower in terms of

research topic coverage. To measure how broad an article is, we calculate the Herfindahl Index

of each abstract:

𝐻 = ∑ 𝑠𝑖2

38

𝑖=1

(7)

where 𝑠𝑖 represents the percentage distribution of the abstract on topic 𝑖.

Fig. 5 presents the trend of published articles’ research interest concentration. The solid

line represents the average Herfindahl Index of abstracts in 17 journals, the dashed line

represents the average Herfindahl Index of abstracts in JF, JFE, and RFS. The average

Herfindahl Index dropped sharply from 1976 to 1982, perhaps because many topics’ pioneering

works started to emerge during the early period and therefore cross-topic research were more

common. The two lines went up between 1982 and 2000, indicating that on average research

becomes narrower. One possible explanation is that many topics matured and the literature was

established after two decades’ development, and researchers made more incremental

contribution. The average Herfindahl Index of abstracts in 17 journals continued to increase

after 2000 while that in the three top journals tended to remain at a constant level and even

declined after 2010, indicating that the three top journals still publish more broad and cross-

topic articles.

5.4. Citation Network Between Topics

To understand how topics relate to each other and the “distance” between the topics, we

use the cross-reference data of each article to construct a citation network between topics. In

Fig. 6, there are 38 nodes and each of them represents a topic. A node’s size is proportional to

the number of articles that focus on the topic that the node represents. As defined in Section

Page 24: Evolution of Financial Studies Over Forty Years: What Can ...sfi.cuhk.edu.cn/uploads/file/20181115/20181115161202.pdf · over Web of Science, ScienceDirect, JSTOR, every journal’s

24

5.2, an abstract focuses on a topic if it has over 10% distribution on it. Topics with more articles

have larger nodes.

The nodes are connected through edges. An edge represents the cross-reference between

the two topics. An edge is thicker if there is more cross-reference. For example, if topic A has

𝑁 articles, in total the 𝑁 articles cite articles in topic B for ∑ 𝑅𝑖𝐵𝑁

𝑖=1 times, where 𝑅𝑖𝐵 is the

number of times that article 𝑖 cites articles in topic B. Similarly, if topic B has 𝑀 articles, in

total the 𝑀 articles cite articles in topic A for ∑ 𝑅𝑗𝐴𝑀

𝑗=1 times, where 𝑅𝑗𝐴 is the number of times

that article 𝑗 cites articles in topic A. Then the total number of cross-reference is ∑ 𝑅𝑖𝐵𝑁

𝑖=1 +

∑ 𝑅𝑗𝐴𝑀

𝑗=1 and is proportional to the thickness of the edge between A and B.

Each node is positioned by a force-directed gravity algorithm called “Force Atlas 2” and

the node is in a position when the forces from each edge’s direction are balanced (Jacomy,

Venturini, Heymann and Bastian, 2014). Intuitively speaking, the algorithm assumes a force to

push every node outward from the center; the algorithm also allows every node to exert gravity

on its connected nodes and drive them inward. Each node is connected with other nodes via

edges. Thicker edge represents greater gravity. If a topic (node A) has a small cross-reference

(thin edge) with another topic (node B) and a large cross-reference (thick edge) with the third

topic (node C), then node A will exert larger gravity on node C. Therefore, the topics with more

cross-reference will be “attracted” closer by their connected edges. The network structure

dynamically evolves and eventually reach an equilibrium where the topics with more cross-

reference cluster. Therefore, the relative position of the nodes is determined by the algorithm,

not chosen by ourselves. The distance between two nodes approximately represents how close

the two topics are related in terms of cross-reference.

We conduct modularity analysis to categorize the topics into clusters based on the

computation of the distance and attraction between the nodes. The number of clusters is

determined by the modularity analysis algorithm, and 38 topics are compartmentalized into 5

Page 25: Evolution of Financial Studies Over Forty Years: What Can ...sfi.cuhk.edu.cn/uploads/file/20181115/20181115161202.pdf · over Web of Science, ScienceDirect, JSTOR, every journal’s

25

clusters, or “territories”: asset pricing, corporate finance, market microstructure, banking and

macro finance, and “mixed areas”. Each node’s color reflects the territory it belongs to.

The left side17 of Fig. 6 is clustered with corporate finance topics, including large topics

such as “CEO, Board, Director”, “Mergers and Acquisitions”, “Shareholder Right, Ownership

Structure”, and “IPO”. The bottom side is clustered with banking and macro finance topics,

including large topics such as “Commercial Banking”, “Central Bank, Monetary Policy”,

“Financial System, Banking Crisis”, and “Financial Regulation”. The right side is clustered

with asset pricing topics, including large topics such as “Option Pricing”, “Volatility”, “Return

Distribution and Value-at-Risk (VaR)”, and “Bond Term Structure”. The central side is

clustered with market microstructure topics, including large topics such as “Market

Microstructure”, “Trader Behavior”, and “Information Asymmetry, Disclosure, Insider

Trading”. The upper side is clustered with “mixed areas”, including large topics such as “Hedge

Fund, Mutual Fund”, “News, Analyst Report, Earnings Announcement”, “Behavioral Finance”,

and “Statistical Estimation Methodology”.

5.5. Bibliometric Regularity

Fig. 7 presents a bibliometric regularity: the number of researchers covering n topics is

approximately 1/2𝑛 of those covering just one topic. A researcher covers a topic if she

publishes at least one article with over 10% distribution on that topic. The horizontal axis of

Fig. 7 represents the number of topics, and the vertical axis represents the number of

researchers.

17 The “Force Atlas 2” force-directed gravity algorithm only determines the relative position of nodes.

The network can be rotated clockwise or counter-clockwise. Here the left, right, upper and bottom

sides are only for explanatory purpose on Fig. 6.

Page 26: Evolution of Financial Studies Over Forty Years: What Can ...sfi.cuhk.edu.cn/uploads/file/20181115/20181115161202.pdf · over Web of Science, ScienceDirect, JSTOR, every journal’s

26

The solid line is generated from our data, which is downward sloping because fewer

researchers are able to cover more topics. The value of each point on the line indicates how

many researchers cover exactly how many topics. For example, the first point on the solid line

is (1, 6830), meaning that 6830 researchers publish articles that focus on just one topic. The

second point is (2, 3507), meaning that 3507 researchers publish articles that focus on just two

topics. We use the dashed line 𝑦 = 13215/2𝑛 to fit the solid line, where y is the number of

researchers covering n topics. When 𝑛 = 1, 𝑦 = 6625.5; when 𝑛 = 2, 𝑦 = 3312.75. The R-

squared value of the fitting is 0.998.

6. Conclusion

How did the finance research topics evolve in the past forty years? In this article, we

apply latent Dirichlet allocation (LDA) model on 20,185 abstracts of finance articles published

between 1976 and 2015, and identify 38 research topics. We present the fastest growing topics

of published articles and working papers in the past decade. For example, publications on

“Financial System, Banking Crisis” and “Hedge Fund, Mutual Fund” grew the fastest from

2006 to 2015, while working papers on “Social Network, Cultural Effect” and “News, Analyst

Report, Earnings Announcement” grew the fastest during the same period. We use citation

network to present how topics are related, and cluster the topics in five “territories”: asset

pricing, corporate finance, market microstructure, banking and macro finance, and “mixed

areas” including “Social Network, Cultural Effect”, “Venture Capital, Entrepreneurship” and

etc. We find our algorithm-computed topics of the articles and their self-reported JEL codes

comparable, which implies that our algorithm can be used to categorize the articles without

JEL codes. Moreover, we find a strong bibliometric regularity: the number of researchers

covering n topics is approximately 1/2𝑛 of those covering just one topic. We also find that on

average a finance publication has been covering fewer topics and therefore becomes narrower

Page 27: Evolution of Financial Studies Over Forty Years: What Can ...sfi.cuhk.edu.cn/uploads/file/20181115/20181115161202.pdf · over Web of Science, ScienceDirect, JSTOR, every journal’s

27

over years. To the best of our knowledge, it is among the first machine learning research of

finance academic publication. Overall, we hope that our study may be beneficial to those who

desire to have an overview of this academic profession and inspire more cross-topic research.

Page 28: Evolution of Financial Studies Over Forty Years: What Can ...sfi.cuhk.edu.cn/uploads/file/20181115/20181115161202.pdf · over Web of Science, ScienceDirect, JSTOR, every journal’s

28

References

Alexander, J. C., & Mabry, R. H. (1994). Relative Significance of Journals, Authors, and

Articles Cited in Financial Research. The Journal of Finance, 49(2), 697-712.

Azoulay, P., Wang, J., & Zivin, J. G. (2010). Superstar Extinction. Quarterly Journal of

Economics, 125(2).

Blei, D. M., & Lafferty, J. D. (2006). Dynamic Topic Models. In Proceedings of the 23rd

International Conference on Machine Learning (pp. 113-120). ACM.

Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of Machine

Learning Research, 3(Jan), 993-1022.

Borokhovich, K. A., Bricker, R. J., Brunarski, K. R., & Simkins, B. J. (1995). Finance Research

Productivity and Influence. The Journal of Finance, 50(5), 1691-1717.

Borokhovich, K. A., Bricker, R. J., & Simkins, B. J. (1994). Journal Communication and

Influence in Financial Research. The Journal of Finance,49(2), 713-725.

Brogaard, J., Engelberg, J., & Parsons, C. A. (2014). Networks and Productivity: Causal

Evidence from Editor Rotations. Journal of Financial Economics, 111(1), 251-270.

Chung, K. H. & Cox, R. A. (1990). Patterns of Productivity in the Finance Literature: A Study

of the Bibliometric Distributions. The Journal of Finance, 301--309.

Cleary, F. R., & Edwards, D. J. (1960). The Origins of the Contributors to the AER During the

‘Fifties. The American Economic Review, 50(5), 1011-1014.

Corrado, C. J., & Ferris, S. P. (1997). Journal Influence on the Design of Finance Doctoral

Education. The Journal of Finance, 52(5), 2091-2102.

Froman, L. A. (1952). Graduate Students in Economics. The American Economic

Review, 42(4), 602-608.

Griffiths, T. L., & Steyvers, M. (2004). Finding Scientific Topics. Proceedings of the National

Academy of Sciences, 101(suppl 1), 5228-5235.

Hansen, S., McMahon, M., & Prat, A. (2014). Transparency and Deliberation within the FOMC:

A Computational Linguistics Approach. Working Paper

Heck, J. L., Cooley, P. L., & Hubbard, C. M. (1986). Contributing Authors and Institutions to

the Journal of Finance: 1946-1985. The Journal of Finance,41(5), 1129-1140.

Henry, W. R., & Burch, E. E. (1974). Institutional Contributions to Scholarly Journals of

Business. The Journal of Business, 47(1), 56-66.

Hoberg, G., & Phillips, G. (2010). Product Market Synergies and Competition in Mergers and

Acquisitions: A Text-based Analysis. The Review of Financial Studies, 23(10), 3773-3811.

Jacomy, M., Venturini, T., Heymann, S., & Bastian, M. (2014). ForceAtlas2, a continuous

graph layout algorithm for handy network visualization designed for the Gephi software.

PloS One, 9(6), e98679.

Page 29: Evolution of Financial Studies Over Forty Years: What Can ...sfi.cuhk.edu.cn/uploads/file/20181115/20181115161202.pdf · over Web of Science, ScienceDirect, JSTOR, every journal’s

29

Klemkosky, R. C., & Tuttle, D. L. (1977). The Institutional Source and Concentration of

Financial Research. The Journal of Finance, 32(3), 901-907.

Laband, D. N., & Piette, M. J. (1994). Favoritism versus Search for Good Papers: Empirical

Evidence Regarding the Behavior of Journal Editors. Journal of Political

Economy, 102(1), 194-203.

Laeven, L., & Levine, R. (2009). Bank Governance, Regulation and Risk Taking. Journal of

Financial Economics, 93(2), 259-275.

Loughran, T., & McDonald, B. (2011). When is a Liability not a Liability? Textual Analysis,

Dictionaries, and 10‐Ks. The Journal of Finance, 66(1), 35-65.

Loughran, T., & McDonald, B. (2013). IPO First-day Returns, Offer Price Revisions, Volatility,

and Form S-1 Language. Journal of Financial Economics, 109(2), 307-326.

Loughran, T., & McDonald, B. (2014). Measuring Readability in Financial Disclosures. The

Journal of Finance, 69(4), 1643-1671.

Niemi, A. W. (1987). Institutional Contributions to the Leading Finance Journals, 1975

Through 1986: A Note. The Journal of Finance, 42(5), 1389-1397.

Schwert, G. W. (1993). The Journal of Financial Economics: A Retrospective Evaluation

(1974–1991). Journal of Financial Economics, 33(3), 369-424.

Steyvers, M., & Griffiths, T. (2007). Probabilistic Topic Models in Latent Semantic Analysis:

A Road to Meaning, Landauer, T. and Mc Namara, D. and Dennis, S. and Kintsch, W.,

eds.

Swidler, S., & Goldreyer, E. (1998). The Value of a Finance Journal Publication. The Journal

of Finance, 53(1), 351-363.

Tetlock, P. C. (2007). Giving Content to Investor Sentiment: The Role of Media in the Stock

Market. The Journal of Finance, 62(3), 1139-1168.

Tetlock, P. C., Saar‐Tsechansky, M., & Macskassy, S. (2008). More than Words: Quantifying

Language to Measure Firms’ Fundamentals. The Journal of Finance, 63(3), 1437-1467.

Wang, C., Blei, D., & Heckerman, D. (2012). Continuous Time Dynamic Topic Models. arXiv

preprint arXiv:1206.3298.

Welch, I. (2014). Referee Recommendations. Review of Financial Studies, 27(9), 2773-2804.

Zivney, T. L., & Bertin, W. J. (1992). Publish or Perish: What the Competition is Really

Doing. The Journal of Finance, 47(1), 295-329.

Page 30: Evolution of Financial Studies Over Forty Years: What Can ...sfi.cuhk.edu.cn/uploads/file/20181115/20181115161202.pdf · over Web of Science, ScienceDirect, JSTOR, every journal’s

30

Table 1: Summary Statistics of Sample Journals

This table reports summary statistics of 20,185 articles published on 17 finance journals

between 1976 and 2015. We obtain each article’s title, authors, affiliations, abstract, full text,

references, citations and publishing date from Web of Science, supplemented with

ScienceDirect, JSTOR and manual search. We exclude articles without abstracts in our sample.

For example, The Journal of Finance (JF) and Journal of Financial Economics (JFE) in our

sample starts from 1976, the year when it had abstracts in our sample for the first time. We

report the first and last year that each journal started to have abstracts in our sample. We have

Review of Financial Studies’ data from 1988, the year of its first volume. JF was founded in

1946 and JFE had its first publication in 1974, but Web of Science started storing these two

journals’ data only from 1976. Moreover, Web of Science stores the article abstracts of JF from

1991 and the article abstracts of Journal of Financial and Quantitative Analysis (JFQA) from

1992. We supplement the missing abstracts of JF between 1976 and 1990 and those of JFQA

between 1984 and 1991 from JSTOR and manual search. We also report the total and median

number of articles published on each journal in our sample.

Journal

First Year

of

Abstract

Last Year

of

Abstract

Total

Number

Annual

Median%

Journal of Banking and Finance 1977 2015 4104 75 20.3%

The Journal of Finance 1976 2015 2465 69 12.2%

Journal of Financial Economics 1976 2015 2304 47 11.4%

Journal of International Money and Finance 1982 2015 1627 49 8.1%

Review of Financial Studies 1988 2015 1505 37 7.5%

Journal of Money Credit and Banking 1997 2015 1246 77 6.2%

Journal of Financial and Quantitative Analysis 1984 2015 1168 35 5.8%

Quantitative Finance 2001 2015 998 62 4.9%

Journal of Portfolio Management 1992 2015 908 39 4.5%

Journal of Futures Markets 1981 2015 870 50 4.3%

Journal of Corporate Finance 1994 2015 833 46 4.1%

Journal of Business Finance and Accounting 1976 2015 558 47 2.8%

Journal of Empirical Finance 1993 2015 476 60 2.4%

Journal of Financial Intermediation 1990 2015 416 18 2.1%

Journal of Financial Markets 1998 2015 308 19 1.5%

Review of Finance 1997 2015 289 27 1.4%

Journal of Financial Research 1978 2015 110 30 0.5%

Total 1976 2015 20185 358 100.0%

Page 31: Evolution of Financial Studies Over Forty Years: What Can ...sfi.cuhk.edu.cn/uploads/file/20181115/20181115161202.pdf · over Web of Science, ScienceDirect, JSTOR, every journal’s

31

Table 2: Keywords for Each Topic

This table reports each topic’s top ten keywords with the highest appearing probabilities. The 38 topics are identified by latent Dirichlet allocation

(LDA) model. The methodology of LDA is detailed in Section 4.1. We name each topic by reading the keywords and the articles that belong to it.

For example, if we observe that “bank”, “loan”, “borrow”, “lend”, “commerce” and “deposit” appear in one topic, after reading the articles

belonging to this topic, we name it as “Commercial Banking”; if we observe that “ceo”, “manag”, “board”, “compens”, “incent”, “director” appear

in one topic, we name it as “CEO, Board, Director”. The abstracts are categorized into 38 research topics and 12 general sentence topics. Each

abstract has quantitative distribution on different topics. We define that an abstract focuses on a topic if it has over 10% distribution on it. An

abstract with higher distribution of a certain topic tends to have more keywords for that topic. An abstract may have two or more topic focuses.

For example, Laeven and Levine (2009) is 12.7% on “Systematic Risk and Risk Premium”, 11% on “Shareholder Right, Ownership Structure”,

10.3% on “Commercial Banking” and 10.1% on “Financial Regulation”. Therefore, Laeven and Levine (2009) focuses on the four topics

“Systematic Risk and Risk Premium”, “Shareholder Right, Ownership Structure”, “Commercial Banking” and “Financial Regulation” by our

definition. We order the topics by the number of articles that focus on the topic. “Option Pricing” is the topic with the most publications that focus

on it, followed by “Commercial Banking”, “CEO, Board, Director”, “Market Microstructure”, “Central Bank, Monetary Policy”, and “Mergers

and Acquisitions”.

Page 32: Evolution of Financial Studies Over Forty Years: What Can ...sfi.cuhk.edu.cn/uploads/file/20181115/20181115161202.pdf · over Web of Science, ScienceDirect, JSTOR, every journal’s

32

No. Topic No. of Papers 1 2 3 4 5 6 7 8 9 10

1 Option Pricing 890 option process jump stochast underli exercis diffus american european black schole

2 Commercial Banking 812 bank loan borrow lend commerci deposit credit busi securit branch

3 CEO, Board, Director 717 ceo manag board compens incent director perform independ monitor execut

4 Market Microstructure 677 trade order spread exchang stock quot bid ask dealer nyse limit order

5 Central Bank, Monetary Policy 650 exchang rate shock respons monetari polici economi central bank interest rate intervent reserv stabil

6 Mergers and Acquisitions 623 target acquisit merger acquir takeov bid deal auction announc sharehold

7 Return Distribution and Value-at-Risk (VaR) 599 distribut method estim var normal extrem skew tail paramet simul

8 News, Analyst Report, Earnings Announcement 572 earn announc news analyst event report reaction stock abnorm return surpris

9 Hedge Fund, Mutual Fund 560 fund manag perform activ mutual fund hedg fund strategi invest fee alpha

10 Shareholder Right, Ownership Structure 556 control ownership govern sharehold compani right protect structur vote corpor

11 International Capital Markets 521 countri intern foreign develop domest unit state global integr region emerg market

12 IPO 511 issu ipo offer equiti underwrit initi public share underpr season

13 Capital Structure, Bankruptcy, Leverage 487 debt equiti leverag bankruptci capit structur corpor convert claim distress creditor

14 Macro Finance 484 inflat real output suppli incom labor busi cycl consum growth macroeconom

15 Volatility 460 volatil condit correl dynam varianc regim process depend garch switch

16 Default and CDS 448 rate credit default spread probabl swap mortgag agenc structur collater

17 Commodities, Futures 436 futur index hedg contract forward commod spot deriv oil underli

18 Trader Behavior 402 trade liquid volum day trader open pattern close intraday specul

19 Bond Term Structure 401 bond term interest rate yield matur short term term structur call rate treasuri

20 Determinants of Stock Return 384 return stock excess predict momentum januari revers anomali cross section season

21 Asset and Portfolio Allocation 380 asset portfolio return diversif varianc alloc mean correl riski covari

22 Asset Pricing Model 380 expect equilibrium gener uncertainti agent prefer consumpt ration risk avers belief

23 Financial Regulation 380 capit requir regul insur liabil limit act deposit insur failur polici

24 Statistical Estimation Methodology 375 estim forecast error predict regress statist bias paramet variabl coeffici

25 International Asset Pricing and Foreign Exchange 350 unit state currenc dollar european euro uk area spillov exchang rate japanes

26 Venture Capital, Entrepreneurship 336 invest financ capit decis extern constraint project opportun ventur entrepreneur

27 Industry Competition and Market Efficiency 328 effici industri product profit competit innov technolog improv cost structur

28 Tax 316 tax short sell loss interest sale arbitrag margin restrict incom

29 Financial System, Banking Crisis 314 financi crisi system import contagion stabil intermediari global stress failur

30 Multifactor Model 291 factor variabl explain variat compon cross section common specif power signific

31 Dividend Policy 269 growth dividend ratio share repurchas polici payout determin pay cash flow

32 Information Asymmetry, Disclosure, Insider Trading 265 privat public insid signal disclosur inform asymmetri improv transpar reveal avail

33 Optimal Choice Model 252 optim strategi maxim choic dynam program design minim condit transact cost

34 Corporate Operational Struture and Value Creation 248 firm corpor cash flow affect oper busi examin characterist level control

35 Systematic Risk and Risk Premium 240 risk premium exposur beta systemat expect idiosyncrat equiti sensit adjust

36 Behavioral Finance 208 investor behavior individu ex ant sentiment dispers tend retail herd

37 Corporate Cash Holding 134 cost higher lower hold cash greater level increas reduc payment

38 Social Network and Cultural Effect 112 institut particip group analysi social network influenc individu central affect

Page 33: Evolution of Financial Studies Over Forty Years: What Can ...sfi.cuhk.edu.cn/uploads/file/20181115/20181115161202.pdf · over Web of Science, ScienceDirect, JSTOR, every journal’s

33

Table 3: Most Cited Articles in Each Topic (Web of Science)

This table lists the most cited articles in each topic. The 38 topics are identified by latent Dirichlet allocation (LDA) model. The methodology of

LDA is detailed in Section 4.1. The citation numbers are collected on Feb 25th, 2016. The year of publication is in the parenthesis. The author

name or the names of coauthors are before the parenthesis. We present Web of Science citation behind the parenthesis.

No. Topic 1 2 3 4 5

1 Option Pricing Heston (1993), 1684 Cox, Ross, Rubinstein (1979), 1398 Vasicek (1977), 1387 Merton (1976), 1259 Cox, Ross (1976), 842

2 Commercial Banking Sharpe (1990), 434 Barth, Caprio, Levine (2004), 340 Boot (2000), 334 Petersen, Rajan (2002), 317 Berger, Miller, Petersen, Rajan, Stein (2005), 268

3 CEO, Board, Director Yermack (1996), 975 Weisbach (1988), 850 Core, Holthausen, Larcker (1999), 649 Amit, Villalonga (2006), 601 Agrawal, Knoeber (1996), 404

4 Market Microstructure Lee, Ready (1991), 704 Copeland, Galai (1983), 392 Hamao, Masulis, Ng (1990), 366 Glosten, Harris (1988), 355 Huang, Stoll (1996), 275

5 Central Bank, Monetary Policy Eun, Shim (1989), 236 Meese, Rogoff (1988), 217 Sercu, Uppal, van Hulle (1995), 153 Blanchard, Galí (2007), 149 Thorbecke (1997), 136

6 Mergers and Acquisitions Jensen, Ruback (1983), 1035 Morck, Shleifer, Vishny (1990), 497 Bradley, Desai, Kim (1988), 385 Moeller, Schlingemann, Stulz (2004), 335 Shleifer, Vishny (2003), 329

7 Return Distribution and Value-at-Risk (VaR) Rockafellar, Uryasev (2002), 729 Cont (2001), 506 Longin, Solnik (2001), 481 Rubinstein (1994), 407 Jackwerth, Rubinstein (1996), 245

8 News, Analyst Report, Earnings Announcement Barberis, Shleifer, Vishny (1998), 727 Fama, French (1995), 568 Ikenberry, Lakonishok, Vermaelen (1995), 362 Teoh, Welch, Wong (1998), 361 Womack (1996), 332

9 Hedge Fund, Mutual Fund Carhart (1997), 1910 Sirri, Tufano (1998), 471 Daniel, Grinblatt, Titman, Wermers (1997), 440 Wermers (1999), 285 Wermers (2000), 278

10 Shareholder Right, Ownership Structure Shleifer, Vishny (1997), 2156 La Porta, Lopez-de-Silanes, Shleifer (1999), 2027 La Porta, Lopez-de-Silanes, Shleifer, Vishny (1997), 1927 Claessens, Djankov, Lang (2000), 1004 La Porta, Lopez-de-Silanes, Shleifer, Vishny (2000), 900

11 International Capital Markets Bekaert, Harvey (1995), 465 Coval, Moskowitz (1999), 384 Bekaert, Harvey (2000), 354 Claessens, Demirgüç-Kunt, Huizinga (2001), 267 Harvey (1995), 252

12 IPO Loughran, Ritter (1995), 671 Ritter (1991), 614 Carter, Manaster (1990), 590 Rock (1986), 544 Megginson, Weiss (1991), 521

13 Capital Structure, Bankruptcy, Leverage Smith, Warner (1979), 747 Rajan (1992), 735 Titman, Wessels (1988), 702 Leland (1994), 467 Deangelo, Masulis (1980), 411

14 Macro Finance Schwert (1989), 680 Estrella, Hardouvelis (1991), 339 Constantinides, Ferson (1991), 182 Blanchard, Galí (2007), 149 McCallum, Nelson (1999), 143

15 Volatility Glosten, Jagannathan, Runkle (1993), 1407 Engle, Ng (1993), 798 Andersen (2001), 484 Pan (2002), 394 Campbell, Hentschel (1992), 392

16 Default and CDS Jarrow, Lando, Turnbull (1997), 292 Longstaff, Mithal, Neis (2005), 281 Blanco, Brennan, Marsh (2005), 185 Bharath, Shumway (2008), 181 Crouhy, Galai, Mark (2000), 174

17 Commodities, Futures Black (1976), 712 Schwartz (1997), 464 Gibson, Schwartz (1990), 259 Fama (1984), 221 Stoll, Whaley (1990), 207

18 Trader Behavior Admati, Pfleiderer (1988), 704 Brunnermeier, Pedersen (2009), 495 French, Roll (1986), 481 Easley, O'Hara (1987), 475 de Long, Shleifer, Summers, Waldmann (1990), 405

19 Bond Term Structure Vasicek (1977), 1387 Fama, French (1989), 779 Chan, Karolyi, Longstaff, Sanders (1992), 489 Longstaff, Schwartz (1995), 470 Leland (1994), 467

20 Determinants of Stock Return Jegadeesh, Titman (1993), 1400 Fama, French (1996), 1065 Debondt, Thaler (1985), 1032 Amihud (2002), 875 French, Schwert, Stambaugh (1987), 823

21 Asset and Portfolio Allocation Demiguel, Garlappi, Uppal (2009), 252 Jagannathan, Ma (2003), 205 Best, Grauer (1991), 170 Chopra, Ziemba (1993), 167 Kim, Omberg (1996), 160

22 Asset Pricing Model Breeden (1979), 713 Stulz (1981), 209 Diamond, Verrecchia (1981), 196 Breeden, Gibbons, Litzenberger (1989), 171 Sundaresan (1989), 147

23 Financial Regulation Barth, Caprio, Levine (2004), 340 Karpoff, Lee, Martin (2008), 137 Marcus (1984), 127 Buser, Chen, Kane (1981), 116 Dahl, Shrieves (1992), 106

24 Statistical Estimation Methodology Petersen (2009), 1413 Barber, Lyon (1997), 526 Dimson (1979), 492 Stambaugh (1999), 350 Hodrick (1992), 306

25 International Asset Pricing and Foreign Exchange Hamao, Masulis, Ng (1990), 366 Dittmar, Neely, Weller (1997), 189 Peel, Taylor (2000), 185 Lins, Servaes (1999), 137 Cheung, Chinn (2001), 131

26 Venture Capital, Entrepreneurship Sahlman (1990), 586 Hellmann, Puri (2002), 354 Hellmann, Puri (2000), 249 Hsu (2004), 230 Gompers (1995), 198

27 Industry Competition and Market Efficiency Claessens, Laeven (2004), 218 Klapper, Laeven, Rajan (2006), 211 Berger, Deyoung (1997), 204 Bonin, Hasan, Wachtel (2005), 188 Gold, Sherman (1985), 183

28 Tax Shefrin, Statman (1985), 478 Lakonishok, Shleifer, Vishny (1992), 317 Claessens, Demirgüç-Kunt, Huizinga (2001), 267 Grinblatt, Keloharju (2001), 209 Miller, Scholes (1978), 205

29 Financial System, Banking Crisis Rajan, Zingales (2003), 601 Beck, Levine, Loayza (2000), 530 Allen, Qian, Qian (2005), 442 Hoshi, Kashyap, Scharfstein (1990), 297 Faccio, Masulis, McConnell (2006), 243

30 Multifactor Model Fama, French (1993), 3481 Fama, French (1992), 2381 Jagannathan, Wang (1996), 444 Harvey, Siddique (2000), 373 Daniel, Titman (1997), 353

31 Dividend Policy La Porta, Lopez-de-Silanes, Shleifer, Vishny (2000), 431 Fama, French (2001), 425 Fama, French (2002), 393 Brav, Graham, Harvey, Michaely (2005), 272 Grullon, Michaely (2002), 232

32 Information Asymmetry, Disclosure, Insider Trading Diamond, Verrecchia (1991), 425 Easley, O'Hara (2004), 395 Seyhun (1986), 247 Froot, Scharfstein, Stein (1992), 220 Blume, Easley, O'Hara (1994), 211

33 Optimal Choice Model Grossman, Hart (1988), 304 Admati, Pfleiderer (1994), 179 Harris, Raviv (1988), 170 Jorion (1986), 168 Kroll, Levy, Markowitz (1984), 143

34 Corporate Operational Struture and Value Creation Morck, Shleifer, Vishny (1988), 1452 Claessens, Djankov, Lang (2000), 1004 Almeida, Campello, Weisbach (2004), 319 Campa, Kedia (2002), 305 Coles, Daniel, Naveen (2008), 302

35 Systematic Risk and Risk Premium Harvey, Siddique (2000), 373 Acerbi, Tasche (2002), 293 Harvey (1991), 265 Laeven, Levine (2009), 263 Ferson, Harvey (1993), 240

36 Behavioral Finance Odean (1998), 586 Shefrin, Statman (1985), 478 Barber, Odean (2008), 303 Grinblatt (2000), 294 Lee, Shleifer, Thaler (1991), 182

37 Corporate Cash Holding Opler (1999), 394 Harford (1999), 232 Bates, Kahle, Stulz (2009), 190 Harford, Mansi, Maxwell (2008), 163 Dittmar, Mahrt-Smith, Servaes (2003), 145

38 Social Network and Cultural Effect Hong, Kubik, Stein (2004), 216 Boss, Elsinger, Summer, Thurner (2004), 116 Hong, Kacperczyk (2009), 102 Blinder, Morgan (2005), 81 Brown, Ivković, Smith, Weisbenner (2008), 74

Page 34: Evolution of Financial Studies Over Forty Years: What Can ...sfi.cuhk.edu.cn/uploads/file/20181115/20181115161202.pdf · over Web of Science, ScienceDirect, JSTOR, every journal’s

34

Table 4: Most reported JEL Codes in Each Topic

This table presents the five most reported JEL codes in the articles belonging to each topic.

The 38 topics are identified by latent Dirichlet allocation (LDA) model. The methodology of

LDA is detailed in Section 4.1. For each topic, we first find all articles that have at least 10%

distribution on it, put together the JEL codes reported on those articles, and count the number

of each JEL code. Then we present the five most reported JEL codes for each topic. A detailed

explanation of each JEL code is listed in Table A.4.

1 2 3 4 5

Option Pricing G13 G12 G11 G14 C63

Commercial Banking G21 G28 G32 G34 G24

CEO, Board, Director G34 G32 G30 J33 G38

Market Microstructure G14 G15 G12 G10 G18

Central Bank, Monetary Policy F31 F41 E52 E58 F32

Mergers and Acquisitions G34 G32 G14 G21 G30

Return Distribution and Value-at-Risk (VaR) G12 G11 G13 C14 G21

News, Analyst Report, Earnings Announcement G14 G24 G12 M41 G11

Hedge Fund, Mutual Fund G11 G23 G12 G14 G20

Shareholder Right, Ownership Structure G32 G34 G38 G30 G21

International Capital Markets G15 F36 G11 F21 G21

IPO G24 G32 G14 G30 G34

Capital Structure, Bankruptcy, Leverage G32 G33 G34 G13 G31

Macro Finance F41 E31 G12 E52 G11

Volatility G12 C32 C22 G13 G10

Default and CDS G21 G12 G13 G33 G28

Commodities, Futures G13 G15 G11 G14 G12

Trader Behavior G14 G12 G15 G10 D82

Bond Term Structure G12 E43 G13 G32 G11

Determinants of Stock Return G12 G14 G11 G10 G15

Asset and Portfolio Allocation G11 G12 G23 G15 D81

Asset Pricing Model G12 G11 G14 G13 G10

Financial Regulation G21 G28 G22 G32 G11

Statistical Estimation Methodology G12 C22 G14 C53 G11

International Asset Pricing and Foreign Exchange F31 G15 F36 G12 G14

Venture Capital, Entrepreneurship G32 G31 G24 G34 G30

Industry Competition and Market Efficiency G21 G28 G32 G34 D24

Tax G14 G12 G32 G11 G34

Financial System, Banking Crisis G21 G01 G28 G15 F3

Multifactor Model G12 G11 G14 G15 G10

Dividend Policy G35 G32 G34 G12 G14

Information Asymmetry, Disclosure, Insider Trading G14 G32 D82 G21 G24

Optimal Choice Model G11 C61 D81 G32 G12

Corporate Operational Struture and Value Creation G32 G34 G30 G31 G38

Systematic Risk and Risk Premium G12 G11 G21 G32 G13

Behavioral Finance G11 G14 G12 G15 G10

Corporate Cash Holding G32 G31 G34 G21 D12

Social Network and Cultural Effect G11 G32 G14 G12 G34

Most Reported JEL CodesTopic

Page 35: Evolution of Financial Studies Over Forty Years: What Can ...sfi.cuhk.edu.cn/uploads/file/20181115/20181115161202.pdf · over Web of Science, ScienceDirect, JSTOR, every journal’s

35

Table 5: The Evolution of Interest within Topics

This table reports the evolution of interest within topics by listing the high-frequency words in different years from Dynamic Topic Model (DTM)

analysis. The methodology of DTM is detailed in Section 4.2. We report the results every 5 years. When implementing DTM, we use 50 topics

and the same hyper-parameters as we used with LDA to produce comparable results with our LDA results. The words under each period are ranked

by its frequency; words in higher positions are more frequently appearing words. In each column, words in higher positions are more frequently

appearing words. We highlight the words that we explain in the text. In Panel A, the topic of “CEO, Board, Director”, the use of

“manager/management” and “control” declined after 2000, while the research of “CEO” and “board” rose. In Panel B, the topic of “Determinants

of Stock Return”, the January effect was once a top theme before 1995. Since 2000, the January effect has not been on the list of the most frequent

words. Instead, “momentum” and “cross-section” rank higher over years. In Panel C, the topic of “Commercial Banking”, we observe the rise of

research interest in lending and network, accompanied with a decline of deposit.

Panel A: “CEO, Board, Director”

1976 1980 1985 1990 1995 2000 2005 2010 2015

corpor corpor manag manag manag control compani sharehold ceo

manag manag corpor sharehold sharehold ownership control ceo sharehold

control control sharehold corpor control manag sharehold corpor board

sharehold sharehold control control ownership compani corpor compani compens

compani compani compani compani corpor sharehold board board incent

ownership ownership ownership ownership compani corpor incent incent corpor

compens compens compens compens compens incent compens compens compani

incent incent incent incent incent compens ownership control director

plan vote vote manageri manageri board ceo govern execut

vote plan manageri plan outsid manageri manag director famili

CEO, Board, Director

Page 36: Evolution of Financial Studies Over Forty Years: What Can ...sfi.cuhk.edu.cn/uploads/file/20181115/20181115161202.pdf · over Web of Science, ScienceDirect, JSTOR, every journal’s

36

Panel B: “Determinants of Stock Return”

1976 1980 1985 1990 1995 2000 2005 2010 2015

return return return return return return return return return

stock stock stock stock stock stock stock stock stock

month month month revers revers revers revers momentum momentum

season januari januari month month past momentum cross-sect cross-sect

januari season season season past low past revers revers

revers revers revers januari low momentum low low low

inconsist inconsist past past cross-sect month cross-sect past past

past past averag cross-sect season cross-sect month month predict

averag averag inconsist low januari explain explain explain month

anomali anomali cross-sect averag explain averag book-to-market averag explain

Determinants of Stock Return

Panel C: “Commercial Banking”

1976 1980 1985 1990 1995 2000 2005 2010 2015

bank bank bank bank bank bank bank bank bank

deposit deposit deposit deposit system system system system system

system system system system regul competit competit competit regul

requir requir requir requir deposit regul regul regul competit

competit competit competit regul requir requir requir requir lend

regul regul regul competit competit deposit deposit lend requir

oper oper oper oper insolv oper lend deposit network

balanc balanc failur failur failur failur oper oper deposit

failur failur branch insolv oper lend failur network oper

branch branch balanc branch entri entri network faliur interbank

Commercial Banking

Page 37: Evolution of Financial Studies Over Forty Years: What Can ...sfi.cuhk.edu.cn/uploads/file/20181115/20181115161202.pdf · over Web of Science, ScienceDirect, JSTOR, every journal’s

37

Fig. 1: Log Likelihood versus Number of Topics

This figure reports the log-likelihood of latent Dirichlet allocation (LDA) model under different

number of topics. Higher likelihood reflects that the LDA model models the corpus better. The

maximum likelihood occurs around 40 topics including the topics of general sentences. In

implementing this approach, we find that there are topics that represent general sentences and

do not indicate specific research interest. For example, a topic with keywords “relat”, “posit”,

“neg”, “associ” and “evid” may simply represent an often used general sentence “we provide

evidences on a positive/negative relation/association”. Therefore, we choose 50 topics when

implementing LDA and exclude 12 general sentence topics from them.

Page 38: Evolution of Financial Studies Over Forty Years: What Can ...sfi.cuhk.edu.cn/uploads/file/20181115/20181115161202.pdf · over Web of Science, ScienceDirect, JSTOR, every journal’s

38

Fig. 2: Rise and Fall of Each Topic Over Years

The figures present the historical evolution of topics’ popularity. The 38 topics are identified

by latent Dirichlet allocation (LDA) model. The methodology of LDA is detailed in Section

4.1. The horizontal axis represents the year of publication. The vertical axis represents the

average percentage distribution on a given topic across abstracts in a given year, and its value

can be interpreted as the topic’s popularity of research interest. It is computed as 𝑝𝑖𝑡 =

∑ 𝑝𝑖𝑡𝑘𝑁𝑡𝑘=1 /𝑁𝑡 where 𝑝𝑖𝑡𝑘 is year-t-published abstract 𝑘’s percentage distribution on topic 𝑖. 𝑁𝑡

is the total number of articles published in year t. For example, the average percentage

distribution on “CEO, Board, Director” across all abstracts rose from about 1.5% in 1980 to

about 2.5% in 2015. We observe that the research interest in “Financial System, Banking Crisis”

often spiked around or after the financial crises, such as the savings and loan crisis in the late

1980s and early 1990s. It grew even faster after the 2008 financial crisis. The research interest

of “CEO, Board, Director” has been growing stably in the past 40 years. Other topics that

attracted more attention include “Behavioral Finance”, “Central Bank, Monetary Policy”,

“Commercial Banking”, “Corporate Cash Holding”, “Hedge Fund, Mutual Fund”,

“International Capital Markets”, “Social Network and Cultural Effect”, “Venture Capital,

Entrepreneurship”, and “Volatility”. The research interest in topics like “Bond Term Structure”

and “Optimal Choice Model” has been shrinking. It is worth noting that high fluctuation of

values exists in the 1970s and 1980s for most of the topics. Fig. A.1 plots the number of active

journals and articles every year. In the 1970s and 1980s, there were fewer journals and articles,

resulting in more volatile values. Fig. A.2 plots the yearly publication numbers in The Journal

of Finance, Journal of Financial Economics, and Review of Financial Studies.

Page 39: Evolution of Financial Studies Over Forty Years: What Can ...sfi.cuhk.edu.cn/uploads/file/20181115/20181115161202.pdf · over Web of Science, ScienceDirect, JSTOR, every journal’s

39

Page 40: Evolution of Financial Studies Over Forty Years: What Can ...sfi.cuhk.edu.cn/uploads/file/20181115/20181115161202.pdf · over Web of Science, ScienceDirect, JSTOR, every journal’s

40

Page 41: Evolution of Financial Studies Over Forty Years: What Can ...sfi.cuhk.edu.cn/uploads/file/20181115/20181115161202.pdf · over Web of Science, ScienceDirect, JSTOR, every journal’s

41

Page 42: Evolution of Financial Studies Over Forty Years: What Can ...sfi.cuhk.edu.cn/uploads/file/20181115/20181115161202.pdf · over Web of Science, ScienceDirect, JSTOR, every journal’s

42

Fig. 3: Fastest Growing and Shrinking Topics

Fig. 3.1 reports the fastest growing and shrinking topics from 2006 to 2015 in 17 journals.

Three topics on the left side grow fastest and three topics on the right side shrink fastest. The

topics are identified by latent Dirichlet allocation (LDA) model. The methodology of LDA is

detailed in Section 4.1. The horizontal axis represents the year of publication. The vertical axis

represents the average percentage distribution for a given topic across abstracts in a given year,

and its value can be interpreted as the topic’s popularity of research interest. It is computed as

𝑝𝑖𝑡 = ∑ 𝑝𝑖𝑡𝑘𝑁𝑡𝑘=1 /𝑁𝑡 where 𝑝𝑖𝑡𝑘 is year-t-published abstract 𝑘 ’s percentage distribution on

topic 𝑖. 𝑁𝑡 is the total number of articles published in year t.

Page 43: Evolution of Financial Studies Over Forty Years: What Can ...sfi.cuhk.edu.cn/uploads/file/20181115/20181115161202.pdf · over Web of Science, ScienceDirect, JSTOR, every journal’s

43

Fig. 3.2 reports the fastest growing and shrinking topics from 2006 to 2015 in The Journal of

Finance, Journal of Financial Economics, and Review of Financial Studies. Three topics on

the left side grow fastest and three topics on the right side shrink fastest. The topics are

identified by latent Dirichlet allocation (LDA) model. The methodology of LDA is detailed in

Section 4.1. The horizontal axis represents the year of publication. The vertical axis represents

the average percentage distribution for a given topic across abstracts in a given year, and its

value can be interpreted as the topic’s popularity of research interest. It is computed as 𝑝𝑖𝑡 =

∑ 𝑝𝑖𝑡𝑘𝑁𝑡𝑘=1 /𝑁𝑡 where 𝑝𝑖𝑡𝑘 is year-t-published abstract 𝑘’s percentage distribution on topic 𝑖. 𝑁𝑡

is the total number of articles published in year t.

Page 44: Evolution of Financial Studies Over Forty Years: What Can ...sfi.cuhk.edu.cn/uploads/file/20181115/20181115161202.pdf · over Web of Science, ScienceDirect, JSTOR, every journal’s

44

Fig. 4: Fastest Growing and Shrinking Topics on Working Papers Uploaded to SSRN’s

Financial Economics Network

This figure reports the fastest growing and shrinking topics from 2006 to 2015 in 130,547

working paper abstracts we collected from SSRN’s Financial Economics Network. Three

topics on the left side grow fastest and three topics on the right side shrink fastest. We apply

the latent Dirichlet allocation (LDA) model trained from published papers on the working

papers. The topics are generated by LDA model. The methodology of LDA is detailed in

Section 4.1. The horizontal axis represents the year of publication. The vertical axis represents

the average percentage distribution for a given topic across abstracts in a given year, and its

value can be interpreted as the topic’s popularity of research interest. It is computed as 𝑝𝑖𝑡 =

∑ 𝑝𝑖𝑡𝑘𝑁𝑡𝑘=1 /𝑁𝑡 where 𝑝𝑖𝑡𝑘 is year-t-posted abstract 𝑘’s percentage distribution on topic 𝑖. 𝑁𝑡 is

the total number of articles published in year t.

Page 45: Evolution of Financial Studies Over Forty Years: What Can ...sfi.cuhk.edu.cn/uploads/file/20181115/20181115161202.pdf · over Web of Science, ScienceDirect, JSTOR, every journal’s

45

Fig. 5: Trend of Yearly Research Interest Concentration

This figure reports the trend of published articles’ research interest concentration. To measure

how broad an article is, we calculate the Herfindahl Index of each abstract: 𝐻 = ∑ 𝑠𝑖238

𝑖=1 .

where si represents the percentage distribution of the abstract on topic i. The horizontal axis

represents the year of publication. The vertical axis represents the average Herfindahl Index of

each article on the distribution over the 38 topics in a certain year. The 38 topics are identified

by latent Dirichlet allocation (LDA) model. The methodology of LDA is detailed in Section

4.1. Higher Herfindahl Index value means higher research interest concentration. The solid line

represents the average Herfindahl Index of abstracts in 17 journals, the dashed line represents

the average Herfindahl Index of abstracts in The Journal of Finance, Journal of Financial

Economics, and Review of Financial Studies. The average Herfindahl Index dropped sharply

from 1976 to 1982, perhaps because many topics’ pioneering works started to emerge during

the early period and therefore cross-topic research were more common. The two lines went up

between 1982 and 2000, indicating that on average research becomes narrower. The average

Herfindahl Index of abstracts in 17 journals continued to increase after 2000 while that in the

three top journals tended to remain at a constant level and even declined after 2010, indicating

that the three top journals still publish more broad and cross-topic articles.

0.05

0.052

0.054

0.056

0.058

0.06

0.062

0.064

0.066

0.068

0.07

Her

fin

dah

l In

dex

Year

Research Interest Concentration

All Journals

The Journal of Finance, Journal of Financial Economics, and Review of Financial Studies

Page 46: Evolution of Financial Studies Over Forty Years: What Can ...sfi.cuhk.edu.cn/uploads/file/20181115/20181115161202.pdf · over Web of Science, ScienceDirect, JSTOR, every journal’s

46

Fig. 6: Citation Network Between Topics

This figure demonstrates the citation network between finance topics, constructed from cross-

reference data of each article. There are 38 nodes and each of them represents a topic. The 38

topics are identified by latent Dirichlet allocation (LDA) model. The methodology of LDA is

detailed in Section 4.1. A node’s size is proportional to the number of articles that focus on the

topic that the node represents. As defined in Section 5.2, an abstract focuses on a topic if it has

over 10% distribution on it. Topics with more articles have larger nodes. The nodes are

connected through edges. An edge represents the cross-reference between the two topics. An

edge is thicker if there is more cross-reference. For example, if topic A has 𝑁 articles, in total

the 𝑁 articles cite articles in topic B for ∑ 𝑅𝑖𝐵𝑁

𝑖=1 times, where 𝑅𝑖𝐵 is the number of times that

article 𝑖 cites articles in topic B. Similarly, if topic B has 𝑀 articles, in total the 𝑀 articles cite

articles in topic A for ∑ 𝑅𝑗𝐴𝑀

𝑗=1 times, where 𝑅𝑗𝐴 is the number of times that article 𝑗 cites

articles in topic A. Then the total number of cross-reference is ∑ 𝑅𝑖𝐵𝑁

𝑖=1 + ∑ 𝑅𝑗𝐴𝑀

𝑗=1 and is

proportional to the thickness of the edge between A and B. The distance between two nodes

approximately represents how close the two topics are related in terms of cross-reference. We

conduct modularity analysis to categorize the topics into clusters, and 38 topics are

compartmentalized into 5 clusters, or “territories”: asset pricing, corporate finance, market

microstructure, banking and macro finance, and “mixed areas”. Each node’s color reflects the

territory it belongs to. The left side of this figure is clustered with corporate finance topics,

including large topics such as “CEO, Board, Director”, “Mergers and Acquisitions”,

“Shareholder Right, Ownership Structure”, and “IPO”. The bottom side is clustered with

banking and macro finance topics, including large topics such as “Commercial Banking”,

“Central Bank, Monetary Policy”, “Financial System, Banking Crisis”, and “Financial

Regulation”. The right side is clustered with asset pricing topics, including large topics such as

“Option Pricing”, “Volatility”, return distribution and Value-at-Risk (VaR), and bond term

structure. The central side is clustered with market microstructure topics, including large topics

such as “Market Microstructure”, “Trader Behavior”, and “Information Asymmetry,

Disclosure, Insider Trading”. The upper side is clustered with “mixed areas”, including large

topics such as “Hedge Fund, Mutual Fund”, “News, Analyst Report, Earnings Announcement”,

“Behavioral Finance”, and “Statistical Estimation Methodology”.

Page 47: Evolution of Financial Studies Over Forty Years: What Can ...sfi.cuhk.edu.cn/uploads/file/20181115/20181115161202.pdf · over Web of Science, ScienceDirect, JSTOR, every journal’s

47

Page 48: Evolution of Financial Studies Over Forty Years: What Can ...sfi.cuhk.edu.cn/uploads/file/20181115/20181115161202.pdf · over Web of Science, ScienceDirect, JSTOR, every journal’s

48

Fig. 7: Bibliometric Regularity: Number of Researchers Covering n Topics

This figure presents a bibliometric regularity: the number of researchers covering n topics is

approximately 1/2𝑛 of those covering only one topic. A researcher covers a topic if she

publishes at least one article with over 10% distribution on that topic. The 38 topics are

identified by latent Dirichlet allocation (LDA) model. The methodology of LDA is detailed in

Section 4.1. The horizontal axis represents the number of topics, and the vertical axis represents

the number of researchers. The solid line is generated from our data, which is downward

sloping because fewer researchers are able to cover more topics. The value of each point on

the line indicates how many researchers cover exactly how many topics. For example, the first

point on the solid line is (1, 6830), meaning that 6830 researchers publish articles that focus on

just one topic. The second point is (2, 3507), meaning that 3507 researchers publish articles

that focus on just two topics. We use the dashed line 𝑦 = 13215/2𝑛 to fit the solid line, where

y is the number of researchers covering n topics. When 𝑛 = 1, 𝑦 = 6625.5; when 𝑛 = 2, 𝑦 =

3312.75. The R-squared value of the fitting is 0.998.

0

1000

2000

3000

4000

5000

6000

7000

8000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Nu

mb

er o

f R

esea

rch

ers

(y)

Number of Topics (n)

Number of Researchers Covering n Topics

Empirical 13215/2ⁿ

Page 49: Evolution of Financial Studies Over Forty Years: What Can ...sfi.cuhk.edu.cn/uploads/file/20181115/20181115161202.pdf · over Web of Science, ScienceDirect, JSTOR, every journal’s

49

Table A.1: Words that Appear in a Phrase at High Frequency

This table reports words that appear in a phrase at high frequency. Words presented here are

processed by stemming. For the last two 3-word set “chief executive officer” and “chief

executive officer” we combine them into “ipo” and “ceo” respectively in our textual data

cleaning.

interest rate limit order

unit state fama french

exchang rate foreign exchang

cross section impli volatil

real estat advers select

cash flow cross border

monetari polici capit structur

mutual fund price discoveri

bid ask brownian motion

mont carlo random walk

short term deposit insur

black schole emerg market

abnorm return institut investor

time seri short run

transact cost risk neutral

risk avers financi distress

time vari yield curv

term structur agenc cost

inform asymmetri feder reserv

corpor govern asymmetr inform

financi crisi cross list

hedg fund ventur capitalist

busi cycl standard deviat

moral hazard tender offer

central bank initi public offer

hong kong chief execut offic

balanc sheet

Words Commonly Appearing Together

Page 50: Evolution of Financial Studies Over Forty Years: What Can ...sfi.cuhk.edu.cn/uploads/file/20181115/20181115161202.pdf · over Web of Science, ScienceDirect, JSTOR, every journal’s

50

Table A.2: High-frequency Keywords in General Sentence Topics

This table presents 12 topics that represent general sentences. For example, a topic with keywords “relat”, “posit”, “neg”, “associ” and “evid” may

simply represent an often used general sentence “we provide evidences on a positive/negative relation/association”. The topics are identified by

latent Dirichlet allocation (LDA) model. The methodology of LDA is detailed in Section 4.1.

No. 1 2 3 4 5 6 7 8 9 101 valu discount econom show present base fundament journal number multipl

2 test adjust run hypothesi data power statist reject mean deviat

3 relat posit neg level associ signific evid examin consist document4 approach framework propos appli properti discuss present methodolog analysi practic

5 differ import role determin type across play structur rel characterist

6 larg small averag size year point sampl number rel period

7 empir evid support theori predict provid consist hypothesi theoret explan8 increas chang decreas declin follow shift becom reduc experi rise9 research studi may literatur previou recent due exist suggest argu

10 time period data set analysi studi observ show continu provid

11 perform measur base sampl indic compar improv better differ studi12 effect impact studi signific affect show examin investig lead direct

Page 51: Evolution of Financial Studies Over Forty Years: What Can ...sfi.cuhk.edu.cn/uploads/file/20181115/20181115161202.pdf · over Web of Science, ScienceDirect, JSTOR, every journal’s

51

Table A.3: Most Cited Articles in Each Topic (Google Scholar)

This table lists the most cited articles in each topic. The 38 topics are identified by latent Dirichlet allocation (LDA) model. The methodology of

LDA is detailed in Section 4.1. The year of publication is in the parenthesis. The author name or the names of coauthors are before the parenthesis.

The number of citation is behind the parenthesis. We present Google Scholar citation behind the author-years.

No. Topic 1 2 3 4 5

1 Option Pricing Heston (1993), 6953 Cox, Ross, Rubinstein (1979), 7371 Vasicek (1977), 6669 Merton (1976), 5694 Cox, Ross (1976), 3547

2 Commercial Banking Sharpe (1990), 2369 Barth, Caprio, Levine (2004), 2394 Boot (2000), 2205 Petersen, Rajan (2002), 1539 Berger, Miller, Petersen, Rajan, Stein (2005), 1503

3 CEO, Board, Director Yermack (1996), 6323 Weisbach (1988), 4846 Core, Holthausen, Larcker (1999), 4007 Amit, Villalonga (2006), 2814 Agrawal, Knoeber (1996), 3345

4 Market Microstructure Lee, Ready (1991), 2801 Copeland, Galai (1983), 2184 Hamao, Masulis, Ng (1990), 2210 Glosten, Harris (1988), 1730 Huang, Stoll (1996), 1125

5 Central Bank, Monetary Policy Eun, Shim (1989), 1778 Meese, Rogoff (1988), 939 Sercu, Uppal, van Hulle (1995), 458 Blanchard, Galí (2007), 703 Thorbecke (1997), 831

6 Mergers and Acquisitions Jensen, Ruback (1983), 6021 Morck, Shleifer, Vishny (1990), 2375 Bradley, Desai, Kim (1988), 2038 Moeller, Schlingemann, Stulz (2004), 1756 Shleifer, Vishny (2003), 2024

7 Return Distribution and Value-at-Risk (VaR) Rockafellar, Uryasev (2002), 2589 Cont (2001), 2004 Longin, Solnik (2001), 2316 Rubinstein (1994), 2038 Jackwerth, Rubinstein (1996), 1167

8 News, Analyst Report, Earnings Announcement Barberis, Shleifer, Vishny (1998), 4650 Fama, French (1995), 3669 Ikenberry, Lakonishok, Vermaelen (1995), 2047 Teoh, Welch, Wong (1998), 2412 Womack (1996), 1768

9 Hedge Fund, Mutual Fund Carhart (1997), 11220 Sirri, Tufano (1998), 2783 Daniel, Grinblatt, Titman, Wermers (1997), 2262 Wermers (1999), 1718 Wermers (2000), 1610

10 Shareholder Right, Ownership Structure Shleifer, Vishny (1997), 16697 La Porta, Lopez-de-Silanes, Shleifer (1999), 11945 La Porta, Lopez-de-Silanes, Shleifer, Vishny (1997), 9199 Claessens, Djankov, Lang (2000), 6107 La Porta, Lopez-de-Silanes, Shleifer, Vishny (2000), 6535

11 International Capital Markets Bekaert, Harvey (1995), 2409 Coval, Moskowitz (1999), 2054 Bekaert, Harvey (2000), 1895 Claessens, Demirgüç-Kunt, Huizinga (2001), 2061 Harvey (1995), 1663

12 IPO Loughran, Ritter (1995), 3998 Ritter (1991), 4494 Carter, Manaster (1990), 2761 Rock (1986), 3296 Megginson, Weiss (1991), 2451

13 Capital Structure, Bankruptcy, Leverage Smith, Warner (1979), 3445 Rajan (1992), 3737 Titman, Wessels (1988), 6004 Leland (1994), 2612 Deangelo, Masulis (1980), 3150

14 Macro Finance Schwert (1989), 3597 Estrella, Hardouvelis (1991), 1457 Constantinides, Ferson (1991), 644 Blanchard, Galí (2007), 703 McCallum, Nelson (1999), 733

15 Volatility Glosten, Jagannathan, Runkle (1993), 7183 Engle, Ng (1993), 4030 Andersen (2001), 1954 Pan (2002), 1424 Campbell, Hentschel (1992), 1772

16 Default and CDS Jarrow, Lando, Turnbull (1997), 1903 Longstaff, Mithal, Neis (2005), 1667 Blanco, Brennan, Marsh (2005), 1089 Bharath, Shumway (2008), 908 Crouhy, Galai, Mark (2000), 1327

17 Commodities, Futures Black (1976), 3201 Schwartz (1997), 2001 Gibson, Schwartz (1990), 1004 Fama (1984), 838 Stoll, Whaley (1990), 1030

18 Trader Behavior Admati, Pfleiderer (1988), 3528 Brunnermeier, Pedersen (2009), 3155 French, Roll (1986), 2173 Easley, O'Hara (1987), 2328 de Long, Shleifer, Summers, Waldmann (1990), 2675

19 Bond Term Structure Vasicek (1977), 6669 Fama, French (1989), 3611 Chan, Karolyi, Longstaff, Sanders (1992), 2111 Longstaff, Schwartz (1995), 2514 Leland (1994), 2612

20 Determinants of Stock Return Jegadeesh, Titman (1993), 8652 Fama, French (1996), 6375 Debondt, Thaler (1985), 7304 Amihud (2002), 5462 French, Schwert, Stambaugh (1987), 3995

21 Asset and Portfolio Allocation Demiguel, Garlappi, Uppal (2009), 1392 Jagannathan, Ma (2003), 875 Best, Grauer (1991), 712 Chopra, Ziemba (1993), 1031 Kim, Omberg (1996), 715

22 Asset Pricing Model Breeden (1979), 2882 Stulz (1981), 863 Diamond, Verrecchia (1981), 720 Breeden, Gibbons, Litzenberger (1989), 780 Sundaresan (1989), 560

23 Financial Regulation Barth, Caprio, Levine (2004), 2394 Karpoff, Lee, Martin (2008), 610 Marcus (1984), 607 Buser, Chen, Kane (1981), 641 Dahl, Shrieves (1992), 720

24 Statistical Estimation Methodology Petersen (2009), 6349 Barber, Lyon (1997), 2908 Dimson (1979), 2234 Stambaugh (1999), 1283 Hodrick (1992), 1180

25 International Asset Pricing and Foreign Exchange Hamao, Masulis, Ng (1990), 2210 Dittmar, Neely, Weller (1997), 674 Peel, Taylor (2000), 560 Lins, Servaes (1999), 672 Cheung, Chinn (2001), 514

26 Venture Capital, Entrepreneurship Sahlman (1990), 3117 Hellmann, Puri (2002), 1892 Hellmann, Puri (2000), 1321 Hsu (2004), 988 Gompers (1995), 2147

27 Industry Competition and Market Efficiency Claessens, Laeven (2004), 1193 Klapper, Laeven, Rajan (2006), 930 Berger, Deyoung (1997), 1353 Bonin, Hasan, Wachtel (2005), 1140 Gold, Sherman (1985), 1102

28 Tax Shefrin, Statman (1985), 2981 Lakonishok, Shleifer, Vishny (1992), 2019 Claessens, Demirgüç-Kunt, Huizinga (2001), 2061 Grinblatt, Keloharju (2001), 1140 Miller, Scholes (1978), 990

29 Financial System, Banking Crisis Rajan, Zingales (2003), 2809 Beck, Levine, Loayza (2000), 3469 Allen, Qian, Qian (2005), 2620 Hoshi, Kashyap, Scharfstein (1990), 1410 Faccio, Masulis, McConnell (2006), 1317

30 Multifactor Model Fama, French (1993), 19549 Fama, French (1992), 16423 Jagannathan, Wang (1996), 2369 Harvey, Siddique (2000), 1902 Daniel, Titman (1997), 2019

31 Dividend Policy La Porta, Lopez-de-Silanes, Shleifer, Vishny (2000), 2622 Fama, French (2001), 2779 Fama, French (2002), 3084 Brav, Graham, Harvey, Michaely (2005), 1817 Grullon, Michaely (2002), 1310

32 Information Asymmetry, Disclosure, Insider Trading Diamond, Verrecchia (1991), 2950 Easley, O'Hara (2004), 2381 Seyhun (1986), 1383 Froot, Scharfstein, Stein (1992), 1122 Blume, Easley, O'Hara (1994), 1218

33 Optimal Choice Model Grossman, Hart (1988), 1740 Admati, Pfleiderer (1994), 953 Harris, Raviv (1988), 815 Jorion (1986), 775 Kroll, Levy, Markowitz (1984), 554

34 Corporate Operational Struture and Value Creation Morck, Shleifer, Vishny (1988), 9026 Claessens, Djankov, Lang (2000), 6107 Almeida, Campello, Weisbach (2004), 2169 Campa, Kedia (2002), 1484 Coles, Daniel, Naveen (2008), 1937

35 Systematic Risk and Risk Premium Harvey, Siddique (2000), 1902 Acerbi, Tasche (2002), 1406 Harvey (1991), 1262 Laeven, Levine (2009), 1487 Ferson, Harvey (1993), 1135

36 Behavioral Finance Odean (1998), 3372 Shefrin, Statman (1985), 2981 Barber, Odean (2008), 2226 Grinblatt (2000), 1469 Lee, Shleifer, Thaler (1991), 1961

37 Corporate Cash Holding Opler (1999), 2694 Harford (1999), 1635 Bates, Kahle, Stulz (2009), 1503 Harford, Mansi, Maxwell (2008), 1151 Dittmar, Mahrt-Smith, Servaes (2003), 1130

38 Social Network and Cultural Effect Hong, Kubik, Stein (2004), 1165 Boss, Elsinger, Summer, Thurner (2004), 555 Hong, Kacperczyk (2009), 692 Blinder, Morgan (2005), 259 Brown, Ivković, Smith, Weisbenner (2008), 341

Page 52: Evolution of Financial Studies Over Forty Years: What Can ...sfi.cuhk.edu.cn/uploads/file/20181115/20181115161202.pdf · over Web of Science, ScienceDirect, JSTOR, every journal’s

52

Table A.4: JEL Codes

This table lists the explanation of JEL codes used in Table 4.

C1

C14 Semiparametric and Nonparametric Methods: General

C2

C22Time-Series Models • Dynamic Quantile Regressions • Dynamic Treatment Effect

Models • Diffusion Processes

C3

C32Time-Series Models • Dynamic Quantile Regressions • Dynamic Treatment Effect

Models • Diffusion Processes • State Space Models

C5

C53 Forecasting and Prediction Methods • Simulation Methods

C6

C61 Optimization Techniques • Programming Models • Dynamic Analysis

C63 Computational Techniques • Simulation Modeling

D1

D12 Consumer Economics: Empirical Analysis

D2

D24Production • Cost • Capital • Capital, Total Factor, and Multifactor Productivity •

Capacity

D8

D81 Criteria for Decision-Making under Risk and Uncertainty

D82 Asymmetric and Private Information • Mechanism Design

E3

E31 Price Level • Inflation • Deflation

E4

E43 Interest Rates: Determination, Term Structure, and Effects

E5

E52 Monetary Policy

E58 Central Banks and Their Policies

F2

F21 International Investment • Long-Term Capital Movements

F3

F31 Foreign Exchange

F32 Current Account Adjustment • Short-Term Capital Movements

F36 Financial Aspects of Economic Integration

F4

F41 Open Economy Macroeconomics

C. Mathematical and Quantitative Methods

D. Microeconomics

E. Macroeconomics and Monetary Economics

F. International Economics

International Finance

Macroeconomic Aspects of International Trade and Finance

Econometric and Statistical Methods and Methodology: General

Single Equation Models • Single Variables

Multiple or Simultaneous Equation Models • Multiple Variables

Econometric Modeling

Mathematical Methods • Programming Models • Mathematical and Simulation Modeling

Household Behavior and Family Economics

Production and Organizations

Information, Knowledge, and Uncertainty

International Factor Movements and International Business

Prices, Business Fluctuations, and Cycles

Money and Interest Rates

Monetary Policy, Central Banking, and the Supply of Money and Credit

Page 53: Evolution of Financial Studies Over Forty Years: What Can ...sfi.cuhk.edu.cn/uploads/file/20181115/20181115161202.pdf · over Web of Science, ScienceDirect, JSTOR, every journal’s

53

Table A.4 JEL Codes (Continued)

G01 Financial Crises

G1

G10 General

G11 Portfolio Choice • Investment Decisions

G12 Asset Pricing • Trading Volume • Bond Interest Rates

G13 Contingent Pricing • Futures Pricing

G14 Information and Market Efficiency • Event Studies • Insider Trading

G15 International Financial Markets

G18 Government Policy and Regulation

G2

G20 General

G21 Banks • Depository Institutions • Micro Finance Institutions • Mortgages

G22 Insurance • Insurance Companies • Actuarial Studies

G23 Non-bank Financial Institutions • Financial Instruments • Institutional Investors

G24 Investment Banking • Venture Capital • Brokerage • Ratings and Ratings Agencies

G28 Government Policy and Regulation

G3

G30 General

G31 Capital Budgeting • Fixed Investment and Inventory Studies • Capacity

G32Financing Policy • Financial Risk and Risk Management • Capital and Ownership

Structure • Value of Firms • Goodwill

G33 Bankruptcy • Liquidation

G34 Mergers • Acquisitions • Restructuring • Corporate Governance

G35 Payout Policy

G38 Government Policy and Regulation

J3

J33 Compensation Packages • Payment Methods

M4

M41 Accounting

G. Financial Economics

J. Labor and Demographic Economics

M. Business Administration and Business Economics • Marketing • Accounting • Personnel

Economics

General Financial Markets

Financial Institutions and Services

Corporate Finance and Governance

Wages, Compensation, and Labor Costs

Accounting and Auditing

Page 54: Evolution of Financial Studies Over Forty Years: What Can ...sfi.cuhk.edu.cn/uploads/file/20181115/20181115161202.pdf · over Web of Science, ScienceDirect, JSTOR, every journal’s

54

Fig. A.1: Yearly Journal and Article Numbers

This figure reports the number of journals and the total number of all articles in our sample

every year. The blue bars represent the number of journals. The orange line represents the total

number of all articles. We exclude articles without abstracts.

0

2

4

6

8

10

12

14

16

18

0

200

400

600

800

1000

1200

1400

1600

No

.Jo

urn

als

No

.Art

icle

s

Year

Number of Journals and Articles

No. Journals No. Articles

Page 55: Evolution of Financial Studies Over Forty Years: What Can ...sfi.cuhk.edu.cn/uploads/file/20181115/20181115161202.pdf · over Web of Science, ScienceDirect, JSTOR, every journal’s

55

Fig. A.2: Yearly Publication Numbers

This figure reports the yearly publication numbers of The Journal of Finance, Journal of

Financial Economics, and Review of Financial Studies in our sample from 1976 to 2014. We

exclude articles without abstracts.

0

20

40

60

80

100

120

140

160

180

Nu

mb

er

Year

Yearly Publication Numbers

The Journal of Finance Journal of Financial Economics Review of Financial Studies