42
Extracting social networks from fiction Imaginary and invisible friends: Investigating the social world of imaginary friends. Adam Ek Institutionen för lingvistik Examensarbete 15 hp Uppsatskurs (15HP) Vårterminen 2017 Handledare: Mats Wirén, Robert Östling English title: Extracting social networks from fiction

Extracting social networks from fiction

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Extracting social networks from fiction

Extracting social networks fromfictionImaginary and invisible friends: Investigating the social world of imaginaryfriends.

Adam Ek

Institutionen för lingvistik

Examensarbete 15 hp

Uppsatskurs (15HP)

Vårterminen 2017

Handledare: Mats Wirén, Robert Östling

English title: Extracting social networks from fiction

Page 2: Extracting social networks from fiction

Extracting social networks fromfictionImaginary and invisible friends: Investigating the social world of imaginary friends.

AbstractThis thesis develops an approach to extract the social relation between characters in literary text tocreate a social network. The approach uses co-occurrences of named entities, keywords associated withthe named entities, and the dependency relations that exist between the named entities to construct thenetwork.

Literary texts contain a large amount of pronouns to represent the named entities, to resolve the an-tecedents of pronouns, a pronoun resolution system is implemented based on a standard pronoun reso-lution algorithm The results indicate that the pronoun resolution system finds the correct named entityin 60,4% of all cases.

The social network is evaluated by comparing character importance rankings based on graph proper-ties with an independently human generated importance rankings. The generated social networks corre-late moderately to strongly with the independent character ranking.

SammanfattningDenna uppsats utvecklar en metod för att extrahera sociala relationer mellan karaktärer inom ett litterärtverk. För att göra detta så har samförekomst mellan namngivna entiteter, nyckoldord associerade medde namngivna entiteterna samt dependens relationer mellan namngivna entitier har användts.

Vanligtvis förekommer det en stor mängd med personliga pronomen i litteratur, som i många fallrepresenterar en namngiven entitet. För att hitta dessa gömda namngivna entiteter så har ett pronomenresolutions program utvecklats. Pronomen systemet hittar den korrekta namngivna entiteten för ett per-sonligt pronomen i 60,4% av alla fall.

De sociala nätverken som genereras utvärderas genom att jämföra viktighets rankningar av karaktär-erna i verket, mot en oberoende männsklig uppskattning av karaktärs vikten. De genererade nätverkenkorrelerar måttligt till startk med den oberoende mänskliga rankingen.

Nyckelord/KeywordsCentrality, Graphs, Named entities, Pronoun, Pronoun resolution, Social network.Centralitet, Grafer, Namngivna entiteter, Pronomen, Pronomen resolution, Sociala nätverk.

Page 3: Extracting social networks from fiction

Contents1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2.1 Computational linguistics and social networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.2 Graph Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.3 Social Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.4 Entities, pronouns and reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.5 Corpus linguistics and literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.6 Definitions and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3 Aims and research questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114.1.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114.1.2 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4.2 Network Extraction: Variables and Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124.2.1 Weight strength . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124.2.2 Search range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124.2.3 Mention co-occurrence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134.2.4 Named entity normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134.2.5 Dependency relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134.2.6 Named Entity Keywords . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4.3 Network Extraction: Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144.3.1 Social network system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144.3.2 Pronoun Resolution system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.3.3 Baseline pronoun resolution system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.4 Network interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174.4.1 Acquaintance through other entities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174.4.2 Communities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.5 Network analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.5.1 Degree centrality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.5.2 Betweeness centrality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.5.3 Eigenvector centrality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.6.1 Pronoun resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.6.2 Candidate selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.6.3 Antecedent selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.6.4 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224.6.5 Social network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245.1 Pronoun resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245.2 Social network extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316.1 Pronoun resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316.2 Social networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

6.2.1 Importance rankings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326.2.2 Social network graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

6.3 Research questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

Page 4: Extracting social networks from fiction

1 IntroductionA literary work is a body of text, telling a story. The story told is interpreted differently by differentindividuals, for A, x has the meaning y, however, for B x means z. These types of differences in inter-pretation are not hard to find, in many cases the differences are small, but they may still have a largeimpact on the interpretation of the work.

The differences in interpretation appear on many linguistic levels. The word-sentence semantics areusually not too problematic, mainly because there exists a vast amount of resources in forms of wordlexicons, synonym lexicons, tree banks, tools for semantical analysis, tools for grammatical analysisand so on. Another domain which poses some problems is the discourse interpretation of the literarywork. The discourse interpretation deals with both what is available in the text explicitly and what thatwhich is implicit. Just as with word and sentence semantics it is possible to use several available toolsdeveloped to help with the interpretation of the discourse.

One particular aspect of the domain interpretation is dealt with in this thesis: The social networksbetween characters. The characters in a literary interact with each other, they speak and share eventsand adventures. The friends of a character may be the enemies of another and so on. These socialdynamics add a lot of depth to the interpretation and the plot of the literary work.

Two previous studies that deal with social networks in literary texts are known to the current author.Elson et al. (2010) extracts social networks based on the dialogue between the named entities in the text.The purpose of the study is to investigate if there exists an inverse relationship between the number ofcharacters and the amount of dialogue, also if there are any differences in character dialogue interactionsbetween two types of settings, urban and rural. The other study is (Beveridge and Shan, 2016), whofocus on the co-occurrences of named entity mentions in the running text. The purpose of the study isto identify who the "true" main character is in A Game of Thrones.

Using computational methods, this thesis will develop an approach based on the methods in (Elsonet al., 2010) and (Beveridge and Shan, 2016) to extract social relations between characters in literaryworks. The goal of the current thesis is simply to construct a program that can identify the characters ina literary text and link them together based on their social role text. In of itself, the social networks ofliterary works is interesting as it allows for analyses relating to the structure of the social network whichmay reveal otherwise unseen information present in the work.

1

Page 5: Extracting social networks from fiction

2 Background2.1 Computational linguistics and social networksThis section highlights two previous studies that deal with social networks based on fictive data. Bothstudies use techniques from computational linguistics which is the study of language through computa-tional analysis.

The two previous studies that construct social networks based on fiction are (Elson et al., 2010) and(Beveridge and Shan, 2016). The purpose of (Elson et al., 2010) is (1) investigate if there exists aninverse correlation between the amount of dialogue and the number of characters in the work, and (2) toinvestigate if the setting of the novel, rural or urban, has any effect on the social structure of the novel(i.e. small/large community etc.).

Elson et al. (2010) focuses on dialogue extraction from several English novels. The approach usesthe Named Entity Recognition tool Standford NER tagger (Finkel et al., 2005) based on ConditionalRandom Fields (Sutton and McCallum, 2012) to identify the named entities. For each named entity acluster of co-referring names is generated through two methods, (1) variations of the named entity e.g.Mr. Euler = Leonard Euler and (2) a list of other named entities which may be co-referring with theentity e.g. The Mathematician = Leonard Euler.

To construct the social network Dialogue Identification was used to represent the connections betweenthe entities. A valid connection between two entities is when they are (1) engaged in a conversation, (2)are at the same place at the same time, (3) the entities take turns speaking and (4) the dialogue is directed,i.e. what person A utters is intended for person B to hear.

Apart from dialogue identification, Quoted Speech was identified in the text and attributed to speakers.To find the speakers of quoted speech in the text a training and testing set was set up from a corpusof European literary works. The testing set was annotated through an online survey with Amazon’sprogram Mechanical Turk, where three annotators attributed quoted speech to one of the candidatespresented in the survey or none. Elson et al. (2010) capture direct connections, i.e. connections wherethere is no ambiguity in the connection between the two entities. A speaker and a recipient have adefinite relationship through a dialogue between them.In contrast to direct connections, there is also indirect connections1, which is investigated in (Beveridgeand Shan, 2016). The article focuses co-occurrences between proper names and definite descriptionmentions of named entities in the text2. Beveridge and Shan (2016) looks at co-occurrences in the textregardless of the actual connection, for example:

(1) Puck saw Pascal. Euler was a clever mathematician

In this example, neither Puck or Pascal have any direct causal or interactive relationship to Euler,they merely co-occur in the text, however, they bear a relationship to each other namely that they dooccur in the context of one another. In (Beveridge and Shan, 2016) the network is evaluated using sixdifferent centrality measurements (Degree, Weighted Degree, Eigenvector, PageRank, Closeness, andBetweenness). The centrality measurements assign a score to each vertex in the network based on thenimportance of the vertex3.

2.2 Graph TheoryTo represent data as a graph, which is the basis for a network, the mathematical theory of graphs isused. The problem that established graph theory as a independent discipline is The Seven Bridges ofKönigsberg. It was solved by Leonard Euler in 1736 by using a graph to represent the islands as vertices

1Also known as contextual connections.2"We parsed the ebook, incrementing the edge weight between two characters when their names (or nicknames) appeared within15 words of one another." Beveridge and Shan (2016)3See section 4.5. and (Freeman, 1978).

2

Page 6: Extracting social networks from fiction

NOTATION DESCRIPTION

{a,b,c} A set containing the elements a,b,c.

(a,b) A tuple with a as the first element and b as the second element.

e ∈ E The element e is a member of the set E.

v⊆V v is a subset of V.

A⋂

B The shared members of the sets A and B.

v1...vn This represents each index of v between 1 and n.

Table 1: Mathematical notation: Mathematical notations used in the thesis.

and the bridges as edges. From this Euler showed that there is no way of visiting each island (vertex) bycrossing each bridge (edge) only once. This proof has led to the development of Graph Theory whichmay be described as the geometry of relations rather than the geometry of quantities. The notions oflength, area, etc. from classical geometry do not exist in Graph Theory. The length of the link betweentwo vertices is irrelevant, what is relevant is that there is that the link between them exists.

Formally, a graph G consists of a tuple (V,E), where V is a set of vertices {v1...vn} and E is a setof edges {e1...en}, where each edge e represents a tuple (v1,v2) which shows the vertices the edge isconnecting.

v1

v2

v3

v4

v5

v6

e1

e2

e3

e4

e5

e6

e7

e8

e9

Figure 1: Graph G = { V = {v1, v2, v3, v4, v5, v6 }, E = {e1, e2, e3, e4, e5, e6, e7, e8, e9} }.

There are several different types, or categories of graphs. In a complete graph each vertex v has anedge e connecting v to every other vertex in the graph. An incomplete graph is thus a graph which hasat least one vertex which is not connected to another vertex. The graph in Figure 1 is not complete aseach vertex is not connected to all other vertices. If the graph would be connected, the vertex v1 wouldbe connected to v2...v6, and the vertex v2 would be connected to v1, ...v6 and so on.

Another distinction between graphs can be made between a connected graph and a disconnected graph.In a connected graph, it is possible to reach every other vertex in the graph from an arbitrary vertex. Ina disconnected graph there exists at least one vertex which cannot be reached from an arbitrary vertex.In Figure 1, it is possible to reach each vertex from an arbitrary vertex, thus the graph is connected.

The proportion of the maximum number of edges to the number of edges in the graph is called theconnectivity or density of the graph (Newman, 2010).

The maximum number of edges is given by the number of edges in the complete graph of which thegraph measured is a sub-graph of, it is then divided by the number of edges in the subgraph. E.g. for agraph G with 10 vertices, the maximum number of edges is when each vertices is connected to every

3

Page 7: Extracting social networks from fiction

other vertices, comparing this graph to a graph G′ with 10 vertices lets us calculate the density of G′

using |V |G′|V |G .For edges a distinction must be made between directed and undirected edges (Diestel, 2005). In a

directed graph edges consists of an ordered pair, or an ordered tuple. This means that each edge inencoded with a starting, or out-vertex and an end, or in-vertex. Thus e1 = (v1, v2) and e2 = (v2, v1) meantwo different things, e1 describes an edge from v1 to v2 while e2 describes an edge from v2 to v1. Theset {v1, v2} describes an edge between v1 and v2, without a definite start and and end vertex, thus {v1,v2} = {v2, v1}. Graphs with only directed edges are called directed graphs, graphs with only undirectededges are called undirected graphs and graphs with both type of edges are called mixed graphs. In thisthesis the focus will lie on undirected graphs, so henceforth graph will be used interchangeably withundirected graph. In Figure 1, the edges of the graph does not have any arrows indicating in- and outvertices, this means that the graph is undirected.

We can further expand the notion of graphs to include subgraphs. A subgraph G′ of G, is a graphwhere V ′ ⊆ V , and where every edge e ∈ G′ also is an edge in G. Lets consider a graph G, where V ={v1, v2, v3, v4} and E = {(v1, v2), (v3, v4)}, a subgraph G′ could then be G′ = { V ′ = (v3, v4}, E ′ = {(v3,v4)} }, where {v3, v4} ∈VG and (v3, v4) ∈ EG.

2.3 Social NetworksThe nature of networks is a representation of the connections that exists between entities that are dis-tinguishable from each other. Graph Theory which was introduced in the previous section provide anexcellent tool for analyzing these connections.

Networks and graphs has proven a very useful method for investigation in many disciplines, such as:

1. Engineering: Electrical circuits represented as graphs.

2. Linguistics: Syntactical structures as trees.

3. Computer Networks: Graph to represent connections.

4. Social Networks: Graph to represent the connections between individuals.

Graph representation of social networks was first introduced by Jacob Moreno in 1933 (Newman,2010) to map friendship between kindergarten students. Moreno used vertices to represent individualstudents and edges to indicate that student A considered student B a friend. The study Moreno performedresulted in a directed graph, where edges originate from one vertex and end at another vertex.

In contrast to directed relations there is also undirected relations where edges do not originate fromone vertex and end in another. The edge in a undirected relation represents the connection between thetwo entities, regardless of their personal feelings towards one another. This would represent relationshipssuch as frequently visited locations or common interests and not relationships such as being in love. Forexample, let the set I j denote the interests for entity j and Ii denote the interests for entity i. If theintersection I j ∩ Ii is not empty this would represent a undirected relationship between j and i, i.e. theshared interests of the entities i and j.

An essential part of social network construction is the data gathering process. The main methodsfor data gathering in the real world are (1) Interviews and Questionnaires, (2) Direct Observation, (3)Achieved Data (Newman, 2010) and (4) Data Mining / Web-scraping. (4) is a rather new developmentin data gathering, with the growth of online social networks it has become an essential method forextracting relationships from communities. The method can be viewed as a type of direct observation asthe data scraped is what is available to the public eye4. For the creation of social networks from fictionall the methods except (1) is appropriate.

There are several types of networks that can be created from the collected data, Knoke and Yang(2008) defines two basic types of networks relevant to the current inquiry.

4See (Witten et al., 2016) for a detailed description of different data mining methods.

4

Page 8: Extracting social networks from fiction

Complete Network: The network is built without bias towards any subjects. The result is a graph rep-resenting all relations present in the dataset.

Ego-centric Network: Ego-centric networks are built by focusing on one subject. The network is de-fined by several levels, the first level is the entities in direct contact with the subject, the secondlevel is those who are connected to the first level entities and so forth up to the n:th level.

In addition to complete and ego-centric networks it will be useful to generalize the notion of ego-centricnetworks.

Sub-network: A sub-network is a generalization of ego-centric networks on a complete network, sub-networks are not necessarily based on one entity, rather it can be based on several. Examplesof non-ego-centric sub-networks are disconnected communities. A complete network may haveseveral "islands" of entities, i.e. entities only connected among themselves and not to the rest ofthe network, such "islands" are called disconnected communities.

Consider Figure 2 representing a complete network. A is connected to B connected to C and C con-nected to D. The neighborhood membership of each entity to A, and conversely, the neighborhoodmembership of A to each other entity, is the shortest walk from entity A to N (and conversely N to A).In the example above A needs to walk past two entities (B, C) to reach D, which puts D at A’s thirdneighborhood, to reach B on the other hand A doesn’t have to cross any other vertices, and thus B is atthe first neighborhood of A.

A B

CD

Figure 2: Graph G = { V = {A,B,C,D},E = {(A,B),(B,C),(C,D)} }.

Looking back at ego-centric networks we can use the above notions to classify ego-centric networks.The longest shortest path for A in Z is A→ D which is length 3, for an ego-centric network based on A,we would say that the graph in Figure 2 is of size 3. The size of an ego-centric network is thus definedas the longest shortest path for the subject in ego-centric networks and as the longest shortest walk forany vertex in the complete network. For a function f (A,B) which returns the length of the shortest walkbetween A and B, the size of the network is decided by the maximum value of f (x,y), i.e. max

x,y∈Vf (x,y).

These tools create a framework for social network analysis and data gathering, however several lin-guistic considerations need to be taken into account. In both (Elson et al., 2010) and (Beveridge andShan, 2016) distinguishable entities are considered as the main entities, but their studies do not considerindistinguishable entity representations such as the antecedents of personal pronouns as named entities5

in the network.Identifying antecedents of pronouns, called pronoun resolution, is a sub-task of co-reference resolu-

tion. The task of pronoun resolution and a brief introduction to co-reference resolution is presented inthe next section.

2.4 Entities, pronouns and referenceTo construct a social network all the named entities which are "persons" must be identified. This processis straightforward when the reference is immediate such as in (1). The proper name Johan is used, andby the nature of proper names it refers explicitly to the real-world person that has been given the (proper)name Johan.

5Indistinguishable in the sense that personal pronouns may refer to several entities depending on the context.

5

Page 9: Extracting social networks from fiction

(1) Johan is sad

Sentences such as (1) are common, but there are other ways of also representing named entities. Onemay also use a personal pronoun6 as in (2).

(2) He is sad

Comparing (1) and (2) we quickly realize they have the same semantic content iff namedEntity(he) =namedEntity(Johan). If this is the case Johan is said to be the antecedent of the anaphoric pronoun he.This shows that pronoun may act as placeholders for named entities in a text. The implication for socialnetworks is that a large source of information is hidden behind the anaphoric pronouns. Resolving thesepronouns will increase the amount of data available, which usually is a positive thing.

The Swedish pronoun system has three features, gender, perspective and number. For gender, Swedishhave four variations, two common; neutrum and utrum, aswell as the less common; maskulinum whichapplies to the third-person pronouns. As with English, word forms for the first-, second- and third-personperspective, and for number singular and plural exists. The Swedish pronoun system is presented inTable 2.

SUBJECT OBJECT

SINGULAR PLURAL SINGULAR PLURAL

FIRST-PERSON jag vi mig oss

SECOND-PERSON du ni dig er

THIRD-PERSON han, hon, den, det, man de honom, henne, den, det, man dem

Table 2: Swedish personal pronouns: Summary.

When a personal pronoun refers to a named entity that has been mentioned in the text, the personalpronoun is said to be anaphoric. One may think that he in (2) is anaphoric, after all it is referring tosome named entity. It is not the case that (2) is anaphoric however, observe that in (2) there is no namedentities. There is no evidence in (2) that he refers to Johan. However, if we combine (1) and (2) andconsider it as one text, (2) is anaphoric if he refers to the same thing as Johan in (1)+(2).

A complete example of a anaphoric pronoun is seen in (3). The third-person singular he refers backthe the named entity Euler in the preceding sentence.

(3) Euler was clever. He was a mathematician.

As we have seen not all cases of personal pronouns are anaphoric. Three common cases of non-anaphoric personal pronoun usage is illustrated below:

Expletive subject: Consider the examples It is cold and Det regnar. It and det are not anaphoric inthese examples. Rather, the pronoun represents the obligatory subject required from the strictVSO word order in Swedish and English.

Deictic pronoun: A deictic pronoun(first- or second-person) refer to the speaker and/or the listeners. Itis often a given who speaks and who listens, and these entities are not necessarily mentioned inthe text (Nilsson, 2010).

Cataphora A cataphoric pronoun is when the referent of a pronoun occurs after the usage of the pro-noun. Cataphoric words function as a "forward" anaphora in contrast to "backwards" anaphora, ifwe consider "forward" and "backward" to indicate the position of the antecedent.

6Henceforth denoted by "pronoun".

6

Page 10: Extracting social networks from fiction

The task of identifying the connections between named entities such as Johan, definite descriptionsand pronouns and is generally called co-reference resolution. Co-reference resolution also deals withco-reference chains which is a set of words all used to mention the same named entity in a text (Nilsson,2010).

For the task of resolving the antecedents to pronouns, most commonly systems based on linguisticknowledge is used. A rather general algorithmic procedure for resolving pronouns can be described inthe following steps:

1. Find a anaphoric pronoun.

2. Find a set of candidates within k sentences of the pronoun.

3. From linguistic knowledge assign scores to each candidate.

4. Select the candidate with the highest score or probability.

Linguistic knowledge in this case may include n-gram statistics, frequency statistics, grammaticalform, and so on. It is essentially any knowledge that deals with language and that may provide clues tofind pronoun-antecedent pair. One system that uses this approach is (Mitkov, 1999).

The approach follows the steps outlined above with a linear-k = 3. For each named entity in withink, the program checks if the entity agrees with the pronoun on grammatical gender and number. If so,the named entity is scored on several other linguistically motivated features, if not the named entity isnot eligible as a candidate for the pronoun. When this process has been done for each named entity, thenamed entity with the highest score is selected as the most likely antecedent of the pronoun.

2.5 Corpus linguistics and literatureCorpus linguistics is the study of language where the data consists of a corpus. A corpus is a collection ofreal-world language use collected into a database. A corpus may contain written and/or spoken languageand may come from different or the same genre, i.e. news articles, blog texts, literature, conversations,political speeches and so on. Several statistical analyses can be performed to obtain data related to wordfrequencies and co-occurrences, such as collocations and n-grams. Biber (2011) describes three methodsof corpus linguistic analysis in the context of literary analysis:

1. Keyword Analysis

2. Phrase Analysis

3. Collocation analysis

Keyword analysis focuses on the usage of individual words for different authors and investigates ifan author uses a word w more often than in the typical case in a reference corpus. This type of analysiscan be generalized as the investigation whether the occurrences of word w in a corpus C is statisticallysignificant or not.

Phrase analysis is similar, instead of searching for the usage of individual words, the usage of phrasesis searched for. Phrases may reveal more contextual properties of an author’s writings, for example,how character interactions occur, descriptions of the environment and other aesthetic presentations andfeatures (Biber, 2011). Phrase analysis bears a close resemblance to the analysis of parse trees. E.g. wemay investigate whether a construction T in corpus C is unique/statistically significant in C. Combin-ing phrase and keyword analysis yields a very strong framework of analysis, where both a syntacticalanalysis and a semantic analysis is possible.

Collocation analysis investigates significant n-grams used by different authors. There are many pos-sible features to examine using collocations, for example, how authors combine positive and negativewords (Biber, 2011) or how prepositions are used, e.g. are you on the train, or are you in the train? In

7

Page 11: Extracting social networks from fiction

general, n-grams and collocations is a widely used tool in language analysis as it captures one of themost important features of language: context. The importance of context for language understanding isexpressed by a famous quote from (Firth, 1961): you shall know a word by the company it keeps.

Statements about the language use within a corpus containing literary texts are only valid for thatcorpus. It can say nothing about the relation between the literary text and the general language usage(Fischer-Starcke, 2010). To extend the analysis to the general language the results must be comparedto a corpus representing the general language use, such as BNC (Burnard, 2000) for English and SUC(Källgren, 2006) for Swedish.

One article of particular interest that use word frequencies and its corollaries to analyze literature is(Stubbs, 2003). The article analyzes The Heart of Darkness by Joseph Conrad, and identifies severalpatterns in the book. One such pattern unveiled using keyword analysis is that the book begins and endwith essentially the same content, traveling the Thames and doing Buddha things. Looking at individualwords again, Stubbs (2003) identifies that Conrad uses words with negative prefixes (e.g. -less, -ness)more often than in the reference corpus used. The use of phrase analysis reveals that frequently usedphrases in general language use (e.g. in the middle of and on the edge of ) are not frequent in the book.Stubbs (2003) argues that such phrases are primary ways of expressing one’s point of view. Their in-frequent usage gives the work an atmosphere of "uncertainty" and "confusion", which is also given byother factors (see (Stubbs, 2003)). Finally, using collocations Stubbs (2003) finds a significant usage ofbetween the words glitter, gleam, glisten, glint and ominous words, such as dark, gloom, blood and fire.

Keyword analysis is also used in (Fischer-Starcke, 2009) where the author analyses the novels of JaneAusten with the use of concordances7. From these analyses, the author finds several patterns within thewritings that were not obvious to human readers. For example, the usage of negatively connotated wordsand their ties to grammatical negation (Fischer-Starcke, 2009)8.

2.6 Definitions and conceptsLinguistic object: A collection of characters creating a "word".

word = B, [C1...Cn], SC = {a-ö 1-9 - ’}, B = {blankspace}, S = {. , blankspace ! ? } 9

Entity: The object that is a referent of a named entity.

Referring expression: An expression that refers to some entity X.

Anaphora: A linguistic object that points back to an already mentioned entity in the current text.

Antecedent: An antecedent is a linguistic object in the text that is pointed to by an anaphora.

Co-reference: The situation where two named entities refer the same entity X.

Co-reference resolution: The process of identifying referents to linguistic objects.

Pronoun resolution: The process of identifying the antecedents of personal pronouns (pronoun reso-lution).

Named Entity Normalization: The process of attaching co-referring names to the same entity.e.g. "Sherlock" = "Sherlock Holmes" = "Mr. Sherlock".

Edge: The representation of the connection between entity A and B in a graph.

7Finding the context surrounding a specific word type.8Given that in (Stubbs, 2003) negatively connotated words appear very often, it would be interesting to find out if the samepattern holds.9Hyphen may also be interpreted as not indicating that it is one word, but that interpretation has been discarded on the basis ofcompound words, those are regarded as one word and in this thesis hyphen will be seen as a marker of compounding.

8

Page 12: Extracting social networks from fiction

Vertex: The representation of entity A in a graph.

Network: A collection of entities and the connections between them.

Graph: A collection of vertices and edges, where the edges may or may not be directed. A representa-tion of a network.

9

Page 13: Extracting social networks from fiction

3 Aims and research questionsThe general research question posed in this thesis is:

1. From a text containing characters who stand in relation to each other, is it possible to representthis information as a social network?

Intuitively this seems obvious, however since this has not been explored in much detail much of theprocess is clouded in mystery, which leads us to two additional questions regarding the process.

2. In constructing social networks from a text which problems are hard?

3. How well does rule-based pronoun resolution systems perform on literary texts?

Several restrictions have been used as constructing social networks is a massive task. The types ofconnections extracted is restricted to character co-occurrences in the text, i.e. indirect connections fromsection 2.3 of the background. It should be noted that this includes co-occurrences that appear withinthe dialogue. This means that both direct and indirect connections are captured, just not distinguishedfrom each other. The task of resolving pronouns will have a big impact on the structure of the networkdue to the proportion of named entities mentions and pronoun occurrences.

The task is defined and restricted in the following way: The system will take as input one of thefollowing pronouns: han, hon, honom, henne and return the most likely antecedent for that pronoun.The accepted input has been reduced to the third-person singular forms due to the additional complexityof resolving plural personal pronouns. Consider the sentence:

(4) We went home and ate.

within some context C with the named entities A and B to which we refers to. The task of finding theantecedent now involves finding the n most likely antecedents. Several obvious problems arise, howlarge is n? How to determine n? How to select n candidates?

First- and second-person pronouns have also been discarded. These personal pronouns are often cat-aphoric or deictic, and in such cases no antecedent exist. The reason for both restrictions is time. Bothplural and non-anaphoric pronouns can be handled, however developing such system will take additionaltime which is not available at this time.

10

Page 14: Extracting social networks from fiction

4 Method4.1 Data4.1.1 DatasetThe data used is the novels Röda Rummet and Hemsöborna from August Strindberg’s Collected Works.The raw data was provided by Litteraturbanken10 and processed by the Department of Linguistics atStockholm University. Table 3 shows the number of types and tokens that the two books contain. It alsoshows the token count for proper names and for pronouns.

BOOK TYPES TOKENS PROPER NAMES PRONOUNS

Röda Rummet 14170 91191 2344 4941

Hemsöborna 14922 79356 883 1983

Table 3: Data summary: Token and types for each book and token count for proper names and pronouns.

Summary: Röda Rummet11

Röda Rummet centers around Arvid Falk, an alter-ego of August Strindberg. Arvid strives towardsfreedom and truth, money and honor come second. Arvids older and domineering brother, Carl NicolausFalk, strives towards the opposite, money and honor. Carl becomes very angry at Arvid when it seemslike Arvid have published a very revealing article. According to Carl, the article dishonors their familyname.

Following this Arvid takes on a number of jobs, some from the publisher Smith who wants Arvidto write articles promoting an insurance company. Arvid declines these jobs because of the hypocrisyof using writing as a commercial device. The main theme of the book is that society is not ideal, andsometimes trying, to be honest, is a bad choice. In the end, Arvid becomes a famous writer but realizesthat the ideas of ideals he has may not always be ideal.

Summary: Hemsöborna12

Hemsöborna centers around the city dweller Carlsson as he arrives at the island Hemsö to help the widowFlod with her farming. Quickly conflicts arise between Carlsson and Flods son, Gusten. Carlsson getthe farming up and running and they rent out a part of the house a professor and he’s family during thesummer.

Carlsson falls in love with one of the maids, Ida, but he’s courting attempts are ignored. The next sum-mer Carlsson and Flod get’s married, but the marriage does not satisfy Carlsson and he starts courting theother maids. Madame Flod knows Carlsson is courting the maids and one winter evening Flod followsCarlsson and catches pneumonia. During the Christmas, Flod dies, and when her casket is transportedto the church by Carlsson, the ice breaks and Carlsson is presumed dead.

4.1.2 PreprocessingThe data preprocessing is done by removing all chapter headings and comments in the text, then tok-enizing the text where each token occupies one line, with blank spaces indicating sentence endings.

The file is then processed with Stagger313 trained on the SUC3 corpus (Källgren, 2006). Stagger isPOS-tagger based on the Averaged Perceptron and was developed at Stockholm University, staggersaccuracy on SUC3 is 96.6% (Östling, 2013). In addition to POS-tagging Stagger3 uses MaltParser(Nivre et al., 2007) for syntactic dependency parsing.

Table 4 shows the different categories of information given for each word and it’s usage in the thesis.

10http://litteraturbanken.se11Interpretation of https://sv.wikipedia.org/wiki/R%C3%B6da_rummet by the author.12Interpretation of https://sv.wikipedia.org/wiki/Hems%C3%B6borna by the author.13https://github.com/robertostling/efselab

11

Page 15: Extracting social networks from fiction

TYPE VALUE USAGE

Index int Find section headings and sentence boundaries.

Word form string Identify words.

Lemma string Identify words stripped of grammatical markers.

Word class string Find words of a certain class, e.g. proper names.

SUC-tagset list Grammatical information related to the word.

Dependency head word int Find the dependency head of a word.

Dependency role string Identify shared dep. roles between named entities.

Table 4: Stagger3 output: The information output from Stagger3 and how it is utilized.

In addition to the information in Table 4, Stagger outputs a Named Entity Recognition analysis basedon the ontology in SUC3, which assigns each proper name a category: person, institution, place, work,myth and so on. The categorization of named entities in SUC is rather unique and strange.

This information will be used to distinguish characters in the novel from other entities such as cities,streets, etc. and associate person named entity mentions with other non-person named entities. Alsowhen investigating proper names the lemma form of the name is used as an identifier rather than theword form. This is to avoid adding redundant entries in the network, as proper names may be possessive,the forms Flod and Flods exist. They both refer to the same named entity and should thus only have oneentry, the result is however that two entries are created.

4.2 Network Extraction: Variables and ModulesIn the process of extracting relations between entities several variables and modules play an essentialrole, the variables and modules used are presented below.

4.2.1 Weight strengthThree types of relations are captured by the system: (1) mention co-occurrence: when a named entityappear in the window of another named entity, (2) keyword intersection: the number of keywords sharedby two named entities and (3) dependency relations. The weight of each type of relation is set at 1.

4.2.2 Search rangeLinear-k is the variable that controls how far back the system should search for relations. Using k = 3would mean the current sentence up until the current named entity, and the two preceding sentences,this is seen if Figure 3.

Sn−3 Sn−2 Sn−1 Sn Sn+1 Sn+2 Sn+3

Figure 3: Linear-k example. k = 3 would select the current sentence (blue box), and the two precedingsentences (green boxes) as the context of the named entity in Sn.

The reason for only searching the preceding sentences is to avoid adding the same relation twice.If the search would include Sn+1 also, the named entity in Sn would create a connection with a namedentity in Sn+1. Later when we search the context of Sn+1 it would add the connection to the named entityin Sn again.

12

Page 16: Extracting social networks from fiction

4.2.3 Mention co-occurrenceMention co-occurrences between entities are captured when one named entity occur in the range kof another named entity. A restriction implemented is that the named entities cannot be identical toone another, the purpose of this is to prevent named entities from having relations to itself. Each co-occurrence between named entities is interpreted as a social connection. The weight of edge betweenthe named entities is then either increased by one, or created.

4.2.4 Named entity normalizationNames such as Roland Rocker may be referred to by (1) Roland, (2) Rocker or (3) Roland Rocker,and grouping these together in the composition of the social network is essential. The named entityrecognition output from stagger comes in the following form [w0B ,w1I , ...,wnI ], which represents theentity that has w0 as the beginning (B) string and w1 as its second and/or end string (I), e.g. Roland =w0 and Rocker = w1.

For the way names work there exists four possible ways of having a (proper) name properly, these aredescribed in Table 5.

TYPE EXAMPLE STRUCTURE REPRESENTATION

Multi word name Roland Ringo Rocker RolandB RingoI , RockerI [w0B ,w1I , ...,wnI ]

First and surname Roland Rocker RolandB RockerI [w0B ,w1I ]

First name only Roland RolandB [w0B ]

Surname only Rocker RockerI [w0I ]

Table 5: Named Entity mentions: Different structures named entity mentions may take when expressedwith proper names.

Whenever a named entity appears the program checks if the representation and form of the namedentity exists in the social network. If it does not exist it is added to the social network. After the texthas been completely read the named entities in the social network is normalized. To normalize thenames, the entries containing only surnames and first names are added to entries which contains both,e.g. RolandB RockerI ← RolandI . Next, all identical and near-identical entries are added together. Near-identical cases are cases such as e0 = [CarlB FalkI] and e1 = [NicolausB FalkI], where the entry e3 =[CarlB NicolausI FalkI] exists. The two entries e0 and e1 are considered near-identical to e3 and are thenadded to e3.

There are a few manual modifications done. In Röda Rummet: (1) The mentions of Arvid Falk andCarl Nicolaus Falk are grouped manually, as there exists ambiguity due to the shared surname whichcaused Arvid and Nicolaus to share one vertex, (2) Struve, Lundell and Borg are manually tagged as"person". Originally stagger recognized version of the names, e.g. Herr Lundell as persons, but not thename without the title, thus when only Lundell appeared it was not added to the network.

For Hemsöborna: (1) Stagger has found the named entity Norström Carlsson, this however is a notone entity but two, Norström and Carlsson are distinct. (2) In Hemsöborna there is a character namedFlod, which also means river in Swedish. Stagger have incorrectly tagged Flod as a noun while it is aproper name, thus Flod has been tagged as "person" manually.

4.2.5 Dependency relationsWhen a mention co-occurrence is captured the system also looks if the two entities are dependent on thesame headword, e.g.

(5) Johan j lovesk Johannai

13

Page 17: Extracting social networks from fiction

In this case, Johan has the relation subject and Johanna the relation direct object to the verb loves.This means that both are dependents to the verb loves. This is interpreted as a direct connection betweenthe entities and the weight of their relationship is increased by 1.

4.2.6 Named Entity KeywordsEntity keywords are determined by other non-person named entities that co-occur with the named entityi. Thus, if Euler has St. Petersburg in its context, St. Petersburg will be added to the keywords of Euler.

The setup of the system only allows for named entities of the type "person" to have keywords assignedto them. Compare example (5) and the following case:

(6) St.Petersburg loves Johanna

In (5) we added two connections between Johan and Johanna, one for the co-occurrence and one forthe shared dependency relation. In example (6), without any modification to the system, we would addSt. Petersburg as keyword twice to Johanna, once for the co-occurrence and one for the dependencyrelation. However, at this stage keywords are restricted to co-occurrences. Adding additional occur-rences will have no effect on the score assigned by the shared keywords and it is unclear exactly what adependency relation between a non-person and person would entail in a social network.

Retrieving the shared keywords K for entityi and entity j is done by taking the intersection of thekeyword sets, Ki

⋂K j. The score is between entityi and entity j is determined by the cardinality of the

intersection, i.e. the total amount of unique keywords shared by entityi and entity j.

4.3 Network Extraction: Procedure4.3.1 Social network systemThe structure of the program is described in Algorithm 1: Social Network Extraction, where w is anamed entity and cw is a named entity within the context of w. The context search is described by:T [i− k : i] where i is the index of w in T and k is the size of the search range (see section 4.2.2). Theprocedural description of the system is given below:

Algorithm 1: Social Network ExtractionnamedEntities = List of named entities, socialNetwork = Graph connecting the named entities

Input: Text TOutput: Graph G of social relations in text.function generateSocialNetwork(text)

for w ∈ T doif w(properName) = T RUE ∧w(person) = T RUE then

if w /∈ socialNetwork thennamedEntities← w

for cw ∈ T [i− k : i] doif cw(properName) = T RUE ∧ cw(person) = T RUE then

tw← depRelation(cw,w)socialNetwork[w]← cw

if cw(properName) = T RUE ∧ cw(namedEntity) = T RUE thenkeywords[w]← getKeywords(w)

endelse

passend

endreturn socialNetwork

end

14

Page 18: Extracting social networks from fiction

The anaphora resolution system is applied to the system as a data preprocessing step. Before the textis used as input for the social network extraction, it is processed by the system presented in the nextsection. For each (pronoun, antecedent) pair found the pronoun word form, lemma, and word class ischanged to that of the antecedent.

4.3.2 Pronoun Resolution systemTo resolve the antecedents of pronouns, the system reads a text tagged with Stagger3 and identifies oc-currences of pronouns. For each pronoun found the system searches the sentences behind the pronounfor antecedent candidates. For each word in the preceding sentences, the grammatical gender and num-ber are extracted from the SUC-tagset. If the extracted grammatical gender and number is identical14 tothe grammatical gender and number of the pronoun, the phrase is selected as an eligible candidate.

In addition to the candidates obtained from the preceding sentences, the last resolved antecedent forthe pronoun15 is also added to the set of candidates if one exists.

Feature Description Score

Theme* Candidates with position 1 in the sentence containing theanaphora are considered more likely as antecedent.

1, 0

Definiteness If the candidate is definite and/or possessive it’s considered morelikely as the antecedent.

1, 0

Verb* Candidate which are followed by an active verb are considered tobe more likely antecedents.

1, 0

PrepositionalPhrase*

If the candidate is in a prepositional phrase it’s considered lesslikely to be the antecedent.

0, -1

Repetitions Candidates which are repeated in the set of candidates are consid-ered more likely as antecedents. If the candidate is repeated morethan two times a score of 2 is assigned, if it is repeated once ascore of 1 is assigned, otherwise 0.

2, 1, 0

Distance Distance between the candidate and the anaphora measured insentences. The candidates in the same sentence as the pronoun isassigned 2, the ones in the preceding sentence 1, and all other 0.

2, 1, 0

Collocationpattern

A candidate that shares a collocation pattern ([W,VB] or [VB,W])with the anaphora is considered more likely as antecedent. Thecollocation pattern are calculated using t-tests.

2, 0

ProperName

Proper Names are considered more likely as antecedents. 1, 0

DependencyRole

Candidates which share a dependency role with the pronoun ismore likely as antecedent.

1, 0

Table 6: Pronoun resolution: Scoring features used in pronoun resolution. * + italics are items identifiedas problematic for Norwegian in (Holen, 2007).

After finding all likely candidates they are each scored on the features presented in 6. In addition tothe candidates selected from the search range, the last resolved antecedent of the pronoun is added tothe candidates. After all candidates have been scored the candidate with the highest score is selected asthe most likely antecedent.

If one named entity appears twice in the set of candidates, only the one with the highest score isconsidered when determining antecedent. Considering every occurrence of an entity will not change the

14If UTR/NEU occurs, it is matched against [NEU,UTR,UTR/NEU].15If han is the pronoun, and Johan have been resolved for han earlier, Johan is added to the set of candidates.

15

Page 19: Extracting social networks from fiction

result, however, it reduces the number of calculations that have to be done. 2 shows the procedure whenresolving pronouns.

Algorithm 2: Pronoun Resolution Algorithm: pn2pmnecandidates = Set of antecedent candidates, scores = Scores assigned to each antecedent candidate,

result = Set of pronoun-antecedent pairs Input: Text TOutput: Pronoun-antecedent pair for each pronoun.function resolvePronouns(text)

for w ∈ T doif w = pronoun then

for w ∈ T [i− j : i] doif conditions(w) = True∧w = person then

candidates← wendif type(w) hasBeenResolved then

candidates← lastAntecedentfor c ∈ candidates do

scores(c)← f eatures(c)endresult← (max(scores),w))

endreturn result

end

The features in Table 6 are directly ported from Mitkov (1998). In the description of the algorithm,there exists several genre-specific features and features that are loosely defined and described, thesefeatures have been re-interpreted to fit the current system. The features changed are:

(1) Term preference: A term is defined as an established word within the current context or area ofstudy16. In the context of a novel, what would such words be? There is no clear answer on how todetermine the terms but the most reasonable class of words that come close to the above definition arethe characters in the book, thus the term preference has been substituted for proper name preference.This, however, is only a relevant feature if non-proper names are allowed as candidates.

(2) Indicative verb preference*: Mitkov (1998) defines a set of verbs that are considered to be indica-tive based on empirical evidence and their salience. Directly translating these to Swedish is an option,but there is no direct evidence that the Swedish translation has the same properties as the words definedby Mitkov, another concern here is that the verbs listed are indicative to certain genres and not generallyindicative verbs. A concern is also that in which manner these words were chosen is not made explicit.

(3) Immediate reference: In (Mitkov, 1998) the immediate reference is described as highly specific totechnical manuals. This feature have thus been removed.

In (Holen, 2007) several problematic features of rule-based anaphora resolution is identified. In Table6 these are marked with italics a star * . The problematic features are identified for Norwegian, whichis rather similar structurally to Swedish. The main problem with the features is that Norwegian andEnglish have a different information structure (Holen, 2007). The problematic difference in informationstructure between Norwegian and English is that the following antecedent preference hierarchy is usedfor English:

sub jects > direct ob jects > indirect ob jects > other

Holen (2007) notices that in Norwegian new information is often conveyed by the subject, whichleads to a more frequent usage of expletive subjects that are less likely to be antecedents. As Norwegian

16It is surprisingly hard to find a linguistic definition for "term", the current definition is based on (5) in https://en.wiktionary.org/wiki/term.

16

Page 20: Extracting social networks from fiction

and Swedish are very similar in comparison to English and Norwegian, it can be assumed that Swedishshare the similar properties as Norwegian and that the features will have a negative effect on the pronounresolution.

4.3.3 Baseline pronoun resolution systemThe pronoun resolution system is compared against a baseline system, with the following heuristic:Select the closest eligible candidate as the most likely antecedent

The purpose of this system is to compare the performances and see how much improvement addingfeatures results in.

4.4 Network interpretationThe interpretation of the connection between vi and v j may be interpreted as just that, a connection.A collection of several connections such as the one above is called a network. The following sectiondevelops some ideas of interest in the context of the interpretation of social networks.

4.4.1 Acquaintance through other entitiesA walk in the graph is defined as a sequence of alternating vertices and edges, starting at vertex v1 andending at vertex vn, e.g. w1 = {v1, e1, v2, e2, . . . , vn}. The set containing all possible walks in a graphcan be denoted by W and individual walks wx. The length l of a walk w1 is the sum of the edges in w1:

l(w1) = |{e | e ∈ w1}| (1)

The shortest path from v1 to vn is as the name implies the least amount of edges needed to be traversedbefore reaching the vn. Determining the shortest path as graphs get larger is no easy task but there areseveral algorithms which deal with the problem such as Dijkstra’s algorithm, Viterbi algorithm, and A*search algorithm. Furthermore, let’s denote the set of all the shortest paths from vertex i to vertex j asW ∗i .

From this it is possible to identify how many nodes away a named entity i is from j, thus who knowswho through other named entities. This is an interesting feature as this allows us to form transitiverelations between non-connected named entities in the network.

4.4.2 CommunitiesA measurement of interest is that of modularity, which measures the quality of partitioning a networkinto so called communities (Newman and Girvan, 2004).

To find structural properties of a graph G, the graph may be partitioned into communities. A commu-nity is a set C of vertices and edges that are densely connected, and any v /∈C is sparsely connected tothe vertices in C, or not connected at all (Newman and Girvan, 2004). Communities are an interestingway of analyzing graphs as it creates different sub-networks which share some property P. In the caseof social networks, this is often going to be different "social" communities, characterized by a propertyP such as friends, enemies, lovers, et cetera.

v1

v2

v3

v4

v5v6

Figure 4: Communities: The densely connected community A = {v1,v2,v3} and the densely connectedcommunity B = {v4,v5,v6} for a graph G.

17

Page 21: Extracting social networks from fiction

In Figure 4 there exists two communities: A = {v1,v2,v3} and B = {v4,v5,v6}. The vertices in eachcommunity is densely connected to the other vertices in that community (see section 2.2), and the onlyconnection between them is the edge e(v2,v4). It can be seen that if the edge e(v2,v4) is removed, twodisconnected sub-networks are formed. Disconnected sub-networks are considered as separate commu-nities in the graph. Disconnected sub-networks17 within a literary text have some interesting implica-tions, or rather properties. The entities in the disconnected community are a part of the story that istold but is not connected to it explicitly. Rather, there must be some non-textual connection between theentities in the disconnected community and the entities in the other communities.

The modularity (i.e. quality of community partitioning) of a graph partitioned into communities isgiven by the following equation (Blondel et al., 2008):

Q =1

2m ∑i, j∈C

[Ai j−

kik j

2m

]δ (Ci,C j) (2)

Where Ai j is the weight between vertices i and j, ki = ∑ j Ai j, which is the sum of all the weights fornode i. The function δ = 1 iff Ci =C j else δ = 0 and m = 1

2 ∑i j Ai j (Blondel et al., 2008).In short, modularity can be expressed as the number of edges present in the community Ci divided by

the expected number of edges in the community Ci.

4.5 Network analysisTo measure the relations extracted from the text three centrality measurements will be used. The mainintuition behind centrality measurements is to capture and identify the most important entities (vertices)in a network, but the "most important" entity is a notion that is ambiguous18.

In this thesis degree, betweenness and eigenvector centrality will be used. The degree centrality isbased on the number of other entities directly tied to entity i. The betweenness centrality is based onthe number of shortest paths entity i is a member of. Finally, the eigenvector centrality is based on theweights of the entities tied directly to entity i.

4.5.1 Degree centralityThe first measurement to use is the centrality measurements which is called the degree centrality, de-noted as CDi for vertex i (Sutton and McCallum, 2012).

CDi =n

∑j=1

ai j (3)

Importance based on degree centrality places the vertex with the highest degree as the most importantvertex in the network, i.e. the entity connected to the highest number of other entities. This measurementis simple, but it gives a clear picture of the connectedness in the first neighborhood of each vertex.

Degree centrality gives a measurement of the number of other vertices another vertex is connected to.This measurement will thus be interpreted as a sort of "acquaintance size". If the degree centrality ofvertices should be compared to different networks, or with sub-networks within the total network, thedegree centrality will be normalized by dividing the centrality with n−1 (Freeman, 1978) which is thetotal number of vertices in the current network minus the current vertex (Sutton and McCallum, 2012).

C∗Di =∑

nj=1 ai j

(n−1)(4)

17More commonly know as connected components within Graph Theory.18See (Freeman, 1978).

18

Page 22: Extracting social networks from fiction

4.5.2 Betweeness centralityBetweenness centrality is defined as the number of shortest paths passing through vertex i which isshown in 7) (Brandes, 2008).

CB(v) = ∑s,t∈V

σ(s, t|v)σ(s, t)

(5)

The σ(s, t) represents the number of shortest paths, and σ(s, t|v) the number of shortest paths thattraverse vertex v. The betweenness centrality aims to measure the importance of ’flow’ in the network.For example, if A is a train hub connecting two parts of a country, if A were to be destroyed by anasteroid controlled by an evil scientist from another planet this would greatly disturb the flow of trainsin the railway network. We thus would identify that A is an important and central point in the network.

4.5.3 Eigenvector centralityA more complex centrality measurement is the eigenvector centrality. The intuition behind this measure-ment is to also include the importance of the vertices the original vertex is connected to. Thus, verticeswith important connections are more important than vertices with less important connections, whichmakes sense intuitively. This measurement depends on a matrix representing the graph in question. Themeasurement of the eigenvector, denoted CEi(v), for vertex i is a measurement proportional to the sumof all nodes connected to it (Kolaczyk, 2009).

CEi(v) = α ∑{u,v}∈E

CEi(u) (6)

The Google PageRank (Page et al., 1999) and Katz centrality (Katz, 1953), which are widely used,are both based on the eigenvector centrality.

4.6 EvaluationUsually co-reference systems are evaluated on metrics such as MUC, B3 or CEAF (Cai and Strube,2010), however, these metrics assume that the system tries to solve more than pronoun-antecedent pairs,which the current system does not. The method used for evaluating the current system is an interpretationof precision, recall, and F-score constructed to capture the performance of the task performed.

4.6.1 Pronoun resolutionAs input, the pronoun resolution system takes a pronoun, and the context preceding the pronoun. Foreach input, the system returns a set of candidates that have scores assigned to them. The most likelyantecedent is the candidate with the highest score.

Precision and recall are calculated from the classification of the system output. There are four clas-sification categories, true positive (TP, correct positive prediction), false positive (FP, incorrect positiveprediction), false negative (FN, incorrect negative prediction) or true negative (TN, correct negativeprediction).

RECALL: Recall describes the relationship between the correct positive predictions andthe negative predictions of positive instances.

T PT P+FN

(7)

PRECISION: Precision describes the relationship between the amount of correct positivepredictions and the amount of incorrect positive predictions.

T PT P+FP

(8)

19

Page 23: Extracting social networks from fiction

F-SCORE: F-score is the harmonic mean between precision and recall.

2×Precision×RecallPrecision+Recall

(9)

When evaluating against several documents the performance of the system for all documents is pre-sented as the micro-averaged precision and recall. The micro-averaged precision and recall is simplysumming up the confusion matrices for each document and calculating the precision, recall, and F-scoreon the new confusion matrix.

Two types of evaluation will be performed. The first type, candidate selection (CS), is the evaluationof the process of selecting eligible candidates from the set of preceding sentences. The second type,antecedent selection (AS), is the evaluation of the process of assigning scores to the selected candidates.

The process for classifying the output and calculating the scores for each evaluation type is presentedbelow.

4.6.2 Candidate selectionThe classification of the output for candidate selection goes as follows. TP is the case when the correctantecedent is a member of the selected candidates. FP is the case when the correct antecedent is notin the set of selected candidates, and the correct antecedent is not in the preceding k sentences of thepronoun. This means that if it is impossible to select the correct antecedent from the set of candidates,the output is tagged as FP. FN is the case when the correct antecedent is a member of the set of precedingsentences but was not selected as a candidate. Examples of this can be seen in Table 7.

CASE CORRECT ANT. CANDIDATES IN CONTEXT SEL. CANDIDATES

TP a a, b, c, d, e, f a, b, c, d

FP a b, c, d, e, f b, c, d

FN a a, b, c, d, e, f b, c, d

Table 7: Candidate selection classification examples. These examples illustrates how different outputsresult in different classification. Correct ant. = Correct antecedent, Sel. candidates = Selectedcandidates.

The classifications in Table 7 follows the specifications described in Table 8.

OBSERVED

a ∈ K a /∈ K

PREDICTEDa ∈ K TP FN

a /∈ K FP TN

Table 8: Candidate selection (CS) classification matrix. a = named entity, K = Set of selected candidates.

In the candidate selection process there are two parameters that are evaluated, agreement (grammat-ical gender and number) and context size (the number of sentences preceding the pronoun, decided bythe value of k).

In Table 7 it can be seen that for both FN and TP the correct antecedent is within the context ofthe pronoun. Since the metric that uses these values is recall, and thus recall is a measurement of theagreement parameter.

Precision, the relationship between TP and FP, is thus a measurement of how appropriate the windowsize is. This follows from the fact that the classification FP is only used when the correct antecedent isnot in the context of the pronoun, and the window size decides the context size.

20

Page 24: Extracting social networks from fiction

4.6.3 Antecedent selectionThe classification of the output for antecedent selection is done in the following manner. TP is the casewhen the correct antecedent is selected as the most likely antecedent. FP is the case when the mostlikely antecedent is not the correct antecedent. FN is the case when the correct antecedent is among theselected candidates but was not selected as the most likely antecedent. Table 9 shows examples of thesethree classifications.

CASE CORRECT ANT. CANDIDATE NES SEL. ANTECEDENT

TP a a, b, c, d a

FP a b, c, d c

FN a a, b, c, d b

Table 9: Antecedent selection classification examples. These examples illustrates how different outputsresult in different classifications. Corr. Ant. = Correct antecedent, Candidate NEs = Names enti-ties selected as candidates, Sel. Ant. = Selected antecedent (the candidate with the highest score,i.e. the most likely antecedent.)

The classifications in Table 9 follows the specifications described in Table 10.

OBSERVED

a = c a 6= c

PREDICTEDa = c TP FN

a 6= c FP TN

Table 10: Antecedent selection (AS) classification matrix. c = most likely antecedent, a = selected an-tecedent.

Observing Table 7 and Table 9 it can be seen that antecedent selection true positives and false nega-tives (Table 9, Row 1,3) are dependent on candidate selection true positives (Table 7, Row 1). In Table 9a basic condition for a true positive is that it is possible to be correct. And it is only possible in the caseswhere the candidate selection classification of the pronoun is a true positive, i.e. the correct antecedenthas been selected as a candidate. One may imagine that the true positives from candidate selection dis-tribute themselves over antecedent selection true positives and false negatives. The candidate selectionfalse positives and false negatives are summed up in antecedent selection false positives.

From this, it follows that the equation for antecedent selection precision is calculated using19:

precisionAS =T PCS−FNAS

(T PCS−FNAS)+(FPCS +FNCS)(10)

Since the candidate selection true positives distribute themselves over antecedent selection TPs andFNs, the antecedent selection precision numerator is T PCS−FNAS = T PAS and the denominator is thus(T PCS−FNAS)+ (FPCS +FNCS) since the set of antecedent selection false positives is identical to thecandidate selection set FP

⋃FN. As is, the meaning and interpretation of antecedent selection preci-

sion is unclear. Modifying the equation by changing the denominator to (T PCS +FNAS +FPCS +FNCS)yields something a little more interesting. With this change, the denominator represents all the errorscommitted, both errors relating to the constraints and the features. With the modified denominator, thisamounts to a measurement which says something about the overall relationship between the correct andthe incorrect selections for the system.

19AS = Antecedent selection and CS = Candidate selection.

21

Page 25: Extracting social networks from fiction

Recall on the other hand displays the systems ability to select the most likely antecedent when it isavailable as an antecedent choice (i.e. in the set of selected candidates). This evaluates the feature setimplemented, as the features selected is the deciding factor when selecting the most likely candidate.

It should be noted that the main difference between these two classification systems is that antecedentselection checks for identities (Table 10) between the antecedent and the most likely antecedent. Thecandidate selection checks for membership (Table 8) in the set of candidates and ignores the actualscoring of the different candidates.

The decision to classify the whole set of candidates rather than each individual candidate is thatranking candidates individually would create an equivalence between precision and recall. For a pronounwith a named entity referent, if each candidate selected was evaluated, there would be a false negativefor each false positive, and conversely a false positive for each false negative. This is because if thecorrect antecedent is not selected, an incorrect antecedent has been selected, which in turn means thatthe correct antecedent was missed, which is a false negative. If for each false positive there logicallyfollows a false negative, the distinction between recall and precision vanishes.

4.6.4 DataThe pronoun resolution system is evaluated against SUC-CORE (Nilsson Björkenstam, 2013). SUC-CORE is a subset of SUC2, containing 20658 tokens, annotated with co-reference between definitedescriptions, named entities, and pronouns. The data is divided into two genres, informative prose:press reportage and scientific writings, and imaginative prose: different kinds of novels such as romance,crime, and general fiction. The pronoun-antecedent pairs from SUC-CORE will be considered as a goldstandard for pronoun resolution.

For each pronoun-antecedent pair in the gold standard, the corresponding pronoun-antecedent pairgenerated by the system is compared to it. For each comparison, the procedure described in the previoussections is used to classify the output. The results are presented for both candidate and antecedentselection, individually on each document in SUC-CORE and on the complete dataset.

When testing the system, only pronoun-antecedent pair where the antecedent is a proper name areconsidered. In the social network, the relations between named entities are only captured if the namedentity is mentioned by a proper name. As such, generating non-proper name antecedents have no purposein the social network.

SUC-CORE contains in total 250 pronouns with proper name mentions as antecedents, 31 of thoseare in the informative prose genre20 and the remaining 219 are in the imaginative prose genre21. Thedifference in sizes makes a comparison between the genres rather difficult (see section 6.1. Table 11).As two of the documents (ea12.conll and ja06.conll) contains no pronouns and two other documents(ea10.conll and ba07.conll) have a very small amount of pronouns. This makes calculating averagesunreliable as there exist many extreme values and with this in mind, a comparison between the twogenres will not be done.

4.6.5 Social networkThe evaluation of the social network extraction is a difficult task as there does not exist any quantitative,established metric for the evaluation. This would indicate that the evaluation must be of a qualitativenature.

The optimal setup for a qualitative evaluation is for a person who is familiar with the work investigatethe networks produced and judge the results. A difficulty with this analysis and evaluation in general, isthat the notion of importance is ambiguous. This is illustrated in the various ways of calculating impor-tance (the different centrality measurements used in this thesis and other variations such as, closeness,Katz centrality, and Google PageRank all capture different "things" that are considered important). Theambiguity of importance is also dealt with extensively in (Freeman, 1978).

20Distribution of pronouns among the documents: aa05(15), aa09(11), b107(1), ea10(1), ea12(0), ja06(0)21Distribution of pronouns among the documents: kk14(22), kk44(78), kn08(17), kl07(77)

22

Page 26: Extracting social networks from fiction

The optimal qualitative setup, someone familiar with the work, is not possible within the scope of thecurrent thesis. However, an independent interpretation of the importance of the characters both in RödaRummet and Hemsöborna exist. The novels have been adapted as TV-series, Röda Rummet (1970)22 andHemsöborna (1966)23, and for each series a (character-actor) listing exists. Precisely in which mannerthis list is constructed in is unclear. The characters are listed in a way which seems to put the morecentral figures higher up in the list. Most importantly, the list is not arranged alphabetically on actor orcharacter, or by some other arbitrary ranking, which means that there is some logic to the ordering ofthe list.

The evaluation performed will extract the named entities in the text corresponding to the characterslisted for each tv-series. All characters with a proper name in Hemsöborna have been selected (e.g."The maid" have not been selected), and the top 10 characters with proper names have been selectedfrom Röda Rummet. For each centrality measurement, the selected characters position is compiled intoa list, which is compared to the list generated from Wikipedia using Spearman’s rho. Spearman rhois a variation of Pearson’s r, while Pearson correlation is based on a linear function to describe thecorrelation, Spearman rho is based on arbitrary monotonic rank-preserving functions to describe thecorrelation (Hauke and Kossowski, 2011).

It should be noted that the rankings obtained from the independent interpretation are not a gold stan-dard for the character importances. It is but one interpretation of the importance of the different charac-ters. The result of the evaluation will only be able to say something about the relationship between thehuman and the computer generated social rankings. As with text interpretations, the subjective judgmentof characters is just that, subjective. The following program simply outputs another (computational) in-terpretation of the characters, where the experimenter decides which features to explore and considerimportant.

22https://sv.wikipedia.org/wiki/R%C3%B6da_rummet_(TV-serie)23https://sv.wikipedia.org/wiki/Hems%C3%B6borna_(1966)

23

Page 27: Extracting social networks from fiction

5 Results5.1 Pronoun resolutionDue to the setup of the input and outputs of the program, the candidate selection recall has nothing tocapture. Candidate selection recall is intended to capture the performance of the constraints (numberand gender agreement), however, proper names are considered singular, and all pronouns that the sys-tem accepts are singular. Furthermore, proper names have no gender assigned to them, which results inthat either (1) no candidates are generated ( /0 6= {UTURUM}) or (2) all proper names within the searchrange is considered a candidate.

In section 2.5.3. the linear-k was set at 3. In Figure 5 the performance of the system is tracked ask increases from 1 to 9. Figures 4.a and 4.b show the performance of the system when measuring themicro-averaged precision, recall, and F-score.

1 2 3 4 5 6 7 8 9

0.5

0.6

0.7

0.8

Range

Scor

e

a) Antecedent Selection

PrecisionRecall

F-Score

1 2 3 4 5 6 7 8 9

0.7

0.8

0.9

1

Range

Scor

e

b) Candidate Selection

PrecisionRecall

F-Score

Figure 5: Pronoun Resolution. Micro-averaged precision, recall and F-score as the linear-k increase from1 to 9.

The micro-averaged precision, recall and f-score for the informative and imaginative prose individually(Table 12) and combined (Table 11) is presented below. The pn2pmne24 system is compared to thebaseline system and a version of the pn2pmne that includes the problematic features identified in (Holen,2007).

ANTECEDENT SELECTION CANDIDATE SELECTION

GENRE MEASUREMENT PR RE F-SCORE PR RE F-SCORE

IMAG Total 0,732 0,788 0,759 0,776 1,000 0,874INFO Total 0,947 0,600 0,734 0,967 1,000 0,983

SYSTEM MEASUREMENT PR RE F-SCORE PR RE F-SCORE

pn2pmne Total 0,608 0,760 0,675 0,800 1,000 0,888Basline Total 0,264 0,349 0,300 0,756 1,000 0,861pn2pmne* Total 0,512 0,653 0,573 0,784 1,000 0,878

Table 11: Pronoun Resolution. k = 4. Performance of the systems described in 4.3.2. and 4.3.3. pn2pmneis the system that includes the features identified as problematic in (Holen, 2007)

24The designated name for the system used.

24

Page 28: Extracting social networks from fiction

ANTECEDENT SELECTION CANDIDATE SELECTION

GENRE DOCUMENT PR RE F-SCORE PR RE F-SCORE

INFORMATIVE PROSE

Press: Reportage (15) aa05 0,400 0,428 0,413 0,933 1,000 0,965Press: Reportage (11) aa09 0,727 0,727 0,727 1,000 1,000 1,000

Press: Editorials (1) ba07 - - - 1,000 1,000 1,000

Skills/hobbies: Design (4) ea10 1 1 1 1 1 1Skills/hobbies: Biology (0) ea12 - - - - - -

Scientific writings: Hum (0) ja06 - - - - - -

IMAGINATIVE PROSE

Fiction (32) kk14 0,437 0,608 0,509 0,718 1,000 0,836Fiction (78) kk44 0,871 0,971 0,918 0,897 1,000 0,945

Romance (92) kn08 0,705 0,750 0,727 0,941 1,000 0,969

Crime (77) kl07 0,434 0,655 0,522 0,663 1,000 0,797

Table 12: Pronoun Resolution: Evaluation using k = 4. Performance of the pronoun resolution systemagainst SUC-CORE. Pr = Precision. Re = Recall. Fiction(22), kk14, ... indicates that the docu-ment kk14.conll contains 22 pronouns with named entity antecedents.

5.2 Social network extractionThe number of named entity co-occurrences, shared keywords and dependency relations is shown inTable 13. The "total" column is the total weight of all the edges in the social network generated.

CONNECTIONS

BOOK DISTANCE CO-OCCURENCE KEYWORD DEPENDENCY TOTAL

Röda Rummet 3 1582 136 48 1766Röda Rummet 6 3151 424 48 3623Röda Rummet 9 4557 777 48 5382

Hemsöborna 3 1077 202 15 1294Hemsöborna 6 2239 460 15 2714Hemsöborna 9 3302 659 15 3976

Table 13: Named Entity Frequency. The distribution of named entity co-occurrences, shared keywordsand dependency relations in the networks without pronoun resolution.

For both Hemsöborna and Röda Rummet two different types social networks was extracted. The firsttype of network only accepts named entities who are mentioned with proper names. The second type ofsystem also considers named entities hidden behind pronouns.

The centrality rankings for the social network without pronoun resolution is presented in Table 14,and the social network with pronoun resolution can be seen in Table 15 for Hemsöborna.

25

Page 29: Extracting social networks from fiction

RANK

DEGREE BETWEENNESS EIGENVECTOR

RANK CHARACTER d3 d6 d9 d3 d6 d9 d3 d6 d9

1 Carlsson 1 1 1 1 1 1 1 1 12 Flod 6 8 8 10 8 10 7 7 73 Gusten 2 2 2 2 2 2 2 2 24 Rundqvist 5 3 3 9 3 3 4 4 45 Norman 3 4 4 8 4 4 3 3 36 Clara 4 5 5 11 6 5 6 6 67 Lotten 7 6 6 14 5 7 8 8 88 Norström 9 10 9 12 9 6 9 9 99 Ida 8 7 7 13 7 9 5 5 5

Table 14: Comparison of character rankings from wikipedia and the result of degree, betweenness andeigenvector centrality on Hemsöborna with distances 3, 6 and 9. The number of entities inHemsöborna was 50.

RANK

DEGREE BETWEENNESS EIGENVECTOR

RANK CHARACTER d3 d6 d9 d3 d6 d9 d3 d6 d9

1 Carlsson 1 1 1 1 1 1 1 1 12 Flod 6 8 9 9 9 10 7 7 73 Gusten 2 2 2 2 2 2 2 2 24 Rundqvist 4 3 3 5 3 3 4 4 45 Norman 3 4 4 7 4 4 3 3 36 Clara 5 5 5 10 6 5 6 6 67 Lotten 8 6 6 13 5 6 9 8 88 Norström 9 9 8 12 8 7 8 9 99 Ida 7 7 7 11 7 9 5 5 5

Table 15: Comparison of character rankings from wikipedia and the result of degree, betweenness andeigenvector centrality on Hemsöborna with pronoun resolution and distances 3, 6 and 9. Thenumber of entities in Hemsöborna was 50.

The correlation between the generated rankings in Tables 14 and 15 against the rankings obtainedfrom the character-actor listings is presented in Table 16 for Hemsöborna.

26

Page 30: Extracting social networks from fiction

DEGREE BETWEENNESS EIGENVECTOR

SETUP DISTANCE rs p rs p rs p

-A 3 0,766 0,015 0,833 0,005 0,599 0,087+A 3 0,766 0,015 0,833 0,005 0,583 0,099

-A 6 0,616 0,076 0,599 0,087 0,599 0,087+A 6 0,616 0,076 0,500 0,170 0,599 0,087

-A 9 0,616 0,076 0,516 0,154 0,599 0,087+A 9 0,516 0,154 0,533 0,139 0,599 0,087

Table 16: Spearman rho(rs) correlation between wikipedia character ranks and degree, betweenness andeigenvector centrality rank for the characters in Hemsöborna. Dist = Distance, -A = Withoutpronoun resolution and +A = With pronoun resolution. p = p-value. Bold font indicates themaximum score for each centrality measurement.

The social network graphs25 for Hemsöborna using k = 3. The graphs were produced by excludingpronouns. The graph is sorted by communities (Figure 6).

Figure 6: Social network without pronouns for Hemsöborna. Sorted by community. k = 3. Edges withlow weights are not seen due to readability.

The centrality rankings for the social network without pronoun resolution is presented in Table 17,and the social network with pronoun resolution can be seen in Table 18 for Röda Rummet.

25The graphs are produced using the software Gephi (https://gephi.org/).

27

Page 31: Extracting social networks from fiction

RANK

DEGREE BETWEENNESS EIGENVECTOR

RANK CHARACTER d3 d6 d9 d3 d6 d9 d3 d6 d9

1 Arvid Falk 1 1 1 1 1 1 1 1 12 Olle Montanus 4 4 3 6 9 7 5 5 53 Sellén 8 8 7 8 8 13 6 6 64 Lundell 3 3 5 4 5 9 4 2 25 Ygberg 13 9 10 27 17 19 3 3 36 Rehnhjelm 10 11 13 17 18 21 12 12 137 Dr. Borg 12 19 18 20 23 22 10 9 98 Levi 18 20 20 32 36 50 7 8 89 Struve 2 2 2 5 7 4 2 4 410 Nicolaus 50 24 35 56 42 60 22 26 29

Table 17: Comparison of character rankings from wikipedia and the result of degree, betweenness andeigenvector centrality on Röda Rummet with distances 3, 6 and 9. The number of entities inRöda Rummet was 142.

CENTRALITY MEASUREMENT

DEGREE BETWEENNESS EIGENVECTOR

RANK CHARACTER d3 d6 d9 d3 d6 d9 d3 d6 d9

1 Arvid Falk 1 1 1 1 1 1 1 1 12 Olle Montanus 4 3 4 2 5 10 5 5 53 Sellén 10 10 10 18 12 13 6 6 64 Lundell 6 4 3 7 7 5 2 2 25 Ygberg 13 9 12 20 15 18 4 3 36 Rehnhjelm 7 11 9 9 10 10 16 18 167 Dr. Borg 9 15 15 22 21 21 18 17 198 Levi 16 20 18 37 41 35 10 10 139 Struve 2 2 2 5 8 3 3 4 410 Nicolaus 54 27 27 63 47 57 33 26 27

Table 18: Comparison of character rankings from wikipedia and the result of degree, betweenness andeigenvector centrality on Röda Rummet with pronoun resolution and distances 3, 6 and 9. Thenumber of entities in Röda Rummet was 142.

The correlation between the generated rankings in Tables 17 and 18 against the rankings obtainedfrom the character-actor listings is presented in Table 19 for Röda Rummet.

28

Page 32: Extracting social networks from fiction

DEGREE BETWEENNESS EIGENVECTOR

SETUP DISTANCE rs p rs p rs p

-A 3 0,587 0,073 0,648 0,045 0,503 0,138+A 3 0,612 0,059 0,684 0,028 0,587 0,073

-A 6 0,624 0,053 0,672 0,033 0,624 0,053+A 6 0,624 0,053 0,793 0,006 0,624 0,053

-A 9 0,648 0,042 0,648 0,042 0,624 0,053+A 9 0,551 0,098 0,527 0,117 0,636 0,047

Table 19: Spearman rho(rs) correlation between wikipedia character ranks and degree, betweenness andeigenvector centrality ranking for the characters in Röda Rummet. Dist = Distance, -A = With-out pronoun resolution and +A = With pronoun resolution. p = p-value. Bold font indicates themaximum score for each centrality measurement.

The social network graphs for Röda Rummet using k = 9 is presented in Figure 7. The graphs wereproduced by excluding pronouns and sorting based on communities with disconnected sub-networksappearing above the main area.

The result sections conclude with Table 20 which describe the following properties: k, average degree,average weighted degree, density, modularity, communities, and sub-networks for each graph associatedwith a social network.

BOOK k A DEGREE WDEGREE DENSITY MODULARITY C DSN

Hemsöborna 3 F 4,773 52,455 0,111 0,019 2 1Hemsöborna 3 T 5,136 147,773 0,119 0,005 1 1Hemsöborna 6 F 6,625 99,667 0,141 0,000 1 1Hemsöborna 6 T 6,883 296,250 0,145 0,000 1 1Hemsöborna 9 F 7,720 140,960 0,158 0,001 2 1Hemsöborna 9 T 7,920 444,960 0,162 0,000 1 1

Röda Rummet 3 F 4,419 12,435 0,036 0,429 17 8Röda Rummet 3 T 5,145 31,855 0,042 0,440 15 7Röda Rummet 6 F 5,901 20,845 0,042 0,414 12 5Röda Rummet 6 T 6,521 57,423 0,046 0,405 12 5Röda Rummet 9 F 6,934 27,855 0,046 0,386 11 5Röda Rummet 9 T 7,618 83,961 0,050 0,380 10 4

Table 20: Graph Properties: k = search range, A = Pronoun resolution (true/false) Degree, wDegree =Average (weighted) degree, C = Number of communities, DSN = Disconnected sub-networks

29

Page 33: Extracting social networks from fiction

Figure 7: Social network without pronouns for Röda Rummet. k = 6. Sorted circularly by communities(color of node), with disconnected sub-networks (which are also communities) appearing at thetop (light blue, orange, dark red and light yellow). Edges with low weights are not seen due toreadability.

30

Page 34: Extracting social networks from fiction

6 Discussion6.1 Pronoun resolutionIn section 4.6.2. the candidate selection interpretation of precision is: How often is the correct antecedentwithin the context of its pronoun? From Table 11 it can be seen that in 80% of the cases (Table 12, Row3, Column 6), the correct antecedent is in the search range.

A perhaps problematic property of SUC-CORE is that the texts are excerpts or selections of shorttexts (Källgren, 2006). As such, there are instances when two neighboring sentences are not connectedwhich in turn means that the neighboring sentence most likely does not contain the correct antecedent.This means that at some point, adding extra sentences to the context will only add redundant candidates.

The antecedent selection classifications are dependent on the candidate selection classification andmust be interpreted in relation to it. As seen in section 4.6.3., the antecedent selection precision is aproportion between the correct selections and all the errors. It thus seems as if in≈ 60% of the cases theprogram finds the correct antecedent (Table 11). However, as also seen in 4.6.3. the candidate selectiontrue positives sets the upper bound for the number of possible correct selections. To get an idea of theactual performance of the features used to score the candidates the antecedent selection recall is used.The antecedent selection recall obtained does show that the features are rather successful at identifyingthe correct antecedent when it is possible. The results presented in Table 11 (Row 3, Column 5) showsthat in 76% of the cases, the correct antecedent is selected.

To determine the performance of the system in general, one would have to compare it to other similarsystems. In general, this is not an easy task, as many systems presented in the literature do not usepublicly available data, furthermore, those systems would have to be implemented and tested againstSUC-CORE. In addition to these difficulties, the current system also has to deal with the followingdifferences:

(1) Other co-reference system does more than just resolve 3rd person anaphoric pronouns. Tasksrelating to definite descriptions and clusters have not been explored in this thesis.

(2) Generally, partially solved co-reference chains are counted as true positives with a certain weightassigned to reflect that it is only partially resolved. The current system, however, does not attempt tosolve co-reference chains and is thus not evaluated.

In spite of these issues, a superficial comparison between the current system and others can be made.Table 21 shows the performance of related systems previously discussed and the performance of thehybrid method developed in (Nilsson, 2010).

SYSTEM TYPE LANG. DATA SCORE METRIC

Mitkov (1998) Rule-based Eng Techical Manuals 89,7% corrN

ARN (Holen, 2007) Rule-based Nor The Oslo Corpus 70,5% corrN

Nilsson (2010)† Hybrid Swe Independent 71,60% F-score (MUC)

Table 21: Other co-reference resolution systems: The performance of other similar systems. † The resultis taken from (Nilsson, 2010) and is for the anaphoric-animate-pronoun resolution system.

One experiment of particular interest in (Nilsson, 2010) is the classification of anaphoric-animatepronouns. Nilsson (2010) defines the anaphoric-animate pronoun resolution task as finding the namedentity antecedents of 3rd person singular animate pronoun (han, hon, man, den, det). Two systemswere tested, one basic system acting as a baseline and three extended system containing additionalfeatures. The best result was an F-score of 71,60% (Nilsson, 2010, p. 156, Table 6.21), it was obtainedby removing the sentence range restriction. Instead, the system evaluates all preceding NPs against a setof constraints and the accepted NPs are considered as candidates (Nilsson, 2010). The evaluation metricused, MUC takes into account partially solved co-reference chains, which again makes the comparison

31

Page 35: Extracting social networks from fiction

difficult. On a surface level at this stage, it looks like the current system performs worse. One bigdifference is that in (Nilsson, 2010) 97% of all pronouns have a co-referring antecedent in the set ofcandidates, while the context of pn2pmne only contains co-referring antecedents in 70− 85%26(Table11) of the cases.

Two systems which are easier to compare the current system against is (Mitkov, 1998) and (Holen,2007). The metric used for the evaluation is done comparing the number of correctly solved anaphora-antecedent pairs to the number of incorrectly solved ones, e.g. correct

N . The performance of the currentsystem in this regard is given by antecedent selection precision, which is 0,608. This is a lower valuethan both (Mitkov, 1998) and (Holen, 2007).

One probable reason for this is the time spent evaluating and selecting features. The current thesisdeals primarily with social networks. Launching a full-scale investigation into pronoun resolution fea-tures and their importance is not feasible as it would require much more time. However, in Table 11we can see that the system with the problematic features removed (heading, prepositional phrase, andverb) performs better than the system in which they are not removed. This implies that Swedish sharethe rejection of the preference hierarchy (sub jects > direct ob jects > indirect ob jects > other) withNorwegian for the current dataset at least.

There are many possible ways of improving the performance in the future. The most pressing issuesfor the current system is the set of features used (In total, only five features and two constraints wasused) and the way in which the program determines the linear-k value. In Table 11 we see that when itis possible to choose the correct antecedent, the features choose it in 76% of the cases. However, simplyincreasing the search range seems to cause the antecedent selection recall to decrease. This seems toimply that as the set of candidates increases, the features have a harder time to determine the correctantecedent. This is perhaps not so strange when there are only five scoring features.

As a last note, in Table 11 it can be seen that the system with features (pn2pmne) performed muchbetter than the baseline system with the "closest candidate" heuristic. The baseline system still resolves≈ 35% of the pronouns when they are in the set of candidates (Table 11, Row 5, Column 4) whichis not extremely bad, but certainly worse than pn2pmne. The addition of features resulted in a ≈ 50%performance increase.

6.2 Social networksThe results from the social network extraction comes in two components, (1) correlations between in-terpretations and (2) graphs representing the network.

6.2.1 Importance rankingsTo interpret the correlations the following null hypothesis is used: There exists no relation between theextracted and the independent character rankings. Generally, the correlations fluctuates around 0.5-0.83,which is considered moderate (0,40-0,59) to very strong (0,80-1). The correlations for Hemsöborna isseen in Table 16, observing the table we notice that any correlation rs≥ 0,636 is significant (p−value≥0,05).

Only four rankings are significant, for Hemsöborna, the degree and betweenness centrality with k = 3,for both the network with and without pronoun resolution. This does not mean that the non-significantrankings are meaningless or random, just that we cannot rule out that possibility. The same applies forthe non-significant rankings in Röda Rummet.

For Röda Rummet, we also notice that correlations rs ≥ 0,636 are significant. In contrast to Hem-söborna however, in Röda Rummet there is 7 significant rankings. Five of those are for the betweennesscentrality measurement and the only non-significant betweenness centrality ranking is for k = 9 with

26Adding the following rule (with k = 4): in 5% of the cases, return None, results in ASprecision = 0,605(-0,003),ASrecall = 0,796(+0,039) and ASF−score = 0,686(+0,011). Future work may include investigating this further.

32

Page 36: Extracting social networks from fiction

pronoun resolution. The other two significant rankings are the degree centrality without pronouns fork = 9 and the eigenvector centrality with pronouns for k = 9. The best correlation 0,833 (with p= 0,006)was for the betweeness centrality using k = 6

For both works, the betweenness centrality seems to correlate the strongest with the independentrankings, it is also the centrality measurement where we can dismiss the null-hypothesis most often.As such, betweenness centrality seems to fit the current approach best. Future work may include ananalysis of different centrality measurement and how effective they are for different connection types(i.e. indirect and direct connections).

As a concluding note on the importance rankings. The addition of pronoun resolution does not seemto decrease or increase the correlations in general. Observing Tables 16 and 19, there is no obviouspattern. In 5 cases the addition of pronouns caused the p-value of the correlation to increase (negative),in 6 cases it caused the p-value to decrease (positive) and in 7 cases no change is seen. As such, it isdifficult to assess what difference it has had on the social network generation.

6.2.2 Social network graphsThe first thing one may notice when looking at Figure 6 and Figure 7 is that many of the nodes are notpeople. For example, we have hahaha (Laughter), Sjötulln (Port customs) and Polstjärnan (The northstar) in Röda Rummet. For Hemsöborna, we have Ängarne (fields), Kvarnöarne (The mill islands),herrarne (The gentlemen) and Hemsöborna (The group of people living on the Hemsö island.). Theseare all instances of incorrectly classified named entities.

Superficially, comparing the two books using the properties of the graphs in Table 20 we can seesome general trends:

1. The average degree for the same k tends to stay the same regardless of which book is processed.

2. The average weighted degree for Hemsöborna is (much) larger than for Röda Rummet.

3. Hemsöborna is denser than Röda Rummet.

4. Röda Rummet has many communities and some disconnected sub-networks, while Hemsöbornahas one or two communities and no disconnected sub-networks.

Looking at Table 3 we see that ProperNamesProperNames+Pronouns ≈ 30%. This is reflected in the differences seen

in the average weighted degree between the networks containing pronouns and those who do not. In allcases, the average weighted degree of the network with pronoun resolution is roughly three times thatof the average degree of the system without pronoun resolution.

That the average degree is roughly the same is perhaps a little strange but not so surprising. The factthat the both Hemsöborna and Röda Rummet have a definite main character27 which will be the centerof attention would imply that the character has many connections. We see in the rankings that Carlssonand Arvid Falk consistently ranks as 1. This should mean that the vertices Carlsson and Arvid Falkhave extreme values compared to the other vertices when calculating the average degree, and will haveextreme values which influence the average degree much more than the other vertices.

The number three pops up again when comparing the densities of the networks. The density of Hem-söborna is given by densityrodarummet × 3 ≈ densityhemsoborna. The average degree is very similar foreach distance between the books, but Röda Rummet contains roughly three times as many vertices asHemsöborna. With a similar degree and three times as many vertices, this density difference is to beexpected28.

One of the more interesting differences is that of communities and disconnected sub-networks. Wesee that in Hemsöborna there is one large community, that is split into two communities for k = 3 andk = 9 without pronoun resolution. In Röda Rummet however we have over 10 communities per graph,with a clear pattern of decline as k increases. The quality of the communities however decrease, this

27Hemsöborna: Carlsson, Röda Rummet: Arvid Falk28There is small variation in network size as the range increases. The cause of this is unclear at this moment.

33

Page 37: Extracting social networks from fiction

indicates that as the network grows the communities are harder and harder to distinguish from each other.This may bear a relationship to the approach used. The current system takes all named entity mentionsand relates them based on the factors described in section 4.2. The majority of these connections areindirect (see Table 13) and as k increase it captures more and more co-occurrences, which means thatthe connections gets stronger, and denser. This should decrease the amount of sparse connections thatcommunities relies on to determine membership (see section 4.3.2).

For disconnected sub-networks we also see a major difference, Röda Rummet contains disconnectedsub-networks while none exists in Hemsöborna. The first factor to consider here is number of charactersin the work. There are considerably less characters in Hemsöborna than in Röda Rummet, which meansthat the possible connections for the characters in Hemsöborna is much smaller, as such they tend to bemore connected. The second factor to consider is the setting of the work.

One of the main differences superficially when comparing the books is the setting. Röda Rum-met takes place in Stockholm, a city, while Hemsöborna takes place at an island in the Stockholmarchipelago. Elson et al. (2010) investigated whether this made a difference when considering the socialrelations relating to dialogue, but found that no significant difference between rural and urban settingsfor the dataset used. Currently only two works have been investigated, but there exists a large differencebetween the two in relation to communities and sub-networks, and it’s not an unreasonable predictionthat the difference is based on the setting of the work.

What is noteworthy is that works set in cities may have more characters in general than works set inrural communities. This can be seen analogously to the fact that for a single city compared to a singlerural community, the population Ncity � Nrural . It is also very unrealistic to actually have relations toevery other character if the number of characters is sufficiently large. Which means some relations willbe stronger and others weaker, and others non-existent, and thus will produces a network with morecommunities and with disconnected sub-networks.

At this stage it is a prediction, to investigate this further more books need to be analyzed and com-pared, but the current results seem to support such an investigation.

6.3 Research questionsThe research questions posed in section 3 was the following:

Q: From a text containing characters who stand in relation to each other, is it possible to represent thisinformation as a social network?

The results of the social network extraction indicate that this was successful to a certain extent. Thesystem used is very basic containing only three types of relations (1) co-occurrence, (2) keywords and(3) dependency relations. The correlation between the human interpretation and the automatically gen-erated character rankings seems to indicate that the systems roughly agrees with the human interpreterresponsible for the independent rankings.

Q: In constructing social networks from a text which problems are hard?Hard problems for the social extraction includes (1) evaluation, how should one evaluate the social

networks created? and (2) determining the approach to extraction: what are the advantages and disad-vantages of mixing direct and indirect connections, or by only capturing one of them. Related to this ishow to determine the weights of the different types of connections.

Another hard problems related to social network extraction is named entity recognition (how to re-duce the number on non-people classified as people) and pronoun resolution.

Q: How well does rule-based anaphora resolution systems perform on imaginative prose?All in all, the performance of the system is reasonable for the task at hand, it finds the correct an-

tecedent in more than half of the cases. However, the pn2pmne system is not a complete co-referenceresolution system, which makes the comparison to other system as we have seen, quite hard. At the

34

Page 38: Extracting social networks from fiction

moment it does introduce incorrect data, looking at the correlations however, it does not look like itcreated too much chaos and randomness.

7 ConclusionsThe goal of this thesis was not primarily to find a specific result, rather the focus has been exploratory.As an exploratory project, it has been a successful sneak-peak into what one can do with social networks.

As with the real world, the world of social networks within works of fiction have an definite structure,however it is interpreted. In the end, this is also one of the very interesting aspect of literature, thatdifferent people may take different things from the work.

For future work within social networks, it would be very interesting to investigate how centralitymeasurements interact with different types of social relations, i.e. direct and indirect. Another veryinteresting aspect is the difference between rural and urban settings. In the real world, the differencesare huge between the two, and it would not be surprising to find indicators of this in literary texts.

Finally, in this thesis pronoun resolution has been used as a tool to help construct social networks.However, social networks may have applications for pronoun resolution. As the social networks capturesome sort of "meaning" and relation between named entities, this should be able to help identifyingwhen two instances refer to the same named entity. As a tool, social networks can also be of use in theclassification of literature, where the classification is based on the structure of the network.

35

Page 39: Extracting social networks from fiction

ReferencesBeveridge, A. and J. Shan

2016. Network of thrones. Math Horizons, 23(4):18–22.

Biber, D.2011. Corpus linguistics and the study of literature: Back to the future? Scientific Study of Literature,1(1):15–23.

Blondel, V. D., J.-L. Guillaume, R. Lambiotte, and E. Lefebvre2008. Fast unfolding of communities in large networks. Journal of statistical mechanics: theory andexperiment, 2008(10):P10008.

Brandes, U.2008. On variants of shortest-path betweenness centrality and their generic computation. SocialNetworks, 30(2):136–145.

Burnard, L.2000. Reference guide for the british national corpus (world edition).

Cai, J. and M. Strube2010. Evaluation metrics for end-to-end coreference resolution systems. In Proceedings of the 11thAnnual Meeting of the Special Interest Group on Discourse and Dialogue, Pp. 28–36. Associationfor Computational Linguistics.

Diestel, R.2005. Graph theory. 2005. Grad. Texts in Math, 101.

Elson, D. K., N. Dames, and K. R. McKeown2010. Extracting social networks from literary fiction. In Proceedings of the 48th Annual Meetingof the Association for Computational Linguistics, ACL ’10, Pp. 138–147, Stroudsburg, PA, USA.Association for Computational Linguistics.

Finkel, J. R., T. Grenager, and C. Manning2005. Incorporating non-local information into information extraction systems by gibbs sampling. InProceedings of the 43rd annual meeting on association for computational linguistics, Pp. 363–370.Association for Computational Linguistics.

Firth, J. R.1961. Papers in Linguistics 1934-1951: Repr. Oxford University Press.

Fischer-Starcke, B.2009. Keywords and frequent phrases of jane austen’s pride and prejudice: A corpus-stylistic analysis.International Journal of Corpus Linguistics, 14(4):492–523.

Fischer-Starcke, B.2010. Corpus Linguistics in Literary Analysis: Jane Austen and Her Contemporaries, Corpus andDiscourse. Bloomsbury Academic.

Freeman, L. C.1978. Centrality in social networks conceptual clarification. Social networks, 1(3):215–239.

Hauke, J. and T. Kossowski2011. Comparison of values of pearson’s and spearman’s correlation coefficients on the same sets ofdata. Quaestiones geographicae, 30(2):87.

36

Page 40: Extracting social networks from fiction

Holen, G. I.2007. Automatic anaphora resolution for norwegian (arn). In Discourse Anaphora and AnaphorResolution Colloquium, Pp. 151–166. Springer.

Katz, L.1953. A new status index derived from sociometric analysis. Psychometrika, 18(1):39–43.

Knoke, D. and S. Yang2008. Social Network Analysis. SAGE Publications.

Kolaczyk, E. D.2009. Statistical Analysis of Network Data: Methods and Models, 1st edition. Springer PublishingCompany, Incorporated.

Källgren, G.2006. Documentation of the stockholm umeå corpus. In S. Gustafson-Capková and B. Hartmann,eds., Manual of the Stockholm Umeå Corpus version 2.0, Pp. 5–85, Department of Linguistics,Stockholm University.

Mitkov, R.1998. Robust pronoun resolution with limited knowledge. In Proceedings of the 36th AnnualMeeting of the Association for Computational Linguistics and 17th International Conference onComputational Linguistics, Volume 2, Pp. 869–875, Montreal, Quebec, Canada. Association forComputational Linguistics.

Mitkov, R.1999. Anaphora Resolution: The State of the Art, Research report (University of Wolverhampton.Research Group in Computational Linguistics and Language Engineering). School of Languages andEuropean Studies, University of Wolverhampton.

Newman, M. E. and M. Girvan2004. Finding and evaluating community structure in networks. Physical review E, 69(2):026113.

Newman, M. E. J.2010. Networks: An Introduction. Oxford University Press.

Nilsson, K.2010. Hybrid methods for coreference resolution in Swedish. PhD thesis, Department of Linguistics,Stockholm University.

Nilsson Björkenstam, K.2013. Suc-core: A balanced corpus annotated with noun phrase coreference. Northern EuropeanJournal of Language Technology (NEJLT), 3:19–39.

Nivre, J., J. Hall, J. Nilsson, A. Chanev, G. Eryigit, S. Kübler, S. Marinov, and E. Marsi2007. Maltparser: A language-independent system for data-driven dependency parsing. NaturalLanguage Engineering, 13(02):95–135.

Page, L., S. Brin, R. Motwani, and T. Winograd1999. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford InfoLab.

Stubbs, M.2003. Conrad, concordance, collocation. heart of darkness or light at the end of the tunnel? ThirdSinclair Open Lecture.

37

Page 41: Extracting social networks from fiction

Sutton, C. and A. McCallum2012. An introduction to conditional random fields. Foundations and Trends R© in Machine Learning,4(4):267–373.

Witten, I. H., E. Frank, M. A. Hall, and C. J. Pal2016. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann.

Östling, R.2013. Stagger: an open-source part of speech tagger for swedish. Northern European Journal ofLanguage Technology (NEJLT), P. 1.

38

Page 42: Extracting social networks from fiction

Stockholms universitet/Stockholm University

SE-106 91 Stockholm

Telefon 08 - 16 20 00

www.su.se