Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Semantics from Narrative: State of the Art and
Future Prospects
Fionn Murtagh Science Foundation Ireland, and
Royal Holloway, University of London
SLDS 2009
Challenges Addressed
• Great masses of data, textual and otherwise, need to be exploited - decisions need to be made. Correspondence Analysis handles multivariate numerical and symbolic data with ease.
• Structures and interrelationships evolve in time.
• We must consider complex web of relationships.
• We need to address all these issues from data and data flows.
• We will look at how this works, using Casablanca film script
• Then return to the data mining approach used
Interaction and decision making -
Casablanca (1942) • Script half
completed when production began
• Dialog for some scenes written while shooting in progress
• My work on Casablanca is joint with Adam Ganz and Stewart McKie, Dept. of Media Arts, RHUL
Casablanca
• Based on unpublished 1940 screenplay by Murray Burnett and Joan Alison, “Everybody comes to Rick’s”
• Script by JJ Epstein, PG Epstein and H Koch
• Film directed by Michael Curtiz and produced by Hal B Wallis and Jack L Warner
• Shot by Warner Bros. between May and August 1942
• Casablanca script has 77 successive scenes
• 6710 words in these scenes
• We use (later) all words, ignoring punctuation and taking all in lower case
• We analyze frequencies of occurrence of words in scenes, so the input is a matrix crossing scenes by words
Illustrative example: Casablanca (1942)
• A first data set had 77 successive scenes crossed by attributes - Int, Ext, Day, Night, Rick, Ilsa, Renault, Strasser, Laszlo, Other (i.e. minor character), and 29 locations.
• Many locations were met with just once; Rick’s Café was the location of 36 scenes. (We did not distinguish between “Main room”, “Office”, “Balcony”, etc.)
12 attributes displayed; 77 scenes displayed as dots
−1.5 −1.0 −0.5 0.0 0.5
−0.5
0.0
0.5
1.0
1.5
Factor 1, 34% of inertia
Fact
or 2
, 15%
of i
nerti
a
.
.
.
.
.
.
.
.
.
.
.
.
.
.
...
..
.
.
.
..
.
.
.
.
.
.
.
.
..
.
.
.
.
.
.
.
.
.
.
.
.
..
.
....
.
.
.
.
.
.
.
.
.
...
.
.
.
.
.
.
.
.
.
.
.
.
Int
Ext
Day
Night
Rick
Ilsa Renault
Strasser
LaszloOther
RicksCafe
NotRicks
Approx. 34+15 = 49% of all information displayedCan study interrelationships between characters, other attributes, scenes, etc.
Some underlying principles 1/2
• Cross-tabulation data, scenes x attributes
• Embedding scenes, attributes in a metric space
• We are probing the “geometry of information”
Triangular inequality holds for metrics
2 3 4 5 6
23
45
6
Horizontal
Vertical
●
●
●
x
z
yd(x, z) ≤ d(x, y) + d(y, z)
Example: Euclidean or “as the crow flies” distance
Some underlying principles 2/2
• Axes are the principal axes of momentum
• Identical principles used as in classical mechanics
• Scenes are located as weighted averages of all associated attributes; and vice versa
ChristiaanHuyghens
(1629-1695)
Huyghens’ theoremrelates to
decomposition of inertia of a cloud of
points
This is the basis of Correspondence
Analysis
• Euclidean embedding provides a very good starting point to look at hierarchical relationships
• An innovation in this work: the hierarchy takes sequence, e.g. timeline, into account
• This captures novelty, anomaly, change
And now: the “topology of information”
2 3 4 5 6
23
45
6
Property 1
Prop
erty
2●
2 3 4 5 6
23
45
6
Property 1
Prop
erty
2●
●
2 3 4 5 6
23
45
6
Property 1
Prop
erty
2●
●
●
2 3 4 5 6
23
45
6
Property 1
Prop
erty
2●
●
●
2 3 4 5 6
23
45
6
Property 1
Prop
erty
2●
●
●
10 20 30 40
510
1520
Property 1
Prop
erty
2
●
●●
●
10 20 30 40
510
1520
Property 1
Prop
erty
2
●
●●
●
40.85
10 20 30 40
510
1520
Property 1
Prop
erty
2
●
●●
●
38.91
10 20 30 40
510
1520
Property 1
Prop
erty
2
●
●●
●
37.58
10 20 30 40
510
1520
Property 1
Prop
erty
2
●
●●
●
Isosceles triangle: approx equal long sides
Strong triangular inequality, or ultrametric inequality, holds for tree distancesx y z
1.0
1.5
2.0
2.5
3.0
3.5
Height
max{d(x, y), d(y, z)}
d(x, z) ≤
d(x, z) = 3.5
d(x, y) = 3.5
d(y, z) = 1.0
Closest common ancestor distance is an ultrametric
05
1015
2025
30
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77
77 scenes clusteredShows up 9 to 10, and progressing from 39, to 40 and 41,
as major changes
A look under the hood
• Correspondence analysis supports following:
• analysis of multivariate, mixed numerical/symbolic data
• web of interrelationships
• evolution of relationships over time
Correspondence Analysis is A Tale of Three Metrics
- Chi squared metric - appropriate for profiles of frequencies of occurrence
- Euclidean metric, for visualization, and for static context
- Ultrametric, for hierarchic relations and for dynamic context
Analysis of semantics:1. Context - the collection of all
interrelationships
• Euclidean distance makes a lot of sense when the population is homogeneous
• All interrelationships together provide context, relativities - and meaning
Analysis of semantics:2. Hierarchy tracks anomaly and change
• Euclidean distance makes a lot of sense when the population is homogeneous
• Ultrametric distance makes a lot of sense when the observables are heterogeneous, discontinuous
• Latter is especially useful for determining: anomalous, atypical, innovative cases
• Back to a deeper look at Casablanca
• We have taken comprehensive but qualitative discussion by McKee and sought qualitative and algorithmic implementation
McKee, Methuen, 1999
Casablance is basedon a range of
miniplots.
McKee: its composition is
“virtually perfect”
Text is the “sensory surface” of the underlying
semantics
Analysis of Casablanca’s “Mid-Act Climax”, Scene 43
subdivided into 11 “beats” (subscenes)• McKee divides this scene, relating to Ilsa and Rick seeking black market exit visas,
into 11 “beats”
• Beat 1 is Rick finding Ilsa in the market
• Beats 2, 3, 4 are rejections of him by Ilsa
• Beats 5, 6 express rapprochement by both
• Beat 7 is guilt-tripping by each in turn
• Beat 8 is a jump in content: Ilsa says she will leave Casablanca soon
• In beat 9, Rick calls her a coward, and Ilsa calls him a fool
• In beat 10, Rick propositions her
• In beat 11, the climax, all goes to rack and ruin: Ilsa says she was married to Laszlo all along. Rick is stunned
m-1.0-1.0M-1.0M
-1.0
-0.5-0.5M-0.5M
-0.5
0.00.0M0.0M
0.0
0.50.5M0.5M
0.5
1.01.0M1.0M
1.0
1.51.5M1.5M
1.5
2.02.0M2.0M
2.0
m-2.0-2.0M-2.0M
-2.0
-1.5-1.5M-1.5M
-1.5
-1.0-1.0M-1.0M
-1.0
-0.5-0.5M-0.5M
-0.5
0.00.0M0.0M
0.0
0.50.5M0.5M
0.5
1.01.0M1.0M
1.0
1.51.5M1.5M
1.5
Factor 1, 12.6% of inertiaMFactor 1, 12.6% of inertiaM
Factor 1, 12.6% of inertia
Factor 2, 12.2% of inertiaMFactor 2, 12.2% of inertiaM
Facto
r 2, 12.2
% o
f in
ert
ia
111M
1
222M
2
333M
3
444M
4
555M
5
666M
6
777M
7
888M
8
999M
9
101010M
10
111111M
11
Principal plane of 11 beats in scene 43MPrincipal plane of 11 beats in scene 43M
Principal plane of 11 beats in scene 43
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M...M
.
...M
.
...M
.
...M...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M...M
.
...M...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M...M...M
.
...M
.
...M
.
...M
.
...M...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M...M
.
...M
.
...M
.
...M
.
...M...M...M
.
...M
.
...M
.
...M
.
210 words used in these 11 “beats” or subscenes
m-1.0-1.0M-1.0M
-1.0
-0.5-0.5M-0.5M
-0.5
0.00.0M0.0M
0.0
0.50.5M0.5M
0.5
1.01.0M1.0M
1.0
1.51.5M1.5M
1.5
2.02.0M2.0M
2.0
m-2.0-2.0M-2.0M
-2.0
-1.5-1.5M-1.5M
-1.5
-1.0-1.0M-1.0M
-1.0
-0.5-0.5M-0.5M
-0.5
0.00.0M0.0M
0.0
0.50.5M0.5M
0.5
1.01.0M1.0M
1.0
1.51.5M1.5M
1.5
Factor 1, 12.6% of inertiaMFactor 1, 12.6% of inertiaM
Factor 1, 12.6% of inertia
Factor 2, 12.2% of inertiaMFactor 2, 12.2% of inertiaM
Facto
r 2, 12.2
% o
f in
ert
ia
111M
1
222M
2
333M
3
444M
4
555M
5
666M
6
777M
7
888M
8
999M
9
101010M
10
111111M
11
Principal plane of 11 beats in scene 43MPrincipal plane of 11 beats in scene 43M
Principal plane of 11 beats in scene 43
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M...M
.
...M
.
...M
.
...M...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M...M
.
...M...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M...M...M
.
...M
.
...M
.
...M
.
...M...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M...M
.
...M
.
...M
.
...M
.
...M...M...M
.
...M
.
...M
.
...M
.
m-1.0-1.0M-1.0M
-1.0
-0.5-0.5M-0.5M
-0.5
0.00.0M0.0M
0.0
0.50.5M0.5M
0.5
1.01.0M1.0M
1.0
1.51.5M1.5M
1.5
2.02.0M2.0M
2.0
m-2.0-2.0M-2.0M
-2.0
-1.5-1.5M-1.5M
-1.5
-1.0-1.0M-1.0M
-1.0
-0.5-0.5M-0.5M
-0.5
0.00.0M0.0M
0.0
0.50.5M0.5M
0.5
1.01.0M1.0M
1.0
1.51.5M1.5M
1.5
Factor 1, 12.6% of inertiaMFactor 1, 12.6% of inertiaM
Factor 1, 12.6% of inertia
Factor 2, 12.2% of inertiaMFactor 2, 12.2% of inertiaM
Facto
r 2, 12.2
% o
f in
ert
ia
111M
1
222M
2
333M
3
444M
4
555M
5
666M
6
777M
7
888M
8
999M
9
101010M
10
111111M
11
Principal plane of 11 beats in scene 43MPrincipal plane of 11 beats in scene 43M
Principal plane of 11 beats in scene 43
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M...M
.
...M
.
...M
.
...M...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M...M
.
...M...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M...M...M
.
...M
.
...M
.
...M
.
...M...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M
.
...M...M
.
...M
.
...M
.
...M
.
...M...M...M
.
...M
.
...M
.
...M
.
Repulsion
Attraction
Beat 8: Lisa to leave
Casablanca!
Beat 11: Lisa married to Laszlo all along!
McKee’s guidelines applied to Scene 43
• Lengths of beat get shorter leading up to climax: word counts of final five beats in scene 43 are: 50 - 44 - 38 - 30 --- 46
• The planar representation seen accounts for approx. 12.6 + 12.2 = 24.8% of the inertia, and hence the information
• We will look at the evolution of this scene using hierarchical clustering - but based on the relative orientations, or correlations with factors
m0.00.0M0.0M
0.0
0.20.2M0.2M
0.2
0.40.4M0.4M
0.4
0.60.6M0.6M
0.6
0.80.8M0.8M
0.8
1.01.0M1.0M
1.0
111M
1
222M
2
333M
3
444M
4
555M
5
666M
6
777M
7
888M
8
999M
9
101010M
10
111111M
11
Hierarchical clustering of 11 beats, using their orientationsMHierarchical clustering of 11 beats, using their orientationsM
Hierarchical clustering of 11 beats, using their orientations
Full dimensionality analysis. Note caesura in moving from beat 7 to 8, and back to 9. Less so in moving from 4 to 5 but still quite pronounced.
Style analysis of scene 43 based on McKee Monte Carlo tested against 999 uniformly
randomized sets of the beats
• In the great majority of cases (against 83% and more of the randomized alternatives) we find the style in scene 43 to be characterized by:
• small variability of movement from one beat to the next
• greater tempo of beats
• high mean rhythm
Our way of analyzing semantics
• We discern story semantics arising out of the orientation of narrative
• This is based on the web of interrelationships
• We examined caesuras and breakpoints in the flow of narrative
• Work of J. Eliashberg, Wharton, U. Penn.
• Use features characterizing scripts to predict box-office success
• Having tracked various aspects of semantics in filmscript
• Can we apply similar principles to the research literature?
• Objective 1: to evaluate funding proposals, and allocation of funding
• Objective 2: to evaluate trends and evolution in fields and subfields of research
• For planning and resource allocation
• Personal, institutional, national, discipline-based
Take 5 articles on neuro-imaging studies of visual awareness and cognitive alternatives in early blind humans
Methodology• Consider sections: resp. in the five articles
there are 7, 6, 6, 6, 7 sections.
• Consider paragraphs within sections: resp. in the five articles there are: 51, 38, 60, 23, 24.
• We analyze sections x words in each article.
• Words are 2 or more characters in length.
• Numbers of words (and unique words) in the five articles: 8067 (1534), 6776 (1408), 8247 (1534), 3891 (999) and 5167 (1255).
• We also used for each article: abstract, bibliography
Issues assessed at individual article level
• Which sections contribute most strongly to the factors
• Which terms, including cited works, contribute or are correlated most with factors - hence which are most important or most salient
• Which technical terms are most
We find abstracts to be good proxies for the articles
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
−2.0
−1.5
−1.0
−0.5
0.0
0.5
1.0
1.5
Factor 1, 36% of inertia
Fact
or 2
, 22%
of i
nerti
a
1 2
3
4
5
1 2
3
4
5
Abstracts projected into the plane. Bold italics: 5 articles
And we find bibliographies to be good proxies also - Possible implications for bibliometrics
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
−2.0
−1.5
−1.0
−0.5
0.0
0.5
1.0
1.5
Factor 1, 36% of inertia
Fact
or 2
, 22%
of i
nerti
a
1 2
3
4
5
123 4
5
Reference sections projected into the plane. Bold italic: 5 articles
Conclusions
• Here bibliography (in each of the five articles) was the set of all cited references, including author names, titles, journal titles and other details
• Caveat: citing cultures differ across disciplines
• Nonetheless:
• Perhaps complementing networks of citing articles, as commonly used in bibliometrics ...
• can sematic analysis based on Correspondence Analysis - as pursued here - ....
• better capture the narrative and hence trends?