Upload
giovanni-colavizza
View
23
Download
0
Tags:
Embed Size (px)
Citation preview
Subject: the European news flow
Hypothesis: 1 system of news exchange through Europe.
Raise in demand during 30y War, regular postal service.
Key traits of this information system: • multi-media (handwritten long and short range, more
flexible on demand; printed short range and broader public)
• adaptive “hub and spoke” network • multi-language
Our questions and general aims
How to: 1. prove the existence and extent of the flow 2. reconstruct its fine-grained dynamic cartography 3. study the problem of information supply and
exchange: media interactions
Basic approach: detect text reuse.We start by developing robust methods for this end.
Sources (year 1648)
Asti
Cartagena
Francia
Catalogna
Provenza
Livorno
Alicante
CasaleParma
Bruxelles
Avignone
Colonia
Palermo
Riviera diPonente
Madrid
Marsiglia
Inghilterra
Lione
Torino
Napoli
Lisbona
Roma
Londra Germania
Milano
Genova
Barcellona
Parigi
VeneziaBologna
Francia
Svezia
Augusta
Palatinato
Costantinopoli
Monaco
Erfurt
Norimberga
Londra
Franconia
Cassel
Venezia
Vienna
Svevia
Munster
Ratisbona
Amburgo
Francoforte PragaColonia
Printed gazettes: Turin and Genoa
Handwritten: from Vatican Archives,
Segreteria di Stato, Avvisi.
Results: editorial policies (printed gazettes)
Most frequent sequence order of printed news in each issue:
• Genoa: Genoa, Rome/Naples/Marseille, Milan, Lisbon, Barcelona, Paris, London, Germany and Venice. • Turin: (i1) Turin, Barcelona, Paris, London, Germany; (i2) Milan, Genoa, Naples, Rome and Venice.
Statistic Genoa TurinTotal character
count 281206 579381
Total number of paragraphs 263 1221
Average characters per
paragraph1069 474
Results: editorial policies (printed gazettes)Sheet1
Page 1
1 2 3 4 5 6
0
200
400
600
800
1000
1200
Average text per item per month Turin
Genoa
Month
Cha
r co
unt
1 2 3 4 5 6
0
2000
4000
6000
8000
10000
12000
14000
16000
Average text per issue Turin
Genoa
Month
Cha
r co
unt
Sheet1
Page 1
1 2 3 4 5 6
0
200
400
600
800
1000
1200
Average text per item Turin
Genoa
Month
Cha
r co
unt
1 2 3 4 5 6
0
2000
4000
6000
8000
10000
12000
14000
16000
Average text per issue Turin
Genoa
Month
Cha
r co
unt
Methods: matching algorithms - printed
Strategy: compare paragraphs (units of formatting/reading but also meaning)
Global match: SubString Kernels (similarity of sequences of non-contiguous characters) Local alignment: Smith-Waterman (finds local matching passages) Threshold filtering and manual evaluation of 2 highest scoring matches
Results: the flow (printed gazettes)
Turin
Paris
Barcelona
Lisbon
Milan Venice
London
Naples
Rome
Genoa
Germany
Results: comparisons (printed gazettes)
Categories: 1. verbatim copy of a whole paragraph or parts of it 2. paraphrasing or translations of the same source 3. same news from different sources 4. same topic but different news
Results: 1 and 3 <1%2 circa 30% 4 circa 43%
Evaluation: precision by hand recall “intractable”
Methods: data preparation - handwritten
Plenipotentiario di Spagna (keyword)
Re di Spagna (name_of_person)
Conte d'Avò (name_of_person)
spagnoli (quantity)
Ambasciatore di Portogallo (keyword)
Perera (name_of_person)
Hassi (keyword)
Cassel (name_of_place)
Plenipotentiario di Franza (keyword)
Sua Maestà Cesarea (name_of_person)
Landgraviessa d'Assia (name_of_person)
Osnapruch (name_of_place)
trattato dell'Imperio (keyword)
Lantgravio di Darmstat (name_of_person)
Amnistia nello stati hereditarij (keyword)
anni (quantity)
Pinorada (name_of_person)
Svedesi (keyword)
Provincia d'Utrecht (name_of_place)
pace (keyword)
Spagna (name_of_place)
Olanda (name_of_place)
Zelanda (name_of_place)
Provinzie Basse (name_of_place)
Francia (name_of_place)
Methods: matching algorithms - handwritten
Strategy: compare paragraphs
Typed canonicalisation: similar words are grouped into typed categories (Jaro-Winkler distance) Paragraph comparison: Tf-idf vectors, cosine distance Manual evaluation of 2 highest scoring matches
Too limited and skewed corpus for now..
Results: matchings (handwritten)
Munster 24 April 1648:
Cologne 19 April 1648:
High score, same topic, different news. Different news-sheets
Open questions
1. How to effectively evaluate results? The open question of scalable recall and precision
2. How to get a larger corpus (e.g. at least 2 years to study seasonality)? 1) lack of data 2) cost of data preparation
3. How to compare printed and handwritten news? Ongoing work
4. What to focus on? Variations are as interesting as verbatim copies to study the interaction of different medias and types of gazettes..