Contributions for building aCorpora-Flow system
Andre [email protected]
Informatics Engineering MScUniversity of Minho
December 2011
Concepts
Aligned parallel corpus: Set of parallel texts inwhich correspondences have been markedbetween blocks (paragraphs, sentences,words, . . . ) from each text.
Corpora-flow: Adaptation of the concept ofworkflow to the several tasks, decisionsand sequences of steps involved in theprocess of building a corpus.
This presentation and the underlying master thesisdescribe the implementation of several tools to beused in typical corpus building activities.
1 Andre Santos, [email protected] Contributions for building a Corpora-Flow system
Concepts
Aligned parallel corpus: Set of parallel texts inwhich correspondences have been markedbetween blocks (paragraphs, sentences,words, . . . ) from each text.
Corpora-flow: Adaptation of the concept ofworkflow to the several tasks, decisionsand sequences of steps involved in theprocess of building a corpus.
This presentation and the underlying master thesisdescribe the implementation of several tools to beused in typical corpus building activities.
1 Andre Santos, [email protected] Contributions for building a Corpora-Flow system
Context
The work developed in the context of this masterthesis was motivated and supported byProject Per-fide, an undergoing project inUniversity of Minho which aims to build largeparallel corpora between Portuguese and other sixlanguages.
2 Andre Santos, [email protected] Contributions for building a Corpora-Flow system
Corpora building challenges
file format and format conversion
finding duplicated files
text encoding format
structural residues
section delimiters
unpaired sections (parallel corpora)
. . .
3 Andre Santos, [email protected] Contributions for building a Corpora-Flow system
Corpora building challenges
Severe problems which often lead to bad results
Many (most?) of them are hard/impossible tosolve completely
Find the problem and report it when it is notsolvable automatically
Provide intelligent ways of describing what wasfound and done
4 Andre Santos, [email protected] Contributions for building a Corpora-Flow system
5 key issues
Book cleaning
Duplicates and candidate pairs detection
Book synchronization
Alignment evaluation
Corpora-flow system
5 Andre Santos, [email protected] Contributions for building a Corpora-Flow system
Book processing problems – Motivation
(...) d <92>’ entree, donnant acces dans la salle commune.
Une legere veranda, qui en prote-
M
<96>- 86 <96>-
^L geait la partie anterieure contre l <92>’ action
des rayons solaires, reposait sur de sveltes bambous. (...)
La Jangada, Jules Verne
6 Andre Santos, [email protected] Contributions for building a Corpora-Flow system
Book processing problems – Motivation
(...) d <92>’ entree, donnant acces dans la salle commune.
Une legere veranda, qui en prote-
M
<96>- 86 <96>-
^L geait la partie anterieure contre l <92>’ action
des rayons solaires, reposait sur de sveltes bambous. (...)
La Jangada, Jules Verne
<92>’ : right single quot. mark (CP1252)<96>- : en dash (CP1252)
^L : page break (0xC)
prote-(...)geait : transpagination
6 Andre Santos, [email protected] Contributions for building a Corpora-Flow system
Book processing problems – Motivation
(...) d <92>’ entree, donnant acces dans la salle commune.
Une legere veranda, qui en prote-
M
<96>- 86 <96>-
^L geait la partie anterieure contre l <92>’ action
des rayons solaires, reposait sur de sveltes bambous. (...)
La Jangada, Jules Verne
(...) d ’ entree, donnant acces dans la salle commune.
Une legere veranda, qui en protegeait _pb1_
la partie anterieure contre l ’ action
des rayons solaires, reposait sur de sveltes bambous. (...)
6 Andre Santos, [email protected] Contributions for building a Corpora-Flow system
Book cleaning
Subdivided in several steps:
7 Andre Santos, [email protected] Contributions for building a Corpora-Flow system
Sections ontology
contains common section typesused to automatically generatethe code to recognize sectiondelimitersallows discussion/cooperationwith people with noprogramming knowledgecode becomes more simple andclean
chap
PT capıtulo,
cap, capitulo
FR chapitre, chap
EN chapter, chap
NT sec
end
PT fim
FR fin
EN the_end
BT _alone
scene
PT cena
FR scene
EN scene
RU главаBT act
8 Andre Santos, [email protected] Contributions for building a Corpora-Flow system
Duplicates and pairs detection
MotivationDuplicates can result in a biased corpusFinding candidate pairs for alignment
Language independent elements (LIEs)
terms which are usually kept untranslatedyear references – “1973”proper names – “Hamlet”
Measuring similarity
similarity(A,B) =|ALIEs ∩ BLIEs ||ALIEs ∪ BLIEs |
Thresholds< 0.2: unrelated
> 0.4: pair
> 0.9: duplicates
9 Andre Santos, [email protected] Contributions for building a Corpora-Flow system
Book synchronization
DefinitionStructural alignment at section level, based onpreviously added section delimiting marks.
MotivationSome aligners cannot handle large documentsSection delimiters can act as anchor pointsUnpaired sections can be discarded
Implementation
match similar section delimiterssynchronization points
10 Andre Santos, [email protected] Contributions for building a Corpora-Flow system
Output
pair of files withsynchronizationmarks
pair of files dividedinto smaller pairsof chunks
text report
synchronizationmatrix
11 Andre Santos, [email protected] Contributions for building a Corpora-Flow system
Output
pair of files withsynchronizationmarks
pair of files dividedinto smaller pairsof chunks
text report
synchronizationmatrix
11 Andre Santos, [email protected] Contributions for building a Corpora-Flow system
Alignment evaluation
Motivationcompare alignments of the same documents(performed by different tools, with different options, . . . )
determine if an alignment was successful
Comparing alignments
parse TMX files and output the total numbercorrespondences of each type0:1/1:0, 1:1, 2:1/1:2 and 2:2
evaluate the other tools developed
compare the performance of the availablealignment tools
12 Andre Santos, [email protected] Contributions for building a Corpora-Flow system
Alignment evaluation
Motivationcompare alignments of the same documents(performed by different tools, with different options, . . . )
determine if an alignment was successful
Comparing alignments
parse TMX files and output the total numbercorrespondences of each type0:1/1:0, 1:1, 2:1/1:2 and 2:2
evaluate the other tools developed
compare the performance of the availablealignment tools
12 Andre Santos, [email protected] Contributions for building a Corpora-Flow system
Alignment evaluation
Determine if an alignment was successful
Summarize a TMX by sampling. Sampling canbe performed based on:
number of samples desiredexplicit sampling pointstranslation units which match a given regularexpression
Output is a (much?) smaller TMX file
13 Andre Santos, [email protected] Contributions for building a Corpora-Flow system
Alignment evaluation
AdsonDE = АдсоRU
The Name of the Rose, Umberto Eco
14 Andre Santos, [email protected] Contributions for building a Corpora-Flow system
Alignment evaluation
AdsonDE = АдсоRU
The Name of the Rose, Umberto Eco
14 Andre Santos, [email protected] Contributions for building a Corpora-Flow system
Alignment evaluation
AdsonDE = АдсоRU
The Name of the Rose, Umberto Eco
14 Andre Santos, [email protected] Contributions for building a Corpora-Flow system
Alignment evaluation
AdsonDE = АдсоRU
The Name of the Rose, Umberto Eco
14 Andre Santos, [email protected] Contributions for building a Corpora-Flow system
Distribution
All the tools implemented as Perl modules:Text::Perfide::BookCleaner
Text::Perfide::BookPairs
Text::Perfide::BookSync
Text::Perfide::TMX::Utils
publicly available on CPAN
including tests and documentation
additional effort required to make codeinstallable and usable by other people
15 Andre Santos, [email protected] Contributions for building a Corpora-Flow system
Corpora-flow
Motivationbuilding a corpus is a complex task
linear pipeline is not powerful enough
Workflowstates
actions
conditions
context
Makefilesfile-oriented
timestamps anddependencies
fail-fast and resumableexecution
parallelization16 Andre Santos, [email protected] Contributions for building a Corpora-Flow system
Corpora-flow
workflow + Makefiles = corpora-flow
DSL (→ Slay::Makefile)workflow: rule*
rule: pre-condition* action post-condition*
action: targets dependencies function
condition: filename function
target: pattern*
dependencies: pattern*
function: Perl code
17 Andre Santos, [email protected] Contributions for building a Corpora-Flow system
Conclusions
Evaluation of the tools has shown that they dohelp to solve problems
Most of the methods devised can be applied inother contextsWorking within a larger project:
provides requirements and resourcesspecific needs and priorities
making code available to other people:requires additional effortgives meaning to the workexternal contributions
Higher level objects help to organize anddiscuss
18 Andre Santos, [email protected] Contributions for building a Corpora-Flow system
Conclusions
Evaluation of the tools has shown that they dohelp to solve problems
Most of the methods devised can be applied inother contextsWorking within a larger project:
provides requirements and resourcesspecific needs and priorities
making code available to other people:requires additional effortgives meaning to the workexternal contributions
Higher level objects help to organize anddiscuss
18 Andre Santos, [email protected] Contributions for building a Corpora-Flow system
Conclusions
Evaluation of the tools has shown that they dohelp to solve problems
Most of the methods devised can be applied inother contextsWorking within a larger project:
provides requirements and resourcesspecific needs and priorities
making code available to other people:requires additional effortgives meaning to the workexternal contributions
Higher level objects help to organize anddiscuss
18 Andre Santos, [email protected] Contributions for building a Corpora-Flow system
Conclusions
Evaluation of the tools has shown that they dohelp to solve problems
Most of the methods devised can be applied inother contextsWorking within a larger project:
provides requirements and resourcesspecific needs and priorities
making code available to other people:requires additional effortgives meaning to the workexternal contributions
Higher level objects help to organize anddiscuss
18 Andre Santos, [email protected] Contributions for building a Corpora-Flow system
Conclusions
Evaluation of the tools has shown that they dohelp to solve problems
Most of the methods devised can be applied inother contextsWorking within a larger project:
provides requirements and resourcesspecific needs and priorities
making code available to other people:requires additional effortgives meaning to the workexternal contributions
Higher level objects help to organize anddiscuss
18 Andre Santos, [email protected] Contributions for building a Corpora-Flow system
Conclusions
Evaluation of the tools has shown that they dohelp to solve problems
Most of the methods devised can be applied inother contextsWorking within a larger project:
provides requirements and resourcesspecific needs and priorities
making code available to other people:requires additional effortgives meaning to the workexternal contributions
Higher level objects help to organize anddiscuss
18 Andre Santos, [email protected] Contributions for building a Corpora-Flow system
Conclusions
Evaluation of the tools has shown that they dohelp to solve problems
Most of the methods devised can be applied inother contextsWorking within a larger project:
provides requirements and resourcesspecific needs and priorities
making code available to other people:requires additional effortgives meaning to the workexternal contributions
Higher level objects help to organize anddiscuss
18 Andre Santos, [email protected] Contributions for building a Corpora-Flow system
Conclusions
Evaluation of the tools has shown that they dohelp to solve problems
Most of the methods devised can be applied inother contextsWorking within a larger project:
provides requirements and resourcesspecific needs and priorities
making code available to other people:requires additional effortgives meaning to the workexternal contributions
Higher level objects help to organize anddiscuss
18 Andre Santos, [email protected] Contributions for building a Corpora-Flow system
Conclusions
Evaluation of the tools has shown that they dohelp to solve problems
Most of the methods devised can be applied inother contextsWorking within a larger project:
provides requirements and resourcesspecific needs and priorities
making code available to other people:requires additional effortgives meaning to the workexternal contributions
Higher level objects help to organize anddiscuss
18 Andre Santos, [email protected] Contributions for building a Corpora-Flow system
Conclusions
Evaluation of the tools has shown that they dohelp to solve problems
Most of the methods devised can be applied inother contextsWorking within a larger project:
provides requirements and resourcesspecific needs and priorities
making code available to other people:requires additional effortgives meaning to the workexternal contributions
Higher level objects help to organize anddiscuss
18 Andre Santos, [email protected] Contributions for building a Corpora-Flow system
Future work
Document cleaners
other types of documents (e.g. scientificarticles)
algorithm for finding section delimiters withnotion of hierarchy
create ebooks/bilingual books
Duplicates and pair detection
list of correspondences (e.g. Adson → Адсо,London → Londres)
calculate best threshold values in real time
19 Andre Santos, [email protected] Contributions for building a Corpora-Flow system
Future work
Document synchronization
interactive mode
improvements on synchronization matrix andmetrics
hierarchical sections
other section alignment algorithms
Corpora-flow
finish specification and implementation
implement a corpora-flow for Project Per-fide
20 Andre Santos, [email protected] Contributions for building a Corpora-Flow system
Contributions for building aCorpora-Flow system
Andre [email protected]
Informatics Engineering MScUniversity of Minho
December 2011