Digital Editions &Language Resources Portal
Workshop - Save the data, 2. 12. 2014, WienMatej ĎurčoICLTT/ACDH, Ö[email protected]
Dictionaries• Persian – English Dictionary• German – Russian Dictionary• Dictionary of Bavarian dialects in Austria
Cooperation with Austrian dictionary and the dictionary of German variantsFull word-form corpus-based lexical database of German
Databases e.g. prosopographic, bibliographic data, …
Audio – speech recordings (project Tunico)
3
What kind of data?
Sources: plain text, images (need OCR), Word documents (need conversion), audio (need transcribing), digitally born - web! (needs cleaning)Multi-level enrichment:Structural markup, linguistic / semantic annotation (stand-off)
Linking:• Combining lexicographic material with information from
corpora (encoding in TEI)• semantic representation of lexicographic resources in
RDF
Audio with aligned transcription
Complexity, Formats
4
XML TEI
qualitative vs. quantitative K. Kraus „Die Fackel“ (1899 – 1936)~ 22.500 pages, ~ 6 mio. tokensAAC ~ 500 mio. tokens + facsimiles 40TB!AMC ~ 8 billion tokens in over 35 mio. articles of recent journalistic texts (complete newspapers & magazines in Austria over last 20 years!) – 100 GB23 000 entries prosopographic databaseA number of smaller editions/corpora5 – 50 works/resources, rich annotation, < 100.000 tMultiple dictionaries with a few thousand entries
Size?
5
Bibliographic informationencoded as teiHeaderCMDI – Metadata Infrastructure used within CLARINAllows for flexible „profiles“ specific to the type of resource and project/context- Lexical Resource- TextCorpus- Collection- teiHeader (emulated in CMDI)
Metadata
6
Varying combinations of:full-text searchsemantic search (search for persons, places, search by categories and classifications) full-view (e.g. text and facsimile of individual pages)specialized visualizations (temporal, spatial, graph)raw data available for downloadstable references to resources and resource fragments
BUT before publication: collaborative editing VRE !
Requirements on online availability
7
Publishing framework: corpus_shell Repository for digital objects (Fedora-based)Viennese Lexicographic Editor Collaborative environment for lexicographic work
oXygen, XML-database eXist Apache Solr, Sketch Engine (/NoSke), DDC for fast advanced (linguistic) search capabilitiesMost recently: Language Resources Portal
Solutions
8
„under construction“but many real services already available
Stable organisational structure has been set upgeneral assembly, board of directors, national coordinators, thematic committees, …
Network of Centres(real ones with computing and storage – not virtual)• Certification process (centre assessment)• Typ: A (infrastructure), B (LRT data/services), C
(metadata)• currently 14 centres certified (+ 4 pending)• Coordinated through SCCTC
Standing Committee on CLARIN Technical Centres9
CLARINEuropean Research Infrastructures
Federated IdentityAAI, Single-Sign-on
Persistent IdentifierCMDI – Component Metadata Infrastructureflexible framework for creation and publication of metadata
FCS – Federated Content Searchdistributed system for searching in the content of the resources (corpora, …)
Fostering the use of standardsCLARIN Standards Committee (SCS)
10
InfrastructureCLARIN
modular framework for publishing a wide range of language resources designed to operate in a distributed and heterogeneous environment
distributed setup FCS-basedintegration with CLARIN metadata infrastructure(reusing) specialized resource viewers for specific types of resourcesmultiple implementations (php, perl, XQuery)cooperation/integration with SADEScalable Architecture for Digital Editions, BBAW, Berlin
open source (code on github)11
corpus_shellPublishing Framework
Dictionary Server• Open and freely available software that can be readily
distributed (MySQL, PHP)• Integrated with corpus_shell (FCS as common protocol)• Connected to the clients through a REST-style web
service
Vienna Lexicographic EditorThe corresponding client
DictGate• a service for hosting lexicographic data • for smaller lexicographic projects
14
Lexicography suite
Viennese Lexicographic Editor (VLE)• XML editor specialized for editing lexicographic data• Generic – support for any (XML) format (LMF, TEI, TBX, RDF)• Making use of cognate technologies (XSLT, XPath, XSD)• Various editing modes, configurable keyboard layouts• Optimised corpus-dictionary interface• On-the-fly data visualisations
15
Lexicography suite
First Austrian node in the network of CLARIN CentresDSA (Data Seal of Approval)and CLARIN Centre B status April 2014Language Resources PortalMission: National depositing and publishing servicefor digital language resources
Toolscorpus_shell, lexicographic suite, …
Infrastructure Services - „Knowledge Hub“mostly about metadata (under development)
16
CLARIN Centre Viennaclarin.oeaw.ac.at
Matej ĎurčoAustrian Centre for Digital Humanities
Österreichische Akademie der [email protected]
Thank you!