64
ERCIM NEWS Number 111 October 2017 www.ercim.eu Special theme Digital Humanities Also in this issue: Research and Innovation: Trend Analysis of Underground Marketplaces

˜5*(.&1 9-*2* .,.9&1 :2&3.9.*8 - ERCIM News 112 · 4 ERCIM NEWS 111 October 2017 ˚ ˆ ˆ*2’*78-.5 After having successfully grown to become one of the most recognized ICT Societies

  • Upload
    doantu

  • View
    227

  • Download
    0

Embed Size (px)

Citation preview

Page 1: ˜5*(.&1 9-*2* .,.9&1 :2&3.9.*8 - ERCIM News 112 · 4 ERCIM NEWS 111 October 2017 ˚ ˆ ˆ*2’*78-.5 After having successfully grown to become one of the most recognized ICT Societies

ERCIM NEWSNumber 111 October 2017

www.ercim.eu

Special theme

DigitalHumanities

Also in this issue:

Research and Innovation:

Trend Analysis of Underground Marketplaces

Page 2: ˜5*(.&1 9-*2* .,.9&1 :2&3.9.*8 - ERCIM News 112 · 4 ERCIM NEWS 111 October 2017 ˚ ˆ ˆ*2’*78-.5 After having successfully grown to become one of the most recognized ICT Societies

ERCIM News is the magazine of ERCIM. Published quar-

terly, it reports on joint actions of the ERCIM partners, and

aims to reflect the contribution made by ERCIM to the

European Community in Information Technology and Applied

Mathematics. Through short articles and news items, it pro-

vides a forum for the exchange of information between the

institutes and also with the wider scientific community. This

issue has a circulation of about 6,000 printed copies and is

also available online.

ERCIM News is published by ERCIM EEIG

BP 93, F-06902 Sophia Antipolis Cedex, France

Tel: +33 4 9238 5010, E-mail: [email protected]

Director: Philipp Hoschka, ISSN 0926-4981

Contributions

Contributions should be submitted to the local editor of your

country

Copyright�notice

All authors, as identified in each article, retain copyright of their

work. ERCIM News is licensed under a Creative Commons

Attribution 4.0 International License (CC-BY).

Advertising

For current advertising rates and conditions, see

http://ercim-news.ercim.eu/ or contact [email protected]

ERCIM�News�online�edition���

http://ercim-news.ercim.eu/

Next�issue

October 2017, Special theme: Digital Humanities

Subscription

Subscribe to ERCIM News by sending an email to

[email protected] or by filling out the form at the

ERCIM News website: http://ercim-news.ercim.eu/

Editorial�Board:

Central editor:

Peter Kunz, ERCIM office ([email protected])

Local Editors:

Austria: Erwin Schoitsch ([email protected])

Belgium:Benoît Michel ([email protected])

Cyprus: Ioannis Krikidis ([email protected])

France: Steve Kremer ([email protected])

Germany: Michael Krapp ([email protected])

Greece: Lida Harami ([email protected]),

Artemios Voyiatzis ([email protected])

Hungary: Andras Benczur ([email protected])

Italy: Maurice ter Beek ([email protected])

Luxembourg: Thomas Tamisier ([email protected])

Norway: Poul Heegaard ([email protected])

Poland: Hung Son Nguyen ([email protected])

Portugal: José Borbinha, Technical University of Lisbon

([email protected])

Spain: Silvia Abrahão ([email protected])

Sweden: Kersti Hedman-Hammarström ([email protected])

Switzerland: Harry Rudin ([email protected])

The Netherlands: Annette Kik ([email protected])

W3C: Marie-Claire Forgue ([email protected])

Editorial Information Contents

ERCIM NEWS 111 October 2017

JOINT ERCIM ACTIONS

4 ERCIM Membership

5 ERCIM “Alain Bensoussan”

Fellowship Programme

5 HORIZON 2020 Project

Management

5 HRADIO – A New Project

to Tackle Future of Internet Radio

17 Argumentation Analysis of

Engineering Research

by Mihály Héder (MTA-SZTAKI)

Information Management Strategy

18 A Back Bone Thesaurus for Digital

Humanities

by Maria Daskalaki and LidaCharami (ICS-FORTH)

19 Meeting the Challenges to Reap the

Benefits of Semantic Data in Digital

Humanities

by George Bruseker, Martin Doerrand Chryssoula Bekiari (ICS-FORTH)

21 HumaReC: Continuous Data

Publishing in the Humanities

by Claire Clivaz, Sara Schulthess andAnastasia Chasapi (SIB)

22 A Data Management Plan for

Digital Humanities: the

PARTHENOS Model

by Sheena Bassett (PIN Scrl), Sara DiGiorgio (MIBACT-ICCU) and PaolaRonzino (PIN Scrl)

23 Trusting Computation in Digital

Humanities Research

by Jacco van Ossenbruggen (CWI)

Research Infrastructure Development

25 The DARIAH ERIC: Redefining

Research Infrastructure for the Arts

and Humanities in the Digital Age

by Jennifer Edmond (Trinity CollegeDublin), Frank Fischer (NationalResearch University Higher School ofEconomics, Moscow), MichaelMertens (DARIAH EU) and LaurentRomary (Inria)

26 DIGILAB: A New Infrastructure

for Heritage Science

by Luca Pezzati and Achille Felicetti(INO-CNR)

28 Building a Federation of Digital

Humanities Infrastructures

by Alessia Bardi and Luca Frosini(ISTI-CNR)

29 Knowledge Complexity and the

Digital Humanities: Introducing the

KPLEX Project.

by Jennifer Edmond and GeorginaNugent Folan (Trinity College Dublin)

SPECIAL THEME

The special theme section “Digital

Humanities” has been coordinated by

George Bruseker (ICS-FORTH), László

Kovács (MTA-SZTAKI), and Franco

Niccolucci (University of Florence)

Digital Source Indexing and Analysis

8 In Codice Ratio: Scalable

Transcription of Vatican Registers

by Donatella Firmani, Paolo Merialdo(Roma Tre University), MarcoMaiorino (Vatican Secret Archives)

9 Teaching Archaeology to Machines:

Extracting Semantic Knowledge

from Free Text Excavation Reports

by Achille Felicetti (Università degliStudi di Firenze)

11 MTAS – Extending Solr into a

Scalable Search Solution and Analysis

Tool on Multi-Tier Annotated Text

by Matthijs Brouwer (Meertens Institute)

12 Phonetic Search in Audio and

Video Recordings

by Ioannis Dologlou and SteliosBakamidis (RC ATHENA)

13 KA3: Speech Analytics for Oral

History and the Language Sciences

by Joachim Köhler (FraunhoferIAIS), Nikolaus P. Himmelmann(Universität zu Köln) and Almut Leh(FernUniversität in Hagen)

14 Graph-Based Entity-Oriented

Search: Imitating the Human

Process of Seeking and Cross

Referencing Information

by José Devezas and Sérgio Nunes(INESC TEC and FEUP)

16 ContentCheck: Content

Management Techniques and Tools

for Fact-checking

by Ioana Manolescu (Inria Saclay andEcole Polytechnique, France)

Page 3: ˜5*(.&1 9-*2* .,.9&1 :2&3.9.*8 - ERCIM News 112 · 4 ERCIM NEWS 111 October 2017 ˚ ˆ ˆ*2’*78-.5 After having successfully grown to become one of the most recognized ICT Societies

ERCIM NEWS 111 October 2017 3

RESEARCH AND INNOvATION

This section features news about

research activities and innovative

developments from European research

institutes

46 Collaboration Spotting: A Visual

Analytics Platform to Assist

Knowledge Discovery

by Adam Agocs, Dimitris Dardanis,Richard Forster, Jean-Marie Le Goff,Xavier Ouvrard and André Rattinger(CERN)

48 Distributional Correspondence

Indexing for Cross-Lingual and

Cross-Domain Sentiment

Classification

by Alejandro Moreo Fernández,Andrea Esuli and Fabrizio Sebastiani

49 Measuring Bias in Online

Information

by Evaggelia Pitoura (University ofIoannina), Irini Fundulaki (FORTH)and Serge Abiteboul (Inria & ENSCachan)

50 The Approximate Average

Common Submatrix for Computing

the Image Similarity

by Alessia Amelio (Univ. of Calabria)

52 My-AHA: Stay Active, Detect Risks

Early, Prevent Frailty!

by Nadine Sturm, Doris M. Bleierand Gerhard Chroust (JohanniterÖsterreich)

EvENTS, IN bRIEf

Call for Proposals

60 Dagstuhl Seminars and

Perspectives Workshops

Report

60 A Conference for Advanced

Students: IFIP TC6’s AIMS

62 W3C Workshop on WebVR

Authoring: Opportunities and

Challenges

In Brief

63 SICS becomes RISE SICS

63 W3C Brings Web Developer Skills

to the European Workforce

63 Awards for Breaking SHA-1

Security Standard in Practice

3D Analysis Techniques

30 Restoration of Ancient Documents

Using Sparse Image Representation

by Muhammad Hanif and AnnaTonazzini (ISTI-CNR)

32 St Paul’s Cathedral Rises From the

Dust – News From the Virtual St

Paul’s Cathedral Project

by John N. Wall (NC StateUniversity), John Schofield (St Paul’sCathedral, London), David Hill (NCState University and Yun Jing (NCState University)

33 Immersive Point Cloud

Manipulation for Cultural Heritage

Documentation

by Jean-Baptiste Barreau(CNRS/CReAAH UMR 6566),Ronan Gaugne (Université de Rennes 1/IRISA-Inria) and Valérie Gouranton(INSA Rennes/ IRISA-Inria)

35 Culture 3D Cloud: A Cloud

Computing Platform for 3D

Scanning, Documentation,

Preservation and Dissemination of

Cultural Heritage

by Pierre Alliez (Inria), FrançoisForge (Reciproque), Livio de Luca(CNRS MAP), Marc Pierrot-Deseilligny (IGN) and Marius Preda(Telecom SudParis)

36 Physical Digital Access Inside

Archaeological Material

by Théophane Nicolas (Inrap/UMR8215 Trajectoires), Ronan Gaugne(Université de Rennes 1/IRISA-Inria),Valérie Gouranton (INSA de Rennes/IRISA-Inria) and Jean-Baptiste Barreau(CNRS/CReAAH UMR 6566)

Information Visualization and

Communication

37 Reinterpreting European History

Through Technology: The

CrossCult Project

by Susana Reboreda Morillo(Universidad de Vigo), MaddalenaBassani (Università degli Studi diPadova) and Ioanna Lykourentzou(LIST)

39 Cultural Opposition in former

European Socialist Countries:

Building the COURAGE Registry

by András Micsik , Tamás Felker andBalázs Nász (MTA SZTAKI)

40 Locale, an Environment-Aware

Storytelling Framework Relying on

Augmented Reality

by Thomas Tamisier, Irene Gironacciand Roderick McCall (LuxembourgInstitute of Science and Technology)

42 The Biennale 4D Project

by Kathrin Koebel, Doris Agotai,Stefan Arisona (FHNW) and MatthiasOberli (SIK-ISEA)

43 The Clavius Correspondence: From

Digitization to Visual Exploration

of Knowledge

by Matteo Abrate, Angelica Lo Duca,Andrea Marchetti (IIT-CNR)

45 Service-oriented Mobile Social

Networking

by Stefano Chessa and MicheleGirolami (ISTI-CNR)

53 A New Vision of the Cruise Ship

Cabin

by Erina Ferro (ISTI-CNR)

54 Trend Analysis of Underground

Marketplaces

by Klaus Kieseberg, Peter Kiesebergand Edgar Weippl: (SBA Research)

55 RDF Visualiser: A Tool for

Displaying and Browsing High

Density of RDF Data

by Nikos Minadakis, Kostas Petrakis,Korina Doerr and Martin Doerr(FORTH)

57 Services for Large Scale Semantic

Integration of Data

by Michalis Mountantonakis andYannis Tzitzikas (ICS-FORTH)

58 On the Interplay between Physical

and Digital World Accessibility

by Christophe Ponsard, JeanVanderdonckt and Lyse Saintjean(CETIC)

60 Highly Sensitive Biomarker

Analysis Using On-Chip

Electrokinetics

by Xander F. van Kooten, FedericoParatore (Israel Institute ofTechnology and IBM Research –Zurich), Moran Bercovici (IsraelInstitute of Technology) and GovindV. Kaigala IBM Research – Zurich)

Page 4: ˜5*(.&1 9-*2* .,.9&1 :2&3.9.*8 - ERCIM News 112 · 4 ERCIM NEWS 111 October 2017 ˚ ˆ ˆ*2’*78-.5 After having successfully grown to become one of the most recognized ICT Societies

ERCIM NEWS 111 October 20174

ERCIM

Membership

After having successfully grown tobecome one of the most recognized ICTSocieties in Europe, ERCIM has openedmembership to multiple member institutesper country. By joining ERCIM, yourresearch institution or university candirectly participate in ERCIM’s activitiesand contribute to the ERCIM members’common objectives playing a leading rolein Information and CommunicationTechnology in Europe:• Building a Europe-wide, open network

of centres of excellence in ICT andApplied Mathematics;

• Excelling in research and acting as abridge for ICT applications;

• Being internationally recognised both asa major representative organisation in itsfield and as a portal giving access to allrelevant ICT research groups in Europe;

• Liaising with other international organi-sations in its field;

• Promoting cooperation in research,technology transfer, innovation andtraining.

About ERCIM

ERCIM – the European ResearchConsortium for Informatics andMathematics – aims to foster collaborativework within the European research com-munity and to increase cooperation withEuropean industry. Founded in 1989,ERCIM currently includes 15 leadingresearch establishments from 14 Europeancountries. ERCIM is able to undertake con-sultancy, development and educationalprojects on any subject related to its field ofactivity.

ERCIM members are centres of excellenceacross Europe. ERCIM is internationallyrecognized as a major representativeorganization in its field. ERCIM providesaccess to all major InformationCommunication Technology researchgroups in Europe and has established anextensive program in the fields of science,strategy, human capital and outreach.ERCIM publishes ERCIM News, a quar-terly high quality magazine and deliversannually the Cor Baayen Award to out-standing young researchers in computerscience or applied mathematics. ERCIMalso hosts the European branch of theWorld Wide Web Consortium (W3C).

benefits of Membership

As members of ERCIM AISBL, institutions benefit from: • International recognition as a leading centre for ICT R&D, as member of the

ERCIM European-wide network of centres of excellence;• More influence on European and national government R&D strategy in ICT.

ERCIM members team up to speak with a common voice and produce strategicreports to shape the European research agenda;

• Privileged access to standardisation bodies, such as the W3C which is hosted byERCIM, and to other bodies with which ERCIM has also established strategiccooperation. These include ETSI, the European Mathematical Society and Infor-matics Europe;

• Invitations to join projects of strategic importance;• Establishing personal contacts with executives of leading European research insti-

tutes during the bi-annual ERCIM meetings; • Invitations to join committees and boards developing ICT strategy nationally and

internationally;• Excellent networking possibilities with more than 10,000 research colleagues

across Europe. ERCIM’s mobility activities, such as the fellowship programme,leverage scientific cooperation and excellence;

• Professional development of staff including international recognition;• Publicity through the ERCIM website and ERCIM News, the widely read quarter-

ly magazine.

How to become a Member

• Prospective members must be outstanding research institutions (including univer-sities) within their country;

• Applicants should address a request to the ERCIM Office. The application shouldinlcude:

• Name and address of the institution;• Short description of the institution’s activities;• Staff (full time equivalent) relevant to ERCIM’s fields of activity;• Number of European projects in which the institution is currently involved;• Name of the representative and a deputy.

• Membership applications will be reviewed by an internal board and may includean on-site visit;

• The decision on admission of new members is made by the General Assembly ofthe Association, in accordance with the procedure defined in the Bylaws(http://kwz.me/U7), and notified in writing by the Secretary to the applicant;

• Admission becomes effective upon payment of the appropriate membership fee ineach year of membership;

• Membership is renewable as long as the criteria for excellence in research and anactive participation in the ERCIM community, cooperating for excellence, are met.

Please contact the ERCIM Office: [email protected]

“Through a long history of successful research collaborations

in projects and working groups and a highly-selective mobility

programme, ERCIM has managed to become the premier net-

work of ICT research institutions in Europe. ERCIM has a consis-

tent presence in EU funded research programmes conducting

and promoting high-end research with European and global

impact. It has a strong position in advising at the research pol-

icy level and contributes significantly to the shaping of EC

framework programmes. ERCIM provides a unique pool of

research resources within Europe fostering both the career

development of young researchers and the synergies among

established groups. Membership is a privilege.”Dimitris Plexousakis, ICS-FORTH, ERCIM AISBL Board

Page 5: ˜5*(.&1 9-*2* .,.9&1 :2&3.9.*8 - ERCIM News 112 · 4 ERCIM NEWS 111 October 2017 ˚ ˆ ˆ*2’*78-.5 After having successfully grown to become one of the most recognized ICT Societies

ERCIM NEWS 111 October 2017 5

Joint ERCIM Actions

ERCIM “Alain bensoussan”

fellowship Programme

ERCIM offers fellowships for PhD holders from all overthe world. Topics cover most disciplines in ComputerScience, Information Technology, and AppliedMathematics. Fellowships are of 12 months duration,spent in one ERCIM member institute. Fellowships areproposed according to the needs of the member institutesand the available funding.

Application deadlines for the next round: 30 April and

30 September 2017

More information: http://fellowship.ercim.eu/

HORIZON 2020

Project Management

A European project can be a richly rewarding tool forpushing your research or innovation activities to the state-of-the-art and beyond. Through ERCIM, our member instituteshave participated in more than 80 projects funded by theEuropean Commission in the ICT domain, by carrying outjoint research activities while the ERCIM Office success-fully manages the complexity of the project administration,finances and outreach.

The ERCIM Office has recognized expertise in a full rangeof services, including identification of funding opportunities,recruitment of project partners, proposal writing and projectnegotiation, contractual and consortium management, com-munications and systems support, organization of attractiveevents, from team meetings to large-scale workshops andconferences, support for the dissemination of results.

How does it work in practice?

Contact the ERCIM Office to present your project idea and apanel of experts will review your idea and provide recom-mendations. If the ERCIM Office expresses its interest toparticipate, it will assist the project consortium as describedabove, either as project coordinator or project partner.

Please contact:

Philippe Rohou, ERCIM Project Group [email protected]

HRADIO – A New Project

to Tackle future of Internet

Radio

ERCIM/W3C is participating in HRADIO (Hybrid Radio every-where for everyone), a new EU-funded project (InnovationAction) that started in September 2017. It focuses on radioservice innovations enabled by convergence. While radio, withits rich editorial content, remains a highly popular medium, lis-tening figures are slowly declining, particularly among young-sters. With the rapid rise of smartphones, radio faces competitionfrom many new services including music steaming platforms.Regular radio today often does not include attractive features asknown from online platforms. And if present, they are mostly notwell integrated with the actual radio programme. This is whereHRADIO will deliver.

Driven by the industry need to create attractive new radio experi-ences, the project will leverage the potential of hybrid tech-nology for radio, enabling the integration of cost-effectivebroadcast distribution with the web. Broadcasters will beenabled to personalise radio services (while respecting privacy),to provide intuitive functionalities like time-shifting and, eventu-ally, to foster and to exploit user engagement. HRADIO willpave the way to bring these features not only to broadcasters’native mobile applications, but also to portals, to connectedradios and into the car. The core approach is to integrate vali-dated solutions and to harmonise APIs which together will pro-vide broadcasters with an abstracted service layer accessibleacross any device and distribution platform – ensuring sustain-ability and return of investment. Therefore, consumers will beable to access their personal radio services on different devicesand platforms enabled by a seamless broadcast-internet integra-tion for radio content distribution. Eventually, HRADIO willpublish its developments as ready-to-use Android and HTMLclient implementations.

W3C will lead the standardisation activities of the project,working with partners to align modelling and main technicaltasks with relevant standards, and to identify and specify pos-sible extensions to standards from which the project wouldbenefit. W3C will thus contribute to the definition of scenariosand requirements, the overall system architecture and specifica-tions. W3C will also lead the management of innovation, dis-semination, communication and exploitation plan work packageof the project.

The project has a duration of 30 months from 1 September 2017to 29 February 2020 with an overall budget of 3,25 Mio Euro.Project partners include VRT, Belgium; Institut fürRundfunktechnik GmbH, Germany; Ludwig-Maximilians-Universität München, Germany; GEIE ERCIM France; RBB,Germany; Konsole Labs GmbH, Germany; UK Radioplayer Ltd.

Link:

http://cordis.europa.eu/project/rcn/211079_en.html

Please contact:

Francois Daoust, W3C, [email protected]

Page 6: ˜5*(.&1 9-*2* .,.9&1 :2&3.9.*8 - ERCIM News 112 · 4 ERCIM NEWS 111 October 2017 ˚ ˆ ˆ*2’*78-.5 After having successfully grown to become one of the most recognized ICT Societies

Just as experimental research is based onreasoning over experiments, research inthe humanities is based on reasoningover sources, which may be textual,material or intangible. Digital humani-ties (DH) is the interdisciplinaryresearch field that studies how computerscience may support such investigationsby creating tools and tailoring computertechnology to the specific needs ofhumanities research while alsoaddressing the methodological issuesthat arise in relation to the adoption of anintensive use of such digital means. Thisissue is dedicated to showcasing cuttingedge work being undertaken in thedomain of DH in the areas of: digitalsource indexing and analysis, informa-tion management strategy, researchinfrastructure development, 3D analysistechniques, and information visualisa-tion and communication.

Digital Source Indexing and AnalysisFollowing on from an earlier period ofresearch and investment in digitisation,when huge amounts of analogue contentwere turned into digital products, theresearch frontier in DH has now shifted toan interest in automatic or semi-auto-matic ways of dealing with digitalsources. Such sources include both theaforementioned digitised materials andalso, increasingly, born-digital texts andother digital media such as audio, imagesand video. A key research challenge is tocreate methodologies and tools forfinding the needle in the proverbialhaystack of millions of poorly indexedfiles. At present, hot topics include theOptical Character Recognition (OCR) ofdigitised manuscripts and the paralleltechniques of Speech Recognition, aswell as the use of Named EntityRecognition (NER) on the resulting dig-ital text files. This is the subject of thearticle, “In Codice Ratio: ScalableTranscription of Vatican Registers”, byFirmani et al., which proposes a super-vised NER using mixed algorithmic andcrowdsourcing approaches, as well as of

Felicetti’s paper, “Teaching Archaeologyto Machines: Extracting SemanticKnowledge from Free Text ExcavationReports” and Brouwer’s, “MTAS –Extending Solr into a Scalable SearchSolution and Analysis Tool on Multi-TierAnnotated Text”. Two articles addressmultimedia sources, presenting innova-tive solutions to retrieval and discovery.Dologlou at al. deal with other audio andvisual sources in their, “Phonetic Searchin Audio and Video Recordings” whileKöhler et al. address speech recognitionand analysis in the contribution, “KA3:Speech Analytics for Oral History andthe Language Sciences”. Addressing thegeneral question of querying and dis-covery in large datasets in the humani-ties, Devezas et al. introduce a perspec-tive and strategy on use of integration ofinformation using graph technology,“Graph-Based Entity-Oriented Search:Imitating the Human Process of Seekingand Cross Referencing Information”.Another increasingly important topic ofresearch arising in this area relates tofact checking and determining theveracity of claims made in sources aswell as impact of arguments in socialcontexts. The contributions ofManolescu in, “ContentCheck: ContentManagement Techniques and Tools forFact-checking” and Heder et al. in,“Argumentation Analysis of EngineeringResearch” offer perspectives on how tocritically assess structured resourcesthrough pattern recognition.

Information Management StrategyThe questions of the long-term controlof this ‘needle in the haystack’ problemand the efficient use of research datasources raise general methodologicalissues of how to create robust informa-tion management strategies in thehumanities. This area of researchaddresses the rapid expansion of digitalsources in multiple formats that coverboth broader and more precise topicsand the question how we can createlong-term sustainable access for the

reuse of resources in a way that pro-motes accessibility and the quality nec-essary to support academic research.Part of the question here is how tocreate and successfully use common ortransparent expressions/translations ofdata across domains, as well as how tocreate, share and properly use commonvocabularies for describing data.Advances in the areas of vocabulariesand semantics are reported inDaskalaki’s et al., “A Back BoneThesaurus for Digital Humanities” andin Bruseker et al., “Meeting theChallenges to Reap the Benefits ofSemantic Data in Digital Humanities”.Beyond the data question, however, liepractical and social dimensions to theproblem of long-term data manage-ment, a question of understanding andformalising procedures and protocols ina new digital research environment andputting them efficiently in place ininformation systems in the presentresearch economy. Clivaz et al. in,“HumaReC: Continuous DataPublishing in the Humanities” andBasset’s, “A Data Management Plan forDigital Humanities: the PARTHENOSModel” provide views and answers onhow to meet the challenges, obligationsand opportunities that arise for digitalhumanists creating digital archives thatare re-usable by others in an openaccess framework. Another area ofimportant research is in supporting trustin digital resources. This area is takenup by van Ossenbruggen in, “TrustingComputation in Digital HumanitiesResearch”. Finally, a perennial areademanding new development and imag-ination lies in creating tools that allowresearchers to generate data meeting theabove criteria in a non-onerous manner.

Research Infrastructure DevelopmentIn order to support forward looking,comprehensive and critical approachesto these issues, the EuropeanCommission supports a great part of thefunding for digital humanities research,

ERCIM NEWS 111 October 20176

Special Theme: Digital Humanities

Introduction to the Special Theme

by George Bruseker (ICS-FORTH), László Kovács (MTA-SZTAKI), Franco Niccolucci (University of Florence)

This special theme “Digital Humanities” of ERCIM News is dedicated to new projects and trends in theinterdisciplinary domain between computer science and the humanities.

Page 7: ˜5*(.&1 9-*2* .,.9&1 :2&3.9.*8 - ERCIM News 112 · 4 ERCIM NEWS 111 October 2017 ˚ ˆ ˆ*2’*78-.5 After having successfully grown to become one of the most recognized ICT Societies

especially through the InfrastructuresProgramme. In particular, this pro-gramme supports the creation ofresearch infrastructures (RI), which arenetworks of facilities, resources andservices offered to a research commu-nity to support and catalyse their work.Following the roadmap created by theESFRI (European Strategy Forum onResearch Infrastructures), some of theseRIs are upgraded and designated as anERIC (European ResearchInfrastructure Consortium), accordingto the recommendation of a panel ofexperts. An ERIC is a transnationalinstitution tasked with managing an RIand fostering innovation in the relatedresearch field. This issue brings newson the progress of a number of ERICSwithin the humanities domain.Edmond’s et al. report on the activity ofDARIAH (Digital ResearchInfrastructure for the Arts and theHumanities) in, “The DARIAH ERIC:Redefining Research Infrastructure forthe Arts and Humanities in the DigitalAge”, while the digital infrastructure ofE-RIHS (European ResearchInfrastructure for Heritage Sciences) isdescribed in Pezzati’s, “DIGILAB, aNew Infrastructure for HeritageScience”. Meanwhile, Bardi et al.present research into the design andimplementation of a generalised infor-mation architecture for digital humani-ties in, “Building a Federation of DigitalHumanities Infrastructures”. Finally, in“Knowledge Complexity and theDigital Humanities: Introducing theKPLEX Project”, Edmond et al.announce interesting new research inthe context of such RIs taking an explic-itly humanities grounded critical lookon the concept of ‘data’ in the first placeand how this affects what may or maynot be digitized, and the manner inwhich it is perceived or used.

3D Analysis TechniquesTurning to the application of digitalmethods to the study of material culture(e.g., man-made objects or architec-ture), we find continuous innovation inthe application of digital techniques inorder to try to better understand extantobjects or to produce academicallysourced and grounded representationsof now lost heritage. In this domain, the

use of 3D imaging techniques andinventing means of analyticallyapplying these to heritage research is akey area of innovation. In this context,Hanif ’s et al.’s paper, “AncientDocument Restoration Using SparseImage Representation”, presents ameans to link research on documentswith research on the matter on whichthey are recorded, allowing the virtualrestoration of ancient documents.Mature research on the application of3D technology to monuments is repre-sented in Wall’s et al., “The Virtual StPaul’s Cathedral Project”, which reportson the virtual reconstruction of St Paul’sbased on historical sources. Meanwhile,Barreau’s et al., “Immersive PointCloud Manipulation for CulturalHeritage Documentation” presents aninnovative application to use 3D modelsin a VR platform in order to allowarchaeologists to work directly withsuch products in their research andreporting activities. Finally, in,“Physical Digital Access insideArchaeological Material” Nicolas et al.demonstrate the uses of 3D imaging inconducting non-destructive research onheritage materials. Alliez et al. in,“Culture 3D Cloud” meanwhile presenta platform for cloud computing of 3Dobjects both for enabling their onlineanalysis as well as communicating themto other researchers and the public.

Information Visualization andCommunicationIndeed, digital humanities research doesnot take place in a bubble but, thanks inno small part to its form, is inherentlysuited to communication to a broaderpublic, both of researchers in other disci-plines and generally interested parties.Consequently, considerable researcheffort is being put into the question ofhow to present and make digital heritageand humanities content understandableand accessible to this wider audience.Papers addressing the communication ofhistory, heritage or museum exhibitswith digital tools in this issue include:Morillo’s et al., “Re-InterpretingEuropean History through Technology:The CrossCult Project”, and Micsik’s etal., “Cultural Opposition in formerEuropean Socialist Countries: Buildingthe COURAGE Registry”, which look

ERCIM NEWS 111 October 2017 7

at means of gathering and presentingheretofore under-represented historicalinformation for researchers and thepublic. A series of contributions alsolook at innovative ways to explore digitalhumanities datasets This research takesmany directions, from exploring theusers of augmented reality as seen inTamisier ’s et al., “Locale, anEnvironment-Aware StorytellingFramework Relying on AugmentedReality”, to the application of virtualreality to connect times and space as pre-sented by Koebel’s et al., “The ‘Biennale4D’ Project” to explorations of new tech-niques of exploring graph based data onthe web as reported by Abrate’s et al. in,“The Clavius Correspondence: FromDigitization to Visual Exploration ofKnowledge”. In a related vein, Chessa etal. explore how to connect personalmobile devices to relevant services fordata exploration inter alia in, “Service-oriented Mobile Social Networking”.Each of these contributions exploresinnovative approaches to better commu-nicate history and heritage to visitors tomuseums, monuments or heritage sitesaround the globe.

The contributions to this special themeissue on digital humanities give aninsight into some of the main currents ofresearch currently being undertaken inDH. Such research is carried out as muchwithin the context of smaller researchprojects as in European funded transna-tional structures such as the ERICs. Thediversity of research demonstrated isstrong evidence of the vitality of thisinterdisciplinary domain, where state-of-the-art digital technology goes hand inhand with the study of human culture ofthe present and of the past.

Please contact:

George BrusekerICS-FORTH, [email protected]

László KovácsMTA-SZTAKI, [email protected]

Franco NiccolucciUniversity of Florence, [email protected]

Page 8: ˜5*(.&1 9-*2* .,.9&1 :2&3.9.*8 - ERCIM News 112 · 4 ERCIM NEWS 111 October 2017 ˚ ˆ ˆ*2’*78-.5 After having successfully grown to become one of the most recognized ICT Societies

Historical handwritten documents are anessential source of knowledge concern-ing past cultures and societies [3]. Manylibraries and archives have recentlybegun digitizing their assets, includingthe Bibliotéque Nationale de France, theVirtual Manuscript Library of Switzer-land, and the Vatican Apostolic Library.Due to the sheer size of the collectionsand the many challenges involved in afully automatic handwriting transcription(such as irregularities in writing, ligaturesand abbreviations, and so forth), manyresearchers in the last years have focusedon solving easier problems, most notablykeyword spotting. However, as more andmore libraries worldwide digitize theircollections, greater effort is being putinto the creation of full-fledged transcrip-tion systems.

Our contribution is a scalable end-to-end transcription workflow based onfine-grained segmentation of text ele-ments into characters and symbols. Wefirst partition sentences and words intotext segments. Most segments containactual characters, but there are also seg-ments with spurious ink strokes.(Perfect segmentation cannot beachieved without transcription. Thisresult is known as Sayer’s Paradox.)Then, we submit all the segments to adeep convolutional neural network(CNN), designed following recent pro-gresses in deep learning [2] and the de

facto standards for complex opticalcharacter recognition (OCR) problems.The labels returned for each segment bythe deep CNN are very accurate whenthe segments contain actual characters,but can be wrong otherwise. Finally, wereassemble such noisy labels into wordsand sentences using language statistics,similarly to [1].

In Codice Ratio is an interdisciplinaryproject involving the Humanities andEngineering departments from RomaTre University, and the Vatican SecretArchives, one of the largest historicallibraries in the world. The projectstarted in 2016 and aims at the completetranscription of the “Vatican Registers”corpus. The corpus, which is part of theVatican Secret Archives, consists ofmore than 18.000 pages of official cor-respondence of the Roman Curia in the13th century, including letters, opinionson legal questions, addressed from andto kings and sovereigns, as well as tomany political and religious institutionsthroughout Europe. Never having beentranscribed in the past, these documentsare of unprecedented historical rele-vance. A small illustration of theVatican Registers is shown in Figure 1.

State-of-the-art transcription algorithmsgenerally work by a segmentation-freeapproach, where it is not necessary toindividually segment each character.

While this removes one of the hardeststeps in the process, it is necessary tohave full-text transcriptions for thetraining corpus, in turn requiring expen-sive labelling procedures undertaken bypaleographers with expertise on theperiod under consideration. Our charac-ter-level classification has instead muchsmaller training cost, and allows thecollection of a large corpus of annotateddata using a cheap crowdsourcing pro-cedure. Specifically, we implemented acustom crowdsourcing platform, andemployed more than a hundred high-school students to manually label thedataset. To overcome the complexity ofreading ancient fonts, we provided thestudents with positive and negativeexamples of each symbol. After a dataaugmentation process, the result is aninexpensive, high-quality dataset of23.000 characters, which we plan tomake publicly available online. Ourdeep CNN trained on this datasetachieves an overall accuracy of 96%,which is one of the highest resultsreported in the literature so far.

The project takes place in Rome (Italy)and features interdisciplinary collabora-tors throughout Europe and the world,including the Trinity College in Dublin(Ireland), the Max-Planck-Institute forEuropean Legal History in Frankfurt(Germany), and the Notre DameUniversity in South Bend (Indiana).

8

Special Theme: Digital Humanities

In Codice Ratio:

Scalable Transcription of vatican Registers

by Donatella Firmani, Paolo Merialdo (Roma Tre University) and Marco Maiorino (Vatican SecretArchives)

In Codice Ratio is an end-to-end workflow for the automatic transcription of the Vatican Registers, acorpus of more than 18.000 pages contained as part of the Vatican Secret Archives. The workflowhas a character recognition phase, featuring a deep convolutional neural network, and a propertranscription phase using language statistics. Results produced so far have high quality and requirelimited human effort.

Figure�1:�Sample�text�from�the�manuscript�“Liber�septimus�regestorum�domini�Honorii�pope�III”,�in�the�Vatican�Registers.

ERCIM NEWS 111 October 2017

Page 9: ˜5*(.&1 9-*2* .,.9&1 :2&3.9.*8 - ERCIM News 112 · 4 ERCIM NEWS 111 October 2017 ˚ ˆ ˆ*2’*78-.5 After having successfully grown to become one of the most recognized ICT Societies

Domestic collaborators includeSapienza University in Rome (Italy)and two roman high schools, namelyLiceo Keplero and Liceo Montale.

The transcriptions produced so faraccount for lower-case letters and asubset of the abbreviation symbols.Future activities include transcription ofabbreviations and upper-case letters, aswell as an extensive experimental eval-uation of the whole pipeline.

References:

[1] D. Keysers, et al.: “Multi-languageonline handwriting recognition”,IEEE TPAMI, 2017.

[2] D. Kingma, B. Jimmy: “Adam: Amethod for stochastic optimiza-tion”, arXiv, 2014.

[3] J. Michel, et al.: “Quantitativeanalysis of culture using millionsof digitized books”, Science, 2011.

Please contact:

Donatella Firmani, Paolo MerialdoRoma Tre University, Italy+39 06 5733 [email protected], +39 06 [email protected]

Marco MaiorinoVatican Secret Archives Vatican City [email protected]

ERCIM NEWS 111 October 2017 9

In recent years, great interest has arisenaround the new technologies related tonatural language processing andmachine learning, which have suddenlyacquired considerable importance, espe-cially in disciplines such as archaeology,where the main information is containedin free text documents rather than inrelational databases or other structureddatasets [1]. Recently, the Venice TimeMachine project [L1], aimed at(re)writing the history of the city bymeans of “stories automaticallyextracted from ancient manuscripts”,has greatly contributed to feeding thisinterest and rendering it topical.

Reading the documents: machinesand the “gift of tongues”The dream of having machines capableof reading a text and understanding itsinnermost meaning is as old as computerscience itself. Its realisation would meanacquiring information formerly inacces-sible or hard to access, and this wouldtremendously benefit the knowledgebases of many disciplines. Today thereare powerful tools capable of reading atext and deciphering its linguistic struc-ture. They can work with most majorworld languages to perform complex

operations, including language detec-tion, parts of speech (POS) recognitionand tagging, and grasping grammatical,syntactic and logical relationships. To acertain extent, they can also speculate,in very general terms, on the meaning ofeach single word (e.g., whether a nounrefers to a person, a place or an event).Named entities resolution (NER) anddisambiguation techniques are the ulti-mate borderline of NLP, beyond whichwe can start to glimpse the actual possi-bility of making the machines aware ofthe meaning of a text.

Searching for meanings: asking therest of the worldHowever, at present no tool is capableof carrying out NER operations on atext without adequate (and intense!)training. This is because tools have noprevious knowledge of the context inwhich they operate unless a humaninstructs them. Usually, each NLP toolcan be trained for a specific domain byfeeding it hundreds of specific termsand concepts from annotated docu-ments, thesauri, vocabularies and otherlinguistic resources available for thatdomain. Once this training is over, thetool is fully operational, but it remains

practically unusable in other contextsunless a new training pipeline isapplied. Fortunately, this situation isabout to change thanks to the web, bigdata and the cloud, which offer a richbase of resources in order to overcomethis limitation: services like BabelNet[L2] and OpenNER [L3] are largeonline aggregators of entities that pro-vide immense quantities of named enti-ties ready to be processed by linguistictools in a cloud context. NLP tools,once deployed as cloud services, will beable to tune into a widely distributedand easily accessible resource network,thus reducing the training needs andmaking the mechanism flexible and per-formant.

TEXTCROWD and the European OpenScience CloudThe above is what TEXTCROWD [L4]seeks to achieve. TEXTCROWD is oneof the pilot tools developed underEOSCpilot [L5], an initiative aimed “todevelop a number of demonstratorsworking as high-profile pilots that inte-grate services and infrastructures toshow interoperability and its benefits ina number of scientific domains,including archaeology”. The pilot

Teaching Archaeology to Machines:

Extracting Semantic Knowledge

from free Text Excavation Reports

by Achille Felicetti (Università degli Studi di Firenze)

Natural language processing and machine learning technologies have acquired considerableimportance, especially in disciplines where the main information is contained in free text documentsrather than in relational databases. TEXTCROWD is an advanced cloud based tool developed withinthe framework of EOSCpilot project for processing textual archaeological reports. The tool has beenboosted and made capable of browsing big online knowledge repositories, educating itself ondemand and used for producing semantic metadata ready to be integrated with information comingfrom different domains, to establish an advanced machine learning scenario.

Page 10: ˜5*(.&1 9-*2* .,.9&1 :2&3.9.*8 - ERCIM News 112 · 4 ERCIM NEWS 111 October 2017 ˚ ˆ ˆ*2’*78-.5 After having successfully grown to become one of the most recognized ICT Societies

developed by VAST-LAB/PIN(University of Florence, Prato) underthe coordination of Franco Niccolucciand in collaboration with the Universityof South Wales, is intended to build acloud service capable of reading exca-vation reports, recognising relevantarchaeological entities and linking themto each other on linguistic bases [2].TEXTCROWD was initially trained ona set of vocabularies and a corpus ofarchaeological excavation reports(Figure 1). Subsequently, thanks to thecloud technology on which it is built, ithas been made capable of browsing allthe major online NER archives, whichmeans it can also discover entities “outof the corpus” (i.e., non-archaeologicalones) and educate itself on demand sothat it can be employed in differentdomains in an advanced machinelearning scenario (Figure 2). Anotherimportant feature concerns the outputproduced by the tool: TEXTCROWD isactually able to generate metadataencoding the knowledge extracted fromthe documents into a language under-standable by a machine, an actual

“translation” from a (natural) languageto another (artificial) one. The syntaxand semantics of the latter are providedby one of the main ontologies devel-oped for the cultural heritage domain:CIDOC CRM [L6], an internationalstandard that is very popular in digitalhumanities.

“Tell me what you have understood”CIDOC CRM provides classes and rela-tionships to build discourses in a formallanguage by means of an elegant syntaxand to tell stories about the real worldand its elements. In TEXTCROWD,CIDOC CRM has been used to “tran-scribe”, in a format readable by amachine, narratives derived from texts,narrating, for instance: the finding of anobject at a given place or during a givenexcavation; the description of archaeo-logical artefacts and monuments, andthe reconstructive hypotheses elabo-rated by archaeologists after dataanalysis [3]. CIDOC CRM supports theencoding of such narratives in standardRDF format, allowing, at the end of theprocess, the production of machine-

consumable stories ready for integra-tion with other semantic knowledgebases, such as the archaeological datacloud built by ARIADNE [L7]. Thedream is about to become reality, withmachines reading textual documents,extracting content and making it under-standable to other machines, in prepara-tion for the digital libraries of the future.

Links:

[L1] timemachineproject.eu[L2] babelnet.org/ [L3] www.opener-project.eu/ [L4] eoscpilot.eu/science-demos/textcrowd [L5] eoscpilot.eu [L6] http://cidoc-crm.org [L7] http://ariadne-infrastructure.eu

References:

[1] F. Niccolucci et al.: “ManagingFull-Text Excavation Data withSemantic Tools”, VAST 2009. The10th International Symposium onVirtual Reality, Archaeology andCultural Heritage, 2009.

[2] A. Vlachidis et al.: “ExcavatingGrey Literature: a case study onthe rich indexing of archaeologicaldocuments via Natural LanguageProcessing techniques andKnowledge Based resources”.ASLIB Proceedings Journal, 2010.

[3] A. Felicetti, F. Murano: “Scriptamanent: a CIDOC CRM semioticreading of ancient texts”,International Journal on DigitalLibraries, Springer, 2016.

Please contact:

Achille Felicetti, VAST-LAB, PIN,Università degli Studi di Firenze, Italy,+39 0574 602578 [email protected]

ERCIM NEWS 111 October 201710

Special Theme: Digital Humanities

Figure�1:�Part-of-speech�recognition�in�TEXTCROWD�using�OpenNER�framework.

Figure�2:�Named�entities�resolution�in�TEXTCROWD�using�DBPedia�and�BabelNet�resources.

Page 11: ˜5*(.&1 9-*2* .,.9&1 :2&3.9.*8 - ERCIM News 112 · 4 ERCIM NEWS 111 October 2017 ˚ ˆ ˆ*2’*78-.5 After having successfully grown to become one of the most recognized ICT Societies

ERCIM NEWS 111 October 2017 11

The Lucene approach to process textualdata is based upon tokenisation of theprovided information: for each docu-ment and field, the occurring tokens orwords are stored within the indextogether with positional information andan optional payload. This offers quickretrieval of matching documents whensearching for specific words, and spe-cific sequences can be found by effi-ciently exploiting positional informationon only those documents that contain allwords involved. Mtas extends this tech-nique by encoding a prefix and postfixwithin the value of each token. SinceLucene allows the use of multiple tokenson the same position, this provides thecapability to store multiple layers ofsingle positioned tokens, where layerscan be distinguished by the appliedprefix. Furthermore, to process non-single positioned and hierarchicalrelated elements, additional informationis stored within the payload, allowingthese items to be stored as single posi-tioned tokens on their first occurringposition in the Lucene index structure.Finally, to enable fast retrieval of infor-mation based on position or hierarchicalrelation within the document, forwardindices are created.

Although the default Lucene querymechanism can still be applied, specificmethods are needed and made availablewithin Mtas to efficiently use the addi-tional encoded information and indices.Based on these methods, within Mtas aparser is provided for the Corpus QueryLanguage (CQL), making it possible forusers to define advanced conditions onthe annotated text directly in the Solrsearch request. Since existing Solr func-tionality is maintained, this can be com-bined with regular defined conditions onmetadata fields within the same docu-ment. For selected documents, Mtas canefficiently generate frequency lists foroccurring layers, provide statistics overmatches for possibly multiple user

MTAS – Extending Solr into a Scalable Search Solution

and Analysis Tool on Multi-Tier Annotated Text

by Matthijs Brouwer (Meertens Institute, KNAW)

To deliver searchability on huge amounts of textual resources and associated metadata, the Lucene basedApache Solr index provides a well proven and scalable solution. Within the field of Humanities, textual data isoften enriched with structures like part-of-speech, named entities, sentences or paragraphs. To include and usethese annotations in search conditions and results, we developed a plugin that extends existing Solrfunctionalities with search and analysis options on these layers.

Figure�1:�Typical�structure�of

textual�data�with�relations�and

annotations�on�multiple�tiers�as

can�be�processed�by�Mtas.

Figure�2:�Distribution�of�the�fraction�of�nouns�and�adjectives�over�all�documents�in�the

Nederlab�collection�with�at�least�500�part-of-speech�annotated�words.

Figure�3:�Location�entities�in

paragraphs�containing�‘Simon

Stevin’,�constructed�with�Mtas

by�grouping�all�matches�on�a

CQL�expression.

defined CQL queries, categorize these byone or multiple metadata fields, producekeyword-in-context representations, andgroup results over one or multiple layers.Mtas fully supports the distributedsearch capabilities from Solr, providingnot only scalability but also making the

process of updating and extending thedata with new collections easier.

Parsers are available for multiple oftenXML based annotated document types,e.g. FoLiA, ISO-TEI and CHAT. Themapping by the parser of these usually

Page 12: ˜5*(.&1 9-*2* .,.9&1 :2&3.9.*8 - ERCIM News 112 · 4 ERCIM NEWS 111 October 2017 ˚ ˆ ˆ*2’*78-.5 After having successfully grown to become one of the most recognized ICT Societies

somewhat loosely defined formats ontothe index can be configured in detail toinclude in a coherent way (only) thoselayers that are of interest. Multiple doc-uments and multiple mapping configu-rations can be combined within thesame index and field. Source code, doc-umentation, example configurationsand a Docker demonstration version areavailable from GitHub [L1].

The Nederlab project [L2] is one of theprimary use cases for the system. It aimsto bring together all digitized texts rele-vant to the national heritage of theNetherlands, the history of Dutch lan-guage and culture (c. 800 – present) inone user-friendly and tool-enrichedopen access web interface, allowingscholars to simultaneously search andanalyse data from texts spanning the fullrecorded history of the Netherlands, itslanguage and culture. It currently pro-vides access to over 15 million items ordocuments, containing almost 10 billionpositions or words and over 35 billionannotations. This data is distributed over

23 different cores with a total index sizefrom over one terabyte, and hosted on asingle Xeon E5 server with 128 GBmemory. Query time, of coursedepending on the type of request, is usu-ally in the order of seconds and limitedby server configuration to three minutes.

Some experiments on the use of topicmodelling techniques with results fromMtas have been performed. This washeavily based on the expensive genera-tion of frequency lists, and subsequentoutcomes are not used to refine orextend analysis and search within Mtas.We would like to improve the efficiencyof applying these techniques and offermethods to use results again for furtherresearch within conditions on the Solrindex. Another ambition involvesadjustments of the Mtas index structureto incorporate frequency lists on mul-tiple layers (e.g., to create a list of allused adjectives), which could alsoimprove performance for certainqueries. Furthermore, we are interestedin implementing an additional query

language to explore the stored hierar-chical structure, since this is currentlynot covered very well by CQL. Finally,to better support and integrate sequence-based techniques, it will probably benecessary to extend the index structurewith n-grams, whilst also improving cer-tain types of already supported queries.

Links:

[L1] https://kwz.me/hmd[L2] https://kwz.me/hmc

Reference:

[1] M. Brouwer, H. Brugman, M.Kemps-Snijders, MTAS: ASolr/Lucene based Multi-TierAnnotation Search solution,Selected papers from the CLARINAnnual Conference 2016, Aix-en-Provence

Please contact:

Matthijs BrouwerMeertens Institute, KNAW,Amsterdam, The [email protected]

ERCIM NEWS 111 October 201712

Special Theme: Digital Humanities

The massive amount of informationproduced by today’s media (radio, tele-vision, etc.) and telecommunications(fixed, mobile telephony, satellite com-munications, etc) necessitates the use ofautomatic management strategies.Useful information can be retrievedfrom audio/video files by using key-words, in the same way as for text files,with a system that automaticallysearches for appropriate information inaudio/video files using a state-of-the-artvoice recognition engine. This enablesvaluable information in broadcast newsor telephone conversations to beretrieved easily, quickly and accurately.

This system was developed by Voice-InSA, a spin-off company of the GreekResearch Centre RC ATHENA [L1].The research started in 2008 and thefirst system was delivered two yearslater. Research is ongoing to improvethe performance and speed of the algo-rithms involved.

The proposed system implements themost advanced speech recognitiontechnology (large vocabulary, contin-uous speech, speaker independent). Itconverts the statistical models of thespeech recognition system and adaptsthem to increase both flexibility andefficiency over the handling of infor-mation which is provided by the key-words. In addition the new approachcomprises a scoring algorithm for auto-

matic detection of words or phrases thatare closest to the user’s query.

The system consists of two subsystems.The first subsystem performs a pre-pro-cessing on each new archiving material(recordings or video files), so that a filewith specific information is created.The second subsystem is the actual coreof the system that implements the newalgorithms for search and retrieval,simultaneously exploiting the previ-ously stored information.

The input to the system is audio orvideo files along with some keywordsthat the user wants to locate in thesefiles. Following a very fast processingof the input data, the system providesinformation on whether the keywordsare present in those files or not. If theoutcome of the search is positive, thespecific audio or video spots that havebeen found are mined and supplied tothe user accompanied by the exact

Phonetic Search in Audio and video Recordings

by Ioannis Dologlou and Stelios Bakamidis (RC ATHENA)

A new system uses advanced speech recognition technology to easily and efficiently retrieveinformation from audio/video recordings just by using keywords.

Page 13: ˜5*(.&1 9-*2* .,.9&1 :2&3.9.*8 - ERCIM News 112 · 4 ERCIM NEWS 111 October 2017 ˚ ˆ ˆ*2’*78-.5 After having successfully grown to become one of the most recognized ICT Societies

ERCIM NEWS 111 October 2017 13

timing information and their confidencelevel.

The major advantage of the new approachthat makes it unbeatable compared toexisting solutions is its ability to fullyoperate on any subject without any priorlearning phase. Consequently it requiresno overhead for customisation and/orinstallation and maintenance. More pre-cisely, the system does not use any lex-icon or database that will become out-dated and need updating. Furthermore, itcan cope with all kinds of words and ter-minology regardless of their frequency ofuse. The system has a very user-friendlyinterface with a comprehensive menueven for non-professionals. It can processall of the following formats: wav, wmv,wma, mp3, mpg, asf and avi.

The new system is useful for severalapplications in various domains andactivities including:• Automated registration, classification,

indexing and efficient, fast and inex-

pensive recovery of information fromaudiovisual media.

• Easy access to information producedby state institutions (parliament, gov-ernment departments, municipalities,communities, etc.).

• Information retrieval from audiovisu-al material from meetings and generalboard meetings, corporate bodies etc.

• Forensic applications, i.e., helping tolocate people suspected of beinginvolved in illegal activities throughautomatic monitoring of telephonecalls, video recordings, etc.

• Automatic monitoring of air and mar-itime frequencies in real time to detectincidents, such as mayday, for aprompt response.

Future activitiesFuture plans focus on the performanceof the algorithms both in terms of accu-racy and speed. Improving the accuracyinvolves the creation of a better speechrecognition system with a large varietyof acoustic models for many different

environments (noisy, cocktail partyeffect etc). Faster search algorithms arealso needed for handling the storedinformation which is created by the firstsubsystem.

Link:

[L1] https://kwz.me/hmy

References:

[1] S. Vijayarani, A. Sakila:“Multimedia Mining Research – AnOverview”, IJCGA, Vol. 5, No.1,Jan 2015, 69-77.

[2] R. Pieraccini: “The Voice in theMachine. Building Computers ThatUnderstand Speech”, The MITPress, 2012, ISBN 978-0262016858.

[3] T. Sainath et al.: “Convolutionalneural networks for LVCSR”,ICASSP, 2013.

Please contact:

Ioannis DologlouRC ATHENA, Greece, [email protected]

AAudio-visual and multimodal data areof increasing interest in digital humani-ties research. Currently most of the dataanalytics tools in digital humanities arepurely text-based. In the context of aGerman research program to establishcentres for digital humanities research,the focus of the three year BMBF projectKA3 [L1] is to investigate, develop andprovide tools to analyse huge amounts ofaudio-visual data resources usingadvanced speech technologies. Thesetools should enhance the process of tran-scribing audio-visual recordings semi-automatically and to provide additionalspeech related analysis features. In cur-rent humanities research most of thework is performed completely manuallyby labelling and annotating speechrecordings. The huge effort in terms oftime and human resources required to do

this severely limits the possibilities ofproperly including multimedia data inhumanities research.

In KA3 two challenging application sce-narios are defined. First, in the interac-tion scenario, linguists investigate thestructure of conversational interactions.This includes the analysis of turn taking,back channelling and other aspects ofthe coordination involved in smoothconversation. A particular focus is oncross-linguistic comparison in order todelimit universal infrastructure for com-munication from language-specificaspects which are defined by culturalnorms. The second scenario is targetingthe oral history domain. This researchdirection uses extended interviews toinvestigate historical, social and culturalissues. After the recording process the

interviews are transcribed manuallyturning the oral source into a text docu-ment for further analysis [1] [L2].

To apply new approaches to analyseaudio-visual recordings in these digitalhumanities research scenarios,Fraunhofer IAIS provides tools andexpertise for automatic speech recogni-tion and speech analytics [2] [L3]. Forthe automatic segmentation and tran-scription of speech recordings, theFraunhofer IAIS Audio Mining Systemis applied and adapted. The followingfigure shows the user interface of aprocessed oral history recording.

The recording is transcribed with alarge vocabulary speech recognitionsystem based on Kaldi technology. Allrecognised words and corresponding

KA3: Speech Analytics for Oral History

and the Language Sciences

by Joachim Köhler (Fraunhofer IAIS), Nikolaus P. Himmelmann (Universität zu Köln) and Almut Leh(FernUniversität in Hagen)

In the project KA3 (Cologne Centre for Analysis and Archiving of Audio-Visual Data), advanced speechtechnologies are developed and provided to enhance the process of indexing and analysis of speechrecordings from the oral history domain and the language sciences. These technologies will be providedas a central service to support researchers in the digital humanities to exploit spoken content.

Page 14: ˜5*(.&1 9-*2* .,.9&1 :2&3.9.*8 - ERCIM News 112 · 4 ERCIM NEWS 111 October 2017 ˚ ˆ ˆ*2’*78-.5 After having successfully grown to become one of the most recognized ICT Societies

ERCIM NEWS 111 October 201714

Special Theme: Digital Humanities

time codes are indexed in the Solrsearch engine. Hence, the user cansearch for relevant query items anddirectly acquire the snippets of therecognised phrases and the entry points(black triangles) to jump to the positionwhere this item was spoken.Additionally the application provides alist of relevant key words which arecalculated by a modified tf-idf algo-rithm to roughly describe the interview.All metadata is stored in the MPEG-7format which increases the interoper-ability with other metadata applica-tions. Mappings to the ELAN formatare realised to perform an additionalannotation with the ELAN tool. The

system is able to process long record-ings of oral history interviews (up tothree hours). Depending on therecording quality, the initial recogni-tion rate is up to 75 percent. For record-ings with low audio quality the recogni-tion process is still quite error-prone.Besides the speech recognition aspect,other speech analytics algorithms areapplied. For the interaction scenario,the detection of back-channel effectsand overlapping speech segments arecarried out.

The selected research scenarios raisechallenging new research questions forspeech analytics technology. Short

back-channel effects are still hard torecognise. On the other hand, speechanalytics tools already provide addedvalue in processing huge amounts ofnew oral history data, thus improvingretrieval and interpretation. The collab-oration between speech technology sci-entists and digital humanitiesresearchers is an important aspect of theKA3 project.

Links:

[L1] http://dch.phil-fak.uni-koeln.de/ka3.html

[L2] http://ohda.matrix.msu.edu/[L3] https://kwz.me/hmH

References:

[1] D. Oard: “Can automatic speechrecognition replace manualtranscription?”, in D. Boyd, S.Cohen, B. Rakerd, & D. Rehberger(Eds.), Oral history in the digitalage, Institute of Library andMuseum Services, 2012.

[2] C. Schmidt, M. Stadtschnitzer andJ. Köhler: “The Fraunhofer IAISAudio Mining System: CurrentState and Future Directions”, ITGFachtagung SpeechCommunication, Paderborn,Germany, September 2016.

Please contact:

Joachim KöhlerFraunhofer IAIS, Germany +49 (0) 2241 [email protected]

We live in a world where an ever-growing web is able to deliver a largebody of knowledge to virtually anyonewith an internet connection. At the sametime, the high availability of content hasmorphed human information seekingbehaviour [1]. People expect to find

answers to their questions quickly andwith little effort. The quality of theanswers is frequently tied with thesearch engine’s ability to understandquery intent, using information fromcurated knowledge bases to providedirect answers, based on identified enti-

ties and relations, alongside the tradi-tional textual document results. Searchengines are greatly dependent on theinverted index, inspired by the back-of-the-book index of printed manuscripts,to rank documents with matching key-words, but they are also increasingly

Graph-based Entity-Oriented Search: Imitating

the Human Process of Seeking and Cross

Referencing Information

by José Devezas and Sérgio Nunes (INESC TEC and FEUP)

In an information society, people expect to find answers to their questions quickly and with littleeffort. Sometimes, these answers are locked within textual documents, which often require amanual analysis, after being retrieved from the web using search engines. At FEUP InfoLab, we areresearching graph-based models to index combined data (text and knowledge), with the goal ofimproving entity-oriented search effectiveness.

Figure:�Search�and�Retrieval�Interface�of�the�Fraunhofer�IAIS�Audio�Mining�System�for�Oral

History.

Page 15: ˜5*(.&1 9-*2* .,.9&1 :2&3.9.*8 - ERCIM News 112 · 4 ERCIM NEWS 111 October 2017 ˚ ˆ ˆ*2’*78-.5 After having successfully grown to become one of the most recognized ICT Societies

ERCIM NEWS 111 October 2017 15

Figure�1:��The�left�side�shows�the�graph-of-word�representation�for�an�example�document�(the�first�sentence�of�the�Wikipedia�page�for�“Semantic

Search”).�Each�term�links�to�the�following�two�terms,�as�a�way�to�use�indegree�to�establish�the�context�of�a�word.�The�right�side�shows�the�proposed

graph-of-entity,�linking�consecutive�terms,�terms�occurring�within�entity�names,�and�related�entities,�as�a�way�to�unify�text�and�knowledge.

dependent on knowledge bases. Whilethere are automatic methods for knowl-edge base construction, most searchengines still depend on manual curationfor this task. On one side, there is theerror associated with automatic knowl-edge base construction and, on the otherside, there is the time constraint anddomain expertise of manually curating aknowledge base. We propose an inter-mediate solution based on a novelgraph-based indexing structure, withthe goal of combining the power of theinverted index with any available andtrustworthy information, through estab-lished knowledge bases.

Our current research is focused onfinding novel ways of integrating textand information through a graph,without losing the properties of terms inan inverted index, and entities and rela-tions in a triple store. The goal is to beable to retain the characteristics of textand knowledge, while combining them,through a unified representation, toimprove retrieval. The hypothesis is thatby establishing potentially weak linksbetween text and entities, we might beable to support and imitate the humanprocess of seeking and cross refer-encing information: knowledge-sup-ported keyword-based search. Ashumans, each of us compiles knowl-edge from multiple sources (e.g., theworld, books, other people), estab-lishing relations between entities andcontinuously correcting for consistency,based on concurrent information and itstrustworthiness. When we have aninformation need, we either asksomeone or consume some sort ofmedia (e.g., a book, a video, an audiolesson) to obtain answers. Let us take,for instance, the task of searchingwithin a book to solve an information

need. Specifically, let us assume a back-of-the-book index search for a given setof terms (analogous to the traditionalkeyword query). Let us then assumethat we skip to a page indicated by oneof those terms and read a textual pas-sage. How do we determine whether ornot it is relevant for our informationneed? While we already know it con-tains one of the terms we seek, we mustuse existing knowledge (ours or other-wise) to assess the relevance of the text.

There is an obvious connection betweentext and knowledge that isn’t being cap-tured by existing search technologies.While there is a clear and growing inte-gration of text-based search and entity-based decorations (e.g., an infoboxabout the most relevant entity, or a listof entities for the given entity typeexpressed by the query), the invertedindex still exists separately from theknowledge base and vice-versa. Ourgoal is to explore the opportunity ofimproving retrieval effectiveness basedon a seamless integration of text andknowledge through a common datamodel, while proposing one or severalunifying ranking functions that onlydecide based on the maximum availableinformation.

We have based our work on the graph-of-word [2], a document representationand retrieval model that defies the termindependence assumption of the tradi-tional bag-of-words approach used ininverted indexing. Figure 1 (left) showsthe graph-of-word representation for anexample document (the first sentence ofthe Wikipedia page for “SemanticSearch”). Each term links to the fol-lowing two terms, as a way to use inde-gree (the number of incoming links) toestablish the context of a word. We pro-

pose the graph-of-entity (Figure 1;right), where we link each term (inpink) only to the following term(dashed line), but also include entitynodes (in green), basic “contained_in”edges between term and entity nodes(dotted line; weak relation based onsubstring matching), and edges betweenentity nodes (solid line), representingrelations between entities in a knowl-edge base or, in this case, indirectlybased on the hyperlinks for theWikipedia article. The objective is tounify text and knowledge retrieval as acombined task, in order to use struc-tured and unstructured data to providebetter answers for the information needsof the users.

Links:

[L1] http://infolab.fe.up.pt[L2] http://ant.fe.up.pt

References:

[1] S A Knight and Amanda Spink:“Toward a web search informationbehavior model”, in Web Search,pages 209-234. Springer, 2008.

[2] F. Rousseau, M. Vazirgiannis:“Graph-of-word and TW-IDF: newapproach to ad hoc IR”, in Proc. ofthe 22nd ACM international confer-ence on Information & KnowledgeManagement, pp 59-68. ACM, 2013.

Please contact:

José Devezas, Sérgio Nunes, INESCTEC and Universidade do [email protected], [email protected]://josedevezas.com

José Devezas is supported by research

grant PD/BD/128160/2016, provided

by the Portuguese funding agency,

Fundação para a Ciência e a

Tecnologia (FCT).

Page 16: ˜5*(.&1 9-*2* .,.9&1 :2&3.9.*8 - ERCIM News 112 · 4 ERCIM NEWS 111 October 2017 ˚ ˆ ˆ*2’*78-.5 After having successfully grown to become one of the most recognized ICT Societies

ERCIM NEWS 111 October 201716

Special Theme: Digital Humanities

The immense value of big data hasrecently been acknowledged by themedia industry, with the coining of theterm “data journalism” to refer to jour-nalistic work inspired by data sources.While data is a natural ingredient of allreporting, the increasing volumes ofavailable digital data as well as itsincreasing complexity lead to a qualita-tive jump, where technical skills forworking with data are stringentlyneeded in journalism teams.

An ongoing collaborative research pro-gramme focused on novel content man-agement techniques applied to data jour-nalism and fact-checking has recentlybeen initiated by: Inria, the LIMSI lab(Université Paris Saclay, CNRS andUniversité Paris Sud), UniversitéRennes 1, Université Lyon 1 and the“Les Décodeurs” fact-checking team ofLe Monde, France’s leading newspaper[L1]. Here, content is broadly inter-preted to denote structured data, text,and knowledge bases, as well as infor-mation describing the social context ofdata being produced and exchangedbetween various actors. The project,called ContentCheck [L2], is sponsoredby ANR, the French National ResearchAgency, and is set to run until 2019.

The project goals are twofold:• First, in an area rich with sensational

fake news, debunked myths, socialmedia rumors, and ambitious start-ups, we aim at a comprehensiveanalysis of the areas of computer sci-ence from which data journalism andfact-checking can draw ideas andtechniques. Questions we seek toanswer include: what kinds of con-tent are involved? What is their lifecycle? What types of processing arefrequently applied (or needed!) injournalistic endeavours? Can weidentify a blueprint architecture forthe ideal journalistic content man-agement system (JCMS)?

• Second, we seek to advance andimprove the technical tools availablefor such journalistic tasks, by pro-posing specific high-level models,languages, and algorithms applied todata of interest to journalists.

Prior to the start of the project, inter-views with mainstream media journal-ists from Le Monde, The WashingtonPost and the Financial Times have high-lighted severe limitations of theirJCMSs These are typically restricted toarchiving published articles, and pro-viding full-text or category-based

searches on them. No support is avail-able for storing or processing externaldata that the journalists work with on adaily basis; newsrooms rely on ad-hoctools such as shared documents andrepositories on the web, and copied filesto and fro as soon as processing wasrequired. The overhead, lost produc-tivity, privacy and reliability weaknessesof this approach are readily evident.

The main findings made in our projectto date include:• Data journalism and fact-checking

involve a wide range of content man-agement and processing tasks, aswell as human-intensive tasks per-formed individually (e.g., a journal-ist or an external expert of a givenfield whose input is solicited – Forinstance, ClimateFeedback is aneffort to analyse media articles aboutclimate change by climate scientists.See, for example, [L3]), or collec-tively (e.g., readers can help flagfake news in a crowd-sourcing sce-nario, while a large consortium ofjournalists may work on a large newsstory with international implications.The International Consortium ofInvestigative Journalism is at the ori-gin of the Panama Papers [L4] dis-

ContentCheck: Content Management Techniques

and Tools for fact-checking

by Ioana Manolescu (Inria Saclay and Ecole Polytechnique, France)

Data journalism and journalistic fact-checking make up a vibrant area of applied research. Content managementmodels and algorithms have the potential to tremendously multiply their power, and have already started to do so.

!"#$%&'()'!*+,'+-'".-+%/01+.'".'-023425&26".#'307678'0.9'%&*&:0.3'%&7&0%25';%+<*&/'-%+/'35&'2+.3&.3=9030=".-+%/01+.''/0.0#&/&.3''0%&0>''

?*0"/'3+'<&'25&26&9'@3&A3'

+%'9030B'C&9"0'2+.3&.3'

C&9"0'2+.3&A3'

D&-&%&.2&'".-+%/01+.'7+$%2&'('

E$/0.'023+%7'@F+$%.0*"7378'&A;&%378'!"#$%&$#"'(")B'

D&-&%&.2&'".-+%/01+.'7+$%2&'G'

D&-&%&.2&'".-+%/01+.'7+$%2&'.'

H&%"I201+.'3++*'@J$&%K8'/03258''7+$%2&'7&0%25LB'

L'

M.0*K7"7'%&7$*3'N'O%$&'='%035&%'3%$&'='%035&%'-0*7&'='-0*7&'

'P&&'7+$%2&7)'5Q;)==9030%&->2+/L''R''

?*0"/''&A3%021+.'

P+2"0*''.&3,+%6'0.0*K7"7'

D&2+.2"*"01+.8'%&;$301+.'

'''''''P+$%2&'9S".-+%/01+.''''''''''''9&'%T-T%&.2&'.U('

'

'

'''''''P+$%2&'9S".-+%/01+.''''''''''''9&'%T-T%&.2&'.U('

'

'

'D&-&%&.2&'".-+%/01+.'7+$%2&''.U('

'

'

P+$%2&'7&0%25'='7+$%2&'7&*&21+.'

D&-&%&.2&'7+$%2&'2+.73%$21+.8'%&I.&/&.38'".3&#%01+.''

Figure�1:�Flow�of�information�in�fact-checking�tasks,�and�relevant�research�problem�from�the�content/data/�information�management�area.

Page 17: ˜5*(.&1 9-*2* .,.9&1 :2&3.9.*8 - ERCIM News 112 · 4 ERCIM NEWS 111 October 2017 ˚ ˆ ˆ*2’*78-.5 After having successfully grown to become one of the most recognized ICT Societies

ERCIM NEWS 111 October 2017 17

closure concerning tax avoidancethrough tax havens).

• Content management tasks includethe usual CRUD (create, read,update, delete) data cycle, appliedboth to static content (e.g., web arti-cles or government PDF reports) anddynamic content such as provided bysocial media streams.

• Stream monitoring and stream filter-ing tools are highly desirable, asjournalists need help to identify, inthe daily avalanche of online infor-mation, the subset worth focusing onfor further analysis.

• Time information attached to allforms of content is highly valuable. Itis important to keep track of the time-changing roles (e.g., elected posi-tions) held by public figures, and alsoto record the time when statementswere made, and (when applicable) thetime such statements were referring to(e.g., when was a certain politician’sspouse employed, and when did thepolitician share this information).

Work to design a single unified archi-tecture for a content management tooldedicated to fact-checking is stillongoing as we write. The overall vision

we currently base our analysis on is out-lined in Figure 1. Claims are madethrough various media, and (impor-tantly) in a context, in which one canfind the claim’s authors, their institu-tions, friend and organisational affilia-tions etc. Claims are fact-checkedagainst some reference information,typically supplied by trustworthy insti-tutions, such as statistics national insti-tutes (INSEE in France, the Office forNational Statistics in the UK) or trustedexperts, such as well-established scien-tists working on a specific topic. Claimsare checked by human users (journal-ists, scientists, or concerned citizens),possibly with the help of some auto-mated tools. The output of a fact-checking task is a claim analysis, whichstates parts of the claims that are true,mostly true, mostly false etc., togetherwith references to the trustworthysources used for the check. Fact-checking outputs, then, can be archivedand used as further reference sources.

Scientific outputs of the project so farinclude a light-weight data integrationplatform for data journalism [1], ananalysis of EU Parliament votes high-lighting unusual (unexpected) correla-

tions between the voting patterns of dif-ferent political groups [2], and a linkedopen data extractor out of INSEEspreadsheets [3]. Our project website isavailable at: [L2].

Links:

[L1] https://kwz.me/hmE[L2] https://kwz.me/hmK[L3] https://kwz.me/hmS[L4] https://panamapapers.icij.org/

References:

[1] R. Bonaque, et al.: “Mixed-instance querying: a lightweightintegration architecture for datajournalism” (demonstration),PVLDB Conference, 2016.

[2] A. Belfodil, et al.: “Flash points:Discovering exceptional pairwisebehaviors in vote or rating data”,ECML-PKDD Conference, 2017.

[3] T. D. Cao, I. Manolescu, X. Tannier:“Extracting Linked Data fromstatistic spreadsheets”, SemanticBig Data Workshop 2017.

Please contact:

Ioana Manolescu, Inria Saclay andEcole Polytechnique, [email protected]

Argumentation Analysis of Engineering Research

by Mihály Héder (MTA-SZTAKI)

Engineering research literature tends to have fewer citations per document than other areas of science.Natural language processing, argumentation structure analysis, and pattern recognition in graphs canhelp to explain this by providing an understanding of the scientific impact mechanisms betweenscientists, engineers and society as a whole.

Institutionalised engineering researchoften needs to follow research policythat was designed with natural science inmind. In this setting, individual aca-demic advancement as well as researchfunding largely depends on scientificindicators. A dominant way of meas-uring impact in science is counting cita-tions at different levels and in differentdimensions: for individual articles andjournals as well as for individuals andgroups. If an engineering research groupdoes not deliver on these metrics, itmight compensate for this with the rev-enue it generates. But this strategyincreasingly pushes a group from basicengineering research activity to appliedand short-term profitable research, sincebasic research does not result in income.

Our main research question is: why arethere fewer citations per document inthis field compared to almost any otherbranch of science? This has direct con-sequences on the aforementioned scien-tific metrics of the field. We are alsoaddressing the questions: are conven-tional citations good indicators for engi-neering research at all? And what wouldbe the ideal impact measurement mecha-nism for engineering research?

Our work approaches the problem byinvestigating both the argumentativestructure [1] of engineering researcharticles and the directed graphs that rep-resent citations between papers. Forinvestigating individual papers we usenatural language processing, including

named entity recognition, clasterization,classification and keyword analysis.During our work we rely on publiclyavailable Open Access registries andcitation databases represented in linkedopen datasets.

We are testing several initial hypotheseswhich might explain the low number ofcitations: • there are virtually no long debates that

manifest themselves in debate-starterpapers and follow-ups; (

• The audience itself on which engi-neering research has impact does notpublish in big numbers;

• There are some additional effects: ref-erences to standards, design docu-ments and patents are often just

Page 18: ˜5*(.&1 9-*2* .,.9&1 :2&3.9.*8 - ERCIM News 112 · 4 ERCIM NEWS 111 October 2017 ˚ ˆ ˆ*2’*78-.5 After having successfully grown to become one of the most recognized ICT Societies

ERCIM NEWS 111 October 201718

Special Theme: Digital Humanities

implied and not made explicit – andeven when they are, they do not count.

The first hypothesis is tested with graphpattern matching. In this case we aredefining an abstract pattern in the cita-tion graphs of noted debates in the fieldsof philosophy of science and philosophyof technology. Then, we attempt torecognise similar patterns in the engi-neering research literature.

To test the second hypothesis we lookinto additional sources, like standardsand software code that are known to beapplications of certain research papers.Then we investigate the publication his-tory of the creators of those applications,to see if they report those applications inpublications. Here we rely on publiclyavailable citation databases and search-able databases provided by big publisherand internet search firms.

For the third hypothesis we haveinvented the definitions of several typesof “implicit citations”. Implicit citationsare cases where the impact of a researcharticle is clear in some work or artefact –standards, software code, patents, etc –but because of the nature of the work theimpact never appears as a citation in any

database. A typical example of this is theusage of a particular algorithm in soft-ware. While it is not appropriate to con-sider these kinds of “implicit citations”as equal to citations from within presti-gious papers, they still point to theresearch’s effects on industry andsociety. Public funding is often justifiedby the advantages a research directioneventually brings to the taxpayer andsociety, so it is good to have an objectivemetric.

The preliminary results indicate that thelow number of citations in the field ofengineering research can, to a greatextent, be explained by the hypothesesabove. We have also identified otherunanticipated factors, namely a proxyeffect that lowers the overall number ofcitations as well as a tendency in thefield to cite a well-known named entity(like the name of an algorithm) but notreferencing to it in a bibliographicallycorrect way.

Science metrics on which researchersneed to deliver are ways of limiting thefreedom of inquiry, since they prescribethe publication types researchers needuse, as well as the places – prestigiousjournals, conferences – where they must

publish. This is usually done in goodwill and with the intent of improving thequality of science, but the usual metricscan be detrimental to the cause of anenvisioned engineering science [2].Since the need for some kind of metricsis likely to remain, the project will pro-pose alternatives that measure engi-neering research impact more inclu-sively by incorporating design docu-ments, patents, standards, and sourcecode usage.

References:

[1] S. Teufel: The Structure ofScientific Articles: Applications toCitation Indexing andSummarization. Center for theStudy of Language andInformation-Lecture Notes, 2010.

[2] B. Martin The use of multipleindicators in the assessment ofbasic research. Scientometrics,1996, 36.3: 343-362.

Please contact:

Mihály HéderSZTAKI, Hungarian Academy ofSciences, [email protected] 1 279 6027

Digital humanities is a relatively newinterdisciplinary field which involvesthe integration of emerging new digitaltechnologies with traditional research inthe humanities in order to ensure thelong term preservation of knowledgeand enable collaboration across the var-ious humanities fields. This integration,however, is not as easy and straight-for-ward as it sounds as it entails the cre-ation and use of a “common language”in the form of a classification schemethat would enable the communicationbetween different disciplines. The actualstate-of-play, however, is somewhat dif-ferent, with different groups of scholarsusually developing their own jargon in

order to build thematic vocabulariesthat are discipline or even applicationspecific. As Barry Smith [1] observes“different databases may use identicallabels but with different meanings;alternately the same meaning may beexpressed via different names”. Thisinevitably introduces an unnecessaryfragmentation of knowledge thatinhibits research and collaboration.Given this situation, there is clearly anurgent need to create a common schemethat would enable interoperabilitybetween the different scholarly fieldsand thus support researchers by givingthem access to uniformly marked updatasets for query and by providing a

guide for the production of systematicterminologies which would avoidmethodological errors that typicallylead to inconsistencies and incompati-bilities between classification systems.

Despite the clear challenges to the con-struction of such a unifying framework,we argue that “a global knowledge net-work” [2] is feasible. Building on a con-centrated research programme into clas-sification methodology, we have devel-oped a system, the Back BoneThesaurus (BBT) [L1], that aims toallow access, compatibility and com-parison across heterogeneous [3] classi-fication systems.

A back bone Thesaurus for Digital Humanities

by Maria Daskalaki and Lida Charami (ICS-FORTH)

In order to integrate new digital technologies and traditional research in the humanities and enablecollaboration across the various scientific fields, we have developed a coherent overarchingthesaurus with a small number of highly expressive, consistent upper level concepts which can beused as the starting point for harmonising the numerous discipline and even project specificterminologies into a coherent and effective thesaurus federation.

Page 19: ˜5*(.&1 9-*2* .,.9&1 :2&3.9.*8 - ERCIM News 112 · 4 ERCIM NEWS 111 October 2017 ˚ ˆ ˆ*2’*78-.5 After having successfully grown to become one of the most recognized ICT Societies

ERCIM NEWS 111 October 2017 19

The field of semantics and the use offormal ontologies for representingresearch data are key areas of research inthe digital humanities, aiming to supportthe sustainable development of interop-erable and accessible datasets for use byresearch communities in the era of bigdata. Coined by Berners-Lee, semanticdata refers to data which is machineprocessable and human readable.

Formal ontologies provide explicit anddisciplined means of producing suchdata, to ensure its wide compatibilityand clear interpretability.

At The Centre for Cultural Informatics(CCI) of ICS-FORTH, based inHeraklion, Greece, we have beenresearching a comprehensive solution tothe complete lifecycle of semantic data

use supported by formal ontologies. Ourresearch is nearly at a point where it canbe applied within digital humanities bybringing together the basic methods andtools for the distinct steps of this cycle:data modelling, mapping and transfor-mation, querying and management. Inparticular, research at the CCI focuseson development to fill gaps or improvemethodologies in these key steps.

Meeting the Challenges to Reap the benefits

of Semantic Data in Digital Humanities

by George Bruseker, Martin Doerr and Chryssoula Bekiari (ICS-FORTH)

In the era of big data, digital humanities faces the ongoing challenge of formulating a long-term andcomplete strategy for creating and managing interoperable and accessible datasets to support its researchaims. Semantics and formal ontologies, properly understood and implemented, provide a powerful potentialsolution to this problem. A long-term research programme is contributing to a complete methodology andtoolset for managing the semantic data lifecycle.

This system, elaborated after theresearch of a multi-disciplinary team ofexperts, is based on a consistent method-ology designed to enable intersubjectiveand interdisciplinary classificationdevelopment and integration withoutforcing specialists and experts toabandon their own terminology. Themethodology relies on the principle offaceted classification and the idea that alimited number of top-level concepts canbecome a substantial tool to harmonisethe numerous discipline and even projectspecific terminologies into a coherentand effective federation in which consis-tency can progressively be carried fromthe upper layers to the lower ones.

In order to define the BBT facets, westarted by examining existing vocabu-laries from the fields of history, archae-ology, ethnology, philosophy of sci-ences, anthropology, linguistics, theatrestudies, musicology and history of art,we analysed these data using a bottomup strategy in order to discover appro-priate upper level concepts. Theresearch consciously avoided the projec-tion of any preconceived formulationsof knowledge onto the material, pre-cisely in order to identify the broader,fundamental categories that would beapplicable across the humanities. Thetop level concepts thus derived, despitetheir generality, can be easily specialisedin order to express the particularmeaning of the different domains

without leading to inconsistencies. Thisis achieved through the detection of theintensional properties of these conceptsand the rigorous and proper applicationof the IsA relationship.

In order to express the exact meaning ofthe top level terms/concepts defined inthe BBT, we provide explicit definitionson the basis of their intensional proper-ties which cannot be replaced withoutloss of meaning since they are the sumof the properties, state of affairs, quali-ties that constitute the necessary andsufficient conditions for identifying aterm/concept.

The BBT facets are further subdividedinto a number of hierarchies using theIsA relation which dictates that the scopeof each narrower term subsumed under abroader term must fall completely withinthe scope of the broader term. In otherwords, every subsumed term mustbelong to the same inherent category asits broader concept. Using the IsA rela-tion as the criterion for building the BBThierarchies ensures that consistency ismaintained since all narrower terms mustpossess all the fundamental propertiesattributed to the broader concepts of thehierarchy into which they are subsumed.In other words, by using the IsA relationwe avoid categorical errors that mayresult from the subsumption of termsunder facets or hierarchies, which haveproperties different than those of the

higher level terms. The strict, properapplication of the IsA relation thus servesas a logical control to avoid contradic-tions and achieve objectivity and inter-disciplinarity.

The BBT is an ongoing work and we arecurrently in the process of reviewing thematerial we have at our disposal inorder to identify additional facets andhierarchies.

Link:

[L1] http://www.backbonethesaurus.eu/

References:

[1] Β. Smith: “Ontology”, in TheBlackwell Guide to the Philosophyof Computing and Information, L.Floridi, ed. Oxford: Blackwell,2004, p. 158.

[2] M. Doerr, D. Iorizzo: “The dreamof a global knowledge network – Anew approach”, in Journal onComputing and Cultural Heritage,2008, 1(1), p. 1.

[3] M. Daskalaki, M. Doerr:“Philosophical backgroundassumptions in digitizedknowledge representationsystems”, in Dia-noesis: A Journalof Philosophy, 2017, (3), 17-28.

Please contact:

Maria DaskalakiICS-FORTH, [email protected]

Page 20: ˜5*(.&1 9-*2* .,.9&1 :2&3.9.*8 - ERCIM News 112 · 4 ERCIM NEWS 111 October 2017 ˚ ˆ ˆ*2’*78-.5 After having successfully grown to become one of the most recognized ICT Societies

ERCIM NEWS 111 October 201720

Special Theme: Digital Humanities

Before semantic data can be integrated,an adequate model for the domain mustbe elaborated. Semantic data modellingtypically follows one of two basicstrategies: the elaboration of complex,all inclusive models or restriction tomodelling of a highly focussed domain.Both produce highly useful models,that nevertheless display certain limita-tions for broad interoperability. Thelimitations of these strategies tend tolie, on the one hand, in a powerful, gen-eral integration with less detailed inte-gration at the leaf level in complexmodels (e.g.: INSPIRE), and, on theother hand, extremely tight integrationof data with lack of relation to the gen-eral in more compact models (e.g.:FOAF). As a coordinating member ofthe international collaborative CIDOCCRM Special Interest Group (SIG),[L1] working under the aegis of theInternational Council of Museums(ICOM), CCI follows a differentapproach.

Adopting a bottom up developmentprocess that works from actual datastructures, the CRM SIG has producedan ontology for cultural heritage and e-sciences which provides the generalintegrative functions of a base ontology.The base model of CIDOC CRM is cur-rently in the sixth revision of its commu-nity version with the fifth revisionstanding as the base of the last ISOrelease in 2014 [1]. The long term suc-cess in uptake of this model has laid thefoundation for community collaborationwith experts from various disciplines tocreate harmonised extensions including:FRBRoo and PRESSoo for library data;CRMdig, CRMinf, and CRMsci whichcollectively provide provenance data inthe respective areas of digitisation, argu-mentation and observation sciences; andCRMarchaeo and CRMba which sup-port reasoning over archaeological prac-tice. The innovation of this extensiondevelopment process is the collaborativework with specialist communities toelaborate harmonised extensions to thebase model which enable the representa-tion of special domains of researchwhile maintaining compatibility withthe top level model.

Having a general ontological frame-work available to express their data,researchers require a means to translateexisting data into the common expres-sion. Development work at CCI hascreated the X3ML Suite which pro-

vides an innovative language, databaseand data mapping tool for generatingcompletely declarative mappings fromany XML data source to any RDFSencoded ontology. This suite of func-tionalities allows domain specialists tocarry out and track mapping processesto the CRM or other suitable ontologieson their own without having to rely onthe mediation of computer science spe-cialists. [2] Together with a tool foreasily viewing/reviewing RDF data(see article by Minadakis et. al. in thesection “Research and Innovation” ofthis issue), this suite provides a plat-form for managing large scale semanticmapping processes without restrictionsto a specific schema.

Once expressed in a common, but stillcomplex, semantic format, there is stillthe challenge of how to provideresearchers a tool to query this semanticnetwork without necessarily having tolearn complex query languages such asSPARQL or the nuances of use of alarge ontology. Methodological work atthe CCI has produced a theory ofFundamental Categories and Relationswhich describes how to specify gener-alist queries over a complex model that

will bring back relevant results toresearchers by providing an intuitiveand semantically consistent abstractionover the complexities of the ontology[3]. This methodology has been takenup and developed as a key tool by theResearchspace project in its develop-ment of an open source platform forsemantic data in the cultural heritageand digital humanities domains [L2].

Finally, to ensure the sustainable devel-opment and use of semanticallyencoded resources, a complete strategyto the semantic data life cycle must beelaborated. CCI presently participatesin the Parthenos project [L3] devel-oping a conceptual data model andarchitecture for long term semanticdata integration and curation, that aimsto model the intergration process itselfand thereby support long term, on-demand integration tasks and the moni-toring thereof.

In order to meet the challenges and takeadvantage of the benefits of semanticdata in Digital Humanities in the era ofbig data, a complete strategy and set oftools to cover the basic elements of thesemantic data lifecycle is essential.

Figure�2:�UI�of�ResearchSpace�fundamental�categories�and�relations�query�tool,

©ResearchSpace.

User�interface�of�X3ML�declarative�mapping�tool.

Page 21: ˜5*(.&1 9-*2* .,.9&1 :2&3.9.*8 - ERCIM News 112 · 4 ERCIM NEWS 111 October 2017 ˚ ˆ ˆ*2’*78-.5 After having successfully grown to become one of the most recognized ICT Societies

ERCIM NEWS 111 October 2017 21

HumaReC: Continuous Data Publishing

in the Humanities

by Claire Clivaz, Sara Schulthess and Anastasia Chasapi (SIB)

HumaReC, a Swiss National Foundation project, aims to test a new model of digital researchpartnership: from original document source to publisher. Its object of study is a trilingual, 12thcentury, New Testament manuscript, which is used by HumaReC to test continuous data publishing.

HumaReC [L1] is a Vital-DH@Vital-ITproject funded by the Swiss NationalFoundation. Under the leadership ofClaire Clivaz, it started in October 2016and has been running for two years. Theproject is based at Vital-IT (Lausanne,CH), under the guidance of group leaderIoannis Xenarios from the SwissInstitute of Bioinformatics. The team iscomposed of a mixture of humanitiesand computing scholars; SaraSchulthess is the main researcher.

The aim of HumaReC is to investigatehow humanities research is reshapedand transformed by the digital rhythmof data production and publication; italso aims to establish the best practicesfor Digital Humanities research. As atest-case, the study is focussed on a

unique, trilingual New Testament man-uscript: Marciana Gr. Z. 11 (379),written in Greek, Latin and Arabic. Inthe spirit of the OA2020 initiative [L4],continuous data publishing is beingtested in partnership with all theresearch network stakeholders: theMarciana library (Venice, Italy), theEdition Visualization Technology(EVT) [L2], the Transkribus [L3]research teams, and the publisher Brill.

Rhythm is a central notion in the struc-ture of HumaReC and we have chosenthis key-concept, based notably onMeschonnic analysis [1], to observe thechanges happening in digital humanitiesresearch which has been premised onprinted culture for a long time. A two tothree year research project in humani-

ties has traditionally been characterisedby the writing, editing and publicationof a final, printed book, often delayed toa certain date after the end of theproject. This delay was even consideredproof of authentic, high-level researchin humanities, certified by an estab-lished book series. The digital transitionis creating a completely new researchparadigm in part due to the publishingof formats such as videos, short mes-sages or draft papers, social media andblogs, all before the research is evencompleted and peer-reviewed. How is itpossible to develop certified and contin-uous data publishing digital research inthe humanities?

As the project’s first step, a virtualresearch environment was created forHumaReC, allowing the researchprocess and results to be made continu-ously available. It provides a manu-script viewer in fully open access thatincludes quality images of the manu-script and three columns of transcrip-tions (see Illustrations below). Themanuscript viewer is based on EVTopen source technology and is a devel-opment of a previous project’s viewer[2]. The improved viewer offers addi-tional features such as linking betweentext and image. In addition to this, anannotation system will allow users todirectly comment on the manuscriptviewer. Secondly, Transkribus, a

With the maturation of a base model,creation of a declarative mapping tooland language, a generalising queryfunction and a model and method formanaging integration processes, webelieve that the key elements formeeting these challenges now lie inplace.

Links:

[L1] http://www.cidoc-crm.org/ [L2] http://www.researchspace.org/[L3] http://www.parthenos-project.eu/

References:

[1] ISO: ISO 21127: 2014, Informationand documentation – a referenceontology for the interchange ofcultural heritage information, 2ndedn., 2014.

[2] N. Minadakis, Y. et al.: “X3MLFramework: An effective suite forsupporting data mappings, Extending.Proceedings of the Workshop onExtending, Mapping and Focusingthe CRM co-located with 19thInternational Conference on Theoryand Practice of Digital Libraries

(2015), Poznań, Poland, September17, 2015”, CEUR WorkshopProceedings 1656, 1-12, 2016.

[3] K. Tzompanaki & M. Doerr:Fundamental Categories andRelationships for Intuitive queryingCIDOC-CRM based repositories,Technical Report, 2012.

Please contact:

George BrusekerCentre for Cultural Informatics ICS-FORTH, [email protected]

Figure�1:�Manuscript�viewer�f.�156r�©�Marciana�Library�all�rights�reserved.

Page 22: ˜5*(.&1 9-*2* .,.9&1 :2&3.9.*8 - ERCIM News 112 · 4 ERCIM NEWS 111 October 2017 ˚ ˆ ˆ*2’*78-.5 After having successfully grown to become one of the most recognized ICT Societies

ERCIM NEWS 111 October 201722

Special Theme: Digital Humanities

Handwritten Text Recognition tool, willalso be tested by HumaReC: a certainnumber of words transcribed by handallows a learning machine to be trainedto recognise specific writing in a manu-script. The results are forthcoming: wewill be able to compare the time takenby purely hand transcription withresults produced by a human/machineteam.Finally, three publication formats werechosen for the continuous disseminationof the data:

• The virtual research environmentitself; it received an ISSN (2504-5075)from the Swiss National Library: allthe published material associated withthe project can be referred to with thisnumber. An international editorialboard is providing project feedbackand input on its research results.

• The research blog is an importantinteractive continuous publishingprocess. We regularly update the blog

about the development of the projectand the research results. We alsoencourage discussions by being pres-ent on social media (Facebook andTwitter).

• The web-book, continuously writtenin open access, summarises theresearch in a long, structured text,similar to a conventional monographbut related to the data. It is producedin partnership with the publisherBrill. At the end of the project, it willbe peer-reviewed, and hopefully pub-lished with its own ISSN by Brill.

We are confident that HumaReC willestablish a new research and publishingmodel, including potential commercialdevelopments for all interested pub-lishers.

Acknowledgement

Many thanks to Harley Edwards for hisEnglish proof-reading, to the Marcianalibrary for the image copyright, to the

reviewers for their useful remarks, andto Martial Sankar and Ioannis Xenariosfor their collaboration and support inour Vital-DH projects@Vital-IT (SIB).

Links:

[L1] https://humarec.org;http://p3.snf.ch/project-169869[L2] http://evt.labcd.unipi.it/[L3] http://transkribus.eu[L4] http://oa2020.org

References:

[1] C. Clivaz et al.: “Editing NewTestament Arabic Manuscripts in a TEI-base: fostering close reading in DigitalHumanities”, JDMDH 3700, 2017. [2] H. Meschonnic: “Critique durythme. Anthropologie historique dulangage”, 1982.

Please contact:

Claire Clivaz, Swiss Institute ofBioinformatics, Switzerland,+41216924060, [email protected]

PARTHENOS [L1] is a Horizon 2020project funded by the EuropeanCommission that aims at strengtheningthe cohesion of research in the broadsector of linguistic studies, cultural her-itage, history, archaeology and relatedfields through a thematic cluster ofEuropean research infrastructures.PARTHENOS is building a cross-disci-plinary virtual environment to enableresearchers in the humanities to haveaccess to data, tools and services basedon common policies, guidelines andstandards. The project is built aroundtwo European Research InfrastructureConsortia (ERICs) from the humanitiesand arts sector: DARIAH [L2] (researchbased on digital humanities) andCLARIN [L3] (research based on lan-guage data), along with ARIADNE [L4](digital archaeological research infra-structure), EHRI [L5] (EuropeanHolocaust research infrastructure),CENDARI [L6] (digital research infra-structure for historical research),

CHARISMA [L7] and IPERION-CH[L8] (EU projects on heritage science)and involves all the relevant integratingactivities projects.

Since 2016, the Horizon 2020Programme has produced guidelines onFAIR data management [L9] to helpHorizon 2020 beneficiaries make theirresearch data findable, accessible, inter-operable and reusable (FAIR) [L10].Funded projects are requested to deliveran implementation of DMP which aimsto improve and maximise access to andreuse of research data generated by theprojects. This is in line withCommission’s policy actions on openscience to reinforce the EU’s politicalpriority of fostering knowledge circula-tion. Open science is in practice about“sharing knowledge as early as practi-cally possible in the discovery process”and because DMPs gather informationabout what data will be created andhow, and outline the plans for sharing

and preservation, specifying the natureof the data and any restrictions that mayneed to be applied, these plans ensurethat research data is secure and well-maintained during a project andbeyond, when it might be shared withothers. DMPs are key elements toknowledge discovery and innovationand to subsequent data and knowledgeintegration and reuse.

Special attention has been paid to thedevelopment of a PARTHENOS datamanagement plan which builds on theHorizon2020 DMP template. This hasresulted in a template (draft) whichaims to address the domain-specificprocedures and practices within thehumanities, taking into considerationstandards and guidelines used in datamanagement that are relevant forPARTHENOS specific research com-munities, which includes archaeolo-gists, historians, linguists, librarians,archivists, and social scientists.

A Data Management Plan for Digital

Humanities: the PARTHENOS Model

by Sheena Bassett (PIN Scrl), Sara Di Giorgio (MIBACT-ICCU) and Paola Ronzino (PIN Scrl)

Understanding how data has been created and under which conditions it can be reused is asignificant step towards the realisation of open science.

Page 23: ˜5*(.&1 9-*2* .,.9&1 :2&3.9.*8 - ERCIM News 112 · 4 ERCIM NEWS 111 October 2017 ˚ ˆ ˆ*2’*78-.5 After having successfully grown to become one of the most recognized ICT Societies

ERCIM NEWS 111 October 2017 23

Modern computational tools are oftenmore advanced and exhibit better per-formance than their less advanced pred-ecessors. These performance improve-ments come, however, with a price interms of increased complexity anddiminished understanding of what actu-ally happens inside the black box thatmany tools have become.

Our argument is that scholars in thehumanities need to understand the toolsthey use to the extent that they canassess the limitations of each tool, andhow these limitations might impact theoutcomes of the study in which the toolis deployed. This level of understandingis necessary, both to be able to assess towhat extent the generic task for which

the tool has been designed fits the spe-cific problem being studied, and toassess to what extent the limitations of atool in the context of this specificproblem may result in unanticipated sideeffects that lead to unintended bias orerrors in the analysis that threaten theassumed objectivity of each computa-tional step.

Trusting Computation

in Digital Humanities Research

by Jacco van Ossenbruggen (CWI)

Research in the humanities typically involves studying specific and potentially subjective interpretationsof (historical) sources, whilst the computational tools and techniques used to support such questionsaim at providing generic and objective methods to process large volumes of data. We claim that thesuccess of a digital humanities project depends on the extent it succeeds in making an appropriatematch of the specific with the generic, and the subjective with the objective. Trust in the outcome of adigital humanities study should be informed by a proper understanding of this match, but involves anon-trivial fit for use assessment.

The PARTHENOS DMP template hasbeen enriched and tailored with speci-fications from the humanities whichwere derived from a survey carried outamong the consortium’s experts. Togather these specifications, representa-tives of the PARTHENOS communi-ties were asked to describe their dailydata management procedures in detail.The questionnaire was structuredaccording to the various phases of thedata life cycle and was then mapped tothe FAIR principles. Each respondenthad the opportunity to choose his/herrole (e.g., researcher creating data /repository provider) and to provideinsight into their best practices throughthe form.

The PARTHENOS DMP template willsatisfy different stakeholders, such asinstitutional repositories, funded proj-ects and researchers each of which havean individual perspective on the dataquality and FAIRness issues. ThePARTHENOS DMP template will bedivided into three different levels: a firstlevel which includes a set of core gen-eral requirements irrespective of disci-pline, a second level including domainspecific requirements and a third levelwhich is project-based. To help users incompleting a data management plan, aset of guiding statements for the spe-cific disciplines will be provided.

At present, the PARTHENOS DMPtemplate collects the high-level require-ments that satisfy each community ofresearchers involved in the project, witha list of recommended answers that willsupport the compilation of the DMP. Asecond stage of this work will concernthe production of PARTHENOS com-munity-specific DMP templates, whichwill be included in the final version ofthe “Guidelines for common policiesimplementation”.

Further work will concern the creation ofa DMP template addressing institutionsthat manage repositories. Since enablinginteroperability is a great benefit forresearchers, repository providers shouldbe able to explain in depth how to pro-vide data to them in the best way.Through the envisaged template,PARTHENOS will provide the righttools to repository providers to be able tooffer standardised answers and guidance,and to liaise with researchers that arelooking to deposit their data with them.

Thanks to the PARTHENOS DMP tem-plate, researchers will be able to freelyaccess, mine, exploit, reproduce and dis-seminate their data and identify the toolsneeded to use the raw data for validatingresearch results, or provide the toolsthemselves, a significant step towards tothe realisation of open science.

Visit the PARTHENOS website [L1] formore information and subscribe to thenewsletter to keep up to date with devel-opments.

Links:

[L1] www.parthenos-project.eu[L2] www.dariah.eu/ [L3] www.clarin.eu/ [L4] www.ariadne-infrastructure.eu [L5] www.ehri-project.eu/ [L6] www.cendari.eu/ [L7] kwz.me/hTk[L8] www.iperionch.eu/ [L9] kwz.me/hTO[L10]kwz.me/hTo

References:

[1] The Three Os – Open innovation,open science, open to the world – avision for Europe, produced by theEuropean Commission’sDirectorate-General for Research& Innovation, 2016-05-17,https://kwz.me/hm4

[2] Wilkinson, M. D. et al: “The FAIRGuiding Principles for scientificdata management andstewardship”, Sci. Data3:160018 doi: 10.1038/sdata.2016.18 (2016).

Please contact:

Sheena Bassett, PIN Scrl, [email protected]

Page 24: ˜5*(.&1 9-*2* .,.9&1 :2&3.9.*8 - ERCIM News 112 · 4 ERCIM NEWS 111 October 2017 ˚ ˆ ˆ*2’*78-.5 After having successfully grown to become one of the most recognized ICT Societies

Our approach to address this problem isto work closely with research partners inthe humanities, and to build together thenext generation of e-infrastructures forthe digital humanities that allowscholars to make these assessments intheir daily research.

This requires more than technical work.For example, to raise awareness for thisneed, we have recently teamed up withpartners from the Utrecht DigitalHumanities Lab and the Huygens INGto organise a second workshop in July2017 around the theme of tool criticism.A recurring issue during these work-shops was the need for humanityscholars to be better trained in computa-tional thinking, while at the same timecomputer scientists (including big dataanalysts and e-science engineers) needto better understand what informationthey need to provide to scholars in thehumanities to make these fit for useassessments possible.

In computer science, the goal is often tocome up with generic solutions byabstracting away from overly specificproblem details. Assessing the behav-iour of an algorithm in such a specificcontext is not seen as the responsibilityof the computer scientist, while theblack box algorithms are insufficientlytransparent to transfer this responsi-bility to the humanities scholar. Makingblack box algorithms more transparentis thus a key challenge and a sufficientlygeneric problem to which computer sci-entists like us can make important con-tributions. In this context, we collabo-

rate in a number of projects with theDutch National Library (KB) tomeasure algorithmic bias and improvealgorithmic transparency. For example,in the COMMIT/ project, CWIresearcher Myriam Traub is investi-gating the influence of black boxoptical character recognition (OCR)and search engine technology on theaccessibility of the library’s extensivehistorical newspaper archive [1]. Here,the central question is: can we trust thatthe (small) part of the archive that usersfind and view through the online inter-face is actually a representative view ofthe entire content of the archive? And,if the answer is no, is this bias causedby a bias in the interest of the users orof a bias implicitly induced by the tech-nology? Traub et al. studied thelibrary’s web server logs with sixmonths of usage data, a period in whicharound a million unique queries hadbeen issued. Traub concluded thatindeed only a small fraction (less than4%) of the overall content of thearchive was actually viewed, and thatthis was not a representative sample atall. The bias however, was largelyattributed to a bias in the interest of theusers, with a much smaller bias causedby the “preference” of the searchengine for medium length articles,which leads to an underrepresentationof overly short and long articles in thesearch result. Follow-up research iscurrently focussing on the impact of theOCR on retrievability bias.

In the ERCIM-coordinated H2020project VRE4EIC [L1] we collaborate

with FORTH, CNR and other partnerson interoperability across virtualresearch environments. At CWI, TesselBogaard, Jan Wielemaker and LauraHollink are developing software infra-structure [2] to support improved trans-parency in the full data science pipeline:involving data selection, cleaning,enrichment, analysis and visualisation.Part of the reproducibility and otheraspects related to transparency might belost if these steps are spread over mul-tiple tools and scripts in a badly docu-mented research environment. Againusing the KB search logs as a casestudy, another part of the challenge liesin creating a transparent workflow ontop of a dataset that is inherently pri-vacy sensitive. For the coming years,the goal is to create trustable computa-tional workflows, even if the data nec-essary to reproduce a study is not avail-able.Link:

[L1] https://www.vre4eic.eu

References:

[1] M. C. Traub, et al.: “Querylog-based Assessment of RetrievabilityBias in a Large NewspaperCorpus”, JCDL 2016: 7-16.

[2] J. Wielemaker, et al.: “Cliopatria:A Swi-Prolog infrastructure for theSemantic Web. Semantic Web”,7(5), 529-541, 2016.

Please contact:

Jacco van OssenbruggenCWI, The Netherlads+31 20 592 [email protected]

ERCIM NEWS 111 October 201724

Special Theme: Digital Humanities

Screenshot�of�a�SWISH�executable�notebook�disclosing�the�full�computational�workflow�in�a�search�log�analysis�use�case.�

Page 25: ˜5*(.&1 9-*2* .,.9&1 :2&3.9.*8 - ERCIM News 112 · 4 ERCIM NEWS 111 October 2017 ˚ ˆ ˆ*2’*78-.5 After having successfully grown to become one of the most recognized ICT Societies

25ERCIM NEWS 111 October 2017

When we think of infrastructure, weoften fall back on canonical images ofthings like roads, bridges and buildings.In many disciplines, these still resonatewith the needs of the researcher com-munity, where a supercollider or aresearch vessel may indeed be at thecore of what is required to advance ourstate of knowledge.

The arts and humanities are different. Asa collection of approaches to knowledge,the methods deployed stem from ashared respect for complex source mate-rial emerging not from the experimentaldesign of the scientist, but from the expe-riences, cultures and creative impulses ofhuman beings. Providing an enhanced,shared, baseline access to key methods,sources, tools and services is therefore agreat challenge indeed.

The Digital Research Infrastructure forthe Arts and Humanities (DARIAH)was first conceptualised in late 2005 asa response to how this very different setof requirements was being addressed inthe fast-moving environment of digi-tally-enhanced research. The infrastruc-ture was later officially founded as aEuropean Research InfrastructureConsortium (or ERIC) based in France,but with 17 national members con-tributing funds and in-kind contribu-tions from their local digital humanitiesresearch communities. The knowledgebase of the resulting network is furtherenhanced by contributions from fundedresearch projects in which DARIAH is apartner, as well as the contributions ofworking groups, assembled by networkmembers on a voluntary basis to addresskey gaps in infrastructural provision orkey emerging challenges for theresearch communities.

This rich tapestry of contributions cre-ates a form of infrastructure based onknowledge production and exchange,

rather than on concrete shared facilitiesor datasets. As such, the challenge forDARIAH as it enters its second decadeof development is to capitalise on thehuman infrastructure it has built tocreate a fully aligned system of coordi-nated contribution and access provisionto the good practices emerging from thenetwork [1]. In order to do this, we arefocussing for the next three years onfour key areas of development that willenhance our ability to deliver low-fric-tion, high value interactions for ourpartners.

The first area of focus is to improve ourexternal communications. Ensuringthat our basic information is availablein an easily legible form for all currentand potential new members of our net-work is a sine qua non for ensuring thatthe infrastructure can function. Thisprogramme of activities will cover thegamut of communications instruments,from a newly redesigned website to theappointment of specific individualambassadors to reach key target regionsand communities; from a more strategicapproach to attending events to a map-ping of key organisation-to-organisa-tion relationships within our commu-nity and beyond. We will also clarifywhat we are able to offer as services toour community, from support for grantcapture to hosting of orphan researchprojects. By focussing on our core mes-sages in this way, we hope also to beable to communicate and build con-sensus around an even clearer messageof what DARIAH is and does, and howit operates as an infrastructure in condi-tions that require a very differentapproach.

The second plank in DARIAH’sdevelopment plan is to push forwardits vision for a virtual marketplace,making visible and accessible themany tools, services, datasets and

expertise bases that our network hasopened up for use by others. This maysound like an easy task to achieve, butthe prerequisite understanding ofwhat these assets would be valued forand by whom is actually quite chal-lenging to develop. Ensuring that weprovide an optimised platform for tar-geted and serendipitous discovery ofresources, as well as their easy reuse,wil l be a major achievement ofDARIAH by 2020.

Our third area of focus is on teachingand learning. Too much focus in thedigital humanities is on either trainingvia formal degree programmes orthrough individual learning via genericplatforms like Code Academy orSoftware Carpentry. The research infra-structure provides a unique environ-ment and set of opportunities for dif-ferent kinds of learning, aligned to sup-port the individual and institutionalmodes, but also to provide uniqueopportunities for experiences of profes-sional acculturation in applied contexts[2]. Already in this area we have activeservices under the banners ofdariahTeach, a Moodle-based, ECTS-linked set of modules, and through ourinfrastructure cluster projectPARTHENOS, which involves theCLARIN ERIC and projects such asIPERION, CHARISMA, ARIADNE,EHRI and CENDARI as well, andwhere the modules are more targeted atself-learners and as “train-the-trainers”resources. We will now build on theseplatforms and momentum.

Finally, we view the development ofour foresight and policy leadershipcapacity as a key asset, not only for ourcurrent cohort of active digital human-ists, but also for the “long tail” of theresearch communities. Theseresearchers, who may not realise howimportant the digital is becoming for

The DARIAH ERIC: Redefining Research

Infrastructure for the Arts and Humanities

in the Digital Age

by Jennifer Edmond (Trinity College Dublin), Frank Fischer (National Research University Higher Schoolof Economics, Moscow), Michael Mertens (DARIAH EU) and Laurent Romary (Inria)

As it begins its second decade of development, the Digital Research Infrastructure for the Arts andHumanities (DARIAH) continues to forge an innovative approach to improving support for and thevibrancy of humanities research in Europe.

Page 26: ˜5*(.&1 9-*2* .,.9&1 :2&3.9.*8 - ERCIM News 112 · 4 ERCIM NEWS 111 October 2017 ˚ ˆ ˆ*2’*78-.5 After having successfully grown to become one of the most recognized ICT Societies

26

Special Theme: Digital Humanities

ERCIM NEWS 111 October 2017

how they conduct and communicatetheir research, are a key community forour future growth, and ensuring thatwe represent their needs at a high levelis of central importance to us as weconsolidate our position. Our work todate in this area has been linked to ini-tiatives such as DARIAH’s contribu-tions to the European Commission’sOpen Science Policy Platform, andalso through our championing of a“Data Reuse Charter” betweenresearchers and cultural heritage insti-tutions, able to promote data sharingand fluidity [3].

On the basis of these interventions,DARIAH is poised to move into itssecond decade with a reputation as a

leader for the arts and humanities, aswell as an innovator in research infra-structure.

Link:

[L1] www.dariah.eu

References:

[1] T. Blanke, C. Kristel, L. Romary:“Crowds for Clouds: RecentTrends in Humanities ResearchInfrastructures” in CulturalHeritage Digital Tools andInfrastructures, A. Benardou, E,Champion, C. Dallas, and L.Hughes eds. Taylor & FrancisGroup, 2018.https://hal.inria.fr/DARIAH/hal-01248562

[2] G. Rockwell and S. Sinclair:“Acculturation and the DigitalHumanities Community”, inDigital Humanities Pedagogy:Practices, Principles and Politics,D. Hirsch ed. Cambridge: OpenBook, 2012, pp. 177-211.

[3] L. Romary, M. Mertens and A.Baillot, “Data fluidity in DARIAH– pushing the agenda forward,”BIBLIOTHEK Forschung undPraxis, De Gruyter, 2016, 39 (3),pp.350-357.

Please contact:

Jennifer Edmond, Trinity CollegeDublin, [email protected]

The European Strategic Roadmap forResearch Infrastructures (ESFRIRoadmap [L2]), initiated in 2016, is oneof the six new projects of the EuropeanResearch Infrastructure for HeritageScience (E-RIHS)[1]. E-RIHS supportsresearch on heritage interpretation,preservation, documentation and man-agement. Both cultural and natural her-itage are addressed: collections, build-ings, archaeological sites, digital andintangible heritage. E-RIHS is a distrib-uted research infrastructure: it includesfacilities from many countries, organ-ised in national networks and coordi-nated by national hubs. The E-RIHSHeadquarters – to be seated in Florence,Italy – will provide the unique accesspoint to all E-RIHS services.

E-RIHS will provide state-of-the-arttools and services to support cross-disci-plinary research communities of usersthrough its four access platforms (Figure1):• MOLAB: offers access to advanced

mobile analytical instrumentation fordiagnostics of heritage objects,archaeological sites and historicalmonuments. The MObile LABorato-ries will allow its users to carry outcomplex multi-technique diagnostic

projects, allowing effective in situinvestigation.

• FIXLAB: provides access to large-scale and specific facilities withunique expertise in heritage science,for cutting-edge scientific investiga-tion on samples or whole objects,revealing micro-structures and chem-ical composition, giving essential andinvaluable insights into historicaltechnologies, materials and species,their context, chronologies, alterationand degradation phenomena.

• ARCHLAB: enables physical accessto archives and collections of presti-gious European museums, galleries,research institutions and universitiescontaining non-digital samples andspecimens and organised scientificinformation.

• DIGILAB: facilitates virtual accessto tools and data hubs for heritageresearch – including measurementresults, analytical data and documen-tation – from large academic as wellas research and heritage institutions.

E-RIHS will help the preservation ofthe world’s heritage by enabling cut-ting-edge research in heritage science,liaising with governments and heritageinstitutions to promote its continual

development and, finally, raising publicawareness about cultural and naturalheritage and its historic, social and eco-nomic significance.

In February 2017, E-RIHS started itspreparatory phase supported by the EUproject E-RIHS PP (H2020-INFRADEV-02-2016). Representativesof 16 countries (15 from the EU plusIsrael) are working together to prepareE-RIHS to be launched as a standaloneresearch infrastructure consortium in2021.

The DIGILAB platform will provideremote services to the heritage scienceresearch community but will also be rel-evant to and accessible by profes-sionals, practitioners and heritage man-agers. DIGILAB will enable access toresearch information as well as to gen-eral documentation of analyses, conser-vation, restoration and any other kind ofrelevant information about heritageresearch and background references,such as controlled vocabularies,authority lists and virtual reference col-lections.

The DIGILAB design takes intoaccount and complies with the EU poli-

DIGILAb: A New Infrastructure for Heritage Science

by Luca Pezzati and Achille Felicetti (INO-CNR)

The European Research Infrastructure for Heritage Science, E-RIHS [ˈīris], is working to launch DIGILAB: thenew data and service infrastructure for the heritage science research community. First services areexpected to be online in 2018.

Page 27: ˜5*(.&1 9-*2* .,.9&1 :2&3.9.*8 - ERCIM News 112 · 4 ERCIM NEWS 111 October 2017 ˚ ˆ ˆ*2’*78-.5 After having successfully grown to become one of the most recognized ICT Societies

ERCIM NEWS 111 October 2017 27

cies and strategies concerning scientificdata, including the FAIR [L3] data prin-ciples [1], the Open Research Data [L4]policy, and the EOSC [L5] strategy.DIGILAB will rely on a network of fed-erated repositories where researchers,professionals, managers and other her-itage-related professionals deposit thedigital results of their work. DIGILABwill not keep those data internally:instead, it will provide access to theoriginal repositories where the data arestored.

DIGILAB is inspired by the FAIR dataprinciples: it will enable Finding datathrough an advanced search systemoperating on a registry containing meta-data describing each individual dataset;it will support Access to the datathrough a federated identity system,while data access grants will be local to

each repository; it will guarantee dataInteroperability by requiring the use ofa standard data model; finally, it willfoster Re-use by making services avail-able to users, to process the dataaccording to their own research ques-tions and use requirements.

Data policies are relatively new to theheritage science sector. In many casesdigital data are still considered dispos-able: they are used as long as theresearch continues, then disposed of.They are often stored in inaccessiblesystems, like personal hard disks orother storage devices, and are thusquickly rendered unusable through thelack of proper documentation, the obso-lescence of the software used to createand manage them or, simply, the degra-dation of storage hardware, which aftersome time becomes unreadable either

because it ceases to function or becauseit can no longer be connected to moreadvanced devices.

On the other hand, most of this valuableinformation can be easily recovered andorganised into usable archives. Thisoften requires a handicraft approach,tailored to each dataset and dependenton the humans who created it and whoare still available to provide the neces-sary information. DIGILAB will set upguidelines for dataset recovery andassist researchers and research institu-tions willing to undertake such tasks.

The cloud-based DIGILAB infrastruc-ture, enhanced by the adoption of inter-national standards and modern IT solu-tions, will provide the necessary flexi-bility for the efficient aggregation,interoperability implementation andintegration management of hugeamounts of scientific data, in order tofoster their publication and redistribu-tion in various formats, in accordancewith the related policies. The complexsemantic graph built in the DIGILABregistry will be able to trace newresearch paths through scientific con-cepts by means of the efficient use ofsemantic relationships linking the enti-ties involved in scientific research; thiswill provide DIGILAB users with newtools for the discovery, access, and re-use of relevant information (Figure 2).

DIGILAB is designed to be the privi-leged gateway to European scientificknowledge in heritage, in preparationfor becoming the main internationalportal for heritage science research.

Links:

[L1] http://www.e-rihs.eu/ [L2] http://www.esfri.eu/ [L3] https://kwz.me/hTO[L4] https://kwz.me/hm0[L5] https://kwz.me/hm1

Reference:

[1] M. D. Wilkinson, et al.: “The FAIRGuiding Principles for scientificdata management andstewardship”, Sci. Data 3:160018,2016, doi: 10.1038/sdata.2016.18

Please contact:

Luca Pezzati, Achille Felicetti, CNRINO – Istituto Nazionale di Ottica,Italy, +390552308279,[email protected]

Figure�1:�Analytical�instrumentation�for�scientific�investigation�of�heritage�objects�in�E-RIHS.

Figure�2:�A�DIGILAB�service�for�mapping�and�comparing�analysis�results�to�be�used�for

painting�restoration�and�preservation.

Page 28: ˜5*(.&1 9-*2* .,.9&1 :2&3.9.*8 - ERCIM News 112 · 4 ERCIM NEWS 111 October 2017 ˚ ˆ ˆ*2’*78-.5 After having successfully grown to become one of the most recognized ICT Societies

ERCIM NEWS 111 October 201728

Special Theme: Digital Humanities

Digital humanities infrastructures(DHIs) are e-infrastructures supportingresearchers in the field of humanitieswith a digital environment where theycan find and use ICT tools and researchdata for conducting their research activi-ties. A growing number of DHIs havebeen realised, most of them targeting aspecific sector of humanities, such asARIADNE [L2] for archeology, EHRI[L3] for studies on the holocaust,Cendari [L4] for history, CLARIN [L5]for linguistic, and DARIAH [L6] for artsand humanities. Thanks to their disci-pline-specific feature, those DHIs offerspecialised services and tools toresearchers, who are now demandingsupport for interdisciplinary research,common solutions for data manage-ment, and access to resources that aretraditionally relevant to different sectors(e.g., text-mining algorithms tradition-ally used by linguists can also be usefulto historians and social scientists).

One of the main goals of thePARTHENOS project (Pooling Activities,Resources and Tools for Heritage E-research Networking, Optimization andSynergies – EC-H2020-RIA grantagreement 654119) is to bridge existing

DHIs by forming a federation whereresearchers of different sectors of thehumanities can collaborate and sharedata, services and tools in an integratedenvironment.

PARTHENOS will produce a completetechnical framework for the federation,enabling transparent access to resourcesmanaged by different DHIs andenabling the creation and operation ofvirtual research environments [1] whereresearchers with different backgroundscan collaborate on specific researchtopics.

The technical framework supports therealisation of the federation by offeringtools for: • The creation of an homogenous infor-

mation space where all resources(data, services and tools) of the dif-ferent DHIs are described accordingto a common data model.

• The discovery of available resources.• The use of available resources (for

download or processing).• The creation of VREs where users

can find resources relevant for aresearch topic, run services, and sharethe computational results.

The technical framework (see Figure 1)includes two main components: thePARTHENOS Content CloudFramework (CCF) and the JointResource Registry (JRR).

The CCF supports the aggregation ofmetadata about resources from the DHIsof the federation. The PARTHENOSaggregator is realised with the D-NETsoftware toolkit [2], an enabling frame-work for the realisation of AggregativeData Infrastructures (ADIs) developedand maintained by CNR-ISTI. D-NETprovides functionality for the automaticcollection, harmonisation, curation anddelivery of metadata coming from adynamic set of heterogenous dataproviders. In the context of thePARTHENOS project, D-NET has beenconfigured to collect metadata madeavailable by existing DHIs operated byPARTHENOS partners (namely: ARI-ADNE, CENDARI, CLARIN,CulturaItalia, DARIAH DE, DARIAHGR/DYAS, DARIAH IT, EHRI, Huma-Num, ILC) and harmonise themaccording to an extension of theCIDOC-CRM model [L7] [L8] byapplying X3ML [L9] mappings. Themapping language, editor and execution

building a federation of Digital Humanities

Infrastructures

by Alessia Bardi and Luca Frosini (ISTI-CNR)

Research infrastructures (RIs) are “facilities, resources and services used by the science community toconduct research and foster innovation” [1]. Researchers’ needs for digital services led to the realisationof e-Infrastructures, i.e., RIs offering digital technologies for data management, computing andnetworking. Relevant examples are high speed connectivity infrastructures (e.g., GÈANT), grid computinginfrastructures (e.g., European Grid Infrastructure EGI), scholarly communication infrastructures (e.g.,OpenAIRE), data e-infrastructures (e.g., D4Science).

Figure�1:�Technical�framework�for�DHIs�federation.

Page 29: ˜5*(.&1 9-*2* .,.9&1 :2&3.9.*8 - ERCIM News 112 · 4 ERCIM NEWS 111 October 2017 ˚ ˆ ˆ*2’*78-.5 After having successfully grown to become one of the most recognized ICT Societies

ERCIM NEWS 111 October 2017 29

engine are realised and maintained by theGreek partner FORTH. Aggregated con-tent is then published via different end-points, supporting a set of (de-facto)standard protocols for metadata search(Solr API, SPARQL) and exchange(OAI-PMH).

The aggregated content is also ingestedinto the Joint Resource Registry, whichexposes an end-user GUI (ResourceCatalogue) and a machine-oriented API(Resource Registry) for resource dis-covery. Data and services registered inthe JRR become discoverable by andaccessible to users and other services ofthe federation. Moreover JRR providesfunctionality for infrastructure manage-ment. For example, a user can run aCLARIN service for full-text mining ona dataset of medieval full-texts availablein the CENDARI DHI in a transparentway. Computational results can be easilystored and shared with a selection of col-leagues or publicly, by publishing theminto the JRR.

The JRR is based on the gCube enablingtechnology [3], an open-source softwaretoolkit used for building and operatinghybrid data infrastructures [4] enablingthe dynamic deployment of virtualresearch environments by favouring therealisation of reuse oriented policies.

gCube is developed and maintained byCNR-ISTI.

The Parthenos technical framework iscurrently at the beta stage and operatedon the D4Science infrastructure [L10] atthe Institute of Information Science andTechnologies of the Italian NationalResearch Council (ISTI-CNR).Representatives of the consortium areactively preparing mappings for meta-data, selecting data and services to shareand setting up VREs. As of August 2017,two VREs have been created: oneincludes services for natural languageprocessing and semantic enrichment oftextual data; the other is meant for theintegration of reference resources. In thecoming months, the framework will bedeployed in a production environmentand assessed by a selection of humani-ties researchers in the consortium. Weplan to open the framework to allresearchers of DHIs in the consortiumby the end of the project (April 2019).

References:

[1] L. Candela et al.: “Virtual researchenvironments: an overview and aresearch agenda”, Data ScienceJournal, 12, GRDI75-GRDI81, 2013.http://doi.org/10.2481/dsj.GRDI-013

[2] P. Manghi et al. “The D-NETsoftware toolkit: A framework for

the realization, maintenance, andoperation of aggregativeinfrastructures”, Program, Vol. 48Issue: 4, pp.322-354, 2014 doi:https://doi.org/10.1108/PROG-08-2013-0045

[3] L. Candela, P. Pagano: “Cross-disciplinary data sharing and reusevia gCube”, in: ERCIM News,Issue 100, January 2015.https://kwz.me/hO7

[4] L. Candela et al.: “Managing BigData through Hybrid DataInfrastructures”, in ERCIM News,Issue 89, April 2012.https://kwz.me/hO8

Links:

[1] https://kwz.me/hO9[2] http://www.ariadne-infrastructure.eu/[3] https://www.ehri-project.eu/[4] http://www.cendari.eu[5] https://www.clarin.eu[6] http://www.dariah.eu/[7] http://www.cidoc-crm.org/[8] https://kwz.me/hOf[9] https://kwz.me/hOj[10] https://parthenos.d4science.org/

Please contact:

Alessia Bardi, Luca FrosiniISTI-CNR, [email protected],[email protected]

The KPLEX Project is an H2020 fundedproject tasked with investigating thecomplexities of humanities and culturaldata, and the implications of digitisationon the unique and complex messy datathat humanities and cultural researchersare accustomed to dealing with. Thedrive for ever greater integration of dig-ital humanities (DH) data is complicatedby the uncomfortable truth that a lot ofthe information that should be the cor-nerstones of our decision making, richdata about the history of our economies,societies and cultures, isn’t digitallyavailable. This, along with the “epis-temics of the algorithm”[1] are key con-

cerns of the KPLEX project, and we areworking to expand awareness of therisks inherent in big data for DH and cul-tural research, and to suggest ways inwhich phenomena that resist datafica-tion can still be represented (if only bytheir absence) in knowledge creationapproaches reliant upon the interroga-tion of large data corpora.

KPLEX addresses the repercussions ofthe dissociation of data sources from thepeople, institutions and conditions thatcreated them. In a rapidly evolving DHenvironment where large scale dataaggregation is becoming ever more

accepted as the gold standard, the K-PLEX project is defining and describingsome of the key aspects of data that areat risk of being left out of our knowledgecreation processes, and the strategiesresearchers have developed to deal withthese complexities.[2]

The K-PLEX team is diverse and hasadopted a comparative, multidiscipli-nary, and multi-sectoral approach to hisproblem, focussing on four key chal-lenges to the knowledge creationcapacity of big data approaches:1)redefining what data is and the terms

we use to speak about it [3];

Knowledge Complexity and the Digital

Humanities: Introducing the KPLEX Project

by Jennifer Edmond and Georgina Nugent Folan (Trinity College Dublin)

The KPLEX project is looking at big data from a rich data perspective. It uses humanities knowledgeto explore bias in big data approaches to knowledge creation.

Page 30: ˜5*(.&1 9-*2* .,.9&1 :2&3.9.*8 - ERCIM News 112 · 4 ERCIM NEWS 111 October 2017 ˚ ˆ ˆ*2’*78-.5 After having successfully grown to become one of the most recognized ICT Societies

ERCIM NEWS 111 October 201730

Special Theme: Digital Humanities

2)the manner in which data that are notdigitised or shared become “hidden”from aggregation systems;

3)the fact that data is human created,and lacks the objectivity oftenascribed to the term;

4)the subtle ways in which data that arecomplex almost always become sim-plified before they can be aggregated.

We approach these questions with ahumanities research perspective andremain committed to humanitiesmethodologies and forms of knowledge,but we make use of social scienceresearch tools to look at both the human-istic and computer science approaches tothe term “data” and its many possiblemeanings and implications. Our coreshared discourse of the digital humani-ties allows us to use these methods andknowledge in a contextualised, sociallyrelevant manner, a strength of our con-sortium that is further enhanced by ourinclusion of both ethnographic/anthropo-logical and industrial perspectives.

Led by Trinity College Dublin, theKPLEX team spans four countries,taking in Freie Universität Berlin(Germany), DANS-KNAW (TheHague) and TILDE (Latvia). Each of theK-PLEX project partners addresses anintegrated set of research questions andchallenges. The research teams havebeen assembled to pursue a set of ques-tions that are humanist-led, but broadly

interdisciplinary, including humanitiesand digital humanities, data manage-ment, anthropology and computer sci-ence, but also including stakeholdersfrom outside of academic research ableto inform the project’s evidence gath-ering and analysis of the challenges,including participation from both a tech-nology SME (TILDE) and a majornational ICT research centre (ADAPT,Ireland). In addition, KPLEX takes inthe experiences of a large number ofmajor European digital research infra-structure projects federating culturalheritage data for use by researchers,through the contributions by TCD(Dublin) and KNAW-DANS (TheHague). These projects (including CEN-DARI, EHRI, DARIAH-EU, DASISH,PARTHENOS, ARIADNE and HaS)have all faced and progressed the issuessurrounding the federation and sharingof cultural heritage data. In addition, twofurther projects that deal with non-scien-tific aspects of researcher epistemics arealso engaged, namely the “ScholarlyPrimitives and Renewed KnowledgeLed Exchanges” project (SPARKLE,based at TCD) and the “Affekte derForscher” (based at FUB). These givethe KPLEX team and project a firmbaseline of knowledge for dealing withthe question of how epistemics createsand marks data.

The KPLEX project kicked off inJanuary 2017, and will conclude in

March 2018, presenting its results via acomposite white paper that unites thefindings of each research team, witheach research team also producing apeer reviewed academic paper on theirfindings. Over the coming months theproject will be represented at DH con-ferences in Liverpool (“Ways of Beingin the Digital Age”), Austria (“DataFirst!? Austrian DH Conference”),Manchester (“Researching DigitalCultural Heritage InternationalConference”) and Tallin (“Metadata andSemantics Research Conference”).

Links:

[L1] https://kplex-project.com/, Twitter: @KPLEXProject,Facebook: KPLEXProject

References:

[1] T Presner: “The Ethics of theAlgorithm”, in Probing the Ethics ofHolocaust Culture.

[2] J Edmond: “Will Historians EverHave Big Data?” In ComputationalHistory and Data-DrivenHumanities. doi:10.1007/978-3-319-46224-0_9.

[3] L Gitelman, ed., “‘Raw Data’ is anOxymoron”.

Please contact:

Jennifer Edmond, Georgina NugentFolan, Trinity College Dublin, [email protected], [email protected]

Restoration of Ancient Documents Using

Sparse Image Representation

by Muhammad Hanif and Anna Tonazzini (ISTI-CNR)

Archival, ancient manuscripts constitute a primary carrier of information about our history andcivilisation process. In the recent past they have been the object of intensive digitisation campaigns,aimed at their preservation, accessibility and analysis. At ISTI-CNR, the availability of the diverseinformation contained in the multispectral, multisensory and multiview digital acquisitions of thesedocuments has been exploited to develop several dedicated image processing algorithms. The aimof these algorithms is to enhance the quality and reveal the obscured contents of the manuscripts,while preserving their best original appearance according to the concept of “virtual restoration”.Following this research line, within an ERCIM “Alain Bensoussan” Fellowship, we are now studyingsparse image representation and dictionary learning methods to restore the natural appearance ofancient manuscripts affected by spurious patterns due to various ageing degradations.

The collection of ancient manuscriptsserves as history’s own closet, carryingstories of enigmatic, unknown places orincredible events that took place in thedistant past, many of which are yet to be

revealed. These manuscripts are of greatinterest and importance for historians tostudy people of the past, their culture,civilisation and way of life. Most of theancient classic documents have had a

very narrow escape from total annihila-tion. Thus, digital preservation of ourdocumental heritage has been one of thefirst focusses of the massive archive andlibrary digitisation campaigns per-

Page 31: ˜5*(.&1 9-*2* .,.9&1 :2&3.9.*8 - ERCIM News 112 · 4 ERCIM NEWS 111 October 2017 ˚ ˆ ˆ*2’*78-.5 After having successfully grown to become one of the most recognized ICT Societies

ERCIM NEWS 111 October 2017 31

formed in the recent years. This, in turn,has contributed to the birth of digitalhumanities as a science. In addition topreservation, computing technologiesapplied to the digital images of thesedocuments have quickly become a pow-erful and versatile tool to simplify theirstudy and retrieval, and to facilitate newinsights into the documents’ contents.

The quality of the digital records, how-ever, depends on the current status ofthe original manuscripts, which in mostcases are affected by several types of

degradation, such as spots, ink fading,or ink seeping from the reverse side,due to bad storage conditions and thefragile nature of the materials (e.g.,humidity, mould, ink chemical proper-ties, paper porosity). In particular, thephenomenon of ink seeping is perhapsthe most frequent one in ancient manu-scripts written on both sides of thesheet. This effect, termed as bleed-through, becomes visible as anunpleasant and disturbing degradationpattern, severely impairing legibility,aesthetics, and interpretation of thesource document.

In general, bleed-through removal isaddressed as a classification problem,where image pixels are labelled aseither background (paper texture),bleed-through (seeped ink), or fore-ground (original text). This classifica-tion problem is very difficult, since theintensity of both foreground text andbleed-through pattern can be so highlyvariable as to make it extremely hard oreven impossible to distinguish them. Inaddition, when the aim is to obtain avery accurate and plausible restoration,

another very critical issue is to replacethe identified bleed-through pixels withappropriate replacement colour values,which do not alter the original look ofthe manuscript.

Recently, we proposed a two-stepmethod to address bleed-through docu-ment restoration from a pre-registeredpair of recto and verso images of themanuscript. First, the bleed-throughpixels are identified on both sides [1];then, a sparse representation basedimage inpainting technique is applied to

fill-in the bleed-through pixels, bytaking into account their propensity toaggregation, and in accordance with thenatural texture of the surrounding back-ground.

Sparse representation methods arereported with state-of-the-art results indifferent image processing applications.These methods process the whole imageby operating on a patch-by-patch level.For this specific application, our aim isto reproduce the background texture tomaintain the original look of the docu-ment. In the sparse representation setup,an over-complete dictionary is learnedusing a set of training patches from therecto and verso image pair. In thetraining set we only select patches withno bleed-though pixels. This choicespeeds up the training process since itexcludes non-informative imageregions. For each patch to be inpainted,we first search for its mutual similarpatches in a small bounded neighbour-hood window. We used a blockmatching technique with Euclidean dis-tance metric as similarity criterion. Thesimilar patches are grouped together in

a matrix. A group-based sparse repre-sentation method [2] is exploited to findthe befitting fill-in for the bleed-throughstrokes. The use of similar patch groupsincorporates local information thathelps to preserve the natural colour/tex-ture continuation property of the phys-ical manuscript.

An original degraded manuscript and itsrestored version are presented inFigure 1. It is worth noting that ouralgorithm can be directly applied toinpaint any other possible interference

pattern detected in the paper support(e.g., stains). We are also studying theextension of the method to the restora-tion of broken or faded foregroundcharacters.

References:

[1] A. Tonazzini, P. Savino, and E.Salerno: “A nonstationary densitymodel to separate overlapped textsin degraded documents,” Signal,Image and Video Processing, vol.9, pp. 155–164, 2015.

[2] J. Zhang and D. Zhaocand W. Gao:“Group-based sparse representationfor image restoration,” IEEE Trans.Image Process., vol. 32, pp.1307–1314, 2016.

Please contact:

Anna Tonazzini, ISTI-CNR, Pisa+39 [email protected]

Figure�1:�A�visual�comparison�of

an�original�ancient�manuscript

effected�by�bleed-through

degradation�and�its�restored

version�using�our�method.�The

first�row�shows�the�degraded

recto�and�verso�pair,�and�the

restored�images�are�presented�in

the�second�row.

Page 32: ˜5*(.&1 9-*2* .,.9&1 :2&3.9.*8 - ERCIM News 112 · 4 ERCIM NEWS 111 October 2017 ˚ ˆ ˆ*2’*78-.5 After having successfully grown to become one of the most recognized ICT Societies

ERCIM NEWS 111 October 201732

Special Theme: Digital Humanities

St Paul’s Cathedral Rises from the Dust –

News from the virtual St Paul’s Cathedral Project

by John N. Wall (NC State University), John Schofield (St Paul’s Cathedral, London), David Hill (NC StateUniversity and Yun Jing (NC State University)

The Virtual St Paul’s Cathedral Project [L1], now at the half-way point in its development, is beginning toshow results. Attached are images of our draft model of St Paul’s Cathedral and buildings in thecathedral’s churchyard, from the early 1620’s, before everything seen here was destroyed by the GreatFire of London in 1666. These images are based on a combination of contemporary images of thecathedral and its surrounding buildings, surveys of these buildings made after the Great Fire, and imagesof appropriate buildings from this period that survive in modern-day England.

When we are done, the visual model willalso incorporate details of weather andclimate, as well as recent scholarshipinto the material and social history ofLondon as it grew in the 16th and 17thcenturies into a city of over 200,000people.

Supported by a Digital HumanitiesImplementation Grant from theNational Endowment for theHumanities, the goal of the Virtual StPaul’s Cathedral Project is to recreatethe experience of worship andpreaching in St Paul’s Cathedral inLondon [1] in the early seventeenth cen-tury. This project demonstrates thevalue of visual and acoustic modellingin helping us understand the look andfeel of historic sites as well as theiracoustic properties, by recreating events

that took place in those spaces, to expe-rience these events as they unfold,minute by minute, in the spaces inwhich they originally took place.

Along the way we are developing anopen source acoustic modelling soft-ware package to enable others to explore

at minimal cost the acoustic propertiesof other historic sites. Our use of digitaltechnology for the St Paul’s CathedralProject aspires to be scrupulously accu-rate, integrating into a single experien-tial model the rich record of informationavailable about St Paul’s and its wor-ship.

Figure�1:�St�Paul’s�Cathedral�as�seen�from�the�south.�

Figure�2:�St�Paul’s�Cathedral�as�seen�from�the�Southwest. Figure�3:�St�Paul’s�Cathedral�as�seen�from�the�Southeast.

Figure�4:�St�Paul’s�Cathedral,�the�West�Front.� Figure�5:�St�Paul’s�Cathedral,�the�West�Front�as�seen�from�the�Northwest.�

Page 33: ˜5*(.&1 9-*2* .,.9&1 :2&3.9.*8 - ERCIM News 112 · 4 ERCIM NEWS 111 October 2017 ˚ ˆ ˆ*2’*78-.5 After having successfully grown to become one of the most recognized ICT Societies

ERCIM NEWS 111 October 2017 33

We will restage worship services con-ducted inside and outside this virtualcathedral, drawing on the rites of theBook of Common Prayer (1604) andincorporating music for choir and organcomposed for performance in thesespaces by musicians at St Paul’s in thelate 16th and early 17th centuries [2].The organ pieces will be played on adigital reconstruction of a 17th centuryinstrument that closely approximates thespecifications of the organ at St Paul’s;the singers and the actors will performchoral pieces, prayers, Bible readings,and the texts of sermons in a linguisticreconstruction of the sound of earlymodern London speech .

In this virtual environment, we will beable to experience the worship of thepost-Reformation Church of England asan unfolding experience in the specificarchitectural setting and of St Paul’sCathedral and in the context of a respon-sive and collaborative congregation aswell as a noisy crowd, just over the choirscreen, of shoppers, merchants, andfashionable people out for gossip andpublic recognition in Paul’s Walk.

This project builds on the successfulcompletion, in 2013, of our proof-of-concept project, the Virtual Paul’s CrossProject, funded by a Digital HumanitiesLevel II Start-Up Grant in 2011. TheVirtual Paul’s Cross Project [3], acces-sible through its website(http://vpcp.chass.ncsu.edu), makes itpossible for us to experience JohnDonne’s sermon for November 5, 1622,performed in its original pronunciation,from eight different listening positionswithin a virtual model of the historicspace of Paul’s Churchyard in Londonand in the presence of four differentsizes of crowd.

The Virtual Paul’s Cross Project websitehas been widely recognised for its con-tribution to research into early modernreligious life and has earned two interna-tional awards, the John Donne Society’sAward for Distinguished DigitalPublication (2013) and the Award forBest DH Data Visualization from DHAwards (2014).

Link:

[L1] http://vpcp.chass.ncsu.edu

References:

[1] D. Keene, et al. (ed.): “St Paul’s:The Cathedral Church of London,604-2004, 2004; and J. Schofield:“St Paul’s Cathedral Before Wren”,English Heritage, 2011.

[2] Sources for choral music includescores now in the archives of theRoyal Academy of Music (partiallypublished in 1641 as The FirstBook of Selected Church Musick,ed. John Barnard, a Minor Canonand member of the choralfoundation at St Paul’s) and organmusic (the so-called “Batten OrganBook) in the Bodleian Library atOxford.

[3] The Virtual Paul’s Cross Project hasbeen widely reviewed, includingMatthew J. Smith, “Meeting JohnDonne: The Virtual Paul’s CrossProject,” Spenser Review 44 (Fall2014), at https://kwz.me/hkR

Please contact

John N. Wall, NC State University, USA+1 919 515 4162, [email protected]

Due to their states, complexity andarchaeological interest, the two subjectsstudied in this work are Breton architec-tural sites for which the development ofnew analytical techniques appears quiteappropriate. The first is the chapel ofLanguidou, built in the middle of the13th century in the municipality ofPlovan, which seems to be the“founding element” of a religious archi-tectural style. The second buildingaddressed in this work is the “jeu depaume” court of Rennes, built at thebeginning of the 17th century and regis-tered as a Historical Monument. For sev-eral decades, the study of these kinds of

architectural styles and reorganizationshas been carried out by archaeologistsby producing different types of 2D doc-umentation: plans, drawings, sections,profiles and orthophotos. 3D documen-tation comes from classical topographicsurvey, laser scanning or photogram-metry. In the case of the French Grand-Ouest, some of this documentation hasbeen carried out within the scope of theWest Digital Conservatory ofArchaeological Heritage [1]. Until now,an engineer has performed the segmen-tations of the 3D documentation on aPC and tries to better meet the archaeol-ogist’s expectations, who is sometimes

unwilling to use new technologies andfrustrated by his lack of autonomy. Theobjective of this work is to involve thearchaeologist more deeply in this procesby immersing and allowing him to seg-ment, in real time and on 1:1 scale, the3D survey.

The first step consists in loading thepoint clouds of the architectural sitesinto the Virtual Reality platformImmersia [2]. The scans were done inJune 2013 and June 2014, with a LeicaScanStation C10 and a Focus3D X330,and were integrated a few months later(see Figure 1). The data structure of the

Immersive Point Cloud Manipulation

for Cultural Heritage Documentation

by Jean-Baptiste Barreau (CNRS/CReAAH UMR 6566), Ronan Gaugne (Université de Rennes 1/IRISA-Inria) and Valérie Gouranton (INSA Rennes/ IRISA-Inria)

A point cloud is the basic raw data obtained when digitizing cultural heritage sites or monumentswith laser scans or photogrammetry. These data represent a rich and faithful record provided thatthey have adequate tools to exploit them. Their current analyses and visualizations on PC requiresoftware skills and can create ambiguities regarding the architectural dimensions. We propose atoolbox to explore and manipulate such data in an immersive environment, and to dynamicallygenerate 2D cutting planes usable for CH documentation and reporting.

Page 34: ˜5*(.&1 9-*2* .,.9&1 :2&3.9.*8 - ERCIM News 112 · 4 ERCIM NEWS 111 October 2017 ˚ ˆ ˆ*2’*78-.5 After having successfully grown to become one of the most recognized ICT Societies

ERCIM NEWS 111 October 201734

Special Theme: Digital Humanities

points displayed in our virtual realitydevice is a billboard with 3 coordinates,a colour and a scalar. The files are in abinary format that we designed forUnity. For a correct exploration withinthe Immersia, a subsampling was sys-tematically done and two cloud loadingmodes were implemented. For the firstmode, the cloud is distributed over sev-eral hundred octrees. They are loadeddynamically and are visible from theuser’s point of view, thanks to a cullingtechnique (cf Figure 2). The secondmode consists in the use of a Level-of-Detail technique. The number of pointsloaded at the start in a single file issmaller and their size is fixed. Theselection of cloud segmentation tools isdone through a MiddleVR menu thatallows the user to interact with threeparameters (cf Figure 3). When the firstmode is active, it is possible to modifythe size of the points. On a more global

the resulting 2D plane, in order to pro-vide a correct understanding of thedimensions.

We now need to have the tool tested bya large community of archaeologists, inparticular to check if the use of theFlystick to move the cutting plane issufficiently precise. At the same time,we still have to improve the optimiza-tion of the point clouds management.Concerning the reduction of the I/Otime and disk space, it will be necessaryto store the points coordinates on 2 bits(instead of 4), to downgrade the preci-sion to the millimetre unit, to decreasethe entropy, and to use zstd compres-sion. The use of culling occlusion, atnative Unity or hardware level, is alsobeing studied to improve optimization.Finally, the management of additionaldata associated with points is a majorissue. Their storage, visualization andlinking seems indeed to be a very inter-esting long-term prospect.

References :

[1] J.-B. Barreau, R. Gaugne, Y.Bernard, G. Le Cloirec, and V.Gouranton, “The West DigitalConservatory of ArchaelogicalHeritage Project,” in DH, (France),pp. 547–554, 2013.

[2] R. Gaugne, V. Gouranton, G.Dumont, A. Chauffaut, and B.Arnaldi, “Immersia, an openimmersive infrastructure: doingarchaeology in virtual reality,”Archeologia e Calcolatori,supplemento 5, pp. 180-189, 2014.

Please contact:

Jean-Baptiste BarreauCNRS/CReAAH UMR 6566, Rennes,[email protected]

Ronan GaugneUniversité de Rennes 1/IRISA UMR6074, Rennes, [email protected]

Valérie GourantonINSA Rennes/IRISA UMR 6074,Rennes, [email protected]

Figure�1:�Point�clouds�of�the�chapel�of�Languidou�and�the�“jeu�de�paume”�court.

Figure�3:�“Cutting�plane”�creation�process�by�an�operator�in�the�Immersia�platform.

Figure�2:�Octree�partitioning�and�culling�used�on�“jeu�de�paume”�court�points�cloud.

scale, the user can switch between thetwo loading modes and change the dis-play distance to the cloud. Concerningthe “cutting plane”, it is possible to dis-play or hide it, change its thickness andassign a unique color to the points con-tained in it. It is also possible to changethe opacity of points outside the cuttingplane. Its manipulation within Immersiais done thanks to a Flystick which mod-ifies its translations and rotations. Thedisplay in the Immersia platform(MiddleVR) has a frame rate which canreach 30fps. This result is two to threetimes less fluent than that on PC (Unity/ MiddleVR). The rendering of the 2Dresulting plane is done with a cameraorthogonal to the 3D cutting plane andcan displayed on a tablet (cf Figure 3).The user can also adjust the distance,field of view and roll of the camera. Ascale is also generated automatically on

Page 35: ˜5*(.&1 9-*2* .,.9&1 :2&3.9.*8 - ERCIM News 112 · 4 ERCIM NEWS 111 October 2017 ˚ ˆ ˆ*2’*78-.5 After having successfully grown to become one of the most recognized ICT Societies

ERCIM NEWS 111 October 2017 35

Platform The platform is designed to be versatilein terms of scale and typologies of arte-facts to be digitised, scalable in terms ofstorage, sharing and visualisation, andable to generate high-definition andaccurate 3D models as output. The plat-form has been implemented from mod-ular open-source software solutions [L2,L3 1, 2, 3]. The current cloud-basedplatform hosted by TGIR HumaNum[L4] enables four simultaneous users inthe form of affordable high performancecomputing services (8 cores, 64GB ofRAM and 80 GB of disk).

Pipeline The photo-modelling web serviceoffered by the platform implements amodular pipeline with various optionsdepending on the user mode that rangesfrom novice to expert (Figure 1). Theminimum requirement is to conform toseveral requirements during acquisi-tion: A protocol specific to each type ofartefact (statue, façade, building, inte-

rior of building) and sufficient overlapbetween the photos. Image settingssuch as EXIF, exposure and white-bal-ance can be adjusted or read from theimage metadata. The image sequencecan be organised into a linear, circularor random sequence. Camera calibra-tion is performed automatically and adense photogrammetry approach basedon image matching is performed togenerate a dense 3D point set withcolour attributes. For the artefactshown in Figure 1 the Micmac softwaresolution [2] generates a 14M point setfrom 26 photos. The output point setcan be cleaned up from outliers andsimplified, and a Delaunay-based sur-face reconstruction method turns it intoa dense surface triangle mesh withcolour attributes.

We plan to improve the platform inorder to deal with series of multifocalphotos, fisheye devices, and photosacquired by unmanned aerial vehicles(drones).

Links:

[L1] http://c3dc.fr/[L2] http://logiciels.ign.fr/?Micmac[L3] https://www.cgal.org/ (seecomponents “Point set processing” and“Surface reconstruction”[L4] http://www.huma-num.fr/

References:

[1] E. RupnikEmail, M. Daakir and M.Pierrot-Deseilligny: “MicMac – afree, open-source solution forphotogrammetry”, Open GeospatialData, Software and Standards 2017.

[2] T. van Lankveld: “Scale-SpaceSurface Reconstruction”, in CGALUser and Reference Manual.CGAL Editorial Board, 4.10.1edition, 2017.

[3] P. Alliez, et al: “Point SetProcessing”, in CGAL User andReference Manual. CGAL EditorialBoard, 4.10.1 edition, 2017.

Please contact:

Pierre Alliez, Inria, [email protected]

Livio de Luca, CNRS, [email protected]

Culture 3D Cloud: A Cloud Computing Platform

for 3D Scanning, Documentation, Preservation

and Dissemination of Cultural Heritage

by Pierre Alliez (Inria), François Forge (Reciproque), Livio de Luca (CNRS MAP), Marc Pierrot-Deseilligny (IGN)and Marius Preda (Telecom SudParis)

One of the limitations of the 3D digitisation process is that it typically requires highly specialised skills andyields heterogeneous results depending on proprietary software solutions and trial-and-error practices. Themain objective of Culture 3D Cloud [L1], a collaborative project funded within the framework of the French“Investissements d’Avenir” programme, is to overcome this limitation, providing the cultural community witha novel image-based modelling service for 3D digitisation of cultural artefacts. This will be achieved byleveraging the widespread expert knowledge of digital photography in the cultural arena to enable culturalheritage practitioners to perform routine 3D digitisation via photo-modelling. Cloud computing was chosenfor its capability to offer high computing resources at reasonable cost, scalable storage via continuouslygrowing virtual containers, multi-support diffusion via remote rendering and efficient deployment of releases.

Figure�1:Figure�1:�The�photo-modelling

process�requires�implementing�a�well-

documented�acquisition�protocol�specific�to

each�type�of�data,�taking�a�series�of�high-

definition�photos,�specifying�the�type�of

acquisition�(linear,�circular,�random),

performing�camera�calibration�(locations�and

orientations),�generating�a�dense�3D�point�set

with�colour�attributes�via�dense�image

matching�and�reconstructing�a�surface

triangle�mesh.

Page 36: ˜5*(.&1 9-*2* .,.9&1 :2&3.9.*8 - ERCIM News 112 · 4 ERCIM NEWS 111 October 2017 ˚ ˆ ˆ*2’*78-.5 After having successfully grown to become one of the most recognized ICT Societies

ERCIM NEWS 111 October 201736

Special Theme: Digital Humanities

Our project proposes combining a com-puted tomography (CT) scan andadvanced 3D printing to generate aphysical representation of an archaeo-logical artefact or material. This projectis conducted in Rennes, France, witharchaeologists from Inrap [L1] and com-puter scientists from Inria [L2]. The goalof the project is to propose innovativepractices, methods and tools for archae-ology based on 3D digital techniques.

Archaeologists and curators regularlyexperience the problem of needing towork on objects that are themselves orhave features which are inaccessible.For example, artefacts may be encasedin corroded materials or in a cremationburial, or integrated in, and inseparablefrom, larger assemblies (e.g., manufac-

tured objects with several components).Current archaeological processes toanalyse concealed or nested archaeo-logical material often use destructivetechniques. On the other hand, theabsence of a real understanding of theinternal structure or state of decay ofsome objects increases the risk thatinvestigation could destroy sourcematerial.

CT scan is an imaging technology basedon X-rays mostly used for medical pur-poses. It produces images of the internalstructure of the scanned objects withdensity information about the internalcomposition. This technology isincreasingly used in Cultural Heritage(CH) to obtain images of the internalstructure of archaeological material.

However, it remains mainly limited toproviding 2D images.

We propose a workflow where the CTscan images are used to produce volumeand surface 3D data which serve as abasis for new evidence usable byarchaeologists. This new evidence canbe observed in interactive 3D digitalenvironments or through physicalcopies of internal elements of the orig-inal material, as in [1] and [2].

This workflow has been applied to ablock of corroded Iron Age tools discov-ered in Plumaugat [L3], Brittany, France(Figure 1) during excavations conductedby E. Ah Thon, Inrap. The CT scan of theblock revealed an assembly of severalblacksmith tools (Figure 2). The resulting

Physical Digital Access Inside Archaeological

Material

by Théophane Nicolas (Inrap/UMR 8215 Trajectoires), Ronan Gaugne (Université de Rennes 1/IRISA-Inria) and Valérie Gouranton (INSA de Rennes/ IRISA-Inria) and Jean-Baptiste Barreau (CNRS/CReAAHUMR 6566)

Traditionally, accessing the interior of an artefact or an archaeological material is a destructiveactivity. We propose an alternative non-destructive technique, based on a combination of medicalimaging and advanced transparent 3D printing.

Figure�1:�The�original�block�excavated�from�the�site�of�Plumaugat.

Figure�3:�3D�models�of�the�external�shape�and�the�internal�tools. Figure�4:�3D�printing�of�the�block.

Figure�2:�View�of�the�internal�structure�of�the�block�with�CT�scan.

Page 37: ˜5*(.&1 9-*2* .,.9&1 :2&3.9.*8 - ERCIM News 112 · 4 ERCIM NEWS 111 October 2017 ˚ ˆ ˆ*2’*78-.5 After having successfully grown to become one of the most recognized ICT Societies

ERCIM NEWS 111 October 2017 37

DICOM data was processed with theOsirix software in order to generate 3Dmodels of the metal tools and of theexternal shape of the block (Figure 3).These 3D models were then processedand 3D printed with an emerging 3Dprinting technique mixing coloured andtransparent parts (Figure 4).

The resulting object is a 1:1 physicalrepresentation of the initial object thatgives access to the internal spatialorganisation of the components. Thetangible medium allows for physicalmanipulation and simple visualisation tosupport researchers’ analysis, as well asaiding the excavation process as such.Having such representations availableoffers the possibility of taking imme-diate precautionary measures before anymanual intervention; such representa-tions are also the only tangible mediumof context preserved after the excavationof the original material. Furthermore, itis important to note that the workflowprocess presented here is far quickerthan a current excavation and restorationprocesses as presented in [3].

The workflow presented in this paper iscurrently developed and extendedthrough collaboration with the

Canadian INRS and University ofLaval, within the ANR-FRSCQ projectINTROSPECT [L4]. This project,which started in January 2017, aims todevelop digital interactive introspectionmethods for archaeological material.These methods will combine CT scanwith 3D technologies such as virtualand augmented reality, tangible interac-tions, and 3D printing.

Thanks to the interdisciplinary collabo-ration, the INTROSPECT project aimsto provide innovative solutions withnew usages and tools to allow access tonew knowledge for archaeologists andCH practitioners. The project is basedon real use cases corresponding toactual archaeological problems. Thescientific heart of the project is the sys-tematisation of the relation between theobject/artefact, the archaeological con-text, the digital object, the virtual recon-struction of the archaeological context,as well as the physical copy obtainedwith 3D printing.

Links:

[L1] www.inrap.fr[L2] www.inria.fr[L3] https://kwz.me/hnL[L4] introspect.info/

References:

[1] T. Nicolas, et al.: “PreservativeApproach to Study EncasedArchaeological Artefacts, Int. Conf.on Cultural Heritage, LNCS, Vol.8740, pp. 332–341, 2014.

[2] T. Nicolas, et al.: “Touching andinteracting with inaccessible culturalheritage, in Presence: Teleoperatorsand Virtual Environments”, MITPress, 2015, 24 (3).

[3] T. Nicolas, et al.: “Internal 3DPrinting of Intricate Structures”,Int. Conf. on Cultural Heritage,LNCS, Vol. 10058 (Part I), pp.432-441, 2016.

Please contact:

Valérie GourantonINSA de Rennes/ IRISA-Inria, France,[email protected]

Théophane Nicolas, Inrap/UMR 8215Trajectoires, [email protected]

Jean-Baptiste Barreau Jean-Baptiste Barreau, CNRS/CReAAH UMR 6566, [email protected]

European history is an exciting meshof interrelated facts and events,crossing countries and cultures.However, historic knowledge is usu-ally presented to the non-specialistpublic (museum or city visitors) in asiloed, simplistic and localised manner.In CrossCult [L1], an H2020-fundedEU project that started in 2016, we aimto change this [1, 2]. With an interdis-ciplinary consortium of 11 partners,from seven European countries [L2]we are developing technologies to helpanswer two intrinsically united human-ities challenges: • How can we present historic knowl-

edge to non-specialist audiences in anengaging way?

• How can we further trigger theseaudiences to reflect, individually andcollectively, on European history andits connection to the present?

In an era where unity is more importantthan ever in Europe, the CrossCultproject contributes to the understandingof otherness, and shows the importanceof the past to explain the present.

CrossCult is implemented through fourpilots, which act as real-world demon-strators of what we aim to achieve. Fromlarge museums to small ones, and fromindoors to outdoors, each pilot repre-sents a strategically important type ofEuropean historical venue.

Pilot 1 – Large multi-thematic venue:Building narratives throughpersonalisationPilot 1 takes place in the NationalGallery in London. It triggers reflectionthrough personalisation, as it uses thegallery’s large collection to offer thevisitors personalised stories that high-light the connections among people,places and events across European his-tory, through art. Semantic reasoning,recommender systems and path routingoptimisation are employed to ensurethat each visitor will be navigatedthrough the conceptually linked exhibitsthat interest them the most, whileavoiding congested spaces as much aspossible.

Reinterpreting European History Through

Technology: The CrossCult Project

by Susana Reboreda Morillo (Universidad de Vigo), Maddalena Bassani (Università degli Studi di Padova)and Ioanna Lykourentzou (Luxembourg Institute of Science and Technology)

The H2020 CrossCult project aims to spur a change in the way Europeans appraise history.

Page 38: ˜5*(.&1 9-*2* .,.9&1 :2&3.9.*8 - ERCIM News 112 · 4 ERCIM NEWS 111 October 2017 ˚ ˆ ˆ*2’*78-.5 After having successfully grown to become one of the most recognized ICT Societies

ERCIM NEWS 111 October 201738

Special Theme: Digital Humanities

The improved quality of experience thatthis combination of technologies offers,balancing in a unique way individualvisitor needs with museum-wide objec-tives, can be extended and customisedto serve the needs of various other largevenues across Europe.

Pilot 2 – Many venues of similarthematic: Building narratives throughsocial networks and gamingThe second pilot triggers reflectionthrough socialisation. It takes place infour different small venues(Montegrotto Terme/Aquae Patavinae,Italy; Lugo/Lucus Augusti, Spain;Chaves/Aquae Flaviae, Portugal;Epidaurus, Greece), where thermo-min-eral water and its use is presented as oneof the most important natural resourcesof the past and the present. Throughsocial networked gaming and story-telling, the pilot connects remote groupsof visitors helping them understand thehistoric connections shared by thevenues, in this case on thermalism andwater (composition, cults, health treat-ments, pilgrimages, leisure and dailylife).

The focus on multiple venues is strate-gically important, due to the hugenumber of small and medium-sizedmuseums around Europe connectedthrough similar themes.

Pilot 3 – Small venue: Buildingnarratives through content enrichment Pilot 3 takes place in a peripheral, low-profile museum, the Archaeological

Museum of Tripoli in Greece. Likemany other small museums all overEurope, this venue houses interestingobjects, but offers limited informativematerial and is often overshadowed bylarger venues. Content enrichment, inthe form of digital narratives, story-telling and social media, backed by psy-chology techniques, such as empathyincrease, are used to create a non-typ-ical visit that goes beyond a traditionalobject showcase and immerses visitorsinto life, power structures, and the placeof women in antiquity. History, in thisrespect, becomes relevant to people’slives, enabling them to see the connec-tions between their present and the past,by reflecting on ever-important humanissues like religion, mortality and socialequality.

The example of pilot 3 can be appliedon multiple other small cultural spacesto help raise their profile and takeadvantage of the latest social, educa-tional, technological developments.

Pilot 4 – Multiple cities: Buildingnarratives through urban informaticsand crowdsourcingEuropean history is intrinsically con-nected with location; our cities, build-ings and streets bear the marks of thepeople and populations that have inhab-ited them. Pilot 4 takes place outdoorsin two cities, Luxembourg City inLuxembourg and Valletta in Malta, andtriggers reflection through urban dis-covery. Focusing on the topic of migra-tion, past for Malta and present for

Luxembourg, and using the technolo-gies of location-based services, urbaninformatics and crowdsourcing, itinvites people to walk the two cities,discover and share stories. Visitors andresidents engage in comparative reflec-tion that challenges their perception ontopics touched by migration such asidentity, quality of life, traditions, inte-gration and sense of belonging.

This pilot has significant potential forthe promotion of cultural tourism, as itstechnologies support the easy integra-tion of new cities to its existing seedcity network.

A Living Lab that you are welcome tojoinCrossCult implements its visionthrough a living lab approach: we inviteresearchers, cultural heritage represen-tatives and stakeholders to co-designwith us, take part in our experimentsand share their viewpoints in a networkof venues and experts across Europe.Get in touch with us at [email protected].

Acknowledgements:

We are grateful to Dr. AngelikiAntoniou, Dr. Evgenia Vasilakaki andthe whole CrossCult consortium.

Links:

[L1] http://crosscult.eu/[L2] https://kwz.me/hnu

References:

[1] I. Lykourentzou et al..”Reflectingon European History with the Helpof Technology: The CrossCultProject”, GCH 2016

[2] C. Vassilakis et al..”InterconnectingObjects, Visitors, Sites and(Hi)Stories Across Cultural andHistorical Concepts: The CrossCultProject”, EuroMed 2016

Please contact:

Susana Reboreda MorilloUniversidad de Vigo, Spain+34 988 387 269, [email protected]

Maddalena BassaniUniversità degli Studi di Padova, Italy+39 329 987 7881,[email protected]

Ioanna Lykourentzou, LuxembourgInstitute of Science and Technology+352 275 888 2703,[email protected]

Figure�1:�CrossCult�H2020�project�–�Overview�of�the�four�pilots�and�their�supporting

technologies.

Page 39: ˜5*(.&1 9-*2* .,.9&1 :2&3.9.*8 - ERCIM News 112 · 4 ERCIM NEWS 111 October 2017 ˚ ˆ ˆ*2’*78-.5 After having successfully grown to become one of the most recognized ICT Societies

ERCIM NEWS 111 October 2017 39

COURAGE addresses the role of thecollections in defining what “culturalopposition” means in order to facilitate a

better understanding of how dissent andcriticism were possible in the formersocialist regimes of Eastern Europe.

The type of opposition we want to dis-cover was largely evident in the cultureand lifestyle under the socialist era, andinclude activities such as alternativemusic, alternative fine arts, folk danceclubs and religious movements. TheCOURAGE project [1] [L1] (within theEU H2020 framework) is creating acomprehensive online database (digitalregistry) of existing but scattered col-lections on the histories and forms ofcultural opposition in the formersocialist countries, thereby makingthem more accessible. It will analysethese collections in their broader social,political and cultural contexts. The gen-eral aim of this analysis is to allow forthe expanded outreach and increasedimpact of the collections by assessingthe historical origins and legacies ofvarious forms of cultural opposition.Our research team aims to explore thegenesis and trajectories of private andpublic collections on cultural opposi-tion movements, the political and socialroles and uses of the collections before1989 and since, the role of exiles in sup-porting these collections, etc.

The project team contains institutes orfaculties of history or sociology fromBulgaria, Croatia, Czech Republic,Germany, Hungary, Ireland, Lithuania,Poland, Romania, Slovakia and theUnited Kingdom. The role of IT supportis assigned to MTA SZTAKI, Hungary.

Research results will be made availablein multiple forms. The online educationmaterial will bring to light the hiddenand lesser-known cultural life of theformer socialist countries and will facil-itate teaching and learning about theperiod. The exhibition will give accessto the hitherto less known masterpiecesof cultural opposition as well as theinteresting lifestyles of its membersthrough fascinating visual footage andfreshly disclosed archival documents.Policy documents will help decision

Cultural Opposition in former European Socialist

Countries: building the COURAGE Registry

by András Micsik, Tamás Felker and Balázs Nász (MTA SZTAKI)

The COURAGE project is exploring the methods for cultural opposition in the socialist era (cc. 1950-1990). We are building a database of historic collections, persons, groups, events and sample collectionitems using a fully linked data solution with data stored in an RDF triple store. The registry will be usedto create virtual and real exhibitions and learning material, and will also serve as a basis for furthernarratives and digital humanities (DH) research.

Figure�1:�The�COURAGE�project�portal.

Figure�2:�List�of�sample�items�from�the�registry.

Page 40: ˜5*(.&1 9-*2* .,.9&1 :2&3.9.*8 - ERCIM News 112 · 4 ERCIM NEWS 111 October 2017 ˚ ˆ ˆ*2’*78-.5 After having successfully grown to become one of the most recognized ICT Societies

makers and funders to recognise thesocial impact and use of the collectionsof cultural opposition, and will providea forum to discuss projects of commoninterest. COURAGE will also organisea documentary film festival. We want toencourage younger generations to careabout the past, and to appreciate theimportance of cultural opposition.

Based on the requirements set by histo-rians and sociologists, MTA SZTAKIinstalled and customised a linked dataplatform, where data for all formersocialist countries can be entered byresearchers involved in the project. Thelinked data approach was a naturalchoice given the importance of cap-turing the connections between all theresearched players, events and collec-tions. The linked data approach is notyet widely known and appreciated inthe fields of arts and social science, butour researchers quickly realised itsadvantages and learned about its dif-ferent way of data representation. Theeditor team benefits from the easy tra-versal and grouping of linked entitiesand the multilingual text handling,while the administrators can very easilyextend or change the data model on-the-

fly. Editors use the Vitro platform withJena triple store for editing, which isconnected to a set of WordPress portalsfor public presentation.

Among the many IT challenges, first wehad to solve the representation of timedproperties, for example when a collectionwas owned by a person and then donatedto a museum. Secondly, we had to imple-ment a kind of simplified authorisationfor triple editing (the option “Everythingis open” was not acceptable for the com-munity). Thirdly, a quality assuranceworkflow was also developed in-house.Entities in the registry go through aseven-step workflow containing localand central quality checks and Englishproofreading. The result is what we cancall a knowledge graph connecting col-lections, events and participants of cul-tural opposition. Part of this graph isalready published and browseable on theproject portal under the registry link. Theschema of the registry is called theCOURAGE Ontology, and it is madeavailable as an OWL file. Owing to ourspecial requirements, we could hardlyrely on existing ontologies, but in thefuture, we aim to map our ontology tovarious widely used metadata formats.

As our current status in rough numbers,we have 300 collections with 500 fea-tured items, 700 persons, 300 groups ororganisations and 400 events in the reg-istry. With 80 researchers editing theregistry we are not yet halfway throughour planned work. We hope that by theend of the project we can build analmost complete encyclopaedia of cul-tural opposition in former socialistcountries in the form of a knowledgegraph.

Link:

Project home page: http://cultural-opposition.eu/

Reference:

[1] COURAGE: Understanding theCultural Heritage of Dissent inFormer Socialist Countries.Understanding Europe – Promotingthe European Public and CulturalSpace, 17 October 2016, Brussels.

Please contact:

András MicsikMTA SZTAKI, Hungary+36 1 279 [email protected]

ERCIM NEWS 111 October 201740

Special Theme: Digital Humanities

Worldwide, and especially in theEuropean Union, information tech-nology is playing an increasingly impor-tant role in enabling the efficient use andpreservation of cultural heritage. TheLocale project looks at the preservationof immaterial cultural heritage from theperspective of tools and techniques tosupport the connection of different sto-ries. In fact, navigating complex infor-mation spaces remains a challengedespite continuous improvements insearch and retrieval processes, espe-cially when the information refers tooverviews or miscellaneous references,such as when connecting particular sto-

ries and places. The Locale project isbased on a collaborative mobile andweb-based platform with a focus onlocation-based storytelling for sharingtestimonies and multi-media historicalheritage content relating to the period of1945-1960 in Luxembourg and theGreater Region: from the end of WWIIto the dawn of Europe, in the context oftheir respective 70th (2015) and 60th(2017: European Economic Community)anniversaries.

In addition, Locale intends to foster thesharing of personal historical accountsthat might not be included in standard

historical literature. The platform offersdedicated functionalities for exploringmultidimensional data using varioushuman analysis and data mining strate-gies, based on metadata, tags, attributesentered by the user, and browsing his-tory (e.g., connections between a placeand queries about a given historicalfact). In particular, Locale draws fromcollaborative visualisation, allowingusers to share views of the same contentwith different focuses, and providing anintuitive way of sharing content [1]. Themain operational features include, on theone hand, text and data mining function-alities for updating knowledge and trig-

Locale, an Environment-Aware Storytelling

framework Relying on Augmented Reality

by Thomas Tamisier, Irene Gironacci and Roderick McCall (Luxembourg Institute of Science andTechnology)

The Locale project proposes a vision of location-aware digital storytelling empowered by acombination of technologies including data mining, information visualisation and augmented reality.The approach is tested through pilot contributors who share their experiences, stories andtestimonies of Luxembourg since the end of World War II.

Page 41: ˜5*(.&1 9-*2* .,.9&1 :2&3.9.*8 - ERCIM News 112 · 4 ERCIM NEWS 111 October 2017 ˚ ˆ ˆ*2’*78-.5 After having successfully grown to become one of the most recognized ICT Societies

ERCIM NEWS 111 October 2017 41

gering selective actions supporting theinteraction between platform users and,on the other hand, different modalities ofexploring and editing stories with a viewto enhancing the spatial dimension andfeeling of flow and immersion by usinga desktop, a mobile device, or interac-tive augmented reality equipment.

Location-based storytelling requiresthat people feel immersed in the experi-ence, and perhaps even feel part of it(i.e. present). For this to be achieved itis essential that any system makes theuser feel a sense of space and moreimportantly place. Space relates to thephysical properties of the environment,for example street layouts, buildingsand perhaps smaller aspects such asbenches. In contrast, a sense of place iswhen a space becomes infused by dif-ferent meanings. A sense of place arisesout of the blending of aspects such assense of self, other people and activitiesin the space [2]. Earlier storytellingapplications using mobile phonesexplored this concept [3] and in Localethis is further enhanced through the co-construction between different physicaland narrative elements, combined withaugmented reality technology.

By filtering physical environmentsthrough ad-hoc additional features andaugmented reality, Locale has beentested with different use-cases andproven to bring powerful support inrevisiting a scene of a story across timeand users’ perspectives. For example, aroute can be created which containsmultiple stopping points, as the userwalks along they can listen to storiesabout places, people and events at dif-

ferent locations. Furthermore, if thereare many layers or stories at a specificlocation about a particular person orevent this may give the user a strongersense of history and importance and ulti-mately shape their understanding of thatplace. In this regard, Locale providesspecifically new ways of interactionthrough the use of an AR headset, a newtype of non-command user interfaceable to track user movements and usethem to create UI elements the user caninteract with. Additional visualisationsprovide: indications of the degree ofagreement/disagreement betweensources of information available andlinks between related information.Figure 1 shows the test of Locale in theVirtual Reality laboratory of LIST: a realgrayscale image (centre) is seen througha Microsoft HoloLens headset, whichtriggers augmented content consisting ofa current image of the building (right)and related information (left). Theoverall picture is an interactive andimmersive storytelling experiencewhere the user can interact with the con-tents of the story (notably images and3D models) in a simple and natural way,for example using gestures or voice.

As a whole, Locale provides an opera-tional framework for location aware andcollaborative storytelling that focuseson three main challenges identified inthe literature. First, it encourages peopleto structure stories in ways that supporttheir perception of place and sense ofpresence. Second, it enables the linkingof content and mining of related data toimprove how people can navigatewithin stories and spaces as well as pro-vide people with easier ways to see and

interact with the rich content. Finally, itexplores novel interface techniques thatare designed to present complex infor-mation but avoid information overload.As a primary impact, the operationalframework achieved through the projectwill help to explore how technologiescoupled with an environment-awaresetting can help to bridge the digitaldivide between users of different agesand backgrounds.

Acknowledgments:

The Locale project is funded byLuxembourg’s Fonds National de laRecherche under the Core grant C12/IS/3986199.

Links:

http://list.luhttps://kwz.me/hNN

References:

[1] P.Carvalho: “Using VisualizationTechniques to acquire a betterunderstanding of Storytelling”,Proc. of DATA conference, 2017.

[2] P. Gustafson: “Meanings of place:everyday experience andtheoretical conceptualizations”,Environmental Psychology.Elsevier, 21.5-16, 2001.

[3] R. McCall: “Mobile phones, Sub-Culture and Presence”, in Proc. ofthe workshop on Mobile SpatialInteraction at ACM Conference onHuman Factors in ComputingSystems (CHI), 2011.

Please contact:

Thomas Tamisier, LuxembourgInstitute of Science and [email protected]

Figure�1:�Augmented�reality�in�Locale:�Detection�and�tracking�of�a�target�image�of�a�building.

Page 42: ˜5*(.&1 9-*2* .,.9&1 :2&3.9.*8 - ERCIM News 112 · 4 ERCIM NEWS 111 October 2017 ˚ ˆ ˆ*2’*78-.5 After having successfully grown to become one of the most recognized ICT Societies

ERCIM NEWS 111 October 201742

Special Theme: Digital Humanities

The Swiss pavilion at “Biennale diVenezia” offers a platform for nationalartists to exhibit their work. This well-known white cube showcases the changesin contemporary Swiss art from the early50s to the present day. The aim of the“Biennale 4D” is to make the archives ofthe past bi-annual art exhibitions morecomprehensible by creating an interactiveexplorative environment using innovative

virtual reality (VR) technology.«Biennale 4D» poses multiple challengesincluding visualisation of historic contentand its documentation, dealing with theheterogeneity and incompleteness ofarchives, interaction design and interac-tion mapping in VR space, integration ofmetadata as well as realising a virtualreality experience for the public spacewith current VR technology.

Design and development of thereconstructed exhibition environmentA pilot application was created by theInstitute of 4D Technologies of theUniversity of Applied Sciences and ArtsNorthwestern Switzerland FHNW andSwiss Institute for Art Research SIK-ISEA [1] to test the concept of the pro-posed VR experience. This functionalprototype allows users to explore thepavilion and displayed art works of dif-ferent epochs with the HTC Vive VRheadset and hand controllers. It containsa 3D model of the pavilion based on theoriginal design by Bruno Giacometti anda concept for visualisation of the exhibi-tion content and its documentation.Exhibition samples (see Figure 1) of var-ious documentation levels are included –for example, “thoroughly documented”,“fragmentarily documented” – as well asexperimental art works, such as videoinstallations. The current release of theprototype showcases exemplary portionsof the selected exhibitions in the years1952, 1984, 2007 and 2013. The devel-opment of a consistent visual languagefor the heterogeneous work posed a chal-lenge. Numerous experiments weremade to discover a suitable visualisationstyle (see Figure 2) that guides the user’sfocus away from the building to the art-works and their documentation. Theseexperiments included variables such asthe degree of abstraction, deliberate lackof definition and level of chromaticity,and lessons from the field of archaeologyhave been applied for the handling ofincomplete data.

Interaction with time, space andmetadataThe prototype allows interaction withthree dimensions which had to bereduced onto the two hand controllersavailable to the user. In this initial proto-type one hand is assigned to time traveland the other hand is designated for spa-tial movement and interaction with theobjects. The user is able to travel intu-itively through time by means of interac-tion with the time machine object (seeFigure 3). This three-dimensional itemoffers affordances to the user about theexhibition content of the years he’s

The biennale 4D Project

by Kathrin Koebel, Doris Agotai, Stefan Arisona (FHNW) and Matthias Oberli (SIK-ISEA)

Virtual reality (VR) reconstruction offers a new interactive way to explore the archives of the SwissPavilion at the “Biennale di Venezia” Art Exhibition.

Figure�2:�Screenshot�of�the

Biennale�4D�prototype,�display-

ing�the�chosen�aesthetics�for

the�visualisation�of�the�histori-

cal�content.�The�original�design

of�the�pavilion�was�reduced

and�blur�has�been�added�to�the

wall�textures�in�order�to�intu-

itively�guide�the�user’s�focus�to

the�documented�art�works.

Figure�1:�Screenshot�of�the

Biennale�4D�prototype,

showing�the�virtual

reconstruction�of�the�2007

art�exhibition�in�the

“Malerei”�hall�of�the�Swiss

pavilion.�

Figure�3:�Screenshot

showing�the�time�machine

object�which�was�created�to

allow�the�user�to�interact

with�the�time�dimension.

Figure�4:�Screenshot�of�the

information�guide�that

allows�interaction�with�the

metadata�of�the�art�work.

Page 43: ˜5*(.&1 9-*2* .,.9&1 :2&3.9.*8 - ERCIM News 112 · 4 ERCIM NEWS 111 October 2017 ˚ ˆ ˆ*2’*78-.5 After having successfully grown to become one of the most recognized ICT Societies

ERCIM NEWS 111 October 2017 43

passing by as he or she travels throughtime, navigating to the scene of thedesired year. For spatial movement theapplication contains a navigation con-cept that allows the user to move withinthe virtual room either by positiontracking of the user’s movement in thephysical space or via teleportation. Itprovides basic haptic feedback in case ofcollisions. Other forms of spatial naviga-tion were considered (e.g., a guidedtour), however testing revealed a correla-tion between the user’s degree offreedom regarding movement and theperceived user experience. In addition,an information guide in the form of a vir-tual booklet offers metadata correspon-ding to the objects on display (see Figure4). This supplementary information canbe accessed by pointing with a laser raytowards the desired item. And, interac-tive hotspots have been added to theapplication to show additional material,such as archive photos.

Conclusion and future workBiennale 4D poses a unique challenge asit combines a virtual art exhibition expe-

rience with archive functionalitiesthrough the use of virtual reality tech-nology. This synthesis allows newapproaches to exhibition reconstructionand the conclusive incorporation of his-torical material in this experimental vir-tual space. The curative work intertwinesthree layers of materials: histori-cal con-tent (original artwork), its doc-umenta-tion (archive photos and other artefacts)and the virtual room (mapping space). Inparticular the handling of the documen-tation layer leaves much room for inter-pretation and exploration. Furthermore,the nature of this application fieldrequires thoughtful examination ofaspects like substantiality, aesthetics aswell as the concept of time.

Some other areas of focus for furtherwork include further elaboration on theaesthetics of the visualisation includingspatial design of the surrounds,improving the storytelling and devel-oping a concept to present the applica-tion and allow its usage in the publicspace. Ongoing work will focus onoffering access to the full content of the

Biennale archives in an even moreinteractive and immersive way and let-ting a wider audience experience thisvaluable portion of Swiss art and cul-tural history.

Links:

www.biennale-venezia.chwww.fhnw.ch/technik/i4ds

Reference:

[1] K. Koebel, O. Kaufmann,“Biennale 4D – Erschaffung einerVirtual Reality Experience zurExploration der Archivbestände desSchweizer Pavillons an derBiennale Venezia”,https://kwz.me/hLd.

Please contact:

Kathrin KoebelFachhochschule Nordwestschweiz,[email protected]

The Clavius Correspondence: from Digitization

to visual Exploration of Knowledge

by Matteo Abrate, Angelica Lo Duca, Andrea Marchetti (IIT-CNR)

The “Clavius on the Web Project” [L1] is an initiative involving the National Research Council in Pisaand the Historical Archives of the Pontifical University in Rome. The project aims to create a webplatform for the input, analysis and visualisation of digital objects, i.e., letters sent to ChristopherClavius, a famous scientist of the 17th Century.

In the field of digital humanities, cul-tural assets can be valued and preservedat different levels, and whether or not anobject is considered a knowledgeresource depends on its peculiarity andrichness. Within the Clavius on the WebProject we mainly consider two kinds ofknowledge resources: contextualresources associated to digitised docu-ments and manual annotations of cul-tural assets. We implemented a differentsoftware for each of these resourcetypes: the Web Metadata Editor for con-textual resources and the KnowledgeAtlas to support manual annotation.

Semi-automated enrichment ofcatalogues: the Web Metadata EditorIn recent years, a great effort has beenmade within the field of digital humani-ties to digitise documents and catalogues

in different formats, such as PDF, XML,plain texts and images. These documentsare often stored either in digital librariesor big digital repositories in the form ofbooks and catalogues. The process of cat-aloguing also requires the creation of aknowledge base, which contains contex-tual resources associated to documents ofthe catalogue, such as the authors of thedocuments and places where documentswere written. Information contained inthe knowledge base can be used to enrichdocument details, such as metadata asso-ciated to documents. Most of the existingtools for catalogue creation allow theknowledge base to be built manually.This process is often tedious, because itrequires known information about a doc-ument, such as the author’s name anddate of birth, to be edited. It is also arepetitive process because many docu-

ments are written by the same author andin the same place thus requiring the sameinformation to be written twice or more.In general there are three main disadvan-tages of this manual effort, compared toan automated system: (i) there is a higherprobability of introducing errors; (ii) theprocess is slower; (iii) the entered infor-mation is isolated, i.e., not connected tothe rest of the web.

In order to mitigate these disadvantagesin the context of the project, we devel-oped the Web Metadata Editor (WeME)[1], a user-friendly web application forbuilding a knowledge base associated to acatalogue of digital documents. Figure 1shows a snapshot of WeME. While theapplication is envisaged forarchivists/librarians, it may also be usefulfor others. WeME helps archivists to

Page 44: ˜5*(.&1 9-*2* .,.9&1 :2&3.9.*8 - ERCIM News 112 · 4 ERCIM NEWS 111 October 2017 ˚ ˆ ˆ*2’*78-.5 After having successfully grown to become one of the most recognized ICT Societies

ERCIM NEWS 111 October 201744

Special Theme: Digital Humanities

enrich their catalogues with resourcesextracted from two kinds of web sources:Linked Data (DBpedia GeoNames) andtraditional web sources (VIAF).

WeME, which is published as an opensource software on GitHub [L2], wastested by 31 users, 45.2% of whom werefemale and 54.8% male. Almost half(48.8%) of the testers were older than 35years. In a scale ranging from poor (1) toexcellent (5), the scores given to WeMEwere: 4 (by 50% of testers), 5 (by 20%),3 (by 23.3%) and 1 or 2 by the remainingtesters. Tests indicate that WeME is apromising solution for reducing therepetitive, often tedious work ofarchivists.

Visual exploration of knowledge:Knowledge AtlasCreating a catalogue and digitisingassets from the archive, while funda-mental, soon proved insufficient toconvey the richness of the knowledgewithin the archive. We thus started thedevelopment of Knowledge Atlas [L3],a user interface designed with the fol-lowing principles:1)Visualisation and interaction – Con-

tent presentation should take advan-tage of the visual expressiveness andinteractivity of modern web tech-nologies. For example, the interac-tive recreation of volvellae, instru-ments made of rotating wheels ofpaper, which enables scholars toinvestigate their purpose withoutrisking damage to the original arte-facts.

2)Depth and detail – Content presenta-tion should explicitly show manydifferent layers of analysis and high-

light interesting points. For example,a representation of the Kunyu Wan-guo Quantu Chinese map is dis-played with highlighted toponymsand cartouches, with which the usercan interact to read Chinese tran-scriptions and Italian translations.Each text can then be furtherexplored to access information aboutspecific Chinese graphemes (Figure2).

3)Context and links – The main contentshould be complemented with con-textual information. Such presenta-tions should enable navigation fromcontent to context and vice versa. Forexample, senders, recipients andcited historical characters related tothe figure of Christophorus Clavius[2] constitute a graph of correspon-dents, which can be navigated byitself or starting from a specific pas-sage of a letter. Other examplesinclude: the set of evolving mathe-matical or astronomical conceptuali-sations and the related lexical termsin Latin, Italian or Greek [3].

4)Divulgation and correctness – Thepresentation of content should fea-ture specific design choices aimed atcapturing the attention of students orcasual readers, by leveraging power-ful visual languages and a systematicorganisation. The power of providinga structured and intuitive overviewshould be used as a gateway to leadto the most complete and correctinformation available. As an exam-ple, the representation of the contextof the aforementioned letters byGalilei about the model of our solarsystem is a visual diagram makinguse of a distorted scale. The actual

distances and sizes of the objects canbe appreciated by accessing detaileddata.

Our current platform, in active andopen development on Github, imple-ments this design without being too tiedto the specificity of the content. By fol-lowing this methodology, we believethat our software will prove to be usefulin contexts beyond the projects dis-cussed here. We are already experi-menting with other content, such asmore maps, books, and paintings, butalso with models of buildings and com-plex structured data about the internet.

Links:

[L1] http://claviusontheweb.it[L2] https://kwz.me/hkj[L3] http://atlasofknowledge.it

References:

[1] A. Lo Duca et. al.: “Web MetadataEditor: a Web Application to Builda Knowledge Base Based onLinked Data”, Third Int. Workshopon Semantic Web for ScientificHeritage, Portoroz, Slovenia, 2017.

[2] M. Abrate et al.: “The Clavius onthe Web Project: Digitization,Annotation and Visualization ofEarly Modern Manuscripts”,AIUCD Annual Conference, 2014.

[3] S.Piccini et al.: “When TraditionalOntologies are not Enough:Modelling and VisualizingDynamic Ontologies in Semantic-Based Access to Texts”, DH 2016.

Please contact

Andrea Marchetti, IIT-CNR, [email protected]

Figure�1:�A�snapshot�of�WeME. Figure�2:�A�screenshot�of�Knowledge�Atlas�showing�information�about�a

Chinese�radical.

Page 45: ˜5*(.&1 9-*2* .,.9&1 :2&3.9.*8 - ERCIM News 112 · 4 ERCIM NEWS 111 October 2017 ˚ ˆ ˆ*2’*78-.5 After having successfully grown to become one of the most recognized ICT Societies

ERCIM NEWS 111 October 2017 45

Mobile social networking is madeup of two key-ingredients: peopleand the smart devices carried bythem [1]. First, people follow asocial rather than a randommobility pattern. People tend tospend time together with peoplewho have similar interests [2].Second, devices are designed withadvanced hardware and softwarecapabilities, and they can commu-nicate among themselves based onshort-range wireless interfaces, thusforming a mobile network. Indeed,they are equipped with differenttypes of resources such as sensingunits and several computationalcapabilities. Such resources can beseen as exploitable services that canbe shared with other devices. Whenthe service-oriented approach isapplied in a mobile social context,common application problems have to berethought for this new challenging perspec-tive.

The goal of our research is to study how toexploit mobility and sociality of people ina mobile social network in order to adver-tise and discover the services provided bydevices. This problem is commonlyreferred to as service discovery.

We investigate the service discoveryproblem, and in particular the possibilityof advertising and discovering servicesoffered by devices in the mobile socialnetwork. We propose an efficient strategyfor advertising new or existing services toother devices and, at the same time, astrategy for discovering a specificservice. The algorithm we have designedis referred to as CORDIAL(Collaborative Service DiscoveryAlgorithm) alluding to the opportunitiesarising from people close to us.

CORDIAL adopts both a reactive and aproactive mechanism. The reactive mech-anism is used to diffuse a service queryand implements the query forwardingstrategy. This strategy is executed by adevice as soon as the user carrying it

needs to access a specific service.Consider the following scenario where auser needs to tag a picture with GPS coor-dinates. The user’s device does not pro-vide this functionality, but she can dis-cover devices in proximity that do offersuch a service. Once the user finds therequired service, in this case the GPS, shecan invoke it (see Figure 1).

The basic idea of the reactive phase is toselect those devices in range with interestssimilar to the query sent by the user.Moreover, our strategy gives priority tothose devices that are able to interact withother devices (e.g., strangers) that canpotentially answer to the service query.

The proactive mechanism of CORDIALis required to proactively diffuse queriesnot yet answered (namely the pendingqueries) and to advertise the existence ofservices. In this case, the goal is to max-imise the diffusion of the messages(queries and service advertisements)and, at the same time, reduce messageduplication.

The outcome of this research was theability to reproduce mobile social net-works by using real-world mobility

traces. Such traces do a good jobof reproducing human mobilityin open outdoor environmentssuch as a university campus or acity district. Among all the met-rics analysed, we gave partic-ular emphasis to the capacity ofthe discovery algorithm to storeservice advertisements ofinterest for the device, byavoiding the submission ofqueries (the proactivity metric).Moreover, we analysed thecapacity of the algorithm tostore only those service adver-tisements of interest for adevice. Such a policy avoidsstoring off-topic advertisements(we measured such behaviourwith the accuracy metric).CORDIAL also obtains optimalvalues of both metrics (proac-

tivity and accuracy) when comparedwith other social-aware discovery strate-gies.

We thus investigated the possibility ofexploiting, in an opportunistic fashion,the increasing number of resourcesoffered by smart or low-power devices inmobile social networks. Future workincludes the possibility of using COR-DIAL in both participatory and oppor-tunistic crowd sensing scenarios as wellas using CORDIAL combined togetherwith off-loading computing techniques.

References:

[1] M. Girolami, S. Chessa, A. Caruso:“On Service Discovery in MobileSocial Networks: Survey andPerspectives”, Computer Networks,88, (2015):51-71, DOI10.1016/j.comnet.2015.06.006

[2] M. McPherson, L. S. Lovin, J. M.Cook, “Birds of a Feather:Homophily in Social Networks”,Annual Review of Sociology, 27:415–444, 2001.

Please contact:

Michele Girolami, ISTI-CNR, [email protected]

Service-oriented Mobile Social Networking

by Stefano Chessa and Michele Girolami (ISTI-CNR)

The Wireless Network Laboratory at ISTI-CNR studies how to exploit mobility and sociality of people in amobile social network in order to advertise and discover the services provided by devices. This problem iscommonly referred to as service discovery. An efficient strategy for advertising to other devices about theexistence of new or existing services is proposed alongside a strategy for discovering a specific service.

Figure�1:�The�Reactive�and�Proactive�strategies�

and�Accuracy�and�Proactivity�metrics.

Page 46: ˜5*(.&1 9-*2* .,.9&1 :2&3.9.*8 - ERCIM News 112 · 4 ERCIM NEWS 111 October 2017 ˚ ˆ ˆ*2’*78-.5 After having successfully grown to become one of the most recognized ICT Societies

Eu

rop

ea

n R

ese

arc

h a

nd

In

no

va

tio

n

Research and Innovation

Collaboration Spotting:

A visual Analytics

Platform to Assist

Knowledge Discovery

by Adam Agocs, Dimitris Dardanis, Richard Forster, Jean-Marie Le Goff, Xavier Ouvrard and André Rattinger (CERN)

Collaboration Spotting (CS) is a visualisation and navigationplatform for large and complex datasets. It uses graphs andsemantic and structural data abstraction techniques toassist domain experts in creating knowledge out of bigdata.

Valuable knowledge, which can help to solve a range of com-plex problems, may be created from the extremely largeamount of data generated by computers and the internet ofthings. To achieve this, we rely on sophisticated cognitive toolswhose efficacy strongly depends on the delicate interplaybetween domain experts and data scientists. Domain expertstrust data scientists to deliver the tools they need, while theeffectiveness of these tools in supporting knowledge creationessentially depends on the information in the hands of domainexperts. In other words, creating knowledge essentially relieson the capability of combining domain specific semantic infor-mation with concepts extracted out of the data and visualisingthe resulting networks.

Storing large networks in a flexible and scalable manner callsfor graphs: mathematical objects that hold information in nodesrepresenting data instances of particular categories – calledfacets – and in relationships characterising the network inter-connectivity. Facets and relationships embody the networkschema, a semantic abstraction of the network content andstructure.

Enhancing the cognitive insight of humans into the under-standing of the data calls for visual analytics: the science ofanalytical reasoning supported by interactive visual interfacesthat combines the power of visual perception with high per-formance computing.

Collaboration Spotting (CS) is a data-driven platform based onopen source software packages that uses visual analytics con-cepts and advanced graph processing techniques to provide aflexible environment for domain experts to run their analysis.CS is domain independent and fully customisable. It gives datascientists the capability of building multi-faceted networks outof multiple and heterogeneous data sources and domain expertsthe ability to specify different perspectives for conducting theiranalysis by means of network schemas. Individual analysis out-puts are visualised in graphs using node-link representationswhere node size, colour and shape, and relationships highlightthe network contents and structures. Output graphs are perspec-tive specific. They represent faceted views of the networkunder study articulated around part or all of its content. Asophisticated navigation system enables users to graphicallyinteract with individual output graphs and reach other facets ofthe network. Advanced structural abstraction techniques tai-lored to graph complexity and size give domain experts a visual

ERCIM NEWS 110 July111 October 201746

Page 47: ˜5*(.&1 9-*2* .,.9&1 :2&3.9.*8 - ERCIM News 112 · 4 ERCIM NEWS 111 October 2017 ˚ ˆ ˆ*2’*78-.5 After having successfully grown to become one of the most recognized ICT Societies

ERCIM NEWS 111 October 2017

access to fine structures and particularities, such as communi-ties and outliers in a clear manner even for reasonably largegraphs. Applying these techniques in hierarchies of graphs pro-vides visual access to larger output graphs at the cost of a lossof semantic and structural information.

Responsiveness when servicing users is an essential aspect ofvisual analytics. To this end, the CS team has paid particularattention to optimising all the time-critical aspects of the plat-form. Networks are stored in graph databases that support effi-cient and flexible query mechanisms. The computing of outputgraph structures with optimized visual perception is done usingscalable clustering and rendering algorithms that run on largecomputing infrastructures and open source cloud computingsoftware.

CERN has developed Collaboration Spotting with the initialaim of providing the particle physics community with intelli-gence on academia and industry players active around keytechnologies with a view to fostering more interdisciplinaryand inter-sectorial R&D collaborations, and giving procure-ment at CERN the opportunity of reaching a wider selection ofhigh-tech companies [L1]. CS concepts and techniques havebeen used for building the Technology Innovation Monitor(TIM) of the EC Joint Research Centre (JRC) [L2] and visual-ising compatibility and dependency relationships in softwareand metadata of the LHCb experiment at CERN [L1]. A collab-oration with Wigner MTA and Budapest University ofTechnology and Economics (BME), Hungary is now in place

to explore the use of CS in the areas of: (i) pharmacoinfor-matics as a supporting tool for knowledge-based drug dis-covery using analytics over linked open data for the life sci-ences [L3]; (ii) IT-analytics as a tool to enhance large-scalevisual analysis of performance metrics of algorithmic andinfrastructure components [L4]; (iii) neuroscience to assist inunderstanding the structural and functional organisation of thebrain [L5]; and (iv) in social sciences as a research tool to studya database of European and international higher educationinstitutes with the goal of developing new metrics to describetheir performance (BRRG) [L6].

From the experience acquired with the prototypes, we plan tofurther develop the CS platform to provide optimized visualperception in order to enhance cognitive insights regardless ofthe size of the networks. These new developments will betested at CERN and at the Wigner GPU laboratory.

Links:

[L1] collspotting.web.cern.ch/[L2] timanalytics.eu/[L3] bioinformatics.mit.bme.hu/UKBNetworks/[L4] inf.mit.bme.hu/en/research/directions[L5] kwz.me/hT6[L6] kwz.me/hT7

Please contact:

Jean-Marie Le Goff, CERN, Switzerland+41 22 767 6559, [email protected]

47

!"#"$%&'()*

!"#"+$%&'()

,

!"#"+$%&'()$

-)#.%'/0

1(2)3"

4

5%6&7"#8,9 :8$&"7+","7;#8($

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

"#!"*)('&$%

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

4

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

"#!")('&$%

,

+

+7&65%

+

+9,8#"

+

+

+

+$:8

+

+ $(8;#7,""7&"$

+

+

+

+

+

+

$

,

)(&'%+$"#!"

+

+

+

+

+

+

+

+

+

+

+

+

Figure�1:�The�Collaboration�Spotting

conceptual�framework.�After�populating�the

database,�domain�experts�analyse�the�network

content�with�the�help�of�the�schema.

Figure�2:�View�of�an�analysis�output

depicting�a�network�of�publications�from

the�keyword�perspective.�Vertices�in�italics

represent�keywords�merged�together�and

the�others�single-keywords.�Coloured

clusters�highlight�groups�of�keywords�that

are�found�more�often�together�in

publications.

Page 48: ˜5*(.&1 9-*2* .,.9&1 :2&3.9.*8 - ERCIM News 112 · 4 ERCIM NEWS 111 October 2017 ˚ ˆ ˆ*2’*78-.5 After having successfully grown to become one of the most recognized ICT Societies

mined by its distribution in text, and by the terms it tends to co-occur with (a famous motto that describes this assumption is“You shall know a word by the company it keeps”). DCImines, from unlabelled datasets and in an unsupervisedfashion, the distributional correspondences between each termand a small set of “pivot” terms, and is based on the furtherhypothesis that these correspondences are approximatelyinvariant across the source context and target context for termsthat play equivalent roles in the two contexts. DCI derivesterm representations in a vector space common to both con-texts, where each dimension (or “feature”) reflects its distribu-tional correspondence to a pivot term. Term correspondence isquantified by means of a distributional correspondence func-tion (DCF); in this work the authors propose and experimentwith a number of efficient DCFs that are motivated by the dis-tributional hypothesis. An advantage of DCI is that it projectseach term into a low-dimensional space (about 100 dimen-sions) of highly predictive concepts (the pivot terms), whileother competing methods (such as Explicit Semantic Analysis)need to work with much higher-dimensional vector spaces.

The authors have applied this technique to sentiment classifi-cation across domains (e.g., book reviews vs DVD reviews)and across languages (e.g., reviews in English vs reviews inGerman), either in isolation (cross-domain SC, or cross-lin-gual SC) or – for the first time in the literature – in combina-tion (cross-domain cross-lingual SC, see Figure 1).Experiments run by the authors (on “Books”, “DVDs”,“Electronics”, “Kitchen Appliances” as the domains, and onEnglish, German, French, and Japanese as the languages)have shown that DCI obtains better sentiment classificationaccuracy than current state-of-the-art techniques (such asStructural Correspondence Learning, Spectral FeatureAlignment, or Stacked Denoising Autoencoder) for cross-lingual and cross-domain sentiment classification. DCI alsobrings about a significantly reduced computational cost (itrequires modest computational resources, which is an indica-tion that it can scale well to huge collections), and requires asmaller amount of human intervention than competingapproaches in order to create the pivot set.

Link: http://jair.org/papers/paper4762.html

Please contact:

Fabrizio Sebastiani, ISTI-CNR, Italy+39 050 6212892, [email protected]

Research and Innovation

Distributional

Correspondence Indexing

for Cross-Lingual and

Cross-Domain Sentiment

Classification

by Alejandro Moreo Fernández, Andrea Esuli and FabrizioSebastiani

Researchers from ISTI-CNR, Pisa (in a joint effort with theQatar Computing Research Institute), have developed atransfer learning method that allows cross-domain andcross-lingual sentiment classification to be performedaccurately and efficiently. This means sentimentclassification efforts can leverage training data originallydeveloped for performing sentiment classification onother domains and/or in other languages.

Sentiment Classification (SC) is the task of classifyingopinion-laden documents in terms of the sentiment (positive,negative, or neutral) they express towards a given entity (e.g.,a product, a policy, a political candidate). Determining theuser’s stance towards such an entity is of the utmost impor-tance for market research, customer relationship management,the social sciences, and political science, and several auto-mated methods have been proposed for this purpose. SC isusually accomplished via supervised learning, whereby a sen-timent classifier is trained on manually labelled data.However, when SC needs to deal with a completely new con-text, it is likely that the amount of manually labelled, opinion-laden documents available to be used as training data, is scarceor even non-existent. Transfer Learning (TL) is a class of con-text adaptation methods that focus on alleviating this problemby allowing the learning algorithm to train a classifier for a“target” context by leveraging manually labelled examplesavailable from a different, but related, “source” context.

In this work the authors propose the DistributionalCorrespondence Indexing (DCI) method for transfer learningin sentiment classification. DCI is inspired by the“Distributional Hypothesis”, a well-know principle in linguis-tics that states that the meaning of a term is somehow deter-

ERCIM NEWS 110 July111 October 201748

Figure�1:�Cross-lingual�(English-German)�and�cross-domain�(book-music)�alignment�of�word�embeddings�through�DCI.

The�left�part�is�a�cluster�of�negative-polarity�words,�the�right�part�of�positive-polarity�ones.

Page 49: ˜5*(.&1 9-*2* .,.9&1 :2&3.9.*8 - ERCIM News 112 · 4 ERCIM NEWS 111 October 2017 ˚ ˆ ˆ*2’*78-.5 After having successfully grown to become one of the most recognized ICT Societies

ERCIM NEWS 111 October 2017 49

Measuring bias in Online

Information

by Evaggelia Pitoura (University of Ioannina), IriniFundulaki (FORTH) and Serge Abiteboul (Inria & ENSCachan)

Bias in online information has recently become apressing issue, with search engines, social networks andrecommendation services being accused of exhibitingsome form of bias. Here, we make the case for a system-atic approach towards measuring bias. To this end, weoutline the necessary components for realising a systemfor measuring bias in online information, and we high-light the related research challenges.

We live in an information age where the majority of ourdiverse information needs are satisfied online by searchengines, social networks, e-shops, and other online informa-

tion providers (OIPs). For every request we submit, a combi-nation of sophisticated algorithms return the most relevantresults tailored to our profile. These results play an importantrole in guiding our decisions (e.g., what should I buy), inshaping our opinions (e.g., who should I vote for), and ingeneral in our view of the world.

Undoubtedly, although the various OIPs have helped us inmanaging and exploiting the available information, theyhave limited our information seeking abilities, rendering usoverly dependent on them. We have come to accept suchresults as the “de facto” truth, immersed in the echo cham-bers and filter bubbles created by personalisation, rarelywondering whether the returned results represent a full rangeof viewpoints. Further, there are increasingly frequentreports of OIPs exhibiting some form of bias. In the recentUS presidential elections, Google was accused of beingbiased against Donald Trump [L1] and Facebook of con-tributing to the post-truth politics [L2]. Google search hasbeen accused of being sexist or racist when returning images

for queries such as “nurse” [L3]. Similar accusations havebeen made for Flickr, Airbnb and LinkedIn.

The problem has attracted some attention in the data man-agement community [1, 2], and is also considered a high-pri-ority problem for machine learning algorithms [3] and AI[L4]. Here we focus on the very first step, that of definingand measuring bias, by proposing a systematic approach foraddressing the problem of bias in online information.According to the Oxford English Dictionary [L5], bias is “aninclination or prejudice for or against one person or group,especially in a way considered to be unfair”, and as “a con-centration on or interest in one particular area or subject”.

When it comes to bias in OIPs, we make the distinctionbetween subject bias and object bias. Subject bias refers tobias towards the users that receive a result, and it appearswhen different users receive different content based on userattributes that should be protected, such as gender, race, eth-nicity, or religion. On the other hand, object bias refers tobiases in the content of the results for a topic (e.g., USA elec-

tions), that appears when a differentiating aspect of the topic(e.g., political party) is disproportionately represented in theresults.

Let us now describe the architecture and the basic compo-nents of the proposed system BiasMeter. We treat the OIP asa black-box, accessed only through its provided interface,e.g., search queries. BiasMeter takes as input: (i) the userpopulation U for which we want to measure the (subject)bias; (ii) the set P of the protected attributes of U; (iii) thetopic T for which we want to test the (object) bias; and (iv)the set A of the differentiating aspects of the topic T. Weassume that the protected attributes P of the user populationand the differentiating aspects A of the topic are given asinput (a more daunting task would be to infer these attrib-utes).

Given the topic T and the differentiating aspects A, the goalof the Query Generator (supported by a knowledge base) isto produce an appropriate set of queries to be fed to OIP that

Figure�1:�System�Components�of�BiasMeter.

Page 50: ˜5*(.&1 9-*2* .,.9&1 :2&3.9.*8 - ERCIM News 112 · 4 ERCIM NEWS 111 October 2017 ˚ ˆ ˆ*2’*78-.5 After having successfully grown to become one of the most recognized ICT Societies

Research and Innovation

best represent the topic and its aspects. The Profile Generatortakes as input the user population U and the set of protectedattributes P, and produces a set of user profiles appropriatefor testing whether the OIP discriminates over users in Ubased on the protected attributes in P (e.g., gender). TheResult Processing component takes as input the results fromthe OIP and applies machine learning and data mining algo-rithms such as topic modelling and opinion mining to deter-mine the value of the differentiating aspects (e.g., if a resulttakes a positive stand). Central to the system is the GroundTruth module, but obtaining the ground truth is hard in prac-tice. Finally, the Compute Bias component calculates thebias of the OIP based on the subject and object bias metricsand the ground-truth.

Since bias is multifaceted, it might be difficult to quantify it.In [L6] we propose some subject and object bias metrics.Further, obtaining the ground truth and the user populationare some of the most formidable tasks in measuring bias,since it is difficult to find objective evaluators and generatelarge samples of user accounts for the different protectedattributes. Also there are many engineering and technicalchallenges for the query generation and result processingcomponents that involve knowledge representation, dataintegration, entity detection and resolution, sentiment detec-tion, etc. Finally, it might be in the interest of governments tocreate legislation that provides access to sufficient data formeasuring bias, since access to the internals of OIPs is notprovided.

Links:

[L1] https://kwz.me/hLc[L2] https://kwz.me/hLy[L3] https://kwz.me/hLH[L4] https://futureoflife.org/ai-principles/[L5] https://en.oxforddictionaries.com/definition/bias[L6] http://arxiv.org/abs/1704.05730

References:

[1] J. Stoyanovich, S. Abiteboul, and G. Miklau. Data,responsibly: Fairness, neutrality and transparency indata analysis. In EDBT, 2016.

[2] J. Kulshrestha, M. Eslami, J. Messias, M. B. Zafar, S.Ghosh, I. Shibpur, I. K. P. Gummadi, and K. Kara-halios. Quantifying search bias: Investigating sourcesof bias for political searches in social media. InCSCW, 2017.

[3] M. B. Zafar, I. Valera, M. G. Rodriguez, and K. P.Gummadi. Fairness beyond disparate treatment & dis-parate impact: Learning classification without dis-parate mistreatment. In WWW, 2017.

Please contact:

Irini Fundulaki, Foundation for Research and Technology –Hellas, Institute of Computer Science, Greece,[email protected]

The Approximate Average

Common Submatrix

for Computing the Image

Similarity

by Alessia Amelio (Univ. of Calabria)

The similarity between images can be computed using anew method that compares image patches where aportion of pixels is omitted at regular intervals. Themethod is accurate and reduces execution time relativeto conventional methods.

To date, researchers have not quite solved the problem ofautomatically computing the similarity between twoimages. This is mainly due to the difficulty of filling in thegap between the human visual similarity and the similaritywhich is captured by the machine. In fact, it requires twoimportant objectives to be fullfilled: (i) to find a reliable andaccurate image descriptor that can capture the most impor-tant image characteristics, and (ii) to use a robust measure toevaluate similarity between the two images according totheir descriptors. Usually a trade-off is needed betweenthese two objectives and the execution time on the machine.An important challenge, therefore, is to achieve computa-tion of an accurate descriptor and a robust measure whilstreducing execution time.

In this paper, I present a new measure for computing the sim-ilarity between two images which is based on the comparisonof image patches where a portion of pixels is omitted at reg-ular intervals. This measure is called Approximate AverageCommon Submatrix (A-ACSM) [1] and it is an extension ofthe Average Common Submatrix (ACSM) measure [2]. Theadvantage of this similarity measure is that it does not needto extract complex descriptors from the images to be used forthe comparison. On the contrary, an image is considered as amatrix, and the similarity is evaluated by measuring theaverage area of the largest sub-matrices which the twoimages have in common. The principle underlying this eval-uation is that two images can be considered as similar if theyshare large patches representing image patterns. Two patchesare considered as identical if they match in a portion of thepixels which are extracted at regular intervals along the rowsand columns of the patches. Hence, the measure is an easymatch between a portion of the pixels. This concept intro-duces an approximation, which is based on the “naive” con-sideration that two images do not need to exactly match inthe intensity of every pixel to be considered as similar. Thisapproximation makes the similarity measure more robust tonoise, i.e., small variations in the pixels’ intensity due toerrors in image generation, and considerably reduces its exe-cution time when it is applied on large images, because a por-tion of the pixels does not need to be checked. Figure 1shows a sample of match between two image patches with aninterval of two along the columns and one along the rows ofthe patches.

Figure 2 depicts the algorithm for computing the A-ACSMsimilarity measure.

ERCIM NEWS 110 July111 October 201750

Page 51: ˜5*(.&1 9-*2* .,.9&1 :2&3.9.*8 - ERCIM News 112 · 4 ERCIM NEWS 111 October 2017 ˚ ˆ ˆ*2’*78-.5 After having successfully grown to become one of the most recognized ICT Societies

ERCIM NEWS 111 October 2017

Figure 3 shows how to find the largestsquare sub-matrix at a sample position(5,3) of the first image.

The A-ACSM similarity measure, aswell as its corresponding dissimilaritymeasure, has been extensively tested onbenchmark image databases and com-pared with the ACSM measure andother well-known measures in terms ofaccuracy and execution time. Resultsdemonstrated that A-ACSM outper-formed its competitors, obtaining higheraccuracy in a lower execution time.

The project of the average common sub-matrix measures is in its early stage atthe Georgia Institute of Technology,USA, and is currently in progress atDIMES University of Calabria, Italy.Future work will extend the submatrixsimilarity measures with new featuresand will evaluate the similarity on dif-ferent types of data, including docu-ments, sensor data, and satellite images[3]. The recent developments in thisdirection are in collaboration with theTechnical Faculty in Bor, University ofBelgrade, Serbia.

References:

[1] A. Amelio: “Approximate Matchingin ACSM dissimilarity measure”,Procedia Computer Science 96:1479-1488, 2016.

[2] A. Amelio, C. Pizzuti: “A patch-based measure for image dissimilar-ity”, Neurocomputing 171:362-378,2016.

[3] A. Amelio, D. Brodić: The ε-Aver-age Common Submatrix: Approxi-mate Searching in a RestrictedNeighborhood, IWCIA: 7-11, (shortcomm.) arXiv:1706.06026, 2017.

Please contact:

Alessia Amelio, DIMES University ofCalabria, Italy [email protected]

51

Figure�3:�The�largest�square�sub-matrix�at�position�(5,3)�of�the�first�image.�All�the�sub-matrices

of�different�size�along�the�main�diagonal�at�(5,3)�are�considered�to�match�inside�the�second

image.�In�this�case,�the�sub-matrix�of�size�3�has�no�match�inside�the�second�image.�Hence,�the

sub-matrix�of�size�2�is�considered.�Because�it�has�a�correspondence�at�position�(3,2)�in�the

second�image,�it�is�the�largest�square�sub-matrix�at�position�(5,3).

Figure�1:�A�demonstration�sample�of�match�between�two�patches�belonging�to�images�1�and�2.

Each�image�has�four�colours�which�are�also�numbered�from�1�to�4.�An�interval�of�Δx=2�and

Δy=1�is�set�respectively�along�the�columns�and�rows�of�the�patches.�Accordingly,�the�match�is

only�verified�between�the�elements�in�the�red�circles.�The�elements�are�selected�as�in�a

chessboard.��In�this�case,�the�two�image�patches�perfectly�match�because�all�the�elements�in�the

red�circles�correspond�to�one�another.

Figure�2:�Flowchart�of�the�A-ACSM�algorithm.

Page 52: ˜5*(.&1 9-*2* .,.9&1 :2&3.9.*8 - ERCIM News 112 · 4 ERCIM NEWS 111 October 2017 ˚ ˆ ˆ*2’*78-.5 After having successfully grown to become one of the most recognized ICT Societies

Research and Innovation

My-AHA: Stay Active,

Detect Risks Early,

Prevent frailty!

by Nadine Sturm, Doris M. Bleier and Gerhard Chroust(Johanniter Österreich Ausbildung und Forschunggemeinnützige GmbH)

Aging of the population is a major concern in theEuropean Union. “AHA – Active and Healthy Aging” is amulti-facetted and systemic approach in order tocompensate for the diminishing capabilities of seniors.The project my-AHA is based on reducing frailty throughtargeted, personalised ICT-based interventions.

With the demographic changes that are occurring, particu-larly within the European Union, we need to consider howbest to support the ageing population. The broad concept ofactive and healthy ageing (AHA) was proposed by the WorldHealth Organisation (WHO) as an answer. Most seniors willby necessity become actively and/or passively involved inactivities related to AHA. The goal of AHA can be roughlydescribed as trying to compensate for the diminished capa-bilities of seniors related to age. It is a highly interdiscipli-nary challenge which involves the following areas:• Physiological/medical (maintaining physical health);• Psychological (preserving mental health and a positive

attitude to the world and seniors, avoiding the trauma ofaging);

• Sociological (maintaining a good relationship betweenthe individual and his/her immediate support environment,e.g., formal and informal caregivers, healthcare servicesand infrastructure);

• Technological (providing technological support on thephysical and mental level, especially by ICT, sensors, acti-vators, etc.);

• Organisational (providing logistic support and infrastruc-ture on various levels, including architectural considera-tions);

• Economic (defraying cost of treatment, supplying equip-ment, etc.).

The main goal of the European Project “my-AHA” (MyActive and Healthy Aging, European Union – Horizon-2020Project No 689592, 2016-2019 [L1] is to reduce the risk offrailty by improving physical activity and cognitive function,psychological state, social resources, nutrition and sleep,thereby contributing to overall well-being. An ICT-basedplatform, including a smartphone system, will detect definedpersonalised risks for frailty, be they physiological, psycho-logical or social, early and accurately via non-stigmatisingembedded sensors and data readily available in the dailyliving environment of older adults.

When a risk is detected, my-AHA will provide personalisedtargeted ICT-based interventions with scientific evidencebased efficacy, including vetted offerings from establishedproviders of medical and AHA support. These interventionswill follow an integrated approach to motivate users to par-ticipate in exercises, cognitively stimulating games and

social networking to achieve long-term behavioural change,sustained by continued end user engagement with my-AHA. This approach will also increase a senior’s competence withrespect to his/her own frailties. It encourages long-termchanges to a senior’s behaviour and as a result leads to alonger, more active, more healthy and thus more enjoyablelife.

The project is coordinated by the University degli Studi diTorino, Italy, and has 14 other partners from: Austria (one),Australia (one), Germany (four), Italy (one), Japan (two),Netherlands (one), Portugal (one), South Korea (one), Spain(one), and United Kingdom (one).

The Johanniter Austria Research and Education (JohanniterÖsterreich Ausbildung und Forschung gemeinnützigeGmbH) [L2] provide considerable expertise in methodolo-gies of empirical social science, especially with respect touser involvement, honouring ethical considerations andapplying technological foresight. This input will be invalu-able when it comes to developing targeted interventions andtests and when introducing new technologies and identifyingthe most effective methods to prevent frailty. Additionally,the Johanniter are setting up a Europe-wide testing environ-ment with standardised parameters to support the project.

Links:

[L1] www.activeageing.unito.it/home[L2] https://www.johanniter.at/dienste/forschung/

Please contact:

Georg AumayrJohanniter Österreich – Ausbildung und Forschunggemeinnützige GmbH, Austria+43 1 470 7030 [email protected]

ERCIM NEWS 110 July111 October 201752

Smartphone�for�interventions.

Page 53: ˜5*(.&1 9-*2* .,.9&1 :2&3.9.*8 - ERCIM News 112 · 4 ERCIM NEWS 111 October 2017 ˚ ˆ ˆ*2’*78-.5 After having successfully grown to become one of the most recognized ICT Societies

ERCIM NEWS 111 October 2017

A New vision of the Cruise

Ship Cabin

by Erina Ferro (ISTI-CNR)

The National Research Council of Italy (CNR) is involvedin the E-Cabin project, which aims to create a futuristicand intelligent cabin in order both to improve the well-being and satisfaction of passengers and to minimize theon-board waste by using advanced consumption control.

The E-Cabin project focusses on the cabins of cruise ships,with the aim of creating a set of advanced technology solu-tions to enhance the travel experience both inside the cabinand throughout the entire ship, providing the ship owner withan additional monitoring system for each cabin. A cruise shipis now a concentration of technologies, equipment, commu-nication and security systems that is hard to find in terrestrialspaces. However, the existing set of technologies is not

enough to improve the experience of individual passengers,who require both a continuous connection with the outsideworld and personalised tools to best enjoy the opportunitiesprovided by the ship system. At the same time, new techno-logical solutions can also be proposed to the shipbuildingcompany, thus increasing the control and safety systems.

In order to create a futuristic and intelligent cabin, E-Cabinhas two primary objectives, one for the passenger and one forthe shipbuilding company. In the first case, the aim is toimprove a passenger’s well-being and satisfaction by makinghis presence on board more enjoyable thanks to a set of inno-vative services. In the second case, the goal is to minimisethe on-board waste thanks to a precise consumption control(for example light, heating/cooling, etc.) by means of a set ofsolutions that include:

1. To carry out a cab monitoring system in order to under-stand consumption, to plan the maintenance work and todynamically manage the ship’s resources. This will beobtained via a set of sensors installed in the cab, whosebattery life will be possibly extended through an energyharvesting system.

2. To understand and take advantage of the correlations andthe interactions between the indexes that govern feelingsof comfort so as to maximise the overall comfort of pas-sengers. This information will be collected from heteroge-

neous devices: sensors and actuators installed in the cabinand personal user devices (smartphones, smartwatchesand/or wearable sensors) in order to detect the correct cor-relation between the personal feeling of well-being and theenvironmental data (intensity of lights, noise, temperature,humidity, etc.).

3. To realise a set of applications that learn the habits of pas-sengers, thus predicting their needs, increasing theiropportunities to socialise, sharing contents through mobilesocial networking applications, and enriching their partic-ipation in the “ship world” by taking advantage of aug-mented reality information relevant to the cruise.

The global e-cabin system is depicted in Fig. 1. The real nov-elty of this approach is that all solutions will be integratedinto a single development platform for the various applica-tions, which will be primarily related to the quality of life inthe cabin, but which will also interact with applicationsrelated to other activities carried out on the ship. Verticalsolutions, which do not communicate each other (typical of

proprietary business solutions), will not be adopted. Thisway, all collected data will be related, thus generating addi-tional knowledge. Just as well-being is the result of a combi-nation of different sensations, physical factors, and environ-mental factors, applications developed in E-Cabin andapplied technologies must also be able to combine and sharedata and information in order to customise solutions.

On-board energy consumption will be monitored and auto-matically reduced by implementing waste reduction policiesbased on strict waste control; this aspect is of particularinterest to the ship owner. All developed components will beminiaturised and integrated into the entire E-Cabin system.The E-Cabin prototype can be fully engineered and inte-grated in time with new features and applications that can bematched with evolving needs and technology.

The experimentation will take place at Fincantieri in Trieste(Italy). This project involves the ISTI-CNR, IIT-CNR,ISTEC-CNR, ITIA-CNR, IEIIT-CNR Institutes and theUniversity of Trieste.

Please contact:

Erina Ferro, ISTI-CNR, [email protected]

53

Figure�1:�The�E-Cabin.�

Page 54: ˜5*(.&1 9-*2* .,.9&1 :2&3.9.*8 - ERCIM News 112 · 4 ERCIM NEWS 111 October 2017 ˚ ˆ ˆ*2’*78-.5 After having successfully grown to become one of the most recognized ICT Societies

Research and Innovation

Trend Analysis

of Underground

Marketplaces

by Klaus Kieseberg, Peter Kieseberg and Edgar Weippl:(SBA Research)

Underground marketplaces, which represent one of themost prominent examples for criminal activities in theDarknet, form their own economic ecosystems, oftenconnected to cyber-attacks against criticalinfrastructures. Retrieving first-hand information onemerging trends in these ecosystems is thus of vitalimportance for law enforcement, as well as criticalinfrastructure protection.

The term “Darknet” generally describes “overlay networks”that are only accessible to a few exclusive users, but it isoften used in order to describe parts of the internet that aresealed off like underground marketplaces, or closed peer-to-peer networks. These networks, and their potential links tocriminal and terrorist activities, have recently gained publicattention, which has highlighted the need for an efficientanalysis of Darknets and similar networks.

We intend to study how these underground forums operate asa means for unobserved communication between like-minded individuals as well as a tool for the propagation ofpolitical propaganda and recruitment. We also focus heavilyon Darknets used for trading illegal goods and especiallyservices that could be used to attack government institutionsand undermine national security. For example, in some ofthese networks it is possible to buy the services of bot net-works for launching “Distributed Denial of Service” (DDoS)attacks against sensitive infrastructures like power distribu-tion networks, as well as physical goods like drugs and arms.An efficient analysis of these underground marketplaces istherefore essential for the prevention of terrorist attacks andto stem the proliferation of digital weapons.

In the course of our research, which notably focused on trendanalysis in underground marketplaces, the following threekey issues emerged that require special attention: 1. Detection and analysis of data sources: In order to get a

good database for subsequent analysis, a detailed sourceanalysis and source detection regarding propaganda andillegal services, as well as an assessment of sourcesregarding their relevance concerning national security isrequired. This also contains means for the undetectedautomation of data collection, in order to get undistortedinformation and to not compromise the information gath-ering process.

2. Privacy preserving analysis of data: Due to the strict pri-vacy requirements for data processing put forth by the“General Data Protection Regulation” (GDPR) [1] newtechniques in privacy preserving machine learning have tobe invented. In addition, techniques for monitoring accessto the crawled data as well as methods for manipulationdetection need to be developed.

3. Studying the mode of operation of underground market-places: This includes the mechanisms of establishing first

contact, pricing, payment and the transfer of goods, partic-ularly goods that would require some sort of contact in theoffline world [2]. A special interest for securing criticalinfrastructures lies in the analysis of trends, rather thanindividual behaviour, as this is more interesting from astrategic point of view and far less problematic withrespect to sensitive information.

In addition to the technical problems, the analysis of this typeof information also opens up many important legal ques-tions. Especially the new rules and regulations introduced bythe GDPR (General Data Protection Regulation) and thenational counterparts play a significant part, since they aimto ensure transparency of data processing procedures con-cerning personal information, as well as the possibility todelete data from data processing applications.

When it comes to analysing underground marketplacesregarding the trade of illegal goods or services, it is impor-tant to develop techniques that can be automated to a certaindegree but still integrate a human component to detect andevaluate criminal activities. Furthermore, many undergroundforums have detection mechanisms in place to identify auto-mated information gathering, as well as users with strangeaccess behaviour that hints at them belonging to law enforce-ment [2]. Often, manual intervention in the form of messageswith questions, as well as the analysis of access patterns areutilised by the forum owners. Thus, methods must be devel-oped that mask the information gathering process and emu-late user behaviour. To maximise the level of automationachievable, it is convenient to focus on the detection andidentification of trends instead of investigating every indi-vidual case in isolation.

A vital aspect of this research work is to ensure anonymityand to secure the privacy of innocent people. We thereforestrive to develop techniques and methods that allow for anefficient analysis of underground marketplaces while pro-viding ample protection to people not involved in illegalactivities. The collection of data will be a selective processadhering to the “data minimisation principle” introduced bythe GDPR. Instead of collecting and analysing all availabledata, an intelligent collection process is required, where datais first evaluated regarding its importance and selectedaccordingly. This also includes metadata, which is of lowimportance for trend analysis itself, but can contain impor-tant information, as well as useful links between individualinformation particles. This not only increases the perform-ance of the data collection process, but also minimises theamount of data collected while providing additional insightsinto the overall ecosystem of underground markets.

Another important question is how privacy protection mech-anisms can affect conventional machine learning techniques.Methods used to protect sensitive information likeanonymization of data via k-anonymity, as well as deletingindividual records from data sets, can alter the informationderived from the data analysis. While these are key factors inchoosing the right protection method, as well as a suitablesecurity factor, these effects have not yet been studied thor-oughly enough [3]. In the course of our work, we examinethe effects of different protective measures on machinelearning techniques and develop methods to mitigate them.

ERCIM NEWS 110 July111 October 201754

Page 55: ˜5*(.&1 9-*2* .,.9&1 :2&3.9.*8 - ERCIM News 112 · 4 ERCIM NEWS 111 October 2017 ˚ ˆ ˆ*2’*78-.5 After having successfully grown to become one of the most recognized ICT Societies

ERCIM NEWS 111 October 2017

Based on these results, we will develop means for controllingthe introduced error by: (i) providing upper bounds for theeffects of anonymization and deletion, as well as by (ii)developing new methods for analysing specific forms ofanonymized sensitive information that introduce less distor-tion into the final result.

In conclusion, the analysis of underground marketplaces andother areas typically considered to constitute the Darknetrequires additional research effort in order to deliver theinformation required for analysing trends and the workingsof their market mechanisms. This led to the development ofthe “Darknet Analysis”-project [L1], which is currently tack-ling the research questions outlined above. Furthermore, theresults of our research will make an important contribution tothe topic of privacy protection in law enforcement as a wholeand thus have high re-use value for the involved govern-mental stakeholders.

Link:

[L1] http://www.kiras.at/gefoerderte-projekte/

References:

[1] Regulation (EU) 2016/679 Of The European ParliamentAnd Of The Council Of 27 April 2016 on the protectionof natural persons with regard to the processing of per-sonal data and on the free movement of such data, andrepealing Directive 95/46/EC (General Data ProtectionRegulation)

[2] H. Fallmann, G. Wondracek, C. Platzer. “Covertly Prob-ing Underground Economy Marketplaces”, DIMVA,Vol. 10, 2010.

[3] B. Malle, P. Kieseberg, E. R. Weippl, A. Holzinger:“The right to be forgotten: towards machine learning onperturbed knowledge bases”, in International Confer-ence on Availability, Reliability, and Security (pp. 251-266), Springer, 2016.

Please contact:

Peter KiesebergSBA Research, Vienna, [email protected]

RDf visualiser: A Tool for

Displaying and browsing

High Density of RDf Data

by Nikos Minadakis, Kostas Petrakis, Korina Doerr andMartin Doerr (FORTH)

RDF Visualiser is a generic browsing mechanism thatgives the user a flexible, highly configurable, detailedoverview of an RDF dataset / database, designed anddeveloped to overcome the drawbacks of the existingRDF data visualisation methods/tools. RDFV is currentlyused in EU research projects by the 3M editor of theX3ML suite of tools and has been tested with largedatasets from the British Museum and the American ArtCollaborative project.

RDF is widely used for data integration, transformation andaggregation of heterogeneous sources and applications ofmappings between the source schemata. Although a greatdeal of emphasis has been placed on the validation of theproduced RDF structure and format, the efficient visualisa-tion of the constructed database contents that enablessemantic validation by domain experts has largely beenignored and is achieved either by manual inspection of mul-tiple files, by formulation and execution of complexSPARQL queries, or by custom user interfaces that workonly with a particular RDF schema without intervention by aprogrammer.

The basic principles that an efficient and user-friendly RDFdata visualisation tool should be able to live up to are:1. The ability to display data of any schema and RDF format. 2. The ability to display all nodes of any class/instance. 3. The application of configuration rules to improve the lay-

out or presentation for known classes and properties (e.g.,hide URIs that are meaningless for the user).

4. The display of a high density of information in one screen(which is not possible in solutions based on “object tem-plates”).

Although some visualisation solutions already exist and havebeen applied to successful projects [L1],[1], the existingapproaches fail to fulfil the complete set of principles enu-merated above. Specifically, approaches that display all thenodes of any class/schema and support any schema andformat (principles 1 and 2) fail to display a high density ofinformation in one screen and those that succeed withregards to the latter fail in relation to the rest of the princi-ples.

In order to meet the requirements laid out by the complete setof principles, our team designed and implemented the RDFVisualiser (RDFV), a generic browsing mechanism thatgives the user a flexible, highly configurable, detailedoverview of an RDF dataset / database.

RDFV presents RDF data as an indented list to handle thedensity and depth of information (principle 4) starting from aspecified RDF resource (URI). In order to achieve this, allincoming links are inverted to display all the nodes of every

55

Page 56: ˜5*(.&1 9-*2* .,.9&1 :2&3.9.*8 - ERCIM News 112 · 4 ERCIM NEWS 111 October 2017 ˚ ˆ ˆ*2’*78-.5 After having successfully grown to become one of the most recognized ICT Societies

Research and Innovation

class/instance (principle 2) in a schema agnostic way (prin-ciple 1). Users are able to configure the display of schema-dependent information according to their preferences (prin-ciple 4). By editing an xml file, users are also able to definepriority based rule chains that are used to define schema-dependent style and order of properties that are inherited tosubclasses and subproperties (principle 4). For anything notcovered by a rule, default options are applied based on ourexperience and best practices. The user interface of the toolhas been designed in cooperation with potential users, with afocus on usability and readability (Figure 1).

Moreover, rich functions are provided to the user to controlthe display of data items, such as identifying same instances,expanding collapsing big texts or big sets of results, selectingthe maximal depth of information, displaying images andimage galleries, removing prefixes, selecting non displayedURIs, retrieving the path of the sub-graph for a specific nodeetc.

RDFV supports browsing of content in triple stores (testedsuccessfully on Virtuoso and BlazeGraph) and in local filesof every format. It has been integrated in the 3M interface ofthe X3ML suite of tools for data mapping and transformation[2]. Added into the 3M interface it adds an important valida-tion tool for data mapped and transformed by domain expertswho wish to check and correct the resulting outputs, enablingan iterative and collaborative evaluation of the resultantRDF. RDFV is currently exploited by a number of EUresearch projects, and has been tested with large RDFdatasets from the British Museum and from the American ArtCollaborative project.

In the future RDFV will support geospatial display functionson maps along with a new refined set of the existing func-tions with a focus on configurability and personalisation.

Our team will continue to work on the development, mainte-nance and user support of RDFV. The RDFV components,which consist of API microservices and a web application,are also available as independent blocks to assist developerswith building their own applications through GitHub bySeptember 2017.

Link:

[L1] http://www.researchspace.org/

References:[1] N. Minadakis, et al.: “LifeWatch Greece Data-Services:

On Supporting Metadata and Semantics Integration forthe Biodiversity Domain”, in 13th International Con-gress on the Zoogeography and Ecology of Greece andAdjacent Regions (ICZEGAR’2015), Heraklion, Crete,October 2015.

[2] Y. Marketakis, et al.: “X3ML Mapping Framework forInformation Integration in Cultural Heritage andbeyond”, International Journal on Digital Libraries, pp1-19, Springer, DOI 10.1007/s00799-016-0179-1.

Please contact:

Minadakis Nikos, FORTH, [email protected]

ERCIM NEWS 110 July111 October 201756

Figure�1:�RDFV�user�interface.

Page 57: ˜5*(.&1 9-*2* .,.9&1 :2&3.9.*8 - ERCIM News 112 · 4 ERCIM NEWS 111 October 2017 ˚ ˆ ˆ*2’*78-.5 After having successfully grown to become one of the most recognized ICT Societies

ERCIM NEWS 111 October 2017

Services for Large Scale

Semantic Integration

of Data

by Michalis Mountantonakis and Yannis Tzitzikas (ICS-FORTH)

LODsyndesis is a new suite of services that helps theuser to exploit the linked data cloud.

In recent years, there has been an international trend towardspublishing open data and an attempt to comply with stan-dards and good practices that make it easier to find, reuse andexploit open data. Linked Data is one such way of publishingstructured data and thousands of such datasets have alreadybeen published from various domains.

However, the semantic integration of data from thesedatasets at a large (global) scale has not yet been achieved,and this is perhaps one of the biggest challenges of com-puting today. As an example, suppose we would like to findand examine all digitally available data about Aristotle in theworld of Linked Data. Even if one starts from DBpedia (thedatabase derived by analysing Wikipedia), specifically fromthe URI “http://dbpedia.org/resource/Aristotle” it is not pos-sible to retrieve all the available data because we should firstfind ALL equivalent URIs that are used to refer to Aristotle.In the world of Linked Data, equivalence is expressed with“owl:sameAs” relationships. However, since this relation istransitive, one should be aware of the contents of all LODdatasets (of which there are currently thousands) in order tocompute the transitive closure of “owl:sameAs”, otherwisewe would fail to find all equivalent URIs. Consequently, inorder to find all URIs about Aristotle, which in turn would bethe lever for retrieving all data about Aristotle, we have to

57

Figure�2:�Services�offered�by�LODsyndesis.

Figure�1:�The�whole�process�of�LODsyndesis

Page 58: ˜5*(.&1 9-*2* .,.9&1 :2&3.9.*8 - ERCIM News 112 · 4 ERCIM NEWS 111 October 2017 ˚ ˆ ˆ*2’*78-.5 After having successfully grown to become one of the most recognized ICT Societies

On the Interplay between

Physical and Digital World

Accessibility

by Christophe Ponsard, Jean Vanderdonckt and LyseSaintjean (CETIC)

Many means of navigating our physical world are nowavailable electronically: maps, streets, schedules, pointsof interest, etc. While this “digital clone” is continuallyexpanding both in completeness and accessibility, aninteresting interplay scenario arises between physicaland digital worlds that can benefit all of us, especiallythe mobility-impaired.

Physical world accessibility is about ensuring that features ofphysical places (e.g., shops, tourist attractions, offices) aredesigned to accommodate the (dis)abilities of people visitingthem in order to ensure optimal access for all, including theestimated 15% of the population living with some kind ofimpairment. Our world is undergoing digitisation at an ever-increasing rate. Online maps are becoming so precise that theinner structure of buildings is often captured, tours are avail-able, while locations are ranked by users, etc. As a conse-quence, many new opportunities have arisen for breakingaccessibility barriers and better combining the physical anddigital aspects of accessibility. For example, the use of theavailable online information can be complemented withphysical measures to help assess the accessibility of build-

Research and Innovation

index and enrich numerous datasets, through cross-datasetinference. .The Information Systems Laboratory of the Institute ofComputer Science of FORTH, designs and develops innova-tive indexes, algorithms and tools to assist the process ofsemantic integration of data at a large scale. This endeavourstarted two years ago, and the current suite of services andtools that have been developed are known as “LODsyndesis”[L1]. As shown in Figure 1, the distinctive characteristic ofLODsyndesis is that it indexes the contents of all datasets inthe Linked Open Data cloud [1]. It then exploits the contentsof datasets to measure the commonalities among thedatasets. The results of measurements are published on theweb and can be used to perform several tasks. Indeed,“global scale” indexing can be used to achieve various out-comes, and an overview of the available LODsyndesis-basedservices for these tasks can be seen in Figure 2.

First, it is useful for dataset discovery, since it enables con-tent-based dataset discovery services, e.g., it allowsanswering queries of the form: “find the K datasets that aremore connected to a particular dataset”. Another task isobject co-reference, i.e., how to obtain complete informationabout one particular entity (identified by a URI) or set ofentities, including provenance information. LODSyndesiscan also help with assessing the quality and veracity of data,since the collection of all information about an entity, and thecross-dataset inference that can be achieved, allows contra-dictions to be identified and provides information for datacleaning or for estimating and suggesting which data areprobably correct or most accurate. Last but not least, theseservices can help to enrich datasets with more features thatcan obtain better predictions in machine learning tasks [2]and the related measurements can be exploited to providemore informative visualisation and monitoring services.

The realisation of the services of LODSyndesis is chal-lenging because they presuppose knowledge of all datasets.Moreover, computing the transitive closure of all“owl:sameAs” relationships is challenging since most of thealgorithms for doing this require a lot of memory. To tacklethese problems LODsyndesis is based on innovative indexesand algorithms [1] appropriate for the needs of the desiredservices. The current version of LODsyndesis indexes 1.8billion triples from 302 datasets, it contains measurements ofthe number of common entities among any combination ofdatasets (e.g., how many common entities exist in a triad ofdatasets), while the performed measurements are also avail-able in DataHub [L2] for direct use. It is worth noting thatcurrently only 38.2% of pairs of datasets are connected (i.e.,they share at least one common entity) and only 2% of enti-ties occur in three or more datasets. The aforementionedmeasurements reveal the sparsity of the current datasets ofthe LOD Cloud and justify the need for services for assistingintegration at large scale. This method is efficient, the timefor constructing the indexes and performing all these meas-urements being only 22 minutes using a cluster of 64 com-puters.

Over the next two years we plan to improve and extend thissuite of services. Specifically, we plan to advance the data

discovery services and to design services for estimating theveracity and trust of data.

This project is funded by FORTH and it has been supportedby the European Union Horizon 2020 research and innova-tion programme under the BlueBRIDGE project (Grantagreement No 675680).

Links:

[L1] http://www.ics.forth.gr/isl/LODsyndesis/ [L2] https://datahub.io/dataset/connectivity-of-lod-datasets

References:

[1] M. Mountantonakis and Y. Tzitzikas, “On measuringthe lattice of commonalities among several linkeddatasets,” Proceedings of the VLDB Endowment, vol.9, no. 12, pp. 1101-1112, 2016.

[2] M. Mountantonakis and Y. Tzitzikas, “How LinkedData can aid Machine Learning-based Tasks,” in Inter-national Conference on Theory and Practice of DigitalLibraries, 2017.

Please contact:

Yannis TzitzikasFORTH-ICS and University of Crete +30 2810 [email protected]

ERCIM NEWS 110 July111 October 201758

Page 59: ˜5*(.&1 9-*2* .,.9&1 :2&3.9.*8 - ERCIM News 112 · 4 ERCIM NEWS 111 October 2017 ˚ ˆ ˆ*2’*78-.5 After having successfully grown to become one of the most recognized ICT Societies

ERCIM NEWS 111 October 2017

ings and modes of transport. Combining digital maps,crowd-sourced information, open data and expert assess-ments can help produce up-to-date accessibility data at alarger scale and reduced cost.

Physical accessibility requirements are now well defined anddocumented, such as in the ISO21542 standard. Assessingphysical accessibility is not straightforward because itrequires the identification of physical obstacles to accessi-bility relevant to specific kinds of impairment, e.g. adoorstep for a wheelchair, the presence or not of braille onlift buttons for blind people, the printing of light labels forthe vision-impaired. Experts can reliably evaluate this acces-sibility by applying a well-defined meas-urement procedure and web-orientedreporting like Access-I [L1]. However,the limited number of trained expertsgreatly restricts the ability to carry outsuch assessments at a large scale and keepthe information up-to-date. On the otherhand, dedicated social media apps likejaccede.com or wheelchair.org enable acrowdsourcing approach through the col-lection of partial information reported bydifferent users, each with her own point-of-view, a method which may, however,lead to inconsistency. Both approachesare naturally complementary and can effi-ciently be combined through the use ofopen linked data techniques [2].

Although information and communica-tion technologies are a clear enabler forphysical accessibility, they also raise newobstacles in that the end-users then need to consult a com-puter or mobile interface to make sure that a place is physi-cally accessible. The Web Accessibility Initiative (WAI) [L2]provides specific guidelines, such as the Web ContentAccessibility Guidelines (WCAG) to help web developersmake websites accessible while dynamic content isaddressed through Accessible Rich Internet Applications(ARIA). Many useful evaluation and repair tools are alsoproposed, including for tailored and optimised usability andaccessibility evaluation [4]. Many organisations proposepractical labels that are based on WCAG, like AnySurfer inBelgium [L3].

Figure 1 graphically depicts the interplay between physicaland digital accessibility, thus raising interesting questionsabout the current level of awareness about each kind ofaccessibility, the possible correlation between them, andwhat kind of synergies can further enhance user experience.In order to answer these questions, a preliminary study wascarried out in Belgium by combining data from Access-I (forphysical accessibility) and AnySurfer (for digital accessi-bility). Some methodological alignment was performed todefine comparable notions of “unacceptable”, “satisfactory”and “good” levels of accessibility for physical, sensory andmental impairments. The survey was conducted on placesalready assessed (so with some bias). Figure 1 shows a satis-factory level of physical accessibility in 60% of cases, of dig-ital accessibility in 70% of cases and both in 33% of cases.The generally better level of digital accessibility, compared

59

with physical accessibility, might be explained by theincreased importance of digital presence and the lower effortrequired to achieve digital accessibility. Indeed, a website iseasier and less expensive to repurpose than a building. Thefact that only one third of the places studied currentlyachieve both dimensions means that these aspects are stilllargely being considered separately.

The current scope of both accessibility and digital assess-ments should also be reviewed: Physical accessibility is notlimited to building boundaries but starts when the intentionto travel to a place arises. And this scenario, eventuallyleading to the use of some service or equipment inside a

building, relying on using a computer and/or a mobile inter-face like a smartphone or tablet to query the web about trans-portation, opening hours, and user specific accessibility.Physical and digital accessibilities then closely intertwinewith retroaction loops illustrated in Figure 1.In addition to deepening our survey on the relations betweenphysical and digital accessibility, we are also currently inves-tigating how to effectively support accessibility using a dig-ital mobile companion exploiting aggregated open accessi-bility data [3].

Links:

[L1] http://www.access-i.be [L2] https://www.w3.org/WAI [L3] http://www.anysurfer.be/en

References:

[1] C. Ponsard, V. Snoeck: “Unlocking Physical WorldAccessibility through ICT: A SWOT Analysis”, Proc. ofICCHP’2014.

[2] C. Ponsard et al: “A Mobile Travel Companion Basedon Open Accessibility Data”, Proc. of ICCHP’2016.

[3] J. Vanderdonckt, A. Beirekdar: “Automated Web Evalu-ation by Guideline Review”, Journal of Web Engineer-ing, Vol. 4, No. 2, 2005, pp. 102-117.

Please contact:

Christophe Ponsard, CETIC, Belgium+32 472 56 90 99, [email protected]

Figure�1:�The�interplay�between�physical�and�digital�accessibilities.

Page 60: ˜5*(.&1 9-*2* .,.9&1 :2&3.9.*8 - ERCIM News 112 · 4 ERCIM NEWS 111 October 2017 ˚ ˆ ˆ*2’*78-.5 After having successfully grown to become one of the most recognized ICT Societies

Research and Innovation

Highly Sensitive

biomarker Analysis Using

On-Chip Electrokinetics

by Xander F. van Kooten, Federico Paratore (IsraelInstitute of Technology and IBM Research – Zurich),Moran Bercovici (Israel Institute of Technology) andGovind V. Kaigala IBM Research – Zurich)

Highly sensitive and specific biochemical assays canprovide accurate information about the physiologicalstate of an individual. Leveraging microtechnology, weare developing miniaturised analytical tools for preciseand fast biomolecular analysis. This work is beingperformed within the scope of the EU-funded project‘Virtual Vials’ between IBM Research – Zurich andTechnion – Israel Institute of Technology in Haifa.

In vitro diagnostics is a rapidly growing field and market thatencompasses clinical laboratory tests and point-of-caredevices, as well as basic research aimed at understanding theorigin of diseases and developing improved testing methodsand treatments. The global market for in vitro diagnostics isprojected to grow to $75 billion by 2020. The developmentof novel tools that enable fast and sensitive analysis of bio-logical samples with high throughput are essential forenabling discoveries and providing better care for patients.

In the past two decades, in vitro studies have shifted rapidlyfrom classical tools to microfluidic devices (often alsoreferred to as “lab-on-a-chip”). Microfluidic devices enablenew capabilities and higher efficiencies in the analysis andcontrol of biological liquid samples as many physical phe-nomena scale favourably as size is reduced. Effects such asdiffusion and surface tension dominate the behaviour offluids and solutes at micrometre length scales, while inertiaand body forces such as gravity are negligible.

One of the key applications of in vitro diagnostics is thedetection of biochemical species, such as proteins or nucleicacids that indicate a physiological condition or the presenceof a disease. In many cases, the relevant analytes are presentat such low concentrations that they cannot be detected usingconventional methods. To overcome this, we make use of aspecific electrokinetic separation and focusing techniquecalled isotachophoresis (ITP). Isotachophoresis is a form of

electrophoresis that is almost a century old, but has attractedrenewed interest in the past decade, largely thanks to thestrong growth of microfluidics research.

ITP uses a discontinuous buffer system to separate and con-centrate target analytes based on differences in their elec-trophoretic mobility. As illustrated in Figure 1, the discontin-uous buffer system consists of a leading (LE) and a termi-nating (TE) electrolyte, which respectively have a higher anda lower electrophoretic mobility than the analyte(s) ofinterest. When an electric field is applied, the analytes arefocused at the continuously moving interface between the LEand TE. In this way, a “virtual vial” of several hundred picol-itres is created, in which the concentration of analytes isenhanced by many orders of magnitude.

We focus on sensitive and rapid detection of proteins, whichis key for early diagnosis of a large number of medical condi-tions. Protein analysis is commonly performed using surfaceimmunoassays, which use antibodies immobilised on a sur-face to capture target proteins from a liquid sample. Thenumber of captured proteins is converted into a readablesignal by a variety of methods, such as electrochemical reac-tions, fluorescence or surface plasmon resonance. However,the performance of all these detection methods is limited bythe kinetics of the protein-antibody reaction because thereaction time is inversely proportional to the concentration ofproteins.

We use ITP to accelerate the surface reaction more than1000-fold, so that a reaction that would require more than 20hours under simple mixing can now be completed in underone minute. Figure 2a shows fluorescence images of the ITP-based immunoassay: At (t1) a fluorescent protein accumu-lates at the moving ITP interface; at (t2) the focused proteinsreach a 100 µm long reaction site, where the acceleratedreaction takes place, and at (t3) the proteins are carried awayby ITP, leaving the reacted surface ready for detection.Combined with standard detection molecules available onthe market, the technique can increase the sensitivity ofexisting immunoassays 1,300 times.

While ITP has proved to be a useful technique for acceler-ating reaction kinetics, the gains of this approach are ulti-mately limited by the volume of sample from which analytesare collected. In conventional microfluidic channels, ITPconcentrates analytes from a sample volume of just a few

ERCIM NEWS 110 July111 October 201760

Figure�1:�Schematic�of�ITP�as�a�tool�for�in�vitro�diagnostics.�Biochemical�analytes�from�patient�samples�can�be�focused�using�ITP�to�increase

their�concentration�locally.�ITP�uses�a�discontinuous�buffer�system�to�focus�analytes�at�the�interface�between�two�electrolytes�under�an�applied

voltage�V.�This�facilitates�their�detection�and�can�lead�to�a�faster�and�more�sensitive�diagnosis.

Page 61: ˜5*(.&1 9-*2* .,.9&1 :2&3.9.*8 - ERCIM News 112 · 4 ERCIM NEWS 111 October 2017 ˚ ˆ ˆ*2’*78-.5 After having successfully grown to become one of the most recognized ICT Societies

ERCIM NEWS 111 October 2017

hundred nanolitres. At low concentrations, the presence oftarget molecules in the volume sampled by ITP becomesprobabilistic, and multiple parallel tests are required to avoidfalse negatives.

As a step towards large-volume sample processing, wedeveloped a chip that enables ITP focusing of biologicalanalytes from 50 μL sample volumes in less than 10 minutes.As all analytes from this volume are focused into an ITPinterface comprising just half a nanolitre, the final concentra-tion factor is more than 100,000, which is one hundred timesmore than ITP focusing in conventional microfluidic chips.

The key to processing larger sample volumes lies in com-bining a wide-channel region (with a large internal volume)and a tapering channel. However, this gradual geometrychange and the concomitant Joule heating resulting fromlocal differences in the current density lead to undesired dis-persion of the sample plug, which is detrimental to the con-centration of the focused analyte and ultimately compro-mises detection downstream. Fortunately, this undesired dis-persion can be mitigated by using a ‘“recovery chamber” atthe end of the chip. This makes the design highly scalable,and the capability to process even larger sample volumes onpassively cooled chips is limited only by the Joule heating inthe channels. Therefore, if the chips are actively cooled,sample volumes of hundreds of microlitres may be processedin this way.

We believe these results are of fundamental significance forthe in vitro diagnostics and research community as they sug-gest the use of assays for improving the sensitivity and speedof molecular analysis, and for processing large volume ofsamples on devices with microfluidic elements.

61

Figure�2:�Two�applications�of�ITP�for�in-vitro�diagnostics�we�pursue.�(a)�In�an�ITP-based�immunoassay,�proteins�are�focused�and�delivered�to�a�reaction

site,�where�the�enhanced�concentration�drives�an�accelerated�reaction.�(b)�A�device�for�ITP�focusing�from�large�sample�volumes.�The�inset�shows�bacteria

focused�from�an�initial�concentration�of�100�bacteria/mL.

References:

[1] F. Paratore, et al.:”Isotachophoresis-based surfaceimmunoassay”, Analytical Chemistry, 2017, 89 (14),pp. 7373–7381.

[2] X.F. van Kooten, M. Truman-Rosentsvit, G. V. Kaigala,and M. Bercovici: “Focusing analytes from 50 μL into500 pL: On-chip focusing from large sample volumesusing isotachophoresis, Scientific Reports 7: 10467,2017.

Please contact: Moran Bercovici, Israel Institute of [email protected]

Govind V. Kaigala, IBM Research – Zurich, [email protected]

100 µm

electro-migration

t3: Detection step

t1: Protein delivery

t2: Accelerated reaction

TE

(a)ITP LE

electromigration

reaction site

V

(b)

ITP brings the focused proteins to the reaction site

Proteins quickly react with the antibodies on the surface

The surface is ready for detection

(a)

TE LE V

Amigrationelectro-

: 2t

1t

the sur

tion

tiontionomig

T

tibodies on the suraneins quickotrP

ITP br

tioneaced rtaelerccA

yerein delivotrP:

eption stecDet: 3

tionecdetor eady fe is rfache sur

efactibodies on the surt with the eacly reins quick

etion siteaco the rteins otocused prings the fITP br

Page 62: ˜5*(.&1 9-*2* .,.9&1 :2&3.9.*8 - ERCIM News 112 · 4 ERCIM NEWS 111 October 2017 ˚ ˆ ˆ*2’*78-.5 After having successfully grown to become one of the most recognized ICT Societies

ERCIM NEWS 111 October 201762

Events

Event Report

A Conference for

Advanced Students:

IfIP TC6’s AIMS

by Harry Rudin

IFIP’s AIMS 2017, the 11th IFIPInternational Conference on AutonomousInfrastructure, Management and Securitytook place at the University of Zurichfrom July 10-13, 2017.

The idea driving the conference is spe-cial: AIMS is a single-track event tar-geted at junior researchers and PhD stu-dents in network and service manage-ment and security. It features a range ofsessions including conference paper pre-sentations, hands-on lab courses, andkeynotes. The objective of AIMS is tooffer junior researchers and PhD stu-dents a dedicated place where they candiscuss their research work and experi-ence, receive constructive feedbackfrom senior scientists, and benefit from anumber of practical hands-on sessionsand labs on emerging technologies. Byputting the focus on junior researchersand PhD students, AIMS acts as a com-plementary event in the set of interna-tional conferences in the network andservice management community, pro-viding an optimal environment for in-depth discussions and networking.

As with most IFIP TC6 conferences, theproceedings are available, free of cost,in the IFIP TC6 Open Digital Library: http://dl.ifip.org/db/conf/aims/aims2017/index.html

The proceedings, as a hard copy version,are also available in the Springer seriesLecture Notes in Computer Science,Vol. 10356.

AIMS 2018 will take place in Munich inOctober 2018.

Link:

http://www.aims-conference.org/2017/

Call for Proposals

Dagstuhl Seminars

and Perspectives

Workshops

Schloss Dagstuhl – Leibniz-Zentrumfür Informatik (LZI) is acceptingproposals for scientific seminars/workshops in all areas of computerscience, in particular also inconnection with other fields.

If accepted the event will be hosted inthe seclusion of Dagstuhl’s well known,own, dedicated facilities in Wadern onthe western fringe of Germany.Moreover, the Dagstuhl office willassume most of the organisational/administrative work and the Dagstuhlscientific staff will support the organ-izers in preparing, running, and docu-menting the event. Due to subsidies thecosts are very low for participants.

Dagstuhl events are typically proposedby a group of three to four outstandingresearchers of different affiliations. Thisorganizer team should represent a rangeof research communities and reflectDagstuhl’s international orientation.More information, in particular detailsabout event form and setup as well as theproposal form and the proposing processcan be found on

http://www.dagstuhl.de/dsproposal.

Schloss Dagstuhl – Leibniz-Zentrum fürInformatik is funded by the German fed-eral and state government. It pursues amission of furthering world classresearch in computer science by facili-tating communication and interactionbetween researchers.

Important Dates• Proposal submission: October 15 to

November 1, 2017• Notification: End of January 2018• Seminar dates: Between mid 2018

and mid 2019.

Call for Participation

W3C Workshop

on WebvR Authoring:

Opportunities and

Challenges

Brussels, 5-7 December 2017

The primary goal of the workshop is tobring together WebVR stakeholders toidentify unexploited opportunities aswell as technical gaps in WebVRauthoring. The workshop is targeted atpeople with experience of authoringWebVR content and people involved inthe evolution of WebVR as a technology.

TopicsParticipants will share good practicesand novel techniques in creatingWebVR-based content, and participatein breakout sessions and working dis-cussions covering topics such as:• Landscape of WebVR authoring tools;• Creating and packaging 3D assets for

WebVR;• Managing assets for practical progres-

sive enhancement;• Progressive enhancement applied to

the variety of user input in WebVR;• Understanding and documenting

WebVR constraints for 3D artists;• Optimizing delivery of 360° videos to

VR headsets on the Web;• Practical approaches to building

accessible WebVR experiences;• Mapping the impact of ongoing evo-

lutions of the Web Platform (WebAssembly, WebGPU, streams) onWebVR authoring;

• Impact of performance factors onauthoring WebVR content;

• Creating convergence on WebVRadvocacy platforms.

People interested in participation canexpress interest in attending the work-shop or suggest a presentation throughthe web site. Attendance is free for allinvited participants and open to thepublic, whether or not W3C members.

More information:

https://kwz.me/hLg

Page 63: ˜5*(.&1 9-*2* .,.9&1 :2&3.9.*8 - ERCIM News 112 · 4 ERCIM NEWS 111 October 2017 ˚ ˆ ˆ*2’*78-.5 After having successfully grown to become one of the most recognized ICT Societies

In brief

63ERCIM NEWS 111 October 2017

SICS becomes RISE SICS

The three Swedish institutes Innventia, SP and Swedish ICT,of which SICS was a part, have merged in order to become astronger research and innovation partner for businesses andsociety: RISE Research Institutes of Sweden.

The vision of RISE is to be an internationally leading partnerfor innovation, offering applied research and development aswell as industrialization and verification.

The business and innovation areas of RISE are:• Mobility• Energy and Bio-based Economy• Life Science• Digitalization• Sustainable Cities and Communities

The ERCIM member institute SICS is changing its name toRISE SICS.

Awards for breaking SHA-1

Security Standard in Practice

Researchers at CWI and Google won the Pwnie Award forBest Cryptographic Attack 2017 for being the first to breakthe SHA-1 internet security standard in practice in February.They received the prize on 26 July, during the BlackHat USAsecurity conference in Las Vegas. The team consisted ofMarc Stevens (CWI), Elie Bursztein (Google), PierreKarpman (CWI), Ange Albertini and Yarik Markov(Google). The prize is a recognition for the most impactfulcryptographic attack against real-world systems, protocols oralgorithms, and the winners are selected by security industryprofessionals. The nominated read: “The SHAttered attackteam generated the first known collision for full SHA-1. (…)A practical collision like this, moves folks still relying on adeprecated protocol to action.”

The research team also won the CRYPTO 2017 Best PaperAward on 22 August, during the CRYPTO 2017 conferencein Santa Barbara. Although SHA-1 is deprecated, it is stillused for digital signatures and file integrity verification,securing credit card transactions, electronic documents, GITopen-source software repositories and software distribution.On 17 August, Stevens and Daniel Shumow (MicrosoftResearch) presented an improved real-time SHA-1 collisiondetection at the USENIX Security conference in Vancouver,which is now used by default in Git, GitHub, Gmail, GoogleDrive, and Microsoft OneDrive.

More information: https://shattered.io.

The�CRYPTO�2017�Best�Paper�Award�winning�team.�Photos�source:

Marc�Stevens.

Marc�Stevens�at�the�CRYPTO�2017�Best�Paper�Award�lecture.�W3C brings Web

Developer Skills

to the European

Workforce

Since its creation in March 2015, W3C’s e-learning platformW3Cx has trained more than half a million people from allover the world in Web design and development. By fol-lowing W3Cx online courses, they have increased their dig-ital skills that can lead to an exciting career in an in-demandand fast-growing field.

With this demand for Web development training, W3C hasrecently launched a “Front-End Web Developer” (FEWD)Professional Certificate on edX.org which consists of a suiteof five W3Cx courses on the three foundational languagesthat power the Web: HTML5, CSS and JavaScript. Some ofthese courses have been developed in partnership withMicrosoft, Intel and University Côte d’Azur.

The success of the W3Cx training programs underscoresW3C’s commitment in creating high quality courses thathelp learners to develop the critical skills and actionableknowledge needed for today’s top jobs, and to develop theproficiency and expertise that employers are looking for.

W3C FEWD: https://www.edx.org/professional-certificate/front-end-web-developer-9

Page 64: ˜5*(.&1 9-*2* .,.9&1 :2&3.9.*8 - ERCIM News 112 · 4 ERCIM NEWS 111 October 2017 ˚ ˆ ˆ*2’*78-.5 After having successfully grown to become one of the most recognized ICT Societies

ERCIM is the European Host of the World Wide Web Consortium.

Institut National de Recherche en Informatique et en AutomatiqueB.P. 105, F-78153 Le Chesnay, Francehttp://www.inria.fr/

VTT Technical Research Centre of Finland LtdPO Box 1000FIN-02044 VTT, Finlandhttp://www.vttresearch.com

SBA Research gGmbHFavoritenstraße 16, 1040 Wienhttp://www.sba-research.org/

Norwegian University of Science and Technology Faculty of Information Technology, Mathematics and Electri-cal Engineering, N 7491 Trondheim, Norwayhttp://www.ntnu.no/

Universty of WarsawFaculty of Mathematics, Informatics and MechanicsBanacha 2, 02-097 Warsaw, Polandhttp://www.mimuw.edu.pl/

Consiglio Nazionale delle RicercheArea della Ricerca CNR di PisaVia G. Moruzzi 1, 56124 Pisa, Italyhttp://www.iit.cnr.it/

Centrum Wiskunde & InformaticaScience Park 123, NL-1098 XG Amsterdam, The Netherlandshttp://www.cwi.nl/

Foundation for Research and Technology – HellasInstitute of Computer ScienceP.O. Box 1385, GR-71110 Heraklion, Crete, Greecehttp://www.ics.forth.gr/FORTH

Fonds National de la Recherche6, rue Antoine de Saint-Exupéry, B.P. 1777L-1017 Luxembourg-Kirchberghttp://www.fnr.lu/

Fraunhofer ICT GroupAnna-Louisa-Karsch-Str. 210178 Berlin, Germanyhttp://www.iuk.fraunhofer.de/

RISE SICSBox 1263, SE-164 29 Kista, Swedenhttp://www.sics.se/

Magyar Tudományos AkadémiaSzámítástechnikai és Automatizálási Kutató IntézetP.O. Box 63, H-1518 Budapest, Hungaryhttp://www.sztaki.hu/

University of CyprusP.O. Box 205371678 Nicosia, Cyprushttp://www.cs.ucy.ac.cy/

Subscribe to ERCIM News and order back copies at http://ercim-news.ercim.eu/

ERCIM – the European Research Consortium for Informatics and Mathematics is an organisa-

tion dedicated to the advancement of European research and development in information tech-

nology and applied mathematics. Its member institutions aim to foster collaborative work with-

in the European research community and to increase co-operation with European industry.

INESCc/o INESC Porto, Campus da FEUP, Rua Dr. Roberto Frias, nº 378,4200-465 Porto, Portugal

I.S.I. – Industrial Systems InstitutePatras Science Park buildingPlatani, Patras, Greece, GR-26504 http://www.isi.gr/

Special Theme: Digital Humanities