20
From containers to content to context The changing role of libraries in eScience and eScholarship Stefan Gradmann University Library, KU Leuven, Leuven, Belgium Abstract Purpose – The aim of this paper is to reposition the research library in the context of the changing information and knowledge architecture at the end of the “Gutenberg Parenthesis” and as part of the rapidly emerging “semantic” environment of the Linked Open Data paradigm. Understanding this process requires a good understanding of the evolution of the “document” notion in the passage from print based culture to the distributed hypertextual and RDF based information architecture of the WWW. Design/methodology/approach – These objectives are reached using literature study and a descriptive historical approach as well as text mining techniques using Google nGrams as a data source. Findings – The paper presents a proposal for effectively repositioning research libraries in the context of eScience and eScholarship as well as clear indications of the proposed repositioning already taking place. Furthermore, a new perspective of the “document” notion is provided. Practical implications – The evolution described in the contribution creates opportunities for libraries to reposition themselves as aggregators and selectors of content and as contextualising agents as part of future Linked Data based scholarly research environments provided they are able and ready to operate the related cultural changes. Originality/value – The paper will be useful for practitioners in search of strategic guidance for repositioning their librarian institutions in a context of ever increasing competition for scarce funding resources. Keywords Libraries, Linked data, Semantic Web, Metadata, Document notion, Gutenberg Parenthesis, Semantic publishing, eResearch Paper type Conceptual paper 1. The poet, the library and the scriptorium The objective of the present contribution is to explore the way the role of libraries might currently be changing profoundly in the context of a technical paradigm shift that involves all players in the process of scholarly knowledge generation, knowledge management and knowledge use: it affects researchers, authors and consumers of documents alike as well as libraries as the managers of such documents. It is thus crucial to well understand the way “documents” as well as scholarly and scientific communication has evolved over time. The current issue and full text archive of this journal is available at www.emeraldinsight.com/0022-0418.htm This paper is in large parts based on the author’s keynote at the 10th International Bielefeld Conference delivered on April 24, 2012. From containers to content to context 241 Received 2 January 2013 Revised 14 March 2013 Accepted 15 March 2013 Journal of Documentation Vol. 70 No. 2, 2014 pp. 241-260 q Emerald Group Publishing Limited 0022-0418 DOI 10.1108/JD-05-2013-0058

From containers to content to context

  • Upload
    stefan

  • View
    212

  • Download
    0

Embed Size (px)

Citation preview

Page 1: From containers to content to context

From containers to content tocontext

The changing role of libraries in eScienceand eScholarship

Stefan GradmannUniversity Library, KU Leuven, Leuven, Belgium

Abstract

Purpose – The aim of this paper is to reposition the research library in the context of the changinginformation and knowledge architecture at the end of the “Gutenberg Parenthesis” and as part of therapidly emerging “semantic” environment of the Linked Open Data paradigm. Understanding thisprocess requires a good understanding of the evolution of the “document” notion in the passage fromprint based culture to the distributed hypertextual and RDF based information architecture of theWWW.

Design/methodology/approach – These objectives are reached using literature study and adescriptive historical approach as well as text mining techniques using Google nGrams as a datasource.

Findings – The paper presents a proposal for effectively repositioning research libraries in thecontext of eScience and eScholarship as well as clear indications of the proposed repositioning alreadytaking place. Furthermore, a new perspective of the “document” notion is provided.

Practical implications – The evolution described in the contribution creates opportunities forlibraries to reposition themselves as aggregators and selectors of content and as contextualisingagents as part of future Linked Data based scholarly research environments provided they are ableand ready to operate the related cultural changes.

Originality/value – The paper will be useful for practitioners in search of strategic guidance forrepositioning their librarian institutions in a context of ever increasing competition for scarce fundingresources.

Keywords Libraries, Linked data, Semantic Web, Metadata, Document notion, Gutenberg Parenthesis,Semantic publishing, eResearch

Paper type Conceptual paper

1. The poet, the library and the scriptoriumThe objective of the present contribution is to explore the way the role of librariesmight currently be changing profoundly in the context of a technical paradigm shiftthat involves all players in the process of scholarly knowledge generation, knowledgemanagement and knowledge use: it affects researchers, authors and consumers ofdocuments alike as well as libraries as the managers of such documents. It is thuscrucial to well understand the way “documents” as well as scholarly and scientificcommunication has evolved over time.

The current issue and full text archive of this journal is available at

www.emeraldinsight.com/0022-0418.htm

This paper is in large parts based on the author’s keynote at the 10th International BielefeldConference delivered on April 24, 2012.

From containersto content to

context

241

Received 2 January 2013Revised 14 March 2013

Accepted 15 March 2013

Journal of DocumentationVol. 70 No. 2, 2014

pp. 241-260q Emerald Group Publishing Limited

0022-0418DOI 10.1108/JD-05-2013-0058

Page 2: From containers to content to context

Both (The Gutenberg Parenthesis Research Forum, 2010) and (Sauerberg, 2009)have it that before the advent of print our cognitive modes of information processing(and as a consequence of scholarly activity, too) have been dominated by primaryorality. Their claim is that we are currently evolving into a phase of secondary oralityin the turn from what (McLuhan, 1962) had called a Gutenberg-Galaxy (in which printwas the dominant medium) to what (Coy, 1994) has called the Turing-Galaxy (andwhich is dominated by information technology). According to Sauerberg this“secondary orality” manifests itself mainly in the interactive settings of the “socialWeb” (Web 2.0) with services such as Twitter or Facebook that are closer to oralinteraction than written communication based on printed documents.

I do not completely agree with this claim regarding secondary orality – and be itonly because (Derrida, 1967) has shown to what extent the “warmakon” of text andwriting has been much more than mere transcription of spoken words (as thelogocentric and phonocentric position would have it), and that in all its ambivalencethis “remedy” or “poison” (both are possible translations of the greek word warmakon)of scripture has determined to a large degree our intellectual evolution. In fact, printwas introduced roughly 500 years ago in a culture which by that moment had alreadybeen fundamentally shaped by scripture and textuality for close to 2000 years (and atlatest since Plato’s critique of scripture in the Phaidros dialogue which Derrida isreferring to).

Still, something fundamental did change with the invention of print and this changefundamentally affected libraries. Looking at the library of Alexandria (and thus longbefore the parenthesis opened) might help us to understand the nature of this change:we know quite a few things about its holdings, we know little about how the buildingof the library actually looked like – but we know quite a lot about the head librariansof Alexandria and the way they interacted with their collections. Librarians such asZenodotus, Callimachus and Erathosthenes were poets and/or scholars as much aslibrarians. Scholarly activity and content generation were part of the library’s identity.And this continued to be the case until shortly before the invention of print: in thenineth century still the plan for an “ideal” monastery in St Gall(www.stgallplan.org/recto.html) only at first sight is obsessed with beer (with itsthree brewing houses). The centre of this plan again is the unity of the library and thescriptorium.

2. The Gutenberg Parenthesis opens . . .This changed with the advent of print: the disruption happened not so much in termsof the medium of scholarship as in a disconnect between the roles of content creation onthe one hand and of management and publishing of the containers of scholarly contenton the other hand. As I have tried to show elsewhere in the past[1] the scholarlysequential production and consumption workflow including this division of roles(authoring vs publishing vs management) has remained remarkably stable in the printparadigm over the centuries and has even remained in place in the starting phase ofpassing over to digital media, where functionalities of the traditional scholarly lifecyclewere mostly just emulated in the digital environment.

For the libraries, this disconnect of roles caused by the advent of printed books hadimportant consequences: from institutions integrating the generation and reproductionof knowledge with their management they transformed into mere managers of content

JDOC70,2

242

Page 3: From containers to content to context

containers which in turn were referenced by metadata records containing pointers tothese containers as shown in Figure 1.

Metadata pertaining to the library collection and its objects are hitherto organisedas catalogues and the user has to go through this mediating catalogue layer with itspointers to the containers – be it in a traditional library with books or in a “digitallibrary” containing book-analogue or digital born objects or even in web based portalsites such as the main Europeana portal[2]. The abstract model of mediating access toinformation objects via catalogues and of having mediating links as pointers frommetadata to objects remains the same in all of these cases, with the catalogues beinglists of items in the library’s collection.

In this world of catalogues and collections the internal processing logic is organisedaccordingly: the focus is on objects as information containers, not so much on thecontent of these containers, and accordingly cataloguing is focussed on containerattributes: librarian descriptive cataloguing became excessively and almostexclusively concerned with the data on the books title page, its number of pagesand its binding – the actual content would be dealt with using one or to subjectheadings, and in most librarian communities rules for descriptive cataloguing startedto evolve independently of the rules for subject indexing – and in case of capacityconflicts the subject indexing would invariably be sacrificed in favour of descriptivecataloguing. Hence the fundamental cultural divide between the librarians and thedocumentation paradigm in the early twentieth century.

As a consequence, the functional macro-primitives of librarian work increasinglyare the ingestion, storage, description and retrieval of information containers and theirformal attributes.

In a way the apotheosis of this librarian role model built on a refusal to deal withcontent is reached with the caricature of a librarian in Musil’s Man Without Qualitieswho – when showing General Stumm to the catalogue room, the “holy of holies” –

Figure 1.Catalogue-based libraries

From containersto content to

context

243

Page 4: From containers to content to context

upon request from the General declares that he can only make his way through this“madhouse of books” because he has never read any of them.

3. . . . and closes againAs stated initially and studied in numerous recent contributions to the field we nowseem to approach the end of this print paradigm. This doesn’t mean that there will beno more printed books around soon – but it does mean that the scholarly knowledgeworkflow hitherto built around the book monolith in a linear, cyclic succession of stepsstarts to be altered profoundly. Figure 2 shows the way this cycle was organisedaround the printed book and until recently around all emulations of print in the digital,such as PDFs.

As opposed to this, Figure 3 attempts to illustrate two fundamental changes thatoccur in genuine digital publishing settings and which fundamentally affect thescholarly workflow as well as the role of libraries within it in a first step.

On the one hand, the linear, cyclically organised succession of steps around theinformation object dissolves and enables direct connections from any of these stages toalmost any other one with roles of the players changing accordingly. We do experience,for instance, a growing (at least technical) confusion of “annotation” and “review” inrapidly developing models of open, social software based rating settings.Apprehension more often just bypasses libraries with publishers offering directaccess routes based on portal services. And libraries increasingly venture into thetraditional realms of publishers making document repositories available as primarypublishing instances – at least in a number of cases. This leads to a process ofrenegotiation of the individual players’ roles on the scholarly workflow and to aconstant need to reshape their respective business models accordingly.

On the other hand, the monolithic object in the centre of the scholarly workflow,formerly a book monolith, becomes transparent with its individual components

Figure 2.The traditional scholarlyknowledge workflow

JDOC70,2

244

Page 5: From containers to content to context

becoming digitally referable in XML models such as TEI P5 or DocBook. This doesn’tyet lead to a complete erosion of the container as such which still keeps stableboundaries – but it enables the creation of fine grained linking structures not onlybetween containers as a whole but also between the components of theirmicrostructures – very much the way Ted Nelson had envisioned things in (Nelson,1981) and as illustrated in Figure 4,which shows a detail of the only implementation ofNelson’s Xanadu Space we currently can get hold of.

Many aspects of this passage from the printing paradigm to the digital publishingworld dominated by XML like document trees and the technology evolving aroundthem has been related in depth starting with (Buckland, 1998) and then in (Pedauque,2003) as well as in (Pedauque, 2006). Although passing from print analogue formatslike PDF to XML-structured container formats enabling various output methods, thevery notion of a monolithic document has remained unchallenged in this stage:

Figure 3.Decomposition of the

scholarly workflow ingenuine digital publishing

Figure 4.Xanadu Space

From containersto content to

context

245

Page 6: From containers to content to context

although decomposing the scholarly workflow and enabling granular addressingdocument microstructures this phase still left the document container as suchuntouched. “Documents” still had indisputable boundaries without their actualconfines being an issue. A “document” was still one “document”, trivially speaking, adiscreet and clearly confined entity.

4. “Documents” and the World Wide WebThis is currently changing profoundly with the evolution of the WWW[3] from a webof documents to an all-encompassing medium, a change that might well be in the orderof passing from oral culture to script some 2,500 years ago[4]. In order to illustrate thissomewhat bold assertion a quick look on the evolution of the WWW during the firstdecades of its evolution is useful – or rather a look at how this information andknowledge space has evolved in two specific directions.

4.1 Extending the document webTo understand the point I am trying to make here a look at one of the foundationaldocuments of the web is instructive: the picture included in Berners-Lee’s MagnaCharta of what he then called “Information Management: a Proposal” in (Berners-Lee,1989) in its very centre refers to itself as “this document”. The original idea of theWWW thus was that of a hypertext application explicitly linking back to Ted Nelson(and who is given due credit there, accordingly): a web of documents linked to eachother in a non-hierarchical graph avoiding – above all! – the issues of tree-liketaxonomic information organisation.

Berners-Lee’s original proposal even included the idea of indicating the nature of thelinks between documents in this hypertext environment – but somehow this part ofthe original idea got lost on the path towards implementation, since the HTTP basedgraphs of the first generation web typically look like in Figure 5.

The example shows two WWW resources with URIs from an imaginary namespaceex.org, connected by a “href” link pointing from the left to the right resource. It is agood example of what Berners-Lee’s first attempt was supposed to deliver:“Human-readable information linked together in an unconstrained way”(Berners-Lee, 1989). The information is a human readable document, indeed: wehumans with our cultural background are of course able to “read” this graph and toinfer what kinds of entities might be referenced here: we know that “Louvre” is aninstance of the class “Museum”, most of us happen to know that “LaJoconde” is thename the French use for the painting short-referenced as “Mona Lisa” in Italy orGermany, that the second entity thus is an instance of the class “painting” – and incase we do not already know all this the referenced web documents will inform us in a

Figure 5.A basic HTTP graph

JDOC70,2

246

Page 7: From containers to content to context

human readable manner. A machine, however, is a priori incapable of making suchinferences: the implicit class model of the graph cannot be processed by a machine.

And likewise for the links: we of course know that paintings happen to be kept inmuseums and from our general knowledge about the way instances from the classes“paintings” and “musea” can relate to each other are able to infer the probable relationbetween the two instances referred to here: the painting “La Joconde” probably is keptin the “Louvre”. Here again, a machine a priori is incapable of producing suchinferences – as long as it is not explicitly given all information required for doing soexplicitly.

However, the first generation web simply lacks expressive power for doing so:neither can the relation between a class and its instances be expressed there nor is itpossible to “type” the relations between classes and instances respectively.

4.1.1 . . . in Syntax: RDF. This extension in expressive power is only enabled by afirst extension of the information architecture of the web (which at same time is thefoundational layer of the “Semantic Web”): the Resource Description Framework (RDF)and its “Grammar”, RDF Schema (RDFS).

RDF as a syntactic extension of the original web architecture enables the explicitstatement (now taken from the real WWW) that the painting “La Joconde” is kept at the“Louvre” (Figure 6).

The relation between classes and instances can be made explicit in this approach,too, as shown in Figure 7.

And finally RDF can also be used to model the generalised relation betweeninstances of two classes as in Figure 8.

As is evident from the Figure 8, the first of our RDF statements ( , La_Joconde .

– , musee . – , Musee_du_Louvre . ) could as well have been inferred by amachine from the aggregation of other statements and thus would not have to be madeexplicit anymore in this context.

Figure 7.Classes and instances in

RDF

Figure 6.A simple RDF triple

From containersto content to

context

247

Page 8: From containers to content to context

Such kinds of inferences are enabled by the “grammar” language RDF Schema (RDFS),which permits the organisation of sets of RDF statements into aggregated systems oftriples. The RDF statements invariably are simple sentences with a fixed subject –predicate – object structure (the equivalent of a Resource – Property – Value triple), inwhich the first two elements must be entities on the WWW (and thus identified with aURI), whereas the third position can be occupied by a genuine web resource or a string(typed or unstructured, also referred to as a “literal”). The components of such triplestatements can be organised in a class concept (such as in “ , La_Joconde . – , is ofrdf:type . – , painting . ), and the same applies to relations within triples (as in “, teaching . – , subPropertyOf . – , communicating . ), and in both cases asimple concept of inheritance of classes and relations enables machines to establishsimple, deterministic inferences from such hierarchies as in the example above.

4.1.2 . . . in Scope: the web of things. However, most of the statements in theexamples above would not be possible without a second extension of the WWWinformation architecture, this time an extension in scope. The constituent entities of thegiant hypertext application of the first generation Web were “documents”, and thus“information entities” in terms of the main standardisation body of the Web, the WorldWide Web Consortium (W3C, www.w3.org/). The so called “Web of Things” extendsthe scope of the Web in such a way as to include everything conceivable that is part ofour “reality” – or at least to enable its representation on the web so as to potentiallymake them part of RDF statements, as this was the case with the resource http://fr.dbpedia.org/page/Musee_du_Louvre, in our example above, which for instance has a, prop-fr:latitude . with the value “48.861073 (xsd:double)” (example of a typedliteral) or which has a relation , dbpedia-owl:country . with the value, dbpedia-fr:France . .

4.2 Semantic Web and linked dataThe original vision of the “Semantic Web” as proclaimed in (Berners-Lee et al., 2001)was built on this double extension of the web in syntax and scope and so is the later,rebranded version of this vision under the title Linked open Data this time. Thesomewhat misleading attribute “semantic” in this context simply refers to the ability to“type” the semantics of relations between web resources in such a way as to enable

Figure 8.Relation between twoclasses in an aggregationof RDF statements

JDOC70,2

248

Page 9: From containers to content to context

processing by a machine, and “linked data” basically refers to the ability to not onlydeal with “document” type entities in the information and knowledge architecture ofthe Web but with whatever particles of “reality” that can be represented on the WWW.

Without going into the details and subtleties of the ongoing discussions about howLinked Data actually relates to the Semantic Web and how both paradigms mightfurther (co-)evolve it is important to note that in the course of the evolution of thissecond generation extension of the WWW, the Web of documents has been turned intoa “Giant Global Graph” (Berners-Lee, 2006), a “Global Data Space” (Heath and Bizer,2011) – but above all into an impressive contextualisation machine enabling thegeneration of knowledge from an enormous and still rapidly growing aggregation oflinked, contextualised data resources (cf. the well-known representation of the LoDcloud in Figure 9 taken from http://richard.cyganiak.de/2007/10/lod/lod-datasets_2011-09-19_coloured.html or again the much better usable – but less complete – partitionedvisualisationapproach at http://lov.okfn.org/dataset/lov/).

This rapidly emerging Linked Open Data paradigm has numerous implications forscience and scholarship. The one that interests us in this context is the fact that theLinked Open Data approach enables a substantial further step in the deconstructionprocess of the traditional “document” referred to further up. The practicalconsequences of this process will be made evident in the next section of thiscontribution.

5. Documents and data in eScience and eScholarshipThe point I will try to make in this section is that Linked Open Data should be seen as achance to fundamentally rethink the relation of data vs publication vs metadata, and tobuild innovative functional approaches on such a renewed fundament. But beforecoming to this point I should illustrate the way document creation and publication isincreasingly affected by the Linked Open Data paradigm[5].

5.1 Semantic publishing and contextualisationA first step beyond the monolithic “document” container currently is taken in what(Shotton, 2009) has termed “Semantic Publishing”. The basic idea of this approach is touse all “anchor points” within a scientific publication that can easily (or evenautomatically) be identified to create contextual links for these elements such aspersonal or place names, scientific terms and other (mostly named) entities. The resultis a contextually enriched article format including lots of links to external resources onthe web as in the example presented in (Shotton et al., 2009), where the authors took anexisting article from PLoS and systematically enriched it with links to dates, diseases,habitats, institutions, organisms, persons, places, proteins and taxons, resulting in apicture as the one in Figure 10.

The result is impressive and definitely attractive for a human reader. It furthermore,can mostly be obtained using robust standard technology for named entity recognitionfrom commercial providers such as Thomson Reuters (www.opencalais.com/) or Temis(www.temis.com/) as well as from open source environments such as the StanfordNamed Entity Recogniser (NER, http://nlp.stanford.edu/software/CRF-NER.shtml). Asomewhat more fine grained and seamless approach in this direction is the UTOPIAsolution presented in (Pettifer et al., 2011) – but which suffers from requiring an

From containersto content to

context

249

Page 10: From containers to content to context

Figure 9.The linked open datacloud

JDOC70,2

250

Page 11: From containers to content to context

additional software to be installed for authoring documents and also for making use ofthe full width of context links.

And anyways, such approaches do not advance scholarship a lot as pointed out by(Page, 2009). Page justly notes that the enhanced article format still remains too muchfocussed on the traditional article format – but his main criticism is that the links inShotton’s example indeed are simple, first generation web style un-typed URLs with nochance of a machine actually identifying the semantics of these relations and as aconsequence processing them:

So, essentially we’ve gone from pre-web documents with no links, to documents where thebibliography is hyperlinked (most online journals), to documents where both thebibliography and some terms in the text are hyperlinked (a few journals, plus the Shottonet al. (2009) example). I’m a tad underwhelmed (Page, 2009).

This shortcoming comes with a second drawback: URLs are unidirectional (thatalready was a criticism Ted Nelson had made from the beginning of Berners-Lee’sWWW architecture and its implementation) and for that reason the resources linked tofrom such an article will never know they are being referred. “Real Linking” as Pagecalls it would mend both shortcomings: the solution is to use RDF and “semantic”linking instead of the Web first generation technology. RDF links can be processed bymachines as explained above – and they can be reversed, too!

An attempt to actually leverage the potential of truly “semantic” publishing in astraightforwardly RDF based approach is the “NanoPublications” approach proposed

Figure 10.An enhanced article

From containersto content to

context

251

Page 12: From containers to content to context

by (Mons and Velterop, 2009). They suggest five steps for going substantially beyondthe current practise of web publishing which is still largely inspired by emulations ofthe Gutenberg Galaxis:

(1) From terms to concepts – conceive scientific publications not so much as anarration based on a succession of natural language terms but as anaggregation of ontologically modelled concepts and the relations between them.

(2) From concepts to statements – connect concept resources using a formalisedstructure: use RDF as the core of the publishing data model.

(3) Annotation of statements with context and provenance – extend RDF withadditional elements for expressing contextual information such as provenance,versioning or authorisation information: this comes close to what is currentlybeing discussed and taken towards standardisation as the “named graph”extension of the RDF model[6].

(4) Treating richly annotated statements as NanoPublications – dramaticallyincrease the granularity level of what is to be considered as a publication (alsoin terms of credit building!)

(5) Removing redundancy, meta-analysing web-statements (raw triples to refinedtriples) – the combination of NanoPublications and systematic work on dataquality (one of the weak spots of the Lined Open Data community!) would in theend result in transforming scientific communication in a rich concept web.

Figure 11 is taken from (Groth et al., 2010) and illustrates the proposedNanoPublication architecture:

One of the examples for the scientific use of NanoPublications reported on at thenanopub.org website is the one published as (Van Haagen et al., 2009) and where theauthors claim to “have developed a method that predicts Protein-Protein Interactions(PPIs) based on the similarity of the context in which proteins appear in literature. Thismethod outperforms previously developed PPI prediction algorithms that rely on theconjunction of two protein names in MEDLINE abstracts. We show significantincreases in coverage (76 per cent versus 32 per cent) and sensitivity (66 per cent versus

Figure 11.The NanoPublicationarchitecture

JDOC70,2

252

Page 13: From containers to content to context

41 per cent at a specificity of 95 per cent) for the prediction of PPIs currently archivedin 6 PPI databases.”

The article is a good example of the way RDF based approaches can be used tocreate new knowledge by inferring over massive amounts of data (18 million abstractsdating from 1980 to 2008 in this case) humans definitely would not be able to processand resulting in 44.000 significant gene-disease associations.

The example also illustrates the claim made in (Mons et al., 2011) stating thatNanoPublications could be part of a broader shift in publishing paradigms that wouldmove away from the current picture as shown in Figure 12 and which has the“Narrative Articles” in its centre. This model has a number of shortcomings indicatedby the red properties:

The model proposed for future scholarly communication looks quite different and isshown in Figure 13. It shows a much higher degree of integration and the “glue”holding together its various components is semantic technology based on the RDFstandard and its centre are NanoPublications:

5.2 Documents, data and metadataThe concept of NanoPublications currently works only in very specific scientific areas:most examples actually come from the biomedical sector, and this is not all too difficultto explain. The sector has a very specific and stable terminology as well as the(financial) means to organise its terminological knowledge in formalised ontologies. Itmay be possible to extend the redical model of NanoPublications to many otherdisciplines, mostly in the STM (Science, Technology and Medicine) area – but themove into semantic publication of this kind will be much slower in the Social Sciences

Figure 12.Current publishing model

From containersto content to

context

253

Page 14: From containers to content to context

and Humanities (if it happens at all) because of their fuzzy and unstable terminology,their fuzzy linking semantics hard to formalise consistently and because of the closerelation between complex document formats and scholarly discourse which I havepublished on elsewhere (Gradmann, 2004).

But even though, two observations are striking in the context of such radicalisedsemantic publishing models:

(1) The actual distinction between documents, their content data and relatedmetadata becomes increasingly meaningless and obsolete – the currentdiscussion of data curation in scientific working environments probably reflectsthis change!

(2) The static, traditional model of content contained in article silos is weakeningand may gradually be replaced by a dynamic model of richly contextualisedaggregations of Web resources with a whole new set of opportunities (theinferencing capabilities referred to above) – but also new questions andchallenges.

Referencing such dynamic aggregations over time will require new standardisationelements as will the actual identification of the confines of such aggregations: whereare the borders of “documents” in such environments? And what is the context relevantfor their interpretation? Not to mention the quality problems that are evident in thecurrent Linked Data community due to the partial and uncontrolled semanticredundancy of ontological resources. Extending the RDF model (as currently discussed

Figure 13.A proposed model ofscholarly communication

JDOC70,2

254

Page 15: From containers to content to context

in terms of “Named Graphs”) and ontology matching and mapping are thus twoexamples of future working areas – and these could be working areas for libraries!

6. An opportunity for libraries. . .For it is evident that digitisation of scientifically relevant sources as well as(increasingly) semantic publishing result in a growing quantity and increasedcomplexity of scholarly environments well beyond scholarly processing capacity interms of reading faculty. This issue has been first presented by (Crane, 2006) and theanswer to his question “What do you do with a million books?” was given in (Renearand Palmer, 2009) coining the term “strategic reading” for a future practise of scientificapprehension that would make heavy use of semantic technologies and ontologies.

Scientists and scholars thus will badly need help in three areas:

(1) semantic abstracting and named entity recognition to enable “strategicreading”;

(2) contextualisation of information objects as a basis for generating new potentialknowledge; and

(3) robust reasoning and inferencing tools yielding digital heuristics andhypotheses.

And this evidently creates potential opportunities for libraries as the academicinstitutions enabling scholars to interact with content and its context in innovativeways. Libraries are particularly apt for this role because of their traditionalco-operational discipline in metadata generation and organisation and also because oftheir excellent contextualisation data (authority files of all kinds) which they are usedto apply for enriching object descriptions with contextual links.

In case libraries are able to seize this opportunity they might find themselves in aposition not so much as a service provider (which always carries the risk of beingreplaced by another player that does a better job at providing such services) but ratheras integrated part of future research environments.

7. . . . and what they need to do to be up to itHowever, it takes a lot for libraries to actually be able to seize this opportunity. Thecultural transformation they need to go through for this end is considerable andsometimes disruptive (which is something librarians abhor!) The core requirement is toget rid of ideas and corresponding terminology that are deeply rooted in the focus oninformation containers we are currently loosing – or which at least cease to be theleading document paradigm in the post-Gutenberg age.

If it is true that the constitutive element of scholarly processes in the future are notprinted pages anymore but something built on current ideas related to semanticpublishing and thus on Linked Open Data and RDF the latter must increasinglybecome their intellectual home. The mental shift required here can be illustrated as inFigure 14.

The passages from, e.g. information to knowledge, from search to navigation oragain from catalogue to graph are complex and far from binary. Rather than bilateralthey happen between systems of concepts (this is why they are visualised as part oftwo clouds). And quite some of the new terms haven’t yet been identified – hence the

From containersto content to

context

255

Page 16: From containers to content to context

absence of an evident counterpart for library in the right-hand cloud: the term we needto use for the institutions dealing with knowledge in the future still remains to becoined.

Moreover, this shift in thinking and in terminology seems to have started alreadyand at least indications can be observed in a resource such as Google’s Ngram Viewerat http://books.google.com/ngrams. Figure 15 for instance shows the shifts in termfrequency of “bibliographic record” vs “linked data” in Google Books (English corpus)and indicates a tendency with two clearly reverse proportional graphs. Unfortunately,the Google Ngram database is limited at 2008. However, a search for two similar terms– “cataloguing” vs “linked data” this time – reveals a similar correlation of graphs inGoogle Trends as can be seen from Figure 16.

These examples bear a number of methodological shortcomings: it is unclear towhich extent Google’s Ngram and Trends services yield comparable results at all, thesearch term “bibliographic record” had to be replaced because of its insignificance in

Figure 14.From catalogues to graphs

Figure 15.Shifts in term frequency inGoogle Booksbibliographic record vslinked data

JDOC70,2

256

Page 17: From containers to content to context

Google Trends and the absolute figures probably are close to meaningless as aconsequence.

The significant fact, however, is the reverse proportional tendency in the evolutionof the graphs, which seems to be a constant also when comparing similar pairs takenfrom the two term clouds.

This trend is confirmed by other observations such as for instance the surprisinglysuccessful (German language!) conference “Semantic Web in Bibliotheken”.

There is, however, a second cultural shift required for libraries to become part of theLinked Open Data paradigm, and this might be much more difficult to operate than theterminology shift, as it is concerned with control and autonomy. Librarians have beenused to create and evolve their particular rules and standards almost for centuries andhave created data exchange formats such as MARC or descriptive cataloguing rulessuch as AACR2 in splendid autonomy, in perfect control of their workingenvironments and the regulations applied there (and mostly ignoring whether anynon-librarian would actually be able to understand these librarian rules andstandards).

One could argue that this has already changed to some extent with the newcataloguing rules by the name of “Resource Description and Access”. The designershave deliberately given up the continuity of the AACR naming line – but on the otherhand they state in (“Resource Description and Access,” 2012) that “RDA is built onfoundations established by the Anglo-American Cataloguing Rules (AACR) and thecataloguing traditions on which it was based.” The intended audience doesn’t seem tobe exclusively librarian anymore (the rules mostly mention “users”, “institutions” and“agencies”) – but almost all the references back (and there is many of them!) point toresources created by librarians. RDA thus is a strange, ambivalent creature,somewhere between two riverbanks. And although it may well be a step in the rightdirection it still contains far too many compromises with past librarian culture to be avaluable guiding line for the future.

Probably the best idea of what lies ahead was given in the W3C Library LinkedData Incubator Group active in 2010 and 20111 who in their final report in (Baker et al.,2011) made the following recommendations:

Figure 16.Shifts in term frequency in

Google Trendscataloguing vs linked data

From containersto content to

context

257

Page 18: From containers to content to context

. That library leaders identify sets of data as possible candidates for early exposure asLinked Data and foster a discussion about Open Data and rights.

. That library standards bodies increase library participation in Semantic Webstandardization, develop library data standards that are compatible with Linked Data,and disseminate best-practice design patterns tailored to library Linked Data.

. That data and systems designers design enhanced user services based on Linked Datacapabilities, create URIs for the items in library datasets, develop policies for managingRDF vocabularies and their URIs, and express library data by re-using or mapping toexisting Linked Data vocabularies.

. That librarians and archivists preserve Linked Data element sets and value vocabulariesand apply library experience in curation and long-term preservation to Linked Datadatasets (Baker et al., 2011).

Building on these recommendations libraries could successfully create a mindset thatwould enable them to substantially contribute to a second strand of thought: how doLinked Open Data and scholarly research systematically relate to each other? We’veseen a few examples of what the scholarly use of semantic annotation and inferencingmight be in the future – but this relation remains to be studied systematically in orderto better understand how infrastructure based on semantic technologies and linkeddata might systematically relate to digital science and scholarship. A separateforthcoming publication from the context of the DM2E project (http://dm2e.eu) will bedevoted to the issue of how scholarly research and innovation can be modelled (and towhich extent) using formal, ontology-based approaches anchored in semantictechnology.A thorough understanding of the evolutionary lines connecting the three poles ofLinked Data and related methodology: research and teaching in Science andScholarship; and the future role of academic knowledge generation and organisationinstitutions may lead to a redefined identity of academic libraries no more as beholdersof content containers, but as embedded research libraries on a par with scholars andscientists. To create such an understanding together with the related conceptual andmodelling framework should be seen as one of the big challenges for library andinformation science as part of the emerging Web Science as sketched in (Hall et al.,2009).

Notes

1. For the first time in (Gradmann, 2005).

2. For reasons that will be made evident further down in this article this does not pertain to theLinked Data based representation at data.europeana.eu or the SPARQL endpoint at http://europeana.ontotext.com/

3. A broader context for this section is given by (Salaun, 2012) who draws a fascinating pictureof the evolution of the WWW as part of the history of the “document” notion and whichbuilds on (Pedauque, 2003) as well as on (Pedauque, 2006), extending the exclusively XMLbased document model of Pedauque’s publications into RDF.

4. A good introduction to this issue can be found in (Hall et al., 2009) together with theargument that this process calls for “Web Science” as an entirely new academic discipline.

JDOC70,2

258

Page 19: From containers to content to context

5. Some of the ideas presented here have their counterpart in the (FORCE11, 2011) manifestowhich further extends some of the issues presented here into the broader context of scholarlyresearch and communication as well as into related economic considerations.

6. The mapping to the Named Graph extension is actually suggested explicitly in (Groth et al.,2010).

References

Baker, T. et al. (2011), Library Linked Data Incubator Group Final Report, available at: www.w3.org/2005/Incubator/lld/XGR-lld-20111025/ (accessed January 2, 2013).

Berners-Lee, T. (1989), Information Management: A Proposal, available at: www.w3.org/History/1989/proposal.html (accessed December 16, 2012).

Berners-Lee, T. (2006), Linked Data – Design Issues, available at: www.w3.org/DesignIssues/LinkedData.html (accessed December 16, 2012).

Berners-Lee, T., Hendler, J. and Lassila, O. (2001), “The Semantic Web”, Scientific American,Vol. 284 No. 5, pp. 34-43.

Buckland, M. (1998), “What is a “digital document”?”, Document Numerique, Vol. 2, pp. 221-230.

Coy, W. (1994), Wolfgang Coy: Computer als Medien. Drei Aufsatze. Forschungsbericht desStudiengangs Informatik der Universitat Bremen.

Crane, G. (2006), “What do you do with a million books?”, D-Lib Magazine, Vol. 12 No. 3.

Derrida, J. (1967), Jacques Derrida: De la Grammatologie, Minuit, Paris.

FORCE11 (2011), Improving Future Research Communication and e-Scholarship, available at:www.force11.org/white_paper

Gradmann, S. (2004), “Vom verfertigen der gedanken im digitalen diskurs: versuch einerwechselseitigen bestimmung hermeneutischer und empirizistischer positionen”, HistoricalSocial Research, Vol. 20, pp. 56-63.

Gradmann, S. (2005), “Beyond electrification: innovative models of scientific and scholarlypublication”, European Science Editing, Vol. 31 No. 1, pp. 5-7.

Groth, P., Gibson, A. and Velterop, J. (2010), “The anatomy of a nano-publication”, InformationServices & Use, Vol. 30, pp. 51-56.

Hall, W., De Roure, D. and Shadbolt, N. (2009), “The evolution of the web and implications foreResearch”, Philosophical Transactions of the Royal Society A, Vol. 367, pp. 991-1001.

Heath, T. and Bizer, C. (2011), Linked Data: Evolving the Web into a Global Data Space, Morgan& Claypool, available at: http://linkeddatabook.com/editions/1.0/

McLuhan, M. (1962), The Gutenberg Galaxy, Routledge & Kegan Paul, London.

Mons, B. and Velterop, J. (2009), “Nano-Publication in the e-science era”, Bioinformatics, pp. 14-15.

Mons, B. et al. (2011), “The value of data”, Nature Genetics, Vol. 43 No. 4, pp. 281-283.

Nelson, T. (1981), Literary Machines: The Report On, and of, Project Xanadu Concerning WordProcessing, Electronic Publishing, Hypertext, Thinkertoys, Tomorrow’s IntellectualRevolution, and Certain Other Topics Including Knowledge, Education and Freedom,Mindful Press, Sausalito, CA.

Page, R. (2009), “iPhylo – Rants, raves (and occasionally considered opinions) onphyloinformatics, taxonomy, and biodiversity informatics: Semantic Publishing:towards real integration by linking”, available at: http://iphylo.blogspot.de/2009/04/semantic-publishing-towards-real.html

From containersto content to

context

259

Page 20: From containers to content to context

Pedauque, R.T. (2003), Document: Form, Sign and Medium, As Reformulated for ElectronicDocuments, STIC-CNRS, Paris.

Pedauque, R.T. (2006), Le document a la lumiere du numerique, C&F editions, Caen.

Pettifer, S., McDermott, P., Marsh, J., Thorne, D., Villeger, A. and Attwood, T.K. (2011), “Cecin’est pas un hamburger: modelling and representing the scholarly article”, LearnedPublishing, Vol. 24 No. 3, pp. 207-220.

Renear, A. and Palmer, C.L. (2009), “Strategic reading, ontologies, and the future of scientificpublishing”, Science, Vol. 325 No. 5942, pp. 838-842.

Resource Description and Access (2012), available at: http://access.rdatoolkit.org/

Salaun, J.-M. (2012), Vu, lu, su - Jean-Michel SALAUN - Editions La Decouverte, Paris, availableat: www.editionsladecouverte.fr/catalogue/index-Vu__lu__su-9782707171351.html

Sauerberg (2009), “The Gutenberg Parenthesis – print, book and cognition – Sauerberg – 2009 –Orbis Litterarum – Wiley Online Library”, Orbis Litterarum, Vol. 64 No. 2, pp. 79-80.

Shotton, D. (2009), “Semantic publishing. The coming revolution in scientific journal publishing”,Learned Publishing, Vol. 22 No. 2, pp. 85-94.

Shotton, D., Portwin, K., Klyne, G. and Miles, A. (2009), “Adventures in semantic publishing:exemplar semantic enhancements of a research article”, PLOS Computational Biology,Vol. 4 No. 5.

The Gutenberg Parenthesis Research Forum (2010), “The Gutenberg Parenthesis – print, bookand cognition”, position paper, available at: www.sdu.dk/en/Om_SDU/Institutter_centre/Ilkm/Forskning/Forskningsprojekter/Gutenberg_projekt/PositionPaper (accessedDecember 23, 2012).

Van Haagen, H.H.H.B.M. et al. (2009), “Novel protein-protein interactions inferred from literaturecontext”, PLoS ONE, Vol. 4 No. 11, p. e7894.

Corresponding authorStefan Gradmann can be contacted at: [email protected]

JDOC70,2

260

To purchase reprints of this article please e-mail: [email protected] visit our web site for further details: www.emeraldinsight.com/reprints