Upload
stefano-mariani
View
233
Download
0
Embed Size (px)
Citation preview
8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption
1/83
Alma Mater Studiorum
Universita di Bologna
II Facolta di Ingegneria
Corso di Ingegneria Informatica
Laurea Magistrale in Sistemi Distribuiti
Molecules of knowledge: a new
approach to knowledge
production, management and
consumption
Candidato RelatoreStefano Mariani Prof. Andrea Omicini
Anno Accademico 2010/2011 - Sessione II
8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption
2/83
.
8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption
3/83
.
Ad Alice, perche senza di lei sarei solo,
ai miei genitori, che mi hanno dato questa possibilita,
a mio fratello, che era meglio se giocavi a WoW,ai miei nonni, che avrei voluto fossero qu,
a tutti i miei amici, la cui provvidenziale ironia
mi ricorda sempre di non prendermi troppo sul serio.
8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption
4/83
8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption
5/83
Contents
Introduction 7
1 Background 11
1 My vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2 The biochemical metaphore . . . . . . . . . . . . . . . . . . . 15
3 IPTCs news standards . . . . . . . . . . . . . . . . . . . . . . 20
3.1 NewsML . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 NITF . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2 Molecules of knowledge model 31
1 Informal introduction to the model . . . . . . . . . . . . . . . 31
1.1 About topology . . . . . . . . . . . . . . . . . . . . . . 35
2 Model abstractions . . . . . . . . . . . . . . . . . . . . . . . . 36
2.1 Seeds . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.2 Atoms . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.3 Molecules . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.4 Chemical reactions . . . . . . . . . . . . . . . . . . . . 43
2.5 Catalysts/Inhibitors . . . . . . . . . . . . . . . . . . . 48
3 The spatial-temporal fabric toward self-adaptation . . . . . . . 52
3.1 Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.2 Space . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.3 Self-adaptation . . . . . . . . . . . . . . . . . . . . . . 55
4 The formal model . . . . . . . . . . . . . . . . . . . . . . . . . 57
3 Model behaviour examples 61
1 Seeds generating atoms . . . . . . . . . . . . . . . . . . . . . . 62
8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption
6/83
6 Contents
2 Diffusion, decay and positive feedback . . . . . . . . . . . . . . 65
3 Molecules from atoms . . . . . . . . . . . . . . . . . . . . . . . 68
Conclusion and further developments 71
Appendice - Sommario in italiano 75
Bibliography 79
Acknowledgments 83
8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption
7/83
Introduction
Information specialists, namely journalists, are facing new and critical chal-
lenges in their knowledge production process: the increasing amount of in-
formation to mine, the pace at which its made available and all the different
formats and paradigms existing to represent and think of it are just a few
to mention.
A new field is emerging to promote the process: computational journalism.
By developing techniques, methods, and user interfaces for exploring the new
landscape of information, computer scientists can help discover, verify, and
even publish new public-interest stories at lower cost. For computational-
ists and journalists to work together to create a new generation of reporting
methods, each needs an understanding of how the other views data. Jour-
nalists are in fact a special kind of information-seekers, because they look
for the unusual handful of individual items that might point toward a news
story or an emerging narrative thread.
Over the past two years, Sarah Cohen, James T. Hamilton, and Fred Turner
have conducted scores of interviews with reporters, editors, computer scien-
tists, information experts, and other domain researchers to identify collabo-rations and projects that could help reduce the cost and difficulty of news
production and knowledge management [1]. Their conversations identified
five areas of opportunity:
Combining information from varied digital sources. The capability to put
into one repository material not easily recovered or searched through
existing search engines is currently missing quite at all, because the
only thing journalists can do actually is to manually mine interesting
8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption
8/83
8 Introduction
sites and take annotations. This is due to the eterogeneity of the form
and format according to which each source of information publish and
organize its contents.
Information extraction. Beat reporters might cover one or more counties, a
subject, an industry, or a group of agencies, hence most of the docu-
ments they obtain would benefit from entity extraction. But effective
use of these tools requires computational knowledge beyond that of
most reporters, documents already organized, recognized, and format-
ted, or an investment in commercial tools typically beyond the reach
of news outlets in non-mission-critical functions.
Document exploration and redundancy. Reporters need to notice informa-
tion that is not commonly known but that could lead to news in in-
terviews, documents, and other published sources. Though, the recent
explosion in blogs, aggregated news sites, and special-interest group
compilations of information makes distinguishing new stories time con-
suming and difficult, hence the ability to group documents in interesting
ways would immediately reduce the time and effort of reporting.
Audio and video indexing. Unless a third party has already transcribed,
closed-captioned, or applied speech-recognition techniques on the record,
most reporters have no way to move to the portion of it that contains
what may be of interest. Existing technology is probably adequate for
reporters immediate needs, but as these interviews suggest there arent
simple user interfaces to the technology that would allow unsophisti-
cated users to test the technology on their own recordings.
Extracting data from forms and reports. Much of the information collected
by reporters arrives in two genres: original forms submitted to or cre-
ated by news agencies, often handwritten, and reports generated from
larger systems, sometimes electronically and sometimes on paper. Jour-
nalists have few choices today: retype key documents into a database,
attempt to search recognized images, or simply read them and take
notes. Extracting meaningful information from forms is among the most
expensive and time-consuming job in large news investigations: its cost
sometimes results in abandoning promising stories.
8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption
9/83
Introduction 9
This thesis will mainly focus on the third issue, that is Document exploration
and redundancy. The main objective in fact, is to provide knowledge pro-
sumers, hence both producers and consumers as tipically journalists are, a
brand new model both to think at knowledge lifecycle under a brand new
perspective and also to shape knowledge and the knowledge production pro-
cess itself accordingly.
Althought the work done in this thesis is tailored to the application domain
of journalism, hence most of the time knowledge actually means journalistic
news to me, most of its ingredients and ideas are easily reproducible in
other areas, namely wherever a self-organising knowledge management sys-
tem is needed [2]. Moreover, the model here conceived can be easily extended
to deal with each of the previous highlighted issues: in fact, some of them are
assumed and some others can be covered as will be mentioned throughout
the thesis.
The remainder of the paper is organized as follows:
Chapter 1 introduces some background information necessary to bet-
ter understand the model, namely the biochemical metaphore for dis-
tributed coordination systems and the IPTC NewsML and NITF jour-
nalistic standards to represent news content, structure and semantics
in a machine-readable format;
Chapter 2 defines the molecules of knowledge model and how it could
be used to design a self-organising news management system;
Chapter 3 shows some brief sperimentation Ive done to observe how
the model behaves;
then I draw conclusions about the work done and guidelines for further
investigations.
8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption
10/83
8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption
11/83
Chapter 1
Background
La disumanita del computer sta nel fatto che,
una volta programmato e messo in funzione,
si comporta in maniera perfettamente onesta.
- Isaac Asimov -
As a first thing, I would like to describe the reader my view of the news
lifecycle and how it can be re-thinked under the brand new perspective of
the biochemical metaphore recently exploited in distributed coordination sys-tems. Then it becomes necessary to describe such metaphore, that is what
second section does. In the end, the IPTC NewsML and NITF standards are
briefly introduced since they are the foundations of my molecules of knowl-
edge model.
1 My vision
I wish to depict the overall scenario that the reader is encouraged to imagineto fruitfully understand what the following sections and chapters are talk-
ing about and what is their purpose. To this end, I think it is better to
distinguish three phases in the news lifecycle: production, management and
consumption.
Production. Journalists will gain the knowledge they need to create news
from different sources of information. Such sources could be either i)
external to the self-organising system, as RSS feeds aggregators, news
8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption
12/83
12 Background
agencies (as the Italian ANSA) broadcasts, digital-articles from online
papers, even the set of posts that are part of the same thread in a
blog; or ii) internal, such as system prosumers own articles, news and
comments/annotations to existing knowledge.
Whichever is the nature of a source of knowledge, I will assume that
either i) it is already structured or ii) there exists a proper entity, within
the system or out of it, able to do so (for instance an interface agent at
the border between the self-organising system and the external sources).
Structured information means to me that it has been built, organized
and distributed according to some standard, either a general purposeknowledge representation language as OWL2 [3] from the W3C or a
more domain specific as the IPTCs standards NewsML and NITF.
I will consider the second approach (the two standards mentioned will
be described properly in the following).
These structured information sources will be either i) reified within the
self-organising system as seeds or ii) managed again by a proper
entity (namely another interface agent). In both cases I assume that
these sources continuously and autonomously inject in the system someatoms of knowledge, which at the moment could be interpreted
as autonomous and independent living pieces of knowledge (actually
they are single NewsML/NITF tags, as will be described).
The fundamental matter is that this injection is not a one-shot op-
eration, but it is continuous in time and its rate could be changed
according to the systems state and its desired behaviour. For instance
recently published news could be injected at a higher rate, hence more
often in a given interval of time, than older ones; or if the system isoverloaded every sources could be slowed down to give it time to dis-
pose of it, while if experiencing scarcity of new atoms existing sources
could be excited to increase their injection rate.
Moreover, every single injection does not add to the self-organising
system a single atom, but a variable number of identical copies of an
atom, namely its concentration.
This quantitative information models the atoms relevance and usefull-
8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption
13/83
1.1 My vision 13
ness: the higher it is, the higher is the importance implicitly attributed
to that atom within the system by the system itself, hence the more it
is capable to influence its behaviour. Concentration may be given ei-
ther by i) the prosumers if they extract the atoms by themselves (this
manual-mode is allowed too, for instance when prosumers inject in the
system their own articles); or ii) the injector component according to
some well-defined criteria. For instance giving higher concentration i)
to atoms extracted from the title or the summary of a news rather than
those taken from its body, ii) to atoms comparing more times inside
the same news source or even iii) to newest news as done regarding
injection rate.
Mind that in the case of manually-given concentration, injection rate
too should be given (eventually later self-adjusted by the system au-
tonomously).
Management. The model this thesis wishes to build has to provide the ab-
stractions and metaphores useful to every possible system designed
upon it with the aim to help information specialists to manage their
knowledge. In particular such system could be a self-organising sys-
tem able to autonomously evolve knowledge according to users needs,
desires and behaviour. For instance it could relate atoms one to each
other to shape molecules of knowledge, hence higher-level knowl-
edge items, evolving both according to space and time patterns such
as decay and diffusion.
The main tool thanks to which the system evolves (or better its atoms
and molecules but seeds too evolve) is the chemical-like law, namely
a one-shot stochastic transition rule consuming a set of reagents to
generate a set of products. These rules have necessarly to be stochastic
to give the system as a whole the self-* properties highly desiderable
in open and distributed knowledge-intensive environments [4]. Stochas-
ticity here means that each law has an associated somehow computed
probability according to which it is scheduled for execution, namely
even less probable laws could be executed despite more probable ones.
8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption
14/83
14 Background
Such laws could be designed to combine together somehow related
atoms, ending to increase the knowledge stored within the system by
emergence [5]. Their reagents could be the atoms of knowledge while
their chemical products the afore-mentioned molecules: this way the
system could be able to self-produce molecules i) about the same peo-
ple, ii) covering the same topic, iii) relating chronologically coherent
atoms, iv) following some kind of spatial criteria and so on.
Concentration of atoms taken as reagents influences execution proba-
bility: the higher is the concentration of the atoms involved in a certain
law rather than that of atoms satisfying another laws pre-conditions,the higher is the probability that the former law will be chosen for exe-
cution over the second (althought still stochastically, hence the second
could be executed despite its lower probability).
Consumption. The creation of new knowledge from existing one by emer-
gence is useless if such knowledge is not made available to potential
consumers. To this end, the system should provide users some mecha-
nism to perceive such knowledge, hence both the single atoms andtheir aggregations, namely the molecules. This way system prosumers
may not only acquire the single pieces of information they were looking
for, but also navigate associations between them, reified as molecules.
A crucial principle to understand when talking about self-organising
systems is that perception actions carried out by users have practical
and observable consequences on the system state and behaviour: as
soon as the system is observed it suddenly changes its shape according
to such observation.
In the case of creation, modification or aggregation of existing informa-
tion by the prosumer, it is easy to detect system changes, but if such
information is only retrieved, browsed and/or navigated through with-
out any modification which are these observable consequences and how
they can be recognized? What is common for all the afore-mentioned
operations, both modifying or not knowledge, is that through them
users become aware that the considered information exists and implic-
8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption
15/83
1.2 The biochemical metaphore 15
itly evaluate such knowledge as useful/relevant to them. The system
is then allowed to interpret all these different kinds of access made to
atoms and molecules as positive feedbacks that increase their con-
centration: pieces of news managed more times and more often than
others are implicitly considered as more relevant/useful by the system
itself, hence they will gain an increased capability to influence its be-
haviour.
According to this view, prosumers can be seen as catalysts for the
chemical reactions installed in the self-organising system, able to in-
fluence its autonomous and stochastic behaviour not only due to thenature of their actions but also to the rate at which they are executed.
Pay attention to another fundamental principle, dual to the previous:
even the absence of any observation could be interpreted as an action
over the system, that as such has to change its state. This is usually
called negative feedback: an atom or molecule of knowledge that isnt
accessed for a long time does not receive any re-enforcement, hence
it should slowly fade away following some kind of implicit negative
feedback enacted by the system itself to avoid divergence (all the atomsand molecules endlessly increasing towards system saturation).
Now that the reader knows what I had in mind while writing this thesis, its
time to introduce the biochemical metaphore I will rely on.
2 The biochemical metaphore
No matter whether one thinks at natural systems using specific viewpoints,
e.g., in terms of physical systems, chemical systems, biological systems, or
social systems. In all of the perspectives one can always recognise the follow-
ing characteristics: i) above a common environmental substrate (defining
ii) the basic laws of nature and the ground on which individuals can
live), iii) individuals of different kinds (or species) interact, compete, and
combine with each other (in respect of the basic laws of nature), so as to serve
their own individual needs as well as the sustainability and the evolvability
8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption
16/83
16 Background
of the overall system.
This is the sort of endeavour that one should assume towards the realisa-
tion of long-lasting (ideally eternal) adaptive service ecosystems: conceiving
services and data components as individuals in an open ecosystem, in which
they interact according to a limited set of eco-laws to serve their own in-
dividual purposes in respect of such laws [6] [7].
Within the ecosystem, the level of species is the one in which all system
entities - persistent and temporary knowledge/data, contextual information,events and information requests, and of course software service components -
are all interpreted with the uniform abstract view of being the living things
that populate the system. After a bootstrap phase in which the ecosystem
is expected to be filled with a non-empty set of individuals, the ecosystem
starts living on its own, with the population of individuals evolving in differ-
ent ways: i) the initial set of individuals is subject to changes (as a reaction
to users actions upon it); ii) service developers and producers inject in the
system new individuals (developers insert new services and virtual devices,
producers insert data and knowledge); and iii) consumers keep observing the
environment for certain individuals (inject information requests and look for
certain data, knowledge, and events).
The environmental level determines the set of fundamental eco-laws re-
sponsible for the way in which individuals interact, compose with others,
aggregate so as to form or spawn new individuals, and decay (ultimately win-
ning or losing the natural selection process intrinsic in the ecosystem). Start-
ing from the unified description of living entities - the information/service
they provide - and from proper matching criteria, such laws basically spec-
ify the likelihood of certain spontaneous evolutions of individuals or groups
of individuals.
Typical patterns that can be driven by such laws may include: temporary
data and services decay as long as they are not exploited until disappearing,
and dually, they get reinforced when exploited; data, data requests, and data
8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption
17/83
1.2 The biochemical metaphore 17
retrieving services might altogether match, hence spawning data-found
events; new services can be created by aggregating existing services whose
descriptions strongly match.
The dynamics of the resulting ecosystem is overall determined by having in-
dividuals in the ecosystem act based on their own internal goals, yet being
subject to the eco-laws for their actions, interactions, and survival. The way
eco-laws apply may be affected by the presence and state of other individuals,
hence providing for closing the feedback loop that is a necessary charac-
teristic to enable self-organisation, self-adaptation, and self-management
features.
For instance, a service component that gets consuming too many resources
can affect the behaviour of resource provider components, diminishing their
availability, and thus avoiding the overall system to crash. Or, in a different
case, a service component being subject to a very high number of requests
can either aggregate new service components of the same class at a different
site or simply spawn itself to increase service availability without affecting
the quality of service provided.
In any case, the openness of the architecture does not exclude the possibility
of enforcing forms of decentralised human management (the existence of a
self-managing system must not preclude the possibility for humans to pre-
serve the capability of controlling the system). In particular, the injection of
new individuals can be used to modify the way eco-laws affect other individ-
uals and, thus, to somehow control the evolution of the ecosystem dynamics.
Chemical metaphores consider that the species of the ecosystem are sorts
of computational atoms/molecules, living in localised solutions, and with
properties described by some sort of semantic descriptions, intended as
the computational counterpart of the description of the bonding properties
of physical atoms and molecules. The laws that drive the overall behaviour
of the eco-system are sort of chemical laws that dictate how chemical reac-
tions and bonding between components take place to realise self-organising
8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption
18/83
18 Background
patterns and aggregations of components. Moreover, chemical metaphores
support forms of external control using sort of catalysts or reagent com-
ponents affecting the behaviour of a chemical ecosystem.
But the chemical metaphore alone is not enough, because it does not consider
any spatiality-related aspect, hence a metaphore inspired by biochemistry
(combining basic aspects of chemistry with some feature of biology) can suit-
ably enhance it to address the development of distributed service ecosystems.
On the one hand, chemistry appears a simple yet powerful framework for
self-organisation since it is based on a very foundational setting of chemical
substances and reactions, and it allows for a well-known fully-computational
description as a continuous-time stochastic system [8]. On the other hand,
when moving from chemistry to biology (hence considering biochemistry)
the notion of space structure enters the picture, and allows us to tackle in a
self-organised way key aspects related to how individuals can spread in the
network topology - a crucial issue for service ecosystems.
Now that I framed the metaphore to use within the biochemical world, lets
deeply describe the mapping from the three general concepts of species, en-
vironmental substrate and laws of nature to the correspondant biochemical
counterparts, hence reactants, compartments and biochemical laws.
Species as reactants. A chemical system is composed of chemical sub-
stances (or reactants): a chemical substance s can be considered as
made of a certain molecule m with concentration c floating in a given
portion of space, possibly in solution with many other substances s1; ...;sn. Concentration is directly responsible for the rate at which s reacts
with other substances, and ultimately, on whether/how it affects the
chemical dynamics at all. Substances may be produced, decay, combine
with others, act as catalysts, inhibitors, signals, data storage, and so
on.
The concept of chemical substance can hence be associated with that of
an individual: the molecule kind m is the individual kind, its structure
8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption
19/83
1.2 The biochemical metaphore 19
provides all interface information used to characterise the individuals
observable behaviour, while the concentration c is a numerical value
representing the activity level of the individual - the higher it is, the
more likely this substance will interact with others, and dually, it will
become inert as activity level fades. Accordingly, individuals can be
injected in the system and start interacting with others, by changing
shape, diffusing, being continuously generated/sustained or decaying.
The environmental substrate as a set ofcompartments. A chemical system
is typically made of a single solution where different substances float
around and interact. To make this scenario better fitting the shape of
distributed computing the biological concept of compartment is needed.
A compartment is a portion of space delimited by a membrane that
filters and regulates whether and how chemical substances can cross it.
Many compartments can exist into a system, in principle hosting to-
tally different substances and chemical reactions, thus possibly playing
different roles in achieving the overall system objective. Compartments
can even touch each other so that substances can move from onecompartment directly to the other, like in cells of a tissue.
The concept of compartment can be associated with that of world
location, i.e., an execution context for ecosystem services. A main
example of location is a network host, with touching compartments
modelling direct connection between nodes.
Laws of nature as biochemical laws. In biochemistry there are two basic
kinds of events that affect a system evolution: purely chemical reactions
responsible of changing the concentration of chemical substances, and
biomechanical actions responsible ofconfiguration changes - namely,
topological changes or chemical substances moving across membranes.
The first kind of events are well understood and studied even in the
context of Computational Systems Biology (CMSB) [9] - starting from
the work of Gillespie [10] and followed in languages like stochastic
p-calculus [11]. They are ruled by reactions of the kind X + Y r Z,
8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption
20/83
20 Background
meaning that when one molecule X collides with one molecule Y they
can interact by creating a new molecule Z (replacing the two original
ones), with a likelihood value expressed by reaction rate r - the actual
rate at which that reaction occurs being proportional to r, and to the
concentration of X and Y.
The second kind of events, biomechanical ones, are inspired by the work
in [12] to extend the mechanism of chemical reactions. The idea is to
allow standard chemical reactions to produce - other than chemical sub-
stances - also biomechanical actions, which are triggers that can make
some substance cross a membrane (hence diffuse to another networknode).
The reader may have recognized some feature of the biochemical metaphore
to be already-mentioned in the previous section, when I was describing my
vision of the model/system. Such correspondances are a first hint to the
complete mapping from the biochemical general framework above to my
molecules of knowledge model that will be formalised in Chapter 2.
In next section, a possible approach about how to ground the biochemical
metaphore into the journalism application domain and in particular its stan-
dards and methodologies is given.
3 IPTCs news standards
The IPTC (International Press Telecommunications Council) [13] is a consor-
tium of the worlds major news agencies, news publishers and news industry
vendors. It develops and maintains technical standards for improved news
exchange that are used by virtually every major news organization in the
world (among which the italian ANSA, the american Thomson Reuters and
the english BBC - see [14] for the full list).
One of the objectives for which the IPTC was established is to study tech-
niques, research and developments in telecommunications and to consider
how they can best be used to improve the flow of news. The following sec-
8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption
21/83
1.3.1 NewsML 21
tions describe two of its main standards designed to represent, organize and
exchange news with the aim to achieve such objective.
3.1 NewsML
NewsML [15] [16] is a media-type agnostic news exchange format standard
to convey not only the core news content, but also data that describes the
content in an abstract way (i.e. metadata), information about how to handle
news in an appropriate way (i.e. news management metadata), information
about the packaging of news themselves, and finally information about the
technical transfer itself.
It provides a set of useful abstractions:
the News Item - it veichles the news content, hence information report-
ing about what has just happened, providing a preview on what one
can expect to happen next and corresponding background information.
Althought this information can be presented in different journalistic
styles - article, blog post, report, comment, ... - and by different media-types - like text (articles), photo, graphics, audio or video - this single
abstraction is conceived to cover all these cases;
the Concept Item - since news are about events, persons, locations, or
themes and the like and such information is worth to be remembered
- and referred to - along with the news content to better identify, rec-
ognize, categorize - namely, manage - it, a data structure to collect all
this worth-to-be-remebered information is needed;
the Package Item - it is made to convey a structured set of items. It is
not merely a simple wrapper for news or concepts but has a feature to
structure information like by a table of contents: a package can have
groups of items and the groups itself can have sub-groups; each group
can have references to multiple items and references can be named like
Top 10 news of the week and the like;
the Knowledge Item - it is a container for many concepts, acting like
an encyclopaedia. This way a small, medium size or even large set of
8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption
22/83
22 Background
concepts can be distributed to receivers of news items to provide basic
knowledge about all the terms the news item refers to.
Briefly it could be said that the News Item is meant to be a comprehensive
container for a single news article as much as possible, conveying both
metadata tags and inline tags along with the news content. Metadata tags
carry all the information regarding the news item as a whole, such as its on-
line version URI, the author(s), the publication date and the covered topic(s);
inline tags instead are spread throughout the content of the news both to
give it a well-defined structure and also to carry all the additional informa-
tion that may be useful to better understand it and characterise even a singleterm inside it.
Having the capability to express and pack together all this information is
pretty much useless if there is no agreement upon its meaning. Moreover, it
should have a machine-readable representation to be succesfully processed
and exchanged by means of some automatic tool. This second issue is soon
addressed thanks to the eXtensible Markup Language [17], choosen by the
IPTC as the first implementation language for its standards (althought theycould be implemented in any other language). The issue about the shared se-
mantics is addressed by the IPTC with a couple of tricks: the afore-mentioned
Concept Item abstraction and the NewsCodes. Here follows how.
Values for metadata can be controlled or uncontrolled, and it is often desider-
able for metadata values to be controlled, that is restricted to a value or range
of values. One obvious reason for doing so is to convey clear and unambigu-
ous information about content. If a provider needs to inform a customer that
the content is a photograph, what term should be used: photograph, photo,
picture, pic? They might be understood by a human reader, but ad hoc terms
may not be processed reliably by software.
To this end the IPTC maintains sets of Controlled Vocabularies (CVs) that
are collectively branded NewsCodes [18]. These represent concepts that de-
scribe and categorise news objects in a consistent manner. By standardising
on NewsCodes, providers can ensure a common understanding of news con-
8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption
23/83
1.3.1 NewsML 23
tent and a greater degree of inter-operability between content from different
providers.
Concepts are the generic term used by the IPTC to denote real-world enti-
ties, such as people, organisations and places, and also abstract notions such
as subject categories. Then Concept Items are a model for managing this
information and making it available via CVs, enabling a single piece of news
content to be linked to a network of information resources. Using Concept
Items, both the news and the entities found in them can be easily identified
to make the content more accessible and relevant to peoples particular infor-
mation needs. NewsML Concepts are powerful because they bring meaning
to news content in a way that can be understood by humans and processed
by machines. This model aligns with work being done at the W3C and else-
where to realize the Semantic Web [19] vision.
Concept Items, being usable as metadata values, may be either uncontrolled
or controlled. Controlled concepts are managed by an authority (an organ-
isation or company) and are maintained in Controlled Vocabularies. They
are identified by a Concept URI, and their scope is global. Uncontrolled con-
cepts are identified by a literal string; their scope is local to the containing
document. Every concept, whether controlled or uncontrolled must be iden-
tified, and the identifier used must be unique in its scope. NewsML specifies
that the Concept URI must be a URL and that it should resolve to human-
readable and machine-readable information about the concept.
As someway related News Items could be packed together in a single Package
Item with the purpose to organize them, then all the Concept Items useful
to a certain common scope or describing the same entity could be collected
in a single Knowledge Item acting as an ontology both human- and machine-
readable.
Describing in detail how each of the four Items above works and their full
tags list is out of the scope of this brief introduction and anyway it will
be useless for the remainder of the thesis. Hence I will take some step fur-
8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption
24/83
24 Background
ther in the explanation only for the News Item, which in the very end is
the real news, and for the Concept Item, because it is responsible to give
machine-processable semantics to a news, a feature upon which I will rely in
my molecules of knowledge model.
The macro structure of a NewsItem is composed by four tags:
is the root element. It wraps anything else, including the other
three tags here listed, and carries some crucial information such as a
unique ID for the document, the XML namespace(s) and the News-
Codes catalog reference(s), used by NewsML interpreters to resolveConcept Items URIs;
carries the so-called management metadata, hence additional
information about news management such as its area of interest (a kind
of broad-topic), the provider of the news and its publication status
(wether it is usable, suspanded or cancelled);
wraps both administrative and descriptive metadata. Both
regards the news content, but while the former is about the source
of the news, its urgency, and the like, descriptive metadata is strictlyconnected to the content, storing for instance its covered topic(s).
is meant to wrap any media type, althought it is better to
phisically store only text leaving other media types, such as audio and
video streams, as external references (NewsML has dedicated wrappers
for photos, audio and video, similar to the NewsItem).
One interesting thing about the content of a NewsItem is that text could
be further tagged using other standards, for instance the NITF described in
next section.
The ConceptItem is quite similar to the NewsItem because it has the same
and sub-sections. Whats new is the
element which is a wrapper for the properties that express it in detail. The
following further tags are used to define a concept:
is the unique identifier of the concept, stored in the form of a
QCode. QCodes consist of two parts, separated by a colon: the first is an
8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption
25/83
1.3.1 NewsML 25
alias (scheme) that can be used to identify the IPTC NewsCode vocab-
ulary involved (for instance ninat stands for newsItem nature, hence
concepts about tha nature of a news); the second part of the QCode is
a reference into the vocabulary, hence one of its entries. Scheme aliases
are resolved by looking in an online Catalog. The reference(s) to cat-
alog(s) are carried at the root level of a NewsML document in the
correspondant tag ;
is the name of the concept in natural language;
and describe the nature of a concept. Both properties
demonstrate the use of the subject, predicate, object triple derived from
RDF [20] to express a named relationship with another concept. The
difference between the two properties in application is that can
only express one kind of relationship: is a. The current types agreed
by the IPTC and contained in the concept nature CV are:
cpnat:abstract for an abstract concept;cpnat:person for a person;cpnat:organisation for any kind of company;cpnat:geoArea for a geopolitical area of any size;
cpnat:poi for a somehow defined point of interest;cpnat:object for every objects (similar to the NITF pur-
pose, see later on);cpnat:event for a newsworthy event.
A uses either a @qcode or @literal to additionally describe
other inherent characteristics of a concept in terms of a named rela-
tionship with another concept. Such relationship may be identified in
the @rel attribute by a QCode; in this case a controlled vocabulary of
relationships, either maintained by an organisation such as the IPTCor custom-defined, would also be required.
allows to enter more extensive natural language information,
even with some mark-up if required.
The opportunity given by NewsML to the user to shape their needed con-
cepts, collect them in a KnowledgeItem and use them in their markup, both
for news metadata and for news content, is a great step toward interoperabil-
ity and automatic semantic processing of knowledge. Particularly important
8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption
26/83
26 Background
are the and tags along with the @rel attribute: their combi-
nation actually allows to shape a whole ontology as related ConceptItems!
Before going on to NITF standard, I wish to highlight one thing. In the
Introduction I described five areas of opportunity for which computer science
could help journalism and I stated that my work in this thesis would focus
on Document exploration and redundancy by helping journalists to manage
news and find stories. Please notice that other issues such as Combining
information from varied digital sources and Audio and video indexing can
be addressed simply by a wide-spread adoption of the NewsML standard: it
allows in fact to structure any kind of news source according to the same
set of tags, hence promoting different news sources interoperability, and has
dedicated newsItem-like objects to convey any kind of media, be it pictures,
video streams or audio files, thus making less-necessary to perform indexing
because relevant information are carried as metadata.
3.2 NITF
The NITF (News Industry Text Format) [21] uses the eXtensible MarkupLanguage (XML) to define the content and structure of news articles. It sup-
ports the identification and description of a number of news characteristics,
among which the most notable are:
Who owns the copyright to the item, who may republish it, and who its
about;
What subjects, organisations, and events it covers;
When it was reported, issued, and revised;
Where it was written, where the action took place, and where it may bereleased;
Why it is newsworthy, based on the editors analysis of the metadata.
From the few examples given for each of the news facets listed above, it is
clear that the NITF is able to express both additional information about the
content of the news and also metadata regarding the news lifecycle. More-
over, it supports most of the usual plain HTML tags for text structuring.
8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption
27/83
1.3.2 NITF 27
A NITF document is organized according to its main tags:
is the root element of the document, hence carries attributes to
identify the document, its time and date metadata and its category. It
must contain a head and a body;
holds the metadata about the document as a whole, such as its
, the subject covered thanks to tag,
and , its potenital area of interest through the tag and a list of items;
is the content of the document and is divided into the three follow-
ing sub-sections;
could contain either metadata useful to be displayed, such as
the author and contributors to the news article, or an abstract/summary
of the paper;
is the actual content of the news, hence it typically contains
text, references to pictures/videos, quotes and every inline tag and
HTML tag supported by the NITF.
is similar to in that they both could contain ad-
ditional information to be displayed. This usually carries a tagline or a
bibliography.
Since NewsML too has the capability to properly manage news-related meta-
data, the NITF someway overlaps. The best thing to do, is to exploit the
NewsML standard to wrap a single news articles content and its metadata
into a properly-structured container, that is the along with its
afore-mentioned metadata sub-tags (hence and ).
Then the NITF should be used to enrich the content of the news through its
inline tags, that is something NewsML cant do.
NewsML in fact provides no support for HTML tags to structure a doc-
ument neither any form of inline tagging to add information to the plain
text, for instance with the purpose to ease the work of any text mining algo-
rithm usable to automatically process the document. In this sense the NITF
and NewsML are complementary standards, hence they perfectly combine to
shape a very comprehensive and coherent framework to manage the whole
8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption
28/83
28 Background
news lifecycle: comprehensive because while one cares about news overall
structure, including metadata, the other focusses on their internal meaning
making it unambiguous; coherent because they both exploit the same IPTC
abstractions, for instance the NITF too makes usage of the NewsCodes tax-
onomies.
Heres the list of some of NITF most used inline tags, called by the IPTC
semantic units:
wraps personal names, both living people and fictitious. It could
contain the tag if the tagged person goes along with its
public role throughout text. Pay attention when some peoples name
is used as a company name or as an object definition, such as the
Thomson Reuters and a Picasso painting: in such cases use the proper
tags and ;
typically marks full official titles, such as the correct denotation
of political, commercial, clerical, military, civil appointments but is also
usable for their synonyms and journalistic variants. Such tag may be
even used to identify members of a profession (job titles) and with
family relations like father, wife as well as for other kinds of roles
such as consultant, employer and the like. The tag may
further be used to identify important (named) or indicative (unnamed)
players in recurring news-relevant scenarios, such as elections (the first
candidate), trials (the special prosecutor), accidents (the driver) and
natural catastrophes, business, cultural or sport events;
serves to identify organisational names. An inner tag ()
allows to add special widely agreed-upon codes, such as codes from the
Standard Industry Classification (SIC) [22] list or even NewsCodes. It
also covers personification of organisations, as in phrases such as the
Government said. Pay attention when some peoples name or even a
location is used as organisation, for instance in phrases like The Nobel
committee decided... or The White House stated that.... Watch out
also for product names such as The new BMW Z4 sport car... which
calls for the proper tag;
8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption
29/83
1.3.2 NITF 29
identifies geographic locations and significant places. It either
contains mere text or structured information thanks to its possible
inclusions , , , and .
It may also comprise significant man-made structures, such as famous
buildings and constructions, bridges, walls, buildings, highways and the
like. As already said, watch out for possible confusion with the
tag and keep in mind to use the proper tag for special cases
such as the Chernobyl catastrophe;
should be limited to newsworthy events or events that carry news
value in the sense of journalism. Factors of news value are for in-stance significance, proximity, prominence of the involved persons, con-
sequence, unusualness, human interest and timeliness. The possible am-
biguity with the tag has been already described above.
should include named news-relevant world objects as publica-
tions and media types (books, newspapers, CDs, TV series), mass me-
dia channels (TV channels, radio stations), titles of awards and prizes,
names of products and product lines, art objects, animals, ships, build-
ings and so on. It could virtually tag anything that is newsworthy andthat no other tag could wrap. It may seem a bit under-constrained,
but it gives the journalist the opportunity to tag specific-interest terms
even according to a controlled vocabulary. For instance, if the news
talks about cancer, then the journalist (or even a software agent) could
exploit either an ad-hoc or a well agreed upon medical ontology and
tag every interesting term recognized from it, so to allow semantic rea-
soning over the news content!
tags concrete dates and days of the week, religious and bank hol-idays, and relative time expressions that may be attributed with a
concrete date such as Christmas Eve and the like.
Thanks to these pre-defined tags and to the opportunity to make their values
constrained to some kind of controlled vocabulary, be it from the NewsCodes
or an ad-hoc ontology, the user of the NITF standard has a great expressive
power about news content enrichment.
8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption
30/83
30 Background
As the NewsML standard could do, the NITF too can address at least one of
the open issues listed in the Introduction: Information extraction. If a doc-
ument is properly NITF-tagged, then its worth-to-remember entities are all
machine-processable items since every NITF tag has a well defined mean-
ing and their values too could be formally defined through taxonomies as
the NewsCodes. NewsML and NITF wide-spread adoption could alone face
many problems regarding news management and sharing.
8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption
31/83
Chapter 2
Molecules of knowledge model
Non ho fallito, ho trovato mille modi
per non costruire una lampadina
- Thomas Edison -
Now that all the necessary knowledge to deal with the molecules model has
been acquired, I wish like to give the reader a brief and informal description
of such model, highlighting the main entities and their counterparts drawn
from the biochemical metaphore and from NewsML and NITF standards.Then, for each of these entities, possible requirements are devised and a
first specification that fullfills them is given. Finally, the formal molecules of
knowledge model is detailed.
1 Informal introduction to the model
At the beginning of the previous Chapter I gave the reader my vision both of
the model to conceive and of a possible self-* system designed upon it. Suchvision was outlined according to three different phases of a news lifecycle,
that are production, management and consumption. Here I would like to
recall such phases to introduce the main entities of the model, which are
inspired by the biochemical metaphore and grounded into the NewsML and
NITF IPTCs standards.
Production. Assumed that every news source exploited by the system pro-
sumers is properly structured according to NewsML and NITF stan-
8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption
32/83
32 Molecules of knowledge model
dards , I will also assume that such sources are reified within the system,
hence in the model too, as seeds both if they are external or internal.
According to the biochemical metaphore, such seeds can be seen both as
catalysts and as atoms: catalysts because their presence affects the sys-
tem behaviour through their continuous injection of knowledge atoms;
atoms because nothing forbids the system to manipulate them as they
were pieces of knowledge themselves, rather than news sources. The ex-
istence of seeds is extremely important because atoms may fade, hence
information will be lost forever in their absence. Moreover, reifying news
sources as seeds allows to keep all the relevant knowledge inside themodel/system, while any kind of interface agent doing seeds job would
make such knowledge external, hence dependant on agents availability
(upon which the system could have no control).
A first fundamental entity of the model is hence the seed. Its counter-
part in the IPTC standards could be the News Item as a whole, since
it represents a single source of knowledge. Moreover some of its poten-
tially worth-to-remember properties could be described by NewsML
tags such as to identify the provider (for instance ANSA), for the date, to describe where it is lo-
cated, for its author and .
Created and injected by the seed, another one of the main model entities
is the atom (of knowledge). Its biochemical counterpart is clear: it is
one of the reagents living in the solution represented by the set of all
the atoms that co-exists in a given chemical compartment. As such, it
will have a concentration value associated, as the chemical metaphore
wants.
Atoms do actually have a clear counterpart in NewsML and NITF stan-
dards: the tag. Tags can in fact be seen as the atoms that altogether
compose the news-substance. Hence it is possible to see living within
the system atoms, atoms, atoms,
atoms and almost every other NewsML/NITF tag.
Management. Now that the system is full of wandering atoms, each gener-
8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption
33/83
2.1 Informal introduction to the model 33
ated by its parent seed at a certain rate, they will end to collide, ei-
ther randomly or driven by some well-defined mechanism. The outcome
of these inter-atom interactions are the third fundamental entity of
the model: the molecule of knowledge. According to the chemical
metaphore, molecules could be seen as composite substances in which
there arent many instances of the same atom, that means a single
species of atom with as many individuals as its concentration value,
but many instances of different atoms.
Molecules are spontaneous, stochastical, environment-driven aggre-
gations of atoms, possibly reifying some meaningful similarity betweenthem, hence adding new knowledge to the system. They are sponta-
neous in that they simply happen as a natural evolution both of the
internal system behaviour and of the prosumers interactions; stochas-
tical as required by the chemical metaphore grounded in the work of
Gillespie [10], which allows for the emergence of a plethora of self-
something properties, above all self-adaptation; driven by the environ-
ment because althought stochastical, their likelihood to actually take
place is modulated both by other molecules/atoms living in the com-partment and by catalysts that could intervene.
The role of driving such aggregations is taken by another fundamen-
tal abstraction of the model: the chemical reaction. The name is
quite self-explanatory about their biochemical inspiration: they are the
transition rules, namely the chemical-like laws, that the chemical en-
gine reified by the system enacts to evolve itself, that is the atoms and
molecules (and even seeds too) it stores. Since they are meant to cre-
ate molecules, they must necessarly be spontaneous, stochastical andenvironment-driven, exactly as described above (and in the chemical
metaphore section of previous Chapter).
Both entities could be grounded to the NewsML and NITF standards:
since molecules are bags of atoms they are actually bags of tags,
hopefully somehow related tags; since molecules should hopefully be
meaningful, chemical reactions that generate them should not be com-
pletely blind to the nature of their reagents. In other words they
8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption
34/83
34 Molecules of knowledge model
should not be purely random transitions. Such chemical laws applica-
tion may be influenced by structural relationships about their reagent-
tags, relationships that actually exists in NewsML and NITF: for in-
stance a tag is always inside a tag and
describes metadata regarding a tag.
Moreover, semantical relationships between tags values may be taken
into account too, since both NewsML and NITF give to the user the
ability to draw such values from either controlled vocabularies or even
full ontologies.
Consumption. As already said, users of the model/system are prosumers,
hence they want also to consume knowledge rather than solely produce
it. Prosumers should be able to retrieve all the pieces of knowledge
stored within the system, access them to inspect their content and
navigate their relationships in the case they are molecules, combine
them to create their own new knowledge and so on.
Notice that every time a prosumer uses an atom/molecule, such us-
age action has other effects beyond the actual consequences of thecomputation. As already said they can be interpreted by the systems
chemical engine as positive feedbacks to the relevance/usefullness of an
atom/molecule, hence they should influence the correspondant concen-
tration. Lack of actions too is a feedback, this time a negative feedback
that should make atoms and molecules decay as time passes.
Due to all these possible side effects both on systems state and be-
haviour (remind that seeds too can be accessed and manipulated, for
instance their injection rate & concentration), prosumers interactingwith the knowledge can be seen as catalysts/inhibitors, the last main
entity of the model directly drawn from the chemical metaphore. They
wont have any NewsML/NITF counterpart, since they are the journal-
ists using such standards, or even automatic processors (agents) able
to interact with the knowledge stored in the system.
Summing up, the molecules of knowledge model is designed around the fol-
lowing abstractions:
8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption
35/83
2.1.1 About topology 35
seeds the news sources;
atoms the NewsML/NITF tags; molecules possibly meaningful bags of tags;
chemical reactions the reifications of the (possibly useful) rela-
tionships among the tags in a bag of tags;
catalysts/inhibitors the journalists, prosumers of knowledge.
1.1 About topology
Before next section in which each of these abstractions is detailed, I wish tofurther describe one aspect of the molecules of knowledge model/system that
has been only mentioned until now: distribution.
If the reader remembers, in the first Chapter I stated that the chemical
metaphore alone wont be enough for my model, because it doesnt account
for any kind of spatial aspect to be considered thus managed. Then such
metaphore was completed with the concept of chemical compartment
drawn from biology, leading to the biochemical metaphore able to model andproperly deal with network topology related issues.
I would like to remark here that such enhancement has not been done merely
to give more expressive power to the model, but that it is strongly encouraged
by the nature of the problem it tries to face, that is knowledge management
in general. In fact, nowadays it is quite an utopy to design a knowledge man-
agement system that is not distributed among different computational nodes,
possibly crossing administrative domains and located at different places.
Moreover my elected application domain is journalism, where distribution
plays an essential role too. A possible use case for the molecules of knowl-
edge model could be to help journalists working in a journalistic heads news-
room: they will probably have their own personal devices (be them laptops,
tablet or whatever) in which they store their news sources, annotations, self-
produced articles and the like. Then the model with all its abstractions could
be installed in every one of this devices, transforming each of them in a
8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption
36/83
36 Molecules of knowledge model
single chemical compartment, hence with its own seeds, atoms, molecules and
chemical reactions, situated somewhere within the whole network of all the
other chemical compartmentes, that is all other journalists (notice that this
will be a mobile network actually).
For these reasons, from now on I will always assume a distributed network
topology to which apply the molecules of knowledge model, in which every
node is the chemical compartment belonging to a precise prosumer (hence
influenced by a well defined catalyst), in which he/she stores his/her own
seeds, atoms, molecules and chemical reactions.
In Section 3 I will talk about spatial interactions and I will describe how to
exploit distribution thanks to neighborhood relationships between com-
partments and atoms/molecules diffusion mechanism (in truth I will only
mention such relationships, because I will rely on a cited paper).
2 Model abstractions
In the following sections, each of the model abstractions just highlighted will
be given a set of requirements to satisfy according to the main goal of this
thesis. Along with such needs, also possible solutions are described and a first
pseudo-formal specification is given too.
2.1 Seeds
Seeds requirements can be devised directly from the brief introduction given
at the beginning of the Chapter. Since they are the reification of any news
source that a journalist would like to consider in his/her knowledge port-
folio, they should carry some information about it. Moreover, they are re-
sponsible for the injection of atoms of knowledge, hence they should store
meta-information about this process too.
Focussing on news source identification and description, NewsML and the
NITF standards provide a number of tags that are potentially useful: ,
8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption
37/83
2.2.2 Atoms 37
, etc. are just a few of the many previously mentioned. Some kind
of unique identifier for the news source is undoubtely necessary too: since
I wish to reuse as much as possible features from NewsML standard, I will
rely on URIs, which have the advantage to be highly encouraged by the W3C
for the Semantic Web vision, for instance in its OWL language. Then, this
collection of tags, along with their content, could be the first information to
store into a seed, fullfilling the first requirement.
Regarding the injection mechanism, three essential information should be re-
mebered: i) first of all, the atoms to be spawned (whose internal structure is
detailed in next section); ii) then, the concentration of every atom to create,
so to generate the exact quantity of each at every injection step; iii) finally,
the injection rate, to generate each atom at the right frequency/probability.
Putting these observations altogether, the following could be a first pseudo-
formal specification of a seed element (I will use a Prolog[23]-like syntax for
its readability):
seed(srcID, srcMeta, [atoms
], [concentrations
], [rates
])
where:
srcID is the URI (or equivalent identifier) of the news source;
srcMeta is the collection of the NewsML tags afore-mentioned;
[atoms] is the list of every single atom to spawn;
[concentrations] is the list of each atoms initial concentration (possibly
different for each of them);
[rates
] is the list of atoms injection rates (again, possibly different for eachof them).
2.2 Atoms
To fruitfully shape a single atom of knowledge as best as possible, the main
goal is to balance two different competing needs: on one hand it should em-
bed enough knowledge to be useful from both the system and the prosumers
point of view; on the other hand the atom is the most primitive piece of
8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption
38/83
38 Molecules of knowledge model
knowledge within the model, hence it should be kept as much simple as pos-
sible.
I will try to reach the needed equilibrium taking into account the following
complementary facets:
Granularity of knowledge. While grounding the chemical metaphore into
NewsML and NITF standards, I stated that any of their tags could be
mapped in a single atom, hence following their structure and semantics,
a six-level scale for the granularity of a piece of knowledge could be
identified:
1. the single NITF tag (finest granularity);2. a descriptive or administrative wrapper;3. the , or wrappers;4. the whole ;5. a single tag within the of a ;6. the whole container (coarsest granularity).
Pay attention that having a single abstraction able to cover all these
different quantitative of information may seem to overlap with the
molecule abstraction, making it useless. This is actually wrong, be-
cause molecules are a completely different concept: an atom may be as
comprehensive as needed but will always be a single not-divisible unit
of information; a molecule instead is the reification of a number of rela-
tionships between different atoms, possibly coming from different seeds.
Context of knowledge. Any piece of knowledge could be misleading if taken
out of its context, because the context is the set of the environmental
conditions needed to correctly interpret it. In other words, context
gives or at least enriches semantics of a piece of knowledge, allowing in
the end for a better/correct understanding of it.
Thus it will be undoubtely useful to embed a certain degree of se-
mantics description in an atom, rather than its content alone. Here
NewsML and NITF standards come in hand with a couple of features:
i) being standards their tags have a well-defined meaning, ii) since they
8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption
39/83
2.2.2 Atoms 39
are implemented in XML they are highly interoperable and easily ex-
changeable, iii) tags values too may have a formal semantics thanks
to NewsCodes or external ontologies (coded as Knowledge Items).
For these reasons a first enrichment to an atoms content could be to
store also the related NewsML/NITF tag that wraps it, but this alone
isnt enough.
It has been already explained how NITF tags can experience some
kind of ambiguity about their usage, but even more problems could be
faced. Lets think about the following phrases: Mr. Marchionne is CEO
of FIAT and FIAT has provided a thousand new job opportunities.. Inboth cases FIAT should be tagged with the tag, but while
in the first case it covers the role of the object, namely answering the
question: Mr. Marchionne is CEO of What?, in the second it is the
subject, hence the Who.
Hence it could be useful to explicitly say which one of the famous 5
W of journalism the current tag is describing, hence if its about the
Who, What, Where, When or Why. Thats another useful information
to store in an atom.
Its not finished yet. Since NewsML and NITF tags values could be
drawn from controlled vocabularies or even ontologies, their meaning is
asserted unambiguously once and for all by these taxonomies. Hence,
I could inject in an atom some information to identify them, namely
the QCode and catalogue: both are logical names that together address
a web page (or even a local file if their scope is local within the user
company) in which the schema is formally defined as in machine- as in
human- readable form.
Relevance/Usefullness of knowledge. A definitory property of a news is its
relevance, hence how its perceived interesting both by the professionists
who manage it and by the target audience to whom it is directed.
Moreover, every news has some kind ofusefullness, measured according
to some criteria: for instance, the level of new knowledge acquired by a
reader or even economic revenues it could generate. These are somehow
8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption
40/83
40 Molecules of knowledge model
two faces of the same coin: as more relevant news are expected to
be more useful to readers/journalists, then useful news may spread
through readers and publishers gaining relevance.
Since atoms carry some piece of information extracted by a news, it
is quite natural to distribute the relevance/usefullness of the original
source of knowledge as a whole among the (possibly) many atoms ex-
tracted from it.
Another definitory property of a news is, as the word itself suggests,
its novelty, hence both how much new is the knowledge it provides
with respect to the actual environment and also how much new it iswith respect to time passing: it is obvious that while news become older
and older they lose relevance and public interest, following a grace-
ful degradation process. As done before for relevance/usefullness, this
time-dependancy property could be easily transferred to the atoms
of knowledge: the less they are shared and used by cooperating jour-
nalists, the more they are going to lose their cultural/economic value.
Since these three facets of a news, that are relevance, usefullness and
novelty, are so deeply influenced one by each other, they all could bemodeled with a single abstraction: the concentration.
From the biochemical metaphore in fact, it is known that an atom/molecules
concentration is a measure of its activity level, namely how much it
could and should influence the overall chemical behaviour of the solu-
tion (system). Since such concentration is subject to a time-dependant
fading mechanism, namely atoms/molecules decay, the mapping rele-
vance/usefullness concentration is perfect!
Summing up, an atom of knowledge should not carry only the content of a
(piece of) news, hence the tag along with the tagged term/phrase, because
this way its semantics could be not clear. I have identified two other pieces
of knowledge that are worth-to-remember and useful to better veichle se-
mantics: i) one of the 5 W and ii) the QCode and catalogue information.
Moreover, concentration too should be explicited, so to model the atoms
relevance/usefullness (and novelty too). As a last bit of info, since atoms are
automatically injected by their own parent seed, it could be useful to bring
8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption
41/83
2.2.3 Molecules 41
some data from such seed to the atom.
Here it is a possible atoms syntax:
atom(srcID, info(tag, content), meta(w, qcode, catalogue), concentration)
where:
srcID is taken from the source seed;
info(tag, content) is the actual piece of news the atom veichles, hence some
content (from the whole paper down to a single term in it) along
with its tag;
meta(w, qcode, catalogue) is the additional information that helps clarify the
atoms semantics, thus one of the 5Ws and the QCode and catalogue
information grounded in NewsML/NITF standards;
concentration is the actual activity level of the atom. Notice that this value
will necessary coincide with the one specified in the source seed only at
injection time: later on it will evolve according to the system behaviour.
2.3 Molecules
Molecules of knowledge may seem the most complex abstraction to deal
with, because in the very end all other are built around them. In fact, chemi-
cal reactions consume seed-generated atoms to forge molecules, creating new
knowledge within the system, while catalysts inspect them to acquire knowl-
edge.
In truth, a very simple interpretation about what a molecule is can be given,
assuming that chemical reactions, to whom they are deeply related and de-
pendant, are properly shaped. How? Here follows my explanation.
Since molecules of knowledge are reifications of interactions among different
pieces of news, they are full of implicit semantics about such interaction.
Moreover, hopefully molecules are composed pursuing some goal and accord-
ing to some criteria, for instance the chemical engine could try to aggregate
atoms similar on a topic basis, for geographical reasons or because they are
8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption
42/83
42 Molecules of knowledge model
chronologically ordered. Then the implicit meaning that a certain molecule
carries, is actually given by the particular chain of chemical reactions that
during time shaped it.
Thanks to negative feedbacks, there is no need to teach the system how to
build only useful aggregations and how to detect and discard meaningless
ones: simply the latter will fade away as an emergent natural selection
process, driven both by systems internal behaviour and by external pro-
sumers interactions. Then there is no reason to explicitly state neither why
a certain molecule has been generated nor how its atoms are related one
to each other. In other words, the afore-mentioned aggregations semantics
could remain implicit: if relationships are relevant/useful, they will survive
because a number of prosumers sees some meaning in them; otherwise, if
nobody finds them interesting such molecules will simply decay until death.
For these reasons, the simple interpretation I am talking about is that a
molecule of knowledge could be viewed as a bag of atoms, hence a single
unordered set of somehow related atoms. According to this interpretation,
a molecule could be simply shaped as follows:
molecule([atoms], concentration)
where:
[atoms] is the list of all the atoms currently bondend together by the
molecule, hence the pool of related pieces of knowledge that a certain
chain of reactions has aggregated during natural system evolution;
concentration is the actual concentration of the molecule.
Please notice that every single atom inside the [atoms] list has not exactly
the same internal structure of a standalone atom. Since it is now part of
a greater aggregation, its concentration is no longer meaningful because the
molecule has its own, hence it is removed from atoms syntax.
Thus, the complete structure of a molecule (omitting a whole list of atoms
for brevity) should be as follows:
8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption
43/83
2.2.4 Chemical reactions 43
molecule([atom(srcID, info(tag, content), meta(w, qcode, catalogue)), ...],
concentration)
2.4 Chemical reactions
In the previous section, in which an informal introduction to models abstrac-
tions was given, I stated a couple of interesting things regarding chemical
reactions. First of all, they are responsible for the consumption of atoms and
the production of molecules, but this is quite obvious. Whats not so obvious
is how molecules are produced and atoms are consumed, in the sense of which
are the criteria to bind atoms together in a molecule and the mechanisms toactually do so. Now Im going to recall these interesting things.
First of all, since most of the NewsML and NITF tags have well-defined
dependancy relationships, a chemical law could exploit them to pack some
kind of NewsML/NITF-compliant molecule. For instance, the self-* sys-
tem built upon this ongoing model could decide to pack together all the tags
(along with their content) nested in a tag. This could hap-
pen because they are frequently accessed together, thus the system tries tospeed-up research latency: prior to the molecule all the single atoms have to
be retrieved; with the molecule this is done in one shot by looking directly
for it.
Moreover, virtually every NewsML/NITF tag could have its admissible val-
ues collected, stored and defined formally by a controlled vocabulary or an
ontology, hence semantical relationships too could be exploited by chemical
reactions! When semantics enters the field of computation and interactions a
plethora of interesting and meaningful behaviours arise to be explored. For
instance, the chemical engine may browse tags values source taxonomies to:
i) discover if two different terms are synonyms, hyperonyms, and the like,
then decide to aggregate the correspondant atoms in a thesaurus molecule;
or ii) navigate relationships among different concepts from the same ontology
and reify such links, such as understanding that the Minister of Defense is a
member of the Government, thus it is in the staff of the Prime Minister and
reify such reasoning putting them both in a taxonomy molecule.
8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption
44/83
44 Molecules of knowledge model
Finally, the most obvious relationship between atoms has not to be omitted:
if they carry the same content they are undoubtely related (maybe such re-
lationship is trivial hence useless, but exists anyway)! For content here I
mean the true content, hence only the tagged term or phrase without con-
sidering the tag. This allows to relate different atoms (thus possibly different
news sources) in which the same thing is tagged differently, for instance when
news A says Termini Imerese is in trouble and news B says employees are
occupying Termini Imerese factory: the first Termini Imerese tag could proba-
bly be a because the term is used in place of FIATs Termini Imerese
factory, while the second tag could be a tag because Termini
Imerese is really a city.
Summing up, a first collection of patterns to join atoms into molecules could
be based upon:
the tag field inside the info(tag, content) term of an atom, in the case
of a structural relationship between different NewsML/NITF tags;
the whole meta(w, qcode, catalogue) term if the relationship is seman-
tical;
the solely content inside the info(tag, content) term of an atom whenever
a subject-based link has to be reified into a molecule.
Now Ive answered first question from the beginning, that was about possi-
ble criteria upon which molecules are composed. Whats left is question two:
which mechanisms to use to aggregate atoms producing molecules?
The answer is directly provided by the biochemical metaphore: chemical
reactions are the tool. Im not gonna list all the possible concrete chemical
reactions to inject in the system to obtain every possible instantiation of the
above described patterns; Im just going to define the structure & semantics
of a general-purpose chemical law for each of the patterns, in the sense of
how many reagents it may have, of which kind, how they should be similar
one each other, whats the produced substance and the like. First of all lets
see the common look that every chemical law will have.
8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption
45/83
2.2.4 Chemical reactions 45
Following literally the interpretation of molecules as bags of atoms, a chemi-
cal reaction simply takes a list of atoms as input reagents and produce a single
molecule as output product. Both involved concentrations, hence reagents
and products, are a single unit, thus a single instance of input atoms is con-
sumed (one each) and a single instance of the output molecule is generated.
But this way molecules cannot be part of a chemical reaction as reagents,
hence they cannot be consumed except by prosumers. This is undesiderable,
because molecules are living and evolving entities pretty much like atoms,
thus nothing should forbid them to join one another or to absorb additional
atoms.
Adding such feature, a generic chemical reaction could look like this (omitting
internal fields for the sake of clarity):
( atom | molecule ) r join molecule([atoms], concentration++)
where reagents could be any combination of any number of atoms and molecules
while product is exactly one molecule aggregating all the atoms on the left-
hand side. This suggests that reagents molecules are somehow unpacked to
extract atoms and inject them in the new molecule. Please remember what
was said about the [atoms] list in previous section to avoid confusion regard-
ing notations.
Now that the most general-purpose chemical-like law has been presented, it
is time to describe its concrete applications to obtain the afore-mentioned
patterns. As already said, the following are still general purpose laws, be-
cause they only state who should be similar to who for the reaction to be
applied and similar information.
The first chemical reaction is meant to produce molecules that aggregate
structural-related atoms, based upon the well defined relationships among
NewsML and NITF tags. Assuming to use apices () to denote some structural
dependancy among tags, such chemical reaction could be as follow (omitting
unnecessary fields to enhance readability):
( atom(srcID, info(tag, ), , 1) | molecule([atom(srcID, info(tag, ), ), ...], 1) )
8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption
46/83
46 Molecules of knowledge model
r structural join
molecule([atoms], concentration++)
This law states that: i) only atoms/molecules all coming from the same news
source could be bound together, ii) such reagents tag fields should have
some dependency according to structural constraints of the NewsML and
NITF standards. Other aspects of the law are inherited from the general
purpose one already described, for instance one unit of concentration is in-
volved, reagents could be in any number, input molecules should be unpacked.
Going on to the second aggregation pattern, I assume that symbols ()and () denote some kind of semantical relationship between terms, for in-
stance according to a thesaurus or ontology involving such terms. This kind
of NewsCodes-based chemical reaction could be shaped as follows:
( atom( , info( , content), meta( , qcode, catalogue), 1) |
| molecule([atom( , info( , content), meta( , qcode, catalogue)), ...], 1) )
r semantical join
molecule([atoms], concentration++)
Such transition rule states that: i) no m