Topological Field Theory of Data - Sysma Unit|sysma.imtlucca.it/cina/lib/exe/fetch.php?media=merelli.pdf · 1. topological data analysis: homology methods to extract piecewise-linear

Topological Field Theory of Data

A program towards a novel strategy for mining data through data language

Emanuela Merelli

School of Science and Technology, University of Camerino

Big Data = Big Challenge

While ICT is becoming an integral part of the fabric of nature and society, the

boundaries between digital and physical worlds progressively fade away and the DATA Management

become a challenge

Today4.7 billion of people have a mobile phone

each day 35 billion SMSs are exchanged

700 million pictures are uploaded on Facebook every year

1 billion cars circulate and over 2 billion people fly on airplanes

How big are Big Data? in 2014

DATA created and exchanged have been 6 zettabyte (1 Zb=1021) in 3 years we all reach

1 yottabyte (1 Yb=1024) more than Avogadro’s number!

The growth of population, urbanization, commercial exchanges and global migrations are more and more entangled with sophisticated ICT technologies,

generating a unique, interconnected ‘socio-technical’ system

WAT$24h.mpg,

La#rete#dei#trasporti#aerei#Transport network

Big Data = Big IT Challenge

of dealing with the huge amount of information flowing in and around complex systems, endowing ICT with new, more efficient tools to play a role in turning

data —> information —> knowledge —> wisdom

classical data mining

Tracce&digitali&del&comportmento&umano&digital traces of human behavior

f

Bevys of Starlings

Multi-agent – Multi-scale – Emergent effects

From individual to collective behavior

Big Data = Big TCS Challenge

ICT plays fundamental role in providing efficient tools in turning

data —> behaviors —> languages —> —>model of computational

a new strategy for mining data through data language

The story began within TOPDRIM - a FP7 FET Project

Our aim was to tame Big Data with TOPOLOGY (the geometry of ‘shapes’)

Fundamental notion, from computer science when dealing with data, is the concept of

‘SPACE of DATA’

• the structure (geometrically represented) in which information is encoded;

• the frame for algorithmic (digital) thinking;

• the lode where to perform DATA MINING, i.e., to extract patterns of information

why topology?

Topology because:

• it is the branch of mathematics that deals with both local and global qualitative geometric information of the data space.

• the connectivity, the classification of loops at any dimension of the manifolds, all the topological invariants properties are preserved under homeomorphisms of the background data space

• it does not depend on coordinates but only on intrinsic geometric features; it is “coordinate-free”; it ignores the notion of distance and replaces it just with the concept of proximity - connective “nearness”

topological data analysis

Simplicial complex: a nested collection of simplices.

Simplex: is a generalization of the notion of a triangle or tetrahedron to arbitrary dimensions.

Persistent homology: is a way to analyse the data space by comparing topological invariants (e.g. Betti numbers) of a sequence of simplicial complexes

From Data to Simplicial Complexes

Increasing ✏, the simplicial complexes determine an increasingsequence.

Which is the right ✏ such that the simplicial complex wellapproximates the shape of the point cloud?

Homology driven

Betti numbers as topological invariants that measure the numbers of holes at any dimension

Betti barecode, a way to visualize the situation

A successful application within TOPDRIM project

rsif.royalsocietypublishing.org

ResearchCite this article: Petri G, Expert P, TurkheimerF, Carhart-Harris R, Nutt D, Hellyer PJ,Vaccarino F. 2014 Homological scaffolds ofbrain functional networks. J. R. Soc. Interface11: 20140873.http://dx.doi.org/10.1098/rsif.2014.0873

Received: 5 August 2014Accepted: 3 October 2014

Subject Areas:computational biology

Keywords:brain functional networks, fMRI, persistenthomology, psilocybin

Author for correspondence:P. Experte-mail: [email protected]

Electronic supplementary material is availableat http://dx.doi.org/10.1098/rsif.2014.0873 orvia http://rsif.royalsocietypublishing.org.

Homological scaffolds of brain functionalnetworksG. Petri1, P. Expert2, F. Turkheimer2, R. Carhart-Harris3, D. Nutt3, P. J. Hellyer4

and F. Vaccarino1,5

1ISI Foundation, Via Alassio 11/c, 10126 Torino, Italy2Centre for Neuroimaging Sciences, Institute of Psychiatry, Kings College London, De Crespigny Park,London SE5 8AF, UK3Centre for Neuropsychopharmacology, Imperial College London, London W12 0NN, UK4Computational, Cognitive and Clinical Neuroimaging Laboratory, Division of Brain Sciences, Imperial CollegeLondon, London W12 0NN, UK5Dipartimento di Scienze Matematiche, Politecnico di Torino, C.so Duca degli Abruzzi no 24, Torino 10129, Italy

Networks, as efficient representations of complex systems, have appealed toscientists for a long time and now permeate many areas of science, includingneuroimaging (Bullmore and Sporns 2009 Nat. Rev. Neurosci. 10, 186–198.(doi:10.1038/nrn2618)). Traditionally, the structure of complex networks hasbeen studied through their statistical properties and metrics concerned withnode and link properties, e.g. degree-distribution, node centrality and modular-ity. Here, we study the characteristics of functional brain networks at themesoscopic level from a novel perspective that highlights the role of inhomo-geneities in the fabric of functional connections. This can be done by focusingon the features of a set of topological objects—homological cycles—associatedwith the weighted functional network. We leverage the detected topologicalinformation to define the homological scaffolds, a new set of objects designed torepresent compactly the homological features of the correlation network andsimultaneously make their homological properties amenable to networks theor-etical methods. As a proof of principle, we apply these tools to compare resting-state functional brain activity in 15 healthy volunteers after intravenous infusionof placebo and psilocybin—the main psychoactive component of magic mush-rooms. The results show that the homological structure of the brain’s functionalpatterns undergoes a dramatic change post-psilocybin, characterized by theappearance of many transient structures of low stability and of a smallnumber of persistent ones that are not observed in the case of placebo.

1. MotivationThe understanding of global brain organization and its large-scale integrationremains a challenge for modern neurosciences. Network theory is an elegant frame-work to approach these questions, thanks to its simplicity and versatility [1]. Indeed,in recent years, networks have become a prominent tool to analyse and understandneuroimaging data coming from very diverse sources, such as functional magneticresonance imaging (fMRI), electroencephalography and magnetoencephalography[2,3], also showing potential for clinical applications [4,5].

A natural way of approaching these datasets is to devise a measure of dynami-cal similarity between the microscopic constituents and interpret it as the strengthof the link between those elements. In the case of brain functional activity, this oftenimplies the use of similarity measures such as (partial) correlations or coherence[6–8], which generally yield fully connected, weighted and possibly signed adja-cency matrices. Despite the fact that most network metrics can be extended tothe weighted case [9–13], the combined effect of complete connectedness andedge weights makes the interpretation of functional networks significantlyharder and motivates the widespread use of ad hoc thresholding methods[7,14–18]. However, neglecting weak links incurs the dangers of a trade-off

& 2014 The Authors. Published by the Royal Society under the terms of the Creative Commons AttributionLicense http://creativecommons.org/licenses/by/4.0/, which permits unrestricted use, provided the originalauthor and source are credited.

on October 29, 2014rsif.royalsocietypublishing.orgDownloaded from

$soggetti$trattati$$

$con$$psicofarmaci$

soggeti$“normali”$

32,

Placebo

Psylocibine

TOPDRIM: some successful applications so far

• The Brain: functions from fMRI data

• The Immune System complexity: models in the topological frame

• The RNA combinatorics

• Communication among biomolecules

• Complex networks (hypernetworks; networks of networks) dynamics

• Our goal is the construction of a Topological Field Theory of Data (TFTD), giving us a new ‘camera’ for reading complex systems (in nature and society)

• a camera whose ‘photographs’ do not consign reality to the past moment they are shot, but enables us to predict a piece of the future;

• a TFTD whose gauge group looks into the transformation properties of the space of data revealing hidden complex patterns, somewhat like the vortices in turbulent water that only Leonardo’s eye was ever able to catch.

• A gauge group G suitable to determine the elements and the interactions among elements of a given Gauge theory. … a group of symmetries over a fiber bundle, the G−bundle.

X=space of data

S=simplicial complex

S∞ after persistent homology

SM after Morse analysisGMC

F=fiber bundlefj=fiber

GP

Gauge group -> G=GMC ⋀GP

Topological Field Theory of Data

• Reconstruct all the characteristics of non-linear interaction by a group of transformations G of the space of data that preserve the topology.

• A gauge group G: a group of symmetries over a fiber bundle. It determines the elements and the interactions among elements.

• The cosets of G order data in equivalence classes with respect to isotopy, leading to a canonical system in the related process algebras P;

Gauge group -> G=GMC ⋀GP

The tree pillars the schema rest on are:1. topological data analysis: homology

methods to extract piecewise-linear object representing the space of data

2. topological field theory: group of transformation of the space of data that preserve the topology

3. formal language: language to translate the semantics of the transformations induced by the field dynamics

Flow(chart(of(the(proposed(approach(

FP7-ICT-2011-C STREP short proposal 04-09-2012 BINÀH

Proposal Part B: page [3] of [6]

Project BINÀH’s three pillars: i) Topological data analysis, based on data space geometric/combinatorial architecture; ii) Topological field theory for data space, as generated by the PL data space structure; iii) Formal language semantic representation of the field theory dynamical transformations, are interlaced in such a way - as represented in the flow-chart - as to reach the specific objective of devising new methods to recognize structural patterns in large data bases, thus allowing us to perform data mining in more efficient way and to extract more easily valuable effectual information.

1.3 S/T methodology Topological data analysis. The problems one faces in the process of big data analysis are characterized by two fundamental questions: how high dimensional, global structures can be inferred from low dimensional, local representations; how can the necessary reduction process be implemented in such a way as to preserve the maximal information about the global structure. The basic principles of this first aspect of the proposal, emerge from three ideas stemming out of the seminal work of several authors: i) it is convenient to replace the huge set of points constituting the space of data with a smaller family of simplicial complexes, parametrized by some 'proximity parameter'. It is this operation that converts the data set into a global topological object; ii) it is useful to deal with such complexes by the tools of algebraic topology, via the theory of persistent homology; resulting into parameterized families of spaces; iii) it is possible and efficient to encode the persistent homology of a data set by a parameterized version of Betti numbers. Recall that a basic set of invariants of a topological space, X, is its collection of homology groups, Hi(X). Computing such groups is of the highest importance, and crucial ingredients are Betti numbers; the i-th Betti number, bi, being the rank of Hi(X). Betti numbers often have intuitive meanings, and knowing Betti numbers is in most cases the same as knowing homology. When one wants to know the homology groups, sometimes it suffices to know the Betti numbers, which are often much simpler to compute. If one is trying to distinguish manifolds via their homology, their Betti numbers may already distinguish them, and often to prove the vanishing of a homology group it is sufficient to have the corresponding Betti number vanish. Data, on the other hand, are customarily represented as unordered sequences of points in a n-dimensional ‘space of data’ En, typically - but not necessarily - Euclidean. The leading idea here is that the information encoded in the global structure of En, through correlation patterns, is what provides the relevant knowledge about the underlying phenomena which data represent. An instance of a data set for which such significant global features are present is the point cloud data, a collection of points in En providing a sample of significant points on a lower-dimensional subset of En, such as in computer representations of physical objects, or for motion data recorded as time series.

Knowledge(Informa;on(

Data(

7(M.(RaseL,E.(Merelli(EIB(Z(9(November(2012(

Building blocks of the theoryFlow(chart(of(the(proposed(approach(

FP7-ICT-2011-C STREP short proposal 04-09-2012 BINÀH

Proposal Part B: page [3] of [6]

Project BINÀH’s three pillars: i) Topological data analysis, based on data space geometric/combinatorial architecture; ii) Topological field theory for data space, as generated by the PL data space structure; iii) Formal language semantic representation of the field theory dynamical transformations, are interlaced in such a way - as represented in the flow-chart - as to reach the specific objective of devising new methods to recognize structural patterns in large data bases, thus allowing us to perform data mining in more efficient way and to extract more easily valuable effectual information.

1.3 S/T methodology Topological data analysis. The problems one faces in the process of big data analysis are characterized by two fundamental questions: how high dimensional, global structures can be inferred from low dimensional, local representations; how can the necessary reduction process be implemented in such a way as to preserve the maximal information about the global structure. The basic principles of this first aspect of the proposal, emerge from three ideas stemming out of the seminal work of several authors: i) it is convenient to replace the huge set of points constituting the space of data with a smaller family of simplicial complexes, parametrized by some 'proximity parameter'. It is this operation that converts the data set into a global topological object; ii) it is useful to deal with such complexes by the tools of algebraic topology, via the theory of persistent homology; resulting into parameterized families of spaces; iii) it is possible and efficient to encode the persistent homology of a data set by a parameterized version of Betti numbers. Recall that a basic set of invariants of a topological space, X, is its collection of homology groups, Hi(X). Computing such groups is of the highest importance, and crucial ingredients are Betti numbers; the i-th Betti number, bi, being the rank of Hi(X). Betti numbers often have intuitive meanings, and knowing Betti numbers is in most cases the same as knowing homology. When one wants to know the homology groups, sometimes it suffices to know the Betti numbers, which are often much simpler to compute. If one is trying to distinguish manifolds via their homology, their Betti numbers may already distinguish them, and often to prove the vanishing of a homology group it is sufficient to have the corresponding Betti number vanish. Data, on the other hand, are customarily represented as unordered sequences of points in a n-dimensional ‘space of data’ En, typically - but not necessarily - Euclidean. The leading idea here is that the information encoded in the global structure of En, through correlation patterns, is what provides the relevant knowledge about the underlying phenomena which data represent. An instance of a data set for which such significant global features are present is the point cloud data, a collection of points in En providing a sample of significant points on a lower-dimensional subset of En, such as in computer representations of physical objects, or for motion data recorded as time series.

Knowledge(Informa;on(

Data(

7(M.(RaseL,E.(Merelli(EIB(Z(9(November(2012(

The Topological Field Theory of Data: a program

towards a novel strategy for data mining through

data language

M Rasetti

1and E Merelli

2

1ISI Foundation, Via Alassio 11-C, 10126 Torino (Italy)2 School of Science and Technology, University of Camerino,Via del Bastione 1, 62032 Camerino (Italy)

E-mail: [email protected] ; [email protected]

Abstract. This paper aims to challenge the current thinking in IT for the 0Big Data 0 question,proposing – almost verbatim, with no formulas – a program aiming to construct an innovativemethodology to perform data analytics in a way that returns an automaton as a recognizer ofthe data language: a Field Theory of Data. We suggest to build, directly out of probing dataspace, a theoretical framework enabling us to extract the manifold hidden relations (patterns)that exist among data, as correlations depending on the semantics generated by the miningcontext. The program, that is grounded in the recent innovative ways of integrating data into atopological setting, proposes the realization of a Topological Field Theory of Data, transferringand generalizing to the space of data notions inspired by physical (topological) field theoriesand harnesses the theory of formal languages to define the potential semantics necessary tounderstand the emerging patterns.

1. The landscape

Complex Systems are ubiquitous; complex, multi-level, multi-scale systems are everywhere: inNature, but also in the Internet, the brain, the climate, the spread of pandemics, in economyand finance; in other words in Society. A deep, intriguing question that has been raised aboutcomplex systems is: can we envisage the construction of a bona fide Complexity Science Theory?In other words, does it make sense to think of a conceptual construct playing for complex systemsthe same role that Statistical Mechanics played for Thermodynamics?

The challenge is indeed enormous. In statistical mechanics a number of basic restrictiveassumptions play a crucial role: ergodicity ensures that all accessible states of a system arereached with equal probability; the thermodynamic limit, N ! 1, induces the number ofparticles into play, measured in terms of the Avogadro’s number, to be essentially infinite;particles are identical and indistinguishable, constraint that is not even mentioned when studyingthe features of collections of particles, but it is there – particles of the same species are identicaland interact with each other pairwise all in the same way, that is, with the same interactionlaw – in the quantum case they are indistinguishable; analytical structure can be defined for theunderlying dynamics, that is, regular equations of motion exist at the micro-scale – analyticitybreaking and singularities only appear as signal of the macro-phenomenon of phase transitions;0experiment-based 0 phenomenology is repeatable, as in reductionist science, under the same

7th International Workshop DICE2014 Spacetime – Matter – Quantum Mechanics IOP PublishingJournal of Physics: Conference Series 626 (2015) 012005 doi:10.1088/1742-6596/626/1/012005

the approach

emerging patterns

data collections behavioural semanticscontextual semantics

?

III Year Theoretical background Results & Conclusions

CONTEXTUALIZE - WHERE WE ARE?W.R.T. the literature, our methodology is on the edge betweenthe two principal approaches for modeling complex system:top-down, and bottom-up.

We extract information at the meso-scale for connecting themicro-scale with the macro-scale

Which are the languages our TFTD automaton can recognize?

Theory of computation: An automaton is an abstract machine, a finite states machine, a mathematical model of computation, capable of recognize a formal language

each fj is a TM non-linearly interacting TMs ?

Within the TOPDRIM consortium, the dream has been that of taming BIG DATA with

Topology -- the geometry of ‘shapes’

A global topological vision of data space

for reconstructing the dynamics of multi-level

complex systems

The integrated approach

http://www.topdrim.eu

Documents

Topological Field Theory of Data - Sysma Unit|sysma.imtlucca.it/cina/lib/exe/fetch.php?media=merelli.pdf · 1. topological data analysis: homology methods to extract piecewise-linear