28
From Kolmogorov and Shannon to Bioinformatics and Grid Computing Raffaele Giancarlo Dipartimento di Matematica, Università di Palermo

From Kolmogorov and Shannon to Bioinformatics and Grid Computing Raffaele Giancarlo Dipartimento di Matematica, Università di Palermo

  • View
    218

  • Download
    0

Embed Size (px)

Citation preview

Page 1: From Kolmogorov and Shannon to Bioinformatics and Grid Computing Raffaele Giancarlo Dipartimento di Matematica, Università di Palermo

From Kolmogorov and Shannon to Bioinformatics

and Grid Computing

Raffaele GiancarloDipartimento di Matematica, Università di Palermo

Page 2: From Kolmogorov and Shannon to Bioinformatics and Grid Computing Raffaele Giancarlo Dipartimento di Matematica, Università di Palermo

Aim Give a flavour of fundamental novel discoveries about indexing

and compression: A string, and any compact encoding of it, is the best index for itself

Give a flavour of some fundamental novel discoveries about Distance functions and Classification, particularly relevant for Bioinformatics

On the way, mention uses of :suffix trees, suffix arrays, Burrows-Wheelet Transform, Move to Front…

In 30 min. an incredibly long jurney: From Kolmogorov and Shannon to Grid Computing

References: available on-line

Page 3: From Kolmogorov and Shannon to Bioinformatics and Grid Computing Raffaele Giancarlo Dipartimento di Matematica, Università di Palermo

What do we mean by “Indexing” ?

Raw sequence of characters or bytes

Types of data

Types of query

Character-based query

Indexing approaches :

• Full-text indexes, » Suffix Array, Suffix tree,…

DNA sequencesAudio-video filesExecutables

Arbitrary substring

Page 4: From Kolmogorov and Shannon to Bioinformatics and Grid Computing Raffaele Giancarlo Dipartimento di Matematica, Università di Palermo

What do we mean by “Compression” ?

Any Algorithm that squezes data : lossless, lossy

From March 2001 the Memory eXpansion Technology (MXT) is available on IBM eServers x330MXT Same performance of a PC with double memory but at half cost

Moral: More economical to store data in compressed form than

uncompressed

» CPU speed nowadays makes (de)compression “costless” !!

Page 5: From Kolmogorov and Shannon to Bioinformatics and Grid Computing Raffaele Giancarlo Dipartimento di Matematica, Università di Palermo

What we mean by “Classification” ?

Any tool that can group “related” objects together, e.g. the unaligned mithocondrial genomes NCBI Classfication

Page 6: From Kolmogorov and Shannon to Bioinformatics and Grid Computing Raffaele Giancarlo Dipartimento di Matematica, Università di Palermo

Compression and Indexing: Two sides of the same coin !

Do we witness a paradoxical situation ?

An index injects redundant data, in order to speed up the pattern

searches

Compression removes redundancy, in order to squeeze the space occupancy NO, new results proved a mutual reinforcement behaviour !

Better indexes can be designed by exploiting compression techniques

Better compressors can be designed by exploiting indexing techniquesIn terms of space occupancy

Also in terms of compression ratio

•Classification is the “third side” of the coin: Kolmogorov Complexity, Information Theory, Compression and Indexing

Page 7: From Kolmogorov and Shannon to Bioinformatics and Grid Computing Raffaele Giancarlo Dipartimento di Matematica, Università di Palermo

Our journey, today...

Suffix Array(1990)

Index design (Weiner ’73) Compressor design (Shannon ’48)

Burrows-Wheeler Transform(1994)

Compressed Index-Space close to gzip, bzip- Query time close to O(|P|)

Compression BoosterTool to transform a poor compressorinto a better compression algorithm

Universal Distances and Classification

Kolmogorov

Page 8: From Kolmogorov and Shannon to Bioinformatics and Grid Computing Raffaele Giancarlo Dipartimento di Matematica, Università di Palermo

Investigate

Indexing ideas Compressor design

First Lap…in record time!!!

BoosterBooster

Page 9: From Kolmogorov and Shannon to Bioinformatics and Grid Computing Raffaele Giancarlo Dipartimento di Matematica, Università di Palermo

Key Idea 1: Suffix Tree [Weiner 73, McCreight 76, Ukkonen 92]

String: mississippi#

12 1

# i pm

s

11 9

#ppi#

ssi

5 2

ppi#

ssippi#

10 9

i# pi# i si

7 4

ppi# ssippi#

6 3

ppi#

ssippi#

Page 10: From Kolmogorov and Shannon to Bioinformatics and Grid Computing Raffaele Giancarlo Dipartimento di Matematica, Università di Palermo

pi#mississi pppi#mississ isippi#missi ssissippi#mi sssippi#miss ississippi#m i

issippi#mis s

mississippi #ississippi# m

Key Idea 2: Burrows-Wheeler Compression (1994)

Let us be given a string s = mississippi#

mississippi#ississippi#mssissippi#mi sissippi#mis

sippi#missisippi#mississppi#mississipi#mississipi#mississipp#mississippi

ssippi#missiissippi#miss Sort the rows

#mississipp ii#mississip pippi#missis s

bwt(s)

s

Page 11: From Kolmogorov and Shannon to Bioinformatics and Grid Computing Raffaele Giancarlo Dipartimento di Matematica, Università di Palermo

Burrows and Wheeler Compression

Why it works:

BWT creates a locally homogeneous string:

abaababa bbbaaaaa

MTF transforms it into a globally homegeneous sequence of integers

bbbaaaaa 00010000

The final string is “easy” to compress

Experimentally: compressibility is proportional to % of zeros

Page 12: From Kolmogorov and Shannon to Bioinformatics and Grid Computing Raffaele Giancarlo Dipartimento di Matematica, Università di Palermo

Qualitatively, it can be shown:

c’ is shorter than c, if s is compressible

Time(Aboost) = Time(A), i.e. no slowdown

A is used as a black-box

Boosting [Ferragina, Giancarlo, Manzini, Sciortino, 03,04,05]

The technique takes a poor compressor A and turns it into a compressor

Aboost with better performance guarantee

c’

BoosterThe better is A,

the better is Aboost

As cThe more compressible is s,

the better is Aboost

Page 13: From Kolmogorov and Shannon to Bioinformatics and Grid Computing Raffaele Giancarlo Dipartimento di Matematica, Università di Palermo

We investigated:

Index Ideas Compression design

Let’s now turn to the other direction

Compression ideas Index design

Second Lap…Even faster

Compressed IndexesCompressed Indexes

Page 14: From Kolmogorov and Shannon to Bioinformatics and Grid Computing Raffaele Giancarlo Dipartimento di Matematica, Università di Palermo

Rotated text

#mississippi#mississipippi#missisissippi#misississippi#mississippipi#mississippi#mississsippi#missisissippi#missippi#missssissippi#m

#mississippi#mississipippi#missisissippi#misississippi#mississippipi#mississippi#mississsippi#missisissippi#missippi#missssissippi#m

Suffix Array vs. BW-transform

ipssm#pissii

L

12

1185211097463

SA L includes SA and T. Can we search within L ?

mississippi

Page 15: From Kolmogorov and Shannon to Bioinformatics and Grid Computing Raffaele Giancarlo Dipartimento di Matematica, Università di Palermo

A compressed index [Ferragina-Manzini, IEEE Focs 2000]

In practice, the index is much appealing: Space close to the best known compressors, ie. bzip Query time of few millisecs on hundreds of MBs

The theoretical result:

Query complexity: O(p + occ log N) time

Space occupancy: O( N Hk(T)) + o(N) bitsk-th order empirical entropy

Page 16: From Kolmogorov and Shannon to Bioinformatics and Grid Computing Raffaele Giancarlo Dipartimento di Matematica, Università di Palermo

Universal Distances and Classification

Third Lap…

Page 17: From Kolmogorov and Shannon to Bioinformatics and Grid Computing Raffaele Giancarlo Dipartimento di Matematica, Università di Palermo

Large Data Sets

Classification of Sequences on a Genome-wide Scale

Distances based on alignments are either not applicable

or too slow

Fast and reliable alignment-free methods are badly needed

Classification of Proteins, both for Function and Structure- Lagging behind to sequence data

Page 18: From Kolmogorov and Shannon to Bioinformatics and Grid Computing Raffaele Giancarlo Dipartimento di Matematica, Università di Palermo

Proteins and Their String Representations

Amino acid sequence (FASTA format);

Atomic coordinates (Atom lines);

Page 19: From Kolmogorov and Shannon to Bioinformatics and Grid Computing Raffaele Giancarlo Dipartimento di Matematica, Università di Palermo

Protein Representations Topologic Models (Top Diagrams)

Page 20: From Kolmogorov and Shannon to Bioinformatics and Grid Computing Raffaele Giancarlo Dipartimento di Matematica, Università di Palermo

Kolmogorov Complexity

The Kolmogorov Complexity K(x) of a string x is defined as the length of the shortest binary program that produces x.

The conditional Kolmogorov Complexity K(x|y) represents the minimum amount of information required to generate x by an effective computation when y is given as an input to the computation.

The Kolmogorov Complexity K(x,y) of a pair objects x and y is the length of the shortest binary program that produces x and y and a way to tell them apart.

Page 21: From Kolmogorov and Shannon to Bioinformatics and Grid Computing Raffaele Giancarlo Dipartimento di Matematica, Università di Palermo

Universal Similarity metric (USM)

Problem: USM(x,y) is based on Kolmogorov Complexity that is non-

computable in the Turing sense.

Solution: K(x) can be approximated via data compression by using its

relationship with Shannon Information Theory. USM is a methodology rather than a formula quantifying the similarity

of two strings.

Page 22: From Kolmogorov and Shannon to Bioinformatics and Grid Computing Raffaele Giancarlo Dipartimento di Matematica, Università di Palermo

Approximations of USM

K(x) can be approximated by C(x), K(x,y) by C(xy) and K(x|y*) by C(xy) – C(x). We obtain three approximations to USM:

where

Page 23: From Kolmogorov and Shannon to Bioinformatics and Grid Computing Raffaele Giancarlo Dipartimento di Matematica, Università di Palermo

Experiments [Ferragina, Giancarlo, Greco, Manzini, Valiente, 2007]

Experimental setup: Five Benchmarck datasets of proteins (several alternative

representations); A benchmark dataset of Genomic sequences (complete unaligned

mitochondrial Genomes); Twenty-five compression algorithms; Three dissimilarity functions based on USM.

Two set of experiments to compare USM both with methods based on alignments and not: via ROC Analysis; via UPGMA and NJ.

Page 24: From Kolmogorov and Shannon to Bioinformatics and Grid Computing Raffaele Giancarlo Dipartimento di Matematica, Università di Palermo

An example Unaligned mitochondrial DNA complete Genomes

Page 25: From Kolmogorov and Shannon to Bioinformatics and Grid Computing Raffaele Giancarlo Dipartimento di Matematica, Università di Palermo

Results and Conclusions

Useful Guidelines for Use of USM Methodilogy for Biological Investigation

Which compressor to use Which among UCD,NCD and CD to use Which data representation is best Etc…

Page 26: From Kolmogorov and Shannon to Bioinformatics and Grid Computing Raffaele Giancarlo Dipartimento di Matematica, Università di Palermo

Software

Kolmogorov Library: http://www.math.unipa.it/~raffaele/kolmogorov/

Sequential processing is too slow even for relatively small data sets, i.e, 278 files (1.5Mb) classification takes 12 hours on a state of the art PC…half an hour on Grid

Soon Available as a Grid-aware Web Service on COMETA Portal

Page 27: From Kolmogorov and Shannon to Bioinformatics and Grid Computing Raffaele Giancarlo Dipartimento di Matematica, Università di Palermo
Page 28: From Kolmogorov and Shannon to Bioinformatics and Grid Computing Raffaele Giancarlo Dipartimento di Matematica, Università di Palermo

Adevertisement 2 20° EDition of Lipari International Summer School for

Computer Scientists

TOPIC: Algorithms, Science and Engineering

See Lipari School Website