Massive Data Sets and Information Theory Ziv Bar-Yossef Department of Electrical Engineering...

Preview:

Citation preview

Massive Data Setsand

Information Theory

Ziv Bar-YossefDepartment of Electrical Engineering

Technion

What are Massive Data Sets?

Technology

The World-Wide WebIP packet flowsPhone call logs

Technology

The World-Wide WebIP packet flowsPhone call logs

Science

Astronomical sky surveysWeather data

Science

Astronomical sky surveysWeather data

Business

Credit card transactionsBilling records

Supermarket sales

Business

Credit card transactionsBilling records

Supermarket sales

Traditionally

Cope with the complexity of the problem

Traditionally

Cope with the complexity of the problem

New challenges Restricted access to the data Not enough time to read the whole data Tiny fraction of the data can be held in main memory

Massive Data Sets

Cope with the complexity of the data

Massive Data Sets

Cope with the complexity of the data

Nontraditional Challenges

Approximation of

Computing over Massive Data Sets

Data

(n is very large)

• Approximation of f(x) is sufficient

• Program can be randomized

Computer Program

Examples

Mean

Parity

Models for Computing over Massive Data Sets

Sampling Data Streams Sketching

Query a few data items

Sampling

Data

(n is very large)

Computer Program

Examples

Mean

O(1) queries

Parity

n queries

Approximation of

Data Streams

Data

(n is very large)

Computer Program

Stream through the data;Use limited memory

Examples

Mean

O(1) memory

Parity

1 bit of memory

Approximation of

Sketching

Data1

(n is very large)

Data2Data1 Data2Sketch2Sketch1

Compress eachdata segment intoa small “sketch”

Compute overthe sketches

Examples

Equality

O(1) size sketch

Hamming distance

O(1) size sketch

Lp distance (p > 2)

n1-2/p) size sketch

Approximation of

Algorithms for Massive Data Sets

• Mean and other moments

• Median and other quantiles

• Volume estimations

• Histograms

• Graph problems

• Low-rank matrix approximations

Sampling

• Frequency moments

• Distinct elements

• Functional approximations

• Geometric problems

• Graph problems

• Database problems

Data Streams

• Equality

• Hamming distance

Sketching

• Edit distance

• Lp distance

Our goal

Study the limits of computingover massive data sets

Study the limits of computingover massive data sets

Query complexitylower bounds

Query complexitylower bounds

Data streammemory

lower bounds

Data streammemory

lower bounds

Sketch sizelower bounds

Sketch sizelower bounds

Main ToolsCommunication complexity, information theory, statistics

Main ToolsCommunication complexity, information theory, statistics

CC(f) = min: computes f cost()CC(f) = min: computes f cost()

Communication Complexity [Yao 79]

Alice Bobm1

m2

m3

m4

Referee

cost() = i |mi|cost() = i |mi|

a,b) “transcript”

Communication Complexity View of Sampling

Alice Bobi1

X[i1]

i2

X[i2]

Referee

cost() = # of queries QC(f) = min computes f cost()

x) “transcript”

Approximation of

icost() = I(X; (X)); IC(f) = mincomputes f icost()icost() = I(X; (X)); IC(f) = mincomputes f icost()

Information Complexity[Chakrabarti, Shi, Wirth, Yao 01]

distribution on inputs to f

X: random variable with distribution

Information Complexity:Minimum amount of information a protocol

that computes f has to reveal about its inputs

Information Complexity:Minimum amount of information a protocol

that computes f has to reveal about its inputs

Note: For some functions, any protocol must reveal much more information about X than just f(X).

CC Lower Bounds via IC Lower Bounds

Useful properties of information complexity:

• Lower bounds communication complexity

• Amenable to “direct sum” decompositions

Framework for bounding CC via IC

1. Find an appropriate “hard input distribution” .

2. Prove a lower bound on IC(f)

1. Decomposition: decompose IC(f) into “simple” information quantities.

2. Basis: Prove a lower bound on the simple quantities

Applications of Information Complexity

• Data streams [Bar-Yossef, Jayram, Kumar, Sivakumar 02] [Chakrabarti, Khot, Sun 03]

• Sampling [Bar-Yossef 03]

• Communication complexity and decision tree complexity [Jayram, Kumar, Sivakumar 03]

• Quantum communication complexity [Jain, Radhakrishnan, Sen 03]

• Cell probe [Sen 03] [Chakrabarti, Regev 04]

• Simultaneous messages [Chakrabarti, Shi, Wirth, Yao 01]

The “Election Problem”• Input: a sequence x of n votes to k parties

7/18 4/18 3/18 2/18 1/18 1/18

(n = 18, k = 6)

Want to get D s.t. ||D – f(x)|| < .Vote Distribution f(x)

Theorem QC(f) = (k/2)

Sampling Lower Bound

Lemma 1 (Normal form lemma):

WLOG, in any protocol that computes the election problem, the queries are uniformly distributed and independent.

x) : transcript of a full protocol

(x) : transcript of a single random query “protocol”

If cost() = q, then (x) = ((x),…,(x)) (q times)

Sampling Lower Bound (cont.)

Lemma 2 (Decomposition lemma):

For any X, I(X ; (X)) <= q I(X; (X)).

I(X; (X)) : information cost of w.r.t. X

I(X; (X)) : information cost of w.r.t. X

Therefore, q >= I(X; (X)) / I(X; (X)).

Combinatorial Designs

1. Each of them constitutes half of U.2. The intersection of each two of them is relatively small.

B1

B2

B3U

A family of subsets B1,…,Bm of a universe U s.t.

Fact: There exist designs of size exponential in |U|.

(Constant rate, constant relative minimum distance binary error-correcting codes).

Hard Input Distribution for the Election Problem

Let B1,…,Bm be a family of subsets of {1,…,k} that form a design of size m = 2(k).

X is uniformly chosen among x1,…,xm, where in xi:

Bi Bic

• ½ + of the votes are split among parties in Bi.

• ½ - of the votes are split among parties in Bi

c.

1. Unique decoding: For every i,j, || f(xi) – f(xj) || > 2. Therefore, I(X; (X)) >= H(X) = (k).

2. Low diameter: I(X; (X)) = O(2).

Conclusions

Information theory plays a growingly major role in complexity theory.

More applications of information complexity? Can we use deeper information theory in

complexity theory?

Thank You

Recommended