456
Building Wavelet Histograms on Large Data in MapReduce Jeffrey Jestes 1 Ke Yi 2 Feifei Li 1 1 School of Computing 2 Department of Computer Science University of Utah Hong Kong University of Science and Technology November 16, 2011

Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Building Wavelet Histogramson Large Data in MapReduce

Jeffrey Jestes1 Ke Yi2 Feifei Li1

1School of Computing 2Department of Computer ScienceUniversity of Utah Hong Kong University of Science and Technology

November 16, 2011

Page 2: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Introduction

Record ID User ID Object ID . . .1 1 12872 . . .2 8 19832 . . .3 4 231 . . .... ... ... ...

For large data we often wish to obtain a concise summary.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 3: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Introduction

Record ID User ID Object ID . . .1 1 12872 . . .2 8 19832 . . .3 4 231 . . .... ... ... ...

For large data we often wish to obtain a concise summary.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 4: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Introduction

Record ID User ID Object ID . . .1 1 12872 . . .2 8 19832 . . .3 4 231 . . .... ... ... ...

For large data we often wish to obtain a concise summary.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 5: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Outline

1 Introduction and MotivationHistogramsMapReduce and Hadoop

2 Exact Top-k Wavelet CoefficientsNaive SolutionHadoop Wavelet Top-k : Our Efficient Exact Solution

3 Approximate Top-k Wavelet CoefficientsLinearly Combinable Sketch MethodOur First Sampling Based ApproachAn Improved Sampling ApproachTwo-Level Sampling

4 Experiments

5 ConclusionsHadoop Wavelet Top-k in Hadoop

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 6: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Introduction: Histograms

A widely accepted and utilized summarization tool is the Histogram.

Let A be an attribute of dataset R.

Values of A are drawn from finite domain [u] = {1, · · · , u}.Define for each x ∈ {1, · · · , u}, v(x) = {count(R.A = x)}.Define frequency vector v of R.A as v = (v(1), . . . , v(u)).A histogram over R.A is any compact (lossy) representation of v.

Record ID User ID Object ID . . .1 1 12872 . . .2 8 19832 . . .3 4 231 . . .... ... ... ...

x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 7: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Introduction: Histograms

A widely accepted and utilized summarization tool is the Histogram.Let A be an attribute of dataset R.

Values of A are drawn from finite domain [u] = {1, · · · , u}.Define for each x ∈ {1, · · · , u}, v(x) = {count(R.A = x)}.Define frequency vector v of R.A as v = (v(1), . . . , v(u)).A histogram over R.A is any compact (lossy) representation of v.

Record ID User ID Object ID . . .1 1 12872 . . .2 8 19832 . . .3 4 231 . . .... ... ... ...

x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 8: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Introduction: Histograms

A widely accepted and utilized summarization tool is the Histogram.Let A be an attribute of dataset R.

Values of A are drawn from finite domain [u] = {1, · · · , u}.

Define for each x ∈ {1, · · · , u}, v(x) = {count(R.A = x)}.Define frequency vector v of R.A as v = (v(1), . . . , v(u)).A histogram over R.A is any compact (lossy) representation of v.

Record ID User ID Object ID . . .1 1 12872 . . .2 8 19832 . . .3 4 231 . . .... ... ... ...

x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 9: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Introduction: Histograms

A widely accepted and utilized summarization tool is the Histogram.Let A be an attribute of dataset R.

Values of A are drawn from finite domain [u] = {1, · · · , u}.Define for each x ∈ {1, · · · , u}, v(x) = {count(R.A = x)}.

Define frequency vector v of R.A as v = (v(1), . . . , v(u)).A histogram over R.A is any compact (lossy) representation of v.

Record ID User ID Object ID . . .1 1 12872 . . .2 8 19832 . . .3 4 231 . . .... ... ... ...

x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 10: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Introduction: Histograms

A widely accepted and utilized summarization tool is the Histogram.Let A be an attribute of dataset R.

Values of A are drawn from finite domain [u] = {1, · · · , u}.Define for each x ∈ {1, · · · , u}, v(x) = {count(R.A = x)}.Define frequency vector v of R.A as v = (v(1), . . . , v(u)).

A histogram over R.A is any compact (lossy) representation of v.

Record ID User ID Object ID . . .1 1 12872 . . .2 8 19832 . . .3 4 231 . . .... ... ... ...

x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 11: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Introduction: Histograms

A widely accepted and utilized summarization tool is the Histogram.Let A be an attribute of dataset R.

Values of A are drawn from finite domain [u] = {1, · · · , u}.Define for each x ∈ {1, · · · , u}, v(x) = {count(R.A = x)}.Define frequency vector v of R.A as v = (v(1), . . . , v(u)).A histogram over R.A is any compact (lossy) representation of v.

Record ID User ID Object ID . . .1 1 12872 . . .2 8 19832 . . .3 4 231 . . .... ... ... ...

x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 12: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Introduction: Histograms

A widely accepted and utilized summarization tool is the Histogram.Let A be an attribute of dataset R.

Values of A are drawn from finite domain [u] = {1, · · · , u}.Define for each x ∈ {1, · · · , u}, v(x) = {count(R.A = x)}.Define frequency vector v of R.A as v = (v(1), . . . , v(u)).A histogram over R.A is any compact (lossy) representation of v.

Record ID User ID Object ID . . .1 1 12872 . . .2 8 19832 . . .3 4 231 . . .... ... ... ...

x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 13: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Introduction: Wavelet Histograms

A common choice for a histogram is the Haar wavelet histogram.

We obtain the Haar wavelet coefficients wi recursively as follows:

[GKMS01] A.C. Gilbert, et al. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In VLDB, 2001.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 14: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Introduction: Wavelet Histograms

A common choice for a histogram is the Haar wavelet histogram.

We obtain the Haar wavelet coefficients wi recursively as follows:

[GKMS01] A.C. Gilbert, et al. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In VLDB, 2001.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 15: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Introduction: Wavelet Histograms

A common choice for a histogram is the Haar wavelet histogram.

We obtain the Haar wavelet coefficients wi recursively as follows:

3 5 10 8 2 2 10 14 ` = 3

x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14

v(1) v(2) v(3) v(4) v(5) v(6) v(7) v(8)

Original data signal at level ` = log2 u.

[GKMS01] A.C. Gilbert, et al. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In VLDB, 2001.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 16: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Introduction: Wavelet Histograms

A common choice for a histogram is the Haar wavelet histogram.

We obtain the Haar wavelet coefficients wi recursively as follows:

3 5 10 8 2 2 10 14 ` = 3

x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14

v(1) v(2) v(3) v(4) v(5) v(6) v(7) v(8)

c4 c5 c6 c7 ` = 21 4 a4 a5 a6 a72 120 2-1 9

Compute detail coefficients ci andaverage coefficients ai for level ` = 2.

[GKMS01] A.C. Gilbert, et al. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In VLDB, 2001.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 17: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Introduction: Wavelet Histograms

A common choice for a histogram is the Haar wavelet histogram.

We obtain the Haar wavelet coefficients wi recursively as follows:

3 5 10 8 2 2 10 14 ` = 3

x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14

v(1) v(2) v(3) v(4) v(5) v(6) v(7) v(8)

c4 c5 c6 c7 ` = 21 4 a4 a5 a6 a72 120 2-1 9

Compute detail coefficients ci andaverage coefficients ai for level ` = 2.

[GKMS01] A.C. Gilbert, et al. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In VLDB, 2001.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 18: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Introduction: Wavelet Histograms

A common choice for a histogram is the Haar wavelet histogram.

We obtain the Haar wavelet coefficients wi recursively as follows:

3 5 10 8 2 2 10 14 ` = 3

x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14

v(1) v(2) v(3) v(4) v(5) v(6) v(7) v(8)

c4 c5 c6 c7 ` = 21 4 a4 a5 a6 a72 120 2-1 9

a4 = (v(2) + v(1))/2

Compute detail coefficients ci andaverage coefficients ai for level ` = 2.

[GKMS01] A.C. Gilbert, et al. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In VLDB, 2001.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 19: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Introduction: Wavelet Histograms

A common choice for a histogram is the Haar wavelet histogram.

We obtain the Haar wavelet coefficients wi recursively as follows:

3 5 10 8 2 2 10 14 ` = 3

x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14

v(1) v(2) v(3) v(4) v(5) v(6) v(7) v(8)

c4 c5 c6 c7 ` = 21 4 a4 a5 a6 a72 120 2-1 9

c4 = (v(2)− v(1))/2

Compute detail coefficients ci andaverage coefficients ai for level ` = 2.

[GKMS01] A.C. Gilbert, et al. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In VLDB, 2001.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 20: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Introduction: Wavelet Histograms

A common choice for a histogram is the Haar wavelet histogram.

We obtain the Haar wavelet coefficients wi recursively as follows:

3 5 10 8 2 2 10 14 ` = 3

x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14

v(1) v(2) v(3) v(4) v(5) v(6) v(7) v(8)

c4 c5 c6 c7 ` = 21 4 a4 a5 a6 a72 120 2-1 9

c2 a22.5 6.5 c3 a35 7 ` = 1

Compute detail coefficients ci andaverage coefficients ai for level ` = 1.

[GKMS01] A.C. Gilbert, et al. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In VLDB, 2001.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 21: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Introduction: Wavelet Histograms

A common choice for a histogram is the Haar wavelet histogram.

We obtain the Haar wavelet coefficients wi recursively as follows:

3 5 10 8 2 2 10 14 ` = 3

x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14

v(1) v(2) v(3) v(4) v(5) v(6) v(7) v(8)

c4 c5 c6 c7 ` = 21 4 a4 a5 a6 a72 120 2-1 9

c2 a22.5 6.5 c3 a35 7 ` = 1

Compute detail coefficients ci andaverage coefficients ai for level ` = 1.

[GKMS01] A.C. Gilbert, et al. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In VLDB, 2001.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 22: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Introduction: Wavelet Histograms

A common choice for a histogram is the Haar wavelet histogram.

We obtain the Haar wavelet coefficients wi recursively as follows:

3 5 10 8 2 2 10 14 ` = 3

x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14

v(1) v(2) v(3) v(4) v(5) v(6) v(7) v(8)

c4 c5 c6 c7 ` = 21 4 a4 a5 a6 a72 120 2-1 9

c2 a22.5 6.5 c3 a35 7 ` = 1

a2 = (a5 + a4)/2

Compute detail coefficients ci andaverage coefficients ai for level ` = 1.

[GKMS01] A.C. Gilbert, et al. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In VLDB, 2001.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 23: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Introduction: Wavelet Histograms

A common choice for a histogram is the Haar wavelet histogram.

We obtain the Haar wavelet coefficients wi recursively as follows:

3 5 10 8 2 2 10 14 ` = 3

x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14

v(1) v(2) v(3) v(4) v(5) v(6) v(7) v(8)

c4 c5 c6 c7 ` = 21 4 a4 a5 a6 a72 120 2-1 9

c2 a22.5 6.5 c3 a35 7 ` = 1

c2 = (a5 − a4)/2

Compute detail coefficients ci andaverage coefficients ai for level ` = 1.

[GKMS01] A.C. Gilbert, et al. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In VLDB, 2001.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 24: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Introduction: Wavelet Histograms

A common choice for a histogram is the Haar wavelet histogram.

We obtain the Haar wavelet coefficients wi recursively as follows:

3 5 10 8 2 2 10 14 ` = 3

x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14

v(1) v(2) v(3) v(4) v(5) v(6) v(7) v(8)

c4 c5 c6 c7 ` = 21 4 a4 a5 a6 a72 120 2-1 9

c2 a22.5 6.5 c3 a35 7 ` = 1

c1 a10.3 6.8

a1 6.8 total average

` = 0

Compute detail coefficients ci andaverage coefficients ai for level ` = 0.

[GKMS01] A.C. Gilbert, et al. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In VLDB, 2001.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 25: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Introduction: Wavelet Histograms

A common choice for a histogram is the Haar wavelet histogram.

We obtain the Haar wavelet coefficients wi recursively as follows:

3 5 10 8 2 2 10 14 ` = 3

x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14

v(1) v(2) v(3) v(4) v(5) v(6) v(7) v(8)

c4 c5 c6 c7 ` = 21 4 a4 a5 a6 a72 120 2-1 9

c2 a22.5 6.5 c3 a35 7 ` = 1

c1 a10.3 6.8

a1 6.8 total average

` = 0

Compute detail coefficients ci andaverage coefficients ai for level ` = 0.

[GKMS01] A.C. Gilbert, et al. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In VLDB, 2001.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 26: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Introduction: Wavelet Histograms

A common choice for a histogram is the Haar wavelet histogram.

We obtain the Haar wavelet coefficients wi recursively as follows:

3 5 10 8 2 2 10 14 ` = 3

x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14

v(1) v(2) v(3) v(4) v(5) v(6) v(7) v(8)

c4 c5 c6 c7 ` = 21 4 a4 a5 a6 a72 120 2-1 9

c2 a22.5 6.5 c3 a35 7 ` = 1

c1 a10.3 6.8

a1 6.8 total average

` = 0

a1 = (a3 + a2)/2

Compute detail coefficients ci andaverage coefficients ai for level ` = 0.

[GKMS01] A.C. Gilbert, et al. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In VLDB, 2001.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 27: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Introduction: Wavelet Histograms

A common choice for a histogram is the Haar wavelet histogram.

We obtain the Haar wavelet coefficients wi recursively as follows:

3 5 10 8 2 2 10 14 ` = 3

x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14

v(1) v(2) v(3) v(4) v(5) v(6) v(7) v(8)

c4 c5 c6 c7 ` = 21 4 a4 a5 a6 a72 120 2-1 9

c2 a22.5 6.5 c3 a35 7 ` = 1

c1 a10.3 6.8

a1 6.8 total average

` = 0

c1 = (a3 − a2)/2

Compute detail coefficients ci andaverage coefficients ai for level ` = 0.

[GKMS01] A.C. Gilbert, et al. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In VLDB, 2001.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 28: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Introduction: Wavelet Histograms

A common choice for a histogram is the Haar wavelet histogram.

We obtain the Haar wavelet coefficients wi recursively as follows:

3 5 10 8 2 2 10 14 ` = 3

x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14

v(1) v(2) v(3) v(4) v(5) v(6) v(7) v(8)

c4 c5 c6 c7 ` = 21 4 a4 a5 a6 a72 120 2-1 9

c2 a22.5 6.5 c3 a35 7 ` = 1

c1 a10.3 6.8

a1 6.8 total average

` = 0

The wavelet coefficients wi

are [ a1, c1, . . . , cu−1 ].

[GKMS01] A.C. Gilbert, et al. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In VLDB, 2001.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 29: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Introduction: Wavelet Histograms

A common choice for a histogram is the Haar wavelet histogram.

We obtain the Haar wavelet coefficients wi recursively as follows:

3 5 10 8 2 2 10 14 ` = 3

x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14

v(1) v(2) v(3) v(4) v(5) v(6) v(7) v(8)

c4 c5 c6 c7 ` = 21 4 a4 a5 a6 a72 120 2-1 9

c2 a22.5 6.5 c3 a35 7 ` = 1

c1 a10.3 6.8

a1 6.8 total average

` = 0

The wavelet coefficients wi

are [ a1, c1, . . . , cu−1 ].

[GKMS01] A.C. Gilbert, et al. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In VLDB, 2001.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 30: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Introduction: Wavelet Histograms

A common choice for a histogram is the Haar wavelet histogram.

We obtain the Haar wavelet coefficients wi recursively as follows:

3 5 10 8 2 2 10 14 ` = 3

x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14

v(1) v(2) v(3) v(4) v(5) v(6) v(7) v(8)

c4 c5 c6 c7 ` = 21 4 a4 a5 a6 a72 120 2-1 9

c2 a22.5 6.5 c3 a35 7 ` = 1

c1 a10.3 6.8

a1 6.8 total average

` = 0

w1

w2

w3 w4

w5 w6 w7 w8

The wavelet coefficients wi

are [ a1, c1, . . . , cu−1 ].

[GKMS01] A.C. Gilbert, et al. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In VLDB, 2001.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 31: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Introduction: Wavelet Histograms

A common choice for a histogram is the Haar wavelet histogram.

We obtain the Haar wavelet coefficients wi recursively as follows:

3 5 10 8 2 2 10 14 ` = 3

x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14

v(1) v(2) v(3) v(4) v(5) v(6) v(7) v(8)

w5 w6 w7 w8a4 a5 a6 a72 120 2-1 9

w3 a22.5 6.5 w4 a35 7

w2 a10.3 6.8

w1 6.8 total average

` = 2

` = 1

` = 0

1 4

One scales wi by√

u/2` to preserve energy, i.e.‖v‖22 = ∑u

i=1 v(i)2 = ∑ui=1 w 2

i

[GKMS01] A.C. Gilbert, et al. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In VLDB, 2001.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 32: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Introduction: Wavelet Histograms

A common choice for a histogram is the Haar wavelet histogram.

We obtain the Haar wavelet coefficients wi recursively as follows:

3 5 10 8 2 2 10 14 ` = 3

x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14

v(1) v(2) v(3) v(4) v(5) v(6) v(7) v(8)

w5 w6 w7 w8a4 a5 a6 a72 120 2-1 9

w3 a22.5 6.5 w4 a35 7

w2 a10.3 6.8

w1 6.8 total average

scale by:√

u/2`

` = 2

` = 1

` = 0

1 4

One scales wi by√

u/2` to preserve energy, i.e.‖v‖22 = ∑u

i=1 v(i)2 = ∑ui=1 w 2

i

energy: ∑ui=1 v(i)2 = ∑u

i=1 w 2i

[GKMS01] A.C. Gilbert, et al. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In VLDB, 2001.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 33: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Introduction: Wavelet Histograms

A common choice for a histogram is the Haar wavelet histogram.

We obtain the Haar wavelet coefficients wi recursively as follows:

3 5 10 8 2 2 10 14 ` = 3

x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14

v(1) v(2) v(3) v(4) v(5) v(6) v(7) v(8)

scale by:√

u/2`

One scales wi by√

u/2` to preserve energy, i.e.‖v‖22 = ∑u

i=1 v(i)2 = ∑ui=1 w 2

i

` = 3

w5 w6 w7 w8a4 a5 a6 a72.8 120 2-1.4 9

w3 a25 6.5 w4 a310 7

w2 a10.8 6.8

w1 19 total average

` = 2

` = 1

` = 0

1.4 4

energy: ∑ui=1 v(i)2 = ∑u

i=1 w 2i

[GKMS01] A.C. Gilbert, et al. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In VLDB, 2001.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 34: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Introduction: Wavelet Histograms

A common choice for a histogram is the Haar wavelet histogram.

We obtain the Haar wavelet coefficients wi recursively as follows:

3 5 10 8 2 2 10 14 ` = 3

x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14

v(1) v(2) v(3) v(4) v(5) v(6) v(7) v(8)

scale by:√

u/2`

` = 3

w5 w6 w7 w8a4 a5 a6 a72.8 120 2-1.4 9

w3 a25 6.5 w4 a310 7

w2 a10.8 6.8

w1 19 total average

` = 2

` = 1

` = 0

1.4 4

energy: ∑ui=1 v(i)2 = ∑u

i=1 w 2i

Select top-k wi in the absolute value to obtainbest k-term representation minimizing error inenergy, i.e. minimize ∑u

i=1 v(i)2 − ∑ui=1 w 2

i

[GKMS01] A.C. Gilbert, et al. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In VLDB, 2001.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 35: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Introduction: Wavelet Histograms

A common choice for a histogram is the Haar wavelet histogram.

We obtain the Haar wavelet coefficients wi recursively as follows:

3 5 10 8 2 2 10 14 ` = 3

x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14

v(1) v(2) v(3) v(4) v(5) v(6) v(7) v(8)

scale by:√

u/2`

` = 3

w5 w6 w7 w8a4 a5 a6 a72.8 120 2-1.4 9

w3 a25 6.5 w4 a310 7

w2 a10.8 6.8

w1 19 total average

` = 2

` = 1

` = 0

1.4 4

energy: ∑ui=1 v(i)2 = ∑u

i=1 w 2i

Select top-k wi in the absolute value to obtainbest k-term representation minimizing error inenergy, i.e. minimize ∑u

i=1 v(i)2 − ∑ui=1 w 2

i

k = 3

[GKMS01] A.C. Gilbert, et al. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In VLDB, 2001.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 36: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Introduction: Wavelet Histograms

A common choice for a histogram is the Haar wavelet histogram.

We obtain the Haar wavelet coefficients wi recursively as follows:

x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14

scale by:√

u/2`

energy: ∑ui=1 v(i)2 = ∑u

i=1 w 2i

k = 3

w5 w6 w7 w8a4 a5 a6 a70 ?0 ?0 ?

w3 a25 ? w4 a310 ?

w2 a10 ?

w1 19 total average

0 ? ` = 2

` = 1

` = 0

We maintain the best k-term wi .Other wi are treated as 0.

? ? ? ? ? ? ? ? ` = 3

v(1) v(2) v(3) v(4) v(5) v(6) v(7) v(8)

[GKMS01] A.C. Gilbert, et al. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In VLDB, 2001.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 37: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Introduction: Wavelet Histograms

A common choice for a histogram is the Haar wavelet histogram.

We obtain the Haar wavelet coefficients wi recursively as follows:

x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14

scale by:√

u/2`

energy: ∑ui=1 v(i)2 = ∑u

i=1 w 2i

k = 3

w5 w6 w7 w8a4 a5 a6 a70 ?0 ?0 ?

w3 a25 ? w4 a310 ?

w2 a10 ?

w1 19 total average

0 ? ` = 2

` = 1

` = 0

The error in energy is∑ui=1 v(i)2 − ∑u

i=1 w 2i = 502− 489.5 = 12.5.

? ? ? ? ? ? ? ? ` = 3

v(1) v(2) v(3) v(4) v(5) v(6) v(7) v(8)

[GKMS01] A.C. Gilbert, et al. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In VLDB, 2001.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 38: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Introduction: Wavelet Histograms

A common choice for a histogram is the Haar wavelet histogram.

We obtain the Haar wavelet coefficients wi recursively as follows:

x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14

scale by:√

u/2`

energy: ∑ui=1 v(i)2 = ∑u

i=1 w 2i

k = 3

w5 w6 w7 w8a4 a5 a6 a70 ?0 ?0 ?

w3 a25 ? w4 a310 ?

w2 a10 ?

w1 19 total average

0 ? ` = 2

` = 1

` = 0

? ? ? ? ? ? ? ? ` = 3

v(1) v(2) v(3) v(4) v(5) v(6) v(7) v(8)

To reconstruct the original signal we computethe average and difference coefficients in re-verse, i.e. top to bottom.

[GKMS01] A.C. Gilbert, et al. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In VLDB, 2001.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 39: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Introduction: Wavelet Histograms

A common choice for a histogram is the Haar wavelet histogram.

We obtain the Haar wavelet coefficients wi recursively as follows:

x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14

4.3 4.3 9.3 9.3 1.8 1.8 12 12 ` = 3

v(1) v(2) v(3) v(4) v(5) v(6) v(7) v(8)

` = 3

w5 w6 w7 w8a4 a5 a6 a70 120 1.80 9.3

w3 a22.5 6.8 w4 a35 6.8

w2 a10 6.8

` = 2

` = 1

` = 0

0 4.3

w1 6.8 total average

The reconstructed signal is a reasonably closeapproximation.

[GKMS01] A.C. Gilbert, et al. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In VLDB, 2001.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 40: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Introduction: Wavelet Histograms

A common choice for a histogram is the Haar wavelet histogram.

We obtain the Haar wavelet coefficients wi recursively as follows:

3 5 10 8 2 2 10 14 ` = 3

x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14

v(1) v(2) v(3) v(4) v(5) v(6) v(7) v(8)

w5 w6 w7 w8a4 a5 a6 a72 120 2-1 9

w3 a22.5 6.5 w4 a35 7

w2 a10.3 6.8

w1 6.8 total average

` = 2

` = 1

` = 0

1 4

All wi can be calculated in O(u log u) time:1. We maintain O(log u) partial wis at a time.2. Compute affected wi and contribution from

each v(x) in O(log u) time.2. Process v(x)s in sorted order. [GKMS01]

[GKMS01] A.C. Gilbert, et al. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In VLDB, 2001.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 41: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Introduction: Wavelet Histograms

A common choice for a histogram is the Haar wavelet histogram.

We obtain the Haar wavelet coefficients wi recursively as follows:

3 5 10 8 2 2 10 14 ` = 3

x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14

v(1) v(2) v(3) v(4) v(5) v(6) v(7) v(8)

w5 w6 w7 w8a4 a5 a6 a72 120 2-1 9

w3 a22.5 6.5 w4 a35 7

w2 a10.3 6.8

w1 6.8 total average

` = 2

` = 1

` = 0

1 4

All wi can be calculated in O(u log u) time:1. We maintain O(log u) partial wis at a time.2. Compute affected wi and contribution from

each v(x) in O(log u) time.2. Process v(x)s in sorted order. [GKMS01]

Affectedwi

[GKMS01] A.C. Gilbert, et al. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In VLDB, 2001.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 42: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Introduction: Wavelet Histograms

A common choice for a histogram is the Haar wavelet histogram.

We obtain the Haar wavelet coefficients wi recursively as follows:

3 5 10 8 2 2 10 14 ` = 3

x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14

v(1) v(2) v(3) v(4) v(5) v(6) v(7) v(8)

w5 w6 w7 w8a4 a5 a6 a72 120 2-1 9

w3 a22.5 6.5 w4 a35 7

w2 a10.3 6.8

w1 6.8 total average

` = 2

` = 1

` = 0

1 4

All wi can be calculated in O(u log u) time:1. We maintain O(log u) partial wis at a time.2. Compute affected wi and contribution from

each v(x) in O(log u) time.2. Process v(x)s in sorted order. [GKMS01]

Affectedwi

[GKMS01] A.C. Gilbert, et al. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In VLDB, 2001.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 43: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Introduction: Wavelet Histograms

A common choice for a histogram is the Haar wavelet histogram.

We obtain the Haar wavelet coefficients wi recursively as follows:

3 5 10 8 2 2 10 14 ` = 3

x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14

v(1) v(2) v(3) v(4) v(5) v(6) v(7) v(8)

w5 w6 w7 w8a4 a5 a6 a72 120 2-1 9

w3 a22.5 6.5 w4 a35 7

w2 a10.3 6.8

w1 6.8 total average

` = 2

` = 1

` = 0

1 4

All wi can be calculated in O(u log u) time:1. We maintain O(log u) partial wis at a time.2. Compute affected wi and contribution from

each v(x) in O(log u) time.2. Process v(x)s in sorted order. [GKMS01]

Affectedwi

[GKMS01] A.C. Gilbert, et al. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In VLDB, 2001.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 44: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Introduction: Wavelet Histograms

A common choice for a histogram is the Haar wavelet histogram.

We obtain the Haar wavelet coefficients wi recursively as follows:

3 5 10 8 2 2 10 14 ` = 3

x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14

v(1) v(2) v(3) v(4) v(5) v(6) v(7) v(8)

w5 w6 w7 w8a4 a5 a6 a72 120 2-1 9

w3 a22.5 6.5 w4 a35 7

w2 a10.3 6.8

w1 6.8 total average

` = 2

` = 1

` = 0

1 4

All wi can be calculated in O(u log u) time:1. We maintain O(log u) partial wis at a time.2. Compute affected wi and contribution from

each v(x) in O(log u) time.2. Process v(x)s in sorted order. [GKMS01]

Affectedwi

[GKMS01] A.C. Gilbert, et al. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In VLDB, 2001.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 45: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Introduction: Wavelet Histograms

A common choice for a histogram is the Haar wavelet histogram.

We obtain the Haar wavelet coefficients wi recursively as follows:

3 5 10 8 2 2 10 14 ` = 3

x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14

v(1) v(2) v(3) v(4) v(5) v(6) v(7) v(8)

w5 w6 w7 w8a4 a5 a6 a72 120 2-1 9

w3 a22.5 6.5 w4 a35 7

w2 a10.3 6.8

w1 6.8 total average

` = 2

` = 1

` = 0

1 4

All wi can be calculated in O(u log u) time:1. We maintain O(log u) partial wis at a time.2. Compute affected wi and contribution from

each v(x) in O(log u) time.2. Process v(x)s in sorted order. [GKMS01]

Affectedwi

[GKMS01] A.C. Gilbert, et al. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In VLDB, 2001.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 46: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Introduction: Wavelet Histograms

A common choice for a histogram is the Haar wavelet histogram.

We obtain the Haar wavelet coefficients wi recursively as follows:

3 5 10 8 2 2 10 14 ` = 3

x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14

v(1) v(2) v(3) v(4) v(5) v(6) v(7) v(8)

w5 w6 w7 w8a4 a5 a6 a72 120 2-1 9

w3 a22.5 6.5 w4 a35 7

w2 a10.3 6.8

w1 6.8 total average

` = 2

` = 1

` = 0

1 4

All wi can be calculated in O(u log u) time:1. We maintain O(log u) partial wis at a time.2. Compute affected wi and contribution from

each v(x) in O(log u) time.2. Process v(x)s in sorted order. [GKMS01]

Affectedwi

[GKMS01] A.C. Gilbert, et al. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In VLDB, 2001.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 47: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Introduction: Wavelet Histograms

A common choice for a histogram is the Haar wavelet histogram.

We obtain the Haar wavelet coefficients wi recursively as follows:

3 5 10 8 2 2 10 14 ` = 3

x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14

v(1) v(2) v(3) v(4) v(5) v(6) v(7) v(8)

w5 w6 w7 w8a4 a5 a6 a72 120 2-1 9

w3 a22.5 6.5 w4 a35 7

w2 a10.3 6.8

w1 6.8 total average

` = 2

` = 1

` = 0

1 4

All wi can be calculated in O(u log u) time:1. We maintain O(log u) partial wis at a time.2. Compute affected wi and contribution from

each v(x) in O(log u) time.2. Process v(x)s in sorted order. [GKMS01]

Affectedwi

[GKMS01] A.C. Gilbert, et al. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In VLDB, 2001.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 48: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Introduction: Wavelet Histograms

A common choice for a histogram is the Haar wavelet histogram.

We obtain the Haar wavelet coefficients wi recursively as follows:

3 5 10 8 2 2 10 14 ` = 3

x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14

v(1) v(2) v(3) v(4) v(5) v(6) v(7) v(8)

w5 w6 w7 w8a4 a5 a6 a72 120 2-1 9

w3 a22.5 6.5 w4 a35 7

w2 a10.3 6.8

w1 6.8 total average

` = 2

` = 1

` = 0

1 4

All wi can be calculated in O(u log u) time:1. We maintain O(log u) partial wis at a time.2. Compute affected wi and contribution from

each v(x) in O(log u) time.2. Process v(x)s in sorted order. [GKMS01]

Affectedwi

[GKMS01] A.C. Gilbert, et al. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In VLDB, 2001.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 49: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Introduction: Histograms

We may also compute wi with the wavelet basis vectors ψi .

wi = v · ψi for i = 1, . . . , u

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 50: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Outline

1 Introduction and MotivationHistogramsMapReduce and Hadoop

2 Exact Top-k Wavelet CoefficientsNaive SolutionHadoop Wavelet Top-k : Our Efficient Exact Solution

3 Approximate Top-k Wavelet CoefficientsLinearly Combinable Sketch MethodOur First Sampling Based ApproachAn Improved Sampling ApproachTwo-Level Sampling

4 Experiments

5 ConclusionsHadoop Wavelet Top-k in Hadoop

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 51: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Introduction: MapReduce and Hadoop

Record ID User ID Object ID . . .1 1 12872 . . .2 8 19832 . . .3 4 231 . . .... ... ... ...

R

Traditionally data is stored in a centralized setting.

Now stored data has sky rocketed, and is increasingly distributed.We study computing the top-k coefficients efficiently on distributeddata in MapReduce using Hadoop to illustrate our ideas.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 52: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Introduction: MapReduce and Hadoop

Record ID User ID Object ID . . .1 1 12872 . . .2 8 19832 . . .3 4 231 . . .... ... ... ...

R1

R2 R3

R4

R = R1 ∪ R2 ∪ R3 ∪ R4

R

Traditionally data is stored in a centralized setting.Now stored data has sky rocketed, and is increasingly distributed.

We study computing the top-k coefficients efficiently on distributeddata in MapReduce using Hadoop to illustrate our ideas.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 53: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Introduction: MapReduce and Hadoop

Record ID User ID Object ID . . .1 1 12872 . . .2 8 19832 . . .3 4 231 . . .... ... ... ...

R1

R2 R3

R4

R = R1 ∪ R2 ∪ R3 ∪ R4

R

Traditionally data is stored in a centralized setting.Now stored data has sky rocketed, and is increasingly distributed.We study computing the top-k coefficients efficiently on distributeddata in MapReduce using Hadoop to illustrate our ideas.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 54: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Background: Hadoop Distributed File System (HDFS)

Hadoop requires a Distributed File System (DFS), we utilize theHadoop Distributed File System (HDFS).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 55: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Background: Hadoop Distributed File System (HDFS)

Hadoop requires a Distributed File System (DFS), we utilize theHadoop Distributed File System (HDFS).

Record ID User ID Object ID . . .1 1 12872 . . .2 8 19832 . . .3 4 231 . . .... ... ... ...

NameNode

DataNodes

R

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 56: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Background: Hadoop Distributed File System (HDFS)

Hadoop requires a Distributed File System (DFS), we utilize theHadoop Distributed File System (HDFS).

NameNode

DataNodes

64MB 64MB 64MB 64MB

R’s DataChunks (Splits)

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 57: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Background: Hadoop Distributed File System (HDFS)

Hadoop requires a Distributed File System (DFS), we utilize theHadoop Distributed File System (HDFS).

NameNode

DataNodes

64MB 64MB 64MB 64MB

R’s DataChunks (Splits)

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 58: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Background: Hadoop Core

Hadoop Core consists of one master JobTracker and severalTaskTrackers.

We assume one TaskTracker per physical machine.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 59: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Background: Hadoop Core

Hadoop Core consists of one master JobTracker and severalTaskTrackers.

We assume one TaskTracker per physical machine.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 60: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Background: Hadoop Core

Hadoop Core consists of one master JobTracker and severalTaskTrackers.

We assume one TaskTracker per physical machine.JobTracker

TaskTrackers

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 61: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Background: Hadoop Core

Hadoop Core consists of one master JobTracker and severalTaskTrackers.

We assume one TaskTracker per physical machine.JobTracker

TaskTrackers

Job SchedulingMapper Task SchedulingReduce Task Scheduling

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 62: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Background: Hadoop Core

Hadoop Core consists of one master JobTracker and severalTaskTrackers.

We assume one TaskTracker per physical machine.JobTracker

TaskTrackers

Job SchedulingMapper Task SchedulingReduce Task Scheduling

MappersReducers

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 63: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Background: Hadoop Cluster

In a Hadoop cluster one machine typically runs both the NameNodeand JobTracker tasks and is called the master.

The other machines run DataNode and TaskTracker tasks and arecalled slaves.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 64: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Background: Hadoop Cluster

In a Hadoop cluster one machine typically runs both the NameNodeand JobTracker tasks and is called the master.

The other machines run DataNode and TaskTracker tasks and arecalled slaves.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 65: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Background: Hadoop Cluster

In a Hadoop cluster one machine typically runs both the NameNodeand JobTracker tasks and is called the master.

The other machines run DataNode and TaskTracker tasks and arecalled slaves.

DataNodes + TaskTrackers

NameNode + JobTracker

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 66: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Background: Hadoop Cluster

In a Hadoop cluster one machine typically runs both the NameNodeand JobTracker tasks and is called the master.

The other machines run DataNode and TaskTracker tasks and arecalled slaves.

DataNodes + TaskTrackers

NameNode + JobTracker

Master

Slaves

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 67: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Background: MapReduce Job Overview

split 1split 2split 3split 4

JobTracker

Mapper

Mapper

Mapper

Mapper

Map Phase

Distributed Cache

Job Configuration

Next we look at an overview of a typical MapReduce Job.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 68: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Background: MapReduce Job Overview

split 1split 2split 3split 4

JobTracker

Mapper

Mapper

Mapper

Mapper

Map Phase

Distributed Cache

Job Configuration

Job specific variables are first placed in the Job Configuration whichis sent to each Mapper Task by the JobTracker.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 69: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Background: MapReduce Job Overview

split 1split 2split 3split 4

JobTracker

Mapper

Mapper

Mapper

Mapper

Map Phase

Distributed Cache

Job Configuration

Large data such as files or libraries are then put in the DistributedCache which is copied to each TaskTracker by the JobTracker.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 70: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Background: MapReduce Job Overview

split 1split 2split 3split 4

(k1, v1)

(k1, v1)

(k1, v1)

(k1, v1)

JobTracker

Mapper

Mapper

Mapper

Mapper

Map Phase

Distributed Cache

Job Configuration

The JobTracker next assigns each InputSplit to a Mapper task on aTaskTracker, we assume m Mappers and m InputSplits.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 71: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Background: MapReduce Job Overview

split 1split 2split 3split 4

(k1, v1)

(k1, v1)

(k1, v1)

(k1, v1)

JobTracker

Mapper

Mapper

Mapper

Mapper

Map Phase

(k2, v2)

(k2, v2)

(k2, v2)

(k2, v2)

p1p2

p1p2

p1p2

p1p2

Distributed Cache

Job Configuration

Each Mapper maps a (k1, v1) pair to an intermediate (k2, v2) pairand partitions by k2, i.e. hash(k2) = pi for i ∈ [1, r ], r = |reducers|.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 72: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Background: MapReduce Job Overview

split 1split 2split 3split 4

(k1, v1)

(k1, v1)

(k1, v1)

(k1, v1)

JobTracker

Mapper

Mapper

Mapper

Mapper

Map Phase

(k2, v2)

(k2, v2)

(k2, v2)

(k2, v2)

p1p2

p1p2

p1p2

p1p2

(k2, list(v2))

Combiner

(k2, list(v2))

Combiner

(k2, list(v2))

Combiner

(k2, list(v2))

Combiner

Distributed Cache

Job Configuration

An optional Combiner is executed over (k2, list(v2)).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 73: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Background: MapReduce Job Overview

split 1split 2split 3split 4

(k1, v1)

(k1, v1)

(k1, v1)

(k1, v1)

JobTracker

Mapper

Mapper

Mapper

Mapper

Map Phase

(k2, v2)

(k2, v2)

(k2, v2)

(k2, v2)

p1p2

p1p2

p1p2

p1p2

(k2, list(v2))

Combiner

(k2, list(v2))

Combiner

(k2, list(v2))

Combiner

(k2, list(v2))

Combiner

(k2, v2)

(k2, v2)

(k2, v2)

(k2, v2)

p1p2

p1p2

p1p2

p1p2

Distributed Cache

Job Configuration

The Combiner aggregates v2 for a k2 and a (k2, v2) is written to apartition on disk.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 74: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Background: MapReduce Job Overview

split 1split 2split 3split 4

(k1, v1)

(k1, v1)

(k1, v1)

(k1, v1)

JobTracker

Mapper

Mapper

Mapper

Mapper

Map Phase

(k2, v2)

(k2, v2)

(k2, v2)

(k2, v2)

p1p2

p1p2

p1p2

p1p2

Reducer

Reducer

Shuffle/Sort Phase

(k2, list(v2))

Combiner

(k2, list(v2))

Combiner

(k2, list(v2))

Combiner

(k2, list(v2))

Combiner

(k2, v2)

(k2, v2)

(k2, v2)

(k2, v2)

p1p2

p1p2

p1p2

p1p2

Distributed Cache

Job Configuration

The JobTracker assigns two TaskTrackers to run the Reducers, eachReducer copies and sorts it’s inputs from corresponding partitions.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 75: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Background: MapReduce Job Overview

split 1split 2split 3split 4

(k1, v1)

(k1, v1)

(k1, v1)

(k1, v1)

JobTracker

Mapper

Mapper

Mapper

Mapper

Map Phase

(k2, v2)

(k2, v2)

(k2, v2)

(k2, v2)

p1p2

p1p2

p1p2

p1p2

Reducer

Reducer

Shuffle/Sort Phase

(k2, list(v2))

Combiner

(k2, list(v2))

Combiner

(k2, list(v2))

Combiner

(k2, list(v2))

Combiner

(k2, v2)

(k2, v2)

(k2, v2)

(k2, v2)

p1p2

p1p2

p1p2

p1p2

Combiners reduce communication overhead!Distributed Cache

Job Configuration

The JobTracker assigns two TaskTrackers to run the Reducers, eachReducer copies and sorts it’s inputs from corresponding partitions.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 76: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Background: MapReduce Job Overview

split 1split 2split 3split 4

(k1, v1)

(k1, v1)

(k1, v1)

(k1, v1)

JobTracker

Mapper

Mapper

Mapper

Mapper

Map Phase

(k2, v2)

(k2, v2)

(k2, v2)

(k2, v2)

p1p2

p1p2

p1p2

p1p2

Reducer

Reducer

Shuffle/Sort Phase

(k3, v3)

(k3, v3)

o1

o2

Reduce Phase

(k2, list(v2))

Combiner

(k2, list(v2))

Combiner

(k2, list(v2))

Combiner

(k2, list(v2))

Combiner

(k2, v2)

(k2, v2)

(k2, v2)

(k2, v2)

p1p2

p1p2

p1p2

p1p2

Distributed Cache

Job Configuration

Each Reducer reduces a (k2, list(v2)) to a single (k3, v3) and writesthe results to a DFS file, oi for i ∈ [1, r ].

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 77: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Outline

1 Introduction and MotivationHistogramsMapReduce and Hadoop

2 Exact Top-k Wavelet CoefficientsNaive SolutionHadoop Wavelet Top-k : Our Efficient Exact Solution

3 Approximate Top-k Wavelet CoefficientsLinearly Combinable Sketch MethodOur First Sampling Based ApproachAn Improved Sampling ApproachTwo-Level Sampling

4 Experiments

5 ConclusionsHadoop Wavelet Top-k in Hadoop

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 78: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Naive Solution

split 1split 2split 3split 4

(x , null)

(x , null)

(x , null)

(x , null)

JobTracker

Mapper

Mapper

Mapper

Mapper

Each of the m Mappers reads the input key x from its input split.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 79: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Naive Solution

split 1split 2split 3split 4

(x , null)

(x , null)

(x , null)

(x , null)

JobTracker

Mapper

Mapper

Mapper

Mapper

(x , 1)

(x , 1)

(x , 1)

(x , 1)

Combiner

Combiner

Combiner

Combiner

Each Mapper emits (x , 1) for combining by the Combiner.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 80: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Naive Solution

split 1split 2split 3split 4

(x , null)

(x , null)

(x , null)

(x , null)

JobTracker

Mapper

Mapper

Mapper

Mapper

(x , 1)

(x , 1)

(x , 1)

(x , 1)

Combiner

Combiner

Combiner

Combiner

p1

p1

p1

p1

(x , vj(x))

(x , vj(x))

(x , vj(x))

(x , vj(x))

Each Combiner emits (x , vj(x)), where vj(x) is the local frequencyof x .

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 81: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Naive Solution

split 1split 2split 3split 4

(x , null)

(x , null)

(x , null)

(x , null)

JobTracker

Mapper

Mapper

Mapper

Mapper

(x , 1)

(x , 1)

(x , 1)

(x , 1)

Combiner

Combiner

Combiner

Combiner

p1

p1

p1

p1

(x , vj(x))

(x , vj(x))

(x , vj(x))

(x , vj(x))

Reducer

Centralized Wavelet Top-k

(x , v(x))

The Reducer utilizes a Centralized Wavelet Top-k algorithm,supplying the (x , v(x)) in a streaming fashion.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 82: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Naive Solution

split 1split 2split 3split 4

(x , null)

(x , null)

(x , null)

(x , null)

JobTracker

Mapper

Mapper

Mapper

Mapper

(x , 1)

(x , 1)

(x , 1)

(x , 1)

Combiner

Combiner

Combiner

Combiner

p1

p1

p1

p1

(x , vj(x))

(x , vj(x))

(x , vj(x))

(x , vj(x))

Reducer

Centralized Wavelet Top-k

close()

At the end of the Reduce phase, the Reducer’s close() method isinvoked. The Reducer then requests the top-k |wi |.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 83: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Naive Solution

split 1split 2split 3split 4

(x , null)

(x , null)

(x , null)

(x , null)

JobTracker

Mapper

Mapper

Mapper

Mapper

(x , 1)

(x , 1)

(x , 1)

(x , 1)

Combiner

Combiner

Combiner

Combiner

p1

p1

p1

p1

(x , vj(x))

(x , vj(x))

(x , vj(x))

(x , vj(x))

Reducer

Centralized Wavelet Top-k

top-k |wi |

The centralized algorithm computes the top-k |wi | and returns theseto the Reducer.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 84: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Naive Solution

split 1split 2split 3split 4

(x , null)

(x , null)

(x , null)

(x , null)

JobTracker

Mapper

Mapper

Mapper

Mapper

(x , 1)

(x , 1)

(x , 1)

(x , 1)

Combiner

Combiner

Combiner

Combiner

p1

p1

p1

p1

(x , vj(x))

(x , vj(x))

(x , vj(x))

(x , vj(x))

Reducer

Centralized Wavelet Top-k

top-k |wi |

o1top-k |wi |

Finally, the Reducer writes the top-k |wi | to its HDFS output file o1.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 85: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Naive Solution

split 1split 2split 3split 4

(x , null)

(x , null)

(x , null)

(x , null)

JobTracker

Mapper

Mapper

Mapper

Mapper

(x , 1)

(x , 1)

(x , 1)

(x , 1)

Combiner

Combiner

Combiner

Combiner

p1

p1

p1

p1

(x , vj(x))

(x , vj(x))

(x , vj(x))

(x , vj(x))

Reducer

Centralized Wavelet Top-k

top-k |wi |

o1top-k |wi |

Too Expensive!!!O(n) Communication!!!

n = Total records in input file.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 86: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Outline

1 Introduction and MotivationHistogramsMapReduce and Hadoop

2 Exact Top-k Wavelet CoefficientsNaive SolutionHadoop Wavelet Top-k : Our Efficient Exact Solution

3 Approximate Top-k Wavelet CoefficientsLinearly Combinable Sketch MethodOur First Sampling Based ApproachAn Improved Sampling ApproachTwo-Level Sampling

4 Experiments

5 ConclusionsHadoop Wavelet Top-k in Hadoop

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 87: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Our Solution

We can try to model the problem as a distributed top-k:wi = v · ψi = (

∑mj=1 vj) · ψi =

∑mj=1 wi,j .

Previous solutions assume local score si,j ≥ 0 and want the largestsi =

∑mj=1 si,j .

We have wi,j < 0 and wi,j ≥ 0 and want the largest |wi |.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 88: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Our Solution

We can try to model the problem as a distributed top-k:wi = v · ψi = (

∑mj=1 vj) · ψi =

∑mj=1 wi,j .

Previous solutions assume local score si,j ≥ 0 and want the largestsi =

∑mj=1 si,j .

We have wi,j < 0 and wi,j ≥ 0 and want the largest |wi |.

split 1w1,1

w2,1

w3,1...

wu,1

split 2w1,2

w2,2

w3,2...

wu,2

split 3w1,3

w2,3

w3,3...

wu,3

split 4w1,4

w2,4

w3,4...

wu,4

Coordinator

m = 4

wi ,j is the local value of wi in split j .

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 89: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Our Solution

We can try to model the problem as a distributed top-k:wi = v · ψi = (

∑mj=1 vj) · ψi =

∑mj=1 wi,j .

Previous solutions assume local score si,j ≥ 0 and want the largestsi =

∑mj=1 si,j .

We have wi,j < 0 and wi,j ≥ 0 and want the largest |wi |.

split 1w1,1

w2,1

w3,1...

wu,1

split 2w1,2

w2,2

w3,2...

wu,2

split 3w1,3

w2,3

w3,3...

wu,3

split 4w1,4

w2,4

w3,4...

wu,4

Coordinator

m = 4

wi ,1wi ,2 wi ,3

wi ,4

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 90: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Our Solution

We can try to model the problem as a distributed top-k:wi = v · ψi = (

∑mj=1 vj) · ψi =

∑mj=1 wi,j .

Previous solutions assume local score si,j ≥ 0 and want the largestsi =

∑mj=1 si,j .

We have wi,j < 0 and wi,j ≥ 0 and want the largest |wi |.

split 1w1,1

w2,1

w3,1...

wu,1

split 2w1,2

w2,2

w3,2...

wu,2

split 3w1,3

w2,3

w3,3...

wu,3

split 4w1,4

w2,4

w3,4...

wu,4

Coordinator

m = 4

wi ,1wi ,2 wi ,3

wi ,4

wi =m∑

j=1wi ,j

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 91: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Our Solution

We can try to model the problem as a distributed top-k:wi = v · ψi = (

∑mj=1 vj) · ψi =

∑mj=1 wi,j .

Previous solutions assume local score si,j ≥ 0 and want the largestsi =

∑mj=1 si,j .

We have wi,j < 0 and wi,j ≥ 0 and want the largest |wi |.

split 1w1,1

w2,1

w3,1...

wu,1

split 2w1,2

w2,2

w3,2...

wu,2

split 3w1,3

w2,3

w3,3...

wu,3

split 4w1,4

w2,4

w3,4...

wu,4

Coordinator

m = 4

wi ,1wi ,2 wi ,3

wi ,4

wi =m∑

j=1wi ,j

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 92: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Our Solution

We can try to model the problem as a distributed top-k:wi = v · ψi = (

∑mj=1 vj) · ψi =

∑mj=1 wi,j .

Previous solutions assume local score si,j ≥ 0 and want the largestsi =

∑mj=1 si,j .

We have wi,j < 0 and wi,j ≥ 0 and want the largest |wi |.

split 1w1,1

w2,1

w3,1...

wu,1

split 2w1,2

w2,2

w3,2...

wu,2

split 3w1,3

w2,3

w3,3...

wu,3

split 4w1,4

w2,4

w3,4...

wu,4

Coordinator

m = 4

wi ,1wi ,2 wi ,3

wi ,4

wi =m∑

j=1wi ,j

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 93: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Our Solution

node 1id x s1(x)e1,1 5 20e1,2 2 7e1,3 1 6e1,4 4 -2e1,5 6 -15e1,6 3 -30

node 2id x s2(x)e2,1 5 12e2,2 4 7e2,3 1 2e2,4 2 -5e2,5 3 -14e2,6 6 -20

node 3id x s3(x)e3,1 1 10e3,2 3 6e3,3 4 5e3,4 2 -3e3,5 5 -6e3,6 6 -10

Rid x sj(x)e1,1 5 20

Rx s(x) Fx τ+(x) τ−(x) τ (x)1 10 011 42 -40 0

k = 1

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 94: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Our Solution

node 1id x s1(x)e1,1 5 20e1,2 2 7e1,3 1 6e1,4 4 -2e1,5 6 -15e1,6 3 -30

node 2id x s2(x)e2,1 5 12e2,2 4 7e2,3 1 2e2,4 2 -5e2,5 3 -14e2,6 6 -20

node 3id x s3(x)e3,1 1 10e3,2 3 6e3,3 4 5e3,4 2 -3e3,5 5 -6e3,6 6 -10

Rid x sj(x)e1,1 5 20

Rx s(x) Fx τ+(x) τ−(x) τ (x)1 10 011 42 -40 0

• An item x has a local score si(x) at node i ∀i ∈ [1 . . .m], where ifx does not appear si(x) = 0

k = 1

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 95: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Our Solution

node 1id x s1(x)e1,1 5 20e1,2 2 7e1,3 1 6e1,4 4 -2e1,5 6 -15e1,6 3 -30

node 2id x s2(x)e2,1 5 12e2,2 4 7e2,3 1 2e2,4 2 -5e2,5 3 -14e2,6 6 -20

node 3id x s3(x)e3,1 1 10e3,2 3 6e3,3 4 5e3,4 2 -3e3,5 5 -6e3,6 6 -10

Rx s(x) Fx τ+(x) τ−(x) τ (x)1 10 011 42 -40 0

Rid x sj(x)e1,1 5 20e1,6 3 -30e2,1 5 12e2,6 6 -20e3,1 1 10e3,6 6 -10

• Each node sends:the top-k most positive scored itemsthe top-k most negative scored items.

k = 1

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 96: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Our Solution

node 1id x s1(x)e1,1 5 20e1,2 2 7e1,3 1 6e1,4 4 -2e1,5 6 -15e1,6 3 -30

node 2id x s2(x)e2,1 5 12e2,2 4 7e2,3 1 2e2,4 2 -5e2,5 3 -14e2,6 6 -20

node 3id x s3(x)e3,1 1 10e3,2 3 6e3,3 4 5e3,4 2 -3e3,5 5 -6e3,6 6 -10

Rid x sj(x)e1,1 5 20e1,6 3 -30e2,1 5 12e2,6 6 -20e3,1 1 10e3,6 6 -10

Rx s(x) Fx τ+(x) τ−(x) τ (x)1 10 001 42 -40 03 -30 100 -8 -60 85 32 110 42 22 226 -30 011 -10 -60 10

• The coordinator computes useful bounds for each received item.

k = 1

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 97: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Our Solution

node 1id x s1(x)e1,1 5 20e1,2 2 7e1,3 1 6e1,4 4 -2e1,5 6 -15e1,6 3 -30

node 2id x s2(x)e2,1 5 12e2,2 4 7e2,3 1 2e2,4 2 -5e2,5 3 -14e2,6 6 -20

node 3id x s3(x)e3,1 1 10e3,2 3 6e3,3 4 5e3,4 2 -3e3,5 5 -6e3,6 6 -10

Rid x sj(x)e1,1 5 20e1,6 3 -30e2,1 5 12e2,6 6 -20e3,1 1 10e3,6 6 -10

Rx s(x) Fx τ+(x) τ−(x) τ (x)1 10 001 42 -40 03 -30 100 -8 -60 85 32 110 42 22 226 -30 011 -10 -60 10

• s(x) denotes the partial score sum for x

k = 1

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 98: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Our Solution

node 1id x s1(x)e1,1 5 20e1,2 2 7e1,3 1 6e1,4 4 -2e1,5 6 -15e1,6 3 -30

node 2id x s2(x)e2,1 5 12e2,2 4 7e2,3 1 2e2,4 2 -5e2,5 3 -14e2,6 6 -20

node 3id x s3(x)e3,1 1 10e3,2 3 6e3,3 4 5e3,4 2 -3e3,5 5 -6e3,6 6 -10

Rid x sj(x)e1,1 5 20e1,6 3 -30e2,1 5 12e2,6 6 -20e3,1 1 10e3,6 6 -10

Rx s(x) Fx τ+(x) τ−(x) τ (x)1 10 001 42 -40 03 -30 100 -8 -60 85 32 110 42 22 226 -30 011 -10 -60 10

• s(x) denotes the partial score sum for x

k = 1

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 99: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Our Solution

node 1id x s1(x)e1,1 5 20e1,2 2 7e1,3 1 6e1,4 4 -2e1,5 6 -15e1,6 3 -30

node 2id x s2(x)e2,1 5 12e2,2 4 7e2,3 1 2e2,4 2 -5e2,5 3 -14e2,6 6 -20

node 3id x s3(x)e3,1 1 10e3,2 3 6e3,3 4 5e3,4 2 -3e3,5 5 -6e3,6 6 -10

Rid x sj(x)e1,1 5 20e1,6 3 -30e2,1 5 12e2,6 6 -20e3,1 1 10e3,6 6 -10

Rx s(x) Fx τ+(x) τ−(x) τ (x)1 10 001 42 -40 03 -30 100 -8 -60 85 32 110 42 22 226 -30 011 -10 -60 10

• Fx is a receipt indication bit vector,if si(x) is received Fx(i) = 1,else Fx(i) = 0.

k = 1

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 100: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Our Solution

node 1id x s1(x)e1,1 5 20e1,2 2 7e1,3 1 6e1,4 4 -2e1,5 6 -15e1,6 3 -30

node 2id x s2(x)e2,1 5 12e2,2 4 7e2,3 1 2e2,4 2 -5e2,5 3 -14e2,6 6 -20

node 3id x s3(x)e3,1 1 10e3,2 3 6e3,3 4 5e3,4 2 -3e3,5 5 -6e3,6 6 -10

Rid x sj(x)e1,1 5 20e1,6 3 -30e2,1 5 12e2,6 6 -20e3,1 1 10e3,6 6 -10

Rx s(x) Fx τ+(x) τ−(x) τ (x)1 10 001 42 -40 03 -30 100 -8 -60 85 32 110 42 22 226 -30 011 -10 -60 10

• Fx is a receipt indication bit vector,if si(x) is received Fx(i) = 1,else Fx(i) = 0.

k = 1

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 101: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Our Solution

node 1id x s1(x)e1,1 5 20e1,2 2 7e1,3 1 6e1,4 4 -2e1,5 6 -15e1,6 3 -30

node 2id x s2(x)e2,1 5 12e2,2 4 7e2,3 1 2e2,4 2 -5e2,5 3 -14e2,6 6 -20

node 3id x s3(x)e3,1 1 10e3,2 3 6e3,3 4 5e3,4 2 -3e3,5 5 -6e3,6 6 -10

Rid x sj(x)e1,1 5 20e1,6 3 -30e2,1 5 12e2,6 6 -20e3,1 1 10e3,6 6 -10

Rx s(x) Fx τ+(x) τ−(x) τ (x)1 10 001 42 -40 03 -30 100 -8 -60 85 32 110 42 22 226 -30 011 -10 -60 10

• τ+(x) is an upper bound on the total score s(x),if si(x) received then τ+(x) = τ+(x) + si(x)else τ+(x) = τ+(x) + k ’th most positive from node i

k = 1

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 102: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Our Solution

node 1id x s1(x)e1,1 5 20e1,2 2 7e1,3 1 6e1,4 4 -2e1,5 6 -15e1,6 3 -30

node 2id x s2(x)e2,1 5 12e2,2 4 7e2,3 1 2e2,4 2 -5e2,5 3 -14e2,6 6 -20

node 3id x s3(x)e3,1 1 10e3,2 3 6e3,3 4 5e3,4 2 -3e3,5 5 -6e3,6 6 -10

Rid x sj(x)e1,1 5 20e1,6 3 -30e2,1 5 12e2,6 6 -20e3,1 1 10e3,6 6 -10

Rx s(x) Fx τ+(x) τ−(x) τ (x)1 10 001 42 -40 03 -30 100 -8 -60 85 32 110 42 22 226 -30 011 -10 -60 10

• τ+(x) is an upper bound on the total score s(x),if si(x) received then τ+(x) = τ+(x) + si(x)else τ+(x) = τ+(x) + k ’th most positive from node i

k = 1

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 103: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Our Solution

node 1id x s1(x)e1,1 5 20e1,2 2 7e1,3 1 6e1,4 4 -2e1,5 6 -15e1,6 3 -30

node 2id x s2(x)e2,1 5 12e2,2 4 7e2,3 1 2e2,4 2 -5e2,5 3 -14e2,6 6 -20

node 3id x s3(x)e3,1 1 10e3,2 3 6e3,3 4 5e3,4 2 -3e3,5 5 -6e3,6 6 -10

Rid x sj(x)e1,1 5 20e1,6 3 -30e2,1 5 12e2,6 6 -20e3,1 1 10e3,6 6 -10

Rx s(x) Fx τ+(x) τ−(x) τ (x)1 10 001 42 -40 03 -30 100 -8 -60 85 32 110 42 22 226 -30 011 -10 -60 10

• τ−(x) is a lower bound on the total score sum s(x),if si(x) received then τ−(x) = τ−(x) + si(x)else τ−(x) = τ−(x) + k ’th most negative score from node i

k = 1

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 104: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Our Solution

node 1id x s1(x)e1,1 5 20e1,2 2 7e1,3 1 6e1,4 4 -2e1,5 6 -15e1,6 3 -30

node 2id x s2(x)e2,1 5 12e2,2 4 7e2,3 1 2e2,4 2 -5e2,5 3 -14e2,6 6 -20

node 3id x s3(x)e3,1 1 10e3,2 3 6e3,3 4 5e3,4 2 -3e3,5 5 -6e3,6 6 -10

Rid x sj(x)e1,1 5 20e1,6 3 -30e2,1 5 12e2,6 6 -20e3,1 1 10e3,6 6 -10

Rx s(x) Fx τ+(x) τ−(x) τ (x)1 10 001 42 -40 03 -30 100 -8 -60 85 32 110 42 22 226 -30 011 -10 -60 10

• τ−(x) is a lower bound on the total score sum s(x),if si(x) received then τ−(x) = τ−(x) + si(x)else τ−(x) = τ−(x) + k ’th most negative score from node i

k = 1

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 105: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Our Solution

node 1id x s1(x)e1,1 5 20e1,2 2 7e1,3 1 6e1,4 4 -2e1,5 6 -15e1,6 3 -30

node 2id x s2(x)e2,1 5 12e2,2 4 7e2,3 1 2e2,4 2 -5e2,5 3 -14e2,6 6 -20

node 3id x s3(x)e3,1 1 10e3,2 3 6e3,3 4 5e3,4 2 -3e3,5 5 -6e3,6 6 -10

Rid x sj(x)e1,1 5 20e1,6 3 -30e2,1 5 12e2,6 6 -20e3,1 1 10e3,6 6 -10

Rx s(x) Fx τ+(x) τ−(x) τ (x)1 10 001 42 -40 03 -30 100 -8 -60 85 32 110 42 22 226 -30 011 -10 -60 10

• τ (x) is a lower bound on |s(x)| computed as,τ (x) = 0 if sign(τ+(x)) 6= sign(τ−(x))τ (x) = min(|τ+(x)|, |τ−(x)|) otherwise.

k = 1

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 106: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Our Solution

node 1id x s1(x)e1,1 5 20e1,2 2 7e1,3 1 6e1,4 4 -2e1,5 6 -15e1,6 3 -30

node 2id x s2(x)e2,1 5 12e2,2 4 7e2,3 1 2e2,4 2 -5e2,5 3 -14e2,6 6 -20

node 3id x s3(x)e3,1 1 10e3,2 3 6e3,3 4 5e3,4 2 -3e3,5 5 -6e3,6 6 -10

Rid x sj(x)e1,1 5 20e1,6 3 -30e2,1 5 12e2,6 6 -20e3,1 1 10e3,6 6 -10

Rx s(x) Fx τ+(x) τ−(x) τ (x)1 10 001 42 -40 03 -30 100 -8 -60 85 32 110 42 22 226 -30 011 -10 -60 10

• τ (x) is a lower bound on |s(x)| computed as,τ (x) = 0 if sign(τ+(x)) 6= sign(τ−(x))τ (x) = min(|τ+(x)|, |τ−(x)|) otherwise.

k = 1

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 107: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Our Solution

node 1id x s1(x)e1,1 5 20e1,2 2 7e1,3 1 6e1,4 4 -2e1,5 6 -15e1,6 3 -30

node 2id x s2(x)e2,1 5 12e2,2 4 7e2,3 1 2e2,4 2 -5e2,5 3 -14e2,6 6 -20

node 3id x s3(x)e3,1 1 10e3,2 3 6e3,3 4 5e3,4 2 -3e3,5 5 -6e3,6 6 -10

Rid x sj(x)e1,1 5 20e1,6 3 -30e2,1 5 12e2,6 6 -20e3,1 1 10e3,6 6 -10

Rx s(x) Fx τ+(x) τ−(x) τ (x)1 10 001 42 -40 03 -30 100 -8 -60 85 32 110 42 22 226 -30 011 -10 -60 10

T1 = 22, T1/m = 22/3

•We select the item with the kth largest τ (x).τ (x) is a lower bound T1 on the top-k |s(x)| for unseen item x .

k = 1

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 108: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Our Solution

node 1id x s1(x)e1,1 5 20e1,2 2 7e1,3 1 6e1,4 4 -2e1,5 6 -15e1,6 3 -30

node 2id x s2(x)e2,1 5 12e2,2 4 7e2,3 1 2e2,4 2 -5e2,5 3 -14e2,6 6 -20

node 3id x s3(x)e3,1 1 10e3,2 3 6e3,3 4 5e3,4 2 -3e3,5 5 -6e3,6 6 -10

Rid x sj(x)e1,1 5 20e1,6 3 -30e2,1 5 12e2,6 6 -20e3,1 1 10e3,6 6 -10

Rx s(x) Fx τ+(x) τ−(x) τ (x)1 10 001 42 -40 03 -30 100 -8 -60 85 32 110 42 22 226 -30 011 -10 -60 10

T1 = 22, T1/m = 22/3

• Any unseen item x must have at least:one si(x) > T1/m orone si(x) < −T1/mTo get into the top-k .

k = 1

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 109: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Our Solution

node 1id x s1(x)e1,1 5 20e1,2 2 7e1,3 1 6e1,4 4 -2e1,5 6 -15e1,6 3 -30

node 2id x s2(x)e2,1 5 12e2,2 4 7e2,3 1 2e2,4 2 -5e2,5 3 -14e2,6 6 -20

node 3id x s3(x)e3,1 1 10e3,2 3 6e3,3 4 5e3,4 2 -3e3,5 5 -6e3,6 6 -10

Rid x sj(x)e1,1 5 20e1,6 3 -30e2,1 5 12e2,6 6 -20e3,1 1 10e3,6 6 -10

Rx s(x) Fx τ+(x) τ−(x) τ (x)1 10 001 42 -40 03 -30 100 -8 -60 85 32 110 42 22 226 -30 011 -10 -60 10

T1 = 22, T1/m = 22/3

Round 1 End

k = 1

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 110: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Our Solution

node 1id x s1(x)e1,1 5 20e1,2 2 7e1,3 1 6e1,4 4 -2e1,5 6 -15e1,6 3 -30

node 2id x s2(x)e2,1 5 12e2,2 4 7e2,3 1 2e2,4 2 -5e2,5 3 -14e2,6 6 -20

node 3id x s3(x)e3,1 1 10e3,2 3 6e3,3 4 5e3,4 2 -3e3,5 5 -6e3,6 6 -10

Rid x sj(x)e1,1 5 20e1,6 3 -30e2,1 5 12e2,6 6 -20e3,1 1 10e3,6 6 -10

Rx s(x) Fx τ+(x) τ−(x) τ (x)1 10 001 42 -40 03 -30 100 -8 -60 85 32 110 42 22 226 -30 011 -10 -60 10

T1 = 22, T1/m = 22/3

• T1/m sent to each site.

k = 1

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 111: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Our Solution

node 1id x s1(x)e1,1 5 20e1,2 2 7e1,3 1 6e1,4 4 -2e1,5 6 -15e1,6 3 -30

node 2id x s2(x)e2,1 5 12e2,2 4 7e2,3 1 2e2,4 2 -5e2,5 3 -14e2,6 6 -20

node 3id x s3(x)e3,1 1 10e3,2 3 6e3,3 4 5e3,4 2 -3e3,5 5 -6e3,6 6 -10

Rid x sj(x)e1,1 5 20e1,6 3 -30e2,1 5 12e2,6 6 -20e3,1 1 10e3,6 6 -10

Rx s(x) Fx τ+(x) τ−(x) τ (x)1 10 001 42 -40 03 -30 100 -8 -60 85 32 110 42 22 226 -30 011 -10 -60 10

T1 = 22, T1/m = 22/3

• Each site finds items withsi(x) > T1/m orsi(x) < T1/m.

k = 1

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 112: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Our Solution

node 1id x s1(x)e1,1 5 20e1,2 2 7e1,3 1 6e1,4 4 -2e1,5 6 -15e1,6 3 -30

node 2id x s2(x)e2,1 5 12e2,2 4 7e2,3 1 2e2,4 2 -5e2,5 3 -14e2,6 6 -20

node 3id x s3(x)e3,1 1 10e3,2 3 6e3,3 4 5e3,4 2 -3e3,5 5 -6e3,6 6 -10

Rid x sj(x)e1,1 5 20e1,6 3 -30e2,1 5 12e2,6 6 -20e3,1 1 10e3,6 6 -10

Rx s(x) Fx τ+(x) τ−(x) τ (x)1 10 001 42 -40 03 -30 100 -8 -60 85 32 110 42 22 226 -30 011 -10 -60 10

T1 = 22, T1/m = 22/3

• Items with |si(x)| > T1/m are sentto coordinator.

k = 1

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 113: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Our Solution

node 1id x s1(x)e1,1 5 20e1,2 2 7e1,3 1 6e1,4 4 -2e1,5 6 -15e1,6 3 -30

node 2id x s2(x)e2,1 5 12e2,2 4 7e2,3 1 2e2,4 2 -5e2,5 3 -14e2,6 6 -20

node 3id x s3(x)e3,1 1 10e3,2 3 6e3,3 4 5e3,4 2 -3e3,5 5 -6e3,6 6 -10

Rx s(x) Fx τ+(x) τ−(x) τ (x)1 10 001 42 -40 03 -30 100 -8 -60 85 32 110 42 22 226 -30 011 -10 -60 10

T1 = 22, T1/m = 22/3

Rid x sj(x)e1,1 5 20e1,5 6 -15e1,6 3 -30e2,1 5 12e2,5 3 -14e2,6 6 -20e3,1 1 10e3,6 6 -10

• Items with |si(x)| > T1/m are sentto coordinator.

k = 1

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 114: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Our Solution

node 1id x s1(x)e1,1 5 20e1,2 2 7e1,3 1 6e1,4 4 -2e1,5 6 -15e1,6 3 -30

node 2id x s2(x)e2,1 5 12e2,2 4 7e2,3 1 2e2,4 2 -5e2,5 3 -14e2,6 6 -20

node 3id x s3(x)e3,1 1 10e3,2 3 6e3,3 4 5e3,4 2 -3e3,5 5 -6e3,6 6 -10

T1 = 22, T1/m = 22/3

Rid x sj(x)e1,1 5 20e1,5 6 -15e1,6 3 -30e2,1 5 12e2,5 3 -14e2,6 6 -20e3,1 1 10e3,6 6 -10

Rx s(x) Fx τ+(x) τ−(x) τ (x) τ ′(x)1 10 001 24.6 -4.6 0 24.63 -44 110 -36.6 -51.3 36.6 51.35 32 110 39.3 24.6 24.6 39.36 -45 111 -45 -45 45 45

• The coordinator updates the bounds foreach item it has ever received.

k = 1

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 115: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Our Solution

node 1id x s1(x)e1,1 5 20e1,2 2 7e1,3 1 6e1,4 4 -2e1,5 6 -15e1,6 3 -30

node 2id x s2(x)e2,1 5 12e2,2 4 7e2,3 1 2e2,4 2 -5e2,5 3 -14e2,6 6 -20

node 3id x s3(x)e3,1 1 10e3,2 3 6e3,3 4 5e3,4 2 -3e3,5 5 -6e3,6 6 -10

T1 = 22, T1/m = 22/3

Rid x sj(x)e1,1 5 20e1,5 6 -15e1,6 3 -30e2,1 5 12e2,5 3 -14e2,6 6 -20e3,1 1 10e3,6 6 -10

Rx s(x) Fx τ+(x) τ−(x) τ (x) τ ′(x)1 10 001 24.6 -4.6 0 24.63 -44 110 -36.6 -51.3 36.6 51.35 32 110 39.3 24.6 24.6 39.36 -45 111 -45 -45 45 45

• The coordinator updates the bounds foreach item it has ever received.

• Partial score sum s(5) = 20 + 12

k = 1

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 116: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Our Solution

node 1id x s1(x)e1,1 5 20e1,2 2 7e1,3 1 6e1,4 4 -2e1,5 6 -15e1,6 3 -30

node 2id x s2(x)e2,1 5 12e2,2 4 7e2,3 1 2e2,4 2 -5e2,5 3 -14e2,6 6 -20

node 3id x s3(x)e3,1 1 10e3,2 3 6e3,3 4 5e3,4 2 -3e3,5 5 -6e3,6 6 -10

T1 = 22, T1/m = 22/3

Rid x sj(x)e1,1 5 20e1,5 6 -15e1,6 3 -30e2,1 5 12e2,5 3 -14e2,6 6 -20e3,1 1 10e3,6 6 -10

Rx s(x) Fx τ+(x) τ−(x) τ (x) τ ′(x)1 10 001 24.6 -4.6 0 24.63 -44 110 -36.6 -51.3 36.6 51.35 32 110 39.3 24.6 24.6 39.36 -45 111 -45 -45 45 45

• The coordinator updates the bounds foreach item it has ever received.

• Receipt vector F5 = [110]

k = 1

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 117: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Our Solution

node 1id x s1(x)e1,1 5 20e1,2 2 7e1,3 1 6e1,4 4 -2e1,5 6 -15e1,6 3 -30

node 2id x s2(x)e2,1 5 12e2,2 4 7e2,3 1 2e2,4 2 -5e2,5 3 -14e2,6 6 -20

node 3id x s3(x)e3,1 1 10e3,2 3 6e3,3 4 5e3,4 2 -3e3,5 5 -6e3,6 6 -10

T1 = 22, T1/m = 22/3

Rid x sj(x)e1,1 5 20e1,5 6 -15e1,6 3 -30e2,1 5 12e2,5 3 -14e2,6 6 -20e3,1 1 10e3,6 6 -10

Rx s(x) Fx τ+(x) τ−(x) τ (x) τ ′(x)1 10 001 24.6 -4.6 0 24.63 -44 110 -36.6 -51.3 36.6 51.35 32 110 39.3 24.6 24.6 39.36 -45 111 -45 -45 45 45

• The coordinator updates the bounds foreach item it has ever received.

• τ+(x) is now tighter,if si(x) received then τ+(x) = τ+(x) + si(x)else τ+(x) = τ+(x) + T1/m

k = 1

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 118: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Our Solution

node 1id x s1(x)e1,1 5 20e1,2 2 7e1,3 1 6e1,4 4 -2e1,5 6 -15e1,6 3 -30

node 2id x s2(x)e2,1 5 12e2,2 4 7e2,3 1 2e2,4 2 -5e2,5 3 -14e2,6 6 -20

node 3id x s3(x)e3,1 1 10e3,2 3 6e3,3 4 5e3,4 2 -3e3,5 5 -6e3,6 6 -10

T1 = 22, T1/m = 22/3

Rid x sj(x)e1,1 5 20e1,5 6 -15e1,6 3 -30e2,1 5 12e2,5 3 -14e2,6 6 -20e3,1 1 10e3,6 6 -10

Rx s(x) Fx τ+(x) τ−(x) τ (x) τ ′(x)1 10 001 24.6 -4.6 0 24.63 -44 110 -36.6 -51.3 36.6 51.35 32 110 39.3 24.6 24.6 39.36 -45 111 -45 -45 45 45

• The coordinator updates the bounds foreach item it has ever received.

• τ−(x) is also tighter,if si(x) received then τ−(x) = τ−(x) + si(x)else τ−(x) = τ−(x)− T1/m

k = 1

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 119: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Our Solution

node 1id x s1(x)e1,1 5 20e1,2 2 7e1,3 1 6e1,4 4 -2e1,5 6 -15e1,6 3 -30

node 2id x s2(x)e2,1 5 12e2,2 4 7e2,3 1 2e2,4 2 -5e2,5 3 -14e2,6 6 -20

node 3id x s3(x)e3,1 1 10e3,2 3 6e3,3 4 5e3,4 2 -3e3,5 5 -6e3,6 6 -10

T1 = 22, T1/m = 22/3

Rid x sj(x)e1,1 5 20e1,5 6 -15e1,6 3 -30e2,1 5 12e2,5 3 -14e2,6 6 -20e3,1 1 10e3,6 6 -10

Rx s(x) Fx τ+(x) τ−(x) τ (x) τ ′(x)1 10 001 24.6 -4.6 0 24.63 -44 110 -36.6 -51.3 36.6 51.35 32 110 39.3 24.6 24.6 39.36 -45 111 -45 -45 45 45

• The coordinator updates the bounds foreach item it has ever received.

• Score absolute value bound τ (5) = min(39.3, 24.6).

k = 1

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 120: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Our Solution

node 1id x s1(x)e1,1 5 20e1,2 2 7e1,3 1 6e1,4 4 -2e1,5 6 -15e1,6 3 -30

node 2id x s2(x)e2,1 5 12e2,2 4 7e2,3 1 2e2,4 2 -5e2,5 3 -14e2,6 6 -20

node 3id x s3(x)e3,1 1 10e3,2 3 6e3,3 4 5e3,4 2 -3e3,5 5 -6e3,6 6 -10

T1 = 22, T1/m = 22/3

Rid x sj(x)e1,1 5 20e1,5 6 -15e1,6 3 -30e2,1 5 12e2,5 3 -14e2,6 6 -20e3,1 1 10e3,6 6 -10

Rx s(x) Fx τ+(x) τ−(x) τ (x) τ ′(x)1 10 001 24.6 -4.6 0 24.63 -44 110 -36.6 -51.3 36.6 51.35 32 110 39.3 24.6 24.6 39.36 -45 111 -45 -45 45 45

• τ ′(x) is an upper bound on |s(x)|,τ ′(x) = max{|τ+(x)|, |τ−(x)|}

• The coordinator updates the bounds foreach item it has ever received.

k = 1

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 121: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Our Solution

node 1id x s1(x)e1,1 5 20e1,2 2 7e1,3 1 6e1,4 4 -2e1,5 6 -15e1,6 3 -30

node 2id x s2(x)e2,1 5 12e2,2 4 7e2,3 1 2e2,4 2 -5e2,5 3 -14e2,6 6 -20

node 3id x s3(x)e3,1 1 10e3,2 3 6e3,3 4 5e3,4 2 -3e3,5 5 -6e3,6 6 -10

T1 = 22, T1/m = 22/3

Rid x sj(x)e1,1 5 20e1,5 6 -15e1,6 3 -30e2,1 5 12e2,5 3 -14e2,6 6 -20e3,1 1 10e3,6 6 -10

Rx s(x) Fx τ+(x) τ−(x) τ (x) τ ′(x)1 10 001 24.6 -4.6 0 24.63 -44 110 -36.6 -51.3 36.6 51.35 32 110 39.3 24.6 24.6 39.36 -45 111 -45 -45 45 45

• τ ′(x) is an upper bound on |s(x)|,τ ′(x) = max{|τ+(x)|, |τ−(x)|}

• The coordinator updates the bounds foreach item it has ever received.

k = 1

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 122: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Our Solution

node 1id x s1(x)e1,1 5 20e1,2 2 7e1,3 1 6e1,4 4 -2e1,5 6 -15e1,6 3 -30

node 2id x s2(x)e2,1 5 12e2,2 4 7e2,3 1 2e2,4 2 -5e2,5 3 -14e2,6 6 -20

node 3id x s3(x)e3,1 1 10e3,2 3 6e3,3 4 5e3,4 2 -3e3,5 5 -6e3,6 6 -10

T1 = 22, T1/m = 22/3

Rid x sj(x)e1,1 5 20e1,5 6 -15e1,6 3 -30e2,1 5 12e2,5 3 -14e2,6 6 -20e3,1 1 10e3,6 6 -10

Rx s(x) Fx τ+(x) τ−(x) τ (x) τ ′(x)1 10 001 24.6 -4.6 0 24.63 -44 110 -36.6 -51.3 36.6 51.35 32 110 39.3 24.6 24.6 39.36 -45 111 -45 -45 45 45

•We select the item x with the kth largest τ (x), which serves as anew lower bound T2 on |s(x)| for any item.

k = 1

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 123: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Our Solution

node 1id x s1(x)e1,1 5 20e1,2 2 7e1,3 1 6e1,4 4 -2e1,5 6 -15e1,6 3 -30

node 2id x s2(x)e2,1 5 12e2,2 4 7e2,3 1 2e2,4 2 -5e2,5 3 -14e2,6 6 -20

node 3id x s3(x)e3,1 1 10e3,2 3 6e3,3 4 5e3,4 2 -3e3,5 5 -6e3,6 6 -10

T1 = 22, T1/m = 22/3

Rid x sj(x)e1,1 5 20e1,5 6 -15e1,6 3 -30e2,1 5 12e2,5 3 -14e2,6 6 -20e3,1 1 10e3,6 6 -10

Rx s(x) Fx τ+(x) τ−(x) τ (x) τ ′(x)1 10 001 24.6 -4.6 0 24.63 -44 110 -36.6 -51.3 36.6 51.35 32 110 39.3 24.6 24.6 39.36 -45 111 -45 -45 45 45

T2 = 45

•We select the item x with the kth largest τ (x), which serves as anew lower bound T2 on |s(x)| for any item.

k = 1

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 124: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Our Solution

node 1id x s1(x)e1,1 5 20e1,2 2 7e1,3 1 6e1,4 4 -2e1,5 6 -15e1,6 3 -30

node 2id x s2(x)e2,1 5 12e2,2 4 7e2,3 1 2e2,4 2 -5e2,5 3 -14e2,6 6 -20

node 3id x s3(x)e3,1 1 10e3,2 3 6e3,3 4 5e3,4 2 -3e3,5 5 -6e3,6 6 -10

T1 = 22, T1/m = 22/3

Rid x sj(x)e1,1 5 20e1,5 6 -15e1,6 3 -30e2,1 5 12e2,5 3 -14e2,6 6 -20e3,1 1 10e3,6 6 -10

Rx s(x) Fx τ+(x) τ−(x) τ (x) τ ′(x)1 10 001 24.6 -4.6 0 24.63 -44 110 -36.6 -51.3 36.6 51.35 32 110 39.3 24.6 24.6 39.36 -45 111 -45 -45 45 45

T2 = 45

• Any item with τ ′(x) < T2 cannot be in the top-k and is pruned fromR .

k = 1

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 125: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Our Solution

node 1id x s1(x)e1,1 5 20e1,2 2 7e1,3 1 6e1,4 4 -2e1,5 6 -15e1,6 3 -30

node 2id x s2(x)e2,1 5 12e2,2 4 7e2,3 1 2e2,4 2 -5e2,5 3 -14e2,6 6 -20

node 3id x s3(x)e3,1 1 10e3,2 3 6e3,3 4 5e3,4 2 -3e3,5 5 -6e3,6 6 -10

T1 = 22, T1/m = 22/3

Rid x sj(x)e1,1 5 20e1,5 6 -15e1,6 3 -30e2,1 5 12e2,5 3 -14e2,6 6 -20e3,1 1 10e3,6 6 -10

Rx s(x) Fx τ+(x) τ−(x) τ (x) τ ′(x)1 10 001 24.6 -4.6 0 24.63 -44 110 -36.6 -51.3 36.6 51.35 32 110 39.3 24.6 24.6 39.36 -45 111 -45 -45 45 45

T2 = 45

• Any remaining items with a 0 in vector Fx are selected.

k = 1

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 126: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Our Solution

node 1id x s1(x)e1,1 5 20e1,2 2 7e1,3 1 6e1,4 4 -2e1,5 6 -15e1,6 3 -30

node 2id x s2(x)e2,1 5 12e2,2 4 7e2,3 1 2e2,4 2 -5e2,5 3 -14e2,6 6 -20

node 3id x s3(x)e3,1 1 10e3,2 3 6e3,3 4 5e3,4 2 -3e3,5 5 -6e3,6 6 -10

T1 = 22, T1/m = 22/3

Rid x sj(x)e1,1 5 20e1,5 6 -15e1,6 3 -30e2,1 5 12e2,5 3 -14e2,6 6 -20e3,1 1 10e3,6 6 -10

Rx s(x) Fx τ+(x) τ−(x) τ (x) τ ′(x)1 10 001 24.6 -4.6 0 24.63 -44 110 -36.6 -51.3 36.6 51.35 32 110 39.3 24.6 24.6 39.36 -45 111 -45 -45 45 45

T2 = 45

Round 2 End

√ √

k = 1

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 127: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Our Solution

node 1id x s1(x)e1,1 5 20e1,2 2 7e1,3 1 6e1,4 4 -2e1,5 6 -15e1,6 3 -30

node 2id x s2(x)e2,1 5 12e2,2 4 7e2,3 1 2e2,4 2 -5e2,5 3 -14e2,6 6 -20

node 3id x s3(x)e3,1 1 10e3,2 3 6e3,3 4 5e3,4 2 -3e3,5 5 -6e3,6 6 -10

T1 = 22, T1/m = 22/3

Rid x sj(x)e1,1 5 20e1,5 6 -15e1,6 3 -30e2,1 5 12e2,5 3 -14e2,6 6 -20e3,1 1 10e3,6 6 -10

Rx s(x) Fx τ+(x) τ−(x) τ (x) τ ′(x)1 10 001 24.6 -4.6 0 24.63 -44 110 -36.6 -51.3 36.6 51.35 32 110 39.3 24.6 24.6 39.36 -45 111 -45 -45 45 45

T2 = 45

√ √

s3(3) =?

• The coordinator asks for missing scoresfor items still in R .

k = 1

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 128: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Our Solution

node 1id x s1(x)e1,1 5 20e1,2 2 7e1,3 1 6e1,4 4 -2e1,5 6 -15e1,6 3 -30

node 2id x s2(x)e2,1 5 12e2,2 4 7e2,3 1 2e2,4 2 -5e2,5 3 -14e2,6 6 -20

node 3id x s3(x)e3,1 1 10e3,2 3 6e3,3 4 5e3,4 2 -3e3,5 5 -6e3,6 6 -10

T1 = 22, T1/m = 22/3

Rid x sj(x)e1,1 5 20e1,5 6 -15e1,6 3 -30e2,1 5 12e2,5 3 -14e2,6 6 -20e3,1 1 10e3,6 6 -10

Rx s(x) Fx τ+(x) τ−(x) τ (x) τ ′(x)1 10 001 24.6 -4.6 0 24.63 -44 110 -36.6 -51.3 36.6 51.35 32 110 39.3 24.6 24.6 39.36 -45 111 -45 -45 45 45

T2 = 45

√ √

s3(3) =?

k = 1

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 129: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Our Solution

node 1id x s1(x)e1,1 5 20e1,2 2 7e1,3 1 6e1,4 4 -2e1,5 6 -15e1,6 3 -30

node 2id x s2(x)e2,1 5 12e2,2 4 7e2,3 1 2e2,4 2 -5e2,5 3 -14e2,6 6 -20

node 3id x s3(x)e3,1 1 10e3,2 3 6e3,3 4 5e3,4 2 -3e3,5 5 -6e3,6 6 -10

T1 = 22, T1/m = 22/3

Rx s(x) Fx τ+(x) τ−(x) τ (x) τ ′(x)1 10 001 24.6 -4.6 0 24.63 -44 110 -36.6 -51.3 36.6 51.35 32 110 39.3 24.6 24.6 39.36 -45 111 -45 -45 45 45

T2 = 45

√ √

s3(3) =?

Rid x sj(x)e1,1 5 20e1,5 6 -15e1,6 3 -30e2,1 5 12e2,5 3 -14e2,6 6 -20e3,1 1 10e3,2 3 6e3,6 6 -10

k = 1

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 130: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Our Solution

node 1id x s1(x)e1,1 5 20e1,2 2 7e1,3 1 6e1,4 4 -2e1,5 6 -15e1,6 3 -30

node 2id x s2(x)e2,1 5 12e2,2 4 7e2,3 1 2e2,4 2 -5e2,5 3 -14e2,6 6 -20

node 3id x s3(x)e3,1 1 10e3,2 3 6e3,3 4 5e3,4 2 -3e3,5 5 -6e3,6 6 -10

T1 = 22, T1/m = 22/3

Rx s(x) Fx τ+(x) τ−(x) τ (x) τ ′(x)1 10 001 24.6 -4.6 0 24.63 -44 110 -36.6 -51.3 36.6 51.35 32 110 39.3 24.6 24.6 39.36 -45 111 -45 -45 45 45

T2 = 45

√ √

s3(3) =?

Rid x sj(x)e1,1 5 20e1,5 6 -15e1,6 3 -30e2,1 5 12e2,5 3 -14e2,6 6 -20e3,1 1 10e3,2 3 6e3,6 6 -10

k = 1

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 131: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Our Solution

node 1id x s1(x)e1,1 5 20e1,2 2 7e1,3 1 6e1,4 4 -2e1,5 6 -15e1,6 3 -30

node 2id x s2(x)e2,1 5 12e2,2 4 7e2,3 1 2e2,4 2 -5e2,5 3 -14e2,6 6 -20

node 3id x s3(x)e3,1 1 10e3,2 3 6e3,3 4 5e3,4 2 -3e3,5 5 -6e3,6 6 -10

T1 = 22, T1/m = 22/3

T2 = 45

√ √

s3(3) =?

Rid x sj(x)e1,1 5 20e1,5 6 -15e1,6 3 -30e2,1 5 12e2,5 3 -14e2,6 6 -20e3,1 1 10e3,2 3 6e3,6 6 -10

Rx s(x) Fi1 10 0013 -38 1115 32 1106 -45 111

• After collecting all scores, the coordinatorcan determine top-k |s(x)|.

k = 1

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 132: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Our Solution

node 1id x s1(x)e1,1 5 20e1,2 2 7e1,3 1 6e1,4 4 -2e1,5 6 -15e1,6 3 -30

node 2id x s2(x)e2,1 5 12e2,2 4 7e2,3 1 2e2,4 2 -5e2,5 3 -14e2,6 6 -20

node 3id x s3(x)e3,1 1 10e3,2 3 6e3,3 4 5e3,4 2 -3e3,5 5 -6e3,6 6 -10

T1 = 22, T1/m = 22/3

T2 = 45

√ √

s3(3) =?

Rid x sj(x)e1,1 5 20e1,5 6 -15e1,6 3 -30e2,1 5 12e2,5 3 -14e2,6 6 -20e3,1 1 10e3,2 3 6e3,6 6 -10

Rx s(x) Fi1 10 0013 -38 1115 32 1106 -45 111

• After collecting all scores, the coordinatorcan determine top-k |s(x)|.

k = 1

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 133: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Our Solution

node 1id x s1(x)e1,1 5 20e1,2 2 7e1,3 1 6e1,4 4 -2e1,5 6 -15e1,6 3 -30

node 2id x s2(x)e2,1 5 12e2,2 4 7e2,3 1 2e2,4 2 -5e2,5 3 -14e2,6 6 -20

node 3id x s3(x)e3,1 1 10e3,2 3 6e3,3 4 5e3,4 2 -3e3,5 5 -6e3,6 6 -10

T1 = 22, T1/m = 22/3

T2 = 45

√ √

s3(3) =?

Rid x sj(x)e1,1 5 20e1,5 6 -15e1,6 3 -30e2,1 5 12e2,5 3 -14e2,6 6 -20e3,1 1 10e3,2 3 6e3,6 6 -10

Rx s(x) Fi1 10 0013 -38 1115 32 1106 -45 111

s(6) = −45?

• After collecting all scores, the coordinatorcan determine top-k |s(x)|.

k = 1

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 134: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Our Solution

node 1id x s1(x)e1,1 5 20e1,2 2 7e1,3 1 6e1,4 4 -2e1,5 6 -15e1,6 3 -30

node 2id x s2(x)e2,1 5 12e2,2 4 7e2,3 1 2e2,4 2 -5e2,5 3 -14e2,6 6 -20

node 3id x s3(x)e3,1 1 10e3,2 3 6e3,3 4 5e3,4 2 -3e3,5 5 -6e3,6 6 -10

T1 = 22, T1/m = 22/3

T2 = 45

√ √

Rid x sj(x)e1,1 5 20e1,5 6 -15e1,6 3 -30e2,1 5 12e2,5 3 -14e2,6 6 -20e3,1 1 10e3,2 3 6e3,6 6 -10

Rx s(x) Fi1 10 0013 -38 1115 32 1106 -45 111

s(6) = −45?

Round 3 End

k = 1

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 135: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Outline

1 Introduction and MotivationHistogramsMapReduce and Hadoop

2 Exact Top-k Wavelet CoefficientsNaive SolutionHadoop Wavelet Top-k : Our Efficient Exact Solution

3 Approximate Top-k Wavelet CoefficientsLinearly Combinable Sketch MethodOur First Sampling Based ApproachAn Improved Sampling ApproachTwo-Level Sampling

4 Experiments

5 ConclusionsHadoop Wavelet Top-k in Hadoop

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 136: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients

Hadoop Wavelet Top-k is a good solution if the exact top-k |wi |must be retrieved, but requires multiple phases.

If we are allowed an approximation, we could further improve:1 communication cost2 number of MapReduce rounds3 amount of I/O incurred

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 137: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients

Hadoop Wavelet Top-k is a good solution if the exact top-k |wi |must be retrieved, but requires multiple phases.

If we are allowed an approximation, we could further improve:

1 communication cost2 number of MapReduce rounds3 amount of I/O incurred

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 138: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients

Hadoop Wavelet Top-k is a good solution if the exact top-k |wi |must be retrieved, but requires multiple phases.

If we are allowed an approximation, we could further improve:1 communication cost

2 number of MapReduce rounds3 amount of I/O incurred

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 139: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients

Hadoop Wavelet Top-k is a good solution if the exact top-k |wi |must be retrieved, but requires multiple phases.

If we are allowed an approximation, we could further improve:1 communication cost2 number of MapReduce rounds

3 amount of I/O incurred

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 140: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients

Hadoop Wavelet Top-k is a good solution if the exact top-k |wi |must be retrieved, but requires multiple phases.

If we are allowed an approximation, we could further improve:1 communication cost2 number of MapReduce rounds3 amount of I/O incurred

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 141: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients

Some natural improvement attempts:

1 Approximate distributed top-k.2 Approximating local coefficients with a linearly combinable sketch.

For set A = A1 ∪ A2,Sketch(A) = Sketch(A1) op Sketch(A2) for operator op.The state of the art wavelet sketch is the GCS Sketch [CGS06].The GCS gives us, for v = v1 + v2GCS(v) = GCS(v1) + GCS(v2)

3 Random sampling techniques.

[GCS06] G. Cormode, et al. Fast approximate wavelet tracking on streams. In EDBT, 2006.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 142: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients

Some natural improvement attempts:1 Approximate distributed top-k.

2 Approximating local coefficients with a linearly combinable sketch.

For set A = A1 ∪ A2,Sketch(A) = Sketch(A1) op Sketch(A2) for operator op.The state of the art wavelet sketch is the GCS Sketch [CGS06].The GCS gives us, for v = v1 + v2GCS(v) = GCS(v1) + GCS(v2)

3 Random sampling techniques.

[GCS06] G. Cormode, et al. Fast approximate wavelet tracking on streams. In EDBT, 2006.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 143: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients

Some natural improvement attempts:1 Approximate distributed top-k.2 Approximating local coefficients with a linearly combinable sketch.

For set A = A1 ∪ A2,Sketch(A) = Sketch(A1) op Sketch(A2) for operator op.The state of the art wavelet sketch is the GCS Sketch [CGS06].The GCS gives us, for v = v1 + v2GCS(v) = GCS(v1) + GCS(v2)

3 Random sampling techniques.

[GCS06] G. Cormode, et al. Fast approximate wavelet tracking on streams. In EDBT, 2006.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 144: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients

Some natural improvement attempts:1 Approximate distributed top-k.2 Approximating local coefficients with a linearly combinable sketch.

For set A = A1 ∪ A2,Sketch(A) = Sketch(A1) op Sketch(A2) for operator op.

The state of the art wavelet sketch is the GCS Sketch [CGS06].The GCS gives us, for v = v1 + v2GCS(v) = GCS(v1) + GCS(v2)

3 Random sampling techniques.

[GCS06] G. Cormode, et al. Fast approximate wavelet tracking on streams. In EDBT, 2006.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 145: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients

Some natural improvement attempts:1 Approximate distributed top-k.2 Approximating local coefficients with a linearly combinable sketch.

For set A = A1 ∪ A2,Sketch(A) = Sketch(A1) op Sketch(A2) for operator op.The state of the art wavelet sketch is the GCS Sketch [CGS06].

The GCS gives us, for v = v1 + v2GCS(v) = GCS(v1) + GCS(v2)

3 Random sampling techniques.

[GCS06] G. Cormode, et al. Fast approximate wavelet tracking on streams. In EDBT, 2006.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 146: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients

Some natural improvement attempts:1 Approximate distributed top-k.2 Approximating local coefficients with a linearly combinable sketch.

For set A = A1 ∪ A2,Sketch(A) = Sketch(A1) op Sketch(A2) for operator op.The state of the art wavelet sketch is the GCS Sketch [CGS06].The GCS gives us, for v = v1 + v2GCS(v) = GCS(v1) + GCS(v2)

3 Random sampling techniques.

[GCS06] G. Cormode, et al. Fast approximate wavelet tracking on streams. In EDBT, 2006.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 147: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients

Some natural improvement attempts:1 Approximate distributed top-k.2 Approximating local coefficients with a linearly combinable sketch.

For set A = A1 ∪ A2,Sketch(A) = Sketch(A1) op Sketch(A2) for operator op.The state of the art wavelet sketch is the GCS Sketch [CGS06].The GCS gives us, for v = v1 + v2GCS(v) = GCS(v1) + GCS(v2)

3 Random sampling techniques.

[GCS06] G. Cormode, et al. Fast approximate wavelet tracking on streams. In EDBT, 2006.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 148: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Outline

1 Introduction and MotivationHistogramsMapReduce and Hadoop

2 Exact Top-k Wavelet CoefficientsNaive SolutionHadoop Wavelet Top-k : Our Efficient Exact Solution

3 Approximate Top-k Wavelet CoefficientsLinearly Combinable Sketch MethodOur First Sampling Based ApproachAn Improved Sampling ApproachTwo-Level Sampling

4 Experiments

5 ConclusionsHadoop Wavelet Top-k in Hadoop

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 149: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Sketch

split 1split 2split 3split 4

JobTracker

(x , null)Mapper

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 150: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Sketch

split 1split 2split 3split 4

JobTracker

(x , null)Mapper

(x , 1)

In-MemoryMap

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 151: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Sketch

split 1split 2split 3split 4

JobTracker

(x , null)Mapper

In-MemoryMap

close()

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 152: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Sketch

split 1split 2split 3split 4

JobTracker

(x , null)Mapper

In-MemoryMap

(x , vj(x))

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 153: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Sketch

split 1split 2split 3split 4

JobTracker

(x , null)Mapper

In-MemoryMap

(x , vj(x))

Sketch(x , vj(x))

close()

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 154: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Sketch

split 1split 2split 3split 4

JobTracker

(x , null)Mapper

wavelet sketch

p1

In-MemoryMap

(x , vj(x))

Sketch(x , vj(x))

close()

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 155: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Sketch

split 1split 2split 3split 4

JobTracker

(x , null)Mapper

wavelet sketch

Reducerp1

In-MemoryMap

wavelet sketch(x , vj(x))

Sketch(x , vj(x))

close()

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 156: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Sketch

split 1split 2split 3split 4

JobTracker

(x , null)Mapper

wavelet sketch

Reducerp1

In-MemoryMap

wavelet sketch(x , vj(x))

Combine Sketches

wavelet sketch

Sketch(x , vj(x))

close()

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 157: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Sketch

split 1split 2split 3split 4

JobTracker

(x , null)Mapper

wavelet sketch

Reducerp1

In-MemoryMap

wavelet sketch(x , vj(x))

Combine Sketches

wavelet sketch

Sketch(x , vj(x))

Wavelet Sketch top-k

close()

close()

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 158: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Sketch

split 1split 2split 3split 4

JobTracker

(x , null)Mapper

wavelet sketch

Reducerp1

In-MemoryMap

wavelet sketch(x , vj(x))

Combine Sketches

wavelet sketch

Sketch(x , vj(x))

Wavelet Sketch top-k

top-k |wi |close()

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 159: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Sketch

split 1split 2split 3split 4

JobTracker

(x , null)Mapper

wavelet sketch

Reducerp1

In-MemoryMap

wavelet sketch(x , vj(x))

Combine Sketches

wavelet sketch

o1top-k |wi |

Sketch(x , vj(x))

Wavelet Sketch top-k

top-k |wi |close()

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 160: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Outline

1 Introduction and MotivationHistogramsMapReduce and Hadoop

2 Exact Top-k Wavelet CoefficientsNaive SolutionHadoop Wavelet Top-k : Our Efficient Exact Solution

3 Approximate Top-k Wavelet CoefficientsLinearly Combinable Sketch MethodOur First Sampling Based ApproachAn Improved Sampling ApproachTwo-Level Sampling

4 Experiments

5 ConclusionsHadoop Wavelet Top-k in Hadoop

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 161: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Basic RandomSampling

nj Records in split j

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 162: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Basic RandomSampling

nj Records in split j

Well known fact: to approximate each v(x) with standarddeviation σ = O(εn) a sample of size Θ(1/ε2) is required.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 163: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Basic RandomSampling

Node j samples tj = nj · p records where p = 1/ε2n.

nj Records in split j

Well known fact: to approximate each v(x) with standarddeviation σ = O(εn) a sample of size Θ(1/ε2) is required.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 164: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Basic RandomSampling

Node j samples tj = nj · p records where p = 1/ε2n.

nj Records in split j

Well known fact: to approximate each v(x) with standarddeviation σ = O(εn) a sample of size Θ(1/ε2) is required.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 165: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Basic RandomSampling

sj(x): Sampled Frequency Counts

Node j samples tj = nj · p records where p = 1/ε2n.

nj Records in split j

Sample

Well known fact: to approximate each v(x) with standarddeviation σ = O(εn) a sample of size Θ(1/ε2) is required.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 166: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Basic RandomSampling

sj(x): Sampled Frequency Counts

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 167: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Basic RandomSampling

sj(x): Sampled Frequency Counts

Emitted sj(x)

Emit

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 168: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Basic RandomSampling

sj(x): Sampled Frequency Counts

Emitted sj(x)

Emit

Coordinator

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 169: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Basic RandomSampling

Coordinator

Emitted sm(x)

Emitted s1(x)

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 170: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Basic RandomSampling

Coordinator

Emitted sm(x)

Construct

s(x)

s(x) = ∑mj=1 sj(x)

234518971673

189104

176215433451

20121356

Emitted s1(x)

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 171: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Basic RandomSampling

Coordinator

Emitted sm(x)

Construct

s(x)

s(x) = ∑mj=1 sj(x)

234518971673

189104

176215433451

20121356

Construct

v(x)

Emitted s1(x)2345/p

1897/p

1673/p

189/p

10/p

4/p

1762/p

1543/p

3451/p

20/p

12/p

1356/p

v(x) = s(x)/p is v(x)’sunbiased estimator

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 172: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Basic RandomSampling

Note: ε must be small for v to approximate v well.

Typical values for ε are 10−4 to 10−6.

The communication for basic sampling is O(1/ε2).

With 1 byte keys, 100MB to 1TB of data must be communicated!

We improve basic random sampling with Improved Sampling.

Key idea: ignore sampled keys with small frequencies in a split.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 173: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Basic RandomSampling

Note: ε must be small for v to approximate v well.

Typical values for ε are 10−4 to 10−6.

The communication for basic sampling is O(1/ε2).

With 1 byte keys, 100MB to 1TB of data must be communicated!

We improve basic random sampling with Improved Sampling.

Key idea: ignore sampled keys with small frequencies in a split.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 174: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Basic RandomSampling

Note: ε must be small for v to approximate v well.

Typical values for ε are 10−4 to 10−6.

The communication for basic sampling is O(1/ε2).

With 1 byte keys, 100MB to 1TB of data must be communicated!

We improve basic random sampling with Improved Sampling.

Key idea: ignore sampled keys with small frequencies in a split.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 175: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Basic RandomSampling

Note: ε must be small for v to approximate v well.

Typical values for ε are 10−4 to 10−6.

The communication for basic sampling is O(1/ε2).

With 1 byte keys, 100MB to 1TB of data must be communicated!

We improve basic random sampling with Improved Sampling.

Key idea: ignore sampled keys with small frequencies in a split.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 176: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Basic RandomSampling

Note: ε must be small for v to approximate v well.

Typical values for ε are 10−4 to 10−6.

The communication for basic sampling is O(1/ε2).

With 1 byte keys, 100MB to 1TB of data must be communicated!

We improve basic random sampling with Improved Sampling.

Key idea: ignore sampled keys with small frequencies in a split.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 177: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Basic RandomSampling

Note: ε must be small for v to approximate v well.

Typical values for ε are 10−4 to 10−6.

The communication for basic sampling is O(1/ε2).

With 1 byte keys, 100MB to 1TB of data must be communicated!

We improve basic random sampling with Improved Sampling.

Key idea: ignore sampled keys with small frequencies in a split.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 178: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Outline

1 Introduction and MotivationHistogramsMapReduce and Hadoop

2 Exact Top-k Wavelet CoefficientsNaive SolutionHadoop Wavelet Top-k : Our Efficient Exact Solution

3 Approximate Top-k Wavelet CoefficientsLinearly Combinable Sketch MethodOur First Sampling Based ApproachAn Improved Sampling ApproachTwo-Level Sampling

4 Experiments

5 ConclusionsHadoop Wavelet Top-k in Hadoop

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 179: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: ImprovedSampling

nj Records in split

Each node sends at most tj/(εtj) = 1/ε keys.

The total communication is O(m/ε).

E[v(x)] may be εn away from v(x) as sj(x) < εtj are ignored.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 180: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: ImprovedSampling

Node j samples tj = nj · p records using

Basic Sampling, where p = 1/ε2n.

nj Records in split

Each node sends at most tj/(εtj) = 1/ε keys.

The total communication is O(m/ε).

E[v(x)] may be εn away from v(x) as sj(x) < εtj are ignored.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 181: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: ImprovedSampling

Node j samples tj = nj · p records using

Basic Sampling, where p = 1/ε2n.

nj Records in split

Each node sends at most tj/(εtj) = 1/ε keys.

The total communication is O(m/ε).

E[v(x)] may be εn away from v(x) as sj(x) < εtj are ignored.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 182: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: ImprovedSampling

sj(x): Sampled Frequency Counts

Node j samples tj = nj · p records using

Basic Sampling, where p = 1/ε2n.

nj Records in split

Sample

Each node sends at most tj/(εtj) = 1/ε keys.

The total communication is O(m/ε).

E[v(x)] may be εn away from v(x) as sj(x) < εtj are ignored.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 183: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: ImprovedSampling

sj(x): Sampled Frequency Counts

Node j sends (x , sj(x)) only if sj(x) > εtj .

• The error in s(x) is ≤ ∑mj=1 εtj = εpn = 1/ε.

Node j sends (x , sj(x)) only if sj(x) > εtj .

• The error in s(x) is ≤ ∑mj=1 εtj = εpn = 1/ε.

Each node sends at most tj/(εtj) = 1/ε keys.

The total communication is O(m/ε).

E[v(x)] may be εn away from v(x) as sj(x) < εtj are ignored.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 184: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: ImprovedSampling

sj(x): Sampled Frequency Counts

Node j sends (x , sj(x)) only if sj(x) > εtj .

• The error in s(x) is ≤ ∑mj=1 εtj = εpn = 1/ε.

Node j sends (x , sj(x)) only if sj(x) > εtj .

• The error in s(x) is ≤ ∑mj=1 εtj = εpn = 1/ε.

Each node sends at most tj/(εtj) = 1/ε keys.

The total communication is O(m/ε).

E[v(x)] may be εn away from v(x) as sj(x) < εtj are ignored.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 185: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: ImprovedSampling

sj(x): Sampled Frequency Counts

Emitted sj(x)

Emit

Node j sends (x , sj(x)) only if sj(x) > εtj .

• The error in s(x) is ≤ ∑mj=1 εtj = εpn = 1/ε.

Node j sends (x , sj(x)) only if sj(x) > εtj .

• The error in s(x) is ≤ ∑mj=1 εtj = εpn = 1/ε.

Each node sends at most tj/(εtj) = 1/ε keys.

The total communication is O(m/ε).

E[v(x)] may be εn away from v(x) as sj(x) < εtj are ignored.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 186: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: ImprovedSampling

sj(x): Sampled Frequency Counts

Emitted sj(x)

Emit

Coordinator

Node j sends (x , sj(x)) only if sj(x) > εtj .

• The error in s(x) is ≤ ∑mj=1 εtj = εpn = 1/ε.

Node j sends (x , sj(x)) only if sj(x) > εtj .

• The error in s(x) is ≤ ∑mj=1 εtj = εpn = 1/ε.

Each node sends at most tj/(εtj) = 1/ε keys.

The total communication is O(m/ε).

E[v(x)] may be εn away from v(x) as sj(x) < εtj are ignored.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 187: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: ImprovedSampling

Coordinator

Emitted sm(x)

Emitted s1(x)

Each node sends at most tj/(εtj) = 1/ε keys.

The total communication is O(m/ε).

E[v(x)] may be εn away from v(x) as sj(x) < εtj are ignored.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 188: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: ImprovedSampling

Coordinator

Emitted sm(x)

Construct

s(x)

s(x) = ∑mj=1 sj(x)

234518971673

189

176215433451

1356

Emitted s1(x)

234

Each node sends at most tj/(εtj) = 1/ε keys.

The total communication is O(m/ε).

E[v(x)] may be εn away from v(x) as sj(x) < εtj are ignored.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 189: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: ImprovedSampling

Coordinator

Emitted sm(x)

Construct

s(x)

s(x) = ∑mj=1 sj(x)

234518971673

189

176215433451

1356

Construct

v(x)

Emitted s1(x)

v(x) = s(x)/p

2345/p

1897/p

1673/p

189/p

1762/p

1543/p

3451/p

1356/p

234 234/p

Each node sends at most tj/(εtj) = 1/ε keys.

The total communication is O(m/ε).

E[v(x)] may be εn away from v(x) as sj(x) < εtj are ignored.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 190: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: ImprovedSampling

Coordinator

Emitted sm(x)

Construct

s(x)

s(x) = ∑mj=1 sj(x)

234518971673

189

176215433451

1356

Construct

v(x)

Emitted s1(x)

v(x) = s(x)/p

2345/p

1897/p

1673/p

189/p

1762/p

1543/p

3451/p

1356/p

234 234/p

Each node sends at most tj/(εtj) = 1/ε keys.

The total communication is O(m/ε).

E[v(x)] may be εn away from v(x) as sj(x) < εtj are ignored.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 191: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: ImprovedSampling

Coordinator

Emitted sm(x)

Construct

s(x)

s(x) = ∑mj=1 sj(x)

234518971673

189

176215433451

1356

Construct

v(x)

Emitted s1(x)

v(x) = s(x)/p

2345/p

1897/p

1673/p

189/p

1762/p

1543/p

3451/p

1356/p

234 234/p

Each node sends at most tj/(εtj) = 1/ε keys.

The total communication is O(m/ε).

E[v(x)] may be εn away from v(x) as sj(x) < εtj are ignored.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 192: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: ImprovedSampling

Coordinator

Emitted sm(x)

Construct

s(x)

s(x) = ∑mj=1 sj(x)

234518971673

189

176215433451

1356

Construct

v(x)

Emitted s1(x)

v(x) = s(x)/p

2345/p

1897/p

1673/p

189/p

1762/p

1543/p

3451/p

1356/p

234 234/p

Each node sends at most tj/(εtj) = 1/ε keys.

The total communication is O(m/ε).

E[v(x)] may be εn away from v(x) as sj(x) < εtj are ignored.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 193: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Outline

1 Introduction and MotivationHistogramsMapReduce and Hadoop

2 Exact Top-k Wavelet CoefficientsNaive SolutionHadoop Wavelet Top-k : Our Efficient Exact Solution

3 Approximate Top-k Wavelet CoefficientsLinearly Combinable Sketch MethodOur First Sampling Based ApproachAn Improved Sampling ApproachTwo-Level Sampling

4 Experiments

5 ConclusionsHadoop Wavelet Top-k in Hadoop

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 194: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

nj Records in split

To construct s(x).If (x , sj(x)) received, ρ(x) = ρ(x) + sj(x).Else if (x , null) received, M(x) = M(x) + 1.

Finally, s(x) = ρ(x) + M(x)/ε√

m.

Then, v(x) = s(x)/p is an unbiased estimator for v(x).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 195: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

Node j samples tj = nj · p records using

Basic Sampling, where p = 1/ε2n.

nj Records in split

To construct s(x).If (x , sj(x)) received, ρ(x) = ρ(x) + sj(x).Else if (x , null) received, M(x) = M(x) + 1.

Finally, s(x) = ρ(x) + M(x)/ε√

m.

Then, v(x) = s(x)/p is an unbiased estimator for v(x).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 196: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

Node j samples tj = nj · p records using

Basic Sampling, where p = 1/ε2n.

nj Records in split

To construct s(x).If (x , sj(x)) received, ρ(x) = ρ(x) + sj(x).Else if (x , null) received, M(x) = M(x) + 1.

Finally, s(x) = ρ(x) + M(x)/ε√

m.

Then, v(x) = s(x)/p is an unbiased estimator for v(x).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 197: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

sj(x): Sampled Frequency Counts

Node j samples tj = nj · p records using

Basic Sampling, where p = 1/ε2n.

nj Records in split

Sample

To construct s(x).If (x , sj(x)) received, ρ(x) = ρ(x) + sj(x).Else if (x , null) received, M(x) = M(x) + 1.

Finally, s(x) = ρ(x) + M(x)/ε√

m.

Then, v(x) = s(x)/p is an unbiased estimator for v(x).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 198: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

sj(x): Sampled Frequency Counts

To construct s(x).If (x , sj(x)) received, ρ(x) = ρ(x) + sj(x).Else if (x , null) received, M(x) = M(x) + 1.

Finally, s(x) = ρ(x) + M(x)/ε√

m.

Then, v(x) = s(x)/p is an unbiased estimator for v(x).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 199: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

Sample record x with probability min{ε√m · sj(x), 1}.• If sj(x) ≥ 1/(ε

√m), emit (x , sj(x)).

• Else emit (x , null) with probability ε√m · sj(x).

sj(x): Sampled Frequency Counts

To construct s(x).If (x , sj(x)) received, ρ(x) = ρ(x) + sj(x).Else if (x , null) received, M(x) = M(x) + 1.

Finally, s(x) = ρ(x) + M(x)/ε√

m.

Then, v(x) = s(x)/p is an unbiased estimator for v(x).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 200: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

Sample record x with probability min{ε√m · sj(x), 1}.• If sj(x) ≥ 1/(ε

√m), emit (x , sj(x)).

• Else emit (x , null) with probability ε√m · sj(x).

sj(x): Sampled Frequency Counts

To construct s(x).If (x , sj(x)) received, ρ(x) = ρ(x) + sj(x).Else if (x , null) received, M(x) = M(x) + 1.

Finally, s(x) = ρ(x) + M(x)/ε√

m.

Then, v(x) = s(x)/p is an unbiased estimator for v(x).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 201: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

Sample record x with probability min{ε√m · sj(x), 1}.• If sj(x) ≥ 1/(ε

√m), emit (x , sj(x)).

• Else emit (x , null) with probability ε√m · sj(x).

sj(x): Sampled Frequency Counts

To construct s(x).If (x , sj(x)) received, ρ(x) = ρ(x) + sj(x).Else if (x , null) received, M(x) = M(x) + 1.

Finally, s(x) = ρ(x) + M(x)/ε√

m.

Then, v(x) = s(x)/p is an unbiased estimator for v(x).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 202: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

Sample record x with probability min{ε√m · sj(x), 1}.• If sj(x) ≥ 1/(ε

√m), emit (x , sj(x)).

• Else emit (x , null) with probability ε√m · sj(x).

sj(x): Sampled Frequency Counts

sj(x): sampled sj(x)

nullnull null

Emit

To construct s(x).If (x , sj(x)) received, ρ(x) = ρ(x) + sj(x).Else if (x , null) received, M(x) = M(x) + 1.

Finally, s(x) = ρ(x) + M(x)/ε√

m.

Then, v(x) = s(x)/p is an unbiased estimator for v(x).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 203: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

Sample record x with probability min{ε√m · sj(x), 1}.• If sj(x) ≥ 1/(ε

√m), emit (x , sj(x)).

• Else emit (x , null) with probability ε√m · sj(x).

sj(x): Sampled Frequency Counts

sj(x): sampled sj(x)

nullnull null

Emit

Coordinator

To construct s(x).If (x , sj(x)) received, ρ(x) = ρ(x) + sj(x).Else if (x , null) received, M(x) = M(x) + 1.

Finally, s(x) = ρ(x) + M(x)/ε√

m.

Then, v(x) = s(x)/p is an unbiased estimator for v(x).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 204: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

Coordinator

s1(x): sampled s1(x)

nullnull null

sm(x): sampled sm(x)

null

null

null

To construct s(x).If (x , sj(x)) received, ρ(x) = ρ(x) + sj(x).Else if (x , null) received, M(x) = M(x) + 1.

Finally, s(x) = ρ(x) + M(x)/ε√

m.

Then, v(x) = s(x)/p is an unbiased estimator for v(x).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 205: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

Coordinator

s1(x): sampled s1(x)

nullnull null

sm(x): sampled sm(x)

null

null

null

Construct

s(x)

s(x): estimator of s(x)

234518971673

18953

176215433451

237431356

To construct s(x).

If (x , sj(x)) received, ρ(x) = ρ(x) + sj(x).Else if (x , null) received, M(x) = M(x) + 1.

Finally, s(x) = ρ(x) + M(x)/ε√

m.

Then, v(x) = s(x)/p is an unbiased estimator for v(x).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 206: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

Coordinator

s1(x): sampled s1(x)

nullnull null

sm(x): sampled sm(x)

null

null

null

Construct

s(x)

s(x): estimator of s(x)

234518971673

18953

176215433451

237431356

To construct s(x).If (x , sj(x)) received, ρ(x) = ρ(x) + sj(x).

Else if (x , null) received, M(x) = M(x) + 1.

Finally, s(x) = ρ(x) + M(x)/ε√

m.

Then, v(x) = s(x)/p is an unbiased estimator for v(x).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 207: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

Coordinator

s1(x): sampled s1(x)

nullnull null

sm(x): sampled sm(x)

null

null

null

Construct

s(x)

s(x): estimator of s(x)

234518971673

18953

176215433451

237431356

To construct s(x).If (x , sj(x)) received, ρ(x) = ρ(x) + sj(x).Else if (x , null) received, M(x) = M(x) + 1.

Finally, s(x) = ρ(x) + M(x)/ε√

m.

Then, v(x) = s(x)/p is an unbiased estimator for v(x).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 208: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

Coordinator

s1(x): sampled s1(x)

nullnull null

sm(x): sampled sm(x)

null

null

null

Construct

s(x)

s(x): estimator of s(x)

234518971673

18953

176215433451

237431356

To construct s(x).If (x , sj(x)) received, ρ(x) = ρ(x) + sj(x).Else if (x , null) received, M(x) = M(x) + 1.

Finally, s(x) = ρ(x) + M(x)/ε√

m.

Then, v(x) = s(x)/p is an unbiased estimator for v(x).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 209: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

Coordinator

s1(x): sampled s1(x)

nullnull null

sm(x): sampled sm(x)

null

null

null

Construct

s(x)

s(x): estimator of s(x)

234518971673

18953

176215433451

237431356

Construct

v(x)

v(x) = s(x)/p

2345/p

1897/p

1673/p

189/p

53/p

1762/p

1543/p

3451/p

237/p

43/p

1356/p

To construct s(x).If (x , sj(x)) received, ρ(x) = ρ(x) + sj(x).Else if (x , null) received, M(x) = M(x) + 1.

Finally, s(x) = ρ(x) + M(x)/ε√

m.

Then, v(x) = s(x)/p is an unbiased estimator for v(x).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 210: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

Theorem

s(x) is an unbiased estimator of s(x) with standard deviation at most 1/ε.

Corollary

v(x) is an unbiased estimator of v(x) with standard deviation at most εn.

Theorem

• wi is an unbiased estimator for any wi .• Recall wi = 〈v, ψi 〉, for ψi = (−φj+1,2k + φj+1,2k+1)/

√u/2j where φ is

a [0, 1] vector defined for j = 1, . . . , log u and k = 0, . . . , 2j − 1. The

variance of wi is bounded by ε2jnu√m

∑(2k+2)u/2j+1

x=2ku/2j+1+1s(x).

Theorem

The expected total communication cost of our two-level samplingalgorithm is O(

√m/ε).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 211: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

Theorem

s(x) is an unbiased estimator of s(x) with standard deviation at most 1/ε.

Corollary

v(x) is an unbiased estimator of v(x) with standard deviation at most εn.

Theorem

• wi is an unbiased estimator for any wi .• Recall wi = 〈v, ψi 〉, for ψi = (−φj+1,2k + φj+1,2k+1)/

√u/2j where φ is

a [0, 1] vector defined for j = 1, . . . , log u and k = 0, . . . , 2j − 1. The

variance of wi is bounded by ε2jnu√m

∑(2k+2)u/2j+1

x=2ku/2j+1+1s(x).

Theorem

The expected total communication cost of our two-level samplingalgorithm is O(

√m/ε).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 212: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

Theorem

s(x) is an unbiased estimator of s(x) with standard deviation at most 1/ε.

Corollary

v(x) is an unbiased estimator of v(x) with standard deviation at most εn.

Theorem

• wi is an unbiased estimator for any wi .• Recall wi = 〈v, ψi 〉, for ψi = (−φj+1,2k + φj+1,2k+1)/

√u/2j where φ is

a [0, 1] vector defined for j = 1, . . . , log u and k = 0, . . . , 2j − 1. The

variance of wi is bounded by ε2jnu√m

∑(2k+2)u/2j+1

x=2ku/2j+1+1s(x).

Theorem

The expected total communication cost of our two-level samplingalgorithm is O(

√m/ε).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 213: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

Theorem

s(x) is an unbiased estimator of s(x) with standard deviation at most 1/ε.

Corollary

v(x) is an unbiased estimator of v(x) with standard deviation at most εn.

Theorem

• wi is an unbiased estimator for any wi .• Recall wi = 〈v, ψi 〉, for ψi = (−φj+1,2k + φj+1,2k+1)/

√u/2j where φ is

a [0, 1] vector defined for j = 1, . . . , log u and k = 0, . . . , 2j − 1. The

variance of wi is bounded by ε2jnu√m

∑(2k+2)u/2j+1

x=2ku/2j+1+1s(x).

Theorem

The expected total communication cost of our two-level samplingalgorithm is O(

√m/ε).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 214: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

split 1split 2split 3split 4

JobTracker

MapRunner

nj= records in split jsj= split j sample frequency vector

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 215: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

split 1split 2split 3split 4

JobTracker

MapRunner

RandomizedRecordReader

(x , null)

nj= records in split jsj= split j sample frequency vector

1 RandomizedRecordReader j (RRj) samples nj/ε2n records.

1 RRj randomly selects nj/ε2n offsets in split j .

2 RRj sorts the offsets in ascending order then seeks the record at eachsampled offset.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 216: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

split 1split 2split 3split 4

JobTracker

MapRunner

RandomizedRecordReader

(x , null)

nj= records in split jsj= split j sample frequency vector

1 RandomizedRecordReader j (RRj) samples nj/ε2n records.

1 RRj randomly selects nj/ε2n offsets in split j .

2 RRj sorts the offsets in ascending order then seeks the record at eachsampled offset.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 217: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

split 1split 2split 3split 4

JobTracker

MapRunner

RandomizedRecordReader

(x , null)

nj= records in split jsj= split j sample frequency vector

1 RandomizedRecordReader j (RRj) samples nj/ε2n records.

1 RRj randomly selects nj/ε2n offsets in split j .

2 RRj sorts the offsets in ascending order then seeks the record at eachsampled offset.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 218: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

split 1split 2split 3split 4

JobTracker

MapRunner

RandomizedRecordReader

(x , null)

(x , null)Mapper

nj= records in split jsj= split j sample frequency vector

1 RandomizedRecordReader j (RRj) samples nj/ε2n records.

1 RRj randomly selects nj/ε2n offsets in split j .

2 RRj sorts the offsets in ascending order then seeks the record at eachsampled offset.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 219: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

split 1split 2split 3split 4

JobTracker

MapRunner

RandomizedRecordReader

(x , null)

(x , null)Mapper

(x , 1)

In-MemoryMap

nj= records in split jsj= split j sample frequency vector

1 RandomizedRecordReader j (RRj) samples nj/ε2n records.

1 RRj randomly selects nj/ε2n offsets in split j .

2 RRj sorts the offsets in ascending order then seeks the record at eachsampled offset.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 220: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

split 1split 2split 3split 4

JobTracker

MapRunner

RandomizedRecordReader

(x , null)

(x , null)Mapper

In-MemoryMap

close()

nj= records in split jsj= split j sample frequency vector

1 RandomizedRecordReader j (RRj) samples nj/ε2n records.

1 RRj randomly selects nj/ε2n offsets in split j .

2 RRj sorts the offsets in ascending order then seeks the record at eachsampled offset.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 221: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

split 1split 2split 3split 4

JobTracker

MapRunner

RandomizedRecordReader

(x , null)

(x , null)Mapper

In-MemoryMap

(x , sj(x))

nj= records in split jsj= split j sample frequency vector

1 RandomizedRecordReader j (RRj) samples nj/ε2n records.

1 RRj randomly selects nj/ε2n offsets in split j .

2 RRj sorts the offsets in ascending order then seeks the record at eachsampled offset.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 222: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

split 1split 2split 3split 4

JobTracker

MapRunner

RandomizedRecordReader

(x , null)

(x , null)Mapper

(x , sj(x)|0)p1

In-MemoryMap

(x , sj(x))

nj= records in split jsj= split j sample frequency vector

close()

2 Mapper j samples key x from s with probability min{ε√m · sj(x), 1}.

If sj(x) ≥ 1/(ε√m), emit (x , sj(x)).

Else emit (x , 0) with probability ε√m · sj(x).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 223: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

split 1split 2split 3split 4

JobTracker

MapRunner

RandomizedRecordReader

(x , null)

(x , null)Mapper

(x , sj(x)|0)p1

In-MemoryMap

(x , sj(x))

nj= records in split jsj= split j sample frequency vector

close()

2 Mapper j samples key x from s with probability min{ε√m · sj(x), 1}.If sj(x) ≥ 1/(ε

√m), emit (x , sj(x)).

Else emit (x , 0) with probability ε√m · sj(x).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 224: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

split 1split 2split 3split 4

JobTracker

MapRunner

RandomizedRecordReader

(x , null)

(x , null)Mapper

(x , sj(x)|0)p1

In-MemoryMap

(x , sj(x))

nj= records in split jsj= split j sample frequency vector

close()

2 Mapper j samples key x from s with probability min{ε√m · sj(x), 1}.If sj(x) ≥ 1/(ε

√m), emit (x , sj(x)).

Else emit (x , 0) with probability ε√m · sj(x).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 225: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

split 1split 2split 3split 4

JobTracker

MapRunner

RandomizedRecordReader

(x , null)

(x , null)Mapper

(x , sj(x)|0)Reducerp1

In-MemoryMap

(x , sj(x)|0)(x , sj(x))

nj= records in split jsj= split j sample frequency vector

close()

2 Mapper j samples key x from s with probability min{ε√m · sj(x), 1}.If sj(x) ≥ 1/(ε

√m), emit (x , sj(x)).

Else emit (x , 0) with probability ε√m · sj(x).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 226: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

split 1split 2split 3split 4

JobTracker

MapRunner

RandomizedRecordReader

(x , null)

(x , null)Mapper

(x , sj(x)|0)Reducerp1

In-MemoryMap

(x , sj(x)|0)(x , sj(x))

nj= records in split jsj= split j sample frequency vector

close()

(x , s(x))

3 Construct s(x).

If (x , sj(x)) received, ρ(x) = ρ(x) + sj(x).

Else if (x , 0) received, M(x) = M(x) + 1.

4 Finally, s(x) = ρ(x) + M(x)/ε√

m.5 Reducer uses v(x) = s(x)/p, our unbiased estimator for v(x).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 227: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

split 1split 2split 3split 4

JobTracker

MapRunner

RandomizedRecordReader

(x , null)

(x , null)Mapper

(x , sj(x)|0)Reducerp1

In-MemoryMap

(x , sj(x)|0)(x , sj(x))

nj= records in split jsj= split j sample frequency vector

close()

(x , s(x))

3 Construct s(x).If (x , sj(x)) received, ρ(x) = ρ(x) + sj(x).

Else if (x , 0) received, M(x) = M(x) + 1.4 Finally, s(x) = ρ(x) + M(x)/ε

√m.

5 Reducer uses v(x) = s(x)/p, our unbiased estimator for v(x).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 228: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

split 1split 2split 3split 4

JobTracker

MapRunner

RandomizedRecordReader

(x , null)

(x , null)Mapper

(x , sj(x)|0)Reducerp1

In-MemoryMap

(x , sj(x)|0)(x , sj(x))

nj= records in split jsj= split j sample frequency vector

close()

(x , s(x))

3 Construct s(x).If (x , sj(x)) received, ρ(x) = ρ(x) + sj(x).Else if (x , 0) received, M(x) = M(x) + 1.

4 Finally, s(x) = ρ(x) + M(x)/ε√

m.5 Reducer uses v(x) = s(x)/p, our unbiased estimator for v(x).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 229: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

split 1split 2split 3split 4

JobTracker

MapRunner

RandomizedRecordReader

(x , null)

(x , null)Mapper

(x , sj(x)|0)Reducerp1

In-MemoryMap

(x , sj(x)|0)(x , sj(x))

nj= records in split jsj= split j sample frequency vector

close()

(x , s(x))

3 Construct s(x).If (x , sj(x)) received, ρ(x) = ρ(x) + sj(x).Else if (x , 0) received, M(x) = M(x) + 1.

4 Finally, s(x) = ρ(x) + M(x)/ε√

m.

5 Reducer uses v(x) = s(x)/p, our unbiased estimator for v(x).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 230: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

split 1split 2split 3split 4

JobTracker

MapRunner

RandomizedRecordReader

(x , null)

(x , null)Mapper

(x , sj(x)|0)Reducerp1

In-MemoryMap

(x , sj(x)|0)(x , sj(x))

nj= records in split jsj= split j sample frequency vector

close()

(x , s(x))

(x , v(x))

3 Construct s(x).If (x , sj(x)) received, ρ(x) = ρ(x) + sj(x).Else if (x , 0) received, M(x) = M(x) + 1.

4 Finally, s(x) = ρ(x) + M(x)/ε√

m.5 Reducer uses v(x) = s(x)/p, our unbiased estimator for v(x).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 231: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

split 1split 2split 3split 4

JobTracker

MapRunner

RandomizedRecordReader

(x , null)

(x , null)Mapper

(x , sj(x)|0)Reducerp1

In-MemoryMap

(x , sj(x)|0)(x , sj(x))

nj= records in split jsj= split j sample frequency vector

close()

(x , s(x))

(x , v(x))

Centralized Wavelet Top-k

3 Construct s(x).If (x , sj(x)) received, ρ(x) = ρ(x) + sj(x).Else if (x , 0) received, M(x) = M(x) + 1.

4 Finally, s(x) = ρ(x) + M(x)/ε√

m.5 Reducer uses v(x) = s(x)/p, our unbiased estimator for v(x).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 232: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

split 1split 2split 3split 4

JobTracker

MapRunner

RandomizedRecordReader

(x , null)

(x , null)Mapper

(x , sj(x)|0)Reducerp1

In-MemoryMap

(x , sj(x)|0)(x , sj(x))

nj= records in split jsj= split j sample frequency vector

close()

(x , s(x))

(x , v(x))

Centralized Wavelet Top-k

close()

3 Construct s(x).If (x , sj(x)) received, ρ(x) = ρ(x) + sj(x).Else if (x , 0) received, M(x) = M(x) + 1.

4 Finally, s(x) = ρ(x) + M(x)/ε√

m.5 Reducer uses v(x) = s(x)/p, our unbiased estimator for v(x).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 233: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

split 1split 2split 3split 4

JobTracker

MapRunner

RandomizedRecordReader

(x , null)

(x , null)Mapper

(x , sj(x)|0)Reducerp1

In-MemoryMap

(x , sj(x)|0)(x , sj(x))

nj= records in split jsj= split j sample frequency vector

close()

(x , s(x))

(x , v(x))

Centralized Wavelet Top-k

top-k |wi |

3 Construct s(x).If (x , sj(x)) received, ρ(x) = ρ(x) + sj(x).Else if (x , 0) received, M(x) = M(x) + 1.

4 Finally, s(x) = ρ(x) + M(x)/ε√

m.5 Reducer uses v(x) = s(x)/p, our unbiased estimator for v(x).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 234: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

split 1split 2split 3split 4

JobTracker

MapRunner

RandomizedRecordReader

(x , null)

(x , null)Mapper

(x , sj(x)|0)Reducerp1

In-MemoryMap

(x , sj(x)|0)(x , sj(x))

o1top-k |wi |

nj= records in split jsj= split j sample frequency vector

close()

(x , s(x))

(x , v(x))

Centralized Wavelet Top-k

top-k |wi |

3 Construct s(x).If (x , sj(x)) received, ρ(x) = ρ(x) + sj(x).Else if (x , 0) received, M(x) = M(x) + 1.

4 Finally, s(x) = ρ(x) + M(x)/ε√

m.5 Reducer uses v(x) = s(x)/p, our unbiased estimator for v(x).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 235: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

Consider: ε = 10−4, m = 103, and 4-byte keys .

The communication for basic sampling is O(1/ε2).

Approximately 400MB of data must be communicated!

The communication for improved sampling is O(m/ε).

Approximately 40MB of data must be communicated.

The communication for two-level sampling is O(√

m/ε).

Only 1.2MB of data needs to be communicated!330-fold reduction over basic sampling and 33-fold reduction overimproved sampling!

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 236: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

Consider: ε = 10−4, m = 103, and 4-byte keys .

The communication for basic sampling is O(1/ε2).

Approximately 400MB of data must be communicated!

The communication for improved sampling is O(m/ε).

Approximately 40MB of data must be communicated.

The communication for two-level sampling is O(√

m/ε).

Only 1.2MB of data needs to be communicated!330-fold reduction over basic sampling and 33-fold reduction overimproved sampling!

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 237: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

Consider: ε = 10−4, m = 103, and 4-byte keys .

The communication for basic sampling is O(1/ε2).

Approximately 400MB of data must be communicated!

The communication for improved sampling is O(m/ε).

Approximately 40MB of data must be communicated.

The communication for two-level sampling is O(√

m/ε).

Only 1.2MB of data needs to be communicated!330-fold reduction over basic sampling and 33-fold reduction overimproved sampling!

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 238: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

Consider: ε = 10−4, m = 103, and 4-byte keys .

The communication for basic sampling is O(1/ε2).

Approximately 400MB of data must be communicated!

The communication for improved sampling is O(m/ε).

Approximately 40MB of data must be communicated.

The communication for two-level sampling is O(√

m/ε).

Only 1.2MB of data needs to be communicated!330-fold reduction over basic sampling and 33-fold reduction overimproved sampling!

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 239: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

Consider: ε = 10−4, m = 103, and 4-byte keys .

The communication for basic sampling is O(1/ε2).

Approximately 400MB of data must be communicated!

The communication for improved sampling is O(m/ε).

Approximately 40MB of data must be communicated.

The communication for two-level sampling is O(√

m/ε).

Only 1.2MB of data needs to be communicated!330-fold reduction over basic sampling and 33-fold reduction overimproved sampling!

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 240: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

Consider: ε = 10−4, m = 103, and 4-byte keys .

The communication for basic sampling is O(1/ε2).

Approximately 400MB of data must be communicated!

The communication for improved sampling is O(m/ε).

Approximately 40MB of data must be communicated.

The communication for two-level sampling is O(√

m/ε).

Only 1.2MB of data needs to be communicated!330-fold reduction over basic sampling and 33-fold reduction overimproved sampling!

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 241: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

Consider: ε = 10−4, m = 103, and 4-byte keys .

The communication for basic sampling is O(1/ε2).

Approximately 400MB of data must be communicated!

The communication for improved sampling is O(m/ε).

Approximately 40MB of data must be communicated.

The communication for two-level sampling is O(√

m/ε).

Only 1.2MB of data needs to be communicated!

330-fold reduction over basic sampling and 33-fold reduction overimproved sampling!

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 242: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

Consider: ε = 10−4, m = 103, and 4-byte keys .

The communication for basic sampling is O(1/ε2).

Approximately 400MB of data must be communicated!

The communication for improved sampling is O(m/ε).

Approximately 40MB of data must be communicated.

The communication for two-level sampling is O(√

m/ε).

Only 1.2MB of data needs to be communicated!330-fold reduction over basic sampling and 33-fold reduction overimproved sampling!

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 243: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Outline

1 Introduction and MotivationHistogramsMapReduce and Hadoop

2 Exact Top-k Wavelet CoefficientsNaive SolutionHadoop Wavelet Top-k : Our Efficient Exact Solution

3 Approximate Top-k Wavelet CoefficientsLinearly Combinable Sketch MethodOur First Sampling Based ApproachAn Improved Sampling ApproachTwo-Level Sampling

4 Experiments

5 ConclusionsHadoop Wavelet Top-k in Hadoop

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 244: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Experiments: Algorithms

We implement the following methods in Hadoop 0.20.2:

Exact Methods:

The baseline solution is denoted Send-V,Our three round exact solution is denoted H-WTopk, (meaning”Hadoop Wavelet Top-k”).

Approximate Methods:

Improved Sampling is denoted Improved-S.Two-Level Sampling is denoted TwoLevel-S.The Sketch-Based Approximation using the GCS-Sketch is denotedSend-Sketch.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 245: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Experiments: Algorithms

We implement the following methods in Hadoop 0.20.2:Exact Methods:

The baseline solution is denoted Send-V,

Our three round exact solution is denoted H-WTopk, (meaning”Hadoop Wavelet Top-k”).

Approximate Methods:

Improved Sampling is denoted Improved-S.Two-Level Sampling is denoted TwoLevel-S.The Sketch-Based Approximation using the GCS-Sketch is denotedSend-Sketch.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 246: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Experiments: Algorithms

We implement the following methods in Hadoop 0.20.2:Exact Methods:

The baseline solution is denoted Send-V,Our three round exact solution is denoted H-WTopk, (meaning”Hadoop Wavelet Top-k”).

Approximate Methods:

Improved Sampling is denoted Improved-S.Two-Level Sampling is denoted TwoLevel-S.The Sketch-Based Approximation using the GCS-Sketch is denotedSend-Sketch.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 247: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Experiments: Algorithms

We implement the following methods in Hadoop 0.20.2:Exact Methods:

The baseline solution is denoted Send-V,Our three round exact solution is denoted H-WTopk, (meaning”Hadoop Wavelet Top-k”).

Approximate Methods:

Improved Sampling is denoted Improved-S.Two-Level Sampling is denoted TwoLevel-S.The Sketch-Based Approximation using the GCS-Sketch is denotedSend-Sketch.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 248: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Experiments: Algorithms

We implement the following methods in Hadoop 0.20.2:Exact Methods:

The baseline solution is denoted Send-V,Our three round exact solution is denoted H-WTopk, (meaning”Hadoop Wavelet Top-k”).

Approximate Methods:

Improved Sampling is denoted Improved-S.

Two-Level Sampling is denoted TwoLevel-S.The Sketch-Based Approximation using the GCS-Sketch is denotedSend-Sketch.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 249: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Experiments: Algorithms

We implement the following methods in Hadoop 0.20.2:Exact Methods:

The baseline solution is denoted Send-V,Our three round exact solution is denoted H-WTopk, (meaning”Hadoop Wavelet Top-k”).

Approximate Methods:

Improved Sampling is denoted Improved-S.Two-Level Sampling is denoted TwoLevel-S.

The Sketch-Based Approximation using the GCS-Sketch is denotedSend-Sketch.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 250: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Experiments: Algorithms

We implement the following methods in Hadoop 0.20.2:Exact Methods:

The baseline solution is denoted Send-V,Our three round exact solution is denoted H-WTopk, (meaning”Hadoop Wavelet Top-k”).

Approximate Methods:

Improved Sampling is denoted Improved-S.Two-Level Sampling is denoted TwoLevel-S.The Sketch-Based Approximation using the GCS-Sketch is denotedSend-Sketch.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 251: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Experiments: Setup

Experiments are performed in a heterogeneous Hadoop cluster with16 machines:

1 9 machines with 2GB of RAM and an Intel Xeon 1.86GHz CPU2 4 machines with 4GB of RAM and an Intel Xeon 2GHz CPU

One is reserved for the master (running JobTracker and NameNode).3 2 machines with 6GB of RAM and an Intel Xeon 2.13GHz CPU

One is reserved for the (only) Reducer.

4 1 machine with 2GB of RAM and an Intel Core 2 1.86GHz CPU

All machines are directly connected to a 1000Mbps switch.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 252: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Experiments: Setup

Experiments are performed in a heterogeneous Hadoop cluster with16 machines:

1 9 machines with 2GB of RAM and an Intel Xeon 1.86GHz CPU

2 4 machines with 4GB of RAM and an Intel Xeon 2GHz CPU

One is reserved for the master (running JobTracker and NameNode).3 2 machines with 6GB of RAM and an Intel Xeon 2.13GHz CPU

One is reserved for the (only) Reducer.

4 1 machine with 2GB of RAM and an Intel Core 2 1.86GHz CPU

All machines are directly connected to a 1000Mbps switch.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 253: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Experiments: Setup

Experiments are performed in a heterogeneous Hadoop cluster with16 machines:

1 9 machines with 2GB of RAM and an Intel Xeon 1.86GHz CPU2 4 machines with 4GB of RAM and an Intel Xeon 2GHz CPU

One is reserved for the master (running JobTracker and NameNode).3 2 machines with 6GB of RAM and an Intel Xeon 2.13GHz CPU

One is reserved for the (only) Reducer.

4 1 machine with 2GB of RAM and an Intel Core 2 1.86GHz CPU

All machines are directly connected to a 1000Mbps switch.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 254: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Experiments: Setup

Experiments are performed in a heterogeneous Hadoop cluster with16 machines:

1 9 machines with 2GB of RAM and an Intel Xeon 1.86GHz CPU2 4 machines with 4GB of RAM and an Intel Xeon 2GHz CPU

One is reserved for the master (running JobTracker and NameNode).

3 2 machines with 6GB of RAM and an Intel Xeon 2.13GHz CPU

One is reserved for the (only) Reducer.

4 1 machine with 2GB of RAM and an Intel Core 2 1.86GHz CPU

All machines are directly connected to a 1000Mbps switch.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 255: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Experiments: Setup

Experiments are performed in a heterogeneous Hadoop cluster with16 machines:

1 9 machines with 2GB of RAM and an Intel Xeon 1.86GHz CPU2 4 machines with 4GB of RAM and an Intel Xeon 2GHz CPU

One is reserved for the master (running JobTracker and NameNode).3 2 machines with 6GB of RAM and an Intel Xeon 2.13GHz CPU

One is reserved for the (only) Reducer.

4 1 machine with 2GB of RAM and an Intel Core 2 1.86GHz CPU

All machines are directly connected to a 1000Mbps switch.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 256: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Experiments: Setup

Experiments are performed in a heterogeneous Hadoop cluster with16 machines:

1 9 machines with 2GB of RAM and an Intel Xeon 1.86GHz CPU2 4 machines with 4GB of RAM and an Intel Xeon 2GHz CPU

One is reserved for the master (running JobTracker and NameNode).3 2 machines with 6GB of RAM and an Intel Xeon 2.13GHz CPU

One is reserved for the (only) Reducer.

4 1 machine with 2GB of RAM and an Intel Core 2 1.86GHz CPU

All machines are directly connected to a 1000Mbps switch.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 257: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Experiments: Setup

Experiments are performed in a heterogeneous Hadoop cluster with16 machines:

1 9 machines with 2GB of RAM and an Intel Xeon 1.86GHz CPU2 4 machines with 4GB of RAM and an Intel Xeon 2GHz CPU

One is reserved for the master (running JobTracker and NameNode).3 2 machines with 6GB of RAM and an Intel Xeon 2.13GHz CPU

One is reserved for the (only) Reducer.

4 1 machine with 2GB of RAM and an Intel Core 2 1.86GHz CPU

All machines are directly connected to a 1000Mbps switch.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 258: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Experiments: Setup

Experiments are performed in a heterogeneous Hadoop cluster with16 machines:

1 9 machines with 2GB of RAM and an Intel Xeon 1.86GHz CPU2 4 machines with 4GB of RAM and an Intel Xeon 2GHz CPU

One is reserved for the master (running JobTracker and NameNode).3 2 machines with 6GB of RAM and an Intel Xeon 2.13GHz CPU

One is reserved for the (only) Reducer.

4 1 machine with 2GB of RAM and an Intel Core 2 1.86GHz CPU

All machines are directly connected to a 1000Mbps switch.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 259: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Experiments: Datasets

We utilize the WorldCup dataset to test all algorithms on real data.

There are a total of 1.35 billion records.Each record has 10 4 byte integer attributes including a client id andobject id.We assign each record a clientobject 4 byte integer id in u = 229

which is distinct for unique parings of a client id and object id.WorldCup is stored in binary format, in total it is 50GB.

We utilize large synthetic Zipfian datasets to evaluate all methods.

Keys are randomly permuted and discontiguous in a dataset.Each key is a 4-byte integer and stored in binary format.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 260: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Experiments: Datasets

We utilize the WorldCup dataset to test all algorithms on real data.

There are a total of 1.35 billion records.

Each record has 10 4 byte integer attributes including a client id andobject id.We assign each record a clientobject 4 byte integer id in u = 229

which is distinct for unique parings of a client id and object id.WorldCup is stored in binary format, in total it is 50GB.

We utilize large synthetic Zipfian datasets to evaluate all methods.

Keys are randomly permuted and discontiguous in a dataset.Each key is a 4-byte integer and stored in binary format.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 261: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Experiments: Datasets

We utilize the WorldCup dataset to test all algorithms on real data.

There are a total of 1.35 billion records.Each record has 10 4 byte integer attributes including a client id andobject id.

We assign each record a clientobject 4 byte integer id in u = 229

which is distinct for unique parings of a client id and object id.WorldCup is stored in binary format, in total it is 50GB.

We utilize large synthetic Zipfian datasets to evaluate all methods.

Keys are randomly permuted and discontiguous in a dataset.Each key is a 4-byte integer and stored in binary format.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 262: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Experiments: Datasets

We utilize the WorldCup dataset to test all algorithms on real data.

There are a total of 1.35 billion records.Each record has 10 4 byte integer attributes including a client id andobject id.We assign each record a clientobject 4 byte integer id in u = 229

which is distinct for unique parings of a client id and object id.

WorldCup is stored in binary format, in total it is 50GB.

We utilize large synthetic Zipfian datasets to evaluate all methods.

Keys are randomly permuted and discontiguous in a dataset.Each key is a 4-byte integer and stored in binary format.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 263: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Experiments: Datasets

We utilize the WorldCup dataset to test all algorithms on real data.

There are a total of 1.35 billion records.Each record has 10 4 byte integer attributes including a client id andobject id.We assign each record a clientobject 4 byte integer id in u = 229

which is distinct for unique parings of a client id and object id.WorldCup is stored in binary format, in total it is 50GB.

We utilize large synthetic Zipfian datasets to evaluate all methods.

Keys are randomly permuted and discontiguous in a dataset.Each key is a 4-byte integer and stored in binary format.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 264: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Experiments: Datasets

We utilize the WorldCup dataset to test all algorithms on real data.

There are a total of 1.35 billion records.Each record has 10 4 byte integer attributes including a client id andobject id.We assign each record a clientobject 4 byte integer id in u = 229

which is distinct for unique parings of a client id and object id.WorldCup is stored in binary format, in total it is 50GB.

We utilize large synthetic Zipfian datasets to evaluate all methods.

Keys are randomly permuted and discontiguous in a dataset.Each key is a 4-byte integer and stored in binary format.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 265: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Experiments: Datasets

We utilize the WorldCup dataset to test all algorithms on real data.

There are a total of 1.35 billion records.Each record has 10 4 byte integer attributes including a client id andobject id.We assign each record a clientobject 4 byte integer id in u = 229

which is distinct for unique parings of a client id and object id.WorldCup is stored in binary format, in total it is 50GB.

We utilize large synthetic Zipfian datasets to evaluate all methods.

Keys are randomly permuted and discontiguous in a dataset.

Each key is a 4-byte integer and stored in binary format.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 266: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Experiments: Datasets

We utilize the WorldCup dataset to test all algorithms on real data.

There are a total of 1.35 billion records.Each record has 10 4 byte integer attributes including a client id andobject id.We assign each record a clientobject 4 byte integer id in u = 229

which is distinct for unique parings of a client id and object id.WorldCup is stored in binary format, in total it is 50GB.

We utilize large synthetic Zipfian datasets to evaluate all methods.

Keys are randomly permuted and discontiguous in a dataset.Each key is a 4-byte integer and stored in binary format.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 267: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Experiments: Defaults

Default values:

Symbol Definition Defaultα Zipfian skewness 1.1u max key in domain log2 u = 29n total records 13.4 billion

dataset size 50GBβ split size 256MBm number of splits 200B network bandwidth 500Mbps

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 268: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Experiments: Vary k

10 20 30 40 50

106

108

1010

1012

Number of Coefficients (k)

Com

mun

icat

ion

(Byt

es)

Send−V H−WTopk Send−Sketch

Improved−S TwoLevel−S

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 269: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Experiments: Vary k

10 20 30 40 5010

2

103

104

105

Number of Coefficients (k)

Tim

e (S

econ

ds)

Send−V H−WTopk Send−Sketch

Improved−S TwoLevel−S

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 270: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Experiments: Vary k

10 20 30 40 5010

14

1015

1016

1017

1018

Number of Coefficients (k)

SS

E

Send−V H−WTopk Send−Sketch

Improved−S TwoLevel−S

Ideal SSE

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 271: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Experiments: Vary ε

10−5

10−4

10−3

10−2

10−1

1014

1016

1018

1020

1022

ε

SS

E

H−WTopkImproved−STwoLevel−S

Ideal SSE

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 272: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Experiments: Vary ε

10−5

10−4

10−3

10−2

10−1

103

104

105

106

107

108

ε

Com

mun

icat

ion

(Byt

es)

Improved−STwoLevel−S

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 273: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Experiments: Vary ε

10−5

10−4

10−3

10−2

10−1

101

102

103

ε

Tim

e (S

econ

ds)

Improved−STwoLevel−S

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 274: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Experiments: Vary n

0 50 100 150 20010

5

107

109

1011

Dataset Size (GB)

Com

mun

icat

ion

(Byt

es)

Send−V H−WTopk Send−Sketch

Improved−S TwoLevel−S

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 275: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Experiments: Vary n

0 50 100 150 20010

2

103

104

105

106

Dataset Size (GB)

Tim

e (S

econ

ds)

Send−V H−WTopk Send−Sketch

Improved−S TwoLevel−S

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 276: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Experiments: Vary u

20 22 24 26 28 30 3210

6

108

1010

1012

log2 u

Com

mun

icat

ion

(Byt

es)

Send−V H−WTopk Send−Sketch

Improved−S TwoLevel−S

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 277: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Experiments: Vary u

20 22 24 26 28 30 3210

2

103

104

105

106

log2 u

Tim

e (S

econ

ds)

Send−V H−WTopk Send−Sketch

Improved−S TwoLevel−S

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 278: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Experiments: Vary β

64 128 256 512

106

108

1010

1012

Split Size (MB)

Com

mun

icat

ion

(Byt

es)

Send−V H−WTopk Send−Sketch

Improved−S TwoLevel−S

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 279: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Experiments: Vary β

64 128 256 51210

2

103

104

105

106

Split Size (MB)

Tim

e (S

econ

ds)

Send−V H−WTopk Send−Sketch

Improved−S TwoLevel−S

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 280: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Experiments: Vary α

100

102

104

106

108

1010

1012

1014

α=0.8 α=1.1 α=1.4

Com

mun

icat

ion

(Byt

es)

Send−V H−WTopk Send−Sketch

Improved−S TwoLevel−S

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 281: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Experiments: Vary α

100

102

104

106

108

α=0.8 α=1.1 α=1.4

Tim

e (S

econ

ds)

Send−V H−WTopk Send−Sketch

Improved−S TwoLevel−S

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 282: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Experiments: Vary B

0 20 40 60 80 10010

2

103

104

105

106

Bandwidth (% of 100Mbps)

Tim

e (S

econ

ds)

Send−V H−WTopk Send−Sketch

Improved−S TwoLevel−S

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 283: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Experiments: WorldCup Dataset

100

102

104

106

108

1010

Com

mun

icat

ion

(Byt

es)

Send−V

H−WTopk

Send−Sketch

Improved−S TwoLevel−S

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 284: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Experiments: WorldCup Dataset

100

101

102

103

104

105

106T

ime

(Sec

onds

)

Send−V

H−WTopk

Send−Sketch

Improved−S TwoLevel−S

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 285: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Conclusions

We study the problem of efficiently computing wavelet histograms inMapReduce clusters.

We present both exact and approximate algorithms.TwoLevel-S is especially easy to implement and ideal in practice.

For 200GB of data with log2 u = 29 it takes 10 minutes with only2MB of communication!

Our work is just the tip of the iceberg for data summarizationtechniques in MapReduce.

Many others remain including:

other histograms including the V-optimal histogram,sketches and synopsis,geometric summaries (ε-approximations and coresets),graph summaries (distance oracles).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 286: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Conclusions

We study the problem of efficiently computing wavelet histograms inMapReduce clusters.

We present both exact and approximate algorithms.

TwoLevel-S is especially easy to implement and ideal in practice.

For 200GB of data with log2 u = 29 it takes 10 minutes with only2MB of communication!

Our work is just the tip of the iceberg for data summarizationtechniques in MapReduce.

Many others remain including:

other histograms including the V-optimal histogram,sketches and synopsis,geometric summaries (ε-approximations and coresets),graph summaries (distance oracles).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 287: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Conclusions

We study the problem of efficiently computing wavelet histograms inMapReduce clusters.

We present both exact and approximate algorithms.TwoLevel-S is especially easy to implement and ideal in practice.

For 200GB of data with log2 u = 29 it takes 10 minutes with only2MB of communication!

Our work is just the tip of the iceberg for data summarizationtechniques in MapReduce.

Many others remain including:

other histograms including the V-optimal histogram,sketches and synopsis,geometric summaries (ε-approximations and coresets),graph summaries (distance oracles).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 288: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Conclusions

We study the problem of efficiently computing wavelet histograms inMapReduce clusters.

We present both exact and approximate algorithms.TwoLevel-S is especially easy to implement and ideal in practice.

For 200GB of data with log2 u = 29 it takes 10 minutes with only2MB of communication!

Our work is just the tip of the iceberg for data summarizationtechniques in MapReduce.

Many others remain including:

other histograms including the V-optimal histogram,sketches and synopsis,geometric summaries (ε-approximations and coresets),graph summaries (distance oracles).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 289: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Conclusions

We study the problem of efficiently computing wavelet histograms inMapReduce clusters.

We present both exact and approximate algorithms.TwoLevel-S is especially easy to implement and ideal in practice.

For 200GB of data with log2 u = 29 it takes 10 minutes with only2MB of communication!

Our work is just the tip of the iceberg for data summarizationtechniques in MapReduce.

Many others remain including:

other histograms including the V-optimal histogram,sketches and synopsis,geometric summaries (ε-approximations and coresets),graph summaries (distance oracles).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 290: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Conclusions

We study the problem of efficiently computing wavelet histograms inMapReduce clusters.

We present both exact and approximate algorithms.TwoLevel-S is especially easy to implement and ideal in practice.

For 200GB of data with log2 u = 29 it takes 10 minutes with only2MB of communication!

Our work is just the tip of the iceberg for data summarizationtechniques in MapReduce.

Many others remain including:

other histograms including the V-optimal histogram,sketches and synopsis,geometric summaries (ε-approximations and coresets),graph summaries (distance oracles).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 291: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Conclusions

We study the problem of efficiently computing wavelet histograms inMapReduce clusters.

We present both exact and approximate algorithms.TwoLevel-S is especially easy to implement and ideal in practice.

For 200GB of data with log2 u = 29 it takes 10 minutes with only2MB of communication!

Our work is just the tip of the iceberg for data summarizationtechniques in MapReduce.

Many others remain including:

other histograms including the V-optimal histogram,sketches and synopsis,geometric summaries (ε-approximations and coresets),graph summaries (distance oracles).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 292: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

The End

Thank You

Q and A

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 293: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Introduction: Histograms

We may also compute wi with the wavelet basis vectors ψi .

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 294: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Introduction: Histograms

We may also compute wi with the wavelet basis vectors ψi .

3 5 10 8 2 2 10 14 ` = 3

x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14

v(1) v(2) v(3) v(4) v(5) v(6) v(7) v(8)

w5 w6 w7 w8a4 a5 a6 a72 120 2-1 9

w3 a22.5 6.5 w4 a35 7

w2 a10.3 6.8

w1 6.8 total average

` = 2

` = 1

` = 0

1 4

w6 = 〈v, ψ6〉

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 295: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Introduction: Histograms

We may also compute wi with the wavelet basis vectors ψi .

3 5 10 8 2 2 10 14 ` = 3

x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14

v(1) v(2) v(3) v(4) v(5) v(6) v(7) v(8)

w5 w6 w7 w8a4 a5 a6 a72 120 2-1 9

w3 a22.5 6.5 w4 a35 7

w2 a10.3 6.8

w1 6.8 total average

` = 2

` = 1

` = 0

1 4

w6 = 〈v, ψ6〉w6 = 〈v, ψ6〉 =〈[0 0 v(3) 0 0 0 0 0], ψ6〉 + 〈[ 0 0 0 v(4) 0 0 0 0], ψ6〉

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 296: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Introduction: Histograms

We may also compute wi with the wavelet basis vectors ψi .

3 5 10 8 2 2 10 14 ` = 3

x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14

v(1) v(2) v(3) v(4) v(5) v(6) v(7) v(8)

w5 w6 w7 w8a4 a5 a6 a72 120 2-1 9

w3 a22.5 6.5 w4 a35 7

w2 a10.3 6.8

w1 6.8 total average

` = 2

` = 1

` = 0

1 4

w3 = 〈v, ψ3〉

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 297: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Introduction: Histograms

We may also compute wi with the wavelet basis vectors ψi .

3 5 10 8 2 2 10 14 ` = 3

x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14

v(1) v(2) v(3) v(4) v(5) v(6) v(7) v(8)

w5 w6 w7 w8a4 a5 a6 a72 120 2-1 9

w3 a22.5 6.5 w4 a35 7

w2 a10.3 6.8

w1 6.8 total average

` = 2

` = 1

` = 0

1 4

w2 = 〈v, ψ2〉

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 298: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Introduction: Histograms

We may also compute wi with the wavelet basis vectors ψi .

3 5 10 8 2 2 10 14 ` = 3

x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14

v(1) v(2) v(3) v(4) v(5) v(6) v(7) v(8)

w5 w6 w7 w8a4 a5 a6 a72 120 2-1 9

w3 a22.5 6.5 w4 a35 7

w2 a10.3 6.8

w1 6.8 total average

` = 2

` = 1

` = 0

1 4

w1 = 〈v, ψ1〉

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 299: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Background: Hadoop MapReduce, Map Phase

split 1split 2split 3split 4

JobTracker

MapRunner

The JobTracker assigns an InputSplit to a TaskTracker, aMapRunner task runs on the TaskTracker to process the split.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 300: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Background: Hadoop MapReduce, Map Phase

split 1split 2split 3split 4

JobTracker

MapRunner

RecordReader

(k1, v1)

The MapRunner acquires a RecordReader from the InputFormat forthe file to view the InputSplit as a stream of records, (k1, v1).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 301: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Background: Hadoop MapReduce, Map Phase

split 1split 2split 3split 4

JobTracker

MapRunner

RecordReader

(k1, v1)

(k1, v1)Mapper

Buffer

(k2, v2)

The MapRunner invokes the user specified Mapper for each (k1, v1),the Mapper emits (k2, v2) and stores in an in-memory buffer.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 302: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Background: Hadoop MapReduce, Map Phase

split 1split 2split 3split 4

JobTracker

MapRunner

RecordReader

(k1, v1)

(k1, v1)Mapper

Buffer

(k2, v2) (k2, list(v2)) (k2, v2)Combiner

p1p2

When the buffer fills, the optional Combiner is executed over(k2, list(v2)), and a (k2, v2) is dumped to a partition on disk.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 303: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Background: Hadoop MapReduce, Shuffle and Sort Phase

JobTracker

p1p2

Reducer:Copy

(k2, v2)

p1p2

(k2, v2)

Reducer:Sort

(k2, v2)

The JobTracker assigns Reducers to TaskTrackers for each partition,each reducer first copies on (k2, v2) and then sorts on k2.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 304: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Background: Hadoop MapReduce, Reduce Phase

JobTracker

p1p2

Reducer:Copy

(k2, v2)

p1p2

(k2, v2)

Reducer:Sort

(k2, v2) (k3, v3)

oiReducer:Reduce

(k2, list(v2))

The sorting output (k2, list(v2)) is processed one k2 at a time andreduced, the reduced output (k3, v3) is written to reducer output oi .

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 305: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 1

split 1split 2split 3

(x , null)

(x , null)

(x , null)

JobTracker

Mapper

Mapper

Mapper

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 306: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 1

split 1split 2split 3

(x , null)

(x , null)

(x , null)

JobTracker

Mapper

Mapper

Mapper

(x , 1)

(x , 1)

(x , 1)

In-MemoryMap

In-MemoryMap

In-MemoryMap

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 307: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 1

split 1split 2split 3

(x , null)

(x , null)

(x , null)

JobTracker

Mapper

Mapper

Mapper

In-MemoryMap

In-MemoryMap

In-MemoryMap

close()

close()

close()

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 308: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 1

split 1split 2split 3

(x , null)

(x , null)

(x , null)

JobTracker

Mapper

Mapper

Mapper

close()

close()

close()

StreamingCompute wi

StreamingCompute wi

StreamingCompute wi

(x , v1(x))

(x , v2(x))

(x , v3(x))

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 309: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 1

split 1split 2split 3

(x , null)

(x , null)

(x , null)

JobTracker

Mapper

Mapper

Mapper

close()

close()

close()

StreamingCompute wi

StreamingCompute wi

StreamingCompute wi

(x , v1(x))

(x , v2(x))

(x , v3(x))

Priority Queue:Largest wi ,j

Priority Queue:Smallest wi ,j

HDFS:wi ,j Streamed to disk

Priority Queue:Largest wi ,j

Priority Queue:Smallest wi ,j

HDFS:wi ,j Streamed to disk

Priority Queue:Largest wi ,j

Priority Queue:Smallest wi ,j

HDFS:wi ,j Streamed to disk

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 310: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 1

split 1split 2split 3

(x , null)

(x , null)

(x , null)

JobTracker

Mapper

Mapper

Mapper

close()

close()

close()

StreamingCompute wi

StreamingCompute wi

StreamingCompute wi

(x , v1(x))

(x , v2(x))

(x , v3(x))

split 1i wi ,1

5 202 71 64 -26 -153 -30

split 2i wi ,2

5 124 71 22 -53 -146 -20

split 3i wi ,3

1 103 64 52 -35 -66 -10

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 311: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 1

split 1split 2split 3

(x , null)

(x , null)

(x , null)

JobTracker

Mapper

Mapper

Mapper

close()

close()

close()

StreamingCompute wi

StreamingCompute wi

StreamingCompute wi

(x , v1(x))

(x , v2(x))

(x , v3(x))

split 1i wi ,1

5 202 71 64 -26 -153 -30

split 2i wi ,2

5 124 71 22 -53 -146 -20

split 3i wi ,3

1 103 64 52 -35 -66 -10

p1

p1

p1

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 312: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 1

split 1split 2split 3

(x , null)

(x , null)

(x , null)

JobTracker

Mapper

Mapper

Mapper

i wi ,1

5 20

i wi ,1

3 -30

i wi ,2

5 12

i wi ,2

6 -20

i wi ,3

1 10

i wi ,3

6 -10

p1

(i , (split 1,wi ,1))

p1

(i , (split 2,wi ,2))

p1

(i , (split 3,wi ,3))

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 313: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 1

split 1split 2split 3

(x , null)

(x , null)

(x , null)

JobTracker

Mapper

Mapper

Mapper

i wi ,1

5 20

i wi ,1

3 -30

i wi ,2

5 12

i wi ,2

6 -20

i wi ,3

1 10

i wi ,3

6 -10

p1

(i , (split 1,wi ,1))

p1

(i , (split 2,wi ,2))

p1

(i , (split 3,wi ,3))

Reducer

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 314: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 1

split 1split 2split 3

(x , null)

(x , null)

(x , null)

JobTracker

Mapper

Mapper

Mapper

i wi ,1

5 20

i wi ,1

3 -30

i wi ,2

5 12

i wi ,2

6 -20

i wi ,3

1 10

i wi ,3

6 -10

p1

(i , (split 1,wi ,1))

p1

(i , (split 2,wi ,2))

p1

(i , (split 3,wi ,3))

Reducer

Ri (split) j wi ,j

1 3 103 1 -305 1 205 2 126 2 -206 3 -10

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 315: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 1

split 1split 2split 3

(x , null)

(x , null)

(x , null)

JobTracker

Mapper

Mapper

Mapper

i wi ,1

5 20

i wi ,1

3 -30

i wi ,2

5 12

i wi ,2

6 -20

i wi ,3

1 10

i wi ,3

6 -10

p1

(i , (split 1,wi ,1))

p1

(i , (split 2,wi ,2))

p1

(i , (split 3,wi ,3))

Reducer

Ri (split) j wi ,j

1 3 103 1 -305 1 205 2 126 2 -206 3 -10

Ri wi Fx τ+(wi) τ−(wi) τ (wi)1 10 001 42 -40 03 -30 100 -8 -60 85 32 110 42 22 226 -30 011 -10 -60 10

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 316: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 1

split 1split 2split 3

(x , null)

(x , null)

(x , null)

JobTracker

Mapper

Mapper

Mapper

i wi ,1

5 20

i wi ,1

3 -30

i wi ,2

5 12

i wi ,2

6 -20

i wi ,3

1 10

i wi ,3

6 -10

p1

(i , (split 1,wi ,1))

p1

(i , (split 2,wi ,2))

p1

(i , (split 3,wi ,3))

Reducer

Ri (split) j wi ,j

1 3 103 1 -305 1 205 2 126 2 -206 3 -10

Ri wi Fx τ+(wi) τ−(wi) τ (wi)1 10 001 42 -40 03 -30 100 -8 -60 85 32 110 42 22 226 -30 011 -10 -60 10

Reduce Phase

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 317: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 1

split 1split 2split 3

(x , null)

(x , null)

(x , null)

JobTracker

Mapper

Mapper

Mapper

i wi ,1

5 20

i wi ,1

3 -30

i wi ,2

5 12

i wi ,2

6 -20

i wi ,3

1 10

i wi ,3

6 -10

p1

(i , (split 1,wi ,1))

p1

(i , (split 2,wi ,2))

p1

(i , (split 3,wi ,3))

Reducer

Ri (split) j wi ,j

1 3 103 1 -305 1 205 2 126 2 -206 3 -10

Ri wi Fx τ+(wi) τ−(wi) τ (wi)1 10 001 42 -40 03 -30 100 -8 -60 85 32 110 42 22 226 -30 011 -10 -60 10

Close Phase

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 318: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 1

split 1split 2split 3

(x , null)

(x , null)

(x , null)

JobTracker

Mapper

Mapper

Mapper

i wi ,1

5 20

i wi ,1

3 -30

i wi ,2

5 12

i wi ,2

6 -20

i wi ,3

1 10

i wi ,3

6 -10

p1

(i , (split 1,wi ,1))

p1

(i , (split 2,wi ,2))

p1

(i , (split 3,wi ,3))

Reducer

Ri (split) j wi ,j

1 3 103 1 -305 1 205 2 126 2 -206 3 -10

Ri wi Fx τ+(wi) τ−(wi) τ (wi)1 10 001 42 -40 03 -30 100 -8 -60 85 32 110 42 22 226 -30 011 -10 -60 10

Close Phase

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 319: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 1

split 1split 2split 3

(x , null)

(x , null)

(x , null)

JobTracker

Mapper

Mapper

Mapper

i wi ,1

5 20

i wi ,1

3 -30

i wi ,2

5 12

i wi ,2

6 -20

i wi ,3

1 10

i wi ,3

6 -10

p1

(i , (split 1,wi ,1))

p1

(i , (split 2,wi ,2))

p1

(i , (split 3,wi ,3))

Reducer

Ri (split) j wi ,j

1 3 103 1 -305 1 205 2 126 2 -206 3 -10

Ri wi Fx τ+(wi) τ−(wi) τ (wi)1 10 001 42 -40 03 -30 100 -8 -60 85 32 110 42 22 226 -30 011 -10 -60 10

Close Phase

T1 = 22, T1/m = 22/3

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 320: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 1

split 1split 2split 3

(x , null)

(x , null)

(x , null)

JobTracker

Mapper

Mapper

Mapper

i wi ,1

5 20

i wi ,1

3 -30

i wi ,2

5 12

i wi ,2

6 -20

i wi ,3

1 10

i wi ,3

6 -10

p1

(i , (split 1,wi ,1))

p1

(i , (split 2,wi ,2))

p1

(i , (split 3,wi ,3))

Reducer

Ri (split) j wi ,j

1 3 103 1 -305 1 205 2 126 2 -206 3 -10

Ri wi Fx τ+(wi) τ−(wi) τ (wi)1 10 001 42 -40 03 -30 100 -8 -60 85 32 110 42 22 226 -30 011 -10 -60 10

Close Phase

T1 = 22, T1/m = 22/3

CoordinatorState

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 321: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 1

split 1split 2split 3

(x , null)

(x , null)

(x , null)

JobTracker

Mapper

Mapper

Mapper

i wi ,1

5 20

i wi ,1

3 -30

i wi ,2

5 12

i wi ,2

6 -20

i wi ,3

1 10

i wi ,3

6 -10

p1

(i , (split 1,wi ,1))

p1

(i , (split 2,wi ,2))

p1

(i , (split 3,wi ,3))

Reducer

Ri (split) j wi ,j

1 3 103 1 -305 1 205 2 126 2 -206 3 -10

Ri wi Fx τ+(wi) τ−(wi) τ (wi)1 10 001 42 -40 03 -30 100 -8 -60 85 32 110 42 22 226 -30 011 -10 -60 10

Close Phase

T1 = 22, T1/m = 22/3

HDFS or Local

CoordinatorState

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 322: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 1

split 1split 2split 3

(x , null)

(x , null)

(x , null)

JobTracker

Mapper

Mapper

Mapper

i wi ,1

5 20

i wi ,1

3 -30

i wi ,2

5 12

i wi ,2

6 -20

i wi ,3

1 10

i wi ,3

6 -10

p1

(i , (split 1,wi ,1))

p1

(i , (split 2,wi ,2))

p1

(i , (split 3,wi ,3))

Reducer

Ri (split) j wi ,j

1 3 103 1 -305 1 205 2 126 2 -206 3 -10

Ri wi Fx τ+(wi) τ−(wi) τ (wi)1 10 001 42 -40 03 -30 100 -8 -60 85 32 110 42 22 226 -30 011 -10 -60 10

Close Phase

T1 = 22, T1/m = 22/3

T1/m

CoordinatorState

HDFS

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 323: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 1

split 1split 2split 3

JobTracker

T1/m = 22/3

Job Configuration

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 324: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 1

split 1split 2split 3

JobTracker

T1/m = 22/3

Job Configuration

bypass

bypass

bypass

split 1:saved wi ,j

split 2:saved wi ,j

split 3:saved wi ,j

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 325: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 3

split 1split 2split 3

JobTracker

T1/m = 22/3

Job Configuration

bypass

Mapper

Mapper

Mapper

split 1i wi ,1

5 202 71 64 -26 -153 -30

split 2i wi ,2

5 124 71 22 -53 -146 -20

split 3i wi ,3

1 103 64 52 -35 -66 -10

bypass

bypass

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 326: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 3

split 1split 2split 3

JobTracker

T1/m = 22/3

Job Configuration

bypass

Mapper

Mapper

Mapper

split 1i wi ,1

5 202 71 64 -26 -153 -30

split 2i wi ,2

5 124 71 22 -53 -146 -20

split 3i wi ,3

1 103 64 52 -35 -66 -10

bypass

bypass

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 327: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 3

split 1split 2split 3

JobTracker

T1/m = 22/3

Job Configuration

bypass

Mapper

Mapper

Mapper

split 1i wi ,1

5 202 71 64 -26 -153 -30

split 2i wi ,2

5 124 71 22 -53 -146 -20

split 3i wi ,3

1 103 64 52 -35 -66 -10

bypass

bypass

Reducer

(6(split 1,-15))

(3,(split 2,-14))

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 328: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 3

split 1split 2split 3

JobTracker

T1/m = 22/3

Job Configuration

bypass

Mapper

Mapper

Mapper

split 1i wi ,1

5 202 71 64 -26 -153 -30

split 2i wi ,2

5 124 71 22 -53 -146 -20

split 3i wi ,3

1 103 64 52 -35 -66 -10

bypass

bypass

Reducer

(6(split 1,-15))

(3,(split 2,-14))

CoordinatorState

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 329: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 3

split 1split 2split 3

JobTracker

T1/m = 22/3

Job Configuration

bypass

Mapper

Mapper

Mapper

split 1i wi ,1

5 202 71 64 -26 -153 -30

split 2i wi ,2

5 124 71 22 -53 -146 -20

split 3i wi ,3

1 103 64 52 -35 -66 -10

bypass

bypass

Reducer

(6(split 1,-15))

(3,(split 2,-14))

Ri wi Fx1 10 0013 -30 1005 32 1106 -30 011

update

CoordinatorState

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 330: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 3

split 1split 2split 3

JobTracker

T1/m = 22/3

Job Configuration

bypass

Mapper

Mapper

Mapper

split 1i wi ,1

5 202 71 64 -26 -153 -30

split 2i wi ,2

5 124 71 22 -53 -146 -20

split 3i wi ,3

1 103 64 52 -35 -66 -10

bypass

bypass

Reducer

(6(split 1,-15))

(3,(split 2,-14))update

Ri wi Fx τ+(wi) τ−(wi) τ (wi) τ ′(wi)1 10 001 24.6 -4.6 0 24.63 -44 110 -36.6 -51.3 36.6 51.35 32 110 39.3 24.6 24.6 39.36 -45 111 -45 -45 45 45

CoordinatorState

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 331: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 3

split 1split 2split 3

JobTracker

T1/m = 22/3

Job Configuration

bypass

Mapper

Mapper

Mapper

split 1i wi ,1

5 202 71 64 -26 -153 -30

split 2i wi ,2

5 124 71 22 -53 -146 -20

split 3i wi ,3

1 103 64 52 -35 -66 -10

bypass

bypass

Reducer

(6(split 1,-15))

(3,(split 2,-14))update

Ri wi Fx τ+(wi) τ−(wi) τ (wi) τ ′(wi)1 10 001 24.6 -4.6 0 24.63 -44 110 -36.6 -51.3 36.6 51.35 32 110 39.3 24.6 24.6 39.36 -45 111 -45 -45 45 45

reduce phase

CoordinatorState

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 332: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 3

split 1split 2split 3

JobTracker

T1/m = 22/3

Job Configuration

bypass

Mapper

Mapper

Mapper

split 1i wi ,1

5 202 71 64 -26 -153 -30

split 2i wi ,2

5 124 71 22 -53 -146 -20

split 3i wi ,3

1 103 64 52 -35 -66 -10

bypass

bypass

Reducer

(6(split 1,-15))

(3,(split 2,-14))update

Ri wi Fx τ+(wi) τ−(wi) τ (wi) τ ′(wi)1 10 001 24.6 -4.6 0 24.63 -44 110 -36.6 -51.3 36.6 51.35 32 110 39.3 24.6 24.6 39.36 -45 111 -45 -45 45 45

close phase

CoordinatorState

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 333: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 3

split 1split 2split 3

JobTracker

T1/m = 22/3

Job Configuration

bypass

Mapper

Mapper

Mapper

split 1i wi ,1

5 202 71 64 -26 -153 -30

split 2i wi ,2

5 124 71 22 -53 -146 -20

split 3i wi ,3

1 103 64 52 -35 -66 -10

bypass

bypass

Reducer

(6(split 1,-15))

(3,(split 2,-14))

Ri wi Fx τ+(wi) τ−(wi) τ (wi) τ ′(wi)1 10 001 24.6 -4.6 0 24.63 -44 110 -36.6 -51.3 36.6 51.35 32 110 39.3 24.6 24.6 39.36 -45 111 -45 -45 45 45

close phase

CoordinatorState

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 334: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 3

split 1split 2split 3

JobTracker

T1/m = 22/3

Job Configuration

bypass

Mapper

Mapper

Mapper

split 1i wi ,1

5 202 71 64 -26 -153 -30

split 2i wi ,2

5 124 71 22 -53 -146 -20

split 3i wi ,3

1 103 64 52 -35 -66 -10

bypass

bypass

Reducer

(6(split 1,-15))

(3,(split 2,-14))

Ri wi Fx τ+(wi) τ−(wi) τ (wi) τ ′(wi)1 10 001 24.6 -4.6 0 24.63 -44 110 -36.6 -51.3 36.6 51.35 32 110 39.3 24.6 24.6 39.36 -45 111 -45 -45 45 45

close phase

T2 = 45

CoordinatorState

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 335: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 3

split 1split 2split 3

JobTracker

T1/m = 22/3

Job Configuration

bypass

Mapper

Mapper

Mapper

split 1i wi ,1

5 202 71 64 -26 -153 -30

split 2i wi ,2

5 124 71 22 -53 -146 -20

split 3i wi ,3

1 103 64 52 -35 -66 -10

bypass

bypass

Reducer

(6(split 1,-15))

(3,(split 2,-14))

Ri wi Fx τ+(wi) τ−(wi) τ (wi) τ ′(wi)1 10 001 24.6 -4.6 0 24.63 -44 110 -36.6 -51.3 36.6 51.35 32 110 39.3 24.6 24.6 39.36 -45 111 -45 -45 45 45

close phase

T2 = 45

CoordinatorState

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 336: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 3

split 1split 2split 3

JobTracker

T1/m = 22/3

Job Configuration

bypass

Mapper

Mapper

Mapper

split 1i wi ,1

5 202 71 64 -26 -153 -30

split 2i wi ,2

5 124 71 22 -53 -146 -20

split 3i wi ,3

1 103 64 52 -35 -66 -10

bypass

bypass

Reducer

(6(split 1,-15))

(3,(split 2,-14))

Ri wi Fx τ+(wi) τ−(wi) τ (wi) τ ′(wi)1 10 001 24.6 -4.6 0 24.63 -44 110 -36.6 -51.3 36.6 51.35 32 110 39.3 24.6 24.6 39.36 -45 111 -45 -45 45 45

close phase

T2 = 45

CoordinatorState

CoordinatorState

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 337: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 3

split 1split 2split 3

JobTracker

T1/m = 22/3

Job Configuration

bypass

Mapper

Mapper

Mapper

split 1i wi ,1

5 202 71 64 -26 -153 -30

split 2i wi ,2

5 124 71 22 -53 -146 -20

split 3i wi ,3

1 103 64 52 -35 -66 -10

bypass

bypass

Reducer

(6(split 1,-15))

(3,(split 2,-14))

Ri wi Fx τ+(wi) τ−(wi) τ (wi) τ ′(wi)1 10 001 24.6 -4.6 0 24.63 -44 110 -36.6 -51.3 36.6 51.35 32 110 39.3 24.6 24.6 39.36 -45 111 -45 -45 45 45

close phase

T2 = 45

HDFS or Local

CoordinatorState

CoordinatorState

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 338: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 3

split 1split 2split 3

JobTracker

T1/m = 22/3

Job Configuration

bypass

Mapper

Mapper

Mapper

split 1i wi ,1

5 202 71 64 -26 -153 -30

split 2i wi ,2

5 124 71 22 -53 -146 -20

split 3i wi ,3

1 103 64 52 -35 -66 -10

bypass

bypass

Reducer

(6(split 1,-15))

(3,(split 2,-14))

Ri wi Fx τ+(wi) τ−(wi) τ (wi) τ ′(wi)1 10 001 24.6 -4.6 0 24.63 -44 110 -36.6 -51.3 36.6 51.35 32 110 39.3 24.6 24.6 39.36 -45 111 -45 -45 45 45

close phase

T2 = 45

CoordinatorState

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 339: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 3

split 1split 2split 3

JobTracker

T1/m = 22/3

Job Configuration

bypass

Mapper

Mapper

Mapper

split 1i wi ,1

5 202 71 64 -26 -153 -30

split 2i wi ,2

5 124 71 22 -53 -146 -20

split 3i wi ,3

1 103 64 52 -35 -66 -10

bypass

bypass

Reducer

(6(split 1,-15))

(3,(split 2,-14))

Ri wi Fx τ+(wi) τ−(wi) τ (wi) τ ′(wi)1 10 001 24.6 -4.6 0 24.63 -44 110 -36.6 -51.3 36.6 51.35 32 110 39.3 24.6 24.6 39.36 -45 111 -45 -45 45 45

close phase

T2 = 45

CoordinatorState

wi ∈ Rmissing a wi ,j

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 340: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 3

split 1split 2split 3

JobTracker

T1/m = 22/3

Job Configuration

bypass

Mapper

Mapper

Mapper

split 1i wi ,1

5 202 71 64 -26 -153 -30

split 2i wi ,2

5 124 71 22 -53 -146 -20

split 3i wi ,3

1 103 64 52 -35 -66 -10

bypass

bypass

Reducer

(6(split 1,-15))

(3,(split 2,-14))

Ri wi Fx τ+(wi) τ−(wi) τ (wi) τ ′(wi)1 10 001 24.6 -4.6 0 24.63 -44 110 -36.6 -51.3 36.6 51.35 32 110 39.3 24.6 24.6 39.36 -45 111 -45 -45 45 45

close phase

T2 = 45

HDFS

CoordinatorState

wi ∈ Rmissing a wi ,j

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 341: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 3

split 1split 2split 3

JobTracker

i3

Distributed Cache

T1/m = 22/3

Job Configuration

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 342: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 3

split 1split 2split 3

JobTracker

i3

Distributed Cache

bypass

bypass

bypass

T1/m = 22/3

Job Configuration

split 1:saved wi ,j

split 2:saved wi ,j

split 3:saved wi ,j

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 343: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 3

split 1split 2split 3

JobTracker

i3

Distributed Cache

bypass

Mapper

Mapper

Mapper

split 1i wi ,1

5 202 71 64 -26 -153 -30

split 2i wi ,2

5 124 71 22 -53 -146 -20

split 3i wi ,3

1 103 64 52 -35 -66 -10

bypass

bypass

T1/m = 22/3

Job Configuration

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 344: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 3

split 1split 2split 3

JobTracker

i3

Distributed Cache

bypass

Mapper

Mapper

Mapper

split 1i wi ,1

5 202 71 64 -26 -153 -30

split 2i wi ,2

5 124 71 22 -53 -146 -20

split 3i wi ,3

1 103 64 52 -35 -66 -10

bypass

bypass

T1/m = 22/3

Job Configuration

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 345: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 3

split 1split 2split 3

JobTracker

i3

Distributed Cache

bypass

Mapper

Mapper

Mapper

split 1i wi ,1

5 202 71 64 -26 -153 -30

split 2i wi ,2

5 124 71 22 -53 -146 -20

split 3i wi ,3

1 103 64 52 -35 -66 -10

bypass

bypass

T1/m = 22/3

Job Configuration

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 346: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 3

split 1split 2split 3

JobTracker

i3

Distributed Cache

bypass

Mapper

Mapper

Mapper

split 1i wi ,1

5 202 71 64 -26 -153 -30

split 2i wi ,2

5 124 71 22 -53 -146 -20

split 3i wi ,3

1 103 64 52 -35 -66 -10

bypass

bypass

Reducer

(3,(site 3,6))

T1/m = 22/3

Job Configuration

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 347: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 3

split 1split 2split 3

JobTracker

i3

Distributed Cache

bypass

Mapper

Mapper

Mapper

split 1i wi ,1

5 202 71 64 -26 -153 -30

split 2i wi ,2

5 124 71 22 -53 -146 -20

split 3i wi ,3

1 103 64 52 -35 -66 -10

bypass

bypass

Reducer

(3,(site 3,6))

T1/m = 22/3

Job Configuration

CoordinatorState

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 348: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 3

split 1split 2split 3

JobTracker

i3

Distributed Cache

bypass

Mapper

Mapper

Mapper

split 1i wi ,1

5 202 71 64 -26 -153 -30

split 2i wi ,2

5 124 71 22 -53 -146 -20

split 3i wi ,3

1 103 64 52 -35 -66 -10

bypass

bypass

Reducer

(3,(site 3,6))

Ri wi Fx3 -44 1106 -45 111

T1/m = 22/3

Job Configuration

CoordinatorState

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 349: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 3

split 1split 2split 3

JobTracker

i3

Distributed Cache

bypass

Mapper

Mapper

Mapper

split 1i wi ,1

5 202 71 64 -26 -153 -30

split 2i wi ,2

5 124 71 22 -53 -146 -20

split 3i wi ,3

1 103 64 52 -35 -66 -10

bypass

bypass

Reducer

(3,(site 3,6))

Ri wi Fx3 -44 1106 -45 111

update

T1/m = 22/3

Job Configuration

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 350: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 3

split 1split 2split 3

JobTracker

i3

Distributed Cache

bypass

Mapper

Mapper

Mapper

split 1i wi ,1

5 202 71 64 -26 -153 -30

split 2i wi ,2

5 124 71 22 -53 -146 -20

split 3i wi ,3

1 103 64 52 -35 -66 -10

bypass

bypass

Reducer

(3,(site 3,6)) update

Ri wi Fx3 -38 1116 -45 111

T1/m = 22/3

Job Configuration

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 351: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 3

split 1split 2split 3

JobTracker

i3

Distributed Cache

bypass

Mapper

Mapper

Mapper

split 1i wi ,1

5 202 71 64 -26 -153 -30

split 2i wi ,2

5 124 71 22 -53 -146 -20

split 3i wi ,3

1 103 64 52 -35 -66 -10

bypass

bypass

Reducer

(3,(site 3,6))

Ri wi Fx3 -38 1116 -45 111

T1/m = 22/3

Job Configuration

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 352: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 3

split 1split 2split 3

JobTracker

i3

Distributed Cache

bypass

Mapper

Mapper

Mapper

split 1i wi ,1

5 202 71 64 -26 -153 -30

split 2i wi ,2

5 124 71 22 -53 -146 -20

split 3i wi ,3

1 103 64 52 -35 -66 -10

bypass

bypass

Reducer

(3,(site 3,6))

Ri wi Fx3 -38 1116 -45 111

w6 = −45?

T1/m = 22/3

Job Configuration

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 353: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 3

split 1split 2split 3

JobTracker

i3

Distributed Cache

bypass

Mapper

Mapper

Mapper

split 1i wi ,1

5 202 71 64 -26 -153 -30

split 2i wi ,2

5 124 71 22 -53 -146 -20

split 3i wi ,3

1 103 64 52 -35 -66 -10

bypass

bypass

Reducer

(3,(site 3,6))

Ri wi Fx3 -38 1116 -45 111

w6 = −45?

w6 = −45

T1/m = 22/3

Job Configuration

Results

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 354: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Outline

1 Introduction and MotivationHistogramsMapReduce and Hadoop

2 Exact Top-k Wavelet CoefficientsNaive SolutionHadoop Wavelet Top-k : Our Efficient Exact Solution

3 Approximate Top-k Wavelet CoefficientsLinearly Combinable Sketch MethodOur First Sampling Based ApproachAn Improved Sampling ApproachTwo-Level Sampling

4 Experiments

5 ConclusionsHadoop Wavelet Top-k in Hadoop

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 355: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 1

split 1split 2split 3

(x , null)

(x , null)

(x , null)

JobTracker

Mapper

Mapper

Mapper

k = 1

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 356: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 1

split 1split 2split 3

(x , null)

(x , null)

(x , null)

JobTracker

Mapper

Mapper

Mapper

(x , 1)

(x , 1)

(x , 1)

In-MemoryMap

In-MemoryMap

In-MemoryMap

k = 1

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 357: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 1

split 1split 2split 3

(x , null)

(x , null)

(x , null)

JobTracker

Mapper

Mapper

Mapper

In-MemoryMap

In-MemoryMap

In-MemoryMap

close()

close()

close()

k = 1

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 358: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 1

split 1split 2split 3

(x , null)

(x , null)

(x , null)

JobTracker

Mapper

Mapper

Mapper

close()

close()

close()

StreamingCompute wi

StreamingCompute wi

StreamingCompute wi

(x , v1(x))

(x , v2(x))

(x , v3(x))

k = 1

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 359: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 1

split 1split 2split 3

(x , null)

(x , null)

(x , null)

JobTracker

Mapper

Mapper

Mapper

close()

close()

close()

StreamingCompute wi

StreamingCompute wi

StreamingCompute wi

(x , v1(x))

(x , v2(x))

(x , v3(x))

Priority Queue:Largest wi ,j

Priority Queue:Smallest wi ,j

HDFS:wi ,j Streamed to disk

Priority Queue:Largest wi ,j

Priority Queue:Smallest wi ,j

HDFS:wi ,j Streamed to disk

Priority Queue:Largest wi ,j

Priority Queue:Smallest wi ,j

HDFS:wi ,j Streamed to disk

k = 1

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 360: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 1

split 1split 2split 3

(x , null)

(x , null)

(x , null)

JobTracker

Mapper

Mapper

Mapper

close()

close()

close()

StreamingCompute wi

StreamingCompute wi

StreamingCompute wi

(x , v1(x))

(x , v2(x))

(x , v3(x))

split 1w2,1

w1,1

w4,1

w5,1

w6,1

w3,1

split 2w1,2

w2,2

w6,2

w3,2

w5,2

w4,2

split 3w2,3

w1,3

w6,3

w4,3

w3,3

w5,3

k = 1

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 361: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 1

split 1split 2split 3

(x , null)

(x , null)

(x , null)

JobTracker

Mapper

Mapper

Mapper

close()

close()

close()

StreamingCompute wi

StreamingCompute wi

StreamingCompute wi

(x , v1(x))

(x , v2(x))

(x , v3(x))

split 1w2,1

w1,1

w4,1

w5,1

w6,1

w3,1

p1

p1

p1

split 2w1,2

w2,2

w6,2

w3,2

w5,2

w4,2

split 3w2,3

w1,3

w6,3

w4,3

w3,3

w5,3

k largest

k largest

k largest

k smallest

k smallest

k smallest

k = 1

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 362: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 1

split 1split 2split 3

(x , null)

(x , null)

(x , null)

JobTracker

Mapper

Mapper

Mapper

w+1 w2,1

p1

(i , (split 1,wi ,1))

p1

(i , (split 2,wi ,2))

p1

(i , (split 3,wi ,3))

w−1 w3,1

w+2 w1,2

w−2 w4,2

w+3 w2,3

w−3 w5,3

k = 1

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 363: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 1

split 1split 2split 3

(x , null)

(x , null)

(x , null)

JobTracker

Mapper

Mapper

Mapper

w+1 w2,1

p1

(i , (split 1,wi ,1))

p1

(i , (split 2,wi ,2))

p1

(i , (split 3,wi ,3))

Reducer

w−1 w3,1

w+2 w1,2

w−2 w4,2

w+3 w2,3

w−3 w5,3

k = 1

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 364: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 1

split 1split 2split 3

(x , null)

(x , null)

(x , null)

JobTracker

Mapper

Mapper

Mapper

w+1 w2,1

p1

(i , (split 1,wi ,1))

p1

(i , (split 2,wi ,2))

p1

(i , (split 3,wi ,3))

Reducer

w−1 w3,1

w+2 w1,2

w−2 w4,2

w+3 w2,3

w−3 w5,3

Compute

R

wi

Fiτ+(wi)τ−(wi)τ (wi)

k = 1

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 365: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 1

split 1split 2split 3

(x , null)

(x , null)

(x , null)

JobTracker

Mapper

Mapper

Mapper

w+1 w2,1

p1

(i , (split 1,wi ,1))

p1

(i , (split 2,wi ,2))

p1

(i , (split 3,wi ,3))

Reducer

w−1 w3,1

w+2 w1,2

w−2 w4,2

w+3 w2,3

w−3 w5,3

Compute

R

wi

Fiτ+(wi)τ−(wi)τ (wi)

Reduce phase

k = 1

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 366: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 1

split 1split 2split 3

(x , null)

(x , null)

(x , null)

JobTracker

Mapper

Mapper

Mapper

w+1 w2,1

p1

(i , (split 1,wi ,1))

p1

(i , (split 2,wi ,2))

p1

(i , (split 3,wi ,3))

Reducer

w−1 w3,1

w+2 w1,2

w−2 w4,2

w+3 w2,3

w−3 w5,3

Compute

R

wi

Fiτ+(wi)τ−(wi)τ (wi)

Close phase

k = 1

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 367: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 1

split 1split 2split 3

(x , null)

(x , null)

(x , null)

JobTracker

Mapper

Mapper

Mapper

w+1 w2,1

p1

(i , (split 1,wi ,1))

p1

(i , (split 2,wi ,2))

p1

(i , (split 3,wi ,3))

Reducer

w−1 w3,1

w+2 w1,2

w−2 w4,2

w+3 w2,3

w−3 w5,3

Compute

T1

k ’th largest τ (wi)

R

wi

Fiτ+(wi)τ−(wi)τ (wi)

Close phase

k = 1

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 368: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 1

split 1split 2split 3

(x , null)

(x , null)

(x , null)

JobTracker

Mapper

Mapper

Mapper

w+1 w2,1

p1

(i , (split 1,wi ,1))

p1

(i , (split 2,wi ,2))

p1

(i , (split 3,wi ,3))

Reducer

w−1 w3,1

w+2 w1,2

w−2 w4,2

w+3 w2,3

w−3 w5,3

ComputeCoordinatorState

HDFS

Local or HDFS

R

wi

Fiτ+(wi)τ−(wi)τ (wi)

Close phase

T1

k ’th largest τ (wi)

k = 1

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 369: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 2

split 1split 2split 3

JobTracker

T1/m

Job Configuration

k = 1

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 370: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 2

split 1split 2split 3

JobTracker

T1/m

Job Configuration

bypass

bypass

bypass

split 1:saved wi ,j

split 2:saved wi ,j

split 3:saved wi ,j

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 371: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 2

split 1split 2split 3

JobTracker

T1/m

Job Configuration

bypass

Mapper

Mapper

Mapper

bypass

bypass

split 1w2,1

w1,1

w4,1

w5,1

w6,1

w3,1

split 2w1,2

w2,2

w6,2

w3,2

w5,2

w4,2

split 3w2,3

w1,3

w6,3

w4,3

w3,3

w5,3

|wi ,j | > T1/m

|wi ,j | > T1/m

|wi ,j | > T1/m

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 372: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 2

split 1split 2split 3

JobTracker

T1/m

Job Configuration

bypass

Mapper

Mapper

Mapper

bypass

bypass

Reducer

split 1w2,1

w1,1

w4,1

w5,1

w6,1

w3,1

split 2w1,2

w2,2

w6,2

w3,2

w5,2

w4,2

split 3w2,3

w1,3

w6,3

w4,3

w3,3

w5,3

|wi ,j | > T1/m

|wi ,j | > T1/m

|wi ,j | > T1/m

(i , (split i ,wi ,j))

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 373: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 2

split 1split 2split 3

JobTracker

T1/m

Job Configuration

bypass

Mapper

Mapper

Mapper

bypass

bypass

ReducerCoordinatorState

split 1w2,1

w1,1

w4,1

w5,1

w6,1

w3,1

split 2w1,2

w2,2

w6,2

w3,2

w5,2

w4,2

split 3w2,3

w1,3

w6,3

w4,3

w3,3

w5,3

|wi ,j | > T1/m

|wi ,j | > T1/m

|wi ,j | > T1/m

(i , (split i ,wi ,j))

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 374: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 2

split 1split 2split 3

JobTracker

T1/m

Job Configuration

bypass

Mapper

Mapper

Mapper

bypass

bypass

ReducerCoordinatorState

split 1w2,1

w1,1

w4,1

w5,1

w6,1

w3,1

split 2w1,2

w2,2

w6,2

w3,2

w5,2

w4,2

split 3w2,3

w1,3

w6,3

w4,3

w3,3

w5,3

|wi ,j | > T1/m

|wi ,j | > T1/m

|wi ,j | > T1/m

(i , (split i ,wi ,j))

R

wi

Fiτ+(wi)τ−(wi)τ (wi)

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 375: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 2

split 1split 2split 3

JobTracker

T1/m

Job Configuration

bypass

Mapper

Mapper

Mapper

bypass

bypass

ReducerCoordinatorState

split 1w2,1

w1,1

w4,1

w5,1

w6,1

w3,1

split 2w1,2

w2,2

w6,2

w3,2

w5,2

w4,2

split 3w2,3

w1,3

w6,3

w4,3

w3,3

w5,3

|wi ,j | > T1/m

|wi ,j | > T1/m

|wi ,j | > T1/m

(i , (split i ,wi ,j))

R

wi

Fiτ+(wi)τ−(wi)τ (wi)

Refine

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 376: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 2

split 1split 2split 3

JobTracker

T1/m

Job Configuration

bypass

Mapper

Mapper

Mapper

bypass

bypass

ReducerCoordinatorState

split 1w2,1

w1,1

w4,1

w5,1

w6,1

w3,1

split 2w1,2

w2,2

w6,2

w3,2

w5,2

w4,2

split 3w2,3

w1,3

w6,3

w4,3

w3,3

w5,3

|wi ,j | > T1/m

|wi ,j | > T1/m

|wi ,j | > T1/m

(i , (split i ,wi ,j))

R

wi

Fiτ+(wi)τ−(wi)τ (wi)

Refine Reduce phase

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 377: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 2

split 1split 2split 3

JobTracker

T1/m

Job Configuration

bypass

Mapper

Mapper

Mapper

bypass

bypass

ReducerCoordinatorState

split 1w2,1

w1,1

w4,1

w5,1

w6,1

w3,1

split 2w1,2

w2,2

w6,2

w3,2

w5,2

w4,2

split 3w2,3

w1,3

w6,3

w4,3

w3,3

w5,3

|wi ,j | > T1/m

|wi ,j | > T1/m

|wi ,j | > T1/m

(i , (split i ,wi ,j))

R

wi

Fiτ+(wi)τ−(wi)τ (wi)

Refine Close phase

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 378: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 2

split 1split 2split 3

JobTracker

T1/m

Job Configuration

bypass

Mapper

Mapper

Mapper

bypass

bypass

ReducerCoordinatorState

split 1w2,1

w1,1

w4,1

w5,1

w6,1

w3,1

split 2w1,2

w2,2

w6,2

w3,2

w5,2

w4,2

split 3w2,3

w1,3

w6,3

w4,3

w3,3

w5,3

|wi ,j | > T1/m

|wi ,j | > T1/m

|wi ,j | > T1/m

(i , (split i ,wi ,j))

R

wi

Fiτ+(wi)τ−(wi)τ (wi)

Refine Close phase

T2

k ’th largest τ (wi)

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 379: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 2

split 1split 2split 3

JobTracker

T1/m

Job Configuration

bypass

Mapper

Mapper

Mapper

bypass

bypass

ReducerCoordinatorState

split 1w2,1

w1,1

w4,1

w5,1

w6,1

w3,1

split 2w1,2

w2,2

w6,2

w3,2

w5,2

w4,2

split 3w2,3

w1,3

w6,3

w4,3

w3,3

w5,3

|wi ,j | > T1/m

|wi ,j | > T1/m

|wi ,j | > T1/m

(i , (split i ,wi ,j))

Close phase

• Reducer computes τ ′(wi)and prunes wi from R withτ ′(wi) < T2R

wi

Fiτ+(wi)τ−(wi)τ (wi)

T2

k ’th largest τ (wi)

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 380: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 2

split 1split 2split 3

JobTracker

T1/m

Job Configuration

bypass

Mapper

Mapper

Mapper

bypass

bypass

ReducerCoordinatorState

split 1w2,1

w1,1

w4,1

w5,1

w6,1

w3,1

split 2w1,2

w2,2

w6,2

w3,2

w5,2

w4,2

split 3w2,3

w1,3

w6,3

w4,3

w3,3

w5,3

|wi ,j | > T1/m

|wi ,j | > T1/m

|wi ,j | > T1/m

(i , (split i ,wi ,j))

Close phase

CoordinatorState

Local or HDFS

R

wi

Fiτ+(wi)τ−(wi)τ (wi)

T2

k ’th largest τ (wi)

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 381: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 2

split 1split 2split 3

JobTracker

T1/m

Job Configuration

bypass

Mapper

Mapper

Mapper

bypass

bypass

ReducerCoordinatorState

split 1w2,1

w1,1

w4,1

w5,1

w6,1

w3,1

split 2w1,2

w2,2

w6,2

w3,2

w5,2

w4,2

split 3w2,3

w1,3

w6,3

w4,3

w3,3

w5,3

|wi ,j | > T1/m

|wi ,j | > T1/m

|wi ,j | > T1/m

(i , (split i ,wi ,j))

Close phase

i for wi ∈ Rmissing a wi ,j

HDFS

R

wi

Fiτ+(wi)τ−(wi)τ (wi)

T2

k ’th largest τ (wi)

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 382: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 3

split 1split 2split 3

JobTracker

i ∈ Rmissing wi ,j

Distributed Cache

T1/m

Job Configuration

k = 1

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 383: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 3

split 1split 2split 3

JobTracker

i ∈ Rmissing wi ,j

Distributed Cache

bypass

bypass

bypass

T1/m

Job Configuration

split 1:saved wi ,j

split 2:saved wi ,j

split 3:saved wi ,j

k = 1

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 384: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 3

split 1split 2split 3

JobTracker

i ∈ Rmissing wi ,j

Distributed Cache

bypass

bypass

bypass

T1/m

Job Configuration

Mapper

Mapper

Mapper

split 1w2,1

w1,1

w4,1

w5,1

w6,1

w3,1

split 2w1,2

w2,2

w6,2

w3,2

w5,2

w4,2

split 3w2,3

w1,3

w6,3

w4,3

w3,3

w5,3

i ∈ R and|wi ,j | ≤ T1/m

i ∈ R and|wi ,j | ≤ T1/m

i ∈ R and|wi ,j | ≤ T1/m

k = 1

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 385: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 3

split 1split 2split 3

JobTracker

i ∈ Rmissing wi ,j

Distributed Cache

bypass

bypass

bypass

Reducer

T1/m

Job Configuration

Mapper

Mapper

Mapper

split 1w2,1

w1,1

w4,1

w5,1

w6,1

w3,1

split 2w1,2

w2,2

w6,2

w3,2

w5,2

w4,2

split 3w2,3

w1,3

w6,3

w4,3

w3,3

w5,3

i ∈ R and|wi ,j | ≤ T1/m

i ∈ R and|wi ,j | ≤ T1/m

i ∈ R and|wi ,j | ≤ T1/m

(i , (split i ,wi ,j))

k = 1

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 386: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 3

split 1split 2split 3

JobTracker

i ∈ Rmissing wi ,j

Distributed Cache

bypass

bypass

bypass

Reducer

T1/m

Job Configuration

CoordinatorState

Mapper

Mapper

Mapper

split 1w2,1

w1,1

w4,1

w5,1

w6,1

w3,1

split 2w1,2

w2,2

w6,2

w3,2

w5,2

w4,2

split 3w2,3

w1,3

w6,3

w4,3

w3,3

w5,3

i ∈ R and|wi ,j | ≤ T1/m

i ∈ R and|wi ,j | ≤ T1/m

i ∈ R and|wi ,j | ≤ T1/m

(i , (split i ,wi ,j))

k = 1

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 387: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 3

split 1split 2split 3

JobTracker

i ∈ Rmissing wi ,j

Distributed Cache

bypass

bypass

bypass

Reducer

T1/m

Job Configuration

CoordinatorState

Mapper

Mapper

Mapper

split 1w2,1

w1,1

w4,1

w5,1

w6,1

w3,1

split 2w1,2

w2,2

w6,2

w3,2

w5,2

w4,2

split 3w2,3

w1,3

w6,3

w4,3

w3,3

w5,3

i ∈ R and|wi ,j | ≤ T1/m

i ∈ R and|wi ,j | ≤ T1/m

i ∈ R and|wi ,j | ≤ T1/m

(i , (split i ,wi ,j))

R

wi

Fiτ+(wi)τ−(wi)τ (wi)

k = 1

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 388: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 3

split 1split 2split 3

JobTracker

i ∈ Rmissing wi ,j

Distributed Cache

bypass

bypass

bypass

Reducer

Update

T1/m

Job Configuration

CoordinatorState

Mapper

Mapper

Mapper

split 1w2,1

w1,1

w4,1

w5,1

w6,1

w3,1

split 2w1,2

w2,2

w6,2

w3,2

w5,2

w4,2

split 3w2,3

w1,3

w6,3

w4,3

w3,3

w5,3

i ∈ R and|wi ,j | ≤ T1/m

i ∈ R and|wi ,j | ≤ T1/m

i ∈ R and|wi ,j | ≤ T1/m

(i , (split i ,wi ,j))

R

wi

Fiτ+(wi)τ−(wi)τ (wi)

k = 1

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 389: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 3

split 1split 2split 3

JobTracker

i ∈ Rmissing wi ,j

Distributed Cache

bypass

bypass

bypass

Reducer

Update

T1/m

Job Configuration

Mapper

Mapper

Mapper

split 1w2,1

w1,1

w4,1

w5,1

w6,1

w3,1

split 2w1,2

w2,2

w6,2

w3,2

w5,2

w4,2

split 3w2,3

w1,3

w6,3

w4,3

w3,3

w5,3

i ∈ R and|wi ,j | ≤ T1/m

i ∈ R and|wi ,j | ≤ T1/m

i ∈ R and|wi ,j | ≤ T1/m

(i , (split i ,wi ,j))

R wi

Exact wi

k = 1

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 390: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Exact Top-k Wavelet Coefficients: Hadoop Phase 3

split 1split 2split 3

JobTracker

i ∈ Rmissing wi ,j

Distributed Cache

bypass

bypass

bypass

Reducer

T1/m

Job Configuration

Results

Mapper

Mapper

Mapper

split 1w2,1

w1,1

w4,1

w5,1

w6,1

w3,1

split 2w1,2

w2,2

w6,2

w3,2

w5,2

w4,2

split 3w2,3

w1,3

w6,3

w4,3

w3,3

w5,3

i ∈ R and|wi ,j | ≤ T1/m

i ∈ R and|wi ,j | ≤ T1/m

i ∈ R and|wi ,j | ≤ T1/m

(i , (split i ,wi ,j))

R wi

Exact wi

top-k (i , |wi |)

k = 1

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 391: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Basic RandomSampling

split 1split 2split 3split 4

JobTracker

MapRunner

nj= records in split jsj= split j sample frequency vector

1 RandomizedRecordReader j (RRj) samples nj/ε2n records.

1 RRj randomly selects nj/ε2n offsets in split j .

2 RRj sorts the offsets in ascending order then seeks the record at eachsampled offset.

2 Reducer uses v(x) = s(x)/p, our unbiased estimator for v(x).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 392: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Basic RandomSampling

split 1split 2split 3split 4

JobTracker

MapRunner

RandomizedRecordReader

(x , null)

nj= records in split jsj= split j sample frequency vector

1 RandomizedRecordReader j (RRj) samples nj/ε2n records.

1 RRj randomly selects nj/ε2n offsets in split j .

2 RRj sorts the offsets in ascending order then seeks the record at eachsampled offset.

2 Reducer uses v(x) = s(x)/p, our unbiased estimator for v(x).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 393: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Basic RandomSampling

split 1split 2split 3split 4

JobTracker

MapRunner

RandomizedRecordReader

(x , null)

nj= records in split jsj= split j sample frequency vector

1 RandomizedRecordReader j (RRj) samples nj/ε2n records.

1 RRj randomly selects nj/ε2n offsets in split j .

2 RRj sorts the offsets in ascending order then seeks the record at eachsampled offset.

2 Reducer uses v(x) = s(x)/p, our unbiased estimator for v(x).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 394: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Basic RandomSampling

split 1split 2split 3split 4

JobTracker

MapRunner

RandomizedRecordReader

(x , null)

nj= records in split jsj= split j sample frequency vector

1 RandomizedRecordReader j (RRj) samples nj/ε2n records.

1 RRj randomly selects nj/ε2n offsets in split j .

2 RRj sorts the offsets in ascending order then seeks the record at eachsampled offset.

2 Reducer uses v(x) = s(x)/p, our unbiased estimator for v(x).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 395: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Basic RandomSampling

split 1split 2split 3split 4

JobTracker

MapRunner

RandomizedRecordReader

(x , null)

(x , null)Mapper

nj= records in split jsj= split j sample frequency vector

1 RandomizedRecordReader j (RRj) samples nj/ε2n records.

1 RRj randomly selects nj/ε2n offsets in split j .

2 RRj sorts the offsets in ascending order then seeks the record at eachsampled offset.

2 Reducer uses v(x) = s(x)/p, our unbiased estimator for v(x).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 396: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Basic RandomSampling

split 1split 2split 3split 4

JobTracker

MapRunner

RandomizedRecordReader

(x , null)

(x , null)Mapper

(x , 1)

In-MemoryMap

nj= records in split jsj= split j sample frequency vector

1 RandomizedRecordReader j (RRj) samples nj/ε2n records.

1 RRj randomly selects nj/ε2n offsets in split j .

2 RRj sorts the offsets in ascending order then seeks the record at eachsampled offset.

2 Reducer uses v(x) = s(x)/p, our unbiased estimator for v(x).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 397: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Basic RandomSampling

split 1split 2split 3split 4

JobTracker

MapRunner

RandomizedRecordReader

(x , null)

(x , null)Mapper

In-MemoryMap

close()

nj= records in split jsj= split j sample frequency vector

1 RandomizedRecordReader j (RRj) samples nj/ε2n records.

1 RRj randomly selects nj/ε2n offsets in split j .

2 RRj sorts the offsets in ascending order then seeks the record at eachsampled offset.

2 Reducer uses v(x) = s(x)/p, our unbiased estimator for v(x).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 398: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Basic RandomSampling

split 1split 2split 3split 4

JobTracker

MapRunner

RandomizedRecordReader

(x , null)

(x , null)Mapper

In-MemoryMap

(x , sj(x))

nj= records in split jsj= split j sample frequency vector

1 RandomizedRecordReader j (RRj) samples nj/ε2n records.

1 RRj randomly selects nj/ε2n offsets in split j .

2 RRj sorts the offsets in ascending order then seeks the record at eachsampled offset.

2 Reducer uses v(x) = s(x)/p, our unbiased estimator for v(x).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 399: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Basic RandomSampling

split 1split 2split 3split 4

JobTracker

MapRunner

RandomizedRecordReader

(x , null)

(x , null)Mapper

(x , sj(x))p1

In-MemoryMap

(x , sj(x))

nj= records in split jsj= split j sample frequency vector

close()

1 RandomizedRecordReader j (RRj) samples nj/ε2n records.

1 RRj randomly selects nj/ε2n offsets in split j .

2 RRj sorts the offsets in ascending order then seeks the record at eachsampled offset.

2 Reducer uses v(x) = s(x)/p, our unbiased estimator for v(x).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 400: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Basic RandomSampling

split 1split 2split 3split 4

JobTracker

MapRunner

RandomizedRecordReader

(x , null)

(x , null)Mapper

(x , sj(x))Reducerp1

In-MemoryMap

(x , sj(x))

(x , sj(x))

nj= records in split jsj= split j sample frequency vector

close()

1 RandomizedRecordReader j (RRj) samples nj/ε2n records.

1 RRj randomly selects nj/ε2n offsets in split j .

2 RRj sorts the offsets in ascending order then seeks the record at eachsampled offset.

2 Reducer uses v(x) = s(x)/p, our unbiased estimator for v(x).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 401: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Basic RandomSampling

split 1split 2split 3split 4

JobTracker

MapRunner

RandomizedRecordReader

(x , null)

(x , null)Mapper

(x , sj(x))Reducerp1

In-MemoryMap

(x , sj(x))

(x , sj(x))

nj= records in split jsj= split j sample frequency vector

close()

(x , s(x))

1 RandomizedRecordReader j (RRj) samples nj/ε2n records.

1 RRj randomly selects nj/ε2n offsets in split j .

2 RRj sorts the offsets in ascending order then seeks the record at eachsampled offset.

2 Reducer uses v(x) = s(x)/p, our unbiased estimator for v(x).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 402: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Basic RandomSampling

split 1split 2split 3split 4

JobTracker

MapRunner

RandomizedRecordReader

(x , null)

(x , null)Mapper

(x , sj(x))Reducerp1

In-MemoryMap

(x , sj(x))

(x , sj(x))

(x , v(x))

nj= records in split jsj= split j sample frequency vector

close()

(x , s(x))

1 RandomizedRecordReader j (RRj) samples nj/ε2n records.

1 RRj randomly selects nj/ε2n offsets in split j .

2 RRj sorts the offsets in ascending order then seeks the record at eachsampled offset.

2 Reducer uses v(x) = s(x)/p, our unbiased estimator for v(x).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 403: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Basic RandomSampling

split 1split 2split 3split 4

JobTracker

MapRunner

RandomizedRecordReader

(x , null)

(x , null)Mapper

(x , sj(x))Reducerp1

In-MemoryMap

(x , sj(x))

(x , sj(x))

Centralized Wavelet Top-k

(x , v(x))

nj= records in split jsj= split j sample frequency vector

close()

(x , s(x))

1 RandomizedRecordReader j (RRj) samples nj/ε2n records.

1 RRj randomly selects nj/ε2n offsets in split j .

2 RRj sorts the offsets in ascending order then seeks the record at eachsampled offset.

2 Reducer uses v(x) = s(x)/p, our unbiased estimator for v(x).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 404: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Basic RandomSampling

split 1split 2split 3split 4

JobTracker

MapRunner

RandomizedRecordReader

(x , null)

(x , null)Mapper

(x , sj(x))Reducerp1

In-MemoryMap

(x , sj(x))

(x , sj(x))

Centralized Wavelet Top-k

(x , v(x))

close()

nj= records in split jsj= split j sample frequency vector

close()

(x , s(x))

1 RandomizedRecordReader j (RRj) samples nj/ε2n records.

1 RRj randomly selects nj/ε2n offsets in split j .

2 RRj sorts the offsets in ascending order then seeks the record at eachsampled offset.

2 Reducer uses v(x) = s(x)/p, our unbiased estimator for v(x).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 405: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Basic RandomSampling

split 1split 2split 3split 4

JobTracker

MapRunner

RandomizedRecordReader

(x , null)

(x , null)Mapper

(x , sj(x))Reducerp1

In-MemoryMap

(x , sj(x))

(x , sj(x))

Centralized Wavelet Top-k

(x , v(x))

top-k |wi |

nj= records in split jsj= split j sample frequency vector

close()

(x , s(x))

1 RandomizedRecordReader j (RRj) samples nj/ε2n records.

1 RRj randomly selects nj/ε2n offsets in split j .

2 RRj sorts the offsets in ascending order then seeks the record at eachsampled offset.

2 Reducer uses v(x) = s(x)/p, our unbiased estimator for v(x).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 406: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Basic RandomSampling

split 1split 2split 3split 4

JobTracker

MapRunner

RandomizedRecordReader

(x , null)

(x , null)Mapper

(x , sj(x))Reducerp1

In-MemoryMap

(x , sj(x))

(x , sj(x))

Centralized Wavelet Top-k

(x , v(x))

top-k |wi |

o1top-k |wi |

nj= records in split jsj= split j sample frequency vector

close()

(x , s(x))

1 RandomizedRecordReader j (RRj) samples nj/ε2n records.

1 RRj randomly selects nj/ε2n offsets in split j .

2 RRj sorts the offsets in ascending order then seeks the record at eachsampled offset.

2 Reducer uses v(x) = s(x)/p, our unbiased estimator for v(x).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 407: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

Theorem

s(x) is an unbiased estimator of s(x).

Proof.

1 Our estimator is s(x) = ρ(x) + M/ε√

m.

Assume in the first m′ splits sj(x) < 1/(ε√

m).

Let Xj = 1 if x is sampled in split j and 0 otherwise.2 E[Xj ] = ε

√m · sj(x).

Let M =∑m′

j=1 Xj .

3 E[M] =∑m′

j=1 ε√m · sj(x) = ε

√m(s(x)− ρ(x)).

4 E[s(x)] = E[ρ(x) + M/ε√

m] = ρ(x) + (s(x)− ρ(x)) = s(x).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 408: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

Theorem

s(x) is an unbiased estimator of s(x).

Proof.

1 Our estimator is s(x) = ρ(x) + M/ε√

m.

Assume in the first m′ splits sj(x) < 1/(ε√

m).

Let Xj = 1 if x is sampled in split j and 0 otherwise.2 E[Xj ] = ε

√m · sj(x).

Let M =∑m′

j=1 Xj .

3 E[M] =∑m′

j=1 ε√m · sj(x) = ε

√m(s(x)− ρ(x)).

4 E[s(x)] = E[ρ(x) + M/ε√

m] = ρ(x) + (s(x)− ρ(x)) = s(x).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 409: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

Theorem

s(x) is an unbiased estimator of s(x).

Proof.

1 Our estimator is s(x) = ρ(x) + M/ε√

m.

Assume in the first m′ splits sj(x) < 1/(ε√

m).

Let Xj = 1 if x is sampled in split j and 0 otherwise.2 E[Xj ] = ε

√m · sj(x).

Let M =∑m′

j=1 Xj .

3 E[M] =∑m′

j=1 ε√m · sj(x) = ε

√m(s(x)− ρ(x)).

4 E[s(x)] = E[ρ(x) + M/ε√

m] = ρ(x) + (s(x)− ρ(x)) = s(x).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 410: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

Theorem

s(x) is an unbiased estimator of s(x).

Proof.

1 Our estimator is s(x) = ρ(x) + M/ε√

m.

Assume in the first m′ splits sj(x) < 1/(ε√

m).

Let Xj = 1 if x is sampled in split j and 0 otherwise.

2 E[Xj ] = ε√m · sj(x).

Let M =∑m′

j=1 Xj .

3 E[M] =∑m′

j=1 ε√m · sj(x) = ε

√m(s(x)− ρ(x)).

4 E[s(x)] = E[ρ(x) + M/ε√

m] = ρ(x) + (s(x)− ρ(x)) = s(x).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 411: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

Theorem

s(x) is an unbiased estimator of s(x).

Proof.

1 Our estimator is s(x) = ρ(x) + M/ε√

m.

Assume in the first m′ splits sj(x) < 1/(ε√

m).

Let Xj = 1 if x is sampled in split j and 0 otherwise.2 E[Xj ] = ε

√m · sj(x).

Let M =∑m′

j=1 Xj .

3 E[M] =∑m′

j=1 ε√m · sj(x) = ε

√m(s(x)− ρ(x)).

4 E[s(x)] = E[ρ(x) + M/ε√

m] = ρ(x) + (s(x)− ρ(x)) = s(x).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 412: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

Theorem

s(x) is an unbiased estimator of s(x).

Proof.

1 Our estimator is s(x) = ρ(x) + M/ε√

m.

Assume in the first m′ splits sj(x) < 1/(ε√

m).

Let Xj = 1 if x is sampled in split j and 0 otherwise.2 E[Xj ] = ε

√m · sj(x).

Let M =∑m′

j=1 Xj .

3 E[M] =∑m′

j=1 ε√m · sj(x) = ε

√m(s(x)− ρ(x)).

4 E[s(x)] = E[ρ(x) + M/ε√

m] = ρ(x) + (s(x)− ρ(x)) = s(x).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 413: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

Theorem

s(x) is an unbiased estimator of s(x).

Proof.

1 Our estimator is s(x) = ρ(x) + M/ε√

m.

Assume in the first m′ splits sj(x) < 1/(ε√

m).

Let Xj = 1 if x is sampled in split j and 0 otherwise.2 E[Xj ] = ε

√m · sj(x).

Let M =∑m′

j=1 Xj .

3 E[M] =∑m′

j=1 ε√m · sj(x)

= ε√m(s(x)− ρ(x)).

4 E[s(x)] = E[ρ(x) + M/ε√

m] = ρ(x) + (s(x)− ρ(x)) = s(x).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 414: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

Theorem

s(x) is an unbiased estimator of s(x).

Proof.

1 Our estimator is s(x) = ρ(x) + M/ε√

m.

Assume in the first m′ splits sj(x) < 1/(ε√

m).

Let Xj = 1 if x is sampled in split j and 0 otherwise.2 E[Xj ] = ε

√m · sj(x).

Let M =∑m′

j=1 Xj .

3 E[M] =∑m′

j=1 ε√m · sj(x) = ε

√m(s(x)− ρ(x)).

4 E[s(x)] = E[ρ(x) + M/ε√

m] = ρ(x) + (s(x)− ρ(x)) = s(x).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 415: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

Theorem

s(x) is an unbiased estimator of s(x).

Proof.

1 Our estimator is s(x) = ρ(x) + M/ε√

m.

Assume in the first m′ splits sj(x) < 1/(ε√

m).

Let Xj = 1 if x is sampled in split j and 0 otherwise.2 E[Xj ] = ε

√m · sj(x).

Let M =∑m′

j=1 Xj .

3 E[M] =∑m′

j=1 ε√m · sj(x) = ε

√m(s(x)− ρ(x)).

4 E[s(x)] = E[ρ(x) + M/ε√

m]

= ρ(x) + (s(x)− ρ(x)) = s(x).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 416: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

Theorem

s(x) is an unbiased estimator of s(x).

Proof.

1 Our estimator is s(x) = ρ(x) + M/ε√

m.

Assume in the first m′ splits sj(x) < 1/(ε√

m).

Let Xj = 1 if x is sampled in split j and 0 otherwise.2 E[Xj ] = ε

√m · sj(x).

Let M =∑m′

j=1 Xj .

3 E[M] =∑m′

j=1 ε√m · sj(x) = ε

√m(s(x)− ρ(x)).

4 E[s(x)] = E[ρ(x) + M/ε√

m] = ρ(x) + (s(x)− ρ(x))

= s(x).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 417: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

Theorem

s(x) is an unbiased estimator of s(x).

Proof.

1 Our estimator is s(x) = ρ(x) + M/ε√

m.

Assume in the first m′ splits sj(x) < 1/(ε√

m).

Let Xj = 1 if x is sampled in split j and 0 otherwise.2 E[Xj ] = ε

√m · sj(x).

Let M =∑m′

j=1 Xj .

3 E[M] =∑m′

j=1 ε√m · sj(x) = ε

√m(s(x)− ρ(x)).

4 E[s(x)] = E[ρ(x) + M/ε√

m] = ρ(x) + (s(x)− ρ(x)) = s(x).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 418: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

Theorem

s(x) is an estimator of s(x) with standard deviation at most 1/ε.

Proof.

1 Our estimator is s(x) = ρ(x) + M/ε√

m.

Assume in the first m′ splits sj(x) < 1/(ε√

m).

Let Xj = 1 if x is sampled in split j and 0 otherwise.2 Var[Xj ] = ε

√m · sj(x)(1− ε

√m · sj(x)) ≤ ε

√m · sj(x).

Let M =∑m′

j=1 Xj .

3 Var[M] ≤∑m′

j=1 Var[Xj ] ≤∑m′

j=1 ε√m · sj(x) ≤ m′ · ε

√m · 1/(ε

√m)

= m′.

4 Var [s(x)] = Var[M/ε√

m] = Var[M]/ε2m ≤ m′/ε2m ≤ 1/ε2 ≤ 1/ε.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 419: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

Theorem

s(x) is an estimator of s(x) with standard deviation at most 1/ε.

Proof.

1 Our estimator is s(x) = ρ(x) + M/ε√

m.

Assume in the first m′ splits sj(x) < 1/(ε√

m).

Let Xj = 1 if x is sampled in split j and 0 otherwise.2 Var[Xj ] = ε

√m · sj(x)(1− ε

√m · sj(x)) ≤ ε

√m · sj(x).

Let M =∑m′

j=1 Xj .

3 Var[M] ≤∑m′

j=1 Var[Xj ] ≤∑m′

j=1 ε√m · sj(x) ≤ m′ · ε

√m · 1/(ε

√m)

= m′.

4 Var [s(x)] = Var[M/ε√

m] = Var[M]/ε2m ≤ m′/ε2m ≤ 1/ε2 ≤ 1/ε.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 420: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

Theorem

s(x) is an estimator of s(x) with standard deviation at most 1/ε.

Proof.

1 Our estimator is s(x) = ρ(x) + M/ε√

m.

Assume in the first m′ splits sj(x) < 1/(ε√

m).

Let Xj = 1 if x is sampled in split j and 0 otherwise.2 Var[Xj ] = ε

√m · sj(x)(1− ε

√m · sj(x)) ≤ ε

√m · sj(x).

Let M =∑m′

j=1 Xj .

3 Var[M] ≤∑m′

j=1 Var[Xj ] ≤∑m′

j=1 ε√m · sj(x) ≤ m′ · ε

√m · 1/(ε

√m)

= m′.

4 Var [s(x)] = Var[M/ε√

m] = Var[M]/ε2m ≤ m′/ε2m ≤ 1/ε2 ≤ 1/ε.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 421: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

Theorem

s(x) is an estimator of s(x) with standard deviation at most 1/ε.

Proof.

1 Our estimator is s(x) = ρ(x) + M/ε√

m.

Assume in the first m′ splits sj(x) < 1/(ε√

m).

Let Xj = 1 if x is sampled in split j and 0 otherwise.

2 Var[Xj ] = ε√m · sj(x)(1− ε

√m · sj(x)) ≤ ε

√m · sj(x).

Let M =∑m′

j=1 Xj .

3 Var[M] ≤∑m′

j=1 Var[Xj ] ≤∑m′

j=1 ε√m · sj(x) ≤ m′ · ε

√m · 1/(ε

√m)

= m′.

4 Var [s(x)] = Var[M/ε√

m] = Var[M]/ε2m ≤ m′/ε2m ≤ 1/ε2 ≤ 1/ε.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 422: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

Theorem

s(x) is an estimator of s(x) with standard deviation at most 1/ε.

Proof.

1 Our estimator is s(x) = ρ(x) + M/ε√

m.

Assume in the first m′ splits sj(x) < 1/(ε√

m).

Let Xj = 1 if x is sampled in split j and 0 otherwise.2 Var[Xj ] = ε

√m · sj(x)(1− ε

√m · sj(x))

≤ ε√m · sj(x).

Let M =∑m′

j=1 Xj .

3 Var[M] ≤∑m′

j=1 Var[Xj ] ≤∑m′

j=1 ε√m · sj(x) ≤ m′ · ε

√m · 1/(ε

√m)

= m′.

4 Var [s(x)] = Var[M/ε√

m] = Var[M]/ε2m ≤ m′/ε2m ≤ 1/ε2 ≤ 1/ε.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 423: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

Theorem

s(x) is an estimator of s(x) with standard deviation at most 1/ε.

Proof.

1 Our estimator is s(x) = ρ(x) + M/ε√

m.

Assume in the first m′ splits sj(x) < 1/(ε√

m).

Let Xj = 1 if x is sampled in split j and 0 otherwise.2 Var[Xj ] = ε

√m · sj(x)(1− ε

√m · sj(x)) ≤ ε

√m · sj(x).

Let M =∑m′

j=1 Xj .

3 Var[M] ≤∑m′

j=1 Var[Xj ] ≤∑m′

j=1 ε√m · sj(x) ≤ m′ · ε

√m · 1/(ε

√m)

= m′.

4 Var [s(x)] = Var[M/ε√

m] = Var[M]/ε2m ≤ m′/ε2m ≤ 1/ε2 ≤ 1/ε.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 424: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

Theorem

s(x) is an estimator of s(x) with standard deviation at most 1/ε.

Proof.

1 Our estimator is s(x) = ρ(x) + M/ε√

m.

Assume in the first m′ splits sj(x) < 1/(ε√

m).

Let Xj = 1 if x is sampled in split j and 0 otherwise.2 Var[Xj ] = ε

√m · sj(x)(1− ε

√m · sj(x)) ≤ ε

√m · sj(x).

Let M =∑m′

j=1 Xj .

3 Var[M] ≤∑m′

j=1 Var[Xj ] ≤∑m′

j=1 ε√m · sj(x) ≤ m′ · ε

√m · 1/(ε

√m)

= m′.

4 Var [s(x)] = Var[M/ε√

m] = Var[M]/ε2m ≤ m′/ε2m ≤ 1/ε2 ≤ 1/ε.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 425: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

Theorem

s(x) is an estimator of s(x) with standard deviation at most 1/ε.

Proof.

1 Our estimator is s(x) = ρ(x) + M/ε√

m.

Assume in the first m′ splits sj(x) < 1/(ε√

m).

Let Xj = 1 if x is sampled in split j and 0 otherwise.2 Var[Xj ] = ε

√m · sj(x)(1− ε

√m · sj(x)) ≤ ε

√m · sj(x).

Let M =∑m′

j=1 Xj .

3 Var[M] ≤∑m′

j=1 Var[Xj ]

≤∑m′

j=1 ε√m · sj(x) ≤ m′ · ε

√m · 1/(ε

√m)

= m′.

4 Var [s(x)] = Var[M/ε√

m] = Var[M]/ε2m ≤ m′/ε2m ≤ 1/ε2 ≤ 1/ε.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 426: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

Theorem

s(x) is an estimator of s(x) with standard deviation at most 1/ε.

Proof.

1 Our estimator is s(x) = ρ(x) + M/ε√

m.

Assume in the first m′ splits sj(x) < 1/(ε√

m).

Let Xj = 1 if x is sampled in split j and 0 otherwise.2 Var[Xj ] = ε

√m · sj(x)(1− ε

√m · sj(x)) ≤ ε

√m · sj(x).

Let M =∑m′

j=1 Xj .

3 Var[M] ≤∑m′

j=1 Var[Xj ] ≤∑m′

j=1 ε√m · sj(x)

≤ m′ · ε√m · 1/(ε

√m)

= m′.

4 Var [s(x)] = Var[M/ε√

m] = Var[M]/ε2m ≤ m′/ε2m ≤ 1/ε2 ≤ 1/ε.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 427: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

Theorem

s(x) is an estimator of s(x) with standard deviation at most 1/ε.

Proof.

1 Our estimator is s(x) = ρ(x) + M/ε√

m.

Assume in the first m′ splits sj(x) < 1/(ε√

m).

Let Xj = 1 if x is sampled in split j and 0 otherwise.2 Var[Xj ] = ε

√m · sj(x)(1− ε

√m · sj(x)) ≤ ε

√m · sj(x).

Let M =∑m′

j=1 Xj .

3 Var[M] ≤∑m′

j=1 Var[Xj ] ≤∑m′

j=1 ε√m · sj(x) ≤ m′ · ε

√m · 1/(ε

√m)

= m′.

4 Var [s(x)] = Var[M/ε√

m] = Var[M]/ε2m ≤ m′/ε2m ≤ 1/ε2 ≤ 1/ε.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 428: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

Theorem

s(x) is an estimator of s(x) with standard deviation at most 1/ε.

Proof.

1 Our estimator is s(x) = ρ(x) + M/ε√

m.

Assume in the first m′ splits sj(x) < 1/(ε√

m).

Let Xj = 1 if x is sampled in split j and 0 otherwise.2 Var[Xj ] = ε

√m · sj(x)(1− ε

√m · sj(x)) ≤ ε

√m · sj(x).

Let M =∑m′

j=1 Xj .

3 Var[M] ≤∑m′

j=1 Var[Xj ] ≤∑m′

j=1 ε√m · sj(x) ≤ m′ · ε

√m · 1/(ε

√m)

= m′.

4 Var [s(x)] = Var[M/ε√

m] = Var[M]/ε2m ≤ m′/ε2m ≤ 1/ε2 ≤ 1/ε.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 429: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

Theorem

s(x) is an estimator of s(x) with standard deviation at most 1/ε.

Proof.

1 Our estimator is s(x) = ρ(x) + M/ε√

m.

Assume in the first m′ splits sj(x) < 1/(ε√

m).

Let Xj = 1 if x is sampled in split j and 0 otherwise.2 Var[Xj ] = ε

√m · sj(x)(1− ε

√m · sj(x)) ≤ ε

√m · sj(x).

Let M =∑m′

j=1 Xj .

3 Var[M] ≤∑m′

j=1 Var[Xj ] ≤∑m′

j=1 ε√m · sj(x) ≤ m′ · ε

√m · 1/(ε

√m)

= m′.

4 Var [s(x)] = Var[M/ε√

m]

= Var[M]/ε2m ≤ m′/ε2m ≤ 1/ε2 ≤ 1/ε.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 430: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

Theorem

s(x) is an estimator of s(x) with standard deviation at most 1/ε.

Proof.

1 Our estimator is s(x) = ρ(x) + M/ε√

m.

Assume in the first m′ splits sj(x) < 1/(ε√

m).

Let Xj = 1 if x is sampled in split j and 0 otherwise.2 Var[Xj ] = ε

√m · sj(x)(1− ε

√m · sj(x)) ≤ ε

√m · sj(x).

Let M =∑m′

j=1 Xj .

3 Var[M] ≤∑m′

j=1 Var[Xj ] ≤∑m′

j=1 ε√m · sj(x) ≤ m′ · ε

√m · 1/(ε

√m)

= m′.

4 Var [s(x)] = Var[M/ε√

m] = Var[M]/ε2m

≤ m′/ε2m ≤ 1/ε2 ≤ 1/ε.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 431: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

Theorem

s(x) is an estimator of s(x) with standard deviation at most 1/ε.

Proof.

1 Our estimator is s(x) = ρ(x) + M/ε√

m.

Assume in the first m′ splits sj(x) < 1/(ε√

m).

Let Xj = 1 if x is sampled in split j and 0 otherwise.2 Var[Xj ] = ε

√m · sj(x)(1− ε

√m · sj(x)) ≤ ε

√m · sj(x).

Let M =∑m′

j=1 Xj .

3 Var[M] ≤∑m′

j=1 Var[Xj ] ≤∑m′

j=1 ε√m · sj(x) ≤ m′ · ε

√m · 1/(ε

√m)

= m′.

4 Var [s(x)] = Var[M/ε√

m] = Var[M]/ε2m ≤ m′/ε2m

≤ 1/ε2 ≤ 1/ε.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 432: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

Theorem

s(x) is an estimator of s(x) with standard deviation at most 1/ε.

Proof.

1 Our estimator is s(x) = ρ(x) + M/ε√

m.

Assume in the first m′ splits sj(x) < 1/(ε√

m).

Let Xj = 1 if x is sampled in split j and 0 otherwise.2 Var[Xj ] = ε

√m · sj(x)(1− ε

√m · sj(x)) ≤ ε

√m · sj(x).

Let M =∑m′

j=1 Xj .

3 Var[M] ≤∑m′

j=1 Var[Xj ] ≤∑m′

j=1 ε√m · sj(x) ≤ m′ · ε

√m · 1/(ε

√m)

= m′.

4 Var [s(x)] = Var[M/ε√

m] = Var[M]/ε2m ≤ m′/ε2m ≤ 1/ε2

≤ 1/ε.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 433: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

Theorem

s(x) is an estimator of s(x) with standard deviation at most 1/ε.

Proof.

1 Our estimator is s(x) = ρ(x) + M/ε√

m.

Assume in the first m′ splits sj(x) < 1/(ε√

m).

Let Xj = 1 if x is sampled in split j and 0 otherwise.2 Var[Xj ] = ε

√m · sj(x)(1− ε

√m · sj(x)) ≤ ε

√m · sj(x).

Let M =∑m′

j=1 Xj .

3 Var[M] ≤∑m′

j=1 Var[Xj ] ≤∑m′

j=1 ε√m · sj(x) ≤ m′ · ε

√m · 1/(ε

√m)

= m′.

4 Var [s(x)] = Var[M/ε√

m] = Var[M]/ε2m ≤ m′/ε2m ≤ 1/ε2 ≤ 1/ε.

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 434: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

Theorem

The expected total communication cost of our two-level samplingalgorithm is O(

√m/ε).

Proof.

1 Our estimator is s(x) = ρ(x) + M/ε√

m.

The first-level sample size is pn = 1/ε2.

If sj(x) ≥ 1/(ε√

m) we emit (x , sj(x)).2 There are ≤ (1/ε2)/(1/ε

√m) =

√m/ε such keys.

If sj(x) < 1/(ε√

m),we emit (x , null) with probability ε

√m · sjx .

3 On expectation there are,∑j

∑x ε√m · sj(x) ≤ ε

√m · 1/ε2 =

√m/ε.

By (2) and (3), the total number of emitted keys is O(√

m/ε).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 435: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

Theorem

The expected total communication cost of our two-level samplingalgorithm is O(

√m/ε).

Proof.

1 Our estimator is s(x) = ρ(x) + M/ε√

m.

The first-level sample size is pn = 1/ε2.

If sj(x) ≥ 1/(ε√

m) we emit (x , sj(x)).2 There are ≤ (1/ε2)/(1/ε

√m) =

√m/ε such keys.

If sj(x) < 1/(ε√

m),we emit (x , null) with probability ε

√m · sjx .

3 On expectation there are,∑j

∑x ε√m · sj(x) ≤ ε

√m · 1/ε2 =

√m/ε.

By (2) and (3), the total number of emitted keys is O(√

m/ε).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 436: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

Theorem

The expected total communication cost of our two-level samplingalgorithm is O(

√m/ε).

Proof.

1 Our estimator is s(x) = ρ(x) + M/ε√

m.

The first-level sample size is pn = 1/ε2.

If sj(x) ≥ 1/(ε√

m) we emit (x , sj(x)).2 There are ≤ (1/ε2)/(1/ε

√m) =

√m/ε such keys.

If sj(x) < 1/(ε√

m),we emit (x , null) with probability ε

√m · sjx .

3 On expectation there are,∑j

∑x ε√m · sj(x) ≤ ε

√m · 1/ε2 =

√m/ε.

By (2) and (3), the total number of emitted keys is O(√

m/ε).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 437: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

Theorem

The expected total communication cost of our two-level samplingalgorithm is O(

√m/ε).

Proof.

1 Our estimator is s(x) = ρ(x) + M/ε√

m.

The first-level sample size is pn = 1/ε2.

If sj(x) ≥ 1/(ε√

m) we emit (x , sj(x)).

2 There are ≤ (1/ε2)/(1/ε√m) =

√m/ε such keys.

If sj(x) < 1/(ε√

m),we emit (x , null) with probability ε

√m · sjx .

3 On expectation there are,∑j

∑x ε√m · sj(x) ≤ ε

√m · 1/ε2 =

√m/ε.

By (2) and (3), the total number of emitted keys is O(√

m/ε).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 438: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

Theorem

The expected total communication cost of our two-level samplingalgorithm is O(

√m/ε).

Proof.

1 Our estimator is s(x) = ρ(x) + M/ε√

m.

The first-level sample size is pn = 1/ε2.

If sj(x) ≥ 1/(ε√

m) we emit (x , sj(x)).2 There are ≤ (1/ε2)/(1/ε

√m) =

√m/ε such keys.

If sj(x) < 1/(ε√

m),we emit (x , null) with probability ε

√m · sjx .

3 On expectation there are,∑j

∑x ε√m · sj(x) ≤ ε

√m · 1/ε2 =

√m/ε.

By (2) and (3), the total number of emitted keys is O(√

m/ε).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 439: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

Theorem

The expected total communication cost of our two-level samplingalgorithm is O(

√m/ε).

Proof.

1 Our estimator is s(x) = ρ(x) + M/ε√

m.

The first-level sample size is pn = 1/ε2.

If sj(x) ≥ 1/(ε√

m) we emit (x , sj(x)).2 There are ≤ (1/ε2)/(1/ε

√m) =

√m/ε such keys.

If sj(x) < 1/(ε√

m),we emit (x , null) with probability ε

√m · sjx .

3 On expectation there are,∑j

∑x ε√m · sj(x) ≤ ε

√m · 1/ε2 =

√m/ε.

By (2) and (3), the total number of emitted keys is O(√

m/ε).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 440: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

Theorem

The expected total communication cost of our two-level samplingalgorithm is O(

√m/ε).

Proof.

1 Our estimator is s(x) = ρ(x) + M/ε√

m.

The first-level sample size is pn = 1/ε2.

If sj(x) ≥ 1/(ε√

m) we emit (x , sj(x)).2 There are ≤ (1/ε2)/(1/ε

√m) =

√m/ε such keys.

If sj(x) < 1/(ε√

m),we emit (x , null) with probability ε

√m · sjx .

3 On expectation there are,∑j

∑x ε√m · sj(x) ≤ ε

√m · 1/ε2

=√m/ε.

By (2) and (3), the total number of emitted keys is O(√

m/ε).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 441: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

Theorem

The expected total communication cost of our two-level samplingalgorithm is O(

√m/ε).

Proof.

1 Our estimator is s(x) = ρ(x) + M/ε√

m.

The first-level sample size is pn = 1/ε2.

If sj(x) ≥ 1/(ε√

m) we emit (x , sj(x)).2 There are ≤ (1/ε2)/(1/ε

√m) =

√m/ε such keys.

If sj(x) < 1/(ε√

m),we emit (x , null) with probability ε

√m · sjx .

3 On expectation there are,∑j

∑x ε√m · sj(x) ≤ ε

√m · 1/ε2 =

√m/ε.

By (2) and (3), the total number of emitted keys is O(√

m/ε).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 442: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: Two-LevelSampling

Theorem

The expected total communication cost of our two-level samplingalgorithm is O(

√m/ε).

Proof.

1 Our estimator is s(x) = ρ(x) + M/ε√

m.

The first-level sample size is pn = 1/ε2.

If sj(x) ≥ 1/(ε√

m) we emit (x , sj(x)).2 There are ≤ (1/ε2)/(1/ε

√m) =

√m/ε such keys.

If sj(x) < 1/(ε√

m),we emit (x , null) with probability ε

√m · sjx .

3 On expectation there are,∑j

∑x ε√m · sj(x) ≤ ε

√m · 1/ε2 =

√m/ε.

By (2) and (3), the total number of emitted keys is O(√

m/ε).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 443: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: ImprovedSampling

split 1split 2split 3split 4

JobTracker

MapRunner

nj= records in split jsj= split j sample frequency vector

1 RandomizedRecordReader j samples tj = nj/ε2n records.

2 If sj(x) > εtj , the Mapper emits (x , sj(x)).

3 Reducer uses v(x) = s(x)/p, our estimator for v(x).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 444: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: ImprovedSampling

split 1split 2split 3split 4

JobTracker

MapRunner

RandomizedRecordReader

(x , null)

nj= records in split jsj= split j sample frequency vector

1 RandomizedRecordReader j samples tj = nj/ε2n records.

2 If sj(x) > εtj , the Mapper emits (x , sj(x)).

3 Reducer uses v(x) = s(x)/p, our estimator for v(x).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 445: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: ImprovedSampling

split 1split 2split 3split 4

JobTracker

MapRunner

RandomizedRecordReader

(x , null)

(x , null)Mapper

nj= records in split jsj= split j sample frequency vector

1 RandomizedRecordReader j samples tj = nj/ε2n records.

2 If sj(x) > εtj , the Mapper emits (x , sj(x)).

3 Reducer uses v(x) = s(x)/p, our estimator for v(x).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 446: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: ImprovedSampling

split 1split 2split 3split 4

JobTracker

MapRunner

RandomizedRecordReader

(x , null)

(x , null)Mapper

(x , 1)

In-MemoryMap

nj= records in split jsj= split j sample frequency vector

1 RandomizedRecordReader j samples tj = nj/ε2n records.

2 If sj(x) > εtj , the Mapper emits (x , sj(x)).

3 Reducer uses v(x) = s(x)/p, our estimator for v(x).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 447: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: ImprovedSampling

split 1split 2split 3split 4

JobTracker

MapRunner

RandomizedRecordReader

(x , null)

(x , null)Mapper

In-MemoryMap

close()

nj= records in split jsj= split j sample frequency vector

1 RandomizedRecordReader j samples tj = nj/ε2n records.

2 If sj(x) > εtj , the Mapper emits (x , sj(x)).

3 Reducer uses v(x) = s(x)/p, our estimator for v(x).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 448: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: ImprovedSampling

split 1split 2split 3split 4

JobTracker

MapRunner

RandomizedRecordReader

(x , null)

(x , null)Mapper

In-MemoryMap

(x , sj(x))

nj= records in split jsj= split j sample frequency vector

1 RandomizedRecordReader j samples tj = nj/ε2n records.

2 If sj(x) > εtj , the Mapper emits (x , sj(x)).

3 Reducer uses v(x) = s(x)/p, our estimator for v(x).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 449: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: ImprovedSampling

split 1split 2split 3split 4

JobTracker

MapRunner

RandomizedRecordReader

(x , null)

(x , null)Mapper

(x , sj(x))p1

In-MemoryMap

(x , sj(x))

nj= records in split jsj= split j sample frequency vector

close()

1 RandomizedRecordReader j samples tj = nj/ε2n records.

2 If sj(x) > εtj , the Mapper emits (x , sj(x)).

3 Reducer uses v(x) = s(x)/p, our estimator for v(x).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 450: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: ImprovedSampling

split 1split 2split 3split 4

JobTracker

MapRunner

RandomizedRecordReader

(x , null)

(x , null)Mapper

(x , sj(x))Reducerp1

In-MemoryMap

(x , sj(x))

(x , sj(x))

nj= records in split jsj= split j sample frequency vector

close()

1 RandomizedRecordReader j samples tj = nj/ε2n records.

2 If sj(x) > εtj , the Mapper emits (x , sj(x)).

3 Reducer uses v(x) = s(x)/p, our estimator for v(x).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 451: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: ImprovedSampling

split 1split 2split 3split 4

JobTracker

MapRunner

RandomizedRecordReader

(x , null)

(x , null)Mapper

(x , sj(x))Reducerp1

In-MemoryMap

(x , sj(x))

(x , sj(x))

nj= records in split jsj= split j sample frequency vector

close()

(x , s(x))

1 RandomizedRecordReader j samples tj = nj/ε2n records.

2 If sj(x) > εtj , the Mapper emits (x , sj(x)).

3 Reducer uses v(x) = s(x)/p, our estimator for v(x).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 452: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: ImprovedSampling

split 1split 2split 3split 4

JobTracker

MapRunner

RandomizedRecordReader

(x , null)

(x , null)Mapper

(x , sj(x))Reducerp1

In-MemoryMap

(x , sj(x))

(x , sj(x))

(x , v(x))

nj= records in split jsj= split j sample frequency vector

close()

(x , s(x))

1 RandomizedRecordReader j samples tj = nj/ε2n records.

2 If sj(x) > εtj , the Mapper emits (x , sj(x)).

3 Reducer uses v(x) = s(x)/p, our estimator for v(x).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 453: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: ImprovedSampling

split 1split 2split 3split 4

JobTracker

MapRunner

RandomizedRecordReader

(x , null)

(x , null)Mapper

(x , sj(x))Reducerp1

In-MemoryMap

(x , sj(x))

(x , sj(x))

Centralized Wavelet Top-k

(x , v(x))

nj= records in split jsj= split j sample frequency vector

close()

(x , s(x))

1 RandomizedRecordReader j samples tj = nj/ε2n records.

2 If sj(x) > εtj , the Mapper emits (x , sj(x)).

3 Reducer uses v(x) = s(x)/p, our estimator for v(x).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 454: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: ImprovedSampling

split 1split 2split 3split 4

JobTracker

MapRunner

RandomizedRecordReader

(x , null)

(x , null)Mapper

(x , sj(x))Reducerp1

In-MemoryMap

(x , sj(x))

(x , sj(x))

Centralized Wavelet Top-k

(x , v(x))

close()

nj= records in split jsj= split j sample frequency vector

close()

(x , s(x))

1 RandomizedRecordReader j samples tj = nj/ε2n records.

2 If sj(x) > εtj , the Mapper emits (x , sj(x)).

3 Reducer uses v(x) = s(x)/p, our estimator for v(x).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 455: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: ImprovedSampling

split 1split 2split 3split 4

JobTracker

MapRunner

RandomizedRecordReader

(x , null)

(x , null)Mapper

(x , sj(x))Reducerp1

In-MemoryMap

(x , sj(x))

(x , sj(x))

Centralized Wavelet Top-k

(x , v(x))

top-k |wi |

nj= records in split jsj= split j sample frequency vector

close()

(x , s(x))

1 RandomizedRecordReader j samples tj = nj/ε2n records.

2 If sj(x) > εtj , the Mapper emits (x , sj(x)).

3 Reducer uses v(x) = s(x)/p, our estimator for v(x).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce

Page 456: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram

Approximate Top-k Wavelet Coefficients: ImprovedSampling

split 1split 2split 3split 4

JobTracker

MapRunner

RandomizedRecordReader

(x , null)

(x , null)Mapper

(x , sj(x))Reducerp1

In-MemoryMap

(x , sj(x))

(x , sj(x))

Centralized Wavelet Top-k

(x , v(x))

top-k |wi |

o1top-k |wi |

nj= records in split jsj= split j sample frequency vector

close()

(x , s(x))

1 RandomizedRecordReader j samples tj = nj/ε2n records.

2 If sj(x) > εtj , the Mapper emits (x , sj(x)).

3 Reducer uses v(x) = s(x)/p, our estimator for v(x).

Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce