22
Detecting Duplicates over Sliding Windows with RAM- Efficient Detached Counting Bloom Filter Arrays Jiansheng Wei , Hong Jiang , Ke Zhou , Dan Feng , Hua Wang School of Computer, Huazhong University of Science and Technology, Wuhan, China Wuhan National Laboratory for Optoelectronics, Wuhan, China Dept. of Computer Science and Engineering, University of Nebraska- Lincoln, Lincoln, NE, USA

Detecting Duplicates over Sliding Windows with RAM-Efficient Detached Counting Bloom Filter Arrays Jiansheng Wei †, Hong Jiang ‡, Ke Zhou † , Dan Feng

Embed Size (px)

Citation preview

Detecting Duplicates over Sliding Windows with RAM-Efficient

Detached Counting Bloom Filter Arrays

Jiansheng Wei†, Hong Jiang‡, Ke Zhou†, Dan Feng†, Hua Wang†

†School of Computer, Huazhong University of Science and Technology, Wuhan, China

Wuhan National Laboratory for Optoelectronics, Wuhan, China‡Dept. of Computer Science and Engineering, University of Nebraska-

Lincoln, Lincoln, NE, USA

2011-7-29 2

Background and Motivation

Duplicate detection is an important technique for monitoring and analysing data streams. Recent studies estimate that about 294 billion emails are sent per day

all over the world in 2010 [1] and around 89.1% of them are spam emails [2].

If all the emails must be scanned for the purposes of anti-spam, anti-virus or homeland security, it is important for the email server to quickly identify duplicates and analyse only unique emails.

Since the real-world data streams can be updated frequently and grow to very large sizes, it is impractical to trace and analyse all the elements in such data streams.

[1] The Radicati Group, Inc. (2010) Email Statistics Report, 2010-2014. [2] Symantec Corp. (2010) MessageLabs Intelligence: 2010 Annual

Security Report.

2011-7-29 3

Background and Motivation

Existing approaches usually employ a decaying window model to evict stale items and record the most recent elements. Landmark Window Model

Jumping Window Model

Sliding Window Model

incomingexpired monitoring

landmark i

next window

landmark i+1 landmark i+2

incomingexpired next window

k equal-sized sub-windowsjumping forward

incoming

data stream

expired monitoring

sliding forwardN elements

The challenge is how to design a membership representation scheme that supports fast search, insertion and deletion of time-ordered elements with low RAM consumption.

2011-7-29 4

Background and Motivation

To achieve high space efficiency and maintain high query performance, Bloom filters (BFs) have been widely used to represent membership of static data sets.

A BF for representing a static set S = {e1, e2, …, en} of n elements consists of an array of m bits and a group of k independent hash functions h1, …, hk with the range of {1, …, m}.

A BF faces challenges when dealing with a dynamic set, i.e., it does not support element deletion.

BF

x

bit 10 …0

h1(x) h2(x) hk(x)

1 1 10 0 0 00 0 0 0 1 10 0

y

bit m

h1(y)h2(y)

hk(y)

BF

z

bit 10 …0 1 1 10 0 0 00 0 0 0 1 10 0

y

bit m

h1(y) h2(y) hk(y)h2(z)

hk(z)h1(z)

Fig. 1 Insert elements into a Bloom filter.

Fig. 2 Query a Bloom filter.

False positive: y is probably a member; No false negative: z is definitely not a member.

2011-7-29 5

Outline

Background and Motivation The DCBA Scheme Analysis and Evaluation Conclusions

2011-7-29 6

The DCBA Scheme

A detached counting Bloom filters array (DCBA) consists of an array of detached counting Bloom filters (DCBFs) that are homogeneous and share the same hash functions.

A DCBF is a Bloom filter (BF) with each of its bits being associated with a counter, which functions as a timer to record the timestamp of the represented element.

All the timers of a DCBF are further grouped into a timer array (TA) to improve the access efficiency.

2011-7-29 7

The DCBA Scheme

A window with size N slides over a data stream, and all the monitored elements are represented by an array of g DCBFs with each having a capacity of N/(g−1) elements.

load offload

Fla

shor Dis

k

RA

M

incoming

TA1

TA2

TAgBF1 BF2 BFg

TAg−1

BFg−1

TA1 TAg

N elements

expired monitoring

decaying filling

data stream

Fig. 3 Using DCBA to represent a sliding window in a single node.

2011-7-29 8

The DCBA Scheme

The g DCBFs logically function as a circular FIFO queue. Fully-filled DCBFs will be retained for query only until its elements become stale,

and the corresponding timer array (TA) that can consume a great amount of memory space may be optionally offloaded to hard disks or flash store to save RAM resources.

load offload

Fla

shor Dis

k

RA

M

incoming

TA1

TA2

TAgBF1 BF2 BFg

TAg−1

BFg−1

TA1 TAg

N elements

expired monitoring

decaying filling

data stream

Fig. 3 Using DCBA to represent a sliding window in a single node.

2011-7-29 9

The DCBA Scheme

A DCBA can also be split and maintained by r nodes, where each node holds a group of g DCBFs. All the r×g DCBFs logically function as a circular FIFO queue, and there is a filling DCBF and a decaying DCBF to accommodate fresh elements and evict stale elements respectively.

incoming

TA1 TAg TAg×r

BF1 BFg BFg×r

TAg×r−r+1

BFg×r−r+1

N elements

expired monitoring

data stream

……

node 1 node r

decaying filling

Fig. 4 Using DCBA to represent a sliding window in a decentralized (clustered) system.

2011-7-29 10

The DCBA Scheme

Bits belonging to different BFs but with the same offset are mapped and stored in the same bit vector so that they can be read or written simultaneously in a single memory access.

Considering that k hash functions are used and shared among the DCBFs, the memory access complexity of querying an item will be O(k) rather than O(k×g).

If a positive is produced by the decaying BF, the associated decaying TA will be queried to check whether the item has already expired.

BF1

0 01 0bit vector1

bit vector2

bit vectorm

BFg…

hi (∙) (0≤i≤k)

bit vector3 110 1001 0

1 10 139 27

1100

0

0

… … …… … …

TA1 TAg

2d−1

BF3 …

Fig. 5 The in-memory structure of a DCBA.

2011-7-29 11

The DCBA Scheme

The bit width d of each timer is fundamentally determined by the capacity of the DCBA. Suppose that the capacity of a DCBA is N, each DCBF will be designed to hold N/(g−1) elements, and each timer will contain d= log⌈ 2N/(g−1) bits to count from 0 to ⌉ N/(g−1)−1, where 0 denotes the oldest timestamp. E.g., if a DCBF is designed to accommodate 1M (220) elements, then each

constructing timer will consume 20 bits to count from 0 to 220−1.

BF1

0 01 0bit vector1

bit vector2

bit vectorm

BFg…

hi (∙) (0≤i≤k)

bit vector3 110 1001 0

1 10 139 27

1100

0

0

… … …… … …

TA1 TAg

2d−1

BF3 …

Fig. 5 The in-memory structure of a DCBA.

2011-7-29 12

The DCBA Scheme

The DCBA scheme maintains a base timer that counts from 0 to N/(g−1)−1 in a circular manner to generate timestamps for the monitored elements.

To insert an element x, k bits in the filling BF will be chosen and set to 1 according to the hash functions hi(x) (1≤i≤k), and the k associated timers in the filling TA will be set to the value of the base timer.

The base timer will be incremented by 1 after an insert operation. An element in the decaying DCBF will be considered expired once its

timestamp becomes smaller than the base timer. BF1

0 01 0bit vector1

bit vector2

bit vectorm

BFg…

hi (∙) (0≤i≤k)

bit vector3 110 1001 0

1 10 139 27

1100

0

0

… … …… … …

TA1 TAg

2d−1

BF3 …

Fig. 5 The in-memory structure of a DCBA.

base timer

2011-7-29 13

The DCBA Scheme

Since a representing bit as well as its associated timer can be shared by multiple elements in a DCBF, we determine the timestamp of an element according to a count-min policy. The minimal value t among all the k timers corresponding to an element x will

be considered as its timestamp. The probability that all the k timers corresponding to x are occasionally

shared and set by other elements with larger timestamps than x is very small and can be constrained by restricting the target error rate of each DCBF. BF1

0 01 0bit vector1

bit vector2

bit vectorm

BFg…

hi (∙) (0≤i≤k)

bit vector3 110 1001 0

1 10 139 27

1100

0

0

… … …… … …

TA1 TAg

2d−1

BF3 …

Fig. 5 The in-memory structure of a DCBA.

base timer

2011-7-29 14

Outline

Background and Motivation The DCBA Scheme Analysis and Evaluation Conclusions

2011-7-29 15

Analysis and Evaluation

RAM Consumption If the overall false positive rate of the DCBA is constrained to εDCBA,

the error rate threshold of each DCBF should be

εDCBF=1−(1−εDCBA)1/g.

The optimal number of hush functions can be derived as

kDCBF= log⌈ 2(1/εDCBF) .⌉ The total space requirement of a DCBA is expressed as

mDCBA-total=g×(d+1)× log⌈ 2e×kDCBF×N/(g−1) .⌉ Since the DCBA scheme allows up to g−2 TAs to be offloaded to

disks, the minimal RAM consumption of a DCBA is therefore

mDCBA-RAM=(g +2d)× log⌈ 2e×kDCBF×N/(g−1) . ⌉

2011-7-29 16

Analysis and Evaluation

RAM Consumption

8 16 32 64 1280

350

700

1,050

1,400

1,750

2,100

2,450

2,800

3,150

3,500

BF DCBA-RAM TBF DCBA-total

RA

M c

onsu

mpt

ion

(MB

)

The number of DCBFs (logarithmic scale)

N=64M (1M=220), ε =1/27

8 16 32 64 1280

600

1,200

1,800

2,400

3,000

3,600

4,200

4,800

5,400

6,000

N=64M (1M=220), ε =1/214

64 128 256 512 10240

10,000

20,000

30,000

40,000

50,000

60,000

70,000

80,000

90,000

100,000

BF DCBA-RAM TBF DCBA-total

RA

M c

onsu

mpt

ion

(MB

)

The number of DCBFs (logarithmic scale)

N=1,024M(1M=220), ε =1/214

64 128 256 512 10240

13,000

26,000

39,000

52,000

65,000

78,000

91,000

104,000

117,000

130,000

N=1,024M(1M=220), ε =1/220

Fig. 6 Memory consumption of a DCBA.

(a) Representing a sliding window with size 64M (1M=220).

(b) Representing a sliding window with size 1,024M (1M=220).

2011-7-29 17

Analysis and Evaluation

We have collected a sequence of 2,026,005,927 chunk fingerprints that contains 83,733,597 unique elements, which provides a real-world data set for measuring the performance and query accuracy of the DCBA scheme.

The experimental server that maintains a DCBA for monitoring the chunk fingerprint stream over a sliding window is configured as follows: a 32-bit Windows operating system a quad-core CPU running at 2 GHz 4×2GB RAM 16×1TB hard disks organized as a RAID-5 partition 2 gigabit network interface cards.

2011-7-29 18

Analysis and Evaluation

Query Performance

8 16 24 32 40 48 56 64 72 800

100,000

200,000

300,000

400,000

500,000

600,000

700,000

800,000

N=32M (1M=220), ε =1/27

Que

ry P

erfo

rman

ce (

fing

erpr

ints

/ se

ncon

d)

The Number of Processed Unique Fingerprints (M, 1M=220)

TBF, k=7 DCBA-16, k=11 DCBA-32, k=12 DCBA-64, k=13

8 16 24 32 40 48 56 64 72 800

100,000

200,000

300,000

400,000

500,000

600,000

N=32M (1M=220), ε =1/214

Que

ry P

erfo

rman

ce (

fing

erpr

ints

/ se

ncon

d)

The Number of Processed Unique Fingerprints (M, 1M=220)

TBF, k=14 DCBA-16, k=18 DCBA-32, k=19 DCBA-64, k=20

Fig. 7 Average query performance of a DCBA.

(a) Representing a sliding window with an error rate threshold of 1/27.

(b) Representing a sliding window with an error rate threshold of 1/214.

2011-7-29 19

Analysis and Evaluation

Query Accuracy

8 16 24 32 40 48 56 64 72 800.000

0.001

0.002

0.003

0.004

0.005

0.006

0.007

0.008

N=32M (1M=220), ε =1/27

Fal

se P

osit

ive

Rat

e

The Number of Processed Unique Fingerprints (M, 1M=220)

Error Rate Threshold TBF, k=7 DCBA-16, k=11 DCBA-32, k=12 DCBA-64, k=13

8 16 24 32 40 48 56 64 72 800.00000

0.00001

0.00002

0.00003

0.00004

0.00005

0.00006

N=32M (1M=220), ε =1/214

Fal

se P

osit

ive

Rat

e

The Number of Processed Unique Fingerprints (M, 1M=220)

Error Rate Threshold TBF, k=14 DCBA-16, k=18 DCBA-32, k=19 DCBA-64, k=20

0.000065

Fig. 8 False positive rate of a DCBA.

(a) Representing a sliding window with an error rate threshold of 1/27.

(b) Representing a sliding window with an error rate threshold of 1/214.

2011-7-29 20

Outline

Background and Motivation The DCBA Scheme Analysis and Evaluation Conclusions

2011-7-29 21

Conclusion

This paper proposes a Detached Counting Bloom filter Array (DCBA) scheme to address the problem of detecting duplicates in data streams over sliding windows. High query performance High space efficiency Scalability Easy to be synchronized

Mathematical analysis and experimental results show that a DCBA can achieve a high query performance that is comparable to the state-

of-the-art timing Bloom filter approach in the same environment with a much lower RAM overhead than the latter.

the actual error rate of a DCBA can be well constrained at or below its predefined threshold.

In general, a DCBA outperforms existing schemes and is more flexible in representing massive stream elements in sliding windows.

2011-7-29 22

Thanks!Questions?

AcknowledgmentThis work is supported in part by the National Basic Research Program (973 Program) of China under Grant No. 2011CB302305, the National High Technology Research and Development Program (863 Program) of China under Grant No. 2009AA01A402, and the US NSF under Grants NSF-IIS-0916859, NSF-CCF-0937993 and NSF-CNS-1016609. The authors are grateful to the anonymous reviewers for their valuable comments and suggestions.