24
CS6100: Topics in Design and Analysis of Algorithms Range Searching John Augustine CS6100 (Even 2012): Range Searching

CS6100: Topics in Design and Analysis of Algorithmsaugustine/cs6100_even2012/slides/06_RangeSearching.pdf · The Range Searching Problem Given a set Pof npoints in Rd, for xed integer

Embed Size (px)

Citation preview

CS6100: Topics in Design andAnalysis of Algorithms

Range Searching

John Augustine

CS6100 (Even 2012): Range Searching

The Range Searching Problem

Given a set P of n points in Rd, for fixed integer d ≥ 1,we want to preprocess and store it in a data structureso that, given a query range, typically an axis parallelrectangle, we can report all the points in the rangequickly.

For 1D range searching, we will study (i) balancedbinary search trees and (ii) skip lists.

For 2D point sets, we will study (i) kd-trees and (ii)Range trees, both of which can be extended to arbitraryd-dimensional point sets.

date of birth

salary

19,500,000 19,559,999

3,000

4,000

G. Ometerborn: Aug 19, 1954salary: $3,500

19,500,000 19,559,999

3,000

4,000

2

4

CS6100 (Even 2012): Range Searching 1

Balanced Binary Search Trees (BBST)

Given a set P of n points in R stored in a sorted arrayA, we can construct a tree that has depth O(log n).For simplicity, we begin with the assumption thatn = 2k for some integer k.

The data nodes are in the leaves. The internal nodesstore values that guide the search.

The root node stores 2k−1th element in A. Whilesearching for a value x in the query phase, if x is lessthan or equal to the value stored in the root, the searchis guided to the left sub tree. Otherwise, the search isguided to the right subtree.

The left subtree is constructed recursively over pointsin A stored from locations 1 through 2k−1. The rightsubtree is constructed over points in A located frompositions 2k−1 + 1 through 2k.

When constructing the internal node on 2 elementpoint sets, the left subtree simply points to the smaller

CS6100 (Even 2012): Range Searching 2

of the two points and the right subtree to the larger,thus terminating the recursion.

The construction can be easily adapted for arbitrary n.See below for an example.

µ µ ′3 10 19 23 30 37

49

59 62 70 80

893 19

10

30

37

59 70

62

100

89

8023

49

100 105

Lemma 1. If the set of points is sorted, we canconstruct the BBST in O(n) time. If not, it takesO(n log n) as we have to sort the points set. TheBBST data structure requires O(n) storage space.

CS6100 (Even 2012): Range Searching 3

To search for a single value µ, we start at the rootnode and ask if µ is greater than the value stored in theroot. If it is, we move to the right subtree, otherwise,we move to the left. We continue recursively till theleaf, where we can report if µ is present.

To query a range [µ, µ′], we traverse the tree for bothµ and µ′ until we find the internal node where the twosplit ways — call it vsplit.

νsplit

µ µ ′

root(T)

the selected subtrees

At vsplit, we part ways for µ and µ′. As we traversetowards µ (past vsplit), just before we move to someleft subtree, we report all points in the right subtree.We deal with µ′ symmetrically.

CS6100 (Even 2012): Range Searching 4

Lemma 2. The time to report points in some range[µ, µ′] is O(k+log n) where k is the number of pointsin [µ, µ′].

Proof. The tree traversal requires O(log n) time.Reporting points in each subtree requires O(k′) time,where k′ is the number of points on which thatparticular subtree is built. Therefore, O(k) time isrequired to report all k points.

Preprocessing Time O(n log n)

Space O(n)

Searching for 1 element O(log n)

Reporting a range with k items O(k + log n)

Insertion O(log n)

Deletion O(log n)

Table 1: Performance bounds of a BBST containing npoints.

CS6100 (Even 2012): Range Searching 5

Skip Tree

While the static implementation of a binary searchtree is very straightforward, making the data structuredynamic (i.e., adding and deleting the points from thepoints set) is non-trivial.

The skip tree is a randomized data structure that allowseasy implementation including updates (insertions anddeletions).

On expectation, it has the same performance boundsas BBST’s (in Table 1).

Head Pointer

CS6100 (Even 2012): Range Searching 6

Construction

Again, we assume that the set P of n points is givento us in sorted order. We denote the ith element of Pin the sorted list by pi.

In our data structure, we use nodes with four pointers:left, right, top and bottom.

We first construct the bottom level (or level 0), whichis a linked list of the sorted list using the four-pointernode structure. The bottom pointers are set to null.

For each pi, we toss a fair coin repeatedly until weget Heads. Let `i be the number of Tails before weobtain the first Heads.

Vertical Pointers. We make `i identical nodescontaining pi, one for each level up to level `i, and wechain them up as follows. For j < `i, the top pointerof jth node points to the j+1th node and the bottompointer of j + 1th node points to node j. The toppointer of the `ith node is null.

CS6100 (Even 2012): Range Searching 7

The number of levels ` = maxi `i. For each level, wehave two special boundary nodes, one to the left of allnodes in that level, and the other to the right. Theboundary nodes are also chained up.

Horizontal Pointers. We establish horizontal links ateach level j starting from j = 1 up to j = `. We startfrom the left boundary of level j. For each node η inlevel j (starting from the left boundary) we step downto its copy in level j − 1 and traverse to the right untilwe come to a node in level j − 1 that has a copy η′ inlevel j. We establish bidirectional links between and ηand η′ and continue this process from η′ until we reachthe right boundary.

The head pointer points to the left boundary of level`.

CS6100 (Even 2012): Range Searching 8

Searching for a Point p

Here, given p, we want to report if P (stored using theskip list datastructure) contains p.

For simplicity, assume that the left boundary nodesstore −∞ and the right boundary nodes store +∞.

Start from the head pointer.

Repeat the following steps:

1. Find the last node whose value is at most than p. Ifthe value is exactly p, we have found it, so we canterminate.

2. Else, if we have reached level 0, then, report that pis not in P and terminate.

3. Else, step directly down one level.

CS6100 (Even 2012): Range Searching 9

Exercises

1. How do we search for points in a range?

2. How do we insert a new node?

3. How do we delete a new node?

4. Suppose you are given a skip list, can youstrategically add and delete points so that the querytimes become bad (i.e., ω(log n))? Note that youwill have to play the role of an adaptive adversarythat can see the coin tosses (and therefore see thedata structure as it evolves).

5. Suppose the coin tosses are hidden to you and youcan’t measure the actual query times. Can you stillstrategically add and delete points so that the querytimes become bad? (Such an adversary that cannotsee the coin tosses is called an oblivious adversary.)

CS6100 (Even 2012): Range Searching 10

6. An alternative way to ask the previous questionis the following. How do we prove that, underan oblivious adversary, the expected performancebounds of a skip list matches Table 1?

CS6100 (Even 2012): Range Searching 11

kd-Trees

Recall that we now want to perform 2D range searches.

date of birth

salary

19,500,000 19,559,999

3,000

4,000

G. Ometerborn: Aug 19, 1954salary: $3,500

So, we need a data structure that considers both thex AND the y coordinates.

Kd-Trees achieve this by alternating between x and y.

Let us now recursively construct the kd-Tree given aset P of n points in 2D.

As in BBST’s, the data is stored in the leaves. Theinternal nodes serve the purpose of guiding searches tothe required leaves.

The root node (level 0) of the kd-Tree corresponds tothe entire data set.

CS6100 (Even 2012): Range Searching 12

To construct the level 1 nodes, i.e., the left and rightchildren of the root, we split the data along the xmedian. The subtree rooted at the left child of theroot node stores all points with x coordinate values nomore than the x median. The rest are stored in theright subtree of the root node.

`

Pleft Pright

To construct level 2 nodes, we again split the pointsstored in the subtrees rooted at each of the level 1nodes into two roughly equal halves. However, thistime, we split along the y median.

We continue recursively alternating between splittingalong x and y medians.

CS6100 (Even 2012): Range Searching 13

p4

p1

p5

p3

p2

p7

p9

p10

p6

p8

`1

`2

`3

`4

`5

`6

`7

`8

`9

p1 p2

`8

`4

`2

`1

`5

p3 p4 p5

p6 p7

p8 p9 p10

`7`6

`9

`3

Algorithm BUILDKDTREE(P,depth)Input. A set of points P and the current depth depth.Output. The root of a kd-tree storing P.1. if P contains only one point2. then return a leaf storing this point3. else if depth is even4. then Split P into two subsets with a vertical line ` through the median x-coordinate

of the points in P. Let P1 be the set of points to the left of ` or on `, and letP2 be the set of points to the right of `.

5. else Split P into two subsets with a horizontal line ` through the median y-coordinate of the points in P. Let P1 be the set of points below ` or on `,and let P2 be the set of points above `.

6. νleft← BUILDKDTREE(P1,depth+1)7. νright← BUILDKDTREE(P2,depth+1)8. Create a node ν storing `, make νleft the left child of ν , and make νright the right

child of ν .9. return ν

24

CS6100 (Even 2012): Range Searching 14

Preprocessing Time and Storage

At each internal node, we have to split P into twosets. This requires O(n) time if the internal nodeis built on n elements. Subsequently, two recursivecalls are made to points sets that contain roughly n/2elements. Therefore, the recurrence relationship on thepreprocessing time of n elements is:

T (n) = O(n) + 2T (n/2),

which evaluates to T (n) = O(n log n).

To analyse the space required by a kd-tree thatstores n points, first note that suppose a binary tree Thas n leaves and each of its internal nodes has exactlytwo children, then T has n − 1 internal nodes. Sinceany kd-tree is such a tree, the space required is O(n).

CS6100 (Even 2012): Range Searching 15

Region of a node

Note that each node in the kd-tree has a regionassociated with it. The region associated with theroot is the entire plane. Subsequently, the region getsdivided based on where the points are spilt.

`1

`2

`3

ν

region(ν)

`3

`2

`1

CS6100 (Even 2012): Range Searching 16

Query Procedure

Traverse the kd-tree, but only visit nodes whose regionsintersect the query rectangle.

• When a region is fully contained in the queryrectangle, just report all points in the subtree.• When traversal reaches a leaf, check its containment

in the query rectangle and report if necessary.

Lemma 3. A query with an axis parallel rectangle ina kd-tree of n points takes O(

√n+ k) time, where k

is the number of points reported.

Proof Sketch.

Reporting all points in a region fully contained in thequery rectangle takes time linear in the number ofpoints in the region. Therefore, the time to report allpoints in regions contained within the query rectanglewill take O(k) time.

Consider the nodes that were visited, but whose regionswere not fully contained by the query rectangle. We

CS6100 (Even 2012): Range Searching 17

only spend O(1) time in each such node. Therefore,we can account for the remaining running time by(asymptotically) counting the number of such nodes.

• The region of each such node is cut by one of thefour boundaries of the query rectangle.• Therefore, the number of such nodes is

asymptotically upper bounded by the maximumnumber of intersections of a line with regions inthe kd-tree.• We build a recurrence function Q(n) that captures

the maximum number of regions in an n-node kd-tree that a line can intersect.• Since the kd-tree alternates between vertical and

horizontal splits, Q(n) must be defined across twolevels. In particular, Q(n) = 2 + 2Q(n/4), whichevaluates to Q(

√n).

Thus the total query time is O(√n+ k).

CS6100 (Even 2012): Range Searching 18

Range Trees

The Range Tree is a data structure for range searchingwhose (non-output sensitive term in the) query time ispolylogarithmic in n instead of O(

√n)?

Its preprocessing time and space complexity isO(n log n).

The key to designing multi-dimensional range searchingdata structures is to combine searching along multiplecoordinate axes.

While we alternated between x and y coordinate in kd-trees, in range trees, we first build on the x-coordinateand then, for each internal node on the x-coordinatetree, we build a separate tree on the y-coordinate.

To construct the range tree, it is helpful to store twocopies of the set of points (at each recursive call), onesorted according to the x coordinates and the othersorted according to the y coordinates.

CS6100 (Even 2012): Range Searching 19

Recall 1D Range Searching

Before we see how 2D range trees can be constructed,we first recall 1D BBST’s.

νsplit

µ µ ′

We store the data as leaves in a balanced binary searchtree.

The canonical subset P (v) of a node v is the datastored in the leaves of the subtree rooted at v.

In 2D range trees, the primary tree is a 1D BBST basedon the x-coordinate of the points. For each internalnode v, we additionally store an associated tree basedon the y-coordinates of the canonical subset P (v) ofv.

CS6100 (Even 2012): Range Searching 20

2D Range Tree

T

P(ν)

ν

Tassoc(ν)

P(ν)

binary search treeon y-coordinates

binary search tree onx-coordinates

Algorithm BUILD2DRANGETREE(P)Input. A set P of points in the plane.Output. The root of a 2-dimensional range tree.1. Construct the associated structure: Build a binary search tree Tassoc on the set Py of y-

coordinates of the points in P. Store at the leaves of Tassoc not just the y-coordinate of thepoints in Py, but the points themselves.

2. if P contains only one point3. then Create a leaf ν storing this point, and make Tassoc the associated structure of ν .4. else Split P into two subsets; one subset Pleft contains the points with x-coordinate less

than or equal to xmid, the median x-coordinate, and the other subset Pright containsthe points with x-coordinate larger than xmid.

5. νleft← BUILD2DRANGETREE(Pleft)6. νright← BUILD2DRANGETREE(Pright)7. Create a node ν storing xmid, make νleft the left child of ν , make νright the right

child of ν , and make Tassoc the associated structure of ν .8. return ν

26

CS6100 (Even 2012): Range Searching 21

Lemma 4. A 2D range tree on n data points takesO(n log n) storage.

Proof. A data point p is stored only in the associatedtrees attached to the nodes of the first level tree onthe path from root to p. At a given level, a pointp is stored in only one associated structure. Sincethe associated tree structure uses linear storage, eachdata point contributes to O(1) of the storage in eachof the O(log n) levels. Therefore, the total space isO(n log n).

Algorithm 2DRANGEQUERY(T, [x : x′]× [y : y′])Input. A 2-dimensional range tree T and a range [x : x′]× [y : y′].Output. All points in T that lie in the range.1. νsplit←FINDSPLITNODE(T,x,x′)2. if νsplit is a leaf3. then Check if the point stored at νsplit must be reported.4. else (∗ Follow the path to x and call 1DRANGEQUERY on the subtrees right of the

path. ∗)5. ν ← lc(νsplit)6. while ν is not a leaf7. do if x 6 xν8. then 1DRANGEQUERY(Tassoc(rc(ν)), [y : y′])9. ν ← lc(ν)10. else ν ← rc(ν)11. Check if the point stored at ν must be reported.12. Similarly, follow the path from rc(νsplit) to x′, call 1DRANGEQUERY with the

range [y : y′] on the associated structures of subtrees left of the path, and check ifthe point stored at the leaf where the path ends must be reported.

27

CS6100 (Even 2012): Range Searching 22

Theorem 1. A 2D range tree on n data pointscan be constructed in O(n log n) time and occupiesO(n log n) space. A range search query on a rangewith k points in it takes O(k + log2 n) time.

Proof. The construction time can be proved usingideas from proof of Lemma 4. On the primary BBST(based on points sorted according to x-coordinates), weperform a 1D range search for nodes whose canonicalsubsets have x coordinates that overlap with the xcoordinates of the range that we are searching for.There are O(log n) such nodes. For each of thesenodes, we look at the associated BBST (base on thecanonical subset sorted according to the y-coordinate)and perform a 1D range search for points whose ycoordinates fall within the range we are searching for.Overall, these traversals require O(log2 n) time.

In these associated BBST’s we look for subtreesthat are fully contained within our search range andreport all points in such subtrees. Since such reportingis linear in the number of points stored in thosesubtrees, this adds an O(k) term in the query time.Therefore, total query time is O(k + log2 n) time.

CS6100 (Even 2012): Range Searching 23