Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern

Answering Approximate Queries

over Autonomous Web Databases

Xiangfu Meng, Z. M. Ma, and Li Yan

College of Information Science and Engineering, Northeastern University

Shenyang, 110004, China

3.Approximate Query Results Ranking

4.Algorithmic Solutions

1.Motivation

5.Conclusion and Further Research

OrganizationOrganization

2.Approximate Query

　 1 Motivation　 1 Motivation

Empty answers: When the query over the Web database is too selective,

the answer may be empty or too little. In that case, it is desirable to have

the option of relaxing the original query for presenting more relevant

items that can meet user’s needs and preferences closely.

Approximate query results ranking: Users’ preferences are usually used

to be a hint for ranking. In real applications, however, the preferences

are often associated to specified contexts, and the ranking accuracy is

affected greatly by the interest degrees of pair-wise preferences.

2 Approximate Query2 Approximate Query

Definition 1. (approximate query) Consider an autonomous Web database R with categorical and numerical attributes A = {A1, A2,…, Am} and a query Q over R with a conjunctive selection condition of the form Q = ∧i{1,…, k}(Ai θ ai), where k m and θ {>, <, =, , , , between}. Note that, if is the operator between and ai is an interval which is represented by [ai1, ai2], Ai θ ai has the form of “Ai between ai1 AND ai2”. Each Ai in the query condition is an attribute from A and ai is a value (or interval) in its domain. By relaxing Q, an approximate query Q which is used to find all tuples of R that show similarity to Q above a thresholdTsim(0, 1) is obtained. Specifically,

Q(R) = {t | tR, Similarity (Q, t) > Tsim}.

2.1 Definition of Approximate query2.1 Definition of Approximate query

2.2 Attribute Weight Assignment2.2 Attribute Weight Assignment

• Importance of Specified Categorical Attributes For a point query “Ai = v”, we define IDFi(v) as log(n/Fi(v)) which represents the importance of attribute value v in the database, where n is the number of tuples in the database and Fi(v) is the frequency of tuples in the database of Ai = v. In this paper, the importance of specified categorical attribute value is treated as the importance of its corresponding attribute.


• Importance of Specified Numerical Attributes Let {v1, v2, …, vn} be the values of attribute A that occur in the database. For specified attribute value v in the query, it defined IDF(v) as shown in Equation (1) where h is the bandwidth parameter.

(1)

A popular estimate for the bandwidth is h = 1.06n−1/5, where is the standard deviation of {v1, v2, …, vn}. Intuitively, the denominator in Equation represents the sum of “contributions” to v from every the other point vi in the database. These contributions are modeled as (scaled) Gaussian distributions, so that the further v is from vi, the smaller is the contribution from vi. The importance of specified numerical attribute value is also treated as the importance of its corresponding attribute.


Moreover, if query condition is generalized as “Ai IN Qi”, where Qi is a set of values for categorical attributes, or a range [lb, ub] for numeric attributes, We define the maximum log(n/Fi(v)) of each different value v in Qi. The generalized importance measuring function is shown in Equation (2).

(2)

By normalized processing, the weight wi of attribute Ai specified by the query can be calculated by

(3)

in which, k is the number of attributes specified by the query.

2.3 Attribute Values Similarity Assessment2.3 Attribute Values Similarity Assessment

• Categorical Attribute Values The similarity between two AV-pairs can be measured as the similarity shown by their supertuples. The supertuples contain sets of keywords for each attribute in the relation and the Jaccard Coefficient is used to determine the similarity between two supertuples. Thus, in this paper the similarity coefficient between two categorical values is then calculated as a sum of the Set similarity on each attribute,

(4)

where C1, C2 are supertuples with m attributes, A is the Set corresponding to the i-th attribute, J(,) is the Jaccard Coefficient and is computed as J(A, B) = |AB|/|AB|.

2.3 Attribute Values Similarity Assessment2.3 Attribute Values Similarity Assessment

• Numerical Attribute Values Because of the continuity of numerical data, we propose an approach to estimate the similarity coefficient between two numerical values. Let {v1, v

2, …, vn} be the values of numerical attribute A occurring in the database. Then the similarity coefficient NSim(v, q) between v and q can be defined by Equation (5), where h is the same as mentioned in Equation (1).

(5)

For a numerical condition Ai = q of Q, let i be a sub-threshold for Ai, according to Equation (5), we can then get the relaxation range of numerical attribute Ai as follows:

(6)

For each condition Ci in the original query Q, by extracting values of its corresponding attribute Ai having similarity above the sub-threshold i and adding them into its query range, we can get the relaxed condition Ci. By join all the relaxed conditions, the approximate query Q is formed.Algorithm 1 The query rewriting algorithmInput: Original query Q= {C1, …,Ck}, sub-threshold {1,…, k}.Output: An approximate query Q = {C1, …, Ck}.1. For i = 1, …, k2. Ci ← Ci3. If Ai is a categorical attribute4. v Dom(Ai)5. If VSim(ai, v) > i

6. Add v into the query range of Ci 7. End If8. End If9. If Ai is a numerical attribute10. Replace query range of Ci with [q–h, q+h]11. End If12. Q = Q Ci

13. End For14. Return Q

2.4 Query Relaxation Algorithm2.4 Query Relaxation Algorithm

　　 3 Approximate Query Results Ranking

　　 3 Approximate Query Results Ranking

Consider a database relation r with n tuples r = {t1,…, tn} with schema R

(A1,…, Ak). Let Dom(Ai) is the active domain of attribute Ai.

Definition (contextual preferences) Contextual preferences are of the form

{Ai = ai1 Ai = ai2 , d | X}, where X is ∧jl(Aj θ aj), with ai1, ai2 Dom(Ai),

l {1,…, k}, θ {>, <, =, , , , between, in} and aj Dom(Aj), d is

the interest degree of preference (i.e., compared to ai2 , the interest degree

of ai1 is d, where 0.5 d 1, and it can be learned from the past data or t

he users feedback). The left-hand side of a preference specifies the choice

and the interest degree while the right-hand side is the context.

3.1 Definition of Contextual Preferences3.1 Definition of Contextual Preferences

For collecting the contextual preferences with interest degrees, we employed the association-rules mining to execute on the workload of CarDB. We say that {A1 = a A1 = b, d | X } if conf( X a )>conf(X b), where conf(X a) is the confidence of the association rule X a in the workload, i.e.,

And then, the interest degree d can be obtained by:

frequency( )

frequency( )

X a

X

conf ( )X a

conf ( )

conf ( ) conf ( )

X a

X a X b

d

For any single preference p and any pair of tuples (ti, tj), p either prefers ti to

tj (denoted by ti p tj) or tj to ti (denoted by tj p ti) or it is inapplicable with

respect to ti and tj (denoted by ti~ p tj). Thus, every preference p defines an

interest degree of preference (dpref) over any pair of tuples t, t’ that evaluates

as follows:

3.2 Contextual Preferences Processing3.2 Contextual Preferences Processing

Problem 1. (approximate query results ranking problem): Assume

a set of preferences P ={ ,…, } and an approximate query Q.

The approximate query results ranking problem asks for an order

of of Q(R) such that

where,

The objective of the problem is to find the order over the set of tu

ples in Q(R) that agrees as much as possible with the input prefe

rences. Additionally, the degree of agreement with a class of prefe

rences is weighted by the similarity between the contexts of

those preferences and the approximate query Q.

1XPmXP

3.3 Problem of Query Results Ranking3.3 Problem of Query Results Ranking

The similarity between context X and approximate query Q using their vector representations as follows:

where, the vector representation of an approximate query Q is a vector VQ of size N. The i-th element of the vector corresponds to pair D[i]. If D[i] satisfies one of the conditions of Q, then VQ[i] can

be computed as follows:

Proc 1 (creating orders): For each preference class PXi create a order of the tuples i

n R such that

'arg max Agree( ', )

ii

i i XP

The output of this procedure is a set of m context, order pairs of the form Xi, ,

where Xi is the context for the preferences in PXi and is the order of the tuples in

R that (approximately) satisfies the following Equation. The score of tuple t in tha

t corresponds to Xi is:

i

i

i( | ) ( ) 1is t X n t

where (t) represents the position of t in order . i i

3.4 Approach of Query Results Ranking3.4 Approach of Query Results Ranking

Proc 2 (clustering orders): In order to reduce the number of context, order pair

s, we need to find representative orders ,…, for m initial pairs Xi, , where, l

< m. These orders partition the space of m initial context, order pairs into l group

s. Each group i is characterized by order and a disjunction of contexts

such that for each order is a representative order for the initial order .

The score of tuple t in is given by:

1 l i

1{ ,..., }i mX X X

, iiX j

( | ) ( ) 1iis t X n t

where (t) represents the position of t in order .

ii

ii

Proc 3 (ranking the top-k answers) For an approximate query Q over relati

on R, using the output of step 2, compute the set Qk(R) Q(R) R with

| Qk(R)| = k, such that t Qk(R) and t’{R- Qk(R)} it holds that score(t,

Q) > score(t’, Q), with

.

score(t, Q) =

　　 4 Algorithmic Solutions 　　 4 Algorithmic Solutions

4.1 Creating Orders

Input: r ={t1,…, tn}, a set of preferences from a class

Output: A pair X, where is an order of the tuples in r

such that as many of the preferences in are satisfied.

1. S = {t1,…,tn}, rank = 0

2. For all i{1,…, n} Do

3.

4. End For

5. While S Do

6. rank + = 1,

7. (tv) = rank, S = S – { tv }

8. For all tS Do

9. p(t) = p(t) –wX (t tv)

10. End For

11. End While

1( ) ( )

n

i X i jjp t w t t

Sarg max ( )uv t ut p t

4.2 Finding Representative Orders

Input: Tm = { ,…, }, U ={ , | Tm, Tm}, l

Output: A set of representative l orders Tl = { , ... , }

1. Let B = {} be a buffer that can hold m , 2. While Tm ≠ and l > 0 Do

3. B ← 4. For each Tm Do

5. Pick si = , with minimum rsi from Ui = { , | Tm, | | = [2,

|Tm| - l + 1]}

6. B ←B +{si}

7. End For

8. Pick s = , with minimum rs from B

9. Tm← Tm – – { }, Tl ←Tl + s, l← l – 1

10. End While

11. Return Tl

1 m i iT iT1 1T l lT

i iT

i iT i iT iT

i iTiiT

4.3 Ranking top-k AnswersInput: Tl = { ,…, }, {X1,…,Xm}, an approximate query Q .Output: top-k answer tuples.

1. B ={} be a buffer that can hold k tuples ordered by score

2. L be an l size array storing the last score from each order

3. Repeat

4. For all i {1,…,l} Do

5. Retrieve next tuple t from

6. Compute as the score of t

7. Update L with score of t in

8. If t Q (R)

9. Get score of t from other orders { | Tl and j i} via random access

10. score(t, Q) summing up of all the retrieved scores

11. Insert t, score(t, Q) in the correct position in B

12. End If

13. End For

14. Until

15. Return B

1 l

i

i

ji

1

[ ].score [ ]l

i

B K L i

score( , ) sim( , ) ( | )i it Q Q X s t X % %

iX

Proposed an approximate query method for finding relevant answer items;

Given a formal definition of the contextual preference with interested degree;

Proposed an approximate query results ranking approach based on contextual

preferences;

it would be interesting to research how to minimize the updating cost when the

database and preferences are varied.

　　 Conclusions and Outlook 　　 Conclusions and Outlook

Documents

Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern