4. Querying

4. Querying

인공지능연구실

2

차례• Accessing the lexicon

• Partially specified query terms

• Boolean query processing

• Ranking and information retrieval

• Evaluating retrieval effectiveness

• Implementation of the cosine measure

• Interactive retrieval

3

Preview(1)

• Boolean query and Ranked query• Boolean query

– AND, OR, NOT 과 같은 접속어로 연결된 term 들의 list 로 구성

– high recall vs. high precision– query 에서의 작은 차이가 매우 다른 결과를

낼 수 있음– query 를 만드는 데는 통찰력과 언어적인

기술이 요구됨

4

Preview(2)

• Ranked query– query 와 각 문서와의 similarity 를 측정하기

위한 heuristic 을 사용– 측정한 numeric indicator 를 바탕으로 r most

closely matching document 가 결과로 나오게 됨 ( r 은 10 이거나 100)

– low precision + high recall– high precision + low recall

5

Accessing the lexicon

• lexicon for an inverted file index 에는 term 뿐만 아니라 보조 정보도 저장– It : address in the inverted file of the correspon

ding list of document numbers

– ft : the number of documents containing the term

• lexicon 들이 어떻게 저장되는 방식 중요

6

Access structures

• Array of records (Figure 4.1)– 20byte string + 4byte address field + 4byte ft

• Array of string pointers (Figure 4.2)– One long contiguous string + an array of 4-byte charact

er pointer

• one word in four indexed (Figure 4.3)– blocking 을 이용하여 string pointer 를 줄임– blocking 은 searching process 를 더 복잡하게 만듦

7

Front coding(1)

• 정렬된 list 에서의 연속된 단어들은 같은 prefix 를 공유할 가능성이 많다는 점 이용

• complete front coding– p.122 Table 4.1 참조– 앞 단어와 같은 prefix characters 의 개수 + 나머지

character 들의 개수 + 남은 문자열– 더 큰 lexicon 에서는 더 많은 메모리를 절약할 수

있다 .

– Binary search 가 불가능

8

Font coding(2)

• Partial “3-in-4” front coding– p.122 Table 4.1 참조– block pointer 가 가리키는 단어들은 완전한

문자열과 길이를 유지– block 내 마지막 단어는 suffix length 가

생략됨– binary search 가능

9

Minimal perfect hashing(1)

• hash function : mechanism for mapping a set L of n keys xj into a set of integer values h(xj) in the range 1≤ h(xj)≤m, with duplicates allowed

• h(x) = x mod m (m > n/a, a : loading(record와 사용가능한 주소와의 비율 ))

• a 값이 작을 수록 두 key 가 같은 hash value 에서 충돌할 가능성이 낮아짐

10


• Perfect hash function– hash function + i = j 일 때만 h(i) = h(j)

– no collision

– minimal perfect hash function(MPHF)• perfect hash function

• m = n

• each of n keys hash to a unique integer between 1 and n

• a = 1.0

– order-preserving minimal perfect hash function• minimal perfect hash function + xi < xj then h(xi) < h(xj)

11


miitt wht

i

mod])[][()(1

||

11

miitt wh s

t

i

mod])[][()(||

12

))(())(()(21tgtgth hh n

12

Design of a minimal perfect hash function(1)

• Probability or the birthday paradox– n items are to be hashed into m slots

– m = 365 and n = 22 → 0.524, n = 23 → 0.493

• Checking for acyclicity and assigning a mapping– figure 4.5 참조

n

in

mnm

m

m

im

m

nm

m

m

m

m

m

m

1 )!(

!1)1(210

13


14

12

11

13

10

9

8

1

76

2

3

5

4

0

0

1

2

3

4

5

67

8

9

10

11

14


• Generating a perfect hash function– m 값을 정한다– w1[i] 와 w2[i] 를 랜덤하게 정한다

– 그래프 G=(V,E) 를 생성한다 (V={1,…,m}, E={(h1

(t), h2(t))}t L})∈– figure 4.5 를 이용해서 mapping g 를 계산– labeling algotirhm 이 실패하면 step 2 로 돌아간다– w1, w2, g 를 return

15

Disk-based lexicon storage• p.131 Figure 4.7 참조 • lexicon 을 저장하는데 primary memory 를 줄일 수

있는 방법 (put it on disk)

• primary memory 에는 각 term 에 해당하는 disk block을 구별하기 위한 정보 저장

• 정보를 찾을 때는 in-memory index 에서 block number를 찾은 다음 그 block 을 buffer 로 읽어 와서 계속 찾음

• 간단하고 작은 양의 primary memory 필요• 비교적 작은 양의 lexicon 을 작은 workstation 이나 pe

rsonal computer 에 저장할 때 효과적

16

Partially specified query terms(1)

• 찾고자 하는 term 이 wildcard character, *, 를 포함하고 있을 때

• Brute force string matching– 전체 lexicon 을 main memory 에 있다면 fast pattern-mat

ching algorithm 을 이용하여 모든 term 들과 pattern 을 비교할 수 있음

• Indexing using n-grams– p.133 Table 4.3 참조– query term 을 n-grams 으로 분해– false match 가 일어날 수 있으므로 결과를 다시 pattern

matcher 를 이용하여 체크해야함 ( ex : lab*r laaber, lavacaber)

17

Partially specified query terms(2)

• Rotated lexicons– p.135 Table 4.4 참조– indexing pointer 를 각각의 lexicon character 에 유지– 속도는 빠르나 메모리 사용이 많다 .– wildcard character 가 앞뒤에 다 있을 때는 하나로

통일– multiple wildcard character

• query 에서 가장 긴 character 가 match 되는 후보 set 을 만듦

• each specified component 를 차례로 검사• pattern matcher 를 이용하여 check

18

Boolean query processing(1)• 간단하고 , AND, OR, NOT 과 같은 접속어로

연결된 term 들의 list 로 구성• inverted file index 를 이용하여 처리할 때 비교적

수월함• Conjunctive queries

– 모든 term 이 AND operation 으로 연결됨– 처리과정

• 각각의 term 들을 기본형으로 만들고 lexicon 에서 찾는다• term 들을 frequency 라 증가하는 순서로 정렬• least frequent term 에 대한 inverted file entry 를 메모리로

가져온다• 다른 term 들의 entry 들을 위의 후보 set 에 대해 처리

19

Boolean query processing(2)

• Term processing order– 후보 set 을 만들 때 가장 least frequent term 을

선택하는 이유• query processing 에 요구되는 메모리 공간을 양을 줄이기

위해서• frequency 순서대로 처리하면 더 빠르기 때문에

(look-up operation 과 merge operation 에 드는 시간 비교 )

– compressed inverted file 을 이용하면 공간은 많이 절약되지만 conjunctive query processing 중에 시간이 더 필요함 ( 압축을 풀고 merge 하는데 드는 시간 )

20

Boolean query processing(3)

• Random access and fast lookup– 빠른 searching 을 지원하기 위해서는 inverted file e

ntry 내에서 random access 가 제공되어야 함– 세 가지 issue

• storage mechanism used for the index• suitable value for bt, the blocking constant for term t• trade-off of time for space

• Nonconjunctive queries– Boolean query expression 이 복잡해질 때는 informal

or ranked query 를 이용

21

Ranking and Information Retrieval

• Boolean Query– Data 에 대한 정보가 확실히 알려져 있는 경우– Exact search 가 가능– Commercial DB, Bibliographic system

• Coordinate matching– Query term 의 수가 더 많은 문서가 더 유사하다– 한 개의 term 이라도 있으면 관련된 문서로 판단

• Ranking– Query 와 document 간의 유사도를 측정– 유사도가 큰 문헌 순서대로 정렬하여 display

22

• Vector 의 내적을 이용한 유사도 계산– Query 를 document 로 취급하여 계산

• 문제점– Term frequency 의 고려가 없다– Scarcity 의 고려가 없다

• 대책– Vector 값을 binary 에서 정수로– 불용어 제거 , idf 사용

Inner Product Similarity

23

• Query 와 Document 를 벡터 공간 상에서 한 점으로 취급

• 벡터 공간은 문서 collection 에 나타나는 색인어에 의해 결정

• 벡터 공간 모델에서 두 문서간의 유사도는 두 벡터 사이의 각의 cosine 값을 이용– 각이 작을수록 유사도가 높다는 아이디어에서 출발– 문서의 길이가 긴 경우 유사도가 커질 가능성을 배제

Vector Space Models

24

Weight Vector

fi

NWt log

• TF 만으로는 문제점 발생• 문서집단의 크기 , text 의 길이 , 단어의

사용빈도를 고려하지 않음• Weight : idf 를 사용

ttdttdtd f

Nfwfw log • TF*IDF 사용

tttqtq f

Nwfw log

• 실제 Query 에서는 TF 가 의미가 없음

25

Cosine Measure

cosYXYX YX

YX cos

td

n

ttq

dqd

dd WW

WWDQ

DQDQine

1

1),(cos

2

1

log1

),(cos

t

n

ttd

dqd f

Nf

WWDQine

2/1

1

2

n

ttdd wW

2/1

1

2

n

ttqq wW

26

• Effectiveness– 얼마나 많은 적합문서가 검색되었나 ?

• Recall( 재현율 )– 전체 적합 문서 중 검색된 적합 문서 수

• Precision( 정확도 )– 검색된 문서 중 적합한 문서 수

• TREC– IR System 의 평가자료

Evaluating retrieval effectiveness

27

• Inverted file 을 만들 때 frequency 를 삽입– <apple;3;[1,2,5]>

– <apple;3;[(1,3),(2,1),(5,1)]>

• 문서 set 이 작을 경우 : unary code

• 평균적으로 감마코드가 이상적

Inverted file 의 구성

28

• Issue : Memory 냐 Disk 냐 ?

• 전체 data 에 대해 모두 처리할 것이냐 , 일부만을 처리할 것이냐 ?

• N-best algorithm

Implementation of the cosine measure

29

• 왜 필요한가 ?– 개념은 있는데 단어가 생각나지 않을 때– 개념과 뜻이 통하지 않는 단어를 사용– 개념을 나타내지 못하는 적은 수의 단어를 사용할 때– 검색결과가 너무 적을 때

• 어떻게 해결할 수 있는가 ?– 사용자 질의어와 의미가 비슷한 말들을 추가하는 방법– 사용자가 좋은 질의를 할 수 있도록 질의어를 추천해 주는

시스템– 검색결과를 보고 사용자가 정보를 더 추가하면 시스템이 이를

처리해 사용자의 의도와 맞는 결과를 다시 보내 주는 방법

Interactive retrieval

30

Relevance Feedback

31

Probabilistic models

• 개념– 특정 query 에 대해 각 문서가 적합할 확률과

부적합할 확률을 계산하여 적합할 확률이 큰 문서를 검색

• 적합문서와 부적합문서 내의 색인어 출현 정보에 의존

• Bayes’ theorem 에 기초

• 이전에 적합성 정보가 준비되어 있어야 한다 .

Documents

4. Querying