98
Pengindeksan Dan Fail Songsang (inverted File)

Pengindeksan Dan Fail Songsang (inverted File)

Embed Size (px)

DESCRIPTION

Pengindeksan Dan Fail Songsang (inverted File). Penjanaan Fail Indeks Songsang. Word Extraction. Word IDs. Dokumen Asal. W1:d1,d2,d3 W2:d2,d4,d7,d9 Wn :d i ,…d n Inverted Files. Document IDs. Posting List panjang. Posting List pendek. - PowerPoint PPT Presentation

Citation preview

Page 1: Pengindeksan  Dan  Fail Songsang (inverted File)

Pengindeksan Dan

Fail Songsang (inverted File)

Page 2: Pengindeksan  Dan  Fail Songsang (inverted File)

Penjanaan Fail Indeks Songsang

Dokumen Asal

Document IDs

Word Extraction Word IDs

W1:d1,d2,d3W2:d2,d4,d7,d9

Wn :di,…dn

Inverted Files

W1:d1,d2,d3W2:d2,d4,d7,d9

Wn :di,…dn

Inverted Files

Page 3: Pengindeksan  Dan  Fail Songsang (inverted File)

Indeks Songsang Sistem capaian maklumat membangunkan indeks songsang untuk

mencari katakunci dalam koleksi dokumen dengan berkesan. Indeks songsang mengandungi dua komponen iaitu satu senarai bagi

setiap katakunci yang dipanggil indeks dan satu senarai yang dipanggil posting list.

Posting List panjang

Posting List pendek

Terbaik jika indeks disimpan dalam ingatan utama

Disebabkan saiznya posting list disiimpan dalam disk

Page 4: Pengindeksan  Dan  Fail Songsang (inverted File)

Map the file names to file IDs Consider the following Original Documents

Our staff have contributed intellectually and professionally to the advancements in these fields.

The Department also produced its first PhD graduate in 1994.

followed by the MSc in Computer Science which was started in 1991.

The Department launched its first BSc(Hons) in Computer Studies in 1987.

The Department of Computer Science was established in 1984.

D5

D4

D3

D2

D1

Penjanaan Fail Indeks Songsang

Page 5: Pengindeksan  Dan  Fail Songsang (inverted File)

Our staff have contributed intellectually and professionally to the advancements in these fields.

The Department also produced its first PhD graduate in 1994.

followed by the MSc in Computer Science which was started in 1991.

The Department launched its first BSc(Hons) in Computer Studies in 1987.

The Department of Computer Science was established in 1984.

D5

D4

D3

D2

D1

green: stop word

Penjanaan Fail Indeks Songsang

Page 6: Pengindeksan  Dan  Fail Songsang (inverted File)

staff contribut intellectu profession advanc field

depart produc phd graduat

follow msc comput scienc start

depart launch bsc hons comput studi

depart comput scienc establish

D5

D4

D3

D2

D1

After stemming, make lowercase (option), delete numbers (option)

Penjanaan Fail Indeks Songsang

Page 7: Pengindeksan  Dan  Fail Songsang (inverted File)

d5

d5

d5

d5

d5

d5

d4

d4

d4

Documents

field

advanc

profession

intellectu

contribut

staff

graduat

phd

produc

Words

d3start

d3msc

d3follow

d2studi

d2hons

d2bsc

d2launch

d1establish

d1,d3scienc

d1,d2,d3comput

d1,d2,d4depart

DocumentsWords

Penjanaan Fail Indeks Songsang(belum terisih)

Page 8: Pengindeksan  Dan  Fail Songsang (inverted File)

d2

d3

d5

d1,d3

d5

d4

d4

d3

Documents

studi

start

staff

scienc

profession

produc

phd

msc

Words

d2launch

d5intellectu

d4graduat

d3follow

d5field

d1establish

d1,d2,d4depart

d5contribut

d1,d2,d3comput

d2bsc

d5advanc

DocumentsWords

Penjanaan Fail Indeks Songsang(terisih)

Page 9: Pengindeksan  Dan  Fail Songsang (inverted File)

Pembinaan indeks Setiap dokumen diwakilkan dalam bentuk vektor

• <term1, term2, term3, …, termn>

• Setiap kemasukkan data menggambarkan bilangan sesuatu term itu ujud pada satu-satu dokumen

Termsnova galaxy heat h’wood film role diet fur

10 5 3

5 10

10 8 7

9 10 5

10 10

9 10

5 7 9

6 10 2 8

7 5 1 3

ABCDEFGHI

Document ids

Page 10: Pengindeksan  Dan  Fail Songsang (inverted File)

Indeks Songsang

Secara konsep, ianya telah dipelajari dalam model ruang vektor dimana ianya dijanakan dalam bentuk vektor di antara term vs dokumen.

Fail songsang merupakan “songsangan” dari fail vektor dimana lajur menjadi baris dan baris menjadi lajur.docs t1 t2 t3D1 1 0 1D2 1 0 0D3 0 1 1D4 1 0 0D5 1 1 1D6 1 1 0D7 0 1 0D8 0 1 0D9 0 0 1

D10 0 1 1

Terms D1 D2 D3 D4 D5 D6 D7 …

t1 1 1 0 1 1 1 0t2 0 0 1 0 1 1 1t3 1 0 1 0 1 0 0

Page 11: Pengindeksan  Dan  Fail Songsang (inverted File)

Pembinaan Fail Songsang Dokumen dihuraikan bagi menghasilkan token dan ia

disimpan bersama dengan ID dokumen

Now is the timefor all good men

to come to the aidof their country

Doc 1

It was a dark andstormy night in

the country manor. The time was past midnight

Doc 2

Term Doc #now 1is 1the 1time 1for 1all 1good 1men 1to 1come 1to 1the 1aid 1of 1their 1country 1it 2was 2a 2dark 2and 2stormy 2night 2in 2the 2country 2manor 2the 2time 2was 2past 2midnight 2

Page 12: Pengindeksan  Dan  Fail Songsang (inverted File)

Setelah selesai semua dokumen dihuraikan, maka fail songsang diisih dalam bentuk tersusun.

Term Doc #a 2aid 1all 1and 2come 1country 1country 2dark 2for 1good 1in 2is 1it 2manor 2men 1midnight 2night 2now 1of 1past 2stormy 2the 1the 1the 2the 2their 1time 1time 2to 1to 1was 2was 2

Term Doc #now 1is 1the 1time 1for 1all 1good 1men 1to 1come 1to 1the 1aid 1of 1their 1country 1it 2was 2a 2dark 2and 2stormy 2night 2in 2the 2country 2manor 2the 2time 2was 2past 2midnight 2

Pembinaan Fail Songsang

Page 13: Pengindeksan  Dan  Fail Songsang (inverted File)

Term yang berulang pada sesuatu dokumen akan dicantumkan (tambah nilai kekerapan)

Term Doc # Freqa 2 1aid 1 1all 1 1and 2 1come 1 1country 1 1country 2 1dark 2 1for 1 1good 1 1in 2 1is 1 1it 2 1manor 2 1men 1 1midnight 2 1night 2 1now 1 1of 1 1past 2 1stormy 2 1the 1 2the 2 2their 1 1time 1 1time 2 1to 1 2was 2 2

Term Doc #a 2aid 1all 1and 2come 1country 1country 2dark 2for 1good 1in 2is 1it 2manor 2men 1midnight 2night 2now 1of 1past 2stormy 2the 1the 1the 2the 2their 1time 1time 2to 1to 1was 2was 2

Pembinaan Fail Songsang

Page 14: Pengindeksan  Dan  Fail Songsang (inverted File)

Kemudian fail boleh dipecahkan kepada dua iaitu

• Fail Dictionary dan

• Fail Postings

Pembinaan Fail Songsang

Page 15: Pengindeksan  Dan  Fail Songsang (inverted File)

Dictionary PostingsTerm Doc # Freqa 2 1aid 1 1all 1 1and 2 1come 1 1country 1 1country 2 1dark 2 1for 1 1good 1 1in 2 1is 1 1it 2 1manor 2 1men 1 1midnight 2 1night 2 1now 1 1of 1 1past 2 1stormy 2 1the 1 2the 2 2their 1 1time 1 1time 2 1to 1 2was 2 2

Doc # Freq2 11 11 12 11 11 12 12 11 11 12 11 12 12 11 12 12 11 11 12 12 11 22 21 11 12 11 22 2

Term N docs Tot Freqa 1 1aid 1 1all 1 1and 1 1come 1 1country 2 2dark 1 1for 1 1good 1 1in 1 1is 1 1it 1 1manor 1 1men 1 1midnight 1 1night 1 1now 1 1of 1 1past 1 1stormy 1 1the 2 4their 1 1time 2 2to 1 2was 1 2

Pembinaan Fail Songsang

Page 16: Pengindeksan  Dan  Fail Songsang (inverted File)

Kelebihan Meningkatkan keberkesanan penggelintaran.

Kelemahan Keperluan menyimpan struktur data yang saiznya 10 – 100%

lebih besar daripada saiz teks dan keperluan untuk menukar indeks jika terdapat penukaran data.

Proses pengemaskinian indeks adalah mahal tetapi tatasusunan yang tersisih mudah dijanakan dan cepat.

Fail Songsang

Page 17: Pengindeksan  Dan  Fail Songsang (inverted File)

Struktur Data yang digunakan pada Fail Songsang

Tatasusunan Terisih (Sorted Arrays) Pohon B Struktur Cincangan (Hashing Structures) Tries (digital search trees)

Page 18: Pengindeksan  Dan  Fail Songsang (inverted File)

Fail songsang yang menggunakan metod ini menyimpan katakunci dalam bentuk tatasusunan terisih, berserta dengan bilangan dokumen yang mengandungi katakunci tersebut dan hubungan yang menghubungkan ke dokumen-dokumen tersebut.

Penggelintaran dalam tatasusunan ini ialah berdasarkan penggelintaran binari.

Kebaikan : senang nak diimplementasi Keburukan : pengemaskinian indeks agak mahal

Tatasusunan Terisih

Page 19: Pengindeksan  Dan  Fail Songsang (inverted File)

Tatasusunan TerisihPenghasilan tatasusunan fail songsang terisih boleh dibahagi kepada 2

atau 3 langkah:

1. Teks yang digunakan sebagai input dihuraikan menjadi senarai perkataan-perkataan berserta dengan lokasinya dalam teks (tentukan penggunaan katahenti dan cantasan sama ada perlu dimasukkan atau tidak. Ini bergantung kepada kekangan penggunaan masa dan storan dalam operasi pengindeksan).

2. Senarai perkataan di songsangkan dari senarai perkataan dalam susunan lokasi ke senarai perkataan terisih bagi kegunaan carian. Pengisihan dibuat dalam susunan tertentu beserta semua lokasi yang dikaitkan bagi setiap term/perkataan.

Page 20: Pengindeksan  Dan  Fail Songsang (inverted File)

3. Proses lanjutan terhadap fail songsang yang terhasil seperti meletakkan pemberat sebutan atau penyusunan semula atau penggunaan pemadatan (compression) bagi fail. (proses ini adalah opsional)

Tatasusunan Terisih

Page 21: Pengindeksan  Dan  Fail Songsang (inverted File)

Pohon B

Pohon-B biasanya digunakan untuk tujuan gelintaran data. Ia mesti mempunyai nombor kunci dan anak. Pohon pada order m nerupakan pohon dimana setiap nod mempunyai sebanyak-banyaknya m anak. Bagi setiap nod, jika k merupakan bilangan sebenar anak pada nod, maka k-1 merupakan bilangan kunci pada nod

Rujuk rajah dibawah dimana baris pertama menunjukkan nod bagi setiap kunci manakala baris kedua menunjukkan penunjuk ke kunci anak.

Page 22: Pengindeksan  Dan  Fail Songsang (inverted File)

Jika pohon gelintar dalam order 4 maka ia harus memenuhi syarat berikut

The keys in each node are in ascending order. Bagi setiap nod jika berikut adalah benar.

• Sub pokok bermula dari rekod Node.Branch[0] hanya ada kunci yang kurang dari Node.key[0]

• Sub pokok bermula dari rekod Node.Branch[1] hanya ada kunci yang lebih dari Node.key[0] dan pada masa yang sama kurang dari Node.Key[1]

• Sub pokok bermula dari rekod Node.Branch[2] hanya ada kunci yang lebih dari Node.key[1] dan pada masa yang sama kurang dari Node.Key[2]

• Sub pokok bermula dari rekod Node.Branch[3] hanya ada kunci yang lebih dari Node.key[2]

Pohon B

Page 23: Pengindeksan  Dan  Fail Songsang (inverted File)

Pohon B Berikut merupakan contoh bagi pohon-B dengan order 5. Ini bermaksud

semua nod luar mempunyai sekurang-kurangnya ceil(5/2) = 3 anak. Bilangan maksimum anak bagi nod adalah 5 (4 adalah bilangan maksimum kunci). Setiap nod daun mesti mengandungi sekurang-kurangnya 2 kunci.

Page 24: Pengindeksan  Dan  Fail Songsang (inverted File)

Pohon B (Kemasukkan Data Baru)

Katakan kemasukkan data baru akan dibuat ke atas pohon-B yang kosong di mana ia menggunakan order 5.

Diberi huruf-huruf berikut : C N G A H E K Q M F W L T Z D P R X Y S.

Ini bermaksud nod boleh mempunyai maksima 5 anak dan 4 kunci. Semua nod selain akar mesti mempunyai minimum 2 kunci.

4 huruf dimasukkan pada nod seperti rajah disebelah

Page 25: Pengindeksan  Dan  Fail Songsang (inverted File)

Masukkan H,

Masukkan E, K, dan Q

Masukkan M

Pohon B (Kemasukkan Data Baru)

Page 26: Pengindeksan  Dan  Fail Songsang (inverted File)

Huruf F, W, L, dan T

masuk Z

Pohon B (Kemasukkan Data Baru)

Page 27: Pengindeksan  Dan  Fail Songsang (inverted File)

Masukkan D

Masuk S

Pohon B (Kemasukkan Data Baru)

Page 28: Pengindeksan  Dan  Fail Songsang (inverted File)

Pohon B (Penghapusan Data)

Penghapusan huruf H

Page 29: Pengindeksan  Dan  Fail Songsang (inverted File)

Hapuskan huruf T.

Pohon B (Penghapusan Data)

Page 30: Pengindeksan  Dan  Fail Songsang (inverted File)

Cincangan

Apa itu Cincangan ? Teknik untuk menentukan indeks atau lokasi untuk

menyimpan data pada struktur data. Fungsi cincangan :

• Untuk menghantar kunci carian/gelintar.

• Merupakan satu transformasi kepada bentuk kunci

• Kebiasaannya dalam bentuk formula matematik

• Memulangkan indeks dimana akan disimpan dan untuk capaian data pada jadual.

Page 31: Pengindeksan  Dan  Fail Songsang (inverted File)

Konsep Asas

We can think of hashing as a key-to-address transformation the keys map to addresses in a list.

Page 32: Pengindeksan  Dan  Fail Songsang (inverted File)

Cincangan Fungsi cincang ialah fungsi h(k) yang menukarkan data

kepada kunci iaitu suatu alamat bagi suatu julat 0 SaizJadual-1

Fungsi cincang digunakan untuk memetakan kekunci ke dalam slot di dalam Jadual cincangan.

Contoh :

• Katakan kita menentukan untuk menggunakan 1000 alamat

maka jika U merupakan semua kemungkinan set kekunci,

maka fungsi hash adalah dari U ke {0, 1, 2, …..999}

kKod ASCII untuk 2 huruf pertama

Hasil darab (d)

h(k)= d mod 1000

BALL 66, 65 66.65 = 4290 290

LOWELL 76, 79 76.79 = 6004 004

TREE 84, 82 84.82 = 6888 888

000

001

..

004 LOWELL

..

290 BALL

888 TREE

999

Page 33: Pengindeksan  Dan  Fail Songsang (inverted File)

Contoh Hash(const char *Key, const int TableSize){

int HashVal = 0;while (*key != ‘\0’)

HashVal += *key++;return HashVal % TableSize

}

——

——

——————

——T

k4

k2 k3

k1k5

U(universe of keys)

K(actualkeys)

k6k8

k7

Page 34: Pengindeksan  Dan  Fail Songsang (inverted File)

Fungsi cincangan yang baik

for (hash=len; len--;)

{

hash = ((hash<<5)^(hash>>27))^*key++;

}

hash = hash % prime;

Page 35: Pengindeksan  Dan  Fail Songsang (inverted File)

Cincangan

Namun begitu, terdapat kekunci yang berbeza tetapi dihantar alamat yang sama maka akan berlaku perlanggaran (collision)

Seperti contoh sebelum, di mana terdapat dua atau lebih yang bermula dengan 2 huruf pertama yang sama.

Maka satu proses yang dinamakan cincangan semula (rehashing) perlu dilakukan

——

——

——————

——T

k4

k2 k3

k1k5

U(universe of keys)

K(actualkeys)

k6k8

k7

Page 36: Pengindeksan  Dan  Fail Songsang (inverted File)

Cincangan Semula (Rehashing)

Contoh mudah fungsi rehash :

rehash(k) = (k + 1) % prime

Fungsi Cincangan semula

Fungsi kedua yang boleh digunakan untuk memilih lokasi jadual bagi item baru yang akan dimasukkan. Jika lokasi tersebut juga telah digunakan maka fungsi rehash boleh digunakan bagi mendapat lokasi ketiga dan seterusnya.

Page 37: Pengindeksan  Dan  Fail Songsang (inverted File)

Kaedah untuk mengurangkan perlanggaran Cuba dapatkan fungsi cincangan yang terbaik untuk penaburan

rekod Penggunakan ruang ingatan yang lebih besar. Meningkatkan

ruang pengalamatan, contohnya jika keperluan ialah 1000 maka lebihkan sehingga 2000 ruang tambahan.

Letakkan lebih dari satu rekod pada satu alamat (penggunaan buckets)

Cincangan Semula (Rehashing)

Page 38: Pengindeksan  Dan  Fail Songsang (inverted File)

Rantaian (Chaining)

Chaining puts elements that hash to the same slot in a linked list:

——

——

——————

——T

k4

k2 k3

k1k5

U(universe of keys)

K(actualkeys)

k6k8

k7

k1 k4 ——

k5 k2

k3

k8 k6 ————

k7 ——

Page 39: Pengindeksan  Dan  Fail Songsang (inverted File)

Hashing (Abu Ata) Memudahkan sesuatu alamat disimpan dan dicapai secara terus

serta cepat dan betul. Dikira berdasarkan Kod ASCII bagi sesuatu huruf dan dijadi

penghubung antara huruf-huruf perkataan yang diindeks Hubungan dikira berdasarkan susunan huruf antara set huruf dan

untuk huruf berikutnya berdasarkan susunan huruf yang bersebelahan.

Mungkin berlaku perlanggaran. Cincangan semula dilakukan dan satu alamat baru akan dijanakan bagi mendapatkan satu rekod yang kosong.

Page 40: Pengindeksan  Dan  Fail Songsang (inverted File)

1. Semua huruf ditukar kepada huruf kecil2. Set 26 huruf abjad diberi nilai berdasarkan susunan jujukan

dalam set abjad contoh : a=1, b=2, c=3 ……..,y=25,z=263. Huruf bagi suatu perkataan dan huruf yang berikutnya dan

pengiraan adalah seperti berikuti. Kedudukan huruf pertama dalam set abjad (peraturan 2)ii. Kedudukan huruf kedua dalam set abjad (peraturan 2)iii. Keputusan pada (i) di darab dengan 26iv. Campur keputusan pada (ii) dan (iii)

4. Campur keputusan bagi peraturan di 2 dengan peraturan di 3

Hashing (Abu Ata)

Page 41: Pengindeksan  Dan  Fail Songsang (inverted File)

Basic Concepts

In this case, we must use the collision resolution algorithm to determine the next possible location for the data and continue until we find the correct data.

Each calculation of an address and test for success is known as a probe.

Sumber : http://www.ee.udel.edu/~durbano/teaching/CISC220/slides/38

Page 42: Pengindeksan  Dan  Fail Songsang (inverted File)

Hashing Methods

There are several hashing methods that we will discuss:• Direct

• Subtraction

• Modulo-Division

• Digit Extraction

• Midsquare

• Folding

• Rotation

• Pseudorandom Generation

Page 43: Pengindeksan  Dan  Fail Songsang (inverted File)

Direct Method In direct hashing, the key is the address

without any algorithmic manipulation.

Page 44: Pengindeksan  Dan  Fail Songsang (inverted File)

Direct Method

In this case, the hash table must contain an element for every possible key.

Although it has a limited use, it is powerful in the sense that it is easy to code and there are no synonyms.

Page 45: Pengindeksan  Dan  Fail Songsang (inverted File)

Direct Method

As an example, consider a small company with less than 100 employees.

Each employee is assigned an employee number (from 0 to 99).

By storing the employees in an array of size 100, we can reference an employee simply by using the employee number as the index into the array.

Page 46: Pengindeksan  Dan  Fail Songsang (inverted File)

Direct Method

Obviously, the direct method has limited uses. Namely, it can only be used on small data sets.

For example, it would be impractical to use direct hashing via the SSN of our employees.

If we did, we would have a 9 digit number as the index into our array (i.e., we would need an array of size 1 billion but would use less than 100 entries!)

Page 47: Pengindeksan  Dan  Fail Songsang (inverted File)

Subtraction Method

Sometimes, keys may be consecutive, but may not start from ‘1’.

Consider our small company – what if we assigned employee numbers from 1000 to 1099?

In this case, our hashing function would simply subtract 1000 from the key value to produce the address (0 to 99).

Page 48: Pengindeksan  Dan  Fail Songsang (inverted File)

Subtraction Method

Algorithm:

address = key – subtractionConstant

As with the direct method, the subtraction method is easy to implement, guarantees no collisions, and has limited uses.

Page 49: Pengindeksan  Dan  Fail Songsang (inverted File)

Modulo-Division Method

Also known as the division remainder method, the modulo-division method divides the key by the array size and uses the remainder for the address.

Algorithm

address = key % listSize

Page 50: Pengindeksan  Dan  Fail Songsang (inverted File)

Modulo-Division Method

Although this algorithm will work with any size list, we typically choose a list size that is a prime number.

This has the effect of reducing the number of collisions.

Page 51: Pengindeksan  Dan  Fail Songsang (inverted File)

Modulo-Division Method

To continue with our small company example, let’s say we are planning on expanding our company.

In our new system, employees will receive employee numbers from 0 to 999,999 and we will provide space in our data structure for up to 300 employees.

Page 52: Pengindeksan  Dan  Fail Songsang (inverted File)

Modulo-Division Method

We start by choosing a list size of 307 (the first prime number above 300).

Therefore, our available address space is 0 to 306 (key%307=[0,306]).

As an example, let’s say we want to hash Bryan’s employee number 121267:

121267/307 = 395 remainder 2Therefore, hash(121267)=2

Page 53: Pengindeksan  Dan  Fail Songsang (inverted File)

Modulo-Division Method

Page 54: Pengindeksan  Dan  Fail Songsang (inverted File)

Modulo-Division Method

Note: in a test situation, I expect you to be able to perform the modulus operation on small numbers.

Page 55: Pengindeksan  Dan  Fail Songsang (inverted File)

Digit-Extraction Method

Using digit extraction, selected digits are extracted from the key and used as an address.

For example, using our 6-digit employee number from before, if we wanted to realize a 3-digit address, we could select the 1st, 2nd, and last digits to create our address

379452 372121267 127

Page 56: Pengindeksan  Dan  Fail Songsang (inverted File)

Midsquare Method

In midsquare hashing, the key is squared and the address is selected from the middle of the result.

For example, if our key value were 9452:

9452*9452=89340304address = 3403

Page 57: Pengindeksan  Dan  Fail Songsang (inverted File)

Midsquare Method

A limitation to the use of this method is the size of key.

Because squaring a key produces a number twice the length of the key, this method will only work for small key values.

Page 58: Pengindeksan  Dan  Fail Songsang (inverted File)

Midsquare Method

However, if we wish to apply the midsquare method to large key values, we can simply choose a subset of the digits of the key to square (sort of like digit extraction).

For example:379452 379*379 = 143641 = 364

address

key

Page 59: Pengindeksan  Dan  Fail Songsang (inverted File)

Folding Methods

Two folding methods are used:• Fold shift

• Fold boundary

Page 60: Pengindeksan  Dan  Fail Songsang (inverted File)

Fold Shift

In fold shift, the key value is divided into parts whose size matches the size of the required address.

Then, the left and right parts are shifted and added with the middle part.

Should the addition result in a carry digit, that digit is simply dropped.

Page 61: Pengindeksan  Dan  Fail Songsang (inverted File)

Fold Boundary

In fold boundary, the left and right numbers are folded on a fixed boundary between them and the center number.

The two outside values are thus reversed.

Should the addition result in a carry digit, that digit is simply dropped.

Page 62: Pengindeksan  Dan  Fail Songsang (inverted File)

Folding Methods

Page 63: Pengindeksan  Dan  Fail Songsang (inverted File)

Rotation Method

In the rotation method, we rotate a digit to the front or back of the key.

This has the effect of spreading the keys more evenly over the key space.

Rotation hashing is usually used in conjunction with other hashing methods, which results in a more effective hash.

Example: imagine selecting only the first 3 digits of the following keys.

Page 64: Pengindeksan  Dan  Fail Songsang (inverted File)

Rotation Method

Page 65: Pengindeksan  Dan  Fail Songsang (inverted File)

Pseudorandom Method

Here, we use the key as the seed in a pseudorandom number generator.

The resulting random number is then scaled into the possible address range using modulo division.

Page 66: Pengindeksan  Dan  Fail Songsang (inverted File)

Pseudorandom Method

One example of a random number generator isy = ax + c

wherex = key

a = scaling coefficient

c = constant

address = y%listSize For maximum efficiency, a and c should be

prime numbers.

Page 67: Pengindeksan  Dan  Fail Songsang (inverted File)

Collision Resolution With the exception of direct hashing and

subtraction hashing, none of the hashing methods we discussed result in a one-to-one mapping.

Therefore, as discussed before, a collision may occur.

Fortunately, there are many methods of dealing with collisions (all of which are independent of the hashing method used).

That is, any collision resolution algorithm can be used with any hashing algorithm.

Page 68: Pengindeksan  Dan  Fail Songsang (inverted File)

Collision Resolution

Generally, there are two different approaches to resolving collisions:• Open addressing

• Linked Lists

A third concept, buckets, defers collisions, but does not fully resolve them.

Page 69: Pengindeksan  Dan  Fail Songsang (inverted File)

Open Addressing

Open addressing resolves collisions in the prime area (the area that contains all of the home addresses).

This technique is contrasted with linked list resolution, in which the collisions are resolved by placing the data in a separate overflow area.

Page 70: Pengindeksan  Dan  Fail Songsang (inverted File)

Open Addressing

When a collision occurs, the prime area addresses are searched for an open element where the new data can be placed.

We will discuss 4 methods of open addressing:• Linear probe

• Quadratic probe

• Double hashing

• Key offset

Page 71: Pengindeksan  Dan  Fail Songsang (inverted File)

Linear Probe

When data cannot be stored at the home address, we resolve the collision by adding 1 to the current address.

Here, we get a collision ataddress 1. To resolve this, we try to insert the data at 2. However, this location is occupied, so we try address 3.

Page 72: Pengindeksan  Dan  Fail Songsang (inverted File)

Linear Probe

As an alternative to a simple linear probe, we can add 1, subtract 2, add 3, subtract 4, etc. until we locate an empty element.

Note: the code that does the collision resolution must verify that the new address is within the address space.

For example, if we are at the last element of the list, when we add 1, we must start back at the beginning of the list.

Page 73: Pengindeksan  Dan  Fail Songsang (inverted File)

Linear Probe

Linear probes have 2 advantages:• Easy to implement

• Data tends to remain near their home addresses (good for caching)

However, linear probes tend to produce primary clustering.

Page 74: Pengindeksan  Dan  Fail Songsang (inverted File)

Linear Probe

After the collision has been resolved, hashing continues as it did before the collision.

The next time a collision occurs, we re-start our resolution algorithm by adding 1 to the address and then continue as before.

Page 75: Pengindeksan  Dan  Fail Songsang (inverted File)

Quadratic Probe

We can eliminate the primary clustering phenomenon in the linear probe by adding a number other than 1 to the address.

One example of this is the quadratic probe. Here, the increment is the collision probe

number squared. Thus, for the first collision we add 12, the

second collision we add 22, the third collision 32, etc.

Page 76: Pengindeksan  Dan  Fail Songsang (inverted File)

Quadratic Probe

Again, we have to make sure that we don’t run off the end of the address list.

To do this, we use the modulus of the new address and the list size.

new address = (last address tried + probe2)%listSize

Page 77: Pengindeksan  Dan  Fail Songsang (inverted File)

Quadratic Probe

Disadvantages:• The time it takes to perform the ‘square’ operation

• Produces secondary clustering

• It is not possible to generate a new address for every element in the list

To help alleviate the last disadvantage, we choose a list size that is a prime number.

This will allow at least half of the list to be reachable (a reasonable number).

Page 78: Pengindeksan  Dan  Fail Songsang (inverted File)

Quadratic Probe

After the collision has been resolved, hashing continues as it did before the collision.

The next time a collision occurs, we start resolution again with 12 and continue as before.

Page 79: Pengindeksan  Dan  Fail Songsang (inverted File)

Double Hashing

The next two open addressing methods are collectively known as double hashing.

In double hashing, rather than use an arithmetic probe function, as in the linear and quadratic probes, we rehash the address.

This prevents primary clustering.

Page 80: Pengindeksan  Dan  Fail Songsang (inverted File)

Double Hashing

The probe sequences used by both linear and quadratic probing are key independent.

For example, linear probing inspects the table locations sequentially, no matter what the value of the key is.

In contrast, double hashing defines key-dependent probe sequences.

In this scheme, the probe sequence still searches the table in a linear order, but a second hash determines the size of the steps taken.

Page 81: Pengindeksan  Dan  Fail Songsang (inverted File)

Pseudorandom Collision Resolution

The first method uses a pseudorandom number to resolve the collision.

This is basically the same process as the pseudorandom hashing function.

In this case, however, instead of using the key as the seed to the pseudorandom number generator, we use the collision address as the seed.

Page 82: Pengindeksan  Dan  Fail Songsang (inverted File)

Pseudorandom Collision Resolution

Here, we have a collision at address 1. To resolve this collision, we use the collision address (1) in our pseudorandom number generator

y=3(1)+5=8

Therefore, we try address 8 as our new address.

Page 83: Pengindeksan  Dan  Fail Songsang (inverted File)

Pseudorandom Collision Resolution

Disadvantage: all of the keys will follow only 1 collision resolution path through the list.

Therefore, this method will lead to secondary clustering.

Page 84: Pengindeksan  Dan  Fail Songsang (inverted File)

Key Offset The second double hashing method is

key offset. This method will produce different

collision paths for different keys. Whereas the pseudorandom number

generator produces a new address as a function of only the collision address, the key offset method uses both the original key and the collision address to calculate the new address.

Page 85: Pengindeksan  Dan  Fail Songsang (inverted File)

Key Offset

Here is one of the simplest implementations:

offset = key/listSize; // integer arithmeticaddress = (old address + offset)%listSize;

Here, we calculate an offset value based on the key and add this value to the collision address.

Does this method lead to primary or secondary clustering?

Page 86: Pengindeksan  Dan  Fail Songsang (inverted File)

Linked List Resolution

In open addressing, we resolve collisions by placing the data in the same memory area as the rest of the data (the prime area).

One problem with this approach is that each resolved collision increases the probability of future collisions.

This disadvantage can be eliminated by using a linked list resolution approach rather than an open addressing approach.

Page 87: Pengindeksan  Dan  Fail Songsang (inverted File)

Linked List Resolution

Linked list resolution uses a separate area (the overflow area) to store the collisions and chains all of the synonyms together in a linked list.

When a collision occurs, one element is stored in the prime area and the other element is stored in the overflow area.

Page 88: Pengindeksan  Dan  Fail Songsang (inverted File)

Linked List Resolution

Here, we have had 2collisions at address 1. Thecollision is resolved by placing the synonyms in a linked list with the head element in the prime area.

Page 89: Pengindeksan  Dan  Fail Songsang (inverted File)

Linked List Resolution

Items are usually inserted in a last-in, first-out (LIFO) order. This allows for fast insertions as the list need not be scanned … the element is simply inserted in the prime area.

Another possible ordering is a key sequenced list where the data with the smallest key value is stored in the prime area, allowing for fast retrievals.

Page 90: Pengindeksan  Dan  Fail Songsang (inverted File)

Bucket Hashing

Another approach to handling the collision problem is bucket hashing.

A bucket is a node that can accommodate multiple data occurrences.

Because the bucket can hold multiple data values, collisions can be postponed until the bucket is full.

Page 91: Pengindeksan  Dan  Fail Songsang (inverted File)

Bucket Hashing

Here, we see animplementation with a bucket size of 3. This structure can accommodateup to 3 synonyms before acollision will occur.

Page 92: Pengindeksan  Dan  Fail Songsang (inverted File)

Bucket Hashing

Disadvantages:• More space is used (many buckets will be

empty or partially empty)

• It does not completely resolve collisions

When a collision does occur, a typical resolution is to use a linear probe.

Here, we assume that the adjacent bucket will have an empty space.

Page 93: Pengindeksan  Dan  Fail Songsang (inverted File)

Bucket Hashing

Question: Why not just increase the size of the hash table instead of using buckets?

Answer: Entire bucket will probably be cached (its contents are adjacent in memory). Thus, multiple “probes” will likely “hit” in cache.

Page 94: Pengindeksan  Dan  Fail Songsang (inverted File)

Combination Approaches

Typically, we often use multiple steps to resolve collisions.

For example, we might use bucket hashing. Should a collision occur, we will perform up to, say, 3 linear probes to try to resolve the collision. Then, we may resort to a linked list resolution.

Page 95: Pengindeksan  Dan  Fail Songsang (inverted File)

What Makes A Good Hashing Function?

1) A hashing function should be fast and easy to compute.

2) A hashing function should scatter the data evenly throughout the hash table.

• How well does the the hash function scatter random data? Nonrandom data?

Page 96: Pengindeksan  Dan  Fail Songsang (inverted File)

Tips For Developing Good Hashing Functions

The calculation of the hashing function should involve the entire search key.

Thus, for example, computing the modulus of the entire ID number is much safer than using only its first 2 digits.

Page 97: Pengindeksan  Dan  Fail Songsang (inverted File)

Tips For Developing Good Hashing Functions

If a hashing function uses modulo arithmetic, the base should be a prime number.

That is, if h is of the form:h(x) = x mod listSize

then listSize should be a prime number. This is a safeguard against many subtle

kinds of patterns in the data (for example, search keys whose digits are likely to be multiples of one another).

Page 98: Pengindeksan  Dan  Fail Songsang (inverted File)

Disadvantage of Hashing

For all of its advantages, one of the major disadvantages of hashing is trying to traverse the data in sorted order.

Traversals are inefficient because a good hashing function scatters items as randomly as possible throughout the array.

Hence, in order to traverse the table in sorted order, you would first have to sort the items.