18
Searching: Needles and Haystacks Searching for stuff Why it's important How it's done Technical difficulties Social difficulties

Searching: Needles and Haystacks Searching for stuff Why it's important How it's done Technical difficulties…

Embed Size (px)

DESCRIPTION

As a Human Activity Looking for keys Remembering names and birthdays Looking up in a book And [the subject of this] making tools for the intertubes Getting a clue, from the above...

Citation preview

Page 1: Searching: Needles and Haystacks Searching for stuff Why it's important How it's done Technical difficulties…

Searching: Needles and Haystacks

Searching for stuff Why it's important How it's done Technical difficulties Social difficulties

Page 2: Searching: Needles and Haystacks Searching for stuff Why it's important How it's done Technical difficulties…

Search and Hugh

Involved in this since about 1982, mainly EEC projects especially Celex

I currently work for Vienna U on a multimedia database for stored manuscripts etc.

Computing industry since about 1974

[email protected]

Page 3: Searching: Needles and Haystacks Searching for stuff Why it's important How it's done Technical difficulties…

As a Human Activity

Looking for keys Remembering names and birthdays Looking up in a book And [the subject of this] making tools for the

intertubes Getting a clue, from the above...

Page 4: Searching: Needles and Haystacks Searching for stuff Why it's important How it's done Technical difficulties…

How it's Done: 1

Health warning: this explanation is simplified! Let's take Google How does it find one zillion documents/images

with 'lolcat' in them, within a few seconds?

Page 5: Searching: Needles and Haystacks Searching for stuff Why it's important How it's done Technical difficulties…

How it's Done:2

It did it already A key concept: IndexingAnother key concept: Inverted index [see

wikipedia]: https://en.wikipedia.org/wiki/Inverted_index

- lolcat in document x at position y- highlighted cat in document x at position y [?]

Page 6: Searching: Needles and Haystacks Searching for stuff Why it's important How it's done Technical difficulties…

Why do this at all?

Since Google, Bing, Yahoo already did it... Lots of interesting technical pieces Self education Fun and profit, do it 'better' [?] Internal search engines, intranet search

engines Domain specific engines School or research projects

Page 7: Searching: Needles and Haystacks Searching for stuff Why it's important How it's done Technical difficulties…

Parts of the Search Engine

- Spider or Harvester [?]- Parser/Indexer- Index Storage- Retrieval [the bit of Google that we see!]I'm going to go through these in order...

Page 8: Searching: Needles and Haystacks Searching for stuff Why it's important How it's done Technical difficulties…

Spider or Harvester

Go and get a load of stuff from the web Think of it as a programmatic super-surfer Actually there are tons of ready-made ones:http://search.cpan.org/~johnd/WWW-Crawler-

Lite-0.005/lib/WWW/Crawler/Lite.pm Be polite, user-agent name, robots.txt,

throttling etc. [?] Can you think of some of the problems?

Page 9: Searching: Needles and Haystacks Searching for stuff Why it's important How it's done Technical difficulties…

Spidering/Harvesting/Darkweb

Spidering: start at top, follow linksDirectory based: index a load of things in a given

directoryHarvesting: academic harvester interfaces,

domain specific, I do this at presentRobots.txt: courtesy, the interactive web and the

darkwebProblems

Page 10: Searching: Needles and Haystacks Searching for stuff Why it's important How it's done Technical difficulties…

Parser/Indexer

Now the fun begins! Breaking down the stuff you get into tokens <b>lolcat</b> is indexed as 'lolcat', for

example Some tagging is preserved as meta-

information, <h1>Cat</h1> for example Parsing document types text, html, pdf etc Swish-e: http://www.swish-e.org/ is a small

scale harvester/parser, for example Some problems/opportunities

Page 11: Searching: Needles and Haystacks Searching for stuff Why it's important How it's done Technical difficulties…

Index

What does it look like?T[0] = "it is what it is"

T[1] = "what is it"

T[2] = "it is a banana"

we have the following inverted file index (where the integers in the set notation brackets refer to the indexes (or keys) of the text symbols, T[0], T[1] etc.):

"a": {2}

"banana": {2}

"is": {0, 1, 2}

"it": {0, 1, 2}

"what": {0, 1}

Problems: stop words !!!

Page 12: Searching: Needles and Haystacks Searching for stuff Why it's important How it's done Technical difficulties…

Storage

This used to be easy, now lots of options Sparse data, some entries 'lolcats' have

millions of entries, pyx [look it up] won't have many

It's a 'lot' of data, google came about by misspelling googolplex:

Relationals are fairly unsuitable Nosql and ready-mades:

http://solr-vs-elasticsearch.com/ for example

Page 13: Searching: Needles and Haystacks Searching for stuff Why it's important How it's done Technical difficulties…

Retrieval

Here you get results of all this work Simple, one field, one work Booleans and implied booleans [lots of works

anded together] Relevant results, this is the main thing and

links back to the storage and parsing Some problems, multilingual, non-latin,

synonyms

Page 14: Searching: Needles and Haystacks Searching for stuff Why it's important How it's done Technical difficulties…

Technical Difficulties

Finding all the documents Documents that change, appear or disappear Making the index and looking at it [it's big!] Non-latin/accented scripts for latin speakers:

appauvrissement Can you think of others?

Page 15: Searching: Needles and Haystacks Searching for stuff Why it's important How it's done Technical difficulties…

Technical Difficulties:2

Looking for André Looking for 中国 [what's that incidentally?] Looking for cat [furry] and cat [computer

command] Speed of index refresh [days] Storage and computation Semantic search

Page 16: Searching: Needles and Haystacks Searching for stuff Why it's important How it's done Technical difficulties…

Social Difficulties

Right to be forgotten Security services and data mining Privacy and doxing, see visual tagging too Linking the unlinked Automatic visual tagging [facebook] Automatic geolocation [most smartphones] Any more?

Page 17: Searching: Needles and Haystacks Searching for stuff Why it's important How it's done Technical difficulties…

Conclusions

It's a central human activity It's a vital activity for the web Very simple central idea, but lots of evolution

possible There's a societal debate to go with the

technical evolution

Page 18: Searching: Needles and Haystacks Searching for stuff Why it's important How it's done Technical difficulties…

Thanks!

Thanks for listening and questions!