Upload
jasper-bishop
View
212
Download
0
Embed Size (px)
DESCRIPTION
As a Human Activity Looking for keys Remembering names and birthdays Looking up in a book And [the subject of this] making tools for the intertubes Getting a clue, from the above...
Citation preview
Searching: Needles and Haystacks
Searching for stuff Why it's important How it's done Technical difficulties Social difficulties
Search and Hugh
Involved in this since about 1982, mainly EEC projects especially Celex
I currently work for Vienna U on a multimedia database for stored manuscripts etc.
Computing industry since about 1974
As a Human Activity
Looking for keys Remembering names and birthdays Looking up in a book And [the subject of this] making tools for the
intertubes Getting a clue, from the above...
How it's Done: 1
Health warning: this explanation is simplified! Let's take Google How does it find one zillion documents/images
with 'lolcat' in them, within a few seconds?
How it's Done:2
It did it already A key concept: IndexingAnother key concept: Inverted index [see
wikipedia]: https://en.wikipedia.org/wiki/Inverted_index
- lolcat in document x at position y- highlighted cat in document x at position y [?]
Why do this at all?
Since Google, Bing, Yahoo already did it... Lots of interesting technical pieces Self education Fun and profit, do it 'better' [?] Internal search engines, intranet search
engines Domain specific engines School or research projects
Parts of the Search Engine
- Spider or Harvester [?]- Parser/Indexer- Index Storage- Retrieval [the bit of Google that we see!]I'm going to go through these in order...
Spider or Harvester
Go and get a load of stuff from the web Think of it as a programmatic super-surfer Actually there are tons of ready-made ones:http://search.cpan.org/~johnd/WWW-Crawler-
Lite-0.005/lib/WWW/Crawler/Lite.pm Be polite, user-agent name, robots.txt,
throttling etc. [?] Can you think of some of the problems?
Spidering/Harvesting/Darkweb
Spidering: start at top, follow linksDirectory based: index a load of things in a given
directoryHarvesting: academic harvester interfaces,
domain specific, I do this at presentRobots.txt: courtesy, the interactive web and the
darkwebProblems
Parser/Indexer
Now the fun begins! Breaking down the stuff you get into tokens <b>lolcat</b> is indexed as 'lolcat', for
example Some tagging is preserved as meta-
information, <h1>Cat</h1> for example Parsing document types text, html, pdf etc Swish-e: http://www.swish-e.org/ is a small
scale harvester/parser, for example Some problems/opportunities
Index
What does it look like?T[0] = "it is what it is"
T[1] = "what is it"
T[2] = "it is a banana"
we have the following inverted file index (where the integers in the set notation brackets refer to the indexes (or keys) of the text symbols, T[0], T[1] etc.):
"a": {2}
"banana": {2}
"is": {0, 1, 2}
"it": {0, 1, 2}
"what": {0, 1}
Problems: stop words !!!
Storage
This used to be easy, now lots of options Sparse data, some entries 'lolcats' have
millions of entries, pyx [look it up] won't have many
It's a 'lot' of data, google came about by misspelling googolplex:
Relationals are fairly unsuitable Nosql and ready-mades:
http://solr-vs-elasticsearch.com/ for example
Retrieval
Here you get results of all this work Simple, one field, one work Booleans and implied booleans [lots of works
anded together] Relevant results, this is the main thing and
links back to the storage and parsing Some problems, multilingual, non-latin,
synonyms
Technical Difficulties
Finding all the documents Documents that change, appear or disappear Making the index and looking at it [it's big!] Non-latin/accented scripts for latin speakers:
appauvrissement Can you think of others?
Technical Difficulties:2
Looking for André Looking for 中国 [what's that incidentally?] Looking for cat [furry] and cat [computer
command] Speed of index refresh [days] Storage and computation Semantic search
Social Difficulties
Right to be forgotten Security services and data mining Privacy and doxing, see visual tagging too Linking the unlinked Automatic visual tagging [facebook] Automatic geolocation [most smartphones] Any more?
Conclusions
It's a central human activity It's a vital activity for the web Very simple central idea, but lots of evolution
possible There's a societal debate to go with the
technical evolution
Thanks!
Thanks for listening and questions!