DL - 2004Introduction – Beeri/Feitelson1 Digital Libraries From Information retrieval to search engines e-books, e-libraries & related topics

  • View
    219

  • Download
    6

Embed Size (px)

Text of DL - 2004Introduction – Beeri/Feitelson1 Digital Libraries From Information retrieval to search...

  • Slide 1
  • DL - 2004Introduction Beeri/Feitelson1 Digital Libraries From Information retrieval to search engines e-books, e-libraries & related topics
  • Slide 2
  • DL - 2004Introduction Beeri/Feitelson2 General background Driving forces rapid evolution of : Computing power Memory Networking (internet) DB systems IR systems Hypertext (WWW) GUI/presentation tools (html) hardware software
  • Slide 3
  • DL - 2004Introduction Beeri/Feitelson3 Consequences: Transformation of existing applications Generation of new applications related to data collection, organization, classification, access Some examples:
  • Slide 4
  • DL - 2004Introduction Beeri/Feitelson4 (Classical) Libraries: Automation of catalogs (old stuff) On-line e-journals Collections of born-digital materials New: Digitized collections (images, maps,..) on-line archives On-line, virtual museums
  • Slide 5
  • DL - 2004Introduction Beeri/Feitelson5 Digital libraries & bibliographic services: ACM Digital Library acm-diglibacm-diglib collection of all (full) papers from ACM journals SIGMOD digital anthology anthologyanthology DBLP dblpdblp collection of bibliographic information Citeseer citeseerciteseer citation and impact factor data
  • Slide 6
  • DL - 2004Introduction Beeri/Feitelson6 Portals, directories, search engines Yahoo a directory (manual labor) Google a search engine (fully automatic) IR technology & hypertext structure Amazon (& similar on-line sales companies) (book/dvd/.. Portal/directory)
  • Slide 7
  • DL - 2004Introduction Beeri/Feitelson7 Data and information services: Medline & US national library of medicine: nlmed nlmed On-line encyclopedias: Wikipedia, art encyclopedia Wikipediaart encyclopedia Lexis-nexis (and many like it)Lexis-nexis
  • Slide 8
  • DL - 2004Introduction Beeri/Feitelson8 Scientific data directories/repositories/portals : Bioinformatics: -- over 500 dbs/data sources bio-source Astronomy the world-wide telescope project wwt wwt Classical humanities the Perseus digital library perseus
  • Slide 9
  • DL - 2004Introduction Beeri/Feitelson9 Issues: Heterogeneity data transformation/integration Dependence on experiments scientific experiment meta-data Huge volumes fast, approximate, on-line stream processing
  • Slide 10
  • DL - 2004Introduction Beeri/Feitelson10 New kinds of services: Query subscription on XML/data streams niagara, niagaracqniagaraniagaracq news Stocks satellite data Issues: Fast streams Millions and more queries & subscribers Needs ultra-fast stream processing Query evaluation/ data routing
  • Slide 11
  • DL - 2004Introduction Beeri/Feitelson11 A few more relevant buzzwords : Customer profiling targeted advertisment Knowledge management Collecting, managing, exploiting the know-how in large organizations E-learning E-publication a End of examples
  • Slide 12
  • DL - 2004Introduction Beeri/Feitelson12 Library/IR basic concepts Classification: Axiom: unique position for a book Unique call number ' (+ secondary subjects) hierarchical & universal classification system Dewey (1876) LC (1900) + subjects books x Single main catalog x Claim to universality
  • Slide 13
  • DL - 2004Introduction Beeri/Feitelson13 Metadata data describing a collection/item Example: library bibliographic record for a book See: Dublin Core & its use in stanfordDublin Core stanford
  • Slide 14
  • DL - 2004Introduction Beeri/Feitelson14 Operations in collection creation/maintenance : Cataloging: create metadata record for an item (many style/spelling conventions used here) Indexing: identify key terms (in all text/some fields) Controlled -- uses a fixed vocabulary Uncontrolled terms chosen by indexer Abstraction: create short description (of key ideas)
  • Slide 15
  • DL - 2004Introduction Beeri/Feitelson15 Products: Bibliographic record db Author/title catalog Subject (header) catalog (provides entry points in on-line catalogs) Currently, operations performed manually Expensive, slow, time-consuming Require experts Results are non-uniform (even with experts) Automatic indexing & abstracting --- research areas
  • Slide 16
  • DL - 2004Introduction Beeri/Feitelson16 Thesaurus: , a vocabulary of standard terms/concepts Hierarchical organization every term has BT broader term, NT narrower terms Additional relationships Synonyms, RT related terms Used for: Indexing standardization of index terms Querying standardization of query terms (supposedly solves problem of non-uniform indexing) Creation: complex, lengthy, community process See also: ontology, KWIC (google them!)
  • Slide 17
  • DL - 2004Introduction Beeri/Feitelson17 Technology/theory background Structured data dbms Schema (structure) describes data precisely Queries & query language (based on structure) Indices, query optimization covered in DB course Big challenge: integration of multiple heterogeneous autonomous sources
  • Slide 18
  • DL - 2004Introduction Beeri/Feitelson18 Unstructured data (free) text : information retrieval Data = collection of texts Query = set of terms (words) Indices = inverted lists (words to locations) Results = ranked answers Issues/challenges: Answers are imprecise, approximate Difficult to evaluate answer goodness
  • Slide 19
  • DL - 2004Introduction Beeri/Feitelson19 Hypertext/www : html, http, soap, . A browsing model covered in bsdi course Issues/ disadvantages: No notion of query, just browsing No structure on data no data quality guarantees
  • Slide 20
  • DL - 2004Introduction Beeri/Feitelson20 Semi-structured data --- XML : A standard for data exchange, also stored covered in bsdi course Self-describing data Optional meta-data --- DTD/schemas Validation tools query language Stream processing tools (under development) Current move: extend to semantic websemantic web
  • Slide 21
  • DL - 2004Introduction Beeri/Feitelson21 Machine learning: Used for automatic Classification Indexing Abstracting Clustering Of free text/semi-structured data covered in machine learning courses End of technology survey
  • Slide 22
  • DL - 2004Introduction Beeri/Feitelson22 The course IR -- classical to Google System architectures Kinds of queries Auxiliary data structures (indices) & efficient query processing Compression, a bit of theory, uses in IR Extensions to hypertext (using link structure) Conceptual topics E-books E-publishing
  • Slide 23
  • DL - 2004Introduction Beeri/Feitelson23 End of Introduction