New Generation Database Systems: IR Systems and the Grid/Cloud

IS 257 – Fall 2010 2010.11.16- SLIDE 1

New Generation Database Systems: IR Systems and the Grid/Cloud

University of California, Berkeley

School of Information

IS 257: Database Management

IS 257 – Fall 2010 2010.11.16- SLIDE 2

Lecture Outline

• XML and DBMS– Cheshire II as XML Database

• The Grid and DBMS– The Grid– Data Grids– Grid-based DBMS

IS 257 – Fall 2010 2010.11.16- SLIDE 3

Lecture Outline

• XML and DBMS– Cheshire II as XML Database


IS 257 – Fall 2010 2010.11.16- SLIDE 4

Standards: XML/SQL

• That table can be mapped to: <EMPLOYEE> <row><EMPNO>000020</EMPNO> <FIRSTNAME>John</FIRSTNAME> <LASTNAME>Smith</LASTNAME> <BIRTHDATE>1955-08-21</BIRTHDATE> <SALARY>52300.00</SALARY> </row>

<row> … etc. …

IS 257 – Fall 2010 2010.11.16- SLIDE 5

XML to Relational Database Mapping

Bhavin Kansara

The following slides are adapted from:

Slide from Bhavin Kansara

IS 257 – Fall 2010 2010.11.16- SLIDE 6

Introduction

• XML/relational mapping means data transformation between XML and relational data models

• XML documents can be transformed to relational data models or vice versa.

• Mapping method is the way the mapping is done


IS 257 – Fall 2010 2010.11.16- SLIDE 7

DTD graph


IS 257 – Fall 2010 2010.11.16- SLIDE 8

Inlined DTD graph

• Given a DTD graph, a node is inlinable if and only if it has exactly one incoming edge and that edge is a normal edge.


IS 257 – Fall 2010 2010.11.16- SLIDE 9

Inlined DTD graph


IS 257 – Fall 2010 2010.11.16- SLIDE 10

Generated Database Schema


IS 257 – Fall 2010 2010.11.16- SLIDE 11

Data Mapping

• XML file is used to insert data into generated database schema

• Parser is used to fetch data from XML file.


IS 257 – Fall 2010 2010.11.16- SLIDE 12

Summary

• Simplify DTD

• Create DTD graph from simplified DTD

• Create inlined DTD graph from DTD graph

• Use inlined DTD graph to generate database schema

• Insert values from XML file into generated tables


IS 257 – Fall 2010 2010.11.16- SLIDE 13

Issues

• So, we can convert the XML to a relational database, but can we then export as an XML document?– This is equally challenging

• But MOSTLY involves just re-joining the tables• How do you store and put back the wrapping tags

for sets of subelements?• Since the decomposition of the DTD was

approximate, the output MAY not be identical to the input

IS 257 – Fall 2010 2010.11.16- SLIDE 14

Anatomy of a Native XML database

• The next set of slides (available on the class web site) come from George Feinberg of SleepyCat Software– SleepyCat is now part of Oracle

IS 257 – Fall 2010 2010.11.16- SLIDE 15

Further comments on NXD

• Native XML databases are most often used for storing “document-centric” XML document– I.e. the unit of retrieval would typically be the

entire document and not a particular node or subelement

• This supports query languages like Xquery– Able to ask for “all documents where the third

chapter contains a page that has boldfaced word”

– Very difficult to do that kind of query in SQL

IS 257 – Fall 2010 2010.11.16- SLIDE 16

XML-Based IR - Cheshire II

• I thought I would take a little time to talk about how the Cheshire system (that I have been working for nearly 20 years) uses XML, since it has some similarities (and many differences) to XML database systems

• Cheshire II (and Cheshire 3) are document-centric and involve parsing the XML for the purposes of indexing (and sometimes for retrieval of partial documents)

IS 257 – Fall 2010 2010.11.16- SLIDE 17

Cheshire II SGML/XML Support

• Underlying native format for all data is SGML or XML

• The DTD defines the file format for each file• Full SGML/XML parsing• SGML/XML Format Configuration Files define

the database• USMARC DTD and MARC to SGML conversion

(and back again)• Access to full-text via special SGML/XML tags

IS 257 – Fall 2010 2010.11.16- SLIDE 18

SGML/XML Support• Example XML record for a DL document

<ELIB-BIB><BIB-VERSION>ELIB-v1.0</BIB-VERSION><ID>756</ID><ENTRY>June 12, 1996</ENTRY><DATE>June 1996</DATE><TITLE>Cumulative Watershed Effects: Applicability of Available Methodologies to the Sierra Nevada</TITLE><ORGANIZATION>University of California</ORGANIZATION><TYPE>report</TYPE><AUTHOR-INSTITUTIONAL>USDA Forest Service</AUTHOR-INSTITUTIONAL><AUTHOR-PERSONAL>Neil H. Berg</AUTHOR-PERSONAL><AUTHOR-PERSONAL>Ken B. Roby</AUTHOR-PERSONAL><AUTHOR-PERSONAL>Bruce J. McGurk</AUTHOR-PERSONAL><PROJECT>SNEP</PROJECT><SERIES>Vol 3</SERIES><PAGES>40</PAGES><TEXT-REF>/elib/data/docs/0700/756/HYPEROCR/hyperocr.html</TEXT-REF><PAGED-REF>/elib/data/docs/0700/756/OCR-ASCII-NOZONE</PAGED-REF></ELIB-BIB>

IS 257 – Fall 2010 2010.11.16- SLIDE 19

<USMARC Material="BK" ID="00000003"><leader><LRL>00722</LRL><RecStat>n</RecStat> <RecType>a</RecType><BibLevel>m</BibLevel><UCP></UCP><IndCount>2</IndCount> <SFCount>2</SFCount><BaseAddr>00229</BaseAddr><EncLevel> </EncLevel> <DscCatFm></DscCatFm><LinkRec></LinkRec><EntryMap><FLength>4</Flength><SCharPos> 5</SCharPos><IDLength>0</IDLength><EMUCP></EMUCP></EntryMap></Leader> <Directry>001001400000005001700014008004100031010001400072035002000086035001700106100001900123245010500142250001100247260003200258300003300290504005000323650003600373700002200409700002200431950003200453998000700485</Directry><VarFlds> <VarCFlds><Fld001>CUBGGLAD1282B</Fld001><Fld005>19940414143202.0</Fld005> <Fld008>830810 1983 nyu eng u</Fld008></VarCFlds> <VarDFlds><NumbCode><Fld010 I1="Blank" I2="Blnk"><a>82019962 </a></Fld010> <Fld035 I1="Blank" I2="Blnk"><a>(CU)ocm08866667</a></Fld035><Fld035 I1="Blank" I2="Blnk"><a>(CU)GLAD1282</a></Fld035></NumbCode><MainEnty><Fld100 NameType="Single" I2=""><a>Burch, John G.</a></Fld100></MainEnty><Titles><Fld245 AddEnty="Yes" NFChars="0"><a>Information systems :</a>theory and practice /<c>John G. Burch, Jr., Felix R. Strater, Gary Grudnitski</c></Fld245></Titles><EdImprnt><Fld250 I1="Blank" I2="Blnk"><a>3rd ed</a></Fld250><Fld260 I1="" I2="Blnk"><a>New York :</a>J. Wiley,<c>1983</c></Fld260></EdImprnt><PhysDesc><Fld300 I1="Blank" I2="Blnk"><a>xvi, 632 p. :</a>ill. ;<c>24 cm</c></Fld300></PhysDesc><Series></Series><Notes><Fld504 I1="Blank" I2="Blnk"><a>Includes bibliographical references and index</a></Fld504></Notes><SubjAccs><Fld650 SubjLvl="NoInfo" SubjSys="LCSH"><a>Management information systems.</a></Fld650> ...

SGML Support

• Example SGML/MARC Record

IS 257 – Fall 2010 2010.11.16- SLIDE 20

SGML Support

• Mini-TREC document…<DOC><DOCNO>FT931-3566</DOCNO><PROFILE>_AN-DCPCCAA3FT</PROFILE><DATE>930316</DATE><HEADLINE>FT 16 MAR 93 / Italy's Corruption Scandal: Magistrates hold key tounlocking Tangentopoli - They will set the investigation agenda</HEADLINE><BYLINE> By ROBERT GRAHAM</BYLINE><TEXT>OVER the weekend the Italian media felt obliged to comment on a non-event.No new arrests had taken place in any of the country's ever more numerouscorruption scandals which centre on the illicit funding of political parties...</TEXT><XX> …

IS 257 – Fall 2010 2010.11.16- SLIDE 21

…Companies:-</XX><CO>Ente Nazionale Idrocarburi. Ente Nazionale per L'Energia Electtrica. Ente Partecipazioni E Finanziamento Industria Manifatturiera. IRI Istituto per La Ricostruzione Industriale.</CO><XX>Countries:-</XX><CN>ITZ Italy, EC.</CN><XX>Industries:-</XX><IN>P9222 Legal Counsel and Prosecution. P91 Executive, Legislative and General Government. P13 Oil and Gas Extraction. P9631 Regulation, Administration of Utilities. P6719 Holding Companies, NEC.</IN><XX>Types:-</XX> …

IS 257 – Fall 2010 2010.11.16- SLIDE 22

…

<TP>CMMT Comment & Analysis.

GOVT Legal issues.

</TP>

<PUB>The Financial Times

</PUB>

<PAGE>

London Page 4

</PAGE>

</DOC>

IS 257 – Fall 2010 2010.11.16- SLIDE 23

SGML/XML Support

• Configuration files for the Server are also SGML/XML:– They include tags describing all of the data

files and indexes for the database.– They also include instructions on how data is

to be extracted for indexing and how Z39.50 attributes map to the indexes for a given database.

IS 257 – Fall 2010 2010.11.16- SLIDE 24

Cheshire Configuration Files<DBCONFIG><DBENV>/projects/is240/GroupX/indexes </DBENV>



<FILEDEF TYPE=SGML>

<DEFAULTPATH>/projects/is240/GroupX </DEFAULTPATH>

<FILETAG> trec </FILETAG>

<FILENAME> /projects/is240/ft </FILENAME>

<CONTINCLUDE> /projects/is240/ft.CONT </CONTINCLUDE>

<FILEDTD> /projects/is240/TREC.FT.DTD </FILEDTD><ASSOCFIL> ft.assoc </ASSOCFIL>

<HISTORY> cheshire_index/TESTDATA.history </HISTORY>…

IS 257 – Fall 2010 2010.11.16- SLIDE 25

<INDEXES>

<INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD NORMAL=NONE PRIMARYKEY=IGNORE><INDXNAME> cheshire_index/trec.docno.index </INDXNAME><INDXTAG> docno </INDXTAG>

<INDXMAP><USE> 12 </USE><struct> 1 </struct> </INDXMAP>



<INDXKEY><TAGSPEC><FTAG>DOCNO </FTAG></TAGSPEC> </INDXKEY> </INDEXDEF>…

IS 257 – Fall 2010 2010.11.16- SLIDE 26

<INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD_PROXIMITY NORMAL=STEM><INDXNAME> cheshire_index/trec.topic.index </INDXNAME><INDXTAG> topic </INDXTAG>

<INDXMAP><USE> 29 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP><INDXMAP><USE> 29 </USE><RELAT> 102 </RELAT><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP>…<STOPLIST> cheshire_index/topicstoplist </STOPLIST><INDXKEY><TAGSPEC><FTAG>HEADLINE </FTAG><FTAG>DATELINE </FTAG><FTAG>BYLINE </FTAG><FTAG>TEXT </FTAG></TAGSPEC> </INDXKEY> </INDEXDEF>

IS 257 – Fall 2010 2010.11.16- SLIDE 27

Cluster Definitions



<CLUSTER><clusname> classcluster </clusname><cluskey normal=CLASSCLUS>

<tagspec><FTAG>FLD950 </FTAG> <s> â </s>

</tagspec></cluskey><stoplist> /usr3/cheshire2/data2/clasclusstoplist </stoplist><clusmap>

<from> <tagspec><ftag>FLD245</ftag><s>^[ab]</s><ftag>FLD440</ftag><s>â</s><ftag>FLD490</ftag><s>â</s><ftag>FLD830</ftag><s>â</s><ftag>FLD740</ftag><s>â</s>

</tagspec></from><to> <tagspec>

<ftag>titles</ftag> </tagspec></to><from> <tagspec>

<ftag>FLD6..</ftag><s>^[abcdxyz]</s> </tagspec></from><to> <tagspec>

<ftag>subjects</ftag> </tagspec></to><summarize> <maxnum> 5 </maxnum>

<tagspec> <ftag>subjsum</ftag></tagspec></summarize>

</clusmap></CLUSTER>

IS 257 – Fall 2010 2010.11.16- SLIDE 28

Component Definitions<COMPONENTS><COMPONENTDEF><COMPONENTNAME> TESTDATA/COMPONENT_DB1 </COMPONENTNAME><COMPONENTNORM>NONE</COMPONENTNORM><COMPSTARTTAG> <TAGSPEC> <FTAG>mainenty </FTAG> <FTAG>titles </FTAG> </TAGSPEC></COMPSTARTTAG><COMPENDTAG> <TAGSPEC><FTAG>Fld300 </FTAG></TAGSPEC></COMPENDTAG><COMPONENTINDEXES><INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD NORMAL=NONE><INDXNAME> TESTDATA/comp1index1.author…</INDEXDEF></COMPONENTDEF></COMPONENTS>

IS 257 – Fall 2010 2010.11.16- SLIDE 29

Result Formatting (Display)<DISPOPTIONS>KEEP_ENTITIES</DISPOPTIONS>

<DISPLAY> <FORMAT NAME="B" OID="1.2.840.10003.5.105" DEFAULT> <convert function="TAGSET-G"> <clusmap> <from> <tagspec> <ftag>DOCNO</ftag> </tagspec></from> <to> <tagspec> <ftag>28</ftag> </tagspec></to> <from> <tagspec> <ftag>#DOCID#</ftag> </tagspec></from> <to> <tagspec> <ftag>5</ftag> </tagspec></to> </clusmap> </convert></FORMAT></DISPLAY>

IS 257 – Fall 2010 2010.11.16- SLIDE 30

Indexing

• Any SGML/XML tagged field or attribute can be indexed:– B-Tree and Hash access via Berkeley DB (Sleepycat)– Stemming, keyword, exact keys and “special keys”– Mapping from any Z39.50 Attribute combination to a specific

index– Underlying postings information includes term frequency for

probabilistic searching.– SGML may include address of full-text for indexing

• New indexes can be easily added, or old ones deleted

IS 257 – Fall 2010 2010.11.16- SLIDE 31

Database Storage

• All data stored as SGML/XML flat text files or in a BerkeleyDB database

• File format is defined though SGML/XML DTD (also flat text file) or XML Schema

• “Associator” files provide indexed direct access to each record in SGML/XML files.– Contain offset and record length for each

“record”– Associators can be built to index any

conformant document in a directory sub-tree

IS 257 – Fall 2010 2010.11.16- SLIDE 32

Database Storage

AssociatorFile

Page DataFile

SGML/XMLFile

HistoryFile

DTDFileCluster

File

PostingsFile

IndexFile

IndexFile

RemoteRDBMS

ConfigFile

IndexFile

AssociatorFile

Prox data File

IS 257 – Fall 2010 2010.11.16- SLIDE 33

Client/Server Architecture

• Server Supports:– Database storage– Indexing – Z39.50 access to local data– Boolean and Probabilistic Searching– Relevance Feedback– External SQL database support

• Client Supports:– Programmable (Tcl/Tk – Python in C3) Graphical User

Interface– Z39.50 access to remote servers– SGML & MARC formatting

• Combined Client/Server CGI scripting via WebCheshire

IS 257 – Fall 2010 2010.11.16- SLIDE 34

Z39.50 Overview

UI

UI

MapQuery

Internet

MapResults

MapQuery

MapResults

MapQuery

MapResults

SearchEngine

IS 257 – Fall 2010 2010.11.16- SLIDE 35

Lecture Outline

• XML and DBMS


IS 257 – Fall 2010 2010.11.16- SLIDE 36

Grid-based Digital Libraries

• So what’s this Grid thing anyhow?

• Data Grids and Distributed Storage

• Grid-Based IR

• Grid-Based Digital Libraries

• Grid vs “Cloud”

This lecture borrows heavily from presentations by Ian Foster (Argonne National Laboratory & University of Chicago), Reagan Moore and others from San Diego Supercomputer Center

IS 257 – Fall 2010 2010.11.16- SLIDE 37

The Grid: On-Demand Access to Electricity

Time

Qua

lity,

eco

nom

ies

of s

cale

Source: Ian Foster

IS 257 – Fall 2010 2010.11.16- SLIDE 38

By Analogy, A Computing Grid

• Decouples production and consumption– Enable on-demand access– Achieve economies of scale– Enhance consumer flexibility– Enable new devices

• On a variety of scales– Department– Campus– Enterprise– Internet

Source: Ian Foster

IS 257 – Fall 2010 2010.11.16- SLIDE 39

What is the Grid?

“The short answer is that, whereas the Web is a service for sharing information over the Internet, the Grid is a service for sharing computer power and data storage capacity over the Internet. The Grid goes well beyond simple communication between computers, and aims ultimately to turn the global network of computers into one vast computational resource.”

Source: The Global Grid Forum

IS 257 – Fall 2010 2010.11.16- SLIDE 40

Not Exactly a New Idea …

• “The time-sharing computer system can unite a group of investigators …. one can conceive of such a facility as an … intellectual public utility.”– Fernando Corbato and Robert Fano , 1966

• “We will perhaps see the spread of ‘computer utilities’, which, like present electric and telephone utilities, will service individual homes and offices across the country.” Len Kleinrock, 1967

Source: Ian Foster

IS 257 – Fall 2010 2010.11.16- SLIDE 41

But, Things are Different Now

• Networks are far faster (and cheaper)– Faster than computer backplanes

• “Computing” is very different than pre-Net– Our “computers” have already disintegrated– E-commerce increases size of demand peaks– Entirely new applications & social structures

• We’ve learned a few things about software

Source: Ian Foster

IS 257 – Fall 2010 2010.11.16- SLIDE 42

Computing isn’t Really Like Electricity

• I import electricity but must export data• “Computing” is not interchangeable but highly

heterogeneous: data, sensors, services, …• This complicates things; but also means that the

sum can be greater than the parts – Real opportunity: Construct new capabilities

dynamically from distributed services

• Raises three fundamental questions– Can I really achieve economies of scale?– Can I achieve QoS across distributed services?– Can I identify apps that exploit synergies?

Source: Ian Foster

IS 257 – Fall 2010 2010.11.16- SLIDE 43

Why the Grid?(1) Revolution in Science

• Pre-Internet– Theorize &/or experiment, alone

or in small teams; publish paper

• Post-Internet– Construct and mine large databases of

observational or simulation data– Develop simulations & analyses– Access specialized devices remotely– Exchange information within

distributed multidisciplinary teamsSource: Ian Foster

IS 257 – Fall 2010 2010.11.16- SLIDE 44

Why the Grid?(2) Revolution in Business

• Pre-Internet– Central data processing facility

• Post-Internet– Enterprise computing is highly distributed,

heterogeneous, inter-enterprise (B2B)– Business processes increasingly

computing- & data-rich– Outsourcing becomes feasible =>

service providers of various sorts

Source: Ian Foster

IS 257 – Fall 2010 2010.11.16- SLIDE 45

The Information Grid

Imagine a web of data• Machine Readable

– Search, Aggregate, Transform, Report On, Mine Data – using more computers, and less humans

• Scalable– Machines are cheap – can buy 50 machines with

100Gb or memory and 100 TB disk for under $100K, and dropping

– Network is now faster than disk

• Flexible– Move data around without breaking the apps

Source: S. Banerjee, O. Alonso, M. Drake - ORACLE

IS 257 – Fall 2010 2010.11.16- SLIDE 46

Tier0/1 facility

Tier2 facility

10 Gbps link

2.5 Gbps link

622 Mbps link

Other link

Tier3 facility

The Foundations are Being Laid

Cambridge

Newcastle

Edinburgh

Oxford

Glasgow

Manchester

Cardiff

Soton

London

Belfast

DL

RAL Hinxton

IS 257 – Fall 2010 2010.11.16- SLIDE 47

Data Grid Problem

• “Enable a geographically distributed community [of thousands] to pool their resources in order to perform sophisticated, computationally intensive analyses on Petabytes of data”

• Note that this problem:– Is common to many areas of science– Overlaps strongly with other Grid problems

IS 257 – Fall 2010 2010.11.16- SLIDE 48

Data Grids forHigh Energy Physics

Tier2 Centre ~1

TIPS

Online System

Offline Processor Farm

~20 TIPS

CERN Computer Centre

FermiLab ~4 TIPS

France Regional Centre

Italy Regional Centre

Germany Regional Centre

Institute

Institute

Institute

Institute ~0.25TIPS

Physicist workstations

~100 MBytes/sec

~100 MBytes/sec

~622 Mbits/sec

~1 MBytes/sec

There is a “bunch crossing” every 25 nsecs.

There are 100 “triggers” per second

Each triggered event is ~1 MByte in size

Physicists work on analysis “channels”.

Each institute will have ~10 physicists working on one or more channels; data for these channels should be cached by the institute server

Physics data cache

~PBytes/sec

~622 Mbits/sec or Air Freight (deprecated)

Tier2 Centre ~1

TIPS

Tier2 Centre ~1

TIPS

Tier2 Centre ~1

TIPS

Caltech ~1 TIPS~622 Mbits/sec

Tier 0Tier 0

Tier 1Tier 1

Tier 2Tier 2

Tier 4Tier 4

1 TIPS is approximately 25,000

SpecInt95 equivalents

Image courtesy Harvey Newman, Caltech

IS 257 – Fall 2010 2010.11.16- SLIDE 49

Grids and Open StandardsIn

crea

sed

func

tiona

lity,

stan

dard

izat

ion

Time

Customsolutions

Open GridServices Arch

GGF: OGSI, …(+ OASIS, W3C)

Multiple implementations,including Globus Toolkit

Web services

Globus Toolkit

Defacto standardsGGF: GridFTP, GSI

X.509,LDAP,FTP, …

App-specificServices

IS 257 – Fall 2010 2010.11.16- SLIDE 50

The Gridas Enabler of 21st Century Science

• Entirely new approaches to enquiry based on– Deep analysis of huge quantities of data– Interdisciplinary collaboration– Large-scale simulation– Smart instrumentation

• Enabled by an infrastructure that enables access to, and integration of, resources & services without regard for location

IS 257 – Fall 2010 2010.11.16- SLIDE 51

Not only Science…

• The Database world is moving to the Grid for large-scale applications

• Oracle 10g is specifically designed to exploit clustered/grid computing using RACs (Real Application Clusters)

• An example from the Information/Publishing world…– Presentation from Oracle about Thomson

Legal’s use of Oracle 10g and RACs

Documents

New Generation Database Systems: IR Systems and the Grid/Cloud