11
Managing Managing Semi-Structured Semi-Structured Data Data

Managing Semi-Structured Data. Is the web a database?

Embed Size (px)

DESCRIPTION

Rules—What Rules? Easy to create web informationEasy to create web information Cannot all be stored in relational databasesCannot all be stored in relational databases Cannot be queried in traditional waysCannot be queried in traditional ways “The web changed the digital information rules.”

Citation preview

Page 1: Managing Semi-Structured Data. Is the web a database?

ManagingManagingSemi-Structured Semi-Structured

DataData

Page 2: Managing Semi-Structured Data. Is the web a database?

Is the web a database?Is the web a database?

Page 3: Managing Semi-Structured Data. Is the web a database?

Rules—What Rules?Rules—What Rules?

• Easy to create web informationEasy to create web information

• Cannot all be stored in relational Cannot all be stored in relational databasesdatabases

• Cannot be queried in traditional waysCannot be queried in traditional ways

““The web changed the digital information rules.”The web changed the digital information rules.”

Page 4: Managing Semi-Structured Data. Is the web a database?

Semi-structured DataSemi-structured Data• Fully structured dataFully structured data

– DatabasesDatabases– Hidden webHidden web

• Fully unstructured data—ordinary textFully unstructured data—ordinary text• Semi-structured data—the grey area in Semi-structured data—the grey area in

betweenbetween– No “good solutions;” no good “software, tools, or No “good solutions;” no good “software, tools, or

methodologies to manipulate [semi-structured methodologies to manipulate [semi-structured data]”data]”

– ““[Researchers] don’t even agree on the shape of [Researchers] don’t even agree on the shape of the problem—much less, good approaches to the problem—much less, good approaches to solving it.” solving it.”

Page 5: Managing Semi-Structured Data. Is the web a database?

Nature of the ProblemNature of the Problem• Information embedded in textInformation embedded in text

– Keyword search insufficient to answer queriesKeyword search insufficient to answer queries– Natural language processing also insufficientNatural language processing also insufficient

• Lack of agreement of vocabularies and Lack of agreement of vocabularies and schemasschemas– ““Reaching schema agreements among Reaching schema agreements among

different communities is one of the most different communities is one of the most expensive steps in software design.”expensive steps in software design.”

– ““We need to be able to process information We need to be able to process information without requiring … a priori schema and without requiring … a priori schema and vocabulary agreements among participants.”vocabulary agreements among participants.”

Page 6: Managing Semi-Structured Data. Is the web a database?

Example: eBayExample: eBay• ““Impossible for … developers to define an a Impossible for … developers to define an a

priori schema for the information.”priori schema for the information.”

• ““Information stored in raw text and searched Information stored in raw text and searched using only keywords, significantly limiting its using only keywords, significantly limiting its usability.”usability.”

• ““Some standard entities (e.g., buyer, date, ask, Some standard entities (e.g., buyer, date, ask, bid …), but the meat of the information—the item bid …), but the meat of the information—the item descriptions—has a rich and evolving structure descriptions—has a rich and evolving structure that isn’t captured.”that isn’t captured.”

Page 7: Managing Semi-Structured Data. Is the web a database?

Why Schemas?Why Schemas?• ““Schemas assign meaning to the data and … Schemas assign meaning to the data and …

allow automatic data search, comparison, and allow automatic data search, comparison, and processing.”processing.”

• Hierarchy of meaningHierarchy of meaning– Raw text: strings (values)Raw text: strings (values)– Data: attribute-value pairsData: attribute-value pairs– Information: data in a conceptual frameworkInformation: data in a conceptual framework– Knowledge: information with a degree of certainty Knowledge: information with a degree of certainty

or community agreementor community agreement– Meaning: knowledge that is relevant or activatesMeaning: knowledge that is relevant or activates

• ““We have to learn to use and exploit schemas We have to learn to use and exploit schemas as helpers, but not rely on their existence or as helpers, but not rely on their existence or allow them to be constraining factors.”allow them to be constraining factors.”

Page 8: Managing Semi-Structured Data. Is the web a database?

Schema-Agnostic ToolsSchema-Agnostic Tools• Information retrieval (sophisticated search Information retrieval (sophisticated search

engines?)engines?)– Find (maybe?) but not answerFind (maybe?) but not answer– No DB-like query logic, updates, transactionsNo DB-like query logic, updates, transactions

• XMLXML– XML data can exist w/wo schemas; schemas can be XML data can exist w/wo schemas; schemas can be

defined before or afterdefined before or after– Mixed text/data contentMixed text/data content– Languages for query (XQuery) and transformation Languages for query (XQuery) and transformation

(XSLT)(XSLT)• OWL & RDFOWL & RDF

– RDF: subject-predicate-object triplesRDF: subject-predicate-object triples– OWL: ontological descriptions usually over RDF triplesOWL: ontological descriptions usually over RDF triples– Classification & inferencingClassification & inferencing– Semantic annotation and taggingSemantic annotation and tagging

Possible Places to Start

Page 9: Managing Semi-Structured Data. Is the web a database?

Are We Stuck?Are We Stuck?

• Better information-authoring tools Better information-authoring tools (annotation assistance)(annotation assistance)

• Information extraction (automatic annotation)Information extraction (automatic annotation)• Creation and reuse of standard schemas and Creation and reuse of standard schemas and

vocabularies (ontology generation)vocabularies (ontology generation)• Mapping schemas to each other (schema Mapping schemas to each other (schema

mapping)mapping)• Automatic data linking (data linking & Automatic data linking (data linking &

merging)merging)• Automatic processing of semi-structured data Automatic processing of semi-structured data

(free-form queries)(free-form queries)

What’s Next?

– Florescu (Embley)

Page 10: Managing Semi-Structured Data. Is the web a database?

Dataspace SystemDataspace System• Supports data and applications in a Supports data and applications in a

wide variety of formats all within a wide variety of formats all within a dataspace.dataspace.

• Offers an integrated means of Offers an integrated means of searching, querying, updating, and searching, querying, updating, and administering the dataspace.administering the dataspace.

• Has varying levels of service (e.g. “best-Has varying levels of service (e.g. “best-effort” or approximate answers)effort” or approximate answers)

• Includes tools to create tighter Includes tools to create tighter integration of the data, as necessary.integration of the data, as necessary.

What’s beyond a database system?

– Franklin, Halevy, Maier

Page 11: Managing Semi-Structured Data. Is the web a database?

““We are still at day one.”We are still at day one.”

“We need to find a compromise to the tension between the advantages of having schemas, in terms of better understanding and automatically processing the data, and disadvantages imposed by schemas, in terms of inflexibility and lack of evolution.” – Florescu