Upload
fatima-allenson
View
214
Download
1
Tags:
Embed Size (px)
Citation preview
Information Integration
Instructor: Pankaj MehraTeaching Assistant: Raghav Gautam
Lec. 9May 13, 2010
ISM 158
Enterprise Information
Web Service
ApplicationInte
r-applic
atio
n
mes
sages
E-mailFile
system
Instantmessages
Webcontentserver
Tag
Tag
1st-levelindex
CQL
DatabaseSQL
Centralarchives
2nd-levelmetadata
Integration Hub
DistributedQuery
Optimizer
EnterpriseInformation
Schema
2nd-levelindex
2nd-levelcache
Central
archives
2nd-levelmetadata
Integration HubDistributedQueryOptim
izer
EnterpriseInformation
Schema
2nd-levelindex
2nd-levelcach
eCo- or sub-repository with
separate data, metadata & index
page 3
Centralized versus Distributed?
• Distributed systems occur naturally• State of the art does not allow complex queries or deep analysis
against distributed information• Centralization may also be favored due to lower costs of infrastructure,
license and labor, as well as due to their ability to better enforce tighter integrity constraints and other information management policies
• Ultimately, the decision needs to take into account issues of ownership and control– Technology considerations often are secondary; even so, rational
rules for resolving these considerations exist, as described in Distributed Computing Economics paper
page 4
Contrasting Business & Technical Information
Businessdomain
Technical domain
Metadata scaling
Data bandwidth scaling
SQL schema & query
XML or WS schema & query
File schema & query
Centralized metadata
Real-time information
Ad hoc query Inconsistent information
Pivoting Pivoting
Data mining
Search federation
Structured sources
Distributed archives
Distributed complex controls
Central controlCentral archive
Stable schemata
Schema evolution
Unstructured sources
Heavy data processingSimple metadata fusion
Complex metadataSimpler data fusion
ETL ETL
Streaming A/VVisualization
DashboardsSteering
Deep linguistics
page 5
The Guiding Principles• It is a bad idea to address the following as afterthoughts
– Scale– Availability– Integrity
• The ability to embed function close to data is fundamental to scalable information processing
• In order to deliver the best performance/$, systems tend to scale out from technology sweet spot of the day
• Redundancy configured in from the start, as well as mechanisms for early detection and isolation of faults
• Optimize availability by optimizing recovery
– Privacy and security– Compliance / auditability– Retention requirements
– Business value– Information
quality
page 6
Scalable Content Processing• Enterprise information is
complex
• Diversity of information sources and formats– Entail complex integration
and processing flows– Metadata generation and
indexing– Content indexing
• Protection and security
stor
age
data
cont
ent
conn
ecto
rs
conn
ecto
rs
scalable repository
scalable processing
e.g. JCR API
page 7
Smart Cells Scalable distributed system
of self contained, all-inclusive data repositories
Principles Scale-out Federation Intelligence close to data Pluggable platforms
supporting proprietary and 3rd-party storage services
Example Platforms for Information
Lifecycle Management services
Scale out architecture used under cloud information services
SmartCell
SmartCell
SmartCell
SmartCell
SmartCell
SmartCell
SmartCell
SmartCell
SmartCell
SmartCell
SmartCell
SmartCell
SmartCell
SmartCell
SmartCell
SmartCell
SmartCell
SmartCell
SmartCell
SmartCell
SmartCell
SmartCell
SmartCell
SmartCell
SmartCell
Smart Query Fabric
Storage:Block,File,Object &Fragment
Content indexing
Attribute indexing
Su
pp
orte
d p
roto
cols
an
d A
PIs
page 8
Considerations in Distributed Information Management
• Information is distributed across heterogeneous sources and has varied provenanceIntegration
• Information management requires information about informationMetadata
• Useful information is timely and findableReal-time integration and cachingIndexingSemantic analysisContext
page 9
Dimensions of IntegrationInformation Integration
Methodologies
Access Mechanism
Scheduled crawl
Triggered crawl
Tap update operations
Tap change log
Tap message flow
Tap streaming data
Subscribe to data
Subscribe to metadata
Query language XQuery
SQL DML
Proprietary API
Proprietary protocol
Search Terms
Query processing technique
Centralized
Distributed, 1-pass, forwarded
Distributed, 1-pass, flooded
Distributed, two-pass
Optional DQO (chaining,referral, recruiting, virtual
stored procedures)
Optional results caching for multi-step queries
Indexing technique
Centralized
Distributed, one-level
Distributed, two-level
Statefulness
Stateful: Local queries on
cached data
Stateful: D
istributed query; D
QO
& lnterm
ediate result caching
Stateless
Schema definition language SQL DDL
XML Schema
WSDL
GGF DAIS
Navigable Filesystem metadata
Navigable Repository metadata
Metadata architecture
Centralized, one-level
Distributed, one-level
Distributed, two-level
SPARQL
page 10
Ecosystem of integration products
• Metadata– Determines
information richness
• Service Orientation– Determines
protocol richness
• Future– Integration as
syndication– Integration aaS
SQL-based EIISAP, Oracle, Composite
XML-based EIIBEA LiquidData, Mark Logic
JSR 170 ECIDay
WS-basedSOA
Microsoft,IBM
RSS-based
NewsGator
PureEAI
Tibco, SAG
Met
adat
a
Service-orientedness
Uniformaccess
MOSS, Attivio
Points for Discussion in class• Consider a healthcare
patient information scenario.– Is it mainly
transactional or mainly analytic?
– Would you lean toward a distributed (EAI) approach or a centralized one (warehouse)?
• Consider a scenario in which a company wants to drill down into the root causes of customer complaints?– Again, centralized or
distributed?• Identifying the root
cause• Tracking the problem
– Would real-time integration become a requirement?
Points to ponder at home
• Pros of integration– Connecting the dots– Single view of …– Quality control over
• Inconsistency• Staleness• Gaps
• Cons of integration– Loss of context– Often, read only– Cost– Duplication– Scale– Losing battle?– Risk
Where to learn more
• Data Integration: The Relational Logic Approach by Michael Genesereth, Morgan & Claypool Publishers, 2010
Upcoming guest lectures in May
• Dr. V. Galotra, Oracle– SOA Deep Dive
• Rahul Nim, Efficient Frontier– Online marketing
Questions?
•
NEWS PRESENTATION