View
220
Download
0
Category
Preview:
Citation preview
1
© 2002 IBM Corporation
IBM Research
Impliance -- Information Management Appliance
Impliance: an Information Management Appliance
Bishwaranjan BhattacharjeeIBM Watson Research Center Vuk Ercegovac, Joseph Glider, Richard Golding, Guy Lohman, Volker Markl, Hamid Pirahesh, Jun Rao, Robert Rees, Frederick Reiss, Eugene Shekita, Garret SwartAlmaden Research Center
2 Impliance -- Information Management Appliance © 2007 IBM Corporation
Agenda
Motivation: Observations Requirements
What is Impliance?
How is Impliance different from…?
Research opportunities
Conclusions
3 Impliance -- Information Management Appliance © 2007 IBM Corporation
After all our successes (and last night’s revelry), it’s easy to become self-congratulatory.
Sorry, time for…
4 Impliance -- Information Management Appliance © 2007 IBM Corporation
Some embarrassing questions:
Why is most (>80%) of the world’s data still not in databases
Didn’t we “solve” this problem in the 1980s with object-relational systems?
Do you use a database to store your data on your laptop?
Why not? (You are a database bigot, aren’t you?)
Have you ever tried to query (with SQL) a database that:– You didn’t create, and…
– Had more than 500 tables?
Just how easy is it to incrementally add DB capacity beyond 1 machine? 100 machines?
Have “self-managing” databases significantly simplified administration?
Observation Requirements (1 of 5)
Observation #1: Information converging Many types of data in today’s enterprise
Structured (traditional Data Base) Semi-structured (traditional Content Management, XML) Unstructured (text, multimedia)
Each needs a different search interface, today SQL JSR-170 Keyword search / Information Retrieval
Requirement #1: Store / Search / Analyze all data Need to rapidly relate information of different types With one unified interface! Real use cases in paper
Observation Requirements (2 of 5)
Observation #2: Awash in data, but not information
Typical complaint: “I can’t find what I’m looking for!” But just finding data isn’t enough! Today’s Business Intelligence is too human-intensive
Requirement #2: Pro-actively derive useful information
Need to glean more business value from enterprise data What sort of analytics exploit unstructured data? Need to automatically extract the semantics of text A rebirth of data mining?
Observation Requirements (3 of 5)
Obs. #3: Total Cost of Ownership (TCO) is paramount People costs dominate TCO
– Hardware often less than 50% of TCO Minimize Time To Value
– Databases take too long to set up! Wizards & Advisors simply mask complexity, add brittleness
Reqmt. #3: System must be simple, robust, & secure Sacrifice resource utilization for radical simplification of:
– Setup / Configuration / Deployment (e.g., Self-Organizing)– Operation
KISS (you know this one) KIWI – Kill It With Iron [Weikum]! Example: “Good enough” plans exploiting massive parallelism
+
Observation Requirements (4 of 5)
Observation #4: Data volumes growing fast Data is kept longer Lots of new kinds of data: RFID, email, photos, videos Disk densities improving, but not seek times!
– 1 TB disk for $399 (Hitachi)
Requirement #4: Simple & massive scale-out 1000s of nodes With low management overhead No single point of failure
Observation Requirements (5 of 5)Obs. #5: Today’s Info. Mgmt. software based upon
hardware 30 yrs. ago Example: Update-in-place databases due to expensive disk Today: Cheap CPUs, large storage, fast networks
Requirement #5: Need new (software) architecture Opportunity to radically rethink Info. Mgmt. software architecture
(Stonebraker: “refactor”), based upon:
– Hardware economics • e.g., cheap (multi-core) CPUs, storage, memory, network
– Software:• Formats (e.g., XML, semi-structured data)• Functionality required (e.g., unstructured search, analytics)
– Specified in the right order: • Service requirements Software Hardware
IBM Research
Impliance – Information Management Appliance © 2007 IBM Corporationi 10
What is Impliance?
Scalable: Massively parallel scale-out… …to Petabytes!
Administrator-less: Low Time to Value by Self-Organizing Low Total Cost of Ownership
Manage and Search All Data: Structured, Semi-Structured, … …Even Unstructured Text!
TextXML
Pro-actively Mine Information: Glean business insight from data
Structured Data
(Tables)
Bundled: HW & SW Pre-configured Pre-tuned Limited APIs
11 Impliance -- Information Management Appliance © 2007 IBM Corporation
What Does Impliance Actually Do?
All enterprise information:
√ Stores & Retrieves (Search / Query)
√ Composes / Integrates / Mashups
√ Finds trends & exceptions (Business Intelligence)
12 Impliance -- Information Management Appliance © 2007 IBM Corporation
Think of Impliance as…
Content Management on steroids (beyond JSR-170) File System with all content searchable Data Warehouse with all your enterprise’s data
Not just structured information
Excluding high-rate OLTP (web site)
A Jambalaya
Where does Impliance fit?S
truc
ture
d
Sem
i-S
truc
ture
dU
n -
Str
uctu
red
Lifetime of Data
Transaction Ingestion
Typ
es o
f D
ata
DBMS
Warehousing/OLAP Archiving
Content Management
OLTP
Impliance
XM
L
ArchivingProducts
14 Impliance -- Information Management Appliance © 2007 IBM Corporation
How is Impliance related to… Google Base?
Primary data storeAppliance (product, i.e., sits in customer site), not a Service Enterprise, not “the masses”
DataSpaces / Google “Pay as you go”?Primary data store (vs. lazy federation of existing data sources)
Enterprise, not “the web”
Database “Appliances” (Netezza, DataAlegro, Green Plum, etc.)?Not just structured (relational) data
Discovery of semantics
More pro-active
15 Impliance -- Information Management Appliance © 2007 IBM Corporation
Research Opportunities Reducing TCO – Make categories of administration just GO AWAY
– Self-Organizing to obviate database design
– Exploit appliance’s limited externalized interfaces New HW & SW architectures using off-the-shelf components
– Achieving fine-grained scale-out
– Targetting robust, “good enough” designs
– Exploiting integration of components Data and query models that
– Unify all data, yet are simple
– Tolerate “schema chaos”
– Combine best features of keyword search & SQL Automated discovery of
– Data & query semantics for
– Improving precision of queries
– Organizing data adaptively
– Trends, exceptions, etc. (pro-active Business Intelligence)
16 Impliance -- Information Management Appliance © 2007 IBM Corporation
Conclusions
We’ve come a long way towards – the autonomic dream
– incorporating all data
But we can do much more!
Impliance provides exciting opportunity for DB research– To lower TCO for information management
– To exploit today’s hardware and software advances
– To rethink information management in a fundamentally new way
Join us!
IBM Research
© 2007 IBM Corporation17 Impliance – Information Management Appliance
Thank You
MerciGrazie
Gracias
Obrigado
Danke
Japanese
English
French
Russian
German
Italian
Spanish
Brazilian Portuguese
Arabic
Traditional Chinese
Simplified Chinese
Hindi
Tamil
Thai
Korean
18 Impliance -- Information Management Appliance © 2007 IBM Corporation
Appendix
19 Impliance -- Information Management Appliance © 2007 IBM Corporation
Redefining Information Systems -- Players
Web 2.0 oriented next generation systems (delivered through services or appliances): Google, Yahoo, MSN, (IBM)
Google base (a semi-structured/un-structured information base)Google OneBox
NextGen systems built by integration of successful open source (Green Plum)Data models: RSS/ATOM/Wiki/… Architecture: DB+Search+Content systems (e.g., MYSQL+Lucene+Jackrabbit)
Entrenched HW/Storage/middleware companies Storage-driven:
EMC-- Moving up the value chain, brought in a classic Content systemIBM– IDS: synergy between classic CM (JCR) and storage
Server-driven:Netezza, Datallegro (for BI)Zantaz (for email compliance)Data Power (XSLT filtering)
Middleware-driven (IBM, Oracle, Microsoft)Oracle Secure Enterprise Search
20 Impliance -- Information Management Appliance © 2007 IBM Corporation
Research Focus 1: Reducing TCO
Make entire categories of administration JUST GO AWAY
Reducing time-to-value through new design principlesSelf-organization of “schema chaos” obviates lengthy logical & physical design, REORG
Fine-grained scale-out (instead of scale-up) obviates need for load balancing, etc.
New software architectureTarget robust, highly-predictable, “good enough” utilization (KIWI = Kill It With Iron)
Componentization
Each component simple, robust, and adaptiveVirtual service model
Service Broker optimizes resources and assigns the workload
Exploit integrated hardware and storage systems to provideBuilt-in redundancy and availability
Automated backup and archiving (ILM)
Easy cluster management
Schema chaos support at storage level (semantic storage)
Ability to use new types of grid elements (cell blade server) seamlessly
21 Impliance -- Information Management Appliance © 2007 IBM Corporation
Research Focus 2: Scalability
True Grid Model Off-the-shelf, commodity hardware
Dedicate blades to different tasks
Data: storage and simple filtering
Analytical: aggregation & mining
Transaction: search, transactional get/put
Supports Mixed Workloads
Analytics, Search, Content, … Fine-grained scale-out
Different blade types scale independently
From SMB to largest enterprises Integrating modern HW & storage, e.g.
BC3, intelligent bricks
Logic pushdown into storage
Predicate application
Aggregation
Redundancy management
Data Array
Data Array
Data Blade
Data proc
RAID
Data Blade
Data proc
RAID…
Analytic GridTransactionalCluster
AnalyticBlade
TransactionBlade
Commodity Interconnect
Data+Content+Search+Digital Media
ContentStream
DataStream
Archive/ILM
Stream
XactionStream
22 Impliance -- Information Management Appliance © 2007 IBM Corporation
Parallel Run-time: Comparison of Plumbing
Platform ApplicationQuerying model Parallelism
Fault tolerance
Resource Scheduling
WS XDTransactional
(composition;no search, no BI)
limited moderate yes yes
DataStage (E2)ETL (streaming)
(cleansing, transformation,composition)
rich high yes yes
GPFS Storage extremely limitedextremely high yes limited
DB2 ESE with DPF Analytics for relational rich high yes yes
Google Map/Reduce
Analytics for anything(search, transformation, simplistic composition)
limited extremely highyes yes
Impliance Analytics for anything, Search, Composition rich extremely highyes yes
23 Impliance -- Information Management Appliance © 2007 IBM Corporation
Virtual Storage and Computing Resource
Distributed Data Store
Security Control
Scalable Reliable Runtime Support
DiscoveryRelationaldata
SQL
contentJCR
XMLXSLT
Web pageHTTP
Video
ArchiveILM…
Data/Query
Modeler
Data Analyzer
Objects
ResourceModeler
Applications
Query
Data Analyzer, Discovery, Query:
Large-scale computation
Data ModelerSimple, generic
SRRSFault tolerant
DDSProvide reliability
VSCRCommodity HW
24 Impliance -- Information Management Appliance © 2007 IBM Corporation
Research Focus 3: Information Modeling and Querying
Simple, rich, unified information model & associated query languages, e.g.Google Base approach promising
Defined typed attributes for navigation
Defined label for keyword search
Infosphere, MUSIC
Open community (RSS / Atom / wiki)
Automatic schema discovery and integration – self-organizing!Integrating solutions from Infosphere, CLIO
Intelligence discoveryAutomatic discovery of semantics (UIMA, Web Fountain, Avatar)
Pro-active, continuous mining (vs. passive BI model)
Contextual information supply
Including reporting and advanced analytics
25 Impliance -- Information Management Appliance © 2007 IBM Corporation
Eliminate Admin Tasks… …Rather than adding layers (1 of 3):
Special-purpose, turn-key appliances for basic servicesvs. today’s general-purpose SW (but still uses off-the-shelf hardware!)
Bundled, Pre-installed, Pre-configured, Pre-tuned software!
Examples:
Information Management appliance Web Server appliance
Minimizes interfaces user has to worry about
No need to externalize underlying operating system, storage details
Eliminates need to install, configure, and tune
Self-organizing data systemsAutomatic discovery of data structure
Obviates need to
Define logical and physical schema a priori, reducing time to value
Migrate schema when organization changes
26 Impliance -- Information Management Appliance © 2007 IBM Corporation
Eliminate Admin Tasks (2 of 3): Universal Data Management
Today:
Plethora of special-purpose data managers:Databases for structured data Content managers for semi-structured dataFile systems for unstructured data
For each, very differentUser interfaces (SQL, JSR 170, file interface)Degrees of semantic knowledge about the data’s contentsDegrees of searchabilityConsistency semantics (e.g., transactions) when updatedManagement capabilities and interfaces
Tomorrow: Single mechanism for managing all data
Uniform interfaces for all types of data, for SearchingUpdatingManagement
Universal indexing (“Google model”) of all data – default search mechanismPlus more precise searching for auto-discovered (above) structured
information Obviates need to Impose naming conventions to find desired data
27 Impliance -- Information Management Appliance © 2007 IBM Corporation
Eliminate Admin Tasks (3 of 3): Robust storage mechanisms to eliminate need for backups
Never throw out data –keep versions!
Update-in-place
Is an anachronism from days of expensive disk
Increases complexity of transactions
Jeopardizes compliance requirements (Sarbanes-Oxley)
Versions permit queries “as of” some time
Exploits storage density increases (relative to number of disk arms)
RAID provides local reliability
Widely accepted and deployed
Weaver Codes extend to multiple simultaneous failures
How provide universal reliability (i.e., against site disasters)?
Selective, automated replication of new versions?
Cross-site RAID? Universal “Call Home” technology for remote management of
Monitoring
Problem determination
Software maintenance & upgrades
28 Impliance -- Information Management Appliance © 2007 IBM Corporation
Observation / Requirements Information converging: Store / Search / Analyze ALL data
Structured (traditional Data Base)Semi-structured (traditional Content Management, XML, multi-media, call center records)Unstructured (text)Same advanced functionality required
Data volume growing fast: On Demand strategy requires massive scale-outLots of new data: RFID, email, photos, videos (Deep Internet-scale systems being built)Data is kept longer, due to compliance requirements
Total Cost of Ownership (TCO) is paramount: System simple & robust (not smart & fragile)
People costs dominate TCO: Hardware often less than 50% of TCOHence, sacrifice resource utilization for radical simplification Delivered in services or appliances
Today’s IM software based upon hardware 30 yrs ago: Need new software architectureCheap CPUs, large storage, fast network in hardwareOpportunity to radically rethink IM software architecture, based upon:
Hardware economics (e.g., cheap CPUs, storage, memory, & network)Data:
Formats (e.g., XML, semi-structured data)Functionality required (e.g., unstructured search, analytics)
29 Impliance -- Information Management Appliance© 2006 IBM Corporation
Cost of management and administration 10% CAGR
New server spending (US$M) 3% CAGR
Spending(US$B)
Installed base (M Units)
Source: IDC, On-Demand Enterprises and Utility Computing: A Current Market Assessment and Outlook, IDC #31513, July 2004
$0
$20
$40
$60
$80
$100
$120
$140
$160
1996 ’97 ’98 ’99 2000 ’01 ’02 ’03 ’04 ’05 ’06 ’07 ’08
5
10
15
20
25
30
35
Cost of management and administration is outpacing spending on new systems
Total Cost of Ownership is the Driver
IBM Research
Impliance – Information Management Appliance 30
Changing Characteristics of DataTransactions and
structured dataText and other human
dataMachine-generated and
unstructured data
Heterogeneity
Actionability
Scale
Heterogeneity
Actionability
Scale
Actionability
Scale
Heterogeneity
Seat on an airplane: Easy to find, structured
LifeScience data - protein folding, gene expression: Difficult to analyze but we
know where to look
Satellite and surveillance data: An infinite space of "patterns"
31 Impliance -- Information Management Appliance © 2007 IBM Corporation
Impliance
Impliance: A Highly-Scalable, Rich-Functional Information
Management Appliance A box with software pre-installed
How delivered to enterprise: appliance or service
What functions? Store and manage all information
accept all types of enterprises data Deliver all intelligence
Integrate cross silo information
Advanced analytics with richer semantics
What properties? Low TCO
easy to deploy (“plug & play”)
simple and stable Scalability
From SMB to Very Large (PetaBytes)
(Not for high-end OLTP!)Data+Content+Digital Media
Relationaldata
SQL
content
JCR
XML
XSLT
Web page
Native
retrieval
interface
Native
update/
load
interface
HTTP
Video
ArchiveILM
…
Recommended