Upload
patrick-griffith
View
224
Download
2
Embed Size (px)
Citation preview
A scalable approach to processing large XML data volumes
Dr. Peter Fankhauser
Fraunhofer IPSIDarmstadt
Dr. Tim Weitzel
Institute of ISFrankfurt University
Dr. Thomas Tesch
Infonyte GmbHDarmstadt
we
scal
e yo
ur
XM
L
http://www.infonyte.de
„one half of the world uses XML...the other half has to“
• increasing XML penetration and data volumes • document management, content management• data and process integration
• deregulated electricity markets • straight through processing in stock trading („garage clearing“)
• challenge: develop scalable XML tools• IETD (3,5 GB XML-manuals)• trading platform integration
• 40,000 transaction every hour• 1MB SWIFT = 10MB swiftML = 100MB RAM consumption
main memory as bottleneck
we
scal
e yo
ur
XM
L
http://www.infonyte.de
XML and main memory
• scalability challenging even on huge systems, often not a relative problem• try editing the 3,5 GB XML-manual of a Boeing airplane with XML Spy
• reason: DOM implemantations represent entire DOM tree in main memory
• depending on XML document and DOM implementation, textual XML up to 20 times as big in a main memory DOM
• analogous for XSLT: 20 MB XML document requires 200-400 MB
• EDI example: SWIFT swiftML
• scalability problem: main menory restrictions, mobile devices, embedded systems
• many architectures don‘t require permant XML storage but rather import data into an „XML warehouse“ (complementary to relational systems) for subsequent processing (XSLT, Xpath, XML Schema validation aggregation, synchronization, retrieval filter, format, transform)
we
scal
e yo
ur
XM
L
http://www.infonyte.de
XML processing
WebServer
PDF+Print
Wireless,PDA,eBook
XMLMessage
SGML/XML
RDBMS
Legacy
EDI
Infonyte
InfonyteCD-ROM
CMS/DMS
ImportCheckinCheckoutReplace
ReuseSearch
AssemblyValidate
FormattingFiltering
TransformationAggregation
Generation Production Deployment
Text
AlgebraicQueryOptimizer
PersistentDOM (PDOM)
XQueryXPathXQL
DataserverI/O Manager
PDOMFile
RDBMS PagedI/OMain
Memory
XSLT
Index Manager
W3C DOMAPI
CollectionAPI
XMLApplication
Servlet Java APICommandLine
AlgebraicQueryOptimizer
PersistentDOM (PDOM)
XQueryXPathXQL
DataserverI/O Manager
PDOMFile
RDBMS PagedI/OMain
Memory
XSLT
Index Manager
W3C DOMAPI
CollectionAPI
XMLApplication
Servlet Java APICommandLine
Infonyte
we
scal
e yo
ur
XM
L
http://www.infonyte.de
IDB – Infonyte Data Base
IDB uses Persistent DOM (PDOM)• result of >10 PY of OO/XML database research at Germany‘s main
think tank• compact, binary, indexed XML format for representing DOM (directly
processing well-formed XML)
basic elements of IDB:• PDOM• persistent XSLT processor (PXSLT)• query engines for XPath, XQL• document collection support• XML workbench
we
scal
e yo
ur
XM
L
http://www.infonyte.de
PDOM
PDOM for storing and accessing XML documents according to W3C DOM API
• binary representation of XML instances, accessed using DOM Level 2 Interface
• also: structural indices for reconstructing document sequence and increasing query performance; PDOM engine for optimizing allocation of XML documents between main and secondary memory
• PDOM can store up to 2^30 XML nodes or 1 Terabyte XML
we
scal
e yo
ur
XM
L
http://www.infonyte.de
Architecture
• modular (e.g. use parts of IDB as highly scalable XML backend for J2EE conforming IBM WebSphere Application Server)
• PDOM
• IDB components 400-800 KB code size, require 16 MB RAM
• access system via command line, web server oder Java interfaces
• can use schema-less XML
• all index and storage structures derived from XML instance no need to define mappings on physical data models (as in realtional systems and some XML databases)
we
scal
e yo
ur
XM
L
http://www.infonyte.de
IDB component architecture
Algebraic Query Optimizer
Persistent DOM (PDOM)
XQueryXPathXQL
Dataserver I/O Manager
PDOM File RDBMS Paged I/OMain
Memory
XSLT
Index Manager
W3C DOMAPI
CollectionAPI
XML Application
Servlet Java APICommand Line
we
scal
e yo
ur
XM
L
http://www.infonyte.de
Performance
test using XML-ified version of freely available freeDB CD database (FreeDB 2002)
• FreeDB consists of about 500,000 CD descriptions• XML version about 500 MB
On a standard PC (1,8 Ghz, 512 MB RAM) • parsing and PDOM creation (32 million XML nodes, 400 MB)
including all structural indices takes about 4 minutes (~2MB/s)• generating user-defined index for all CD keys (indexes 548,000
nodes or 1.7% of the entire database) in about 88 seconds• generating full-text index (28 million nodes, 89% of the entire data-
base) in 17 minutes, resulting in an index size of 90 MB• XSLT processing (generate HTML) throughput up to 10 MB per
second• searching for CDs with particular titles or tracks using the full-text
index, first results are delivered within 5-10 milliseconds, analogous for subsequent hits.
we
scal
e yo
ur
XM
L
http://www.infonyte.de
Scalability
Parser done
Low MemError0x8007000E
95 MB
125 MB
1026MB
5:13 min 13:48 min
356 MB
PDOM 1.3.8Apache Xalan 2.0.1JDK 1.3.0_02
Main Memory DOM
524.288Elemente
CPU
we
scal
e yo
ur
XM
L
http://www.infonyte.de
Applications I
XML Warehouse• business process integration
• congregating data from different information systems into one common XML representation
• all data then reformatted, e.g. for publishing on a web server, using XSLT or XQL/XPath commands.
• huge US-based financial information and service provider• based on IDB, an application was developed for individualized
messaging and feeding a web portal that allows customers to get their individual transaction data in real time
• Infonyte system gets 10 GB XML raw data every day, indexes it and makes it available for ten days
• significant savings by straightforwardly processing these large amounts of data going along with access time in millisecond range
we
scal
e yo
ur
XM
L
http://www.infonyte.de
Applications II
Interactive Electronic Technical Documentation (IETD)• aviation industry with long SGML history, now many systems as browser
based XML applications• main challenge: designing distributed authoring environment with
centralized data repository and efficient production process for compiling and formatting electronic manuals for different user groups,
• Sikorsky Aircraft Corporation• XML-IETD system based on Infonyte• IDB used for production process as well as for providing the documents
via a web server• production: Infonyte XSLT processor is key element for demand driven
compilation of large XML data volumes• subsequent usage of the technical manuals in a reading environment,
Infonyte is used as client-side tools to enable XML query languages to retrieve relevant document fragments.
• architectures helped Sikorsky realize substantial cost and service improvements
we
scal
e yo
ur
XM
L
http://www.infonyte.de
Applications III
Mobile Information Management• challenge
• low memory consumption, platform independence qua Java and the compact PDOM format make Infonyte the ideal XML based mobile application kernel.
• Mobil Sales Force Automation• US-based Vaultus
(http://www.vaultus.com) used Infonyte technology as foundation of their mobile information platform. In addition to data management, the system offers offline capabilities, secure transactions, network independence, and remote maintenance services
we
scal
e yo
ur
XM
L
http://www.infonyte.de
Performance
Performance of IDB on mobile devices
• developed mobile demo scenario using the full freeDB
• a limited version consisting only of the data server, the PDOM, and the index and collection APIs (all in all about 300 KB), the full FreeDB demo runs on a PocketPC (iPAQ Pocket PC H3800 with 64 MB Ram, 32 MB Rom, 206 MHz ARM-Processor, 1GB IBM-Microdrive, Personal Java 1.2 Insignia Jeode)
• using the indices, response time for Boolean search on this limited platform is 1-2 seconds, searching for singular criteria is even faster.
we
scal
e yo
ur
XM
L
http://www.infonyte.de
Performance: an EDI example
Algebraic Query Optimizer
Persistent DOM (PDOM)
XQueryXPathXQL
Dataserver I/O Manager
PDOMFile
RDBMS Paged I/OMain
Memory
XSLT
Index Manager
W3C DOMAPI
CollectionAPI
XML Application
Servlet Java APICommand LineWeb
PDF+Print
XMLMessageEDI
PDOM
PDOMCD-ROM
ImportCheckinCheckoutReplace
ReuseSearch
AssemblyValidate
FormattingFiltering
TransformationAggregation
Source Production Destination
SWIFT
FIX
SWIFTML
FpML
EDISWIFT
FIX
we
scal
e yo
ur
XM
L
http://www.infonyte.de
SWIFT2XML
• processing SWIFT messages with XML
• SWIFT to XML• developed parser• fully XML-ified (i.e. no information loss)• generic XML multi-step optimization of process chain, trading-off bandwidth and document construction
time (multiple calculations like PDOM creation and full-text index)
• XML processing• processing of well-formed XML• storage as PDOM• access using full-text indices and data indices• visualizatin using XSLT, integration with web server
SWIFT XML PDOM full text index
data volume 100MB 430MB 200MB 40MB
compression 92%8MB
97%12,9MB
73%54MB
69%12,4MB
transfer and parsing (10MB/s)
~12 min(+7 min)
~7 min(+7 min)
6 sec +~2 sec
transfer and parsing (2MB/s)
4s + ~12 min(+7 min)
6s + ~7 min(+7 min)
33 sec + ~7 sec