20
A scalable approach to processing large XML data volumes Dr. Peter Fankhauser Fraunhofer IPSI Darmstadt [email protected] Dr. Tim Weitzel Institute of IS Frankfurt University tim @xml-network.de Dr. Thomas Tesch Infonyte GmbH Darmstadt tesch @ infonyte .de

A scalable approach to processing large XML data volumes Dr. Peter Fankhauser Fraunhofer IPSI Darmstadt [email protected] Dr. Tim Weitzel Institute

Embed Size (px)

Citation preview

A scalable approach to processing large XML data volumes

Dr. Peter Fankhauser

Fraunhofer IPSIDarmstadt

[email protected]

Dr. Tim Weitzel

Institute of ISFrankfurt University

[email protected]

Dr. Thomas Tesch

Infonyte GmbHDarmstadt

[email protected]

we

scal

e yo

ur

XM

L

http://www.infonyte.de

„one half of the world uses XML...the other half has to“

• increasing XML penetration and data volumes • document management, content management• data and process integration

• deregulated electricity markets • straight through processing in stock trading („garage clearing“)

• challenge: develop scalable XML tools• IETD (3,5 GB XML-manuals)• trading platform integration

• 40,000 transaction every hour• 1MB SWIFT = 10MB swiftML = 100MB RAM consumption

main memory as bottleneck

we

scal

e yo

ur

XM

L

http://www.infonyte.de

XML and main memory

• scalability challenging even on huge systems, often not a relative problem• try editing the 3,5 GB XML-manual of a Boeing airplane with XML Spy

• reason: DOM implemantations represent entire DOM tree in main memory

• depending on XML document and DOM implementation, textual XML up to 20 times as big in a main memory DOM

• analogous for XSLT: 20 MB XML document requires 200-400 MB

• EDI example: SWIFT swiftML

• scalability problem: main menory restrictions, mobile devices, embedded systems

• many architectures don‘t require permant XML storage but rather import data into an „XML warehouse“ (complementary to relational systems) for subsequent processing (XSLT, Xpath, XML Schema validation aggregation, synchronization, retrieval filter, format, transform)

we

scal

e yo

ur

XM

L

http://www.infonyte.de

XML processing

WebServer

PDF+Print

Wireless,PDA,eBook

XMLMessage

SGML/XML

RDBMS

Legacy

EDI

Infonyte

InfonyteCD-ROM

CMS/DMS

ImportCheckinCheckoutReplace

ReuseSearch

AssemblyValidate

FormattingFiltering

TransformationAggregation

Generation Production Deployment

Text

AlgebraicQueryOptimizer

PersistentDOM (PDOM)

XQueryXPathXQL

DataserverI/O Manager

PDOMFile

RDBMS PagedI/OMain

Memory

XSLT

Index Manager

W3C DOMAPI

CollectionAPI

XMLApplication

Servlet Java APICommandLine

AlgebraicQueryOptimizer

PersistentDOM (PDOM)

XQueryXPathXQL

DataserverI/O Manager

PDOMFile

RDBMS PagedI/OMain

Memory

XSLT

Index Manager

W3C DOMAPI

CollectionAPI

XMLApplication

Servlet Java APICommandLine

Infonyte

we

scal

e yo

ur

XM

L

http://www.infonyte.de

IDB – Infonyte Data Base

IDB uses Persistent DOM (PDOM)• result of >10 PY of OO/XML database research at Germany‘s main

think tank• compact, binary, indexed XML format for representing DOM (directly

processing well-formed XML)

basic elements of IDB:• PDOM• persistent XSLT processor (PXSLT)• query engines for XPath, XQL• document collection support• XML workbench

we

scal

e yo

ur

XM

L

http://www.infonyte.de

PDOM

PDOM for storing and accessing XML documents according to W3C DOM API

• binary representation of XML instances, accessed using DOM Level 2 Interface

• also: structural indices for reconstructing document sequence and increasing query performance; PDOM engine for optimizing allocation of XML documents between main and secondary memory

• PDOM can store up to 2^30 XML nodes or 1 Terabyte XML

we

scal

e yo

ur

XM

L

http://www.infonyte.de

Architecture

• modular (e.g. use parts of IDB as highly scalable XML backend for J2EE conforming IBM WebSphere Application Server)

• PDOM

• IDB components 400-800 KB code size, require 16 MB RAM

• access system via command line, web server oder Java interfaces

• can use schema-less XML

• all index and storage structures derived from XML instance no need to define mappings on physical data models (as in realtional systems and some XML databases)

we

scal

e yo

ur

XM

L

http://www.infonyte.de

IDB component architecture

Algebraic Query Optimizer

Persistent DOM (PDOM)

XQueryXPathXQL

Dataserver I/O Manager

PDOM File RDBMS Paged I/OMain

Memory

XSLT

Index Manager

W3C DOMAPI

CollectionAPI

XML Application

Servlet Java APICommand Line

we

scal

e yo

ur

XM

L

http://www.infonyte.de

Performance

test using XML-ified version of freely available freeDB CD database (FreeDB 2002)

• FreeDB consists of about 500,000 CD descriptions• XML version about 500 MB

On a standard PC (1,8 Ghz, 512 MB RAM) • parsing and PDOM creation (32 million XML nodes, 400 MB)

including all structural indices takes about 4 minutes (~2MB/s)• generating user-defined index for all CD keys (indexes 548,000

nodes or 1.7% of the entire database) in about 88 seconds• generating full-text index (28 million nodes, 89% of the entire data-

base) in 17 minutes, resulting in an index size of 90 MB• XSLT processing (generate HTML) throughput up to 10 MB per

second• searching for CDs with particular titles or tracks using the full-text

index, first results are delivered within 5-10 milliseconds, analogous for subsequent hits.

we

scal

e yo

ur

XM

L

http://www.infonyte.de

Search results for “bowie” on “bbc”

we

scal

e yo

ur

XM

L

http://www.infonyte.de

Search results for “bowie” on “bbc”

we

scal

e yo

ur

XM

L

http://www.infonyte.de

Scalability

Parser done

Low MemError0x8007000E

95 MB

125 MB

1026MB

5:13 min 13:48 min

356 MB

PDOM 1.3.8Apache Xalan 2.0.1JDK 1.3.0_02

Main Memory DOM

524.288Elemente

CPU

we

scal

e yo

ur

XM

L

http://www.infonyte.de

Applications I

XML Warehouse• business process integration

• congregating data from different information systems into one common XML representation

• all data then reformatted, e.g. for publishing on a web server, using XSLT or XQL/XPath commands.

• huge US-based financial information and service provider• based on IDB, an application was developed for individualized

messaging and feeding a web portal that allows customers to get their individual transaction data in real time

• Infonyte system gets 10 GB XML raw data every day, indexes it and makes it available for ten days

• significant savings by straightforwardly processing these large amounts of data going along with access time in millisecond range

we

scal

e yo

ur

XM

L

http://www.infonyte.de

Applications II

Interactive Electronic Technical Documentation (IETD)• aviation industry with long SGML history, now many systems as browser

based XML applications• main challenge: designing distributed authoring environment with

centralized data repository and efficient production process for compiling and formatting electronic manuals for different user groups,

• Sikorsky Aircraft Corporation• XML-IETD system based on Infonyte• IDB used for production process as well as for providing the documents

via a web server• production: Infonyte XSLT processor is key element for demand driven

compilation of large XML data volumes• subsequent usage of the technical manuals in a reading environment,

Infonyte is used as client-side tools to enable XML query languages to retrieve relevant document fragments.

• architectures helped Sikorsky realize substantial cost and service improvements

we

scal

e yo

ur

XM

L

http://www.infonyte.de

Applications III

Mobile Information Management• challenge

• low memory consumption, platform independence qua Java and the compact PDOM format make Infonyte the ideal XML based mobile application kernel.

• Mobil Sales Force Automation• US-based Vaultus

(http://www.vaultus.com) used Infonyte technology as foundation of their mobile information platform. In addition to data management, the system offers offline capabilities, secure transactions, network independence, and remote maintenance services

we

scal

e yo

ur

XM

L

http://www.infonyte.de

Performance

Performance of IDB on mobile devices

• developed mobile demo scenario using the full freeDB

• a limited version consisting only of the data server, the PDOM, and the index and collection APIs (all in all about 300 KB), the full FreeDB demo runs on a PocketPC (iPAQ Pocket PC H3800 with 64 MB Ram, 32 MB Rom, 206 MHz ARM-Processor, 1GB IBM-Microdrive, Personal Java 1.2 Insignia Jeode)

• using the indices, response time for Boolean search on this limited platform is 1-2 seconds, searching for singular criteria is even faster.

we

scal

e yo

ur

XM

L

http://www.infonyte.de

we

scal

e yo

ur

XM

L

http://www.infonyte.de

Performance: an EDI example

Algebraic Query Optimizer

Persistent DOM (PDOM)

XQueryXPathXQL

Dataserver I/O Manager

PDOMFile

RDBMS Paged I/OMain

Memory

XSLT

Index Manager

W3C DOMAPI

CollectionAPI

XML Application

Servlet Java APICommand LineWeb

PDF+Print

XMLMessageEDI

PDOM

PDOMCD-ROM

ImportCheckinCheckoutReplace

ReuseSearch

AssemblyValidate

FormattingFiltering

TransformationAggregation

Source Production Destination

SWIFT

FIX

SWIFTML

FpML

EDISWIFT

FIX

we

scal

e yo

ur

XM

L

http://www.infonyte.de

SWIFT2XML

• processing SWIFT messages with XML

• SWIFT to XML• developed parser• fully XML-ified (i.e. no information loss)• generic XML multi-step optimization of process chain, trading-off bandwidth and document construction

time (multiple calculations like PDOM creation and full-text index)

• XML processing• processing of well-formed XML• storage as PDOM• access using full-text indices and data indices• visualizatin using XSLT, integration with web server

SWIFT XML PDOM full text index

data volume 100MB 430MB 200MB 40MB

compression 92%8MB

97%12,9MB

73%54MB

69%12,4MB

transfer and parsing (10MB/s)

~12 min(+7 min)

~7 min(+7 min)

6 sec +~2 sec

transfer and parsing (2MB/s)

4s + ~12 min(+7 min)

6s + ~7 min(+7 min)

33 sec + ~7 sec

[email protected]

download IDB, FreeDB etc.: www.infonyte.com

papers etc.http://tim.weitzel.com