31
Innovations in Database Technology IRMAC BI/DW SIG May 28, 2009

Innovations in Database Technology IRMAC BI/DW SIG May 28, 2009

Embed Size (px)

Citation preview

Page 1: Innovations in Database Technology IRMAC BI/DW SIG May 28, 2009

Innovations in Database Technology

IRMAC BI/DW SIG May 28, 2009

Page 2: Innovations in Database Technology IRMAC BI/DW SIG May 28, 2009

2

Agenda

About Infobright

Data Warehousing Challenge

Use Cases

Infobright Approach

Infobright Architecture

Infobright Versions & System Requirements

Page 3: Innovations in Database Technology IRMAC BI/DW SIG May 28, 2009

3

About Infobright

Page 4: Innovations in Database Technology IRMAC BI/DW SIG May 28, 2009

4

Founded 2006

Headquarters Toronto, Canada; offices in Boston, MA and Warsaw, Poland

The Infobright Data Warehouse

Simplicity: No new schemas, no indices, no data partitioning, easy to maintainScalability: Designed for rapidly growing volumes. Ideal for up to 30 TB Low TCO: Industry-leading compression, less storage, industry standard servers, low software costs, minimal ongoing operational expenses

The Open Source Solution Community (open source) and Enterprise Editions are available

MySQL Integration

Leverages MySQL connectivity to ETL and BI Provides MySQL customers with scalable, enterprise-ready data warehouse MySQL/SUN Microsystems invests in Infobright Sept 15, 2008

About Infobright

Page 5: Innovations in Database Technology IRMAC BI/DW SIG May 28, 2009

5

Data Warehousing Challenge

Page 6: Innovations in Database Technology IRMAC BI/DW SIG May 28, 2009

6

Data Warehousing Challenges

.

Traditional Data Warehousing

Labor intensive, heavy indexing and partitioning

Hardware intensive: massive storage; big servers

Expensive and complex

More Data, More Data Sources

More Kinds of Output Needed by More Users,

More Quickly

Limited Resources and Budget

0101010101010101010101010101

0101010101010101010101010

0101010101010101010101

1

0101010101010101010101

10

1010 1011001

0 110

01

1

0

01

101

010101

1

1

0101

0

1010101

10 0101

10

01

10

01

10

1

0

10101

01 010 01 0101

011

10100101

1

01

0

10

1010 101 10010 1

10

01

1

0

01

101

0

10101

10

0101010101010101010101010

0101010101010101010101010101

1

10110

0

101

1010 10 1101

010

0

0 101 0010

0

Real time data

Multiple databases

External Sources

6

Page 7: Innovations in Database Technology IRMAC BI/DW SIG May 28, 2009

7

New Demands: Larger transaction volumes driven by the internetImpact of Cloud ComputingMore -> Faster -> Cheaper

Data Warehousing – Raising The Bar

Data Warehousing Matures: Near real time updatesIntegration with master data managementData mining using discrete business transactionsProvision of data for business critical applications

Early Data Warehouse Characteristics:Integration of internal systemsMonthly and weekly loadsHeavy use of aggregates

Page 8: Innovations in Database Technology IRMAC BI/DW SIG May 28, 2009

8

Use Cases

Page 9: Innovations in Database Technology IRMAC BI/DW SIG May 28, 2009

9

Use Cases

• Loading millions of transactions with a limited batch window• Summarizing transactional data for trend analysis• Extracting transactional detail based on specific constraints• Ad hoc query support across many dimensional attributes

Infobright is a good fit for;

• Real-time transactional updates (operational data entry)• Full data extracts (select * from …)• Row based operations that need to access all columns of

a table are typically better suited to row based databases

Avoid using Infobright for;

Page 10: Innovations in Database Technology IRMAC BI/DW SIG May 28, 2009

10

Customer Experience – Load Speed

• Custom front end developed using MySQL JDBC driver• Completed design, test, deployment in < 3 months with no assistance from Infobright• Allowed for expansion from 7 to 90 days of online SMS history• Supports plan for 70% annual growth• Rollout to allow for 120 concurrent users

• Mavenir - OEM customer deploying a world wide telco application• Application provides operators with access to detailed SMS traffic• Needed a low cost solution with the ability to load 20K records

per second • Peak of 70M messages per hour during Chinese New year

Business Requirement

Solution

Page 11: Innovations in Database Technology IRMAC BI/DW SIG May 28, 2009

11

Customer Experience – Query Performance

• Sulake - Online Social Networking service with 126M users across 31 countries

• 990M page impressions per month• Need to quickly analyze online spend on a daily basis to enhance

online experience and drive additional revenue• Existing InnoDB solution was able to process business queries in a

reasonable time frame (queries taking hours to complete)• Business opportunities were being lost due to inability to analyze

subscriber behavior using transactions

Business Requirement

• Customer used existing data model and deployed the application using Business Objects – Data Integrator for ETL, Web-Intelligence for BI

• Existing ETL workflows were converted to Infobright in less than 4 weeks without assistance

• Historically long running queries (hours) now running in minutes and seconds• Additional benefits due to compression were a reduced need for disk storage

and an overall reduction in I/O and network traffic

Solution

Page 12: Innovations in Database Technology IRMAC BI/DW SIG May 28, 2009

12

Customer Experience - TCO

• A global provider of electronic trading solutions across 22 time zones and 700 financial exchanges

• Wanted to expand analytical access to financial transactions to include both current (30 days) and archived transactions (4 years)

• Expansion of existing Sybase solution was too costly

Business Requirement

• Infobright was able to achieve performance benchmarks within the first 3 days of a proof of concept using production data

• 28,000 records per second load speed• Join 100M row with a 30Mrow table -> 400k rows, returned in 185 seconds

• Additional queries that did not complete using Sybase, finished in minutes using Infobright

• Final solution deployed using Pentaho Kettle for ETL and Crystal Reports for BI• Success with modest data size (150GB) has opened opportunities for additional more

detailed transactional analysis

Solution

$

Page 13: Innovations in Database Technology IRMAC BI/DW SIG May 28, 2009

13

Customer Experience – Query Performance and TCO

• TradeDoubler – Based in Sweden, a global digital marketing company, serving 1600+ online advertisers across Europe and Asia.

• TradeDoubler optimizes Web marketing campaigns by analyzing Web clicks, impressions and purchases.

• Analyzing terabytes of data about the results of its programs is central to the company’s success.

• Selected Infobright to produce analytical results rapidly, seamless interoperability with their MySQL database and low TCO

Business Requirement

• Deployed solution using a single, $12,500 Dell server with 8 CPU cores and 16 GB RAM • Used Pentaho Kettle for ETL and Jaspersoft Server Pro Reports for BI• Needed to process and analyze data 20 billion online transactions/month• In POC, loaded > 3.2 billion rows at > 300,000 rows / second• In production, achieved 30x data compression• Extremely fast query speed. 3 queries that previously did not return, now returned

within a minute

Solution

Page 14: Innovations in Database Technology IRMAC BI/DW SIG May 28, 2009

14

Infobright Approach

Page 15: Innovations in Database Technology IRMAC BI/DW SIG May 28, 2009

15

Introducing Infobright

Smarter architecture Load data and go No indices or partitions

to build and maintain Knowledge Grid

automatically updated as data packs are created or updated

Super-compact data foot- print can leverage off-the-shelf hardware

Data Packs – data stored in manageably sized, highly compressed data packs

Data compressed using algorithms tailored to data type

Knowledge Grid – statistics and metadata “describing” the super-compressed data

Column Orientation

15

Page 16: Innovations in Database Technology IRMAC BI/DW SIG May 28, 2009

16

Column vs. Row-Oriented

EMP_ID FNAME LNAME SALARY 1 Moe Howard 100002 Curly Joe 120003 Larry Fine 9000

Row Oriented (1,Moe,Howard,10000; 2,Curly, Joe,12000; 3,Larry,Fine,9000;)

Works well if all the columns are needed for every query.

Efficient for transactional processing if all the data for the row is available

Works well with aggregate results (sum, count, avg. )

Only columns that are relevant need to be touched Consistent performance with any database design Allows for very efficient compression

Column Oriented (1,2,3; Moe,Curly,Larry; Howard,Joe,Fine; 10000,12000,9000;)

Page 17: Innovations in Database Technology IRMAC BI/DW SIG May 28, 2009

17

Data Packs and Compression

64K

64K

64K

64K

Data Packs Each data pack contains 65, 536 data values Compression is applied to each individual data pack The compression algorithm varies depending on data type

and data distribution

Compression Results vary depending on the

distribution of data among data packs A typical overall compression ratio

seen in the field is 10:1 Some customers have seen results

have been as high as 40:1Patent PendingCompressionAlgorithms

Page 18: Innovations in Database Technology IRMAC BI/DW SIG May 28, 2009

18

Knowledge Grid

This metadata layer = 1% of the compressed volume

Data Pack Nodes (DPN)A separate DPN is created for every data pack created in the database to store basic statistical information

Character Maps (CMAPs)Every Data Pack that contains text creates a matrix that records the occurrence of every possible ASCII character

HistogramsHistograms are created for every Data Pack that contains numeric data and creates 1024 MIN-MAX intervals.

Pack-to-Pack Nodes (PPN)PPNs track relationships between Data Packs when tables are joined. Query performance gets better as the database is used.

Page 19: Innovations in Database Technology IRMAC BI/DW SIG May 28, 2009

19

A Simple Query using the Knowledge Grid

SELECT count(*) FROM employees WHERE salary > 50000 AND age < 65 AND job = ‘Shipping’ AND city = ‘TORONTO’;

salary age job city

Rows 1 to 65,536

65,537 to 131,072

131,073 to ……

2. Find the Data Packs that contain age < 65

3. Find the Data Packs that have job = ‘Shipping’

4. Find the Data Packs that have City = “Toronto’

All packs ignored

All packs ignored

All packs ignored5. Now we eliminate all rows that have

been flagged as irrelevant.

Only this pack will be

decompressed

6. Finally we have identified the data pack that needs to be decompressed

1. Find the Data Packs with salary > 50000

Completely Irrelevant

Suspect

All values match

Page 20: Innovations in Database Technology IRMAC BI/DW SIG May 28, 2009

20

A Join Query using the Knowledge Grid

SELECT MIN(sale), MAX(discount), nameFROM carsales, salesperson WHERE carsales.id = salesperson.id AND carsales.prov = ‘ON’ AND carsales.date = ‘2008-02-29’GROUP BY name;

Car Sales

id sale discount prov date

Sales Person

id name

1. Eliminate the Car Sales Data Packs that are irrelevant based on constraints in the SQL

2. Determine the related Sales Person Data Packs based on the values of carsales_id found in the relevant Car Sales Data Packs.

4. Any subsequent queries will be able to use the PPN to resolve joins between Car Sales and Sales Person

3. Create a Pack-to-Pack node that stores the results of the join condition between Car Sales and Sales Person.

Pack-to-Packcarsales_id vs salesperson_id

carsales.id

salesperson.id

0 1 0

1 1 0

Indicates that the Data Packs are related

Page 21: Innovations in Database Technology IRMAC BI/DW SIG May 28, 2009

21

Infobright Architecture

Page 22: Innovations in Database Technology IRMAC BI/DW SIG May 28, 2009

22

Infobright Optimizerand Executor

Infobright Optimizerand Executor

MySQL/Infobright Architecture

CONNECTORS: Native C API, JDBC, ODBC, .NET, PHP, Python, Perl, Ruby, VB CONNECTORS: Native C API, JDBC, ODBC, .NET, PHP, Python, Perl, Ruby, VB

Management Services &

Utilities

Management Services &

Utilities

InfobrightLoader / Unloader

InfobrightLoader / Unloader

CONNECTION POOL: Authentication, Thread Reuse, Connection Limits, Check Memory, Caches

CONNECTION POOL: Authentication, Thread Reuse, Connection Limits, Check Memory, Caches

SQL

Inte

rfac

eSQ

L In

terf

ace

MyS

QL

Load

erM

ySQ

L Lo

ader

Pars

erPa

rser

Cach

es &

Buff

ers

Cach

es &

Buff

ers

MyISAM•Views•Users•Permissions•Tables Defs

MyISAM•Views•Users•Permissions•Tables Defs

Knowledge GridKnowledge Grid

Data PackData Pack Data PackData Pack Data PackData Pack Data PackData Pack Data PackData Pack Data PackData Pack Data PackData Pack

Knowledge Node

Knowledge Node

Knowledge Node

Knowledge Node

Knowledge Node

Knowledge Node

Knowledge Node

Knowledge Node

Knowledge Node

Knowledge Node

Knowledge Node

Knowledge Node

Knowledge Node

Knowledge Node

Data PackNode

Data PackNode

Data PackNode

Data PackNode

Data PackNode

Data PackNode

Data PackNode

Data PackNode

Data PackNode

Data PackNode

Data PackNode

Data PackNode

Data PackNode

Data PackNode

Compressor / DecompressorCompressor / Decompressor

Infobright – Embedded With MySQL

Infobright Components•IB Storage Engine consisting of 64Kb Data Packs, Compressor, and the Knowledge Grid

Knowledge GridKnowledge Grid

Data PackData Pack Data PackData Pack Data PackData Pack Data PackData Pack Data PackData Pack Data PackData Pack Data PackData Pack

Knowledge Node

Knowledge Node

Knowledge Node

Knowledge Node

Knowledge Node

Knowledge Node

Knowledge Node

Knowledge Node

Knowledge Node

Knowledge Node

Knowledge Node

Knowledge Node

Knowledge Node

Knowledge Node

Data PackNode

Data PackNode

Data PackNode

Data PackNode

Data PackNode

Data PackNode

Data PackNode

Data PackNode

Data PackNode

Data PackNode

Data PackNode

Data PackNode

Data PackNode

Data PackNode

Compressor / DecompressorCompressor / Decompressor

Infobright Optimizerand Executor

Infobright Optimizerand Executor

• IB Optimizer that uses rough set algorithms and the knowledge grid to navigate the database

InfobrightLoader / Unloader

InfobrightLoader / Unloader

• IB Loader supports text based and binary data formats

My SQL OptimizerMy SQL Optimizer

Infobright ships with the full MySQL binaries. The MySQL architecture is used to support database components such as connectors, security and memory management.

Page 23: Innovations in Database Technology IRMAC BI/DW SIG May 28, 2009

23

Optimized SQL for Infobright

The Infobright Optimizer supports a large amount of MySQL syntax and functions. When the optimizer encounters SQL syntax that is not supported, then the query is executed using the MySQL optimizer.

MySQL

Infobright Optimized SQL•Select Statements•Comparison Operators•Logical Operators•String Comparison Functions (LIKE, ..)

•Aggregate Functions•Arithmetic Operators•Data Manipulation Language (I/U/D)

• Data Definition Language (CREATE & DROP)

• String Functions• Date/Time Functions• Numeric Functions• Trigonometric Functions• Case Statements

Page 24: Innovations in Database Technology IRMAC BI/DW SIG May 28, 2009

24

Infobright Data TypesN

umer

icN

umer

icD

ate

Dat

eSt

ring

Strin

g

Most of the data types expected for a MySQL database engine are fully supported. The data types that are currently not implemented within Infobright include BLOB, ENUM, SET and Auto Increment.

Page 25: Innovations in Database Technology IRMAC BI/DW SIG May 28, 2009

25

Increased efficiency with popular platforms

Deeper ETL Integration Jaspersoft, Talend, Pentaho Leverages end-to-end data

management provided by ETL tools Improved support for Data

Manipulation Language (DML)

Leverage existing IT tools and resources for fast, simple deployments and low TCO

ETL Integration

Page 26: Innovations in Database Technology IRMAC BI/DW SIG May 28, 2009

26

Data Loading with & without custom ETL connectors

Loading Infobright tables with custom connectors: Kettle from Pentaho Talend ETL from Talend Jaspersoft ETL (Talend) from Jaspersoft

Two ways to invoke Infobright loader without connectors1.Generate a CSV or binary file and invoke the Infobright loader to load the file2.Named pipe technique:

Create a named pipe (i.e. mkfifo /home/mysql/s_mysession1.pipe) Launch the Infobright loader in the background to read from the pipe Launch the ETL process that writes data to the named pipe When the ETL process runs, as records are written to the named pipe, the

loader reads them and writes them to an Infobright database table

 

Page 27: Innovations in Database Technology IRMAC BI/DW SIG May 28, 2009

27

Infobright Versions & System Requirements

Page 28: Innovations in Database Technology IRMAC BI/DW SIG May 28, 2009

28

Comparison of ICE and IEE

Features

Technical SupportForums and/or

one-time 4-hr support packAvailable

Warranty and Indemnification

No Included

INSERT/UPDATE/DELETE No Supported

Infobright Loader Up to 50 GB/hrMulti-threaded, Up to 300

GB/hr

Data Load Types

Text onlyText & Binary(100% faster)

MySQL Loader No Supported

Platform Support

64-bit Intel and AMDRHEL 5, CentOS 5, Debian

32-bit Intel and AMD for Windows XP, Ubuntu 8.04,

Fedora 9

64-bit Intel and AMDWindows Server 2003,

Windows Server 2008, RHEL 5, CentOS 5, Debian, Solaris

10

Page 29: Innovations in Database Technology IRMAC BI/DW SIG May 28, 2009

29

System Requirements

Page 30: Innovations in Database Technology IRMAC BI/DW SIG May 28, 2009

30

For More Information

Thank you

Data Warehouse EvangelistBob Newell

[email protected]

Or join our open source community atwww.infobright.org

Page 31: Innovations in Database Technology IRMAC BI/DW SIG May 28, 2009

31

Query performanceInfobright Traditional DB

# Query Query name Intervall No cache Cache No cache Cache

1 Affiliate/minor/sum(order)/year 20060101-20061231 7,72 0,99 13,00,91 4,03,21

2 Affiliate/major/sum(order)/year 20060101-20061231 31,52 7,81 N/A N/A

1 Affiliate/minor/sum(order)/month 20060101-20060131 1,32 0,43 1,00,43 10,69

2 Affiliate/major/sum(order)/month 20060101-20060131 3,23 0,65 2,12,34 18,55

3 Events/Cat=2/Country/sum(no of)/year 20060101-20061231 37,16 24,42 N/A N/A

4 Events/Cat=*/Country/sum(no of)/year 20060101-20061231 41,67 29,62 N/A N/A

3 Events/Cat=2/Country/sum(no of)/month 20060101-20060131 15,16 7,15 8,08,13 2,10,15

4 Events/Cat=*/Country/sum(no of)/month 20060101-20060131 22,12 8,01 15,08,32 3,12,82

Time in minutes, seconds, milliseconds

31