Data Processing over very Large Relational Databases

Preview:

DESCRIPTION

Final presentation of my dissertation thesis focused on orientation, analyzing and finding information in large or unknown relational databases and data visualisation

Citation preview

Data Processing over Very Large Databases

Ing. Ľuboš Takáč

Supervisor: doc. Ing. Michal Zábovský, PhD.

Faculty of Management Science and Informatics

University of Žilina

Large Databases

• VLDB (very large databases)

• Relational Databases with hundreds of tables and millions of rows

The Problem

• How to understand relational database model so that we could find information in them.

• Orientation in large RDB– given by the complexity of RDB model

• Modification and development of RDB.

Existing approaches

• Database metrics

• Database visualization

• Database to ontology mapping and examination of ontology

Database Metrics• Database metric is a function that assigns to an

object from the database a numeric value.

• Examples of table metrics– DRT(T) – depth of relational tree

– TS(T) – table size

– RD(T) – referential degree

– …

• Rankings – grouping metrics with different weights.

RDB Visualization

• Database schema visualization.

• Standard ER - diagram is insufficient for large RDB model.

SchemaBall

• Visualization of large or complex RDB schemas.

• Using RDB metrics and rankings.

• We implemented and enhanced such solution.

SchemaBall

Visualization of RDB schema graph

• Vertex and edge weighted graph based on RDB metrics.

• Using Gephi for visualization– automatic generated layout

– interactive visualization (selections, examinations of nodes and edges)

– using graph algorithms

Analyzing of RDB graph

• Three approaches– graph of RDB model (vertex – table, edges – foreign key

relations)

– alternative (vertex – table, edge – foreign key relation for each tuple)

– graph of tuples (vertex – tuple, edge – foreign key relation between tuples)

Analyzing of RDB Graph – first approach

1 2 3 4 5 6 7 8 9 10 11 13 17 18 290.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

0.16

probability

vertex degreeDistribution function of vertex degree.

Analyzing of RDB Graph – second approach

probability

vertex degreeDistribution function of vertex degree.

0.00

E+00

5.00

E+05

1.00

E+06

1.50

E+06

2.00

E+06

2.50

E+06

3.00

E+06

3.50

E+06

4.00

E+06

4.50

E+06

5.00

E+06

5.50

E+06

6.00

E+06

6.50

E+06

7.00

E+06

7.50

E+06

8.00

E+06

8.50

E+06

9.00

E+06

9.50

E+06

1.00

E+07

1.05

E+07

1.10

E+07

1.15

E+07

1.20

E+07

1.25

E+07

1.30

E+07

1.35

E+070

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Analyzing of RDB Graph – third approach

count

vertex degree

Distribution function of vertex degree.

Analyzing of RDB Graph – Scale free networks• Connected graph with Yule-Simon distribution of

vertex degree.

• , usually between 2 – 3

Visualization of RDB schema network

Analyzing of RDB Graph - Conclusion

• RDB model is scale-free.

• To understand RDB you must to understand centers at first. (there is not a lot of centres)

• Very useful metric NR(T) – number of references validated by analyzing of RDB Graph.

• We created 2 new metrics based on mentioned three approaches.

A Method for Analyzing Large RDB

• Find components of schema graph (tables = vertices, FK = edges)

• Examine each component starting in order with largest first– If you get alone table, very probably is an archive, try to

check it or find another purpose.

– Else visualize it via ER diagram, Schamaball or graph using table metrics.

Practical Example

• Unknown complex RDB– 332 tables

– 2339 attributes

– 192 foreign keys

– Size 2,4 GB

All tables

Archive Tables

• Each alone table is archive table, with convention “_A”

Component A

Component B

RDBAnalyzer• supports all RDB Systems supporting JDBC, easy

scalable, online connection

• features– large online RDB schema visualization

– finding the components of graph

– schema graph creation, visualization and export (GEPHI)

– transform RDB to tuple graph

– metrics charts, parallel coordinates visualization

RDBAnalyzer

RDB to Ontology Mapping

– better understanding and searching for information without knowledge of RDB model, data mining from RDB

– can be used by web search engines to search in RDBs

– getting information from RDB by people, whose do not understand RDB technology (layman)

– a method how to merge multiple databases (ontology merging)

– interactive searching for information (Protégé)

RDB Schema NORTHWIND (ER-Diagram)

OntoGraph (Protége)

How to find information in Ontologies

• using query language (SPARQL)

• interactive (e.g. Protégé)– using OntoGraf combined with text searching

– explore entities and individuals

Disadvantages & Problems of mapped RDBs to Ontologies

• Difficult to maintain actual data (static & dynamic Ontology creation).

• Aggregated queries are very slow.

• Existing tools are not capable with large RDBs (or large ontologies).

Conclusion & Scientific Contribution• Design and creation of method for orientation,

understanding and finding information in large or unknown relational databases. (RDBAnalyzer supports mentioned principles)

• Detection of RDB graph characteristics (Scale free network) and using this knowledge to create 2 new and validate 1 existing metric.

• Design and creation of method for finding information in ontologies generated from RDB.

Thank you for your attention!

lubos.takac@gmail.com

Recommended