38
Data Processing over Very Large Databases Ing. Ľuboš Takáč Supervisor: doc. Ing. Michal Zábovský, PhD. Faculty of Management Science and Informatics University of Žilina

Data Processing over very Large Relational Databases

Embed Size (px)

DESCRIPTION

Final presentation of my dissertation thesis focused on orientation, analyzing and finding information in large or unknown relational databases and data visualisation

Citation preview

Page 1: Data Processing over very Large Relational Databases

Data Processing over Very Large Databases

Ing. Ľuboš Takáč

Supervisor: doc. Ing. Michal Zábovský, PhD.

Faculty of Management Science and Informatics

University of Žilina

Page 2: Data Processing over very Large Relational Databases

Large Databases

• VLDB (very large databases)

• Relational Databases with hundreds of tables and millions of rows

Page 3: Data Processing over very Large Relational Databases

The Problem

• How to understand relational database model so that we could find information in them.

• Orientation in large RDB– given by the complexity of RDB model

• Modification and development of RDB.

Page 4: Data Processing over very Large Relational Databases

Existing approaches

• Database metrics

• Database visualization

• Database to ontology mapping and examination of ontology

Page 5: Data Processing over very Large Relational Databases

Database Metrics• Database metric is a function that assigns to an

object from the database a numeric value.

• Examples of table metrics– DRT(T) – depth of relational tree

– TS(T) – table size

– RD(T) – referential degree

– …

• Rankings – grouping metrics with different weights.

Page 6: Data Processing over very Large Relational Databases

RDB Visualization

• Database schema visualization.

• Standard ER - diagram is insufficient for large RDB model.

Page 7: Data Processing over very Large Relational Databases
Page 8: Data Processing over very Large Relational Databases
Page 9: Data Processing over very Large Relational Databases

SchemaBall

• Visualization of large or complex RDB schemas.

• Using RDB metrics and rankings.

• We implemented and enhanced such solution.

Page 10: Data Processing over very Large Relational Databases

SchemaBall

Page 11: Data Processing over very Large Relational Databases

Visualization of RDB schema graph

• Vertex and edge weighted graph based on RDB metrics.

• Using Gephi for visualization– automatic generated layout

– interactive visualization (selections, examinations of nodes and edges)

– using graph algorithms

Page 12: Data Processing over very Large Relational Databases
Page 13: Data Processing over very Large Relational Databases
Page 14: Data Processing over very Large Relational Databases

Analyzing of RDB graph

• Three approaches– graph of RDB model (vertex – table, edges – foreign key

relations)

– alternative (vertex – table, edge – foreign key relation for each tuple)

– graph of tuples (vertex – tuple, edge – foreign key relation between tuples)

Page 15: Data Processing over very Large Relational Databases

Analyzing of RDB Graph – first approach

1 2 3 4 5 6 7 8 9 10 11 13 17 18 290.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

0.16

probability

vertex degreeDistribution function of vertex degree.

Page 16: Data Processing over very Large Relational Databases

Analyzing of RDB Graph – second approach

probability

vertex degreeDistribution function of vertex degree.

0.00

E+00

5.00

E+05

1.00

E+06

1.50

E+06

2.00

E+06

2.50

E+06

3.00

E+06

3.50

E+06

4.00

E+06

4.50

E+06

5.00

E+06

5.50

E+06

6.00

E+06

6.50

E+06

7.00

E+06

7.50

E+06

8.00

E+06

8.50

E+06

9.00

E+06

9.50

E+06

1.00

E+07

1.05

E+07

1.10

E+07

1.15

E+07

1.20

E+07

1.25

E+07

1.30

E+07

1.35

E+070

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Page 17: Data Processing over very Large Relational Databases

Analyzing of RDB Graph – third approach

count

vertex degree

Distribution function of vertex degree.

Page 18: Data Processing over very Large Relational Databases

Analyzing of RDB Graph – Scale free networks• Connected graph with Yule-Simon distribution of

vertex degree.

• , usually between 2 – 3

Page 19: Data Processing over very Large Relational Databases

Visualization of RDB schema network

Page 20: Data Processing over very Large Relational Databases

Analyzing of RDB Graph - Conclusion

• RDB model is scale-free.

• To understand RDB you must to understand centers at first. (there is not a lot of centres)

• Very useful metric NR(T) – number of references validated by analyzing of RDB Graph.

• We created 2 new metrics based on mentioned three approaches.

Page 21: Data Processing over very Large Relational Databases

A Method for Analyzing Large RDB

• Find components of schema graph (tables = vertices, FK = edges)

• Examine each component starting in order with largest first– If you get alone table, very probably is an archive, try to

check it or find another purpose.

– Else visualize it via ER diagram, Schamaball or graph using table metrics.

Page 22: Data Processing over very Large Relational Databases

Practical Example

• Unknown complex RDB– 332 tables

– 2339 attributes

– 192 foreign keys

– Size 2,4 GB

Page 23: Data Processing over very Large Relational Databases

All tables

Page 24: Data Processing over very Large Relational Databases

Archive Tables

• Each alone table is archive table, with convention “_A”

Page 25: Data Processing over very Large Relational Databases

Component A

Page 26: Data Processing over very Large Relational Databases

Component B

Page 27: Data Processing over very Large Relational Databases
Page 28: Data Processing over very Large Relational Databases

RDBAnalyzer• supports all RDB Systems supporting JDBC, easy

scalable, online connection

• features– large online RDB schema visualization

– finding the components of graph

– schema graph creation, visualization and export (GEPHI)

– transform RDB to tuple graph

– metrics charts, parallel coordinates visualization

Page 29: Data Processing over very Large Relational Databases

RDBAnalyzer

Page 30: Data Processing over very Large Relational Databases

RDB to Ontology Mapping

– better understanding and searching for information without knowledge of RDB model, data mining from RDB

– can be used by web search engines to search in RDBs

– getting information from RDB by people, whose do not understand RDB technology (layman)

– a method how to merge multiple databases (ontology merging)

– interactive searching for information (Protégé)

Page 31: Data Processing over very Large Relational Databases

RDB Schema NORTHWIND (ER-Diagram)

Page 32: Data Processing over very Large Relational Databases

OntoGraph (Protége)

Page 33: Data Processing over very Large Relational Databases
Page 34: Data Processing over very Large Relational Databases

How to find information in Ontologies

• using query language (SPARQL)

• interactive (e.g. Protégé)– using OntoGraf combined with text searching

– explore entities and individuals

Page 35: Data Processing over very Large Relational Databases
Page 36: Data Processing over very Large Relational Databases

Disadvantages & Problems of mapped RDBs to Ontologies

• Difficult to maintain actual data (static & dynamic Ontology creation).

• Aggregated queries are very slow.

• Existing tools are not capable with large RDBs (or large ontologies).

Page 37: Data Processing over very Large Relational Databases

Conclusion & Scientific Contribution• Design and creation of method for orientation,

understanding and finding information in large or unknown relational databases. (RDBAnalyzer supports mentioned principles)

• Detection of RDB graph characteristics (Scale free network) and using this knowledge to create 2 new and validate 1 existing metric.

• Design and creation of method for finding information in ontologies generated from RDB.

Page 38: Data Processing over very Large Relational Databases

Thank you for your attention!

[email protected]