48
Paper PDF: http:// www.icmc.usp.br/pessoas/junio/PublishedPapers/RodriguesJr_et_al-IV2010.pdf Institute of Mathematics and Computer Sciences University of Sao Paulo (USP) - Brazil Combining Visual Analytics and Content Based Data Retrieval Technology for Efficient Data Analysis Jose F Rodrigues Jr – [email protected]

Combining visual analytics and content based data retrieval technology for efficient data analysis sem notas

Embed Size (px)

DESCRIPTION

Jose Rodrigues, Luciana A S Romani, Agma Juci Machado Traina, Caetano Traina Jr (2010) Combining Visual Analytics and Content Based Data Retrieval Technology for Efficient Data Analysis In: 14th International Conference on Information Visualisation (IV) 61-67 IEEE Press. @inproceedings { DBLP:conf/iv/RodriguesRTT10, title = "Combining Visual Analytics and Content Based Data Retrieval Technology for Efficient Data Analysis", year = "2010", author = "Jose Rodrigues and Luciana A S Romani and Agma Juci Machado Traina and Caetano Traina Jr", booktitle = "14th International Conference on Information Visualisation (IV)", pages = "61-67", publisher = "IEEE Press", doi = "10.1109/IV.2010.101", url = "http://www.icmc.usp.br/~junio/PublishedPapers/RodriguesJr_et_al-IV2010.pdf", urllink = "http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5571357&tag=1", abstract = "One of the most useful techniques to help visual data analysis systems is interactive filtering (brushing). However, visualization techniques often suffer from overlap of graphical items and multiple attributes complexity, making visual selection inefficient. In these situations, the benefits of data visualization are not fully observable because the graphical items do not pop up as comprehensive patterns. In this work we propose the use of content-based data retrieval technology combined with visual analytics. The idea is to use the similarity query functionalities provided by metric space systems in order to select regions of the data domain according to user-guidance and interests. After that, the data found in such regions feed multiple visualization workspaces so that the user can inspect the correspondent datasets. Our experiments showed that the methodology can break the visual analysis process into smaller problems (views) and that the views hold the expectations of the analyst according to his/her similarity query selection, improving data perception and analytical possibilities. Our contribution introduces a principle that can be used in all sorts of visualization techniques and systems, this principle can be extended with different kinds of integration visualization-metric-space, and with different metrics, expanding the possibilities of visual data analysis in aspects such as semantics and scalability.", keywords = "Data analysis , Data visualization , Extraterrestrial measurements , Feature extraction , Nearest neighbor searches , Visualization"}

Citation preview

Page 1: Combining visual analytics and content based data retrieval technology for efficient data analysis sem notas

Paper PDF: http://www.icmc.usp.br/pessoas/junio/PublishedPapers/RodriguesJr_et_al-IV2010.pdf

Institute of Mathematics and Computer SciencesUniversity of Sao Paulo (USP) - Brazil

London, UK - July/2010

Combining Visual Analytics and Content Based Data Retrieval Technology for Efficient Data Analysis

Jose F Rodrigues Jr – [email protected]

Page 2: Combining visual analytics and content based data retrieval technology for efficient data analysis sem notas

- 14th International Conference on Information Visualisation (IV 2010) -

Context and Context and problems:problems:

content-based content-based data retrieval data retrieval

and and visualizationvisualization

Page 3: Combining visual analytics and content based data retrieval technology for efficient data analysis sem notas

- 14th International Conference on Information Visualisation (IV 2010) -

Context: Content-based data retrieval

Metric spaces: de facto solution for non-orderable data domains Multivariate data Image Video Sound Text

Metric spaces Content-based Data Retrieval It demands three things:

1.Features extraction/annotation2.A distance function

Page 4: Combining visual analytics and content based data retrieval technology for efficient data analysis sem notas

- 14th International Conference on Information Visualisation (IV 2010) -

Context: Content-based data retrievalFeatures extraction/annotation

Features extraction transforms data objects into multivariate data

Features annotation translates real world objects into multivariate data

Page 5: Combining visual analytics and content based data retrieval technology for efficient data analysis sem notas

- 14th International Conference on Information Visualisation (IV 2010) -

Distance functions measure the distance in between features vectors

L2=Euclidiana

r

L0=LInfinity=Chebychev

L1=Manhatan

Context: Content-based data retrieval Distance functions

Page 6: Combining visual analytics and content based data retrieval technology for efficient data analysis sem notas

- 14th International Conference on Information Visualisation (IV 2010) -

Distance FunctionFeatures/Multivariate Data

Context: Content-based data retrieval

Page 7: Combining visual analytics and content based data retrieval technology for efficient data analysis sem notas

- 14th International Conference on Information Visualisation (IV 2010) -

Distance FunctionFeatures/Multivariate Data

Context: Content-based data retrieval

Similarity QueriesSimilarity Queries

Page 8: Combining visual analytics and content based data retrieval technology for efficient data analysis sem notas

- 14th International Conference on Information Visualisation (IV 2010) -

Context: Metric spaces

Metric Structure

Distance Function

Features ExtractionFeatures ExtractionContent-based data retrieval

Page 9: Combining visual analytics and content based data retrieval technology for efficient data analysis sem notas

- 14th International Conference on Information Visualisation (IV 2010) -

Context: Metric spaces

Metric Structure

Distance Function

Features ExtractionFeatures ExtractionContent-based data retrieval

Data retrieval

Page 10: Combining visual analytics and content based data retrieval technology for efficient data analysis sem notas

- 14th International Conference on Information Visualisation (IV 2010) -

Context: Metric spaces

Problem 1 - it is difficult to make sense of metric spaces:High dimensionalityToo many data itemsSimilarity queries work like a black box:

no opportunity to check the context of what was retrieved

nor what else could have been retrieved

Page 11: Combining visual analytics and content based data retrieval technology for efficient data analysis sem notas

- 14th International Conference on Information Visualisation (IV 2010) -

Context: Visualization

The goal of this conference

Important aid for data analysis

Interactive

Intuitive

Page 12: Combining visual analytics and content based data retrieval technology for efficient data analysis sem notas

- 14th International Conference on Information Visualisation (IV 2010) -

Context: Visualization

Visualization mantra:overview first, zoom & filter, then details on

demand

Hence: filtering is an essential part of visualization

The principal means for visual filtering is Interactive brushing

Page 13: Combining visual analytics and content based data retrieval technology for efficient data analysis sem notas

- 14th International Conference on Information Visualisation (IV 2010) -

Context: Visualization

Problem 2 - interactive brushing may be inefficient:Overlap of graphical elementsToo many data elements to select one at a timeSelection based on geometrical primitives (as

spheres and cubes) will brush more elements than what is desired

Page 14: Combining visual analytics and content based data retrieval technology for efficient data analysis sem notas

- 14th International Conference on Information Visualisation (IV 2010) -

Context: Visualization

Problem 3 - the user’s interests may yield filterings that cannot be accomplished with brushing:Semantical interests lead to data items that are:

Not equally importantNot visually adjacentNot at the same analytical context

Page 15: Combining visual analytics and content based data retrieval technology for efficient data analysis sem notas

- 14th International Conference on Information Visualisation (IV 2010) -

Methodology:Methodology:

Why not put Why not put together together

content-based content-based data retrieval data retrieval

and and visualization?visualization?

Page 16: Combining visual analytics and content based data retrieval technology for efficient data analysis sem notas

- 14th International Conference on Information Visualisation (IV 2010) -

Linking CBDR and visualization

Metric spaces are properly spatial

Visualizations use space as one of its main channels for data coding

Page 17: Combining visual analytics and content based data retrieval technology for efficient data analysis sem notas

- 14th International Conference on Information Visualisation (IV 2010) -

Linking CBDR and visualization

?

CBDR Visualization

Page 18: Combining visual analytics and content based data retrieval technology for efficient data analysis sem notas

- 14th International Conference on Information Visualisation (IV 2010) -

Linking CBDR and visualization

Page 19: Combining visual analytics and content based data retrieval technology for efficient data analysis sem notas

- 14th International Conference on Information Visualisation (IV 2010) -

Linking CBDR and visualization

Slim-Tree - Family of Metric Access Methods(Traina Jr. et al. 2000)

Page 20: Combining visual analytics and content based data retrieval technology for efficient data analysis sem notas

- 14th International Conference on Information Visualisation (IV 2010) -

Linking CBDR and visualization

Multidimensional Projection with Fastmap algorithm(Faloutsos and K.-I. Lin 1995)

Slim-Tree - Family of Metric Access Methods(Traina Jr. et al. 2000)

Page 21: Combining visual analytics and content based data retrieval technology for efficient data analysis sem notas

- 14th International Conference on Information Visualisation (IV 2010) -

Methodology:Methodology:

content-based content-based data retrieval viadata retrieval via

Slim-TreeSlim-Tree

Page 22: Combining visual analytics and content based data retrieval technology for efficient data analysis sem notas

- 14th International Conference on Information Visualisation (IV 2010) -

Slim-Tree Metric Access Method

Goal: performance in similarity queries Challenge: multidimensional data, and queries based on

similarity rather than equality Demands: a distance function satisfying triangular

inequality, symmetry and non-negativity Principle: use of progressively hierarchical (tree)

restricting radiuses around representative objects Engineer: clustering subtrees and use of trigonometry

for branch pruning during search Mechanism: dynamic, balanced, bottom-up, disk-

oriented Data: representatives at inner nodes, objects at the leaves

Page 23: Combining visual analytics and content based data retrieval technology for efficient data analysis sem notas

- 14th International Conference on Information Visualisation (IV 2010) -

Video: Metric-Tree-Example-faster.avi

Slim-Tree Metric Access Method

Page 24: Combining visual analytics and content based data retrieval technology for efficient data analysis sem notas

- 14th International Conference on Information Visualisation (IV 2010) -

Oq

• Example: Range Query

Query object Oq and a query radius rq

Slim-Tree Metric Access Method

Oj

Op

Okrepr0(O )p

repr1(O )p

obj(Op)

Oi

Ol Om

repr1(O )i

repr (O1 j)

obj(O m)obj(Ok)

obj(O i)obj(O l)

obj(O j)

...

......

Page 25: Combining visual analytics and content based data retrieval technology for efficient data analysis sem notas

- 14th International Conference on Information Visualisation (IV 2010) -

Methodology:Methodology:

visualization viavisualization via

FastMapFastMap

Page 26: Combining visual analytics and content based data retrieval technology for efficient data analysis sem notas

- 14th International Conference on Information Visualisation (IV 2010) -

FastMap

Projection of p-dimensional points in k mutually orthogonal directions; p > k

Imagine that the points are in a Euclidean space Select two pivot points xa and xb that are far apart Compute a pseudo-projection of the remaining points

along the “line” xaxb

“Project” the points to an orthogonal subspace and recurse

Page 27: Combining visual analytics and content based data retrieval technology for efficient data analysis sem notas

- 14th International Conference on Information Visualisation (IV 2010) -

FastMap: Selecting the Pivot Points

Find the two-furthest points O(n2) Use heuristic 2*(n-1) O(n)

Select any point x0

Let x1 be the furthest from x0

Let x2 be the furthest from x1

Return (x1, x2)x0

x2

x1

Page 28: Combining visual analytics and content based data retrieval technology for efficient data analysis sem notas

- 14th International Conference on Information Visualisation (IV 2010) -

FastMap: Selecting the Pivot Points

Find the two-furthest points O(n2) Use heuristic 2*(n-1) O(n)

Select any point x0

Let x1 be the furthest from x0

Let x2 be the furthest from x1

Return (x1, x2)x0

x2

x1

Page 29: Combining visual analytics and content based data retrieval technology for efficient data analysis sem notas

- 14th International Conference on Information Visualisation (IV 2010) -

FastMap: Selecting the Pivot Points

Find the two-furthest points O(n2) Use heuristic 2*(n-1) O(n)

Select any point x0

Let x1 be the furthest from x0

Let x2 be the furthest from x1

Return (x1, x2)x0

x2

x1

Page 30: Combining visual analytics and content based data retrieval technology for efficient data analysis sem notas

- 14th International Conference on Information Visualisation (IV 2010) -

FastMap: Selecting the Pivot Points

Find the two-furthest points O(n2) Use heuristic 2*(n-1) O(n)

Select any point x0

Let x1 be the furthest from x0

Let x2 be the furthest from x1

Return (x1, x2)x0

x2

x1

XaXb

Page 31: Combining visual analytics and content based data retrieval technology for efficient data analysis sem notas

- 14th International Conference on Information Visualisation (IV 2010) -

FastMap: Pseudo-Projections

Given pivots (xa , xb ), and any third point y

From the law of cosines, one can get to:

The pseudo-projection for y is

This is the first of k coordinates to compute

xa

xb

y

cy da,y

db,y

da,b 2 2 2 2by ay ab y abd d d c d

2 2 2

2ay ab by

yab

d d dc

d

Page 32: Combining visual analytics and content based data retrieval technology for efficient data analysis sem notas

- 14th International Conference on Information Visualisation (IV 2010) -

FastMap: “Project to orthogonal plane”

For orthogonality, the points are considered as if projected on a hyperplane perpendicular to XaXb;

hence p becomes p-1 The projection is not done properly

said, but the distance function for such configuration is determined

We can compute distances within the “orthogonal hyperplane” using the Pythagorean theorem

2 2'( ', ') ( , ) ( )z yd y z d y z c c

xb

xa

y

z

y’ z’dy’,z’

dy,z

cz-cy

Get back to the previous slide and use d ’ to calculate the next cy

coordinate; repeat until k coordinates are computed

Page 33: Combining visual analytics and content based data retrieval technology for efficient data analysis sem notas

- 14th International Conference on Information Visualisation (IV 2010) -

FastMap: “Project to orthogonal plane”

For orthogonality, the points are considered as if projected on a hyperplane perpendicular to XaXb;

hence p becomes p-1 The projection is not done properly

said, but the distance function for such configuration is determined

We can compute distances within the “orthogonal hyperplane” using the Pythagorean theorem

2 2'( ', ') ( , ) ( )z yd y z d y z c c

xb

xa

y

z

y’ z’dy’,z’

dy,z

cz-cy

Get back to the previous slide and use d ’ to calculate the next cy

coordinate; repeat until k coordinates are computed

More important than what it does, is how it does; and this step is the main engineer behind FastMap – it allows it to work recursively and fast.

Page 34: Combining visual analytics and content based data retrieval technology for efficient data analysis sem notas

- 14th International Conference on Information Visualisation (IV 2010) -

FastMap: “Project to orthogonal plane”

For orthogonality, the points are considered as if projected on a hyperplane perpendicular to XaXb;

hence p becomes p-1 The projection is not done properly

said, but the distance function for such configuration is determined

We can compute distances within the “orthogonal hyperplane” using the Pythagorean theorem

2 2'( ', ') ( , ) ( )z yd y z d y z c c

xb

xa

y

z

y’ z’dy’,z’

dy,z

cz-cy

Get back to the previous slide and use d ’ to calculate the next coordinate; repeat until k coordinates are computed

• Good point: overall complexity is k*n O(n), better than any other

• Bad point: original distances are not strictly maintained stress (like any other technique of its kind)

• All in all, FastMap is a top five multidimensional projection technique (along with MDS, PCA, LSP, FDP), but with the best performance

Page 35: Combining visual analytics and content based data retrieval technology for efficient data analysis sem notas

- 14th International Conference on Information Visualisation (IV 2010) -

Solving the stated Solving the stated problems:problems:

innovating withinnovating with

MetricSPlatMetricSPlat

Page 36: Combining visual analytics and content based data retrieval technology for efficient data analysis sem notas

- 14th International Conference on Information Visualisation (IV 2010) -

Breast Cancer Wisconsin (Diagnostic) Data Set• 9 descriptive attributes annotated from tissue biopsies:

1. ClumpThickness2. UniforSize3. UniforShape4. MargAdhes5. SingleEpithSize6. BareNuclei7. BlandChromatin8. NormalNucleoli9. Mitoses

• Breast Cancer classification:• benign (class 0) • malign (class 1)

Demonstration dataset

Page 37: Combining visual analytics and content based data retrieval technology for efficient data analysis sem notas

- 14th International Conference on Information Visualisation (IV 2010) -

Problem 1

Problem 1 - it is difficult to make sense of metric spaces

Choose a distance function, which will be the same for FastMap and SlimTree

Build a SlimTreeChoose an element of interest to be the center of the

projectionRetrieve the entire dataset out of the SlimTreeFastMap-project all the elements of the dataset in 3DUse central point for translation

Page 38: Combining visual analytics and content based data retrieval technology for efficient data analysis sem notas

- 14th International Conference on Information Visualisation (IV 2010) -

Video: Problem1-final.avi

Page 39: Combining visual analytics and content based data retrieval technology for efficient data analysis sem notas

- 14th International Conference on Information Visualisation (IV 2010) -

Problem 2

Problem 2 - interactive brushing may be inefficient Data selections may demand too many brushing steps Attribute by attribute selection may not correspond to

similarity as expected by the user In 3D projection, spheres and cubes do not apply for non-

regular regions of interest Due to stress, projection does not exactly correpond to

the original data configuration Approach:

Brush one element of interestPerform a KNN/Range query having this element as

the query center

Page 40: Combining visual analytics and content based data retrieval technology for efficient data analysis sem notas

- 14th International Conference on Information Visualisation (IV 2010) -

Video: Problem2-final.avi

Page 41: Combining visual analytics and content based data retrieval technology for efficient data analysis sem notas

- 14th International Conference on Information Visualisation (IV 2010) -

Problem 3

Problem 3 - the user’s interests may yield filterings that cannot be accomplished with brushingWeight attributes according to perception and

knowledge domainRedefine metric space and projectionPropagate data to multivariate visualization

techniques

Page 42: Combining visual analytics and content based data retrieval technology for efficient data analysis sem notas

- 14th International Conference on Information Visualisation (IV 2010) -

Video: Problem3-final.avi

Page 43: Combining visual analytics and content based data retrieval technology for efficient data analysis sem notas

- 14th International Conference on Information Visualisation (IV 2010) -

Without weighting of attributes 2 and 3 After weighting of attributes 2 and 3

KNN(15, 441) = <445, 444, 449, 433, 450, 448, 447, 446, 443, 442, 440, 439, 438, 437, 436> = R

KNN(15, 441) = <456, 454, 453, 452, 450, 449, 448, 447, 443, 441, 436, 433, 430, 425, 424> = R´

Page 44: Combining visual analytics and content based data retrieval technology for efficient data analysis sem notas

- 14th International Conference on Information Visualisation (IV 2010) -

Without weighting of attributes 2 and 3 After weighting of attributes 2 and 3

KNN(15, 441) = <445, 444, 449, 433, 450, 448, 447, 446, 443, 442, 440, 439, 438, 437, 436> = R

KNN(15, 441) = <456, 454, 453, 452, 450, 449, 448, 447, 443, 441, 436, 433, 430, 425, 424> = R´

Page 45: Combining visual analytics and content based data retrieval technology for efficient data analysis sem notas

- 14th International Conference on Information Visualisation (IV 2010) -

Without weighting of attributes 2 and 3 After weighting of attributes 2 and 3

KNN(15, 441) = <445, 444, 449, 433, 450, 448, 447, 446, 443, 442, 440, 439, 438, 437, 436> = R

KNN(15, 441) = <456, 454, 453, 452, 450, 449, 448, 447, 443, 441, 436, 433, 430, 425, 424> = R´

The query answer is 60% different:

R’ - R = {456, 454, 453, 452, 450, 441, 430, 425, 424}R ∩ R’ = {449, 448, 447, 443, 436, 433}

It retrieves elements whose attributes 2 and 3 are more similar to the query center 441 at attributes 2 and 3.

Page 46: Combining visual analytics and content based data retrieval technology for efficient data analysis sem notas

- 14th International Conference on Information Visualisation (IV 2010) -

References

C. Faloutsos and K.-I. Lin, “FastMap: A Fast Algorithm for Indexing, Data-Mining and Visualization of Traditional and Multimedia Datasets”, Proc. ACM SIGMOD, 1995, 163-174

C. Traina Jr., A. Traina, B. Seeger, C. Faloutsos, “Slim-trees: High Performance Metric Trees Minimizing Overlap Between Nodes”, Int. Conference on Extending Database Technology (EDBT), 2000, 51--65

Page 47: Combining visual analytics and content based data retrieval technology for efficient data analysis sem notas

- 14th International Conference on Information Visualisation (IV 2010) -

Resources

D. Mount, “Feature Selection”, Presentation for CMSC 828K - University of Maryland - http://www.kanungo.com/teaching/cmsc828K/dave/feature.ppt- accessed 07/2010

Tomáš Skopal, Jaroslav Pokorný, Michal Krátký, Václav Snášel, “Revisiting M-tree Building Principles”, Presentation for Advances in Databases and Information Systems 2003

http://siret.ms.mff.cuni.cz/skopal/pres/mtree.ppt - accessed 07/2010

Clemens Marschner, “MTree Tester Applet”

http://www.cmarschner.net/mtree.html - accessed 07/2010

Texas A&M University, “Function Plotter”

http://www.math.tamu.edu/AppliedCalc/Classes/Plot2/CardPlot.html

Page 48: Combining visual analytics and content based data retrieval technology for efficient data analysis sem notas

- 14th International Conference on Information Visualisation (IV 2010) -

Thank youThank youhttp://www.icmc.usp.br/~junio

Link “MetricSPlat”

[email protected]