65
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign The Versus Comparison Framework Kenton McHenry, Ph.D. Research Scientist National Center for Supercomputing Applications

The Versus Comparison Framework

  • Upload
    darva

  • View
    63

  • Download
    0

Embed Size (px)

DESCRIPTION

The Versus Comparison Framework. Kenton McHenry, Ph.D. Research Scientist National Center for Supercomputing Applications. The Problem. The abundance of file formats is a problem when preserving electronic records Why? Will there be software to load the file in the future? - PowerPoint PPT Presentation

Citation preview

Page 1: The  Versus Comparison Framework

National Center for Supercomputing ApplicationsUniversity of Illinois at Urbana-Champaign

The Versus Comparison Framework

Kenton McHenry, Ph.D.Research ScientistNational Center for Supercomputing Applications

Page 2: The  Versus Comparison Framework

The Problem

• The abundance of file formats is a problem when preserving electronic records

• Why?• Will there be software to load the file in the future?• If not will the specification for the format still exist?• Was the specification ever available to begin with

(closed/proprietary formats)?

Page 3: The  Versus Comparison Framework

*.dwg *.max, *.3ds*.blend

*.k3d

*.w3d

*.ma, *.mb, *.mp

*.iam*.lwo *.c4d

*.pdf (*.prc, *.u3d)

*.vtk, *.vtp

*.skp

Page 4: The  Versus Comparison Framework

Available 3D File Formats…

Page 5: The  Versus Comparison Framework

Converting Formats

• In order to preserve content for future use one option is to convert the file to an open/standardized format that is likely to be supported for some time.• Store both this file and the original for provenance

• Ideally with one file format for a particular content type it will be easy for users to view/use the data.

Page 6: The  Versus Comparison Framework

Converting Formats (continued…)• How and which format!?

• Fully supporting the many available formats is an enormous undertaking

• If a file format is closed/proprietary it may be difficult to retrieve the data directly from the file• May be possible to reverse engineer and recover some of

the content• Vendor file formats sometimes store application feature specific

pieces of information that’s not supported in other formats• Examples include: animations, physics, …• When converting to a format that doesn’t have a place for

such information we must drop it.• Information loss…

Page 7: The  Versus Comparison Framework

Converting Formats (continued…)

• There are different ways of storing the 3D content itself• Faceted:

• Comprised of vertices and faces • Popular within the graphics community

• Boundary Representation: • Comprised of vertices, edges, edge loops, and primitive

surfaces• Popular among CAD users

• Constructive Solid Geometry• Comprised of boolean operations on primitive volumes

• …

Page 8: The  Versus Comparison Framework

Converting Formats (continued…)

• Translating geometry representation may not be trivial• B-Rep to Faceted

• Translating involves triangulating the surfaces created from the bounded primitives (tesselation)

• The resulting sampled surface will suffer from aliasing at high viewing resolutions

• Can accommodate by performing a finer triangulation (i.e. more triangles and a larger file)

• Faceted to B-Rep:• Translating in this direction is non-trivial!• How does one decide if a group of triangles should be

grouped together as part of some larger primitive (e.g. part of a cylinder).

Page 9: The  Versus Comparison Framework

Geometry Appearance Scene Format Faceted Parametric CSG B-Rep Color Material Texture Bump Lights Views Trans. Groups

Animation

3ds √ √ √ √ √ √ √ √ √

igs √ √ √ √ √ √ √

lwo √ √ √ √ √ √

obj √ √ √ √ √ √ √

ply √ √ √ √ √

stp √ √ √ √ √ √

wrl √ √ √ √ √ √ √ √ √ √ √

u3d √ √ √ √ √ √ √ √ √

x3d √ √ √ √ √ √ √ √ √ √ √

Converting Formats (continued…)

• How do we measure information loss?• Which is the best format to use for preservation?

Page 10: The  Versus Comparison Framework

NCSA Polyglot (2009)

• Conversions service based on utilizing any and all available 3rd party software• Imposed Code Reuse: Re-attaching a programmable interface to

compiled software.• Scripted operations within software

• GUI scripting (e.g. AutoHotKey)• Created a simple workflow referred to as an Input/Output Graph• Compared files before/after conversion to measure information

loss• Distributed across multiple machines• Web access

Page 11: The  Versus Comparison Framework

ISDA File Migration Tools

• Conversion Software Registry• Software Servers• Polyglot• Versus

Page 12: The  Versus Comparison Framework

Software that can Convert between Formats

• There is a lot of software available, each with its own unique capabilities

• A lot of it is not free• It would be expensive to buy a package just to check if it truly is

capable of converting between a desired pair of formats• How can someone know what software to get for their

needs?

http://isda.ncsa.illinois.edu/NARA/CSR

Page 13: The  Versus Comparison Framework

The Conversion Software Registry

Adobe 3D Reviewer

Page 14: The  Versus Comparison Framework

The Conversion Software Registry

Page 15: The  Versus Comparison Framework

Input/Output GraphsAdobe 3D Reviewer

Page 16: The  Versus Comparison Framework

Input/Output Graphs

3DS Max

Adobe 3D Reviewer

AutoCAD

Blender

Cinema 4D

K-3D

LightWave 3D

Maya

Wings 3D

Page 17: The  Versus Comparison Framework

Shortest conversion path

Input/Output Graphs

Page 18: The  Versus Comparison Framework

Software Servers

Page 19: The  Versus Comparison Framework

Software Server

• To program against software• i.e. to write new code that can utilize functionality within arbitrary

software, compiled code, where the source code is probably not available.

• Imposed Code Reuse (or Software Reuse): The process of attaching an API like interface to software so that its functionality can be called within new code.

Page 20: The  Versus Comparison Framework

Software Servers

• Shares the functionality of software over the web• In contrast to services which share data: ftpd, nfsd, sambad,

httpd• Similar to services such as: telnetd, sshd, VNC, rdesktop

• The main difference is in the interface:• Uniform across all software

http://host:8182/software/<Application>/<Task>/<Output Format>/<InputFile>

• Simple• Widely accessible• Capable of being programmed against

• Allows any desktop application to become a cloud based web service*

Page 21: The  Versus Comparison Framework

Software Functionality Sharing

Page 22: The  Versus Comparison Framework

Software Functionality Sharing

Page 23: The  Versus Comparison Framework

Software Functionality Sharing

Page 24: The  Versus Comparison Framework

Software Functionality Sharing

Page 25: The  Versus Comparison Framework

Software Functionality Sharing

Page 26: The  Versus Comparison Framework

Software Functionality Sharing

Page 27: The  Versus Comparison Framework

Software Functionality Sharing

Page 28: The  Versus Comparison Framework

Software

• Excel 2010• Word 2010• Power Point 2010• Publisher 2010• One Note• Access 2007• Wordpad• Notepad• Calculator2010• Internet Explorer 8• Winzip• Photoshop CS5• Adobe Acrobat• ABBYY

• 3D Studio Max• Adobe 3D Reviewer• Google Sketchup• Wings 3D• Blender• K3D• Paraview• VTK• Cyberware PLYTool• NIST X3D Tool• Imagemagick• IrfanView• GIMP• Microsfot Paint

Page 29: The  Versus Comparison Framework

Polyglot

• Listens for Software Server broadcasts on the network• Catalogues available input/output operations and

constructs and I/O-graph• Identifies conversion paths between input and output

formats• Carries out CHAINED conversions

Page 30: The  Versus Comparison Framework

Polyglot

Page 31: The  Versus Comparison Framework

Polyglot

Page 32: The  Versus Comparison Framework

Versus

• Java library/framework for comparing file content• Distributed architecture• RESTful Web Interface

• http://<host>/versus/comparisons• dataset1, dataset2• adapter, extractor, measure

Page 33: The  Versus Comparison Framework

Adapters

Name Package Description

Mesh 3D Load 3D files content into a mesh made up of vertices and polygons connecting those vertices.

Audio Audio Encapsulation of audio files.

Bytes Core Simplest possible representation of data.

PDF Doc2Learn Encapsulation of the Doc2Learn PDF document.

Buffered Image Image Standard Java representation of image data.

Image Object Image Encapsulation of the Im2Learn Image Object.

SIFT GPU GPU Encapsulation of image data for SIFT Gpu specific processing.

Page 34: The  Versus Comparison Framework

DescriptorsName Package Description

Double Array Core A single dimensional array containing double values.

MD5 Digest Core A data integrity structure generated from the raw data.

Three Dimensional Double Array

Core A three-dimensional array containing double values.

Vector Core A list of generic elements, allows greater storage flexibility.

Label Histogram Doc2Learn A histogram of labels obtained through Doc2Learn.

Keypoint Image Generic container for invariant feature detectors.Pixel Image Generic type for various image package

descriptors.Color Layout Image A two dimensional grid of sub-images over the

input image.Grayscale Histogram Image A one-dimensional grayscale image histogram.

RGB Histogram Image A three-dimensional RGB color histogram. Pixel Histogram Image A multidimensional histogram for a pixel’s

intensity and position. MOPS Features Fiji Invariant feature type used for image stitching.SIFT Features Fiji Popular invariant feature type used for image

comparison and object matching. SIFT Gpu Gpu Same as SIFT but implemented through Gpu

libraries. Harris Corners OpenCV Well-known corner detector used for image

inference, tracking, and recognition.Hough Circles OpenCV Circles detected in an image with the Hough

Transform.Hough Lines OpenCV Lines detected in an image with the Hough

Transform.SURF Features OpenCV Invariant feature type that can be computed

faster than standard SIFT.

Page 35: The  Versus Comparison Framework

ExtractorsName Package Description

Light Field 3D Surface is represented by silhouettes taken from 3 canonical positions capturing the surface shape minus any concavities (i.e. the convex hull).

Statistics 3D Ignores the surface and focuses on the vertices of a 3D object returning their mean and standard deviation. Simple, but fast to compute.

Surface Area 3D The sum of the area occupied by the polygons making up a surface. Considers surface and is still fast to compute.

Audio Audio Sampling of audio from existing file for histogram usage and comparison.

MD5 Core Creation of the MD5 hash from data.Image Histogram Doc2Learn Generates a non-standard color histogram

Line Graphics Histogram Doc2Learn Generates a histogram to compare vector graphics found in documents.

Text Histogram Doc2Learn Generates a label histogram based on word frequency.

Array Feature Image Generates the three-dimensional double array; a generic image container.

Color Average Vector Feature Image Generates an average RGB color over 9 regions taken from the image.

Grayscale Histogram Image Generates the histogram for grayscale images. Useful for image comparison.

Pixel Histogram Image Generates the multidimensional histogram for feature matching.

RGB Histogram Image Generates the histogram for color images. Useful for image comparison.

Signature Vector Image Feature vector (for an image) containing colorspace information and pixel position.

MOPS Features Fiji Open source implementation for the MOPS detector.

SIFT Features Fiji Open source implementation of Lowe’s method.SIFT Gpu Gpu Gpu implementation for the SIFT detector.

Harris Corners OpenCV Corner detector for images.Hough Circles OpenCV Circle detector for imagesHough Lines OpenCV Line detector for images

SURF Features OpenCV Open source implementation for the SIFT detector.

Page 36: The  Versus Comparison Framework

MeasuresName Package Description

Chessboard Distance CoreAlso known as Chebyshev; the greatest difference along any coordinate dimension (between two vectors)

Dynamic Time Warping Core Similarity metric between two (possibly) varying sequences over time.

Euclidean Distance Core Distance between two n-dimensional points in Euclidean space.

Manhattan Distance CoreAbsolute difference of coordinates of points, distance between two points measured along right angled axes.

MD5 Hash Core Binary measure; either equal or not.

Bhattacharyya Distance Image Measures the overlap between two probability distributions.

Neyman’s χ2 Image Tests the goodness of fit between two distributions. Variant of the standard χ2 test.

Czekanowski Distance ImageSum of the absolute value of the difference of two distributions divided by the sum of the two distributions.

Histogram Euclidean Distance

Doc2Learn / Image

Bin-by-bin comparison using the standard Euclidean distance. Well known and widely used.

Histogram Intersection Doc2Learn / Image

Sum of the absolute value of the difference of two distributions, scaled by one-half. Well known and widely used.

KL Divergence ImageNon-symmetric measure of the difference between two probability distributions. Well known measure of entropy.

Jeffrey Divergence Image Symmetric measure of the difference between two probability distributions.

Motyka Distance Image Sum of the maximum of two distributions divided by the sum of the two distributions.

Normalized Cross Correlation Image Similar to sum of squared differences; invariant

to the magnitude of two points.

Ruzicka Similarity ImageSum of the minimum of two distributions divided by the sum of the maximum of the two distributions.

Sum of Squared Differences Image

Sum of squared differences between two arrays, cheaper to computer than Euclidean distance.

Tanimoto Distance ImageSum of the difference of the max and the min of two distributions divided by the sum of the max.

Wave Hedges Distance ImageSum of the absolute value of the difference between two distributions divided by their maximum.

Invariant Feature Comparison

Fiji / OpenCV /

SIFT Gpu

 

Compares the invariant features between two images by calculating the pairwise Euclidean distance and voting for a match using a predetermined threshold.

Earth Mover’s Distance OpenCVMeasures the similarity between two probability distributions. This is the minimum cost of transforming one distribution to the other.

Page 37: The  Versus Comparison Framework

Extractor Previews

Page 38: The  Versus Comparison Framework

Measuring Information LossWe would like to assign

a value to each conversion edge …

• With a “universal” converter we could convert files from every format A to every other format B

• Assuming we then had a loader for both format A and format B we could load and compare the 3D content independent of how it is stored.

Page 39: The  Versus Comparison Framework

Measuring 3D Information Loss

good… (e.g. 1.0) not so good… (e.g. 0.1)

• Adobe 3D Reviewer• Blender• Cyberware PlyTool• K-3D• NIST VRML/X3D• VTK

Page 40: The  Versus Comparison Framework

Measuring 3D Information Loss

• Data representation• Meshes

• Loaders• Use 3D similarity as a means of comparing 3D models

• Statistics• Surface Area [Brunnermeier, RTI 1999]• Spin Images [Johnson, PAMI 1999]• Light Fields [Chen, Eurographics 2003]

Page 41: The  Versus Comparison Framework

Statistics

• Use the mean and standard deviation of the vertices to represent the model

• Simple but fast to compute• Sensitive to size and orientation of the model

Page 42: The  Versus Comparison Framework

Surface Area

• Use the sum of face areas to represent the model• Also simple and fast to compute• Sensitive to size, somewhat sensitive to shape. Will

detect loss of faces.

Page 43: The  Versus Comparison Framework

Light Fields [Chen, 2003]

• Compares silhouettes from various viewing angles around a model.

Page 44: The  Versus Comparison Framework

Light Fields

Page 45: The  Versus Comparison Framework

Light Fields

Page 46: The  Versus Comparison Framework

Light Fields

Page 47: The  Versus Comparison Framework

Light Fields

Page 48: The  Versus Comparison Framework

Light Fields

• Fairly fast to compute• Sensitive to shape of convex hull, invariant to rigid

transformations

Page 49: The  Versus Comparison Framework

Spin Images [Johnson, 1999]

• 2D histograms of the in plane and out of plane distances of vertices neighboring a given vertex.

p

N q

a

b

Page 50: The  Versus Comparison Framework

Spin Images

Page 51: The  Versus Comparison Framework

Spin Images

Page 52: The  Versus Comparison Framework

Spin Images

Page 53: The  Versus Comparison Framework

Spin Images

Page 54: The  Versus Comparison Framework

Spin Images

Page 55: The  Versus Comparison Framework

Spin Images

• Expensive to compute• Sensitive to relative vertex position, ignores surface,

invariant to rotations and translations

Page 56: The  Versus Comparison Framework

3D Information Loss and 3D File Loaders

• If we were able to load every file format we really wouldn't need to use software reuse to convert via 3rd party applications.

• Implement loaders for a small number of formats that will make up our test data set• Convert from format A along path to some format B then back to

A again• Estimate path scores by comparing before/after content and

assigning scores to all edges along path

Page 57: The  Versus Comparison Framework

STP to X3D to STP

X3D

WRLSTP

WRL

A3D Reviewer

STP

Vrml97ToX3d

X3dToVrml97 A3D Reviewer

Page 58: The  Versus Comparison Framework

Web Browser …

Application Wrapper A

Application A Application B Application C

Application Wrapper B

Application Wrapper C

Software Server

Software Server Client

Polyglot Client…

Polyglot Panel

Application Wrapper X

Application X Application Y Application Z

Application Wrapper Y

Application Wrapper Z

Software Server

Polyglot Server

Software Server Client Software Server Client

Polyglot Web Server

… …

IOGraph Weights Tool

Versus

Software

Software Reuse

Conversion(Polyglot)

Comparison(Versus)

Polyglot StewardComparison (Versus)Conversion (Polyglot)

Software ReuseClosed Source Software

Page 59: The  Versus Comparison Framework

Which conversion preserved the most?

• Using the light fields measure:• Emphasizes shape through silhouettes• Adobe 3D Reviewer between *.pdf and *.stp (61.67)

• Using the spin image measure:• Emphasize shape through relative vertex positions• Adobe 3D Reviewer between *.obj and *.pdf (59.07)

Page 60: The  Versus Comparison Framework

Which is the best format?Within the context of preservation we can define this as the format that retains on average the most information when converted to by other formats.

• Using the light fields measure:• Emphasizes shape through silhouettes• *.stp (40.73)

• Using the spin image measure:• Emphasizes shape through relative vertex positions• *.stl (34.89)• *.stp being a CAD format has more variability in vertex positions

due to tessellation

Page 61: The  Versus Comparison Framework

Word Spotting in Versus

• Include Word Spotting feature code as extractor in Versus• Just another form of content based comparison• Specific to handwriting of course

• Include general purpose efficient indexing into Versus• e.g. hierarchical agglomerative clustering• Not just for Word Spotting

Page 62: The  Versus Comparison Framework

Conclusion

Image Formats by Information Preservation

1. PPM2. PNG3. GIF4. JPG5. …

Image Formats by File Size

1. JPG2. GIF3. PNG4. PPM5. …

Image Software by Information Preservation

1. ImageMagick2. Adobe Photoshop3. GIMP4. Microsoft Paint5. …

3D Formats by File Size

1. MP42. MAX3. PLY4. OBJ5. …

3D Formats by Information Preservation

1. STL2. STP3. OBJ4. MAX5. …

3D Software by Information Preservation

1. Adobe 3D Reviewer2. 3DS Max3. Maya4. Blender5. …

Video Formats by Information Preservation

1. AVI2. MOV3. WMV4. MPG5. …

*Rankings shown demonstrate the type of output that will be obtained from future work and does NOT represent actual results!

1,682 formats2,007 applications

Page 63: The  Versus Comparison Framework

Versus and CBR

• CBIR: Content based image retrieval• Versus as a content based comparison framework can

serve as the back end for general purpose content based retrieval

Page 64: The  Versus Comparison Framework

Acknowledgements

• This research was partially supported by a National Archive and Records Administration (NARA) supplement to NSF PACI cooperative agreement CA #SCI-9619019 and by NCSA Industrial Partners. 

Imaginations unbound

• The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the National Science Foundation, the National Archive and Records Administration, or the U.S. government.

Page 65: The  Versus Comparison Framework

The ISDA Tools (Free and Open Source)

Image, Spatial, and Data Analysis Group

http://isda.ncsa.illinois.edu

Kenton McHenryRob KooperMichal OndrejcekLuigi Marini