Creating Good Test Data - Heidelberg University · Creating Good Test Data o Oliver Zendel...

Preview:

Citation preview

Creating Good Test Data

Oliver Zendel

oliver.zendel@ait.ac.at

AIT Austrian Institute of Technology

www.vitro-testing.com

ECCV 2016 Workshop on Datasets and Performance Analysis in Early Vision

Saturday, 2016-10-08 10h15, Oudemanhuispoort of the University of Amsterdam

What should be present in my test data?

Goal

Good test data includes:

Enough variation

Systematically organized

Low redundancy

2 08.10.2016

Outline

Basics

Why is validation of CV hard?

CV-HAZOP

Tool for evaluation and planning of test data

Outlook

Relevant complementing topics

3 08.10.2016

Outline

Basics

Why is validation of CV hard?

CV-HAZOP

Tool for evaluation and planning of test data

Outlook

Relevant complementing topics

4 08.10.2016

SW Quality Assurance in General

Verification

System meets specification

Implementation is correct

“Are we building the solution right?”

Validation

System fulfills intended purpose

Algorithm is suitable/robust

“Are we building the right solution?”

5 08.10.2016

Quality Assurance in CV (1)

Verification

Interested in code coverage

Formal methods with proofs -> can guarantee completeness

Rich tool set exist

Generic SW tools are applicable to CV implementations

Examples: clang static code analysis; Event-B

6 08.10.2016

Quality Assurance in CV (2)

Validation

Interested in data coverage

Based on experience – no 100% guarantees

Very application/intent specific special considerations for CV

Amount possible input data >> amount possible test data

7 08.10.2016

State of the Art

Example: Test data in Stereo Vision

Middlebury stereo database: Scharstein et al. [Sch2002], [Sch2014]

KITTI Vision Benchmark Suite: Geiger et al. [Gei2012], [Men2015]

Synthetic: Sintel [But2012]; VKITTI [Gai2016]; SYNTHIA [Ros2016]

Evaluation of dataset bias:

Ponce et al. [Pon2006]

Pinto et al. [Pin2008]

Torralba and Efros [Tor2011]

Systematic High-Level

Approach?

8 08.10.2016

Equivalence Classes

Partitioning of input data

Cluster/Segment input data into distinct classes

Represent each class by finite number of test cases

9 08.10.2016

Equivalence Classes

Partitioning of input data

Cluster/Segment input data into distinct classes

Represent each class by finite number of test cases

10 08.10.2016

Continuous parameters

Finite number of test

cases

Equivalent Test Images?

Mathematically definition for:

„All images showing trees“

„All images without a reindeer“

11 08.10.2016

Not feasible!

Use Semantics

Test images classified using two semantics:

Domain aspects

Vulnerability aspects

Domain aspects:

Open-world vs. closed-world

Vulnerabilities:

Situations / Relations known

to cause problems for CV

Examples:

Low contrast, Reflections, Glare,

Shadows, Image Noise,

Occlusions, …

12 08.10.2016

Mapping to the Goals

Good test data includes:

Enough variation

Enough equivalence classes

Systematically organized

Low redundancy

13 08.10.2016

Mapping to the Goals

Good test data includes:

Enough variation

Enough equivalence classes

Systematically organized

Meaningful organization also for equivalence classes

Low redundancy

14 08.10.2016

Mapping to the Goals

Good test data includes:

Enough variation

Enough equivalence classes

Systematically organized

Meaningful organization also for equivalence classes

Low redundancy

Right amount of representatives per class

15 08.10.2016

Outline

Basics

Why is validation of CV hard?

CV-HAZOP

Tool for evaluation and planning of test data

Outlook

Relevant complementing topics

16 08.10.2016

Our Approach: Checklist!

All critical situations that decrease CV output quality

Just tick off entries for your test data

Coverage vs. all entries

Systematic Approach: Risk Analysis

Analyze a complex system and its interactions

Chosen method: HAZOP

Hazard and Operability Study

17 08.10.2016

The Generic CV- Model

One generic model

Identify subcomponents valid for all CV applications

Information-flow as flow diagram

18 08.10.2016

HAZOP Ingredients

Locations

Smallest components/parts of model

Parameters

Descriptive for each location

Guide Words

How parameters can deviate

from the expected

19 08.10.2016

Locations

Model Parts -> Locations

Recursion = own location “Objects”

Observer as two locations:

“Opto-mechanics” and “Electronics”

20 08.10.2016

Parameters: Characterize Locations

Example: Light Sources

Number

Position

Area

Spectrum

Texture

Intensity

Beam properties

Wave properties

21 08.10.2016

Guide Words

22 08.10.2016

HAZOP Ingredients

7 Locations

Subcomponents of the model

52 Parameters

Descriptive for each location

E.g. for Light Sources: Number, Position, Area, Spectrum, Texture,

Intensity, Beam properties, and Wave properties

17 Guide Words

How parameters can deviate from the expected

E.g. More, Less, Other Than, Faster, …

23 08.10.2016

CV-HAZOP Execution

24 08.10.2016

Experts assign meanings to each Parameter / Guide Word combination

Derive Consequences and Risks from each Meaning

CV-HAZOP Example

Meaning: A light source shines stronger than expected

Consequence: Too much light is in the scene

Risk: Overexposure

25 08.10.2016

Parameter: Intensity

Location: Light Sources

x =

Guide Word: More

CV-HAZOP Example (2)

Meaning: Particles move faster than expected

Consequence: Motion blur of particles

Risk: Scene is severely occluded

26 08.10.2016

Guide Word: Faster Parameter: Particles

Location: Medium

x =

HAZOP Combinations

52 Parameters x 17 Guide Words 884 Combinations

27 08.10.2016

x

Valid Combinations

28 08.10.2016

Results

Nine experts, one year 947 unique entries

See vitro-testing.com

29 08.10.2016

Evaluation

Proof-of-concept:

Generic to specific

Entries in list algorithm output quality decrease

Applied to stereo vision test data sets

Middlebury (Original)

[Sch2002]

Middlebury 2014

[Sch2014]

KITTI

[Gei2012]

30 08.10.2016

Found any Risks? Where?

About 500 entries valid for the stereo vision task

Examples:

31 08.10.2016

MB 06

MB 14

KITTI

Image

Pairs

Images w.

Risks

Found

Risks

Number of

Annotiations

26 19 34 55

23 17 57 80

194 62 76 101

Glare No Texture Mirroring Interlens

Reflection Underexp.

Evaluating our Approach

32 08.10.2016

Test influence on output quality of:

Shape of risk region ( shape )

Only position and area ( box )

In comparison to; controls:

Random position, same area ( rand )

The entire image ( all )

Annotation for No Texture

Percentage of erroneous disparity output:

Stereo Vision Evaluation

33 08.10.2016

shape

box

rand

all

0% 100%

Identified risks indeed increase test data difficulty

Areas identified by checklist result in higher error ratios

Multiple Algorithms and Datasets

Percentage of erroneous disparity output:

[Kon1998] [Hum2010] [Hir2008] [Rhe2011] [Ble2011] [Mei2013]

SAD CENSUS SGBM CVF PatchMatch ST-2

Areas identified by checklist result in higher error ratios

34 08.10.2016

Identified risks indeed increase test data difficulty

Areas identified by checklist result in higher error ratios

How do I apply this?

Start with the whole list

Visit vitro-testing.com

Filter out specific entries

Concretize entries

CV-HAZOP entries are generic (up to hazard level)

Interpret entries for your application (intent + domain)

Guide test data creation

Use list to guide and categorize test data

Test-driven development: Iterate!

Change focus / test data amount based on results from evaluation

35 08.10.2016

Example:

Concretize entry for stereo vision:

Object / Less / Texture

Meaning:

Object has less texture than expected

Consequence:

Texture correlation quality is reduced

Hazard (for stereo vision)

Images with large textureless surface on same

epipolar lines prevents correct correlation

Find in existing test data sets or create anew

Starting distribution: based on experience,

e.g. 10 images per equivalence class = hazard entry

Evaluation: Indeed problematic create/use more test here

36 08.10.2016

Goals in Regard to Vulnerabilities

Good test data includes:

Enough variation

Checklist enforces many different scenarios

Systematically organized

Low redundancy

37 08.10.2016

Goals in Regard to Vulnerabilities

Good test data includes:

Enough variation

Checklist enforces many different scenarios

Systematically organized

Test cases are organized by checklist order/categories

Low redundancy

38 08.10.2016

Goals in Regard to Vulnerabilities

Good test data includes:

Enough variation

Checklist enforces many different scenarios

Systematically organized

Test cases are organized by checklist order/categories

Low redundancy

Process of applying domain/intent gives insights into priorities

39 08.10.2016

Goals in Regard to Vulnerabilities

Good test data includes:

Enough variation

Checklist enforces many different scenarios

Systematically organized

Test cases are organized by checklist order/categories

Low redundancy

Process of applying domain/intent gives insights into priorities

40 08.10.2016

Outline

Basics

Why is validation of CV hard?

CV-HAZOP

Tool for evaluation and planning of test data

Outlook

Relevant complementing topics

41 08.10.2016

Ground Truth Quality

Label noise [Bow2001]

People interpret same data differently

Robust statistics over many annotations

Use crowd-sourcing for mass [Don2013]

Measurement errors distort GT

Add error bars

Stereo Ground Truth With Error Bars [Kon2015]:

LIDAR based data transformed to disparity

Combination of multiple influences:

2D feature annotation ,pose estimation,

bundle adjustment, stereo camera

calibration (intrinsic and extrinsic)

42 08.10.2016

[Kon2015]

[Bow2001]

Synthetic vs. Real

Is artificial test data valid for real world applications?

Rendering artifacts can create false alarms

Clean data can be too easy (e.g. sensor noise)

Realism must match algorithm; not humans

Computer graphics improve realism rapidly

Physically correct rendering gets faster;

At least for offline data feasible

Many benefits

Perfect ground truth without label noise

Safe simulation of dangerous situations

Generation of specific scenes

Systematic sampling of parameters

43 08.10.2016

Domain aspects

Building blocks of our scenery

Possible objects (expected and unexpected ones!)

Static scenery, background, clutter

Relations / Rules of environment

Physics

Behavior of actors

Interaction between actors

For More:

Workshop on Quality Assurance

in Computer Vision at ICTSS-2016

2016-10-19, Graz, Austria

44 08.10.2016

Outlook CV-HAZOP

Apply checklist to existing test data sets

Create merged test data set spotlighting hazards

Creating new test data for missing hazards

Create test data set for parameter sweeping

Sample specific parameter

Find breaking-point of algorithm

Deep learning training data vs. test data

Different ratio of corner cases vs. normality

Too much difficult cases will prevent learning

45 08.10.2016

Conclusion

Validation for CV has unique needs

Usage of checklists increase quality of new test data

CV-HAZOP is a useful tool to create

checklists for robustness testing

46 08.10.2016

Better test data better systems

Participate, access CV-HAZOP and data sets:

www.vitro-testing.com Contact:

oliver.zendel@ait.ac.at

References

47 08.10.2016

[Bow2001] K. Bowyer, C. Kranenburg, and Sean Dougherty. Edge Detector Evaluation Using Empirical ROC Curves. In Computer Vision and Image Understanding 84ff, 2001.

[Ble2011] M. Bleyer, C. Rhemann, and C. Rother. Patchmatch stereo-stereo matching with slanted support windows. In British Machine Vision Conference, 2011.

[Don2013] A. Donath, D. Kondermann. Is crowdsourcing for optical ow ground truth generation feasible? In Prroceeding to the International Conference on Vision Systems, 2013.

[But2012] D. J. Butler, J. Wulff, G. B. Stanly, and M. J. Black. A Naturalistic Open Source Movie for Optical Flow Evaluation. In European Conference on Computer Vision, 2012.

[Gai2016] A. Gaidon, Q. Wang, Y. Cabon and E. Vig. Virtual Worlds as Proxy for Multi-Object Tracking Analysis. In Computer Vision and Pattern Recognition, 2016.

[Gei2012] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? The kitti vision benchmark suite. In Computer Vision and Pattern Recognition, 2012.

[Hir2008] H. Hirschmüller. Stereo processing by semiglobal matching and mutual information. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 30(2):328ff, 2008.

[Hum2010] M. Humenberger, C. Zinner, M.Weber,W. Kubinger, and M. Vincze. A fast stereo matching algorithm suitable for embedded real-time systems. Computer Vision and

Image Understanding, 2010.

[Kon1998] K. Konolige. Small vision systems: Hardware and implementation. In Robotics Research. Springer, 1998.

[Kon2015] D. Kondermann, R. Nair, S. Meister, W. Mischler, B. Güssefeld, K. Honauer, S. Hofmann, C. Brenner, and B. Jähne. Stereo ground truth with error bars. In Asian

Conference on Computer Vision, 2015.

[Mei2013] X. Mei, X. Sun, W. Dong, H. Wang, and X. Zhang. Segment-tree based cost aggregation for stereo matching. In Computer Vision and Pattern Recognition, 313ff, 2013.

[Men2015] M. Menze and A. Geiger. Object Scene Flow for Autonomous Vehicles. Conference on Computer Vision and Pattern Recognition, 2015.

[Pin2008] N. Pinto, D. D. Cox, and J. J. DiCarlo. Why is real-world visual object recognition hard? PLoS Comput Biol, 4(1), 2008.

[Pon2006] J. Ponce, T. L. Berg, M. Everingham, D. A. Forsyth, M. Hebert, S. Lazebnik, M. Marszalek, C. Schmid, B. C. Russell, A. Torralba, et al. Dataset issues in object

recognition. In Toward category-level object recognition, pages 29–48. Springer, 2006.

[Rhe2011] C. Rhemann, A. Hosni, M. Bleyer, C. Rother, and M. Gelautz. Fast cost-volume filtering for visual correspondence and beyond. In Computer Vision and Pattern

Recognition, pages 3017–3024, 2011.

[Ros2016] G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. Lopez. The SYNTHIA Dataset. In Computer Vision and Pattern Recognition, 2016.

[Sch2002] D. Scharstein and R. Szeliski. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. International journal of computer vision, 47(1):7ff, 2002.

[Sch2014] D. Scharstein, H. Hirschmüller, Y. Kitajima, G. Krathwohl, N. Nesic, X. Wang, and P. Westling. High-resolution stereo datasets with subpixel-accurate ground truth. In

Pattern Recognition, pages 31–42. Springer, 2014.

[Tor2011] A. Torralba and A. A. Efros. Unbiased look at dataset bias. In Computer Vision and Pattern Recognition, pages 1521–1528, 2011.

Participate, access CV-HAZOP and data sets:

www.vitro-testing.com Contact:

oliver.zendel@ait.ac.at

48 08.10.2016

Algorithms used:

SAD CENSUS SGBM CVF PatchMatch ST-2

[Kon1998] [Hum2010] [Hir2008] [Rhe2011] [Ble2011] [Mei2013]

49 08.10.2016

[Kon1998] K. Konolige. Small vision systems: Hardware and implementation. In Robotics Research. Springer, 1998.

[Hum2010] M. Humenberger, C. Zinner, M.Weber,W. Kubinger, and M. Vincze. A fast stereo matching algorithm suitable for embedded

real-time systems. Computer Vision and Image Understanding, 2010.

[Hir2008] H. Hirschmüller. Stereo processing by semiglobal matching and mutual information. Pattern Analysis and Machine Intelligence,

IEEE Transactions on, 30(2):328–341, 2008.

[Rhe2011] C. Rhemann, A. Hosni, M. Bleyer, C. Rother, and M. Gelautz. Fast cost-volume filtering for visual correspondence and

beyond. In Computer Vision and Pattern Recognition, pages 3017–3024, 2011.

[Ble2011] M. Bleyer, C. Rhemann, and C. Rother. Patchmatch stereo-stereo matching with slanted support windows. In British Machine

Vision Conference, 2011.

[Mei2013] X. Mei, X. Sun, W. Dong, H. Wang, and X. Zhang. Segment-tree based cost aggregation for stereo matching. In Computer

Vision and Pattern Recognition, pages 313–320, 2013.

Annotation

Per dataset and for each entry in the HAZOP list:

Look for first image that includes identified hazard

Annotation of shape saved together with risk entry id

Images are randomly ordered (no bias from starting point)

50 08.10.2016

For your contributions

Lawitzky G., Wichert G., Feiten W. (Siemens Munich)

Köthe U. (HCI Heidelberg)

Fischer J. (Fraunhofer IPA)

Zinner C. (AIT)

Thank you Experts

51 08.10.2016

Parameters (1)

52 08.10.2016

Parameters (2)

53 08.10.2016

The Main Question:

Which situations should be covered by the test data

Domain aspects: elements and situations from actual real world

Vulnerability aspects: situations and relations known to cause problems

Validation <-> Experience!

When have we tested enough to reach a conclusion?

Reduce redundencies

Don‘t use 1 Million Kilometers of road data showing the same stuff!

54 08.10.2016

Recommended