View
224
Download
2
Category
Tags:
Preview:
Citation preview
Motivation
• A potentially rich control paradigm, allowing for nuance.
• Removes the barrier of some funny plastic controller.
• Successful experiment: Konami’s Police 911
My game: Air Guitar
• A “beat-matching” game where you stand and play air guitar to your favorite songs.
• Previous beat-matching games (Parappa, DDR) are very digital; I want to use a webcam to make Air Guitar more organic and to allow the user to be expressive.
• Technically demanding as a vision app (needs semantics about what is what).
Real-World Concerns
• Noise
• Illumination changes
• Camera auto-adjusts
• Background changes / camera moves
• Shadows
• Camera saturation / under-excitement
Varying Lighting Conditions
• Can’t rely on RGB values to identify pixels
• Need context… hmm… this becomes a hard AI problem.
Vision Techniques That Suck
• Background subtraction (shadows, motion!)
• Noise reduction by smoothing (resolution!)
• Turning functions (unstable)
• Frame coherence (just a band-aid)
• Edge detection
• Hysteresis (Latin for “cheap hack”)
• Discreteness
General Paradigm
• Technique should:– Work on a still image– Be robust: avoid discrete decisions wherever
possible.– Work in as general a case as we should manage,
but we won’t strive to be ideally general.
• We will do “whatever it takes” to get the job done.
Restrained Ambition
• Only trying to roughly determine the positions of torso and arms
• Okay to say “the user must wear a long-sleeved shirt of uniform color that contrasts with the background”
• We won’t dictate the color of the shirt (too restrictive!)
• We won’t dictate colors of other things (user’s skin, background).
Early Segmentation
• Divide up the image into regions of “like” pixels to ease computation.
• Ad hoc technique: iterate over scanlines potentially adding each pixel to its neighbor’s group.
• This technique sucks.
The Unreasonable Instability of Approximate Clustering
• “Real” clustering is slow
• “Loose” clustering is interactively unstable
• Even just the small amount of camera noise makes things go berserk… motion is even worse.
• Clustering is about continuous ==> discrete. We wanted to avoid that so we should be very careful.
My solution: Be Inflexible
• Simply divide the image into square regions of constant size.
• If any region needs more detail, subdivide it.
• Noise still affects this system (some regions subdividing / recombining from frame to frame) but it’s relatively stable.
Which color space do we work in?
• Want to group pixels that are “alike”: nearby in some color space.
• Choices: nonlinear RGB, linear-light RGB, CIE LAB, many others.
• CIE LAB produced nicer results for some ad hoc segmentation experiments, but is expensive to compute.
• Linear-light RGB is the right thing for inverse rendering techniques; it is cheap to compute.
• I started with CIE LAB, but now use linear RGB.
Simple Inverse Rendering
• Assume all surfaces have Lambertian reflectance
• p = mlcosθ… θ is angle between light and surface normal.
• Can’t disambiguate material color from illuminant color
• The compound color ml, under varying scale, forms a vector through the origin in RGB space.
• This is a much more specific relation than e.g. Euclidean distance.
Covariance Bodies:
• 5 numbers’ worth of storage
• Ellipsoid-shaped (take eigenvectors of matrix)
• Statistical significance: expected value of points
• Advantage: consistency under summation
• Can use them to vaguely characterize shapes.
• Generalizes to n dimensions.
yxyxy
xyx,
2
2
Covariance Bodies for Color Plane Fitting
• Least-squares plane fit uses the same matrix.
• Track RHS. 3 more numbers:
• Sum these to get group plane fits.
• (example)
yzxzz ,,
yz
xz
p
p
yxy
xyx
y
x
2
2
Calibration Mode
• Stand in a fixed pose• Pose designed to be easily recognizable• Gives us things that help later:
– Body measurements– Background of scene– Shirt color (and histogram)– Skin color– Coarse model of environment illumination
How We Recognize This Pose
• Pick a color to look for; isolate it.• Project this color to the X and Y image axes• Find spikes in projection• Use heuristics to judge shape and give a
confidence value:– Outliers– Relative spike sizes– Screen real-estate occupied
• (example)
Try many colors.
• Sort colors present in scene by popularity; cluster them.
• Create a fuzzy color cone through each cluster.• Vary the cone radius.• Do the recognition listed on previous slide;
select the color cone with the best score.
• Fixed color grid (to combat instability!)
Head Finding
• Many heuristics:– Medium-detail region (Flatness + sharpness)
• But not a long sharp edge
– Compact body– Skin-colored– Not the background
Skin color?
• Fit points in RGB space with an approximating surface?
• Where do I get a good skin color database?
Gameplay Recognition Mode
• Goal: Find positions of user’s torso and arms.
• When we’re actually playing the game, we use the info provided by calibration to help us.
• Currently only use shirt + skin color.
Body Shape Analysis
• Slide a square window across the image; for each window position, use the pixel regions falling within the window to perform a local shape analysis.
• Examine the resulting ellipses to find the arms. These are long, centered ellipses; round regions are the torso. (example)
• Path-trace these to get an ordered series of points representing each arm.
• Fit one or two line segments to this series of points (one segment = straight arm, two = bent).
Hands in front of body?
• The arm will blend into the body.
• The hands will look like “holes” in the body.
• This messes up arm detection.
Multi-step Process:
• Do a sliding window pass; approximate extents of torso using initial set of regions (holes may be there).
• Look for hand-colored blobs in this area.• Merge those blobs with the set of torso
regions.• Do another sliding window pass, now
detecting elongated shapes (for arms).
Creating a 3D character pose from 2D information
• Resolve ambiguities with game-domain constraints (e.g. hands always within some plane in front of torso).
• Use inverse kinematics and some simple body knowledge to recover 3D joint angles.
• See the column “The Inner Product” in the April 2002 issue of Game Developer for an explanation of 3D IK, and source code.
Method Advantages
• It’s reasonably fast
• Works with moving background / camera
• Doesn’t care much about shadows
Method Shortcomings
• Currently confused by similar colors (low clustering resolution)
• Requires a few more technical solutions before it will be truly robust (e.g. auto gamma detection).
Future Work
• Performance: 640x480 @ 30fps
• More inverse rendering work (specularity)
• Local surface modeling (eliminate confusion due to similar colors)
• Texture classification
• Mental model feedback
Coding Issues
• How do you get video images from a webcam in Windows?– VFW code by Nathan d’Obrenan in Game
Programming Gems 2– Unfortunately, VFW is a legacy API– DirectShow is the thing you need to use for
future compatibility.
DirectShow is terrible!
• Needlessly complex and bloated.• The base classes provided in the DirectX
SDK induce a lot of latency (latency = death)
• A minimal implementation of “just give me a damn frame from the camera” took 1,500 lines of code; should have taken 8.
• Ask me if you want the source code (jon@bolt-action.com)
• Or use VFW or a proprietary API.
Recommended