This presentation as made as a tutorial at NCVPRIPG (http://www.iitj.ac.in/ncvpripg/) at IIT Jodhpur on 18-Dec-2013. Kinect is a multimedia sensor from Microsoft. It is shipped as the touch-free console for Xbox 360 video gaming platform. Kinect comprises an RGB Camera, a Depth Sensor (IR Emitter and Camera) and a Microphone Array. It produces a multi-stream video containing RGB, depth, skeleton, and audio streams. Compared to common depth cameras (laser or Time-of-Flight), the cost of a Kinect is quite low as it uses a novel structured light diffraction and triangulation technology to estimate the depth. In addition, Kinect is equipped with special software to detect human figures and to produce its 20-joints skeletons. Though Kinect was built for touch-free gaming, its cost effectiveness and human tracking features have proved useful in many indoor applications beyond gaming like robot navigation, surveillance, medical assistance and animation.
<ul><li>1.1Looking Deep into DepthProf. Partha Pratim Das, firstname.lastname@example.org & partha.p.das@gm Department of Computer Science & Engineering Indian Institute of Technology Kharagpur 18th December 2013Tutorial: NCVPRIPG 2013</li></ul><p>2. Natural Interaction Understanding human nature How we interact with other people How we interact with environment How we learn Interaction methods the user is familiar and comfortable with Touch Gesture Voice Eye Gaze 3. GestureVoiceEye GazeTouch 4. Natural User Interface Revolutionary change in Human-Computer interaction Interpret natural human communicationComputers communicate more like people The user interface becomes invisible or looks and feels so real that it is no longer iconic 5. Source: http://sites.amd.com/us/Documents/AMD_TFE2011_032SW.pdf 6. Source: http://sites.amd.com/us/Documents/AMD_TFE2011_032SW.pdf 7. Source: http://sites.amd.com/us/Documents/AMD_TFE2011_032SW.pdf 8. Source: http://sites.amd.com/us/Documents/AMD_TFE2011_032SW.pdf 9. Factors for NUI Revolution 10. Sensor Detect human behavior NUI Middleware Libraries Interprets human behavior Cloud Extend NUI across multiple deviceSensorsLibrariesApplicationCloud 11. SensorsLibrariesApplication 12. Touch Mobile Phones Tablets and Ultrabooks Portable Audio Players Portable Game ConsolesTouch Less RGB camera Depth camera Lidar, CamBoard nano RGBD camera Kinect, Asus Xtion pro, Carmine 1.08 Eye tracker Tobii eye tracker Voice recognizer Kinect, Asus Xtion pro, Carmine 1.08 13. Kinect The most popular NUI Sensor 14. 19 15. 20 Games / Apps Sports Dance/Music Fitness Family Action Adventure Videos Xbox 360 Fitnect Humanoid Magic Mirror 16. 21Xbox 360 Fitnect Virtual Fitting Room http://www.openni.org/solutions/fitnect-virtual-fitting-room/ http://www.fitnect.hu/Controlling a Humanoid Robot with Kinect http://www.youtube.com/watch?v=_xLTI-b-cZUhttp://www.wimp.com/robotkinect/Magic Mirror in Bathroom http://www.extremetech.com/computing/94751-the-newyork-times-magic-mirror-will-bring-shopping-to-thebathroom 17. 22Low-cost sensor device for providing real-time depth, color and audio data. 18. 23How Kinect supports NUI? 19. 24 20. 25 21. 261.2.Kinect hardware - The hardware components, including the Kinect sensor and the USB hub through which the Kinect sensor is connected to the computer. Kinect drivers - Windows drivers for the Kinect: 3.4.Audio and Video Components Kinect NUI for skeleton tracking, audio, and color and depth imaging DirectX Media Object (DMO) 5.Microphone array as a kernel-mode audio device std audio APIs in Windows. Audio and video streaming controls (color, depth, and skeleton). Device enumeration functions for more than one Kinectmicrophone array beamforming audio source localization.Windows 7 standard APIs - The audio, speech, and media APIs and the Microsoft Speech SDK. 22. 27Hardware 23. 28 24. 29 RGB camera Infrared (IR) emitter and an IR depth sensor estimate depth Multi-array microphone - four microphones for capturing sound. 3-axis Accelerometer - configured for a 2g range, where g is the acceleration due to gravity. Determines the current orientation of the Kinect. 25. 30 26. 31NUI API 27. 32NUI API lets the user programmatically control and access four data streams. Color Stream Infrared Stream Depth Stream Audio Stream 28. 33Source: Kinect for Windows Developer Toolkit v1.8 http://www.microsoft.com/en-in/download/details.aspx?id=40276 App Examples: ColorBasics - D2D Infrared Basics - D2D Depth Basics - D2D / Depth - D3D Depth with Color - D3D Kinect Explorer - D2D Kinect Explorer - WPF 29. 34Source: Kinect for Windows Developer Toolkit v1.8 http://www.microsoft.com/en-in/download/details.aspx?id=40276 App Examples: ColorBasics - D2D 30. 35Color Format RGB YUVBayerResolution 640x480 Fps30 1280x960 Fps12 640x480 Fps15 640x480 Fps301280x960 Fps12 New features of SDK 1.6 onwards: Low light (or brightly lit). Use hue, brightness, or contrast to improve visual clarity. Use gamma to adjust the way the display appears on certain hardware. 31. 36Source: Kinect for Windows Developer Toolkit v1.8 http://www.microsoft.com/en-in/download/details.aspx?id=40276 App Examples: InfraredBasics - D2D 32. 37IR stream is the test pattern observed from both the RGB and IR camera Resolution 640x480 Fps 30 33. 38Source: Kinect for Windows Developer Toolkit v1.8 http://www.microsoft.com/en-in/download/details.aspx?id=40276 App Examples: DepthBasics - D2D Depth - D3D Depth with Color - D3D 34. 39The depth data stream merges two separate types of data: Depthdata, in millimeters. Player segmentation data Eachplayer segmentation value is an integer indicating the index of a unique player detected in the scene. The Kinect runtime processes depth data to identify up to six human figures in a segmentation map. 35. 40 36. 41Source: Kinect for Windows Developer Toolkit v1.8 http://www.microsoft.com/en-in/download/details.aspx?id=40276 App Examples: KinectExplorer - D2D Kinect Explorer - WPF 37. 42 High-quality audio capture Audio input from + and 50 degree in front of sensor The array can be pointed at 10 degree increment with in the range Identification of the direction of audio sources Raw voice data access 38. 43 39. 44In addition to the hardware capabilities, the Kinect software runtime implements: Skeleton Tracking Speech Recognition - Integration with the Microsoft Speech APIs. 40. 45 41. 46Source: Kinect for Windows Developer Toolkit v1.8 http://www.microsoft.com/en-in/download/details.aspx?id=40276 App Examples: SkeletonBasics - D2D Kinect Explorer - D2D Kinect Explorer - WPF 42. 47 Tracking Modes (Seated/Default) Tracking Skeletons in Near Depth Range Joint Orientation Joint Filtering Skeleton Tracking With Multiple Kinect Sensors 43. 48Source: Kinect for Windows Developer Toolkit v1.8 http://www.microsoft.com/en-in/download/details.aspx?id=40276 App Examples: SpeechRecognition - Integration with the Microsoft Speech APIs. SpeechBasics - D2D Tic-Tac-Toe - WPF 44. 49Microphone array is an excellent input device for speech recognition. Applications can use Microsoft.Speech API for latest acoustical algorithms. Acoustic models have been created to allow speech recognition in several locales in addition to the default locale of en-US. 45. 50 Face Tracking Kinect Fusion Kinect Interaction 46. 51Source: Kinect for Windows Developer Toolkit v1.8 http://www.microsoft.com/en-in/download/details.aspx?id=40276 App Examples: FaceTracking 3D - WPF Face Tracking Basics - WPF Face Tracking Visualization 47. 52Input Images The Face Tracking SDK accepts Kinect color and depth images as input. The tracking quality may be affected by the image quality of these input framesFace Tracking Outputs Tracking status 2D points 3D head pose Action Units (Aus) Source: http://msdn.microsoft.com/en-us/library/jj130970.aspx 48. 532D pointsSource: http://msdn.microsoft.com/en-us/library/jj130970.aspx 49. 543D head poseSource: http://msdn.microsoft.com/en-us/library/jj130970.aspx 50. 55AUs AU3 Brow LowererAU1 Jaw LowererAU0 Upper Lip RaiserAU2 Lip StretcherSource: http://msdn.microsoft.com/en-us/library/jj130970.aspx 51. 56Source: Kinect for Windows Developer Toolkit v1.8 http://www.microsoft.com/en-in/download/details.aspx?id=40276 App Examples: KinectFusion Basics - D2D Kinect Fusion Color Basics - D2D Kinect Fusion Explorer - D2D 52. 57Take depth images from the Kinect camera with lots of missing data Within a few seconds producing a realistic smooth 3D reconstruction of a static scene by moving the Kinect sensor around. From this, a point cloud or a 3D mesh can be produced. 53. 58 54. 59Source: Kinect for Windows Developer Toolkit v1.8 http://www.microsoft.com/en-in/download/details.aspx?id=40276 App Examples: InteractionGallery - WPF 55. 60Kinect Interaction provides: Identification of up to 2 users and identification and tracking of their primary interaction hand. Detection services for user's hand location and state. Grip and grip release detection. Press and scroll detection. Information on the control targeted by the user. 56. 61Third Party Libraries 57. 62Matlab OpenCV PCL (Point Cloud Library) BoofCV Library ROS 58. 63Matlab is a high-level language and interactive environment for numerical computation, visualization, and programming. Open Source Computer Vision (OpenCV) is a library of programming functions for real-time computer vision. MATLAB and OpenCV functions perform numerical computation and visualization on video and depth data from sensor. Source: Maltlab http://msdn.microsoft.com/enus/library/dn188693.aspx 59. 64Source: PCL 60. 65Provides libraries and tools to help software developers create robot applications. Provides hardware abstraction, device drivers, libraries, visualizers, messagepassing, package management, and more. Provides additional RGB-D processing capabilities.Source: ROS 61. 66 Binary Image processingImage registration and model fittingInterest point detectionCamera calibrationProvides RGB-D processing capabilities.Depth image from Kinect sensor Source: BoofCV 3D point cloud created from RGB & depth images 62. 67Applications 63. 68 Gesture Recognition Gait Detection Activity Recognition 64. 69Body PartsGesturesApplicationsReferenc esFingers9 hand Gestures for number 1to 9Sudoku gameRen et al. 2011Hand, FingersPointing gestures: pointing at objects or locations of interestReal-time hand gesture interaction with the robot. Pointing gestures translated into goals for the robot.Bergh et al. 2011ArmsLifting both Physical rehabilitation arms to the front, to the side and upwardsChang 2011Hand10 hand Gestures forDoliotis 2011HCI system. Experimental datasets in-clude hand signed digits 65. 70Body PartsGesturesApplicationsReferenc esHand and HeadClap, Call, Greet, Yes, No, Wave, Clasp, RestHCI applications. Not build for any specic applicationBiswas 2011Single HandGrasp and DropVarious HCI applicationsTang 2012Right hand Push, Hover (default), Left hand, Head, Foot or KneeView and select recipes for cooking in Kitchen when hands are messyPanger 2012Hand, ArmsVarious HCI applications. Not build for any specific applicationLai 2012Right / Left-arm swing, Right / Left-arm push, Right / Left-arm back, Zoom-in/out 66. 71Image source : Kinect for Windows Human Interface Guidelines v1.5Left and right swipeStart presentationEnd presentationHand as a markerCircling gesturePush gesture Image source : Kinect for Windows Human Interface Guidelines v1.5 67. 72Right swipe gesture paths 0.50.50.40.22.214.171.124.2Series10.10.1 00 -0.4-0.2 -0.1 00.20.4-0.40.6-0.2-0.1 0-0.20.20.4-0.2-0.3-0.30.50.5 0.40.40.30.3 0.20.2 Series10.10.10 -0.4Series1-0.2 -0.1 0 -0.2 -0.3Series10 0.20.40.6-0.4-0.2 -0.1 0 -0.2 -0.30.20.40.6 68. 73Consider an example : Take a left swipe path-0.4-0.20.5 0.4 0.3 0.2 0.1 0 -0.1 0 -0.2 -0.3Series1 0.20.4The sequence of angles will be of the form [180,170,182, ,170,180,185]The corresponding quantized feature vector is [9,9,9, , 9,9,9]How does it distinguish left swipe from right swipe? straightforward!Very 69. 74HMM model for right swipeL3L2 L1L4Real time gesture Returns likelihoods which are then converted to probabilities by normalization, followed by thresholding (empirical) to classify gestures 70. 75 Here are the results obtained by the implementation Correct GestureIncorrect gestureCorrectly classified2818Wrongly classified19292 Precision = 97.23 % Recall = 93.66 % F-measure = 95.41 % Accuracy = 95.5 % 71. 76Source: 72. 77(a) Fixed Camera (b) Freely moving cameraRecognition Rate 97% 73. 78Source: 74. 79Windows SDK vis--vis OpenNI 75. # 76. 81Additional libraries and APIs Skeleton Tracking KinectInteraction Kinect Fusion Face TrackingOpenNI Middleware NITE 3D Hand KSCAN3D 3-D FACESource: NITE http://www.openni.org/files/nite/ 3D Hand http://www.openni.org/files/3d-hand-tracking-library/ 3D face http://www.openni.org/files/3-d-face-modeling/ KSCAN3D http://www.openni.org/files/kscan3d-2/ 77. 82Evaluating a Dancers Performance using Kinectbased Skeleton Tracking Joint Positions Joint Velocity 3D Flow ErrorSource: 78. 83Kinect SDK Better skeleton representation 20 joints skeleton Joints are mobileHigher frame rate better for tracking actions involving rapid pose changes Significantly better skeletal tracking algorithm (by Shotton, 2013). Robust and fast tracking even in complex body posesOpenNi Simpler skeleton representation15 joints skeletonFixed basic frame of [shoulder left: shoulder right: hip] triangle and [head:shoulder] vectorLower frame rate slower tracking of actions Weak skeletal tracking. 79. 84 80. 85Depth Sensing 81. 86Source: http://www.futurepicture.org/?p=97 82. 87The IR emitter projects an irregular pattern of IR dots of varying intensities. The IR camera reconstructs a depth image by recognizing the distortion in this pattern. Kinect works on stereo matching algorithm. It captures stereo with only one IR camera. 83. 88 Project Speckle pattern onto the scene Infer depth from the deformation of the Speckle patternSource: http://users.dickinson.edu/~jmac/selected-talks/kinect.pdf 84. 89Skeleton Tracking 85. 90capture depth image & remove backgroundinfer body parts per pixel cluster pixels to hypothesize body joint positionsSource:fit model & track skeleton 86. 91Highly varied training dataset - driving, dancing, kicking, running, navigating menus, etc (500k images) The pairs of depth and body part images - used for learning the classifier. Randomized decision trees and forests - used to classify each pixel to a body part based on some depth image feature and threshold 87. 92Find joint point in each body parts - local mode-finding approach based on mean shift with a weighted Gaussian kernel. This process considers both the inferred body part probability at the pixel and the world surface area of the pixel. 88. 93Input DepthInferred body partsSide view Top view Front view Inferred body joint position No tracking or smoothing Source: http://research.microsoft.com/pubs/145347/Kinect%20Slides%20CVPR2011.pptx 89. 94 90. PROS Multimedia Sensor Ease of Use Low Cost Strong Library SupportCONS 2.5D depth data Limited field of view Limited Range 43o vert, 57o horz 0.8 3.5 mUni-Directional View Depth Shadows Limited resolution Missing depth values 91. 96For a single Kinect, minimum of 10 feet by 10 feet space (3 meters by 3 meters) is needed.Side ViewTop ViewSource: iPiSoft Wiki: http://wiki.ipisoft.com/User_Guide_for_Dual_Depth_Sensor_Configuration 92. 97Source: Multiple Kinects -- possible? Moving while capturing -- possible? http://www.youtube.com/watch?v=ttMHme2EI9I 93. Use Multiple Kinects 94. 3600 ReconstructionSource: Huawei/3DLife ACM Multimedia Grand Challenge for 2013 http://mmv.eecs.qmul.ac.uk/mmgc2013/ 95. Increases field of view Super ResolutionSource: Maimone & Fuchs, 2012 96. IR Interference NoiseSource: Berger et. al., 2011a 97. Individual captureSimultaneous capture...</p>