Generating a time shrunk lecture video by event

Preview:

DESCRIPTION

Generating a time shrunk lecture video by event

Citation preview

1

Generating a time shrunk lecture video by event detection

Presented by: Mona Ragheb

Yara Ali

Supervised by: Dr. Aliaa Youssif

2

Agenda

Introduction Generating lecture video using virtual camera work Event detection steps Evaluation Results Conclusion References

3

Introduction

E-learning has become a popular method used in higher education.

However, video recording by a cameraman and videoediting takes a long time and costs a great deal.

To solve this problem, a system has been developedto generate dynamic lecture video using virtual camerawork from the high resolution images recorded by a HDV (high-definition video) camcorder

4

How the system works?

The system generates a lecture video:

1. Using virtual camerawork based on shooting techniques of broadcast cameramen

2. By cropping from the high resolution image to track the region of interest (ROI) such as the instructor.

3. Generate time shrunk video using event detection

Camera motion analysis is used to detect scene changes.

5

Shooting techniques

People invariably make the same sets of mistakes when they first start shooting video:

1. Trees or telephone poles sticking out of the back of someone's head

2. Interview subjects who are just darkened blurs because there was bright light in the background

3. Boring shots of buildings with no action

6

Event Detection

Two kinds of events were detected:

1. A speech period

2. Chalkboard writing period

7

Kinds of event detection

1. Speech period:

It’s detected by voice activity detection with LPC cepstrum and classified into speech or non-speech using Mahalanobis distance

8

LPC

Linear predictive coding (LPC) is a tool used mostly in audio signal processing and speech processing for representing the spectral envelope of a digital signal of speech in compressed form, using the information of a linear predictive model.

one of the most useful methods for encoding good quality speech at a low bit rate and provides extremely accurate estimates of speech parameters.

A spectral envelope is a curve in the frequency-amplitude plane, derived from a Fourier magnitude spectrum. It describes one point in time (one window, to be precise).

9

How LPC works?

LPC analyzes the speech signal by estimating the formants, removing their effects from the speech signal, and estimating the intensity and frequency of the remaining buzz.

The process of removing the formants is called inverse filtering, and the remaining signal after the subtraction of the filtered modeled signal is called the residue.

use the buzz parameters and the residue to create a source signal, use the formants to create a filter (which represents the tube), and run the source through the filter, resulting in speech.

10

Formants

11

Kinds of event detection (Cont.)2. Chalkboard writing period It’s detected by using a graph cuts technique to

segment a precise region of interests such as an instructor.

By deleting content-free period, i.e, period without the events of speech and writing, and fast-forwarding writing periods, our method can generate a time shrunk lecture video automatically.

12

Generating lecture video using virtual camera work A HDV camcorder is located at the back of the

classroom to videotape images with high resolution (1,400 × 810 pixels), which contain the whole area of the chalkboard, so that students can read the handwritten characters on the chalkboard.

Problem: it is impossible to display the high resolution image on the small screen of a general notebook PC.

13

Generating lecture video using virtual camera work (Cont.)

14

Solution?

1. The system detects a moving object by temporal differencing

2. The timing for virtual camerawork is detected using bilateral filtering and zero crossing.

Bilateral Filter

The bilateral filter was introduced by Tomasi et al. as a non-iterative means of smoothing images while retaining edge detail.

It involves a weighted convolution in which the weight for each pixel depends not only on its distance from the center pixel, but also its relative intensity.

15

16

Bilateral Filter• (a) and (b) show the potential of bilateral filtering for the

removal of texture. The picture "simplification" illustrated by figure 2 (b) can be useful for data reduction without loss of overall shape features in applications such as image transmission, picture editing and manipulation, image description for retrieval.

17

Generating lecture video using virtual camerawork (Cont.) If the ROI has a large movement, this period of the video is classified into

panning, and if the ROI has no motion but voice activity, this period is classified into zooming.

Panning is used to show motion and speed. It is a technique that requires practice since it has to be done in one smooth, continuous motion.

18

Event detection steps

1. Voice activity detection

2. Chalkboard writing detection1. Object detection and segmentation

2. Generation of current chalkboard image

3. Chalkboard writing detection

3. Generating a time shrunk video

19

1- Voice activity detection

1- Voice activity detection (Cont.) Whenever you do a finite Fourier transform, you're

implicitly applying it to an infinitely repeating signal. So, for instance, if the start and end of your finite sample don't match then that will look just like a discontinuity in the signal, and show up as lots of high-frequency nonsense in the Fourier transform, which you don't really want.

20

1- Voice activity detection (Cont.) If your sample happens to be a beautiful sinusoid

but an integer number of periods don't happen to fit exactly into the finite sample, your FT will show appreciable energy in all sorts of places nowhere near the real frequency. You don't want any of that.

Windowing the data makes sure that the ends match up while keeping everything reasonably smooth; this greatly reduces the sort of "spectral leakage" described in the previous paragraph

21

2-Chalkboard writing detection1-Object detection and segmentation Extracting a precise object region is needed for

detecting periods of writing characters on chalkboard.

Temporal differencing ( object detection ) is robust to lighting change.

Temporal differencing can not extract all foreground pixels of moving objects so another technique is used to support ( Graph cuts technique )

23

2-Chalkboard writing detection1-Object detection and segmentation

24

2- Chalkboard writing detection2-1-Object detection and segmentation (Cont.)

Image Segmentation

Image segmentation is an important problem in computer vision and medical image analysis.

The objective of image segmentation is to provide a visually meaningful partition of the image domain. Although it is usually an easy task for human to separate background and different objects for a given image.

25

26

2- Chalkboard writing detection2-2- Generation of current chalkboard image

27

2- Chalkboard writing detection2-3- Chalkboard writing detection

28

2-3-Chalkboard writing detection (Cont.)

29

3- Generating a time shrunk video

30

3- Generating a time shrunk video

31

Evaluation

We videotaped 3 lectures (each video is 90 min long) by HDV camcorder.

In this evaluation, we use “recall” and “precision” for determining effectiveness of detection result.

Precision Vs Recall• Precision (also called positive predictive value)

• Recall (also known as sensitivity) both calculate no of correct returned in in total frames

In this figure the relevant items are to the left of the straight line while the retrieved items are within the oval. The red regions represent errors. On theleft these are the relevant items not retrieved (false negatives), while on the right they are the retrieved items that are not relevant (false positives).

33

Results

There are 200 false positive frames in a 90 min video because of the students’ voices.

34

Results

35

Results

36

Conclusion

This paper presents a novel approach for generating a time shrunk lecture video using event detection.

Our method detects speech periods by voice activity detection and chalk board writing periods by a combination of object detection and segmentation techniques.

By deleting the content-free periods and fast-forwarding the chalkboard writing periods, our method can generate a time shrunk lecture video automatically.

The resulting generated video is about 20%∼30% shorter than the original video in time. This is almost the same as the results of manual editing by a human operator.

37

References

1. http://www.vision.cs.chubu.ac.jp/04/pdf/e-learning08.pdf

2. http://research.cs.tamu.edu/prism/lectures/sp/l9.pdf

3. http://www.princeton.edu/~achaney/tmve/wiki100k/docs/Linear_predictive_coding.html

4. http://www.ee.columbia.edu/~dpwe/e4896/lectures/E4896-L06.pdf

38

ANY QUESTIONS?

39

THANK YOU !

Recommended