View
646
Download
9
Category
Tags:
Preview:
DESCRIPTION
Video audio slides compressions
Citation preview
Video & Audio representa0on and coding
Sistemi Mul+mediali ‐ DIS 2011
5.1 Types of Video Signals Component video • Component video: Higher‐end video systems make use of three separate
video signals for the red, green, and blue image planes. Each color channel is sent as a separate video signal. (a) Most computer systems use Component Video, with separate signals for R, G,
and B signals. (b) For any color separaHon scheme, Component Video gives the best color
reproducHon since there is no “crosstalk” between the three channels. (c) This is not the case for S‐Video or Composite Video, discussed next.
Component video, however, requires more bandwidth and good synchronizaHon of the three components.
2 Li & Drew
Sistemi Mul+mediali ‐ DIS 2011
Composite Video — 1 Signal • Composite video: color (“chrominance”) and intensity (“luminance”) signals are
mixed into a single carrier wave.
a) Chrominance is a composiHon of two color components (I and Q, or U and V).
b) In NTSC TV, e.g., I and Q are combined into a chroma signal, and a color subcarrier is then employed to put the chroma signal at the high‐frequency end of the signal shared with the luminance signal.
c) The chrominance and luminance components can be separated at the receiver end and then
the two color components can be further recovered. d) When connecHng to TVs or VCRs, Composite Video uses only one wire and video color signals
are mixed, not sent separately. The audio and sync signals are addiHons to this one signal.
• Since color and intensity are wrapped into the same signal, some interference between the luminance and chrominance signals is inevitable.
3 Li & Drew
Sistemi Mul+mediali ‐ DIS 2011
S‐Video — 2 Signals • S‐Video: as a compromise, (separated video, or Super‐video, e.g., in S‐VHS)
uses two wires, one for luminance and another for a composite chrominance signal.
• As a result, there is less crosstalk between the color informaHon and the
crucial gray‐scale informaHon. • The reason for placing luminance into its own part of the signal is that
black‐and‐white informaHon is most crucial for visual percepHon.
– In fact, humans are able to differenHate spaHal resoluHon in grayscale images with a much higher acuity than for the color part of color images.
– As a result, we can send less accurate color informaHon than must be sent for
intensity informaHon — we can only see fairly large blobs of color, so it makes sense to send less color detail.
4 Li & Drew
Sistemi Mul+mediali ‐ DIS 2011
5.2 Analog Video • An analog signal f(t) samples a Hme‐varying image. So‐called
“progressive” scanning traces through a complete picture (a frame) row‐wise for each Hme interval.
• In TV, and in some monitors and mulHmedia standards as well,
another system, called “interlaced” scanning is used:
a) The odd‐numbered lines are traced first, and then the even‐numbered lines are traced. This results in “odd” and “even” fields — two fields make up one frame. b) In fact, the odd lines (starHng from 1) end up at the middle of a line at the end of the odd field, and the even scan starts at a half‐way point.
5 Li & Drew
Sistemi Mul+mediali ‐ DIS 2011
• Table 5.2 gives a comparison of the three major analog broadcast TV systems.
Table 5.2: Comparison of Analog Broadcast TV Systems
6 Li & Drew
TV System Frame Rate (fps)
# of Scan Lines
Total Channel Width (MHz)
Bandwidth Alloca0on (MHz)
Y I or U Q or V
NTSC 29.97 525 6.0 4.2 1.6 0.6
PAL 25 625 8.0 5.5 1.8 1.8
SECAM 25 625 8.0 6.0 2.0 2.0
Sistemi Mul+mediali ‐ DIS 2011
5.3 Digital Video • The advantages of digital representaHon for video are many.
For example:
(a) Video can be stored on digital devices or in memory, ready to be processed (noise removal, cut and paste, etc.), and integrated to various mulHmedia applicaHons;
(b) Direct access is possible, which makes nonlinear video ediHng achievable as a simple, rather than a complex, task;
(c) Repeated recording does not degrade image quality; (d) Ease of encrypHon and beier tolerance to channel noise.
Li & Drew 7
Sistemi Mul+mediali ‐ DIS 2011
CCIR Standards for Digital Video • CCIR is the ConsultaHve Commiiee for InternaHonal Radio, and one of the most important standards it has produced is CCIR‐601, for component digital video.
– This standard has since become standard ITU‐R‐601, an internaHonal standard for professional video applicaHons — adopted by certain digital video formats including the popular DV video.
Li & Drew 8
Sistemi Mul+mediali ‐ DIS 2011
HDTV (High Defini0on TV) • The main thrust of HDTV (High DefiniHon TV) is not to increase the
“definiHon” in each unit area, but rather to increase the visual field especially in its width. (a) The first generaHon of HDTV was based on an analog technology developed
by Sony and NHK in Japan in the late 1970s. (b) MUSE (MUlHple sub‐Nyquist Sampling Encoding) was an improved NHK HDTV
with hybrid analog/digital technologies that was put in use in the 1990s. It has 1,125 scan lines, interlaced (60 fields per second), and 16:9 aspect raHo.
(c) Since uncompressed HDTV will easily demand more than 20 MHz bandwidth,
which will not fit in the current 6 MHz or 8 MHz channels, various compression techniques are being invesHgated.
(d) It is also anHcipated that high quality HDTV signals will be transmiied using
more than one channel even amer compression.
Li & Drew 9
Sistemi Mul+mediali ‐ DIS 2011
• A brief history of HDTV evoluHon:
(a) In 1987, the FCC decided that HDTV standards must be compaHble with the exisHng NTSC standard and be confined to the exisHng VHF (Very High Frequency) and UHF (Ultra High Frequency) bands.
(b) In 1990, the FCC announced a very different iniHaHve, i.e., its preference for a full‐resoluHon HDTV, and it was decided that HDTV would be simultaneously broadcast with the exisHng NTSC TV and eventually replace it.
(c) Witnessing a boom of proposals for digital HDTV, the FCC made a key decision
to go all‐digital in 1993. A “grand alliance” was formed that included four main proposals, by General Instruments, MIT, Zenith, and AT&T, and by Thomson, Philips, Sarnoff and others.
(d) This eventually led to the formaHon of the ATSC (Advanced Television Systems
Commiiee) — responsible for the standard for TV broadcasHng of HDTV. (e) In 1995 the U.S. FCC Advisory Commiiee on Advanced Television Service
recommended that the ATSC Digital Television Standard be adopted.
10 Li & Drew
Sistemi Mul+mediali ‐ DIS 2011
• The standard supports video scanning formats shown in Table 5.4. In the table, “I” mean interlaced scan and “P” means progressive (non‐interlaced) scan.
Table 5.4: Advanced Digital TV formats supported by ATSC
11 Li & Drew
# of Ac0ve Pixels per line
# of Ac0ve Lines
Aspect Ra0o Picture Rate
1,920 1,080 16:9 60I 30P 24P
1,280 720 16:9 60P 30P 24P
704 480 16:9 & 4:3 60I 60P 30P 24P
640 480 4:3 60I 60P 30P 24P
Sistemi Mul+mediali ‐ DIS 2011
• For video, MPEG‐2 is chosen as the compression standard. For audio, AC‐3 is the standard. It supports the so‐called 5.1 channel Dolby surround sound, i.e., five surround channels plus a subwoofer channel.
• The salient difference between convenHonal TV and HDTV:
(a) HDTV has a much wider aspect raHo of 16:9 instead of
4:3.
(b) HDTV moves toward progressive (non‐interlaced) scan. The raHonale is that interlacing introduces serrated edges to moving objects and flickers along horizontal edges.
12 Li & Drew
Sistemi Mul+mediali ‐ DIS 2011
• The FCC has planned to replace all analog broadcast services with digital TV broadcasHng by the year 2009. The services provided will include:
– SDTV (Standard Defini0on TV): the current NTSC TV or higher.
– EDTV (Enhanced Defini0on TV): 480 acHve lines or higher, i.e., the third and fourth rows in Table 5.4.
– HDTV (High Defini0on TV): 720 acHve lines or higher.
13 Li & Drew
Sistemi Mul+mediali ‐ DIS 2011
6.1 Digi0za0on of Sound What is Sound? • Sound is a wave phenomenon like light, but is macroscopic
and involves molecules of air being compressed and expanded under the acHon of some physical device.
(a) For example, a speaker in an audio system vibrates back and forth and produces a longitudinal pressure wave that we perceive as sound.
(b) Since sound is a pressure wave, it takes on conHnuous values, as opposed to digiHzed ones.
14 Li & Drew
Sistemi Mul+mediali ‐ DIS 2011
(c) Even though such pressure waves are longitudinal, they sHll have ordinary wave properHes and behaviors, such as reflecHon (bouncing), refracHon (change of angle when entering a medium with a different density) and diffracHon (bending around an obstacle).
(d) If we wish to use a digital version of sound waves we must form digiHzed representaHons of audio informaHon.
15 Li & Drew
Sistemi Mul+mediali ‐ DIS 2011
Digi0za0on • Digi0za0on means conversion to a stream of numbers, and preferably these numbers should be integers for efficiency.
• Fig. 6.1 shows the 1‐dimensional nature of sound: amplitude values depend on a 1D variable, Hme. (And note that images depend instead on a 2D set of variables, x and y).
16 Li & Drew
Sistemi Mul+mediali ‐ DIS 2011
Fig. 6.1: An analog signal: conHnuous measurement of pressure wave.
17 Li & Drew
Sistemi Mul+mediali ‐ DIS 2011
• The graph in Fig. 6.1 has to be made digital in both Hme and amplitude. To digiHze, the signal must be sampled in each dimension: in Hme, and in amplitude.
(a) Sampling means measuring the quanHty we are interested in, usually
at evenly‐spaced intervals. (b) The first kind of sampling, using measurements only at evenly spaced
Hme intervals, is simply called, sampling. The rate at which it is performed is called the sampling frequency (see Fig. 6.2(a)).
(c) For audio, typical sampling rates are from 8 kHz (8,000 samples per
second) to 48 kHz. This range is determined by the Nyquist theorem, discussed later.
(d) Sampling in the amplitude or voltage dimension is called
quan0za0on. Fig. 6.2(b) shows this kind of sampling.
18 Li & Drew
Sistemi Mul+mediali ‐ DIS 2011
Signal to Noise Ra0o (SNR) • The raHo of the power of the correct signal and the noise is
called the signal to noise ra+o (SNR) — a measure of the quality of the signal.
• The SNR is usually measured in decibels (dB), where 1 dB is
a tenth of a bel. The SNR value, in units of dB, is defined in terms of base‐10 logarithms of squared voltages, as follows:
(6.2)
Li & Drew 19
2
10 10210log 20logsignal signal
noise noise
V VSNR
V V= =
Sistemi Mul+mediali ‐ DIS 2011
a) The power in a signal is proporHonal to the square of the voltage. For example, if the signal voltage Vsignal is 10 Hmes the noise, then the SNR is 20 ∗ log10(10) = 20dB.
b) In terms of power, if the power from ten violins is ten Hmes that from one violin playing, then the raHo of power is 10dB, or 1B.
c) To know: Power — 10; Signal Voltage — 20.
20 Li & Drew
Sistemi Mul+mediali ‐ DIS 2011
• The usual levels of sound we hear around us are described in terms of decibels, as a raHo to the quietest sound we are capable of hearing. Table 6.1 shows approximate levels for these sounds.
Table 6.1: Magnitude levels of common sounds, in decibels
21 Li & Drew
Threshold of hearing 0
Rustle of leaves 10
Very quiet room 20
Average room 40
ConversaHon 60
Busy street 70
Loud radio 80
Train through staHon 90
Riveter 100
Threshold of discomfort 120
Threshold of pain 140
Damage to ear drum 160
Sistemi Mul+mediali ‐ DIS 2011
Audio Filtering • Prior to sampling and AD conversion, the audio signal is also usually filtered
to remove unwanted frequencies. The frequencies kept depend on the applicaHon:
(a) For speech, typically from 50Hz to 10kHz is retained, and other frequencies
are blocked by the use of a band‐pass filter that screens out lower and higher frequencies.
(b) An audio music signal will typically contain from about 20Hz up to 20kHz. (c) At the DA converter end, high frequencies may reappear in the output —
because of sampling and then quanHzaHon, smooth input signal is replaced by a series of step funcHons containing all possible frequencies.
(d) So at the decoder side, a lowpass filter is used amer the DA circuit.
Li & Drew 22
Sistemi Mul+mediali ‐ DIS 2011
Audio Quality vs. Data Rate • The uncompressed data rate increases as more bits are used for
quanHzaHon. Stereo: double the bandwidth. to transmit a digital audio signal.
Table 6.2: Data rate and bandwidth in sample audio applicaHons
Li & Drew 23
Quality Sample Rate (Khz)
Bits per Sample
Mono / Stereo
Data Rate (uncompressed)
(kB/sec)
Frequency Band (KHz)
Telephone 8 8 Mono 8 0.200‐3.4
AM Radio 11.025 8 Mono 11.0 0.1‐5.5
FM Radio 22.05 16 Stereo 88.2 0.02‐11
CD 44.1 16 Stereo 176.4 0.005‐20
DAT 48 16 Stereo 192.0 0.005‐20
DVD Audio 192 (max) 24(max) 6 channels 1,200 (max) 0‐96 (max)
Sistemi Mul+mediali ‐ DIS 2011
6.2 MIDI: Musical Instrument Digital Interface
• Use the sound card’s defaults for sounds: ⇒ use a simple scripHng language and hardware setup called MIDI.
• MIDI Overview
(a) MIDI is a scripHng language — it codes “events” that stand for the producHon of sounds. E.g., a MIDI event might include values for the pitch of a single note, its duraHon, and its volume.
(b) MIDI is a standard adopted by the electronic music industry for controlling devices, such as synthesizers and sound cards, that produce music.
Li & Drew 24
Sistemi Mul+mediali ‐ DIS 2011
(c) The MIDI standard is supported by most synthesizers, so sounds created on one synthesizer can be played and manipulated on another synthesizer and sound reasonably close.
(d) Computers must have a special MIDI interface, but this is incorporated into most sound cards. The sound card must also have both D/A and A/D converters.
25 Li & Drew
Sistemi Mul+mediali ‐ DIS 2011
MIDI Concepts • MIDI channels are used to separate messages.
(a) There are 16 channels numbered from 0 to 15. The channel forms the last 4 bits (the least significant bits) of the message.
(b) Usually a channel is associated with a parHcular instrument: e.g., channel 1 is the piano, channel 10 is the drums, etc.
(c) Nevertheless, one can switch instruments midstream, if desired, and associate another instrument with any channel.
Li & Drew 26
Sistemi Mul+mediali ‐ DIS 2011
• System messages (a) Several other types of messages, e.g. a general message for all instruments indicaHng a change in tuning or Hming.
(b) If the first 4 bits are all 1s, then the message is interpreted as a system common message.
• The way a syntheHc musical instrument responds to a MIDI message is usually by simply ignoring any play sound message that is not for its channel.
– If several messages are for its channel, then the instrument responds, provided it is mul0‐voice, i.e., can play more than a single note at once.
27 Li & Drew
Sistemi Mul+mediali ‐ DIS 2011
• It is easy to confuse the term voice with the term 0mbre — the laier is MIDI terminology for just what instrument that is trying to be emulated, e.g. a piano as opposed to a violin: it is the quality of the sound.
(a) An instrument (or sound card) that is mul0‐0mbral is one that is
capable of playing many different sounds at the same Hme, e.g., piano, brass, drums, etc.
(b) On the other hand, the term voice, while someHmes used by
musicians to mean the same thing as Hmbre, is used in MIDI to mean every different Hmbre and pitch that the tone module can produce at the same Hme.
• Different Hmbres are produced digitally by using a patch — the set
of control sevngs that define a parHcular Hmbre. Patches are omen organized into databases, called banks.
28 Li & Drew
Sistemi Mul+mediali ‐ DIS 2011
• The data in a MIDI status byte is between 128 and 255; each of the data bytes is between 0 and 127. Actual MIDI bytes are 10‐bit, including a 0 start and 0 stop bit.
Fig. 6.8: Stream of 10‐bit bytes; for typical MIDI messages, these consist of {Status byte, Data Byte, Data Byte} = {Note On, Note Number, Note Velocity}
29 Li & Drew
Sistemi Mul+mediali ‐ DIS 2011
Hardware Aspects of MIDI • The MIDI hardware setup consists of a 31.25 kbps serial connecHon. Usually, MIDI‐capable units are either Input devices or Output devices, not both.
• A tradiHonal synthesizer is shown in Fig. 6.10:
Fig. 6.10: A MIDI synthesizer
Li & Drew 30
Sistemi Mul+mediali ‐ DIS 2011
• The physical MIDI ports consist of 5‐pin connectors for IN and OUT, as well as a third connector called THRU.
(a) MIDI communicaHon is half‐duplex.
(b) MIDI IN is the connector via which the device receives all MIDI data.
(c) MIDI OUT is the connector through which the device transmits all the MIDI data it generates itself.
(d) MIDI THRU is the connector by which the device echoes the data it receives from MIDI IN. Note that it is only the MIDI IN data that is echoed by MIDI THRU — all the data generated by the device itself is sent via MIDI OUT.
31 Li & Drew
Sistemi Mul+mediali ‐ DIS 2011
• A typical MIDI sequencer setup is shown in Fig. 6.11:
32 Li & Drew
Fig. 6.11: A typical MIDI setup
Sistemi Mul+mediali ‐ DIS 2011
Structure of MIDI Messages • MIDI messages can be classified into two types: channel messages and system messages, as in Fig. 6.12:
Fig. 6.12: MIDI message taxonomy
Li & Drew 33
Sistemi Mul+mediali ‐ DIS 2011
• A. Channel messages: can have up to 3 bytes:
a) The first byte is the status byte (the opcode, as it were); has its most significant bit set to 1. b) The 4 low‐order bits idenHfy which channel this message belongs to (for 16 possible channels). c) The 3 remaining bits hold the message. For a data byte, the most significant bit is set to 0.
• A.1. Voice messages:
a) This type of channel message controls a voice, i.e., sends informaHon specifying which note to play or to turn off, and encodes key pressure.
b) Voice messages are also used to specify controller effects such as sustain, vibrato, tremolo, and
the pitch wheel. c) Table 6.3 lists these operaHons.
34 Li & Drew
Sistemi Mul+mediali ‐ DIS 2011
Table 6.3: MIDI voice messages
(** &H indicates hexadecimal, and ‘n’ in the status byte hex value stands for a channel number. All values are in 0..127 except Controller number, which is in 0..120)
35 Li & Drew
Voice Message Status Byte Data Byte1 Data Byte2
Note Off &H8n Key number Note Off velocity
Note On &H9n Key number Note On velocity
Poly. Key Pressure &HAn Key number Amount
Control Change &HBn Controller num. Controller value
Program Change &HCn Program number None
Channel Pressure &HDn Pressure value None
Pitch Bend &HEn MSB LSB
Sistemi Mul+mediali ‐ DIS 2011
General MIDI • General MIDI is a scheme for standardizing the assignment of instruments to patch numbers.
a) A standard percussion map specifies 47 percussion sounds.
b) Where a “note” appears on a musical score determines what percussion instrument is being struck: a bongo drum, a cymbal.
c) Other requirements for General MIDI compaHbility: MIDI device must support all 16 channels; a
device must be mulHHmbral (i.e., each channel can play a different instrument/program); a device must be polyphonic (i.e., each channel is able to play many voices); and there must be a minimum of 24 dynamically allocated voices.
• General MIDI Level2: An extended general MIDI has recently been defined, with a
standard .smf “Standard MIDI File” format defined — inclusion of extra character informaHon, such as karaoke lyrics.
Li & Drew 36
Sistemi Mul+mediali ‐ DIS 2011
MIDI to WAV Conversion • Some programs, such as early versions of Premiere, cannot include .mid files — instead, they insist on .wav format files.
a) Various shareware programs exist for approximaHng a reasonable conversion between MIDI and WAV formats.
b) These programs essenHally consist of large lookup files that try to subsHtute pre‐defined or shimed WAV output for MIDI messages, with inconsistent success.
Li & Drew 37
Sistemi Mul+mediali ‐ DIS 2011
7.1 Introduc0on • Compression: the process of coding that will effecHvely reduce the total number of bits needed to represent certain informaHon.
Fig. 7.1: A General Data Compression Scheme.
Li & Drew 38
Sistemi Mul+mediali ‐ DIS 2011
Introduc0on (cont’d) • If the compression and decompression processes induce no informaHon loss, then the compression scheme is lossless; otherwise, it is lossy.
• Compression ra0o:
(7.1) B0 – number of bits before compression B1 – number of bits amer compression
Li & Drew 39
0
1
BcompressionratioB
=
Sistemi Mul+mediali ‐ DIS 2011
7.2 Basics of Informa0on Theory • The entropy η of an informaHon source with alphabet S =
{s1, s2, . . . , sn} is:
(7.2)
(7.3)
pi – probability that symbol si will occur in S. – indicates the amount of informaHon ( self‐
informaHon as defined by Shannon) contained in si, which corresponds to the number of bits needed to encode si.
Li & Drew 40
21
1( ) logn
ii i
H S pp
η=
= =∑
21
logn
i iip p
=
= −∑
1log2 pi
Sistemi Mul+mediali ‐ DIS 2011
Distribu0on of Gray‐Level Intensi0es
Fig. 7.2 Histograms for Two Gray‐level Images.
• Fig. 7.2(a) shows the histogram of an image with uniform distribuHon of gray‐level intensiHes, i.e., ∀i pi = 1/256. Hence, the entropy of this image is:
log2256 = 8 (7.4)
• Fig. 7.2(b) shows the histogram of an image with two possible values. Its
entropy is 0.92.
Li & Drew 41
Sistemi Mul+mediali ‐ DIS 2011
Entropy and Code Length • As can be seen in Eq. (7.3): the entropy η is a weighted‐sum
of terms ; hence it represents the average amount of informaHon contained per symbol in the source S.
• The entropy η specifies the lower bound for the average
number of bits to code each symbol in S, i.e.,
(7.5) ‐ the average length (measured in bits) of the codewords produced by the encoder.
Li & Drew 42
1log2 pi
lη ≤
l
Sistemi Mul+mediali ‐ DIS 2011
7.3 Run‐Length Coding • Memoryless Source: an informaHon source that is
independently distributed. Namely, the value of the current symbol does not depend on the values of the previously appeared symbols.
• Instead of assuming memoryless source, Run‐Length Coding
(RLC) exploits memory present in the informaHon source. • Ra0onale for RLC: if the informaHon source has the
property that symbols tend to form conHnuous groups, then such symbol and the length of the group can be coded.
Li & Drew 43
Sistemi Mul+mediali ‐ DIS 2011
7.4 Variable‐Length Coding (VLC) Shannon‐Fano Algorithm — a top‐down approach 1. Sort the symbols according to the frequency count of their
occurrences.
2. Recursively divide the symbols into two parts, each with approximately the same number of counts, unHl all parts contain only one symbol.
An Example: coding of “HELLO”
Frequency count of the symbols in ”HELLO”.
Li & Drew 44
Symbol H E L O
Count 1 1 2 1
Sistemi Mul+mediali ‐ DIS 2011
Huffman Coding ALGORITHM 7.1 Huffman Coding Algorithm— a boiom‐up approach 1. IniHalizaHon: Put all symbols on a list sorted according to their frequency counts.
2. Repeat unHl the list has only one symbol lem: (1) From the list pick two symbols with the lowest frequency counts. Form a Huffman subtree
that has these two symbols as child nodes and create a parent node.
(2) Assign the sum of the children’s frequency counts to the parent and insert it into the list such that the order is maintained.
(3) Delete the children from the list.
3. Assign a codeword for each leaf based on the path from the root.
Li & Drew 45
Sistemi Mul+mediali ‐ DIS 2011
Fig. 7.5: Coding Tree for “HELLO” using the Huffman Algorithm.
Li & Drew 46
Sistemi Mul+mediali ‐ DIS 2011
Huffman Coding (cont’d) In Fig. 7.5, new symbols P1, P2, P3 are created to refer to the parent nodes in the Huffman coding tree. The contents in the list are illustrated below:
Amer iniHalizaHon: L H E O Amer iteraHon (a): L P1 H Amer iteraHon (b): L P2 Amer iteraHon (c): P3
Li & Drew 47
Sistemi Mul+mediali ‐ DIS 2011
Proper0es of Huffman Coding 1. Unique Prefix Property: No Huffman code is a prefix of any other Huffman
code ‐ precludes any ambiguity in decoding.
2. Op0mality: minimum redundancy code ‐ proved opHmal for a given data model (i.e., a given, accurate, probability distribuHon):
• The two least frequent symbols will have the same length for their Huffman
codes, differing only at the last bit. • Symbols that occur more frequently will have shorter Huffman codes than
symbols that occur less frequently. • The average code length for an informaHon source S is strictly less than η + 1.
Combined with Eq. (7.5), we have:
(7.6)
Li & Drew 48
1l η< +
Sistemi Mul+mediali ‐ DIS 2011
7.7 Lossless Image Compression • Approaches of Differen0al Coding of Images:
– Given an original image I(x, y), using a simple difference operator we can define a difference image d(x, y) as follows:
d(x, y) = I(x, y) − I(x − 1, y) (7.9) or use the discrete version of the 2‐D Laplacian operator to define a difference image d(x, y) as
d(x, y) = 4 I(x, y) − I(x, y − 1) − I(x, y +1) − I(x+1, y) − I(x − 1, y) (7.10)
• Due to spa+al redundancy existed in normal images I, the
difference image d will have a narrower histogram and hence a smaller entropy, as shown in Fig. 7.9.
Li & Drew 49
Sistemi Mul+mediali ‐ DIS 2011
Fig. 7.9: DistribuHons for Original versus DerivaHve Images. (a,b): Original gray‐level image and its parHal derivaHve image; (c,d): Histograms for original and derivaHve images. (This figure uses a commonly employed image called “Barb”.)
Li & Drew 50
Sistemi Mul+mediali ‐ DIS 2011
8.1 Introduc0on • Lossless compression algorithms do not deliver compression ra+os that are high enough. Hence, most mulHmedia compression algorithms are lossy.
• What is lossy compression?
– The compressed data is not the same as the original data, but a close approximaHon of it.
– Yields a much higher compression raHo than that of lossless compression.
Li & Drew 51
Sistemi Mul+mediali ‐ DIS 2011
8.2 Distor0on Measures • The three most commonly used distorHon measures in image compression are:
– mean square error (MSE) σ2,
(8.1)
where xn, yn, and N are the input data sequence, reconstructed data sequence, and length of the data sequence respecHvely.
– signal to noise ra+o (SNR), in decibel units (dB),
(8.2)
where is the average square value of the original data sequence and is the MSE.
– peak signal to noise ra+o (PSNR),
(8.3)
Li & Drew 52
2 2
1
1 ( )N
n nnx yNσ
=
= −∑
2
10 210log x
d
SNR σσ
=
2
10 210log peak
d
xPSNR
σ=
2xσ
2dσ
Sistemi Mul+mediali ‐ DIS 2011
Spa0al Frequency and DCT • Spa+al frequency indicates how many Hmes pixel values change across an image block.
• The DCT formalizes this noHon with a measure of how much the image contents change in correspondence to the number of cycles of a cosine wave per block.
• The role of the DCT is to decompose the original signal into its DC and AC components; the role of the IDCT is to reconstruct (re‐compose) the signal.
Li & Drew 53
Sistemi Mul+mediali ‐ DIS 2011
Defini0on of DCT: Given an input funcHon f(i, j) over two integer variables i and j (a piece of an image), the 2D DCT transforms it into a new funcHon F(u, v), with integer u and v running over the same range as i and j. The general definiHon of the transform is:
(8.15) where i, u = 0, 1, . . . ,M − 1; j, v = 0, 1, . . . ,N − 1; and the constants C(u) and C(v) are determined by
(8.16)
Li & Drew 54
1 1
0 0
2 ( ) ( ) (2 1)· (2 1)·( , ) cos ·cos · ( , )2 2
M N
i j
C u C v i u j vF u v f i jM NMNπ π− −
= =
+ += ∑∑
2 0,( ) 21 .
ifCotherwise
ξξ
=
=
Sistemi Mul+mediali ‐ DIS 2011
2D Discrete Cosine Transform (2D DCT):
(8.17)
where i, j, u, v = 0, 1, . . . , 7, and the constants C(u) and C(v) are determined by Eq. (8.5.16).
2D Inverse Discrete Cosine Transform (2D IDCT): The inverse funcHon is almost the same, with the roles of f(i, j) and F(u, v) reversed, except that now C(u)C(v) must stand inside the sums:
(8.18) where i, j, u, v = 0, 1, . . . , 7.
Li & Drew 55
7 7
0 0
( ) ( ) (2 1) (2 1)( , ) cos cos ( , )4 16 16i j
C u C v i u j vF u v f i jπ π
= =
+ += ∑∑
7 7
0 0
( ) ( ) (2 1) (2 1)( , ) cos cos ( , )4 16 16u v
C u C v i u j vf i j F u vπ π
= =
+ +=∑∑%
Sistemi Mul+mediali ‐ DIS 2011
The DCT is a linear transform: In general, a transform T (or funcHon) is linear, iff
(8.21) where α and β are constants, p and q are any funcHons, variables or constants. From the definiHon in Eq. 8.17 or 8.19, this property can readily be proven for the DCT because it uses only simple arithmeHc operaHons.
Li & Drew 56
T (!p +!q) =!T (p)+!T (q)
Sistemi Mul+mediali ‐ DIS 2011
The Cosine Basis Func0ons • FuncHon Bp(i) and Bq(i) are orthogonal, if
(8.22) • FuncHon Bp(i) and Bq(i) are orthonormal, if they are orthogonal and
(8.23) • It can be shown that:
Li & Drew 57
[ ( )· ( )] 0 p qiB i B i if p q= ≠∑
7
0
(2 1)· (2 1)·cos ·cos 0 16 16i
i p i q if p qπ π
=
+ + = ≠ ∑
7
0
( ) (2 1)· ( ) (2 1)·cos · cos 12 16 2 16i
C p i p C q i q if p qπ π
=
+ + = = ∑
[ ( )· ( )] 1 p qiB i B i if p q= =∑
Sistemi Mul+mediali ‐ DIS 2011
Fig. 8.9: Graphical IllustraHon of 8 × 8 2D DCT basis. Li & Drew 58
Sistemi Mul+mediali ‐ DIS 2011
2D Separable Basis • The 2D DCT can be separated into a sequence of two, 1D DCT steps:
(8.24)
(8.25)
• It is straigh�orward to see that this simple change saves many arithmeHc steps. The number of iteraHons required is reduced from 8 × 8 to 8+8.
Li & Drew 59
7
0
(2 1)1( , ) ( ) cos ( , )2 16j
j vG i v C v f i jπ
=
+= ∑
7
0
(2 1)1( , ) ( ) cos ( , )2 16i
i uF u v C u G i vπ
=
+= ∑
Sistemi Mul+mediali ‐ DIS 2011
9.1 The JPEG Standard • JPEG is an image compression standard that was developed
by the “Joint Photographic Experts Group”. JPEG was formally accepted as an internaHonal standard in 1992.
• JPEG is a lossy image compression method. It employs a
transform coding method using the DCT (Discrete Cosine Transform).
• An image is a funcHon of i and j (or convenHonally x and y)
in the spa+al domain. The 2D DCT is used as one step in JPEG in order to yield a frequency response which is a funcHon F(u, v) in the spa+al frequency domain, indexed by two integers u and v.
Li & Drew 60
Sistemi Mul+mediali ‐ DIS 2011
Observa0ons for JPEG Image Compression • The effecHveness of the DCT transform coding method in JPEG relies on 3 major observaHons:
Observa0on 1: Useful image contents change relaHvely slowly across the image, i.e., it is unusual for intensity values to vary widely several Hmes in a small area, for example, within an 8×8 image block. • much of the informaHon in an image is repeated, hence “spaHal redundancy”.
Li & Drew 61
Sistemi Mul+mediali ‐ DIS 2011
Observa0ons for JPEG Image Compression (cont’d)
Observa0on 2: Psychophysical experiments suggest that humans are much less likely to noHce the loss of very high spaHal frequency components than the loss of lower frequency components.
• the spaHal redundancy can be reduced by largely reducing the high spaHal frequency contents.
Observa0on 3: Visual acuity (accuracy in disHnguishing closely
spaced lines) is much greater for gray (“black and white”) than for color.
• chroma subsampling (4:2:0) is used in JPEG.
Li & Drew 62
Sistemi Mul+mediali ‐ DIS 2011
Fig. 9.1: Block diagram for JPEG encoder. Li & Drew 63
Sistemi Mul+mediali ‐ DIS 2011
9.1.1 Main Steps in JPEG Image Compression • Transform RGB to YIQ or YUV and subsample color. • DCT on image blocks. • QuanHzaHon. • Zig‐zag ordering and run‐length encoding. • Entropy coding.
Li & Drew 64
Sistemi Mul+mediali ‐ DIS 2011
DCT on image blocks • Each image is divided into 8 × 8 blocks. The 2D DCT is applied to each block image f(i, j), with output being the DCT coefficients F(u, v) for each block.
• Using blocks, however, has the effect of isolaHng each block from its neighboring context. This is why JPEG images look choppy (“blocky”) when a high compression ra+o is specified by the user.
Li & Drew 65
Sistemi Mul+mediali ‐ DIS 2011
Quan0za0on
(9.1) • F(u, v) represents a DCT coefficient, Q(u, v) is a “quanHzaHon matrix” entry,
and represents the quan+zed DCT coefficients which JPEG will use in the succeeding entropy coding.
– The quan0za0on step is the main source for loss in JPEG compression. – The entries of Q(u, v) tend to have larger values towards the lower right corner.
This aims to introduce more loss at the higher spaHal frequencies — a pracHce supported by ObservaHons 1 and 2.
– Table 9.1 and 9.2 show the default Q(u, v) values obtained from psychophysical
studies with the goal of maximizing the compression raHo while minimizing perceptual losses in JPEG images.
Li & Drew 66
( , )ˆ ( , ) ( , )F u vF u v round Q u v
=
ˆ ( , )F u v
Sistemi Mul+mediali ‐ DIS 2011
Table 9.1 The Luminance Quan0za0on Table
Table 9.2 The Chrominance Quan0za0on Table
Li & Drew 67
16 11 10 16 24 40 51 61 12 12 14 19 26 58 60 55 14 13 16 24 40 57 69 56 14 17 22 29 51 87 80 62 18 22 37 56 68 109 103 77 24 35 55 64 81 104 113 92 49 64 78 87 103 121 120 101 72 92 95 98 112 100 103 99
17 18 24 47 99 99 99 99 18 21 26 66 99 99 99 99 24 26 56 99 99 99 99 99 47 66 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99
Sistemi Mul+mediali ‐ DIS 2011
An 8 × 8 block from the Y image of ‘Lena’
Fig. 9.2: JPEG compression for a smooth image block.
Li & Drew 68
200 202 189 188 189 175 175 175 200 203 198 188 189 182 178 175 203 200 200 195 200 187 185 175 200 200 200 200 197 187 187 187 200 205 200 200 195 188 187 175 200 200 200 200 200 190 187 175 205 200 199 200 191 187 187 175 210 200 200 200 188 185 187 186
f(i, j)
515 65 -12 4 1 2 -8 5 -16 3 2 0 0 -11 -2 3 -12 6 11 -1 3 0 1 -2 -8 3 -4 2 -2 -3 -5 -2 0 -2 7 -5 4 0 -1 -4 0 -3 -1 0 4 1 -1 0 3 -2 -3 3 3 -1 -1 3
-2 5 -2 4 -2 2 -3 0 F(u, v)
Sistemi Mul+mediali ‐ DIS 2011
9.1.2 Four Commonly Used JPEG Modes
• SequenHal Mode — the default JPEG mode, implicitly assumed in the discussions so far. Each graylevel image or color image component is encoded in a single lem‐to‐right, top‐to‐boiom scan.
• Progressive Mode. • Hierarchical Mode. • Lossless Mode
Li & Drew 69
Sistemi Mul+mediali ‐ DIS 2011
9.1.3 A Glance at the JPEG Bitstream
Li & Drew 70
Fig. 9.6: JPEG bitstream.
Sistemi Mul+mediali ‐ DIS 2011
9.2 The JPEG2000 Standard • Design Goals: – To provide a beier rate‐distorHon tradeoff and improved subjecHve image quality. – To provide addiHonal funcHonaliHes lacking in the current JPEG standard.
• The JPEG2000 standard addresses the following problems: – Lossless and Lossy Compression: There is currently no standard that can provide superior lossless compression and lossy compression in a single bitstream.
Li & Drew 71
Sistemi Mul+mediali ‐ DIS 2011
– Low Bit‐rate Compression: The current JPEG standard offers excellent rate‐distorHon performance in mid and high bit‐rates. However, at bit‐rates below 0.25 bpp, subjecHve distorHon becomes unacceptable. This is important if we hope to receive images on our web‐ enabled ubiquitous devices, such as web‐aware wristwatches and so on.
– Large Images: The new standard will allow image resoluHons greater than 64K by 64K without Hling. It can handle image size up to 232 − 1.
– Single Decompression Architecture: The current JPEG standard has 44 modes, many of which are applicaHon specific and not used by the majority of JPEG decoders.
Li & Drew 72
Sistemi Mul+mediali ‐ DIS 2011
– Transmission in Noisy Environments: The new standard will provide improved error resilience for transmission in noisy environments such as wireless networks and the Internet.
– Progressive Transmission: The new standard provides seamless quality and resoluHon scalability from low to high bit‐rate. The target bit‐rate and reconstrucHon resoluHon need not be known at the Hme of compression.
– Region of Interest Coding: The new standard allows the specificaHon of Regions of Interest (ROI) which can be coded with superior quality than the rest of the image. One might like to code the face of a speaker with more quality than the surrounding furniture.
Li & Drew 73
Sistemi Mul+mediali ‐ DIS 2011
– Computer Generated Imagery: The current JPEG standard is opHmized for natural imagery and does not perform well on computer generated imagery.
– Compound Documents: The new standard offers metadata mechanisms for incorporaHng addiHonal non‐image data as part of the file. This might be useful for including text along with imagery, as one important example.
• In addiHon, JPEG2000 is able to handle up to 256 channels of informaHon whereas the current JPEG standard is only able to handle three color channels.
Li & Drew 74
Sistemi Mul+mediali ‐ DIS 2011
Proper0es of JPEG2000 Image Compression • Uses Embedded Block Coding with OpHmized TruncaHon
(EBCOT) algorithm which parHHons each subband LL, LH, HL, HH produced by the wavelet transform into small blocks called “code blocks”.
• A separate scalable bitstream is generated for each code
block ⇒ improved error resilience.
Fig. 9.7: Code block structure of EBCOT.
Li & Drew 75
Sistemi Mul+mediali ‐ DIS 2011
Main Steps of JPEG2000 Image Compression • Embedded Block coding and bitstream generaHon.
• Post compression rate distorHon (PCRD) opHmizaHon.
• Layer formaHon and representaHon.
Li & Drew 76
Sistemi Mul+mediali ‐ DIS 2011
Region of Interest Coding in JPEG2000 • Goal:
– ParHcular regions of the image may contain important informaHon, thus should be coded with beier quality than others.
• Usually implemented using the MAXSHIFT method which
scales up the coefficients within the ROI so that they are placed into higher bit‐planes.
• During the embedded coding process, the resulHng bits are
placed in front of the non‐ROI part of the image. Therefore, given a reduced bit‐rate, the ROI will be decoded and refined before the rest of the image.
Li & Drew 77
Sistemi Mul+mediali ‐ DIS 2011
Fig. 9.11: Region of interest (ROI) coding of an image using a circularly shaped ROI. (a) 0.4 bpp, (b) 0.5 bpp, (c) 0.6bpp, and (d) 0.7 bpp.
Li & Drew 78
Sistemi Mul+mediali ‐ DIS 2011
Fig. 9.12: Performance comparison for JPEG and JPEG2000 on different image types. (a): Natural images.
Li & Drew 79
Sistemi Mul+mediali ‐ DIS 2011
(a) Fig. 9.13: Comparison of JPEG and JPEG2000. (a) Original image.
Li & Drew 80
Sistemi Mul+mediali ‐ DIS 2011
(c) Fig. 9.13 (Cont’d): Comparison of JPEG and JPEG2000. (b) JPEG (lem) and JPEG2000 (right) images
compressed at 0.75 bpp. (c) JPEG (lem) and JPEG2000 (right) images compressed at 0.25 bpp.
Li & Drew 81
(b)
Sistemi Mul+mediali ‐ DIS 2011
9.3 The JPEG‐LS Standard • JPEG‐LS is in the current ISO/ITU standard for lossless or “near
lossless” compression of conHnuous tone images. • It is part of a larger ISO effort aimed at beier compression of
medical images. • Uses the LOCO‐I (LOw COmplexity LOssless Compression for Images)
algorithm proposed by Hewlei‐Packard. • MoHvated by the observaHon that complexity reducHon is omen
more important than small increases in compression offered by more complex algorithms.
Main Advantage: Low complexity!
Li & Drew 82
Sistemi Mul+mediali ‐ DIS 2011
10.1 Introduc0on to Video Compression
• A video consists of a Hme‐ordered sequence of frames, i.e., images.
• An obvious soluHon to video compression would be predic+ve coding based on previous frames. Compression proceeds by subtracHng images: subtract in Hme order and code the residual error.
• It can be done even beier by searching for just the right parts of the
image to subtract from the previous frame.
83 Li & Drew
Sistemi Mul+mediali ‐ DIS 2011
10.2 Video Compression with Mo0on Compensa0on
• ConsecuHve frames in a video are similar — temporal redundancy exists.
• Temporal redundancy is exploited so that not every frame of the video needs to be coded independently as a new image.
The difference between the current frame and other frame(s) in the sequence will be coded — small values and low entropy, good for compression.
• Steps of Video compression based on Mo#on Compensa#on (MC):
1. MoHon EsHmaHon (moHon vector search).
2. MC‐based PredicHon.
3. DerivaHon of the predicHon error, i.e., the difference.
84 Li & Drew
Sistemi Mul+mediali ‐ DIS 2011
Mo0on Compensa0on
• Each image is divided into macroblocks of size N x N. ‐ By default, N = 16 for luminance images. For chrominance images,
N = 8 if 4:2:0 chroma subsampling is adopted. • MoHon compensaHon is performed at the macroblock level.
‐ The current image frame is referred to as Target Frame. ‐ A match is sought between the macroblock in the Target Frame and the
most similar macroblock in previous and/or future frame(s) (referred to as Reference frame(s)).
‐ The displacement of the reference macroblock to the target macroblock is called a mo+on vector MV.
‐ Figure 10.1 shows the case of forward predic+on in which the Reference frame is taken to be a previous frame.
85 Li & Drew
Sistemi Mul+mediali ‐ DIS 2011
• MV search is usually limited to a small immediate neighborhood — both horizontal and verHcal displacements in the range [−p, p]. This makes a search window of size (2p + 1) x (2p + 1).
86
Fig. 10.1: Macroblocks and MoHon Vector in Video Compression.
Li & Drew
Sistemi Mul+mediali ‐ DIS 2011
10.3 Search for Mo0on Vectors • The difference between two macroblocks can then be measured by their Mean
Absolute Difference (MAD):
(10.1)
N — size of the macroblock,
k and l — indices for pixels in the macroblock,
i and j — horizontal and verHcal displacements,
C ( x + k, y + l ) — pixels in macroblock in Target frame,
R ( x + i + k, y + j + l ) — pixels in macroblock in Reference frame.
• The goal of the search is to find a vector (i, j) as the moHon vector MV = (u, v), such that MAD(i, j) is minimum:
(10.2)
87 Li & Drew
1 1
20 0
1( , ) ( , ) ( , )N N
k lMAD i j C x k y l R x i k y j l
N
− −
= =
= + + − + + + +∑∑
[ ]( , ) ( , ) | ( , ) , [ , ], [ , ] u v i j MAD i j is minimum i p p j p p= ∈ − ∈ −
Sistemi Mul+mediali ‐ DIS 2011
Sequen0al Search
• Sequen0al search: sequenHally search the whole (2p + 1) x (2p + 1) window in the Reference frame (also referred to as Full search). ‐ a macroblock centered at each of the posiHons within the window is
compared to the macroblock in the Target frame pixel by pixel and their respecHve MAD is then derived using Eq. (10.1).
‐ The vector (i, j) that offers the least MAD is designated as the MV (u, v) for the macroblock in the Target frame.
‐ sequenHal search method is very costly — assuming each pixel comparison requires three operaHons (subtracHon, absolute value, addiHon), the cost for obtaining a moHon vector for a single macroblock is (2p + 1) (2p + 1) N 2 3 O ( p 2 N 2 ).
88 Li & Drew
Sistemi Mul+mediali ‐ DIS 2011
PROCEDURE 10.1 Mo0on‐vector:sequen0al‐search
begin
min_MAD = LARGE NUMBER; /* IniHalizaHon */
for i = −p to p
for j = −p to p
{
cur_MAD = MAD(i, j);
if cur_MAD < min_MAD
{
min_MAD = cur_MAD;
u = i; /* Get the coordinates for MV. */
v = j;
}
}
end
89 Li & Drew
Sistemi Mul+mediali ‐ DIS 2011
2D Logarithmic Search
• Logarithmic search: a cheaper version, that is subopHmal but sHll usually effecHve.
• The procedure for 2D Logarithmic Search of moHon vectors takes several iteraHons and is akin to a binary search: ‐ As illustrated in Fig.10.2, iniHally only nine locaHons in the search window
are used as seeds for a MAD‐based search; they are marked as ‘1’. ‐ Amer the one that yields the minimum MAD is located, the center of the
new search region is moved to it and the step‐size (“offset”) is reduced to half.
‐ In the next iteraHon, the nine new locaHons are marked as ‘2’ and so on.
90 Li & Drew
Sistemi Mul+mediali ‐ DIS 2011
Hierarchical Search
• The search can benefit from a hierarchical (mulHresoluHon) approach in which iniHal esHmaHon of the moHon vector can be obtained from images with a significantly reduced resoluHon.
• Figure 10.3: a three‐level hierarchical search in which the original image is at Level 0, images at Levels 1 and 2 are obtained by down‐sampling from the previous levels by a factor of 2, and the iniHal search is conducted at Level 2. Since the size of the macroblock is smaller and p can also be proporHonally reduced, the number of operaHons required is greatly reduced.
91 Li & Drew
Sistemi Mul+mediali ‐ DIS 2011
Fig. 10.3: A Three‐level Hierarchical Search for MoHon Vectors.
92 Li & Drew
Sistemi Mul+mediali ‐ DIS 2011
Hierarchical Search (Cont'd)
• Given the esHmated moHon vector (uk, vk) at Level k, a 3 x 3 neighborhood centered at (2 ∙ uk, 2 ∙ vk) at Level k − 1 is searched for the refined moHon vector.
• the refinement is such that at Level k − 1 the moHon vector (uk−1 , vk−1) saHsfies:
(2uk − 1 ≤ uk−1 ≤ 2uk +1, 2vk − 1 ≤ vk−1 ≤ 2vk +1) • Let (xk
0, yk0) denote the center of the macroblock at Level k in the Target
frame. The procedure for hierarchical moHon vector search for the macroblock centered at (x0
0, y00) in the Target frame can be outlined as
follows:
93 Li & Drew
Sistemi Mul+mediali ‐ DIS 2011
10.4 H.261
• H.261: An earlier digital video compression standard, its principle of MC‐based compression is retained in all later video compression standards. ‐ The standard was designed for videophone, video conferencing and other
audiovisual services over ISDN. ‐ The video codec supports bit‐rates of p x 64 kbps, where p ranges from 1 to
30 (Hence also known as p * 64). ‐ Require that the delay of the video encoder be less than 150 msec so that
the video can be used for real‐Hme bidirecHonal video conferencing.
94 Li & Drew
Sistemi Mul+mediali ‐ DIS 2011
ITU Recommendations & H.261 Video Formats
• H.261 belongs to the following set of ITU recommendaHons for visual telephony systems:
1. H.221 — Frame structure for an audiovisual channel supporHng 64 to 1,920 kbps.
2. H.230 — Frame control signals for audiovisual systems. 3. H.242 — Audiovisual communicaHon protocols. 4. H.261 — Video encoder/decoder for audiovisual services at p x 64 kbps. 5. H.320 — Narrow‐band audiovisual terminal equipment for p x 64 kbps
transmission.
95 Li & Drew
Sistemi Mul+mediali ‐ DIS 2011
Table 10.2 Video Formats Supported by H.261
96 Li & Drew
Sistemi Mul+mediali ‐ DIS 2011
Fig. 10.4: H.261 Frame Sequence.
97 Li & Drew
Sistemi Mul+mediali ‐ DIS 2011
H.261 Frame Sequence
• Two types of image frames are defined: Intra‐frames (I‐frames) and Inter‐frames (P‐frames): ‐ I‐frames are treated as independent images. Transform coding method similar
to JPEG is applied within each I‐frame, hence “Intra”. ‐ P‐frames are not independent: coded by a forward predicHve coding method
(predicHon from a previous P‐frame is allowed — not just from a previous I‐frame).
‐ Temporal redundancy removal is included in P‐frame coding, whereas I‐frame coding performs only spa0al redundancy removal.
‐ To avoid propagaHon of coding errors, an I‐frame is usually sent a couple of Hmes in each second of the video.
• MoHon vectors in H.261 are always measured in units of full pixel and
they have a limited range of ± 15 pixels, i.e., p = 15.
98 Li & Drew
Sistemi Mul+mediali ‐ DIS 2011
Intra‐frame (I‐frame) Coding
Fig. 10.5: I‐frame Coding. • Macroblocks are of size 16 x 16 pixels for the Y frame, and 8 x 8 for Cb
and Cr frames, since 4:2:0 chroma subsampling is employed. A macroblock consists of four Y, one Cb, and one Cr 8 x 8 blocks.
• For each 8 x 8 block a DCT transform is applied, the DCT coefficients then go through quanHzaHon zigzag scan and entropy coding.
99 Li & Drew
Sistemi Mul+mediali ‐ DIS 2011
Inter-frame (P-frame) Predictive Coding
• Figure 10.6 shows the H.261 P‐frame coding scheme based on moHon compensaHon:
‐ For each macroblock in the Target frame, a moHon vector is
allocated by one of the search methods discussed earlier. ‐ Amer the predicHon, a difference macroblock is derived to measure
the predic+on error. ‐ Each of these 8 x 8 blocks go through DCT, quanHzaHon, zigzag scan
and entropy coding procedures.
100 Li & Drew
Sistemi Mul+mediali ‐ DIS 2011
• The P‐frame coding encodes the difference macroblock (not the Target macroblock itself).
• SomeHmes, a good match cannot be found, i.e., the predicHon error
exceeds a certain acceptable level. ‐ The MB itself is then encoded (treated as an Intra MB) and in this case it is
termed a non‐mo+on compensated MB.
• For a moHon vector, the difference MVD is sent for entropy coding: MVD = MVPreceding − MVCurrent (10.3)
101 Li & Drew
Sistemi Mul+mediali ‐ DIS 2011
Fig. 10.6: H.261 P‐frame Coding Based on Mo0on Compensa0on.
102 Li & Drew
Sistemi Mul+mediali ‐ DIS 2011
11.1 Overview • MPEG: Moving Pictures Experts Group, established in 1988 for the development of digital video.
• It is appropriately recognized that proprietary interests need to be maintained within the family of MPEG standards:
– Accomplished by defining only a compressed bitstream that implicitly defines the decoder.
– The compression algorithms, and thus the encoders, are completely up to the manufacturers.
Li & Drew 103
Sistemi Mul+mediali ‐ DIS 2011
11.2 MPEG‐1 • MPEG‐1 adopts the CCIR601 digital TV format also known as
SIF (Source Input Format). • MPEG‐1 supports only non‐interlaced video. Normally, its
picture resoluHon is:
– 352 × 240 for NTSC video at 30 fps – 352 × 288 for PAL video at 25 fps – It uses 4:2:0 chroma subsampling
• The MPEG‐1 standard is also referred to as ISO/IEC 11172. It has five parts: 11172‐1 Systems, 11172‐2 Video, 11172‐3 Audio, 11172‐4 Conformance, and 11172‐5 Somware.
Li & Drew 104
Sistemi Mul+mediali ‐ DIS 2011
Mo0on Compensa0on in MPEG‐1 • MoHon CompensaHon (MC) based video encoding in H.261 works as follows:
– In MoHon EsHmaHon (ME), each macroblock (MB) of the Target P‐frame is assigned a best matching MB from the previously coded I or P frame ‐ predic0on.
– predic0on error: The difference between the MB and its matching MB, sent to DCT and its subsequent encoding steps.
– The predicHon is from a previous frame — forward predic0on.
Li & Drew 105
Sistemi Mul+mediali ‐ DIS 2011
Fig 11.1: The Need for BidirecHonal Search.
The MB containing part of a ball in the Target frame cannot find a good matching MB in the previous frame because half of the ball was occluded by another object. A match however can readily be obtained from the next frame.
Li & Drew 106
Sistemi Mul+mediali ‐ DIS 2011
Mo0on Compensa0on in MPEG‐1 (Cont’d) • MPEG introduces a third frame type — B‐frames, and its accompanying bi‐
direcHonal moHon compensaHon. • The MC‐based B‐frame coding idea is illustrated in Fig. 11.2:
– Each MB from a B‐frame will have up to two moHon vectors (MVs) (one from the forward and one from the backward predicHon).
– If matching in both direcHons is successful, then two MVs will be sent and the
two corresponding matching MBs are averaged (indicated by ‘%’ in the figure) before comparing to the Target MB for generaHng the predicHon error.
– If an acceptable match can be found in only one of the reference frames, then
only one MV and its corresponding MB will be used from either the forward or backward predicHon.
Li & Drew 107
Sistemi Mul+mediali ‐ DIS 2011
Fig 11.2: B‐frame Coding Based on BidirecHonal MoHon CompensaHon.
Li & Drew 108
Sistemi Mul+mediali ‐ DIS 2011
Fig 11.3: MPEG Frame Sequence.
Li & Drew 109
Sistemi Mul+mediali ‐ DIS 2011
Fig 11.5: Layers of MPEG‐1 Video Bitstream. Li & Drew 110
Sistemi Mul+mediali ‐ DIS 2011
11.3 MPEG‐2 • MPEG‐2: For higher quality video at a bit‐rate of more than
4 Mbps. • Defined seven profiles aimed at different applicaHons:
– Simple, Main, SNR scalable, Spa0ally scalable, High, 4:2:2, Mul0view.
– Within each profile, up to four levels are defined (Table 11.5). – The DVD video specificaHon allows only four display resoluHons: 720×480, 704×480, 352×480, and 352×240 — a restricted form of the MPEG‐2 Main profile at the Main and Low levels.
Li & Drew 111
Sistemi Mul+mediali ‐ DIS 2011
Table 11.5: Profiles and Levels in MPEG‐2
Table 11.6: Four Levels in the Main Profile of MPEG‐2
Li & Drew 112
Level Simple profile
Main profile
SNR Scalable profile
Spa0ally Scalable profile
High Profile
4:2:2 Profile
Mul0view Profile
High High 1440 Main Low
*
* * * *
* *
*
* * *
*
*
Level Max. Resolu0on
Max fps
Max pixels/sec
Max coded Data Rate (Mbps)
Applica0on
High High 1440 Main Low
1,920 × 1,152 1,440 × 1,152 720 × 576 352 × 288
60 60 30 30
62.7 × 106 47.0 × 106 10.4 × 106 3.0 × 106
80 60 15 4
film producHon consumer HDTV
studio TV consumer tape equiv.
Sistemi Mul+mediali ‐ DIS 2011
Suppor0ng Interlaced Video • MPEG‐2 must support interlaced video as well since this is
one of the opHons for digital broadcast TV and HDTV. • In interlaced video each frame consists of two fields,
referred to as the top‐field and the boZom‐field.
– In a Frame‐picture, all scanlines from both fields are interleaved to form a single frame, then divided into 16×16 macroblocks and coded using MC.
– If each field is treated as a separate picture, then it is called Field‐picture.
Li & Drew 113
Sistemi Mul+mediali ‐ DIS 2011
Fig. 11.6: Field pictures and Field‐predicHon for Field‐pictures in MPEG‐2. (a) Frame−picture vs. Field−pictures, (b) Field PredicHon for Field−pictures
Li & Drew 114
Sistemi Mul+mediali ‐ DIS 2011
Five Modes of Predic0ons • MPEG‐2 defines Frame Predic0on and Field Predic0on as well as five predicHon modes:
1. Frame Predic0on for Frame‐pictures: IdenHcal to
MPEG‐1 MC‐based predicHon methods in both P‐frames and B‐frames.
2. Field Predic0on for Field‐pictures: A macroblock size of 16 × 16 from Field‐pictures is used. For details, see Fig. 11.6(b).
Li & Drew 115
Sistemi Mul+mediali ‐ DIS 2011
3. Field Predic0on for Frame‐pictures: The top‐field and boiom‐field of a Frame‐picture are treated separately. Each 16 × 16 macroblock (MB) from the target Frame‐picture is split into two 16 × 8 parts, each coming from one field. Field predicHon is carried out for these 16 × 8 parts in a manner similar to that shown in Fig. 11.6(b).
4. 16×8 MC for Field‐pictures: Each 16×16 macroblock (MB) from the target Field‐picture is split into top and boiom 16 × 8 halves. Field predicHon is performed on each half. This generates two moHon vectors for each 16×16 MB in the P‐Field‐picture, and up to four moHon vectors for each MB in the B‐Field‐picture.
This mode is good for a finer MC when moHon is rapid and irregular.
Li & Drew 116
Sistemi Mul+mediali ‐ DIS 2011
5. Dual‐Prime for P‐pictures: First, Field predicHon from each previous field with the same parity (top or boiom) is made. Each moHon vector mv is then used to derive a calculated moHon vector cv in the field with the opposite parity taking into account the temporal scaling and verHcal shim between lines in the top and boiom fields. For each MB the pair mv and cv yields two preliminary predicHons. Their predicHon errors are averaged and used as the final predicHon error. This mode mimics B‐picture predicHon for P‐pictures without adopHng backward predicHon (and hence with less encoding delay). This is the only mode that can be used for either Frame‐pictures or Field‐pictures.
Li & Drew 117
Sistemi Mul+mediali ‐ DIS 2011
Alternate Scan and Field DCT • Techniques aimed at improving the effecHveness of DCT on
predicHon errors, only applicable to Frame‐pictures in interlaced videos:
– Due to the nature of interlaced video the consecuHve rows in the 8×8
blocks are from different fields, there exists less correlaHon between them than between the alternate rows.
– Alternate scan recognizes the fact that in interlaced video the verHcally
higher spaHal frequency components may have larger magnitudes and thus allows them to be scanned earlier in the sequence.
• In MPEG‐2, Field_DCT can also be used to address the same issue.
Li & Drew 118
Sistemi Mul+mediali ‐ DIS 2011
Fig 11.7: Zigzag and Alternate Scans of DCT Coefficients for Progressive and Interlaced Videos in MPEG‐2.
Li & Drew 119
Sistemi Mul+mediali ‐ DIS 2011
12.1 Overview of MPEG‐4 • MPEG‐4: a newer standard. Besides compression, pays great
aienHon to issues about user interacHviHes. • MPEG‐4 departs from its predecessors in adopHng a new
object‐based coding:
– Offering higher compression raHo, also beneficial for digital video composiHon, manipulaHon, indexing, and retrieval.
– Figure 12.1 illustrates how MPEG‐4 videos can be composed and manipulated by simple operaHons on the visual objects.
• The bit‐rate for MPEG‐4 video now covers a large range
between 5 kbps to 10 Mbps.
Li & Drew 120
Sistemi Mul+mediali ‐ DIS 2011
Fig. 12.1: ComposiHon and ManipulaHon of MPEG‐4 Videos.
Li & Drew 121
Sistemi Mul+mediali ‐ DIS 2011
Overview of MPEG‐4 (Cont’d) • MPEG‐4 (Fig. 12.2(b)) is an enHrely new standard for:
(a) Composing media objects to create desirable audiovisual scenes.
(b) MulHplexing and synchronizing the bitstreams for these media data enHHes so that they can be transmiied with guaranteed Quality of Service (QoS).
(c) InteracHng with the audiovisual scene at the receiving end — provides a toolbox of advanced coding modules and algorithms for audio and video compressions.
Li & Drew 122
Sistemi Mul+mediali ‐ DIS 2011
Fig. 12.2: Comparison of interacHviHes in MPEG standards: (a) reference models in MPEG‐1 and 2 (interacHon in dashed lines supported only by MPEG‐2); (b) MPEG‐4 reference model.
Li & Drew 123
(a) (b)
Sistemi Mul+mediali ‐ DIS 2011
Overview of MPEG‐4 (Cont’d) • The hierarchical structure of MPEG‐4 visual bitstreams is very different from that of MPEG‐1 and ‐2, it is very much video object‐oriented.
Fig. 12.3: Video Object Oriented Hierarchical DescripHon of a Scene in MPEG‐4 Visual Bitstreams.
Li & Drew 124
Sistemi Mul+mediali ‐ DIS 2011
Overview of MPEG‐4 (Cont’d) 1. Video‐object Sequence (VS)—delivers the complete MPEG‐4 visual scene,
which may contain 2‐D or 3‐D natural or syntheHc objects. 2. Video Object (VO) — a parHcular object in the scene, which can be of
arbitrary (non‐rectangular) shape corresponding to an object or background of the scene.
3. Video Object Layer (VOL) — facilitates a way to support (mulH‐layered)
scalable coding. A VO can have mulHple VOLs under scalable coding, or have a single VOL under non‐scalable coding.
4. Group of Video Object Planes (GOV) — groups Video Object Planes
together (opHonal level). 5. Video Object Plane (VOP) — a snapshot of a VO at a parHcular moment.
Li & Drew 125
Sistemi Mul+mediali ‐ DIS 2011
12.2 Object‐based Visual Coding in MPEG‐4
VOP‐based vs. Frame‐based Coding
• MPEG‐1 and ‐2 do not support the VOP concept, and hence their coding method is referred to as frame‐based (also known as Block‐based coding).
• Fig. 12.4 (c) illustrates a possible example in which both potenHal matches yield small predicHon errors for block‐based coding.
• Fig. 12.4 (d) shows that each VOP is of arbitrary shape and ideally will obtain a unique moHon vector consistent with the actual object moHon.
Li & Drew 126
Sistemi Mul+mediali ‐ DIS 2011
Fig. 12.4: Comparison between Block‐based Coding and Object‐based Coding.
Li & Drew 127
Sistemi Mul+mediali ‐ DIS 2011
VOP‐based Coding • MPEG‐4 VOP‐based coding also employs the MoHon CompensaHon
technique:
– An Intra‐frame coded VOP is called an I‐VOP. – The Inter‐frame coded VOPs are called P‐VOPs if only forward
predicHon is employed, or B‐VOPs if bi‐direcHonal predicHons are employed.
• The new difficulty for VOPs: may have arbitrary shapes, shape
informaHon must be coded in addiHon to the texture of the VOP. Note: texture here actually refers to the visual content, that is the gray‐level (or chroma) values of the pixels in the VOP.
Li & Drew 128
Sistemi Mul+mediali ‐ DIS 2011
VOP‐based Mo0on Compensa0on (MC) • MC‐based VOP coding in MPEG‐4 again involves three steps:
(a) MoHon EsHmaHon.
(b) MC‐based PredicHon. (c) Coding of the predicHon error.
• Only pixels within the VOP of the current (Target) VOP are considered for matching in MC.
• To facilitate MC, each VOP is divided into many macroblocks (MBs).
MBs are by default 16×16 in luminance images and 8 × 8 in chrominance images.
Li & Drew 129
Sistemi Mul+mediali ‐ DIS 2011
• MPEG‐4 defines a rectangular bounding box for each VOP (see Fig. 12.5 for details).
• The macroblocks that are enHrely within the VOP are
referred to as Interior Macroblocks. The macroblocks that straddle the boundary of the VOP are called Boundary Macroblocks.
• To help matching every pixel in the target VOP and meet the
mandatory requirement of rectangular blocks in transform codine (e.g., DCT), a pre‐processing step of padding is applied to the Reference VOPs prior to moHon esHmaHon.
Note: Padding only takes place in the Reference VOPs.
Li & Drew 130
Sistemi Mul+mediali ‐ DIS 2011
Fig. 12.5: Bounding Box and Boundary Macroblocks of VOP.
Li & Drew 131
Sistemi Mul+mediali ‐ DIS 2011
I. Padding • For all Boundary MBs in the Reference VOP, Horizontal Repe++ve
Padding is invoked first, followed by Ver+cal Repe++ve Padding.
Fig. 12.6: A Sequence of Paddings for Reference VOPs in MPEG‐4. • Amerwards, for all Exterior Macroblocks that are outside of the VOP
but adjacent to one or more Boundary MBs, extended padding will be applied.
Li & Drew 132
Sistemi Mul+mediali ‐ DIS 2011
Example 12.1: Repe00ve Paddings
Fig. 12.7: An example of RepeHHve Padding in a boundary macroblock of a Reference VOP: (a) Original pixels within the VOP, (b) Amer Horizontal RepeHHve Padding, (c) Followed by VerHcal RepeHHve Padding.
Li & Drew 133
Sistemi Mul+mediali ‐ DIS 2011
Shape Coding • MPEG‐4 supports two types of shape informaHon, binary
and gray scale. • Binary shape informaHon can be in the form of a binary map
(also known as binary alpha map) that is of the size as the rectangular bounding box of the VOP.
• A value ‘1’ (opaque) or ‘0’ (transparent) in the bitmap
indicates whether the pixel is inside or outside the VOP. • AlternaHvely, the gray‐scale shape informaHon actually
refers to the transparency of the shape, with gray values ranging from 0 (completely transparent) to 255 (opaque).
Li & Drew 134
Sistemi Mul+mediali ‐ DIS 2011
I. Binary Shape Coding • BABs (Binary Alpha Blocks): to encode the binary alpha map
more efficiently, the map is divided into 16×16. blocks • It is the boundary BABs that contain the contour and hence
the shape informaHon for the VOP — the subject of binary shape coding.
• Two bitmap‐based algorithms:
(a) Modified Modified READ (MMR).
(b) Context‐based Arithme0c Encoding (CAE).
Li & Drew 135
Sistemi Mul+mediali ‐ DIS 2011
Modified Modified READ (MMR) • MMR is basically a series of simplificaHons of the Rela0ve
Element Address Designate (READ) algorithm • The READ algorithm starts by idenHfying five pixel locaHons
in the previous and current lines:
– a0: the last pixel value known to both the encoder and decoder; – a1: the transiHon pixel to the right of a0; – a2: the second transiHon pixel to the right of a0; – b1: the first transiHon pixel whose color is opposite to a0 in the previously coded line; and
– b2: the first transiHon pixel to the right of b1 on the previously coded line.
Li & Drew 136
Sistemi Mul+mediali ‐ DIS 2011
II. Gray‐scale Shape Coding • The gray‐scale here is used to describe the transparency of the shape, not the texture.
• Gray‐scale shape coding in MPEG‐4 employs the same technique as in the texture coding described above.
– Uses the alpha map and block‐based moHon compensaHon, and encodes the predicHon errors by DCT.
– The boundary MBs need padding as before since not all pixels are in the VOP.
Li & Drew 137
Sistemi Mul+mediali ‐ DIS 2011
Sta0c Texture Coding • MPEG‐4 uses wavelet coding for the texture of staHc objects. • The coding of subbands in MPEG‐4 staHc texture coding is conducted in the
following manner:
– The subbands with the lowest frequency are coded using DPCM. PredicHon of each coefficient is based on three neighbors.
– Coding of other subbands is based on a mulHscale zero‐tree wavelet coding
method.
• The mulHscale zero‐tree has a Parent‐Child RelaHon tree (PCR tree) for each coefficient in the lowest frequency sub‐band to beier track locaHons of all coefficients.
• The degree of quanHzaHon also affects the data rate.
Li & Drew 138
Sistemi Mul+mediali ‐ DIS 2011
Sprite Coding • A sprite is a graphic image that can freely move around within a larger
graphic image or a set of images. • To separate the foreground object from the background, we introduce the
noHon of a sprite panorama: a sHll image that describes the staHc background over a sequence of video frames.
– The large sprite panoramic image can be encoded and sent to the decoder only
once at the beginning of the video sequence. – When the decoder receives separately coded foreground objects and
parameters describing the camera movements thus far, it can reconstruct the scene in an efficient manner.
– Fig. 12.10 shows a sprite which is a panoramic image sHtched from a sequence
of video frames.
Li & Drew 139
Sistemi Mul+mediali ‐ DIS 2011
Fig. 12.10: Sprite Coding. (a) The sprite panoramic image of the background, (b) the foreground object (piper) in a blue‐screen image, (c) the composed video scene. Piper image courtesy of Simon Fraser University Pipe Band.
Li & Drew 140
Sistemi Mul+mediali ‐ DIS 2011
Global Mo0on Compensa0on (GMC) • “Global” – overall change due to camera moHons (pan, Hlt, rotaHon and
zoom)
Without GMC this will cause a large number of significant moHon vectors • There are four major components within the GMC algorithm:
– Global moHon esHmaHon – Warping and blending – MoHon trajectory coding – Choice of LMC (Local MoHon CompensaHon) or GMC.
Li & Drew 141
Sistemi Mul+mediali ‐ DIS 2011
12.3 Synthe0c Object Coding in MPEG‐4 2D Mesh Object Coding • 2D mesh: a tessellaHon (or parHHon) of a 2D planar region using
polygonal patches:
– The verHces of the polygons are referred to as nodes of the mesh. – The most popular meshes are triangular meshes where all polygons are
triangles. – The MPEG‐4 standard makes use of two types of 2D mesh: uniform
mesh and Delaunay mesh – 2D mesh object coding is compact. All coordinate values of the mesh
are coded in half‐pixel precision. – Each 2D mesh is treated as a mesh object plane (MOP).
Li & Drew 142
Sistemi Mul+mediali ‐ DIS 2011
Fig. 12.11: 2D Mesh Object Plane (MOP) Encoding Process
Li & Drew 143
Sistemi Mul+mediali ‐ DIS 2011
I. 2D Mesh Geometry Coding • MPEG‐4 allows four types of uniform meshes with different triangulaHon structures.
Fig. 12.12: Four Types of Uniform Meshes.
Li & Drew 144
Sistemi Mul+mediali ‐ DIS 2011
• Defini0on: If D is a Delaunay triangulaHon, then any of its triangles tn = (Pi, Pj, Pk) ∈ D saHsfies the property that the circumcircle of tn does not contain in its interior any other node point Pl.
• A Delaunay mesh for a video object can be obtained in the
following steps:
1. Select boundary nodes of the mesh: A polygon is used to approximate the boundary of the object.
2. Choose interior nodes: Feature points, e.g., edge points or corners, within the object boundary can be chosen as interior nodes for the mesh.
3. Perform Delaunay triangula0on: A constrained Delaunay triangula+on is performed on the boundary and interior nodes with the polygonal boundary used as a constraint.
Li & Drew 145
Sistemi Mul+mediali ‐ DIS 2011
I. Face Object Coding and Anima0on • MPEG‐4 has adopted a generic default face model, which was
developed by VRML ConsorHum. • Face Anima0on Parameters (FAPs) can be specified to achieve
desirable animaHons — deviaHons from the original “neutral” face. • In addiHon, Face Defini0on Parameters (FDPs) can be specified to
beier describe individual faces. • Fig. 12.16 shows the feature points for FDPs. Feature points that can
be affected by animaHon (FAPs) are shown as solid circles, and those that are not affected are shown as empty circles.
Li & Drew 146
Sistemi Mul+mediali ‐ DIS 2011
Fig. 12.16: Feature Points for Face DefiniHon Parameters (FDPs). (Feature points for teeth and tongue not shown.)
Li & Drew 147
Sistemi Mul+mediali ‐ DIS 2011
II. Body Object Coding and Anima0on • MPEG‐4 Version 2 introduced body objects, which are a
natural extension to face objects. • Working with the Humanoid AnimaHon (H‐Anim) Group in
the VRML ConsorHum, a generic virtual human body with default posture is adopted.
– The default posture is a standing posture with feet poinHng to the front, arms on the side and palms facing inward.
– There are 296 Body Anima0on Parameters (BAPs). When applied to any MPEG‐4 compliant generic body, they will produce the same animaHon.
Li & Drew 148
Sistemi Mul+mediali ‐ DIS 2011
– A large number of BAPs are used to describe joint angles connecHng different body parts: spine, shoulder, clavicle, elbow, wrist, finger, hip, knee, ankle, and toe — yields 186 degrees of freedom to the body, and 25 degrees of freedom to each hand alone.
– Some body movements can be specified in mulHple levels of detail.
• For specific bodies, Body Defini0on Parameters (BDPs) can
be specified for body dimensions, body surface geometry, and opHonally, texture.
• The coding of BAPs is similar to that of FAPs: quanHzaHon
and predicHve coding are used, and the predicHon errors are further compressed by arithmeHc coding.
Li & Drew 149
Sistemi Mul+mediali ‐ DIS 2011
12.4 MPEG‐4 Object types, Profiles and Levels • The standardizaHon of Profiles and Levels in MPEG‐4 serve two main
purposes:
(a) ensuring interoperability between implementaHons
(b) allowing tesHng of conformance to the standard • MPEG‐4 not only specified Visual profiles and Audio profiles, but it
also specified Graphics profiles, Scene descripHon profiles, and one Object descriptor profile in its Systems part.
• Object type is introduced to define the tools needed to create video
objects and how they can be combined in a scene.
Li & Drew 150
Sistemi Mul+mediali ‐ DIS 2011
Table 12.1: Tools for MPEG‐4 Natural Visual Object Types
Li & Drew 151
Sistemi Mul+mediali ‐ DIS 2011
Table 12.2: MPEG‐4 Natural Visual Object Types and Profiles
Li & Drew 152
• For “Main Profile”, for example, only Object Types “Simple”, “Core”, “Main”, and“Scalable Still Texture” are supported.
Sistemi Mul+mediali ‐ DIS 2011
Table 12.3: MPEG‐4 Levels in Simple, Core, and Main Visual Profiles
Li & Drew 153
Sistemi Mul+mediali ‐ DIS 2011
12.6 MPEG‐7 • The main objecHve of MPEG‐7 is to serve the need of audio‐visual content‐based retrieval (or audiovisual object retrieval) in applicaHons such as digital libraries.
• Nevertheless, it is also applicable to any mulHmedia applicaHons involving the generaHon (content crea+on) and usage (content consump+on) of mulHmedia data.
• MPEG‐7 became an InternaHonal Standard in September 2001 — with the formal name Mul0media Content Descrip0on Interface.
Li & Drew 154
Sistemi Mul+mediali ‐ DIS 2011
Applica0ons Supported by MPEG‐7 • MPEG‐7 supports a variety of mulHmedia applicaHons. Its data may include sHll pictures, graphics, 3D models, audio, speech, video, and composiHon informaHon (how to combine these elements).
• These MPEG‐7 data elements can be represented in textual format, or binary format, or both.
• Fig. 12.17 illustrates some possible applicaHons that will benefit from the MPEG‐7 standard.
Li & Drew 155
Sistemi Mul+mediali ‐ DIS 2011
Fig. 12.17: Possible ApplicaHons using MPEG‐7.
Li & Drew 156
Sistemi Mul+mediali ‐ DIS 2011
MPEG‐7 and Mul0media Content Descrip0on • MPEG‐7 has developed Descriptors (D), DescripHon Schemes (DS) and
DescripHon DefiniHon Language (DDL). The following are some of the important terms:
– Feature — characterisHc of the data. – Descrip0on — a set of instanHated Ds and DSs that describes the structural and
conceptual informaHon of the content, the storage and usage of the content, etc.
– D — definiHon (syntax and semanHcs) of the feature. – DS — specificaHon of the structure and relaHonship between Ds and between
DSs. – DDL — syntacHc rules to express and combine DSs and Ds.
• The scope of MPEG‐7 is to standardize the Ds, DSs and DDL for descripHons. The mechanism and process of producing and consuming the descripHons are beyond the scope of MPEG‐7.
Li & Drew 157
Sistemi Mul+mediali ‐ DIS 2011
Descriptor (D) • The descriptors are chosen based on a comparison of their performance, efficiency, and size. Low‐level visual descriptors for basic visual features include:
– Color ∗ Color space. (a) RGB, (b) YCbCr, (c) HSV (hue, saturaHon, value), (d) HMMD (HueMaxMinDiff), (e) 3D color space derivable by a 3 × 3 matrix from RGB, (f) monochrome. ∗ Color quanHzaHon. (a) Linear, (b) nonlinear, (c) lookup tables. ∗ Dominant colors. ∗ Scalable color. ∗ Color layout. ∗ Color structure. ∗ Group of Frames/Group of Pictures (GoF/GoP) color.
Li & Drew 158
Sistemi Mul+mediali ‐ DIS 2011
– Texture ∗ Homogeneous texture. ∗ Texture browsing. ∗ Edge histogram.
– Shape ∗ Region‐based shape. ∗ Contour‐based shape. ∗ 3D shape.
Li & Drew 159
Sistemi Mul+mediali ‐ DIS 2011
– Mo0on ∗ Camera moHon (see Fig. 12.18). ∗ Object moHon trajectory. ∗ Parametric object moHon. ∗ MoHon acHvity.
– Localiza0on ∗ Region locator. ∗ SpaHotemporal locator.
– Others ∗ Face recogniHon.
Li & Drew 160
Sistemi Mul+mediali ‐ DIS 2011
Fig. 12.18: Camera moHons: pan, Hlt, roll, dolly, track, and boom.
Li & Drew 161
Sistemi Mul+mediali ‐ DIS 2011
Descrip0on Scheme (DS) • Basic elements
– Datatypes and mathemaHcal structures. – Constructs. – Schema tools.
• Content Management – Media DescripHon. – CreaHon and ProducHon DescripHon. – Content Usage DescripHon.
• Content Descrip0on – Structural DescripHon.
Li & Drew 162
Sistemi Mul+mediali ‐ DIS 2011
A Segment DS, for example, can be implemented as a class object. It can have five subclasses: Audiovisual segment DS, Audio segment DS, S+ll region DS, Moving region DS, and Video segment DS. The subclass DSs can recursively have their own subclasses.
– Conceptual DescripHon.
• Naviga0on and access
– Summaries. – ParHHons and DecomposiHons. – VariaHons of the Content.
• Content Organiza0on – CollecHons. – Models.
• User Interac0on – UserPreference.
Li & Drew 163
Sistemi Mul+mediali ‐ DIS 2011
Fig. 12.19: MPEG‐7 video segment.
Li & Drew 164
Sistemi Mul+mediali ‐ DIS 2011
Fig. 12.20: A video summary.
Li & Drew 165
Sistemi Mul+mediali ‐ DIS 2011
Descrip0on Defini0on Language (DDL) • MPEG‐7 adopted the XML Schema Language iniHally developed by the WWW ConsorHum (W3C) as its DescripHon DefiniHon Language (DDL). Since XML Schema Language was not designed specifically for audiovisual contents, some extensions are made to it:
– Array and matrix data types. – MulHple media types, including audio, video, and audiovisual presentaHons.
– Enumerated data types for MimeType, CountryCode, RegionCode, CurrencyCode, and CharacterSetCode.
– Intellectual Property Management and ProtecHon (IPMP) for Ds and DSs.
Li & Drew 166
Sistemi Mul+mediali ‐ DIS 2011
12.7 MPEG‐21 • The development of the newest standard, MPEG‐21: Mul0media
Framework, started in June 2000, and was expected to become InternaHonal Stardard by 2003.
• The vision for MPEG‐21 is to define a mulHmedia framework to enable
transparent and augmented use of mulHmedia resources across a wide range of networks and devices used by different communiHes.
• The seven key elements in MPEG‐21 are:
– Digital item declara0on — to establish a uniform and flexible abstracHon and interoperable schema for declaring Digital items.
– Digital item iden0fica0on and descrip0on— to establish a framework for
standardized idenHficaHon and descripHon of digital items regardless of their origin, type or granularity.
Li & Drew 167
Sistemi Mul+mediali ‐ DIS 2011
– Content management and usage — to provide an interface and protocol that facilitate the management and usage (searching, caching, archiving, distribuHng, etc.) of the content.
– Intellectual property management and protec0on (IPMP) — to enable contents to be reliably managed and protected.
– Terminals and networks — to provide interoperable and transparent access to content with Quality of Service (QoS) across a wide range of networks and terminals.
– Content representa0on — to represent content in an adequate way for pursuing the objecHve of MPEG‐21, namely “content anyHme anywhere”.
– Event repor0ng — to establish metrics and interfaces for reporHng events (user interacHons) so as to understand performance and alternaHves.
Li & Drew 168
Recommended