iApplianceWeb.com

EE Times Network
News Flash Appliance Insights Appliance Directory Standards in IA Webcasts


 

Video Coding Spec Threads Net-Centric Media Needle


iApplianceWeb
(08/16/02, 03:50:02 PM EDT)

By Vishal Markandey and Jeremiah Golston, Texas Instruments, Inc., Dallas, Texas,And Faouzi Kossentini, Foued Ben Amara, and Ali Jerbi UB Video, Inc., Dallas, Texas

Digital video is being adopted in a proliferating array of applications, ranging from fixed and mobile video telephony and videoconferencing to DVD and digital TV.

Fueling the growth is the development of video-coding standards, such as MPEG-4 and H.26L, that provide the means needed to achieve interoperability between systems. Those standards have also striven to reduce bandwidth requirements across the network infrastructure to allow better-quality video at a fixed bit rate while reducing storage requirements for content.

Just a few years ago, block-based compression algorithms appeared to be reaching a point of saturation. Now, the emerging H.26L standard is delivering a breakthrough in coding efficiency with up to a 50 percent bit-rate reduction by extending the basic techniques of prior standards.

Key improvements include intra-prediction, more flexible motion compensation, a new 4 x 4 integer transform and enhanced entropy coding. The variety of coding tools offered by H.26L allows system developers to differentiate products by optimizing their algorithms for the specific end-system application.

But before launching into development with H.26L, it is important to understand not only what it does but also how and why it does it. Only then can the designer take full advantage of what this emerging video-coding standard has to offer.

H.26L overview
The ITU-T is one of two formal organizations that develop video-coding standards, the other being ISO/IEC JTC1. The ITU-T video-coding standards are called recommendations and are denoted with H.26x (e.g., H.261, H.262, H.263 and H.26L). The ISO/IEC standards are denoted with MPEG-x (e.g., MPEG-1, MPEG-2 and MPEG-4).

The ITU-T recommendations have been designed primarily for real-time video communication applications, such as videoconferencing and video telephony. On the other hand, the MPEG standards have been designed largely to address the needs of video storage (DVD), broadcast video (broadcast TV), and video-streaming (e.g., video-over-Internet, video-over-DSL and video-over-wireless) applications. For the most part, the two standardization committees have worked independently. The only exception has been the jointly developed H.262/MPEG-2 standard.

Recently, ITU-T and ISO/IEC JTC1 have agreed to join efforts in the development of the emerging H.26L standard, which was initiated by the ITU-T committee. The two committees have adopted H.26L because it represents a departure in terms of performance from all existing video-coding standards. Figure 1 summarizes the evolution of the ITU-T recommendations and the ISO/IEC MPEG standards.

The objective of the H.26L project is to develop a high-performance video-coding standard by adopting a "back-to- basics" approach using simple and straightforward design based on well-known building blocks. The ITU-T Video Coding Experts Group (VCEG) initiated the work on the H.26L standard in 1997. Toward the end of 2001, and witnessing the superiority of video quality offered by H.26L-based software over that achieved by the most-optimized existing MPEG-4 based software, ISO/IEC MPEG joined ITU-T VCEG by forming a Joint Video Team (JVT) that took over the H.26L project. The JVT's objective is to create a single video-coding standard that will simultaneously result in a new part (likely Part 10) of the MPEG-4 family of standards and a new ITU-T (likely H.264) recommendation.

H.26L development is an ongoing effort, with the first version of the standard expected to be finalized technically before the end of 2002 and officially before the end of 2003.

The emerging H.26L standard has a number of features that distinguish it from existing standards. Its key features include:

--Up to a 50 percent bit-rate savings. Compared with H.263v2 (H.263+) or MPEG-4 Simple Profile, H.26L permits an average reduction in bit rate of up to 50 percent for a similar degree of encoder optimization at most bit rates.

--High-quality video. H.26L offers consistently high video quality, even at low bit rates.

--Adaptation to delay constraints. H.26L can operate in a low-delay mode to adapt to real-time communications applications (such as videoconferencing), while allowing higher processing delay in applications with no delay constraints (such as video storage).

--Error resilience. H.26L provides the tools necessary to deal with packet loss in packet networks and bit errors in error-prone wireless networks.

--Network friendliness. A new feature is the conceptual separation between a video-coding layer, which provides the core high-compression representation of the video picture content, and a network adaptation layer, which packages that representation for delivery over a particular type of network. This facilitates packetization and improves information-priority control.

The above features translate into a number of advantages for video applications. Advantages for the specific application of videoconferencing are discussed at the end of this paper.

How does it do that?
The main objective of H.26L is to provide a means to achieve substantially higher video quality compared with what could be achieved using any of the existing video-coding standards. Nonetheless, the underlying approach of H.26L is similar to that adopted in previous standards, such as H.263 and MPEG-4, and consists of the following four main stages:

--Dividing each video frame into blocks of pixels so that processing of the video frame can be conducted at the block level.

-- Exploiting the spatial redundancies that exist within the video frame by coding some of the original blocks through transform, quantization and entropy coding (or variable-length coding).

--Exploiting the temporal dependencies that exist between blocks in successive frames, so that only changes between successive frames need to be encoded. This is accomplished by using motion estimation and compensation.

--Exploiting any remaining spatial redundancies that exist within the video frame by coding the residual blocks, i.e., the difference between the original blocks and the corresponding predicted blocks-again, through transform, quantization and entropy coding.

From the coding point of view, the main differences between H.26L and the other standards are summarized in Figure 2 through an encoder block diagram. From the motion estimation/compensation side, H.26L employs blocks of different sizes and shapes, higher-resolution subpel motion estimation and multiple-reference-frame selection. On the transform side, H.26L uses an integer-based transform that approximates the DCT transform used in previous standards but that does not have the mismatch problem in the inverse transform.

In H.26L, entropy coding can be performed using either a single universal variable-length codes table or context-based adaptive binary arithmetic coding. Those and other features are discussed in more detail in the following sections.

Bit stream organization
As mentioned above, a given video picture is divided into a number of small blocks, referred to as macroblocks. For example, a picture with QCIF resolution (176 x 144 pixels) is divided into 99 16 x 16 macroblocks, as indicated in Figure 3. A similar macroblock segmentation is used for other frame sizes.

The luminance component of the picture is sampled at those frame resolutions, while the chrominance components, Cb and Cr, are downsampled by two in the horizontal and vertical directions. In addition, a picture may be divided into an integer number of "slices," which are valuable for resynchronization should some data be lost.

Intraprediction, coding
Intracoding refers to the case where only spatial redundancies within a video picture are exploited. The resulting frame is referred to as an I-picture. I-pictures are typically encoded by directly applying the transform to the different macroblocks in the frame. As a consequence, encoded I-pictures are large,since a large amount of information is usually present in the frame, and no temporal information is used as part of the encoding process.

To increase the efficiency of the intracoding process in H.26L, spatial correlation between adjacent macroblocks in a given frame is exploited. The idea is based on the observation that adjacent macroblocks tend to have similar properties. Therefore, as a first step in the encoding process for a given macroblock, one may predict the macroblock of interest from the surrounding macroblocks (typically the ones located on top and to the left of the macroblock of interest, since those macroblocks have already been encoded). The difference between the actual macroblock and its prediction is then coded, which results in fewer bits to represent the macroblock of interest compared with applying the transform directly to the macroblock itself.

To perform the intraprediction mentioned above, H.26L offers six modes for prediction of 4 x 4 luminance blocks, including dc prediction (Mode 0) and five directional modes (labeled 1 through 5 in Figure 4). In this figure, pixels A to I from neighboring blocks have already been encoded and may be used for prediction.

For example, if Mode 2 is selected, then pixels a, e, i and m are predicted by setting them equal to pixel A, and pixels b, f, j and n are predicted by setting them equal to pixel B. For regions with less spatial detail (flat regions), H.26L also supports 16 x 16 intracoding where one of four prediction modes is chosen for the prediction of the entire macroblock.

Finally, the prediction mode for each block is efficiently coded by assigning shorter symbols to more likely modes, where the probability of each mode is determined based on the modes used for coding surrounding blocks.

Interprediction, coding
Interprediction and coding tap motion estimation and compensation to take advantage of the temporal redundancies that exist between successive frames, thus providing very efficient coding of video sequences. When a selected reference frame for motion estimation is a previously encoded frame, the frame to be encoded is referred to as a P-picture. When both a previously encoded frame and a future frame are chosen as reference frames, then the frame to be encoded is referred to as a B-picture.

Motion estimation in H.26L supports most of the key features found in earlier video standards, but its efficiency is improved through added flexibility and functionality. In addition to supporting P-pictures (with single and multiple reference frames) and B-pictures, H.26L supports a new, interstream transitional picture called an SP-picture. The following four sections describe the main motion estimation features used in H.26L: various block sizes and shapes, high-precision subpel motion vectors, multiple reference frames and deblocking filters in the prediction loop.

--Block sizes: Motion compensation on each 16 x 16 macroblock can be performed using a number of different block sizes and shapes. These are illustrated in Figure 5. Individual motion vectors can be transmitted for blocks as small as 4 x 4, so up to 16 motion vectors may be transmitted for a single macroblock. Block sizes of 16 x 8, 8 x 16, 8 x 8, 8 x 4 and 4 x 8 are also supported as shown. The availability of smaller motion-compensation blocks improves prediction in general; in particular, the small blocks improve the ability of the model to handle fine motion detail and result in better subjective viewing quality because they do not produce large blocking artifacts.

--Motion estimation accuracy: The prediction capability of the motion compensation algorithm in H.26L is further improved by allowing motion vectors to be determined with higher levels of spatial accuracy than in existing standards. Quarter-pixel-accurate motion compensation is the lowest-accuracy form of motion compensation in H.26L, while eighth-pixel accuracy is being adopted as a feature that will likely be useful for increased coding efficiency at high bit rates and high video resolutions.

--Multiple reference-picture selection: The H.26L standard offers the option of having multiple reference frames in interpicture coding. Up to five different reference frames can be selected, resulting in better subjective video quality and more efficient coding of the video frame under consideration. Moreover, using multiple reference frames can help make the H.26L bit stream more error-resilient.

--Deblocking filter: H.26L specifies the use of an adaptive deblocking filter that operates on the horizontal and vertical block edges within the prediction loop in order to remove artifacts caused by block prediction errors. The filtering is generally based on 4 x 4 block boundaries, in which two pixels on either side of the boundary may be updated using a three-tap filter. The rules for applying the current deblocking filter are intricate and quite complex.

Integer transform
The information contained in a prediction error block resulting from either intraprediction or interprediction is subsequently re-expressed in the form of transform coefficients. H.26L is unique in that it employs a purely integer spatial transform (an approximation of the discrete cosine transform) that is 4 x 4 in shape, as opposed to the floating-point 8 x 8 DCT, specified with rounding-error tolerances, used in earlier standards. The small shape helps reduce blocking and ringing artifacts, while the integer specification eliminates any mismatch problems between the encoder and decoder in the inverse transform.

Quantization
A significant portion of data compression takes place in the quantization step. In H.26L, the transform coefficients are quantized using scalar quantization with no widened dead zone. Thirty-two different quantization step sizes can be chosen on a macroblock basis, this being similar to the abilities of prior standards (H.263 supports 31, for example), but in H.26L the step sizes are increased at a compounding rate of approximately 12.5 percent rather than by a constant increment. The fidelity of chrominance components is improved by using finer quantization step sizes compared with those used for the luminance coefficients, particularly when the luminance coefficients are coarsely quantized.

The quantized transform coefficients correspond to different frequencies, with the coefficient at the top-lefthand corner in Figure 6 representing the dc value and the rest of the coefficients corresponding to different nonzero-frequency values. The next step in the encoding process is to arrange the quantized coefficients in an array, starting with the dc coefficient.

Two different coefficient-scanning patterns are available in H.26L (see Figure 6). The simple zigzag scan is used in most cases and is identical to the conventional scan used in earlier video-coding standards. The zigzag scan arranges the coefficient in an ascending order of the corresponding frequencies. The double scan is used, for improved coding efficiency, only for intrablocks that use a small quantization step size.

Entropy coding
The last step in the video-coding process is entropy coding. So far, H.26L has adopted two approaches for entropy coding. The first approach is based on the use of universal variable-length codes (UVLCs); the second is based on context-based adaptive binary arithmetic coding. Substantial efforts are being made to adopt a single approach, which will likely be based on the adaptive use of special VLCs.

Entropy coding based on VLCs is the most widely used method to compress quantized transform coefficients, motion vectors and other encoder information. VLCs are based on assigning shorter code words to symbols with higher probabilities of occurrence and longer code words to symbols with less-frequent occurrences. The symbols and the associated code words are organized in lookup tables, called VLC tables, which are stored at the encoder and decoder.

In some video-coding standards, such as H.263, a number of VLC tables are used, depending on the type of data under consideration (e.g., transform coefficients or motion vectors). H.26L offers a single, universal VLC table that is to be used in entropy coding of all symbols in the encoder, regardless of the type of data those symbols represent.

Context-based adaptive binary arithmetic coding makes use of a probability model at both the encoder and decoder for all the syntax elements (transform coefficients, motion vectors). To increase the coding efficiency of arithmetic coding, the underlying probability model is adapted to the changing statistics with a video frame through context modeling. That process provides estimates of conditional probabilities of the coding symbols.

Using suitable context models, the given intersymbol redundancy can be exploited by switching among probability models according to already coded symbols in the neighborhood of the current symbol. Each syntax element maintains a different model; for example, motion vectors and transform coefficients have different models. If a given symbol is nonbinary-valued, it will be mapped onto a sequence of binary decisions, or "bins." The actual binarization is done according to a given binary tree; in this case the UVLC binary tree is used.

Each binary decision is then encoded with the arithmetic encoder using the new probability estimates, which have been updated during the previous context-modeling stage. After encoding of each bin, the probability estimate is adjusted upward for the binary symbol that was just encoded.

Implementing H.26L
H.26L implementations on digital signal processors fully exploit the new techniques, offering improved video quality and low latency for real-time video communication applications. A real-time videoconferencing application best illustrates the H.26L implementation.

The major requirements for a typical videoconferencing session are consistent video quality (even when using limited bandwidth), low delay and robustness to packet loss. In a videoconferencing call involving audio, video and data, the video component typically consumes most of the bandwidth available for the call.

There would be a much wider acceptance of videoconferencing systems over low-bandwidth networks if the required bandwidth for the video component could be reduced. H.26L reduces the bandwidth requirements from 320 kbits/ second for H.263+ to 160 kbits/s.

In addition, while many videoconferencing solutions still perform two-pass encoding in order to guarantee satisfactory video quality, the two-pass encoding method can introduce an objectionable delay during a conference call. The H.26L solution guarantees excellent video quality even for single-pass encoding, thereby reducing processing latency.

Although most current videoconferencing calls take place over local private networks, a certain level of packet loss still takes place during packet transmission. H.26L has error resilience at the encoding side and error concealment at the decoder side that combat packet loss even when the loss rate is high.

Related Articles

  1. "Powering Up 3G Handsets for MPEG-4 Video," www.CommsDesign.com/story/OEG20010109S1047
  2. "MPEG-7 Tackles Multimedia Content," www.CommsDesign.com/story/OEG20011112S0058

About the Authors
Vishal Markandey, distinguished member of the technical staff in Texas Instruments Inc.'s DSP video and imaging emerging end-equipment business, holds an MEE from Rice University and a bachelor's degree in electronics and communication engineering from India's Osmania University. Vishal can be reached at .

Jeremiah Golston is the chief technology officer for TI's DSP video and imaging emerging end-equipment business. He holds a BSEE and MSEE from the University of Missouri-Rolla. Jeremiah can be reached at .

Faouzi Kossentini is the president and CEO of UB Video Inc. He received BSEE, MSEE and PhD degrees from the Georgia Institute of Technology (Atlanta). Faouzi can be reached at .

Foued Ben Amara, the business development manager at UB Video, has PhD and master's degrees from the University of Michigan, Ann Arbor. Foued can be reached at .

Ali Jerbi is a product development manager at UB Video and teaches image-processing-related courses at the EE Dept. of the University of British Columbia. Jerbi holds BS, MS and PhD degrees in ECE from the Georgia Institute of Technology (Atlanta). Ali can be reached at .



Copyright © 2004 Appliance-Lab
Terms and Conditions
Privacy Statement