Sep 25, 2008

Video Codecs

Video codecs are much easier to distinguish from each other than audio codecs. Video quality is still on the rise, with each new codec release improving quality. However, most video codecs are proprietary, meaning they won't play back on other players. If you're embedding your podcast in a Web page, then you can do all sorts of checking via JavaScript to see whether people have a certain plug-in installed. If you're targeting portable media players, then you'll have to be careful about which video codec you choose.

H.264: Part of the MPEG4 standard, H.264 is quite an improvement and, what's more important, is the video standard supported by the iPod.

Windows Media Video (WMV): Currently at version 11, the WMV codec provides outstanding quality, along with lots of advanced functionality. It is also supported by the "Plays for Sure" family of portable media players.

RealVideo (RV): Recently voted the best video codec by author Jan Ozer. RV provides lots of advanced functionality, but is not supported on the iPod or the "Plays for Sure" devices. However, RV is supported on a number of cell phones.

OGG Theora: The sister project to the Ogg Vorbis audio codec project. Theora videos play back in a number of open source players, as well as the RealPlayer and QuickTime, though they require the installation of an additional component.

Sep 19, 2008

Audio Codecs : Speech optimized codecs & Music optimized codecs

Now that you know how codecs work, it's time to see what codecs are available to podcasters, how they differ, and why you might want to use them. The first thing to consider is whether you're planning on using music in your podcast. If you are, then you definitely want to use a codec that is suited for music. If your podcast is just speech, then you may want to consider using a speech codec, because you'll be able to get very good quality at ridiculously low bit rates, thereby saving you money on bandwidth.

Choosing a codec is a tricky business. Newer codecs offer better quality, and some offer advanced functionality such as book marking and embedding images. However, many of the newer codecs play back only on a limited number of portable devices. If you want the latest and greatest features, but also want to cater to the widest possible audience, you may want to consider encoding to multiple formats.

Music-optimized codecs
As mentioned previously, if you're going to include any music at all in your podcast, you must use a music codec. Luckily, you're spoiled for choice. Here's a list of possible candidates:

  • MP3: The granddaddy of them all. MP3 wasn't initially designed as a low bit rate codec, so other codecs sound much better at low bit rates. It also does not support book marking. But just about every computer and portable media device in the world will play back an MP3 file.

  • Advanced Audio Coding (AAC): The new and improved MPEG audio codec, meant to replace MP3. The only problem is that it isn't supported on some portable players. AAC enables advanced features such as book marking and embedded images.

  • Windows Media Audio (WMA): The standard on Microsoft PCs. It has many advanced features such as markers, script commands, and embedded links. WMA is not supported on iPods, though it is supported on the "Plays for Sure" family of portable media devices.

  • RealAudio (RA): The default audio codec of the RealPlayer, which offers embedded links and script commands. It is supported on a number of cell phones.

  • OGG Vorbis: An open source audio codec offering excellent quality. Unfortunately, Vorbis isn't supported by many of the proprietary players, nor by the iPod.

  • Speech-optimized codecs
    If your podcast doesn't include music, you should consider using a speech codec. They provide better quality at the same bit rate as a music codec, or the same quality at a reduced bit rate.

  • Audible Audio (AA): Developed for the first portable digital media player, which was released by Audible and designed to play back audio books. AA supports a number of advanced features such as book marking. Unfortunately, Audible doesn't make an AA encoder publicly available.

  • The granddaddy of voice codecs. In fact, the AA format is based on the codec. is supported by both the Windows Media and Real players.

  • OGG Speex: Another branch of the OGG open source project, specializing in low bit rate speech compression.

  • Windows Media Audio Voice Codec: During Windows Media encoding you can specify that you're encoding voice content, and the Windows Media encoder will use a voiceoptimized codec.
  • Sep 13, 2008

    How video codecs work

    Video codecs also have improved dramatically. The challenge of encoding video, however, is orders of magnitude more difficult than encoding audio. We found that a minute of CD-quality audio is about 10 MB before it is encoded. That's nothing compared to video. If the video is being digitized in the RGB color space, each pixel uses 24 bits (8 for red, 8 for green, and 8 for blue). So that means a frame of video uses:

    720 lines * 486 pixels * 24 bits/pixel = 8,398,080 bits =
    1,049,760 bytes
    = 1MB per frame of video

    To get the file size for a 20-minute podcast, we remember that there are 30 frames per second, so:

    1MB * 30 frames * 60 seconds * 20 minutes = 36000MB = 35.15 GB

    Yes, you read that right. A 20-minute podcast can chew up an entire hard drive, or at least a good chunk of one. Of course, the preceding calculations assumed uncompressed RGB video, and most podcasts are done using a DV camera. Because DV video is compressed at a 5:1 ratio, you're only looking at around 7 GB for your 20-minute podcast. But imagine downloading a 7 GB file! That's not going to happen in a flash. It's going to take a good long time.

    So the first thing we have to consider is reducing the resolution of the video so there are fewer pixels to encode in each frame. If you resize down to 320×240, you've reduced the file size by 75 percent. You also can cut the frame rate in half for further data reduction. But it turns out that this is still nowhere near the amount of reduction required to be able to deliver this video reliably and in an acceptable amount of time (and without breaking your bandwidth budget). To do this, video codecs rely on perceptual coding, using inter-frame and intra-frame encoding.

    Intra-frame encoding is encoding each frame individually, just as you would when you shrink an image using a JPEG codec. Inter-frame encoding is a more sophisticated approach that looks at the difference between frames and encodes only what has changed from one frame to the next. This is illustrated in Figure 1.

    Figure 1: Inter-frame compression encodes only the differences between frames.

    To be able to encode the difference between frames, the codec starts off by encoding a full frame of video. This full frame is known as a key frame. After the key frame, a number of difference frames are encoded. Difference frames, unsurprisingly, encode only what has changed from the previous frame to the current frame. The codec encodes a number of difference frames either until a scene change or when the amount of change in the frame crosses a predetermined threshold. The sequence of key frames and difference frames is illustrated in Figure 2.

    Figure 2: Inter-frame compression uses a sequence of key frames and difference frames.

    The combination of reduced screen resolutions, frame rates, intra-frame compression, and interframe compression is sufficient to create satisfactory video experiences at amazingly low bit rates. Although no one would want to pay to watch it, you can create video files at bit rates as low as 32 Kbps. Of course, we recommend using ten times that much for your video podcast. At 300 Kbps and above, you can deliver an entirely satisfactory video experience. It won't be perfect, but it should be more than adequate.

    Codec side effects
    No codecs are perfect. Even when codecs claim to be transparent, an expert somewhere can tell the difference. At higher bit rates, the differences between the original and the encoded version are minimal. As the bit rate decreases, however, the differences become easy to spot. Perceptual codecs attempt to remove things that we won't notice, but unfortunately they're not always successful.

    Because so much information must be removed from files, you get less of everything in the encoded version of your file. The frequency range is reduced, as well as the dynamic range. If you're encoding video, you have a smaller screen resolution and possibly a decreased frame rate. If that's not enough, you also see or hear artifacts in your podcast.

    Artifacts are things that weren't in the original file. In encoded audio files, artifacts can be heard as low rumbling noises, pops, clicks, and what is known as "pre-echo," which gives speech content a lisping quality. For video files, you may notice blocking artifacts, where the video is broken up into blocks that move around the screen. You also may see smearing, where the video image looks muddy and lacks detail.

    If your podcast has audible or visible artifacts, you should check your encoding settings. Audio podcasts in particular should not have artifacts; you should be more than capable of producing a high quality audio podcast. Video, however, is a different matter. If you're delivering a 320×240 video podcast encoded at 300 Kbps, chances are good that you'll encounter a few artifacts. They shouldn't interfere with the ability to enjoy your podcast. If they do, you'll need to revisit your equipment or your shooting and editing style, or simply encode your video podcast at a higher bit rate.

    Sep 5, 2008

    How perceptual codecs & audio codecs work

    Perceptual codecs take advantage of how we actually perceive audio and video, and use this information to make intelligent decisions about what information can safely be discarded. Perceptual codecs are by definition lossy because of this. The original cannot be recreated from the encoded file. Instead, an approximation that attempts to retain as much fidelity as possible is constructed. The idea is that we won't notice what has been discarded.

    Our ears are extremely sensitive. We can hear from 20Hz to 20,000Hz and sounds over a wide dynamic range, from a whisper to a scream. We can pick out a conversation at the next table in a crowded restaurant if the topic happens to catch our ear. We can do this because our brains filter out the information that is not of interest and focus on the rest. Our brains effectively prioritize incoming sound information.

    For example, even a quiet classroom has plenty of sounds, such as the hum of air conditioning, people shuffling papers, and the teacher lecturing at the front. If someone sneezes in the room, for that split second, everyone notices the sneeze and nothing else. The sneeze is the loudest thing in the room and takes precedence over everything else.

    Similarly, our eyes can take in a wide range of visual information, the entire color spectrum from red all the way through purple, and from very dim environments to very bright environments. Our field of vision is approximately 180 degrees from left to right. What we actually pay attention to, though, is much more focused. In general, we pay more attention to things that are brightly colored and things that are moving.

    Perceptual codecs use this information to make better decisions about what information in audio and video files can be discarded or encoded with less detail. Perceptual codecs prioritize the loudest frequencies in an audio file, knowing that's what our ears pay most attention to. When encoding video, perceptual codecs prioritize bright colors and any motion in the frame.

    At higher bit rates, perceptual codecs are extremely effective. A 128 kbps MP3 file is considered to be the same apparent quality as a CD and is only one-tenth the size of the original, which is pretty incredible if you think about it. Some of the savings is encoding efficiency, but the majority of it is perceptual encoding. As the bit rate is lowered and the codec is forced to discard more and more of the original information, the fidelity is reduced and the effects of perceptual encoding are more audible. Still, you should always balance the required fidelity of your podcast with the realities of bandwidth and throughput.

    How audio codecs work
    Audio codec technology has made spectacular advances in the last few years. It's now possible for FM quality to be encoded in as little as 32 kbps (in mono, that is). Modern codecs such as Windows Media, Real, and QuickTime AAC can achieve CD quality in approximately 64 Kbps. How do they do it?

    The idea is to capture as much of the frequency and dynamic range as possible and to capture the entire stereo image. However, given the target bit rate, the codec usually determines what a reasonable frequency range is. Files that are encoded in mono are always slightly higher fidelity, because the encoder worries about only one channel, not two.

    Another economy can be made if the codec knows that it will be encoding speech. Speech tends to stay in a very limited frequency and dynamic range. If someone is talking, it's unlikely that her voice will suddenly drop down an octave, or that she'll start screaming for no reason. Knowing this, a codec can take shortcuts when encoding the frequency and dynamics information.

    Caution Don't try to encode music using a speech codec. The shortcuts a speech codec uses are totally unsuitable for music, because music uses a very wide frequency range and is generally very dynamic. If you encode using a speech codec, it sounds awful. So don't do it.

    After the frequency range has been determined, the codec must somehow squeeze as much information as possible into the encoded file and decide what can be discarded. Perceptual audio codecs use the concept of masking to help make that decision. If one frequency is very loud, it masks other frequencies, so the codec can safely discard them because we wouldn't perceive them.

    This is why all background noise must be minimized in your original recordings and your programming must be nice and loud. This ensures that the codec doesn't discard any of the programming information.