Sep 5, 2008

How perceptual codecs & audio codecs work

Perceptual codecs take advantage of how we actually perceive audio and video, and use this information to make intelligent decisions about what information can safely be discarded. Perceptual codecs are by definition lossy because of this. The original cannot be recreated from the encoded file. Instead, an approximation that attempts to retain as much fidelity as possible is constructed. The idea is that we won't notice what has been discarded.

Our ears are extremely sensitive. We can hear from 20Hz to 20,000Hz and sounds over a wide dynamic range, from a whisper to a scream. We can pick out a conversation at the next table in a crowded restaurant if the topic happens to catch our ear. We can do this because our brains filter out the information that is not of interest and focus on the rest. Our brains effectively prioritize incoming sound information.

For example, even a quiet classroom has plenty of sounds, such as the hum of air conditioning, people shuffling papers, and the teacher lecturing at the front. If someone sneezes in the room, for that split second, everyone notices the sneeze and nothing else. The sneeze is the loudest thing in the room and takes precedence over everything else.

Similarly, our eyes can take in a wide range of visual information, the entire color spectrum from red all the way through purple, and from very dim environments to very bright environments. Our field of vision is approximately 180 degrees from left to right. What we actually pay attention to, though, is much more focused. In general, we pay more attention to things that are brightly colored and things that are moving.

Perceptual codecs use this information to make better decisions about what information in audio and video files can be discarded or encoded with less detail. Perceptual codecs prioritize the loudest frequencies in an audio file, knowing that's what our ears pay most attention to. When encoding video, perceptual codecs prioritize bright colors and any motion in the frame.

At higher bit rates, perceptual codecs are extremely effective. A 128 kbps MP3 file is considered to be the same apparent quality as a CD and is only one-tenth the size of the original, which is pretty incredible if you think about it. Some of the savings is encoding efficiency, but the majority of it is perceptual encoding. As the bit rate is lowered and the codec is forced to discard more and more of the original information, the fidelity is reduced and the effects of perceptual encoding are more audible. Still, you should always balance the required fidelity of your podcast with the realities of bandwidth and throughput.

How audio codecs work
Audio codec technology has made spectacular advances in the last few years. It's now possible for FM quality to be encoded in as little as 32 kbps (in mono, that is). Modern codecs such as Windows Media, Real, and QuickTime AAC can achieve CD quality in approximately 64 Kbps. How do they do it?

The idea is to capture as much of the frequency and dynamic range as possible and to capture the entire stereo image. However, given the target bit rate, the codec usually determines what a reasonable frequency range is. Files that are encoded in mono are always slightly higher fidelity, because the encoder worries about only one channel, not two.

Another economy can be made if the codec knows that it will be encoding speech. Speech tends to stay in a very limited frequency and dynamic range. If someone is talking, it's unlikely that her voice will suddenly drop down an octave, or that she'll start screaming for no reason. Knowing this, a codec can take shortcuts when encoding the frequency and dynamics information.

Caution Don't try to encode music using a speech codec. The shortcuts a speech codec uses are totally unsuitable for music, because music uses a very wide frequency range and is generally very dynamic. If you encode using a speech codec, it sounds awful. So don't do it.


After the frequency range has been determined, the codec must somehow squeeze as much information as possible into the encoded file and decide what can be discarded. Perceptual audio codecs use the concept of masking to help make that decision. If one frequency is very loud, it masks other frequencies, so the codec can safely discard them because we wouldn't perceive them.

This is why all background noise must be minimized in your original recordings and your programming must be nice and loud. This ensures that the codec doesn't discard any of the programming information.

1 comment:

Codecs said...

Great Post. Understanding of working of perceptual codecs and audio codecs gives insight of knowledge.