Digital Sound & Music: Concepts, Applications, & Science, Chapter 5, last updated 6/25/2013

64

Figure 5.49 MP3 compression

Let’s look at the steps in this algorithm more closely.

1. Divide the audio signal in frames.

MP3 compression processes the original audio signal in frames of 1152 samples. Each

frame is split into two granules of 576 samples each. Frames are encoded in a number of bytes

consistent with the bit rate set for the compression at hand. In the example described above

(with sampling rate of 44.1 kHz and requested bit rate of 128 kb/s), 1152 samples are

compressed into a frame of approximately 450 bytes – 418 bytes for data and 32 bytes for the

header.

2. Use the Fourier transform to transform the time domain data to the frequency

domain, sending the results to the psychoacoustical analyzer.

The fast Fourier transform changes the data to the frequency domain. The frequency

domain data is then sent to a psychoacoustical analyzer. One purpose of this analysis is to

identify masking tones and masked frequencies in a local neighborhood of frequencies over a

small window of time. The psychoacoustical analyzer outputs a set of signal-to-mask ratios

(SMRs) that can be used later in quantizing the data. The SMR is the ratio between the

amplitude of a masking tone and the amplitude of the minimum masked frequency in the chosen

vicinity. The compressor uses these values to choose scaling factors and quantization levels such

that quantization error mostly falls below the masking threshold. Step 5 explains this process

further.

Another purpose of the psychoacoustical analysis is to identify the presence of transients

and temporal masking. When the MDCT is applied in a later step, transients have to be treated

in smaller window sizes to achieve better time resolution in the encoding. If not, one transient

sound can mask another that occurs close to it in time. Thus, in the presence of transients,

windows are made one third their normal size in the MDCT.

3. Divide each frame into 32 frequency bands

Steps 2 and 3 are independent and actually could be done in parallel. Dividing the frame

into frequency bands is done with filter banks. Each filter bank is a bandpass filter that allows

only a range of frequencies to pass through. (Chapter 7 gives more details on bandpass filters.)

The complete range of frequencies that can appear in the original signal is 0 to ½ the sampling

rate, as we know from the Nyquist theorem. For example, if the sampling rate of the signal is

44.1 kHz, then the highest frequency that can be present in the signal is 22.05 kHz. Thus, the

filter banks yield 32 frequency bands between 0 and 22.05 kHz, each of width 22050/32, or

about 689 Hz.

The 32 resulting bands are still in the time domain. Note that dividing the audio signal into

frequency bands increases the amount of data by a factor of 32 at this point. That is, there are 32

sets of 1152 time-domain samples, each holding just the frequencies in its band. (You can