Digital Sound & Music: Concepts, Applications, & Science, Chapter 5, last updated 6/25/2013
Figure 5.49 MP3 compression
Let’s look at the steps in this algorithm more closely.
1. Divide the audio signal in frames.
MP3 compression processes the original audio signal in frames of 1152 samples. Each
frame is split into two granules of 576 samples each. Frames are encoded in a number of bytes
consistent with the bit rate set for the compression at hand. In the example described above
(with sampling rate of 44.1 kHz and requested bit rate of 128 kb/s), 1152 samples are
compressed into a frame of approximately 450 bytes – 418 bytes for data and 32 bytes for the
2. Use the Fourier transform to transform the time domain data to the frequency
domain, sending the results to the psychoacoustical analyzer.
The fast Fourier transform changes the data to the frequency domain. The frequency
domain data is then sent to a psychoacoustical analyzer. One purpose of this analysis is to
identify masking tones and masked frequencies in a local neighborhood of frequencies over a
small window of time. The psychoacoustical analyzer outputs a set of signal-to-mask ratios
(SMRs) that can be used later in quantizing the data. The SMR is the ratio between the
amplitude of a masking tone and the amplitude of the minimum masked frequency in the chosen
vicinity. The compressor uses these values to choose scaling factors and quantization levels such
that quantization error mostly falls below the masking threshold. Step 5 explains this process
Another purpose of the psychoacoustical analysis is to identify the presence of transients
and temporal masking. When the MDCT is applied in a later step, transients have to be treated
in smaller window sizes to achieve better time resolution in the encoding. If not, one transient
sound can mask another that occurs close to it in time. Thus, in the presence of transients,
windows are made one third their normal size in the MDCT.
3. Divide each frame into 32 frequency bands
Steps 2 and 3 are independent and actually could be done in parallel. Dividing the frame
into frequency bands is done with filter banks. Each filter bank is a bandpass filter that allows
only a range of frequencies to pass through. (Chapter 7 gives more details on bandpass filters.)
The complete range of frequencies that can appear in the original signal is 0 to ½ the sampling
rate, as we know from the Nyquist theorem. For example, if the sampling rate of the signal is
44.1 kHz, then the highest frequency that can be present in the signal is 22.05 kHz. Thus, the
filter banks yield 32 frequency bands between 0 and 22.05 kHz, each of width 22050/32, or
about 689 Hz.
The 32 resulting bands are still in the time domain. Note that dividing the audio signal into
frequency bands increases the amount of data by a factor of 32 at this point. That is, there are 32
sets of 1152 time-domain samples, each holding just the frequencies in its band. (You can