The FFT and Exploring Spectra

So far in the course we have generally considered the sound data we deal with to be in the time domain, as a sequence of samples (or a varying voltage) in time. Sometimes, however, it can be useful to work with sound in the frequency domain, as an indication of how energy is distributed in frequency, or the time-frequency domain, which shows how this distribution varies over time.

Here, for instance, is a spectral analyser that gives us a snapshot (or average of a series of snapshots) of the spectrum at a particular moment. This shows us the energy (y axis) across frequency, the magnitude spectrum.

A spectrum snapshot

Whereas here is a spectrogram that shows the progression of the sound’s spectrum over time. The brightness is the energy, and the vertical axis is now frequency, so it is like a series of the above snapshots glued together.

A Spectrogram

We can use spectral representations of sound for a number of things:

  • Looking at your sound
  • Processing your sound
  • Extracting further analysis of your sound

We’ll cover these each in turn. First it’s worthwhile looking briefly under the hood at how this works.

The Phase Vocoder and the FFT

Spectral processing is almost ubiquitously done using something called the Phase Vocoder, which is, in turn, almost universally implemented using the Fast Fourier Transform (FFT). As sound designers it is important that we understand these to the extent that we are aware of the strengths and weaknesses of this scheme for working with sound so that we know where best to apply it, and how to cope with artefacts.

The Phase Vocoder came about as an extension to the Channel Vocoder, which is what we more normally associate with the term ‘vocoder’. The latter was originally a scheme for encoding speech. It works  by running a signal through a bank of band-pass filters and measuring the energy from each filter, which are then used upon ‘decoding’ to modulate some new carrier signal (taken to be something similar to the waveform produced by our vocal folds). The Phase Vocoder extends this idea by reporting not only the amplitude in a channel, but the phase of that channel also, which in turn allows tracking of information about frequency. When the Phase Vocoder was implemented on computers, use was made of the FFT for computational efficiency.

The Fourier Transform (fast or otherwise) derives from a theory that any waveform can be described as a sum of cosine and sine waves (they are identical, but have a 90° phase difference) of various frequencies and amplitudes. This is only strictly true if the waveform in question is both perfectly periodic and infinite, neither of which are conditions likely to be met for sound. However, it turns out to be true enough to be useful.

The FT is needed for the Phase Vocoder because now there are two pieces of information wanting to be extracted from each filter channel, and if we use ‘normal’ filters, only one measurement: when confronted with two unknowns, we need two variables, which we get by making our measurement complex (in the mathematical sense, rather than in the sense of being complicated). What we end up with in each channel is a ‘cosine’ component and a ‘sine’ component; using high school trigonometry turns out to be enough to get the amplitude and phase from these.

Doing it Digitally: The Discrete Fourier Transform

So, because the FT describes the signal as a stack of cosines and sines, we can use this as a complex number from which we get an amplitude and phase measurement from each channel.

However, because we’re implementing this on computer, we need to take into account that computers can only deal with things in discrete chunks, that are neither infinitesimally variable nor infinite in duration.  When implementing the Fourier Transform on a computer, it becomes known as the Discrete Fourier Transform (DFT), and the FFT is a particular algorithm for performing a DFT in a reasonable amount of time.

The way the DFT works is that it has a finite length, some number of samples. After processing, there will be as many frequency divisions as there were samples. The whole spectrum is divided up across these divisions so that each one doesn’t so much represent a specific frequency, but rather a ‘bin’ into which a certain region’s spectral energy gets put.

So, if I put eight samples in, I get my spectrum divided up into eight sections. If I put 1 million samples in, I get a much finer frequency division. In order for the FFT to do its fast stuff, this number of samples needs to be a power of 2 (2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, etc.) – we can always add zeros to our signal to ‘pad’ up to a suitable length if needed.

Short time → Coarse Frequency, Long time → Fine frequency: Time-frequency uncertainty.

A DFT is perfectly invertible. If we take a DFT of some samples and then take an inverse DFT, we end up with exactly the same sequence of samples so long as we haven’t done anything to the data. So, even though its not immediately apparent, a single DFT still contains timing information, but it is tangled up in the phase.

A further consequence of DFTs in practice is due to their finite length. The ideal, non-Discrete FT of a stationary, infinite sine or cosine would give us an exact point on the frequency axis. However, in addition to no longer dealing with a continuous frequency scale but instead having bins,  when the signal is truncated so that it is no longer infinite  the result is that this single exact point is smeared across neighbouring bins. This smearing is minimal when the frequency of the input matches the central frequency of a bin exactly, and at its worst when the frequency is near the boundary between bins.

This gives us a clue as to why the DFT loses its perfect invertibility when we do processing in the frequency domain, as the energy of individual components is now spread (in a slightly complicated way) across a number of neighbouring bins.

Leakage in the Discrete Fourier Transform: Due to the finite time of actual signals, energy is smeared across neighbouring bins

This smearing can be made a bit better by tapering the edges of the signal with a windowing function, which tapers the signal at the ends.

Windows: The effect of these is to suppress the ‘sidelobes’ of energy appearing in spurious bins; different window shapes have different strengths, which I’m not going to go into. The trade-offs of choosing a window function are (literally) the same as trade-offs we make when designing digital filters (because windowing is actually low pass filtering). We can choose, for instance, between greater suppression of side lobes versus less spurious amplitude ripple in the pass-band. A ‘hann’ window is often a good place to start, but it is worth proceeding by ear to see how different windows interact with your particular material.

The astute reader will notice something about the above plots – there are two frequency lines for our (real valued) sine wave and the frequency axis goes below zero: negative frequency! What’s up with that?

The upshot of this is that only half of the data from a DFT (of a real valued signal) is of any interest so we can simply discard the rest, under most circumstances.

This means that, in fact, there are only half as many frequency bins as input samples (but we can save half the calculations as well).

Getting Our Groove Back: The Short-Time Fourier Transform

Because timing information is embedded but not directly accessible in the DFT, any processing we do is liable to cause an amount of temporal smearing. If we took a great big FFT of a whole sound, we’d find that a lot of what we did would tend to produce drones, as the temporal relationships are affected (I’ll demonstrate this later). Also, taking enormous DFTs isn’t useful for real-time purposes, as the latency (and CPU load) would be significant.

What is normally done then is to take a series of short FFT’s as frames over time, so that amplitude and phase in each frame can be tracked. To minimize artefacts, these frames normally overlap and are windowed (as above). This is called the Short-Time Fourier Transform (STFT), and gets us our Phase Vocoder. The gap between analysis windows is called the hop size, and is usually some fraction of the window length (half, quarter etc.).

Overlapping windows of signal for STFT

The amount of overlap between windows affects the temporal resolution of our Phase Vocoder. You can think of it as a filter bank where the output of each channel is downsampled (i.e. at a lower sample rate than the input) by a factor related to the hop size. If we took a STFT every sample (hop size = 1) then each channel would be at full sample rate. This is computationally impractical on normal CPUs, however.

The upshot of this is that when we manipulate spectral data with the STFT, we are compelled to assume that the phases and amplitudes remain constant within each frame (actually, there are some very complex and experimental ways of trying to get around this…).

The estimation of how frequency is changing in time is done by looking at the change of phase in a given bin between successive slices in time (remembering that we are considering instantaneous frequency to be the speed at which the phase of a signal is changing). As such, it should be clear that having slices closer together enables us to track faster frequency modulations more accurately.

In fact, we still have a practical problem with this frequency estimate, due to something called phase unwrapping. Because we are estimating the phase using a inverse tangent function, the numbers we get will always be in the range 0-360° (0-2π radians). However, the inverse tangent function could return the same value for a whole range of inputs (in the same way that sin(720º) is the same as sin(360º). What we get is known as the principal value which may or may not be the one we want, and gives the phase signal the appearance of a sawtooth wave. The unwrapping process is the attempt to convert this to a continuous function, otherwise we get spurious spikes in our frequency estimates. There is still no completely robust algorithm for this.

Models of Signals: Doing Things with the STFT

As I’ve said, if we don’t touch the data, the DFT (and STFT) will invert perfectly back to our original data. However, by and large we do want to do things, and at this point we have to start making some assumptions about what the data means: we are modelling our signal in some way when we do this.

As should be evident from the discussion of leakage above, we can’t readily interpret the data as containing a well-defined harmonic in each bin.

This is due to the discretisation and the truncation in time: the former means that the bins are linearly spaced in frequency according to a fundamental frequency dictated by the FFT size (a power of 2, remember), probably unrelated to the fundamental of our signal. The latter means that, unless partials of the signal happen to fall bang in the middle of a bin, there will be a certain amount of smearing into adjacent bins.

We also have a frequency resolution problem, at least with respect to human hearing. The FFT is linear in frequency. Each bin will be spaced an equal number of Hz from each other. For example, at 44100 Hz for a 1024 sample analysis window, the bin spacing will be 44100/1024 = 43.1 Hz; so the lowest bin centre is at 43.1 Hz, the next at 86.2 Hz, etc.

This means that there is generally insufficient resolution at low frequencies (an octave, by definition, between bins 1 & 2) and ‘too much’ resolution between higher bins (0.03 semitones at the other extreme, way below our ability to discriminate pitch). For rich signals, with lots going on in the lows, bins are likely to contain contributions from a number of sources.

This starts to indicate something important: the further our signal is from being composed of well-spaced, stationary partials, the less valid the assumption becomes that bins might contain well-defined frequency information.

At the limit case, the spectrum of white noise turns out to be white noise. Nevertheless, the approximation turns out to be good enough that we can often treat prominent peaks in the spectrum as representing partials (for certain kinds of signals).

Magnitude Spectra of Speech and White Noise

Looking at the speech above, we can see well defined peaks in the spectrum, compared to the noise.

The Phase Vocoder model also has assumptions (or forces us to make assumptions) about how fast the signal is changing. Even if we have a harmonic signal of well-defined partials, if their frequency or amplitude modulates significantly with the time span of the analysis window, this will be obscured.  This is particularly problematic for transients in the signal, which are poorly described by stationary sinusoids. and tend to exhibit a great deal of both frequency and amplitude modulation in a short space of time.

Standard Phase Vocoder approaches don’t pay much attention to maintaining phase relationships between bins, because for stationary sounds it isn’t terribly important from an auditory point of view. However, the phase relationships are important to our perception of transients, and the consequences of this not being attended to in a spectral processor is that they can be softened, smeared or obliterated entirely. More sophisticated implementations go to some lengths to detect transients, and handle them differently.

It is also possible to develop more specialised models. For instance, by keeping track of peaks in the spectrum between frames, one can try to work out which new peaks are continuations of previous peaks, and develop a partial tracker. This is what’s called a sinusoidal model that gives us a (smaller) number of sinusoidal tracks than can be resynthesised directly or analogously to additive synthesis. Needless to say, this works best for sounds with stable, well-defined partials.

More sophisticated still, one can then try and deal sensibly with what’s left – the stuff that was deemed not-sinusoidal. One could assume that all of this is ‘noisy’ material, at which point we have a sines + noise or  sines + residue model, or one could try and separate further still into sines + noise + transients so that each type of material can be handled separately and appropriately.

Phase Vocoder Summary

  • The Phase Vocoder is a system for splitting a signal into frequency channels and allowing us to track the amplitude and phase in those channels.
  • It is most commonly implemented using the FFT.
  • The Fourier Transform gives us a view of a signal in terms of cosines and sines; it is simply another way of looking at the signal.
  • In digital systems we have to use the DFT, which has some consequences, such as smearing and frequency quantisation.
  • There is a trade-off with the DFT between time resolution and frequency resolution.
  • To track time varying signals with a Phase Vocoder, we use the STFT.
  • Parameters for the STFT are the analysis window size, the hop size and the FFT size.
  • We can get our amplitude and phase numbers from the complex output of the STFT using basic maths to convert between cartesian and polar coordinates.
  • By and large, the FFT based Phase Vocoder works best on signals with well defined partial structure that don’t change too rapidly.
  • Conversely, it will work quite strangely (or poorly) for processing sounds that are noisy or very transient heavy (drums, for instance) or that contain many different sources.
  • It is possible to build more sophisticated models on top of the STFT based Phase Vocoder.

Looking at Our Sounds

Sometimes it can be useful to see what’s going on in a sound. This comes with some caveats, however!

Whilst we can use spectrograms to locate particular things by eye, it does not follow that mixing-by-spectrum-analyser is in any sense a good idea. As is hopefully clear from above, putting complex sources through a spectrum analyser is liable to create all kinds of exciting interference between bins, particularly in the lower frequencies, and to report a deluge of redundant information in the high frequencies. Using them to try and guide eq decisions on complex material is simply unreliable, and no substitute for doing this by ear.

Aiming for a particular ‘shape’ of spectrum is even worse, and just makes no sense. Don’t.

However, analysers can occasionally be useful for pinpointing a specific peak (or notch) if you are having trouble finding it; you just have to be sure that it’s real (i.e. that it relates to something you hear, rather than trying to hear something you see).

Spectrograms, rather than real-time analysers, are usually rather more useful as a visual aid. Again, this is mostly to confirm things you think you already hear, rather than to use as a definitive template for working. For instance, sometimes a file may have a consistent very high frequency tone embedded in it, often by contamination by a computer screen or similar:

High Frequency Interference in a Recording

The snippet above is of a vocal part in a song I was handed to mix. The high frequency tone wasn’t especially noticeable until listening at a good volume, but started to become much more noticeable post-compression and EQ. By using a spectrogram with a long FFT size, which will emphasise stationary components, these very static bands become quite evident.

Sometimes a spectrogram can be a useful aid in helping orientate around a piece of audio.

A Spectrogram of an Electronic Instrumental

In the example above, from an instrumental electronic track, we can clearly see distinct sections (more clearly than with the normal waveform) and also identify areas of similarity. This could be a starting point for analysis, or editing. (Some) electronica, being composed of particularly steady waveforms, is  well-suited to being looked at this way. What about something more organic?

Here’s a section of a field recording made in a station:

Field Recording Spectrogram, Linear Frequency

Not especially enlightening. However, we can improve matters somewhat if we change the view of the frequency scale from linear (good for spotting harmonics and high frequency activity) to logarithmic, as most of the action in this recording is at lower frequencies:

Field Recording, Log Frequency

Here we can see a great deal more going on; for example all kinds of exciting stuff in the low-mids, much of which turns out to be snatches of speech.  Again, this could form a basis for making an initial exploration of a longish recording in order to pull out bits of interest. Notice how the background manifests as a sort of blurry bed across everything.

In the next example, I’m trying to pinpoint the position of crackles on a recording taken from vinyl, which are very often easier to locate on a spectrogram than with the time domain waveform. Here it is:

By setting the analysis window to be very short, we can emphasise transients, which manifest as narrow vertical stripes, just as steady tones appear as narrow horizontal stripes:

Looking for crackle and clicks with a short window

So, we can use spectrograms for a variety of tasks involving looking at our sounds, tailoring the analysis settings towards the kinds of things we’re looking for.


Time Stretching and Pitch Shifting

One of the most common uses of the Phase Vocoder is for performing time stretches (without changing pitch) and pitch shifts (without altering timing).

Time stretching tends to be an offline process, as it needs the data already in place to read back slower (stretching) or faster (compressing) than the original. Conceptually, time stretching with the Phase Vocoder is pretty simple: we just stick the frames together with more temporal space between them when the sound is resynthesised (by changing the hop size and calculating new appropriate phase values), possibly ‘inventing’ new frames by interpolation as needed. Alternatively, if we’re using a sinusoidal model, interpolating the amplitude and frequency (and possibly phase) functions for each partial.

However, there are significant complications to making this effective on arbitrary material, for reasons that I’ve explained above:

  1. The extent to which the interpolation will sound ‘natural’ depends very much on the temporal resolution of the analysis. Recall that within a frame we only have an average across time to deal with; at some point, the material will begin to sound metallic.
  2. Transients require special handling. The most sophisticated approaches try not to stretch transient regions at all, in order to maintain the grain of the original file. If this isn’t done, then the transients become softened.
  3. Noisy material stretched this way can sound artificial, as correlations related to the window size begin to creep in, introducing a strange (and distracting) phasiness.

Some examples. Here’s the speech from above stretched to four times its original length. First with a standard Phase Vocoder:

Now with a sinusoidal model:

So, it’s pretty clear that both these methods produce audible (but possibly liveable-with) artefacts, which become clearer at larger stretch factors. If we’re using small factors just to tidy up the timing of snippets, it is pretty easy to be more transparent. When dealing with whole passages like this where you want to minimize artefacts, it can sometimes be worth segmenting by hand to start with and only stretching the non-transient parts.

Of course, we can also use time stretching more creatively by using large and /  or time-varying stretch factors to make more abstract sounds. Here’s the same speech example with a variable stretch that looks like this:

Variable time stretch curve used in the following audio example

Which turns the ~10s original into a 3 minute drone piece:


Pitch Shifting

Shifting the pitch can be approached in a number of ways. If we’re happy enough to do it offline, one can ‘simply’ time stretch the audio and then resample in by the appropriate factor to get shifted pitch with the same original length. So, if I want to shift up by an octave, we can time stretch by two, then resample so that the file is half as long (raising it by an octave in the process).

Alternatively, pitch shifting can be attempted by working out the frequencies in each bin and multiplying them by some factor. This is more like a sinusoidal modelling approach insofar as it assumes that there is sensible frequency information to be found.

One thing we have to watch out for, particularly when pitch shifting the voice, is how the formants  are handled in any particular case. Formants are spectral peaks in a sound that arise not due to the particular harmonic spectrum of that sound (i.e. the relative strength of partials), but due to the various resonances that go into producing the sound beyond its ‘basic waveform’.

In the case of the voice, there is a waveform produced by our vocal folds that has a particular harmonic structure, and is what determines the pitch of our vocalisations. The formants arise due to the various cavities that form our vocal systems (mouth, nose, oesophagus etc.), the shaping of which is involved in producing particular vowel sounds.

If we pitch up the voice with its formants, we end up with a ‘chipmunk’ effect:

If we attempt to preserve the formants, it sounds better (assuming we weren’t going for chipmunks), although still in need of some attention if its to pass as a normal human voice:

Cross Synthesis

Cross synthesis means making a new sound by combining characteristics of different source sounds. The most basic sense of this is straightforward vocoding, where the amplitudes of one sound are imposed on the phases of another.

So if we take our trusty speech sample:

And something instrumental:

We can get a familiar, but more dynamic, vocoder type effect by using the speech amplitudes with the phases from the instrumental sample:

We get a much stranger result if we use the amplitudes of the instrumental material with the phases from the speech:

If we mix them together, we get something slightly different to a normal time domain mixture, because of the way the phases are combined:

One can try to morph between different sounds by altering mixtures with a Phase Vocoder. However, simple cross-fades rarely give satisfactory results; often it sounds like one input, then suddenly shifts to the other without much of a satisfying journey in-between. In practice, to get worthwhile morphs between sounds this way it takes a lot of trial and error and adjustment, as well as careful consideration of what sounds you’re trying to go between. Different, more sophisticated approaches to this are an open research question.

Whilst we’re here, it is worth noting what happens if we play back the amplitude tracks of either sound, without any of the phase information. Recall that the phase is telling us about the deviation of that channel from the central frequency of its bin. Consequently, the effect of zero-ing all the phases is to quantise the sound to the bin frequencies of the particular size FFT we’re using.

With the voice, this gives a handy way of getting a ‘robot’ kind of an effect:

With more complex (and harmonic) mixtures, like the instrumental sample, we can get strange (and usually eerie) retunings:

Other Effects

There are all kinds of other things one can do with the Phase Vocoder bin data to get a variety of more-or-less strange effects.

Just for instance…

We can ‘blur’ our sound by averaging between successive frames:

We can ‘freeze’ portions of the sound by looping a frame:

We can thin the sound out by selectively setting the amplitude of bins to zero (tracing). Here I’m retaining only the bins with most energy, starting with around half of them, and throwing more and more away:

And, of course, we can combine these effects to our hearts’ content; here’s tracing and blurring:

Noise Reduction

We can also use the Phase Vocoder as a basis for reducing noise in a file. The simplest scheme is similar to the tracing effect above; bins whose energy falls below a threshold are discarded.

This tends to give rise to very noticeable chirping and warbling artefacts, similar to those we heard in the tracing example and often more distracting than the noise itself. These arise from bins being turned off and on again, often in the mids, where we may have insufficient resolution; you hear the same artefacts with low-rate lossy encoded audio, such as 64 kbps MP3, for the same reason.

The thresholds for each bin are normally set by sampling a ‘silent’ part of the data that only has noise in it (which is why making sure you’ve always got some room tone with everyone being quiet at the start of your takes is a Good Idea).

The assumption then is that the noise stays statistically stationary over the course of the file (which is why, if you’re going to de-noise location footage, doing so before you’ve comped together takes from different mics, and even locations, is Also a Good Idea).

If we take some nice noisy audio, such as a snippet of the field recording we looked at above:

We might decide, for some reason, to try and get rid of the noise. Here’s what happens with a naïve approach:

Which is pretty bad. I think I preferred the noise. Matters can be improved somewhat by smoothing the function that turns bins on and off. A simple way of doing this would be something like the smoothing you see in compressors and expanders with attack and release settings, but there are more complex and effective schemes. Here’s one with smoothing, from Izotope RX

Better, still slightly warbly. Here’s an example with Izotope’s fanciest algorithm, which runs considerably slower than real-time, and involves making the decision on whether to suppress a bin on the basis of an examination of that sample’s neighbours in time and frequency (described in this paper, for the interested).


Diminishing returns, perhaps, but the above example does a better job of preserving transients and dulls the signal less, to my ears.

Some practical suggestion about NR and its use:

  • If you can avoid it, do – even if it means re-recording; is the noise really so bad?
  • Make sure you’ve recorded some tone, in case you have to.
  • Work on the earliest generation file you can, not post-processing if at all possible.
  • Don’t try to use the same noise profile to fix different recordings. It won’t work.
  • Sometimes it works better to apply NR in small successive stages, rather than all in one go.
  • Use the best algorithm you can, and be prepared for it to take a long time.
  • Take off less than you think; the aim is to reduce the noise, not eliminate it.


We’re familar with convolution as a buzz-word, but what does it mean?

Convolution in fact describes a numerical operation that is fundamental to digital signal processing. In the time domain, it consists of combining two signals by multiplying together every sample in each signal with every other and overlaying (mixing) the results. This is very computationally expensive in the time domain, and therefore almost always done using the FFT in practice.

We often talk of gathering impulse responses in the context of convolution reverb. The impulse response describes, literally, the response of a system to a single impulse (a solitary, unit-valued sample).

Here’s some details from Varun detailing creative uses of convolution for gathering impulse responses and creating reverbs. Commonly, impulse responses for convolution reverbs are gathered by stimulating our system with a chirp (a sinusoidal sweep) and recording the response back: the amplitude will reflect the frequency response of the system.

Getting to a impulse response from this recording involves the inverse operation of convolution: deconvolution.  In order for this operation to be possible, we need to know exactly what signal we excited the signal with (a chirp), which allows us to separate the room frequency response from the dry signal.

Turning the frequency response into an impulse response rests on being aware that convolution in the time domain fortuistously turns out to be equal to straight multiplication in the Fourier domain: much cheaper. The impulse response is the time domain dual of the frequency response, so having separated our sweep from the room response, we get an impulse just by doing an inverse FFT.

However, when processing a convolution reverb, this will be done back in the Fourier domain. The standard scheme is to break up the (presumably long) impulse response into sections, the first short section is performed in the time domain for minimal latency (the computational advantage of the FFT only kicks in for impulses > ~100 samples), and then a sequence of FFTs handle the remain chunks of the impulse.

Aside from reverb, convolution also happens to be the basis on which all linear digital filtering takes place, which implies that we can also use the Phase Vocoder as a great big EQ, which indeed we can.

As you might suspect, this is often most useful at higher frequencies where we have all that extra resolution, such as for dealing with the kinds of HF interference I showed earlier.

FFTs are also used to build linear phase EQ effects, which (because they involve convolution with moderately long impulse responses) would take too long to perform in the time domain for real-time use.

A nice effect can be achieved by convolving a file with itself (auto-convolution), as it strengthens all the prominent frequencies at the expense of weaker ones. If we use an FFT of the whole file, rather than an STFT, we also get a ‘free’ time stretch thrown in.

It also obliterates all the timing, as I promised above that DFT effects will do. The bonus time stretch is important though, because it alerts us to an important thing to bear in mind: the length of any convolved sequences is the combined length of the input and the impulse response (because the final sample of the input is multiplied by every sample in the IR, forming its ‘tail’) so for an auto-convolution, this means double length.

Bearing this in mind, we can do a more coherent auto-convolution with the STFT. In order for it to work, I have to make sure that the FFT length is at least double the analysis window length in order to provide room for the tail on re-synthesis.

If I don’t do this, I get what is called a ‘circular convolution’, meaning that the tail folds back onto the first beginning of the window causing what is called time aliasing.

Feature Extraction

Finally, it is possible to use information from the Phase Vocoder to try and infer things about a piece of audio. Such techniques are popular for music information retrieval, sonification, and building more-or-less autonomous instruments / installations / effects.

I’m not going to dwell too long on this, as its slightly marginal to the business of working directly with sound, but is very useful should you come to be building your own devices that need to interpret sound in other courses, or further in your careers as sound-workers.

Things we can try and extract include:

  • Fundamental Frequency(F0): This is generally taken to be related to what we perceive as the pitch (of those sounds that have a perceivable sense of pitch). It is not necessarily the lowest identifiable partial in a spectrum, given that our pitch perception depends on the spacing of partials: we can ‘fill in’ a missing fundamental based on the pattern of harmonics.
  • Spectral Flatness is used as a rough measure of how ‘noisy’ a spectrum is – recall that the spectrum of noise has fewer sharp peaks than a spectrum with clear partials.
  • Spectral Centroid is the ‘centre of gravity’ frequency of the spectrum, where exactly half the energy is above, and half below. This is correlated to our sense of how ‘bright’ a sound is.
  • Spectral Envelope if we trace a line across the top of the peaks of a spectrum, this is called an envelope, in the same way that we can take an envelope in the time domain to track amplitude. We can use this envelope as a shape to filter other sounds with, similar to to the vocoder, or we can try and trace the ‘formants’ (spectral peaks) over time.
  • Spectral Flux is a measure of how much change there is between frames. It can be used as a cheap and cheerful transient detector, on the basis that there is most spectral change at these points.

These are all low level features, meaning that they are basic measures that may not have a great deal to do with how it is we actually hear, let alone interpret, a sonic experience.

A great deal of research is ongoing into deriving higher level features that may bridge this ‘semantic gap’ (although given the extent to which interpretation happens in context, it may well be unbridgeable). All this notwithstanding, these and other measurements can be useful.


The Phase Vocoder and the STFT are useful for a range of audio tasks, including visual inspection, processing and computer analysis. It is always important to bear in mind that the STFT does not deliver to us an unambiguous model of our signal with well defined partials, but that the distribution of its energy will depend on analysis parameters and the type of material being analysed. Consequently, no treatment is completely without artefacts, and certain types of sound (stable, harmonic) work better than others.

The most common processing uses of the Phase Vocoder are time stretching / pitch shifting and cross synthesis, although there are a host of other interesting techniques.

There are also various interesting sorts of supplementary data we can try and derive from spectral analyses; again it is important to remember that these are results of modelling assumptions, and shouldn’t not be treated as altogether authoritative.


There is an awful lot of material available on the STFT.

A good thing to start with would be some listening.

Much of the most nuanced and skilful use of the Phase Vocoder can be found in acousmatic music, particularly that of Denis Smalley.

For greater technical detail on the STFT, FFT et al:

Roads, C., J. Strawn, C. Abbott, J. Gordon, and P. Greenspun. 1996. The Computer Music Tutorial. Vol. 81. MIT press Cambridge, Massachusetts.
A useful and more gentle introduction to the FFT et al (some maths, but very very well explained):
Lyons, Richard G. 2004. Understanding digital signal processing. Upper Saddle River, NJ: Prentice Hall PIR.

If you want yet more detail, and are happy about maths:

Zölzer, Udo. 2002. DAFX : digital audio effects. Chichester: Wiley
Julius Smith has a whole book online, for free. Free! Incredibly detailed and valuable on the technical side of things.

For more discussion of processing techniques (without maths), and a generally valuable read from a practical point of view:

Wishart, T. 1994. Audible Design: a Plain and Easy Introduction to Practical Sound Composition. Orpheus the Pantomime.
  • Some of the examples in the notes were made with Trevor Wishart’s (and others’) Composers Desktop Project software.
  • For a very full featured Phase Vocoder, Ircam’s audiosculpt is the thing (but not cheap).
  • Spear is a free application, using a sinusoidal model. It can export SDIF files of its analysis, which can then be further processed and played back in Max (using FTM or the CNMAT tools).
  • ATS (h/t Rafael Subía Valdez) uses a sines + noise model to improve on some of the phase vocoder’s shortcomings, and interfaces with a number of systems (Common Lisp Music, Pd, Supercollider, GTK, CSound)
  • SMSTools also use a sines+noise model, based in python.
  • zsa.descriptors is a set of Max/MSP objects for spectral feature extraction. This paper by Geoffroy Peeters tells you possibly all you ever need to know (maybe more) about how to derive a whole host of description (mildly mathsy).