Echo Cancellation — Architecture, Algorithms, and Implementation

For a concise glossary entry, see aec. This article covers detailed AEC algorithm theory and implementation.

Acoustic echo cancellation is the most technically demanding audio processing task in conferencing system design. Unlike equalization or dynamics processing — where the algorithm applies a fixed or slowly-changing transform to the signal — AEC must continuously model a changing, unpredictable physical environment and subtract its effects from the microphone signal in real time, often within microseconds.

When a remote participant's voice plays through the room loudspeaker, that sound travels through the air, reflects off walls and surfaces, and eventually reaches the room's microphone capsule. This acoustic path introduces delay (typically 5-50 ms for a conference room) and coloration (frequency-dependent absorption and reflection from room surfaces). The microphone signal is the sum of the desired near-end speech and this unwanted loudspeaker echo.

Without cancellation, the remote participant hears their own voice returned to them at their audio output — delayed by the round-trip network latency plus the acoustic propagation time. This echo is disorienting and makes natural conversation impossible. In large rooms with high reverberation, the echo tail extends hundreds of milliseconds, compounding the problem.

AEC models the acoustic path from loudspeaker to microphone as a finite impulse response (FIR) filter — a sequence of coefficients representing the room impulse response. If the filter model perfectly matched the true acoustic path, convolving the loudspeaker signal (reference) with the filter would produce an exact replica of the echo component in the microphone signal. Subtracting this replica from the microphone input would leave only the near-end speech.

The challenge: the acoustic path changes continuously as people move, doors open, HVAC airflow shifts, and room temperature changes. The filter coefficients must adapt in real time to track these changes.

NLMS (Normalized Least Mean Squares) is the standard adaptive algorithm used in AEC. At each sample, NLMS adjusts the filter coefficients in the direction that reduces the error (residual echo) between the predicted echo and the actual microphone signal, normalized by the power of the reference signal.

Step size tradeoffs: Higher step size = faster adaptation to room changes, but more susceptibility to double-talk errors. Lower step size = more stable filter, slower adaptation, better double-talk performance. Professional AEC processors use variable step size algorithms that increase adaptation speed during single-talk periods and reduce it during double-talk.

The tail length (filter length) of the adaptive filter determines how much of the room's echo the AEC can cancel. The filter must have enough coefficients to span the entire acoustic echo duration, including late reflections.

Tail length required ≈ Room RT60 + direct echo delay + safety margin

For a small conference room with RT60 of 0.4 seconds, a 400 ms tail length is the minimum. A boardroom with 0.7s RT60 needs 700+ ms. Large reverberant spaces with RT60 > 1.5 seconds push the boundary of what practical AEC can handle.

Computational cost: FIR filter computation scales linearly with tail length. A 500 ms filter at 48 kHz sampling rate has 24,000 coefficients. This is multiplied by the number of microphone channels. Modern professional DSPs implement AEC using FFT-based frequency-domain processing, which reduces computational cost significantly.

After the adaptive filter subtracts the predicted echo, residual echo remains due to imperfect filter convergence, non-linear acoustic effects, and model mismatch. Non-Linear Processing (NLP) is a second-stage suppressor that reduces this residual.

NLP estimates the residual echo level by comparing the filter's prediction accuracy against the actual microphone signal. It applies gain reduction during remote-talker-only periods, when the near-end signal is presumed to be mostly residual echo rather than speech.

NLP aggressiveness tradeoffs: Higher NLP aggressiveness eliminates more residual echo but risks clipping soft near-end speech that the filter misidentified as echo. Lower aggressiveness sounds more natural but may leave perceptible echo in very reverberant rooms. Calibrating NLP aggressiveness for each room is part of AEC commissioning.

Double-talk — simultaneous near-end and far-end speech — is the most difficult condition for AEC. During double-talk:

The microphone signal contains both near-end speech and far-end echo superimposed
The NLMS algorithm cannot distinguish which component is "error" to minimize
If the filter adapts during double-talk, it corrupts its model by treating near-end speech as echo

Double-talk detectors use various approaches:

Cross-correlation: If the microphone signal is highly correlated with the reference (loudspeaker), echo is dominant. If correlation drops, near-end speech has entered.
Power ratio: If near-end microphone power exceeds a threshold relative to reference power, near-end speech is likely active.
Geigel algorithm: Compares maximum of reference signal over a window with microphone signal level — simple and fast, though less accurate than correlation methods.

During detected double-talk, the NLMS step size is reduced to near-zero, freezing filter adaptation. Systems with poor double-talk detection either suppress near-end speech (over-aggressive) or fail to converge after double-talk periods (under-aggressive).

Purpose-built hardware AEC (Biamp Tesira, QSC Q-SYS, Shure IntelliMix P300, Sennheiser TeamConnect Ceiling):

Dedicated DSP chips with deterministic clock timing
Filter tail lengths of 500 ms to 1000 ms or more
AEC reference signal carried directly on-chip, eliminating latency uncertainty
Optimized for extended operation in fixed rooms — warm-up time allows filter to converge before users arrive
QSC Q-SYS and Biamp allow integration of beamforming mic outputs directly into the AEC input

Software AEC in conferencing clients (Teams, Zoom, Webex):

Runs on general-purpose CPU with variable latency
Tail length typically 200-400 ms
Quality varies significantly across platforms and CPU generations
Microsoft Teams' AEC is widely regarded as among the best software implementations

The double AEC problem: When hardware AEC and software AEC both run on the same audio path, each stage treats the output of the other as a signal to suppress. The result is severe clipping, distortion, and dropped speech — particularly during double-talk. Always disable software AEC when hardware DSP AEC is in the signal chain. Every major DSP manufacturer (QSC, Biamp, Shure, Extron) publishes guidance for disabling AEC in Teams, Zoom, and Webex when using their hardware.

The adaptive filter requires a reference signal representing the loudspeaker output as it enters the room. The reference must be:

Post-mixing — after all far-end audio sources are combined onto the output bus
Pre-amplification — before any gain changes that the amplifier might apply
Pre-room — ideally post-DAC, before the loudspeaker driver; never from a microphone
Time-aligned — delayed by the digital processing latency to match the actual echo arrival time

A reference taken at the wrong point produces a mismatch between the predicted echo and the actual echo in the microphone. This mismatch causes residual echo that no amount of NLP tuning can fully suppress.

In QSC Q-SYS: The AEC reference input is a separate audio component output in the design. It must be explicitly wired from the speaker output bus post-processing. A common commissioning error is omitting this wire entirely — AEC appears active in the design but performs no cancellation.

In Biamp Tesira: The AEC reference is assigned in the AEC processing block settings. The reference channel must be routed from the output path serving the room's loudspeaker, post all mixing and processing.

Modern beamforming microphone arrays (Shure MXA series, Biamp Parle, Sennheiser TeamConnect Ceiling 2) integrate AEC processing directly in the array or in a co-designed DSP. The beamformed output feeds the AEC as the near-end signal.

Beamforming complements AEC by providing spatial filtering — the beam formed toward a talker attenuates the loudspeaker (which is in a different direction) by 15-25 dB before AEC even begins. This significantly reduces the echo-to-speech ratio at the AEC input, improving convergence stability and reducing NLP suppression artifacts.

Some systems (Shure IntelliMix P300 + MXA microphone combination) automatically coordinate beam steering and AEC reference based on loudspeaker direction. This automates part of the commissioning process that previously required manual configuration.

Verify AEC reference routing in DSP software before powering up the room. Trace the signal path from the loudspeaker output to the AEC reference input explicitly.
Set gain structure — with a remote caller active, set speaker output to normal conversational level and microphone gain to -18 dBFS nominal on near-end speech. See gain-structure.
Test echo cancellation — with the remote caller speaking, listen at the remote end. Any echo from the near end indicates incomplete cancellation. Common causes: AEC reference not connected, gain too high, tail length too short.
Adjust NLP aggressiveness — increase until residual echo is inaudible; verify near-end speech is not clipped during single-talk.
Test double-talk — both participants speak simultaneously. Verify near-end speech is preserved, not suppressed.
Test convergence after reset — restart the DSP and test echo cancellation immediately. Well-designed systems reconverge within seconds; poorly designed ones take minutes.
Test with room at occupancy — people in the room change the acoustic environment significantly. Test AEC performance with the room fully occupied.

AEC reference not connected — the most common commissioning error. AEC component active in design but no reference signal wired to it. Result: echo on every call. Verify reference signal path explicitly before calling commissioning complete.
Double AEC (hardware DSP + conferencing client) — disabling software AEC is a required step, not optional. QSC, Biamp, and Shure publish vendor-specific instructions for disabling AEC in Teams, Zoom, and Webex.
Tail length shorter than room RT60 — a 200 ms tail in a 0.6s RT60 room leaves 400 ms of uncanceled echo. Increase tail length setting in DSP to match room RT60 plus 25% safety margin.
Microphone too close to loudspeaker — direct acoustic coupling overwhelms the adaptive filter. Minimum microphone-to-loudspeaker distance: 3 feet.
Gain structure causing reference mismatch — if the amplifier gain is changed after the AEC reference tap point, the reference no longer matches actual loudspeaker output level. Fix amplifier gain during commissioning and control volume via the DSP only.
Room change after initial convergence — major room changes alter the impulse response. The filter may take several minutes to reconverge. Some systems support a manual convergence trigger to accelerate reconvergence after known room changes.

Echo Cancellation — Architecture, Algorithms, and Implementation

The Echo Problem in Detail

The Adaptive Filter Model

Tail Length and RT60

Non-Linear Processing (NLP)

Double-Talk Detection

Hardware vs. Software AEC: Implementation Differences

AEC Reference Signal Path — The Critical Detail

AEC in Beamforming Systems

Commissioning AEC — Step by Step

Common Pitfalls

Related