Neural Kalman Filters for
Acoustic Echo Cancellation

Ernst Seidel, Gerald Enzner, Pejman Mowlaee, Tim Fingscheidt (2024)

Paper Page 3 Paper Page 2 Paper Page 1
汇报人:SZ2503007 丁阳奕
演讲时间:2026.4.20

1. The Acoustic Echo Problem

Hands-free & Speakerphone Scenarios

  • The Loop: Far-end speech x(n) plays through a loudspeaker, travels through the room, and is picked up by the microphone as echo.
  • Signal Mixture: The mic signal y(n) = s(n) + n(n) + d(n), where s(n) is Near-end speech, n(n) is Background noise, and d(n) is Echo.
  • The Goal: Use the reference x(n) to estimate and subtract the echo.
  • The Challenge: It's not just about "lowering volume". During double-talk, the algorithm must suppress echo but preserve near-end speech, and adapt quickly when the echo path changes.
AEC System Overview
Fig 1: Signal relationships in an AEC system

2. From Adaptive Filtering to FDKF

State-Space Modeling for AEC

  • Traditional Methods (LMS, NLMS): Rely on identifying the room impulse response h(n) via d(n) = xT(n) h(n).
  • The Core Conflict: When error increases, is the echo path changing, or is near-end speech/noise interfering?
  • Step-Size Dilemma:
    • Large step = destroys near-end speech during double-talk.
    • Small step = slow re-convergence after path changes.
  • FDKF Contribution: Unifies filter updates and step-size control in a Kalman/state-space framework.
LMS / NLMS
Manual/Heuristic Step Size
Conflict: Adaptation vs. Double-Talk
State-Space Modeling
FDKF (Frequency-Domain Kalman Filter)
Kalman Gain = Structural, Model-Based Step Size

3. Why Neural Kalman Filters?

Combining the Best of Both Worlds

  • FDKF Advantages: Low complexity, few parameters, clear structure, excellent near-end speech protection.
  • FDKF Weaknesses: Strong linear assumptions (fails on loudspeaker nonlinearities), hard to estimate covariance priors accurately.
  • Pure DNN AEC: Powerful, but often requires massive models or sacrifices near-end speech quality (aggressive suppression).
  • The Hybrid Approach: Keep the FDKF state-space skeleton, but let the DNN learn the hardest-to-model local components:
    • Distortion Model
    • Filter-State Update
    • Kalman Gain
Neural Kalman Overview
Fig 2: Neural Kalman / FDKF Framework (Right: injectable DNN modules)

4. Where Exactly Does the DNN Go?

Architectural Variations

Architectures
Fig 3: Internal structures of NeuralKalman vs DeepAdaptive
Method What the DNN Learns Architecture Type Characteristics
FDKF (None - fully statistical) N/A Baseline, lowest complexity
DLAC-Kalman / NKF Kalman Gain (State Uncertainty) Per-Bin (Shared small network) High flexibility, fewer params, but higher FLOPS
NeuralKalman Distortion + State Update Fully Connected (Joint-Freq) Balanced, lower FLOPS, Kalman gain stays model-based
DeepAdaptive Distortion + Aggressive Update Fully Connected (Joint-Freq) Acts more like an aggressive suppressor, higher risk to near-end

5. Fair Experimental Setup

Ensuring an Apples-to-Apples Comparison

  • Unified Framework: All models trained in the same PyTorch environment for AEC-only tasks.
  • Strict Constraints: 16 kHz sampling rate, unified algorithmic delay, and identical reference lengths (no "looking ahead" advantages).
  • Evaluation Datasets (Unseen Data):
    • Dtest: Standard speech far-end excitation.
    • DWGN_test: White Gaussian Noise far-end (system ID stress test).
    • DNL_test: Strong loudspeaker non-linearities (mismatch test).
  • Segments include Single-Talk Far-End (STFE), Near-End (STNE), and Double-Talk (DT).
Training
VCTK + Synthetic RIR + SEF NL + DEMAND/QUT Noise
PyTorch Unified AEC Framework
Testing
TIMIT + Real Aachen RIR + ETSI Noise
Metrics: ERLE, AECMOS, PESQ, STOI, LPS, FLOPS, Params

6. Convergence Behavior (Single Example)

ERLE over time in Dtest

  • Initial STFE Phase: All neural Kalman filters reach higher Echo Return Loss Enhancement (ERLE) much faster than the standard FDKF.
  • DeepAdaptive: Most aggressive echo suppression, achieving very high ERLE (behaves closely to a strict suppressor).
  • NKF & DLAC-Kalman: Well-balanced. They improve convergence speed over FDKF without being overly aggressive.
  • Double-Talk (DT): Overall ERLE naturally drops. Reminder: high suppression during DT often correlates with near-end speech degradation!
Dtest single example
Fig 4: Time-domain signals and ERLE curves across STFE, STNE, and DT segments.

7. Re-convergence After RIR Switch

Adapting to Sudden Echo Path Changes (at 4s)

  • FDKF Shortcoming: Updates are too conservative, leading to painfully slow re-convergence (Fig 5).
  • Speech Excitation: DeepAdaptive & NeuralKalman recover ERLE incredibly fast. NKF and DLAC-Kalman are steadier but still vastly outperform FDKF.
  • WGN Excitation (Stress Test): When far-end is replaced by White Gaussian Noise (Fig 6), NKF/DLAC/FDKF thrive (classic system ID).
  • However, DeepAdaptive & NeuralKalman struggle with noise-like echo, proving that fast ERLE recovery doesn't always equal true echo-path identification.
Speech Re-convergence WGN Re-convergence
Fig 5 (Top): Speech Excitation. Fig 6 (Bottom): WGN Excitation.

8. Suppression vs. Preservation

Comprehensive Double-Talk Performance

  • Echo Suppression: DeepAdaptive leads in ERLE/AECMOS Echo, acting as a strong suppressor.
  • Near-End Preservation: DLAC-Kalman leads neural models in PESQ, STOI, and LPS. Learning the Kalman gain is the safest way to protect near-end speech.
  • The Trade-off: NeuralKalman & DeepAdaptive sacrifice speech quality (lower PESQ) for extreme echo suppression.
  • Non-linearities (Red lines): DNL_test severely impacts speech quality metrics, even if ERLE looks okay.
  • Resources: FDKF is cheapest. NKF/DLAC have few parameters but high FLOPS. Fully connected hybrids have low FLOPS but high parameters.
Comprehensive Metrics
Fig 7: Echo metrics, Speech metrics, and Complexity (Blue = Linear, Red = Non-linear)

9. Conclusions & Design Implications

How to design the next generation of AEC

  • Hybrid Superiority: Neural Kalman filters drastically improve convergence and echo suppression over standard FDKF.
  • Not All DNNs Are Equal: From a control-theory perspective, learning the Kalman gain is the most stable, balanced way to enhance AEC without destroying near-end speech.
  • Architecture Trade-offs:
    • Per-bin models (NKF): Flexible, parameter-light, but computationally heavy (FLOPS).
    • Fully-connected: Cheaper to compute, but less adaptable and risk acting like blind suppressors.
  • Post-filters matter: Residual echo from non-linearities and long echo tails still require post-filtering.
Final Principle: Keep the state-space Kalman loop intact. Let the DNN estimate only the most uncertain, non-linear, and hardest-to-model variables.
1 / 10