PASE: Leveraging the Phonological Prior of WavLM
for Low-Hallucination Generative Speech Enhancement

Xiaobin Rong1,2, Qinwen Hu1,2, Mansur Yesilbursa2, Kamil Wojcicki2, Jing Lu1
1Key Laboratory of Modern Acoustics, Nanjing University
2Collaboration AI, Cisco Systems, Inc.
📁 Code Repository 📄 ArXiv Paper

Abstract

Generative models have shown remarkable performance in speech enhancement (SE), achieving superior perceptual quality over traditional discriminative approaches. However, existing generative SE approaches often overlook the risk of hallucination under severe noise, leading to incorrect spoken content or inconsistent speaker characteristics, which we term linguistic and acoustic hallucinations, respectively.

We argue that linguistic hallucination, stemming from models' failure to constrain valid phonological structures, is the more fundamental challenge. While language models (LMs) are well-suited for capturing the underlying speech structure through modeling the distribution of discrete tokens, existing approaches are limited in learning from noise-corrupted representations, which can lead to contaminated priors and hallucinations.

To overcome these limitations, we propose the Phonologically Anchored Speech Enhancer (PASE), a generative SE framework that leverages the robust phonological prior embedded in the pre-trained WavLM model to mitigate hallucinations. First, we adapt WavLM into a denoising expert via representation distillation to clean its final-layer features. Guided by the model's intrinsic phonological prior, this process enables robust denoising with a strong resistance to linguistic hallucination.

To further reduce acoustic hallucinations, we train the vocoder with a dual-stream representation: the high-level phonetic representation provides clean linguistic content, while a low-level acoustic representation retains speaker identity and prosody. Experimental results demonstrate that PASE not only surpasses state-of-the-art discriminative models in perceptual quality, but also significantly outperforms prior generative models with substantially lower linguistic and acoustic hallucinations.

PASE Model Framework
Figure 1: (a) Overall architecture of the proposed PASE framework. (b) Diagram of denoising representation distillation.
Audio Demos
File ID 1
Noisy Signal
Noisy 1
Enhanced by TF-GridNet [1]
TF-GridNet Enhanced 1
Enhanced by LLaSE-G1 [2]
LLaSE-G1 Enhanced 1
Enhanced by AES-V2 [3]
AES-V2 Enhanced 1
Enhanced by PASE (Ours)
PASE Enhanced 1
Clean Signal
Clean 1
File ID 2
Noisy Signal
Noisy 2
Enhanced by TF-GridNet [1]
TF-GridNet Enhanced 2
Enhanced by LLaSE-G1 [2]
LLaSE-G1 Enhanced 2
Enhanced by AES-V2 [3]
AES-V2 Enhanced 2
Enhanced by PASE (Ours)
PASE Enhanced 2
Clean Signal
Clean 2
File ID 3
Noisy Signal
Noisy 3
Enhanced by TF-GridNet [1]
TF-GridNet Enhanced 3
Enhanced by LLaSE-G1 [2]
LLaSE-G1 Enhanced 3
Enhanced by AES-V2 [3]
AES-V2 Enhanced 3
Enhanced by PASE (Ours)
PASE Enhanced 3
Clean Signal
Clean 3
File ID 4
Noisy Signal
Noisy 4
Enhanced by TF-GridNet [1]
TF-GridNet Enhanced 4
Enhanced by LLaSE-G1 [2]
LLaSE-G1 Enhanced 4
Enhanced by AES-V2 [3]
AES-V2 Enhanced 4
Enhanced by PASE (Ours)
PASE Enhanced 4
Clean Signal
Clean 4
File ID 5
Noisy Signal
Noisy 5
Enhanced by TF-GridNet [1]
TF-GridNet Enhanced 5
Enhanced by LLaSE-G1 [2]
LLaSE-G1 Enhanced 5
Enhanced by AES-V2 [3]
AES-V2 Enhanced 5
Enhanced by PASE (Ours)
PASE Enhanced 5
Clean Signal
Clean 5

[1] Wang, Z.-Q.; Cornell, S.; Choi, S.; Lee, Y.; Kim, B.-Y.; and Watanabe, S. 2023. TF-GridNet: Integrating full-and sub-band modeling for speech separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31:3221–3236.

[2] Kang, B.; Zhu, X.; Zhang, Z.; Ye, Z.; Liu, M.; Wang, Z.; Zhu, Y.; Ma, G.; Chen, J.; Xiao, L.; et al. 2025. LLaSE-G1: Incentivizing generalization capability for llama-based speech enhancement. arXiv preprint arXiv:2503.00493.

[3] https://podcast.adobe.com/en/enhance