StuPASE: Towards Low-Hallucination
Studio-Quality Generative Speech Enhancement

Xiaobin Rong1,2, Jun Gao1,3, Zheng Wang1,3, Mansur Yesilbursa2, Kamil Wojcicki2, Jing Lu1,3
1Key Laboratory of Modern Acoustics, Nanjing University
2Collaboration AI, Cisco Systems, Inc.
3NJU-Horizon Intelligent Audio Lab, Horizon Robotics
>
Audio Examples of the Effect of Early Reflections

Dry Signal: Original clean speech without artifically added reflections.
Early-Reflected Signal: Clean speech convolved with the first 50 ms of a room impulse response (RIR), simulating early reflections.
Reverberant Signal: Clean speech convolved with the full RIR, including both early and late reverberation.
The RIRs are sourced from the open datasets openSLR26 and openSLR28 (https://openslr.org/resources.php).

fileid_0
Reverberant Signal
Early-Reflected Signal
Dry Signal
fileid_1
Reverberant Signal
Early-Reflected Signal
Dry Signal
Audio Examples for Comparison
Noisy Signal
TF-GridNet [1] Output
FlowSE [2] Output
PASE [3] Output
PASE-R Output
AES-V2 [4] Output
SenSE [5] Output
StuPASE Output (Ours)
Clean
Noisy Signal
TF-GridNet [1] Output
FlowSE [2] Output
PASE [3] Output
PASE-R Output
AES-V2 [4] Output
SenSE [5] Output
StuPASE Output (Ours)
Clean
Noisy Signal
TF-GridNet [1] Output
FlowSE [2] Output
PASE [3] Output
PASE-R Output
AES-V2 [4] Output
SenSE [5] Output
StuPASE Output (Ours)
Clean
Noisy Signal
TF-GridNet [1] Output
FlowSE [2] Output
PASE [3] Output
PASE-R Output
AES-V2 [4] Output
SenSE [5] Output
StuPASE Output (Ours)
Clean
Noisy Signal
TF-GridNet [1] Output
FlowSE [2] Output
PASE [3] Output
PASE-R Output
AES-V2 [4] Output
SenSE [5] Output
StuPASE Output (Ours)
Clean
Noisy Signal
TF-GridNet [1] Output
FlowSE [2] Output
PASE [3] Output
PASE-R Output
AES-V2 [4] Output
SenSE [5] Output
StuPASE Output (Ours)
Clean
Noisy Signal
TF-GridNet [1] Output
FlowSE [2] Output
PASE [3] Output
PASE-R Output
AES-V2 [4] Output
SenSE [5] Output
StuPASE Output (Ours)
Clean
Noisy Signal
TF-GridNet [1] Output
FlowSE [2] Output
PASE [3] Output
PASE-R Output
AES-V2 [4] Output
SenSE [5] Output
StuPASE Output (Ours)
Clean
Noisy Signal
TF-GridNet [1] Output
FlowSE [2] Output
PASE [3] Output
PASE-R Output
AES-V2 [4] Output
SenSE [5] Output
StuPASE Output (Ours)
Clean
Noisy Signal
TF-GridNet [1] Output
FlowSE [2] Output
PASE [3] Output
PASE-R Output
AES-V2 [4] Output
SenSE [5] Output
StuPASE Output (Ours)
Clean
References

[1] Z.-Q. Wang, S. Cornell, S. Choi, Y. Lee, B.-Y. Kim, and S. Watanabe, “TF-GridNet: Integrating full-and sub-band modeling for speech separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 3221-3236, 2023.

[2] Z. Wang, Z. Liu, X. Zhu, Y. Zhu, M. Liu, J. Chen, L. Xiao, C. Weng, and L. Xie, “FlowSE: Efficient and High-Quality Speech Enhancement via Flow Matching,” in Interspeech 2025, 2025, pp. 4858-4862.

[3] X. Rong, Q. Hu, M. Yesilbursa, K. Wojcicki, and J. Lu, “PASE: Leveraging the Phonological Prior of WavLM for Low- Hallucination Generative Speech Enhancement,” in Proceedings of the 40th AAAI Conference on Artificial Intelligence (AAAI 2026), 2026, accepted.

[4] https://podcast.adobe.com/enhance

[5] X. Li, H. Xie, Z. Wang, Z. Zhang, L. Xiao, and L. Xie, “SenSE: Semantic-aware high-fidelity universal speech enhancement,” 2025. [Online]. Available: https://arxiv.org/abs/2509.24708