At the ICASSP 2025 conference, held in April, researchers from the IDIAP Research Institute, a partner in the ELOQUENCE project, presented their latest advancement in automatic speech recognition: the XLSR-Transducer.
This novel approach brings streaming capabilities to self-supervised pretrained models, bridging the gap between cutting-edge machine learning research and the real-time demands of speech technologies in practice.
Automatic speech recognition systems have made great strides in recent years, particularly with the rise of self-supervised learning, which allows models to learn from vast amounts of unlabeled audio data. However, many of these models are designed for offline processing, making them difficult to apply in scenarios that require immediate response, such as live transcription, voice assistants, or interactive applications. The XLSR-Transducer was developed to address precisely this limitation.
By adapting self-supervised models like XLS-R to a streaming framework, the team at IDIAP has enabled low-latency transcription without compromising the performance typically associated with large pretrained models. This innovation not only improves the responsiveness of ASR systems but also opens up new possibilities for deploying voice technologies in multilingual and dynamic environments – one of the main goals of the ELOQUENCE project.
The paper, titled “XLSR-Transducer: Streaming ASR for Self-Supervised Pretrained Models”, was authored by Shashi Kumar, Srikanth Madikeri, Juan Zuluaga-Gomez, Esaú Villatoro-Tello, Iuliia Thorbecke, Petr Motlicek, Manjunath K E, and Aravind Ganapathiraju. It was formally published as part of the ICASSP 2025 proceedings and is available via our publication page.
By sharing these findings with both the research community and industry stakeholders, the team aims to foster further collaboration and awareness around the integration of audio, speech, and language models in real-world applications.
Their work demonstrates how the ELOQUENCE project is helping to redefine the future of conversational AI one model at a time.
