International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025
International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025
Source link
VibE: A Visual Analytics Workflow for Semantic Error Analysis of CVML Models at Subgroup...
Effective error analysis is critical for the successful development and deployment of CVML models. One approach to understanding model errors is to summarize...
SELMA: A Speech-Enabled Language Model for Virtual Assistant Interactions
In this work, we present and evaluate SELMA, a Speech-Enabled Language Model for virtual Assistant interactions that integrates audio and text as inputs...
M2R2: Mixture of Multi-Rate Residuals for Efficient Transformer Inference
Residual transformations enhance the representational depth and expressive power of large language models (LLMs). However, applying static residual transformations across all tokens in...
Towards Automatic Assessment of Self-Supervised Speech Models Using Rank
This study explores using embedding rank as an unsupervised evaluation metric for general-purpose speech encoders trained via self-supervised learning (SSL). Traditionally, assessing the...
DR-MPC: Deep Residual Model Predictive Control for Real-World Social Navigation
How can a robot safely navigate around people with complex motion patterns? Deep Reinforcement Learning (DRL) in simulation holds some promise, but much...
Towards AI-Driven Sign Language Generation with Non-Manual Markers
Sign languages are essential for the Deaf and Hard-of-Hearing (DHH) community. Sign language generation systems have the potential to support communication by translating...
When Does a Predictor Know Its Own Loss?
Given a predictor and a loss function, how well can we predict the loss that the predictor will incur on an input? This...
An Efficient and Streaming Audio Visual Active Speaker Detection System
This paper delves into the challenging task of Active Speaker Detection (ASD), where the system needs to determine in real-time whether a person...
Visatronic: A Multimodal Decoder-Only Model for Speech Synthesis
In this paper, we propose a new task - generating speech from videos of people and their transcripts (VTTS) - to motivate new...