Contextualization of ASR with LLM Using Phonetic Retrieval-Based Augmentation
Large language models (LLMs) have shown superb capability of modeling multimodal signals including audio and text, allowing the model to generate spoken or...
Compress and Compare: Interactively Evaluating Efficiency and Behavior Across ML Model Compression Experiments
*Equal Contributors
To deploy machine learning models on-device, practitioners use compression algorithms to shrink and speed up models while maintaining their high-quality output. A...
Generalizable Error Modeling for Human Data Annotation: Evidence from an Industry-Scale Search Data Annotation...
Machine learning (ML) and artificial intelligence (AI) systems rely heavily on human-annotated data for training and evaluation. A major challenge in this context...
Misty: UI Prototyping Through Interactive Conceptual Blending
UI prototyping often involves iterating and blending elements from examples such as screenshots and sketches, but current tools offer limited support for incorporating...
Optimizing Byte-level Representation for End-to-End ASR
This paper was accepted at the IEEE Spoken Language Technology Workshop (SLT) 2024.
In this paper, we propose an algorithm to optimize a byte-level...
Interspeech 2024
Interspeech 2024
Source link
Classifier-Free Guidance Is a Predictor-Corrector
We investigate the unreasonable effectiveness of classifier-free guidance (CFG).
CFG is the dominant method of conditional sampling for text-to-image diffusion models, yet
unlike other aspects...
Apple Workshop on Privacy-Preserving Machine Learning 2024
At Apple, we believe privacy is a fundamental human right. It’s also one of our core values, influencing both our research and the...
Positional Description for Numerical Normalization
We present a Positional Description Scheme (PDS) tailored for digit sequences, integrating placeholder value information for each digit. Given the structural limitations of...
AV-CPL: Continuous Pseudo-Labeling for Audio-Visual Speech Recognition
Audio-visual speech contains synchronized audio and visual information that provides cross-modal supervision to learn representations for both automatic speech recognition (ASR) and visual...