Model Compression in Practice: Lessons Learned from Practitioners Creating On-device Machine Learning Experiences
On-device machine learning (ML) promises to improve the privacy, responsiveness, and proliferation of new, intelligent user experiences by moving ML computation onto everyday...
Talaria: Interactively Optimizing Machine Learning Models for Efficient Inference
On-device machine learning (ML) moves computation from the cloud to personal devices, protecting user privacy and enabling intelligent user experiences. However, fitting models...
Frequency-Aware Masked Autoencoders for Multimodal Pretraining on Biosignals
Inspired by the advancements in foundation models for language-vision modeling, we explore the utilization of transformers and large-scale pretraining on biosignals. In this...
OpenELM: An Efficient Language Model Family with Open-source Training and Inference Framework
The reproducibility and transparency of large language models are crucial for advancing open research, ensuring the trustworthiness of results, and enabling investigations into...
A Multi-signal Large Language Model for Device-directed Speech Detection
We present an architecture for device-directed speech detection that treats the task as a text-generation problem. We use a multi-modal fusion approach that...
Towards a World-English Language Model
Neural Network Language Models (NNLMs) of Virtual Assistants (VAs) are generally language-, region-, and in some cases, device-dependent, which increases the effort to...
Streaming Anchor Loss: Augmenting Supervision with Temporal Significance
Streaming neural network models for fast frame-wise responses to various speech and sensory signals are widely adopted on resource-constrained platforms. Hence, increasing the...
MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training
*Equal Contributors
Contrastive pretraining of image-text foundation models, such as CLIP, demonstrated excellent zero-shot performance and improved robustness on a wide range of downstream...
International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2024
International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2024
Source link
Overcoming the Pitfalls of Vision-Language Model Finetuning for OOD Generalization
Existing vision-language models exhibit strong generalization on a variety of visual domains and tasks. However, such models mainly perform zero-shot recognition in a...