VeCLIP: Improving CLIP Training via Visual-enriched Captions
Paper abstract: Large-scale web-crawled datasets are fundamental for the success of pre-training vision-language models, such as CLIP. However, the inherent noise and potential...
Revisit Large-Scale Image–Caption Data in Pre-training Multimodal Foundation Models
Recent advancements in multimodal models highlight the value of rewritten captions for improving performance, yet key challenges remain. Notably, the role of synthetic...
Disentangled Representational Learning with the Gromov-Monge Gap
Learning disentangled representations from unlabelled data is a fundamental challenge in machine learning. Solving it may unlock other problems, such as generalization, interpretability,...
ELEGNT: Expressive and Functional Movement Design for Non-Anthropomorphic Robot
Nonverbal behaviors such as posture, gestures, and gaze are essential for conveying internal states, both consciously and unconsciously, in human interaction. For robots...
Pseudo-Generalized Dynamic View Synthesis from a Video
Rendering scenes observed in a monocular video from novel viewpoints is a chal- lenging problem. For static scenes the community has studied both...
Construction of Paired Knowledge Graph – Text Datasets Informed by Cyclic Evaluation
Datasets that pair Knowledge Graphs (KG) and text together (KG-T) can be used to train forward and reverse neural models that generate text...
Transfer Learning for Structured Pruning under Limited Task Data
This paper was accepted at the Efficient Natural Language and Speech Processing (ENLSP-III) Workshop at NeurIPS.
Large, pre-trained models are problematic to use in...
Acoustic Model Fusion for End-to-end Speech Recognition
Recent advances in deep learning and automatic speech recognition (ASR) have enabled the end-to-end (E2E) ASR system and boosted its accuracy to a...
Model Compression in Practice: Lessons Learned from Practitioners Creating On-device Machine Learning Experiences
On-device machine learning (ML) promises to improve the privacy, responsiveness, and proliferation of new, intelligent user experiences by moving ML computation onto everyday...
FocalLens: Instruction Tuning Enables Zero-Shot Conditional Image Representations
This paper was accepted at the Workshop on Foundation Models in the Wild at ICLR 2025.
Visual understanding is inherently contextual - what we...