DART: Denoising Autoregressive Transformer for Scalable Text-to-Image Generation
Diffusion models have become the dominant approach for visual generation. They are trained by denoising a Markovian process which gradually adds noise to...
FastVLM: Efficient Vision encoding for Vision Language Models
Scaling the input image resolution is essential for enhancing the performance of Vision Language Models (VLMs), particularly in text-rich image understanding tasks. However,...
Disentangled Representational Learning with the Gromov-Monge Gap
Learning disentangled representations from unlabelled data is a fundamental challenge in machine learning. Solving it may unlock other problems, such as generalization, interpretability,...
CoMotion: Concurrent Multi-Person 3D Motion
We introduce an approach for detecting and tracking detailed 3D poses of multiple people from a single monocular camera stream. Our system maintains...
Step-by-Step Diffusion: An Elementary Tutorial
We present an accessible first course on the mathematics of diffusion models and flow matching for machine learning. We aim to teach diffusion...
Language Models Know More Than They Show: Exploring Hallucinations From the Model’s Viewpoint
Large language models (LLMs) often produce errors, including factual inaccuracies, biases, and reasoning failures, collectively referred to as "hallucinations". Recent studies have demonstrated...
FocalLens: Instruction Tuning Enables Zero-Shot Conditional Image Representations
This paper was accepted at the Workshop on Foundation Models in the Wild at ICLR 2025.
Visual understanding is inherently contextual - what we...
EC-DIT: Scaling Diffusion Transformers with Adaptive Expert-Choice Routing
Diffusion transformers have been widely adopted for text-to-image synthesis. While scaling these models up to billions of parameters shows promise, the effectiveness of...
Scaling Laws for Native Multimodal Models
Building general-purpose models that can effectively perceive the world through multimodal signals has been a long-standing goal. Current approaches involve integrating separately pre-trained...
Simple ReFlow: Improved Techniques for Fast Flow Models
Diffusion and flow-matching models achieve remarkable generative performance but at the cost of many sampling steps, this slows inference and limits applicability to...