Poly-View Contrastive Learning
Contrastive learning typically matches pairs of related views among a number of unrelated negative views.
Views can be generated (e.g. by augmentations) or be...
MOFI: Learning Image Representation from Noisy Entity Annotated Images
In this paper, we introduce a novel approach to automatically assign entity labels to images from existing noisy image-text pairs. The approach employees...
How Far Are We from Intelligent Visual Deductive Reasoning?
This paper was accepted at the How Far Are We from AGI? workshop at ICLR 2024.
Vision-Language Models (VLMs) such as GPT-4V have recently...
Guiding Instruction-based Image Editing via Multimodal Large Language Models
Instruction-based image editing improves the controllability and flexibility of image manipulation via natural commands without elaborate descriptions or regional masks. However, human instructions...
Pseudo-Generalized Dynamic View Synthesis from a Video
Rendering scenes observed in a monocular video from novel viewpoints is a chal- lenging problem. For static scenes the community has studied both...
When can transformers reason with abstract symbols?
We investigate the capabilities of transformer models on relational reasoning tasks. In these tasks, models are trained on a set of strings encoding...
Vanishing Gradients in Reinforcement Finetuning of Language Models
Pretrained language models are commonly adapted to comply with human intent and downstream tasks via finetuning. The finetuning process involves supervised finetuning (SFT),...
Think While You Write Hypothesis Verification Promotes Faithful Knowledge-to-Text Generation
Neural knowledge-to-text generation models often struggle to faithfully generate descriptions for the input facts: they may produce hallucinations that contradict the given facts,...
Label-Efficient Sleep Staging Using Transformers Pre-trained with Position Prediction
Sleep staging is a clinically important task for diagnosing various sleep disorders but remains challenging to deploy at scale because it requires clinical...
CatLIP: CLIP-level Visual Recognition Accuracy with 2.7× Faster Pre-training on Web-scale Image-Text Data
Contrastive learning has emerged as a transformative method for learning effective visual representations through the alignment of image and text embeddings. However, pairwise...