An Efficient and Streaming Audio Visual Active Speaker Detection System
This paper delves into the challenging task of Active Speaker Detection (ASD), where the system needs to determine in real-time whether a person...
Visatronic: A Multimodal Decoder-Only Model for Speech Synthesis
In this paper, we propose a new task - generating speech from videos of people and their transcripts (VTTS) - to motivate new...
Exploring Prediction Targets in Masked Pre-Training for Speech Foundation Models
Speech foundation models, such as HuBERT and its variants, are pre-trained on large amounts of unlabeled speech data and then used for a...
dMel: Speech Tokenization Made Simple
Large language models have revolutionized natural language processing by leveraging self-supervised pretraining on vast textual data. Inspired by this success, researchers have investigated...
Novel View Synthesis with Pixel-Space Diffusion Models
Synthesizing a novel view from a single input image is a challenging task. Traditionally, this task was approached by estimating scene depth, warping,...
ARMOR: Egocentric Perception for Humanoid Robot Collision Avoidance and Motion Planning
Humanoid robots have significant gaps in their sensing and perception, making it hard to perform motion planning in dense environments. To address this,...
Transfer Learning in Scalable Graph Neural Network for Improved Physical Simulation
In recent years, graph neural network (GNN) based models showed promising results in simulating complex physical systems. However, training dedicated graph network simulator...
FlexTok: Resampling Images into 1D Token Sequences of Flexible Length
This work was done in collaboration with Swiss Federal Institute of Technology Lausanne (EPFL).
Image tokenization has enabled major advances in autoregressive image...
From Multimodal LLMs to Generalist Embodied Agents: Methods and Lessons
We examine the capability of Multimodal Large Language Models (MLLMs) to tackle diverse domains that extend beyond the traditional language and vision tasks...
KV Prediction for Improved Time to First Token
Inference with transformer-based language models begins with a prompt processing step. In this step, the model generates the first output token and stores...