Revealing the Utilized Rank of Subspaces of Learning in Neural Networks
In this work, we study how well the learned weights of a neural network utilize the space available to them. This notion is...
Enhancing CTC-based Speech Recognition with Diverse Modeling Units
In recent years, the evolution of end-to-end (E2E) automatic speech recognition (ASR) models has been remarkable, largely due to advances in deep learning...
Bytes Are All You Need: Transformers Operating Directly On File Bytes
Modern deep learning approaches usually utilize modality-specific processing. For example, the most common deep learning approach to image classification involves decoding image file...
On Computationally Efficient Multi-Class Calibration
Consider a multi-class labelling problem, where the labels can take values in , and a predictor predicts a distribution over the labels. In...
Omnipredictors for Regression and the Approximate Rank of Convex Functions
Consider the supervised learning setting where the goal is to learn to predict labels y given points x from a distribution. An omnipredictor...
Superposition Prompting: Improving and Accelerating Retrieval-Augmented Generation
Despite the successes of large language models (LLMs), they exhibit significant drawbacks, particularly when processing long contexts. Their inference cost scales quadratically with...
How Smooth Is Attention?
Self-attention and masked self-attention are at the heart of Transformers' outstanding success. Still, our mathematical understanding of attention, in particular of its Lipschitz...
Optimization Without Retraction on the Random Generalized Stiefel Manifold
Optimization over the set of matrices X that satisfy X^TBX = Ip, referred to as the generalized Stiefel manifold, appears in many applications...
Careful With That Scalpel: Improving Gradient Surgery With an EMA
Beyond minimizing a single training loss, many deep learning estimation pipelines rely on an auxiliary objective to quantify and encourage desirable properties of...
Accurate Knowledge Distillation via N-best Reranking
We propose utilizing n-best reranking to enhance Sequence-Level Knowledge Distillation (Kim and Rush, 2016) where we extract pseudo-labels for student model’s training data...