linkedin reddit search_black sharethis
334 Publications

Heavy-Tailed Class Imbalance and Why Adam Outperforms Gradient Descent on Language Models

Frederik Kunstner, Robin Yadav, Alan Milligan, Mark Schmidt, A. Bietti

Adam has been shown to outperform gradient descent on large language models by a larger margin than on other tasks, but it is unclear why. We show that a key factor in this performance gap is the heavy-tailed class imbalance found in language tasks. When trained with gradient descent, the loss of infrequent words decreases more slowly than the loss of frequent ones. This leads to a slow decrease on the average loss as most samples come from infrequent words. On the other hand, Adam and sign-based methods are less sensitive to this problem. To establish that this behavior is caused by class imbalance, we show empirically that it can be reproduced across architectures and data types, on language transformers, vision CNNs, and linear models. On a linear model with cross-entropy loss, we show that class imbalance leads to imbalanced, correlated gradients and Hessians that have been hypothesized to benefit Adam. We also prove that, in continuous time, gradient descent converges slowly on low-frequency classes while sign descent does not.

Show Abstract

Good Rates From Bad Coordinates: The Exponential Average Time-dependent Rate Approach

Nicodemo Mazzaferro, Subarna Sasmal, P. Cossio, Glen M. Hocky

Our ability to calculate rate constants of biochemical processes using molecular dynamics simulations is severely limited by the fact that the time scales for reactions, or changes in conformational state, scale exponentially with the relevant free-energy barrier heights. In this work, we improve upon a recently proposed rate estimator that allows us to predict transition times with molecular dynamics simulations biased to rapidly explore one or several collective variables (CVs). This approach relies on the idea that not all bias goes into promoting transitions, and along with the rate, it estimates a concomitant scale factor for the bias termed the “CV biasing efficiency”γ. First, we demonstrate mathematically that our new formulation allows us to derive the commonly used Infrequent Metadynamics (iMetaD) estimator when using a perfect CV, where γ= 1. After testing it on a model potential, we then study the unfolding behavior of a previously well characterized coarse-grained protein, which is sufficiently complex that we can choose many different CVs to bias, but which is sufficiently simple that we are able to compute the unbiased rate directly. For this system, we demonstrate that predictions from our new Exponential Average Time-Dependent Rate (EATR) estimator converge to the true rate constant more rapidly as a function of bias deposition time than does the previous iMetaD approach, even for bias deposition times that are short. We also show that the γparameter can serve as a good metric for assessing the quality of the biasing coordinate. We demonstrate that these results hold when applying the methods to an atomistic protein folding example. Finally, we demonstrate that our approach works when combining multiple less-than-optimal bias coordinates, and adapt our method to the related “OPES flooding”approach. Overall, our time-dependent rate approach offers a powerful framework for predicting rate constants from biased simulations.

Show Abstract

AstroCLIP: a cross-modal foundation model for galaxies

Liam Parker , Francois Lanusse, Siavash Golkar, Leopoldo Sarra, Miles Cranmer, A. Bietti, Michael Eickenberg, Geraud Krawezik, Michael McCabe , R. Morel, R. Ohana, B. Régaldo-Saint Blancard, et al.

We present AstroCLIP, a single, versatile model that can embed both galaxy images and spectra into a shared, physically meaningful latent space. These embeddings can then be used – without any model fine-tuning – for a variety of downstream tasks including (1) accurate in-modality and cross-modality semantic similarity search, (2) photometric redshift estimation, (3) galaxy property estimation from both images and spectra, and (4) morphology classification. Our approach to implementing AstroCLIP consists of two parts. First, we embed galaxy images and spectra separately by pre-training separate transformer-based image and spectrum encoders in self-supervised settings. We then align the encoders using a contrastive loss. We apply our method to spectra from the Dark Energy Spectroscopic Instrument and images from its corresponding Legacy Imaging Survey. Overall, we find remarkable performance on all downstream tasks, even relative to supervised baselines. For example, for a task like photometric redshift prediction, we find similar performance to a specifically trained ResNet18, and for additional tasks like physical property estimation (stellar mass, age, metallicity, and specific-star-formation rate), we beat this supervised baseline by 19 per cent in terms of R2. We also compare our results with a state-of-the-art self-supervised single-modal model for galaxy images, and find that our approach outperforms this benchmark by roughly a factor of two on photometric redshift estimation and physical property prediction in terms of R2, while remaining roughly in-line in terms of morphology classification. Ultimately, our approach represents the first cross-modal self-supervised model for galaxies, and the first self-supervised transformer-based architectures for galaxy images and spectra.

Show Abstract

High-order and adaptive optical conductivity calculations using Wannier interpolation

Lorenzo Van Muñoz, J. Kaye, A. Barnett, Sophie Beck

We present an automatic, high-order accurate, and adaptive Brillouin zone integration algorithm for the calculation of the optical conductivity with a non-zero but small broadening factor η, focusing on the case in which a Hamiltonian in a downfolded model can be evaluated efficiently using Wannier interpolation. The algorithm uses iterated adaptive integration to exploit the localization of the transport distribution near energy and energy-difference iso-surfaces, yielding polylogarithmic computational complexity with respect to η. To demonstrate the method, we compute the AC optical conductivity of a three-band tight-binding model, and are able to resolve the Drude and interband peaks with broadening in the sub-meV regime to several digits of accuracy. Our algorithm automates convergence testing to a user-specified error tolerance, providing an important tool in black-box first-principles calculations of electrical transport phenomena and other response functions.

Show Abstract

Variational Inference for Uncertainty Quantification: an Analysis of Trade-offs

C. Margossian, L. Pillaud-Vivien, L. Saul

Given an intractable distribution p, the problem of variational inference (VI) is to find the best approximation from some more tractable family Q. Commonly, one chooses Q to be a family of factorized distributions (i.e., the mean-field assumption), even though p itself does not factorize. We show that this mismatch leads to an impossibility theorem: if p does not factorize, then any factorized approximation q∈Q can correctly estimate at most one of the following three measures of uncertainty: (i) the marginal variances, (ii) the marginal precisions, or (iii) the generalized variance (which can be related to the entropy). In practice, the best variational approximation in Q is found by minimizing some divergence D(q,p) between distributions, and so we ask: how does the choice of divergence determine which measure of uncertainty, if any, is correctly estimated by VI? We consider the classic Kullback-Leibler divergences, the more general Rényi divergences, and a score-based divergence which compares ∇logp and ∇logq. We provide a thorough theoretical analysis in the setting where p is a Gaussian and q is a (factorized) Gaussian. We show that all the considered divergences can be

Show Abstract

How Truncating Weights Improves Reasoning in Language Models

Lei Chen, Joan Bruna, A. Bietti

In addition to the ability to generate fluent text in various languages, large language models have been successful at tasks that involve basic forms of logical "reasoning" over their context. Recent work found that selectively removing certain components from weight matrices in pre-trained models can improve such reasoning capabilities. We investigate this phenomenon further by carefully studying how certain global associations tend to be stored in specific weight components or Transformer blocks, in particular feed-forward layers. Such associations may hurt predictions in reasoning tasks, and removing the corresponding components may then improve performance. We analyze how this arises during training, both empirically and theoretically, on a two-layer Transformer trained on a basic reasoning task with noise, a toy associative memory model, and on the Pythia family of pre-trained models tested on simple reasoning tasks.

Show Abstract

Contextual Counting: A Mechanistic Study of Transformers on a Quantitative Task

Siavash Golkar, A. Bietti, Mariel Pettee, Michael Eickenberg, et al.

Transformers have revolutionized machine learning across diverse domains, yet understanding their behavior remains crucial, particularly in high-stakes applications. This paper introduces the contextual counting task, a novel toy problem aimed at enhancing our understanding of Transformers in quantitative and scientific contexts. This task requires precise localization and computation within datasets, akin to object detection or region-based scientific analysis. We present theoretical and empirical analysis using both causal and non-causal Transformer architectures, investigating the influence of various positional encodings on performance and interpretability. In particular, we find that causal attention is much better suited for the task, and that no positional embeddings lead to the best accuracy, though rotary embeddings are competitive and easier to train. We also show that out of distribution performance is tightly linked to which tokens it uses as a bias term.

Show Abstract

Neurosift: DANDI exploration and NWB visualization in the browser

J. Magland, J. Soules, Cody Baker, Benjamin Dichter

Neurosift, a browser-based visualization tool, is designed for the interactive exploration of Neurodata Without Borders (NWB) files, whether stored locally, on remote servers, or within the Distributed Archives for Neurophysiology Data Integration (DANDI). NWB (Rübel et al., 2022; Teeters et al., 2015) is an open data standard for neurophysiology that enables the sharing, archiving, and analysis of various types of neurophysiology data. DANDI (Rübel et al., 2022) is a cloud-based platform that supports the storage, sharing, and analysis of neurophysiology data including NWB files. With Neurosift integration, users browsing DANDI can easily open any NWB file in the browser and explore its contents, including timeseries data, images, and more. Neurosift can also be used to browse the DANDI database or individual Dandisets. Overall, Neurosift simplifies the visualization and exploration of complex NWB file structures, making it a valuable tool for neuroscientists.

Show Abstract

Efficient convergent boundary integral methods for slender bodies

The interaction of fibers in a viscous (Stokes) fluid plays a crucial role in industrial and biological processes, such as sedimentation, rheology, transport, cell division, and locomotion. Numerical simulations generally rely on slender body theory (SBT), an asymptotic, nonconvergent approximation whose error blows up as fibers approach each other. Yet convergent boundary integral equation (BIE) methods which completely resolve the fiber surface have so far been impractical due to the prohibitive cost of layer-potential quadratures in such high aspect-ratio 3D geometries. We present a high-order Nyström quadrature scheme with aspect-ratio independent cost, making such BIEs practical. It combines centerline panels (each with a small number of poloidal Fourier modes), toroidal Green's functions, generalized Chebyshev quadratures, HPC parallel implementation, and FMM acceleration. We also present new BIE formulations for slender bodies that lead to well conditioned linear systems upon discretization. We test Laplace and Stokes Dirichlet problems, and Stokes mobility problems, for slender rigid closed fibers with (possibly varying) circular cross-section, at separations down to 1/20 of the slender radius, reporting convergence typically to at least 10 digits. We use this to quantify the breakdown of numerical SBT for close-to-touching rigid fibers. We also apply the methods to time-step the sedimentation of 512 loops with up to 1.65 million unknowns at around 7 digits of accuracy.

Show Abstract

Level Set Teleportation: An Optimization Perspective

Aaron Mishkin, A. Bietti, R. M. Gower

We study level set teleportation, an optimization sub-routine which seeks to accelerate gradient methods by maximizing the gradient norm on a level-set of the objective function. Since the descent lemma implies that gradient descent (GD) decreases the objective proportional to the squared norm of the gradient, level-set teleportation maximizes this one-step progress guarantee. For convex functions satisfying Hessian stability, we prove that GD with level-set teleportation obtains a combined sub-linear/linear convergence rate which is strictly faster than standard GD when the optimality gap is small. This is in sharp contrast to the standard (strongly) convex setting, where we show level-set teleportation neither improves nor worsens convergence rates. To evaluate teleportation in practice, we develop a projected-gradient-type method requiring only Hessian-vector products. We use this method to show that gradient methods with access to a teleportation oracle uniformly out-perform their standard versions on a variety of learning problems.

Show Abstract
  • Previous Page
  • Viewing
  • Next Page
Advancing Research in Basic Science and MathematicsSubscribe to Flatiron Institute announcements and other foundation updates