Stepping in the Right Direction: Optimizing Machine Learning
Machine learning models, also called artificial intelligence models, have made astonishing leaps in complexity in recent years and are entering the public realm en masse. Highly visible examples are the chatbot ChatGPT and the image generator DALL·E 2. Despite these advances, a lot of work is still needed to make machine learning more efficient, reliable, safe and fair before it can be fully integrated into society — whether in the context of a text generator or in high-stakes technology like self-driving cars.
Flatiron Research Fellow Neha Wadia is one of the researchers striving to make machine learning more efficient and reliable. One project she’s working on applies methods from numerical analysis to train machine learning models more efficiently.
Wadia joined the Flatiron Institute’s Center for Computational Mathematics, or CCM, in 2022. Prior to that she earned her Ph.D. in biophysics from the University of California, Berkeley, her master’s in physics from the Perimeter Institute for Theoretical Physics, and her bachelor’s in physics from Amherst College. She was a junior research fellow at the National Centre for Biological Sciences in Bangalore, India.
Wadia recently spoke with the Simons Foundation about her work and the future of machine learning. The conversation has been edited for clarity.
What machine learning projects are you working on?
One of the main projects I’m working on right now is in the area of optimization, which is the technical workhorse of machine learning. In machine learning, we typically begin with a dataset and a model, which we train to learn a function of the data. Training is achieved by solving an optimization problem in which we minimize some measure of the error in performance of the model by tuning its parameters. For example, if we’re training a model to identify humans in an image, we can measure performance through the number of images the model incorrectly answers.
Here’s an optimization analogy: Imagine you’re walking in a mountainous landscape and trying to find the deepest valley. A reasonable strategy is to look around you for the steepest slope and take a step in that direction. Hopefully, by repeating this process over and over, you will eventually reach the deepest point, or at least the deepest point that you could have reached given your initial location. In this analogy, each point on the landscape is a possible set of parameter values for the model, and the height of the landscape is the measure of the error of the model. Walking corresponds to tuning the model parameters.
It turns out that it is not hard to compute the direction in which to take the next step. What is difficult to decide is how big of a step to take. To understand this, think of walking on a bowl-shaped surface. If you take too long a step, you might overshoot the bottom and end up on the other side of the bowl. If you take too small a step, you may exhaust your computational budget before getting there. We have to choose the size of each step in such a way that we don’t take too many steps and we also don’t miss the valley we’re looking for.
To choose the right-size steps we often use what are called adaptive step size methods, which are a class of algorithms that adjust the step length at each iteration in a way that respects that trade-off between speed and computational budget. The downside to these methods is that they are often hard to interpret and can require a larger computational budget per iteration than some other methods. For both these reasons, I am working on developing new, efficient and interpretable adaptive methods for optimization. Along with my collaborators, I am borrowing techniques from numerical analysis — which is the area of mathematics concerned with simulating dynamical systems, among other things — and applying them in optimization. Roughly speaking, the main idea is to reformulate the training process of a machine learning model as a dynamical system and leverage existing efficient and principled techniques from numerical analysis to simulate that system.
It’s been really useful to be working here on this project because the CCM has one of the best numerical analysis groups in the world. I’ve been learning numerical analysis proof techniques from the group to prove that it also works in the context of optimization. It’s great to be able to walk next door and lean on their expertise. It’s cool to see ideas from one field of mathematics be used successfully in another. It celebrates the unity of the computational sciences in a beautiful way.
Are you close to large-scale applications?
Not yet. Preliminary experimental results seem to indicate that the method works really well and is very computationally efficient on small scales — models with a couple hundred parameters. This is not a lot by modern standards, where models often have millions of parameters. I’ll soon start coding the method up on a larger scale and see how it performs.
Given the results I’m seeing so far in my experiments, I think my work will certainly be of interest in the optimization theory community. If the method performs well at larger scales, it may also be of interest to the folks building the large models. It would provide an interpretable alternative to current adaptive methods that are used at large scales.
One interesting thing I find in machine learning is that the answers to the kinds of “first-principles” questions you might ask in order to understand how some part of the pipeline works usually also come with practical implications or generate new algorithms. The work I am doing now on efficient adaptive step size methods grew out of trying to understand the dynamics of commonly used optimization algorithms in machine learning. The dynamics are crucial because they influence exactly what is learned, and therefore if we understand and thus control the dynamics better, we could directly engineer greater efficiency and other properties we care about into our models.
Efficiency is important because large models require tremendous computing power and have large carbon footprints. The training process of a large model can emit as much carbon dioxide as is released by several hundred flights between New York and San Francisco, calculated per passenger. Clearly, the more efficiently we can train, the more we can reduce carbon emissions.
Another property we care about that we know is heavily influenced by optimization dynamics is robustness, which is critically important when machine learning models are used in a societal context, especially when their outputs are the basis for potentially high-stakes decision-making. For example, if the image recognition system in a self-driving car cannot reliably differentiate between a stop sign and a speed limit sign when they’re covered in graffiti, it could create a dangerous situation.
Where do you see machine learning in a decade?
Well, 10 years ago, if you told a machine learning researcher that we’d have these generative models — like DALL·E 2 and ChatGPT — in a decade, they would not have believed you.
We’re at an exciting point in this field because the people who are building the models are really surprising us with what they can do, in terms of new applications like chatbots and art generators. I suspect that that will continue in ways I can’t even predict.
However, there are also huge challenges, both on the research end and on the societal end, which are intertwined. I believe that achieving an understanding of training dynamics will allow us to address many problems at least partially, including the lack of robustness, efficiency and fairness.
Another big problem that I hope we will have made progress on in 10 years is how to assign an error bar to the accuracy of a model. For example, if you’re using an image recognition algorithm to assist a doctor in identifying cancerous tumors, you need to be able to give some guarantee on how often the model is correct. The whole science of uncertainty quantification in machine learning is almost completely missing at the moment.
I have two other hopes for machine learning research in the near future. First, while we are pushing the limits of what machine learning models can do, I think it is equally important to understand what they can’t do. As a researcher in this field, it makes me nervous when models are used in the public sphere — or even in science itself — despite the fact that we don’t understand their limitations. Second, it’s also quite clear to me that even those of us in machine learning who are mostly doing pen-and-paper research are having impacts on the world. As scientists, we are not trained to think about the impact our research and our opinions can have. That needs to change. I think we need to be more vocal about how our work is used.