Scoring the Brain: How Benchmark Datasets and Other Tools are Solving Key Challenges in Neuroscience
Over the last decade, ‘machine learning’ has become a familiar term. Although the ascent of machine learning is nuanced, many experts trace its rapid growth to a single trend within the field. Since 2010, an annual contest called the ImageNet Large Scale Visual Recognition Challenge (often referred to simply as the ImageNet Challenge) has invited participants to provide a mathematical model that can best identify each subject — such as a car or a strawberry — within a dataset of pictures. In just 10 years, the accuracy of the best-performing model catapulted from just over 50% to more than 90%. Many believe two elements catalyzed this explosive growth: the field-wide adoption of a standardized dataset for tracking progress and a framework in which to do the tracking.
Over roughly the same time frame, the size and number of datasets in neuroscience has ballooned, fueled by rapid developments in experimental methods for acquiring data. These new datasets have sparked the development of new models aimed at correctly preprocessing this deluge of data, as well as models that seek to distill the data into illuminating pictures of how the brain works. However, lacking standardized datasets and benchmarks for comparing this proliferation of models, many researchers have struggled with the same battery of questions: How accurate is a model, and how does that accuracy depend on the particulars of a dataset? For a question or brain region of interest, which model is the right model?
“For a long time it’s been clear to many of us in the field that there is no standardization of either data or models, and so it makes it very hard to know whether you’re making progress,” says Jim DiCarlo, a neuroscientist at the Massachusetts Institute of Technology, an investigator with the Simons Collaboration on the Global Brain (SCGB), and one of many researchers seeking to rectify this situation.
To overcome these challenges, a patchwork collective of investigators are spearheading initiatives across different areas of neuroscience research, from basic data processing to modeling brain activity. These initiatives employ a common set of tools, including community-based research challenges, standardized datasets, publicly available computer code and easily navigable websites, with the hope that the success seen in machine learning triggered by the ImageNet Challenge will transfer to neuroscience.
Building a benchmark
A key ingredient for the success of the ImageNet Challenge was the establishment of a benchmark dataset that the majority of the field rallied around. Establishing a similar dataset for neuroscience, however, is much more challenging.
Take the case of spike sorting — the process of identifying action potentials, or spikes, within electrophysiological recordings. This early step in an analysis pipeline transforms one of the most commonly collected types of neuroscience data from a raw format to a usable one. But different spike-sorting algorithms can produce different results.
To create a ‘gold standard’ dataset for evaluating these algorithms, scientists need to simultaneously record from a neuron using ‘extracellular’ methods, which record the action potentials of multiple cells as well as other types of electrical activity, and from within the cell itself, which unambiguously identifies action potentials. This approach is time-consuming and feasible only for small numbers of neurons. With support from the SCGB, Dan English, a neuroscientist at Virginia Tech, is in the process of collecting such a dataset, which many in the field believe will be invaluable to spike-sorting benchmarking.
A less laborious alternative is to generate a synthetic benchmark dataset by simulating the physical processes believed to underlie the data. Such ‘synthetic data’ can be used as is or can be blended into a true electrophysiology dataset to generate a ‘hybrid-synthetic’ dataset. MEArec, a recently developed Python tool, provides this functionality and integrates nicely into commonly used spike-sorting software packages.
Optical microscopy, which uses calcium imaging to indirectly monitor neural activity, faces similar benchmarking challenges. A typical microscopy dataset consists of videos of twinkling neurons, and an analysis pipeline has to determine which neurons were active when, and what the relationship is between fluorescent activity and the true activity of the neurons. Acquiring gold-standard data for benchmarking analyses of these data has been difficult, requiring simultaneous optical and electrophysiological recordings under a range of optical imaging conditions. A recent study by researchers in Switzerland, published in Nature Neuroscience in August, collected such a dataset by performing these dual recordings in many brain regions in multiple animal species, using many different types of optical ‘indicators,’ the molecule that is imaged by the microscope. With such a diversity of benchmarking data, the researchers were able to develop a data preprocessing method that took the idiosyncrasies of different datasets into account, making the method very robust to new, unseen datasets that likewise contain idiosyncratic features.
However, producing these datasets is labor-intensive, and some researchers are wary of certain types of ground-truth datasets used for benchmarking optical imaging methods because they sometimes rely on humans to identify the location of cells, a procedure prone to inaccuracies. “Often manual labeling of these types of data is, how do I say this, well, it’s variable. Humans make errors all the time,” says Adam Charles, a professor of biomedical engineering at Johns Hopkins University. This is another situation where simulation can lend a hand.
The NAOMi simulator, a new method developed in a collaboration led by Charles that includes SCGB investigators Jonathan Pillow and David Tank, offers a powerful tool for generating synthetic ground-truth data for benchmarking functional microscopy datasets. NAOMi creates a detailed, end-to-end simulation of brain activity as well as natural adulterations that imaging methods introduce, to yield synthetic datasets which can be used to test the efficiency of analysis tools that are normally applied to real imaging data. Because the ground truth is known, NAOMi allows researchers to know exactly how effective their models for preprocessing functional imaging datasets are, and also where they fall short. “The idea is to get a robust standardized method of testing these analysis methods where you could literally say all else is equal,” says Charles.
Standardizing first steps
Accurately benchmarking early analysis steps, such as spike sorting and preprocessing of functional microscopy data, is critical because scientific results derived from this processed data can be sensitive to these initial steps. With the advent of high-density electrophysiological methods, such considerations have gone from concerning to grave.
“It’s super important because it’s the thing that we all rely on as the primary source of data — it’s our microscope, it’s our sequencing gel, it’s our telescope and voltmeter all rolled into one,” says Karel Svoboda, a group leader at the Howard Hughes Medical Institute’s Janelia Research Campus and an SCGB investigator. Although many canonical electrophysiology-based discoveries are reproducible, discrepancies exist. It’s unclear if these differences are genuine or if they simply reflect differences in the way individual groups process their data. “It’s a little shocking that we don’t understand extracellular electrophysiology very well,” says Svoboda.
SpikeForest, an initiative from the Simons Foundation’s Flatiron Institute, is one effort to standardize and benchmark spike-sorting algorithms and make them easier to use. SpikeForest focuses on three key challenges in spike sorting: curating benchmark datasets, maintaining up-to-date performance results on an easy-to-navigate website, and lowering technical barriers to using existing spike-sorting software.
The SpikeForest software suite includes hundreds of benchmark datasets, including both gold-standard and synthetic and hybrid-synthetic datasets, for testing state-of-the-art spike-sorting algorithms. When benchmark datasets or spike-sorting algorithms are updated, the SpikeForest initiative will run the algorithms on all existing benchmark datasets and publish accuracy metrics — measurements of how well the algorithms agree with the ground-truth data — on the SpikeForest website. The highly interactive website gives users up-to-date information on algorithms’ performance and the ability to explore the results themselves.
One of the most worrisome findings to emerge from these types of benchmarking tools is a low concordance between different sorters for challenging cases, when the size of spikes is small compared to background noise. The discrepancies confirm the utility of benchmarking efforts and illustrate how important it is for users to be able to run different sorters on their data and share the exact details of what they did with other labs. “Currently, it’s sort of the wild west — each lab has their own workflow that’s not really transferable from one group to another,” says Jeremy Magland, a senior data scientist at the Flatiron Institute and the lead author on the SpikeForest initiative. In fact, different labs use different spike-sorting software for a somewhat embarrassing reason: “It’s just difficult to install spike-sorting software,” says Magland.
To lower the barrier to using existing software in a reliable way, SpikeForest, in collaboration with another platform for standardizing spike sorting called SpikeInterface, released a software package that bundles together the most commonly used spike-sorting software and other necessary components. With this package, users can reliably execute all the supported spike-sorting algorithms, without extensive knowledge and irrespective of the specifics of the computer system.
In the future, Magland hopes to simplify this process even further by setting up a ‘go-to’ website where users can upload their data, run a spike sorter of their choice, and get easily shareable results. “We view our initial effort as one step towards the real goal, which is to set up a software infrastructure where all spike sorters are equally easy to run and that when you do run it, you can share the provenance of exactly what you used for the input parameters and what you got for the output,” says Magland.
The field-wide adoption of data standardization and benchmarking would bring neuroscience into line with more mature scientific fields, where data reuse is standard, speeding the pace of discovery. “With benchmarked data, you can now do meta-analyses across many labs without doing all of the data analyses yourself,” says Svoboda.
Finding the most brainlike model
As important and admirable as these initiatives are, to many investigators they are simply a prelude to the deeper modeling efforts currently underway in neuroscience — using correctly processed experimental data to develop models of how the brain works. Here again, the field is confronted with the same set of challenges: establishing benchmark datasets and metrics by which to measure progress and lowering technical and sociological barriers to sharing these models among the community. Three notable efforts, covering a range of neuroscientific questions, have sprung up to address these challenges, all inspired by the success of the ImageNet Challenge.
The first, a collaborative effort called Brain-Score, led by SCGB investigator Jim DiCarlo, seeks to compare different models of the ventral visual stream, a series of brain areas thought to underlie our ability to visually identify objects. Brain-Score follows a now familiar pattern: Models are ranked based on how well they account for a set of benchmark datasets, with rankings shared via a website.
However, Brain-Score is unique in that it demands that a model predict not just one aspect of a dataset but multiple aspects — the composite ‘brain score’ is based on how well the model captures real neural responses in multiple brain regions recorded while a subject performed an object identification task, and how well it predicts the choices the subject made during the task. “There’s many things that we want to simultaneously explain,” says DiCarlo. “And that way, you can start to see — ‘What if I explain this better? Does it also explain that better?’ That’s something that falls out naturally out of this approach.”
Brain-Score is also unique in that, unlike in the ImageNet challenge, the top-performing models will help determine the next key neuroscience experiments that should be done. “The next turn of the crank should be looking at the top lists of Brain-Score models, then designing an experiment based on the outputs of those [models],” says DiCarlo.
In the future, Martin Schrimpf, a graduate student in the DiCarlo lab and the lead author of the Brain-Score effort, plans to hold an annual Brain-Score challenge, much like the ImageNet Challenge. “People can submit their models on a set of brain benchmarks for that year, and then we’ll keep adding to it, so perhaps every year or every couple years we will have a competition to really motivate people to both submit better models, but also to motivate experimentalists to break the models and show the shortcomings for future competitions,” says Schrimpf.
A second initiative, the Neural Latents Benchmark (NLB) challenge, spearheaded by Chethan Pandarinath, a neuroscientist at Emory University and Georgia Tech, takes a similar approach to assess a different class of models. While Brain-Score rates mechanistic models designed to illuminate how the brain classifies objects, NLB will focus on ranking and organizing models that capture the low-dimensional covariation often seen in neural activity, with an initial focus on sensorimotor and cognitive brain regions.
To encourage modelers to develop methods with broad applications, the NLB organizers selected benchmark datasets that span different brain areas and different behaviors. Methods that can accurately account for neural responses in a variety of brain areas offer a potential window into the general principles that neural populations employ to perform computations. “Don’t just develop a method that you test on your V1 dataset or I test on my motor cortex dataset,” Pandarinath says. “The principles of how a population of neurons works together, those are general.”
The organizers believe that encouraging models that generalize to different brain regions will push the field forward. “With these benchmarks, we think it’s important that we go a little bit beyond what the field can do right now,” says Pandarinath. “Being a little ahead of what the field can currently do will allow us to track progress towards that over time.”
The first NLB challenge will be released in August 2021 on the EvalAI platform. The organizers are hoping to give a workshop at the 2022 Cosyne conference to announce the winners of the challenge, to discuss the competition, and to solicit ideas for future competitions.
The third effort, the Multi-Agent Behavior (MABe) challenge, led by Ann Kennedy, a neuroscientist at Northwestern University, uses many of the same tools as Brain-Score and NLB but focuses on models of behavior. Much as the volume of neural data in neuroscience has exploded, advances in machine vision have fueled a boom in the volume of behavioral video data, leaving the field of behavioral analysis in a similar situation as the field of neuroscience. The MABe challenge centers on a particularly challenging aspect of behavior — social interactions among multiple animals.
“In creating this challenge we wanted to encourage people to focus on particular things that we think are current barriers to progress in the field of computational behavior analysis,” says Kennedy. One of those barriers is the question of why researchers categorize the same animal behavior differently. In one of the challenge’s tasks, individuals had to submit models where they correctly predicted the idiosyncratic way in which researchers annotated videos of animals engaged in social interactions.
“This was like a mini-model of what happens when lab A has made a classifier for ‘attack’ and they give it to lab B, and it disagrees with what lab B defines as ‘attack,’” says Kennedy. “Where are those differences coming from?” By understanding the specific features of behavior that lead two people to differ in their definitions, Kennedy hopes to drive our understanding of animal behavior forward. “By going from one person’s binary definitions of a behavior to having more people describe these behaviors we’re hoping we can build towards a richer vocabulary of animals’ actions.”
Although ImageNet offers an excellent blueprint for understanding the key elements needed to correctly benchmark and accelerate progress in science, whether neuroscience can mirror ImageNet’s success will ultimately rest on the community’s ability to work together toward a common goal. “[Brain-Score] is not my lab’s,” says DiCarlo. “It’s something for the community. Even though we’ve been the driver, we think of this as something that’s bigger than us, and we’re just trying to grease the wheels.”