Why Neuroscience Needs Data Scientists
As a professor of both statistics and neuroscience at Columbia University, Liam Paninski straddles two worlds. Paninski, an investigator with the Simons Collaboration on the Global Brain, argues that this type of scientific bilingualism is becoming increasingly important in neuroscience, where new approaches are needed to analyze ever-expanding datasets.
Paninski’s team is focused on developing tools to extract information encoded in large populations of neurons, such as that from calcium imaging and multielectrode array recordings. As part of the International Brain Laboratory, a large-scale collaboration funded in part by the SCGB, his group is developing a data processing pipeline that he hopes will enable data sharing much more broadly.
Paninski described some of these advances in a lecture, “Neural Data Science: Accelerating the Experiment-Analysis-Theory Cycle in Large-Scale Neuroscience,” which he presented at the Society for Neuroscience conference in San Diego in November. He spoke with the SCGB about the research and about the need to improve data-sharing in neuroscience. An edited version of the conversation follows.
You described this as the golden age of statistical neuroscience. What makes it such an exciting time?
We’re in a time of fast, cheap computation. We have lots of optical and genetic data to work with. For example, the Neuropixels probe, which can record from 1,000 cells, just became publicly available. For people like me, there is way too much exciting big data. (For more on Neuropixels, see “‘Neuropixels’ Expand Access to the Brain” and “Coming Soon: Routine Recording From Thousands of Neurons.”)
You called for more data scientists in neuroscience. Why?
Many of the big questions in neuroscience are statistical questions in disguise. What information is encoded in neural populations? How do we define cell types? How do we decide which experiments to run next? What is one region of the brain telling another? We need more people who can speak the language of neural science and statistics simultaneously to translate neuroscience questions, theories and prior knowledge into statistical models. There will be a huge impact from people who can bridge these two worlds and make progress on both sides.
Do you have recommendations for people with expertise in one field who want to cross over?
The main thing is to have a real knowledge of both neuroscience and statistics, machine learning and data science. The last three all blend together — I don’t see them as completely distinct things. For people in neuroscience who want to learn more data science, I recommend picking up a book on statistical machine learning to get familiar with some of the basic models that we use. For statisticians who want to learn more about neuroscience, I usually point them to some recent papers on specific topics that might be of interest, such as calcium imaging or analysis of multiple-neuron datasets. I’ve collected a bunch of relevant papers and links here. We eventually plan to turn this material into a textbook.
You’re passionate about data sharing. Why is this such an important issue in neuroscience?
Data sharing in neuroscience remains rare and primitive. In my utopian vision, I want all data to be shared immediately. Imagine how much progress we could make if all data were public and searchable in an easy way. I would love to see faster progress on this. It should be a global effort, and we need both carrots and sticks.
Has there been a philosophical shift in how the field approaches developing these types of tools?
I think people working in the field and funding agencies are both starting to become aware of this. The BRAIN Initiative has been hugely helpful for pushing people to do analysis in a more reproducible way. The Allen Institute and the Simons Foundation’s Flatiron Institute have been good at pushing resources into those kinds of efforts. It’s harder to do in an academic setting. Our experience has been to start with prototype ideas, then make software that people can use, including non-experts. But that requires a significant effort — it can be hard to hand off a prototype method to a software engineer when the prototype is still changing every day.
You’re working on a data processing and data sharing pipeline as part of the International Brain Laboratory, a collaboration among 21 labs. What is your role in the IBL?
The idea with the IBL is to record from every part of the brain with multiple modalities as a mouse performs a specific, well-defined standardized behavior. Experimental labs will soon start generating large electrophysiology, two-photon, and one-photon calcium imaging datasets, as well as rich behavioral datasets. Our goal is to set up a data pipeline to take the signal from that data and push it into the cloud. Eventually the data will be completely public so that theorists around the world can take a hack at it and try to figure out what’s going on. Every dataset in neuroscience in some sense should be made public. We’re trying to build tools to facilitate that. We want to develop solutions so that no one else has to worry about these problems.
What are the major challenges faced by the IBL and other neuroscience data sharing efforts in developing this pipeline?
This is something that’s never been done before — it’s very ambitious. Both calcium imaging and spike sorting involve big datasets that are getting bigger and more interesting each year. We don’t have the capacity to put all the data online. The dream is to extract the useful information we need, push it to the cloud, and discard the rest without leaving anything behind. These are difficult statistical and software engineering problems. You need to have a team of engineers working to create open-source code that everyone can use. It’s really been holding back the field, slowing the sharing of data and the reproducibility of results.
Are there particular advances in your lab or at other places that are helping to solve these challenges?
For calcium imaging data, we start with large raw movies and figure out where the cells are. A lot of cells overlap in space, and we need to ‘de-mix’ them. The first step is recognizing that the de-mixing problem is the one you want to solve. We are thinking about ways to attack this problem, either to make the process computationally faster or to make it more accurate or more robust.
To determine where the cells are in space in raw calcium imaging movies, we use an approach called constrained non-negative matrix factorization. We take a movie and form a matrix out of it — we vectorize each frame of the movie and stack the vectors into a matrix with the dimensions pixels by timesteps. Then we separate this matrix into three components: signal, background and noise. Each neuron has a spatial footprint — which indicates the location and shape of the neuron — and a calcium trace. We have a mathematical model to estimate the calcium concentration of that neuron over time. Different groups have taken different approaches using different models of background or noise, but matrix factorization is a starting point for all of them. (For more information on these tools, see CalmAn, OnACID and “Pixelated Movies Yield Faster Way to Image Many Neurons.”)
One thing we did this year that I’m excited about is that we figured out a good way to take calcium imaging datasets and compress them and clean them up to the point where, instead of sharing big raw datasets with each other, we can de-noise them and reduce them down by a factor of 100 or so. That makes data sharing and collaboration much easier.
We’ve also started working on voltage sensors, which have become much more powerful in the last couple of years. That’s opening a lot of exciting potential experiments. We are working with groups on that and have some new methods for extracting signals from voltage imaging datasets with a low signal-to-noise ratio.
How is your approach different from that of other groups?
Our focus has been on trying to extract every signal we can from the data. We aren’t happy grabbing the big signals and leaving the little stuff behind. Maybe we are more obsessive than most. But lots of smart people are working on it. I see it more as a big collaboration.
We don’t want to lose any information in processing because we want scientists to be able to tackle subtle questions about neural coding. That’s driven a lot of our efforts over the last few years. For example, can you figure out which neuron is connected to which just by looking at correlational structure in populations of neurons? To do that, we need to maintain these small signals. For example, you have to be careful about catching all the spikes and making sure spikes from one neuron do not contaminate signals from other neurons. You need to treat the de-mixing problem very thoroughly.
Do you think that the IBL model will catch on?
Working on the IBL has been exciting. It’s a fantastic bunch of scientists, and it’s inspiring to have a shared purpose and be part of a large team devoted to doing open science in a reproducible way. The scale of the problems we’re trying to solve in neuroscience is vast, so I think it’s natural to see larger-scale collaborations. There are more team efforts in neuroscience. The Allen Institute is a great example, as is the GENIE project at Janelia. The Flatiron Institute is starting to do some things like this.
You mentioned that one of the challenges with calcium imaging is the lack of a ground-truth dataset. What do you mean by that?
We want to take a calcium video dataset and separate the spatial and temporal components cleanly into signals from neuron A and neuron B. That’s a de-mixing problem. But there is currently no ground truth. That means that two people can look at the same video and say, I think that pixel belongs to neuron A or neuron B. Electron microscopy can provide the ground truth on where the cells are. There has been a lot of effort to combine electron microscopy and calcium imaging data.
Have you made progress in developing this kind of ground-truth dataset?
We are working on this with several groups as part of the Microns project. The Tolias group at Baylor University images from multiple planes in the same volume of the mouse cortex. They then ship the mouse to the Allen Institute, which does a dense electron microscopy reconstruction of the same volume. They then send that massive dataset to Princeton University, where Sebastian Seung’s group uses machine learning methods to create a wiring map. (For more on Seung’s work, see “Using Artificial Intelligence to Map the Brain’s Wiring.”) They send us the map and the calcium movies, and we match neural components from electron microscopy with two-dimensional slices of functional imaging data. We’ve had some nice results over the last year, where we think we have a new gold standard. We will make these datasets public.
What is the challenge of bringing electron microscopy and calcium imaging data together?
The challenge is that there is a mismatch in the resolution of the two datasets — the spatial resolution of the two techniques is orders of magnitude apart. There are many more electron microscopy components in the microanatomy than we can see in the calcium imaging data. If it were just the puzzle of matching 2-D slices and 3-D images, that would be challenging but conceptually straightforward. But the problem is you don’t have an exact match. The calcium indicator is only expressed in a subset of the cell. It’s a challenging statistical matching problem to match the electron microscopy components with the calcium components.
How do you use imaging data to estimate network connectivity?
A big issue in doing that is the dreaded ‘common input’ problem. If firing of neuron 1 always leads to firing of neuron 2, it could mean that neurons 1 and 2 are connected or that both are connected to a third neuron that you didn’t observe in that trial. What experimental design can help solve the common input problem? We proposed an approach that resembles shotgun sequencing. In shotgun sequencing, you sequence a subset of the nucleotides you want to sequence, take a lot of different snapshots of the subset, and use statistical methods to patch them together. We can do the same thing with network estimation. We can only observe a subset of neurons. But instead of parking the microscope over one set of neurons, we observe as many subsets as possible. In small networks, you can recover simulated ground truth even when observing only 10 percent of the population simultaneously. We would love for more labs around the world to run these shotgun experiments.