Physics-Informed AI Method Could Help Make CRISPR Safer
CRISPR promises to be a game changer in the development of treatments for genetic diseases, but sometimes the technology slices DNA where it’s not supposed to. A new model uses biophysics-informed artificial intelligence to predict the risk of these off-target cuts, offering a potential way to make CRISPR more reliable.
CRISPR, the genome editing system that has revolutionized biological research over the last decade, is generally very good at its job. The technology is so accurate in altering an organism’s genome that it can even be used as a human therapy: Last year, regulatory agencies in the U.S. and U.K. approved a CRISPR therapy that deactivates mutated genes in the genomes of patients with sickle cell disease and beta thalassemia, allowing them to make healthy red blood cells.
But CRISPR can occasionally make mistakes and cut DNA where it’s not supposed to. Now, a new artificial intelligence–based model developed at the Flatiron Institute’s Center for Computational Biology (CCB) can predict the chances that a given CRISPR system’s Cas9 enzyme will make ‘off-target’ cuts. The model could lead to safer genome editing and reveal new scientific insights into how CRISPR and other enzyme systems work in cells.
If Cas9 stays in the cell long enough without reaching its target or degrading, it will find something to cut, even if it’s an off-target site. “The key innovation of our work is designing a new AI that learns an overlooked aspect of Cas9 off-target cutting: time,” says Zijun Zhang, a former CCB research fellow who is now an assistant professor at Cedars-Sinai Medical Center in Los Angeles. Explaining time-dependent phenomena is “new for genomic AI models, but a classical biophysical problem solved by kinetic modeling,” he says.
The method could improve predictions of how CRISPR technologies will behave in a variety of settings, says CCB research fellow Adam Lamson. “By bringing in more physics and trying to explain when and why something is cut, we can generalize these patterns to different cell types, different cell contexts or even different organisms,” he says.
Lamson and Zhang worked on the project with CCB director Michael Shelley and CCB deputy director for genomics Olga Troyanskaya. The researchers presented the new model in a paper published December 14 in Nature Computational Science.
“Basically, we’re trying to make an auto-scientist.”
Adam Lamson
CRISPR technology was adapted from a natural immune system in bacteria that detects and destroys viruses called bacteriophages that infect the bacteria. The editing system is made up of two components: a guide RNA that contains about 20 nucleotide ‘letters’ designed to match with a particular target genetic sequence, and the Cas9 enzyme that cuts DNA. The two form a complex that binds a cell’s DNA and then slides and hops along it looking for the guide RNA’s complementary sequence. The complex searches the DNA by unzipping the double helix’s two strands and comparing nucleotides. Once it finds a sufficiently similar sequence, the complex cuts the DNA at that site.
The match doesn’t have to be perfect, which is useful for bacteria. A flexible CRISPR system gives the microbes some wiggle room to detect slightly different strains of bacteriophages with different genomes. However, that sequence flexibility can be a problem for researchers trying to make specific edits to a cell’s genome. If the complex finds another spot in the genome that matches most of its 20 nucleotides, it may cut there instead of at the designated site, potentially creating harmful mutations.
So researchers developing CRISPR systems for different applications — particularly human therapies — need to be able to calculate the chances that a given complex of Cas9 and guide RNA will make off-target cuts. Biotech companies scan organisms’ genomes for potential unintended sequences that could match their products’ guide RNAs and have even developed machine learning algorithms to better identify these sites. And regulatory agencies evaluating CRISPR therapies, such as for sickle cell treatment, have required researchers to screen their guide RNAs against large databases of human genomes to ensure the therapy’s safety in diverse populations.
However, not all machine learning algorithms work well when used in datasets other than the ones they were trained on. Different organisms, for instance, may have different numbers of potential sites that would match a given guide RNA sequence. And even within an organism, the condition of the DNA can vary widely from cell to cell — the guide RNA’s ideal target site might be blocked by another protein already bound to the DNA, for instance. This problem has greatly hindered the development of methods to confidently predict where and when CRISPR therapies will edit DNA.
Scientists can calculate these kinds of probabilities in vitro by putting the CRISPR/Cas9 complex in a tube with DNA and measuring how Cas9 interacts with various sequences. However, determining how time plays into these equations is challenging to do in vivo without killing the cells. With that in mind, Zhang, Lamson, Shelley and Troyanskaya set out to develop a more complete model to predict how CRISPR/Cas9 would work in any system or context by merging biophysical knowledge with machine learning techniques.
First, they turned to datasets that have carefully quantified CRISPR’s kinetics in vitro, incorporating factors such as the amount of time it spends unzipping DNA versus cutting it, or how the structural variations of DNA affect which sites are cut. Using these data, they built a set of machine learning models they call kinetically interpretable neural networks (KINNs). The novelty of KINNs is that they incorporate biophysically relevant kinetic equations into their structure. Once trained to determine how an enzyme will act in a controlled in vitro scenario, they can predict how enzymes will behave in more complex scenarios since the same physics must apply.
But in real life, cells are much more complicated than the in vitro systems these models were trained to predict. So the researchers built a second machine learning algorithm called Elektrum, which they trained using the complex, messy in vivo data that research teams have gathered from studying CRISPR-Cas9 effects in real cells. “There’s so many unknown unknowns,” Lamson says. “That’s why we have needed these really large models in the past to make predictions.”
To train the model, the researchers feed a guide RNA sequence into Elektrum. The program tests six different KINNs to see which one best fits the way the CRISPR complex will act in living cells. By combining this KINN with other parameters that it optimizes, it models which DNA sequences are most likely to be cut in that scenario.
Training Elektrum on the Simons Foundation’s supercomputers can take days because the calculations are so complex. However, once the model is trained, researchers can quickly test a guide RNA’s ability to cut locations in the organism’s genome. Genomes may contain multiple close matches, but Elektrum gives quantitative predictions of how likely each one of these sites is to be cut. Depending on the sequence and where nucleotide mismatches occur, similar matches can have very low probabilities of being edited, while more dissimilar off-target sites may have a high probability. This knowledge informs researchers how dangerous a specific guide RNA sequence could be to an organism if used.
The researchers compared Elektrum to existing machine learning algorithms by having them predict off-target cuts in a new dataset on which they hadn’t trained their model. Elektrum’s predictions were more accurate at predicting CRISPR-Cas9 activity than those of other models, both in vitro and in vivo. The team is now trying to develop high-throughput experiments that will allow them to better interpret the algorithm’s predictions.
Unlike many other AI-powered products, Elektrum learns and discovers the underlying physical rules governing the system. This means that the team can mathematically compare Elektrum’s reasoning against known and novel physics to make sure the results make sense, a property known as interpretability. This also prevents the algorithm from becoming overly attuned to its training data, a problem known as overfitting that results in inaccurate predictions.
Elektrum’s prowess also means that it can help generate ideas and predictions for new experiments, Lamson says. For instance, it can predict what happens when different Cas9 proteins compete inside the same nucleus for the same target. In addition, Lamson says, Elektrum may even be able to directly generate new scientific insights. Researchers may find new cellular mechanisms that affect enzymatic activity by comparing the deep learning algorithm’s results to those of simple kinetic models. “Basically, we’re trying to make an auto-scientist,” he says. “I could just throw an optimization algorithm at it, and it’ll spit out a physical model for us.”
For instance, the team has already found that the CRISPR complex seems to get “stuck” in genetic regions that contain a lot of guanines — one of the four nucleotide letters that make up DNA’s genetic code. With nothing else to do, Cas9 begins cutting DNA at these sites instead of sites that would be better matches for its guide RNA. The researchers say more work is needed to determine how the structure of these nucleotides changes Cas9’s normal kinetic behavior.
Having this adaptable model, the researchers say, will be particularly helpful for therapies because humans’ genomes vary so widely and can contain different numbers and types of off-target sites. The model can also be expanded to incorporate other kinds of messy cell data, such as whether the DNA is wound around proteins known as histones or if it contains chemical tags that affect how a sequence is expressed. “This optimization process really enables us to generate knowledge from the data’s perspective instead of being restricted to existing human knowledge,” Zhang says.
Zhang and Lamson say they’re now studying how Elektrum could help researchers design new applications for CRISPR technology, such as warning scientists about potentially biased results when they’re using the technology to attach fluorescent proteins to specific DNA sequences so that they show up under a microscope. Zhang and Lamson say the algorithm could also be applied to other systems, such as prime editing, which swaps one nucleotide for another without breaking both strands of the DNA.
And because Elektrum can work with any molecule encoded by a sequence, researchers could even use it to predict how proteins interact with DNA or other proteins by feeding it the sequence of amino acids that make up a protein. Zhang says the sky is the limit when it comes to using machine learning to understand enzyme kinetics. “We’re trying to go to the wild world of things that we don’t know yet.”