Natural-language processing (NLP) algorithms are now able to generate protein sequences and predict virus mutations, including key changes that help the coronavirus evade the immune system.

In a study just published in Science, NLP was used to predict mutations that allow viruses to avoid being detected by antibodies in the human immune system, a process known as viral immune escape. In essence, the interpretation of a virus by an immune system is analogous to the interpretation of a sentence by a human.

“We’re learning the language of evolution,” says Bonnie Berger, one of the authors of the study and a computational biologist at the Massachusetts Institute of Technology.

Berger’s team used two different linguistic concepts: grammar and semantics (or meaning). The genetic or evolutionary fitness of a virus—characteristics such as how good it is at infecting a host—can be interpreted in terms of grammatical correctness. A successful, infectious virus is grammatically correct; an unsuccessful one is not.

Similarly, mutations of a virus can be interpreted in terms of semantics. Mutations that make a virus appear different to things in its environment—such as changes in its surface proteins that make it invisible to certain antibodies—have altered its meaning. Viruses with different mutations can have different meanings, and a virus with a different meaning may need different antibodies to read it.

To model these properties, the researchers used an LSTM, a type of neural network that predates the transformer-based ones used by large language models like GPT-3. These older networks can be trained on far less data than transformers and still perform well for many applications.

Instead of millions of sentences, they trained the NLP model on thousands of genetic sequences taken from three different viruses: 45,000 unique sequences for a strain of influenza, 60,000 for a strain of HIV, and between 3,000 and 4,000 for a strain of Sars-Cov-2, the virus that causes covid-19.

NLP models work by encoding words in a mathematical space in such a way that words with similar meanings are closer together than words with different meanings. This is known as an embedding. For viruses, the embedding of the genetic sequences grouped viruses according to how similar their mutations were.

The overall aim of the approach is to identify mutations that might let a virus escape an immune system without making it less infectious—that is, mutations that change a virus’s meaning without making it grammatically incorrect.

To test their approach, the team used a common metric for assessing predictions made by machine-learning models that scores accuracy on a scale between 0.5 (no better than chance) and 1 (perfect). In this case, they took the top mutations identified by the tool and, using real viruses in a lab, checked how many of them were actual escape mutations. Their results ranged from 0.69 for HIV to 0.85 for one coronavirus strain. It’s claimed, this is better than results from other state-of-the-art models.

Knowing what mutations might be coming could make it easier for hospitals and public health authorities to plan ahead. Since undertaking the work, the team has been running models on new variants of the coronavirus, including the so-called UK mutation, the mink mutation from Denmark, and variants taken from South Africa, Singapore and Malaysia.

 

The full study can be read in Science here