“Write me a concise summary of Mission Impossible characters and plots to date,” I recently asked ChatGPT before catching the latest franchise entry. It delivered. I didn’t need to understand its code or know its training dataset. All I needed to do was ask.
ChatGPT and other chatbots powered by large language models, or LLMs, are more popular than ever. Scientists are taking note. Proteins—the molecular workhorses of cells—keep our bodies running smoothly. They also have a language all their own. Scientists assign a shorthand letter to each of the 20 amino acids that make up proteins. Like words, strings of these letters link together to form working proteins, their sequence determining shape and function.
Inspired by LLMs, scientists are now building protein language models that design proteins from scratch. Some of these algorithms are publicly available, but they require technical skills. What if your average researcher could simply ask an AI to design a protein with a single prompt?
Last month, researchers gave protein design AI the ChatGPT treatment. From a description of the type, structure, or functionality of a protein that you’re looking for, the algorithm churns out potential candidates. In one example, the AI, dubbed Pinal, successfully made multiple proteins that could break down alcohol when tested inside living cells. You can try it out here.
Pinal is the latest in a growing set of algorithms that translate everyday English into new proteins. These protein designers understand plain language and structural biology, and act as guides for scientists exploring custom proteins, with little technical expertise needed.
It’s an “ambitious and general approach,” the international team behind Pinal wrote in a preprint posted to bioRxiv. The AI taps the “descriptive power and flexibility of natural language” to make designer proteins more accessible to biologists.
Pitted against existing protein design algorithms, Pinal better understood the main goal for a target protein and upped the chances it would work in living cells.
“We are the first to design a functional enzyme using only text,” Fajie Yuan, the AI scientist at Westlake University in China who led the team, told Nature. “It’s just like science fiction.”
Beyond Evolution
Proteins are the building blocks of life. They form our bodies, fuel metabolism, and are the target of many medications. These intricate molecules start from a sequence of amino acid “letters,” which bond to each other and eventually fold into intricate 3D structures. Many structural elements—a loop here, a weave or pocket there—are essential to their function.
Scientists have long sought to engineer proteins with new abilities, such as enzymes that efficiently break down plastics. Traditionally, they’ve customized existing proteins for a certain biological, chemical, or medical use. These strategies “are limited by their reliance on existing protein templates and natural evolutionary constraints,” wrote the authors. Protein language models, in contrast, can dream up a universe of new proteins untethered from evolution.
Rather than absorbing text, image, or video files, like LLMs, these algorithms learn the language of proteins by training on protein sequences and structures. EvolutionaryScale’s ESM3, for example, trained on over 2.7 billion protein sequences, structures, and functions. Similar models have already been used to design antibodies that fight off viral attacks and new gene editing tools.
But these algorithms are difficult to use without expertise. Pinal, in contrast, aims for the average-Joe scientist. Like a DSLR camera on auto, the model “bypasses manual structural specifications,” wrote the team, making it simpler to make your desirable protein.
Talk to Me
To use Pinal, a user asks the AI to build a protein with a prompt of several keywords, phrases, or an entire paragraph. On the front end, the AI parses the specific requirements in the prompt. On the back end, it transforms these instructions into a functional protein.
It’s a bit like asking ChatGTP to write you a restaurant review or an essay. But of course, proteins are harder to design. Though they’re also made up of “letters,” their final shape determines how (or if) they work. One approach, dubbed end-to-end training, directly translates a prompt into protein sequences. But this opens the AI to a vast world of potential sequences, making it harder to dial in on the accurate sequences of working proteins. Compared to sequences, protein structure—the final 3D shape—is easier for the algorithm to generate and decipher.
Then there’s the headache of training data. Here, the team turned to existing protein databases and used LLMs to label them. The end result was a vast library of 1.7 billion protein-text pair, in which protein structures are matched up with text descriptions of what they do.
The completed algorithm uses 16 billion parameters—these are an AI’s internal connections—to translate plain English into the language of biology.
Pinal follows two steps. First it translates prompts into structural information. This step breaks a protein down into structural elements, or “tokens,” that are easier to process. In the second step, a protein-language model called SaProt considers user intent and protein functionality to design protein sequences most likely to fold into a working protein that meets the user’s needs.
Compared to state-of-the-art protein design algorithms that also use text as input, including ESM3, Pinal outperformed on accuracy and novelty—that is, generating proteins not known to nature. Using a few keywords to design a protein, “half of the proteins from Pinal exhibit predictable functions, only around 10 percent of the proteins generated by ESM3 do so.”
In a test, the team gave the AI a short prompt: “Please design a protein that is an alcohol dehydrogenase.” These enzymes break down alcohol. Out of over 1,600 candidate proteins, the team picked the most promising eight and tested them in living cells. Two successfully broke down alcohol at body temperature, while others were more active at a sweaty 158 degrees Fahrenheit.
More elaborate prompts that included a protein’s function and examples of similar molecules, yielded candidates for antibiotics and proteins to help cells cell recover from infection.
Pinal isn’t the only text-to-protein AI. The startup 310 AI has developed an AI dubbed MP4 to generate proteins from text, with results the company says could benefit heart disease.
The approach isn’t perfect. Like LLMs, which often “hallucinate,” protein language models also dream up unreliable or repetitive sequences that lower the chances of a working end result. The precise phrasing of prompts also affects the final protein structure. Still, the AI is like the first version of DALL-E: Play with it and then validate the resulting protein using other methods.
Source link
#Whips #Designer #Proteins #Text #Prompt