...

Building AI Systems That Think Like Scientists in Life Sciences – with Annabel Romero of Deloitte


This interview analysis is sponsored by Deloitte and was written, edited, and published in alignment with our Emerj sponsored content guidelines. Learn more about our thought leadership and content creation services on our Emerj Media Services page.

Large language models have significantly advanced the field of genomics by enabling the prediction of genome-wide variant effects, the identification of regulatory elements, and the detection of DNA-protein interactions directly from DNA sequences. 

According to the research paper published in the National Library of Medicine, models such as DNABERT, DNABERT-2, and HyenaDNA have demonstrated the ability to prioritize functional genetic variants and detect long-range interactions, surpassing the limitations of traditional genomic analysis methods. 

In terms of protein targeting, AlphaFold has revolutionized life sciences by achieving near-experimental accuracy in protein structure prediction, as demonstrated by its performance in the CASP14 assessment. 

According to the paper cited above, the latest iteration, AlphaFold 3, further enhances drug discovery by accurately predicting protein-ligand interactions and binding sites. The enhancement enables the streamlining of the drug design process and facilitates the rapid identification of promising therapeutic candidates.

Emerj Editorial Director Matthew DeMello recently sat down with Annabel Romero to discuss how AI, particularly tools like AlphaFold and large language models, is transforming the way scientists understand and design proteins, with significant implications for drug discovery, allergy research, and agricultural biotech.

Their conversation highlights how LLMs are redefining protein science by enabling faster and more precise structure prediction, as well as custom protein design, across drug discovery and agriculture.

This article examines two key insights from their conversation:

  • Leveraging LLMs to recognize protein patterns: Recognizing protein patterns across species using LLMs to reveal structure-function relationships, enabling faster biological insights and discovery.
  • Enabling custom protein design with advanced LLMs: Leveraging new models like Atlas to generate new proteins that precisely target biological structures, unlocking novel paths in drug discovery.

Listen to the full episode below:

Guest: Annabel Romero, Specialist Leader in AI for Drug Discovery, Deloitte

Expertise: Structural Biology, Protein Design, Drug Discovery

Brief Recognition: Annabel is the development leader of Atlas AI for data harmonization at Deloitte. Previously, she has worked with SFL Scientific and Regeneron. She holds a PhD in Structural Biology. 

Leveraging LLMs to Recognize Protein Patterns

Annabel opens the conversation by explaining that when people ask her about the connection between biology and AI, they typically think of large language models used for human languages, such as English, Spanish, or French. However, she emphasized that biology also has its inherent language, which she refers to as the ‘language of evolution.’ 

She describes how the information that encodes everything in our bodies is composed of two key biological languages: 

  • The language of DNA, which consists of a four-letter nucleotide code
  • The language of proteins, which involves 20 different amino acids

Annabel also compares proteins to molecular machines of living cells and explains that, like human sentences, proteins are composed of smaller parts called domains that often appear in recognizable patterns. 

Just as a model can understand and reconstruct a natural sentence like ‘She went outside to watch the sunset,’ a model can also identify and interpret these biological patterns. 

Annabel goes on to explain that a key aspect of LLMs is the amount of data used to train them. She notes that these models don’t fundamentally “think” like humans do, especially when it comes to intuitive or straightforward tasks that may seem obvious to people but are not to the model. 

Focusing specifically on protein models, she emphasized that they are trained on thousands of known protein sequences collected over decades and span a vast diversity of organisms, including humans, mice, chimpanzees, plants like corn and soybeans, as well as bacteria and viruses. She highlights that, unlike human language models, which reflect human bias in writing styles and topics, protein models benefit from this broad biological diversity. 

She then explains the concept of protein conservation, specifically how certain protein domains remain similar across species. For instance, a protein found in human mitochondria might have a very similar version in other mammals.

She said these repeating patterns carry essential information about proteins. What matters, she added, is that the structure of a protein tells us about its function. LLMs can pick up on the patterns in ways that traditional biologists could not easily see before.

“Going back to your point about how natural it feels to write a letter or perform an operation — it’s second nature because we’ve practiced those tasks long before we even learned to read or write. 

In contrast, understanding the language of proteins takes significant training and study. What I now jokingly call ‘old-school’ methods — early bioinformatics techniques — focused on analyzing amino acid sequences to determine evolutionary relationships. Those foundations have since been modernized with AI.”

– Annabel Romero, Specialist Leader Focusing on AI for Drug Discovery at Deloitte

Annabel explains that cytochrome c is a key protein for respiration and is highly conserved across species, like in whales.

She notes that identifying such patterns used to require extensive manual work by bioinformaticians. Now, LLMs can recognize them instantly, including not just common elements but also their structural arrangement — a significant shift from how things were done since the 1990s.

Enabling Custom Protein Design with Atlas

Annabel goes on to share her perspective on AlphaFold, shaped by her background as a structural biologist. She explained that the experimental protein structures she worked on were the same ones deposited in public databases and later used to train AlphaFold. 

Coming from the experimental side and now working in AI, she described AlphaFold’s advantage as its ability to predict protein structures based on similarities in domains and functions. It is especially valuable in cases where obtaining the protein’s structure experimentally is very difficult.

She gives an example: if there’s already an experimental structure available for a mouse protein. Still, researchers are interested in the human version of drug discovery; it used to take years of lab work to obtain that human structure. Now, with AlphaFold, researchers can generate a model more quickly. However, she cautioned that AlphaFold, like any AI model, comes with limitations.

Annabel points out that interpreting AlphaFold’s output requires understanding the context: whether the protein is part of a complex, which part of it’s being modeled, and whether it’s behaving in isolation. She encourages using human experience and judgment, such as her ability to recognize when a predicted structure appears off, to get the most value from such tools.

She then explains that combining protein-specific large language models, such as AlphaFold, with tools for designing small molecules can bridge the gap between drug discovery and agricultural technology. While these two fields may seem unrelated, she tells the Emerj podcast audience that they follow a similar logic: in drug discovery, the patient is human; in agriculture, the plant is treated as the patient. Whether you’re targeting a human disease like cancer or a plant disease caused by fungi, the goal is the same: identify and act on the right protein target.

Annabel then points out that a key difference between human and plant research is the availability of data. Due to the focus on human health, there is much less information available for plants, which creates a challenge for plant scientists. 

However, she notes that AI tools, such as large language models and AlphaFold, enable the transfer of knowledge from well-studied human proteins and genomes to lesser-known plant systems.

Annabel added that over decades of research, scientists discovered cross-reactivity in allergies by analyzing samples from many patients — some might be allergic to bananas but not apples or to peaches but not grapes. 

These trial-and-error approaches revealed that the allergens in different foods often have structural similarities, which confuse the immune system. For example, someone with a birch tree pollen allergy might also react to apples because their allergens are structurally similar, triggering a similar immune response.

She explains that this discovery required solving the structures of many allergens and studying patient data over years. However, by utilizing large language models in Atlas, her team was able to predict these links quickly, identifying proteins in apples and elder trees as closely related to the birch allergen, aligning with what the literature has confirmed as “cross-reactive” proteins.

Annabel emphasizes that understanding the structure of a protein helps researchers determine where a drug binds, which is increasingly essential for meeting FDA requirements to demonstrate the mechanism of action at the molecular level. Tools like AlphaFold can assist in this by guiding experiments that clarify how a drug works.

She also introduces the concept of de novo protein generation, which she finds particularly exciting. These are entirely new proteins, also called “binders,” created using diffusion models. 

Scientists can control the size and design of proteins, essentially building a new one from scratch. Although these generated proteins haven’t yet been published with final therapeutic use, they are actively being explored in academic labs and industry.

Annabel highlights that these binders can be engineered to activate, block, or interact with other proteins in specific ways. When used alongside AlphaFold (which helps identify a target protein’s structure), diffusion models can then be used to design an ideal binder for that target. 

Combining these systems expands the possibilities for drug development beyond traditional methods, such as antibodies or small molecules, opening new avenues to treat diseases by designing proteins that precisely interact with specific biological targets.

Source link

#Building #Systems #Scientists #Life #Sciences #Annabel #Romero #Deloitte