2025-07-03
Artificial Intelligence Unveils the “Dark Matter” in DNA
Source:Science and Technology Daily
In 2003, scientists completed the first sequencing of the human genome, revealing the entire DNA sequence that serves as the “blueprint” of human life. Although 98% of the genome does not directly code for proteins, these non-coding regions profoundly influence gene regulation and cellular function. Once dismissed as “junk DNA”, these regions are now thought to harbor crucial biological secrets, akin to the “dark matter” of the genome.
On June 25 this year, DeepMind announced the development of an artificial intelligence (AI) model named AlphaGenome, which may represent a breakthrough in decoding this genomic “dark matter.” According to Nature, this sequence-to-function model can predict how small variations in DNA affect a wide range of molecular processes, offering a new avenue for unraveling the mechanisms of human gene regulation.
A Unified Tool for Interpreting DNA Sequences
In 2020, DeepMind introduced AlphaFold 2, which successfully solved a long-standing scientific challenge: accurately predicting a protein’s three-dimensional structure from its amino acid sequence. This breakthrough revolutionized structural biology and accelerated drug discovery.
In comparison to proteins, understanding DNA sequence functionality is far more complex, as there is no single “correct answer.” DNA’s function lies primarily in regulating gene expression, for example, determining when and where genes are turned on or off, and to what extent they are expressed.
If predicting protein structure is like assembling a 3D model of a “part”, then interpreting DNA functionality is like deciphering an instruction manual, understanding every symbol, annotation, control switch, and even the meaning of its “dark matter” regions. This involves more intricate layers of information and broader interconnections. A single DNA fragment may play different roles across various cell types or time points, making it significantly more challenging to model than proteins.
For decades, biologists have employed a range of computational tools to explore the intricate and elusive mechanisms of DNA regulation. However, most models focus on a single function. Scientists have long sought an integrated tool for DNA sequence interpretation, hence the emergence of AlphaGenome.
According to Interesting Engineering, unlike previous models that had to trade off between sequence length and prediction accuracy, AlphaGenome achieves both. It captures long-range genomic context while providing precise, base-level predictions, broadening research possibilities in areas such as disease biology, rare variant analysis, and synthetic DNA design.
Processing One Million Base Pairs at Once
According to DeepMind’s official website, the model can process up to one million base pairs in a single run and predict thousands of molecular features, including gene expression, splicing patterns, protein-binding sites, and chromatin accessibility, across a wide range of cell types. This marks the first time an AI system has jointly modeled such a broad spectrum of regulatory features.
The dataset used to train AlphaGenome was compiled from multiple publicly available, large-scale resources. Surprisingly, training a complete model took only four hours and required just half the computational power of its predecessors. In 26 benchmark evaluations, AlphaGenome outperformed or matched dedicated models in 24.
A highlight of the new model is its variant scoring system, which efficiently compares DNA sequences before and after mutations, evaluating their impact across diverse biological pathways.
AlphaGenome also features splicing site modeling capabilities, becoming the first model to predict RNA splicing abnormalities associated with diseases such as cystic fibrosis and spinal muscular atrophy.
In synthetic biology, AlphaGenome could be used to design specific regulatory sequences, such as activating certain genes only in neurons while keeping them silent in muscle cells. It also holds promise for studying rare, high-impact genetic variants, such as those causing Mendelian disorders.
In one validation experiment, researchers applied AlphaGenome to a previously identified leukemia-related mutation. The model accurately predicted that certain non-coding variants indirectly activate the nearby TAL1 oncogene, a known pathogenic mechanism in T-cell acute lymphoblastic leukemia.
Not Yet Suitable for Individual Diagnosis
Despite AlphaGenome’s impressive capabilities, the DeepMind team emphasized that the system still has several limitations. It is not designed for interpreting individual genomes and cannot be used to predict disease risk or ancestry information in the way that services like 23andMe or clinical genetic testing can. In other words, the model is not currently suitable for individual diagnostics or medical decision-making.
At present, AlphaGenome’s training data is limited to humans and mice, and its cross-species adaptability remains to be validated. Additionally, it still struggles to identify connections between regulatory elements and distant target genes (over 100,000 bases apart) and cannot yet fully model the dynamic regulation of cells across different states and tissues.
As Peter Koo, a computational biologist at Cold Spring Harbor Laboratory in America, pointed out: “These models are often trained under fixed conditions, but real-world cells are dynamic. Protein levels, DNA chemical modifications, and transcriptional states all change with time and environmental factors, which can significantly influence how a given DNA sequence behaves”. Therefore, future models will need to incorporate more multimodal and multi-timescale inputs to better simulate biological processes.