Gene activation discovery with Artificial Intelligence (AI)
Long-awaited code breakthrough is enabled with the aid of ‘machine learning’. This has unraveled the potential applications of AI in biomedicine.
Scientists have long known about human gene activation and the role of precise instructions in our DNA. Researchers, with the help of artificial intelligence, have solved a long-standing mystery of the activation code of the DNA. They discovered that their termed downstream core promoter region (DPR), could be used in biotechnology applications to control the activation of a gene.
Scientists have long known about the relation between human gene and instructions delivered by precise order its the coding bases (A, T, G, C)
Also, it is known that near about 25% of our genes are transcribed by sequences that resemble TATAAA sequences, the ‘TATA box’ sequence. The turning on or how the three-quarters are promoted has remained a mystery. This was due to the enormous number of possibilities of the DNA base sequences. This has kept the information about the activation of the gene under blinds.
Now, researchers at the University of California San Diego, with the help of artificial intelligence, have identified an activation code of DNA, and it is used as frequently as the TATA box in humans. Their discovered downstream core promoter region (DPR), could be used to control gene activation eventually in various bioscience applications. A detail of the study is published in the journal Nature.
Professor James T. Kadonaga, UC San Diego, the senior author of the paper states that the identification of the DPR reveals a key step in the activation of about a quarter to a third of the human genes. He also remarked the DPR had remained an enigma and how its existence in humans has been controversial. Fortunately, this puzzle has been resolved by the team with the help of machine learning.
In 1996, a novel gene activation sequence was identified in fruit flies by Kadonaga and his colleagues. They termed the sequence as DPE (corresponding to a part of the DPR). In the absence of the TATA box, this sequence enables the gene to be turned on. In the year 1997, they found a single sequence, which is a DPE alike. However, the prevalence of human DPE and deciphering its details have been elusive since then.
Most strikingly, in the tens of thousands of human genes, there have been only two or three active DPE-like sequences found. To crack this mystery, Kadonaga worked with a team after about 20 years later. The team consists of Long Vo Ngoc, lead author and post-doctoral scholar, Cassidy Yunjing Huang, Claudia Medrano, and Jack Cassidy, a retired computer scientist. Jack helped the team leverage the powerful tools of artificial intelligence.
In a ‘fairly serious computation’ brought to bear in a biological problem, the researchers made a pool and evaluated the DPR activity of 500,000 random versions of DNA sequences. Two hundred thousand versions were selected from the pool and used to create a model for machine learning. This model was designed to predict DPR activity accurately in human DNA.
Kadonaga described the results of combining artificial intelligence and study gene activation to be “absurdly good”. He added that the results were so good that they created a model similar to the previous machine learning model. This new machine learning model could be used as a new way to identify TATA box sequences. They tested the model’s predictive ability with the help of thousands of test cases in which the TATA box and DPR results were previously known. Kadonaga remarked the predictive ability of the model to be “incredible”.
The existence of the DPR motif in the human genes was clearly revealed by these results. Moreover, the frequency of appearance of DPR approximates the frequency of appearance of the TATA box. Besides, an intriguing duality was observed between the DRP and TATA. Genes activated by the TATA box sequences lack the DPR sequence and vice versa.
In the TATA box sequence finding the six bases was straightforward, says Kadonaga. Cracking code for 19 bases DPR was much more difficult.
Kadonaga said that the DPR could not be found because the motif has no clear apparent pattern of its sequence. The hidden information is encrypted in the DPR that makes it an active element. This code can be deciphered by the machine learning model, which couldn’t be decoded by humans.
This knowledge will be helpful for researchers in biotechnology and the biomedical field, says Kadonaga. Artificial intelligence can be of further use in analyzing DNA sequence patterns. This tool should increase the researcher’s ability to understand and control the activation of the gene in human cells.
Kadonaga said that, like the identification of the DPR with the machine learning model, other significant DNA sequence motifs could be studied with related artificial intelligence approaches. A lot of unexplainable things can now be explained, he added.
The study was supported by the National Institute of General Medical Sciences (NIGMS) at the National Institutes of Health.
Gene activation discovery and Artificial Intelligence (AI)
Author: Mayuree Hazarika