Download PDFOpen PDF in browserConvolutional neural net learns promoter sequence features driving transcription strength10 pages•Published: March 11, 2020AbstractPromoters drive gene expression and help regulate cellular responses to the environment. In recent research, machine learning models have been developed to predict a bacterial promoter’s transcriptional initiation rate, although these models utilize expert-labeled sequence elements across a defined set of DNA building blocks. The generalizability of these methods is therefore limited by the necessary labeling of the specific components studied. As a result, current models have not been used to predict the transcriptional initiation rates of promoters with generalized nucleotide sequences. If generalizable models existed, they could greatly facilitate the design of synthetic genetic circuits with well-controlled transcription rates in bacteria.To address these limitations, we used a convolutional neural network (CNN) to predict a promoter’s transcriptional initiation rate directly from its DNA nucleotide sequence. We first evaluated the model on a published promoter component dataset. Trained using only the sequence as input, our model fits held-out test data with R2 = 0.90, comparable to published models that fit expert-labeled sequence elements. We produced a new promoter strength dataset including non-repetitive promoters with high sequence variation and not limited to combinations of discrete expert-labeled components. Our CNN trained on this more varied dataset fits held-out promoter strength with R2 = 0.61. Previously-published models are intractable on a dataset like this with highly diverse inputs. The CNN outperforms classical approach baselines like LASSO on a bag of words for promoter sequence elements (R2 = 0.42). We applied recent machine learning approaches to quantify the contribution of individual nucleotides to the CNN's promoter strength prediction. Learning directly from DNA sequence, our model identified the consensus -35 and -10 hexamer regions as well as the discriminator element as keycontributorstoσ70promoterstrength.Italsoreplicatedafindingthataperfectconsensus sequence match does not yield the strongest promoter. The model's ability to independently learn biologically-relevant information directly from sequence, while performing similarly to or better than classical methods, makes it appealing for further prediction optimization and research into generalizability. This approach may be useful for synthetic promoter design, as well as for sequence feature identification. Keyphrases: convolutional neural network, neural network, promoter, sequence model, synthetic biology, transcription rate In: Qin Ding, Oliver Eulenstein and Hisham Al-Mubaid (editors). Proceedings of the 12th International Conference on Bioinformatics and Computational Biology, vol 70, pages 163-172.
|