Optimising mRNA sequences to achieve stability and high-level expression of the encoded protein is a key challenge in the generation of mRNA therapies. This is traditionally performed through codon optimization by modifying synonymous codons according to the host organism's specific codon usage bias and tRNA abundances. However, this approach does not take into account the full complexity of the multiple codon combinations and does not consider key properties of the target protein, such as abundance and function. Here we propose leveraging Large Language Models and generative AI to address these challenges and accomplish mRNA optimisation for high protein expression more effectively. Our approach comprehensively explores the context of all codons in the mRNA reference within the embedding space. During this exploration, our approach performs two simultaneous optimizations to improve the translation output and to minimise the structural similarity with the target protein.
The translation output is a regressor model, where the input is the embedding of the mRNA sequence and the output is the protein abundance. By optimising the protein abundance rather than the translation efficiency based on ribosome occupancy, we focus on a more relevant metric for the efficiency of the mRNA therapy. The structural similarity is estimated using a deep learning approach that compares the structures of the target protein and the protein encoded by the generated mRNA. This optimization ensures that the target function is preserved while allowing a broader testing of mRNA sequences. This dual optimization strategy allows for a more comprehensive exploration of the sequence space, considering both codon usage and the function of the encoded protein. By integrating these advanced computational techniques, our method promises to enhance the design of mRNA sequences for therapeutic applications, potentially improving the efficacy and stability of mRNA-based treatments.