Nature Genetics volume 55, pages 1512–1522 (2023 )Cite this article
Predicting the effects of coding variants is a major challenge. While recent deep-learning models have improved variant effect prediction accuracy, they cannot analyze all coding variants due to dependency on close homologs or software limitations. Here we developed a workflow using ESM1b, a 650-million-parameter protein language model, to predict all ~450 million possible missense variant effects in the human genome, and made all predictions available on a web portal. ESM1b outperformed existing methods in classifying ~150,000 ClinVar/HGMD missense variants as pathogenic or benign and predicting measurements across 28 deep mutational scan datasets. We further annotated ~2 million variants as damaging only in specific protein isoforms, demonstrating the importance of considering all isoforms when predicting variant effects. Our approach also generalizes to more complex coding variants such as in-frame indels and stop-gains. Together, these results establish protein language models as an effective, accurate and general approach to predicting variant effects.
Determining the phenotypic consequences of genetic variants, known as variant effect prediction (VEP), is a key challenge in human genetics1,2,3,4. Coding variants altering the amino acid sequences of proteins are of special interest due to their enrichment in disease associations, better-understood mechanisms and therapeutic actionability5,6,7,8. Most naturally occurring coding variants are missense, substituting one amino acid with another9. Despite progress in functional genomics and genetic studies, distinguishing protein-disrupting damaging variants from neutral ones remains a challenge. Furthermore, most human genes are alternatively spliced, and the same variant may be damaging to some protein isoforms but neutral to others, depending on interactions with the rest of the protein. Thus, most missense variants remain as variants of uncertain significance (VUS), limiting the utility of exome sequencing in clinical diagnosis2,10. VEP is even more challenging for coding variants affecting multiple residues such as in-frame indels.