We experiment with Copilot to gain insights into questions about insecure generated code by designing scenarios for Copilot to complete and by analyzing the produced code for security weaknesses.
There is burgeoning interest in designing AI-based systems to assist humans in designing computing systems, including tools that automatically generate computer code. The most notable of these comes in the form of the first self-described “AI pair programmer,” GitHub Copilot, a language model trained over open source GitHub code. However, code often contains bugs—and so, given the vast quantity of unvetted code that Copilot has processed, it is certain that the language model will have learned from exploitable, buggy code. This raises concerns on the security of Copilot’s code contributions. In this work, we systematically investigate the prevalence and conditions that can cause GitHub Copilot to recommend insecure code. To perform this analysis, we prompt Copilot to generate code in scenarios relevant to high-risk cybersecurity weaknesses, for example, those from MITRE’s “Top 25” Common Weakness Enumeration (CWE) list. We explore Copilot’s performance on three distinct code-generation axes—examining how it performs given diversity of weaknesses, diversity of prompts, and diversity of domains. In total, we produce 89 different scenarios for Copilot to complete, producing 1,689 programs. Of these, we found approximately 40% to be vulnerable.
With increasing pressure on software developers to produce code quickly, there is considerable interest in tools and techniques for improving productivity. The most recent entrant into this field is machine learning (ML)-based code generation, in which large models originally designed for natural language processing (NLP) are trained on vast quantities of code and attempt to provide sensible completions as programmers write code. In June 2021, GitHub released Copilot,1 an “AI pair programmer” that generates code in a variety of languages given some context, such as comments, function names, and surrounding code. Copilot is built on a large language model that is trained on open-source code5 including “public code…with insecure coding patterns”, thus giving rise to the potential for “synthesize[d] code that contains these undesirable patterns”.1