In the last few months, I finally started using GitHub copilot and it quickly increased my productivity. It's the first developer tool I've tried in a long time to have a unilateral impact. Compared to any other application I use in my day-to-day life where large language models (LLMs) are a piece of it, it's by far and away the highest value on a per-generated-token basis. For people like me and you, every marginal percentage point of productivity will compound to be a substantial delta in the long run. Copilot feels like something contributing to that value growth, but I'm not sure that the tools predicting the next word of my emails are too.
Releasing the alpha version of StarChat was my first time playing with code models. I was a pretty clean slate we these until ChatGPT came out. For StarChat, we instruction-tuned the BigCode StarCoder model to follow instructions and we were surprised by how easy it was to get an improvement in qualitative performance at answering code questions. At the implementation level, instruction-tuning is continuing to train the language model with the original loss function (autoregressive prediction) on a set of question-and-answer style prompts. A few hours and a few GPUs later, responses from the instruction-tuned model to coding questions were preferred over 95% by GPT4 evaluations versus the base model (not rigorous, I know, but interesting). These two models, the chat model and the instruction model have very different use cases.
Base code models (e.g. Copilot) really want to be good at next-sequence prediction. When you write a function name and a doc-string, predicting the next tokens works really well. This is the case where the engineer knows what they want to build and the AI is a tool to help them along the way.