AI Coding Assistants Review: GitHub Copilot's Shocking Comeback
Well, this is always my favorite article to write. This is only my second one, but I love evaluating all of the AI assistants that I can so I have the information for what I actually want to commit to for the following month.
This month, we've added a few new contenders to the list: Codeium, Gemini CLI, Kilo, OpenCode, Aider, and Warp.dev. The list has gotten a lot longer; I believe I have 17 total that I ran through my evaluation process. This took an enormous amount of time to get through, and I've changed a few things in my methodology.
As always, there are some I haven't tested, and I'm sure you can name many more. There are some I know of but am not currently testing, like CodiumAI or Bloop. I'm trying to get as many as I can.
Methodology
What am I measuring here? I am measuring instruction-following. The way I do that is by providing a very, very detailed prompt. I don't want to reveal these prompts fully, as I'm concerned that the teams building these AI coding tools or the LLMs might fine-tune their models specifically to my tests.
What I care about is this: if I tell an assistant to do something, does it do it? Does it do it accurately? And what is the quality of the output?
To measure this, I have a series of unit tests that sit behind each evaluation. I iterate repeatedly to fine-tune the instructions so that if the assistant follows them, the unit tests will pass. I then use an LLM as a judge, giving it strict criteria about what is good and what is bad, which is part of the weighting factor as well. For this particular evaluation, I've moved to using Claude for automated grading, which allows me to run a bunch of tests and then use Claude to grade everything. This makes the process much easier, as I can do it all through the CLI.
Here is an example snippet from one of the prompts:
Implement a new feature in the existing project. The feature should add a caching layer using Redis. You must create a new module named
caching.js
. This module must export a functiongetOrSetCache(key, callback)
that takes a key and a callback function. If the data is in the cache, return it. If not, execute the callback, store its result in the cache, and then return the result. Ensure the cache has a default TTL of 60 seconds.
The evaluations are larger things, like fixing bugs in existing projects or implementing a feature. It's not just a single file or a single algorithm. If I give it a really detailed spec and tell it specifically what functions need to be in there, what things need to be named, and the functional and non-functional criteria, how much of that can it do by itself?
I've moved to a points-based ranking system. You don't have to worry about what the numbers mean; just know that each of my evaluations has the same amount of points. The things you accomplish get added into the point score. What's really most important is the variance between them. Month over month, you're going to see these points change because I want to continually add more things to the test and get better at it.
This month, I am primarily testing with Sonnet 4, as it seems to perform the most consistently across all of them. There were some from last month where other models were pretty good, but Sonnet 4 was the most consistent. I do have some exceptions, which I'll talk about a little bit later, as I am testing Codeium with GPT-4.1 and Gemini CLI with Gemini 2.5 Pro.
The Rankings
Here is the overall chart.
| Rank | Assistant | Model | Score | | :--- | :--- | :--- | :--- | | 1 | Claude Code | Sonnet 4 | 17,000+ | | 2 | Open Code | Sonnet 4 | ~16,000 | | 3 | GitHub Copilot | Sonnet 4 | ~15,950 | | 4 | Zed | Opus | ~15,800 | | 5 | Kilo Code | Sonnet 4 | 15,714 | | 6 | Klein | Sonnet 4 | 15,714 | | 7 | R-Code | Sonnet 4 | 15,714 | | 8 | Cursor | Sonnet 4 | 14,824 | | 9 | WindSurf | Sonnet 4 | 14,814 | | 10 | Augment Code | Sonnet 4 | ~14,000 | | 11 | Aider | Sonnet 4 | ~13,500 | | 12 | Void | Sonnet 4 | ~9,000 | | 13 | Gemini CLI | Gemini 2.5 Pro | 8,780 | | 14 | Codeium | GPT-4.1 | 1,700 |
Key Surprises and Observations
GitHub Copilot's Turnaround
Number three for Sonnet 4 is GitHub Copilot. What the crap? I think they fixed it. This is the biggest turnaround that I never expected. I reran this several times because I thought there was some sort of anomaly, but no. I am bringing my own key to this, so I don't know if that impacts the quality, but it's worth noting. From last month to this month, GitHub Copilot has gone from being what I thought was basically unusable to actually pretty dang solid. I was impressed with that overall.
Open Code's Strong Debut
This might actually surprise you, but number two is Open Code. I am so thankful that people pushed me to actually bring Open Code in. It is phenomenal. It did an exceptional job, and the cost of using it is really low compared to some of the other AI coding tools. I gave it the full prompt, and it executed really, really well.
Claude Code Remains King
Number one is Claude Code. It has just come a long way. It crushed all the scores; they were higher in this month's evaluation compared to last month's, and on the new ones, it crushed more than any other. It's the only AI coding assistant to actually hit over 17,000 the first time I've kind of run through this.
The Laggards
My gosh, Gemini CLI and Codeium with GPT-4.1 are painful. My thinking behind it was that I wanted to test where OpenAI is in all of this and how it compares to the other ones. Codeium scored 1,700. This is atrocious. It is unbelievably low. Gemini CLI with Gemini 2.5 Pro scored 8,780, and this is also not that good. The problem with these is it's hard to pin it down on if it's the AI coding agent or the model itself because I actually really like Gemini 2.5 Pro. I still use Gemini 2.5 in other models, and it is not a bad model. But I really do want Gemini CLI to get good. I'm pulling for it, and I want Gemini 2.5 Pro to be amazing in Gemini CLI.
Kilo Code, R-Code, and Klein
Kilo Code, R-Code, and Klein all had the exact same score: 15,714. That just shows you maybe how consistent my tests actually are at this point, that three things that are basically identical kind of fall within the same scoring. It is kind of interesting. I do want to talk a little bit about Kilo Code here in a second because that one is an interesting one that I have a lot of thoughts about.
A Note on Kilo Code
I get these ads all the time. In this particular one, it's like "Cursor + Windsurf + Klein + Rue, the best parts of each, and it's open source." I don't know what to make of this, honestly, because I've spent some time going through the source code. You can kind of see stuff like this where they've got this features page on their website, and they'll say, "Hey, MCP Marketplace, Rue doesn't have that." It should be a check there; it is a fairly new one, so they just haven't updated their website yet. But now you see a PR stage switching MCP Marketplace to the R-Code version, so rather than the Kilo version or the Klein version, whatever version was there before. It's very fascinating to me how this is all kind of playing out.
I also do not quite understand this one thing that they claim, which is "OpenRouter without the 5% markup." When you click on that, it just tells you about bringing your own keys and the additional 5% of the original model. I don't quite understand how they're getting around that, nor do I find anywhere where they're explaining it. I am okay with people forking things; I think that's just the nature of it. But I'm a little worried about Kilo Code's approach here because they're trying to take all of these things and cram them into a single entry, but it's not really bringing anything new to the table. They did implement autocomplete; I tested it, and there's nothing special about it. So I have a hard time recommending Kilo Code at this point, but it is something I'm going to be monitoring.
Final Thoughts and My Personal Rankings
My subjective rankings, based on what I will be using for the next month, are as follows:
- Claude Code: It's just so good. The $100/month max plan is unbeatable. I've actually debated going back up to the $200/month plan and canceling or downgrading my OpenAI subscription because I've found Claude Code is just so freaking good.
- Open Code: I have found over the last week or so that I have come to really like CLI-based coding tools. Open Code allows me to switch to a bunch of different models. It works exceptionally well with Sonnet 4, so much so that I think it goes neck-and-neck with Claude Code. I'm going to start experimenting with using it in some of my automated workflows.
- Augment Code: I have dropped it from two to three. I still really, really love Augment Code. That is not anything negative on Augment Code; it's just been surpassed in the amount of usage that I'm getting with other ones. Augment Code's context engine is amazing. It's super fast, it's very, very affordable. Its task list is just great across the board. It's really great. I have it as number three because the amount of code that I'm creating with Augment Code has dropped.
This has been a fascinating evaluation. The rise of GitHub Copilot and the strong showing of Open Code have really shaken things up. I'm excited to see where this goes long term.