A Developer's Guide to the Top 5 AI Coding Models

Every time a new AI model comes out, I try to code with it. Sure, I do the fun little one-shots or three-shots, however many shots it takes, but then I use it in a way that you actually do as a software engineer, and that is in Windsurf or in Cursor, whatever you use—something that is integrated into your development environment. Somebody should really coin that term "integrated development environment"; it has a ring to it.

So I figured I'd do all of that with the five most popular AI models for coding today. That means going over everything I've experienced coding with these in real codebases, refactoring some simple code in the browser interfaces, and seeing if it can one-shot a P5.js game. We'll discover their strengths, we'll discover their weaknesses, as well as which one is better for which tasks. And what you will be reading about is the quintessential vibe coding, because even though I do like Tab-Tab-Tab, that does not use the model you choose over here; it uses some Windsurf built-in thing. So in order to get the actual test of the models, we'll need to be prompting over here and seeing how it integrates into the codebase.

Claude 3.5 Sonnet

Starting things off with Claude 3.5 Sonnet. Honestly, this was my first foray into AI coding, and it was awesome. It is incredibly precise. It executes exactly what I ask it to with minimal wandering—almost no wandering in my experience. But it also gets full context of everything that it needs as well. So you ask it to do one thing in a particular code file, and then it'll see, "Oh, well this is pulling in that file over here and that file over there." So it analyzes everything that is called within this file to have full context, and then it can write the best code possible.

It also keeps very good context. I don't have to reiterate something that I had mentioned five messages prior, like with some of these other models, because it would remember. And while it is kind of slow, I'd rather have it work slower with less debugging on my end than work faster and then I spend a whole ton of time debugging that code. And even though this is an older model, it is still one of the best. If you want something that is very precise and ideal for tasks that need careful, accurate execution, this is your model.

However, it does tend to play it a bit safe. Where it may analyze all of the files related to the target file, it won't refactor those files where it sees something has gone wrong or could be improved. This is a good thing because it stays on task, but it's bad because it doesn't improve where there could be improvements.

Claude 3.7 Sonnet

This model is what I would say is overly ambitious. It reads more than what it needs for that specific file, it feels like, and then every single file it reads, it's like, "Oh, this could be refactored a little bit, or this function can be deleted, or this over here and this over there." And it just sticks its arm into everything to the point where you have five or six diffs that you have to review before you can accept the code when you only asked for one. So it is like 3.5 with a bit more horsepower, if you will, but it's not as focused.

But that ambition often leads to overreaching, to where maybe it will delete a function over here but forgets that it's supposed to replace it with something else. Or you're looking at it like, "Why did it delete that function? I need that function for this file over here." That's not a good thing. And then when it comes to the extended thinking mode, I don't like it. It hallucinates too much, it takes too long, it's too expensive, it tries to be excessively complex. So thinking is not really an option for me. Whereas 3.7 itself, I wouldn't really recommend it to anybody because it feels like just a worse version of the new Gemini 2.5 Pro.

Gemini 2.5 Pro

Gemini 2.5 Pro feels like all the best parts of 3.5 and all the best parts of 3.7 combined. So it's just as, if not more, accurate than 3.5, and it has amazing breadth like 3.7, but it doesn't touch as much unrelated code. Like I talked about, 3.5 will analyze all of these files but only write code for where you wanted it to write code and maybe wander a little bit. 2.5 Pro, unless you explicitly tell it otherwise, will actually recommend revisions for this, whether it's refactoring and things of that nature.

And it has such a large context window that it doesn't delete a function over here and forget about it when it was really supposed to replace it like 3.7. It remembers everything that you asked it to do and everything that it had seen, thanks to that large context window. And I found that the mistakes that it makes are much more minor than what you will see in a lot of other models as well. So sure, sometimes you don't want the AI model to touch code that you didn't tell it to touch, but if there is one that does it and does it right, it's Gemini 2.5 Pro. So if you have a large codebase, you have a lot that needs to get done, you have a huge refactor in mind, something a bit more complex or high-stakes, this is the model I would recommend.

Note: Gemini 2.5 Pro is my go-to right now, even over 3.5 Sonnet, because even though it is again a little bit broader, the code quality for all of it appears to be better for me.

03 Mini (Medium Reasoning)

Then we have the model that feels like the opposite of 3.7 Sonnet, and that is 03 Mini with medium reasoning in Windsurf. Because where 3.7 Sonnet likes to reach everywhere and touch every piece of code, 03 Mini does not like that at all. It barely even writes all of the code that you ask it to. It's like it writes most of it, and then you have to do a manual iteration. "Oh, you need to add this." Okay, let's add just that one line right there. "Oh, and then you need to add this." Let's add just another line or two there, until you get some code that is precise and typically accurate, but over many manual iterations and without context of the larger codebase, because it doesn't really analyze too much of the codebase around it at all.

So if you want more control, more precision if you will, and exactly what is happening over something like 3.5, then 03 Mini may be your best bet. But at that point, I would just be inside the code file and just use Tab-Tab-Tab. That's what it feels like: just a less convenient version of Tab-Tab-Tab because you have to prompt.

This was the craziest one. This was the last time it wrote some code, and then it said, "I've updated this. Please test the button." I said, "It worked. Now let's store the data." It says, "I'll update this." I say, "Yes, let's apply these changes." "I'll now update this." "Wait, you didn't do the previous changes." "I'll update this." I say, "Okay, yes, code these changes in." "I'll now apply these changes." "I'll now apply these changes." Okay, so we currently have this, and I want to do an actual prompt. It's like, "I'll now apply these changes." "Okay, please do." And then it wrote the code as a diff and didn't add them to the codebase. So I said, "Add them to the codebase," and then it says, "I'll now apply these changes in a single edit." That's 03 Mini in Windsurf. I don't know how it is in Cursor, but that is a horrendous user experience if I've ever seen one.

GPT-4o

Finally, we get to GPT-4o, which is supposed to be one of the best coding AI models out there according to a benchmark with its new March 26th update. You know, the update that also came with the image generation update with all the Ghibli Studio this and that. Yeah, they also updated its coding ability. And what it feels like is that it's trying to be Claude 3.5, but it's just not as good, as accurate, as precise, with more hallucinations. And something it really likes to do for whatever reason is overwrite a lot of code with the same exact code.

The only thing it's better at than 3.5 is that it's faster. But if something's going to be faster and way more wrong, then I'd rather the other one take longer. You really need heavy code review. In other words, don't use 4o with coding. Use it for chat. It's a wonderful chat companion when I just want to bounce some ideas off of and things of that nature. It'll tell me like, "Bro, you're cooking now. That's low-key fire, dude." Or I don't know what the lingo is nowadays, but that's how it tries to talk. It tries to be very personable, but that has nothing to do with coding.

Bookmark This Article

Your browser doesn't support automatic bookmarking. You can:

  1. Press Ctrl+D (or Command+D on Mac) to bookmark this page
  2. Or drag this link to your bookmarks bar:
Bookmark This

Clicking this bookmarklet when on any page of our site will bookmark the current page.

The P5.js Game Challenge

Now let's see which one is the best one of all. Unfortunately, Claude wants to charge money for 3.5 Sonnet, but yet I could use 3.7 Sonnet for free, so we're going to skip 3.5 and go straight to 3.7, which in theory, in other tests that I've done previously, 3.7 is better at trying to one-shot things. But spoiler alert, it's not the best. Let's see how it does.

I'm entering this prompt:

Make an addictive launch-style game like Kitten Cannon. Use p5.js only, no HTML. Show instructions on screen. I like pixelated animals, funny physics, and random obstacles that send you flying or stop you cold.

Claude 3.7 Sonnet

After about a minute and 40 seconds, this is what it produced. I'm going to open up the p5.js web editor, throw it in, and launch it. That wasn't really expected, but that should be an easy fix. So if I just say, "It works as in there's no errors, but the screen isn't following the character once launched. I need to also be able to aim up and down to adjust trajectory." Well, that should fix everything, right? Well, it fixed those aspects, but now we have floating obstacles that are... I don't even know what to say. I'm sure we could have fixed it, but yeah, I'm sure I can fix it. The obstacles are floating up and down in odd ways. They should stay where they are. And if we try it again... okay, that didn't fix it. We're done. Wait, are we going backwards now? Let's move on to the next one.

Gemini 2.5 Pro

This one gave me an error. Then I tried to fix it. The game worked; however, I got a collision error, which it was able to fix with another prompt. And then this is the game. Pretty dang good if you ask me. It wrote all of the code, just needed a couple of error fixes, which it did itself in true vibe coding fashion. Definitely better than 3.7.

GPT-4o

I gave it the same prompt. All of these are the same prompt, and it did in fact work on the very first try. It depends on what you consider working. My issue here is that there were too many things wrong. There was no charge, there's no aiming, the camera didn't work properly, it didn't launch it very far, the pixels are floating above the ground, but they're not moving like 3.7 at least. So in all honesty, I didn't give this as many shots as the others, but I don't think it deserved it.

03 Mini High

I did get 03 Mini High to give it a shot. However, I forgot that it had memory of other chats, which I think is why it looks kind of similar to what 4o gave me. It does have an interesting launch system, but it's not a very powerful one. I actually need to try this again. Oh, that sends it way further. I thought it was just bad, like a low power, but it just depends on how far you drag this. It looks like it doesn't have any obstacles past a certain point. Yeah, interesting. So it did not do anything like infinitely. And red ones slow you down, green ones speed you up. That one's actually a very cool mechanic, I ain't going to lie. However, that means it didn't listen to the prompt. I said, "random obstacles that send you flying or stop you cold." So I guess the green doesn't send you flying, but it gives you a boost, and the red doesn't stop you cold, it only slows you down. So while cool and unique, it didn't particularly listen exactly to the prompt.

Game Challenge Results

So what that means is Gemini 2.5 Pro, even though it took three iterations where it wrote most of the code at first and then I had to fix two errors, produced the best game, most accurate to the prompt. 03 Mini comes in place number two. Even though it didn't listen exactly to the prompt, it did one-shot 200 lines of code, had some pretty cool mechanics, tried to have its own unique spin on the game, I suppose, and it worked. Whereas then, I don't know, 3.7 and 4o, I don't even think they deserve third and fourth place because those were kind of trash.

The Rust Refactoring Challenge

Now for the Rust refactoring.

  • Common Successes: What all four AIs got right was changing the Vec in is_safe to a slice, which just avoids unnecessary cloning. They all also changed windows(2).next() to windows(2).all(), which is just more efficient, more readable, and idiomatic. It's better.
  • Error Handling: Claude, GPT-4o, and 03 Mini High used expect instead of unwrap, which provides a better message but still panics. Gemini 2.5 Pro, however, used the ? operator and match logic, so it logs bad lines and just keeps going. They all work, but it seems like 2.5 Pro is just quite a bit better.
  • Vector Manipulation: Those same three (Claude, GPT-4o, 03 Mini) cloned the full vector to remove an item, which is just inefficient. Gemini built a new vector while skipping one index using .filter_map() or slicing, which is more efficient with less memory churn.
  • Logic: What's interesting is Gemini and Claude both reported report_less_than_two is true, which is just logical. But the OpenAI models returned false, which is technically incorrect.
  • Looping: Claude and 03 Mini both used a map().sum() chain, which is nice. It works perfectly fine with limited error handling. 4o and Gemini used for loops for this specific thing, which is not as elegant, if you will, but does allow for better error handling. If you need more control, it allows for that too, but is it necessary in this instance? I'll let y'all be the judge.

Refactoring Results

So what we have here is just 2.5 Pro appearing to be a lot better. 3.7 Sonnet would come in second place because it did some of the things that 2.5 Pro did that I felt were a little bit better. And then 03 Mini and 4o were very similar, with 03 Mini edging it out just a little bit, but still not great.

Conclusion

Anyway, that's what I got for you. There's no point in recapping because I talked about everything throughout the article, what was better for what, and what I recommend for what. This will vary based on if you use new frameworks or unpopular languages and things of that nature, and then the size of your codebase. There's a lot of variables here that will dictate which is better and which is worse. But what I did in this article was as broad as I could get.