Automated Web App Testing with Operative.sh and Cursor |

Let's say you're building a web app in Cursor; you've got everything set up and you're deep into development. Now, before anything goes live, every part of the app needs to be tested properly.

The Importance of Thorough Testing

Take something simple like the login page. It might seem basic, but it's one of the most critical parts. You need to make sure it handles everything: invalid inputs, edge cases, even potential attacks. Someone could enter a command that tries to delete your entire database, and if you're not prepared for that, things can go very wrong. That's why testing every possible use case matters, even for something as straightforward as logging in.

AI-Powered Automated Testing

Now, imagine if Cursor could do all that testing for you—not just the login page, but your entire app, from the front end to the back end, making sure every component works exactly as it should.

That's where this agent comes in. The one I'm talking about is called operative.sh, and what it does is let your AI agent debug itself. Cursor can access this AI agent through MCP, and wherever the code is written, it can test it out for you and carry out the steps you'd usually handle manually. So you don't have to go through the trouble of testing everything on your own.

Let's say you've built a web app. You don't need to break it down into separate components; you can just ask it to test the whole app. Cursor already knows how it wrote the app, so you can give it instructions in plain English. Just tell it what the app does and what needs to be tested, and it takes care of the rest.

Installation Guide

Let me take you through the installation. First, on their GitHub, they've provided a way to install it manually by setting up each component one by one. Or, if you prefer a quicker method, you can just run the installer, which is available right on their site.

1. Get Your API Key

Before we install it, we need to get the API key for this tool, and yes, it's free.

First, go ahead and log in to the operative.sh website.
Once you're logged in, head over to the dashboard. Inside the dashboard, you'll also find some guides, and if you want, you can check those out as well.
On the sidebar, there's a section for API keys. You get 100 browser chat completion requests per month, and once that limit is reached, you will need to upgrade your plan.
For now, let's just create our key. Go ahead and name it, create the key, and as you can see, it's already copied and ready to use.

2. Run the Installer

Now that you've copied the API key, the next step is to copy the installer command. This command will fetch the installation script, run it to install everything automatically, and then delete the script once it's done.

So, let's open the terminal, paste the command we just copied, and run it. As you can see, we're getting an interactive installation process.

The first question it asks is about the installation type; in other words, where do we want to install the MCP. Since this is an MCP installation, it will modify the MCP/config.json file of whichever tool we choose. Let's go ahead and select Cursor here.

What it's doing now is setting up the directory, checking for any required dependencies, and downloading the additional components needed to get the tool running properly. The installation has finished, and it automatically integrated itself with Cursor.

Now we have the web_eval agent along with its tools, which include the web_eval_agent itself and the setup_browser_state tool. We'll take a closer look at what these do in just a moment. During the installation, it also asks for your API key, so you'll need to paste that in as part of the setup. It launches a browser instance, which means Playwright was installed as well.

Note: One important thing they mention at the end of the installation is that you need to restart whichever app you're using for everything to work correctly. If you skip this step, it might not appear or function as expected. If you open Cursor and it's not working, try hitting the refresh button; that usually solves it. If it still doesn't show up, just close and reopen Cursor, and that should take care of the issue.

Testing a Sample Application

This is a website I quickly put together just to test this tool. We're going to run some tests on it, and the main area we'll be focusing on is the login and sign-in process.

Let me go ahead and log in. All right, now that we're signed in, this is the dashboard. It has a really clean look because I built it using Aceternity UI, which is a great UI library. There isn't much else going on here aside from that, but our main goal is to test the login functionality. So for now, let's go ahead and sign out, and I'll show you what you can do next.

Understanding the Tools

First, let me give you a little background on the tool. Right now, it includes two main components: the web_eval_agent and the setup_browser_state. If we take a step back, you'll see that each one serves a different purpose.

The web_eval_agent acts as an automatic emulator that uses the browser to carry out any task you describe in natural language, and it does this using Playwright.
On the other hand, the setup_browser_state lets you sign into your browser once if the site you're testing requires authentication, so you won't have to handle that manually each time.

These are the two core tools that come with the setup.

Tool Arguments

Now, let us talk about what arguments these tools actually require.

web_eval_agent * url: This is the address where your app is running. If it is hosted elsewhere, you can enter the corresponding URL where your app is live. * task: This is a natural language description of what you want the agent to do. You do not need to include any technical details; just describe what a normal user would do while interacting with your app. * headless_browser: By default, it is set to false, which means you will see the browser window while the agent performs the task. If you want to run it silently in the background without opening a visible window, you can set this to true by instructing Cursor.

setup_browser_state * url (optional): Its main function is to let you sign in once, and it will save that browser session so you do not need to log in again the next time you run your tests.

Running a Simple Login Test

So, I ran a login test just to make sure everything was working properly, and here is how it went. First, I asked it in simple language to test the login. If Cursor understands the task, it automatically translates that instruction into step-by-step actions. All you need to do is describe it naturally, and it fills in the argument with the required details.

In my case, the app was running locally on port 3000, so I did not have to provide any technical arguments. I just told it where the app was running, and Cursor handled everything else. This is the MCP tool call that was made:

{
  "tool": "web_eval_agent",
  "url": "http://localhost:3000",
  "task": "Go to the login page, create a new user account, log in with the new credentials, and then log out.",
  "headless_browser": false
}

You can see that we invoked the web_eval_agent tool. Both the url and the task were filled in automatically. The task itself was broken down into step-by-step instructions, and I did not have to write any detailed logic for that to happen. You will also notice that headless_browser was set to false, which meant I could actually see the browser performing the actions live instead of running quietly in the background.

It opened a dashboard that acted like a control center. * On the left side, it showed a live preview of what was happening. Even if you run it in headless mode, you can still go to the dashboard and watch the process in real time. * There was a status tab that showed the current state of the agent, although it did not give many details about the actual testing steps. * In the console tab, however, all the logs were visible. * It also captured and displayed every network request and response.

This gave us full visibility into the test results. Errors, logs, screenshots, and everything else were sent back to Cursor. The results from the login test showed that everything was working correctly. It successfully went through the entire flow by creating an ID, signing up, logging in, and then logging out.

However, it did not test any edge cases. Right now, I do not think there is any protection built in for those kinds of situations, like when a user enters something invalid. So the next step is to ask Cursor to generate some edge cases for the login test, run them through this agent, and make sure those cases are handled correctly. This is the workflow we are trying to set up using this MCP.

Extensive Login Functionality Testing

Okay, so here is how I extensively tested the login functionality. The tests are still running, and as you can see, it is still generating results. If I go into the dashboard and open the control center, you can see the tests in progress right here. There is a live preview on one side, the agent status on the other, along with the console logs and network requests. Everything is clearly displayed, and the tests are running in the background.

Here is essentially what I did: I asked it to write test cases for the login functionality, including edge cases. I created a file called login_test_cases.md, and it generated all the test cases inside that file. You can see there is a section for the actual result, and the test cases are being parsed one at a time. The first test case was parsed, then the second, and so on. Right now, I think it is on test case 5 because that section has not updated yet. What happens is that the agent performs a test case, then comes back and edits the result directly into the file.

At this point, Cursor has generated around 28 test cases, all of them very granular. Every small detail is checked to make sure nothing is overlooked. This is how real software development works: every edge case is accounted for because as the app grows, the likelihood of bugs increases a lot. This is how you can make sure everything works correctly while you are still building. You are not missing anything, and every possible scenario is covered. You just write the test cases, and Playwright handles the actual testing.

Right now, the browser is not visible because I ran the tests in headless mode. After writing the file, I told the agent to go ahead and test each use case and then return to mark each one as passed or failed. I set the app location to localhost on port 3000, provided the URL, and enabled headless mode. You can see that the tests have started running, and right now, let me check... yes, it is currently on test case 10. It has been about 7 minutes since the test started, and it has already worked through 10 use cases.

Now, I do want to mention something important: using AI for testing is a slow process. It does take time, but the advantage is that I do not have to write any scripts. This is a simple example, but it can be applied in any situation. The AI looks at the tags, writes the code, and runs the tests automatically. All you have to do is provide the use cases, and it tests them, reports back, and updates the results. This may be a small implementation, but it could easily grow into a complete system that tracks which test cases were parsed, which ones were missed, and automatically handles the rest.

A Note on the AI Model

And finally, I wanted to mention Claude 4. It was released just 2 or 3 days ago at the time of this recording, and I have to say the model is really impressive. So far, I have genuinely enjoyed using it. It does not have the same frustrating issues that earlier models did, and overall, it has been a great experience working with it.

Final Results

And here are the final results. A total of nine tests were not executed. This was either because they required specific tools that were not available, needed some manual configuration inside the browser, or were skipped due to other limitations in the environment. Regardless of the reason, those nine tests did not run.

Out of the remaining tests, 60% passed successfully, and this is exactly the kind of process you want in place during development. If any of the tests had failed or if the outcome had not matched the expected results, that information would have been sent back to Cursor. From there, we could have identified the issue and fixed it right away.

This entire workflow can now be reused. Whether you are building a new feature, working on a specific component, or testing your full application, this same process applies and helps ensure everything is functioning as expected.