First I had it write stories based on the pages and components. I had obviously to review the work and further add more cases.
Then I had it generate a markdown file where it documented the purpose and usage and apis for those and combined it with user stories written in our project management tool which I copy pasted in different files. It helped our user stories are written in a gerkin-like fashion (when/and/or/then) which is computer-friendly.
As most of the components had unique identifiers in terms of data-test attributes I could further ask it to implement more e2e cases.
Overall I was very satisfied of the cost/benefit ratio.
Stories were the most complicated part as Cursor tended to redeclare mocks multiple times rather than sharing them across, and it wasn't consistent in the API choices it made (storybook has too many ways to accomplish the same thing).
E2Es with Playwright were the easiest part, the criticism here was that I used data attributes (which users don't see) over elements like text. I very much agree with that, as I myself am a fan of testing the way that users would. Problem is that as our application is localized I had to compromise in order to keep them parallel and fast, as many tests do change locale settings which was interfering, as new pages loading had a different locale then expected. I'm not the only one using such attributes for testing, I know it's common practice in big cushy tech too.
One thing I want to note, you can't do it in few prompts, it feels like having to convince the agent to do what you ask him iteratively.
I'm still convinced of the cost/benefits ratio and with practice you get better at prompting. You try to get to the result you want by manual editing and chatting, then feed the example result to generate more.
> One thing I want to note, you can't do it in few prompts, it feels like having to convince the agent to do what you ask him iteratively.
Success with current day LLMs isn't about getting them to output perfect code. Having them do the pets their good at - rough initial revs - and then iterating from there, is more effective. The important metric is code (not LoC, mind you) that gets checked into git/revision control and sent for PR and merged. Realizing when convincing the LLM to output flawless code ends up taking you in circles and is unproductive, while not throwing away the LLM as a useful tool is where the sweet spot is.
In my experience since cursor doesn't know how a frontend app looks like nor can it run a browser, the tests it writes are often inane.
Can you tell me what testing stack do you use, and how do you approach the process of writing large tests for mature codebases with cursor?