Test-driven development as prompt engineering

As AI code generator tools gain adoption, Test-Driven Development (TDD) becomes even more essential and a key differentiator for developers.

As Kent Beck said: "90% of my skills just dropped to $0. The leverage for the remaining 10% went up 1000x." I don't think we're anywhere close to 90% currently, but it's likely we'll get there. TDD is going to be a large portion of that remaining 10% that'll become the majority of meaningful work we do as programmers.

I briefly trialed GitHub Copilot in the summer of 2023. While it produced interesting results, I haven't integrated into my daily work. I'll be revisiting this decision periodically. Since then, I've built further experiences working with clients who use these tools and observing videos and posts from the community in general.

Throughout this period, I've observed people being frustrated with the code output of tools like Copilot and ChatGPT, and there are early signs that AI code generators might be lowering code quality at scale.

While these tools can definitely vary in output, I've learned their quality is highly dependent on prompt engineering.

With AI code generator tools like GitHub Copilot, most people leave code comments, wait for generated responses, and are underwhelmed with their choices.

If you use these tools, you'll get the best results by instead following a test-driven development (TDD) workflow.

With TDD, you follow a loop of Red, Green, Refactor: you first write a failing test (Red), then just enough code to make that test pass (Green), then safely improve your code while ensuring it still works (Refactor).

Typically, I only work on one test at a time, but a tradeoff with copilot tools is they benefit from you writing every test you can think of in advance. This helps guide the generated code output address all expected scenarios at once. Otherwise, you'd have to resort to prompting the copilot to iterate the code for each new scenario, which currently doesn't seem as successful.

My preferred copilot workflow is the following:

Write all tests you can think of for a function (Red)
Only use copilot tools for the implementation stage of writing just enough code to make the test pass (Green)
Improve the generated code to meet coding standards and add any other tests that come to mind (Refactor)

This is similar to ping-pong pairing (also known as strict pairing), where one pairing partner writes a test, the other partner writes the code, and then the pair trade roles. Except with copilot tools, you should always be the one writing the tests. It's fine to prompt the tool for additional test ideas, but make sure you are the one setting expectations for how the code should work.

Tests are essential to ensure the code does what you want
Tests are essential to refactoring
- If the copilot output passes the tests, that's a good stopping point
- If you want to refactor the code, you can safely do so with test coverage
It's a great learning mechanism
- I can often state the objective of code, with the function name, inputs, and outputs I'd like to have, but I might deliberate the how too much

It's important for humans to write tests for the behavior that is interesting to our work, and "human judgement is always part of TDD."

Issues with common copilot workflows

The most common workflow I've observed with copilot tools is developers writing code comments that describe the implementation they want. The copilot generates a function, and the developer may make some small tweaks. At best, the developer might then prompt the copilot to generate some tests for the new function.

The first issue of this workflow is waste. Prompting code comments can be quite verbose and require multiple attempts. Plus, these code comments have to be deleted after the code is generated or they add noise to the codebase. Additionally, there may be rework needed to get the code in the desired shape and functionality.

The second and more important issue is bugs with false positives. Even without copilot tools, writing tests after (Test-After Development, or TAD) introduces the risk of only demonstrating what the code currently does. Even if all the tests pass, that means the tests simply describe the current functionality, but they don't prove that the code under test has the desired functionality. By writing tests first, we ensure the generated code meets our needs and accounts for all the scenarios or potential errors that matter to us.

Benefits of TDD as prompt engineering

TDD makes quality suggestions much more likely.

Writing tests first sets expectations for the generated code. It defines:

What the function is called
What it should do (at least for the current case)
What arguments the function accepts (its signature)
What it returns

This gives much more effective guidance for AI as your pairing partner to generate useful code, making it more likely that the generated code will handle the implementation successfully.

After writing tests, it's helpful to evaluate the generated suggestions (using the Ctrl/Cmd + Enter command in VSCode, for example) and pick the one you feel is most likely to pass the tests. This is a good opportunity to make a hypothesis/bet and learn from the results.

Because TDD encourages good design and single responsibility principle for testable functions, copilot tools are more likely to generate more focused, cleaner functions.

Tips

Code comment prompts as code smells

If you have to resort to scaffold code or comments for prompts, you're not writing tests first, probably taking too big a step, doing more than one thing, and working at too high a level of abstraction.

Instead, write tests for each operation, which will prompt the generated code to first create more testable helper functions that are largely one liners.

Have a pair programming mindset

Working with AI is like being the navigator in pair programming. I found the constant code suggestions to be quite distracting at first, but when I started thinking of it as another form of pairing, I reframed the suggestions as immediate feedback about my role as navigator and using tests as prompts.

Speed and scale are dangerous priorities

AI code generator tools are an accelerator, but accelerating isn't always desirable. Going faster doesn't mean anything if it's in the wrong direction. Going in the right direction, then going faster, is a true advantage.

Speed and scale in isolation create waste, not value. In the world of lean manufacturing, the ranked criteria for any improvement is:

Safer
Better quality
Simpler
Faster

The post "Safety, Accuracy, Efficiency, Then Scale" offers a related perspective focused on product teams.

My experience working with TDD, with or without AI code generators, is that it results in much, much greater speed and scalability over time. This is thanks to fewer bugs, less manual debugging, better documented/simpler code, and improved system design.

So, if copilot tools are appealing for their speed gains, the best way to leverage these gains is by incorporating TDD.

TDD is even more important as AI code generators gain adoption

As more and more teams start using AI code generators to churn out variable code quality, following Test-Driven Development will be a real competitive advantage.

In programming, typing isn't the bottleneck. Thinking is. Thinking about the tests ("what" the code should do) is far more important than thinking about the implementation code ("how" the code should do it). Without tests, we're handing off too much responsibility for copilot tools to determine both "what" and "how".

Prompt engineering is shaping, and TDD is prompt engineering. Without tests, we're not actually capturing or asserting our thinking for what generated code should do. We're just taking suggestions as our code quality, long-term productivity, and maintainability suffer at scale.

If you or your team need help implementing test-driven development and want to achieve meaningful speed and scale, let's talk: team@buildux.co