Emerging Responsibilities for Engineers

Prompt Engineering as a Quality Skill

Quality engineers must develop new skills to unlock the full potential that GenAI can provide. One of these skills is prompt engineering. Far from being a niche technique, prompting is becoming a core skill for shaping AI behavior, and by extension, shaping the quality of the software artifacts it helps produce. Much like writing a good test case, writing a good prompt is both an art and a science. It demands clarity, precision, and intent. In this building block we will focus on the relevance of prompt crafting for quality engineers.

Crafting Effective Prompts to Shape Useful AI Output

Generative AI is only as effective as the instructions it receives. A vague prompt like “Write tests for this function” will likely yield superficial, happy-path cases. In contrast, a targeted prompt such as “Generate boundary tests for input validation logic in a payment processing module, covering null values, type mismatches, and maximum limits” is far more likely to result in relevant, risk-aligned output.
In this context, prompt engineering becomes a lever for quality. It allows engineers to steer the AI toward specific risks, domain constraints, or coverage goals. The ability to deconstruct and translate a problem into a prompt that elicits meaningful results is a practical expression of testing expertise.

Prompting for Edge Cases, Regulatory Constraints, and Security

One of the most valuable uses of prompt engineering is to expose the AI to considerations it would otherwise ignore. For example:


“Include tests for scenarios where the user session expires during a transaction.”


“Generate test cases that validate GDPR-compliant data handling in user profile updates.”


“Suggest security-focused test cases to check for SQL injection and input sanitization.”


These kinds of prompts embed quality objectives directly into the AI interaction, effectively guiding the tool to behave like a seasoned engineer. This is especially powerful when prompting for less obvious areas, edge conditions, abuse cases, or regulatory requirements that AI is unlikely to infer from code alone.

Feedback Loops and Iteration

Prompt engineering is an incremental and iterative process. Like test design, it benefits from iterative refinement. Initial outputs often highlight ambiguities in the prompt, prompting the engineer to clarify assumptions or adjust focus. This back-and-forth creates a valuable feedback loop: engineers learn to express their quality expectations more clearly, and the AI becomes a more aligned assistant.


Over time, quality engineers can develop reusable prompt templates tailored to specific domains, risk areas, or compliance needs. When paired with intelligent review, these templates can form the basis for scalable, high-quality test generation practices.

Summary

Prompt engineering is not just about “talking to the AI;” it’s about thinking like a quality engineer while doing so. It requires understanding risk, context, and intent, and converting that into precise instructions. As generative AI becomes a fixture in the software development lifecycle, the ability to shape its outputs through smart prompting will be a defining skill for modern quality engineers.

Verifying AI Output: Expert-in-the-Loop Testing

Generative AI can produce tests, code, and documentation at a scale and speed that was previously unattainable. But with this acceleration comes a critical need: human oversight. Without it, flawed or incomplete outputs can easily slip into production. This is where expert-in-the-loop testing becomes essential, placing engineers at the center of quality control, not as bottlenecks, but as intelligent filters and decision-makers.

Manual Review of Test Logic, Assertions, and Coverage

AI-generated tests often look correct, but visual polish is no substitute for logical soundness. Engineers must rigorously evaluate:

  • Test logic: Does the test actually validate the correct behavior? Are conditions meaningful and assertions valid?
  • Coverage relevance: Are key risks, failure modes, and business scenarios addressed?
  • Noise and redundancy: Are tests duplicative or overly trivial, adding maintenance overhead without quality value?

This review process isn’t about mistrusting AI, but acknowledging its limitations and inserting human judgment where nuance and context matter most.

Techniques for Validating Generative Results

To make human-in-the-loop testing effective and efficient, quality engineers can adopt structured techniques for reviewing AI outputs:

  • Risk-based triage: First, focus review efforts on high-impact areas, such as security features, financial flows, or critical user journeys.
  • Sampling and prioritization: Rather than inspecting every generated test, review a representative subset to assess consistency, accuracy, and coverage trends.
  • Fault seeding and mutation testing: Introduce intentional faults into the system to evaluate whether the generated tests can detect them. This helps validate the effectiveness of the test suite, not just its structure.


In many cases, AI can assist in this review by generating test summaries, highlighting redundant patterns, or clustering similar tests. But final decisions still require human interpretation.

When to Trust, When to Intervene

A key question in expert-in-the-loop testing is knowing when to accept AI output as-is and when to step in. This decision should be based on:

  • Domain criticality: Human validation should be non-negotiable in safety-critical or regulated systems.
  • AI maturity and track record: Some AI tools may perform reliably in well-scoped areas (e.g. frontend UI testing), but falter in complex or edge-case-rich systems.
  • Confidence thresholds: Teams can define criteria for “trustable” outputs, such as test alignment with acceptance criteria, mutation score thresholds, or code path coverage metrics.


Ultimately, the expert-in-the-loop model isn’t about slowing down automation; it’s about making automation accountable. It’s a guardrail that preserves quality while benefiting from speed.


AI may generate artifacts, but quality still hinges on human judgment. Expert-in-the-loop testing ensures that generative tools serve as amplifiers of engineering judgment, not replacements for it. By systematically reviewing, validating, and refining AI outputs, quality engineers maintain control over what matters most: delivering trustworthy, resilient, and context-aware software.

Meta-Testing: Testing the AI’s Tests

Frameworks for Evaluating the Quality of AI-Generated Artifacts

Traditional test review methods are no longer sufficient when dealing with AI-generated output at scale. Quality engineers need structured frameworks to evaluate test artifacts across several key dimensions:

  • Relevance: Does the test relate to critical functionality or business logic?
  • Effectiveness: Can the test detect faults or regressions if they occur?
  • Clarity: Is the test readable, maintainable, and logically sound?
  • Redundancy: Are similar tests testing the same condition unnecessarily?


Meta-testing frameworks often draw from established testing heuristics, such as RCRCRC (Recent, Core, Risky, Configuration-sensitive, Repaired, Chronic) [Johnson 2009] repurposed to evaluate the test set rather than the application under test.

Sanity Checks, Mutation Testing, and Redundancy Analysis

Three practical techniques stand out in the meta-testing toolkit:

  • Sanity checks: A quick but essential step, ensuring that automated tests compile, execute, and behave as expected. This filters out malformed or non-executable test automation that AI may occasionally produce.
  • Mutation testing: By injecting small faults (mutations) into the codebase and checking whether the test suite catches them, teams can measure the real effectiveness of their tests. High mutation scores indicate strong fault detection; low scores signal gaps, even if test coverage appears high.
  • Redundancy analysis: AI often generates repetitive or slightly varied tests. Teams can reduce noise and maintenance effort by identifying duplicate logic or overlapping assertions while preserving actual coverage.


These methods shift the focus from quantity to quality, turning test validation into a first-class engineering activity rather than a checkbox exercise.

Creating Confidence Thresholds

Organizations can establish confidence thresholds to operationalize meta-testing, criteria that define when a set of AI-generated tests is “good enough” to be trusted. These might include:

  • Minimum mutation score thresholds per module or feature.
  • Maximum allowed redundancy ratio across a test suite.
  • Required coverage for specific risk categories or user stories.
  • Human-reviewed samples meet a pass/fail acceptance rate.


These thresholds help scale quality assurance without sacrificing rigor. They also provide clear guidance on when expert review is required and when the system can proceed autonomously.