Test Design for Intelligent Machines

Test design is an important topic in TMAP. Structured testing requires careful consideration of what to test and how to test it. We use the term “test design” for the entirety of these activities, even though in some approaches (like exploratory testing) there is no actual up-front test case design involved.
To prepare and perform tests, we distinguish two overall approaches:

Figure. Testing consists of experience-based and coverage-based approaches.

Test Design Approaches for AI

In the more than 30 years of TMAP’s existence, we have described many different test design techniques and approaches. But with the rise of intelligent machines the focus shifts, the existing techniques and approaches described below, gain importance, and new techniques and approaches also become relevant.

Coverage-Based Test Design

Coverage-based test design that pertains to testing of AI based solutions:

A/B testing is a method where two versions (A and B) of a system are compared to determine which variant best fits the expectations of the users and other stakeholders. A and B can be two different versions of a GenAI based tool that are compared.
Explainability refers to making the model’s behavior, decisions, and outputs understandable and interpretable to humans, especially since these outputs are complex results of generative models that create not only text but also images, code, sound, etc. The need for explainability is important due to the opacity, unpredictability, and potential risks of these non-deterministic systems.
Parallel testing consists of running multiple test scenarios concurrently to reduce the test execution time. Usually this is done automatically using test automation scripts. Parallel testing is traditionally used to execute many tests for one AI-based solution and shorten execution time. In the context of intelligent machines it is used to execute the same test for multiple different AI-based solutions (for example calling different AI models) to compare the results in a short time and compare the quality level of those solutions.
Metamorphic testing is a property-based testing technique that addresses the test oracle problem and the test case generation problem. It relies on relationships between multiple executions of a program, rather than just comparing a single input-output pair. By observing how outputs change in response to modified inputs (metamorphic relations), it can detect faults even when a traditional test oracle (the expected output) is unavailable.

Experience-Based Test Design

Experience-based test design approaches that pertain to the testing of AI-based solutions are:

Exploratory testing by pairs (or larger teams) of experts, using prepared charters and heuristics specifically aimed at testing intelligent machines.
Adversarial testing is a test design approach used to evaluate the robustness and security of Generative AI systems by exposing them to intentionally crafted, challenging inputs designed to trigger failures, biases, or policy violations. It helps identify vulnerabilities and edge cases that traditional testing may overlook, ensuring more secure and trustworthy AI deployments.
Examination of an AI-based system serves as a validation step to ensure it meets predefined standards of knowledge and operational skill. See a further explanation below.

By no means do we claim that the above lists of coverage-based and experience-based test design are exhaustive nor definitive lists, since the field GenAI is rapidly evolving new techniques and approaches will very likely pop up. And please remember that the traditional test design techniques and approaches can also be used.

Exploratory Testing Is Often Preferred Over Prepared Test Cases

When preparing for testing an AI-based system, especially when it is Generative AI, a proper preparation is to create one or more charters for exploratory testing. This will be more effective than creating detailed test cases upfront, because of the probabilistic nature of the AI models. This probabilistic nature implies that it is always uncertain what the output of a process will be, therefore it is fundamentally less reliable when compared with a rule-based solution. If, however, the risks are assessed to be within acceptable range, and in balance with expected benefits (such as faster IT delivery), organizations may still decide to apply AI-based solutions. In this situation, performing well organized structured exploratory testing can be a good way to provide information about the quality level. This information helps stakeholders establish their confidence in achieving the pursued business value.

Examination of Intelligent Machines

People learn all sorts of skills and have to prove to be proficient in an exam, for example to obtain a driving license. The exam checks if they know the rules and are able to apply them correctly. After they pass the examination, we trust in the future these people will always make the right decisions.
As with people, we want to know whether an intelligent machine will perform well enough. Testing is, of course, the basic approach to evaluate the quality level. Since it isn’t possible to test all possibilities, we can use the approach of examination to decide if the intelligent machine is to be trusted to perform the task. As soon as the AI passes the exam it can be used in live operation. If the learning is frozen, so the intelligent machine doesn’t change its behavior after the exam, this may be a good approach.

If the intelligent machine keeps on learning, one exam is not sufficient, we should do an examination every once in a while. With people, periodic examination probably wouldn’t be feasible. But with an intelligent system a periodic exam would be possible, provided that the examination itself can also be done by an automated system. With periodic examination, a fundamental question is how often this examination should take place. That has to do with the periodicity of use of the intelligent machine, but most machines will be functioning continuously. So, there will probably be a desire to do continuous examination. Because it’s an operational system, it is better to use the term continuous monitoring. When stakeholders know the system is continuously monitored in a structured and formal way, this will support the confidence in such an operational system. Especially with closed-loop machine learning and other forms of AI systems that continuously improve their behavior, it is important to implement continuous monitoring of the results to check if the results remain within the boundaries of tolerance that were defined as “good behavior”.

[Reference: Testing in a digital age / chapter 4.5 [Ven, 2018]]