Quality of GenAI Models

Organizations often want to decide which GenAI-model to use. Sometimes they even want to use multiple GenAI models in parallel, for different purposes.
To determine which model is best suited for a specific use, a comparison of models is necessary. Keep in mind that a model that performs well for one type of use, may not be the best choice for other applications. Comparing models isn’t a trivial task and should be well considered and organized.

Comparing the Quality of Multiple Models

To compare models, an intuitive approach would be to use a standard test dataset, feed it to the different models and check and compare their behavior. Unfortunately, experience shows that this approach doesn’t work for real-world examples. One reason is that once a GenAI-model has seen the dataset, it knows and remembers the dataset and adjusts its behavior, which means that you would need a new dataset for every test run. This is a drawback of public test datasets. AI teams now apply private test datasets for benchmarks, so that the dataset isn’t leaked and cannot be used for training.

Another reason is that a model is good at answering human questions, but could be bad at performing a task in a specific application or domain.

New methods for comparing models are emerging—for example, through tools like LLM Arena (https://lmarena.ai/). Users ask questions to multiple models and compare and contrast the responses, ultimately choosing the “best answer”. This leads to an “Elo score” (named after inventor Arpad Elo) that indicates a comparative level of quality for LLMs. This may be a proper measure of how good a model is at answering human questions.

Still, there are some systematic drawbacks, for example described in this paper: https://cohere.com/research/papers/the-leaderboard-illusion-2025-04-30.

However, comparing models may be a lot of work and not always leads to a score relevant to your situation. Another possibility is to use “LLM as a judge” where the answer of one GenAI model is entered into another GenAI model with the prompt to evaluate the quality of the answer. The advantage of this approach is that it scales really well. The second and more important advantage is that you can build this into your application flow and get “real” scores of a model performing an actual task.

The disadvantage, however, is that Gen AI models often favor their own style of reasoning and may evaluate responses from other types of models less favorably.

At a large government organization, there was a need to develop a custom chatbot to answer questions about the huge number of documents available in the organization.

The quality of the answers of the GenAI-based chatbot is monitored using the “LLM as a judge” approach, where another GenAI model evaluated all answers of the chatbot and analyzes the trend in the answers. This helped detect drift in the answers, due to evolution of the model involved or due to problems with the flexibility of the API used to interface with the model.

When deciding which model to use, users often have to be involved to get an unbiased, fair and usable comparison.

What If There Is No Free Choice of Model?

Many organizations, however, don’t allow a free choice of model. Often, an organization has selected a specific ecosystem (such as Microsoft, AWS, Google, etcetera) and co-workers can only use the available GenAI models of that eco-system.
Even in that situation, it’s wise to examine the strengths and weaknesses of the models, so that the users can understand what to expect. Be aware that such scores are only valid for a limited period of time, since the GenAI models evolve quickly. This evolution is based on two things, first—like any IT system—the teams involved work on improving and extending their IT system. But with GenAI tools, the other reason for evolution is the fact that these models are improving at a high pace.

This means that the perceived quality of the model in use may change over time. The teams involved should have a minimum required quality level and regularly check if the GenAI model in use still meets this minimum level. If the Gen AI model does not meet these standards, that is a trigger to challenge the selection.

Testing GenAI Isn’t All That Different from Testing Traditional IT Systems

When we see GenAI as part of the IT system as a whole, the testing of such IT system is not that much different from testing traditional IT systems. There are different varieties of testing, including testing the components, testing integrations and end-to-end-testing of the business process. One major difference, however, comes from the non-deterministic nature of GenAI models. This means that what works perfectly today, may not work correctly tomorrow, because the model could change based on the input it receives.

Rule one: be aware that GenAI models are non-deterministic. Traditionally programmed (i.e. rule-based) IT systems will not change their behavior unless the code is changed. GenAI models may give different outputs due to their probabilistic nature. This may lead to better results over time, but it may also lead to worse results.

Therefore, constant quality monitoring should be implemented when using GenAI-based systems. This includes logging of decisions made and process steps performed so that the concept of “explainable AI” is followed.

The pitfall of this need for continuous monitoring is that it will require some level of human involvement (“expert in the loop”).

We doubt whether, in the long run, organizations will remain willing to retain a sufficient number of human experts to support this monitoring process. The concept of “human oversight” as mentioned in the EU AI Act addresses this need.

However, how much human oversight organizations are willing to implement—
considering the high costs and perceived low number of failures—remains to be seen.

Problems Related to the People Involved

People that previously needed to rely on specialists to accomplish certain tasks now have very powerful tools, leading them to believe they do no longer need specialists and can perform the tasks themselves in collaboration with GenAI tools. In coding, this concept is known as Vibe Coding (see Section 6.5.1). This concept also applies to other forms of GenAI use. For simple tasks, this “vibe behavior” may result in acceptable results. In the long run, however, especially with more complex tasks that are not one-off, this may lead to IT systems producing unpredictable (and often incorrect) results. Teams often don’t have the knowledge and skills to oversee all relevant aspects of their situation. They may also lack a clear vision of what the end result should be like, and, finally, they often do not have the experience to anticipate what might go wrong.

This calls for implementing amplified quality engineering, where you strive to build the right level of quality from the start. First, the goal and vision are elaborated, then the overall concept for the IT solution is worked out, and only then the IT system is created. Don’t get us wrong, all of those steps can be supported by collaborating with GenAI tools, but it still requires a structured process to start with.
Organizations should work towards empowering their teams to do exactly this, supplying them with proven tools and supporting them in using those tools. While the speed that can be achieved by collaborating with GenAI tools is highly desirable, it is essential to establish guardrails to ensure the resulting IT systems maintain the right quality level.

A comforting thought: we think that the problem of people doing things they better should not do, is largely temporary. Over time, people will get more experienced with how to use GenAI and what not to do with GenAI, just as they did with previous technological advancements.
The only major difference is that todays’ technological advancements are happening at an unprecedented pace, the power of the tools is equally unprecedented, and this technology is enabled with unprecedented access. (Books, when first introduced, were only available to people that could read. The internet was only available to people that had a computer. Nowadays digital devices are commodities and with the free access to GenAI tooling, almost veryone is able to use it.)

Keep this wisdom about the use of GenAI in mind: A fool with a tool is just a fool…, but a fool with a powerful tool is a dangerous fool!!!