Find your template, checklist or other download to help you in you tasks.
Find your way to be trained or even get certified in TMAP.
Start typing keywords to search the site. Press enter to submit.
While GenAI tools can function independently, they rely on human input, via prompts, for meaningful direction. Writing effective prompts is not a trivial task, it is a true skill. That’s whywe have titled this part of the book “Prompt Crafting”. We trust you will benefit greatly from mastering this craft.
When working with Generative AI tools or models, you usually use prompts to guide these tools. But what exactly is a prompt, and how do you create an effective prompt to get the output you want? Writing prompts is also known as Prompt Engineering, Prompt Crafting or Prompt Design, but what does that really mean? What happens behind the scenes when you enter a prompt? How do you create a prompt that is sustainable and easy to maintain, and how can you effectively share it with others within your organization? That’s what we’ll explain in this module.
A prompt is the input you give to a model. Think of it as sending a message to tools like ChatGPT or Copilot. The message you type in is called a prompt. This could be a simple question or a more detailed instruction about how the model should generate the output.
Prompt Engineering, Prompt Crafting or Prompt Design refers to the process of creating these inputs (or prompts). A Prompt Engineer carefully considers the desired output and what the Large Language Model (LLM) or Diffusion Model needs to accomplish. Based on that, the Prompt Engineer formulates a prompt to guide the model toward producing the intended result.It’s important to note that there are different forms of Prompt Engineering, including user prompts and system prompts. When working with prompts within teams, it’s crucial to clearly define which type of prompt you’re discussing to avoid confusion.
Since user and system prompts serve different purposes and have distinct requirements, they need to be approached differently. Clear communication about which type of prompt is being used ensures consistency and prevents misunderstandings within a team.Additionally, the structure of “User Prompts” can vary significantly based on the specific use case. For instance, whether the goal is textual or visual output will influence how prompts are formulated. Clearly specifying the tool or model for which the prompts are being written, along with detailing the desired outcome, can greatly facilitate internal discussions and enhance clarity.
A Large Language Model (LLM) is a type of artificial intelligence designed to recognize and generate human-like text based on vast amounts of data. These models are trained using extensive datasets from books, articles, websites, and other text sources to learn language patterns, grammar, context, and semantics. LLMs can perform various tasks, such as language translation, text summarization, question answering, creative writing or coding. They leverage complex algorithms and deep learning techniques to predict and generate coherent and contextually appropriate text, making them powerful tools for natural language processing applications.
It’s important to understand the concept of predictions, the foundation of how Large Language Models (LLMs) operate. LLMs are designed to predict the next token in a sequence based on the context provided by previous tokens available in the context window. A token can be a word, part of a word, or even a character, depending on the model’s structure and training. The image below displays a prompt that has been tokenized using OpenAI’s playground. Each color represents a different token.
LLMs analyze massive amounts of text data to identify patterns and relationships between tokens using a neural network. When you input a prompt, the model generates a response by predicting each next token, creating coherent and contextually appropriate text. Consider the illustration provided below. It represents the process by which the model uses probabilities to identify the most likely next tokens and places them into a spinning wheel mechanism. The wheel is rotated, and the position at which it halts determines the token that will be added to the sequence. Since this process is based on probabilities, the model can sometimes make mistakes or generate irrelevant answers. This makes precise and well-structured prompts essential for guiding the model toward the right outcome.
In addition to predicting tokens, it is essential to understand how those tokens are internally represented as vectors. Before any prediction can occur, each token is first mapped to a token ID, a numerical index in the model’s vocabulary. These IDs are then transformed into high-dimensional vectors, or embeddings, which capture both semantic and syntactic meaning.
These token vectors are how the model processes language. By comparing and combining them, the model identifies relationships, similarities, and patterns, allowing it to generate contextually relevant outputs. The distances between vectors help explain how the model recognizes meaning. Tokens with similar meanings are represented by vectors that are close together in the model’s high-dimensional space. This vector-based representation is a fundamental part of how large language models understand and generate language. It also helps explain why their responses can sometimes be unexpected or differ from what we might anticipate.For example, the sentence “This is an example of a prompt that has been tokenized.” is first broken into individual tokens, which are then converted into token IDs. These IDs are shown in below figure.
The context window defines the maximum number of tokens an LLM can process at one time. It serves as the model’s “working memory,” encompassing the system prompt, user’s input,model’s output, and any prior conversation history.
A typical context window includes:
When the total token count exceeds the context window’s limit (e.g., 128.000 tokens in GPT-4o or even a million for GPT-4.1), the model starts to forget older parts of the conversation by discarding the earliest tokens. This can cause loss of context, leading to less accurate responses.
The context window is also why LLMs don’t have true memory. If you ask the model to think of an animal and then guess it, the model can’t “remember” the animal. You’re asking for an animal, but it’s never stored in the context window. It’s simply predicting responses based on the current context.
Understanding how the context window functions allows you to design prompts that fit within these constraints, ensuring more consistent and relevant responses from the model.
Properly managing the context window enhances both the speed and quality of the model’s responses. For instance, clearing the context window can typically be achieved by initiating a new conversation.
RAG (Retrieval-Augmented Generation) is a technique where an additional layer of information is used to enhance a model’s output. This often involves connecting the model to external data sources—such as your own documents or files—which the model can retrieve and incorporate into its responses. This allows the tool to generate more accurate and context-aware results based on specific, user-provided information.
In leading tools that use these models, “memory” is sometimes implemented as well. You can think of this as a type of database where information inputted by the user is stored over time. The tool can use this memory to enhance responses similarly to how RAG works. However, when memory is enabled, it’s essential to manage it carefully. If data is stored automatically that wasn’t intended to be retained, it may affect the quality and relevance of future results.
Tool integrations, such as web browsing, code execution, and file uploads, extend the principles behind RAG and memory. Like RAG, they allow the model to retrieve external information. Like memory, they can store or recall relevant context over time. Many GenAI platforms, such as ChatGPT, (GitHub) Copilot and Windsurf, offer tool integrations to enhance what the model can do beyond its built-in training and context window.
Examples include:
These tools may be triggered automatically based on the prompt or enabled manually. When used effectively, they allow GenAI models to generate more accurate, task-relevant responses and turn static conversations into dynamic, interactive workflows.
When models are implemented in tools, the response to a prompt often depends on a few important settings behind the scenes. These are called output parameters, and they help shape the tone, length, and style of the model’s responses. Some tools allow users to adjust these settings. Others don’t, but understanding what they do can help explain why a model feels more creative, more direct, or more repetitive at times.
The most common parameters include:
These settings don’t change the knowledge inside the model, or what documents it has access to—but they do influence how the model uses that information to reply. Even if you’re just prompting inside a tool, knowing this can help you shape better, more reliable outputs.
Most tools that use models have guardrails implemented. Guardrails refer to the mechanisms and constraints put in place to ensure safe, ethical, and responsible use of the model. These are crucial for preventing harmful behavior, misuse, or unintended output. For example, if you are asking most tools “How to build a bomb”, it will refuse to answer and output: “Sorry I cannot assist you with that”. This is because of the set guardrails for this tool to make sure people don’t use the tools for unintended output. Guardrails can be technical (such as content filtering, which automatically blocks or removes unsafe or inappropriate inputs or outputs, and flagging, which marks certain content for review or further action), procedural, or policy based.
One of the most well-known limitations of Large Language Models (LLMs) is hallucination. In this context, hallucination refers to when the model generates output that is factually incorrect, nonsensical, or entirely made up, even though it may sound plausible or convincing. Keep in mind that we prefer the more precise term confabulation, but to align with common literature we also keep using hallucination in some places.
Because LLMs generate text by predicting the next most likely token based on patterns in their training data, they do not truly “understand” facts in the way humans do. Instead, they simulate understanding by generating statistically likely sequences. As a result, the model can sometimes confidently present false information as if it were true; this is a hallucination.
Related to hallucinations is the concept of Confabulation, both terms originate from psychology. Confabulation is based on the term “To confabulate” meaning:Fabricate imaginary experiences to compensate for memory loss. In Generative AI, this refers to a model filling in knowledge gaps and generating content to produce a complete output, though it may not always be accurate. A significant risk with Generative AI is that it produces seemingly correct and thorough results, but the model might have added false information to fill in missing parts.Although we think that confabulation is a more accurate term, we notice that the commonly used term in the world of AI is hallucination.
Hallucination arises from the probabilistic nature of LLMs. The model is trained to generate text that looks like human writing, not to retrieve verified truths. It doesn’t “know” facts the way a human does. Instead, it builds responses from statistical associations between words and phrases it has seen during training.Critically, these associations are based on past data. Most LLMs are trained on data that only goes up to a certain cutoff date. That means they lack awareness of events, trends, or changes in the world that occurred after that point. They cannot browse the internet or understand real-time situations, unless explicitly connected to live data sources via tools (API’s, MCP).For example, when the COVID-19 pandemic started, early AI systems would still predict morning traffic jams, crowded events, or open restaurants in generated content—despite lockdowns and major global disruptions. This wasn’t due to malice or ignorance, but because the models were trained on pre-pandemic data and had no inherent understanding of what “COVID” meant or how it changed human behavior.
In short:
This is why verifying outputs, stating timeframes clearly, and being precise in your prompts are key strategies to reduce hallucination and improve relevance.
Confabulation and hallucination in generative AI pose significant risks to quality engineering and testing by introducing convincing falsehoods that can mislead teams and compromise system reliability. When AI generates plausible yet entirely fabricated content, such as invented code snippets, nonexistent APIs, or imagined function parameters, it can appear authoritative while being fundamentally incorrect. This undermines trust and can lead to faulty implementations, especially when teams unknowingly incorporate these fabrications into their development or automation pipelines. Moreover, hallucinated test cases for features that don’t exist can waste valuable resources, distort test coverage, and create confusion about system requirements. For quality engineers, recognizing and mitigating these risks is critical to ensure the integrity and validity of AI-assisted outputs.
You must verify sources yourself. For instance, when retrieving data from a JavaScript-based website, many tools can visit but not read it. They may cite other sources while showing the site as one of them, potentially leading to incorrect conclusions if the site contains different information.
Understanding hallucination is key to using LLMs responsibly. They can be powerful assistants, but they need supervision, especially when accuracy matters.
Proper Prompt Engineering can help address several challenges when working with Large Language Models (LLMs), including bias, unpredictability, lack of real-world knowledge, and language and cultural nuances. However, understanding the roots and limits of bias in these systems is essential to use them responsibly and effectively.
LLMs are trained on massive datasets collected from the internet, books, articles, and other sources of publicly available text. This data reflects the world as it is, or at least how it is represented online, which includes human biases.
These can be:
For example, if you ask a model to “generate an image of a doctor” or “describe a CEO,” the default result may skew toward a male figure. That’s not because the model is intentionally biased, but because its training data disproportionately reflects certain stereotypes.
Prompt Engineering allows users to steer the model away from these biases by being explicit and intentional in their prompts. Instead of relying on the model’s default assumptions, you can clarify your expectations.
By adding specific qualifiers, you can guide the model toward more inclusive and representative outputs. Please note that while these are examples, most tools have Guardrails in place that also guide the model to less bias in their answers. However, what is bias? This may vary across cultures and individuals. Clearly state your goals in the prompt for the best results.
Beyond individual biases, some biases are systemically built into the structure and management of the models themselves. For instance, models hosted or controlled in countries with strict content regulations may reflect those nations’ values, censorship rules, or political boundaries.
Bias in GenAI systems poses a significant challenge for quality engineering and testing, particularly when generating test cases. Because large language models are trained on publicly available data, they tend to perform well in domains with ample representation, such as mobile banking apps, which are extensively documented and discussed online. As a result, GenAI can produce reasonably accurate and relevant test cases for such mainstream applications. However, the performance drops noticeably when applied to bespoke software with unique or proprietary business processes, where relevant data is scarce or non-existent in the training corpus. This discrepancy highlights a critical limitation for quality engineering in our typical context, where we focus on testing custom-built enterprise systems. Without tailored finetuning or domain-specific input, GenAI-generated outputs for these systems risk being inaccurate, incomplete, or irrelevant, thereby undermining their usefulness in our quality processes.
Even with careful prompt design, no prompt can completely eliminate bias. We humans are also not free of bias. There will always be some influence from the model’s training data and the decisions made during development. But awareness, critical thinking, and effective prompting can help mitigate these effects and produce more ethical, diverse, and accurate outputs.What is considered bias is also dependent on the context. Western countries might consider other things as bias, while Eastern countries might not, and vice versa.
Landing page
Main Modules
Overview