Prompt Crafting

While GenAI tools can function independently, they rely on human input, via prompts, for meaningful direction. Writing effective prompts is not a trivial task, it is a true skill. That’s why
we have titled this part of the book “Prompt Crafting”. We trust you will benefit greatly from mastering this craft.

Crafting AI Prompts

When working with Generative AI tools or models, you usually use prompts to guide these tools. But what exactly is a prompt, and how do you create an effective prompt to get the output you want? Writing prompts is also known as Prompt Engineering, Prompt Crafting or Prompt Design, but what does that really mean? What happens behind the scenes when you enter a prompt? How do you create a prompt that is sustainable and easy to maintain, and how can you effectively share it with others within your organization? That’s what we’ll explain in this module.

What is a Prompt?

A prompt is the input you give to a model. Think of it as sending a message to tools like ChatGPT or Copilot. The message you type in is called a prompt. This could be a simple question or a more detailed instruction about how the model should generate the output.

Figure. Input field for entering a prompt.

What is Prompt Engineering?

Prompt Engineering, Prompt Crafting or Prompt Design refers to the process of creating these inputs (or prompts). A Prompt Engineer carefully considers the desired output and what the Large Language Model (LLM) or Diffusion Model needs to accomplish. Based on that, the Prompt Engineer formulates a prompt to guide the model toward producing the intended result.
It’s important to note that there are different forms of Prompt Engineering, including user prompts and system prompts. When working with prompts within teams, it’s crucial to clearly define which type of prompt you’re discussing to avoid confusion.

User Prompts: These are the inputs provided directly by the user, such as questions, instructions, or requests. They are typically dynamic and may vary depending on the use case.
System Prompts: These are hidden instructions used to guide the model’s behavior, identity, tone, and response style. They define underlying rules for how the model should operate and are created during the implementation of a GenAI tool.

Since user and system prompts serve different purposes and have distinct requirements, they need to be approached differently. Clear communication about which type of prompt is being used ensures consistency and prevents misunderstandings within a team.
Additionally, the structure of “User Prompts” can vary significantly based on the specific use case. For instance, whether the goal is textual or visual output will influence how prompts are formulated. Clearly specifying the tool or model for which the prompts are being written, along with detailing the desired outcome, can greatly facilitate internal discussions and enhance clarity.

What Is a Large Language Model?

A Large Language Model (LLM) is a type of artificial intelligence designed to recognize and generate human-like text based on vast amounts of data. These models are trained using extensive datasets from books, articles, websites, and other text sources to learn language patterns, grammar, context, and semantics. LLMs can perform various tasks, such as language translation, text summarization, question answering, creative writing or coding. They leverage complex algorithms and deep learning techniques to predict and generate coherent and contextually appropriate text, making them powerful tools for natural language processing applications.

Prediction-Based Responses

It’s important to understand the concept of predictions, the foundation of how Large Language Models (LLMs) operate. LLMs are designed to predict the next token in a sequence based on the context provided by previous tokens available in the context window. A token can be a word, part of a word, or even a character, depending on the model’s structure and training. The image below displays a prompt that has been tokenized using OpenAI’s playground. Each color represents a different token.

LLMs analyze massive amounts of text data to identify patterns and relationships between tokens using a neural network. When you input a prompt, the model generates a response by predicting each next token, creating coherent and contextually appropriate text. Consider the illustration provided below. It represents the process by which the model uses probabilities to identify the most likely next tokens and places them into a spinning wheel mechanism. The wheel is rotated, and the position at which it halts determines the token that will be added to the sequence. Since this process is based on probabilities, the model can sometimes make mistakes or generate irrelevant answers. This makes precise and well-structured prompts essential for guiding the model toward the right outcome.

Figure, Prediction-based nature of LLMs.

Vectors

In addition to predicting tokens, it is essential to understand how those tokens are internally represented as vectors. Before any prediction can occur, each token is first mapped to a token ID, a numerical index in the model’s vocabulary. These IDs are then transformed into high-dimensional vectors, or embeddings, which capture both semantic and syntactic meaning.

These token vectors are how the model processes language. By comparing and combining them, the model identifies relationships, similarities, and patterns, allowing it to generate contextually relevant outputs. The distances between vectors help explain how the model recognizes meaning. Tokens with similar meanings are represented by vectors that are close together in the model’s high-dimensional space. This vector-based representation is a fundamental part of how large language models understand and generate language. It also helps explain why their responses can sometimes be unexpected or differ from what we might anticipate.
For example, the sentence “This is an example of a prompt that has been tokenized.” is first broken into individual tokens, which are then converted into token IDs. These IDs are shown in below figure.

Figure. Prompt represented as token IDs.

Context Window

The context window defines the maximum number of tokens an LLM can process at one time. It serves as the model’s “working memory,” encompassing the system prompt, user’s input,
model’s output, and any prior conversation history.

A typical context window includes:

System Prompt: a hidden instruction that guides the model’s behavior and tone.
User Prompt: the input provided by the user.
Model Output: the model’s own responses, which are fed back into the context window.
Previous Conversation History: past interactions that help
maintain consistency.

Figure. Components of an LLM context window.

When the total token count exceeds the context window’s limit (e.g., 128.000 tokens in GPT-4o or even a million for GPT-4.1), the model starts to forget older parts of the conversation by discarding the earliest tokens. This can cause loss of context, leading to less accurate responses.

The context window is also why LLMs don’t have true memory. If you ask the model to think of an animal and then guess it, the model can’t “remember” the animal. You’re asking for an animal, but it’s never stored in the context window. It’s simply predicting responses based on the current context.

Understanding how the context window functions allows you to design prompts that fit within these constraints, ensuring more consistent and relevant responses from the model.

Properly managing the context window enhances both the speed and quality of the model’s responses. For instance, clearing the context window can typically be achieved by initiating a new conversation.

RAG and Memory

RAG (Retrieval-Augmented Generation) is a technique where an additional layer of information is used to enhance a model’s output. This often involves connecting the model to external data sources—such as your own documents or files—which the model can retrieve and incorporate into its responses. This allows the tool to generate more accurate and context-aware results based on specific, user-provided information.

In leading tools that use these models, “memory” is sometimes implemented as well. You can think of this as a type of database where information inputted by the user is stored over time. The tool can use this memory to enhance responses similarly to how RAG works. However, when memory is enabled, it’s essential to manage it carefully. If data is stored automatically that wasn’t intended to be retained, it may affect the quality and relevance of future results.

Tool integrations

Tool integrations, such as web browsing, code execution, and file uploads, extend the principles behind RAG and memory. Like RAG, they allow the model to retrieve external information. Like memory, they can store or recall relevant context over time. Many GenAI platforms, such as ChatGPT, (GitHub) Copilot and Windsurf, offer tool integrations to enhance what the model can do beyond its built-in training and context window.

Examples include:

Web browsing, for real-time information.
Code interpreters, for calculations or data analysis.
Plugins & MCP, for domain-specific tasks such as querying databases, calling APIs, or coordinating actions through autonomous agents.

These tools may be triggered automatically based on the prompt or enabled manually. When used effectively, they allow GenAI models to generate more accurate, task-relevant responses and turn static conversations into dynamic, interactive workflows.

Model Output Parameters

When models are implemented in tools, the response to a prompt often depends on a few important settings behind the scenes. These are called output parameters, and they help shape the tone, length, and style of the model’s responses. Some tools allow users to adjust these settings. Others don’t, but understanding what they do can help explain why a model feels more creative, more direct, or more repetitive at times.

The most common parameters include:

Temperature: This controls how creative or random the model is. A low temperature makes responses more focused and predictable; a high temperature allows for more variation.
Max tokens: This limits how long the model’s response can be. If it’s set too low, answers may be cut off. If it’s high, the response may be long, off-topic, or more expensive to generate.
Top-P and Top-K: These limit how many possible words the model considers at each step. They affect how diverse or consistent the answers are.

These settings don’t change the knowledge inside the model, or what documents it has access to—but they do influence how the model uses that information to reply. Even if you’re just prompting inside a tool, knowing this can help you shape better, more reliable outputs.

Guardrails

Most tools that use models have guardrails implemented. Guardrails refer to the mechanisms and constraints put in place to ensure safe, ethical, and responsible use of the model. These are crucial for preventing harmful behavior, misuse, or unintended output. For example, if you are asking most tools “How to build a bomb”, it will refuse to answer and output: “Sorry I cannot assist you with that”. This is because of the set guardrails for this tool to make sure people don’t use the tools for unintended output. Guardrails can be technical (such as content filtering, which automatically blocks or removes unsafe or inappropriate inputs or outputs, and flagging, which marks certain content for review or further action), procedural, or policy based.

Hallucination

One of the most well-known limitations of Large Language Models (LLMs) is hallucination. In this context, hallucination refers to when the model generates output that is factually incorrect, nonsensical, or entirely made up, even though it may sound plausible or convincing. Keep in mind that we prefer the more precise term confabulation, but to align with common literature we also keep using hallucination in some places.

Because LLMs generate text by predicting the next most likely token based on patterns in their training data, they do not truly “understand” facts in the way humans do. Instead, they simulate understanding by generating statistically likely sequences. As a result, the model can sometimes confidently present false information as if it were true; this is a hallucination.

Types of Hallucinations

Factual Hallucination: When the model invents names, dates, numbers, sources, or events that never occurred.
Example: “Albert Einstein won the Nobel Prize for his Theory of Relativity in 1921.” (In fact, he won it for the photoelectric effect.)
Contextual Hallucination: When the model loses track of context and fabricates details based on an incorrect assumption.
Example: Misremembering a character’s name or plot point in a conversation about a book.
Unsupported Citation: When the model generates fake academic references or URLs that appear legitimate but don’t exist.

Confabulation

Related to hallucinations is the concept of Confabulation, both terms originate from psychology. Confabulation is based on the term “To confabulate” meaning: “Fabricate imaginary experiences to compensate for memory loss.” In Generative AI, this refers to a model filling in knowledge gaps and generating content to produce a complete output, though it may not always be accurate. A significant risk with Generative AI is that it produces seemingly correct and thorough results, but the model might have added false information to fill in missing parts.
Although we think that confabulation is a more accurate term, we notice that the commonly used term in the world of AI is hallucination.

Why do Hallucinations happen?

Hallucination arises from the probabilistic nature of LLMs. The model is trained to generate text that looks like human writing, not to retrieve verified truths. It doesn’t “know” facts the way a human does. Instead, it builds responses from statistical associations between words and phrases it has seen during training.
Critically, these associations are based on past data. Most LLMs are trained on data that only goes up to a certain cutoff date. That means they lack awareness of events, trends, or changes in the world that occurred after that point. They cannot browse the internet or understand real-time situations, unless explicitly connected to live data sources via tools (API’s, MCP).
For example, when the COVID-19 pandemic started, early AI systems would still predict morning traffic jams, crowded events, or open restaurants in generated content—despite lockdowns and major global disruptions. This wasn’t due to malice or ignorance, but because the models were trained on pre-pandemic data and had no inherent understanding of what “COVID” meant or how it changed human behavior.

In short:

The model may hallucinate when a prompt is vague, ambiguous, or outside its expertise (training data).
It may confidently generate incorrect information if the topic involves events or developments after its training data cutoff.
And it can misrepresent dynamic or evolving situations, since it has no access to live updates unless specifically designed for that.

This is why verifying outputs, stating timeframes clearly, and being precise in your prompts are key strategies to reduce hallucination and improve relevance.

Consequences of Hallucinations

Confabulation and hallucination in generative AI pose significant risks to quality engineering and testing by introducing convincing falsehoods that can mislead teams and compromise system reliability. When AI generates plausible yet entirely fabricated content, such as invented code snippets, nonexistent APIs, or imagined function parameters, it can appear authoritative while being fundamentally incorrect. This undermines trust and can lead to faulty implementations, especially when teams unknowingly incorporate these fabrications into their development or automation pipelines. Moreover, hallucinated test cases for features that don’t exist can waste valuable resources, distort test coverage, and create confusion about system requirements. For quality engineers, recognizing and mitigating these risks is critical to ensure the integrity and validity of AI-assisted outputs.

How to mitigate Hallucination

Be specific: Precise and detailed prompts reduce ambiguit and guide the model more effectively.
Fact-check output: Always verify critical information, especially in domains like medicine, law, or science.
Use tools and references: When accuracy is essential, pair the LLM with external tools (e.g., a search API, database, RAG, MCP) or reference documents that it must include.
Ask for sources cautiously: If a model cites something, double-check it, it may be fabricated.

You must verify sources yourself. For instance, when retrieving data from a JavaScript-based website, many tools can visit but not read it. They may cite other sources while showing the site as one of them, potentially leading to incorrect conclusions if the site contains different information.

Understanding hallucination is key to using LLMs responsibly. They can be powerful assistants, but they need supervision, especially when accuracy matters.

Bias

Proper Prompt Engineering can help address several challenges when working with Large Language Models (LLMs), including bias, unpredictability, lack of real-world knowledge, and language and cultural nuances. However, understanding the roots and limits of bias in these systems is essential to use them responsibly and effectively.

Bias in Training Data

LLMs are trained on massive datasets collected from the internet, books, articles, and other sources of publicly available text. This data reflects the world as it is, or at least how it is represented online, which includes human biases.

These can be:

Gender bias
Racial bias
Cultural bias
Socioeconomic bias
Political bias

For example, if you ask a model to “generate an image of a doctor” or “describe a CEO,” the default result may skew toward a male figure. That’s not because the model is intentionally biased, but because its training data disproportionately reflects certain stereotypes.

How Prompt Engineering Helps

Prompt Engineering allows users to steer the model away from these biases by being explicit and intentional in their prompts. Instead of relying on the model’s default assumptions, you can clarify your expectations.

Example
	Prompt: “Generate an image of a doctor.” Likely output: A white male doctor in a lab coat. Prompt: “Generate an image of a female doctor of color in her 30s, working in a hospital.” Likely output: A more diverse and representative result.

Figure. Outputs of model may be biased.

By adding specific qualifiers, you can guide the model toward more inclusive and representative outputs. Please note that while these are examples, most tools have Guardrails in place that also guide the model to less bias in their answers. However, what is bias? This may vary across cultures and individuals. Clearly state your goals in the prompt for the best results.

Cultural and Systemic Bias

Beyond individual biases, some biases are systemically built into the structure and management of the models themselves. For instance, models hosted or controlled in countries with strict content regulations may reflect those nations’ values, censorship rules, or political boundaries.

Example
	In China, models may avoid or omit discussion on politically sensitive topics like Tiananmen Square or Taiwan. This cultural or political filtering shapes the model’s responses and creates regional variation in which topics are accessible or how they’re framed. This might even be guarded in the tool’s guardrails, blocking any generated answers regarding these topics.

Bias and Testing

Bias in GenAI systems poses a significant challenge for quality engineering and testing, particularly when generating test cases. Because large language models are trained on publicly available data, they tend to perform well in domains with ample representation, such as mobile banking apps, which are extensively documented and discussed online. As a result, GenAI can produce reasonably accurate and relevant test cases for such mainstream applications. However, the performance drops noticeably when applied to bespoke software with unique or proprietary business processes, where relevant data is scarce or non-existent in the training corpus. This discrepancy highlights a critical limitation for quality engineering in our typical context, where we focus on testing custom-built enterprise systems. Without tailored finetuning or domain-specific input, GenAI-generated outputs for these systems risk being inaccurate, incomplete, or irrelevant, thereby undermining their usefulness in our quality processes.

Bias will Always exist, but Awareness helps

Even with careful prompt design, no prompt can completely eliminate bias. We humans are also not free of bias. There will always be some influence from the model’s training data and the decisions made during development. But awareness, critical thinking, and effective prompting can help mitigate these effects and produce more ethical, diverse, and accurate outputs.
What is considered bias is also dependent on the context. Western countries might consider other things as bias, while Eastern countries might not, and vice versa.