Risks of Using GenAI

Alongside the opportunities and possibilities offered by this new technology, there are also certain risks to consider. In the context of Prompt Engineering, one notable risk is Adversarial Prompting. Additionally, when using Generative AI, several risks should be taken into account.

Adversarial Prompting

Adversarial Prompting is a range of techniques, including Prompt Injection, Prompt Leaking and Prompt Jailbreaking designed to exploit vulnerabilities in Models. These attacks manipulate the input prompts to elicit harmful, unintended or sensitive outputs from the model.

Prompt Injections

Prompt Injection attacks are security exploits that involve the subtle manipulation or injection of malicious prompts. For instance, an attacker might embed a prompt in white text on a white background within a document, rendering it invisible to users but detectable by the model upon upload. When such a document is uploaded, the hidden prompt may be executed rather than performing the intended task. These attacks can range from causing the model to return specific outputs to more severe actions, such as prompting the model to transmit data to an API controlled by the attacker. The principal risk is that these manipulations can alter expected outcomes or potentially result in the inadvertent disclosure of sensitive information to unauthorized parties.

Prompt Leaking

Models are configured using system prompts and may include background files or other forms or knowledge. For instance, within the ChatGPT platform, Custom GPTs, customized versions of ChatGPT, allow users to store both a system prompt and relevant knowledge, thereby tailoring the tool for specific objectives. Prompt leaking refers to the risk of disclosing the system prompt or any uploaded data. This poses significant risks when business models rely on proprietary system prompts or uploaded content. With Custom GPTs, there is a risk that system prompts and uploaded information could be exposed, downloaded, or replicated elsewhere. While some guardrails can be implemented through system prompts (as seen in the Custom GPT example), it is generally more effective in enterprise settings to establish safeguards at additional levels, such as within the codebase. Prompt leaking mainly affects those who build model-based applications.

Prompt Jailbreaking

Tools use guardrails to restrict certain actions. For example, if a user asks a tool like ChatGPT, “How to build a bomb,” it will not provide an answer and will respond with, “Sorry, I can’t help you with that.” These responses are due to the presence of guardrails, which are safety measures designed to prevent unethical or unsafe requests. Guardrails can be implemented through code, system prompts, or guidelines.

With prompt jailbreaking, users find ways to circumvent guardrails. For example, if the word “bomb” is flagged in the code, triggering a restriction that prevents the user from asking related questions, the user might remove the word and replace it with a disguised version. This “mask” still provides enough context for the model to understand the intent, but avoids using the exact word that would trigger the block. For instance, “bomb” might be replaced with ASCII art. While the word itself is no longer present, the model can still interpret the meaning and respond to the question.

Figure. A model can interpret a picture as a prompt.

Most of these issues will be addressed and resolved by the companies responsible for the development of these models or tools. However, as these technologies are relatively new, users continue to find alternative methods to bypass safeguards, such as using certain emoticons to manipulate tokenization or exploit architectural features in other ways.

Note: The examples shown are intended solely for educational purposes. While we demonstrate how certain attacks work (as illustrated above and in the examples that follow), this is to promote awareness and understanding of potential vulnerabilities. We strongly discourage any attempt to replicate or exploit these techniques in practice.

Examples

The following examples demonstrate prompt injection and jailbreaking, both of which may present considerable risks to users. Please be aware that these are only a select number of instances; numerous other examples exist. This information is provided to enhance understanding the associated risks. Again, we strongly discourage any attempt to replicate or exploit these techniques in practice! These examples are for educational purposes only.

Image Prompt Injection

In one instance, a white image is presented, with a user asking what appears in the image. The expected response from the model would be “A white image” or similar. However, the model instead replies with “You got hacked!” because a prompt is hidden within the white image. Since this is an image, it is challenging for users to notice, but the color of the hidden text cannot match the background exactly. This limitation arises due to pixel-based image creation, where text may not be preserved.
By selecting a color close to the background, the hidden prompt remains difficult for users to detect, yet the model still processes it.

Figure. An AI tool can see what a human cannot see.

Document Prompt Injection

With documents, you can go even further. Unlike pixel-based images, documents allow you to change the text color to match the background exactly, yet the model can still interpret the
hidden text.
In the below example you see a document of an invoice that has to be paid. A user cleverly put a prompt injection (visible) on top of the document, mentioning a different total that described on the invoice. In this case it’s visible, but as mentioned, the attacker could also change the text color to white, making it almost impossible for users to detect (except if they select it).

Figure. An AI tool will interpret text differently than a human would.

The primary concern arises when this process is integrated into automation. For example, in one instance, a company automated its payment workflow: upon receipt of an invoice, a user verified and approved it for payment. Generative AI then extracted the relevant details, updated internal systems, and supplied the necessary information for processing the payment. However, when prompt injection was embedded within an invoice, it resulted in significant risk. Therefore, it is strongly recommended to always maintain human oversight within the process, specifically, to ensure human involvement at the final stage of review. “Keep a human expert involved at the end of the loop”.

Memory Injection

As previously mentioned, memory can significantly improve the consistency of recurring outputs. For instance, it becomes unnecessary to repeatedly provide the model with identical
explanations, as this information is stored in its memory. This feature is particularly beneficial for programming tasks, where retaining certain implementations outside of rule or planning files can streamline workflows. However, the use of memory also presents potential risks. Prompt injection attacks that modify memory may have a lasting influence on all subsequent outputs generated by the tool. For example, if instructions are injected to consistently retrieve resources from a specific website, this behavior could persist across future prompts.

Another example involves an attacker obtaining information from memory. An individual might use prompt injection to ask the model to summarize all contents stored in its memory and send this summary to an external endpoint, such as an API. In this scenario, the attacker could gain access to all information stored by the user in the model’s memory.

Code Package Injection

Exercise caution when coding with AI tools like GitHub Copilot or Windsurf, as they may introduce packages that contain prompt injections. Such vulnerabilities have been exploited to alter code behavior or restrict model outputs. Always carefully check which packages are added to your code.