Tame your LLM with prompt engineering


05.03.2025

Tame your LLM: prompt engineering as the basis for large language models


Auf Deutsch
 

How do you get a large language model to deliver exactly what users need? Prompt engineering is the key: it is through targeted inputs that you can control large language models (LLMs). But what makes a good prompt, how to prevent hallucinations, and what are the principles of prompt engineering? Trainers Martin Pfister, Simeon Harrison, and Thomas Haschka from EuroCC Austria provide answers in an interview with Bettina Benesch.

I am sitting here with a mechanical engineer (Simeon), a physicist (Martin), and a biophysicist (Thomas), all of whom have specialised in machine learning. What notions do you associate with prompt engineering?


Simeon Harrison: I would say that prompt engineering is primarily about taming large language models. So, taming would be my first keyword. It’s also about making something repeatable and comprehensible. That means I want a reproducible response with certain prompts. Reproducibility is therefore another important term.

Martin Pfister: For me, guidance is also part of it – regulating what the LLM says and what it doesn’t. ChatGPT, for example, won’t give you instructions on how to build a bomb. Another keyword would be trial and error: unlike in mathematics, there is no definitive right or wrong in prompt engineering. It’s more about experimenting.

What exactly is prompt engineering?


Thomas Haschka: Prompt engineering is a very broad term. Essentially, it involves expanding the user’s original prompt in order to control the behaviour of a large language model. Prompt engineering allows us to influence how an LLM responds. For example, with a chatbot for medical consultation, we might add the instruction: “Answer the question precisely and using medical terminology” to the user’s request. Under the same conditions, we could instead append: “Answer the query in a child-friendly manner.” In this case, we would see two different styles of language in the LLM’s responses. This invisible influence, exerted through additional text in the query, is the core of prompt engineering.

Simeon Harrison: Prompt engineering is the easiest way to create an LLM, compared to other approaches such as RAG* (Retrieval Augmented Generation) or fine-tuning. It’s essentially the gateway, and from there, you build further.

What is the difference between user prompting and system prompts?


Simeon Harrison: In principle, they are the same, but the system prompt is hidden, while the user prompt is not.


What can be controlled with a system prompt?


Simeon Harrison: You can instruct the LLM on whether the answer should be delivered as a list, continuous text, or another format; decide on style and tone; specify whether simple language should be used or if the output should be more complex; define an ethical framework; regulate the accuracy of responses; set the model’s reasoning approach, such as "work step by step"; establish how the LLM should respond when uncertain about an answer; or specify that sources should be cited.

 

Prompt engineering is the easiest way to create an LLM compared to other approaches like RAG or fine-tuning. It’s essentially the gateway drug, and from there, you build further.

Providing the LLM with examples is also helpful. For instance, I might supply an example of another product, perhaps a refrigerator, which also has an animal-themed name. This can influence the output. A sample prompt might be:

"Come up with three product names for kid’s shoes. The shoes should embody sportiness, a love of nature and adventure. The names should be in vibrant English and include an animal name. Return the results as a comma-separated list in this format:

Product Description: Children's casual shoe
Product Names: [List of 3 names]

Examples:
Product Description: Blender Product Names: [Fruity Flamingo, Cheery Cheetah, Sunny Seahorse]"

The more examples I provide and the less variance in those examples, the narrower the LLM's response. It becomes precise and reproducible – but, of course, this also reduces the model’s creative freedom.

If your system prompt is very long, the user has less space available for their input. So, the prompt should be kept short and to the point. However, if you're creating an LLM for children, for example, you’ll want to strictly regulate the output, which means the system prompt needs to be more extensive.

What is the average length of such prompts?


Simeon Harrison: It depends on the use case. Generally, the more extensive the prompt, the more precise and reproducible the output. But this also sacrifices part of the input sequence, as user and system prompts together have a limited number of tokens (one token is approximately ¾ of an English word). If your system prompt is very long, the user has less space for their input. So, the prompt should be short and to the point.

Martin Pfister: If I ask ChatGPT, "Explain what RAG is," the request is very short, and the model might respond: "RAG is a technique that combines large language models with external knowledge sources to generate more accurate and up-to-date answers," followed by a few more sentences about RAG. However, you can also give the LLM a specific task, such as: "Take this 200-page text, which currently has no punctuation, and insert punctuation and paragraphs where appropriate." That would be a very long prompt because my 200-page text is part of the input.

Simeon Harrison: And if you’re designing an LLM for children, you would want to regulate the output strictly, meaning the system prompt would need to be more comprehensive.

What libraries are useful for working with LLMs, and which do you use in your courses?


Simeon Harrison: It varies, but an important one is ChatGPT’s own library. LangChain is also a key tool. It’s useful for turning individual models, prompts, or databases used for RAG into a finished, robust product.

 

LLMs frequently hallucinate, delivering incorrect answers with confidence. What role does prompt engineering play in preventing hallucinations?


Simeon Harrison: ChatGPT is no longer as prone to hallucinations as it once was. Current LLMs are larger and trained longer, the areas of concern are being identified and addressed with new releases. But hallucinations are still an issue. In the system prompt, you can mitigate this by instructing the model to only respond when it is sufficiently confident. For example, if it is only 60% sure about an answer, it could respond: "I currently lack the data to make a solid statement."

I recently asked ChatGPT for help with a programming task involving a library I wasn’t very familiar with. The LLM generated code that looked great but didn’t work.

Martin Pfister: I recently asked ChatGPT for help with a programming task involving a library I wasn’t very familiar with. The LLM generated a beautifully formatted piece of code that didn’t work. I then instructed the model to provide a link to the documentation for each function, which made it easier to verify the responses.
 

Are there ready-made standard prompts for prompt engineering that can be adapted to individual needs?


Thomas Haschka: Yes, there are, and they are even implemented in llama.cpp**. There is a web interface where you can select standard prompts, such as Chain-of-Thought, where intermediate steps in reasoning are explicitly carried out. However, it’s not as simple as it might seem: If you instruct the LLM in the system prompt to evaluate its own output, it has no reference point on which to base this evaluation. That only works if there is a database in the background from which the LLM retrieves the correct answer. That would be the RAG approach.


Couldn’t quality also be assessed by human reviewers?


Simeon Harrison: In principle, yes, but evaluation is generally very, very difficult and poses a major challenge for AI research and development. This applies to every area of LLMs. Take translations, for example: A sentence can be translated in countless different ways. Determining the best version is not always straightforward.

Human assessments are certainly the gold standard. Ideally, you would have experts available who conduct a statistically significant number of experiments and evaluate them. However, this is time-consuming and expensive.

Aside from that, there is A/B testing. You might recognise this from ChatGPT when you're asked to choose between two different responses. But it can also be done covertly. Essentially, A/B testing is an experimental method in which two variants are compared. The goal is always to determine which version yields better results. The results are statistically analysed, and based on that, a decision is made on which version should be used in production. This way, LLMs can be continuously optimised.

One problem with assessing quality is that evaluation datasets can end up in the training data. In such cases, LLMs might not generate answers independently but rather rely on the evaluation dataset. This makes it difficult to reliably determine where the LLM’s response is actually coming from—whether it is “thinking” on its own or simply recalling memorized information.

Thomas Haschka: One problem with assessing quality is that evaluation datasets can sometimes end up in training data, which distorts the assessment of model quality. In practice, here’s what happens: To compare LLMs like DeepSeek or Llama, developers create question-answer pairs and make this evaluation dataset publicly available online. This allows anyone to use the dataset to check the quality of an LLM's responses based on these question-answer pairs.

Unfortunately, it sometimes happens that this dataset is also included in training data when a web crawler scrapes the internet for text to create a new training dataset. It's like a maths teacher giving students the test questions and answers in advance: You don’t know whether the students have actually understood the material or just memorised the answers.

For LLMs, this means they might not be generating responses independently but instead recalling information from the evaluation dataset. In such cases, it becomes impossible to reliably verify whether the LLM is truly "thinking" or just regurgitating memorised answers.


One way to ensure quality is the RAG approach, which we briefly mentioned earlier. On 9 April, we’ll publish a blog post on this fascinating topic. Sign up for our newsletter here and have the post delivered directly to your inbox.

Interested in Prompt Engineering, RAG, and Fine-tuning? Here are the next free courses from EuroCC Austria.


The five principles of prompting

 

  1. Provide clear and specific instructions
    The more precise the instruction, the more likely the AI model will deliver the desired response. Vague prompts can lead to inaccurate or irrelevant results.
     
  2. Define formatting and structure
    Specifying a preferred format or structure helps the model tailor its output accordingly, which is especially useful for specific applications.
     
  3. Provide examples
    Including examples in the prompt helps the model understand the desired style or format and respond accordingly.
     
  4. Break down complex tasks into steps
    For large or complex tasks, breaking them into smaller, manageable steps often leads to clearer and more precise answers.
     
  5. Give the model time to think
    Instructions such as "Think step by step" can encourage the model to process information more thoroughly, leading to more detailed and logical responses.

Meet our experts

Martin Pfister
Martin studied physics at TU Wien and is currently pursuing a doctorate in medical physics at MedUni Wien. At EuroCC Austria, he helps clients run projects on the Austrian VSC Supercomputer and teaches Deep Learning and Machine Learning.

Simeon Harrison
Simeon is a trainer at EuroCC Austria, specialising in Deep Learning and Machine Learning. A former mathematics teacher, he receives feedback from course participants such as: "Simeon Harrison is an excellent teacher. His teaching style is clear, engaging, and well-structured, making the topics more enjoyable to learn.”

Thomas Haschka
Thomas made his way from simulation and biophysics to data science and ultimately to artificial intelligence. He earned his PhD in France and conducted research at the Muséum National d'Histoire Naturelle, Institut Pasteur, and the Paris Brain Institute. Before returning to TU Wien, he spent over a year teaching artificial intelligence at the American University of Beirut.


* RAG (Retrieval-Augmented Generation) is a technique that combines large language models with an external knowledge source. Instead of relying solely on what the model learned during training, RAG searches a database or the internet for relevant information and uses it to generate more accurate and up-to-date responses. This allows the model to work with current or specialized information that it wouldn’t otherwise know. As a result, RAG is particularly useful for applications like chatbots, enterprise knowledge systems, or scientific research.

** Llama.cpp is a software that enables the efficient deployment of a Large Language Model.