Optimizing Multimodal Agents
Multimodal AI agents, those that can process text and images (or other media), are rapidly entering real-world domains like autonomous driving, healthcare, and robotics. In these settings, we have traditionally used vision models like CNNs; in the post-GPT era, we can use vision and multimodal language models that leverage human instructions in the form of prompts, rather than task-oriented, highly specific vision models.
However, ensuring good outcomes from the models requires effective instructions, or, more commonly, prompt engineering. Existing prompt engineering methods rely heavily on trial and error, and this is often exacerbated by the complexity and higher cost of tokens when working across non-text modalities such as images. Automatic prompt optimization is a recent advancement in the field that systematically tunes prompts to produce more accurate, consistent outputs.
For example, a self-driving car perception system might use a vision-language model to answer questions about road images. A poorly phrased prompt can lead to misunderstandings or errors with serious consequences. Instead of fine-tuning and reinforcement learning, we can use another multimodal model with reasoning capabilities to learn and adapt its prompts.

Although these automatic methods can be applied to text-based agents, they are often not well documented for more complex, real-world applications beyond a basic toy dataset, such as handwriting or image classification. To best demonstrate how these concepts work in a more complex, dynamic, and data-intensive setting, we will walk through an example using a self-driving car agent.
What Is Agent Optimization?
Agent optimization is part of automatic prompt engineering, but it involves working with various parts of the agent, such as multi-prompts, tool calling, RAG, agent architecture, and various modalities. There are a number of research projects and libraries, such as GEPA; however, many of these tools do not provide end-to-end support for tracing, evaluating, and managing datasets, such as images.
For this walk-through, we will be using the Opik Agent Optimizer SDK (opik-optimizer), which is an open-sourced agent optimization toolkit that automates this process using LLMs internally, along with optimization algorithms like GEPA and a variety of their own, such as HRPO, for various use cases, so you can iteratively improve prompts without manual trial-and-error.
How Can LLMs Optimize Prompts?
Essentially, an LLM can “act as” a prompt engineer and rewrite a given prompt. We start by taking the traditional approach, as a prompt engineer would with trial and error, and ask a small agent to review its work across a few examples, fix its mistakes, and create a new prompt.
Meta Prompting is a classic example of using chain-of-thought reasoning (CoT), such as “explain the reason why you gave me this prompt”, during its new prompt generation process, and we keep iterating on this across multiple rounds of prompt generation. Below is an example of an LLM-based meta-prompting optimizer adjusting the prompt and generating new candidates.

In the toolkit, there is a meta-prompt-based optimizer called metaprompter, and we can demonstrate how the optimization works:
- It starts with an initial ChatPrompt, an OpenAI-style chat prompt object with system and user prompts,
- a dataset (of input and answer examples),
- and a metric (reward signal) to optimize against, which can be an LLMaaJ (LLM-as-a-judge) or even simpler heuristic metrics, such as equal comparison of expected outputs in the dataset to outputs from the model.
Opik then uses various algorithms, including LLMs, to iteratively mutate the prompt and evaluate performance, automatically tracking results. Essentially acting as our own very machine-driven prompt engineer!
Getting Started
In this walkthrough, we want to use a small dataset of self-driving car dashcam images and tune the prompts using automatic prompt optimization with a multi-modal agent that will detect hazards.
We need to set up our environment and install the toolkit to get going. First, you will need an open-source Opik instance, either in the cloud or locally, to log traces, manage datasets, and store optimization results. You can go to the repository and run the Docker start command to run the Opik platform or set up a free account on their website.
Once set up, you’ll need Python (3.10 or higher) and a few libraries. First, install the opik-optimizer package; it will also install the opik core package, which handles datasets and evaluation.
Install and configure using uv (recommended):
# install with venv and py version
uv venv .venv --python 3.11
# install optimizer package
uv pip install opik-optimizer
# post-install configure SDK
opik configure
Or alternatively, install and configure using pip:
# Setup venv
python -m venv .venv
# load venv
source .venv/bin/activate
# install optimizer package
pip install opik-optimizer
# post-install configure SDK
opik configure
You’ll also need API keys for any LLM models you plan to use. The SDK uses LiteLLM, so you can mix providers, see here for a full list of models, and read their docs for other integrations like ollama and vLLM if you want to run models locally.
In our example, we will be using OpenAI models, so you need to set your keys in your environment. You adjust this step as needed for loading the API keys for your model:
export OPENAI_API_KEY="sk-…"
Now that we have our Opik environment set up and our keys configured to access LLM models for optimization and evaluation, we can get to work on our datasets to tune our agent.
Working with Datasets To Tune the Agent
Before we can start with prompts and models, we need a dataset. To tune an AI agent (or even just to optimize a simple prompt), we need examples that serve as our “preferences” for the outcomes we want to achieve. You would normally have a “golden” dataset, which, for your AI agent, would include example inputs and output pairs that you maintain as the prime examples and evaluate your agent against.
For this example project, we will use an off-the-shelf dataset for self-driving cars that is already set up as a demo dataset in the optimizer SDK. The dataset contains dashcam images and human-labeled hazards. Our goal is to use a very basic prompt and have the optimizer “discover” the optimal prompt by reviewing the images and the test outputs it will run.
The dataset, DHPR (Driving Hazard Prediction and Reasoning), is available on Hugging Face and is already mapped in the SDK as the driving_hazard dataset (this dataset is released under BSD 3-Clause license). The internal mapping in the SDK handles Hugging Face conversions, image resizing, and compression, including PNG-to-JPEG conversions and conversions to an Opik-compatible dataset. The SDK includes helper utilities if you wish to use your own multimodal dataset.

The DHPR dataset includes a few fields that we will use to ground our agent’s behavior against human preferences during our optimization process. Here is a breakdown of what’s in the dataset:
question, which they asked the human annotator, “Based on my dashcam image, what is the potential hazard?”hazard, which is the response from the human labelingbounding_boxthat has the hazard marked and can be overlaid on the imageplausible_speedis the annotator’s guestimate of the car’s speed from the predefined set [10, 30, 50+].image_sourcemetadata on where the source images were recorded.
Now, let’s start with a new Python file, optimize_multimodal.py, and start with our dataset to train and validate our optimization process with:
from opik_optimizer.datasets import driving_hazard
dataset = driving_hazard(count=20)
validation_dataset = driving_hazard(count=5)
This code, when executed, will ensure the Hugging Face dataset is downloaded and added to your Opik platform UI as a dataset we can optimize or test with. We will then pass the variables dataset and validation_dataset to the optimization steps in the code later on. You will note we are setting the count values to low numbers, 20 and 5, to load a small sample as needed to avoid processing the entire dataset for our walk-through, which would be resource-intensive.
When you run a full optimization process in a live environment, you should aim to use as much of the dataset as possible. It’s good practice to start small and scale up, as diagnosing long-running optimizations can be problematic and resource-intensive.
We also configured the optional validation_dataset, which is used to test our optimization at the start and end on a hold-out set to ensure the recorded improvement is validated on unseen data. Out of the box, the optimizers’ pre-configured datasets all come with pre-set splits, which you can access from the split argument. See examples as follows:
# example a) driving_hazard pre-configured splits
from opik_optimizer.datasets import driving_hazard
trainset = driving_hazard(split=train)
valset = driving_hazard(split=validation)
testset = driving_hazard(split=test)
# example b) gsm84k math dataset pre-configured splits
from opik_optimizer.datasets import gsm8k
trainset = gsm8k(split=train)
valset = gsm8k(split=validation)
testset = gsm8k(split=test)
The splits also ensure there’s no overlapping data, as the dataset is shuffled in the correct order and split into 3 parts. We avoid using these splits to avoid having to use very large datasets and runs when we are getting started.
Let’s go ahead and run our code optimize_multimodal.py with just the driving hazard dataset. The dataset will be loaded into Opik and can be seen in our dashboard (figure 4 below) under “driving_hazard_train_20”.

With our dataset loaded in Opik we can also load the dataset in the Opik playground, which is a nice and easy way to see how various prompts would behave and test them against a simple prompt such as “Identify the hazards in this image.”

As you can see from the example (figure 4 above), we can use the playground to test prompts for our agent quite quickly. This is probably the usual process we would use for manual prompt engineering: adjusting the prompt in a playground-like environment and simulating how various changes to the prompt would affect the model’s outputs.
For some scenarios, this could be sufficient with some automated scoring and using intuition to adjust prompts, and you can see how bringing the existing prompt optimization process into a more visual and systematic process, how subtle changes can easily be tested against our golden dataset (our sample of 20 for now)
Defining Evaluation Metrics To Optimize With
We will continue to define our evaluation metrics designed to let the optimizer know what changes are working and which are not. We need a way to signal the optimizer about what’s working and what is failing. For this, we will use an evaluation metric as the “reward”; it will be a simple score that the optimizer uses to decide which prompt changes to make.
These evaluation metrics can be simple (e.g., Equals) or more complex (e.g., LLM-as-a-judge). Since Opik is a fully open-source evaluation suite, you can use a number of various metrics, which you can explore here to find out more.
Logically, you would think that when we compare the dataset ground truth (a) to the model output (b), we would do a simple equals comparison metric like is (a == b), which will return a boolean true or false. Using a direct comparison metric can be harmful to the optimizer, as it makes the process much harder and may not yield the exact answer right from the start (or throughout the optimization process).
One of the human-annotated examples from the dataset we are trying to get the optimizer to match, you can see how getting the LLM to create exactly the same output blindly could be challenging:
Entity #1 brakes his car in front of Entity #2. Seeing that Entity #2 also pulled his brakes. At a speed of 45 km/h, I can't stop my car in time and hit Entity #2.
To support the hill-climbing needed for the optimizer, we will use a comparison metric that provides an approximation score as a percentage on a scale of 0.0 to 1.0. For this scenario, we will use the Levenshtein ratio, a simple math-based measure of how closely the characters and words in the output match those in the ground truth dataset. With our closeness to example metric, LR (Levenshtein ratio) a body of text with a few characters off could yield a score for example of 98% (0.98), as they are very similar (figure 6 below).

In our Python script, we define this custom metric as a function alongside the input and output variables from our dataset. In practice we will define the mapping between the dataset hazard and the output llm_output, as well as the scoring function to be passed to the optimizer. There are more metric examples in the documentation, but for now, we will use the following setup in our code after the dataset creation:
from opik.evaluation.metrics import LevenshteinRatio
from opik.evaluation.metrics.score_result import ScoreResult
def levenshtein_ratio(
dataset_item: dict[str, Any],
llm_output: str
) -> ScoreResult:
metric = LevenshteinRatio()
metric_score = metric.score(
reference=dataset_item["hazard"], output=llm_output
)
return ScoreResult(
value=metric_score.value,
name=metric_score.name,
reason=f"Levenshtein ratio between `{dataset_item['hazard']}` and `{llm_output}` is `{metric_score.value}`.",
)
Setting Up Our Base Agent & Prompt
Here we are configuring the agent’s starting point. In this case, we assume we already have an agent and a handwritten prompt. If you were optimizing your own agent, you would replace these placeholders. We start by importing the ChatPrompt class, which allows us to configure the agent as a simple chat prompt. The optimizer SDK handles inputs via the ChatPrompt, and you can extend this with tool/function calling and more multi-prompt/agent scenarios, also for your own use cases.
from opik_optimizer import ChatPrompt
# Define the prompt to optimize
system_prompt = """You are an expert driving safety assistant
specialized in hazard detection. Your task is to analyze dashcam
images and identify potential hazards that a driver should be aware of.
For each image:
1. Carefully examine the visual scene
2. Identify any potential hazards (pedestrians, vehicles,
road conditions, obstacles, etc.)
3. Assess the urgency and severity of each hazard
4. Provide a clear, specific description of the hazard
Be precise and actionable in your hazard descriptions.
Focus on safety-critical information."""
# Map into an OpenAI-style chat prompt object
prompt = ChatPrompt(
messages=[
{"role": "system", "content": system_prompt},
{
"role": "user",
"content": [
{"type": "text", "text": "{question}"},
{
"type": "image_url",
"image_url": {
"url": "{image}",
},
},
],
},
],
)
In our example, we have a system prompt and a user prompt, based on the question {question}and the image {image} from the dataset we created earlier. We are going to try to optimize the system prompt so that the input changes based on each image (as we saw in the playground earlier). The fields in the parentheses, like {data_field}, are columns in our dataset that the SDK will automatically map and also convert for things like multi-modal images.
Loading and Wiring the Optimizers
The toolkit comes with a range of optimizers, from simple meta-prompting, which uses chain-of-thought reasoning to update prompts, to GEPA and more advanced reflective optimizers. At the time of this walk-through, the hierarchical reflective optimizer (HRPO) is the one we will use for example purposes, as it’s suited for complex and ambiguous tasks.
The HRPO optimization algorithm (figure 7 below) uses hierarchical root cause analysis to identify and address specific failure modes in your prompts. It analyzes evaluation results, identifies patterns in failures, and generates targeted improvements to systematically address each failure mode.

So far in our project, we have set up the base dataset, evaluation metric, and prompt for our agent, but have not wired this up to any optimizers. Let’s go ahead and wire in HRPO into our project. We need to load our model and configure any parameters, such as the model we want to use to run the optimizer on:
from opik_optimizer import HRPO
# Setup optimizer and configuration parameters
optimizer = HRPO(
model="openai/gpt-5.2."
model_parameters={"temperature": 1}
}
There are additional parameters we can set, such as the number of threads for multi-threading or the model parameters passed directly to the LLM calls, as we demonstrate by setting our temperaturevalue.
It’s Time, Running The Optimizer
Now we have everything we need, including our starting agent, dataset, metric, and the optimizer. To execute the optimizer, we need to call the optimizer’s optimize_prompt function and pass all components, along with any additional parameters. So really, at this stage, the optimizer and the optimize_prompt() function, which when executed, will run the optimizer we configured (optimizer).
# Execute optimizer
optimization_result = optimizer.optimize_prompt(
prompt=prompt, # our ChatPrompt
dataset=dataset, # our Opik dataset
validation_dataset=validation_dataset, # optional, hold-out test
metric=levenshtein_ratio, # our custom metric
max_trials=10, # optional, number of runs
)
# Output and display results
optimization_result.display()
You will notice some additional arguments we passed; the max_trials argument limits the number of trials (optimization loops) the optimizer will run before stopping. You should start with a low number, as some datasets and optimizer loops can be token-heavy, especially with image-based runs, which can lead to very long runs and be time and cost-intensive. Once we are happy with our setup, we can always come back and scale this up.
Let’s run our full script now and see the optimizer in action. It’s best to execute this in your terminal, but this should also work fine in a notebook such as Jupyter Notebooks:

The optimizer will run through 10 trials (optimization loops). On each loop, it will generate a number (k) of failures to check, test, and develop new prompts for. At each trial (loop), the new candidate prompts are tested and evaluated, and another trial begins. After a short while, we should reach the end of our optimization loop; in our case, this happens after 10 full trials, which should not take more than a minute to execute.
Congratulations, we optimized our multi-modal agent, and we can now take the new system prompt and use it on the same model in production with improved accuracy. In a production scenario, you would copy this into our codebase. To analyze our optimization run, we can see that the terminal and dashboard should show the new results:

Based on the results, we can see that we have gone from a baseline score of 15% to 39% after 10 trials, a whoping 152% improvement with a new prompt in under a minute. These results are based on our comparison metric, which the optimizer used as its signal: a comparison of the output vs. our expected output in our dataset.
Digging into our results, a few key things to note:
- During the trial runs the score shoots up very quickly, then slowly normalizes. You should increase the number of trials, and we should see whether it needs more to determine the next set of prompt improvements.
- The score will also be more “volatile” and overfit with low samples of 20 and 5 for validation, so we had to keep our test small; randomness will impact our scores massively. When you re-run, try using the full dataset or a larger sample (e.g., count=50) and see how the scores are more realistic.
Overall, as we scale this up, we need to give the optimizer more data and more time (signal) to “hill climb,” which can take multiple rounds.
At the end of our optimization, our new and improved system prompt has now recognized that it needs to label various interactions and that the output style needs to match. Here is our final improved prompt after 10 trials:
You are an expert driving incident analyst specialized in collision-causal description.
Your task is to analyze dashcam images and write the most likely collision-oriented causal narrative that matches reference-style answers.
For each image:
1. Identify the primary interacting participants and label them explicitly as "Entity #1", "Entity #2", etc. (e.g., vehicle, pedestrian, cyclist, obstacle).
2. Describe the single most salient accident interaction as an explicit causal chain using entity labels: "Entity #X [action/failure] → [immediate consequence/path conflict] → [impact]".
3. End with a clear impact outcome that MUST (a) use explicit collision language AND (b) name the entities involved (e.g., "Entity #2 rear-ends Entity #1", "Entity #1 side-impacts Entity #2",
"Entity #1 strikes Entity #2").
Output requirements (critical):
- Produce ONE short, direct causal statement (1–2 sentences).
- The statement MUST include: (i) at least two entities by label, (ii) a concrete action/failure-to-yield/encroachment, and (iii) an explicit collision outcome naming the entities. If any of these
are missing, the answer is invalid.
- Do NOT output a checklist, multiple hazards, severity/urgency ratings, or general driving advice.
- Avoid general risk discussion (visibility, congestion, pedestrians) unless it directly supports the single causal chain culminating in the collision/impact.
- Focus on the specific causal progression culminating in the impact (even if partially inferred from context); do not describe multiple possible crashes-commit to the single most likely one.
You can grab the full final code for the example end to end as follows:
from typing import Any
from opik_optimizer.datasets import driving_hazard
from opik_optimizer import ChatPrompt, HRPO
from opik.evaluation.metrics import LevenshteinRatio
from opik.evaluation.metrics.score_result import ScoreResult
# Import the dataset
dataset = driving_hazard(count=20)
validation_dataset = driving_hazard(split="test", count=5)
# Define the metric to optimize on
def levenshtein_ratio(dataset_item: dict[str, Any], llm_output: str) -> ScoreResult:
metric = LevenshteinRatio()
metric_score = metric.score(reference=dataset_item["hazard"], output=llm_output)
return ScoreResult(
value=metric_score.value,
name=metric_score.name,
reason=f"Levenshtein ratio between `{dataset_item['hazard']}` and `{llm_output}` is `{metric_score.value}`.",
)
# Define the prompt to optimize
system_prompt = """You are an expert driving safety assistant specialized in hazard detection.
Your task is to analyze dashcam images and identify potential hazards that a driver should be aware of.
For each image:
1. Carefully examine the visual scene
2. Identify any potential hazards (pedestrians, vehicles, road conditions, obstacles, etc.)
3. Assess the urgency and severity of each hazard
4. Provide a clear, specific description of the hazard
Be precise and actionable in your hazard descriptions. Focus on safety-critical information."""
prompt = ChatPrompt(
messages=[
{"role": "system", "content": system_prompt},
{
"role": "user",
"content": [
{"type": "text", "text": "{question}"},
{
"type": "image_url",
"image_url": {
"url": "{image}",
},
},
],
},
],
)
# Initialize HRPO (Hierarchical Reflective Prompt Optimizer)
optimizer = HRPO(model="openai/gpt-5.2", model_parameters={"temperature": 1})
# Run optimization
optimization_result = optimizer.optimize_prompt(
prompt=prompt,
dataset=dataset,
validation_dataset=validation_dataset,
metric=levenshtein_ratio,
max_trials=10,
)
# Show results
optimization_result.display()
Going Further and Common Pitfalls
Now you’re done with your first optimization run. There are some additional tips when working with optimizers, and especially when working with multi-modal agents, to go into more advanced scenarios, as well as avoiding some common anti-patterns:
- Model Costs and Choice: Multimodal prompts send larger payloads. Monitor token usage in the Opik dashboard. If cost is an issue, use a smaller vision model. Running these optimizers through multiple loops can get quite expensive. At the time of publication on GPT 5.2, this example cost us about $0.15 USD. Monitor this as you run examples to see how the optimizer is behaving and catch any issues before you scale out.
- Model Selection and Vision Support: Double-check that your chosen model supports images. Some very recent model releases may not be mapped yet, so you might have issues. Keep your Python packages updated.
- Dataset Image Size and Format: Consider using JPEGs and lower-resolution images, which are more efficient over large-resolution PNGs, which can be more token-hungry due to their size. Test how the model behaves via direct API calls, the playground, and small trial runs before scaling out. In the demo we ran, the dataset images were automatically converted by the SDK to JPEG (60% quality) and a max height/width of 512 pixels, pattern you are welcomed to follow.
- Dataset Split: If you have many examples, split into training/validation. Use a subset (
n_samples) during optimization to find a better prompt, and reserve unseen data to confirm the improvement generalizes. This prevents overfitting the prompt to a few items. - Evaluation Metric Design: For Hierarchical Reflective optimizer, return a ScoreResult with a reason for each example. These reasons drive its root-cause analysis. Poor or missing reasons can make the optimizer less effective. Other optimizers behave differently, so knowing that evaluations are critical to success is key, you can also see if LLM-as-a-judge is a viable evaluation metric for more complex senarios.
- Iteration and Logging: The example script automatically logs each trial’s prompts and scores. Inspect these to understand how the prompt changed. If results stagnate, try increasing
max_trialsor using a different optimizer algorithm. You can also chain optimizers: take the output prompt from one optimizer and feed it into another. This is a good way to combine multiple approaches and ensemble optimizers to achieve higher combined efficiency. - Combine with Other Methods: We can also combine steps and data into the optimizer using bounding boxes, adding additional data through purpose-built visual processing models like Meta’s SAM 3 to annotate our data and provide additional metadata. In practice, our input dataset could have image and image_annotated, which can be used as input to the optimizer.
Takeaways and Future Outlook of Optimizers
Thanks for following along with this. As part of this walk-through, we explored:
- Getting started with open-source agent & prompt optimization
- Creating a process to optimize a multi-modal vision-based agent
- Evaluating with image-based datasets in the context of LLMs
Moving forward, automating prompt design is becoming increasingly important as vision-capable LLMs advance. Thoughtfully optimized prompts can significantly improve model performance on complex multimodal tasks. Optimizers show how we can harness LLMs themselves to refine instructions, turning a long, tedious, and very manual process into a systematic search.
Looking ahead, we can start to see new ways of working in which automatic prompts and agent-optimization tools replace outdated prompt-engineering methods and fully leverage each model’s own understanding.
Enjoyed This Article?
Vincent Koc is a highly accomplished AI research engineer, writer, and lecturer with a wealth of experience across a number of global companies and works primarily in open-source development in artificial intelligence with a keen interest in optimization approaches. Feel free to connect with him on LinkedIn and X if you want to stay connected or have any questions about the hands-on example.
References
[1] Y Choi, et. al. Multimodal Prompt Optimization: Why Not Leverage Multiple Modalities for MLLMs https://arxiv.org/abs/2510.09201
[2] M Suzgun, A T Kalai. Meta-Prompting: Enhancing Language Models with Task-Agnostic Scaffolding https://arxiv.org/abs/2401.12954
[3] K Charoenpitaks, et. al. Exploring the Potential of Multi-Modal AI for Driving Hazard Prediction https://ieeexplore.ieee.org/document/10568360 & https://github.com/DHPR-dataset/DHPR-dataset
[4] F. Yu, et. al. BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask Learning https://arxiv.org/abs/1805.04687 & https://bair.berkeley.edu/blog/2018/05/30/bdd/
[5] Chen et. al. MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark https://dl.acm.org/doi/10.5555/3692070.3692324 & https://mllm-judge.github.io/
[6] Opik. HRPO (Hierarchical Reflective Prompt Optimizer) https://www.comet.com/docs/opik/agent_optimization/algorithms/hierarchical_adaptive_optimizer & https://www.comet.com/site/products/opik/features/automatic-prompt-optimization/
[7] Meta. Introducing Meta Segment Anything Model 3 and Segment Anything Playground https://ai.meta.com/blog/segment-anything-model-3/
Source link
#Automatic #Prompt #Optimization #Multimodal #Vision #Agents #SelfDriving #CarExample
























