Finding Golden Examples: A Smarter Approach to In-Context Learning

Context

using Large Language Models (LLMs), In-Context Learning (ICL), where input and output are provided to LLMs to learn from them before handling the next input, has proven to be very effective in guiding the LLM in the right direction. Multiple ICL strategies have been developed. Some popular ones are one-shot (provide one example), few-shot (provide multiple examples), chain-of-thought (showing step-by-step reasoning in examples) …etc. Let’s take a simple example to better understand in-context learning.

When you ask a LLM, “What animal makes the sound ‘moo’ and what is its type?” you might be expecting an answer as “Cow, mammal.” But LLMs give more information. For example, when asked the same question, ChatGPT gave the following answer:

User: What animal makes the sound 'moo' and what is its type?

The animal that makes the sound "moo" is a cow.
Animal: Cow
Type: Mammal (specifically, a domesticated ungulate belonging to the species Bos taurus)
Other examples of non-mammals include birds (like eagles 🦅), reptiles (like snakes 🐍)
, fish (like salmon 🐟), and insects (like butterflies 🦋).

Link: https://chatgpt.com/share/6886636f-7b48-8000-a477-54405edd7e43

Now, to teach the LLM to produce results as we expect (animal name, type), we can either fine-tune (train) the LLM to output in our expected format, which is not a feasible approach since training LLMs requires significant resources. Alternatively, during inference we can provide examples before asking the question to teach the LLM our expected format. That’s exactly what in-context learning is. LLMs learn the pattern from the provided examples to perform the task at hand. Here is the same interaction with ChatGPT with in-context examples:

User: What animal makes the 'woof' sound and what is its type?
Assistant: Dog, mammal
User: What animal makes the 'meow' sound and what is its type?
Assistant: Cat, mammal
User: What animal makes the sound 'moo' and what is its type?

This time, the LLM gave the correct answer: Cow, mammal.

Link: https://chatgpt.com/share/688664f0-96f0-8000-9125-6a40b24d2773

As we can see, LLMs adapt well to In-Context Learning (ICL) to achieve their goals. Research has shown ICL helps boost the performance and accuracy of LLMs. But ICL is fragile. Performance is highly sensitive to which examples you choose, their order, and even minor formatting changes. ICL works through pattern matching rather than true learning, so it relies heavily on superficial cues. Imagine for a complex task like code repair, text-to-sql …etc, one set of examples might work well while another alternative might drop the accuracy significantly. Hence, ICL’s main challenge is “How to select examples that actually help (not just any examples)?”

In this post, we’re going to look at the research paper AuPair: Golden Example Pairs for Code Repair published by Google DeepMind to systematically handle these issues. AuPair specifically tackles example selection for code repair tasks (fixing buggy code). This post aims to explain the core ideas behind their work and build a foundation for understanding how to systematically generate examples for ICL.

Effective Example Selection

Now, we understand the first challenge of ICL is to find the right set of examples. Before we look into how AuPair’s approach tackles this problem, let’s look at the traditional approach of example selection. Normally, for domain-specific problems (like code generation/repair or text-to-sql), we randomly pick a few examples using our own aptitude or pick problems from the dataset, write examples for those selected problems, and use them at runtime for ICL. Another extension of this is, we build a pool of examples and use similarity search to pull the relevant examples at runtime to inject as ICL.

In the traditional example curation process, we don’t have the ability to measure which example is most effective in anchoring the LLM in the right direction. Now, let’s look at AuPair’s approach and how it addresses this problem. Instead of picking random examples, AuPair first builds a large dataset of example pairs and then applies a greedy selection algorithm to select the best-performing pairs. Let’s look at each step one by one.

Phase 1: Example Pair generation

The first step is to create a large collection of candidate repair pairs. AuPair starts with a dataset of coding problems that have test cases. For each problem, it asks the LLM to generate an initial solution (guess). If this guess is partially correct (score between 0 and 1), it gets added to the training data set.

The repair process takes this broken code and asks the LLM to fix it using a few-shot prompt with k randomly selected existing pairs as context (k = 32 was used in the experiment). If the generated fix scores better than the original guess, this becomes a candidate pair (guess → fix). The clever part is that if the fix is still not perfect, it becomes a new “broken” code that gets added back to the training dataset for further improvement in the next iteration. This creates chains of incremental improvements. AuPair repeats this process thousands of times to build a huge pool of candidate pairs covering different types of bugs and their fixes.

Phase 2: Golden(Au)Pair Extraction

Once we have the candidate pair dataset, we need to pick the most effective pairs. This process happens in 2 steps. First, we need to measure how much impact each candidate repair pair has, and second, we need to select the best ones using a greedy algorithm.

Let’s first look into how the effectiveness of candidate repair pairs is measured.

To measure the effectiveness, we first create a validation dataset — basically a set of broken code problems. Then, for each problem in the validation dataset, we take each candidate repair pair and use it as a 1-shot example along with the validation problem to generate a fix. Once the fix is generated, it gets tested against the unit test cases, and a score is calculated for that validation problem.

We create a quality matrix M where M[i,j] represents how well candidate pair i helps solve validation problem j, which gives us a comprehensive view of which pairs are most helpful across different types of problems.

The next step is to find the AuPairs using the calculated effectiveness. The algorithm picks the candidate pair with the highest average score across all validation problems and adds it to the AuPair list. The crucial next step is to subtract this pair’s contribution from all remaining pairs in the matrix. This ensures we don’t pick redundant pairs but keep the pairs complementary, each new AuPair must solve different problems than the previously selected ones. This process continues until the improvement falls below a threshold, resulting in an ordered list of golden pairs where each one teaches something unique.

Experiment Results

AuPair was benchmarked across 7 different coding problem datasets using 5 different LLM models. It consistently outperformed self-reflection and best-of-N sampling approaches to solve problems. The results further show that AuPairs achieve 2–3x better compute efficiency. It takes only 12 AuPairs to reach the same performance that requires 32 random pairs. The results also show that AuPairs generated on the CodeForces dataset worked effectively on completely different datasets like HackerEarth and AtCoder. This proves that once we’ve built a good set of golden pairs, they can perform very well on new problems in the same domain.

Limitations

AuPair shows promising results, but it has a few constraints too. First, it requires substantial computational cost for making LLM calls to generate candidate example pairs with iterative repair. Second, it heavily relies on evaluation metrics (like unit tests for code) to measure improvement, which may not be available in all domains, and it assumes that complementary examples will lead to better performance. While this worked for coding problems, it may not be true for all domains. Finally, AuPair was benchmarked against structured contest problems rather than more complex real-world codebases.

Conclusion

AuPair shows us a smarter way to do in-context learning for code repair tasks. Instead of randomly picking examples, it uses a systematic approach to find the most effective repair patterns that actually help the LLM perform better. While it needs significant upfront compute cost and works best when you have good evaluation metrics, the results prove it’s worth the investment, especially since the golden pairs work well across different datasets. This research opens up possibilities for applying similar example selection techniques to other domains(e.g text-to-sql) where we can systematically generate and measure example effectiveness.