Beyond Prompting: The Power of Context Engineering

an LLM can see before it generates an answer. This includes the prompt itself, instructions, examples, retrieved documents, tool outputs, and even the prior conversation history.

Context has a huge impact on answer quality. For example, if you ask an LLM to write a SQL query without providing the data schema, the result will almost certainly be suboptimal. Worse, if the model has no access to the database at all, it may simply hallucinate a query that doesn’t work. Even when tools are available, the model still needs extra time and effort to infer the schema before it can produce a correct answer.

Because context plays such a central role in LLM-based applications, context engineering has emerged as a discipline focused on systematically optimising what information goes into a model’s prompt. The goal is to build “self-improving” systems that learn from experience without relying on expensive fine-tuning (retraining models and updating millions of parameters).

Context engineering comes with several key advantages:

it’s more cost-effective and doesn’t require specialised fine-tuning expertise;
context and instructions remain transparent, interpretable, and easy for humans to modify;
iteration cycles are much faster, since updates can be made instantly without retraining or redeploying models;
it’s more agile, especially when information needs to be forgotten for privacy or legal reasons.

With all these advantages, it’s not surprising that context engineering is gaining so much attention. What’s interesting, though, is how quickly the approaches themselves are evolving. In this article, I’ll walk through that evolution and then experiment with one of the newer frameworks for prompt optimisation: Agentic Context Engineering (ACE).

Evolution of context engineering approaches

Context engineering didn’t appear overnight. It has evolved through several distinct stages.

The earliest stage was static prompting. Here, prompts were hand-crafted instructions that never changed. Most of the effort went into classic prompt engineering: carefully choosing wording, structure, and formatting to squeeze better performance out of the model.

The next major step was dynamic retrieval. Instead of relying on a fixed prompt, systems began pulling in relevant information (documents, examples, or facts) at inference time. Retrieval-Augmented Generation (RAG) became one of the most popular approaches in this category. By grounding responses in external data, RAG significantly improved accuracy and reduced hallucinations, especially for knowledge-heavy tasks.

More recently, the focus has shifted toward self-improving contexts. Rather than treating context as something that is merely retrieved or injected, these approaches allow the system to update and refine its own context based on past performance. In other words, the prompt itself becomes adaptive, evolving through reflection and feedback.

A number of frameworks have emerged around this idea. Below are some of the most influential ones.

One of the earliest and most significant works is “Reflexion: Language Agents with Verbal Reinforcement Learning” by Shinn et al. This research introduced the idea that language agents can learn from mistakes through natural language reflection rather than gradient-based updates. Reflexion agents analyse feedback from previous attempts, generate verbal reflections about what went wrong, and store these reflections in an episodic memory buffer. These stored reflections then guide better decision-making in subsequent trials.
Another important contribution is “TextGrad: Automatic Differentiation via Text” by Yuksekgonul et al. TextGrad borrows concepts from deep learning optimisation (such as gradients, backpropagation, and gradient descent) but replaces numerical derivatives with natural language feedback. In this framework, LLMs generate textual critiques describing how a variable should change to improve the outcome. These “textual gradients” are then propagated backwards through the system using prompting, effectively performing a natural-language version of backpropagation across a compound AI system.
The paper “GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning” by Agrawal et al. takes a different angle by combining evolutionary algorithms with language-based reflection. Prompts are treated like organisms: they mutate, compete, and evolve under selection pressure. Over time, better-performing prompts survive and propagate. This approach is implemented in DSPy, and Hugging Face provides a practical guide for applying it in real-world use cases.
Finally, “Dynamic Cheatsheet: Test-Time Learning with Adaptive Memory” by Suzgun et al. explores test-time learning through persistent memory. In this setup, a black-box LLM is given a notebook where it can write down useful strategies, patterns, and code snippets during inference. Instead of repeatedly rediscovering the same insights, the model accumulates and reuses knowledge across tasks. This adaptive memory significantly improves performance without requiring explicit labels or human feedback.

Agentic Context Engineering

Now that we’ve covered how context engineering has evolved, let’s take a closer look at Agentic Context Engineering (ACE), one of the more recent approaches and the main focus of this article. ACE is introduced in the paper “Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models” by Zhang et al., published in 2025.

The paper starts by identifying two key problems with existing self-improving context methods.

Brevity bias is the tendency for systems to oversimplify important details and gradually collapse toward short, generic prompts. While compact prompts are attractive, they often lose the nuances that actually drive good performance.
Context collapse. When systems repeatedly rewrite the entire prompt, they tend to forget useful knowledge accumulated earlier. Over time, this leads to instability and regressions rather than steady improvement.

To address these issues, the authors propose Agentic Context Engineering (ACE), a framework designed for scalable and efficient context adaptation in both offline settings (such as system prompt optimisation) and online scenarios (like test-time memory adaptation). Instead of compressing knowledge into a single static prompt, ACE allows the model to continuously evolve its context by accumulating successful strategies, reflecting on failures, and organising knowledge in a structured way. Conceptually, it resembles an AI assistant that improves over time by keeping detailed notes and refining its own playbook.

At the core of ACE is an agentic learning loop that mirrors how humans learn through experimentation: try, reflect, and consolidate. The framework consists of three components:

Generator, which produces reasoning trajectories while solving tasks;
Reflector, which analyses successes and failures and distils actionable insights;
Curator, which integrates those insights into the shared context as small, incremental updates.

Rather than maintaining a single monolithic prompt, ACE organises context as a playbook made up of structured bullet points. Each bullet contains metadata (such as a unique identifier and counters tracking how often it has been helpful or harmful) as well as content representing a small, reusable unit of knowledge. This might be a general strategy, a domain-specific concept, or a common failure mode.

Figure from the paper Zhang et al, 2025 | source

The ACE workflow consists of several phases.

Generation phase. The Generator tackles new problems using the current playbook, marking which bullets were helpful or misleading.
Reflection phase. The Reflector analyses the full trajectory, extracting lessons from both successes and failures through iterative refinement.
Curation phase. The Curator turns these insights into compact “delta” updates — new or modified bullets that are merged into the existing playbook using lightweight, non-LLM logic.
Grow-and-refine phase. New bullets are appended, existing ones are updated in place, and periodic deduplication removes redundancy using semantic embeddings.

This design enables parallel processing of multiple updates and supports multi-epoch adaptation, where the same queries can be revisited to progressively strengthen the context over time.

Empirically, ACE delivers strong results. On benchmark evaluations, it outperforms other self-improving context approaches, achieving a +10.6% improvement on AI agent tasks and a +8.6% gain in specialised domains such as finance.

Beyond accuracy, ACE is also more cost-efficient thanks to its incremental update mechanism, showing 83.6% lower token costs compared to baseline methods.

Together, these results position ACE as a practical and scalable step forward in building self-improving LLM systems.

Using ACE for banking intent data

The ACE framework looks promising on paper, so the next step is to see how it performs in practice. Fortunately, the authors have shared an open-source implementation on GitHub, which gives us a solid starting point.

Loading the data

To keep the experiment focused, I decided to apply ACE to a classification task. I’m using a publicly available dataset of banking intents released by PolyAI (). This dataset reflects a very common real-world problem: identifying customer intent when someone contacts customer support. Accurate intent classification is critical for routing requests to the right team, triggering semi-automated responses, or simply monitoring recurring issues.

In this dataset, each customer message (for example, “I’m not sure why my card didn’t work”) needs to be mapped to a specific banking intent, such as declined_card_payment. In total, there are 77 distinct intent categories.

To keep the experiment manageable, I sampled 500 examples from the dataset and split them into training, test, and validation sets. Below is the code used to load the data and create the splits.

full_df = pd.read_csv('./poly_ai_banking_data/train.csv')

# params
total_number_of_samples = 500 
train_share = 0.5
test_share = 0.4
val_share = 0.1

sample_df = full_df.sample(n=total_number_of_samples, random_state=42)\
  .reset_index(drop=True)

random.seed(42)
sample_df['group'] = random.choices(['train', 'test', 'val'], 
  weights=(train_share, test_share, val_share), k=total_number_of_samples)

train_df = sample_df[sample_df['group'] == 'train'].reset_index(drop=True)
test_df = sample_df[sample_df['group'] == 'test'].reset_index(drop=True)
val_df = sample_df[sample_df['group'] == 'val'].reset_index(drop=True)

Extending ACE to banking intent data

The next step is to extend the ACE framework so it can work with our banking intent dataset. Fortunately, the authors provide a detailed guide that makes this process relatively straightforward.

In addition to plugging in the new dataset, I made a couple of small modifications to the core framework to support Anthropic models and configurable temperature settings. You can find the complete, modified version of the code on GitHub.

Preparing the data

The first thing we need to do is prepare the dataset in a format that ACE expects. I saved the training, validation, and test splits as CSV files under banking/data. Each example contains:

text: the customer support message,
category: the target intent label we want to predict,
group: an auxiliary field indicating whether the example belongs to the train, test, or validation set.

The group field won’t be used later by the framework itself, but it’s convenient for dataset management and reproducibility.

Here’s what the data format looks like.

text,category,group
Is it possible for me to change my PIN number?,change_pin,test
What is the $1 transaction on my account?,extra_charge_on_statement,test
How much does top up fees cost?,top_up_by_card_charge,test
I live in the EU - can I get a card?,country_support,test

Next, we need to tell ACE where to find each split. This is done by specifying dataset paths in banking/data/task_config.json.

{
  "banking": {
    "train_data": "./banking/data/train.csv",
    "val_data": "./banking/data/val.csv",
    "test_data": "./banking/data/test.csv"
  }
}

Implementing the DataProcessor

To integrate a new task, the framework requires a custom DataProcessor module. According to the guide, this involves implementing three core methods: process_task_data, answer_is_correct and evaluate_accuracy.

In addition, we need a helper function to load the raw data from disk. Let’s start with that.

Below is the implementation of the data-loading function. It reads a CSV file, validates its existence, and converts each row into a dictionary that the rest of the pipeline can work with.

def load_data(data_path: str) -> List[Dict[str, Any]]:
  """
  Load and process data from a CSV file.
  
  Expected CSV format: text,category,group (with header)
  
  Args:
    data_path: Path to the CSV file
      
  Returns:
    List of dictionaries containing the data
  """
  if not os.path.exists(data_path):
    raise FileNotFoundError(f"Data file not found: {data_path}")
  
  data = []
  with open(data_path, 'r', encoding='utf-8') as f:
    reader = csv.DictReader(f)
    for row in reader:
      data.append({
        'text': row['text'],
        'category': row['category'],
        'group': row.get('group', '')
      })
  
  print(f"Loaded {len(data)} samples from {data_path}")
  return data

With the data-loading function in place, we can move on to implementing the remaining DataProcessor methods.

The main purpose of process_task_data is to convert the raw dataset into ACE’s standardised input format.

ACE expects each example to contain three fields: context, question, and target. In our case, the mapping is fairly simple. We map the intent category directly to target, and we leave context empty since there’s no additional background information needed for classification.

The most important part here is the question. We added extra context to make it clear to the LLM that it should classify the query rather than answer questions directly, while also providing the list of available topics to guide an LLM’s response.

def process_task_data(self, raw_data: List[Dict]) -> List[Dict]:
  """
  Convert raw CSV data into standardized format for ACE.
  
  Args:
    raw_data: Raw data loaded from CSV (list of dicts with 'text', 'category')
      
  Returns:
    List of dicts with keys: 'context', 'question', 'target'
  """
  processed_data = []
  
  # Gather the list of topics to include into the question
  topics_list = ", ".join(self.allowed_topics)
  
  for item in raw_data:
    customer_query = item.get('text', '')
    ground_truth_topic = item.get('category', '')
    
    # The question provides the classification task instruction
    question = (
      f"Classify the following banking customer support query into one of the predefined topics.\n\n"
      f"Customer Query: {customer_query}\n\n"
      f"Available Topics: {topics_list}\n\n"
      f"Respond with ONLY the topic name, nothing else."
    )
    
    processed_item = {
      "context": "",  # No additional context needed
      "question": question,
      "target": ground_truth_topic,
      "others": {
        "original_text": customer_query,
        "task": self.task_name,
      }
    }
    
    processed_data.append(processed_item)
  
  return processed_data

The next method, answer_is_correct, checks whether a model’s prediction matches the ground truth label. Since we explicitly instruct the LLM to respond with only the category name, a simple case-insensitive string comparison is sufficient here.

def answer_is_correct(self, predicted: str, ground_truth: str) -> bool:
  """
  Check if the predicted topic matches the ground truth.
  Uses simple case-insensitive comparison.
  
  Args:
    predicted: Model's predicted topic
    ground_truth: Ground truth topic
      
  Returns:
    bool: True if prediction is correct, False otherwise
  """
  return predicted.lower().strip() == ground_truth.lower().strip()

The final method we need to implement is evaluate_accuracy, which computes overall classification accuracy across multiple predictions. There’s nothing fancy going on here. We simply calculate the fraction of cases where answer_is_correct(prediction, ground_truth) returns True.

def evaluate_accuracy(self, predictions: List[str], ground_truths: List[str]) -> float:
  """
  Calculate classification accuracy across multiple predictions.
  
  Args:
    predictions: List of model predictions
    ground_truths: List of ground truth topics
      
  Returns:
    Accuracy as a float between 0 and 1
  """
  if len(predictions) != len(ground_truths):
    raise ValueError("Predictions and ground truths must have same length")
  
  if not predictions:
    return 0.0
  
  correct = sum(
    1 for pred, truth in zip(predictions, ground_truths)
    if self.answer_is_correct(pred, truth)
  )
  
  return correct / len(predictions)

Putting together the workflow script

With the DataProcessor in place, the next step is to assemble a comprehensive run script for ACE. I created a run_ace_workflow script that accepts several key arguments:

api_provider selects the language model API to use ('anthropic', 'openai', 'together', or 'sambanova'), defaulting to 'anthropic'.
generator_model specifies the model for the Generator agent (default: 'claude-haiku-4-5').
reflector_model specifies the model for the Reflector agent (default: 'claude-sonnet-4-5').
curator_model specifies the model for the Curator agent (default: 'claude-sonnet-4-5').
max_train and max_test are optional limits on the train and test set sizes, useful for quick experiments or debugging.

Let’s discuss how this script actually works. The script begins by loading the banking intent data and initialising the DataProcessor. Here’s the helper function I wrote for this.

def load_banking_data(max_train=None, max_test=None):
  """Load and process banking dataset."""
  from banking.data_processor import DataProcessor, load_data
  
  base_path = os.path.dirname(__file__)
  data_path = os.path.join(base_path, "data")
  
  # Load raw data
  train_raw = load_data(os.path.join(data_path, "train.csv"))
  val_raw = load_data(os.path.join(data_path, "val.csv"))
  test_raw = load_data(os.path.join(data_path, "test.csv"))
  
  # Limit samples if specified
  if max_train:
    train_raw = train_raw[:max_train]
    val_raw = val_raw[:max(max_train // 4, 10)]
  if max_test:
    test_raw = test_raw[:max_test]
  
  # Process data
  processor = DataProcessor(task_name="banking")
  train_samples = processor.process_task_data(train_raw)
  val_samples = processor.process_task_data(val_raw)
  test_samples = processor.process_task_data(test_raw)
  
  return train_samples, val_samples, test_samples, processor

train_samples, val_samples, test_samples, processor = load_banking_data(
  max_train=args.max_train,
  max_test=args.max_test
)

The next step is to define a playbook template. This is important because the current ACE implementation can’t dynamically create new sections, so we predefine the structure to guide the model. Here’s the template I used for the banking domain.

BANKING_PLAYBOOK_TEMPLATE = """
## GENERAL
## CLASSIFICATION PRINCIPLES
## CATEGORY DISAMBIGUATION
## BANKING DOMAIN KNOWLEDGE
## COMMON PATTERNS
## HANDLING AMBIGUOUS QUERIES
## COMMON MISTAKES TO AVOID
## OTHERS
"""

With the data and template ready, we can initialise the ACE object with the main parameters.

ace_system = ACE(
  api_provider=args.api_provider,
  generator_model=args.generator_model,
  reflector_model=args.reflector_model,
  curator_model=args.curator_model,
  max_tokens=4096,
  initial_playbook=BANKING_PLAYBOOK_TEMPLATE,
  use_bulletpoint_analyzer=True, # enabling deduplication of bullet points in the playbook
  generator_temperature=0.1, # prioritising consistency for generator
  reflector_temperature=0.7, # prioritising creativity for reflector and curator
  curator_temperature=0.7,
)

Finally, we define a function to run the ACE training workflow, which includes initial evaluation, iterative reflection, curation, and final evaluation.

def run_ace_training(ace_system, train_samples, val_samples, test_samples, processor, results_dir):
  """Train ACE to improve the playbook (includes initial and final evaluations)."""
  config = {
    'num_epochs': 1,
    'max_num_rounds': 3,  # max reflection rounds per sample
    'curator_frequency': 5,  # run curator every 5 steps
    'eval_steps': max(len(train_samples) // 10, 10),  # evaluate 10 times during training
    'save_steps': max(len(train_samples) // 10, 10),
    'playbook_token_budget': 80000,
    'task_name': 'banking_ace',
    'json_mode': False,
    'no_ground_truth': False,
    'save_dir': os.path.join(results_dir, "training"),
    'test_workers': 10,
  }
  
  results = ace_system.run(
    mode='offline',
    train_samples=train_samples,
    val_samples=val_samples,
    test_samples=test_samples,
    data_processor=processor,
    config=config
  )
  
  # Extract results
  initial_acc = results.get('initial_test_results', {}).get('accuracy', 0)
  final_acc = results.get('final_test_results', {}).get('accuracy', 0)
  training_results = results.get('training_results', {})
  
  return ace_system.best_playbook, results

best_playbook, training_results = run_ace_training(
  ace_system, train_samples, val_samples, test_samples, 
  processor, results_dir
)

And that’s it! That’s all the core logic we need to run ACE. I’ve added some logging on top of the workflow for convenience, but it’s not essential to the main functionality.

Results

Let’s take a look at the results and see how everything comes together. First, check out the best playbook, which you can find at results/banking_{dt}/best_playbook.txt. The playbook is organised into itemised bullets, grouped according to the categories we defined in our initial template. Each bullet contains detailed instructions and strategies, along with metadata showing how often it was marked helpful or harmful. This structure makes it easy to see which topics and strategies the system found most useful during training.

## GENERAL
## CLASSIFICATION PRINCIPLES
[cls-00001] helpful=1 harmful=0 :: Temporal indicators like 'was able to before', 'worked previously', or 'used to work' are strong signals that the issue is specific to the current transaction rather than a general system capability problem. These phrases suggest a change in status for a specific entity (beneficiary, card, account) rather than overall functionality.
[cls-00002] helpful=18 harmful=4 :: Apply specificity hierarchy: when multiple categories could apply, choose the most specific one that matches the contextual clues. For example, beneficiary_not_allowed (specific to recipient) is more specific than declined_transfer (general failure).
[cls-00009] helpful=0 harmful=3 :: Specificity hierarchy works bidirectionally: choose specific categories when contextual clues point to a particular transaction type, but use general categories (like 'extra_charge_on_statement') when the query lacks sufficient context to determine the specific nature of the transaction. Don't force specificity when the customer's query is inherently general.
[cls-00017] helpful=5 harmful=1 :: Process-oriented vs Status-tracking distinction: Differentiate between questions about HOW to obtain/acquire something (process-oriented) versus questions about WHEN something will arrive or WHETHER it has arrived (status-tracking). Process questions focus on the steps and components needed, while status questions focus on timing and delivery confirmation. Use this distinction to choose between acquisition categories and tracking/arrival categories.
## CATEGORY DISAMBIGUATION
[dis-00003] helpful=1 harmful=0 :: declined_transfer vs beneficiary_not_allowed: If the customer mentions they could transfer before but suddenly cannot, this strongly indicates beneficiary_not_allowed (recipient is blocked/restricted) rather than declined_transfer (general transfer failure due to funds, limits, or system errors).
[dis-00011] helpful=11 harmful=0 :: pending_* vs failed_* vs declined_*: Transaction state is critical for classification. 'Hasn't gone through yet' or 'taking too long' = pending state. 'Didn't work', 'was declined', or 'was rejected' = failed/declined state. 'Money came back' or 'was returned' = reverted state. Match the category to the actual transaction state described.
[dis-00012] helpful=0 harmful=1 :: country_support vs supported_cards_and_currencies: Queries about geographic availability ('which countries', 'where can I', 'what regions') should be classified as 'country_support'. In contrast, 'supported_cards_and_currencies' is for questions about card types (Visa, Mastercard) and currency options, not geographic availability.
[dis-00014] helpful=2 harmful=0 :: Cash withdrawal issues: Distinguish by transaction state and outcome: 'pending_cash_withdrawal' (not completed yet, still processing), 'declined_cash_withdrawal' (rejected, no cash received), 'cash_withdrawal_not_recognised' (customer doesn't recall the transaction), and 'wrong_amount_of_cash_received' (transaction completed but incorrect amount dispensed). If cash was received but the amount was wrong, use the most specific category: wrong_amount_of_cash_received.
[dis-00015] helpful=3 harmful=3 :: card_arrival vs get_physical_card: Distinguish between status-tracking questions (card_arrival) and process-acquisition questions (get_physical_card). 'card_arrival' is for tracking existing orders ('Has my card arrived?', 'Where is my card?'). 'get_physical_card' encompasses the entire process of obtaining a physical card including all components like PIN ('Where can I find my PIN?', 'How do I get my card and PIN?'). Questions about missing PINs with 'haven't gotten it yet' indicate the customer is in the acquisition process, not just tracking delivery.
[dis-00021] helpful=1 harmful=0 :: card_payment_not_recognised vs extra_charge_on_statement: When a customer mentions a 'payment' they don't recognize or didn't make ('payment I never submitted', 'payment I didn't authorize'), classify as 'card_payment_not_recognised' because 'payment' is a specific transaction type. Use 'extra_charge_on_statement' only when the customer describes unexpected amounts, fees, or charges WITHOUT specifying the transaction type (e.g., 'I see an extra $5 on my statement', 'there's a strange charge' without mentioning payment/transfer/withdrawal).
[dis-00024] helpful=0 harmful=1 :: Fee/charge category specificity: When customers ask about fees or charges, prioritize transaction-type-specific fee categories over 'extra_charge_on_statement'. If the query mentions a specific transaction type (transfer, payment, withdrawal, top-up), use the corresponding specific fee category: 'transfer_fee_charged' for transfer fees, 'card_payment_fee_charged' for payment fees, 'atm_fee_charged' for withdrawal fees, 'top_up_fee' for top-up fees. Reserve 'extra_charge_on_statement' only for fee queries where no specific transaction type is mentioned (e.g., 'Why is there an extra $5 charge?' without context).
[dis-00026] helpful=0 harmful=0 :: receiving_money vs transfer_into_account: Distinguish between passive receipt and active transfer. 'receiving_money' is for queries about receiving funds FROM another party (passive, initiated by sender). 'transfer_into_account' is for queries about the customer initiating a transfer TO add funds to their own account (active, self-initiated). Context clues: empty/low balance + asking about transfers = likely transfer_into_account. Questions about 'can I transfer funds' in the context of needing to add money = transfer_into_account, not receiving_money.
[dis-00029] helpful=0 harmful=0 :: beneficiary_not_allowed vs declined_transfer: When a query explicitly mentions 'beneficiary' or 'recipient' combined with restriction language ('not allowed', 'blocked', 'restricted', 'cannot add', 'unable to add'), classify as 'beneficiary_not_allowed' even without temporal indicators. The combination of the specific banking entity term (beneficiary/recipient) with restriction language is a strong direct signal for recipient-level restrictions rather than general transfer failures.
## BANKING DOMAIN KNOWLEDGE
[bank-00006] helpful=0 harmful=0 :: In banking, when a previously successful transfer suddenly fails, common causes include: beneficiary being flagged/blocked by fraud systems, beneficiary account restrictions, or beneficiary being removed from allowed list. These are distinct from general transfer declines due to insufficient funds or system errors.
[bank-00008] helpful=0 harmful=6 :: Small unexpected amounts (like £1, £0.01) appearing on statements often indicate authorization holds, verification charges, or miscellaneous fees. When customers question these without additional context, they should be classified as 'extra_charge_on_statement' rather than more specific transaction types.
[bank-00018] helpful=0 harmful=0 :: 'card_swallowed' is the banking industry term for ATM card retention scenarios where the machine keeps/retains the customer's card. This applies when cards are stuck, won't come out, or are held by the ATM, regardless of the specific phrasing used by the customer.
[bank-00020] helpful=10 harmful=4 :: Banking terminology has a specificity hierarchy for transaction references. Specific transaction type keywords include: 'payment' (card payments), 'transfer' (money transfers), 'withdrawal' (cash withdrawals), 'top-up' (account funding), 'direct debit', 'standing order'. Generic terms include: 'charge', 'amount', 'transaction', 'fee'. When a customer uses a specific transaction type keyword, it provides sufficient context to classify into transaction-type-specific categories rather than general categories.
## COMMON PATTERNS
[pat-00004] helpful=0 harmful=0 :: Pattern: 'It worked before, now it doesn't' + transfer context = likely beneficiary-level restriction rather than system-level decline. The previous success indicates the account and transfer mechanism are functional, pointing to a specific restriction on the current recipient.
[pat-00007] helpful=3 harmful=6 :: Pattern: Customer describes transaction as 'strange', 'unexpected', 'unexplained', or asks 'what is this charge' on their statement without providing specific transaction type context (transfer, payment, withdrawal, etc.) = classify as 'extra_charge_on_statement'. This is the appropriate general category when the nature of the charge is unclear.
[pat-00010] helpful=8 harmful=1 :: Pattern: Phrases like 'hasn't gone through yet', 'still waiting', 'not completed', or 'still pending' indicate a transaction in PENDING state, not a FAILED state. Choose 'pending_*' categories over 'failed_*' or 'declined_*' categories when these language cues are present.
[pat-00013] helpful=0 harmful=2 :: Pattern: Questions with geographic scope indicators like 'which countries', 'where can I', 'what regions', or 'in what locations' are asking about service availability by geography = classify as 'country_support'. The core intent is understanding geographic reach of services.
[pat-00016] helpful=2 harmful=9 :: Pattern: 'Where can I find' or 'How do I get' phrasing indicates process-oriented questions seeking information about obtaining or acquiring something, not status-tracking questions. These should typically map to acquisition/setup categories (like 'get_physical_card') rather than delivery/tracking categories (like 'card_arrival' or 'card_delivery_estimate').
[pat-00019] helpful=0 harmful=0 :: Pattern: Phrases indicating a card is physically retained by an ATM ('card stuck in ATM', 'card won't come out', 'ATM kept my card', 'get my card out of ATM', 'retrieve card from machine') should be classified as 'card_swallowed'. The key indicator is the card being physically held/retained by the machine rather than other card issues like damage, loss, or functionality problems.
[pat-00022] helpful=1 harmful=0 :: Pattern: Specific transaction type keyword + 'not recognized'/'didn't make'/'never submitted' = use transaction-type-specific 'not_recognised' category. Examples: 'payment I didn't make' → card_payment_not_recognised; 'transfer I don't recognize' → transfer_not_received_by_recipient or related transfer issue; 'withdrawal I never made' → cash_withdrawal_not_recognised. The presence of a specific transaction type keyword (payment, transfer, withdrawal) is sufficient context to avoid general categories.
[pat-00025] helpful=1 harmful=0 :: Pattern: Transaction type keyword + timing question ('how long', 'when will', 'how much time') + geographic mention = prioritize transaction-specific timing category (e.g., 'transfer_timing', 'card_delivery_estimate'). Treat geographic mentions as contextual information about the transaction origin/destination unless the query explicitly asks about service availability ('which countries', 'where can I use', 'is it available in'). Example: 'transfer from China, how long?' → 'transfer_timing' (not 'country_support').
[pat-00027] helpful=0 harmful=0 :: Pattern: Account balance context + transfer inquiry = intent to add funds. When a customer mentions their account is empty/has no funds/needs money AND asks about transferring, they are asking about moving funds INTO their account (transfer_into_account), not about receiving money from others (receiving_money). The account state provides critical context for disambiguating transfer-related intents.
## HANDLING AMBIGUOUS QUERIES
## COMMON MISTAKES TO AVOID
[err-00005] helpful=2 harmful=0 :: Don't default to general categories (like declined_transfer) when temporal context ('was able to before') suggests a more specific issue. The temporal change is a key discriminator that often points to entity-specific restrictions (beneficiary, card, account) rather than general failures.
[err-00023] helpful=2 harmful=0 :: Don't default to 'extra_charge_on_statement' when the customer mentions a specific transaction type (payment, transfer, withdrawal, top-up) they don't recognize. 'extra_charge_on_statement' should be reserved for truly ambiguous cases where no transaction type is specified. When a customer says 'payment I never made', the word 'payment' provides sufficient context to use 'card_payment_not_recognised' instead of the generic 'extra_charge_on_statement'.
[err-00028] helpful=0 harmful=0 :: Don't apply pattern rules or domain knowledge that are irrelevant to the query. If a query has no geographic indicators, don't apply geographic patterns. If there's no mention of fees, don't apply fee-related rules. Focus on rules that directly match the semantic content and context of the customer's query rather than grasping for any applicable rule. Irrelevant rule application leads to misclassification.
## OTHERS

For a deeper look at how each agent operates, you can explore the detailed execution logs at results/banking_{dt}/training/ace_run_{dt}/detailed_llm_logs . I highly recommend browsing these logs. At the very least, skim through the prompts and see how the Generator, Reflector, and Curator interact. It’s a great way to understand how ACE evolves the context step by step.

Of course, the most interesting metric is accuracy. You can find the initial and final test results in results/banking_{datetime}/training/initial_test_results.json and results/banking_{datetime}/training/final_test_results.json.

# initial results 
{
  "test_results": {
    "accuracy": 0.7512437810945274,
    "correct": 151,
    "total": 201,
    "no_answer": 0
  },
  "error_log": {
    "accuracy": 0.7512437810945274,
    "errors": [
      {
        "index": 2,
        "prediction": "declined_card_payment",
        "ground_truth": "declined_transfer"
      },
      {
        "index": 9,
        "prediction": "top_up_limits",
        "ground_truth": "automatic_top_up"
      },
      {
        "index": 7,
        "prediction": "transfer_not_received_by_recipient",
        "ground_truth": "balance_not_updated_after_cheque_or_cash_deposit"
      },
      ...
    ]
  }
}

# final results 
{
  "test_results": {
    "accuracy": 0.736318407960199,
    "correct": 148,
    "total": 201,
    "no_answer": 0
  },
  "error_log": {
    "accuracy": 0.736318407960199,
    "errors": [
      {
        "index": 9,
        "prediction": "top_up_limits",
        "ground_truth": "automatic_top_up"
      },
      {
        "index": 2,
        "prediction": "declined_card_payment",
        "ground_truth": "declined_transfer"
      },
      {
        "index": 7,
        "prediction": "pending_transfer",
        "ground_truth": "balance_not_updated_after_cheque_or_cash_deposit"
      },
      ...
    ]
  }
}

The results, admittedly, are not very impressive. In fact, accuracy slightly dropped after optimisation, from 75.1% to 73.6%. But even negative results can teach us something valuable.

There are a few likely reasons why ACE didn’t provide much benefit in this case:

Limited data per category. We only had 248 training examples, 201 test examples, and 51 validation examples. However, our task involved 77 different categories. With so few examples per class, the model simply may not have had enough data to learn meaningful distinctions.
Small and unrepresentative validation set. With only 51 examples, the validation set might not have captured the full diversity of customer queries, making it difficult for ACE to generate useful reflections and improvements.
Task complexity. Our use case is relatively straightforward. As the authors note, ACE tends to shine in scenarios with large amounts of highly specialised domain knowledge or more complex agentic workflows, where reflection and iterative context refinement can significantly improve performance.

Using ACE for code generation

Encouraged by the previous experiment, I decided to give ACE another try. This time on the Mostly Basic Python Problems dataset (available under cc-by-4.0 license). Hopefully, the results would be more promising with a code generation task.

Data overview

Each example in the dataset contains three key components:

Question, for example, “Write a function to reverse words in a given string.”
Ground truth implementation — Python reference code. For example, for the question above

def reverse_words(s):
  return ' '.join(reversed(s.split()))

Test cases are assertions to validate the generated code, such as

[
    assert reverse_words("python program")==("program python"),
    assert reverse_words("java language")==("language java"),
    assert reverse_words("indian man")==("man indian")
]

Adding a new task to the ACE framework

We can follow similar steps to extend the ACE framework to handle coding tasks. I won’t go into all the implementation details here, since you can find the full code on GitHub. However, it’s worth highlighting the key differences compared to the banking intent example.

Coding tasks are inherently more complex. In the banking intent case, the model outputs a single class out of 77, which is easy to compare directly with the ground truth. In code generation, however, the LLM can produce arbitrary code, so we cannot simply check for exact matches. Instead, we need to run tests to determine whether the generated solution is correct.

# banking

def answer_is_correct(self, predicted: str, ground_truth: str) -> bool:
  return predicted.lower() == ground_truth.lower()

# coding 
def answer_is_correct(self, predicted: str, ground_truth: str, 
                    test_list: List[str], idx: int, save_dir: str) -> bool:
  code = extract_code_from_response(predicted)
  result = execute_code_with_tests(code, test_list, timeout=5)
  return result['success']

Because of this added complexity, I had to implement several enhancements in the DataProcessor for code generation:

Code extraction. LLMs often include extra context around the code, such as Markdown formatting (```python ...```). We need to clean and extract the code to ensure it can compile correctly.
Safe execution. Since we run the generated code to verify correctness, it’s important to implement basic safety measures, such as timeouts and isolated execution environments.
Providing full context. It’s crucial to include all necessary information in the question. If we just ask the LLM to generate code, it’s unlikely to pass the tests because it won’t be clear what function name or signature is expected. That’s why it’s crucial to provide all necessary details in the question when standardising the data in the process_task_data function.

question = (
  f"Write a Python function to solve the following problem:\n\n"
  f"Problem: {problem_text}\n\n"
  f"Your code must pass the following test cases:\n"
  f"{test_cases_formatted}\n\n"
  f"Important: The test cases will be executed against your code. "
  f"Make sure your function name and signature match what the tests expect.\n\n"
  f"Respond with ONLY the Python code, no explanations."
)

In the original ACE implementation, the Reflector compared generated code directly with the ground truth, which works for classification tasks. For coding, however, this approach doesn’t make sense: multiple correct solutions can exist, and optimising for code that “looks similar” to the reference doesn’t guarantee it will pass the tests.

To address this, I implemented a new method, get_test_feedback, which provides the Reflector with actual test execution results and error messages. The test output becomes the primary signal for correctness, giving much more informative feedback than simple code comparison.

def get_test_feedback(self, predicted: str, ground_truth: str, test_list: List[str] = None) -> str:
  """
  Get detailed test execution feedback for the reflector.
  
  This method provides the reflector with actual test results and error messages,
  which is more informative than just comparing generated code with ground truth.
  The test output is the primary signal for correctness in code generation tasks.
  
  Args:
      predicted: Model's predicted code
      ground_truth: Ground truth code (reference only, not used for evaluation)
      test_list: List of test assertions to run
      
  Returns:
      str: Detailed feedback string with test execution results
  """
  if test_list is None:
      return "No test cases provided - cannot evaluate code."
  
  # Extract code from response if needed
  code = extract_code_from_response(predicted)
  
  # Execute code with tests
  result = execute_code_with_tests(code, test_list, timeout=self.timeout)
  
  # Build detailed feedback
  feedback_parts = []
  
  if result['success']:
    feedback_parts.append(f"✓ All {result['total']} tests PASSED")
    feedback_parts.append("\nTest cases executed successfully:")
    for i, test in enumerate(test_list, 1):
        feedback_parts.append(f"  {i}. {test} ✓")
  else:
    feedback_parts.append(f"✗ Tests FAILED: {result['passed']}/{result['total']} tests passed")
    
    if result['timeout']:
      feedback_parts.append("\n⏱ TIMEOUT: Code execution exceeded time limit")
    
    if result['errors']:
      feedback_parts.append("\n--- ERROR DETAILS ---")
      for error in result['errors']:
        feedback_parts.append(f"  • {error}")
    
    # Show which tests passed vs failed
    feedback_parts.append("\n--- TEST RESULTS ---")
    for i, test in enumerate(test_list, 1):
      # Check if this specific test appears in errors
      test_failed = any(f"Test {i}" in err for err in result.get('errors', []))
      status = "✗ FAILED" if test_failed else "✓ passed"
      feedback_parts.append(f"  {i}. {test} - {status}")
  
  # Add extracted code for reference
  feedback_parts.append("\n--- EXTRACTED CODE ---")
  feedback_parts.append(code)
  
  return "\n".join(feedback_parts)

Alongside this new method, I created a dedicated Reflector prompt tailored for code generation. Its focus is on test results, not line-by-line code comparison.

You are an expert code reviewer and educator. Your job is to analyze why generated code passed or failed test cases, and identify patterns that lead to correct or incorrect solutions.

**IMPORTANT: Test execution results are the PRIMARY signal for correctness.**
- The code is correct if and only if ALL tests pass
- Do NOT compare implementations line-by-line with the reference - different implementations can be equally correct
- Focus on understanding WHY tests passed or failed based on the code's logic

**Instructions:**
- First, examine the Test Execution Results to determine if the code is correct
- If tests FAILED: Analyze what caused the failure (syntax errors, logic errors, edge cases, wrong algorithm)
- If tests PASSED: Identify what the model did well that led to success
- The "Possible Implementation" is just ONE way to solve the problem - the model's approach may be different but equally valid
- Provide actionable insights for improving code generation in the future
- Tag bulletpoints as helpful/harmful/neutral based on whether they contributed to passing tests

Your output should be a json object, which contains the following fields:
  - reasoning: analyze the test results and the code's logic, explain why tests passed/failed
  - error_identification: if tests failed, what specific issue caused the failure? If tests passed, state "No errors - all tests passed"
  - root_cause_analysis: what underlying concept or pattern led to success or failure?
  - correct_approach: what coding strategy or pattern should be used for similar problems?
  - key_insight: what principle should be remembered for future code generation tasks?
  - bullet_tags: a list of json objects with bullet_id and tag for each bulletpoint

**Question:**
{}

**Model's Reasoning Trace:**
{}

**Model's Generated Code:**
{}

**Possible Implementation (Reference Only - NOT the only correct solution):**
{}

**Test Execution Results (PRIMARY SIGNAL):**
{}

**Part of Playbook that's used by the generator to answer the question:**
{}

**Answer in this exact JSON format:**
{{
  "reasoning": "[Analyze test results and code logic - why did tests pass or fail?]",
  "error_identification": "[What caused test failures? Or 'No errors - all tests passed']",
  "root_cause_analysis": "[What concept/pattern led to success or failure?]",
  "correct_approach": "[What coding strategy works for this type of problem?]",
  "key_insight": "[What principle should be remembered for future code generation?]",
  "bullet_tags": [
    {{"id": "code-00001", "tag": "helpful"}},
    {{"id": "code-00002", "tag": "harmful"}}
  ]
}}

This coding-specific Reflector is automatically used whenever the task name contains "coding".

Results

Finally, I ran the prompt optimisation process on a dataset of 500 samples, split into train, test, and validation sets. This time, the results are much more promising: accuracy improved significantly from 71.1% to 87.1%. In this case, ACE clearly helped optimise the prompts and guide the model toward correct solutions.

Looking at the best playbook, it’s quite extensive. Many of the most helpful patterns are general principles, such as:

Write the simplest correct, Pythonic solution first,
Treat test cases as the true specification,
Verify correctness before any further optimisation.

At the same time, the playbook also includes very specific guidance, for example, detailed instructions for tasks like GCD calculations.

Overall, this shows that ACE can effectively capture both high-level strategies and task-specific tips.

## GENERAL
## COMMON MISTAKES TO AVOID
[err-00003] helpful=5 harmful=0 :: Don't add unnecessary complexity to recursive algorithms. For example, in GCD implementations, explicit min/max logic or special cases for checking if a value equals 1 are redundant when using the standard Euclidean algorithm.
[err-00007] helpful=0 harmful=0 :: Don't assume problem constraints match your algorithm's mathematical prerequisites. For example, Fermat's Little Theorem for modular inverse requires a PRIME modulus - verify the problem guarantees this before using pow(a, p-2, p). If constraints aren't specified, choose more general algorithms.
## OTHERS
## CODE GENERATION PRINCIPLES
[cgp-00002] helpful=41 harmful=2 :: Prefer minimal, mathematically sound implementations over complex ones. Avoid adding unnecessary preprocessing logic (like min/max) or special case checks when the core algorithm naturally handles all scenarios.
[cgp-00012] helpful=91 harmful=2 :: Always ensure generated code is syntactically complete before finalizing output. Verify all opened brackets, braces, and parentheses are properly closed, and all statements are fully formed. Incomplete code generation (truncation mid-statement) causes syntax errors that prevent execution regardless of algorithmic correctness.
[cgp-00020] helpful=6 harmful=0 :: When a problem explicitly requires using lambda functions, integrate them naturally with Python's functional programming tools (map, filter, reduce, sorted with key parameter). Don't force lambda usage where it's awkward - these built-in functions are designed to work seamlessly with lambdas for operations like filtering, transformation, and counting.
[cgp-00024] helpful=140 harmful=2 :: Prioritize readable, Pythonic solutions using built-in functions over performance-optimized complex algorithms unless the problem explicitly requires optimization or involves large-scale data. A clear solution using bin(), str methods, or list comprehensions is often preferable to bit manipulation or manual loops. Optimize only when necessary.
[cgp-00047] helpful=56 harmful=2 :: Follow a correctness-first development strategy: (1) implement the straightforward algorithm that correctly solves the problem, even if it's not optimally efficient, (2) verify correctness with test cases, (3) only then consider optimization if performance is inadequate or the problem explicitly requires it. A correct O(n) solution is infinitely better than a buggy O(log n) attempt. Premature optimization often introduces errors in logic, especially for mathematical or algorithmic problems.
[cgp-00050] helpful=0 harmful=0 :: When multiple algorithmically correct solutions exist, prefer the one with better time/space complexity. A correct O(1) formula-based solution is superior to a correct O(n) iterative solution. However, only optimize if you can maintain correctness - a working O(n) solution is infinitely better than a buggy O(1) attempt. Verify the more efficient approach passes all tests before committing to it.
[cgp-00053] helpful=0 harmful=0 :: When implementing mathematical optimizations (especially for pair/combination counting), verify the optimized approach against test cases through manual calculation BEFORE coding. For each test case: (1) apply your mathematical insight to predict the output, (2) confirm it matches expected output, (3) only then implement. This catches errors in mathematical reasoning early, preventing bugs that are harder to debug in code than in arithmetic.
[cgp-00057] helpful=0 harmful=0 :: Avoid shadowing Python built-in names (dict, list, str, int, set, tuple, etc.) when naming variables or parameters. Use descriptive alternatives instead: 'd' or 'data' instead of 'dict', 'lst' or 'items' instead of 'list', 's' or 'text' instead of 'str'. Shadowing built-ins makes them inaccessible in that scope and reduces code clarity, even though it's syntactically valid.
[cgp-00059] helpful=2 harmful=0 :: Include defensive programming practices (input validation, bounds checking, type checking) even when not explicitly tested by visible test cases. For string indexing, validate index bounds before access. For numeric conversions, verify the input is a valid digit. For list operations, check for empty collections. These safeguards increase code robustness and prevent runtime errors on edge cases that may exist in hidden tests, demonstrating production-quality coding practices.
[cgp-00074] helpful=0 harmful=0 :: For operations involving powers of 2, prefer bitwise shift operators over arithmetic operations for clarity and efficiency: use left shift (1 > k) instead of n // (2**k) for dividing by powers of 2. Bitwise operators make the bit-level intent explicit and are the idiomatic approach in bit manipulation contexts. This is especially valuable when working with bit positions and their corresponding values.
[cgp-00081] helpful=0 harmful=0 :: Before using standard library mathematical constants (math.pi, math.e, etc.), validate that test cases expect full-precision values by calculating one test output and comparing to expected. If expected outputs suggest truncated/simplified constants (pi=3.14, pi=3.1415, e=2.718), use hardcoded values matching test precision instead of library constants. Pattern: (1) identify mathematical constant needed, (2) calculate test output with standard constant, (3) if mismatch exists, derive the constant value that produces exact expected outputs, (4) use hardcoded value. Test case expectations override mathematical purity.
## COMMON PYTHON PATTERNS
[cpp-00010] helpful=23 harmful=0 :: For finding elements with maximum/minimum properties based on a criterion, use built-in max()/min() functions with the key parameter. Example: max(list_of_lists, key=len) finds the longest list. This is more Pythonic and readable than manual iteration with comparisons.
[cpp-00013] helpful=17 harmful=0 :: For counting or searching operations in Python collections (tuples, lists, strings), prioritize built-in methods: use .count() for occurrence counting, .index() for finding positions, .find() for strings. These are more reliable, efficient, and Pythonic than manual iteration with counters or loops.
[cpp-00014] helpful=3 harmful=0 :: When working with mixed-type data structures, use isinstance() for type checking to distinguish between different element types. Combine with len() checks to validate structure. Example: isinstance(item, list) and len(item) == 2 reliably identifies 2-element lists in mixed collections.
[cpp-00015] helpful=3 harmful=0 :: Use extend() instead of append() when adding multiple elements from a sequence to a list. extend() adds elements individually to the target list, while append() would add the entire sequence as a single nested element. Example: result.extend([value] * count) vs result.append([value] * count).
[cpp-00016] helpful=2 harmful=0 :: Use list multiplication ([value] * count) to efficiently repeat elements. This is more Pythonic and readable than manual loops for creating repeated elements. Combine with extend() for adding repeated elements to existing lists.
[cpp-00019] helpful=2 harmful=0 :: For counting elements matching a condition with lambda functions, use sum(map(lambda x: 1 if condition else 0, iterable)) as an elegant alternative to len(list(filter(lambda x: condition, iterable))). The sum(map()) approach maps elements to 1/0 and sums them, often more readable and efficient than filtering then counting.
[cpp-00026] helpful=14 harmful=0 :: For converting sequences (tuples, lists) of characters/strings into a single string, use str.join() method: ''.join(sequence) for character concatenation, or 'separator'.join(sequence) for joining with delimiters. This is the idiomatic Python approach - more readable and performant than manual loops with += or accumulation patterns.
[cpp-00030] helpful=1 harmful=0 :: For character classification with regex, use re.findall() with mutually exclusive character class patterns. For 'everything else' categories (like special characters), prefer negation patterns [^...] over enumerating specific characters - e.g., [^A-Za-z0-9] captures all non-alphanumeric characters comprehensively, avoiding the brittleness of lists like [,.!?]. Ensure patterns don't overlap to prevent double-counting.
[cpp-00031] helpful=2 harmful=0 :: For finding global maximum/minimum across nested iterables (list of tuples, list of lists, etc.), use nested generator expressions with built-in max()/min(): `max(element for container in containers for element in container)`. This pattern naturally flattens one level of nesting without creating intermediate lists, making it ideal for finding extremes across tuple records or sublists. More efficient and readable than manual iteration.
[cpp-00033] helpful=2 harmful=0 :: For index-based access to dictionary keys, use the pattern list(dict)[index] or list(dict.keys())[index]. This relies on Python 3.7+ guarantees that dictionaries maintain insertion order. Converting the dictionary to a list extracts keys in order, allowing standard list indexing. This is the idiomatic Python solution for mapping numeric indices to dictionary keys.
[cpp-00036] helpful=27 harmful=2 :: For mathematical operations (GCD, LCM, factorial, prime checking, trigonometry), check Python's math module FIRST before implementing algorithms manually. Built-in functions like math.gcd(), math.factorial(), math.isqrt() are well-tested, optimized, and reduce implementation errors. Pattern: (1) Understand the mathematical definition, (2) Check if math module provides the operation, (3) Use it directly or wrap it with problem-specific logic (e.g., is_coprime = math.gcd(a,b) == 1).
[cpp-00038] helpful=0 harmful=0 :: For checking if a number is a perfect square, use math.isqrt() instead of math.sqrt() to avoid floating-point precision errors. Pattern: b = math.isqrt(n); is_perfect_square = (b * b == n). The isqrt() function returns the integer square root, and squaring it back allows exact integer comparison without floating-point rounding issues.
[cpp-00043] helpful=0 harmful=0 :: For character filtering problems (removing/keeping characters based on membership criteria), use the set+comprehension+join pattern: (1) Convert filter criteria into a set for O(1) lookup (char_set = set(filter_string)), (2) Use list comprehension or generator expression to filter (char for char in source if char not in char_set), (3) Use ''.join() to reconstruct the string. This pattern is more Pythonic, readable, and maintainable than manual index manipulation or character counting approaches, while being equally correct and efficient.
[cpp-00049] helpful=0 harmful=0 :: When returning tuples or lists with mixed numeric types (integers and floats), use appropriate division operators for each component: integer division (//) for whole number results, regular division (/) for decimal results. Example: for sum and average, return (n*(n+1)//2, n*(n+1)/2/n) to ensure sum is int and average is float. This prevents type mismatches in test assertions.
[cpp-00054] helpful=0 harmful=0 :: For digit-by-digit comparison or manipulation problems (digit distance, digit sum differences, etc.): Use the string conversion pattern: (1) Convert integers to strings with str(), (2) Use zfill(max_length) to pad shorter numbers with leading zeros for equal length, (3) Use zip() to pair corresponding digit positions, (4) Apply operations on paired digits and aggregate results. Example: str(num1).zfill(length) and str(num2).zfill(length) then zip() for pairing. This handles different-length numbers elegantly and provides clean positional access to digits.
[cpp-00056] helpful=5 harmful=0 :: For checking if all/any elements in a collection satisfy a condition, use Python's built-in all() or any() functions with generator expressions. Pattern: all(condition for item in iterable) for universal quantification (all must satisfy), any(condition for item in iterable) for existential quantification (at least one satisfies). This is more Pythonic, readable, and efficient than manual loops with flags. Common use cases: all(v == target for v in dict.values()) for value uniformity, any(x > threshold for x in list) for threshold checking, all(isinstance(x, int) for x in collection) for type validation.
[cpp-00060] helpful=0 harmful=0 :: For whitespace normalization (collapsing multiple spaces/whitespace into single spaces), use the split-join pattern: ' '.join(s.split()). The key insight: str.split() without arguments has special behavior - it splits on ANY whitespace (spaces, tabs, newlines) AND automatically removes empty strings from the result, naturally collapsing consecutive whitespace. Combined with ' '.join(), this creates a clean solution without regex imports. This pattern is more Pythonic and maintainable than regex alternatives like re.sub(r' +', ' ', s) for simple whitespace normalization tasks.
[cpp-00062] helpful=0 harmful=0 :: For complex number operations (polar/rectangular conversion, phase calculation, magnitude), use Python's cmath module functions as the first choice: cmath.polar(z) for conversion to polar form (returns magnitude and angle), cmath.rect(r, phi) for polar to rectangular, cmath.phase(z) for angle extraction. These built-in functions handle edge cases correctly (e.g., treating real numbers as complex with imaginary part 0) and are more reliable than manual trigonometric calculations. Pattern: import cmath → use appropriate function → handle the return type (often tuples).
[cpp-00064] helpful=0 harmful=0 :: For grouping elements by a key while preserving insertion order (critical for tie-breaking in subsequent sorting), use collections.OrderedDict with setdefault pattern: from collections import OrderedDict; grouped = OrderedDict(); for item in items: grouped.setdefault(key, []).append(value). While Python 3.7+ dicts maintain insertion order, OrderedDict makes the intent explicit and is safer when order matters for downstream operations like sorting by aggregated properties where equal values should maintain original encounter order.
[cpp-00065] helpful=0 harmful=0 :: For creating tuples with variable-length unpacked elements, use the * unpacking operator: (first, *middle_elements, last) unpacks a list/tuple into individual tuple positions. Example: (key, *values, count) where values is a list creates a tuple with key, all values unpacked as separate elements, and count at the end. This is essential when output format requires flattening nested structures into single-level tuples with variable element counts.
[cpp-00069] helpful=0 harmful=0 :: For regex pattern matching problems requiring full string matches, choose between re.search(), re.match(), and re.fullmatch() based on matching scope: re.match() matches from the start, re.search() finds patterns anywhere, re.fullmatch() requires the entire string to match. When full string matching is needed, either use re.fullmatch() with the pattern directly, or use re.search()/re.match() with explicit anchors (^ for start, $ for end). Example: re.fullmatch('a.*b', s) is equivalent to re.search('^a.*b$', s). Both approaches are valid - fullmatch() makes the intent explicit, while search() with anchors provides more flexibility. Always analyze test cases to determine if partial or full string matching is required.
[cpp-00072] helpful=1 harmful=0 :: For counting elements in an iterable that match a condition, use the generator expression pattern with sum(): sum(1 for x in iterable if condition). This provides optimal balance of readability, memory efficiency, and Pythonic style compared to alternatives like len([x for x in iterable if condition]) which creates an intermediate list. For character-level string operations, prefer built-in string methods (isdigit(), isalpha(), isalnum(), isupper(), islower()) over manual ASCII range comparisons - they handle edge cases correctly, improve clarity, and are more maintainable.
[cpp-00073] helpful=0 harmful=0 :: For bit manipulation problems (finding set bits, MSB/LSB positions, bit counting), check Python's integer bit methods FIRST before implementing manual algorithms: bit_length() returns the number of bits needed to represent the integer (useful for MSB position), bit_count() counts set bits (Python 3.10+), as_integer_ratio() for rational representation. These built-in methods are optimized, handle edge cases (including 0), and often eliminate the need for manual bit-by-bit iteration. Pattern: understand what bit property you need, check if a built-in method provides it directly.
[cpp-00076] helpful=0 harmful=0 :: For grouping consecutive identical elements in a sequence, use itertools.groupby() as the canonical Python solution. Pattern: [list(group) for key, group in itertools.groupby(sequence)]. The groupby function returns (key, group_iterator) tuples where key is the element value and group is an iterator of consecutive occurrences. Convert each group iterator to a list to materialize results. Critical distinction: groupby groups CONSECUTIVE identical elements only - non-consecutive duplicates form separate groups, making it ideal for run-length encoding and consecutive duplicate detection without manual index tracking.
## H&LING EDGE CASES
[hec-00021] helpful=2 harmful=0 :: When using mathematical operations like modulo (%), division, or exponentiation, verify the solution handles negative numbers correctly. For example, modulo operator works correctly for both positive and negative integers in Python (e.g., -18 % 2 == 0 for even number checking), but behavior may differ from expectations in other languages.
## ALGORITHM DESIGN
[ad-00001] helpful=1 harmful=2 :: For recursive GCD problems, use the Euclidean algorithm: base case is b == 0 (return a), recursive case is gcd(b, a % b). This handles all edge cases naturally including argument ordering, equal numbers, and divisibility.
[ad-00006] helpful=0 harmful=0 :: For bidirectional character swap problems (A↔B) using regex: use re.sub() with a callback function in a single pass. Pattern: (1) Create a character class matching all swap targets (e.g., r'[ _]'), (2) Implement callback that examines each match and returns its counterpart. This avoids ambiguity from sequential replacements where new characters become indistinguishable from originals.
[ad-00008] helpful=0 harmful=0 :: For modular arithmetic problems (nCr mod p, etc.), check if p must be prime. If p can be composite, avoid algorithms requiring modular inverse (like Fermat's Little Theorem). Instead, use approaches that avoid division entirely, such as Pascal's triangle with DP: C[j] = (C[j] + C[j-1]) % p, which works for ANY modulus.
[ad-00009] helpful=0 harmful=0 :: When division is needed in modular arithmetic: (1) If modulus is guaranteed prime, use Fermat's Little Theorem: a/b mod p = a * b^(p-2) mod p. (2) If modulus may be composite, use Extended Euclidean Algorithm for modular inverse, or better yet, redesign to avoid division (e.g., use recurrence relations like Pascal's triangle).
[ad-00017] helpful=1 harmful=0 :: For decoding problems with mixed encoded/non-encoded elements: (1) use type checking to distinguish element types, (2) validate encoded element structure, (3) handle each type appropriately in a single pass. Prioritize simple iterative approaches with explicit conditionals over complex comprehensions for better readability and maintainability.
[ad-00018] helpful=4 harmful=0 :: For maximum sum problems with non-adjacent element constraints: Use dynamic programming with recurrence dp[i] = max(arr[i] + dp[i-2], dp[i-1]), representing the choice to include current element (add to best from i-2) or exclude it (keep best from i-1). Handle edge cases: empty array returns 0, single element returns that element, initialize dp[0] = arr[0] and dp[1] = max(arr[0], arr[1]). Time: O(n), Space: O(n) or O(1) with optimization.
[ad-00023] helpful=0 harmful=0 :: For bit counting and parity checking problems: Multiple valid approaches exist with different trade-offs. (1) Pythonic approach: bin(n).count('1') - most readable and maintainable, (2) Bit manipulation: repeatedly use x & (x-1) to clear lowest set bit - better performance for large inputs, (3) XOR reduction for parity. Choose the Pythonic approach by default unless performance profiling shows it's a bottleneck.
[ad-00028] helpful=1 harmful=1 :: For bit toggling problems: (1) Create a mask with 1s at positions to be toggled, (2) Use XOR operation (n ^ mask) to toggle those bits. For variable-length numbers, use bit_length() to determine how many bits to process. Example: to toggle bits at positions 1,3,5 up to bit_length, generate mask = sum(1 = len-k, ensuring each element selected once, or (2) Set union: combine subsets with set(min_k + max_k) then sort to eliminate duplicates. Always consider edge cases where k*2 >= collection_size, as this guarantees overlap between minimum and maximum selections. Avoid simple list concatenation which creates duplicates when ranges overlap.
[ad-00045] helpful=0 harmful=0 :: For 'find the n-th number with property X' problems: Use the iterative counting pattern: (1) implement a helper function to check if a number satisfies the property, (2) iterate through candidate numbers starting from an appropriate initial value, (3) maintain a counter for numbers that satisfy the property, (4) return the candidate when counter reaches n. This pattern works for prime numbers, perfect squares, numbers with specific factorization properties, etc. It's straightforward to implement correctly and optimize later if needed.
[ad-00046] helpful=3 harmful=0 :: For counting distinct prime factors: Use the standard factorization pattern: (1) iterate potential divisors from 2 to sqrt(n), (2) for each divisor that divides n, increment the distinct factor count, then divide n by that divisor repeatedly until it no longer divides (this ensures each prime is counted once regardless of its power), (3) after the loop, if n > 1, it's a remaining prime factor (count it), (4) optimize by checking divisor 2 separately, then only odd numbers. This correctly distinguishes between distinct primes and their multiplicities.
[ad-00048] helpful=1 harmful=0 :: For mathematical sequence problems (sum of first n numbers, arithmetic/geometric series, factorial-related), check if a closed-form formula exists before implementing iterative solutions. Common formulas: sum(1..n) = n*(n+1)/2, sum of arithmetic series = n*(first+last)/2, sum of geometric series = a*(r^n - 1)/(r-1). Formula-based solutions provide O(1) time complexity vs O(n) for loops, are less error-prone, and demonstrate mathematical insight. Always verify formula correctness with test cases.
[ad-00051] helpful=1 harmful=0 :: For pair-counting problems (count pairs satisfying a condition), look for mathematical properties that eliminate the need for explicit enumeration. Pattern: (1) Identify what makes a pair valid, (2) Find mathematical properties characterizing valid pairs (e.g., for XOR being odd: one number must be even, other odd), (3) Transform into a counting problem (count elements in each category), (4) Use combinatorics to compute result (e.g., odd_count × even_count). This reduces O(n²) pair enumeration to O(n) categorization + O(1) calculation.
[ad-00052] helpful=0 harmful=0 :: For problems involving XOR operations, leverage bit-level properties for optimization: (1) XOR result is odd ⟺ operands have different parities (one even, one odd), because parity depends on the least significant bit, (2) XOR is commutative and associative, allowing reordering, (3) x ^ x = 0 and x ^ 0 = x, useful for cancellation patterns. Analyze the specific XOR property relevant to your problem to find mathematical shortcuts that avoid brute force computation.
[ad-00061] helpful=0 harmful=0 :: For iterative mathematical sequence problems (sum/product of first n terms with specific properties): Use a structured 3-step approach: (1) Identify the formula for generating the k-th element (e.g., 2k-1 for odd numbers, 2k for even numbers, k² for squares), (2) Determine the operation to apply to each element (exponentiation, multiplication, transformation), (3) Aggregate with appropriate function (sum, product, max). Implement using generator expressions with built-ins: sum(operation(formula(i)) for i in range(start, n+1)). Ensure range bounds match the sequence indexing (1-indexed sequences need range(1, n+1)). This pattern provides clarity and correctness for problems where closed-form formulas don't exist or aren't obvious.
[ad-00066] helpful=0 harmful=0 :: For problems requiring grouping, counting, and sorting by aggregated properties: (1) Group elements using dict/OrderedDict with setdefault() or defaultdict, choosing OrderedDict when insertion order affects tie-breaking in sorting, (2) Sort groups using sorted() with key function based on aggregated metric (e.g., key=lambda x: len(x[1]) for count), (3) Transform output to match required format using appropriate unpacking/restructuring. This pattern handles 'group by X, sort by count of Y' problems systematically.
[ad-00068] helpful=0 harmful=0 :: For heap-based 'top k' problems, verify OUTPUT ORDERING against test cases, not just which elements to return. Key distinction: (1) heappop() from a min-heap produces ASCENDING order by the heap key, (2) heapq.nlargest(k, items, key=func) produces DESCENDING order by key, (3) heapq.nsmallest(k, items, key=func) produces ASCENDING order by key. When implementing heap solutions, trace through test cases to determine if results should be ordered ascending or descending by frequency/priority. If ordering is wrong, either reverse the final list or switch between nlargest/nsmallest, or use the heappop pattern. Test case output ordering is authoritative when the problem description doesn't explicitly specify.
[ad-00070] helpful=0 harmful=0 :: For 2D grid problems with adjacency or selection constraints (can't pick adjacent cells/rows/columns): Look for opportunities to reduce dimensionality before applying DP. If constraints allow picking at most one element per column (or row), pre-compute the optimal choice for each column/row (e.g., max of two rows in a column), transforming the problem into a 1D array. Then apply standard 1D DP patterns (like 'house robber' for non-adjacency). This dimensional reduction simplifies state space and makes complex grid problems tractable using well-known DP templates.
[ad-00071] helpful=0 harmful=0 :: Recognize the 'house robber' DP pattern as a fundamental template applicable beyond linear arrays: any problem involving selecting non-adjacent elements to maximize/minimize a sum can use the recurrence dp[i] = max(value[i] + dp[i-2], dp[i-1]). This pattern appears in: linear arrays with spacing constraints, grid problems (after dimensional reduction), tree problems (with parent-child constraints), and sequence optimization. When you see 'maximize sum' + 'can't pick adjacent', immediately consider this template.
[ad-00075] helpful=0 harmful=0 :: For finding the most significant bit (MSB) value or position: Use bit_length() method which returns the number of bits required to represent an integer. For MSB value, use the pattern: 1

These results show that ACE can significantly improve performance on complex tasks like code generation.

Summary

In this article, we’ve explored a lot about context engineering and the ACE approach, so let’s briefly recap the key takeaways:

Context engineering has emerged as a critical field because it allows us to improve LLM performance without lengthy and costly fine-tuning.
ACE (Agentic Context Engineering) is one of the latest approaches to prompt optimisation, leveraging detailed playbooks with atomised bullet points that include both instructions and metadata.
As our examples showed, prompt optimisation is not a silver bullet. It doesn’t improve performance in every case. According to the authors, ACE is most effective for agentic workflows or highly specialised domains. In our experiments, it made a clear difference in code generation, but had limited impact on banking intent classification.

The main takeaway for me is that prompt optimisation won’t solve your task automatically. You still need a holistic understanding of what information the LLM and agents have during the optimisation process and how best to structure and refine it. Context matters, and thoughtful engineering of that context is what makes approaches like ACE effective.

Thank you for reading. I hope this article was insightful. Remember Einstein’s advice: “The important thing is not to stop questioning. Curiosity has its own reason for existing.” May your curiosity lead you to your next great insight.