Context Engineering by now. This article will cover the key ideas behind creating LLM applications using Context Engineering principles, visually explain these workflows, and share code snippets that apply these concepts practically.
Don’t worry about copy-pasting the code from this article into your editor. At the end of this article, I will share the GitHub link to the open-source code repository and a link to my 1-hour 20-minute YouTube course that explains the concepts presented here in greater detail.
Unless otherwise mentioned, all images used in this article are produced by the author and are free to use.
Let’s begin!
What is Context Engineering?
There is a significant gap between writing simple prompts and building production-ready applications. Context Engineering is an umbrella term that refers to the delicate art and science of fitting information into the context window of an LLM as it works on a task.
The exact scope of where the definition of Context Engineering begins and ends is debatable, but according to this tweet from Andrej Karpathy, we can identify the following key points:
- It is not just atomic prompt engineering, where you ask one question to the LLM and get a response
- It is a holistic approach that breaks up a larger problem into multiple subproblems
- These subproblems can be solved by multiple LLMs (or agents) in isolation. Each agent is provided with the appropriate context to carry out its task
- Each agent can be of appropriate capability and size depending on the complexity of the task.
- Intermediate steps that each agent can take to complete the task – the context is not just information we input – it also includes intermediate tokens that the LLMs see during generation (eg. reasoning steps, tool results, etc)
- The agents are connected with control flows, and we orchestrate exactly how information flows through our system
- The information available to the agents can come from multiple sources – external databases with Retrieval-Augmented Generation (RAG), tool calls (like web search), memory systems, or classic few-shot examples.
- Agents can take actions while generating responses. Each action the agent can take should be well-defined so the LLM can interact with it through reasoning and acting.
- Additionally, systems need to be evaluated with metrics and maintained with observability. Monitoring token usage, latency, and cost to output quality is a key consideration.
Important: How this article is structured
Throughout this article, I will be referring to the points above while providing examples of how they are applied in building real applications. Whenever I do so, I will use a block quote like this:
It is a holistic approach that breaks up a larger problem into multiple subproblems
When you see a quote in this format, the example that follows will apply the quoted concept programmatically.
But before that, we must ask ourselves one question…
Why not pass everything into the LLM?
Research has shown that cramming every piece of information into the context of an LLM is far from ideal. Even though many frontier models do claim to support “long-context” windows, they still suffer from issues like context poisoning or context rot.
(Source: Chroma)
Too much unnecessary information in an LLM’s context can pollute the model’s understanding, lead to hallucinations, and result in poor performance.
This is why simply having a large context window isn’t enough. We need systematic approaches to context engineering.
Why DSPY
For this tutorial, I have chosen the DSPy framework. I will explain the reasoning for this choice shortly, but let me assure you that the concepts presented here apply to almost any prompting framework, including writing prompts in pure English.
DSPy is a declarative framework for building modular AI software. They have neatly separated the two key aspects of any LLM task —
(a) the input and output contracts passed into a module,
and (b) the logic that governs how information flows.
Let’s see an example!
Imagine we want to use an LLM to write a joke. Specifically, we want it to generate a setup, a punchline, and the full delivery in a comedian’s voice.
Oh, and we also want the output in JSON format so that we can post-process individual fields of the dictionary after generation. For example, perhaps we want to print the punchline on a T-shirt (assume someone has already written a convenient function for that).
system_prompt = """
You are a comedian who tells jokes, you are always funny.
Generate the setup, punchline, and full delivery in the comedian's voice.
Output in the following JSON format:
{
"setup": ,
"punchline": ,
"delivery":
}
Your response should be parsable withou errors in Python using json.loads().
"""
client = openai.Client()
response = client.chat.completions.create(
model="gpt-4o-mini",
temperature = 1,
messages=[
{"role": "system", "content": system_prompt,
{"role": "user", "content": "Write a joke about AI"}
]
)
joke = json.loads(response.choices[0].message.content) # Hope for the best
print_on_a_tshirt(joke["punchline"])
Notice how we post-process the LLM’s response to extract the dictionary? What if something “bad” happened, like the LLM failing to generate the response in the desired format? Our entire code would fail and there will be no printing on any T-shirts!
The above code is also quite difficult to extend. For example, if we wanted the LLM to do chain of thought reasoning before generating the answer, we would need to write additional logic to parse that reasoning text correctly.
Furthermore, it can be difficult to look at plain English prompts like these and understand what the inputs and outputs of these systems are. DSPy solves all of the above. Let’s write the above example using DSPy.
class JokeGenerator(dspy.Signature):
"""You're a comedian who tells jokes. You're always funny."""
query: str = dspy.InputField()
setup: str = dspy.OutputField()
punchline: str = dspy.OutputField()
delivery: str = dspy.OutputField()
joke_gen = dspy.Predict(JokeGenerator)
joke_gen.set_lm(lm=dspy.LM("openai/gpt-4.1-mini", temperature=1))
result = joke_gen(query="Write a joke about AI")
print(result)
print_on_a_tshirt(result.punchline)
This approach gives you structured, predictable outputs that you can work with programmatically, eliminating the need for regex parsing or error-prone string manipulation.
Dspy Signatures explicitly makes you define what the inputs to the system are (“query” in the above example), and the outputs to the system (setup, punchline, and delivery) as well as their data-types. It also tells the LLM the order in which you want them to be generated.
The dspy.Predict
thing is an example of a DSPy Module. With modules, you define how the LLM converts from inputs to outputs. dspy.Predict
is the most basic one – you can pass the query to it, as in joke_gen(query="Write a joke about AI")
and it will create a basic prompt to send to the LLM. Internally, DSPy just creates a prompt as you can see below.
Once the LLM responds, DSPy will create Pydantic BaseModel
objects that perform automatic schema validation and send back the output. If errors occur during this validation process, DSPy automatically attempts to fix them by re-prompting the LLM—thereby significantly reducing the risk of a program crash.
Another common theme in context engineering is Chain of Thought. Here, we want the LLM to generate reasoning text before providing its final answer. This allows the LLM’s context to be populated with its self-generated reasoning before it generates the final output tokens.
To do that, you can simply replace dspy.Predict
with dspy.ChainOfThought
in the example above. The rest of the code remains the same. Now you can see that the LLM generates reasoning before the defined output fields.
Multi-Step Interactions and Agentic Workflows
The best part of DSPy’s approach is how it decouples system dependencies (Signatures
) from control flows (Modules
), which makes writing code for multi-step interactions trivial (and fun!). In this section, let’s see how we can build some simple agentic flows.
Sequential Processing
Let’s remind ourselves about one of the key components of Context Engineering.
It is a holistic approach that breaks up a larger problem into multiple subproblems
Let’s continue with our joke generation example. We can easily separate out two subproblems from it. Generating the idea is one, creating a joke is another.
Let’s have two agents then — the first Agent generates a joke idea (setup and punchline) from a query. A second agent then generates the joke from this idea.
Each agent can be of appropriate capability and size depending on the complexity of the task
We are also running the first agent with gpt-4.1-mini
and the second agent with the more powerful gpt-4.1
.
Notice how we wrote our own dspy.Module
called JokeGenerator
. Here we use two separate dspy modules – the query_to_idea
and the idea_to_joke
to convert our original query to a JokeIdea
and subsequently into a joke (as pictured above).
class JokeIdea(BaseModel):
setup: str
contradiction: str
punchline: str
class QueryToIdea(dspy.Signature):
"""Generate a joke idea with setup, contradiction, and punchline."""
query = dspy.InputField()
joke_idea: JokeIdea = dspy.OutputField()
class IdeaToJoke(dspy.Signature):
"""Convert a joke idea into a full comedian delivery."""
joke_idea: JokeIdea = dspy.InputField()
joke = dspy.OutputField()
class JokeGenerator(dspy.Module):
def __init__(self):
self.query_to_idea = dspy.Predict(QueryToIdea)
self.idea_to_joke = dspy.Predict(IdeaToJoke)
self.query_to_idea.set_lm(lm=dspy.LM("openai/gpt-4.1-mini"))
self.idea_to_joke.set_lm(lm=dspy.LM("openai/gpt-4.1"))
def forward(self, query):
idea = self.query_to_idea(query=query)
joke = self.idea_to_joke(joke_idea=idea.joke_idea)
return joke
Iterative Refinement
You can also implement iterative improvement where the LLM reflects on and refines its outputs. For example, we can write a refinement module whose context is the output of a previous LM, and it must act as a feedback provider. The first LM can input this feedback and iteratively improve its response.
iteratively improve the final joke. (Source: Author)
Conditional Branching and Multi-Output Systems
The agents are connected with control flows, and we orchestrate exactly how information flows through our system
Sometimes you want your agent to output multiple variations, and then select the best among them. Let’s look at an example of that.
Here we have first defined a joke judge – it inputs several joke ideas, and then picks the index of the best joke. This joke is then passed into the next section.
num_samples = 5
class JokeJudge(dspy.Signature):
"""Given a list of joke ideas, you must pick the best joke"""
joke_ideas: list[JokeIdeas] = dspy.InputField()
best_idx: int = dspy.OutputField(
le=num_samples,
ge=1,
description="The index of the funniest joke")
class ConditionalJokeGenerator(dspy.Module):
def __init__(self):
self.query_to_idea = dspy.ChainOfThought(QueryToIdea)
self.judge = dspy.ChainOfThought(JokeJudge)
self.idea_to_joke = dspy.ChainOfThought(IdeaToJoke)
async def forward(self, query):
# Generate multiple ideas in parallel
ideas = await asyncio.gather(*[
self.query_to_idea.acall(query=query)
for _ in range(num_samples)
])
# Judge and rank ideas
best_idx = (await self.judge.acall(joke_ideas=ideas)).best_idx
# Select best idea and generate final joke
best_idea = ideas[best_idx]
# Convert from idea to joke
return await self.idea_to_joke.acall(joke_idea=best_idea)
Tool Calling
LLM applications often need to interact with external systems. This is where tool-calling steps in. You can imagine a tool to be any Python function. You just need two things to define a Python function as an LLM tool:
- A description of what the function does
- A list of inputs and their data types
Let’s see an example of fetching news. We first write a simple Python function, where we use Tavily. The function inputs a search query and fetches recent news articles from the last 7 days.
client = TavilyClient(api_key=os.getenv("TAVILY_API_KEY"))
def fetch_recent_news(query: str) -> str:
"""Inputs a query string, searches for news and returns top results."""
response = tavily_client.search(query, search_depth="advanced",
topic="news", days=7, max_results=3)
return [x["content"] for x in response["results"]]
Now let’s usedspy.ReAct
(or the REasoning and ACTing). The module automatically reasons about the user’s query, decides when to call which tools, and incorporates the tool results into the final response. Doing this is pretty easy:
class HaikuGenerator(dspy.Signature):
"""
Generates a haiku about the latest news on the query.
Also create a simple file where you save the final summary.
"""
query = dspy.InputField()
summary = dspy.OutputField(desc="A summary of the latest news")
haiku = dspy.OutputField()
program = dspy.ReAct(signature=HaikuGenerator,
tools=[fetch_recent_news],
max_iters=2)
program.set_lm(lm=dspy.LM("openai/gpt-4.1", temperature=0.7))
pred = program(query="OpenAI")
When the above code runs, the LLM first reasons about what the user wants and which tool to call (if any). Then it generates the name of the function and the arguments to call the function.
We call the news function with the generated args, execute the function to generate the news data. This information is passed back into the LLM. The LLM makes a decision whether to call more tools, or “finish”. If the LLM reasons that it has enough information to answer the user’s original request, it chooses to finish, and generate the answer.
Agents can take actions while generating responses. Each action the agent can take should be well defined so the LLM can interact with it through reasoning and acting.
Advanced Tool Usage — Scratchpad and File I/O
An evolving standard for modern applications is to allow LLMs access to the file system, allowing them to read and write files, move between directories (with appropriate restrictions), grep and search text within files, and even run terminal commands!
This pattern opens a ton of possibilities. It transforms the LLM from a passive text generator into an active agent capable of performing complex, multi-step tasks directly within a user’s environment. For example, just displaying the list of tools available to Gemini CLI will reveal a short but incredibly powerful collection of tools.
A quick word on MCP Servers
Another new paradigm in the space of agentic systems are MCP servers. MCPs need their own dedicated article, so I won’t go over them in detail in this one.
This has quickly become the industry-standard way to serve specialized tools to LLMs. It follows the classic Client-Server architecture where the LLM (a client) sends a request to the MCP server, and the MCP server carries out the requested action, and returns a result back to the LLM for downstream processing. MCPs are perfect for context engineering specific examples since you can declare system prompt formats, resources, restricted database access, etc, to your application.
This repository has a great list of MCP servers that you can study to make your LLM applications connect with a wide variety of applications.
Retrieval-Augmented Generation (RAG)
Retrieval Augmented Generation has become a cornerstone of modern AI application development. It is an architectural approach that injects external, relevant, and up-to-date information into the Large Language Models (LLMs) that is contextually relevant to the user’s query.
RAG pipelines consist of a preprocessing and an inference-time phase. During pre-processing, we process the reference data corpus and save it in a queryable format. In the inference phase, we process the user query, retrieve relevant documents from our database, and pass them into the LLM to generate a response.
The information available to the agents can come from multiple sources – external database with Retrieval-Augmented Generation (RAG), tool calls (like web search), memory systems, or classic few-shot examples.
Building RAGs is complicated, and there has been a lot of great research and engineering optimizations that have made life easier. I made a 17-minute video that covers all the aspects of building a reliable RAG pipeline.
Some practical tips for Good RAG
- When preprocessing, generate additional metadata per chunk. This can be as simple as “questions this chunk answers”. When saving the chunks to your database, also save the generated metadata!
class ChunkAnnotator(dspy.Signature):
chunk: str = dspy.InputField()
possible_questions: list[str] = dspy.OutputField(
description="list of questions that this chunk answers"
)
- Query Rewriting: Directly using the user’s query to do RAG retrieval is often a bad idea. Users write pretty random things, which may not match the distribution of text in your corpus. Query rewriting does what it says – it “rewrites” the query, perhaps fixing grammar, spelling errors, contextualizes it with past conversation, or even adds additional keywords that make querying easier.
class QueryRewriting(dspy.Signature):
user_query: str = dspy.InputField()
conversation: str = dspy.InputField(
description="The conversation so far")
modified_query: str = dspy.OutputField(
description="a query contextualizing the user query with the conversation's context and optimized for retrieval search"
)
- HYDE or Hypothetical Document Embedding is a type of Query Rewriting system. In HYDE, we generate an artificial (or hypothetical) answer from the LLM’s internal knowledge. This response often contains important keywords that try to directly match with the answers database. Vanilla query rewriting is great for searching a database of questions, and HYDE is great for searching a database with answers.
- Hybrid search is almost always better than purely semantic or purely keyword-based search. For semantic search, I’d use cosine similarity nearest neighbor search with vector embeddings. And for semantic search, use BM25.
- RRF: You can choose multiple strategies to retrieve documents, and then use reciprocal rank fusion to combine them into one unified list!
- Multi-Hop Search is an option to consider as well if you can afford additional latency. Here, you pass the retrieved documents back into the LLM to generate new queries, which are used to conduct additional searches on the database.
class MultiHopHyDESearch(dspy.Module):
def __init__(self, retriever):
self.generate_queries = dspy.ChainOfThought(QueryGeneration)
self.retriever = retriever
def forward(self, query, n_hops=3):
results = []
for hop in range(n_hops): # Notice we loop multiple times
# Generate optimized search queries
search_queries = self.generate_queries(
query=query,
previous_jokes=retrieved_jokes
)
# Retrieve using both semantic and keyword search
semantic_results = self.retriever.semantic_search(
search_queries.semantic_query
)
bm25_results = self.retriever.bm25_search(
search_queries.bm25_query
)
# Fuse results
hop_results = reciprocal_rank_fusion([
semantic_results, bm25_results
])
results.extend(hop_results)
return results
- Citations: When asking LLM to generate responses from the retrieved documents, we can also ask the LLM to cite references to the documents it found useful. This allows the LLM to first generate a plan of how it’s going to use the retrieved content.
- Memory: If you are building a chatbot, it is important to figure out the question of memory. You can imagine Memory as a combination of Retrieval and Tool Calling. A well-known system is the Mem0 system. The LLM observes new data and calls tools to decide if it needs to add or modify its existing memories. During question-answering, it retrieves relevant memories using RAG to generate answers.
Best Practices and Production Considerations
This section is not directly about Context Engineering, but more about best practices to build LLM apps for production.
Additionally, systems need to be evaluated with metrics and maintained with observability. Monitoring token usage, latency, and cost to output quality is a key consideration.
1. Design Evaluation First
Before building features, decide how you’ll measure success. This helps scope your application and guides optimization decisions.
- If you can design verifiable or objective rewards, that’s the best. (example: classification tasks where you have a validation dataset)
- If not, can you define functions that heuristically evaluate LLM responses for your use case? (example: number of times a specific chunk is retrieved given a question)
- If not, can you get humans to annotate your LLM’s responses?
- If nothing works, use an LLM as a judge to evaluate responses. In most cases, you want to set your evaluation task as a comparison study, where the Judge receives multiple responses produced using different hyperparameters/prompts, and the judge must rank which ones are the best.
3. Use Structured Outputs Almost Everywhere
Always prefer structured outputs over free-form text. It makes your system more reliable and easier to debug. You can add validation and retries as well!
4. Design for failure
When designing prompts or dspy modules, make sure you always consider “what happens if things go wrong?”
Like any good software, cutting down error states and failing with swagger is the ideal scenario.
5. Monitor Everything
DSpy integrates with MLflow to track:
- Individual prompts passed into the LLM and their responses
- Token usage and costs
- Latency per module
- Success/failure rates
- Model performance over time
Langfuse, Logfire are equally great alternatives.
Outro
Context engineering represents a paradigm shift from simple prompt engineering to building comprehensive and modular LLM applications.
The DSPy framework provides the tools and abstractions needed to implement these patterns systematically. As LLM capabilities continue to evolve, context engineering will become increasingly crucial for building applications that effectively leverage the power of large language models.
To watch the full video course on which this article is based, please visit this YouTube link.
To access the full GitHub repo, visit:
https://github.com/avbiswas/context-engineering-dspy
References
Author’s YouTube channel: https://www.youtube.com/@avb_fj
Author’s Patreon: www.patreon.com/NeuralBreakdownwithAVB
Author’s Twitter (X) account: https://x.com/neural_avb
Full Context Engineering video course: https://youtu.be/5Bym0ffALaU
Github Link: https://github.com/avbiswas/context-engineering-dspy
Source link
#Context #Engineering #Comprehensive #HandsOn #Tutorial #DSPy