...

LangChain for EDA: Build a CSV Sanity-Check Agent in Python


, agents perform actions.

That’s exactly what we’re going to try out in today’s article.

In this article, we’ll use LangChain and Python to build our own CSV sanity check agent. With this agent, we’ll automate typical exploratory data analysis (EDA) tasks as displaying columns, detecting missing values (NaNs) and retrieving descriptive statistics.

Agents decide step by step which tool to call and when to answer a question about our data. This is a big difference from an application in the traditional sense, where the developer defines how the process works (e.g., via if-else loops). It also goes far beyond simple prompting because we are building a system that acts (albeit in a simple way) and doesn’t just talk.

This article is for you if you:

  • …work with Pandas and want to automate EDA.
  • …find LLMs exciting, but have little experience with LangChain so far.
  • …want to understand how agents really work (from setup to mini-evaluation) using a simple example.

Table of Contents
What we build & why
Hands-On-Example: CSV-Sanity-Check Agent with LangChain
Mini-Evaluation
Final Thoughts – Pitfalls, Tips and Next Steps
Where Can You Continue Learning?

What we build & why

An agent is a system to which we assign tasks. The system then decides for itself which tools to use to solve these tasks.

This requires three components:

Agent = LLM + Tools + Control logic

Let’s take a closer look at the three components:

  • The LLM provides the intelligence: It understands the question, plans steps, and decides what to do.
  • The tools are small Python functions that the agent is allowed to call (e.g., get_schema() or get_nulls()): They provide specific information from the data, such as column names or statistics.
  • The control logic (policy) ensures that the LLM does not respond immediately, but first decides whether it should use a tool. It thinks step by step: First, the question is analyzed, then the appropriate tool is selected, then the result is interpreted and, if necessary, a next step is selected, and finally a response is returned.

Instead of manually describing all data as in classic prompting, we transfer the responsibility to the agent: The system should act on its own, but only with the tools provided.

Let’s look at a simple example:

A user asks: “What is the average age in the CSV?”

At this point, the agent calls up the tool we have defined, df.describe(). The output is a clearly structured value (e.g., “mean”: 29.7). Here we can also see that this can reduce or minimize hallucinations, as the system knows what to apply and cannot return an answer such as “Probably between 20 and 40.”

LangChain as a framework

We use the LangChain framework for the agent. This allows us to connect LLMs with tools and build systems with defined behavior. The system can perform actions instead of just providing answers or generating text. A detailed explanation would make this article too long. But in a previous article, you can find an explanation of LangChain and a comparison with Langflow: LangChain vs Langflow: Build a Simple LLM App with Code or Drag & Drop.

What the agent does for us

When we receive a new CSV, we usually ask ourselves the following questions first (start of exploratory data analysis):

  • What columns are there?
  • Where is data missing?
  • What do the descriptive statistics look like?

This is exactly what we want the agent to do automatically.

Tools we define for the agent

For the agent to work, it needs clearly defined tools. It is best to define them as small, specific, and controlled as possible. This way, we avoid errors, hallucinations or unclear outputs because they make the output deterministic. They also make the agent reproducible and testable because the same input should produce a consistent result.

In our example, we define three tools:

  • schema: Returns column names and data types.
  • nulls: Shows columns with missing values (including number).
  • describe: Provides descriptive statistics for numeric columns.

Later, we will add a small mini-evaluation to ensure that our agent is working correctly.

Why is this an agent and not an app?

We are not building a classic program with a fixed sequence (e.g., using if-else), but rather the model plans itself based on the question, selects the appropriate tool, and combines steps as necessary to arrive at an answer:

The image shows the difference between a traditional app and this agent.
Visualization by the author.

Hands-On-Example: CSV-Sanity-Check Agent with LangChain

1) Setup

Prerequisite: Python 3.10 or higher must be installed. Many packages in the AI tooling world require ≥ 3.10. You can find the code and the link to the repo below.

Tip for newbies:
You can check this by entering “python –version” in cmd.exe.

With the code below, we first create a new project, create an isolated Python environment and activate it. We do this so that packages and versions are reproducible and do not consolidate with other projects.

Tip for newbies:
I work with Windows. We open a terminal with Windows + R > cmd and paste the following code.

mkdir csv-agent

cd csv-agent
python -m venv .venv
.venv\Scripts\activate

Then we install the necessary packages:

pip install "langchain>=0.2,=0.1.7" "langchain-community>=0.2" pandas seaborn

With this command, we pin LangChain to the 0.2 line and install the OpenAI connection and the community package. We also install pandas for the EDA functions and seaborn for loading the Titanic sample dataset.

The image shows creating an environment and installing packages.
Screenshot taken by the author.

Tip for newbies:
If you don’t want to use OpenAI, you can work locally with Ollama (e.g., with Llama or Mistral). This option is available later in the code.

2) Prepare the data set in prepare_data.py

Next, we create a Python file called prepare_data.py. I use Visual Studio Code for this, but you can also use another IDE. In this file, we load the Titanic dataset, as it is publicly available.

# prepare_data.py
import seaborn as sns
df = sns.load_dataset("titanic")
df.to_csv("titanic.csv", index=False)
print("Saved titanic.csv")

With seaborn.load_dataset(“titanic”), we load the public dataset (891 rows + first row with column names) directly into memory and save it as titanic.csv. The dataset contains only numeric, Boolean and categorical columns, making it ideal for an EDA agent.

Tips for newbies:

  • sns.load_dataset() requires internet access (the data comes from the seaborn repo).
  • Save the file in the project folder (csv-agent) so htat main.py can find it.

In the terminal, we execute the Python file with the following command, so that the titanic.csv file is located in the project:

python prepare_data.py

We then see in the terminal that the csv has been saved and see the titanic.csv file in the folder:

The image shows the result in the terminal after the csv is saved.
Screenshot taken by the author.
The image shows the folder structure of the project.
Screenshot taken by the author.

Side Note – Titanic dataset

The analysis is based on the Titanic dataset (OpenML ID 40945), which is marked as public on OpenML.

When we open the file, we see the following 14 columns and 891 rows of data. The Titanic dataset is a classic example of exploratory data analysis (EDA). It contains information on 891 passengers of the Titanic and is often used to investigate the relationship between characteristics (e.g., gender, age, ticket class) and survival.

The image shows the Titanic dataset in Excel.
Screenshot taken by the author.

Here are the 14 columns with a brief explanation:

  • survived: Survived (1) or did not survive (0).
  • pclass: Ticket class (1 = 1st class, 2 = 2nd class, 3 = 3rd class).
  • sex: Gender of the passenger.
  • age: Age of the passenger (in years, may be missing).
  • sibsp: Number of siblings/spouses on board.
  • parch: Number of parents/children on board.
  • fare: Fare paid by the passenger.
  • embarked: Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton).
  • class: Ticket class as text (First, Second, Third). Corresponds to pclass.
  • who: Categorization “man,” “woman,” “child.”
  • adult_male: Boolean field: Was the passenger an adult male (True/False)?
  • deck: Cabin deck (often missing).
  • embark_town: City of port of embarkation (Cherbourg, Queenstown, Southampton).
  • alone: Boolean field: Did the passenger travel alone (True/False)?

Optional for advanced readers
If you want to track and evaluate your agent runs later, you can use LangSmith.

2) Define tools in main.py

Next, we define the various tools. To do this, we create a new Python file called main.py and save it in the csv-agent folder as well. We add the following code to it:

# main.py
import os, json
import pandas as pd

# --- 0) Loading CSV ---
DF_PATH = "titanic.csv"
df = pd.read_csv(DF_PATH)

# --- 1) Defining tools as small, concise commands ---
# IMPORTANT: Tools return strings (in this case, JSON strings) so that the LLM sees clearly structured responses.

from langchain_core.tools import tool

@tool
def tool_schema(dummy: str) -> str:
    """Returns column names and data types as JSON."""
    schema = {col: str(dtype) for col, dtype in df.dtypes.items()}
    return json.dumps(schema)

@tool
def tool_nulls(dummy: str) -> str:
    """Returns columns with the number of missing values as JSON (only columns with >0 missing values)."""
    nulls = df.isna().sum()
    result = {col: int(n) for col, n in nulls.items() if n > 0}
    return json.dumps(result)

@tool
def tool_describe(input_str: str) -> str:
    """
    Returns describe() statistics.
    Optional: input_str can contain a comma-separated list of columns, e.g. "age, fare".
    """
    cols = None
    if input_str and input_str.strip():
        cols = [c.strip() for c in input_str.split(",") if c.strip() in df.columns]
    stats = df[cols].describe() if cols else df.describe()
    # describe() has a MultiIndex. Flatten it for the LLM to keep it readable:
    return stats.to_csv(index=True)

After importing the necessary packages, we load titanic.csv into df once and define three small, narrowly defined tools. Let’s take a closer look at each of these tools:

  • tool_schema returns the column names and data types as JSON. This gives us an overview of what we are dealing with and is usually the first step in any data analysis. Even if a tool doesn’t need input (like schema), it must still accept one argument, because the agent always passes a string. We simply ignore it.
  • tool_nulls counts missing values per column and returns only columns with missing values.
  • tool_describe calls df.describe(). It is important to note that this tool only works for numeric columns. Strings or Booleans, on the other hand, are ignored. This is an important step in the sanity check or EDA. This allows us to quickly see the mean, min, max, etc. of the different columns. For large CSVs, describe() can take a long time. In this case, you could integrate df.sample(n=10000) as sampling logic, for example.

These tools are the controlled interfaces through which the LLM is allowed to access the data. They are deterministic and therefore reproducible. Tools should ideally be clear and limited: In other words, they should have only one function or task.


Why do we need tools at all?

An LLM can generate text, but it cannot directly “see” data. In order for the LLM to work meaningfully with a CSV, we need to provide interfaces. That’s exactly what tools are for:

Tools are small Python functions that the agent is allowed to call. Instead of making everything free, we only allow very specific, reproducible actions.


What exactly does the code do?

With the @tool decorator, LangChain automatically infers the tool’s name, description and argument schema from the function signature and docstring. This means we only need to write the function itself. LangChain takes care of the rest.

  • The model passes arguments that match the tool’s schema (often JSON). In this tutorial we keep things simple and accept a single string argument (e.g., input_str: str or a dummy string we ignore).
  • Tools always return a string (text). JSON is ideal for structured data, which we define with return json.dumps(…).
This image shows how the agent uses multi-step reasoning with tools.
Visualization by the author.

This is a multi-step thought process. The LLM plans iteratively. Instead of responding directly, it thinks step by step: it decides which tool to call, interprets the result, and may continue until it has enough information to respond.

4) Registering tools for LangChain in main.py

We add the code below to the same main.py file to register the previously defined tools for the agent:

# --- 2) Registering tools for LangChain ---

tools = [tool_schema, tool_nulls, tool_describe]

With this code, we simply collect the decorated functions into a list. Each function has already been converted into a LangChain tool by the @tool decorator.

5) Configuring LLM in main.py

Next, we configure the LLM that the agent uses. Here, you can either use the variant for OpenAI or for an open-source tool with Ollama.

I used OpenAI, which is why we first need to set the API key:

At OpenAI, we create a new API key:

The image shows how to create an API-Key in OpenAI.
Screenshot taken by the author.

We then copy it directly (it will not be displayed later) and set it as an environment variable in the terminal with the following command.

setx OPENAI_API_KEY "your_key”

It is important to restart cmd and reactivate .venv afterwards. We can use echo to check whether an API key has been saved.

The image shows how to check in the terminal, if the API-Key was saved.
Screenshot taken by the author.

Now we add the following code to the end of main.py:

# --- 3) Configure LLM ---
# Option A: OpenAI (simple)
#   export OPENAI_API_KEY=...    # Windows: setx OPENAI_API_KEY "YOUR_KEY"
#   Use a lower temperature for more stable tool usage
USE_OPENAI = bool(os.getenv("OPENAI_API_KEY"))

if USE_OPENAI:
    from langchain_openai import ChatOpenAI
    llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.1)
else:
    # Option B: Local with Ollama (make sure to pull the model first, e.g. 'ollama run llama3')
    from langchain_community.chat_models import ChatOllama
    llm = ChatOllama(model="llama3.1:8b", temperature=0.1)

The code uses OpenAI if an OpenAI_API_KEY is available, otherwise Ollama locally.

We set the temperature to 0.1. This ensures that the responses are more deterministic, which is important for the subsequent test.

We also use gpt-4o-mini as the LLM. This is a lightweight model from OpenAI with a focus on tool usage.

Tip for Newbies:
The temperature determines how creatively an LLM responds. If we enter 0.0, it responds deterministically. This means that the model almost always returns the same answer when the input is the same. This is good for structured tasks such as tool usage, code or facts, for example. If we specify 1.0, the model responds creatively and with a wide variety of options. This means that the model varies more and can suggest different formulations or solutions, which is good for brainstorming or text ideas, for example.

6) Defining the agent’s behavior in main.py using the policy

In this step, we define how the agent should behave. The system prompt sets the policy.

# --- 4) Narrow Policy/Prompt (Agent Behavior) ---
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder

SYSTEM_PROMPT = (
    "You are a data-focused assistant. "
    "If a question requires information from the CSV, first use an appropriate tool. "
    "Use only one tool call per step if possible. "
    "Answer concisely and in a structured way. "
    "If no tool fits, briefly explain why.\n\n"
    "Available tools:\n{tools}\n"
    "Use only these tools: {tool_names}."
)

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", SYSTEM_PROMPT),
        ("human", "{input}"),
        MessagesPlaceholder(variable_name="agent_scratchpad"),
    ]
)

_tool_desc = "\n".join(f"- {t.name}: {t.description}" for t in tools)
_tool_names = ", ".join(t.name for t in tools)
prompt = prompt.partial(tools=_tool_desc, tool_names=_tool_names)

First, we import ChatPromptTemplate to structure our agent’s prompt. The most important part of the code is the system prompt: it defines the policy, i.e., the “rules of the game” for the agent. In it, we define that the agent may only use one tool per step, that it should be concise, and that it may only use the tools we have defined.

With the last two lines in the system prompt, we ensure that {tools} lists all available tools with their descriptions and with {tool_names}, we ensure that the agent can only use these names and cannot invent fantasy tools.

In addition, we use MesagesPlaceholder(“agent_scratchpad”). This is where the agent stores intermediate steps: The agent stores which tools it has called and which results it has received. This allows it to continue its own chain of reasoning until it arrives at a final answer.

7) Create tool-calling agent in main.py

In the last step, we define the agent:

# --- 5) Create & Run Tool-Calling Agent ---
from langchain.agents import create_tool_calling_agent, AgentExecutor

agent = create_tool_calling_agent(llm=llm, tools=tools, prompt=prompt)
agent_executor = AgentExecutor(
    agent=agent,
    tools=tools,
    verbose=False,   # optional: True for debug logs
    max_iterations=3,
)

if __name__ == "__main__":
    user_query = "Which columns have missing values? List 'Column: Count'."
    result = agent_executor.invoke({"input": user_query})
    print("\n=== AGENT ANSWER ===")
    print(result["output"])

With create_tool_calling_agent, we connect our LLM, the tools and the prompt to form a tool-calling agent.

To ensure that the process runs smoothly, we use the AgentExecutor. It takes care of the so-called agent loop: The agent first plans what needs to be done, then calls up a tool, receives the result and decides whether another tool is needed or whether it can provide the final answer. This cycle repeats until the result is ready.

With verbose=True, we can view the intermediate steps in the terminal, which is extremely helpful for debugging. For example, we can see which tool was called when or what data was returned. If everything is running smoothly, we can also set it to =False to keep the output clearer.

With max_iterations=3, we limit how many reasoning–tool–response cycles the agent may perform. This helps prevent infinite loops or excessive tool calls. In our example, the agent might reasonably call schema → nulls → describe before answering.

With the last part of the code, the agent is executed with the sample input “Which columns have missing values?”. The result is printed in the terminal.

Tip for newbies:
if name == “main”: is a standard Python pattern: If we execute the file directly in the terminal with python main.py, the code in this block will be started. However, if we only import the file (e.g., later in the mini_eval.py file), this block is skipped. This allows us to use the file as a standalone script or reuse it as a module in other projects.

8) Run the script: Run the file main.py in the terminal.

Now we enter python main.py in the terminal to start the agent. We then see the final answer in the terminal:

The image shows the result that the agent shows in the terminal (how many missing values).
Screenshot taken by the author.

Mini-Evaluation

Finally, we want to check our agent, which we do with a small evaluation. This ensures that the agent behaves correctly and does not introduce any “regressions” when we change something in the code later on.

At the end of main.py, we add the code below:

def ask_agent(query: str) -> str:
    return agent_executor.invoke({"input": query})["output"]

With ask_agent, we encapsulate the agent call in a function that simply returns a string. This allows us to call the agent later from other files.

The lower block ensures that a test run is performed when main.py is called directly. If, on the other hand, we import main into another file, only the function is provided.

Now we create the mini_eval.py file and insert the following code:

# mini_eval.py

from main import ask_agent

tests = [
    ("Which columns have missing values?", ["age", "embarked", "deck", "embark_town"]),
    ("Show me the first 3 columns with their data types.", ["survived", "pclass", "sex"]),
    ("Give me a statistical summary of the 'age' column.", ["mean", "min", "max"]),
]

def passed(q, out, must_include):
    text = out.lower()
    return all(any(tok in text for tok in (m.lower(), str(m).lower())) for m in must_include)

if __name__ == "__main__":
    ok = 0
    for q, must in tests:
        out = ask_agent(q)
        result = passed(q, out, must)
        print(f"[{'OK' if result else 'FAIL'}] {q}\n{out}\n")
        ok += int(result)
    print(f"Passed {ok}/{len(tests)}")

In the code, we define three test cases. Each test consists of a question for the agent and a list of keywords that must appear in the answer. The passed() function checks whether these keywords are included.

Expected test results

  • Test 1: “Which columns have missing values?”
    Expected: Output mentions age, deck, embarked, embark_town.
  • Test 2: “Show me the first 3 columns with their data types.” Expected: Output contains survived, pclass, sex with types such as int64 or object.
  • Test 3: “Give me a statistical summary of the ‘age’ column.” Expected output: Output contains mean ≈ 29.7, min = 0.42, max = 80.

If everything runs correctly, the script reports “Passed 3/3” at the end.

We get this output in the terminal. So the test works:

The image shows the result of the mini-evaluation.
Screenshot taken by the author.

You can find the code & the csv in the repo on GitHub.

On my Substack Data Science Espresso, I share practical guides and bite-sized updates from the world of Data Science, Python, AI, Machine Learning, and Tech — made for curious minds like yours.

Have a look and subscribe on Medium or on Substack if you want to stay in the loop.


Final Thoughts – Pitfalls, tips and next steps

LangChain is very practical for this example because it already includes and nicely illustrates the entire agent loop (planning, tool calling, control). For small or clearly structured tasks, however, alternatives such as pure function calling (e.g., via the OpenAI API) or classic EDA frameworks like Great Expectations might be sufficient. That said, LangChain does add some overhead. If you only need fixed EDA checks, a plain Python script would be leaner and faster. LangChain is especially worthwhile when you want to extend things flexibly or orchestrate multiple tools and agents.

When working with agents, there are a few things you should keep in mind:

One common pitfall is unclear tool descriptions: If the descriptions are too vague, the model can easily choose the wrong tool (misrouting). With precise and concrete descriptions, we can greatly reduce this.

Another important point is testing: Even a small mini-evaluation with three simple tests is helpful in detecting regressions (errors that stay unnoticed due to subsequent changes) at an early stage.

It’s also worth starting small: In our example, we only worked with three clearly defined tools, but now we know that they work reliably.

With regard to this agent, it might also be useful to incorporate sampling (for example, df.sample(n=10000)) for very large CSV files to avoid performance issues. Keep in mind that LLM agents can also become costly if every question triggers multiple tool calls.

In this article, we built a single agent that checks CSV files. In practice, multiple agents would often work together: For example, one agent could ensure data quality while a second agent creates visualizations. Such multi-agent systems are the next step in solving more complex tasks.

As a next step, we could also incorporate LangGraph to extend the agent loop with states and orchestration. This would allow us to assemble agents as in a flowchart, including interruptions, memory, or more flexible control logic.

Finally, in our example, we manually defined the three tools schema, nulls, and describe. With the Model Context Protocol (MCP), we could connect tools in a standardized way. For example, we could connect databases, APIs or IDEs.

Where Can You Continue Learning?

Source link

#LangChain #EDA #Build #CSV #SanityCheck #Agent #Python