! If you’ve been following along, we’ve come a long way. In Part 1, we did the “dirty work” of cleaning and prepping.
In Part 2, we zoomed out to a high-altitude view of NovaShop’s world — spotting the big storms (high-revenue countries) and the seasonal patterns (the massive Q4 rush).
But here’s the thing: a business doesn’t actually sell to “months” or “countries.” It sells to human beings.
If you treat every customer exactly the same, you’re making two very expensive mistakes:
- Over-discounting: Giving a “20% off” coupon to someone who was already reaching for their wallet.
- Ignoring the “Quiet” Ones: Failing to notice when a formerly loyal customer stops visiting, until they’ve been gone for six months and it’s too late to win them back.
The Solution? Behavioural Segmentation.
Instead of guessing, we’re going to use the data to let the customers tell us who they are. We do this using the gold standard of retail analytics: RFM Analysis.
- Recency (R): How recently did they buy? (Are they still engaged with us?)
- Frequency (F): How often do they buy? (Are they loyal, or was it a one-off?)
- Monetary (M): How much do they spend? (What is their total business impact?)
By the end of this part, we’ll move beyond “Top 10 Products” and actually assign a specific, actionable Label to every single customer in NovaShop’s database.
Data Preparation: The “Missing ID” Pivot
Before we can start scoring, we have to address a decision we made back in Part 1.
If you remember our Initial Inspection, we noticed that about 25% of our rows were missing a CustomerID. At the time, we made a strategic business decision to keep those rows. We needed them to calculate the accurate total revenue and see which products were popular overall.
For RFM analysis, the rules change. You cannot track behavior without a consistent identity. We can’t know how “frequent” a customer is if we don’t know who they are!
So, our first step in Part 3 is to isolate our “Trackable Universe” by filtering for rows where a CustomerID exists.
Engineering the RFM Metrics
Now that we have a dataset where every row is linked to a specific person, we need to aggregate all their individual transactions into three summary numbers: Recency, Frequency, and Monetary.
Defining the Snapshot Date
Before calculating RFM, we need a reference point in time, commonly called the snapshot date.
Here, we take the most recent transaction date in the dataset and add one day. This snapshot date represents the moment at which we’re evaluating customer behaviour.
snapshot_date = df['InvoiceDate'].max() + dt.timedelta(days=1)
We added one day, so customers who bought on the most recent date still have a Recency value of 1 day, not 0. This keeps the metric intuitive and avoids edge-case problems.
Aggregating Transactions at the Customer Level
rfm = df.groupby(‘CustomerID’).agg({
‘InvoiceDate’: lambda x: (snapshot_date — x.max()).days,
‘InvoiceNo’: ‘nunique’,
‘Revenue’: ‘sum’
})
Each row in our dataset represents a single transaction. To calculate RFM, we need to collapse these transactions into one row per customer.
We do this by grouping the data by CustomerID and applying different aggregation functions:
- Recency: For each customer, we find their most recent purchase date and calculate how many days have passed since then.
- Frequency: We count the number of unique invoices associated with each customer. This tells us how often they’ve made purchases.
- Monetary: We sum the total revenue generated by each customer across all transactions.
Renaming Columns for Clarity
rfm.rename(columns={
'InvoiceDate': 'Recency',
'InvoiceNo': 'Frequency',
'Revenue': 'Monetary'
}, inplace=True)py
The aggregation step keeps the original column names, which can be confusing. Renaming them makes the dataframe immediately readable and aligns it with standard RFM terminology.
Now each column clearly answers a business question:
- Recency → How recently did the customer purchase?
- Frequency → How often do they purchase?
- Monetary → How much revenue do they generate?
Inspecting the Result
print(rfm.head())
The final rfm dataframe contains one row per customer, with three intuitive metrics summarizing their behavior.
Output:

Let’s walk through this the way we would with NovaShop in a real conversation.
“When was the last time this customer bought from us?”
That’s exactly what Recency answers.
Take Customer 12347:
- Recency = 2
- Translation: “This customer bought something just two days ago.”
They’re fresh. They remember the brand. They’re still engaged.
Now compare that to Customer 12346:
- Recency = 326
- Translation: “They haven’t bought anything in almost a year.”
Even though this customer spent a lot in the past, they’re currently silent.
From NovaShop’s perspective: Recency tells us who’s still listening and who might need a nudge (or a wake-up call).
“Is this a one-time buyer or someone who keeps coming back?”
That’s where Frequency comes in.
Look again at Customer 12347:
- Frequency = 7
- They didn’t just buy once — they came back again and again.
Now look at several others:
- Frequency = 1
- One purchase, then gone.
From a business perspective, frequency separates casual shoppers from loyal customers.
“Who actually brings in the money?”
That’s the Monetary column.
And this is where things get interesting.
Customer 12346:
- Monetary = £77,183.60
- Frequency = 1
- Recency = 326
This tells a very specific story:
A single, very large order… a long time ago… and nothing since.
Now compare that to Customer 12347:
- Lower total spend
- Multiple purchases
- Very recent activity
Important insight for NovaShop: A “high-value” customer in the past isn’t necessarily a valuable customer today.
Why This View Changes the Conversation
If NovaShop only looked at total revenue, they might focus all their attention on customers like 12346.
But RFM shows us that:
- Some customers spent a lot once and disappeared
- Some spend less but stay loyal
- Some are active right now and ready to be engaged
This output helps NovaShop stop guessing and start prioritizing:
- Who should get retention emails?
- Who needs reactivation campaigns?
- Who is already loyal and should be rewarded?
Right now, these are still raw numbers.
In the next step, we’ll rank and score these customers, so NovaShop doesn’t have to interpret rows manually. Instead, they’ll see clear segments like:
- Champions
- Loyal Customers
- At-Risk
- Lost
That’s where this becomes a real decision-making tool — not just a dataframe.
Turning RFM Numbers Into Meaningful Customer Segments
At this stage, NovaShop has a table full of numbers. Useful — but not exactly decision-friendly.
A marketing team can’t realistically scan hundreds or thousands of rows asking:
- Is a Recency of 19 good or bad?
- Is Frequency = 2 impressive?
- How much Monetary value is “high”?
Our goal is to rank customers relative to one another and turn raw values into scores.
Step 1: Ranking Customers by Each RFM Metric
Instead of treating Recency, Frequency, and Monetary as absolute values, we look at where each customer stands compared to everyone else.
- Customers with more recent purchases should score higher
- Customers who buy more often should score higher
- Customers who spend more should score higher
In practice, we do this by splitting each metric into quantiles (usually 4 or 5 buckets).
However, there’s a small real-world wrinkle. This is something I came across while working on this project
In transactional datasets, it’s common to see:
- Many customers with the same Frequency (e.g. one-time buyers)
- Highly skewed Monetary values
- Small samples where quantile binning can fail
To keep things robust and readable, we’ll wrap the scoring logic in a small helper function.
def rfm_score(series, ascending=True, n_bins=5):
# Rank the values to ensure uniqueness
ranked = series.rank(method=’first’, ascending=ascending)
# Use pd.qcut on the ranks to assign bins
return pd.qcut(
ranked,
q=n_bins,
labels=range(1, n_bins+1)
).astype(int)
To explain what’s going on here:
- We’re creating a helper function that turns a raw numeric column into a clean RFM score using quantile-based binning.
- First, the values are ranked. So, instead of binning the raw values directly, we rank them first. This step guarantees unique ordering, even when many customers share the same value (a common issue in RFM data).
- The
ascendingflag lets us flip the logic depending on the metric — for example, lower recency is better, while higher frequency and monetary values are better. - Next, we’re applying quantile-based binning.
qcutsplits the ranked values inton_binsequally sized groups. Each customer is assigned a score from 1 to 5 (by default), where the score represents their relative position within the distribution. - Finally, the results will be converted to integers for easy use in analysis and segmentation.
In short, this function provides a robust and reusable way to score RFM metrics without running into duplicate bin edge errors — and without overcomplicating the logic.
Step 2: Applying the Scores
Now we can score each metric cleanly and consistently:
# Assign R, F, M scores
rfm['R_Score'] = rfm_score(rfm['Recency'], ascending=False) # Recent purchases = high score
rfm['F_Score'] = rfm_score(rfm['Frequency']) # More frequent = high score
rfm['M_Score'] = rfm_score(rfm['Monetary']) # Higher spend = high score
The only special case here is Recency:
- Lower values mean more recent activity
- So we reverse the ranking with
ascending=False - Everything else follows the natural “higher is better” rule.
What This Means for NovaShop
Instead of seeing this:
Recency = 326
Frequency = 1
Monetary = 77,183.60
NovaShop now sees something like:
R = 1, F = 1, M = 5
That’s instantly more interpretable:
- Not recent
- Not frequent
- High spender (historically)
Step 3: Creating a Combined RFM Score
Now we combine these three scores into a single RFM code:
rfm['RFM_Score'] = (
rfm['R_Score'].astype(str) +
rfm['F_Score'].astype(str) +
rfm['M_Score'].astype(str)
)
This produces values like:
- 555 → Best customers
- 155 → High spenders who haven’t returned
- 111 → Customers who are likely gone
Each customer now carries a compact behavioral fingerprint. And we’re not done yet.
Translating RFM Scores Into Customer Segments
Raw scores are nice, but let’s be honest: no marketing manager wants to look at 555, 154, or 311 all day.
NovaShop needs labels that make sense at a glance. That’s where RFM segments come in.
Step 1: Defining Segments
Using RFM scores, we can classify customers into meaningful categories. Here’s a common approach:
- Champions: Top Recency, top Frequency, top Monetary (555) — your best customers
- Loyal Customers: Regular buyers, may not be spending the most, but keep coming back
- Big Spenders: High Monetary, but not necessarily recent or frequent
- At-Risk: Used to buy, but haven’t returned recently
- Lost: Low scores in all three metrics — likely disengaged
- Promising / New: Recent customers with lower frequency or monetary spend
This transforms abstract numbers into a narrative that marketing and management can act on.
Step 2: Mapping Scores to Segments
Here’s an example using simple conditional logic:
def rfm_segment(row):
if row['R_Score'] >= 4 and row['F_Score'] >= 4 and row['M_Score'] >= 4:
return 'Champions'
elif row['F_Score'] >= 4:
return 'Loyal Customers'
elif row['M_Score'] >= 4:
return 'Big Spenders'
elif row['R_Score']
Now each customer has a human-readable label, making it immediately actionable.
Let’s review our results using rfm.head()

Step 3: Turning Segments into Strategy
With labeled segments, NovaShop can:
- Reward Champions → Exclusive deals, loyalty points
- Re-engage Big Spenders & At-Risk customers → Personalized emails or discounts
- Focus marketing wisely → Don’t waste effort on customers who are truly lost
This is the moment where data becomes strategy.
What NovaShop Should Do Next (Key Takeaways & Recommendations)
At the start of this analysis, NovaShop had a familiar problem:
A lot of transactional data, but limited clarity on customer behaviour.
By applying the RFM framework, we’ve turned raw purchase history into a clear, structured view of who NovaShop’s customers are — and how they behave.
Now let’s talk about what to actually do with it.
1. Protect and Reward Your Best Customers
Champions and Loyal Customers are already doing what every business wants:
- They buy recently
- They buy often
- They generate consistent revenue
These customers don’t need heavy discounts — they need recognition.
Recommended actions:
- Early access to sales
- Loyalty points or VIP tiers
- Personalized thank-you emails
The goal here isn’t acquisition, it’s retention.
2. Re-Engage High-Value Customers Before They’re Lost
The most dangerous segment for NovaShop isn’t “Lost” customers.
It’s At-Risk and Big Spenders.
These customers:
- Have shown clear value in the past
- But haven’t purchased recently
- Are one step away from churning completely
Recommended actions:
- Targeted win-back campaigns
- Personalized offers (not blanket discounts)
- Reminder emails tied to past purchase behavior
Winning back an existing customer is almost always cheaper than acquiring a new one.
3. Don’t Over-Invest in Truly Lost Customers
Some customers will inevitably churn. RFM helps NovaShop identify those customers early and avoid spending ad budget, discounts and marketing effort on users who are unlikely to return. This isn’t about being cold — it’s about being efficient.
4. Use RFM as a Living Framework, Not a One-Off Analysis
The real power of RFM comes when it’s:
- Recomputed monthly or quarterly
- Integrated into dashboards
- Used to track movement between segments over time
For NovaShop, this means asking questions like:
- How many At-Risk customers became Loyal this month?
- Are Champions increasing or shrinking?
- Which campaigns actually move customers up the ladder?
RFM turns customer behaviour into something measurable and trackable.
Final Thoughts: Closing the EDA in Public Series
When I started this EDA in Public series, I wasn’t trying to build the perfect analysis or demonstrate advanced techniques. I wanted to slow down and share how I actually think when working with real data. Not the polished version, but the messy, iterative process that usually stays hidden.
This project began with a noisy CSV and a lot of open questions. Along the way, there were small issues that only surfaced once I paid closer attention — dates stored as strings, assumptions that didn’t quite hold up, metrics that needed context before they made sense. Working through those moments in public was uncomfortable at times, but also genuinely valuable. Each correction made the analysis stronger and more honest.
One thing this process reinforced for me is that most meaningful insights don’t come from complexity. They come from slowing down, structuring the data properly, and asking better questions. By the time I reached the RFM analysis, the value wasn’t in the formulas themselves — it was in what they forced me to confront. A customer who spent a lot once isn’t necessarily valuable today. Recency matters. Frequency matters. And none of these metrics mean much in isolation.
Ending the series with RFM felt deliberate. It sits at the point where technical work meets business thinking, where tables turn into conversations and numbers turn into decisions. It’s also where exploratory analysis stops being purely descriptive and starts becoming practical. At that stage, the goal is no longer just to understand the data, but to decide what to do next.
Doing this work in public changed how I approach analysis. Writing things out forced me to explain my reasoning, question my assumptions, and be comfortable showing imperfect work. It reminded me that EDA isn’t a checklist you rush through — it’s a dialogue with the data. Sharing that dialogue makes you more thoughtful and more accountable.
This may be the final part of the EDA in Public series, but it doesn’t feel like an endpoint. Everything here could evolve into dashboards, automated pipelines, or deeper customer analysis.
And if you’re a founder, analyst, or team working with customer or sales data and trying to make sense of it, this kind of exploratory work is often where the biggest clarity comes from. These are exactly the kinds of problems I enjoy working through — slowly, thoughtfully, and with the business context in mind.
If you’re documenting your own analyses, I’d love to see how you approach it. And if you’re wrestling with similar questions in your data and want to talk through them, feel free to reach out on any of the platforms below. Good data conversations usually start there.
Thanks for following along!
Source link
#EDA #Public #Part #RFM #Analysis #Customer #Segmentation #inPandas
























