To get the most out of this tutorial, you should have a solid understanding of how to compare two distributions. If you don’t, I recommend checking out this excellent article by @matteo-courthoud.
We automated the analysis and exported the results to an Excel file using Python. If you already know the basics of Python and how to write to Excel, that will make things even easier.
I would like to thank everyone who took the time to read and engage with my article. Your support and feedback mean a lot.
, whether academic or professional, the question of data representativeness between two samples arises frequently.
By representativeness, we mean the degree to which two samples resemble each other or share the same characteristics. This concept is essential, as it directly determines the accuracy of statistical conclusions or the performance of a predictive model.
At each stage of a model’s life cycle, the issue of data representativeness takes specific forms :
- During the construction phase: this is where it all begins. You gather the data, clean it, split it into training, test, and out-of-time samples, estimate the parameters, and carefully document every decision. You ensure that the test and the out-of-time samples are representative of the training data.
- In The application phase: once the model is built, it must be confronted with reality. And here a crucial question arises: do the new datasets truly resemble the ones used during construction? If not, much of the previous work may quickly lose its value.
- In the monitoring phase, or backtesting: over time, populations evolve. The model must therefore be regularly challenged. Do its predictions remain valid? Is the representativeness of the target portfolio still ensured?
Representativeness is therefore not a one-off constraint, but an issue that accompanies the model throughout its development.
To answer the question of representativeness between two samples, the most common approach is to compare their distributions, proportions, and structures. This involves the use of visual tools like density functions, histograms, boxplots, supplemented by statistical tests such as the Student’s t-test, the Kruskal-Wallis test, the Wilcoxon test, or the Kolmogorov-Smirnov test. On this subject, @matteo-courthoud has published a great article, complete with practical codes, to which we refer the reader for further information.
In this article, we will focus on two practical tools often used in credit risk management to check whether two datasets are comparable:
- The Population Stability Index (PSI) shows how much a distribution shifts, either over time or between two samples.
- Cramér’s V measures the strength of association between categories, helping us see if two populations share a similar structure.
We will then explore how these tools can help engineers and decision-makers by transforming statistical comparisons into clear data for faster and more reliable decisions.
In Section 1 of this article, we present two concrete examples where questions of representativeness between samples may arise. In Section 2, we evaluate representativeness between two datasets using PSI and Cramér’s V. Finally, in Section 3, we demonstrate how to implement and automate these analyses in Python, exporting the results into an Excel file.
1. Two real-world examples of the representativeness challenge
The issue of representativeness becomes important when a model is applied to a domain other than the one for which it was developed. Two typical situations illustrate this challenge:
1.1 When a model is applied to a new scope of clients
Imagine a bank developing a scoring model for small businesses. The model performs well and is recognized internally. Encouraged by this success, the leadership decides to extend its use to large corporations. Your manager asks for your opinion on the approach. What steps do you take before responding?
Since the development and application populations differ, using the model on the new population extends its scope. It is therefore crucial to confirm that this application is valid.
The statistician has several tools to address this question, in particular representativeness analysis comparing the development population with the application population. This can be done by examining their characteristics variable by variable, for example through tests of mean equality, tests of distribution equality, or by comparing the distribution of categorical variables.
1.2 When two banks merge and need to align their risk models
Now consider Bank A, a large institution with a substantial balance sheet and a proven model to assess client default risk. Bank A is studying the possibility of merging with Bank B. Bank B, however, operates in a weaker economic environment and has not developed its own internal model.
Suppose Bank A’s management approaches you, as the statistician responsible for its internal models. The strategic question is: would it be appropriate to apply Bank A’s internal models to Bank B’s portfolio in the event of a merger?
Before applying Bank A’s internal model to Bank B’s portfolio, it is crucial to compare the distributions of key variables across both portfolios. The model can only be transferred with confidence if the two populations are truly representative of each other.
We have just presented two concrete cases where verifying representativeness is essential for sound decision-making. In the next section, we address how to analyze representativeness between two portfolios by introducing two statistical tools: the Population Stability Index (PSI) and Cramér’s V.
2. Comparing Distributions to Assess Representativeness Between Two Populations Using the Population Stability Index (PSI) and V-Cramer.
In practice, the study of representativeness between two datasets consists of comparing the characteristics of the observed variables in both samples. This comparison relies on both statistical measures and visual tools.
From a statistical perspective, analysts often examine measures of central tendency (mean, median) and dispersion (variance, standard deviation), as well as more granular indicators such as quantiles.
On the visual side, common tools include histograms, boxplots, cumulative distribution functions, density curves, and QQ-plots. These visualizations help detect potential differences in shape, location, or dispersion between two distributions.
Such graphical analyses provide an essential first step: they guide the investigation and help formulate hypotheses. However, they must be complemented by statistical tests to confirm observations and reach rigorous conclusions. These tests include:
- Parametric tests, such as Student’s t-test (comparison of means),
- Nonparametric tests, such as the Kolmogorov–Smirnov test (comparison of distributions), the chi-squared test (for categorical variables), and Welch’s test (for unequal variances).
These approaches are well presented in the article by @matteo-courthoud. Beyond them, two indicators are particularly relevant in credit risk analysis for assessing distributional drift between populations and supporting decision-making: the Population Stability Index (PSI) and Cramér’s V
2.1. The Population Stability Index (PSI)
The PSI is a fundamental tool in the credit industry. It measures the difference between two distributions of the same variable:
- for example, between the training dataset and a more recent application dataset,
- or between a reference dataset at time T0 and another at time T1.
In other words, the PSI quantifies how much a population has drifted over time or across different scopes.
Here’s how it works in practice:
- For a categorical variable, we compute the proportion of observations in each category for both datasets.
- For a continuous variable, we first discretize it into bins. In practice, deciles are often used to obtain a balanced distribution.
The PSI then compares, bin by bin, the proportions observed in the reference dataset versus the target dataset. The final indicator aggregates these differences using a logarithmic formula:
Here, pᵢ and qᵢ represent the proportions in bin i for the reference dataset and the target dataset, respectively. The PSI can be computed easily in an Excel file:
The interpretation is highly intuitive:
- A smaller PSI means the two distributions are closer.
- A PSI of 0 means the distributions are identical.
- A very large PSI (tending toward infinity) means the two distributions are fundamentally different.
In practice, industry guidelines often use the following thresholds:
- PSI : the population is stable,
- 0.1 ≤ PSI : the shift is noticeable—monitor closely,
- PSI ≥ 0.25: the shift is significant—the model may no longer be reliable.
2.2. Cramér’s V
When assessing the representativeness of a categorical variable (or a discretized continuous variable) between two datasets, a natural starting point is the Chi-square test of independence.
We build a contingency table crossing:
- the categories (modalities) of the variable of interest, and
- an indicator variable for dataset membership (Dataset 1 / Dataset 2).
The test is based on the following statistic:
where Oij are the observed counts and Eij are the expected counts under the assumption of independence.
- Null hypothesis H0: the variable has the same distribution in both datasets (independence).
- Alternative hypothesis H1 : the distributions differ.
If H0 is rejected, we conclude that the variable does not follow the same distribution across the two datasets.
However, the Chi-square test has a major limitation: it only provides a binary answer (reject / do not reject), and its power is highly sensitive to sample size. With very large datasets, even tiny differences can appear statistically significant.
To address this limitation, we use Cramér’s V, which rescales the Chi-square statistic to produce a normalized measure of association bounded between 0 and 1:
where n is the total sample size, r is the number of rows, and c is the number of columns in the contingency table.
The interpretation is intuitive:
- V≈0 ⇒ The distributions are very similar; representativeness is strong.
- V→1 ⇒ The difference between distributions is large; the datasets are structurally different.
Unlike the Chi-square test, which simply answers “yes” or “no,” Cramér’s V provides a graded measure of the strength of the difference. This allows us to assess whether the difference is negligible, moderate, or substantial.
We use the same thresholds as those applied for the PSI to draw our conclusions. For the PSI and Cramér’s V indicators, if the distribution of one or more variables differs significantly between the two datasets, we conclude that they are not representative.
3. Measuring Representativeness with PSI and Cramér’s V in Python.
In a previous article, we applied different variable selection methods to reduce the Communities & Crime dataset to just 16 explanatory variables. This step was essential to simplify the model while keeping the most relevant information.
This dataset also includes a variable called fold, which splits the data into 10 subsamples. These folds are commonly used in cross-validation: they allow us to test the robustness of a model by training it on one part of the data and validating it on another. For cross-validation to be reliable, each fold should be representative of the global dataset:
- To ensure valid performance estimates.
- To prevent bias: a non-representative fold can distort model results
- To support generalization: representative folds provide a better indication of how the model will perform on new data.
In this example, we will focus on checking whether fold 1 is representative of the global dataset using our two indicators: PSI and Cramer’s V by comparing the distribution of 16 variables across the two samples. We will proceed in two steps:
Step 1: Start with the Target Variable
We begin with the target variable. The idea is simple: compare its distribution between fold 1 and the entire dataset. To quantify this difference, we’ll use two complementary indicators:
- the Population Stability Index (PSI), which measures distributional shifts,
- Cramér’s V, which measures the strength of association between two categorical variables.
Step 2: Automating the Analysis for All Variables
After illustrating the approach with the target, we extend it to all features. We’ll build a Python function that computes PSI and Cramér’s V for each of the 16 explanatory variables, as well as for the target variable.
To make the results easy to interpret, we’ll export everything into an Excel file with:
- one sheet per variable, showing the detailed comparison by segment,
- a Summary tab, aggregating results across all variables.
3.1 Comparing the target variable ViolentCrimesPerPop
between the global dataset (reference) and fold 1 (target)
Before applying statistical tests or building decision indicators, it is essential to conduct a descriptive and graphical analysis. There are not just formalities; they provide an early intuition about the differences between populations and help interpreting the results. In practice, a well-chosen chart often reveals the conclusions that indicators like PSI or Cramér’s V will later confirm (or challenge).
For visualization, we proceed in three steps:
1. Comparing continuous distributions We begin with graphical tools such as boxplots, cumulative distribution functions, and probability density plots. These visualizations provide an intuitive way to examine differences in the target variable’s distribution between the two datasets.
2. Discretization into quantiles Next, we discretize the variable in the reference dataset using quartiles (Q1, Q2, Q3, Q4), which creates five classes (Q1 through Q5). We then apply the exact same cut-off points to the target dataset, ensuring that each observation is mapped to intervals defined from the reference. This guarantees comparability between the two distributions.
3. Comparing categorical distributions Finally, once the variable has been discretized, we can use visualization methods suited for categorical data — such as bar charts — to compare how frequencies are distributed across the two datasets.
The process depends on the type of variable:
For a continuous variable:
- Start with standard visualizations (boxplots, cumulative distributions, and density plots).
- Next, split the variable into segments (Q1 to Q5) based on the reference dataset’s quantiles.
- Finally, treat these segments as categories and compare their distributions.
For a categorical variable:
- No discretization is needed — it’s already in categorical form.
- Go straight to comparing category distributions, for example with a bar chart.
The code below prepares the two datasets we want to compare and then visualizes the target variable with a boxplot, showing its distribution in both the global dataset and in fold 1.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import chi2_contingency, ks_2samp
data = pd.read_csv("communities_data.csv")
# filter sur fold =1
data_ref = data
data_target = data[data["fold"] == 1]
# compare the two distribution of "ViolentCrimesPerPop" in the reference and target datasets with boxplots
# Build datasets with a "Group" column
df_ref = pd.DataFrame({
"ViolentCrimesPerPop": data_ref["ViolentCrimesPerPop"],
"Group": "Reference"
})
df_target = pd.DataFrame({
"ViolentCrimesPerPop": data_target["ViolentCrimesPerPop"],
"Group": "Target"
})
# Merge them
df_all = pd.concat([df_ref, df_target])
plt.figure(figsize=(8, 6))
# Boxplot with both distributions overlayed
sns.boxplot(
x="Group",
y="ViolentCrimesPerPop",
data=df_all,
palette="Set2",
width=0.6,
fliersize=3
)
# Add mean points
means = df_all.groupby("Group")["ViolentCrimesPerPop"].mean()
for i, m in enumerate(means):
plt.scatter(i, m, color="red", marker="D", s=50, zorder=3, label="Mean" if i == 0 else "")
# Title tells the story
plt.title("Violent Crimes Per Population by Group", fontsize=14, weight="bold")
plt.suptitle("Both groups show nearly identical distributions",
fontsize=10, color="gray")
plt.ylabel("Violent Crimes (Per Pop)", fontsize=12)
plt.xlabel("")
# Cleaner look
sns.despine()
plt.grid(axis="y", linestyle="--", alpha=0.5, visible=False)
plt.legend()
plt.show()
print(len(data.columns))
The figure above suggests that both groups share similar distributions for the ViolentCrimesPerPop
variable. To take a closer look, we can use Kernel Density Estimation (KDE) plots, which provide a smooth view of the underlying distribution and make it easier to spot subtle differences.
plt.figure(figsize=(8, 6))
# KDE plots with better styling
sns.kdeplot(
data=df_all,
x="ViolentCrimesPerPop",
hue="Group",
fill=True, # use shading for overlap
alpha=0.4, # transparency to show overlap
common_norm=False,
palette="Set2",
linewidth=2
)
# KS-test for distribution difference
g1 = df_all[df_all["Group"] == df_all["Group"].unique()[0]]["ViolentCrimesPerPop"]
g2 = df_all[df_all["Group"] == df_all["Group"].unique()[1]]["ViolentCrimesPerPop"]
stat, pval = ks_2samp(g1, g2)
# Add annotation
plt.text(df_all["ViolentCrimesPerPop"].mean(),
plt.ylim()[1]*0.9,
f"KS-test p-value = {pval:.3f}\nNo significant difference observed",
ha="center", fontsize=10, color="black")
# Titles with story
plt.title("Kernel Density Estimation of Violent Crimes Per Population", fontsize=14, weight="bold")
plt.suptitle("Distributions overlap almost completely between groups", fontsize=10, color="gray")
plt.xlabel("Violent Crimes (Per Pop)")
plt.ylabel("Density")
sns.despine()
plt.grid(False)
plt.show()
The KDE graph confirms that the two distributions are very similar, showing a high degree of overlap. The Kolmogorov-Smirnov (KS) statistical test of 0.976 also indicates that there is no significant difference between the two groups. To extend the analysis, we can now examine the cumulative distribution of the target variable.
# Cumulative distribution
plt.figure(figsize=(9, 6))
sns.histplot(
data=df_all,
x="ViolentCrimesPerPop",
hue="Group",
stat="density",
common_norm=False,
fill=False,
element="step",
bins=len(df_all),
cumulative=True,
)
# Titles tell the story
plt.title("Cumulative Distribution of Violent Crimes Per Population", fontsize=14, weight="bold")
plt.suptitle("ECDFs overlap extensively; central tendencies are nearly identical", fontsize=10)
# Labels & cleanup
plt.xlabel("Violent Crimes (Per Pop)")
plt.ylabel("Cumulative proportion")
plt.grid(visible=False)
plt.show()
The cumulative distribution plot provides additional evidence that the two groups are very similar. The curves overlap almost completely, suggesting that their distributions are nearly identical in both central tendency and spread.
As a next step, we’ll discretize the variable into quantiles in the reference dataset and then apply the same cut-off points to the target dataset (fold 1). The code below demonstrates how to do this. Finally, we’ll compare the resulting distributions using a bar chart.
def bin_numeric(ref, tgt, n_bins=5):
"""
Discretize a numeric variable into quantile bins (ex: quintiles).
- Quantile thresholds are computed only on the reference dataset.
- Extend bins with -inf and +inf to cover all possible values.
- Returns:
* ref binned
* tgt binned
* bin labels (Q1, Q2, ...)
"""
edges = np.unique(ref.dropna().quantile(np.linspace(0, 1, n_bins + 1)).values)
if len(edges)
As before, we reach the same conclusion: the distributions in the reference and target datasets are very similar. To move beyond visual inspection, we will now compute the Population Stability Index (PSI) and Cramér’s V statistic. These metrics allow us to quantify the differences between distributions; both for all variables in general and for the target variable ViolentCrimesPerPop in particular.
3.2 Automating the Analysis for All Variables
As mentioned earlier, the results of the distribution comparisons for each variable between the two datasets, calculated using PSI and Cramér’s V, are presented in separate sheets within a single Excel file.
To illustrate, we begin by examining the results for the target variable ViolentCrimesPerPop when comparing the global dataset (reference) with fold 1 (target). The table 1 below summarizes how both PSI and Cramér’s V are computed.
Since both PSI and Cramér’s V are below 0.1, we can conclude that the target variable ViolentCrimesPerPop follows the same distribution in both datasets.
The code that generated this table is shown below. The same code can also be used to produce results for all variables and export them into an Excel file called representativity.xlsx.
EPS = 1e-12 # A very small constant to avoid division by zero or log(0)
# ============================================================
# 1. Basic functions
# ============================================================
def safe_proportions(counts):
"""
Convert raw counts into proportions in a safe way.
- If the total count = 0, return all zeros (to avoid division by zero).
- Clip values so no proportion is exactly 0 or 1 (numerical stability).
"""
total = counts.sum()
if total == 0:
return np.zeros_like(counts, dtype=float)
p = counts / total
return np.clip(p, EPS, 1.0)
def calculate_psi(p_ref, p_tgt):
"""
Compute the Population Stability Index (PSI) between two distributions.
PSI = sum( (p_ref - p_tgt) * log(p_ref / p_tgt) )
Interpretation:
- PSI 0.25 → major shift
"""
p_ref = np.clip(p_ref, EPS, 1.0)
p_tgt = np.clip(p_tgt, EPS, 1.0)
return float(np.sum((p_ref - p_tgt) * np.log(p_ref / p_tgt)))
def calculate_cramers_v(contingency):
"""
Compute Cramér's V statistic for association between two categorical variables.
- Input: a 2 x K contingency table (counts).
- Uses Chi² test.
- Normalizes the result to [0, 1].
* 0 → no association
* 1 → perfect association
"""
chi2, _, _, _ = chi2_contingency(contingency, correction=False)
n = contingency.sum()
r, c = contingency.shape
if n == 0 or min(r - 1, c - 1) == 0:
return 0.0
return np.sqrt(chi2 / (n * (min(r - 1, c - 1))))
# ============================================================
# 2. Preparing variables
# ============================================================
def bin_numeric(ref, tgt, n_bins=5):
"""
Discretize a numeric variable into quantile bins (ex: quintiles).
- Quantile thresholds are computed only on the reference dataset.
- Extend bins with -inf and +inf to cover all possible values.
- Returns:
* ref binned
* tgt binned
* bin labels (Q1, Q2, ...)
"""
edges = np.unique(ref.dropna().quantile(np.linspace(0, 1, n_bins + 1)).values)
if len(edges) high
Note: first_row, last_row, and col are zero-based indices (xlsxwriter convention).
"""
green = wb.add_format({"bg_color": "#C6EFCE", "font_color": "#006100"})
orange = wb.add_format({"bg_color": "#FCD5B4", "font_color": "#974706"})
red = wb.add_format({"bg_color": "#FFC7CE", "font_color": "#9C0006"})
if last_row ", "value": high, "format": red})
def representativity_report(ref_df, tgt_df, variables, output="representativity.xlsx",
n_bins=5, psi_thresholds=(0.10, 0.25),
v_thresholds=(0.10, 0.25), color_summary=True):
"""
Build a representativity report across multiple variables and export to Excel.
For each variable:
- Create a sheet with detailed PSI by segment, Global PSI, and Cramer's V.
- Apply traffic light colors for easier interpretation.
Create one "Résumé" sheet with overall Global PSI and Cramer's V for all variables.
"""
summary = []
with pd.ExcelWriter(output, engine="xlsxwriter") as writer:
wb = writer.book
fmt_header = wb.add_format({"bold": True, "bg_color": "#0070C0",
"font_color": "white", "align": "center"})
fmt_pct = wb.add_format({"num_format": "0.00%"})
fmt_ratio = wb.add_format({"num_format": "0.000"})
fmt_int = wb.add_format({"num_format": "0"})
for var in variables:
# Analyze variable
df, meta = analyze_variable(ref_df[var], tgt_df[var], n_bins)
sheet = var[:31] # Excel sheet names are limited to 31 characters
df.to_excel(writer, sheet_name=sheet, index=False)
ws = writer.sheets[sheet]
# Format headers and columns
for j, col in enumerate(df.columns):
ws.write(0, j, col, fmt_header)
ws.set_column(0, 0, 18)
ws.set_column(1, 2, 16, fmt_int)
ws.set_column(3, 4, 20, fmt_pct)
ws.set_column(5, 5, 18, fmt_ratio)
nrows = len(df) # number of data rows (excluding header)
col_psi = 5 # "PSI by Segment" column index
# PSI by Segment rows
apply_traffic_light(ws, wb, first_row=1, last_row=max(1, nrows-2),
col=col_psi, low=psi_thresholds[0], high=psi_thresholds[1])
# Global PSI row (second to last)
apply_traffic_light(ws, wb, first_row=nrows-1, last_row=nrows-1,
col=col_psi, low=psi_thresholds[0], high=psi_thresholds[1])
# Cramer's V row (last row)
apply_traffic_light(ws, wb, first_row=nrows, last_row=nrows,
col=col_psi, low=v_thresholds[0], high=v_thresholds[1])
# Add summary info for Résumé sheet
summary.append({"Variable": var,
"Global PSI": meta["psi"],
"Cramer's V": meta["v_cramer"]})
# Résumé sheet
df_sum = pd.DataFrame(summary)
df_sum.to_excel(writer, sheet_name="Résumé", index=False)
ws = writer.sheets["Résumé"]
for j, col in enumerate(df_sum.columns):
ws.write(0, j, col, fmt_header)
ws.set_column(0, 0, 28)
ws.set_column(1, 2, 16, fmt_ratio)
# Apply traffic light to summary sheet
if color_summary and len(df_sum) > 0:
last = len(df_sum)
# PSI column
apply_traffic_light(ws, wb, 1, last, 1, psi_thresholds[0], psi_thresholds[1])
# Cramer's V column
apply_traffic_light(ws, wb, 1, last, 2, v_thresholds[0], v_thresholds[1])
return output
# ============================================================
# Example
# ============================================================
if __name__ == "__main__":
# columns namees privées de fold
columns = [x for x in data.columns if x != "fold"]
# Generate the report
path = representativity_report(data_ref, data_target, columns, output="representativity.xlsx")
print(f" Report generated: {path}")
inally, Table 2 shows the last sheet of the file, titled Summary, which brings together the results for all variables of interest.
This synthesis provides an overall view of representativeness between the two datasets, making interpretation and decision-making much easier. Since both PSI and Cramér’s V are below 0.1, we can conclude that all variables follow the same distribution in the global dataset and in fold 1. Therefore, fold 1 can be considered representative of the global dataset.
Conclusion
In this post, we explored how to study representativeness between two datasets by comparing the distributions of their variables. We introduced two key indicators Population stability index(PSI) and Cramér’s V, that are both easy to use, easy to interpret, and highly valuable for decision-making.
We also showed how these analyses can be automated, with results saved directly into an Excel file.
The main takeaway is this: if you build a model and end up with overfitting, one possible reason may be that your training and test sets are not representative of each other. A simple way to prevent this is to always run a representativity analysis between datasets. Variables that show representativity issues can then guide you in stratifying your data when splitting it into training and test sets. What about you? In what situations do you study representativeness between two data sets, for what reasons, and using what methods?
References
Yurdakul, B. (2018). Statistical properties of population stability index. Western Michigan University.
Redmond, M. (2002). Communities and Crime [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C53W3X.
Data & Licensing
The dataset used in this article is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.
This license allows anyone to share and adapt the dataset for any purpose, including commercial use, provided that proper attribution is given to the source.
For more details, see the official license text: CC BY 4.0.
Disclaimer
I write to learn so mistakes are the norm, even though I try my best. Please, when you spot them, let me know. I also appreciate suggestions on new topics!
Source link
#Training #Data #Representative #Guide #Checking #PSI #Python