discussed about classification metrics like ROC-AUC and Kolmogorov-Smirnov (KS) Statistic in previous blogs.
In this blog, we will explore another important classification metric called the Gini Coefficient.
Why do we have multiple classification metrics?
Every classification metric tells us the model performance from a different angle. We know that ROC-AUC gives us the overall ranking ability of a model, while KS Statistic shows us where the maximum gap between two groups occurs.
When it comes to the Gini Coefficient, it tells us how much better our model is than random guessing at ranking the positives higher than the negatives.
First, let’s see how the Gini Coefficient is calculated.
For this, we again use the German Credit Dataset.
Let’s use the same sample data that we used to understand the calculation of Kolmogorov-Smirnov (KS) Statistic.
This sample data was obtained by applying logistic regression on the German Credit dataset.
Since the model outputs probabilities, we selected a sample of 10 points from those probabilities to demonstrate the calculation of the Gini coefficient.
Calculation
Step 1: Sort the data by predicted probabilities.
The sample data is already sorted descending by predicting probabilities.
Step 2: Compute Cumulative Population and Cumulative Positives.
Cumulative Population: The cumulative number of records considered up to that row.
Cumulative Population (%): The percentage of the total population covered so far.
Cumulative Positives: How many actual positives (class 2) we’ve seen up to this point.
Cumulative Positives (%): The percentage of positives captured so far.
Step 3: Plot X and Y values
X = Cumulative Population (%)
Y = Cumulative Positives (%)
Here, let’s use Python to plot these X and Y values.
Code:
import matplotlib.pyplot as plt
X = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
Y = [0.0, 0.25, 0.50, 0.75, 0.75, 1.00, 1.00, 1.00, 1.00, 1.00, 1.00]
# Plot curve
plt.figure(figsize=(6,6))
plt.plot(X, Y, marker='o', color="cornflowerblue", label="Model Lorenz Curve")
plt.plot([0,1], [0,1], linestyle="--", color="gray", label="Random Model (Diagonal)")
plt.title("Lorenz Curve from Sample Data", fontsize=14)
plt.xlabel("Cumulative Population % (X)", fontsize=12)
plt.ylabel("Cumulative Positives % (Y)", fontsize=12)
plt.legend()
plt.grid(True)
plt.show()
Plot:
The curve we get when we plot Cumulative Population (%) and Cumulative Positives (%) is called the Lorenz curve.
Step 4: Calculate the area under the Lorenz curve.
When we discussed ROC-AUC, we found the area under the curve using the trapezoid formula.
Each region between two points was treated as a trapezoid, its area was calculated, and then all areas were added together to get the final value.
The same method is applied here to calculate the area under the Lorenz curve.
Area under the Lorenz curve
Area of Trapezoid:
$$
\text{Area} = \frac{1}{2} \times (y_1 + y_2) \times (x_2 – x_1)
$$
From (0.0, 0.0) to (0.1, 0.25):
\[
A_1 = \frac{1}{2}(0+0.25)(0.1-0.0) = 0.0125
\]
From (0.1, 0.25) to (0.2, 0.50):
\[
A_2 = \frac{1}{2}(0.25+0.50)(0.2-0.1) = 0.0375
\]
From (0.2, 0.50) to (0.3, 0.75):
\[
A_3 = \frac{1}{2}(0.50+0.75)(0.3-0.2) = 0.0625
\]
From (0.3, 0.75) to (0.4, 0.75):
\[
A_4 = \frac{1}{2}(0.75+0.75)(0.4-0.3) = 0.075
\]
From (0.4, 0.75) to (0.5, 1.00):
\[
A_5 = \frac{1}{2}(0.75+1.00)(0.5-0.4) = 0.0875
\]
From (0.5, 1.00) to (0.6, 1.00):
\[
A_6 = \frac{1}{2}(1.00+1.00)(0.6-0.5) = 0.100
\]
From (0.6, 1.00) to (0.7, 1.00):
\[
A_7 = \frac{1}{2}(1.00+1.00)(0.7-0.6) = 0.100
\]
From (0.7, 1.00) to (0.8, 1.00):
\[
A_8 = \frac{1}{2}(1.00+1.00)(0.8-0.7) = 0.100
\]
From (0.8, 1.00) to (0.9, 1.00):
\[
A_9 = \frac{1}{2}(1.00+1.00)(0.9-0.8) = 0.100
\]
From (0.9, 1.00) to (1.0, 1.00):
\[
A_{10} = \frac{1}{2}(1.00+1.00)(1.0-0.9) = 0.100
\]
Total Area Under Lorenz Curve:
\[
A = 0.0125+0.0375+0.0625+0.075+0.0875+0.100+0.100+0.100+0.100+0.100 = 0.775
\]
We calculated the area under the Lorenz curve, which is 0.775.
Here, we plotted Cumulative Population (%) and Cumulative Positives (%), and we can observe that the area under this curve shows how quickly the positives (class 2) are being captured as we move down the sorted list.
In our sample dataset, we have 4 positives (class 2) and 6 negatives (class 1).
For a perfect model, by the time we reach 40% of the population, it captures 100% of the positives.
The curve looks like this for a perfect model.
Area under the lorenz curve for the perfect model.
\[
\begin{aligned}
\text{Perfect Area} &= \text{Triangle (0,0 to 0.4,1)} + \text{Rectangle (0.4,1 to 1,1)} \\[6pt]
&= \frac{1}{2} \times 0.4 \times 1 \;+\; 0.6 \times 1 \\[6pt]
&= 0.2 + 0.6 \\[6pt]
&= 0.8
\end{aligned}
\]
We also have another method to calculate the Area under the curve for the perfect model.
\[
\text{Let }\pi \text{ be the proportion of positives in the dataset.}
\]
\[
\text{Perfect Area} = \frac{1}{2}\pi \cdot 1 + (1-\pi)\cdot 1
\]
\[
= \frac{\pi}{2} + (1-\pi)
\]
\[
= 1 – \frac{\pi}{2}
\]
For our dataset:
Here, we have 4 positives out of 10 records, so: π = 4/10 = 0.4.
\[
\text{Perfect Area} = 1 – \frac{0.4}{2} = 1 – 0.2 = 0.8
\]
We calculated the area under the lorenz curve for our sample dataset and also for the perfect model with same number of positives and negatives.
Now, if we go through the dataset without sorting, the positives are evenly spread out. This means the rate at which we collect positives is the same as the rate at which we move through the population.
This is the random model, and it always gives an area under the curve of 0.5.
Step 5: Calculate the Gini Coefficient
\[
A_{\text{model}} = 0.775
\]
\[
A_{\text{random}} = 0.5
\]
\[
A_{\text{perfect}} = 0.8
\]
\[
\text{Gini} = \frac{A_{\text{model}} – A_{\text{random}}}{A_{\text{perfect}} – A_{\text{random}}}
\]
\[
= \frac{0.775 – 0.5}{0.8 – 0.5}
\]
\[
= \frac{0.275}{0.3}
\]
\[
\approx 0.92
\]
We got Gini = 0.92, which means almost all the positives are concentrated at the top of the sorted list. This shows that the model does a very good job of separating positives from negatives, coming close to perfect.
As we have seen how the Gini Coefficient is calculated, let’s look at what we actually did during the calculation.
We considered a sample of 10 points consisting of output probabilities from logistic regression.
We sorted the probabilities in descending order.
Next, we calculated Cumulative Population (%) and Cumulative Positives (%) and then plotted them.
We got a curve called the Lorenz curve, and we calculated the area under it, which is 0.775.
Now let’s understand what is 0.775?
Our sample consists of 4 positives (class 2) and 6 negatives (class 1).
The output probabilities are for class 2, which means the higher the probability, the more likely the customer belongs to class 2.
In our sample data, the positives are captured within 50% of the population, which means all the positives are ranked at the top.
If the model is perfect, then the positives are captured within the first 4 rows, i.e., within the first 40% of the population, and the area under the curve for the perfect model is 0.8.
But we got AUC = 0.775, which is nearly perfect.
Here, we are trying to calculate the efficiency of the model. If more positives are concentrated at the top, it means the model is good at classifying positives and negatives.
Next, we calculated the Gini Coefficient, which is 0.92.
\[
\text{Gini} = \frac{A_{\text{model}} – A_{\text{random}}}{A_{\text{perfect}} – A_{\text{random}}}
\]
The numerator tells us how much better our model is than random guessing.
The denominator tells us the maximum possible improvement over random.
The ratio puts these two together, so the Gini coefficient always falls between 0 (random) and 1 (perfect).
Gini is used to measure how close the model is to being perfect in separating positive and negative classes.
But we may get a doubt about why we calculated Gini and why we didn’t stop after 0.775.
0.775 is the area under the Lorenz curve for our model. It doesn’t tell us how close the model is to being perfect without comparing it to 0.8, which is the area for the perfect model.
So, we calculate Gini to standardize it so that it falls between 0 and 1, which makes it easy to compare models.
Banks also use Gini Coefficient to evaluate credit risk models alongside ROC-AUC and KS Statistic. Together, these measures give a complete picture of model performance.
Now, let’s calculate ROC-AUC for our sample data.
import pandas as pd
from sklearn.metrics import roc_auc_score
# Sample data
data = {
"Actual": [2, 2, 2, 1, 2, 1, 1, 1, 1, 1],
"Pred_Prob_Class2": [0.92, 0.63, 0.51, 0.39, 0.29, 0.20, 0.13, 0.10, 0.05, 0.01]
}
df = pd.DataFrame(data)
# Convert Actual: class 2 -> 1 (positive), class 1 -> 0 (negative)
y_true = (df["Actual"] == 2).astype(int)
y_score = df["Pred_Prob_Class2"]
# Calculate ROC-AUC
roc_auc = roc_auc_score(y_true, y_score)
roc_auc
We got AUC = 0.9583
Now, Gini = (2 * AUC) – 1 = (2 * 0.9583) – 1 = 0.92
This is the relation between Gini & ROC-AUC.
Now let’s calculate Gini Coefficient on a full dataset.
Code:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
# Load dataset
file_path = "C:/german.data"
data = pd.read_csv(file_path, sep=" ", header=None)
# Rename columns
columns = [f"col_{i}" for i in range(1, 21)] + ["target"]
data.columns = columns
# Features and target
X = pd.get_dummies(data.drop(columns=["target"]), drop_first=True)
y = data["target"]
# Convert target: make it binary (1 = good, 0 = bad)
y = (y == 2).astype(int)
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42, stratify=y
)
# Train logistic regression
model = LogisticRegression(max_iter=10000)
model.fit(X_train, y_train)
# Predicted probabilities
y_pred_proba = model.predict_proba(X_test)[:, 1]
# Calculate ROC-AUC
auc = roc_auc_score(y_test, y_pred_proba)
# Calculate Gini
gini = 2 * auc - 1
auc, gini
We got Gini = 0.60
Interpretation:
Gini > 0.5: acceptable.
Gini = 0.6–0.7: good model.
Gini = 0.8+: excellent, rarely achieved.
Dataset
The dataset used in this blog is the German Credit dataset, which is publicly available on the UCI Machine Learning Repository. It is provided under the Creative Commons Attribution 4.0 International (CC BY 4.0) License. This means it can be freely used and shared with proper attribution.
I hope you found this blog useful.
If you enjoyed reading, consider sharing it with your network, and feel free to share your thoughts.
If you haven’t read my earlier blogs on ROC-AUC and Kolmogorov Smirnov Statistic, you can check them out here.
Thanks for reading!
Source link
#ROCAUC #Gini #Coefficient #Explained #Simply