Why MissForest Fails in Prediction Tasks: A Key Limitation You Need to Keep in Mind

[ad_1]

The of this article is to explain that, in predictive settings, imputations must always be estimated on the training set and the resulting parameters or models saved. These should then be applied unchanged to the test, out-of-time, or application data, in order to avoid data leakage and ensure an unbiased assessment of generalization performance.

I want to thank everyone who took the time to read and engage with my article. Your support and feedback are greatly appreciated.

In practice, most real-world datasets contain missing values, making missing data one of the most common challenges in statistical modeling. If it is not handled properly, it can lead to biased coefficient estimates, reduced statistical power, and ultimately incorrect conclusions (Van Buuren, 2018). In predictive modeling, ignoring missing data by performing complete case analysis or by excluding predictor variables with missing values can limit the applicability of the model and result in biased or suboptimal performance.

The Three Missing-Data Mechanisms

To address this issue, statisticians classify missing data into three mechanisms that describe how and why values go missing. MCAR (Missing Completely at Random) refers to cases where the missingness occurs entirely at random and is independent of both observed and unobserved variables. MAR (Missing at Random) means that the probability of missingness depends on the observed variables but not on the missing value itself. MNAR (Missing Not at Random) describes the most complex case, in which the probability of missingness depends on the unobserved value itself.

Classical Approaches and Their Limits to deal with missing data

Under the MAR assumption, it is possible to use the information contained in the observed variables to predict the missing values. Classical approaches based on this idea include regression-based imputation, k-nearest neighbors (kNN) imputation, and multiple imputation by chained equations (MICE). These methods are considered multivariate because they explicitly condition the imputation on the observed variables.These approaches explicitly condition the imputation on the observed data, but have a significant limitation: they do not handle mixed databases (continuous + categorical) well and have difficulty capturing nonlinear relationships and complex interactions.

The Rise of MissForest implemented in R

It is to overcome these limitations that MissForest (Stekhoven & Bühlmann, 2012) has established itself as a benchmark method. Based on random forests, MissForest can capture nonlinear relationships and complex interactions between variables, often outperforming traditional imputation techniques. However, when working on a project that required a generalizable modeling process — with a proper train/test split and out-of-time validation — we encountered a significant limitation. The R implementation of the missForest package does not store the imputation model parameters once fitted.

A Critical Limitation of MissForest in Prediction Settings

This creates a practical challenge: it is impossible to train the imputation model on the training set and then apply the exact same parameters to the test set. This limitation introduces a risk of information leakage during model evaluation or a degradation in the quality and consistency of imputations.

Existing solutions and Their Risks

While looking for an alternative solution that would allow consistent imputation in a predictive modeling setting, we asked ourselves a simple but critical question:

How can we impute the test data in a way that remains fully consistent with the imputations learned on the training data?

Exploring this question led us to a discussion on CrossValidated, where another user was facing the exact same issue and asked:

“How to use missForest in R for test data imputation?”

Two main solutions were suggested to overcome this limitation. The first consists of merging the training and test data before running the imputation. This approach often improves the quality of the imputations because the algorithm has more data to learn from, but it introduces data leakage, since the test set influences the imputation model. The second approach imputes the test set separately from the training set, which prevents information leakage but forces the algorithm to build an entirely new imputation model using only the test data, which is often much smaller. This can lead to less stable imputations and a potential drop in predictive performance.

Even the well-known tutorial by Liam Morgan arrives at a similar workaround. His proposed solution involves imputing the training set, fitting a predictive model, then combining the training and test data for a final imputation step:

# 1) Impute the training set
imp_train_X

Although this approach often may improves imputation quality, it suffers from the same weakness as Method 1: the test data indirectly participate in the learning process, which may inflate model performance metrics and creates an overly optimistic estimate of generalization.

These examples highlight a fundamental dilemma:

How do we impute missing values without biasing model evaluation?
How do we ensure that the imputations applied to the test set are consistent with those learned on the training set?

Research Question and Motivation

These questions motivated our exploration of a more robust solution that preserves generalization, avoids data leakage, and produces stable imputations suitable for predictive modeling pipelines.

This paper is organized into four main sections:

Section 1 introduces the process of identifying and characterizing missing values, including how to detect, quantify, and describe them.
Section 2 discusses the MCAR (Missing Completely at Random) mechanism and presents methods for handling missing data under this assumption.
Section 3 focuses on the MAR (Missing at Random) mechanism, outlining appropriate imputation strategies and addressing the critical question: Why does the MissForest implementation in R fail in prediction settings?
Section 4 examines the MNAR (Missing Not at Random) mechanism and explores strategies for dealing with missing data when the mechanism depends on the unobserved values themselves.

1. Identification and Characterization of Missing Values

This step is critical and should be carried out in close collaboration with all stakeholders: model developers, domain experts, and future users of the model. The goal is to identify all missing values and mark them.

In Python, and particularly when using libraries such as Pandas, NumPy, and Scikit-Learn, missing values are represented as NaN. Values marked as NaN are ignored by many operations such as sum() and count(). You can mark missing values using the replace() function on the relevant subset of columns in a Pandas DataFrame.

Once the missing values have been marked, the next step is to evaluate their distribution for each variable. The isnull() function can be used to identify all NaN values as True, and combined with sum() to count the number of missing values per column.

Understanding the distribution of missing values is crucial. With this information, stakeholders can assess whether the patterns of missingness are reasonable. It also allows you to define acceptable thresholds of missingness depending on the nature of each variable. For instance, you might decide that up to 10% missing values is acceptable for continuous variables, while the threshold for categorical variables should remain at 0%.

After selecting the relevant variables for modeling, including those containing missing values when they are important for prediction, it is essential to split the dataset into three samples:

Training set to estimate parameters and train the models,
Test set to evaluate model performance on unseen data,
Out-of-Time (OOT) set to validate the temporal robustness of the model.

This split should be performed to preserve the statistical representativeness of each subsample — for example, by using stratified sampling if the target variable is imbalanced.

The analysis of missing values should then be conducted exclusively on the training set:

Identify their mechanism (MCAR, MAR, MNAR) using statistical tests,
Select the appropriate imputation method,
Train the imputation models on the training set.

The imputation parameters and models obtained in this step must then be applied as is to the test set and to the Out-of-Time set. This step is essential to avoid information leakage and to ensure a correct evaluation of the model’s generalization performance.

In the next section, we will examine the MCAR mechanism in detail and present the imputation methods that are best suited for this type of missing data.

2. Understanding MCAR and Choosing the Right Imputation Methods

In simple terms, MCAR (Missing Completely at Random) describes a situation where the fact that a value is missing is entirely unrelated to either the value itself or any other variables in the dataset. In mathematical terms, this means that the probability of a data point being missing does not depend on the variable’s value nor on the values of any other variables: the missingness is completely random.

Before formally defining the MCAR mechanism, let us introduce the notations that will be used in this section and throughout the article:

Consider an independent and identically distributed sample of n observations:

y_i = (y_i1, . . ., y_ip)^T, i = 1, 2, . . ., n

where p is the number of variables with missing values and n is the sample size.

Y ∈ R^nxp represents the variables that may contain missing values. This is the set on which we wish to perform imputation.

We denote the observed entries and missing entries of Y as Y_o and Y_m,

X ∈ R^nxq represents the fully observed variables, meaning they contain no missing values.
To indicate which components of y_i are observed or missing, we define the indicator vector:

r_i = (r_i1, . . ., r_ip)^T, i = 1, 2, . . ., n

with r_ik = 1 if y_ik is observed, and 0 otherwise.

Stacking these vectors yields the complete matrix of presence/absence indicators:

R = (r₁, . . ., r_n)^T

Then the MCAR assumption is defined as :

Pr(R|Y_m ,Y_o, X) = Pr(R). (1)

which implies that the missing indicators are completely independent of both the missing data, Ym, and the observed data, Y_o. Note that here R is also independent of covariates X. Before presenting methods for handling missing values under the MCAR assumption, we will first introduce a few simple techniques to assess whether the MCAR assumption is likely to hold.

2.1 Assessing the MCAR Assumption

In this section, we will simulate a dataset with 10,000 observations and four variables under the MCAR assumption:

One continuous variable containing 20% missing values and one categorical variable with two levels (0 and 1) containing 10% missing values.
One continuous variable and one categorical variable that are fully observed, with no missing values.
Finally, a binary target variable named target, taking values 0 and 1.

import numpy as np
import pandas as pd

# --- Reproducibility ---
np.random.seed(42)

# --- Parameters ---
n = 10000

# --- Utility Functions ---
def generate_continuous(mean, std, size, missing_rate=0.0):
    """Generate a continuous variable with optional MCAR missingness."""
    values = np.random.normal(loc=mean, scale=std, size=size)
    if missing_rate > 0:
        mask = np.random.rand(size)  0:
        mask = np.random.rand(size)

Before performing any analysis, it is essential to split the dataset into two parts: a training set and a test set.

2.1.1 Preparing Train and Test Data for Analysis the MCAR

It is essential to split the dataset into training and test sets while ensuring representativeness. This guarantees that both the model and the imputation methods are learned exclusively on the training set and then evaluated on the test set. Doing so prevents data leakage and provides an unbiased estimate of the model’s ability to generalize to unseen data.

from sklearn.model_selection import train_test_split
import pandas as pd

def stratified_split(df, strat_vars, test_size=0.3, random_state=None):
    """
    Split a DataFrame into train and test sets with stratification
    based on one or multiple variables.

    Parameters
    ----------
    df : pandas.DataFrame
        The input dataset.
    strat_vars : list or str
        Column name(s) used for stratification.
    test_size : float, default=0.3
        Proportion of the dataset to include in the test split.
    random_state : int, optional
        Random seed for reproducibility.

    Returns
    -------
    train_df : pandas.DataFrame
        Training set.
    test_df : pandas.DataFrame
        Test set.
    """
    # Ensure strat_vars is a list
    if isinstance(strat_vars, str):
        strat_vars = [strat_vars]

    # Create a combined stratification key
    strat_key = df[strat_vars].astype(str).fillna("MISSING").agg("_".join, axis=1)

    # Perform stratified split
    train_df, test_df = train_test_split(
        df,
        test_size=test_size,
        stratify=strat_key,
        random_state=random_state
    )

    return train_df, test_df


# --- Application ---
# Stratification sur cat_mcar, cat_full et target
train_df, test_df = stratified_split(df, strat_vars=["cat_mcar", "cat_full", "target"], test_size=0.3, random_state=42)

print(f"Train size: {train_df.shape[0]}  ({len(train_df)/len(df):.1%})")
print(f"Test size:  {test_df.shape[0]}  ({len(test_df)/len(df):.1%})")

2.1.1 Analysis MCAR Assumption for continuous variables with missing values

The first step is to create a binary indicator R (where 1 indicates an observed value and 0 indicates a missing value) and compare the distributions of Yo, Ym, and X across the two groups (observed vs. missing).

Let us illustrate this process using the variable cont_mcar as an example. We will compare the distribution of cont_full between observations where cont_mcar is missing and where it is observed, using both a boxplot and a Kolmogorov–Smirnov test. We will then perform a similar analysis for the categorical variable cat_full, comparing proportions across the two groups with a bar plot and a chi-squared test.

import matplotlib.pyplot as plt
import seaborn as sns

# --- Step 1: Train/Test Split with Stratification ---
train_df, test_df = stratified_split(
    df,
    strat_vars=["cat_mcar", "cat_full", "target"],
    test_size=0.3,
    random_state=42
)

# --- Step 2: Create the R indicator on the training set ---
train_df = train_df.copy()
train_df["R_cont_mcar"] = np.where(train_df["cont_mcar"].isnull(), 0, 1)

# --- Step 3: Prepare the data for comparison ---
df_obs = pd.DataFrame({
    "cont_full": train_df.loc[train_df["R_cont_mcar"] == 1, "cont_full"],
    "Group": "Observed (R=1)"
})

df_miss = pd.DataFrame({
    "cont_full": train_df.loc[train_df["R_cont_mcar"] == 0, "cont_full"],
    "Group": "Missing (R=0)"
})

df_all = pd.concat([df_obs, df_miss])

# --- Step 4: KS Test before plotting ---
from scipy.stats import ks_2samp
stat, p_value = ks_2samp(
    train_df.loc[train_df["R_cont_mcar"] == 1, "cont_full"],
    train_df.loc[train_df["R_cont_mcar"] == 0, "cont_full"]
)

# --- Step 5: Visualization with KS result ---
plt.figure(figsize=(8, 6))
sns.boxplot(
    x="Group", 
    y="cont_full", 
    data=df_all,
    palette="Set2",
    width=0.6,
    fliersize=3
)

# Add red diamonds for means
means = df_all.groupby("Group")["cont_full"].mean()
for i, m in enumerate(means):
    plt.scatter(i, m, color="red", marker="D", s=50, zorder=3, label="Mean" if i == 0 else "")

# Title and KS test result
plt.title("Distribution of cont_full by Missingness of cont_mcar (Train Set)",
          fontsize=14, weight="bold")

# Add KS result as text box
textstr = f"KS Statistic = {stat:.3f}\nP-value = {p_value:.3f}"
plt.gca().text(
    0.05, 0.95, textstr,
    transform=plt.gca().transAxes,
    fontsize=10,
    verticalalignment='top',
    bbox=dict(boxstyle="round,pad=0.3", facecolor="white", alpha=0.8)
)

plt.ylabel("cont_full", fontsize=12)
plt.xlabel("")
sns.despine()
plt.legend()
plt.show()

import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import chi2_contingency

# --- Step 1: Build contingency table on the TRAIN set ---
contingency_table = pd.crosstab(train_df["R_cont_mcar"], train_df["cat_full"])
chi2, p_value, dof, expected = chi2_contingency(contingency_table)

# --- Step 2: Compute proportions for each group ---
# --- Recompute proportions but flip the axes ---
props = contingency_table.div(contingency_table.sum(axis=1), axis=0)

# Transform for plotting: Group (R) on x-axis, Category as hue
df_props = props.reset_index().melt(
    id_vars="R_cont_mcar",
    var_name="Category",
    value_name="Proportion"
)

# Map R values to clear labels
df_props["Group"] = df_props["R_cont_mcar"].map({1: "Observed (R=1)", 0: "Missing (R=0)"})

# --- Plot: Group on x-axis, bars show proportions of each category ---
sns.set_theme(style="whitegrid")
plt.figure(figsize=(8,6))

sns.barplot(
    x="Group", y="Proportion", hue="Category",
    data=df_props, palette="Set2"
)

# Title and Chi² result
plt.title("Proportion of cat_full by Observed/Missing Status of cont_mcar (Train Set)",
          fontsize=14, weight="bold")

# Add Chi² result as a text box
textstr = f"Chi² = {chi2:.3f}, p = {p_value:.3f}"
plt.gca().text(
    0.05, 0.95, textstr,
    transform=plt.gca().transAxes,
    fontsize=10,
    verticalalignment='top',
    bbox=dict(boxstyle="round,pad=0.3", facecolor="white", alpha=0.8)
)

plt.xlabel("Observed / Missing Group (R)")
plt.ylabel("Proportion")
plt.legend(title="cat_full Category")
sns.despine()
plt.show()

The two figures above show that, under the MCAR assumption, the distribution of 𝑌, 𝑌ₘ, and 𝑋 remains unchanged regardless of the value of R (1 = observed, 0 = missing). These results are further supported by the Kolmogorov–Smirnov and Chi-squared tests, which confirm the absence of significant differences between the observed and missing groups.

For categorical variables, the same analyses can be performed as described above. While these univariate checks can be time-consuming, they are useful when the number of variables is small, as they provide a quick and intuitive first look at the missing data mechanism. For larger datasets, however, multivariate methods should be considered.

2.1.3 Multivariate Analysis of the MCAR Assumption

To the best of my knowledge, only one multivariate statistical test is widely used to assess the MCAR assumption at the dataset level: Little’s chi2 for test MCAR assumption called mcartest. This test, developed in R language, compares the distributions of observed variables across different missing-data patterns and computes a global test statistic that follows a Chi-squared distribution.

However, its main limitation is that it is not well suited for categorical variables, as it relies on the strong assumption that the variables are normally distributed. We now turn to the methods for imputing missing values under the MCAR assumption.

2.2 Methods to deal with missing data under MCAR.

Under the MCAR assumption, the missingness indicators R are independent of Y_o, Y_m, and X. Since the data are missing completely at random, dropping incomplete observations does not introduce bias. However, this approach becomes inefficient when the proportion of missing values is high.

In such cases, simple imputation methods, replacing missing values with the mean, median, or most frequent category, are often preferred. They are easy to implement, require little computational effort, and can be managed over time without adding complexity for modelers. While these methods do not create bias, they tend to underestimate variance and may distort relationships between variables.

By contrast, advanced methods such as regression-based imputation, kNN, or multiple imputation can improve statistical efficiency and help preserve information when the proportion of missing data is substantial. Their main drawback lies in their algorithmic complexity, higher computational cost, and the greater effort required to maintain them in production settings.

To impute missing values under the MCAR assumption for prediction purposes, proceed as follows:

Learn imputation values from the training set only, using the mean for continuous variables and the most frequent category for categorical variables.
Apply these values to replace missing data in both the training and the test sets.
Evaluate the model on the test set, ensuring that no information from the test set was used during the imputation process.

import pandas as pd

def compute_impute_values(df, cont_vars, cat_vars):
    """
    Compute imputation values (mean for continuous, mode for categorical)
    from the training set only.
    """
    impute_values = {}
    for col in cont_vars:
        impute_values[col] = df[col].mean()
    for col in cat_vars:
        impute_values[col] = df[col].mode().iloc[0]
    return impute_values

def apply_imputation(train_df, test_df, impute_values, vars_to_impute):
    """
    Apply the learned imputation values to both train and test sets.
    """
    train_df[vars_to_impute] = train_df[vars_to_impute].fillna(value=impute_values)
    test_df[vars_to_impute] = test_df[vars_to_impute].fillna(value=impute_values)
    return train_df, test_df

# --- Example usage ---
train_df, test_df = stratified_split(
    df,
    strat_vars=["cat_mcar", "cat_full", "target"],
    test_size=0.3,
    random_state=42
)

# Variables to impute
cont_vars = ["cont_mcar"]
cat_vars = ["cat_mcar"]
vars_to_impute = cont_vars + cat_vars

# 1. Learn imputation values on TRAIN
impute_values = compute_impute_values(train_df, cont_vars, cat_vars)
print("Imputation values learned from train:", impute_values)

# 2. Apply them consistently to TRAIN and TEST
train_df, test_df = apply_imputation(train_df, test_df, impute_values, vars_to_impute)

# 3. Check
print("Remaining missing values in train:\n", train_df[vars_to_impute].isnull().sum())
print("Remaining missing values in test:\n", test_df[vars_to_impute].isnull().sum())

This section on understanding MCAR and selecting the appropriate imputation method provides a clear foundation for approaching similar strategies under the MAR assumption.

3. Understanding MAR and Choosing the Right Imputation Methods

The MAR assumption is defined as :

Pr(R|Y_m ,Y_o, X) = Pr(R|Y_o, X) (2)

In other words, the distribution of the missing indicators depends only on the observed data. Even in the case where R depends only on the covariates X,

Pr(R|Y_m ,Y_o, X) = Pr(R|X) (3)

This still falls under the MAR assumption.

3.1 Analysis MAR Assumption for variables with missing values

Under the MAR assumption, the missingness indicators 𝑅 depend only on the observed variables Y_o and X, but not on the missing data 𝑌.
To indirectly assess the plausibility of this assumption, common statistical tests (Student’s t-test, Kolmogorov–Smirnov, Chi-squared, etc.) can be applied by comparing the distributions of observed variables between groups with and without missing values.

For multivariate analysis, one may also use the mcartest implemented in R, which extends Little’s test of MCAR to evaluate assumption (3), namely Pr(R|Y_m ,Y_o, X) = Pr(R|X), under the assumption of multivariate normality of the variables.

If this test is not rejected, the missing-data mechanism can reasonably be considered MAR (assumption 3) given the auxiliary variables X .

We can now turn to the question of how to impute this type of missing data.

3.2 Methods to deal with missing data under MAR.

Under the MAR assumption, the probability of missingness R depends only on the observed variables Y_o and covariates X. In this setting, variables Y^k with missing values can be explained using the other available variables Y_oand X, which motivates the use of advanced imputation methods based on supervised learning.

These approaches involve building a predictive model in which the incomplete variable Y^k serves as the target, and the other observed variables Y_o and X act as predictors. The model is trained on complete cases ([Y^k]_o of Y) and then applied to estimate the missing values [Y^k]_m of Y^k.

The most commonly used imputation methods in the literature include:

k-nearest neighbors (KNNimpute, Troyanskaya et al., 2001), primarily applied to continuous data;
the saturated multinomial model (Schafer, 1997), designed for categorical data;
multivariate imputation by chained equations (MICE, Van Buuren & Oudshoorn, 1999), suitable for mixed datasets but dependent on tuning parameters and the specification of a parametric model.

All of these approaches rely on assumptions about the underlying data distribution or on the ability of the chosen model to adequately capture relationships between variables.

More recently, MissForest (Stekhoven & Bühlmann, 2012) has emerged as a nonparametric alternative based on random forests, well-suited to mixed data types and robust to both interactions and nonlinear relationships.

The MissForest algorithm relies on random forests (RF) to impute missing values. The authors propose the following procedure:

**MissForest algorithm**
**Source: [2] Stekhoven et al.(2012)**

As defined, the MissForest algorithm cannot be used directly for prediction purposes. For each variable, between steps 6 and 7, the random forest model M_s used to predict y_mis(s) from x_mis(s)is not saved. Consequently, it is neither likely nor desirable for practitioners to rely on MissForest as a predictive model in production.

The absence of stored models M_s or imputation parameters (here on the training set) makes it difficult to evaluate generalization performance on new data. Although some have attempted to work around this issue by following Liam Morgan‘s approach, the challenge remains unresolved.

Furthermore, this limitation increases algorithmic complexity and computational cost, since the entire algorithm must be rerun from scratch for each new dataset (for instance, when working with separate training and test sets).

What should be done? Should the MissForest algorithm still be used?

If the goal is to develop a model for classification or analysis solely on the available dataset, with no intention of applying it to new data, then MissForest is strongly recommended, as it offers high accuracy and robustness.

However, if the aim is to build a predictive model that will be applied to new datasets, MissForest should be avoided for the reasons discussed above. In such cases, it is preferable to use an algorithm that explicitly stores the imputation models or the parameters estimated from the training set.

Fortunately, an adapted version now exists: MissForestPredict, available since 2024 in both R and Python, specifically designed for predictive tasks. For further details, we refer the reader to Albu, Elena, et al. (2024).

The use of the MissForestPredict algorithm for prediction consists of applying the standard MissForest procedure to the training data. Unlike the original MissForest, however, this version returns and stores the individual models M_s associated with each variable, which makes it possible to reuse them for imputing missing values in new datasets.

**MissForestPredict Based Imputation with Model Saving**
**Source: [4] Albu et al. (2024).**

The algorithm below illustrates how to apply MissForestPredict to new observations, whether they come from the test set, an out-of-time sample, or an application dataset.

**Illustration of MissForestPredict Applied to a New Observation**
**Source: [4] Albu et al. (2024).**

We now have all the elements needed to address the issues raised in the introduction. Let us turn to the final mechanism, MNAR, before moving on to the conclusion.

4. Understanding MNAR

Missing Not At Random (MNAR) occurs when the missing data mechanism depends directly on the unobserved values themselves. In other words, if a variable Y contains missing values, then the indicator variable R (with R=1 if Y is observed and R=0 otherwise) depends only on the missing component Y_m.

There is no universal statistical method to handle this type of mechanism, since the information needed to model the dependency is precisely what is missing. In such cases, the recommended approach is to rely on domain expertise to understand the reasons behind the nonresponse and to define context-specific strategies for analyzing and addressing the missing values.

It is important to emphasize, however, that MAR and MNAR cannot generally be distinguished empirically based on the observed data alone.

Conclusion

The objective of this article was to show how to impute missing values for predictive purposes without biasing the evaluation of model performance. To this end, we presented the main mechanisms that generate missing data (MCAR, MAR, MNAR), the statistical tests used to assess their plausibility, and the imputation methods best suited to each.

Our analysis highlights that, under MCAR, simple imputation methods are generally preferable, as they provide substantial time savings without introducing bias. In practice, however, missing data mechanisms are most often MAR. In this setting, advanced imputation approaches such as MissForest, based on machine learning models, are particularly appropriate.

Nevertheless, when the goal is to build predictive models, it is essential to use methods that store the imputation parameters or models learned from the training data and then replicate them consistently on the test, out-of-time, or application datasets. This is precisely the contribution of MissForestPredict (introduced in 2024 and available in both R and Python), which addresses the limitation of the original MissForest (2012), a method not originally designed for predictive tasks.

Using MissForest for prediction without adaptation may therefore lead to biased results, unless corrective measures are implemented. It would be highly valuable for practitioners who have deployed MissForest in production to share the strategies they developed to overcome this limitation.

References

[1] Audigier, V., White, I. R., Jolani, S., Debray, T. P., Quartagno, M., Carpenter, J., … & Resche-Rigon, M. (2018). Multiple imputation for multilevel data with continuous and binary variables.

[2] Stekhoven, D. J., & Bühlmann, P. (2012). MissForest—non-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1), 112-118.

[3] Li, C. (2013). Little’s test of missing completely at random. The Stata Journal, 13(4), 795-809.

[4] Albu, E., Gao, S., Wynants, L., & Van Calster, B. (2024). missForestPredict–Missing data imputation for prediction settings. arXiv preprint arXiv:2407.03379.

Image Credits

All images and visualizations in this article were created by the author using Python (pandas, matplotlib, seaborn, and plotly) and excel, unless otherwise stated.

Disclaimer

I write to learn so mistakes are the norm, even though I try my best. Please, when you spot them, let me know. I also appreciate suggestions on new topics!

Source link

#MissForest #Fails #Prediction #Tasks #Key #Limitation #Mind

[ad_2]

Why MissForest Fails in Prediction Tasks: A Key Limitation You Need to Keep in Mind

The Three Missing-Data Mechanisms

Classical Approaches and Their Limits to deal with missing data

The Rise of MissForest implemented in R

A Critical Limitation of MissForest in Prediction Settings

Existing solutions and Their Risks

Research Question and Motivation

1. Identification and Characterization of Missing Values

2. Understanding MCAR and Choosing the Right Imputation Methods

2.1 Assessing the MCAR Assumption

2.1.1 Preparing Train and Test Data for Analysis the MCAR

2.1.1 Analysis MCAR Assumption for continuous variables with missing values

2.1.3 Multivariate Analysis of the MCAR Assumption

2.2 Methods to deal with missing data under MCAR.

3. Understanding MAR and Choosing the Right Imputation Methods

3.1 Analysis MAR Assumption for variables with missing values

3.2 Methods to deal with missing data under MAR.

4. Understanding MNAR

Conclusion

References

Image Credits

Disclaimer

Recent Posts

New Google Cloud tool fights future quantum attacks

Western Union to launch stablecoin

“We will never build a sex robot,” says Mustafa Suleyman

Using NumPy to Analyze My Daily Habits (Sleep, Screen Time & Mood)

Mazda shows a rotary hybrid concept for Tokyo with evolved design language

Donald Trump’s Truth Social Is Launching a Polymarket Competitor

Roundtables: Seeking Climate Solutions in Turbulent Times

Withings’ urine scanning health tracker is now available for $350

Google Workspace Promo Code: Up to 14% Off in October 2025

University Denies Monkeys That Escaped in Truck Crash Were Infected With Horrific Diseases