Authors: Augusto Cerqua, Marco Letta, Gabriele Pinto
learning (ML) has gained a central role in economics, the social sciences, and business decision-making. In the public sector, ML is increasingly used for so-called prediction policy problems: settings where policymakers aim to identify units most at risk of a negative outcome and intervene proactively; for instance, targeting public subsidies, predicting local recessions, or anticipating migration patterns. In the private sector, similar predictive tasks arise when firms seek to forecast customer churn, or optimize credit risk assessment. In both domains, better predictions translate into more efficient allocation of resources and more effective interventions.
To achieve these goals, ML algorithms are increasingly applied to panel data, characterized by repeated observations of the same units over multiple time periods. However, ML models were not originally designed for use with panel data, which feature distinctive cross-sectional and longitudinal dimensions. When ML is applied to panel data, there is a high risk of a subtle but serious problem: data leakage. This occurs when information unavailable at prediction time accidentally enters the model training process, inflating predictive performance. In our paper “On the Mis(Use) of Machine Learning With Panel Data” (Cerqua, Letta, and Pinto, 2025), recently published in the Oxford Bulletin of Economics and Statistics, we provide the first systematic assessment of data leakage in ML with panel data, propose clear guidelines for practitioners, and illustrate the consequences through an empirical application with publicly available U.S. county data.
The Leakage Problem
Panel data combine two structures: a temporal dimension (units observed across time) and a cross-sectional dimension (multiple units, such as regions or firms). Standard ML practice, splitting the sample randomly into training and testing sets, implicitly assumes independent and identically distributed (i.i.d.) data. This assumption is violated when default ML procedures (such as a random split) are applied to panel data, creating two main types of leakage:
- Temporal leakage: future information leaks into the model during the training phase, making forecasts look unrealistically accurate. Furthermore, past information can end up in the testing set, making ‘forecasts’ retrospective.
- Cross-sectional leakage: the same or very similar units appear in both training and testing sets, meaning the model has already “seen” most of the cross-sectional dimension of the data.
Figure 1 shows how different splitting strategies affect the risk of leakage. A random split at the unit–time level (Panel A) is the most problematic, as it introduces both temporal and cross-sectional leakage. Alternatives such as splitting by units (Panel B), by groups (Panel C), or by time (Panel D), mitigate one type of leakage but not the other. As a result, no strategy completely eliminates the problem: the appropriate choice depends on the task at hand (see below), since in some cases one form of leakage may not be a real concern.
Figure 1 | Training and testing sets under different splitting rules
Two Types of Prediction Policy Problems
A key insight of the study is that researchers must clearly define their prediction goal ex-ante. We distinguish two broad classes of prediction policy problems:
1. Cross-sectional prediction: The task is to map outcomes across units in the same period. For example, imputing missing data on GDP per capita across regions when only some regions have reliable measurements. The best split here is at the unit level: different units are assigned to training and testing sets, while all time periods are kept. This eliminates cross-sectional leakage, although temporal leakage remains. But since forecasting is not the goal, this is not a real issue.
2. Sequential forecasting: The goal is to predict future outcomes based on historical data—for example, predicting county-level income declines one year ahead to trigger early interventions. Here, the correct split is by time: earlier periods for training, later periods for testing. This avoids temporal leakage but not cross-sectional leakage, which is not a real concern since the same units are being forecasted across time.
The wrong approach in both cases is the random split by unit-time (Panel A of Figure 1), which contaminates results with both types of leakage and produces misleadingly high performance metrics.
Practical Guidelines
To help practitioners, we summarize a set of do’s and don’ts for applying ML to panel data:
- Choose the sample split based on the research question: unit-based for cross-sectional problems, time-based for forecasting.
- Temporal leakage can occur not only through observations, but also through predictors. For forecasting, only use lagged or time-invariant predictors. Using contemporaneous variables (e.g., using unemployment in 2014 to predict income in 2014) is conceptually wrong and creates temporal data leakage.
- Adapt cross-validation to panel data. Random k-fold CV found in most ready-to-use software packages is inappropriate, as it mixes future and past information. Instead, use rolling or expanding windows for forecasting, or stratified CV by units/groups for cross-sectional prediction.
- Ensure that out-of-sample performance is tested on truly unseen data, not on data already encountered during training.
Empirical Application
To illustrate these issues, we analyze a balanced panel of 3,058 U.S. counties from 2000 to 2019, focusing exclusively on sequential forecasting. We consider two tasks: a regression problem—forecasting per capita income—and a classification problem—forecasting whether income will decline in the subsequent year.
We run hundreds of models, varying split strategies, use of contemporaneous predictors, inclusion of lagged outcomes, and algorithms (Random Forest, XGBoost, Logit, and OLS). This comprehensive design allows us to quantify how leakage inflates performance. Figure 2 below reports our main findings.
Panel A of Figure 2 shows forecasting performance for classification tasks. Random splits yield very high accuracy, but this is illusory: the model has already seen similar data during training.
Panel B shows forecasting performance for regression tasks. Once again, random splits make models look far better than they really are, while correct time-based splits show much lower, yet realistic, accuracy.
Figure 2 | Temporal leakage in the forecasting problem
Panel A – Classification task
Panel B – Regression task
In the paper, we also show that the overestimation of model accuracy becomes significantly more pronounced during years marked by distribution shifts and structural breaks, such as the Great Recession, making the results particularly misleading for policy purposes.
Why It Matters
Data leakage is more than a technical pitfall; it has real-world consequences. In policy applications, a model that seems highly accurate during validation may collapse once deployed, leading to misallocated resources, missed crises, or misguided targeting. In business settings, the same issue can translate into poor investment decisions, inefficient customer targeting, or false confidence in risk assessments. The danger is especially acute when machine learning models are intended to serve as early-warning systems, where misplaced trust in inflated performance can result in costly failures.
By contrast, properly designed models, even if less accurate on paper, provide honest and reliable predictions that can meaningfully inform decision-making.
Takeaway
ML has the potential to transform decision-making in both policy and business, but only if applied correctly. Panel data offer rich opportunities, yet are especially vulnerable to data leakage. To generate reliable insights, practitioners should align their ML workflow with the prediction objective, account for both temporal and cross-sectional structures, and use validation strategies that prevent overoptimistic assessments and an illusion of high accuracy. When these principles are followed, models avoid the trap of inflated performance and instead provide guidance that genuinely helps policymakers allocate resources and businesses make sound strategic choices. Given the rapid adoption of ML with panel data in both public and private domains, addressing these pitfalls is now a pressing priority for applied research.
References
A. Cerqua, M. Letta, and G. Pinto, “On the (Mis)Use of Machine Learning With Panel Data”, Oxford Bulletin of Economics and Statistics (2025): 1–13, https://doi.org/10.1111/obes.70019.
Source link
#Machine #Learning #Meets #Panel #Data #Practitioners