Context
centers, network slowdowns can appear out of nowhere. A sudden burst of traffic from distributed systems, microservices, or AI training jobs can overwhelm switch buffers in seconds. The problem is not just knowing when something goes wrong. It is being able to see it coming before it happens.
Telemetry systems are widely used to monitor network health, but most operate in a reactive mode. They flag congestion only after performance has degraded. Once a link is saturated or a queue is full, you are already past the point of early diagnosis, and tracing the original cause becomes significantly harder.
In-band Network Telemetry, or INT, tries to solve that gap by tagging live packets with metadata as they travel through the network. It gives you a real-time view of how traffic flows, where queues are building up, where latency is creeping in, and how each switch is handling forwarding. It is a powerful tool when used carefully. But it comes with a cost. Enabling INT on every packet can introduce serious overhead and push a flood of telemetry data to the control plane, much of which you might not even need.
What if we could be more selective? Instead of tracking everything, we forecast where trouble is likely to form and enable INT just for those regions and just for a short time. This way, we get detailed visibility when it matters most without paying the full cost of always-on monitoring.
The Problem with Always-On Telemetry
INT gives you a powerful, detailed view of what’s happening inside the network. You can track queue lengths, hop-by-hop latency, and timestamps directly from the packet path. But there’s a cost: this telemetry data adds weight to every packet, and if you apply it to all traffic, it can eat up significant bandwidth and processing capacity.
To get around that, many systems take shortcuts:
Sampling: Tag only a fraction (e.g. — 1%) of packets with telemetry data.
Event-triggered telemetry: Turn on INT only when something bad is already happening, like a queue crossing a threshold.
These techniques help control overhead, but they miss the critical early moments of a traffic surge, the part you most want to understand if you’re trying to prevent slowdowns.
Introducing a Predictive Approach
Instead of reacting to symptoms, we designed a system that can forecast congestion before it happens and activate detailed telemetry proactively. The idea is simple: if we can anticipate when and where traffic is going to spike, we can selectively enable INT just for that hotspot and only for the right window of time.
This keeps overhead low but gives you deep visibility when it actually matters.
System Design
We came up with a simple approach that makes network monitoring more intelligent. It can predict when and where monitoring is actually needed. The idea is not to sample every packet and not to wait for congestion to happen. Instead, we want a system that could catch signs of trouble early and selectively enable high-fidelity monitoring only when it’s needed.
So, how’d we get this done? We created the following four critical components, each for a distinct task.
Data Collector
We begin by collecting network data to monitor how much data is moving through different network ports at any given moment. We use sFlow for data collection because it helps to collect important metrics without affecting network performance. These metrics are captured at regular intervals to get a real-time view of the network at any time.
Forecasting Engine
The Forecasting engine is the most important component of our system. It’s built using a Long Short-Term Memory (LSTM) model. We went with LSTM because it learns how patterns evolve over time, making it suitable for network traffic. We’re not looking for perfection here. The important thing is to spot unusual traffic spikes that typically show up before congestion starts.
Telemetry Controller
The controller listens to those forecasts and makes decisions. When a predicted spike crosses alert threshold the system would respond. It sends a command to the switches to switch into a detailed monitoring mode, but only for the flows or ports that matter. It also knows when to back off, turning off the extra telemetry once conditions return to normal.
Programmable Data Plane
The final piece is the switch itself. In our setup, we use P4 programmable BMv2 switches that let us adjust packet behavior on the fly. Most of the time, the switch simply forwards traffic without making any changes. But when the controller turns on INT, the switch begins embedding telemetry metadata into packets that match specific rules. These rules are pushed by the controller and let us target just the traffic we care about.
This avoids the tradeoff between constant monitoring and blind sampling. Instead, we get detailed visibility exactly when it is needed, without flooding the system with unnecessary data the rest of the time.
Experimental Setup
We built a full simulation of this system using:
- Mininet for emulating a leaf-spine network
- BMv2 (P4 software switch) for programmable data plane behavior
- sFlow-RT for real-time traffic stats
- TensorFlow + Keras for the LSTM forecasting model
- Python + gRPC + P4Runtime for the controller logic
The LSTM was trained on synthetic traffic traces generated in Mininet using iperf. Once trained, the model runs in a loop, making predictions every 30 seconds and storing forecasts for the controller to act on.
Here’s a simplified version of the prediction loop:
For every 30 seconds:
latest_sample = data_collector.current_traffic()
slinding_window += latest_sample
if sliding_window size >= window size:
forecast = forecast_engine.predict_upcoming_traffic()
if forecast > alert_threshold:
telem_controller.trigger_INT()
Switches respond immediately by switching telemetry modes for specific flows.
Why LSTM?
We went with an LSTM model because network traffic tends to have structure. It’s not entirely random. There are patterns tied to time of day, background load, or batch processing jobs, and LSTMs are particularly good at picking up on those temporal relationships. Unlike simpler models that treat each data point independently, an LSTM can remember what came before and use that memory to make better short-term predictions. For our use case, that means spotting early signs of an upcoming surge just by looking at how the last few minutes behaved. We didn’t need it to forecast exact numbers, just to flag when something abnormal might be coming. LSTM gave us just enough accuracy to trigger proactive telemetry without overfitting to noise.
Evaluation
We didn’t run large-scale performance benchmarks, but through our prototype and system behavior in test conditions, we can outline the practical advantages of this design approach.
Lead Time Advantage
One of the main benefits of a predictive system like this is its ability to catch trouble early. Reactive telemetry solutions typically wait until a queue threshold is crossed or performance degrades, which means you’re already behind the curve. By contrast, our design anticipates congestion based on traffic trends and activates detailed monitoring in advance, giving operators a clearer picture of what led to the issue, not just the symptoms once they appear.
Monitoring Efficiency
A key goal in this project was to keep overhead low without compromising visibility. Instead of applying full INT across all traffic or relying on coarse-grained sampling, our system selectively enables high-fidelity telemetry for short bursts, and only where forecasts indicate potential problems. While we haven’t quantified the exact cost savings, the design naturally limits overhead by keeping INT focused and short-lived, something that static sampling or reactive triggering can’t match.
Conceptual Comparison of Telemetry Strategies
While we didn’t record overhead metrics, the intent of the design was to find a middle ground, delivering deeper visibility than sampling or reactive systems but at a fraction of the cost of always-on telemetry. Here’s how the approach compares at a high level:
Conclusion
We wanted to figure out a better way to monitor the network traffic. By combining machine learning and programmable switches, we built a system that predicts congestion before it happens and activates detailed telemetry in just the right place and time.
It seems like a minor change to predict instead of react, but it opens up a new level of observability. As telemetry becomes increasingly important in AI-scale data centers and low-latency services, this kind of intelligent monitoring will become a baseline expectation, not just a nice to have.
References
Source link
#Reactive #Predictive #Forecasting #Network #Congestion #Machine #Learning #andINT