is a game changer in Machine Learning. In fact, in the recent history of Deep Learning, the idea of allowing models to focus on the most relevant parts of an input sequence when making a prediction completely revolutionized the way we look at Neural Networks.
That being said, there is one controversial take that I have about the attention mechanism:
The best way to learn the attention mechanism is not through Natural Language Processing (NLP)
It is (technically) a controversial take for two reasons.
- People naturally use NLP cases (e.g., translation or NSP) because NLP is the reason why the attention mechanism was developed in the first place. The original goal was to overcome the limitations of RNNs and CNNs in handling long-range dependencies in language (if you haven’t already, you should really read the paper Attention is All You Need).
- Second, I will also have to say that in order to understand the general idea of putting the “attention” on a specific word to do translation tasks is very intuitive.
That being said, if we want to understand how attention REALLY works in a hands-on example, I believe that Time Series is the best framework to use. There are many reasons why I say that.
- Computers are not really “made” to work with strings; they work with ones and zeros. All the embedding steps that are necessary to convert the text into vectors add an extra layer of complexity that is not strictly related to the attention idea.
- The attention mechanism, though it was first developed for text, has many other applications (for example, in computer vision), so I like the idea of exploring attention from another angle as well.
- With time series specifically, we can create very small datasets and run our attention models in minutes (yes, including the training) without any fancy GPUs.
In this blog post, we will see how we can build an attention mechanism for time series, specifically in a classification setup. We will work with sine waves, and we will try to classify a normal sine wave with a “modified” sine wave. The “modified” sine wave is created by flattening a portion of the original signal. That is, at a certain location in the wave, we simply remove the oscillation and replace it with a flat line, as if the signal had temporarily stopped or become corrupted.
To make things more spicy, we will assume that the sine can have whatever frequency or amplitude, and that the location and extension (we call it length) of the “rectified” part are also parameters. In other words, the sine can be whatever sine, and we can put our “straight line” wherever we like on the sine wave.
Well, ok, but why should we even bother with the attention mechanism? Why are we not using something simpler, like Feed Forward Neural Networks (FFNs) or Convolutional Neural Networks (CNNs)?
Well, because again we are assuming that the “modified” signal can be “flattened” everywhere (in whatever location of the timeseries), and it can be flattened for whatever length (the rectified part can have whatever length). This means that a standard Neural Network is not that efficient, because the anomalous “part” of the timeseries is not always in the same portion of the signal. In other words, if you are just trying to deal with this with a linear weight matrix + a non linear function, you will have suboptimal results, because index 300 of time series 1 can be completely different from index 300 of time series 14. What we need instead is a dynamic approach that puts the attention on the anomalous part of the series. This is why (and where) the attention method shines.
This blog post will be divided into these 4 steps:
- Code Setup. Before getting into the code, I will display the setup, with all the libraries we will need.
- Data Generation. I will provide the code that we will need for the data generation part.
- Model Implementation. I will provide the implementation of the attention model
- Exploration of the results. The benefit of the attention model will be displayed through the attention scores and classification metrics to assess the performance of our approach.
It seems like we have a lot of ground to cover. Let’s get started! 🚀
1. Code Setup
Before delving into the code, let’s invoke some friends that we will need for the rest of the implementation.
These are just default values that can be used throughout the project. What you see below is the short and sweet requirements.txt file.
I like it when things are easy to change and modular. For this reason, I created a .json file where we can change everything about the setup. Some of these parameters are:
- The number of normal vs abnormal time series (the ratio between the two)
- The number of time series steps (how long your timeseries is)
- The size of the generated dataset
- The min and max locations and lengths of the linearized part
- Much more.
The .json file looks like this.
So, before going to the next step, make sure you have:
- The constants.py file is in your work folder
- The .json file in your work folder or in a path that you remember
- The libraries in the requirements.txt file were installed
2. Data Generation
Two simple functions build the normal sine wave and the modified (rectified) one. The code for this is found in data_utils.py:
Now that we have the basics, we can do all the backend work in data.py. This is intended to be the function that does it all:
- Receives the setup information from the .json file (that’s why you need it!)
- Builds the modified and normal sine waves
- Does the train/test split and train/val/test split for the model validation
The data.py script is the following:
The additional data script is the one that prepares the data for Torch (SineWaveTorchDataset), and it looks like this:
If you want to take a look, this is a random anomalous time series:

And this is a non-anomalous time series:

Now that we have our dataset, we can worry about the model implementation.
3. Model Implementation
The implementation of the model, the training, and the loader can be found in the model.py code:
Now, let me take some time to explain why the attention mechanism is a game-changer here. Unlike FFNN or CNN, which would treat all time steps equally, attention dynamically highlights the parts of the sequence that matter most for classification. This allows the model to “zoom in” on the anomalous section (regardless of where it appears), making it especially powerful for irregular or unpredictable time series patterns.
Let me be more precise here and talk about the Neural Network.
In our model, we use a bidirectional LSTM to process the time series, capturing both past and future context at each time step. Then, instead of feeding the LSTM output directly into a classifier, we compute attention scores over the entire sequence. These scores determine how much weight each time step should have when forming the final context vector used for classification. This means the model learns to focus only on the meaningful parts of the signal (i.e., the flat anomaly), no matter where they occur.
Now let’s connect the model and the data to see the performance of our approach.
4. A practical example
4.1 Training the Model
Given the big backend part that we develop, we can train the model with this super simple block of code.
This took around 5 minutes on the CPU to complete.
Notice that we implemented (on the backend) an early stopping and a train/val/test to avoid overfitting. We are responsible kids.
4.2 Attention Mechanism
Let’s use the following function here to display the attention mechanism together with the sine function.
Let’s show the attention scores for a normal time series.

As we can see, the attention scores are localized (with a sort of time shift) on the areas where there is a flat part, which would be near the peaks. Nonetheless, again, these are only localized spikes.
Now let’s look at an anomalous time series.

As we can see here, the model recognizes (with the same time shift) the area where the function flattens out. Nonetheless, this time, it is not a localized peak. It is a whole section of the signal where we have higher than usual scores. Bingo.
4.3 Classification Performance
Ok, this is nice and all, but does this work? Let’s implement the function to generate the classification report.
The results are the following:
Accuracy : 0.9775
Precision : 0.9855
Recall : 0.9685
F1 Score : 0.9769
ROC AUC Score : 0.9774Confusion Matrix:
[[1002 14]
[ 31 953]]
Very high performance in terms of all the metrics. Works like a charm. 🙃
5. Conclusions
Thank you very much for reading through this article ❤️. It means a lot. Let’s summarize what we found in this journey and why this was helpful. In this blog post, we applied the attention mechanism in a classification task for time series. The classification was between normal time series and “modified” ones. By “modified” we mean that a part (a random part, with random length) has been rectified (substituted with a straight line). We found that:
- Attention mechanisms have been originally developed in NLP, but they also excel at identifying anomalies in time series data, especially when the location of the anomaly varies across samples. This flexibility is difficult to achieve with traditional CNNs or FFNNs.
- By using a bidirectional LSTM combined with an attention layer, our model learns what parts of the signal matter most. We saw that a posteriori through the attention scores (alpha), which reveal which time steps were most relevant for classification. This framework provides a transparent and interpretable approach: we can visualize the attention weights to understand why the model made a certain prediction.
- With minimal data and no GPU, we trained a highly accurate model (F1 score ≈ 0.98) in just a few minutes, proving that attention is accessible and powerful even for small projects.
6. About me!
Thank you again for your time. It means a lot ❤️
My name is Piero Paialunga, and I’m this guy here:

I am a Ph.D. candidate at the University of Cincinnati Aerospace Engineering Department. I talk about AI and Machine Learning in my blog posts and on LinkedIn, and here on TDS. If you liked the article and want to know more about machine learning and follow my studies, you can:
A. Follow me on Linkedin, where I publish all my stories
B. Follow me on GitHub, where you can see all my code
C. For questions, you can send me an email at [email protected]
Ciao!
Source link
#HandsOn #Attention #Mechanism #Time #Series #Classification #Python