were first introduced for images, and for images they are often easy to understand.
A filter slides over pixels and detects edges, shapes, or textures. You can read this article I wrote earlier to understand how CNNs work for images with Excel.
For text, the idea is the same.
Instead of pixels, we slide filters over words.
Instead of visual patterns, we detect linguistic patterns.
And many important patterns in text are very local. Let’s take these very simple examples:
- “good” is positive
- “bad” is negative
- “not good” is negative
- “not bad” is often positive
In my previous article, we saw how to represent words as numbers using embeddings.
We also saw a key limitation: when we used a global average, word order was completely ignored.
From the model’s point of view, “not good” and “good not” looked exactly the same.
So the next challenge is clear: we want the model to take word order into account.
A 1D Convolutional Neural Network is a natural tool for this, because it scans a sentence with small sliding windows and reacts when it recognizes familiar local patterns.
1. Understanding a 1D CNN for Text: Architecture and Depth
1.1. Building a 1D CNN for text in Excel
In this article, we build a 1D CNN architecture in Excel with the following components:
- Embedding dictionary
We use a 2-dimensional embedding. Because one dimension is not enough for this task.
One dimension encodes sentiment, and the second dimension encodes negation. - Conv1D layer
This is the core component of a CNN architecture.
It consists of filters that slide across the sentence with a window length of 2 words. We choose 2 words to be simple. - ReLU and global max pooling
These steps keep only the strongest matches detected by the filters.
We will also discuss the fact that ReLU is optional. - Logistic regression
This is the final classification layer, which combines the detected patterns into a probability.

This pipeline corresponds to a standard CNN text classifier.
The only difference here is that we explicitly write and visualize the forward pass in Excel.
1.2. What “deep learning” means in this architecture
Before going further, let us take a step back.
Yes, I know, I do this often, but having a global view of models really helps to understand them.
The definition of deep learning is often blurred.
For many people, deep learning simply means “many layers”.
Here, I will take a slightly different point of view.
What really characterizes deep learning is not the number of layers, but the depth of the transformation applied to the input data.
With this definition:
- Even a model with a single convolution layer can be considered deep learning,
- because the input is transformed into a more structured and abstract representation.
On the other hand, taking raw input data, applying one-hot encoding, and stacking many fully connected layers does not necessarily make a model deep in a meaningful sense.
In theory, if we don’t have any transformation, one layer is enough.
In CNNs, the presence of multiple layers has a very concrete motivation.
Consider a sentence like:
This movie is not very good
With a single convolution layer and a small window, we can detect simple local patterns such as: “very + good”
But we cannot yet detect higher-level patterns such as: “not + (very good)”
This is why CNNs are often stacked:
- the first layer detects simple local patterns,
- the second layer combines them into more complex ones.
In this article, we deliberately focus on one convolution layer.
This makes every step visible and easy to understand in Excel, while keeping the logic identical to deeper CNN architectures.
2. Turning words into embeddings
Let us start with some simple words. We will try to detect negation, so we will use these terms, with other words (that we will not model)
- “good”
- “bad”
- “not good”
- “not bad”
We keep the representation intentionally small so that every step is visible.
We will only use a dictionary of three words : good, bad and not.
All other words will have 0 as embeddings.
2.1 Why one dimension is not enough
In a previous article on sentiment detection, we used a single dimension.
That worked for “good” versus “bad”.
But now we want to handle negation.
One dimension can only represent one concept well.
So we need two dimensions:
- senti: sentiment polarity
- neg: negation marker
2.2 The embedding dictionary
Each word becomes a 2D vector:
- good → (senti = +1, neg = 0)
- bad → (senti = -1, neg = 0)
- not → (senti = 0, neg = +1)
- any other word → (0, 0)

This is not how real embeddings look. Real embeddings are learned, high-dimensional, and not directly interpretable.
But for understanding how Conv1D works, this toy embedding is perfect.
In Excel, this is just a lookup table.
In a real neural network, this embedding matrix would be trainable.

3. Conv1D filters as sliding pattern detectors
Now we arrive at the core idea of a 1D CNN.
A Conv1D filter is nothing mysterious. It is just a small set of weights plus a bias that slides over the sentence.
Because:
- each word embedding has 2 values (senti, neg)
- our window contains 2 words
each filter has:
- 4 weights (2 dimensions × 2 positions)
- 1 bias
That is all.
You can think of a filter as repeatedly asking the same question at every position:
“Do these two neighboring words match a pattern I care about?”
3.1 Sliding windows: how Conv1D sees a sentence
Consider this sentence:
it is not bad at all
We choose a window size of 2 words.
That means the model looks at every adjacent pair:
- (it, is)
- (is, not)
- (not, bad)
- (bad, at)
- (at, all)
Important point:
The filters slide everywhere, even when both words are neutral (all zeros).

3.2 Four intuitive filters
To make the behavior easy to understand, we use four filters.

Filter 1 – “I see GOOD”
This filter looks only at the sentiment of the current word.
Plain-text equation for one window:
z = senti(current_word)
If the word is “good”, z = 1
If the word is “bad”, z = -1
If the word is neutral, z = 0
After ReLU, negative values become 0. But it is optional.
Filter 2 – “I see BAD”
This one is symmetric.
z = -senti(current_word)
So:
- “bad” → z = 1
- “good” → z = -1 → ReLU → 0
Filter 3 – “I see NOT GOOD”
This filter looks at two things at the same time:
- neg(previous_word)
- senti(current_word)
Equation:
z = neg(previous_word) + senti(current_word) – 1
Why the “-1”?
It acts like a threshold so that both conditions must be true.
Results:
- “not good” → 1 + 1 – 1 = 1 → activated
- “is good” → 0 + 1 – 1 = 0 → not activated
- “not bad” → 1 – 1 – 1 = -1 → ReLU → 0
Filter 4 – “I see NOT BAD”
Same idea, slightly different sign:
z = neg(previous_word) + (-senti(current_word)) – 1
Results:
- “not bad” → 1 + 1 – 1 = 1
- “not good” → 1 – 1 – 1 = -1 → 0
This is a very important intuition:
A CNN filter can behave like a local logical rule, learned from data.
3.3 Final result of sliding windows
Here is the final results of these 4 filters.

4. ReLU and max pooling: from local to global
4.1 ReLU
After computing z for every window, we apply ReLU:
ReLU(z) = max(0, z)
Meaning:
- negative evidence is ignored
- positive evidence is kept
Each filter becomes a presence detector.
By the way, it is an activation function in the Neural network. So a Neural network is not that difficult after all.

4.2 Global Max pooling
Then comes global max pooling.
For each filter, we keep only:
max activation over all windows
Interpretation:
“I do not care where the pattern appears, only whether it appears strongly somewhere.”
At this point, the whole sentence is summarized by 4 numbers:
- strongest “good” signal
- strongest “bad” signal
- strongest “not good” signal
- strongest “not bad” signal

4.3 What happens if we remove ReLU?
Without ReLU:
- negative values stay negative
- max pooling may select negative values
This mixes two ideas:
- absence of a pattern
- opposite of a pattern
The filter stops being a clean detector and becomes a signed score.
The model could still work mathematically, but interpretation becomes harder.
5. The final layer is logistic regression
Now we combine these signals.
We compute a score using a linear combination:
score = 2 × F_good – 2 × F_bad – 3 × F_not_good – 3 × F_not_bad – bias

Then we convert the score into a probability:
probability = 1 / (1 + exp(-score))
That is exactly logistic regression.
So yes:
- the CNN extracts features: this step can be considered as feature engineering, right?
- logistic regression makes the final decisions, it is a classic machine learning model we know well

6. Full examples with sliding filters
Example 1
“it is bad, so it is not good at all”
The sentence contains:
After max pooling:
- F_good = 1 (because “good” exists)
- F_bad = 1
- F_not_good = 1
- F_not_bad = 0
Final score becomes strongly negative.
Prediction: negative sentiment.

Example 2
“it is good. yes, not bad.”
The sentence contains:
After max pooling:
- F_good = 1
- F_bad = 1 (because the word “bad” appears)
- F_not_good = 0
- F_not_bad = 1
The final linear layer learns that “not bad” should outweigh “bad”.
Prediction: positive sentiment.
This also shows something important: max pooling keeps all strong signals.
The final layer decides how to combine them.

Exemple 3 with A limitation that explains why CNNs get deeper
Try this sentence:
“it is not very bad”
With a window of size 2, the model sees:
It never sees (not, bad), so the “not bad” filter never fires.
It explains why real models use:
- larger windows
- multiple convolution layers
- or other architectures for longer dependencies

Conclusion
The strength of Excel is visibility.
You can see:
- the embedding dictionary
- all filter weights and biases
- every sliding window
- every ReLU activation
- the max pooling result
- the logistic regression parameters
Training is simply the process of adjusting these numbers.
Once you see that, CNNs stop being mysterious.
They become what they really are: structured, trainable pattern detectors that slide over data.
Source link
#Machine #Learning #Advent #Calendar #Day #CNN #Text #Excel
























