Why Edges Come Before Shapes, Why We Use Conv2d, and Why ReLU Must Follow Convolution

When beginners first learn about neural networks — especially convolutional neural networks (CNNs) — they often ask:

Why do early layers detect edges, not shapes?
Why do we use Conv2d so much?
What does convolution actually mean?
Why must we apply ReLU after every Conv2d layer?
How does stacking layers let networks learn complex patterns?

This guide explains the answers deeply and intuitively, with examples and reasoning from math, engineering, and even neuroscience.
If you’ve ever wanted to truly understand how deep learning works “under the hood,” this article is for you.

🔥 Part 1 — Why Deeper Layers Learn Higher-Level Features

Neural networks learn in a hierarchy:

Layer	What It Learns	Why
1	Edges	simplest patterns, most information-rich
2	Shapes	edges combine into corners, curves
3	Textures	shapes combine into repeated patterns
4+	Object parts	eyes, wheels, leaves
Final	Objects	cat, dog, car, etc.

❗ But here’s the key:

We never program the network to do this.

There is no code like:

layer1.learn_edges()
layer2.learn_shapes()

Instead, the network has one objective:

\text{Minimize the loss}

Through backpropagation, each layer learns whatever features are most useful to reduce that loss.
Edges happen to be the simplest, strongest signals → so they appear first.
Shapes require edges → so they appear later.

This is called emergent hierarchical feature learning.

🔍 Part 2 — What Convolution Really Means

Convolution is the heart of image understanding.

✔ Simple explanation:

Convolution = sliding a small filter (like a 3×3 grid) over an image to detect patterns.

Example filter:

[ 1  0 -1 ]
[ 1  0 -1 ]
[ 1  0 -1 ]

This detects vertical edges.

What the convolution does:

Multiply filter values with image pixels
Sum the result
Move one pixel over
Repeat across the whole image

This process extracts:

edges
corners
curves
textures
shapes

Deep layers stack these patterns into more complex concepts.

✔ Mathematical definition:

(I * K)(x,y) = \sum_{m,n} I(x+m, y+n)\cdot K(m,n)

But the intuition is enough: convolution is a pattern detector.

🟦 Part 3 — Why We Almost Always Use `Conv2d` for Images

nn.Conv2d is the best tool for image tasks because:

✔ 1. Images have spatial structure

Nearby pixels are related. Convolution respects locality.

✔ 2. Convolution shares weights

One small filter is reused across the whole image → fewer parameters → less overfitting.

✔ 3. Translation invariance

The filter detects the same feature anywhere in the image.

✔ 4. Efficiency

A fully connected layer on a 224×224×3 image would require over 150k weights per neuron.
Conv2d needs 9 weights.

✔ 5. Hierarchical feature extraction

Deep Conv2d layers naturally build up patterns:

Edges → Shapes → Textures → Object parts → Objects

This is why networks like AlexNet, VGG, ResNet, and MobileNet are based on convolution.

Even Vision Transformers (ViT) still use a convolution-like patch embedding at the input.

🔥 Part 4 — Understanding Conv2d Parameters

nn.Conv2d has several parameters, but the most important are:

nn.Conv2d(in_channels, out_channels, kernel_size, stride, padding)

Parameter	Meaning
`in_channels`	input depth (e.g., 1=grayscale, 3=RGB, 32=features)
`out_channels`	number of filters to learn
`kernel_size`	size of filter (3×3, 5×5, etc.)
`stride`	how far the filter moves each step
`padding`	add zeros around image to keep size same

A common Conv layer:

nn.Conv2d(3, 64, kernel_size=3, stride=1, padding=1)

Means:

RGB in
64 learned 3×3 filters out

This is the core building block of modern CNNs.

⚡ Part 5 — Why ReLU Must Come After Conv2d

After every Conv2d, we apply ReLU:

nn.Conv2d(...)
nn.ReLU()

But why?

✔ 1. Convolution is linear

Without nonlinear activation, stacking many Conv layers = one single linear function.

Such a model cannot learn:

edges
shapes
textures
classifications

✔ 2. ReLU introduces nonlinearity

ReLU(x) = \max(0, x)

This breaks linearity and lets layers combine features in complex ways.

✔ 3. ReLU prevents vanishing gradients

Unlike sigmoid/tanh, ReLU keeps gradient strong for positive values → faster training.

✔ 4. ReLU produces clean, activated features

Negative values disappear → edges and textures become sharp and meaningful.

This combination is simple but incredibly powerful:

Conv → ReLU → Conv → ReLU → Conv → ReLU → …

This is the basic structure of almost every successful CNN.

🧠 Part 6 — Putting It All Together: How Deep Networks Learn

A neural network learns by:

1 Forward pass
Compute predictions.

2 Loss function
Measure error.

3 Backpropagation
Compute gradients:

\frac{\partial L}{\partial W}

4 Weight update
Using optimizers like SGD or Adam:

W \leftarrow W - \eta ,\frac{\partial L}{\partial W}

5 Repeat for many layers
Each layer adjusts itself to help reduce loss.

This automatic process is what creates:

edge filters
shape detectors
texture patterns
object detectors

You never program these manually.
They emerge naturally because they help minimize the final loss.

🎯 Final Thoughts

Neural networks seem magical, but they work because of powerful ideas:

Convolution extracts local patterns.
ReLU introduces nonlinearity.
Backpropagation trains filters automatically.
Deep layers build hierarchical representations.
Edges → shapes → textures → objects is a natural consequence of how networks process information.

Once you understand these principles, you understand the core of modern deep learning.