Understanding Training, Validation, and Testing in Machine Learning
A Complete Guide to How Models Learn, Improve, and Get Evaluated
When learning machine learning or deep learning, one of the most important foundations is understanding the three phases of model development:
- Training
- Validation
- Testing
These three phases ensure that a model not only learns patterns, but also generalizes and performs well in the real world.
This article explains each phase clearly and shows you how they fit together in a complete, automated workflow.
🔥 Part 1 — Training: Where the Model Learns
The training phase is where the neural network actually learns from data.
During training:
- The model performs a forward pass to make predictions
- Loss is computed
- Backpropagation calculates gradients
- The optimizer updates weights
- Repeat for many epochs
✔ Purpose of Training
- Learn patterns
- Adjust model parameters
- Reduce training loss
✔ PyTorch Example
model.train()
for x, y in train_loader:
optimizer.zero_grad()
preds = model(x)
loss = criterion(preds, y)
loss.backward()
optimizer.step()
This is the “learning loop” that makes the model smarter.
🔍 Part 2 — Validation: Where We Tune and Select the Best Model
The validation phase is NOT training.
The model only runs forward to estimate how well it generalizes.
During validation:
- No gradients
- No learning
- No weight updates
- Only evaluation
Validation tells us:
- Is the model overfitting?
- Should we adjust hyperparameters?
- Which epoch produced the best model?
✔ Validation in PyTorch
model.eval()
with torch.no_grad():
for x, y in val_loader:
preds = model(x)
val_loss += criterion(preds, y).item()
✔ Why validation matters
- Prevents overfitting
- Helps tune learning rate, architecture, dropout
- Allows us to save the best model checkpoint
⭐ Part 3 — “Validate Every Epoch → Save Best Model”
After each epoch, we validate the model.
If this epoch’s validation loss is the best so far, we save a checkpoint.
Example:
| Epoch | Train Loss | Val Loss | Action |
|---|---|---|---|
| 1 | 0.50 | 0.42 | save |
| 2 | 0.40 | 0.36 | save |
| 3 | 0.32 | 0.30 | save |
| 4 | 0.28 | 0.35 | no save |
| 5 | 0.26 | 0.31 | no save |
The best model is from epoch 3, not the last epoch.
This prevents overfitting and ensures we select the best version.
⚙️ Part 4 — Early Stopping (Optional Optimization)
Early stopping automatically ends training when validation stops improving.
Example:
patience = 5
Meaning:
👉 If validation loss does not improve for 5 consecutive epochs → stop training.
This saves time and stops overfitting.
🧪 Part 5 — Testing: The Final Exam
The test set is used only once, after selecting the best model.
Test set evaluates:
- Final accuracy
- Real-world performance
- Generalization ability
⚠️ Critical rule:
Do NOT tune model based on test results.
Otherwise test data becomes contaminated.
📊 Part 6 — Full Workflow (Mermaid.js Diagram)
Below is a complete, clean Mermaid.js diagram showing the whole process:
flowchart TD
A[Dataset] --> B[Split Train / Validation / Test]
B --> C[Training Loop<br>Forward + Backprop + Update]
C --> D[Validation Loop<br>No Gradient]
D --> E{Best Validation<br>Performance?}
E -->|Yes| F[Save Best Model Checkpoint]
E -->|No| G[Skip Saving]
F --> H[Continue Training]
G --> H[Continue Training]
H --> I{Training Finished<br>or Early Stopping?}
I -->|No| C
I -->|Yes| J[Load Best Checkpoint]
J --> K[Test Set Evaluation]
K --> L[Final Performance]
This diagram summarizes the entire machine learning lifecycle from data processing to final evaluation.
🧠 Part 7 — Summary
| Phase | Purpose | Learns? |
|---|---|---|
| Training | Learn weights & features | ✔ Yes |
| Validation | Tune hyperparameters, pick best model | ❌ No |
| Testing | Final unbiased evaluation | ❌ No |
The golden workflow:
Train → Validate → Save Best Model → Test Once
This ensures a model that learns well and generalizes well.
Get in Touch with us
Related Posts
- The Accounting Software Your Firm Uses Is Built for Your Clients, Not for You
- 2026年本地大模型(Local LLM)硬件选型实用指南
- Choosing Hardware for Local LLMs in 2026: A Practical Sizing Guide
- Why Your Finance Team Spends 40% of Their Week on Work AI Can Now Do
- 用纯开源方案搭建生产级 SOC:Wazuh + DFIR-IRIS + 自研集成层实战记录
- How We Built a Real Security Operations Center With Open-Source Tools
- FarmScript:我们如何从零设计一门农业IoT领域特定语言
- FarmScript: How We Designed a Programming Language for Chanthaburi Durian Farmers
- 智慧农业项目为何止步于试点阶段
- Why Smart Farming Projects Fail Before They Leave the Pilot Stage
- ERP项目为何总是超支、延期,最终令人失望
- ERP Projects: Why They Cost More, Take Longer, and Disappoint More Than Expected
- AI Security in Production: What Enterprise Teams Must Know in 2026
- 弹性无人机蜂群设计:具备安全通信的无领导者容错网状网络
- Designing Resilient Drone Swarms: Leaderless-Tolerant Mesh Networks with Secure Communications
- NumPy广播规则详解:为什么`(3,)`和`(3,1)`行为不同——以及它何时会悄悄给出错误答案
- NumPy Broadcasting Rules: Why `(3,)` and `(3,1)` Behave Differently — and When It Silently Gives Wrong Answers
- 关键基础设施遭受攻击:从乌克兰电网战争看工业IT/OT安全
- Critical Infrastructure Under Fire: What IT/OT Security Teams Can Learn from Ukraine’s Energy Grid
- LM Studio代码开发的系统提示词工程:`temperature`、`context_length`与`stop`词详解













