Understanding Training, Validation, and Testing in Machine Learning
A Complete Guide to How Models Learn, Improve, and Get Evaluated
When learning machine learning or deep learning, one of the most important foundations is understanding the three phases of model development:
- Training
- Validation
- Testing
These three phases ensure that a model not only learns patterns, but also generalizes and performs well in the real world.
This article explains each phase clearly and shows you how they fit together in a complete, automated workflow.
🔥 Part 1 — Training: Where the Model Learns
The training phase is where the neural network actually learns from data.
During training:
- The model performs a forward pass to make predictions
- Loss is computed
- Backpropagation calculates gradients
- The optimizer updates weights
- Repeat for many epochs
✔ Purpose of Training
- Learn patterns
- Adjust model parameters
- Reduce training loss
✔ PyTorch Example
model.train()
for x, y in train_loader:
optimizer.zero_grad()
preds = model(x)
loss = criterion(preds, y)
loss.backward()
optimizer.step()
This is the “learning loop” that makes the model smarter.
🔍 Part 2 — Validation: Where We Tune and Select the Best Model
The validation phase is NOT training.
The model only runs forward to estimate how well it generalizes.
During validation:
- No gradients
- No learning
- No weight updates
- Only evaluation
Validation tells us:
- Is the model overfitting?
- Should we adjust hyperparameters?
- Which epoch produced the best model?
✔ Validation in PyTorch
model.eval()
with torch.no_grad():
for x, y in val_loader:
preds = model(x)
val_loss += criterion(preds, y).item()
✔ Why validation matters
- Prevents overfitting
- Helps tune learning rate, architecture, dropout
- Allows us to save the best model checkpoint
⭐ Part 3 — “Validate Every Epoch → Save Best Model”
After each epoch, we validate the model.
If this epoch’s validation loss is the best so far, we save a checkpoint.
Example:
| Epoch | Train Loss | Val Loss | Action |
|---|---|---|---|
| 1 | 0.50 | 0.42 | save |
| 2 | 0.40 | 0.36 | save |
| 3 | 0.32 | 0.30 | save |
| 4 | 0.28 | 0.35 | no save |
| 5 | 0.26 | 0.31 | no save |
The best model is from epoch 3, not the last epoch.
This prevents overfitting and ensures we select the best version.
⚙️ Part 4 — Early Stopping (Optional Optimization)
Early stopping automatically ends training when validation stops improving.
Example:
patience = 5
Meaning:
👉 If validation loss does not improve for 5 consecutive epochs → stop training.
This saves time and stops overfitting.
🧪 Part 5 — Testing: The Final Exam
The test set is used only once, after selecting the best model.
Test set evaluates:
- Final accuracy
- Real-world performance
- Generalization ability
⚠️ Critical rule:
Do NOT tune model based on test results.
Otherwise test data becomes contaminated.
📊 Part 6 — Full Workflow (Mermaid.js Diagram)
Below is a complete, clean Mermaid.js diagram showing the whole process:
flowchart TD
A[Dataset] --> B[Split Train / Validation / Test]
B --> C[Training Loop<br>Forward + Backprop + Update]
C --> D[Validation Loop<br>No Gradient]
D --> E{Best Validation<br>Performance?}
E -->|Yes| F[Save Best Model Checkpoint]
E -->|No| G[Skip Saving]
F --> H[Continue Training]
G --> H[Continue Training]
H --> I{Training Finished<br>or Early Stopping?}
I -->|No| C
I -->|Yes| J[Load Best Checkpoint]
J --> K[Test Set Evaluation]
K --> L[Final Performance]
This diagram summarizes the entire machine learning lifecycle from data processing to final evaluation.
🧠 Part 7 — Summary
| Phase | Purpose | Learns? |
|---|---|---|
| Training | Learn weights & features | ✔ Yes |
| Validation | Tune hyperparameters, pick best model | ❌ No |
| Testing | Final unbiased evaluation | ❌ No |
The golden workflow:
Train → Validate → Save Best Model → Test Once
This ensures a model that learns well and generalizes well.
Get in Touch with us
Related Posts
- AI驱动的 Network Security Monitoring(NSM)
- AI-Powered Network Security Monitoring (NSM)
- 使用开源 + AI 构建企业级系统
- How to Build an Enterprise System Using Open-Source + AI
- AI会在2026年取代软件开发公司吗?企业管理层必须知道的真相
- Will AI Replace Software Development Agencies in 2026? The Brutal Truth for Enterprise Leaders
- 使用开源 + AI 构建企业级系统(2026 实战指南)
- How to Build an Enterprise System Using Open-Source + AI (2026 Practical Guide)
- AI赋能的软件开发 —— 为业务而生,而不仅仅是写代码
- AI-Powered Software Development — Built for Business, Not Just Code
- Agentic Commerce:自主化采购系统的未来(2026 年完整指南)
- Agentic Commerce: The Future of Autonomous Buying Systems (Complete 2026 Guide)
- 如何在现代 SOC 中构建 Automated Decision Logic(基于 Shuffle + SOC Integrator)
- How to Build Automated Decision Logic in a Modern SOC (Using Shuffle + SOC Integrator)
- 为什么我们选择设计 SOC Integrator,而不是直接进行 Tool-to-Tool 集成
- Why We Designed a SOC Integrator Instead of Direct Tool-to-Tool Connections
- 基于 OCPP 1.6 的 EV 充电平台构建 面向仪表盘、API 与真实充电桩的实战演示指南
- Building an OCPP 1.6 Charging Platform A Practical Demo Guide for API, Dashboard, and Real EV Stations
- 软件开发技能的演进(2026)
- Skill Evolution in Software Development (2026)













