Deep Learning 101: A Beginner's Complete Guide to Neural Networks & PyTorch

Part I

Foundations

Chapter 1

What Is Deep Learning?

Before we write a single line of code, let's understand what we're getting into — and why it matters.

1.1 A Brief History of Teaching Machines to Think

The dream of artificial intelligence is old. Ancient Greek myths told of Talos, a giant bronze automaton that guarded Crete. Medieval engineers built mechanical birds. But the real story of AI begins in the 1940s, when mathematicians and logicians first asked: can a machine think?

1943

McCulloch & Pitts propose the first mathematical model of a neuron — a simple unit that takes inputs, applies weights, and produces an output.

1957

Frank Rosenblatt builds the Perceptron — a machine that learns to classify patterns. The New York Times reports it as the embryo of a computer that will "walk, talk, see, write, reproduce and be conscious of its existence."

1969

Minsky & Papert prove the Perceptron can't solve XOR (exclusive or). This single mathematical result triggers an "AI Winter" — funding dries up for over a decade.

1986

Rumelhart, Hinton & Williams popularize backpropagation, showing how multi-layer networks can learn complex patterns. The field warms up again.

2012

Alex Krizhevsky's AlexNet wins ImageNet by a massive margin using deep CNNs and GPUs. This is the Big Bang of modern deep learning.

2017

Vaswani et al. publish "Attention Is All You Need," introducing the Transformer architecture. It will reshape NLP, vision, and everything else.

2020+

GPT-3, DALL·E, Stable Diffusion, GPT-4 — foundation models demonstrate capabilities that seemed like science fiction five years earlier.

1.2 AI, Machine Learning, and Deep Learning

These three terms are often used interchangeably, but they're nested concepts:

Figure 1.1 — Deep Learning is a subset of Machine Learning, which is a subset of Artificial Intelligence.

Artificial Intelligence (AI) is the broadest concept: any system that exhibits intelligent behavior. This includes rule-based systems (if the temperature > 100°C, send an alert), search algorithms (GPS navigation), and much more. Most AI systems don't involve learning at all.

Machine Learning (ML) is a subset of AI where systems learn patterns from data rather than following hand-coded rules. Instead of programming explicit rules, you provide examples and the system figures out the rules itself.

Deep Learning (DL) is a subset of ML that uses neural networks with many layers — hence "deep." These layered networks can automatically discover the features that matter in data, eliminating much of the manual feature engineering that traditional ML requires.

Key Insight

In traditional ML, a human expert must carefully choose what features to look for. In deep learning, the network discovers these features on its own. This is why deep learning works so well for messy, unstructured data like images, audio, and text — the features are too complex for humans to define by hand.

1.3 What Can Deep Learning Do?

The honest answer: a lot. But it's not magic, and it's not good at everything. Here's a realistic landscape:

Task	Example	Architecture
Image Classification	Is this X-ray showing pneumonia?	CNN, Vision Transformer
Object Detection	Find all pedestrians in this street scene	YOLO, Faster R-CNN
Text Generation	Write a coherent paragraph about climate change	Transformer (GPT family)
Machine Translation	Translate English to French	Seq2Seq, Transformer
Speech Recognition	Convert audio to text	Whisper, CTC Networks
Image Generation	Create a photorealistic face that doesn't exist	GAN, Diffusion Model
Game Playing	Beat the world champion at Go	Reinforcement Learning + DL

1.4 What Deep Learning Is NOT Good At

Being honest about limitations is just as important as celebrating capabilities:

Small datasets — Deep learning is data-hungry. With fewer than a few hundred examples, traditional ML or even simple heuristics often win.
Explainability — A deep neural network's decisions are hard to interpret. If you need to explain why a decision was made (medical diagnosis, loan approval), this matters a lot.
Common sense reasoning — Models can generate fluent text but don't "understand" the way humans do. They can be confidently wrong.
Out-of-distribution generalization — A model trained on one kind of data can fail spectacularly when the data shifts.
Simple, structured problems — If your data fits neatly in a spreadsheet and has clear rules, gradient-boosted trees (XGBoost) are often better and faster.

1.5 The Building Blocks (A 30,000-Foot View)

Every deep learning system has the same fundamental ingredients:

Data — Examples the model learns from. Images, text, audio, numbers — organized into inputs and (usually) labels.
A Model (Architecture) — The structure of the neural network. How many layers, what type, how they connect.
A Loss Function — A mathematical measure of how wrong the model's predictions are.
An Optimizer — An algorithm that adjusts the model's internal parameters to reduce the loss.
Training Loop — The cycle: feed data → make predictions → measure error → adjust parameters → repeat.

Figure 1.2 — The training loop: the heartbeat of deep learning.

Every chapter in this book builds on these five ingredients. You'll understand each one deeply by the end.

1.6 What You'll Build in This Book

This isn't a theory textbook. By the end of these chapters, you will have:

Built a neural network from scratch in pure Python
Trained an image classifier that recognizes handwritten digits
Created a text generator that writes Shakespeare-like prose
Used a pre-trained model to classify your own images
Built and deployed a complete deep learning application
Understood the Transformer architecture powering ChatGPT and its relatives

Let's start building.

✎ Exercises — Chapter 1

In your own words, explain the difference between AI, machine learning, and deep learning to a friend who has never taken a CS course.
Find three real-world applications of deep learning that you interact with daily (hint: your phone uses several).
For each application in the table in Section 1.3, think of one benefit and one risk.
Why might a company choose traditional ML over deep learning for a spam filter? When might deep learning be the better choice?

Chapter 2

The Math You Actually Need

Don't panic. You don't need a math degree — just a few core ideas, explained with pictures and code.

Who This Chapter Is For

If you're comfortable with basic algebra and have seen graphs (x-y plots), you're fine. We'll build up everything else. If you already know linear algebra and calculus, skip to Chapter 3.

2.1 Vectors: Lists of Numbers

A vector is simply an ordered list of numbers. That's it. In deep learning, vectors represent data — a single image, a word, a row in a spreadsheet.

Python
# A vector in Python (as a list)
student = [170, 65, 22]  # height(cm), weight(kg), age

# The same as a NumPy array
import numpy as np
student = np.array([170, 65, 22])
print(student.shape)  # (3,) — a 1D array with 3 elements

Think of a vector as an arrow pointing from the origin to a point in space. A 2D vector [3, 4] points to the point (3, 4). A 3D vector [1, 2, 3] points to (1, 2, 3) in 3D space. In deep learning, our vectors often have hundreds or thousands of dimensions — we can't draw them, but the math works the same way.

Vector Operations

Addition: add element by element.

[1, 2, 3] + [4, 5, 6] = [5, 7, 9]

Scalar multiplication: multiply every element by a number.

3 × [1, 2, 3] = [3, 6, 9]

Dot product: multiply corresponding elements, then sum. This is the most important operation in neural networks.

[1, 2, 3] · [4, 5, 6] = (1×4) + (2×5) + (3×6) = 32

Python
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

print(a + b)           # [5 7 9]
print(3 * a)           # [3 6 9]
print(np.dot(a, b))    # 32
print(a @ b)           # 32 (same as dot product)

2.2 Matrices: Tables of Numbers

A matrix is a 2D grid of numbers. In deep learning, weight matrices are the core of every layer — they transform input data into useful representations.

Python
# A 2×3 matrix (2 rows, 3 columns)
W = np.array([
    [1, 2, 3],
    [4, 5, 6]
])
print(W.shape)  # (2, 3)

Matrix-Vector Multiplication

This is the fundamental operation in neural networks. Each row of the matrix is dotted with the vector to produce the output:

Figure 2.1 — Matrix-vector multiplication: the engine of a neural network layer.

Python
W = np.array([[1, 2], [3, 4], [5, 6]])  # 3×2 matrix
x = np.array([10, 20])                    # 2-element vector

y = W @ x   # Matrix-vector multiplication
print(y)     # [ 50 110 170]

# Verify: row 1 is [1,2] · [10,20] = 10+40 = 50 ✓
#         row 2 is [3,4] · [10,20] = 30+80 = 110 ✓

2.3 Derivatives: Measuring Change

A derivative measures how much a function's output changes when its input changes a tiny bit. If you know the derivative of a function at a point, you know the slope — which direction to move to increase or decrease the output.

Intuition

Imagine you're blindfolded on a hill. The derivative is like poking the ground in different directions with a stick to figure out which way is downhill. That's exactly what gradient descent does — it uses derivatives to find the direction that reduces the loss.

For f(x) = x², the derivative is f'(x) = 2x. At x = 3, the slope is 6 — the function is increasing. At x = -2, the slope is -4 — it's decreasing.

Python
# Numerical derivative (finite differences)
def numerical_derivative(f, x, h=1e-5):
    return (f(x + h) - f(x - h)) / (2 * h)

f = lambda x: x ** 2
print(numerical_derivative(f, 3.0))   # ≈ 6.0 ✓
print(numerical_derivative(f, -2.0))  # ≈ -4.0 ✓

The Chain Rule

The chain rule is how we compute derivatives of composed functions. If y = f(g(x)), then:

dy/dx = f'(g(x)) × g'(x)

This is the mathematical foundation of backpropagation — the algorithm that makes training neural networks possible. We'll see it in action in Chapter 4.

2.4 Gradients: Derivatives in Many Dimensions

When your function has multiple inputs (as neural network loss functions always do), the gradient is a vector of all the partial derivatives. It points in the direction of steepest ascent.

∇f = [∂f/∂x₁, ∂f/∂x₂, ..., ∂f/∂xₙ]

If we want to minimize the function (reduce the loss), we move in the opposite direction of the gradient. This is the essence of gradient descent.

Python
# Simple gradient descent example
x = 10.0  # Start far from minimum
learning_rate = 0.1

for step in range(20):
    gradient = 2 * x            # Derivative of x²
    x = x - learning_rate * gradient  # Move opposite to gradient
    print(f"Step {step+1}: x = {x:.4f}, f(x) = {x**2:.4f}")

# x converges to 0 — the minimum of x²

Don't Memorize — Understand

You don't need to memorize derivative rules. Modern frameworks like PyTorch compute gradients automatically (this is called automatic differentiation or autograd). What you need is to understand what gradients represent and why they matter.

2.5 Probability Basics

Deep learning is fundamentally about uncertainty. A model that classifies an image as "cat" with 92% confidence is more useful than one that just says "cat." Understanding a few probability concepts helps you interpret what models are really doing.

Softmax is a function that turns a list of numbers (called logits) into a probability distribution — all values are between 0 and 1, and they sum to 1:

Python
def softmax(z):
    exp_z = np.exp(z - np.max(z))  # subtract max for stability
    return exp_z / exp_z.sum()

logits = np.array([2.0, 1.0, 0.1])
probs = softmax(logits)
print(probs)        # [0.659 0.242 0.099]
print(probs.sum())   # 1.0 — a valid probability distribution

Cross-entropy loss measures the distance between the model's predicted probabilities and the true labels. Lower is better. This is the most common loss function for classification tasks.

✎ Exercises — Chapter 2

Compute the dot product of [2, 3, 4] and [5, 6, 7] by hand, then verify with NumPy.
Multiply the matrix [[1, 0], [0, 1], [1, 1]] by vector [3, 7]. What do you notice?
The function f(x) = x³ - 4x has derivative f'(x) = 3x² - 4. At what values of x is the derivative zero? (These are the local minimum and maximum.)
Write a Python function that performs gradient descent on f(x) = (x-3)² + 2. Start at x = 0 and find the minimum.
Apply softmax to [5, 5, 5] and [100, 0, 0]. What do you notice about the outputs?

Chapter 3

Python & Tools Setup

Let's get your development environment ready. You'll need Python, NumPy, and PyTorch — nothing more to start.

3.1 Installing Python

We recommend Python 3.10 or later. The easiest way to get everything set up is through Anaconda or Miniconda:

Terminal
# Option A: Miniconda (lightweight, recommended)
# Download from https://docs.conda.io/en/latest/miniconda.html

# Option B: If you already have Python, just use pip
pip install numpy matplotlib jupyter torch torchvision

3.2 Your Toolkit

Tool	Purpose	Why
NumPy	Matrix math	Fast, universal, the foundation everything is built on
Matplotlib	Plotting	Visualize data, loss curves, model outputs
Jupyter	Interactive notebooks	Write and run code in cells — great for learning
PyTorch	Deep learning framework	Intuitive, Pythonic, dominant in research

3.3 NumPy Crash Course

If you know Python but not NumPy, here's everything you need in 60 seconds:

Python
import numpy as np

# Create arrays
a = np.array([1, 2, 3])           # 1D
b = np.zeros((3, 4))              # 3×4 matrix of zeros
c = np.random.randn(2, 3)        # 2×3 random normal
d = np.ones((2, 3))               # 2×3 matrix of ones

# Shape and indexing
print(c.shape)      # (2, 3)
print(c[0, :])      # First row (all columns)
print(c[:, 1])      # Second column (all rows)

# Operations
print(c + d)         # Element-wise addition
print(c * 2)         # Scalar multiplication
print(c @ a)         # Matrix-vector multiply (if shapes match)
print(c.reshape(3, 2))  # Reshape
print(np.exp(c))      # Element-wise exponential

Common Gotcha

NumPy uses broadcasting: operations between arrays of different shapes can work if dimensions are compatible. For example, adding a (3,4) matrix and a (1,4) vector works — the vector is "broadcast" across rows. This is powerful but can cause silent bugs if shapes don't match as you expect.

3.4 Why PyTorch?

PyTorch has become the dominant framework in both research and increasingly in industry. It's "Pythonic" — if you know NumPy, PyTorch feels familiar. The key addition is automatic differentiation:

Python
import torch

# Create a tensor with gradient tracking
x = torch.tensor(3.0, requires_grad=True)

# Forward pass
y = x ** 2  # y = x²

# Backward pass (compute dy/dx)
y.backward()

print(x.grad)  # tensor(6.) — that's 2×3 = 6 ✓

PyTorch tracks every operation and automatically computes gradients. This is what makes training neural networks practical — you define the forward computation, and PyTorch handles the math of learning.

3.5 GPU Setup (Optional but Recommended)

Deep learning on a CPU works for learning. For anything larger, you need a GPU. Options:

Google Colab (free) — Browser-based Jupyter with GPU access. Best option to start.
Kaggle Notebooks (free) — Also provides GPU, plus datasets.
Local GPU — NVIDIA GPU with CUDA installed.

Python
# Check if GPU is available
import torch
print(torch.cuda.is_available())  # True if GPU is ready
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using: {device}")

✎ Exercises — Chapter 3

Create a 4×4 matrix of random numbers using NumPy. Extract the second row, third column, and the 2×2 submatrix in the top-left corner.
Use PyTorch to compute the gradient of f(x) = 3x³ + 2x² - 5x + 7 at x = 2.
Create a NumPy array and convert it to a PyTorch tensor and back. What types do you get?

Chapter 4

Your First Neural Network

We'll build a neural network from scratch in NumPy — no frameworks, no magic. Every calculation spelled out.

4.1 The Neuron

A neuron is the basic unit of a neural network. It does three things:

Weights each input — multiply each input by a learned weight
Sums — add them all up, plus a bias term
Activates — pass the sum through a nonlinear function

Figure 4.1 — A single neuron: weighted sum + activation.

Mathematically: y = σ(w₁x₁ + w₂x₂ + w₃x₃ + b) where σ is the activation function.

4.2 Activation Functions

Without activation functions, stacking layers would just produce linear transformations (a linear function of a linear function is still linear). Nonlinearity is what gives deep networks their power.

Python
def sigmoid(z):
    """Squashes values to range (0, 1)"""
    return 1 / (1 + np.exp(-z))

def relu(z):
    """Returns z if positive, else 0. The workhorse of modern DL."""
    return np.maximum(0, z)

def relu_derivative(z):
    return (z > 0).astype(float)

Function	Formula	Range	Used In
Sigmoid	1/(1+e⁻ᶻ)	(0, 1)	Output layer (binary classification)
Tanh	(eᶻ-e⁻ᶻ)/(eᶻ+e⁻ᶻ)	(-1, 1)	Hidden layers (older architectures)
ReLU	max(0, z)	[0, ∞)	Hidden layers (modern default)
Softmax	eᶻⁱ/Σeᶻʲ	(0, 1), sums to 1	Output layer (multi-class)

4.3 Building a Network from Scratch

Let's build a complete 2-layer neural network that learns the XOR function — the very problem that killed the Perceptron in 1969. Our network will solve it in seconds.

Python
import numpy as np

# ─── XOR Dataset ───
X = np.array([[0,0], [0,1], [1,0], [1,1]])
y = np.array([[0], [1], [1], [0]])

# ─── Initialize Weights ───
np.random.seed(42)
W1 = np.random.randn(2, 4)   # Input (2) → Hidden (4)
b1 = np.zeros((1, 4))
W2 = np.random.randn(4, 1)   # Hidden (4) → Output (1)
b2 = np.zeros((1, 1))

# ─── Activation ───
def sigmoid(z): return 1 / (1 + np.exp(-z))
def sigmoid_deriv(a): return a * (1 - a)

lr = 2.0  # Learning rate

# ─── Training Loop ───
for epoch in range(10000):

    # Forward pass
    z1 = X @ W1 + b1
    a1 = sigmoid(z1)
    z2 = a1 @ W2 + b2
    a2 = sigmoid(z2)

    # Loss (MSE)
    loss = np.mean((a2 - y) ** 2)

    # Backward pass
    dz2 = (a2 - y) * sigmoid_deriv(a2)
    dW2 = a1.T @ dz2
    db2 = np.sum(dz2, axis=0, keepdims=True)

    dz1 = (dz2 @ W2.T) * sigmoid_deriv(a1)
    dW1 = X.T @ dz1
    db1 = np.sum(dz1, axis=0, keepdims=True)

    # Update weights
    W2 -= lr * dW2;  b2 -= lr * db2
    W1 -= lr * dW1;  b1 -= lr * db1

    if epoch % 2000 == 0:
        print(f"Epoch {epoch:5d}  Loss: {loss:.6f}")

print(f"\nPredictions:\n{np.round(a2, 3)}")

Expected Output

Epoch     0  Loss: 0.258731
Epoch  2000  Loss: 0.004649
Epoch  4000  Loss: 0.001232
Epoch  6000  Loss: 0.000564
Epoch  8000  Loss: 0.000316

Predictions:
[[0.015]
 [0.983]
 [0.983]
 [0.017]]

The network has learned XOR. Input [0,0] → ~0, [0,1] → ~1, [1,0] → ~1, [1,1] → ~0.

4.4 What Just Happened?

Let's trace through the key steps:

Forward pass: Data flows through the network. Each layer multiplies by weights, adds bias, applies activation. The output is a prediction.
Loss calculation: Compare prediction to truth. Mean Squared Error: how far off are we, on average?
Backward pass (backpropagation): Use the chain rule to compute how much each weight contributed to the error. Start from the output, work backward layer by layer.
Update: Nudge each weight in the direction that reduces the loss. The learning rate controls how big the nudge is.

This cycle repeats thousands of times. Each iteration, the network gets slightly better. That's all training is — repetition of this four-step process.

4.5 Visualizing the Learning Process

Python
import matplotlib.pyplot as plt

# (Re-run the training, storing losses)
losses = []
for epoch in range(10000):
    # ... forward/backward/update (same as above) ...
    losses.append(loss)

plt.figure(figsize=(8, 4))
plt.plot(losses, color='#c0392b', linewidth=1.5)
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Training Loss Over Time')
plt.yscale('log')
plt.grid(True, alpha=0.3)
plt.show()

✎ Exercises — Chapter 4

Change the hidden layer from 4 neurons to 2. Can the network still learn XOR? What about 1 neuron? Explain why.
Replace sigmoid with ReLU in the hidden layer (keep sigmoid in the output). Does training improve?
Change the learning rate to 0.01 and then to 20. What happens to convergence?
Add a third input to XOR: the output should be 1 if an odd number of inputs are 1. Extend the network to handle this.
Plot the decision boundary of the trained XOR network. (Hint: create a grid of points, predict for each, and use plt.contourf.)

Part II

Core Deep Learning

Chapter 5

How Neural Networks Learn

Gradient descent, learning rates, overfitting, regularization — the mechanics of making networks actually work.

5.1 Gradient Descent Variants

In Chapter 4, we used batch gradient descent — computing the gradient over the entire dataset before updating. This works for XOR (4 samples), but what about a dataset with a million images?

Variant	Batch Size	Speed	Stability
Batch GD	Entire dataset	Slow per step	Very stable
Stochastic GD (SGD)	1 sample	Fast per step	Noisy, can escape local minima
Mini-batch SGD	32–256 samples	Fast	Good balance — the practical default

Python
# Mini-batch training loop
batch_size = 32
for epoch in range(num_epochs):
    # Shuffle data each epoch
    indices = np.random.permutation(len(X))
    for start in range(0, len(X), batch_size):
        end = start + batch_size
        X_batch = X[indices[start:end]]
        y_batch = y[indices[start:end]]
        # Forward → Loss → Backward → Update on this batch

5.2 Optimizers Beyond SGD

Vanilla SGD has a problem: it treats all parameters equally and uses a single learning rate. Modern optimizers adapt the learning rate per-parameter and use momentum to smooth out noisy gradients.

SGD with Momentum: Like a ball rolling downhill — it builds up speed in consistent directions and slows down when the gradient changes direction.

Adam (Adaptive Moment Estimation): The workhorse optimizer. Combines momentum with per-parameter learning rate adaptation. It's the safe default for almost every project.

Python
# In PyTorch, switching optimizers is one line:
optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)  # ← Start here

5.3 Overfitting: The Enemy

Overfitting happens when the model memorizes the training data instead of learning generalizable patterns. You can spot it when:

Training loss keeps decreasing, but validation loss starts increasing
The model performs great on training data but poorly on new data

Figure 5.1 — Classic overfitting: the gap between training and validation loss widens.

5.4 Fighting Overfitting

1. Get more data. The single most effective solution, though not always possible.

2. Data augmentation. Artificially increase your dataset by transforming existing samples. For images: flip, rotate, crop, adjust colors.

Python
from torchvision import transforms

augment = transforms.Compose([
    transforms.RandomHorizontalFlip(),
    transforms.RandomRotation(15),
    transforms.ColorJitter(brightness=0.2),
    transforms.RandomResizedCrop(224, scale=(0.8, 1.0)),
])

3. Dropout. During training, randomly "turn off" a fraction of neurons. This prevents any single neuron from becoming too specialized and forces the network to learn redundant representations.

Python
import torch.nn as nn

model = nn.Sequential(
    nn.Linear(784, 256),
    nn.ReLU(),
    nn.Dropout(p=0.5),   # Drop 50% of neurons during training
    nn.Linear(256, 10),
)

4. Weight decay (L2 regularization). Add a penalty to the loss based on the magnitude of the weights. This discourages the model from relying too heavily on any single feature.

Python
optimizer = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-4)

5. Early stopping. Monitor validation loss during training. When it starts increasing while training loss keeps decreasing — stop.

5.5 Learning Rate Scheduling

A large learning rate helps the model explore quickly at the start. A small learning rate helps it fine-tune near the end. A learning rate scheduler adjusts the rate during training:

Python
# Step decay: reduce by factor of 10 every 30 epochs
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=30, gamma=0.1)

# Cosine annealing: smooth decay following a cosine curve
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=100)

for epoch in range(num_epochs):
    train_one_epoch(...)
    scheduler.step()

5.6 Batch Normalization

BatchNorm normalizes the activations of each layer to have mean 0 and variance 1, then scales and shifts them with learned parameters. Benefits:

Faster training (can use higher learning rates)
Acts as mild regularization
Makes the network less sensitive to weight initialization

Python
model = nn.Sequential(
    nn.Linear(784, 256),
    nn.BatchNorm1d(256),
    nn.ReLU(),
    nn.Linear(256, 10),
)

✎ Exercises — Chapter 5

Train a network on a dataset with and without dropout. Plot both training and validation curves. What do you observe?
Compare Adam (lr=0.001) vs SGD with momentum (lr=0.01) on the same task. Which converges faster? Which achieves better final accuracy?
Implement early stopping in your training loop. If validation loss doesn't improve for 10 epochs, stop training.
Experiment with different dropout rates (0.1, 0.3, 0.5, 0.7). How does it affect training vs validation performance?

Chapter 6

Building with PyTorch

From manual NumPy to professional PyTorch. The same concepts, but cleaner, faster, and GPU-ready.

6.1 The PyTorch Way

PyTorch organizes deep learning into clear abstractions:

torch.Tensor — like NumPy arrays, but with GPU support and automatic differentiation
nn.Module — base class for all models. You define layers and the forward pass.
torch.optim — optimizers (SGD, Adam, etc.)
DataLoader — handles batching, shuffling, and parallel data loading

6.2 Classifying Handwritten Digits (MNIST)

The MNIST dataset is the "Hello World" of deep learning — 70,000 grayscale images of handwritten digits (0–9), each 28×28 pixels. Let's build a complete classifier:

Python
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torchvision import datasets, transforms

# ─── 1. Data ───
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))  # MNIST mean & std
])

train_data = datasets.MNIST('./data', train=True, download=True, transform=transform)
test_data  = datasets.MNIST('./data', train=False, transform=transform)

train_loader = DataLoader(train_data, batch_size=64, shuffle=True)
test_loader  = DataLoader(test_data, batch_size=1000)

# ─── 2. Model ───
class DigitClassifier(nn.Module):
    def __init__(self):
        super().__init__()
        self.network = nn.Sequential(
            nn.Flatten(),
            nn.Linear(28*28, 512),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(512, 256),
            nn.ReLU(),
            nn.Linear(256, 10)
        )

    def forward(self, x):
        return self.network(x)

model = DigitClassifier()

# ─── 3. Loss & Optimizer ───
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# ─── 4. Training Loop ───
for epoch in range(10):
    model.train()
    total_loss = 0
    for batch_x, batch_y in train_loader:
        optimizer.zero_grad()
        output = model(batch_x)
        loss = criterion(output, batch_y)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()

    # ─── 5. Evaluate ───
    model.eval()
    correct = 0
    with torch.no_grad():
        for batch_x, batch_y in test_loader:
            preds = model(batch_x).argmax(dim=1)
            correct += (preds == batch_y).sum().item()

    acc = correct / len(test_data)
    print(f"Epoch {epoch+1}: Loss={total_loss:.1f}, Accuracy={acc*100:.2f}%")

Expected Result

After 10 epochs, you should see ~97-98% accuracy on the test set. Not bad for a simple fully-connected network!

6.3 Key PyTorch Patterns

model.train() vs model.eval(): This switches the behavior of Dropout and BatchNorm. Always call the right one.

torch.no_grad(): Disables gradient computation during evaluation. Saves memory and computation.

optimizer.zero_grad(): PyTorch accumulates gradients. You must zero them each step or they'll add up.

6.4 Saving and Loading Models

Python
# Save model weights
torch.save(model.state_dict(), 'mnist_model.pth')

# Load later
model = DigitClassifier()
model.load_state_dict(torch.load('mnist_model.pth'))
model.eval()

✎ Exercises — Chapter 6

Add a third hidden layer. Does accuracy improve? By how much?
Implement the training and evaluation as a reusable function that takes hyperparameters as arguments.
Add a confusion matrix visualization: which digits does the model confuse most often?
Train the same model with and without BatchNorm. Compare training curves.

Chapter 7

Convolutional Neural Networks

CNNs revolutionized computer vision by learning spatial hierarchies of features — edges → textures → parts → objects.

7.1 Why Convolutions?

A fully-connected layer for a 224×224 RGB image would need 224×224×3 = 150,528 input weights per neuron. This is wasteful — pixels near each other are related, pixels far apart usually aren't. Convolutional layers exploit this spatial locality:

Local connectivity: Each neuron only looks at a small patch (e.g., 3×3 pixels)
Weight sharing: The same filter slides across the entire image, detecting the same pattern everywhere
Translation invariance: A cat detector works whether the cat is in the top-left or bottom-right

7.2 How Convolution Works

A small filter (kernel) slides across the input, computing dot products at each position:

Figure 7.1 — Convolution: a small filter slides across the input, computing dot products.

Python
import torch.nn as nn

# A single convolutional layer
conv = nn.Conv2d(
    in_channels=3,     # RGB input
    out_channels=16,   # 16 different filters
    kernel_size=3,     # 3×3 filter
    stride=1,          # Move 1 pixel at a time
    padding=1          # Keep same spatial size
)

7.3 Pooling

Pooling layers reduce the spatial dimensions, making the network faster and more robust to small shifts. Max pooling takes the maximum value in each patch:

Python
pool = nn.MaxPool2d(2, stride=2)  # Halves the spatial dimensions
# Input: (16, 28, 28) → Output: (16, 14, 14)

7.4 Building a Complete CNN

Python
class CNNClassifier(nn.Module):
    def __init__(self):
        super().__init__()
        self.features = nn.Sequential(
            # Block 1: 1×28×28 → 32×14×14
            nn.Conv2d(1, 32, 3, padding=1),
            nn.BatchNorm2d(32),
            nn.ReLU(),
            nn.MaxPool2d(2),

            # Block 2: 32×14×14 → 64×7×7
            nn.Conv2d(32, 64, 3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(),
            nn.MaxPool2d(2),

            # Block 3: 64×7×7 → 128×3×3
            nn.Conv2d(64, 128, 3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(),
            nn.MaxPool2d(2),
        )
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(128 * 3 * 3, 256),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(256, 10),
        )

    def forward(self, x):
        x = self.features(x)
        return self.classifier(x)

7.5 What CNNs See: Feature Visualization

The first convolutional layer typically learns to detect edges and simple textures. Deeper layers combine these into increasingly complex patterns:

Layer	Detects
Layer 1	Edges, gradients, simple colors
Layer 2	Textures, corners, small shapes
Layer 3	Object parts (eyes, wheels, petals)
Layer 4+	Entire objects, complex structures

7.6 Classic CNN Architectures

Architecture	Year	Key Innovation
LeNet-5	1998	Pioneered CNNs for digit recognition
AlexNet	2012	Deep CNN + GPU training + ReLU + Dropout
VGGNet	2014	Showed depth matters (16-19 layers)
ResNet	2015	Skip connections enabling 152+ layers
EfficientNet	2019	Optimized scaling of depth/width/resolution

✎ Exercises — Chapter 7

Train the CNN on MNIST. What accuracy do you get compared to the fully-connected network in Chapter 6?
Modify the CNN for CIFAR-10 (32×32 color images, 10 classes). What changes are needed?
Remove all pooling layers. How does this affect the output size and training time?
Add a skip connection: make the output of block 1 also feed into block 3 (concatenated or added).

Chapter 8

Recurrent Neural Networks

Sequence data — text, audio, time series — requires networks that remember the past. RNNs process inputs one at a time while maintaining a hidden state.

8.1 The Problem with Fixed-Size Inputs

CNNs and fully-connected networks take a fixed-size input and produce a fixed-size output. But language is variable-length. "I love deep learning" and "I think I might love deep learning someday" need different-sized inputs. RNNs solve this by processing sequences step by step.

8.2 How RNNs Work

An RNN cell processes one element at a time, maintaining a hidden state that acts as a memory:

Figure 8.1 — An unrolled RNN processing a sentence word by word.

At each step: h_t = tanh(W_hh · h_{t-1} + W_xh · x_t + b)

The same weights are used at every time step. The hidden state h_t carries forward information from all previous steps.

8.3 The Vanishing Gradient Problem

Plain RNNs struggle to learn long-range dependencies. When backpropagating through many time steps, gradients either shrink exponentially to zero (vanish) or grow exponentially (explode). This means a vanilla RNN can't effectively remember information from 50 steps ago.

8.4 LSTMs and GRUs

LSTM (Long Short-Term Memory) solves this with a gating mechanism and a separate cell state that acts as a conveyor belt for information:

Forget gate: What information to discard from the cell state
Input gate: What new information to store
Output gate: What to output from the cell state

GRU (Gated Recurrent Unit) is a simplified version with only two gates (reset and update). It's faster and often works just as well as LSTM.

Python
# PyTorch LSTM
class TextClassifier(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, output_dim):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.lstm = nn.LSTM(embed_dim, hidden_dim, batch_first=True, num_layers=2)
        self.fc = nn.Linear(hidden_dim, output_dim)

    def forward(self, x):
        embedded = self.embedding(x)          # (batch, seq_len) → (batch, seq_len, embed_dim)
        output, (hidden, cell) = self.lstm(embedded)
        return self.fc(hidden[-1])            # Use last hidden state

8.5 Project: Text Generation

Train a character-level LSTM on a text corpus. Given a sequence of characters, predict the next one:

Python
# Simplified character-level generation
text = "to be or not to be that is the question"
chars = sorted(set(text))
char_to_idx = {c: i for i, c in enumerate(chars)}
idx_to_char = {i: c for c, i in char_to_idx.items()}

# Create training pairs: "to b" → "o be", "o be" → " beo"
seq_length = 4
# ... (create input-target pairs) ...

# Generate text: seed with a prompt, predict one char at a time
def generate(model, seed, length=100):
    model.eval()
    chars = list(seed)
    for _ in range(length):
        x = torch.tensor([[char_to_idx[c] for c in chars[-seq_length:]]])
        pred = model(x)
        next_char = idx_to_char[pred.argmax(-1).item()]
        chars.append(next_char)
    return ''.join(chars)

✎ Exercises — Chapter 8

Train the text generator on a Shakespeare dataset. How does the output quality change with training epochs?
Compare LSTM vs GRU on the same task. Which converges faster?
Add temperature-based sampling to the generation function: divide logits by a temperature before softmax. What happens at temperature 0.5 vs 1.5?
Why are RNNs being replaced by Transformers? What are the computational disadvantages of processing sequences step by step?

Part III

Leveling Up

Chapter 9

Transformers & Attention

The most important architecture of the decade. Transformers power GPT, BERT, and virtually every state-of-the-art language model.

9.1 The Key Insight: Attention

Instead of processing a sequence step by step (like RNNs), the attention mechanism lets every element in a sequence look at every other element simultaneously. This is both more powerful and more parallelizable.

Think of it this way: when you read the word "it" in "The cat sat on the mat because it was tired," your brain attends to "cat" to understand what "it" refers to. Attention formalizes this process.

9.2 Self-Attention Step by Step

For each token in the input, we compute three vectors:

Query (Q): "What am I looking for?"
Key (K): "What do I contain?"
Value (V): "What information do I provide?"

The attention score between two tokens is the dot product of one's Query with the other's Key. Higher score = more attention.

Attention(Q, K, V) = softmax(Q · Kᵀ / √d_k) · V

Python
import torch
import torch.nn as nn
import torch.nn.functional as F
import math

class SelfAttention(nn.Module):
    def __init__(self, embed_dim, num_heads):
        super().__init__()
        self.num_heads = num_heads
        self.head_dim = embed_dim // num_heads

        self.Q = nn.Linear(embed_dim, embed_dim)
        self.K = nn.Linear(embed_dim, embed_dim)
        self.V = nn.Linear(embed_dim, embed_dim)
        self.out = nn.Linear(embed_dim, embed_dim)

    def forward(self, x):
        B, T, C = x.shape

        q = self.Q(x).view(B, T, self.num_heads, self.head_dim).transpose(1, 2)
        k = self.K(x).view(B, T, self.num_heads, self.head_dim).transpose(1, 2)
        v = self.V(x).view(B, T, self.num_heads, self.head_dim).transpose(1, 2)

        scores = (q @ k.transpose(-2, -1)) / math.sqrt(self.head_dim)
        weights = F.softmax(scores, dim=-1)
        out = (weights @ v).transpose(1, 2).contiguous().view(B, T, C)
        return self.out(out)

9.3 The Transformer Architecture

A full Transformer block has:

Multi-head self-attention (look at all positions)
Add & Norm (residual connection + layer normalization)
Feed-forward network (process each position independently)
Add & Norm again

9.4 Why Transformers Won

Feature	RNN	Transformer
Parallelization	Sequential (slow)	Fully parallel (fast)
Long-range dependencies	Struggles beyond ~100 steps	Direct attention to any position
Training on GPUs	Limited by sequential nature	Perfectly suited for GPUs
Scalability	Hard to scale past a few layers	Scales to billions of parameters

9.5 BERT, GPT, and the Foundation Model Era

BERT (2018): Encoder-only Transformer. Pre-trained to fill in masked words. Excels at understanding tasks (classification, Q&A).

GPT (2018–present): Decoder-only Transformer. Pre-trained to predict the next word. Excels at generation tasks (text completion, conversation).

T5 (2019): Full encoder-decoder Transformer. Frames everything as "text in → text out."

✎ Exercises — Chapter 9

Implement the Transformer block and train it on a simple sequence prediction task. Compare to an LSTM on the same data.
Visualize attention weights: which tokens attend to which? Feed a sentence through a Transformer and plot the attention matrix.
Why does scaling attention by √d_k matter? What happens without it?
Read the "Attention Is All You Need" paper (2017). Write a 1-paragraph summary of the key contributions.

Chapter 10

Transfer Learning

Why train from scratch when someone else already has? Use pre-trained models as a starting point and adapt them to your task.

10.1 The Big Idea

Training a large model from scratch requires millions of images and days of GPU time. Transfer learning says: take a model pre-trained on a massive dataset (like ImageNet's 1.2 million images), and adapt it to your task with your smaller dataset.

It works because early layers learn general features (edges, textures) that are useful for almost any vision task. Only the later layers are task-specific.

10.2 Feature Extraction vs Fine-Tuning

Strategy	What Changes	When to Use
Feature Extraction	Only the final classifier layer	Small dataset, similar domain
Fine-Tuning	All (or most) layers, at low learning rate	Larger dataset or different domain

Python
import torchvision.models as models

# Load pre-trained ResNet-18
model = models.resnet18(weights='IMAGENET1K_V1')

# Freeze all layers (feature extraction)
for param in model.parameters():
    param.requires_grad = False

# Replace the final layer for your task
num_classes = 5  # e.g., 5 types of flowers
model.fc = nn.Linear(model.fc.in_features, num_classes)

# Only the new layer will be trained
optimizer = torch.optim.Adam(model.fc.parameters(), lr=0.001)

10.3 Fine-Tuning in Practice

Python
# Unfreeze all layers for fine-tuning
for param in model.parameters():
    param.requires_grad = True

# Use a LOWER learning rate (don't destroy learned features)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-5)

# OR: gradually unfreeze layers (discriminative fine-tuning)
# Start training only the last layer, then unfreeze the last 2, etc.

10.4 Transfer Learning for NLP

The same idea works for text. Models like BERT and GPT are pre-trained on massive text corpora. You can fine-tune them for specific tasks:

Python
from transformers import AutoModelForSequenceClassification, AutoTokenizer

# Load pre-trained BERT
model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

# Fine-tune on sentiment analysis
# The model already "understands" English — you just teach it your specific task

Hugging Face

The Hugging Face Transformers library provides thousands of pre-trained models for NLP, vision, and audio. It's become the standard way to use pre-trained models. pip install transformers

✎ Exercises — Chapter 10

Use a pre-trained ResNet-18 to classify a custom 5-class image dataset (e.g., flowers, food). Compare feature extraction vs full fine-tuning.
Fine-tune a pre-trained BERT model for sentiment analysis on movie reviews. What accuracy do you achieve?
What happens if you fine-tune with too high a learning rate? Try lr=0.01 and observe the results.

Chapter 11

Generative Models

Models that create — generate images, text, music. Autoencoders, GANs, and the diffusion models behind DALL·E and Stable Diffusion.

11.1 Autoencoders

An autoencoder learns to compress data into a compact representation (encoding) and reconstruct it back. It's trained to minimize reconstruction error.

Python
class Autoencoder(nn.Module):
    def __init__(self):
        super().__init__()
        self.encoder = nn.Sequential(
            nn.Linear(784, 256), nn.ReLU(),
            nn.Linear(256, 64),  nn.ReLU(),
            nn.Linear(64, 16)    # Bottleneck: 784 → 16
        )
        self.decoder = nn.Sequential(
            nn.Linear(16, 64),   nn.ReLU(),
            nn.Linear(64, 256),  nn.ReLU(),
            nn.Linear(256, 784), nn.Sigmoid()
        )

    def forward(self, x):
        z = self.encoder(x)
        return self.decoder(z)

11.2 Generative Adversarial Networks (GANs)

A GAN pits two networks against each other:

Generator: Creates fake data from random noise
Discriminator: Tries to distinguish real data from fakes

They train together: the generator gets better at faking, the discriminator gets better at detecting. Eventually, the generator produces data indistinguishable from real data.

11.3 Diffusion Models

The technology behind Stable Diffusion, DALL·E 2, and Midjourney. The idea is beautifully simple:

Forward process: Gradually add noise to an image until it's pure static
Reverse process: Train a neural network to reverse each step of noising — to "denoise" one step at a time
Generation: Start with pure noise, denoise step by step, and a coherent image emerges

✎ Exercises — Chapter 11

Build and train an autoencoder on MNIST. Visualize the reconstructions — how do they look with a bottleneck of 2 vs 16 vs 64?
If the bottleneck is 2D, you can plot the latent space. Color-code by digit class. What structure do you see?
Implement a simple GAN for generating MNIST digits. Plot generated samples at epochs 1, 10, 50, and 100.

Chapter 12

Practical Skills

Theory gets you started. These skills get you to production.

12.1 Data Preprocessing

Real-world data is messy. Common tasks:

Missing values: Impute with mean/median, or flag with a mask
Normalization: Scale features to similar ranges (e.g., zero mean, unit variance)
Tokenization: Convert text to numerical sequences
Encoding: Convert categorical variables to numbers (one-hot or embedding)

12.2 Debugging Models

When your model doesn't work (and it often won't), follow this checklist:

Overfit a single batch. If your model can't learn 10 examples perfectly, there's a bug.
Check the loss. Is it decreasing at all? If not, the learning rate might be too high or too low.
Visualize predictions. Plot what the model outputs vs. the truth.
Check gradients. Are they flowing? Are they exploding (NaN) or vanishing (~0)?
Simplify. Remove complexity until something works, then add it back incrementally.

12.3 Experiment Tracking

Python
# Use Weights & Biases for experiment tracking
import wandb
wandb.init(project="my-dl-project")
wandb.config.update({"lr": 0.001, "epochs": 50, "batch_size": 64})

for epoch in range(50):
    train_loss = train_one_epoch(...)
    val_acc = evaluate(...)
    wandb.log({"loss": train_loss, "accuracy": val_acc})

12.4 Common Pitfalls

Problem	Symptom	Fix
Data leakage	Unrealistically high accuracy	Ensure test data is never seen during training
Unbalanced classes	High accuracy but low recall	Use weighted loss or oversample minority class
Wrong loss function	Loss doesn't decrease	Match loss to task: CrossEntropy for classification, MSE for regression
Learning rate too high	Loss oscillates or explodes	Start at 1e-3, decrease if unstable
Learning rate too low	Loss barely decreases	Increase by 10x

✎ Exercises — Chapter 12

Take a dataset and intentionally introduce data leakage. Show that accuracy is artificially high. Then fix it.
Create an imbalanced dataset (90% class A, 10% class B). Train a model and observe the problem. Apply class weighting and compare.
Set up experiment tracking with either W&B or TensorBoard. Compare 3 different learning rates on the same task.

Part IV

The Real World

Chapter 13

Deploying Models

A model on your laptop is a research project. A model that serves real users is a product.

13.1 Saving & Exporting

Python
# PyTorch: save entire model
torch.save(model, 'model_full.pth')

# Or just the weights (recommended)
torch.save(model.state_dict(), 'model_weights.pth')

# Export to ONNX (framework-agnostic)
dummy = torch.randn(1, 1, 28, 28)
torch.onnx.export(model, dummy, "model.onnx")

13.2 Building a Simple API

Python
from flask import Flask, request, jsonify
import torch

app = Flask(__name__)
model = torch.load('model.pth')
model.eval()

@app.route('/predict', methods=['POST'])
def predict():
    data = request.get_json()
    tensor = preprocess(data['input'])
    with torch.no_grad():
        prediction = model(tensor)
    return jsonify({'prediction': prediction.tolist()})

13.3 Deployment Options

Platform	Best For
Flask / FastAPI	Simple APIs, prototyping
Docker + Cloud (AWS/GCP/Azure)	Production scale
ONNX Runtime	Cross-platform inference
TorchServe	PyTorch-native serving
Edge (TFLite, CoreML)	Mobile and embedded devices

13.4 Optimization for Inference

Quantization: Use 8-bit integers instead of 32-bit floats. 2-4x faster, minimal accuracy loss.
Pruning: Remove small weights (set to zero). Creates sparse, smaller models.
Distillation: Train a small "student" model to mimic a large "teacher" model.

✎ Exercises — Chapter 13

Wrap your trained MNIST model in a Flask API. Accept an image and return the predicted digit.
Export a PyTorch model to ONNX and run it with ONNX Runtime.
Apply dynamic quantization to your model and compare inference speed and model size.

Chapter 14

Ethics & Responsible AI

With great power comes real consequences. Building AI responsibly isn't optional.

14.1 Bias in, Bias Out

Models learn from data. If the data reflects historical biases — racial, gender, socioeconomic — the model will reproduce and amplify those biases. A hiring model trained on historical data may learn that "most successful candidates were male" and penalize female applicants.

14.2 Fairness Metrics

Demographic parity: Prediction rates should be similar across groups
Equal opportunity: True positive rates should be similar across groups
Individual fairness: Similar individuals should get similar predictions

14.3 Interpretability

When a model denies someone a loan or flags a medical image, people deserve an explanation. Techniques include:

SHAP/LIME: Explain individual predictions by testing which features matter most
Attention visualization: Show what the model "looks at"
Grad-CAM: Highlight which parts of an image influenced the decision

14.4 Environmental Cost

Training large models consumes significant energy. GPT-3's training reportedly used ~1,300 MWh of electricity. Consider:

Do you need a large model, or would a smaller one suffice?
Can you use transfer learning instead of training from scratch?
Can you train on renewable energy or during off-peak hours?

14.5 Guidelines for Responsible Development

Audit your data for bias before training
Test model performance across demographic groups
Be transparent about model limitations
Implement human-in-the-loop for high-stakes decisions
Document your model (what it was trained on, its intended use, known failure modes)

Chapter 15

Where the Field Is Heading

A snapshot of the frontier as of 2025 — and where things might go next.

15.1 Foundation Models

Large models pre-trained on broad data that can be adapted to many tasks. GPT-4, Claude, Gemini, and similar models represent a paradigm shift: instead of building task-specific models, you build one large model and adapt it.

15.2 Multimodal Models

Models that understand and generate across modalities — text, images, audio, video — simultaneously. GPT-4V can reason about images. Gemini processes text, code, images, and audio natively.

15.3 Key Trends

Smaller, more efficient models: Techniques like quantization, distillation, and efficient architectures are making powerful models accessible on consumer hardware.
Open source: Models like LLaMA, Mistral, and Stable Diffusion are democratizing access.
AI Agents: Models that can use tools, browse the web, write and execute code.
Reasoning: Models that can break down complex problems and reason step by step.
Regulation: The EU AI Act and similar legislation are shaping how AI can be deployed.

15.4 What to Learn Next

Read papers: Start with well-written survey papers and landmark papers (Attention Is All You Need, ResNet, etc.)
Build projects: The best way to learn is to build something and get it wrong
Join communities: Papers With Code, Hugging Face forums, r/MachineLearning
Specialize: Go deep in one area (computer vision, NLP, reinforcement learning, etc.)

Chapter 16

Capstone Project

Put it all together. This project walks you through building a complete deep learning application end to end.

Project: Image Classification Web App

Goal: Build a web app that classifies images into custom categories. The user uploads an image, the model predicts the class, and the result is displayed with a confidence score.

Step 1: Define the Problem

Choose a dataset. Suggestions:

Flowers (102 species)
Food (101 categories)
Dog breeds (120 breeds)
Your own collected dataset

Step 2: Data Pipeline

Python
from torch.utils.data import DataLoader, random_split
from torchvision import datasets, transforms

train_transform = transforms.Compose([
    transforms.RandomResizedCrop(224),
    transforms.RandomHorizontalFlip(),
    transforms.ColorJitter(0.2, 0.2),
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
])

val_transform = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
])

Step 3: Model

Python
import torchvision.models as models

model = models.resnet50(weights='IMAGENET1K_V2')
model.fc = nn.Linear(model.fc.in_features, num_classes)

# Fine-tune with differential learning rates
optimizer = torch.optim.AdamW([
    {'params': model.layer4.parameters(), 'lr': 1e-4},
    {'params': model.fc.parameters(), 'lr': 1e-3},
], weight_decay=1e-2)

Step 4: Training with Best Practices

Mixup / CutMix augmentation
Cosine annealing learning rate schedule
Early stopping on validation loss
Model checkpointing (save best model)

Step 5: Deploy

Create a simple web interface with Flask or Streamlit. Accept image uploads, run the model, and display results with confidence bars.

You Made It

If you've followed along from Chapter 1, you now understand the foundations of deep learning — from the math to the models to the deployment. The field is vast and evolving fast, but you have the fundamentals to learn anything that comes next. Keep building.

✎ Final Exercises — Chapter 16

Complete the capstone project end-to-end with a dataset of your choice.
Write a model card documenting your model: training data, performance metrics, known limitations, intended use cases.
Deploy your model as a web app and share it with someone who isn't in tech. Get their feedback.
Reflect: what was the hardest part? What would you do differently next time?

Appendix

Math Refresher & Glossary

Quick reference for the math and terminology used throughout this book.

A.1 Linear Algebra Cheat Sheet

Concept	Notation	Meaning
Scalar	a	A single number
Vector	v = [v₁, v₂, v₃]	Ordered list of numbers (1D array)
Matrix	W	2D grid of numbers
Dot product	a · b	Element-wise multiply, then sum
Matrix multiply	AB	Rows of A dotted with columns of B
Transpose	Aᵀ	Rows become columns
Norm	‖v‖	"Length" of a vector

A.2 Calculus Cheat Sheet

Function	Derivative
f(x) = c	f'(x) = 0
f(x) = xⁿ	f'(x) = nxⁿ⁻¹
f(x) = eˣ	f'(x) = eˣ
f(x) = ln(x)	f'(x) = 1/x
f(x) = σ(x)	f'(x) = σ(x)(1 - σ(x))
f(x) = ReLU(x)	f'(x) = 1 if x > 0, else 0
Chain rule: f(g(x))	f'(g(x)) × g'(x)

A.3 Glossary

Term	Definition
Activation function	Nonlinear function applied to neuron output (ReLU, sigmoid, etc.)
Backpropagation	Algorithm for computing gradients via the chain rule
Batch size	Number of samples processed before updating weights
Bias	Learnable offset added after the weighted sum in a neuron
CNN	Convolutional Neural Network — specialized for grid-like data (images)
Cross-entropy	Loss function for classification tasks
Data augmentation	Artificially expanding training data through transformations
Dropout	Regularization: randomly zeroing activations during training
Epoch	One complete pass through the entire training dataset
Fine-tuning	Continuing training of a pre-trained model on a new task
Gradient	Vector of partial derivatives — direction of steepest increase
Gradient descent	Optimization: iteratively move parameters opposite to the gradient
GPU	Graphics Processing Unit — massively parallel hardware for DL
Learning rate	Step size for parameter updates during optimization
Loss function	Measure of how wrong the model's predictions are
LSTM	Long Short-Term Memory — RNN variant that handles long-range dependencies
Overfitting	Model memorizes training data instead of learning general patterns
Parameter	Learnable value in the model (weights and biases)
Pooling	Downsampling operation in CNNs (max or average)
Pre-training	Initial training on a large dataset before fine-tuning
Regularization	Techniques to prevent overfitting (dropout, weight decay, etc.)
RNN	Recurrent Neural Network — processes sequences step by step
Softmax	Function that converts logits to probabilities
Tensor	Multi-dimensional array (generalization of vectors and matrices)
Transfer learning	Using a pre-trained model as a starting point for a new task
Transformer	Architecture based on self-attention (dominant since 2017)
Underfitting	Model is too simple to capture the patterns in data
Weight	Learnable parameter that scales the input to a neuron