A Beginner's Complete Guide

Deep
Learning
101

From absolute zero to building real neural networks. No PhD required — just curiosity and Python.

First Edition · 2025

Table of Contents

Part I — Foundations
Ch 1. What Is Deep Learning?1 Ch 2. The Math You Actually Need9 Ch 3. Python & Tools Setup21 Ch 4. Your First Neural Network29
Part II — Core Deep Learning
Ch 5. How Neural Networks Learn41 Ch 6. Building with PyTorch53 Ch 7. Convolutional Neural Networks65 Ch 8. Recurrent Neural Networks81
Part III — Leveling Up
Ch 9. Transformers & Attention95 Ch 10. Transfer Learning109 Ch 11. Generative Models119 Ch 12. Practical Skills131
Part IV — The Real World
Ch 13. Deploying Models143 Ch 14. Ethics & Responsible AI153 Ch 15. Where the Field Is Heading161 Ch 16. Capstone Project169 Appendix. Math Refresher & Glossary179
Part I
Foundations
Chapter 1

What Is Deep Learning?

Before we write a single line of code, let's understand what we're getting into — and why it matters.

1.1 A Brief History of Teaching Machines to Think

The dream of artificial intelligence is old. Ancient Greek myths told of Talos, a giant bronze automaton that guarded Crete. Medieval engineers built mechanical birds. But the real story of AI begins in the 1940s, when mathematicians and logicians first asked: can a machine think?

1943
McCulloch & Pitts propose the first mathematical model of a neuron — a simple unit that takes inputs, applies weights, and produces an output.
1957
Frank Rosenblatt builds the Perceptron — a machine that learns to classify patterns. The New York Times reports it as the embryo of a computer that will "walk, talk, see, write, reproduce and be conscious of its existence."
1969
Minsky & Papert prove the Perceptron can't solve XOR (exclusive or). This single mathematical result triggers an "AI Winter" — funding dries up for over a decade.
1986
Rumelhart, Hinton & Williams popularize backpropagation, showing how multi-layer networks can learn complex patterns. The field warms up again.
2012
Alex Krizhevsky's AlexNet wins ImageNet by a massive margin using deep CNNs and GPUs. This is the Big Bang of modern deep learning.
2017
Vaswani et al. publish "Attention Is All You Need," introducing the Transformer architecture. It will reshape NLP, vision, and everything else.
2020+
GPT-3, DALL·E, Stable Diffusion, GPT-4 — foundation models demonstrate capabilities that seemed like science fiction five years earlier.

1.2 AI, Machine Learning, and Deep Learning

These three terms are often used interchangeably, but they're nested concepts:

Artificial Intelligence Machine Learning Deep Learning
Figure 1.1 — Deep Learning is a subset of Machine Learning, which is a subset of Artificial Intelligence.

Artificial Intelligence (AI) is the broadest concept: any system that exhibits intelligent behavior. This includes rule-based systems (if the temperature > 100°C, send an alert), search algorithms (GPS navigation), and much more. Most AI systems don't involve learning at all.

Machine Learning (ML) is a subset of AI where systems learn patterns from data rather than following hand-coded rules. Instead of programming explicit rules, you provide examples and the system figures out the rules itself.

Deep Learning (DL) is a subset of ML that uses neural networks with many layers — hence "deep." These layered networks can automatically discover the features that matter in data, eliminating much of the manual feature engineering that traditional ML requires.

Key Insight

In traditional ML, a human expert must carefully choose what features to look for. In deep learning, the network discovers these features on its own. This is why deep learning works so well for messy, unstructured data like images, audio, and text — the features are too complex for humans to define by hand.

1.3 What Can Deep Learning Do?

The honest answer: a lot. But it's not magic, and it's not good at everything. Here's a realistic landscape:

Task Example Architecture
Image Classification Is this X-ray showing pneumonia? CNN, Vision Transformer
Object Detection Find all pedestrians in this street scene YOLO, Faster R-CNN
Text Generation Write a coherent paragraph about climate change Transformer (GPT family)
Machine Translation Translate English to French Seq2Seq, Transformer
Speech Recognition Convert audio to text Whisper, CTC Networks
Image Generation Create a photorealistic face that doesn't exist GAN, Diffusion Model
Game Playing Beat the world champion at Go Reinforcement Learning + DL

1.4 What Deep Learning Is NOT Good At

Being honest about limitations is just as important as celebrating capabilities:

1.5 The Building Blocks (A 30,000-Foot View)

Every deep learning system has the same fundamental ingredients:

  1. Data — Examples the model learns from. Images, text, audio, numbers — organized into inputs and (usually) labels.
  2. A Model (Architecture) — The structure of the neural network. How many layers, what type, how they connect.
  3. A Loss Function — A mathematical measure of how wrong the model's predictions are.
  4. An Optimizer — An algorithm that adjusts the model's internal parameters to reduce the loss.
  5. Training Loop — The cycle: feed data → make predictions → measure error → adjust parameters → repeat.
Data Model (Predict) Loss (Measure Error) Optimizer (Adjust) Repeat (Train)
Figure 1.2 — The training loop: the heartbeat of deep learning.

Every chapter in this book builds on these five ingredients. You'll understand each one deeply by the end.

1.6 What You'll Build in This Book

This isn't a theory textbook. By the end of these chapters, you will have:

Let's start building.

✎ Exercises — Chapter 1
  1. In your own words, explain the difference between AI, machine learning, and deep learning to a friend who has never taken a CS course.
  2. Find three real-world applications of deep learning that you interact with daily (hint: your phone uses several).
  3. For each application in the table in Section 1.3, think of one benefit and one risk.
  4. Why might a company choose traditional ML over deep learning for a spam filter? When might deep learning be the better choice?
Chapter 2

The Math You Actually Need

Don't panic. You don't need a math degree — just a few core ideas, explained with pictures and code.

Who This Chapter Is For

If you're comfortable with basic algebra and have seen graphs (x-y plots), you're fine. We'll build up everything else. If you already know linear algebra and calculus, skip to Chapter 3.

2.1 Vectors: Lists of Numbers

A vector is simply an ordered list of numbers. That's it. In deep learning, vectors represent data — a single image, a word, a row in a spreadsheet.

Python
# A vector in Python (as a list)
student = [170, 65, 22]  # height(cm), weight(kg), age

# The same as a NumPy array
import numpy as np
student = np.array([170, 65, 22])
print(student.shape)  # (3,) — a 1D array with 3 elements

Think of a vector as an arrow pointing from the origin to a point in space. A 2D vector [3, 4] points to the point (3, 4). A 3D vector [1, 2, 3] points to (1, 2, 3) in 3D space. In deep learning, our vectors often have hundreds or thousands of dimensions — we can't draw them, but the math works the same way.

Vector Operations

Addition: add element by element.

[1, 2, 3] + [4, 5, 6] = [5, 7, 9]

Scalar multiplication: multiply every element by a number.

3 × [1, 2, 3] = [3, 6, 9]

Dot product: multiply corresponding elements, then sum. This is the most important operation in neural networks.

[1, 2, 3] · [4, 5, 6] = (1×4) + (2×5) + (3×6) = 32
Python
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

print(a + b)           # [5 7 9]
print(3 * a)           # [3 6 9]
print(np.dot(a, b))    # 32
print(a @ b)           # 32 (same as dot product)

2.2 Matrices: Tables of Numbers

A matrix is a 2D grid of numbers. In deep learning, weight matrices are the core of every layer — they transform input data into useful representations.

Python
# A 2×3 matrix (2 rows, 3 columns)
W = np.array([
    [1, 2, 3],
    [4, 5, 6]
])
print(W.shape)  # (2, 3)

Matrix-Vector Multiplication

This is the fundamental operation in neural networks. Each row of the matrix is dotted with the vector to produce the output:

w₁₁ w₂₁ w₃₁ W (matrix) × x₁ x₂ x (vec) = y₁ y₂ y₃ y (out) where y₁ = w₁₁x₁ + w₁₂x₂ y₂ = w₂₁x₁ + w₂₂x₂ y₃ = w₃₁x₁ + w₃₂x₂
Figure 2.1 — Matrix-vector multiplication: the engine of a neural network layer.
Python
W = np.array([[1, 2], [3, 4], [5, 6]])  # 3×2 matrix
x = np.array([10, 20])                    # 2-element vector

y = W @ x   # Matrix-vector multiplication
print(y)     # [ 50 110 170]

# Verify: row 1 is [1,2] · [10,20] = 10+40 = 50 ✓
#         row 2 is [3,4] · [10,20] = 30+80 = 110 ✓

2.3 Derivatives: Measuring Change

A derivative measures how much a function's output changes when its input changes a tiny bit. If you know the derivative of a function at a point, you know the slope — which direction to move to increase or decrease the output.

Intuition

Imagine you're blindfolded on a hill. The derivative is like poking the ground in different directions with a stick to figure out which way is downhill. That's exactly what gradient descent does — it uses derivatives to find the direction that reduces the loss.

For f(x) = x², the derivative is f'(x) = 2x. At x = 3, the slope is 6 — the function is increasing. At x = -2, the slope is -4 — it's decreasing.

Python
# Numerical derivative (finite differences)
def numerical_derivative(f, x, h=1e-5):
    return (f(x + h) - f(x - h)) / (2 * h)

f = lambda x: x ** 2
print(numerical_derivative(f, 3.0))   # ≈ 6.0 ✓
print(numerical_derivative(f, -2.0))  # ≈ -4.0 ✓

The Chain Rule

The chain rule is how we compute derivatives of composed functions. If y = f(g(x)), then:

dy/dx = f'(g(x)) × g'(x)

This is the mathematical foundation of backpropagation — the algorithm that makes training neural networks possible. We'll see it in action in Chapter 4.

2.4 Gradients: Derivatives in Many Dimensions

When your function has multiple inputs (as neural network loss functions always do), the gradient is a vector of all the partial derivatives. It points in the direction of steepest ascent.

∇f = [∂f/∂x₁, ∂f/∂x₂, ..., ∂f/∂xₙ]

If we want to minimize the function (reduce the loss), we move in the opposite direction of the gradient. This is the essence of gradient descent.

Python
# Simple gradient descent example
x = 10.0  # Start far from minimum
learning_rate = 0.1

for step in range(20):
    gradient = 2 * x            # Derivative of x²
    x = x - learning_rate * gradient  # Move opposite to gradient
    print(f"Step {step+1}: x = {x:.4f}, f(x) = {x**2:.4f}")

# x converges to 0 — the minimum of x²
Don't Memorize — Understand

You don't need to memorize derivative rules. Modern frameworks like PyTorch compute gradients automatically (this is called automatic differentiation or autograd). What you need is to understand what gradients represent and why they matter.

2.5 Probability Basics

Deep learning is fundamentally about uncertainty. A model that classifies an image as "cat" with 92% confidence is more useful than one that just says "cat." Understanding a few probability concepts helps you interpret what models are really doing.

Softmax is a function that turns a list of numbers (called logits) into a probability distribution — all values are between 0 and 1, and they sum to 1:

Python
def softmax(z):
    exp_z = np.exp(z - np.max(z))  # subtract max for stability
    return exp_z / exp_z.sum()

logits = np.array([2.0, 1.0, 0.1])
probs = softmax(logits)
print(probs)        # [0.659 0.242 0.099]
print(probs.sum())   # 1.0 — a valid probability distribution

Cross-entropy loss measures the distance between the model's predicted probabilities and the true labels. Lower is better. This is the most common loss function for classification tasks.

✎ Exercises — Chapter 2
  1. Compute the dot product of [2, 3, 4] and [5, 6, 7] by hand, then verify with NumPy.
  2. Multiply the matrix [[1, 0], [0, 1], [1, 1]] by vector [3, 7]. What do you notice?
  3. The function f(x) = x³ - 4x has derivative f'(x) = 3x² - 4. At what values of x is the derivative zero? (These are the local minimum and maximum.)
  4. Write a Python function that performs gradient descent on f(x) = (x-3)² + 2. Start at x = 0 and find the minimum.
  5. Apply softmax to [5, 5, 5] and [100, 0, 0]. What do you notice about the outputs?
Chapter 3

Python & Tools Setup

Let's get your development environment ready. You'll need Python, NumPy, and PyTorch — nothing more to start.

3.1 Installing Python

We recommend Python 3.10 or later. The easiest way to get everything set up is through Anaconda or Miniconda:

Terminal
# Option A: Miniconda (lightweight, recommended)
# Download from https://docs.conda.io/en/latest/miniconda.html

# Option B: If you already have Python, just use pip
pip install numpy matplotlib jupyter torch torchvision

3.2 Your Toolkit

ToolPurposeWhy
NumPyMatrix mathFast, universal, the foundation everything is built on
MatplotlibPlottingVisualize data, loss curves, model outputs
JupyterInteractive notebooksWrite and run code in cells — great for learning
PyTorchDeep learning frameworkIntuitive, Pythonic, dominant in research

3.3 NumPy Crash Course

If you know Python but not NumPy, here's everything you need in 60 seconds:

Python
import numpy as np

# Create arrays
a = np.array([1, 2, 3])           # 1D
b = np.zeros((3, 4))              # 3×4 matrix of zeros
c = np.random.randn(2, 3)        # 2×3 random normal
d = np.ones((2, 3))               # 2×3 matrix of ones

# Shape and indexing
print(c.shape)      # (2, 3)
print(c[0, :])      # First row (all columns)
print(c[:, 1])      # Second column (all rows)

# Operations
print(c + d)         # Element-wise addition
print(c * 2)         # Scalar multiplication
print(c @ a)         # Matrix-vector multiply (if shapes match)
print(c.reshape(3, 2))  # Reshape
print(np.exp(c))      # Element-wise exponential
Common Gotcha

NumPy uses broadcasting: operations between arrays of different shapes can work if dimensions are compatible. For example, adding a (3,4) matrix and a (1,4) vector works — the vector is "broadcast" across rows. This is powerful but can cause silent bugs if shapes don't match as you expect.

3.4 Why PyTorch?

PyTorch has become the dominant framework in both research and increasingly in industry. It's "Pythonic" — if you know NumPy, PyTorch feels familiar. The key addition is automatic differentiation:

Python
import torch

# Create a tensor with gradient tracking
x = torch.tensor(3.0, requires_grad=True)

# Forward pass
y = x ** 2  # y = x²

# Backward pass (compute dy/dx)
y.backward()

print(x.grad)  # tensor(6.) — that's 2×3 = 6 ✓

PyTorch tracks every operation and automatically computes gradients. This is what makes training neural networks practical — you define the forward computation, and PyTorch handles the math of learning.

3.5 GPU Setup (Optional but Recommended)

Deep learning on a CPU works for learning. For anything larger, you need a GPU. Options:

Python
# Check if GPU is available
import torch
print(torch.cuda.is_available())  # True if GPU is ready
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using: {device}")
✎ Exercises — Chapter 3
  1. Create a 4×4 matrix of random numbers using NumPy. Extract the second row, third column, and the 2×2 submatrix in the top-left corner.
  2. Use PyTorch to compute the gradient of f(x) = 3x³ + 2x² - 5x + 7 at x = 2.
  3. Create a NumPy array and convert it to a PyTorch tensor and back. What types do you get?
Chapter 4

Your First Neural Network

We'll build a neural network from scratch in NumPy — no frameworks, no magic. Every calculation spelled out.

4.1 The Neuron

A neuron is the basic unit of a neural network. It does three things:

  1. Weights each input — multiply each input by a learned weight
  2. Sums — add them all up, plus a bias term
  3. Activates — pass the sum through a nonlinear function
x₁ x₂ x₃ w₁ w₂ w₃ Σ σ( ) y
Figure 4.1 — A single neuron: weighted sum + activation.

Mathematically: y = σ(w₁x₁ + w₂x₂ + w₃x₃ + b) where σ is the activation function.

4.2 Activation Functions

Without activation functions, stacking layers would just produce linear transformations (a linear function of a linear function is still linear). Nonlinearity is what gives deep networks their power.

Python
def sigmoid(z):
    """Squashes values to range (0, 1)"""
    return 1 / (1 + np.exp(-z))

def relu(z):
    """Returns z if positive, else 0. The workhorse of modern DL."""
    return np.maximum(0, z)

def relu_derivative(z):
    return (z > 0).astype(float)
FunctionFormulaRangeUsed In
Sigmoid1/(1+e⁻ᶻ)(0, 1)Output layer (binary classification)
Tanh(eᶻ-e⁻ᶻ)/(eᶻ+e⁻ᶻ)(-1, 1)Hidden layers (older architectures)
ReLUmax(0, z)[0, ∞)Hidden layers (modern default)
Softmaxeᶻⁱ/Σeᶻʲ(0, 1), sums to 1Output layer (multi-class)

4.3 Building a Network from Scratch

Let's build a complete 2-layer neural network that learns the XOR function — the very problem that killed the Perceptron in 1969. Our network will solve it in seconds.

Python
import numpy as np

# ─── XOR Dataset ───
X = np.array([[0,0], [0,1], [1,0], [1,1]])
y = np.array([[0], [1], [1], [0]])

# ─── Initialize Weights ───
np.random.seed(42)
W1 = np.random.randn(2, 4)   # Input (2) → Hidden (4)
b1 = np.zeros((1, 4))
W2 = np.random.randn(4, 1)   # Hidden (4) → Output (1)
b2 = np.zeros((1, 1))

# ─── Activation ───
def sigmoid(z): return 1 / (1 + np.exp(-z))
def sigmoid_deriv(a): return a * (1 - a)

lr = 2.0  # Learning rate

# ─── Training Loop ───
for epoch in range(10000):

    # Forward pass
    z1 = X @ W1 + b1
    a1 = sigmoid(z1)
    z2 = a1 @ W2 + b2
    a2 = sigmoid(z2)

    # Loss (MSE)
    loss = np.mean((a2 - y) ** 2)

    # Backward pass
    dz2 = (a2 - y) * sigmoid_deriv(a2)
    dW2 = a1.T @ dz2
    db2 = np.sum(dz2, axis=0, keepdims=True)

    dz1 = (dz2 @ W2.T) * sigmoid_deriv(a1)
    dW1 = X.T @ dz1
    db1 = np.sum(dz1, axis=0, keepdims=True)

    # Update weights
    W2 -= lr * dW2;  b2 -= lr * db2
    W1 -= lr * dW1;  b1 -= lr * db1

    if epoch % 2000 == 0:
        print(f"Epoch {epoch:5d}  Loss: {loss:.6f}")

print(f"\nPredictions:\n{np.round(a2, 3)}")
Expected Output
Epoch     0  Loss: 0.258731
Epoch  2000  Loss: 0.004649
Epoch  4000  Loss: 0.001232
Epoch  6000  Loss: 0.000564
Epoch  8000  Loss: 0.000316

Predictions:
[[0.015]
 [0.983]
 [0.983]
 [0.017]]

The network has learned XOR. Input [0,0] → ~0, [0,1] → ~1, [1,0] → ~1, [1,1] → ~0.

4.4 What Just Happened?

Let's trace through the key steps:

  1. Forward pass: Data flows through the network. Each layer multiplies by weights, adds bias, applies activation. The output is a prediction.
  2. Loss calculation: Compare prediction to truth. Mean Squared Error: how far off are we, on average?
  3. Backward pass (backpropagation): Use the chain rule to compute how much each weight contributed to the error. Start from the output, work backward layer by layer.
  4. Update: Nudge each weight in the direction that reduces the loss. The learning rate controls how big the nudge is.

This cycle repeats thousands of times. Each iteration, the network gets slightly better. That's all training is — repetition of this four-step process.

4.5 Visualizing the Learning Process

Python
import matplotlib.pyplot as plt

# (Re-run the training, storing losses)
losses = []
for epoch in range(10000):
    # ... forward/backward/update (same as above) ...
    losses.append(loss)

plt.figure(figsize=(8, 4))
plt.plot(losses, color='#c0392b', linewidth=1.5)
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Training Loss Over Time')
plt.yscale('log')
plt.grid(True, alpha=0.3)
plt.show()
✎ Exercises — Chapter 4
  1. Change the hidden layer from 4 neurons to 2. Can the network still learn XOR? What about 1 neuron? Explain why.
  2. Replace sigmoid with ReLU in the hidden layer (keep sigmoid in the output). Does training improve?
  3. Change the learning rate to 0.01 and then to 20. What happens to convergence?
  4. Add a third input to XOR: the output should be 1 if an odd number of inputs are 1. Extend the network to handle this.
  5. Plot the decision boundary of the trained XOR network. (Hint: create a grid of points, predict for each, and use plt.contourf.)
Part II
Core Deep Learning
Chapter 5

How Neural Networks Learn

Gradient descent, learning rates, overfitting, regularization — the mechanics of making networks actually work.

5.1 Gradient Descent Variants

In Chapter 4, we used batch gradient descent — computing the gradient over the entire dataset before updating. This works for XOR (4 samples), but what about a dataset with a million images?

VariantBatch SizeSpeedStability
Batch GDEntire datasetSlow per stepVery stable
Stochastic GD (SGD)1 sampleFast per stepNoisy, can escape local minima
Mini-batch SGD32–256 samplesFastGood balance — the practical default
Python
# Mini-batch training loop
batch_size = 32
for epoch in range(num_epochs):
    # Shuffle data each epoch
    indices = np.random.permutation(len(X))
    for start in range(0, len(X), batch_size):
        end = start + batch_size
        X_batch = X[indices[start:end]]
        y_batch = y[indices[start:end]]
        # Forward → Loss → Backward → Update on this batch

5.2 Optimizers Beyond SGD

Vanilla SGD has a problem: it treats all parameters equally and uses a single learning rate. Modern optimizers adapt the learning rate per-parameter and use momentum to smooth out noisy gradients.

SGD with Momentum: Like a ball rolling downhill — it builds up speed in consistent directions and slows down when the gradient changes direction.

Adam (Adaptive Moment Estimation): The workhorse optimizer. Combines momentum with per-parameter learning rate adaptation. It's the safe default for almost every project.

Python
# In PyTorch, switching optimizers is one line:
optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)  # ← Start here

5.3 Overfitting: The Enemy

Overfitting happens when the model memorizes the training data instead of learning generalizable patterns. You can spot it when:

Epochs → Loss → Training Loss Validation Loss Overfitting starts here
Figure 5.1 — Classic overfitting: the gap between training and validation loss widens.

5.4 Fighting Overfitting

1. Get more data. The single most effective solution, though not always possible.

2. Data augmentation. Artificially increase your dataset by transforming existing samples. For images: flip, rotate, crop, adjust colors.

Python
from torchvision import transforms

augment = transforms.Compose([
    transforms.RandomHorizontalFlip(),
    transforms.RandomRotation(15),
    transforms.ColorJitter(brightness=0.2),
    transforms.RandomResizedCrop(224, scale=(0.8, 1.0)),
])

3. Dropout. During training, randomly "turn off" a fraction of neurons. This prevents any single neuron from becoming too specialized and forces the network to learn redundant representations.

Python
import torch.nn as nn

model = nn.Sequential(
    nn.Linear(784, 256),
    nn.ReLU(),
    nn.Dropout(p=0.5),   # Drop 50% of neurons during training
    nn.Linear(256, 10),
)

4. Weight decay (L2 regularization). Add a penalty to the loss based on the magnitude of the weights. This discourages the model from relying too heavily on any single feature.

Python
optimizer = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-4)

5. Early stopping. Monitor validation loss during training. When it starts increasing while training loss keeps decreasing — stop.

5.5 Learning Rate Scheduling

A large learning rate helps the model explore quickly at the start. A small learning rate helps it fine-tune near the end. A learning rate scheduler adjusts the rate during training:

Python
# Step decay: reduce by factor of 10 every 30 epochs
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=30, gamma=0.1)

# Cosine annealing: smooth decay following a cosine curve
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=100)

for epoch in range(num_epochs):
    train_one_epoch(...)
    scheduler.step()

5.6 Batch Normalization

BatchNorm normalizes the activations of each layer to have mean 0 and variance 1, then scales and shifts them with learned parameters. Benefits:

Python
model = nn.Sequential(
    nn.Linear(784, 256),
    nn.BatchNorm1d(256),
    nn.ReLU(),
    nn.Linear(256, 10),
)
✎ Exercises — Chapter 5
  1. Train a network on a dataset with and without dropout. Plot both training and validation curves. What do you observe?
  2. Compare Adam (lr=0.001) vs SGD with momentum (lr=0.01) on the same task. Which converges faster? Which achieves better final accuracy?
  3. Implement early stopping in your training loop. If validation loss doesn't improve for 10 epochs, stop training.
  4. Experiment with different dropout rates (0.1, 0.3, 0.5, 0.7). How does it affect training vs validation performance?
Chapter 6

Building with PyTorch

From manual NumPy to professional PyTorch. The same concepts, but cleaner, faster, and GPU-ready.

6.1 The PyTorch Way

PyTorch organizes deep learning into clear abstractions:

6.2 Classifying Handwritten Digits (MNIST)

The MNIST dataset is the "Hello World" of deep learning — 70,000 grayscale images of handwritten digits (0–9), each 28×28 pixels. Let's build a complete classifier:

Python
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torchvision import datasets, transforms

# ─── 1. Data ───
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))  # MNIST mean & std
])

train_data = datasets.MNIST('./data', train=True, download=True, transform=transform)
test_data  = datasets.MNIST('./data', train=False, transform=transform)

train_loader = DataLoader(train_data, batch_size=64, shuffle=True)
test_loader  = DataLoader(test_data, batch_size=1000)

# ─── 2. Model ───
class DigitClassifier(nn.Module):
    def __init__(self):
        super().__init__()
        self.network = nn.Sequential(
            nn.Flatten(),
            nn.Linear(28*28, 512),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(512, 256),
            nn.ReLU(),
            nn.Linear(256, 10)
        )

    def forward(self, x):
        return self.network(x)

model = DigitClassifier()

# ─── 3. Loss & Optimizer ───
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# ─── 4. Training Loop ───
for epoch in range(10):
    model.train()
    total_loss = 0
    for batch_x, batch_y in train_loader:
        optimizer.zero_grad()
        output = model(batch_x)
        loss = criterion(output, batch_y)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()

    # ─── 5. Evaluate ───
    model.eval()
    correct = 0
    with torch.no_grad():
        for batch_x, batch_y in test_loader:
            preds = model(batch_x).argmax(dim=1)
            correct += (preds == batch_y).sum().item()

    acc = correct / len(test_data)
    print(f"Epoch {epoch+1}: Loss={total_loss:.1f}, Accuracy={acc*100:.2f}%")
Expected Result

After 10 epochs, you should see ~97-98% accuracy on the test set. Not bad for a simple fully-connected network!

6.3 Key PyTorch Patterns

model.train() vs model.eval(): This switches the behavior of Dropout and BatchNorm. Always call the right one.

torch.no_grad(): Disables gradient computation during evaluation. Saves memory and computation.

optimizer.zero_grad(): PyTorch accumulates gradients. You must zero them each step or they'll add up.

6.4 Saving and Loading Models

Python
# Save model weights
torch.save(model.state_dict(), 'mnist_model.pth')

# Load later
model = DigitClassifier()
model.load_state_dict(torch.load('mnist_model.pth'))
model.eval()
✎ Exercises — Chapter 6
  1. Add a third hidden layer. Does accuracy improve? By how much?
  2. Implement the training and evaluation as a reusable function that takes hyperparameters as arguments.
  3. Add a confusion matrix visualization: which digits does the model confuse most often?
  4. Train the same model with and without BatchNorm. Compare training curves.
Chapter 7

Convolutional Neural Networks

CNNs revolutionized computer vision by learning spatial hierarchies of features — edges → textures → parts → objects.

7.1 Why Convolutions?

A fully-connected layer for a 224×224 RGB image would need 224×224×3 = 150,528 input weights per neuron. This is wasteful — pixels near each other are related, pixels far apart usually aren't. Convolutional layers exploit this spatial locality:

7.2 How Convolution Works

A small filter (kernel) slides across the input, computing dot products at each position:

Input (5×5) 3×3 patch Filter (3×3) × Output (3×3)
Figure 7.1 — Convolution: a small filter slides across the input, computing dot products.
Python
import torch.nn as nn

# A single convolutional layer
conv = nn.Conv2d(
    in_channels=3,     # RGB input
    out_channels=16,   # 16 different filters
    kernel_size=3,     # 3×3 filter
    stride=1,          # Move 1 pixel at a time
    padding=1          # Keep same spatial size
)

7.3 Pooling

Pooling layers reduce the spatial dimensions, making the network faster and more robust to small shifts. Max pooling takes the maximum value in each patch:

Python
pool = nn.MaxPool2d(2, stride=2)  # Halves the spatial dimensions
# Input: (16, 28, 28) → Output: (16, 14, 14)

7.4 Building a Complete CNN

Python
class CNNClassifier(nn.Module):
    def __init__(self):
        super().__init__()
        self.features = nn.Sequential(
            # Block 1: 1×28×28 → 32×14×14
            nn.Conv2d(1, 32, 3, padding=1),
            nn.BatchNorm2d(32),
            nn.ReLU(),
            nn.MaxPool2d(2),

            # Block 2: 32×14×14 → 64×7×7
            nn.Conv2d(32, 64, 3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(),
            nn.MaxPool2d(2),

            # Block 3: 64×7×7 → 128×3×3
            nn.Conv2d(64, 128, 3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(),
            nn.MaxPool2d(2),
        )
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(128 * 3 * 3, 256),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(256, 10),
        )

    def forward(self, x):
        x = self.features(x)
        return self.classifier(x)

7.5 What CNNs See: Feature Visualization

The first convolutional layer typically learns to detect edges and simple textures. Deeper layers combine these into increasingly complex patterns:

LayerDetects
Layer 1Edges, gradients, simple colors
Layer 2Textures, corners, small shapes
Layer 3Object parts (eyes, wheels, petals)
Layer 4+Entire objects, complex structures

7.6 Classic CNN Architectures

ArchitectureYearKey Innovation
LeNet-51998Pioneered CNNs for digit recognition
AlexNet2012Deep CNN + GPU training + ReLU + Dropout
VGGNet2014Showed depth matters (16-19 layers)
ResNet2015Skip connections enabling 152+ layers
EfficientNet2019Optimized scaling of depth/width/resolution
✎ Exercises — Chapter 7
  1. Train the CNN on MNIST. What accuracy do you get compared to the fully-connected network in Chapter 6?
  2. Modify the CNN for CIFAR-10 (32×32 color images, 10 classes). What changes are needed?
  3. Remove all pooling layers. How does this affect the output size and training time?
  4. Add a skip connection: make the output of block 1 also feed into block 3 (concatenated or added).
Chapter 8

Recurrent Neural Networks

Sequence data — text, audio, time series — requires networks that remember the past. RNNs process inputs one at a time while maintaining a hidden state.

8.1 The Problem with Fixed-Size Inputs

CNNs and fully-connected networks take a fixed-size input and produce a fixed-size output. But language is variable-length. "I love deep learning" and "I think I might love deep learning someday" need different-sized inputs. RNNs solve this by processing sequences step by step.

8.2 How RNNs Work

An RNN cell processes one element at a time, maintaining a hidden state that acts as a memory:

t=1 t=2 t=3 t=4 RNN RNN RNN RNN "The" "cat" "sat" "down"
Figure 8.1 — An unrolled RNN processing a sentence word by word.

At each step: h_t = tanh(W_hh · h_{t-1} + W_xh · x_t + b)

The same weights are used at every time step. The hidden state h_t carries forward information from all previous steps.

8.3 The Vanishing Gradient Problem

Plain RNNs struggle to learn long-range dependencies. When backpropagating through many time steps, gradients either shrink exponentially to zero (vanish) or grow exponentially (explode). This means a vanilla RNN can't effectively remember information from 50 steps ago.

8.4 LSTMs and GRUs

LSTM (Long Short-Term Memory) solves this with a gating mechanism and a separate cell state that acts as a conveyor belt for information:

GRU (Gated Recurrent Unit) is a simplified version with only two gates (reset and update). It's faster and often works just as well as LSTM.

Python
# PyTorch LSTM
class TextClassifier(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, output_dim):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.lstm = nn.LSTM(embed_dim, hidden_dim, batch_first=True, num_layers=2)
        self.fc = nn.Linear(hidden_dim, output_dim)

    def forward(self, x):
        embedded = self.embedding(x)          # (batch, seq_len) → (batch, seq_len, embed_dim)
        output, (hidden, cell) = self.lstm(embedded)
        return self.fc(hidden[-1])            # Use last hidden state

8.5 Project: Text Generation

Train a character-level LSTM on a text corpus. Given a sequence of characters, predict the next one:

Python
# Simplified character-level generation
text = "to be or not to be that is the question"
chars = sorted(set(text))
char_to_idx = {c: i for i, c in enumerate(chars)}
idx_to_char = {i: c for c, i in char_to_idx.items()}

# Create training pairs: "to b" → "o be", "o be" → " beo"
seq_length = 4
# ... (create input-target pairs) ...

# Generate text: seed with a prompt, predict one char at a time
def generate(model, seed, length=100):
    model.eval()
    chars = list(seed)
    for _ in range(length):
        x = torch.tensor([[char_to_idx[c] for c in chars[-seq_length:]]])
        pred = model(x)
        next_char = idx_to_char[pred.argmax(-1).item()]
        chars.append(next_char)
    return ''.join(chars)
✎ Exercises — Chapter 8
  1. Train the text generator on a Shakespeare dataset. How does the output quality change with training epochs?
  2. Compare LSTM vs GRU on the same task. Which converges faster?
  3. Add temperature-based sampling to the generation function: divide logits by a temperature before softmax. What happens at temperature 0.5 vs 1.5?
  4. Why are RNNs being replaced by Transformers? What are the computational disadvantages of processing sequences step by step?
Part III
Leveling Up
Chapter 9

Transformers & Attention

The most important architecture of the decade. Transformers power GPT, BERT, and virtually every state-of-the-art language model.

9.1 The Key Insight: Attention

Instead of processing a sequence step by step (like RNNs), the attention mechanism lets every element in a sequence look at every other element simultaneously. This is both more powerful and more parallelizable.

Think of it this way: when you read the word "it" in "The cat sat on the mat because it was tired," your brain attends to "cat" to understand what "it" refers to. Attention formalizes this process.

9.2 Self-Attention Step by Step

For each token in the input, we compute three vectors:

The attention score between two tokens is the dot product of one's Query with the other's Key. Higher score = more attention.

Attention(Q, K, V) = softmax(Q · Kᵀ / √d_k) · V
Python
import torch
import torch.nn as nn
import torch.nn.functional as F
import math

class SelfAttention(nn.Module):
    def __init__(self, embed_dim, num_heads):
        super().__init__()
        self.num_heads = num_heads
        self.head_dim = embed_dim // num_heads

        self.Q = nn.Linear(embed_dim, embed_dim)
        self.K = nn.Linear(embed_dim, embed_dim)
        self.V = nn.Linear(embed_dim, embed_dim)
        self.out = nn.Linear(embed_dim, embed_dim)

    def forward(self, x):
        B, T, C = x.shape

        q = self.Q(x).view(B, T, self.num_heads, self.head_dim).transpose(1, 2)
        k = self.K(x).view(B, T, self.num_heads, self.head_dim).transpose(1, 2)
        v = self.V(x).view(B, T, self.num_heads, self.head_dim).transpose(1, 2)

        scores = (q @ k.transpose(-2, -1)) / math.sqrt(self.head_dim)
        weights = F.softmax(scores, dim=-1)
        out = (weights @ v).transpose(1, 2).contiguous().view(B, T, C)
        return self.out(out)

9.3 The Transformer Architecture

A full Transformer block has:

  1. Multi-head self-attention (look at all positions)
  2. Add & Norm (residual connection + layer normalization)
  3. Feed-forward network (process each position independently)
  4. Add & Norm again

9.4 Why Transformers Won

FeatureRNNTransformer
ParallelizationSequential (slow)Fully parallel (fast)
Long-range dependenciesStruggles beyond ~100 stepsDirect attention to any position
Training on GPUsLimited by sequential naturePerfectly suited for GPUs
ScalabilityHard to scale past a few layersScales to billions of parameters

9.5 BERT, GPT, and the Foundation Model Era

BERT (2018): Encoder-only Transformer. Pre-trained to fill in masked words. Excels at understanding tasks (classification, Q&A).

GPT (2018–present): Decoder-only Transformer. Pre-trained to predict the next word. Excels at generation tasks (text completion, conversation).

T5 (2019): Full encoder-decoder Transformer. Frames everything as "text in → text out."

✎ Exercises — Chapter 9
  1. Implement the Transformer block and train it on a simple sequence prediction task. Compare to an LSTM on the same data.
  2. Visualize attention weights: which tokens attend to which? Feed a sentence through a Transformer and plot the attention matrix.
  3. Why does scaling attention by √d_k matter? What happens without it?
  4. Read the "Attention Is All You Need" paper (2017). Write a 1-paragraph summary of the key contributions.
Chapter 10

Transfer Learning

Why train from scratch when someone else already has? Use pre-trained models as a starting point and adapt them to your task.

10.1 The Big Idea

Training a large model from scratch requires millions of images and days of GPU time. Transfer learning says: take a model pre-trained on a massive dataset (like ImageNet's 1.2 million images), and adapt it to your task with your smaller dataset.

It works because early layers learn general features (edges, textures) that are useful for almost any vision task. Only the later layers are task-specific.

10.2 Feature Extraction vs Fine-Tuning

StrategyWhat ChangesWhen to Use
Feature ExtractionOnly the final classifier layerSmall dataset, similar domain
Fine-TuningAll (or most) layers, at low learning rateLarger dataset or different domain
Python
import torchvision.models as models

# Load pre-trained ResNet-18
model = models.resnet18(weights='IMAGENET1K_V1')

# Freeze all layers (feature extraction)
for param in model.parameters():
    param.requires_grad = False

# Replace the final layer for your task
num_classes = 5  # e.g., 5 types of flowers
model.fc = nn.Linear(model.fc.in_features, num_classes)

# Only the new layer will be trained
optimizer = torch.optim.Adam(model.fc.parameters(), lr=0.001)

10.3 Fine-Tuning in Practice

Python
# Unfreeze all layers for fine-tuning
for param in model.parameters():
    param.requires_grad = True

# Use a LOWER learning rate (don't destroy learned features)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-5)

# OR: gradually unfreeze layers (discriminative fine-tuning)
# Start training only the last layer, then unfreeze the last 2, etc.

10.4 Transfer Learning for NLP

The same idea works for text. Models like BERT and GPT are pre-trained on massive text corpora. You can fine-tune them for specific tasks:

Python
from transformers import AutoModelForSequenceClassification, AutoTokenizer

# Load pre-trained BERT
model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

# Fine-tune on sentiment analysis
# The model already "understands" English — you just teach it your specific task
Hugging Face

The Hugging Face Transformers library provides thousands of pre-trained models for NLP, vision, and audio. It's become the standard way to use pre-trained models. pip install transformers

✎ Exercises — Chapter 10
  1. Use a pre-trained ResNet-18 to classify a custom 5-class image dataset (e.g., flowers, food). Compare feature extraction vs full fine-tuning.
  2. Fine-tune a pre-trained BERT model for sentiment analysis on movie reviews. What accuracy do you achieve?
  3. What happens if you fine-tune with too high a learning rate? Try lr=0.01 and observe the results.
Chapter 11

Generative Models

Models that create — generate images, text, music. Autoencoders, GANs, and the diffusion models behind DALL·E and Stable Diffusion.

11.1 Autoencoders

An autoencoder learns to compress data into a compact representation (encoding) and reconstruct it back. It's trained to minimize reconstruction error.

Python
class Autoencoder(nn.Module):
    def __init__(self):
        super().__init__()
        self.encoder = nn.Sequential(
            nn.Linear(784, 256), nn.ReLU(),
            nn.Linear(256, 64),  nn.ReLU(),
            nn.Linear(64, 16)    # Bottleneck: 784 → 16
        )
        self.decoder = nn.Sequential(
            nn.Linear(16, 64),   nn.ReLU(),
            nn.Linear(64, 256),  nn.ReLU(),
            nn.Linear(256, 784), nn.Sigmoid()
        )

    def forward(self, x):
        z = self.encoder(x)
        return self.decoder(z)

11.2 Generative Adversarial Networks (GANs)

A GAN pits two networks against each other:

They train together: the generator gets better at faking, the discriminator gets better at detecting. Eventually, the generator produces data indistinguishable from real data.

11.3 Diffusion Models

The technology behind Stable Diffusion, DALL·E 2, and Midjourney. The idea is beautifully simple:

  1. Forward process: Gradually add noise to an image until it's pure static
  2. Reverse process: Train a neural network to reverse each step of noising — to "denoise" one step at a time
  3. Generation: Start with pure noise, denoise step by step, and a coherent image emerges
✎ Exercises — Chapter 11
  1. Build and train an autoencoder on MNIST. Visualize the reconstructions — how do they look with a bottleneck of 2 vs 16 vs 64?
  2. If the bottleneck is 2D, you can plot the latent space. Color-code by digit class. What structure do you see?
  3. Implement a simple GAN for generating MNIST digits. Plot generated samples at epochs 1, 10, 50, and 100.
Chapter 12

Practical Skills

Theory gets you started. These skills get you to production.

12.1 Data Preprocessing

Real-world data is messy. Common tasks:

12.2 Debugging Models

When your model doesn't work (and it often won't), follow this checklist:

  1. Overfit a single batch. If your model can't learn 10 examples perfectly, there's a bug.
  2. Check the loss. Is it decreasing at all? If not, the learning rate might be too high or too low.
  3. Visualize predictions. Plot what the model outputs vs. the truth.
  4. Check gradients. Are they flowing? Are they exploding (NaN) or vanishing (~0)?
  5. Simplify. Remove complexity until something works, then add it back incrementally.

12.3 Experiment Tracking

Python
# Use Weights & Biases for experiment tracking
import wandb
wandb.init(project="my-dl-project")
wandb.config.update({"lr": 0.001, "epochs": 50, "batch_size": 64})

for epoch in range(50):
    train_loss = train_one_epoch(...)
    val_acc = evaluate(...)
    wandb.log({"loss": train_loss, "accuracy": val_acc})

12.4 Common Pitfalls

ProblemSymptomFix
Data leakageUnrealistically high accuracyEnsure test data is never seen during training
Unbalanced classesHigh accuracy but low recallUse weighted loss or oversample minority class
Wrong loss functionLoss doesn't decreaseMatch loss to task: CrossEntropy for classification, MSE for regression
Learning rate too highLoss oscillates or explodesStart at 1e-3, decrease if unstable
Learning rate too lowLoss barely decreasesIncrease by 10x
✎ Exercises — Chapter 12
  1. Take a dataset and intentionally introduce data leakage. Show that accuracy is artificially high. Then fix it.
  2. Create an imbalanced dataset (90% class A, 10% class B). Train a model and observe the problem. Apply class weighting and compare.
  3. Set up experiment tracking with either W&B or TensorBoard. Compare 3 different learning rates on the same task.
Part IV
The Real World
Chapter 13

Deploying Models

A model on your laptop is a research project. A model that serves real users is a product.

13.1 Saving & Exporting

Python
# PyTorch: save entire model
torch.save(model, 'model_full.pth')

# Or just the weights (recommended)
torch.save(model.state_dict(), 'model_weights.pth')

# Export to ONNX (framework-agnostic)
dummy = torch.randn(1, 1, 28, 28)
torch.onnx.export(model, dummy, "model.onnx")

13.2 Building a Simple API

Python
from flask import Flask, request, jsonify
import torch

app = Flask(__name__)
model = torch.load('model.pth')
model.eval()

@app.route('/predict', methods=['POST'])
def predict():
    data = request.get_json()
    tensor = preprocess(data['input'])
    with torch.no_grad():
        prediction = model(tensor)
    return jsonify({'prediction': prediction.tolist()})

13.3 Deployment Options

PlatformBest For
Flask / FastAPISimple APIs, prototyping
Docker + Cloud (AWS/GCP/Azure)Production scale
ONNX RuntimeCross-platform inference
TorchServePyTorch-native serving
Edge (TFLite, CoreML)Mobile and embedded devices

13.4 Optimization for Inference

✎ Exercises — Chapter 13
  1. Wrap your trained MNIST model in a Flask API. Accept an image and return the predicted digit.
  2. Export a PyTorch model to ONNX and run it with ONNX Runtime.
  3. Apply dynamic quantization to your model and compare inference speed and model size.
Chapter 14

Ethics & Responsible AI

With great power comes real consequences. Building AI responsibly isn't optional.

14.1 Bias in, Bias Out

Models learn from data. If the data reflects historical biases — racial, gender, socioeconomic — the model will reproduce and amplify those biases. A hiring model trained on historical data may learn that "most successful candidates were male" and penalize female applicants.

14.2 Fairness Metrics

14.3 Interpretability

When a model denies someone a loan or flags a medical image, people deserve an explanation. Techniques include:

14.4 Environmental Cost

Training large models consumes significant energy. GPT-3's training reportedly used ~1,300 MWh of electricity. Consider:

14.5 Guidelines for Responsible Development

  1. Audit your data for bias before training
  2. Test model performance across demographic groups
  3. Be transparent about model limitations
  4. Implement human-in-the-loop for high-stakes decisions
  5. Document your model (what it was trained on, its intended use, known failure modes)
Chapter 15

Where the Field Is Heading

A snapshot of the frontier as of 2025 — and where things might go next.

15.1 Foundation Models

Large models pre-trained on broad data that can be adapted to many tasks. GPT-4, Claude, Gemini, and similar models represent a paradigm shift: instead of building task-specific models, you build one large model and adapt it.

15.2 Multimodal Models

Models that understand and generate across modalities — text, images, audio, video — simultaneously. GPT-4V can reason about images. Gemini processes text, code, images, and audio natively.

15.3 Key Trends

15.4 What to Learn Next

  1. Read papers: Start with well-written survey papers and landmark papers (Attention Is All You Need, ResNet, etc.)
  2. Build projects: The best way to learn is to build something and get it wrong
  3. Join communities: Papers With Code, Hugging Face forums, r/MachineLearning
  4. Specialize: Go deep in one area (computer vision, NLP, reinforcement learning, etc.)
Chapter 16

Capstone Project

Put it all together. This project walks you through building a complete deep learning application end to end.

Project: Image Classification Web App

Goal: Build a web app that classifies images into custom categories. The user uploads an image, the model predicts the class, and the result is displayed with a confidence score.

Step 1: Define the Problem

Choose a dataset. Suggestions:

Step 2: Data Pipeline

Python
from torch.utils.data import DataLoader, random_split
from torchvision import datasets, transforms

train_transform = transforms.Compose([
    transforms.RandomResizedCrop(224),
    transforms.RandomHorizontalFlip(),
    transforms.ColorJitter(0.2, 0.2),
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
])

val_transform = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
])

Step 3: Model

Python
import torchvision.models as models

model = models.resnet50(weights='IMAGENET1K_V2')
model.fc = nn.Linear(model.fc.in_features, num_classes)

# Fine-tune with differential learning rates
optimizer = torch.optim.AdamW([
    {'params': model.layer4.parameters(), 'lr': 1e-4},
    {'params': model.fc.parameters(), 'lr': 1e-3},
], weight_decay=1e-2)

Step 4: Training with Best Practices

Step 5: Deploy

Create a simple web interface with Flask or Streamlit. Accept image uploads, run the model, and display results with confidence bars.

You Made It

If you've followed along from Chapter 1, you now understand the foundations of deep learning — from the math to the models to the deployment. The field is vast and evolving fast, but you have the fundamentals to learn anything that comes next. Keep building.

✎ Final Exercises — Chapter 16
  1. Complete the capstone project end-to-end with a dataset of your choice.
  2. Write a model card documenting your model: training data, performance metrics, known limitations, intended use cases.
  3. Deploy your model as a web app and share it with someone who isn't in tech. Get their feedback.
  4. Reflect: what was the hardest part? What would you do differently next time?
Appendix

Math Refresher & Glossary

Quick reference for the math and terminology used throughout this book.

A.1 Linear Algebra Cheat Sheet

ConceptNotationMeaning
ScalaraA single number
Vectorv = [v₁, v₂, v₃]Ordered list of numbers (1D array)
MatrixW2D grid of numbers
Dot producta · bElement-wise multiply, then sum
Matrix multiplyABRows of A dotted with columns of B
TransposeARows become columns
Normv"Length" of a vector

A.2 Calculus Cheat Sheet

FunctionDerivative
f(x) = cf'(x) = 0
f(x) = xⁿf'(x) = nxⁿ⁻¹
f(x) = eˣf'(x) = eˣ
f(x) = ln(x)f'(x) = 1/x
f(x) = σ(x)f'(x) = σ(x)(1 - σ(x))
f(x) = ReLU(x)f'(x) = 1 if x > 0, else 0
Chain rule: f(g(x))f'(g(x)) × g'(x)

A.3 Glossary

TermDefinition
Activation functionNonlinear function applied to neuron output (ReLU, sigmoid, etc.)
BackpropagationAlgorithm for computing gradients via the chain rule
Batch sizeNumber of samples processed before updating weights
BiasLearnable offset added after the weighted sum in a neuron
CNNConvolutional Neural Network — specialized for grid-like data (images)
Cross-entropyLoss function for classification tasks
Data augmentationArtificially expanding training data through transformations
DropoutRegularization: randomly zeroing activations during training
EpochOne complete pass through the entire training dataset
Fine-tuningContinuing training of a pre-trained model on a new task
GradientVector of partial derivatives — direction of steepest increase
Gradient descentOptimization: iteratively move parameters opposite to the gradient
GPUGraphics Processing Unit — massively parallel hardware for DL
Learning rateStep size for parameter updates during optimization
Loss functionMeasure of how wrong the model's predictions are
LSTMLong Short-Term Memory — RNN variant that handles long-range dependencies
OverfittingModel memorizes training data instead of learning general patterns
ParameterLearnable value in the model (weights and biases)
PoolingDownsampling operation in CNNs (max or average)
Pre-trainingInitial training on a large dataset before fine-tuning
RegularizationTechniques to prevent overfitting (dropout, weight decay, etc.)
RNNRecurrent Neural Network — processes sequences step by step
SoftmaxFunction that converts logits to probabilities
TensorMulti-dimensional array (generalization of vectors and matrices)
Transfer learningUsing a pre-trained model as a starting point for a new task
TransformerArchitecture based on self-attention (dominant since 2017)
UnderfittingModel is too simple to capture the patterns in data
WeightLearnable parameter that scales the input to a neuron