What Is Deep Learning?
Before we write a single line of code, let's understand what we're getting into — and why it matters.
1.1 A Brief History of Teaching Machines to Think
The dream of artificial intelligence is old. Ancient Greek myths told of Talos, a giant bronze automaton that guarded Crete. Medieval engineers built mechanical birds. But the real story of AI begins in the 1940s, when mathematicians and logicians first asked: can a machine think?
1.2 AI, Machine Learning, and Deep Learning
These three terms are often used interchangeably, but they're nested concepts:
Artificial Intelligence (AI) is the broadest concept: any system that exhibits intelligent behavior. This includes rule-based systems (if the temperature > 100°C, send an alert), search algorithms (GPS navigation), and much more. Most AI systems don't involve learning at all.
Machine Learning (ML) is a subset of AI where systems learn patterns from data rather than following hand-coded rules. Instead of programming explicit rules, you provide examples and the system figures out the rules itself.
Deep Learning (DL) is a subset of ML that uses neural networks with many layers — hence "deep." These layered networks can automatically discover the features that matter in data, eliminating much of the manual feature engineering that traditional ML requires.
In traditional ML, a human expert must carefully choose what features to look for. In deep learning, the network discovers these features on its own. This is why deep learning works so well for messy, unstructured data like images, audio, and text — the features are too complex for humans to define by hand.
1.3 What Can Deep Learning Do?
The honest answer: a lot. But it's not magic, and it's not good at everything. Here's a realistic landscape:
| Task | Example | Architecture |
|---|---|---|
| Image Classification | Is this X-ray showing pneumonia? | CNN, Vision Transformer |
| Object Detection | Find all pedestrians in this street scene | YOLO, Faster R-CNN |
| Text Generation | Write a coherent paragraph about climate change | Transformer (GPT family) |
| Machine Translation | Translate English to French | Seq2Seq, Transformer |
| Speech Recognition | Convert audio to text | Whisper, CTC Networks |
| Image Generation | Create a photorealistic face that doesn't exist | GAN, Diffusion Model |
| Game Playing | Beat the world champion at Go | Reinforcement Learning + DL |
1.4 What Deep Learning Is NOT Good At
Being honest about limitations is just as important as celebrating capabilities:
- Small datasets — Deep learning is data-hungry. With fewer than a few hundred examples, traditional ML or even simple heuristics often win.
- Explainability — A deep neural network's decisions are hard to interpret. If you need to explain why a decision was made (medical diagnosis, loan approval), this matters a lot.
- Common sense reasoning — Models can generate fluent text but don't "understand" the way humans do. They can be confidently wrong.
- Out-of-distribution generalization — A model trained on one kind of data can fail spectacularly when the data shifts.
- Simple, structured problems — If your data fits neatly in a spreadsheet and has clear rules, gradient-boosted trees (XGBoost) are often better and faster.
1.5 The Building Blocks (A 30,000-Foot View)
Every deep learning system has the same fundamental ingredients:
- Data — Examples the model learns from. Images, text, audio, numbers — organized into inputs and (usually) labels.
- A Model (Architecture) — The structure of the neural network. How many layers, what type, how they connect.
- A Loss Function — A mathematical measure of how wrong the model's predictions are.
- An Optimizer — An algorithm that adjusts the model's internal parameters to reduce the loss.
- Training Loop — The cycle: feed data → make predictions → measure error → adjust parameters → repeat.
Every chapter in this book builds on these five ingredients. You'll understand each one deeply by the end.
1.6 What You'll Build in This Book
This isn't a theory textbook. By the end of these chapters, you will have:
- Built a neural network from scratch in pure Python
- Trained an image classifier that recognizes handwritten digits
- Created a text generator that writes Shakespeare-like prose
- Used a pre-trained model to classify your own images
- Built and deployed a complete deep learning application
- Understood the Transformer architecture powering ChatGPT and its relatives
Let's start building.
- In your own words, explain the difference between AI, machine learning, and deep learning to a friend who has never taken a CS course.
- Find three real-world applications of deep learning that you interact with daily (hint: your phone uses several).
- For each application in the table in Section 1.3, think of one benefit and one risk.
- Why might a company choose traditional ML over deep learning for a spam filter? When might deep learning be the better choice?
The Math You Actually Need
Don't panic. You don't need a math degree — just a few core ideas, explained with pictures and code.
If you're comfortable with basic algebra and have seen graphs (x-y plots), you're fine. We'll build up everything else. If you already know linear algebra and calculus, skip to Chapter 3.
2.1 Vectors: Lists of Numbers
A vector is simply an ordered list of numbers. That's it. In deep learning, vectors represent data — a single image, a word, a row in a spreadsheet.
Python # A vector in Python (as a list) student = [170, 65, 22] # height(cm), weight(kg), age # The same as a NumPy array import numpy as np student = np.array([170, 65, 22]) print(student.shape) # (3,) — a 1D array with 3 elements
Think of a vector as an arrow pointing from the origin to a point in space. A 2D vector [3, 4] points to the point (3, 4). A 3D vector [1, 2, 3] points to (1, 2, 3) in 3D space. In deep learning, our vectors often have hundreds or thousands of dimensions — we can't draw them, but the math works the same way.
Vector Operations
Addition: add element by element.
Scalar multiplication: multiply every element by a number.
Dot product: multiply corresponding elements, then sum. This is the most important operation in neural networks.
Python a = np.array([1, 2, 3]) b = np.array([4, 5, 6]) print(a + b) # [5 7 9] print(3 * a) # [3 6 9] print(np.dot(a, b)) # 32 print(a @ b) # 32 (same as dot product)
2.2 Matrices: Tables of Numbers
A matrix is a 2D grid of numbers. In deep learning, weight matrices are the core of every layer — they transform input data into useful representations.
Python # A 2×3 matrix (2 rows, 3 columns) W = np.array([ [1, 2, 3], [4, 5, 6] ]) print(W.shape) # (2, 3)
Matrix-Vector Multiplication
This is the fundamental operation in neural networks. Each row of the matrix is dotted with the vector to produce the output:
Python W = np.array([[1, 2], [3, 4], [5, 6]]) # 3×2 matrix x = np.array([10, 20]) # 2-element vector y = W @ x # Matrix-vector multiplication print(y) # [ 50 110 170] # Verify: row 1 is [1,2] · [10,20] = 10+40 = 50 ✓ # row 2 is [3,4] · [10,20] = 30+80 = 110 ✓
2.3 Derivatives: Measuring Change
A derivative measures how much a function's output changes when its input changes a tiny bit. If you know the derivative of a function at a point, you know the slope — which direction to move to increase or decrease the output.
Imagine you're blindfolded on a hill. The derivative is like poking the ground in different directions with a stick to figure out which way is downhill. That's exactly what gradient descent does — it uses derivatives to find the direction that reduces the loss.
For f(x) = x², the derivative is f'(x) = 2x. At x = 3, the slope is 6 — the function is increasing. At x = -2, the slope is -4 — it's decreasing.
Python # Numerical derivative (finite differences) def numerical_derivative(f, x, h=1e-5): return (f(x + h) - f(x - h)) / (2 * h) f = lambda x: x ** 2 print(numerical_derivative(f, 3.0)) # ≈ 6.0 ✓ print(numerical_derivative(f, -2.0)) # ≈ -4.0 ✓
The Chain Rule
The chain rule is how we compute derivatives of composed functions. If y = f(g(x)), then:
This is the mathematical foundation of backpropagation — the algorithm that makes training neural networks possible. We'll see it in action in Chapter 4.
2.4 Gradients: Derivatives in Many Dimensions
When your function has multiple inputs (as neural network loss functions always do), the gradient is a vector of all the partial derivatives. It points in the direction of steepest ascent.
If we want to minimize the function (reduce the loss), we move in the opposite direction of the gradient. This is the essence of gradient descent.
Python # Simple gradient descent example x = 10.0 # Start far from minimum learning_rate = 0.1 for step in range(20): gradient = 2 * x # Derivative of x² x = x - learning_rate * gradient # Move opposite to gradient print(f"Step {step+1}: x = {x:.4f}, f(x) = {x**2:.4f}") # x converges to 0 — the minimum of x²
You don't need to memorize derivative rules. Modern frameworks like PyTorch compute gradients automatically (this is called automatic differentiation or autograd). What you need is to understand what gradients represent and why they matter.
2.5 Probability Basics
Deep learning is fundamentally about uncertainty. A model that classifies an image as "cat" with 92% confidence is more useful than one that just says "cat." Understanding a few probability concepts helps you interpret what models are really doing.
Softmax is a function that turns a list of numbers (called logits) into a probability distribution — all values are between 0 and 1, and they sum to 1:
Python def softmax(z): exp_z = np.exp(z - np.max(z)) # subtract max for stability return exp_z / exp_z.sum() logits = np.array([2.0, 1.0, 0.1]) probs = softmax(logits) print(probs) # [0.659 0.242 0.099] print(probs.sum()) # 1.0 — a valid probability distribution
Cross-entropy loss measures the distance between the model's predicted probabilities and the true labels. Lower is better. This is the most common loss function for classification tasks.
- Compute the dot product of
[2, 3, 4]and[5, 6, 7]by hand, then verify with NumPy. - Multiply the matrix
[[1, 0], [0, 1], [1, 1]]by vector[3, 7]. What do you notice? - The function
f(x) = x³ - 4xhas derivativef'(x) = 3x² - 4. At what values of x is the derivative zero? (These are the local minimum and maximum.) - Write a Python function that performs gradient descent on
f(x) = (x-3)² + 2. Start at x = 0 and find the minimum. - Apply softmax to
[5, 5, 5]and[100, 0, 0]. What do you notice about the outputs?
Python & Tools Setup
Let's get your development environment ready. You'll need Python, NumPy, and PyTorch — nothing more to start.
3.1 Installing Python
We recommend Python 3.10 or later. The easiest way to get everything set up is through Anaconda or Miniconda:
Terminal # Option A: Miniconda (lightweight, recommended) # Download from https://docs.conda.io/en/latest/miniconda.html # Option B: If you already have Python, just use pip pip install numpy matplotlib jupyter torch torchvision
3.2 Your Toolkit
| Tool | Purpose | Why |
|---|---|---|
| NumPy | Matrix math | Fast, universal, the foundation everything is built on |
| Matplotlib | Plotting | Visualize data, loss curves, model outputs |
| Jupyter | Interactive notebooks | Write and run code in cells — great for learning |
| PyTorch | Deep learning framework | Intuitive, Pythonic, dominant in research |
3.3 NumPy Crash Course
If you know Python but not NumPy, here's everything you need in 60 seconds:
Python import numpy as np # Create arrays a = np.array([1, 2, 3]) # 1D b = np.zeros((3, 4)) # 3×4 matrix of zeros c = np.random.randn(2, 3) # 2×3 random normal d = np.ones((2, 3)) # 2×3 matrix of ones # Shape and indexing print(c.shape) # (2, 3) print(c[0, :]) # First row (all columns) print(c[:, 1]) # Second column (all rows) # Operations print(c + d) # Element-wise addition print(c * 2) # Scalar multiplication print(c @ a) # Matrix-vector multiply (if shapes match) print(c.reshape(3, 2)) # Reshape print(np.exp(c)) # Element-wise exponential
NumPy uses broadcasting: operations between arrays of different shapes can work if dimensions are compatible. For example, adding a (3,4) matrix and a (1,4) vector works — the vector is "broadcast" across rows. This is powerful but can cause silent bugs if shapes don't match as you expect.
3.4 Why PyTorch?
PyTorch has become the dominant framework in both research and increasingly in industry. It's "Pythonic" — if you know NumPy, PyTorch feels familiar. The key addition is automatic differentiation:
Python import torch # Create a tensor with gradient tracking x = torch.tensor(3.0, requires_grad=True) # Forward pass y = x ** 2 # y = x² # Backward pass (compute dy/dx) y.backward() print(x.grad) # tensor(6.) — that's 2×3 = 6 ✓
PyTorch tracks every operation and automatically computes gradients. This is what makes training neural networks practical — you define the forward computation, and PyTorch handles the math of learning.
3.5 GPU Setup (Optional but Recommended)
Deep learning on a CPU works for learning. For anything larger, you need a GPU. Options:
- Google Colab (free) — Browser-based Jupyter with GPU access. Best option to start.
- Kaggle Notebooks (free) — Also provides GPU, plus datasets.
- Local GPU — NVIDIA GPU with CUDA installed.
Python # Check if GPU is available import torch print(torch.cuda.is_available()) # True if GPU is ready device = torch.device("cuda" if torch.cuda.is_available() else "cpu") print(f"Using: {device}")
- Create a 4×4 matrix of random numbers using NumPy. Extract the second row, third column, and the 2×2 submatrix in the top-left corner.
- Use PyTorch to compute the gradient of
f(x) = 3x³ + 2x² - 5x + 7at x = 2. - Create a NumPy array and convert it to a PyTorch tensor and back. What types do you get?
Your First Neural Network
We'll build a neural network from scratch in NumPy — no frameworks, no magic. Every calculation spelled out.
4.1 The Neuron
A neuron is the basic unit of a neural network. It does three things:
- Weights each input — multiply each input by a learned weight
- Sums — add them all up, plus a bias term
- Activates — pass the sum through a nonlinear function
Mathematically: y = σ(w₁x₁ + w₂x₂ + w₃x₃ + b) where σ is the activation function.
4.2 Activation Functions
Without activation functions, stacking layers would just produce linear transformations (a linear function of a linear function is still linear). Nonlinearity is what gives deep networks their power.
Python def sigmoid(z): """Squashes values to range (0, 1)""" return 1 / (1 + np.exp(-z)) def relu(z): """Returns z if positive, else 0. The workhorse of modern DL.""" return np.maximum(0, z) def relu_derivative(z): return (z > 0).astype(float)
| Function | Formula | Range | Used In |
|---|---|---|---|
| Sigmoid | 1/(1+e⁻ᶻ) | (0, 1) | Output layer (binary classification) |
| Tanh | (eᶻ-e⁻ᶻ)/(eᶻ+e⁻ᶻ) | (-1, 1) | Hidden layers (older architectures) |
| ReLU | max(0, z) | [0, ∞) | Hidden layers (modern default) |
| Softmax | eᶻⁱ/Σeᶻʲ | (0, 1), sums to 1 | Output layer (multi-class) |
4.3 Building a Network from Scratch
Let's build a complete 2-layer neural network that learns the XOR function — the very problem that killed the Perceptron in 1969. Our network will solve it in seconds.
Python import numpy as np # ─── XOR Dataset ─── X = np.array([[0,0], [0,1], [1,0], [1,1]]) y = np.array([[0], [1], [1], [0]]) # ─── Initialize Weights ─── np.random.seed(42) W1 = np.random.randn(2, 4) # Input (2) → Hidden (4) b1 = np.zeros((1, 4)) W2 = np.random.randn(4, 1) # Hidden (4) → Output (1) b2 = np.zeros((1, 1)) # ─── Activation ─── def sigmoid(z): return 1 / (1 + np.exp(-z)) def sigmoid_deriv(a): return a * (1 - a) lr = 2.0 # Learning rate # ─── Training Loop ─── for epoch in range(10000): # Forward pass z1 = X @ W1 + b1 a1 = sigmoid(z1) z2 = a1 @ W2 + b2 a2 = sigmoid(z2) # Loss (MSE) loss = np.mean((a2 - y) ** 2) # Backward pass dz2 = (a2 - y) * sigmoid_deriv(a2) dW2 = a1.T @ dz2 db2 = np.sum(dz2, axis=0, keepdims=True) dz1 = (dz2 @ W2.T) * sigmoid_deriv(a1) dW1 = X.T @ dz1 db1 = np.sum(dz1, axis=0, keepdims=True) # Update weights W2 -= lr * dW2; b2 -= lr * db2 W1 -= lr * dW1; b1 -= lr * db1 if epoch % 2000 == 0: print(f"Epoch {epoch:5d} Loss: {loss:.6f}") print(f"\nPredictions:\n{np.round(a2, 3)}")
Epoch 0 Loss: 0.258731 Epoch 2000 Loss: 0.004649 Epoch 4000 Loss: 0.001232 Epoch 6000 Loss: 0.000564 Epoch 8000 Loss: 0.000316 Predictions: [[0.015] [0.983] [0.983] [0.017]]
The network has learned XOR. Input [0,0] → ~0, [0,1] → ~1, [1,0] → ~1, [1,1] → ~0.
4.4 What Just Happened?
Let's trace through the key steps:
- Forward pass: Data flows through the network. Each layer multiplies by weights, adds bias, applies activation. The output is a prediction.
- Loss calculation: Compare prediction to truth. Mean Squared Error: how far off are we, on average?
- Backward pass (backpropagation): Use the chain rule to compute how much each weight contributed to the error. Start from the output, work backward layer by layer.
- Update: Nudge each weight in the direction that reduces the loss. The learning rate controls how big the nudge is.
This cycle repeats thousands of times. Each iteration, the network gets slightly better. That's all training is — repetition of this four-step process.
4.5 Visualizing the Learning Process
Python import matplotlib.pyplot as plt # (Re-run the training, storing losses) losses = [] for epoch in range(10000): # ... forward/backward/update (same as above) ... losses.append(loss) plt.figure(figsize=(8, 4)) plt.plot(losses, color='#c0392b', linewidth=1.5) plt.xlabel('Epoch') plt.ylabel('Loss') plt.title('Training Loss Over Time') plt.yscale('log') plt.grid(True, alpha=0.3) plt.show()
- Change the hidden layer from 4 neurons to 2. Can the network still learn XOR? What about 1 neuron? Explain why.
- Replace sigmoid with ReLU in the hidden layer (keep sigmoid in the output). Does training improve?
- Change the learning rate to 0.01 and then to 20. What happens to convergence?
- Add a third input to XOR: the output should be 1 if an odd number of inputs are 1. Extend the network to handle this.
- Plot the decision boundary of the trained XOR network. (Hint: create a grid of points, predict for each, and use
plt.contourf.)
How Neural Networks Learn
Gradient descent, learning rates, overfitting, regularization — the mechanics of making networks actually work.
5.1 Gradient Descent Variants
In Chapter 4, we used batch gradient descent — computing the gradient over the entire dataset before updating. This works for XOR (4 samples), but what about a dataset with a million images?
| Variant | Batch Size | Speed | Stability |
|---|---|---|---|
| Batch GD | Entire dataset | Slow per step | Very stable |
| Stochastic GD (SGD) | 1 sample | Fast per step | Noisy, can escape local minima |
| Mini-batch SGD | 32–256 samples | Fast | Good balance — the practical default |
Python # Mini-batch training loop batch_size = 32 for epoch in range(num_epochs): # Shuffle data each epoch indices = np.random.permutation(len(X)) for start in range(0, len(X), batch_size): end = start + batch_size X_batch = X[indices[start:end]] y_batch = y[indices[start:end]] # Forward → Loss → Backward → Update on this batch
5.2 Optimizers Beyond SGD
Vanilla SGD has a problem: it treats all parameters equally and uses a single learning rate. Modern optimizers adapt the learning rate per-parameter and use momentum to smooth out noisy gradients.
SGD with Momentum: Like a ball rolling downhill — it builds up speed in consistent directions and slows down when the gradient changes direction.
Adam (Adaptive Moment Estimation): The workhorse optimizer. Combines momentum with per-parameter learning rate adaptation. It's the safe default for almost every project.
Python # In PyTorch, switching optimizers is one line: optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9) optimizer = torch.optim.Adam(model.parameters(), lr=0.001) # ← Start here
5.3 Overfitting: The Enemy
Overfitting happens when the model memorizes the training data instead of learning generalizable patterns. You can spot it when:
- Training loss keeps decreasing, but validation loss starts increasing
- The model performs great on training data but poorly on new data
5.4 Fighting Overfitting
1. Get more data. The single most effective solution, though not always possible.
2. Data augmentation. Artificially increase your dataset by transforming existing samples. For images: flip, rotate, crop, adjust colors.
Python from torchvision import transforms augment = transforms.Compose([ transforms.RandomHorizontalFlip(), transforms.RandomRotation(15), transforms.ColorJitter(brightness=0.2), transforms.RandomResizedCrop(224, scale=(0.8, 1.0)), ])
3. Dropout. During training, randomly "turn off" a fraction of neurons. This prevents any single neuron from becoming too specialized and forces the network to learn redundant representations.
Python import torch.nn as nn model = nn.Sequential( nn.Linear(784, 256), nn.ReLU(), nn.Dropout(p=0.5), # Drop 50% of neurons during training nn.Linear(256, 10), )
4. Weight decay (L2 regularization). Add a penalty to the loss based on the magnitude of the weights. This discourages the model from relying too heavily on any single feature.
Python optimizer = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-4)
5. Early stopping. Monitor validation loss during training. When it starts increasing while training loss keeps decreasing — stop.
5.5 Learning Rate Scheduling
A large learning rate helps the model explore quickly at the start. A small learning rate helps it fine-tune near the end. A learning rate scheduler adjusts the rate during training:
Python # Step decay: reduce by factor of 10 every 30 epochs scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=30, gamma=0.1) # Cosine annealing: smooth decay following a cosine curve scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=100) for epoch in range(num_epochs): train_one_epoch(...) scheduler.step()
5.6 Batch Normalization
BatchNorm normalizes the activations of each layer to have mean 0 and variance 1, then scales and shifts them with learned parameters. Benefits:
- Faster training (can use higher learning rates)
- Acts as mild regularization
- Makes the network less sensitive to weight initialization
Python model = nn.Sequential( nn.Linear(784, 256), nn.BatchNorm1d(256), nn.ReLU(), nn.Linear(256, 10), )
- Train a network on a dataset with and without dropout. Plot both training and validation curves. What do you observe?
- Compare Adam (lr=0.001) vs SGD with momentum (lr=0.01) on the same task. Which converges faster? Which achieves better final accuracy?
- Implement early stopping in your training loop. If validation loss doesn't improve for 10 epochs, stop training.
- Experiment with different dropout rates (0.1, 0.3, 0.5, 0.7). How does it affect training vs validation performance?
Building with PyTorch
From manual NumPy to professional PyTorch. The same concepts, but cleaner, faster, and GPU-ready.
6.1 The PyTorch Way
PyTorch organizes deep learning into clear abstractions:
torch.Tensor— like NumPy arrays, but with GPU support and automatic differentiationnn.Module— base class for all models. You define layers and the forward pass.torch.optim— optimizers (SGD, Adam, etc.)DataLoader— handles batching, shuffling, and parallel data loading
6.2 Classifying Handwritten Digits (MNIST)
The MNIST dataset is the "Hello World" of deep learning — 70,000 grayscale images of handwritten digits (0–9), each 28×28 pixels. Let's build a complete classifier:
Python import torch import torch.nn as nn import torch.optim as optim from torch.utils.data import DataLoader from torchvision import datasets, transforms # ─── 1. Data ─── transform = transforms.Compose([ transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,)) # MNIST mean & std ]) train_data = datasets.MNIST('./data', train=True, download=True, transform=transform) test_data = datasets.MNIST('./data', train=False, transform=transform) train_loader = DataLoader(train_data, batch_size=64, shuffle=True) test_loader = DataLoader(test_data, batch_size=1000) # ─── 2. Model ─── class DigitClassifier(nn.Module): def __init__(self): super().__init__() self.network = nn.Sequential( nn.Flatten(), nn.Linear(28*28, 512), nn.ReLU(), nn.Dropout(0.2), nn.Linear(512, 256), nn.ReLU(), nn.Linear(256, 10) ) def forward(self, x): return self.network(x) model = DigitClassifier() # ─── 3. Loss & Optimizer ─── criterion = nn.CrossEntropyLoss() optimizer = optim.Adam(model.parameters(), lr=0.001) # ─── 4. Training Loop ─── for epoch in range(10): model.train() total_loss = 0 for batch_x, batch_y in train_loader: optimizer.zero_grad() output = model(batch_x) loss = criterion(output, batch_y) loss.backward() optimizer.step() total_loss += loss.item() # ─── 5. Evaluate ─── model.eval() correct = 0 with torch.no_grad(): for batch_x, batch_y in test_loader: preds = model(batch_x).argmax(dim=1) correct += (preds == batch_y).sum().item() acc = correct / len(test_data) print(f"Epoch {epoch+1}: Loss={total_loss:.1f}, Accuracy={acc*100:.2f}%")
After 10 epochs, you should see ~97-98% accuracy on the test set. Not bad for a simple fully-connected network!
6.3 Key PyTorch Patterns
model.train() vs model.eval(): This switches the behavior of Dropout and BatchNorm. Always call the right one.
torch.no_grad(): Disables gradient computation during evaluation. Saves memory and computation.
optimizer.zero_grad(): PyTorch accumulates gradients. You must zero them each step or they'll add up.
6.4 Saving and Loading Models
Python # Save model weights torch.save(model.state_dict(), 'mnist_model.pth') # Load later model = DigitClassifier() model.load_state_dict(torch.load('mnist_model.pth')) model.eval()
- Add a third hidden layer. Does accuracy improve? By how much?
- Implement the training and evaluation as a reusable function that takes hyperparameters as arguments.
- Add a confusion matrix visualization: which digits does the model confuse most often?
- Train the same model with and without BatchNorm. Compare training curves.
Convolutional Neural Networks
CNNs revolutionized computer vision by learning spatial hierarchies of features — edges → textures → parts → objects.
7.1 Why Convolutions?
A fully-connected layer for a 224×224 RGB image would need 224×224×3 = 150,528 input weights per neuron. This is wasteful — pixels near each other are related, pixels far apart usually aren't. Convolutional layers exploit this spatial locality:
- Local connectivity: Each neuron only looks at a small patch (e.g., 3×3 pixels)
- Weight sharing: The same filter slides across the entire image, detecting the same pattern everywhere
- Translation invariance: A cat detector works whether the cat is in the top-left or bottom-right
7.2 How Convolution Works
A small filter (kernel) slides across the input, computing dot products at each position:
Python import torch.nn as nn # A single convolutional layer conv = nn.Conv2d( in_channels=3, # RGB input out_channels=16, # 16 different filters kernel_size=3, # 3×3 filter stride=1, # Move 1 pixel at a time padding=1 # Keep same spatial size )
7.3 Pooling
Pooling layers reduce the spatial dimensions, making the network faster and more robust to small shifts. Max pooling takes the maximum value in each patch:
Python pool = nn.MaxPool2d(2, stride=2) # Halves the spatial dimensions # Input: (16, 28, 28) → Output: (16, 14, 14)
7.4 Building a Complete CNN
Python class CNNClassifier(nn.Module): def __init__(self): super().__init__() self.features = nn.Sequential( # Block 1: 1×28×28 → 32×14×14 nn.Conv2d(1, 32, 3, padding=1), nn.BatchNorm2d(32), nn.ReLU(), nn.MaxPool2d(2), # Block 2: 32×14×14 → 64×7×7 nn.Conv2d(32, 64, 3, padding=1), nn.BatchNorm2d(64), nn.ReLU(), nn.MaxPool2d(2), # Block 3: 64×7×7 → 128×3×3 nn.Conv2d(64, 128, 3, padding=1), nn.BatchNorm2d(128), nn.ReLU(), nn.MaxPool2d(2), ) self.classifier = nn.Sequential( nn.Flatten(), nn.Linear(128 * 3 * 3, 256), nn.ReLU(), nn.Dropout(0.5), nn.Linear(256, 10), ) def forward(self, x): x = self.features(x) return self.classifier(x)
7.5 What CNNs See: Feature Visualization
The first convolutional layer typically learns to detect edges and simple textures. Deeper layers combine these into increasingly complex patterns:
| Layer | Detects |
|---|---|
| Layer 1 | Edges, gradients, simple colors |
| Layer 2 | Textures, corners, small shapes |
| Layer 3 | Object parts (eyes, wheels, petals) |
| Layer 4+ | Entire objects, complex structures |
7.6 Classic CNN Architectures
| Architecture | Year | Key Innovation |
|---|---|---|
| LeNet-5 | 1998 | Pioneered CNNs for digit recognition |
| AlexNet | 2012 | Deep CNN + GPU training + ReLU + Dropout |
| VGGNet | 2014 | Showed depth matters (16-19 layers) |
| ResNet | 2015 | Skip connections enabling 152+ layers |
| EfficientNet | 2019 | Optimized scaling of depth/width/resolution |
- Train the CNN on MNIST. What accuracy do you get compared to the fully-connected network in Chapter 6?
- Modify the CNN for CIFAR-10 (32×32 color images, 10 classes). What changes are needed?
- Remove all pooling layers. How does this affect the output size and training time?
- Add a skip connection: make the output of block 1 also feed into block 3 (concatenated or added).
Recurrent Neural Networks
Sequence data — text, audio, time series — requires networks that remember the past. RNNs process inputs one at a time while maintaining a hidden state.
8.1 The Problem with Fixed-Size Inputs
CNNs and fully-connected networks take a fixed-size input and produce a fixed-size output. But language is variable-length. "I love deep learning" and "I think I might love deep learning someday" need different-sized inputs. RNNs solve this by processing sequences step by step.
8.2 How RNNs Work
An RNN cell processes one element at a time, maintaining a hidden state that acts as a memory:
At each step: h_t = tanh(W_hh · h_{t-1} + W_xh · x_t + b)
The same weights are used at every time step. The hidden state h_t carries forward information from all previous steps.
8.3 The Vanishing Gradient Problem
Plain RNNs struggle to learn long-range dependencies. When backpropagating through many time steps, gradients either shrink exponentially to zero (vanish) or grow exponentially (explode). This means a vanilla RNN can't effectively remember information from 50 steps ago.
8.4 LSTMs and GRUs
LSTM (Long Short-Term Memory) solves this with a gating mechanism and a separate cell state that acts as a conveyor belt for information:
- Forget gate: What information to discard from the cell state
- Input gate: What new information to store
- Output gate: What to output from the cell state
GRU (Gated Recurrent Unit) is a simplified version with only two gates (reset and update). It's faster and often works just as well as LSTM.
Python # PyTorch LSTM class TextClassifier(nn.Module): def __init__(self, vocab_size, embed_dim, hidden_dim, output_dim): super().__init__() self.embedding = nn.Embedding(vocab_size, embed_dim) self.lstm = nn.LSTM(embed_dim, hidden_dim, batch_first=True, num_layers=2) self.fc = nn.Linear(hidden_dim, output_dim) def forward(self, x): embedded = self.embedding(x) # (batch, seq_len) → (batch, seq_len, embed_dim) output, (hidden, cell) = self.lstm(embedded) return self.fc(hidden[-1]) # Use last hidden state
8.5 Project: Text Generation
Train a character-level LSTM on a text corpus. Given a sequence of characters, predict the next one:
Python # Simplified character-level generation text = "to be or not to be that is the question" chars = sorted(set(text)) char_to_idx = {c: i for i, c in enumerate(chars)} idx_to_char = {i: c for c, i in char_to_idx.items()} # Create training pairs: "to b" → "o be", "o be" → " beo" seq_length = 4 # ... (create input-target pairs) ... # Generate text: seed with a prompt, predict one char at a time def generate(model, seed, length=100): model.eval() chars = list(seed) for _ in range(length): x = torch.tensor([[char_to_idx[c] for c in chars[-seq_length:]]]) pred = model(x) next_char = idx_to_char[pred.argmax(-1).item()] chars.append(next_char) return ''.join(chars)
- Train the text generator on a Shakespeare dataset. How does the output quality change with training epochs?
- Compare LSTM vs GRU on the same task. Which converges faster?
- Add temperature-based sampling to the generation function: divide logits by a temperature before softmax. What happens at temperature 0.5 vs 1.5?
- Why are RNNs being replaced by Transformers? What are the computational disadvantages of processing sequences step by step?
Transformers & Attention
The most important architecture of the decade. Transformers power GPT, BERT, and virtually every state-of-the-art language model.
9.1 The Key Insight: Attention
Instead of processing a sequence step by step (like RNNs), the attention mechanism lets every element in a sequence look at every other element simultaneously. This is both more powerful and more parallelizable.
Think of it this way: when you read the word "it" in "The cat sat on the mat because it was tired," your brain attends to "cat" to understand what "it" refers to. Attention formalizes this process.
9.2 Self-Attention Step by Step
For each token in the input, we compute three vectors:
- Query (Q): "What am I looking for?"
- Key (K): "What do I contain?"
- Value (V): "What information do I provide?"
The attention score between two tokens is the dot product of one's Query with the other's Key. Higher score = more attention.
Python import torch import torch.nn as nn import torch.nn.functional as F import math class SelfAttention(nn.Module): def __init__(self, embed_dim, num_heads): super().__init__() self.num_heads = num_heads self.head_dim = embed_dim // num_heads self.Q = nn.Linear(embed_dim, embed_dim) self.K = nn.Linear(embed_dim, embed_dim) self.V = nn.Linear(embed_dim, embed_dim) self.out = nn.Linear(embed_dim, embed_dim) def forward(self, x): B, T, C = x.shape q = self.Q(x).view(B, T, self.num_heads, self.head_dim).transpose(1, 2) k = self.K(x).view(B, T, self.num_heads, self.head_dim).transpose(1, 2) v = self.V(x).view(B, T, self.num_heads, self.head_dim).transpose(1, 2) scores = (q @ k.transpose(-2, -1)) / math.sqrt(self.head_dim) weights = F.softmax(scores, dim=-1) out = (weights @ v).transpose(1, 2).contiguous().view(B, T, C) return self.out(out)
9.3 The Transformer Architecture
A full Transformer block has:
- Multi-head self-attention (look at all positions)
- Add & Norm (residual connection + layer normalization)
- Feed-forward network (process each position independently)
- Add & Norm again
9.4 Why Transformers Won
| Feature | RNN | Transformer |
|---|---|---|
| Parallelization | Sequential (slow) | Fully parallel (fast) |
| Long-range dependencies | Struggles beyond ~100 steps | Direct attention to any position |
| Training on GPUs | Limited by sequential nature | Perfectly suited for GPUs |
| Scalability | Hard to scale past a few layers | Scales to billions of parameters |
9.5 BERT, GPT, and the Foundation Model Era
BERT (2018): Encoder-only Transformer. Pre-trained to fill in masked words. Excels at understanding tasks (classification, Q&A).
GPT (2018–present): Decoder-only Transformer. Pre-trained to predict the next word. Excels at generation tasks (text completion, conversation).
T5 (2019): Full encoder-decoder Transformer. Frames everything as "text in → text out."
- Implement the Transformer block and train it on a simple sequence prediction task. Compare to an LSTM on the same data.
- Visualize attention weights: which tokens attend to which? Feed a sentence through a Transformer and plot the attention matrix.
- Why does scaling attention by √d_k matter? What happens without it?
- Read the "Attention Is All You Need" paper (2017). Write a 1-paragraph summary of the key contributions.
Transfer Learning
Why train from scratch when someone else already has? Use pre-trained models as a starting point and adapt them to your task.
10.1 The Big Idea
Training a large model from scratch requires millions of images and days of GPU time. Transfer learning says: take a model pre-trained on a massive dataset (like ImageNet's 1.2 million images), and adapt it to your task with your smaller dataset.
It works because early layers learn general features (edges, textures) that are useful for almost any vision task. Only the later layers are task-specific.
10.2 Feature Extraction vs Fine-Tuning
| Strategy | What Changes | When to Use |
|---|---|---|
| Feature Extraction | Only the final classifier layer | Small dataset, similar domain |
| Fine-Tuning | All (or most) layers, at low learning rate | Larger dataset or different domain |
Python import torchvision.models as models # Load pre-trained ResNet-18 model = models.resnet18(weights='IMAGENET1K_V1') # Freeze all layers (feature extraction) for param in model.parameters(): param.requires_grad = False # Replace the final layer for your task num_classes = 5 # e.g., 5 types of flowers model.fc = nn.Linear(model.fc.in_features, num_classes) # Only the new layer will be trained optimizer = torch.optim.Adam(model.fc.parameters(), lr=0.001)
10.3 Fine-Tuning in Practice
Python # Unfreeze all layers for fine-tuning for param in model.parameters(): param.requires_grad = True # Use a LOWER learning rate (don't destroy learned features) optimizer = torch.optim.Adam(model.parameters(), lr=1e-5) # OR: gradually unfreeze layers (discriminative fine-tuning) # Start training only the last layer, then unfreeze the last 2, etc.
10.4 Transfer Learning for NLP
The same idea works for text. Models like BERT and GPT are pre-trained on massive text corpora. You can fine-tune them for specific tasks:
Python from transformers import AutoModelForSequenceClassification, AutoTokenizer # Load pre-trained BERT model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2) tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased') # Fine-tune on sentiment analysis # The model already "understands" English — you just teach it your specific task
The Hugging Face Transformers library provides thousands of pre-trained models for NLP, vision, and audio. It's become the standard way to use pre-trained models. pip install transformers
- Use a pre-trained ResNet-18 to classify a custom 5-class image dataset (e.g., flowers, food). Compare feature extraction vs full fine-tuning.
- Fine-tune a pre-trained BERT model for sentiment analysis on movie reviews. What accuracy do you achieve?
- What happens if you fine-tune with too high a learning rate? Try lr=0.01 and observe the results.
Generative Models
Models that create — generate images, text, music. Autoencoders, GANs, and the diffusion models behind DALL·E and Stable Diffusion.
11.1 Autoencoders
An autoencoder learns to compress data into a compact representation (encoding) and reconstruct it back. It's trained to minimize reconstruction error.
Python class Autoencoder(nn.Module): def __init__(self): super().__init__() self.encoder = nn.Sequential( nn.Linear(784, 256), nn.ReLU(), nn.Linear(256, 64), nn.ReLU(), nn.Linear(64, 16) # Bottleneck: 784 → 16 ) self.decoder = nn.Sequential( nn.Linear(16, 64), nn.ReLU(), nn.Linear(64, 256), nn.ReLU(), nn.Linear(256, 784), nn.Sigmoid() ) def forward(self, x): z = self.encoder(x) return self.decoder(z)
11.2 Generative Adversarial Networks (GANs)
A GAN pits two networks against each other:
- Generator: Creates fake data from random noise
- Discriminator: Tries to distinguish real data from fakes
They train together: the generator gets better at faking, the discriminator gets better at detecting. Eventually, the generator produces data indistinguishable from real data.
11.3 Diffusion Models
The technology behind Stable Diffusion, DALL·E 2, and Midjourney. The idea is beautifully simple:
- Forward process: Gradually add noise to an image until it's pure static
- Reverse process: Train a neural network to reverse each step of noising — to "denoise" one step at a time
- Generation: Start with pure noise, denoise step by step, and a coherent image emerges
- Build and train an autoencoder on MNIST. Visualize the reconstructions — how do they look with a bottleneck of 2 vs 16 vs 64?
- If the bottleneck is 2D, you can plot the latent space. Color-code by digit class. What structure do you see?
- Implement a simple GAN for generating MNIST digits. Plot generated samples at epochs 1, 10, 50, and 100.
Practical Skills
Theory gets you started. These skills get you to production.
12.1 Data Preprocessing
Real-world data is messy. Common tasks:
- Missing values: Impute with mean/median, or flag with a mask
- Normalization: Scale features to similar ranges (e.g., zero mean, unit variance)
- Tokenization: Convert text to numerical sequences
- Encoding: Convert categorical variables to numbers (one-hot or embedding)
12.2 Debugging Models
When your model doesn't work (and it often won't), follow this checklist:
- Overfit a single batch. If your model can't learn 10 examples perfectly, there's a bug.
- Check the loss. Is it decreasing at all? If not, the learning rate might be too high or too low.
- Visualize predictions. Plot what the model outputs vs. the truth.
- Check gradients. Are they flowing? Are they exploding (NaN) or vanishing (~0)?
- Simplify. Remove complexity until something works, then add it back incrementally.
12.3 Experiment Tracking
Python # Use Weights & Biases for experiment tracking import wandb wandb.init(project="my-dl-project") wandb.config.update({"lr": 0.001, "epochs": 50, "batch_size": 64}) for epoch in range(50): train_loss = train_one_epoch(...) val_acc = evaluate(...) wandb.log({"loss": train_loss, "accuracy": val_acc})
12.4 Common Pitfalls
| Problem | Symptom | Fix |
|---|---|---|
| Data leakage | Unrealistically high accuracy | Ensure test data is never seen during training |
| Unbalanced classes | High accuracy but low recall | Use weighted loss or oversample minority class |
| Wrong loss function | Loss doesn't decrease | Match loss to task: CrossEntropy for classification, MSE for regression |
| Learning rate too high | Loss oscillates or explodes | Start at 1e-3, decrease if unstable |
| Learning rate too low | Loss barely decreases | Increase by 10x |
- Take a dataset and intentionally introduce data leakage. Show that accuracy is artificially high. Then fix it.
- Create an imbalanced dataset (90% class A, 10% class B). Train a model and observe the problem. Apply class weighting and compare.
- Set up experiment tracking with either W&B or TensorBoard. Compare 3 different learning rates on the same task.
Deploying Models
A model on your laptop is a research project. A model that serves real users is a product.
13.1 Saving & Exporting
Python # PyTorch: save entire model torch.save(model, 'model_full.pth') # Or just the weights (recommended) torch.save(model.state_dict(), 'model_weights.pth') # Export to ONNX (framework-agnostic) dummy = torch.randn(1, 1, 28, 28) torch.onnx.export(model, dummy, "model.onnx")
13.2 Building a Simple API
Python from flask import Flask, request, jsonify import torch app = Flask(__name__) model = torch.load('model.pth') model.eval() @app.route('/predict', methods=['POST']) def predict(): data = request.get_json() tensor = preprocess(data['input']) with torch.no_grad(): prediction = model(tensor) return jsonify({'prediction': prediction.tolist()})
13.3 Deployment Options
| Platform | Best For |
|---|---|
| Flask / FastAPI | Simple APIs, prototyping |
| Docker + Cloud (AWS/GCP/Azure) | Production scale |
| ONNX Runtime | Cross-platform inference |
| TorchServe | PyTorch-native serving |
| Edge (TFLite, CoreML) | Mobile and embedded devices |
13.4 Optimization for Inference
- Quantization: Use 8-bit integers instead of 32-bit floats. 2-4x faster, minimal accuracy loss.
- Pruning: Remove small weights (set to zero). Creates sparse, smaller models.
- Distillation: Train a small "student" model to mimic a large "teacher" model.
- Wrap your trained MNIST model in a Flask API. Accept an image and return the predicted digit.
- Export a PyTorch model to ONNX and run it with ONNX Runtime.
- Apply dynamic quantization to your model and compare inference speed and model size.
Ethics & Responsible AI
With great power comes real consequences. Building AI responsibly isn't optional.
14.1 Bias in, Bias Out
Models learn from data. If the data reflects historical biases — racial, gender, socioeconomic — the model will reproduce and amplify those biases. A hiring model trained on historical data may learn that "most successful candidates were male" and penalize female applicants.
14.2 Fairness Metrics
- Demographic parity: Prediction rates should be similar across groups
- Equal opportunity: True positive rates should be similar across groups
- Individual fairness: Similar individuals should get similar predictions
14.3 Interpretability
When a model denies someone a loan or flags a medical image, people deserve an explanation. Techniques include:
- SHAP/LIME: Explain individual predictions by testing which features matter most
- Attention visualization: Show what the model "looks at"
- Grad-CAM: Highlight which parts of an image influenced the decision
14.4 Environmental Cost
Training large models consumes significant energy. GPT-3's training reportedly used ~1,300 MWh of electricity. Consider:
- Do you need a large model, or would a smaller one suffice?
- Can you use transfer learning instead of training from scratch?
- Can you train on renewable energy or during off-peak hours?
14.5 Guidelines for Responsible Development
- Audit your data for bias before training
- Test model performance across demographic groups
- Be transparent about model limitations
- Implement human-in-the-loop for high-stakes decisions
- Document your model (what it was trained on, its intended use, known failure modes)
Where the Field Is Heading
A snapshot of the frontier as of 2025 — and where things might go next.
15.1 Foundation Models
Large models pre-trained on broad data that can be adapted to many tasks. GPT-4, Claude, Gemini, and similar models represent a paradigm shift: instead of building task-specific models, you build one large model and adapt it.
15.2 Multimodal Models
Models that understand and generate across modalities — text, images, audio, video — simultaneously. GPT-4V can reason about images. Gemini processes text, code, images, and audio natively.
15.3 Key Trends
- Smaller, more efficient models: Techniques like quantization, distillation, and efficient architectures are making powerful models accessible on consumer hardware.
- Open source: Models like LLaMA, Mistral, and Stable Diffusion are democratizing access.
- AI Agents: Models that can use tools, browse the web, write and execute code.
- Reasoning: Models that can break down complex problems and reason step by step.
- Regulation: The EU AI Act and similar legislation are shaping how AI can be deployed.
15.4 What to Learn Next
- Read papers: Start with well-written survey papers and landmark papers (Attention Is All You Need, ResNet, etc.)
- Build projects: The best way to learn is to build something and get it wrong
- Join communities: Papers With Code, Hugging Face forums, r/MachineLearning
- Specialize: Go deep in one area (computer vision, NLP, reinforcement learning, etc.)
Capstone Project
Put it all together. This project walks you through building a complete deep learning application end to end.
Project: Image Classification Web App
Goal: Build a web app that classifies images into custom categories. The user uploads an image, the model predicts the class, and the result is displayed with a confidence score.
Step 1: Define the Problem
Choose a dataset. Suggestions:
- Flowers (102 species)
- Food (101 categories)
- Dog breeds (120 breeds)
- Your own collected dataset
Step 2: Data Pipeline
Python from torch.utils.data import DataLoader, random_split from torchvision import datasets, transforms train_transform = transforms.Compose([ transforms.RandomResizedCrop(224), transforms.RandomHorizontalFlip(), transforms.ColorJitter(0.2, 0.2), transforms.ToTensor(), transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]), ]) val_transform = transforms.Compose([ transforms.Resize(256), transforms.CenterCrop(224), transforms.ToTensor(), transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]), ])
Step 3: Model
Python import torchvision.models as models model = models.resnet50(weights='IMAGENET1K_V2') model.fc = nn.Linear(model.fc.in_features, num_classes) # Fine-tune with differential learning rates optimizer = torch.optim.AdamW([ {'params': model.layer4.parameters(), 'lr': 1e-4}, {'params': model.fc.parameters(), 'lr': 1e-3}, ], weight_decay=1e-2)
Step 4: Training with Best Practices
- Mixup / CutMix augmentation
- Cosine annealing learning rate schedule
- Early stopping on validation loss
- Model checkpointing (save best model)
Step 5: Deploy
Create a simple web interface with Flask or Streamlit. Accept image uploads, run the model, and display results with confidence bars.
If you've followed along from Chapter 1, you now understand the foundations of deep learning — from the math to the models to the deployment. The field is vast and evolving fast, but you have the fundamentals to learn anything that comes next. Keep building.
- Complete the capstone project end-to-end with a dataset of your choice.
- Write a model card documenting your model: training data, performance metrics, known limitations, intended use cases.
- Deploy your model as a web app and share it with someone who isn't in tech. Get their feedback.
- Reflect: what was the hardest part? What would you do differently next time?
Math Refresher & Glossary
Quick reference for the math and terminology used throughout this book.
A.1 Linear Algebra Cheat Sheet
| Concept | Notation | Meaning |
|---|---|---|
| Scalar | a | A single number |
| Vector | v = [v₁, v₂, v₃] | Ordered list of numbers (1D array) |
| Matrix | W | 2D grid of numbers |
| Dot product | a · b | Element-wise multiply, then sum |
| Matrix multiply | AB | Rows of A dotted with columns of B |
| Transpose | Aᵀ | Rows become columns |
| Norm | ‖v‖ | "Length" of a vector |
A.2 Calculus Cheat Sheet
| Function | Derivative |
|---|---|
| f(x) = c | f'(x) = 0 |
| f(x) = xⁿ | f'(x) = nxⁿ⁻¹ |
| f(x) = eˣ | f'(x) = eˣ |
| f(x) = ln(x) | f'(x) = 1/x |
| f(x) = σ(x) | f'(x) = σ(x)(1 - σ(x)) |
| f(x) = ReLU(x) | f'(x) = 1 if x > 0, else 0 |
| Chain rule: f(g(x)) | f'(g(x)) × g'(x) |
A.3 Glossary
| Term | Definition |
|---|---|
| Activation function | Nonlinear function applied to neuron output (ReLU, sigmoid, etc.) |
| Backpropagation | Algorithm for computing gradients via the chain rule |
| Batch size | Number of samples processed before updating weights |
| Bias | Learnable offset added after the weighted sum in a neuron |
| CNN | Convolutional Neural Network — specialized for grid-like data (images) |
| Cross-entropy | Loss function for classification tasks |
| Data augmentation | Artificially expanding training data through transformations |
| Dropout | Regularization: randomly zeroing activations during training |
| Epoch | One complete pass through the entire training dataset |
| Fine-tuning | Continuing training of a pre-trained model on a new task |
| Gradient | Vector of partial derivatives — direction of steepest increase |
| Gradient descent | Optimization: iteratively move parameters opposite to the gradient |
| GPU | Graphics Processing Unit — massively parallel hardware for DL |
| Learning rate | Step size for parameter updates during optimization |
| Loss function | Measure of how wrong the model's predictions are |
| LSTM | Long Short-Term Memory — RNN variant that handles long-range dependencies |
| Overfitting | Model memorizes training data instead of learning general patterns |
| Parameter | Learnable value in the model (weights and biases) |
| Pooling | Downsampling operation in CNNs (max or average) |
| Pre-training | Initial training on a large dataset before fine-tuning |
| Regularization | Techniques to prevent overfitting (dropout, weight decay, etc.) |
| RNN | Recurrent Neural Network — processes sequences step by step |
| Softmax | Function that converts logits to probabilities |
| Tensor | Multi-dimensional array (generalization of vectors and matrices) |
| Transfer learning | Using a pre-trained model as a starting point for a new task |
| Transformer | Architecture based on self-attention (dominant since 2017) |
| Underfitting | Model is too simple to capture the patterns in data |
| Weight | Learnable parameter that scales the input to a neuron |