Part 2 of 3
Loss functions, gradient descent, overfitting, feature engineering, and neural networks — all explained visually.
Module 03
Linear regression is the "Hello World" of machine learning. The idea: find the straight line (or hyperplane) that best fits your data. It's simple, interpretable, and forms the foundation for understanding all other models.
The model's equation is just a weighted sum of features plus a bias term:
Equation 1: The linear model equation. y’ is the predicted output, w are the learned weights, x are the feature values, and b is the bias term.
Figure 8: Interactive linear regression. Click "Animate Fit" to watch the model iteratively find the best-fit line through the data points.
A loss function measures how wrong your model's predictions are. It's the number we want to minimise during training. The most common loss for regression is Mean Squared Error (MSE):
Figure 9: The red dashed lines are the residuals — the vertical distance from each actual point to the model line. MSE averages the squares of these distances. Residual = y − y′: points above the line have positive residuals; points below have negative ones.
Module 04
We know we want to minimise loss — but how? We can't try every possible combination of weights; even a small model might have millions of parameters. The answer is gradient descent: an iterative algorithm that nudges weights in the direction that reduces loss, step by step.
Imagine you're blindfolded in a hilly landscape and you want to reach the lowest valley. You can't see the whole landscape, but you can feel which direction is downhill right where you're standing. You take a step downhill, feel the slope again, take another step. That's gradient descent.
Figure 10: Animated gradient descent on a loss curve. The ball represents the current weight value; the curve is the loss landscape. Try different learning rates to see how convergence changes.
Each step in gradient descent is scaled by a learning rate (α) — a hyperparameter you set before training. It controls how big each step is:
Training takes forever. The model takes tiny steps and may never converge in a reasonable time, especially in high-dimensional spaces.
Converges efficiently to a good minimum. Finding this value is often done by trying several values (a process called learning rate search or scheduling).
The model overshoots the minimum, bouncing around or even diverging (loss goes up instead of down). The ball flies over the valley.
Computing gradients on the full dataset is slow. SGD uses one random example (or a small mini-batch) per step — much faster, slightly noisy, but works very well in practice.
Module 05
A model that memorises the training data perfectly — but fails on new data — is useless. The true goal of ML is generalization: good performance on examples the model has never seen.
This leads to the most important tension in machine learning, known as the bias-variance tradeoff:
Figure 11: The three regimes. Underfitting (left) — too simple, misses real patterns. Good fit (centre) — captures the true pattern without chasing noise. Overfitting (right) — memorises noise, performs poorly on new data.
The standard way to detect overfitting is to hold out data the model never trains on. Google's ML course recommends three splits:
Figure 12: The three-way dataset split. Train on the training set, tune using the validation set, and evaluate exactly once on the test set. Using the test set more than once causes test-set contamination — you inadvertently optimise for it, and your reported accuracy becomes too optimistic.
Module 06
Raw data rarely arrives in a form suitable for ML models. Feature engineering is the process of transforming raw data into informative, well-scaled inputs. It's often the most impactful thing a practitioner can do to improve model performance — more than choosing a fancier algorithm.
Numbers like age, price, or temperature. Usually normalised (scaled to 0–1 or z-scored) so that large-scale features don't dominate small-scale ones. A salary of $80,000 and an age of 35 need to live on comparable scales.
Labels like "city" or "colour". Must be converted to numbers. Common techniques: one-hot encoding (a separate 0/1 column per category) or embeddings (learned dense representations, used in NLP).
Turning a continuous number into categories. Age → [child, teen, adult, senior]. Useful when the relationship with the label is non-linear — e.g., click-through rate peaks in the 25–34 age bucket, not monotonically.
Combining two features into one: latitude × longitude creates a location cell more expressive than either feature alone. Google uses feature crosses extensively in linear models to capture interaction effects cheaply.
Real datasets have gaps. Common fixes: imputation (replace with mean, median, or a learned value) or a separate is_missing indicator feature. Dropping rows loses data; the right strategy depends on why data is missing.
Not all features help — some add noise. Techniques like correlation analysis, mutual information, and LASSO regularisation identify which features are truly predictive and which can be dropped to simplify the model.
Figure 13: Feature engineering pipeline. Raw text and numbers are transformed into clean, scaled, and informative numerical features before being fed to the model.
Module 07
A neural network is a type of machine learning model loosely inspired by the human brain. It consists of layers of simple processing units called neurons (or nodes), connected by weighted links. Information flows from the input layer through one or more hidden layers to the output layer.
Each connection has a weight — a number that controls how much influence one neuron has on the next. Training a neural network means adjusting millions of these weights until the network gives accurate outputs.
Figure 14: An animated neural network — pixels flow in from the left, patterns are extracted by hidden layers, and the output predicts "Cat" with 87% confidence. The dashed lines show signals propagating through the network in real time.
When you stack many hidden layers, you get deep learning. Deep networks can learn incredibly complex representations — the lower layers detect edges and shapes, middle layers detect features like eyes or wheels, and upper layers recognise whole objects.
This 19-minute video by 3Blue1Brown is widely considered the single best visual explanation of neural networks ever made. If you watch one video from this entire course, make it this one.
Test Yourself
5 questions · 20 points each · 100 total.
5 questions · 20 points each