Part 2 of 3

How Models
Work

Loss functions, gradient descent, overfitting, feature engineering, and neural networks — all explained visually.

5Modules
30Minutes
3Animations

Linear Regression & Loss

03 · Google ML Course

The Simplest ML Model

Linear regression is the "Hello World" of machine learning. The idea: find the straight line (or hyperplane) that best fits your data. It's simple, interpretable, and forms the foundation for understanding all other models.

The model's equation is just a weighted sum of features plus a bias term:

y' = w₁·x₁ + w₂·x₂ + … + wₙ·xₙ + b
y' = predicted output (label) w = weights (learned) b = bias (learned) x = feature values (given)

Equation 1: The linear model equation. y’ is the predicted output, w are the learned weights, x are the feature values, and b is the bias term.

Figure 8: Interactive linear regression. Click "Animate Fit" to watch the model iteratively find the best-fit line through the data points.

What is Loss?

A loss function measures how wrong your model's predictions are. It's the number we want to minimise during training. The most common loss for regression is Mean Squared Error (MSE):

y (actual) x (input) 100 200 300 400 500 10 40 70 100 130 Actual value Residual (error) Model line

Figure 9: The red dashed lines are the residuals — the vertical distance from each actual point to the model line. MSE averages the squares of these distances. Residual = y − y′: points above the line have positive residuals; points below have negative ones.

MSE = (1/N) · Σ (y − y')²
N = number of examples · y = actual label · y' = model prediction
Why square the errors? Squaring does two things: it makes all errors positive (so negative and positive errors don't cancel out), and it penalises large errors more heavily — a prediction that's off by 10 contributes 100× more to MSE than one off by 1. This pushes the model to avoid wild mistakes.

Gradient Descent

04 · Google ML Course

How Models Actually Learn

We know we want to minimise loss — but how? We can't try every possible combination of weights; even a small model might have millions of parameters. The answer is gradient descent: an iterative algorithm that nudges weights in the direction that reduces loss, step by step.

Imagine you're blindfolded in a hilly landscape and you want to reach the lowest valley. You can't see the whole landscape, but you can feel which direction is downhill right where you're standing. You take a step downhill, feel the slope again, take another step. That's gradient descent.

Figure 10: Animated gradient descent on a loss curve. The ball represents the current weight value; the curve is the loss landscape. Try different learning rates to see how convergence changes.

The Learning Rate

Each step in gradient descent is scaled by a learning rate (α) — a hyperparameter you set before training. It controls how big each step is:

TOO SMALL

Learning rate too low

Training takes forever. The model takes tiny steps and may never converge in a reasonable time, especially in high-dimensional spaces.

JUST RIGHT

Optimal learning rate

Converges efficiently to a good minimum. Finding this value is often done by trying several values (a process called learning rate search or scheduling).

TOO LARGE

Learning rate too high

The model overshoots the minimum, bouncing around or even diverging (loss goes up instead of down). The ball flies over the valley.

IN PRACTICE

Stochastic Gradient Descent

Computing gradients on the full dataset is slow. SGD uses one random example (or a small mini-batch) per step — much faster, slightly noisy, but works very well in practice.

The update rule: At each training step, every weight is updated as:
w ← w − α · (∂Loss/∂w)
The term ∂Loss/∂w is the gradient — how much the loss changes when that weight changes. We subtract it because we want to go downhill.

Generalization & Overfitting

05 · Google ML Course

The Real Goal: Perform Well on New Data

A model that memorises the training data perfectly — but fails on new data — is useless. The true goal of ML is generalization: good performance on examples the model has never seen.

This leads to the most important tension in machine learning, known as the bias-variance tradeoff:

Underfitting Model too simple High bias, low variance Good Fit Balanced complexity Low bias, low variance Overfitting Model too complex Low bias, high variance

Figure 11: The three regimes. Underfitting (left) — too simple, misses real patterns. Good fit (centre) — captures the true pattern without chasing noise. Overfitting (right) — memorises noise, performs poorly on new data.

The Train / Validation / Test Split

The standard way to detect overfitting is to hold out data the model never trains on. Google's ML course recommends three splits:

Training Set — 70% Model learns weights from this Validation — 15% Tune hyperparameters Test — 15% Final evaluation only ← Your Full Dataset →

Figure 12: The three-way dataset split. Train on the training set, tune using the validation set, and evaluate exactly once on the test set. Using the test set more than once causes test-set contamination — you inadvertently optimise for it, and your reported accuracy becomes too optimistic.

Signs your model is overfitting: Training loss is low but validation loss is high — a gap between the two. The model has memorised the training data rather than learning the underlying pattern. Fixes include: getting more data, reducing model complexity, adding regularization (penalising large weights), or using dropout in neural networks.

Feature Engineering

06 · Google ML Course

Turning Raw Data into Useful Inputs

Raw data rarely arrives in a form suitable for ML models. Feature engineering is the process of transforming raw data into informative, well-scaled inputs. It's often the most impactful thing a practitioner can do to improve model performance — more than choosing a fancier algorithm.

Numerical Features

Numbers like age, price, or temperature. Usually normalised (scaled to 0–1 or z-scored) so that large-scale features don't dominate small-scale ones. A salary of $80,000 and an age of 35 need to live on comparable scales.

Categorical Features

Labels like "city" or "colour". Must be converted to numbers. Common techniques: one-hot encoding (a separate 0/1 column per category) or embeddings (learned dense representations, used in NLP).

Binning / Bucketizing

Turning a continuous number into categories. Age → [child, teen, adult, senior]. Useful when the relationship with the label is non-linear — e.g., click-through rate peaks in the 25–34 age bucket, not monotonically.

Feature Crosses

Combining two features into one: latitude × longitude creates a location cell more expressive than either feature alone. Google uses feature crosses extensively in linear models to capture interaction effects cheaply.

Missing Value Handling

Real datasets have gaps. Common fixes: imputation (replace with mean, median, or a learned value) or a separate is_missing indicator feature. Dropping rows loses data; the right strategy depends on why data is missing.

Feature Selection

Not all features help — some add noise. Techniques like correlation analysis, mutual information, and LASSO regularisation identify which features are truly predictive and which can be dropped to simplify the model.

Raw Data "2024-03-15" "New York" salary: 95000 age: 34 city: "New York" engineer Engineered Features day_of_week: 5 (Friday) is_northeast: 1 salary_norm: 0.72 age_bucket: adult city_NY:1, city_Boston:0 model ML Model predicts label

Figure 13: Feature engineering pipeline. Raw text and numbers are transformed into clean, scaled, and informative numerical features before being fed to the model.

The practitioner's insight: "Better data beats fancier algorithms." A carefully engineered feature set fed to a simple linear model often outperforms a complex deep learning model fed raw, messy data. Feature engineering is where domain knowledge pays off — a clinician who knows which lab values matter will build better medical AI features than any automated system.

Neural Networks

07 · Deep Learning

Inspired by the Brain

A neural network is a type of machine learning model loosely inspired by the human brain. It consists of layers of simple processing units called neurons (or nodes), connected by weighted links. Information flows from the input layer through one or more hidden layers to the output layer.

Each connection has a weight — a number that controls how much influence one neuron has on the next. Training a neural network means adjusting millions of these weights until the network gives accurate outputs.

x₁ x₂ x₃ x₄ Cat87% Dog13% Input (pixels) Hidden Hidden Output

Figure 14: An animated neural network — pixels flow in from the left, patterns are extracted by hidden layers, and the output predicts "Cat" with 87% confidence. The dashed lines show signals propagating through the network in real time.

When you stack many hidden layers, you get deep learning. Deep networks can learn incredibly complex representations — the lower layers detect edges and shapes, middle layers detect features like eyes or wheels, and upper layers recognise whole objects.

Why "deep"? The "depth" refers to the number of hidden layers. Modern large language models (like the ones that power AI chatbots) have hundreds of layers and billions of parameters (weights). This depth is what gives them their impressive capabilities.
07 · Video

Neural Networks Explained

This 19-minute video by 3Blue1Brown is widely considered the single best visual explanation of neural networks ever made. If you watch one video from this entire course, make it this one.

But what is a neural network?
3Blue1Brown · 19 min · Essential viewing

Part 2 Quiz

5 questions · 20 points each · 100 total.

Ready?

5 questions · 20 points each

Well done!

You understand how models learn. Now see the cutting edge.

Part 3: Modern AI →