d/dx f(g(x)) = f'(g(x))·g'(x)

Growing a Neural Network

d/dx f(g(x)) = f'(g(x))·g'(x)

What we know: The exact update rule. At every single step, we know precisely what will happen: compute the gradient, multiply by learning rate, subtract from parameters.

What we don't know: How the program that results from repeatedly applying the update rule works internally.

A tiny neural network learns to classify pixel patterns. Watch the weights visualization evolve. These weight patterns emerge from nothing but gradient descent. You didn't design them. They're emergent.

The floating-point numbers in those weights are executable code—a program that successfully classifies patterns. But the program is inscrutable. It works, we just don't automatically understand how. Hit "RESET NETWORK" and watch completely different weights emerge. Same algorithm, different solution every time.

Why this matters: This is the essence of modern AI. We design the optimization process, not the system itself. We understand the principle (minimize loss via gradient descent), but cannot predict what features, representations, or behaviors will emerge from billions of these simple steps. GPT-4 is just this, scaled up.

Modern AI systems are grown, not built.

"We have a very good idea of sort of roughly what it's doing. But as soon as it gets really complicated, we don't actually know what's going on any more than we know what's going on in your brain."

60 Minutes (Pelley: "What do you mean we don't know exactly how it works? It was designed by people.")
"No, it wasn't… we designed the learning algorithm… But we don't really understand exactly how they do those things."

— Geoffrey Hinton, 60 Minutes Interview

CLASSIFICATION TASK

OBJECTIVE: Classify 8×8 pixel patterns into 4 categories. 300 training samples total.

H-STRIPE

V-STRIPE

DIAGONAL

CHECKER

ALL NETWORK WEIGHTS

DARKER = HIGHER VALUE. ORANGE = ACTIVE GRADIENT.

Neural Network Architecture

Input Layer
(64 pixels)

...

→

Hidden Layer
(6 features)

→

Output Layer
(4 classes)

WEIGHTS (ABOVE) = PARAMETERS CONNECTING INPUT TO HIDDEN LAYER

TRY THE NETWORK

Draw an 8×8 pattern and watch the network classify it in real-time.

PREDICTION: —

CONFIDENCE: —

HIDDEN LAYER (6 features)

OUTPUT LAYER (4 classes)

for each batch:
  gradient = ∂Loss/∂θ
  θ = θ - α · gradient

where:
  θ = network parameters (weights)
  α = learning rate (step size)

Learning Rate (α)

α = 0.100

Hidden Units

Speed

0.2x

Iteration: 0

Loss: 0.000

Accuracy: 0%

Status: Ready