# Why Machines Learn
**Anil Ananthaswamy**

---
_Machine learning is fundamentally about finding boundaries in high-dimensional space. The cleverness is in how you find them._
Ananthaswamy traces the conceptual history from Rosenblatt's perceptron through to deep learning, and the book's great gift is a single clarifying framing: machine learning is geometry. Every algorithm, from the simplest linear classifier to the deepest neural network, is trying to draw a boundary that separates one kind of thing from another. A hyperplane is a line in two dimensions, a plane in three, and a generalisation of the same idea in any number of dimensions. If your data can be cleanly divided by such a boundary, the perceptron will find it; the convergence theorem guarantees it. But most interesting data cannot be divided that way in its original dimensions. The history of the field is the history of finding increasingly creative ways to draw boundaries in spaces too complex for human intuition to visualise.
---
**The kernel trick is the manoeuvre that makes this tractable.** Take data that's inseparable in low dimensions and project it into higher dimensions where a separating boundary does exist. Support vector machines do this, and the kernel function lets you compute in the high-dimensional space without ever explicitly transforming the data. The optimal boundary isn't just any separator; it maximises the margin between the boundary and the nearest data points. That margin is what makes the model generalise to new data rather than just memorising the training set. Backpropagation, the algorithm that makes deep learning possible, solved a different problem: how to adjust millions of weights in a multi-layer network by propagating error signals backward using the chain rule from calculus. The algorithm itself is pure maths. Applying it to networks with many layers required computational power that took decades to arrive, which is why the 1986 breakthrough didn't produce deep learning immediately.
What changed about my intuition after reading this: I'd always thought of ML as pattern matching, as if the computer were looking at examples and learning to recognise similar ones. The geometric framing is more honest. The algorithm doesn't "understand" anything. It finds a boundary that separates the training data well enough that new data falling on one side or the other gets classified correctly. When ML fails, it's usually because the boundary was drawn in a space that didn't capture the relevant dimensions, or because the training data didn't represent the territory the model would actually encounter.
---
**Most machine learning is inherently probabilistic, even when the algorithm wasn't designed to be.** The outputs are [[Confidence]] levels, not certainties. A model that says "92% probability this image is a cat" is drawing a boundary and reporting how far inside the boundary the data point sits. Bayesian thinking underlies much of modern ML: you start with [[Priors]], update them based on evidence, and get a posterior distribution that tells you what to believe given what you've observed. Most of us are intuitive frequentists, thinking probability means counting how often things happen. The Bayesian framing is more general and more useful. It asks: what should I believe now, given everything I've seen?
This matters beyond ML. Any time you encounter a confident-sounding prediction, from an algorithm or a person, the right question is: what's the probability distribution around that prediction? How wide is the uncertainty? Where did the priors come from? The probabilistic framing doesn't make predictions less useful. It makes them honest. And it makes you a better consumer of every model, human or machine, that claims to know what's coming.
---
**Constrained optimisation is the core mathematical problem throughout.** Finding the best solution subject to real-world constraints: Lagrange multipliers, gradient descent, and related techniques are the tools. This is the same mathematics that appears in economics, engineering, and physics whenever you're trying to maximise something subject to limitations. A computer is a dynamical system, one whose behaviour can be seen as evolving from state to state with each tick of the clock. That framing, borrowed from physics, connects ML to thermodynamics and information theory in ways that illuminate why the field developed as it did.
The field didn't stall because the mathematics was wrong. The perceptron's limitations were real but narrow. The algorithms that worked in theory couldn't run on the hardware of their time. Both constraints were eventually overcome, and the pattern is instructive: the mathematics had been patient and correct the whole time. The engineering just needed to catch up.
---