A Map of Machine Learning Principles and Algorithms

Here is my attempt to map out the major classes of algorithms in Machine Learning, organised around the associated induction principles and learning theory. The usual caveats apply around this being biased towards my own experience.

At the highest level, we can distinguish between the Passive and Active learning settings. In the passive case, the agent receives data chosen by the environment and its job is to learn a model to predict what’s coming next. In the active case, in addition to receiving data from the environment, the agent has a set of actions it can take at every time step so its job is to learn a policy, a mapping from observations or states to an action, to maximise some measure of long-term reward. In sufficiently rich domains, one can have nested passive and active settings, like in state-of-the-art large language models and reinforcement learning agents.

The most standard passive setting is the supervised learning setting, where the agent is given a set of labelled examples of the form $\{ (x_i,y_i) \}$ drawn from some unknown probability distribution, and the goal for the agent is to learn a model $f : X \rightarrow Y$ that, given arbitrary $x$ , can predict the corresponding $y$ with high probability of success. The labelled examples can come one at a time or in a large batch, and the underlying unknown probability distribution can stay the same or change over time. A variation of the supervised learning setting is the sequential prediction setting, where at every time step $t$ the agent has to predict the next observation $x_t$ given the history of observations up till then $x_1x_2\ldots x_{t-1}$ . Here, the set of labelled examples given to the agent takes effectively this form: $\{ (x_1,x_2), (x_1x_2,x_3), \cdots, (x_1x_2\ldots x_{t-1}, x_t) \}$ .

In general terms, given the labelled examples $S := \{ (x_i,y_i) \}$ drawn from an unknown distribution $D$ , a class $F$ of possible models, and a loss function $l(\cdot,\cdot)$ , the agent’s goal is to find an $f \in F$ where the expected loss $l(f(x),y)$ is low for new previously unseen pairs $(x,y)$ drawn from $D$ . The algorithmic strategies for the different passive learning settings can be quite different, of course. From Bayesian probability theory, the optimal solution is given by $f^*(x) = \sum_{f \in F} Pr(f | S) f(x)$ , where $Pr(f | S) = Pr(S | f) Pr(f) / Pr(S)$ is the posterior probability that $f$ is the true underlying model given we have seen $S$ , and $Pr(f)$ is the prior probability of $f$ , with less complex models given higher probabilities in accordance with Occam’s razor. The class of algorithms listed under Bayesian Inference -> Mixtures / Ensembles are all algorithms that directly solve for $f^*$ . In the rare cases where the model class $F$ has nice mathematical structures (like those that satisfy generalised distributive laws), the Bayesian optimal solution can be computed exactly in a computationally efficient manner, and this strategy yields famous algorithms like the Context Tree Weighting family of algorithms and the Hedge family of algorithms under the so-called Tracking the Best Expert setting. In all other cases, an approximation of the full summation $\sum_{f \in F} Pr(f | S) f(x)$ is necessary.

The first approach, exemplified by the Boosting, Random Subspace and MCMC families of algorithms, is to approximate $f^*$ by averaging over only a subset of the $f \in F$ that may have high posterior probabilities. Boosting is a gradient-based approach to constructing an ensemble of weak-learners that can be shown to optimise a notion of margin on the training labelled examples. Random subspace methods like Bagging and Random Forests combine random projection, bootstrapping, and majority voting ideas to build practical, low-variance and high accuracy ensemble models. In addition to those, there are also many Markov Chain Monte Carlo algorithms for estimating mixture models from data, especially when the individual elements in the mixture model are distributions from the exponential family.

The second approach, called Maximum A Posteriori (MAP), to approximating $f^*$ is to find the model in $F$ with the highest posterior probability $f_{\it map} = \arg \max_{f} Pr(S | f)Pr(f)$ . This optimisation can be solved exactly for distributions where the prior $Pr(f)$ and the likelihood $Pr(S | f)$ are so-called conjugate distributions, in that their multiplication result in a new distribution with the same functional form as the prior distribution. Examples of such conjugate distributions can be found at https://en.wikipedia.org/wiki/Conjugate_prior. In cases where the MAP optimisation problem cannot be solved in close form, we need to rely on the mathematical structure of the model class $F$ and the loss function $l(\cdot,\cdot)$ to make progress. The entire field of Mathematical Programming / Mathematical Optimisation is dedicated to the study of such problems, and a lot of work in Machine Learning is in designing how we can (re)formulate a learning-from-data problem as an efficiently solvable optimisation problem, ideally as a convex optimisation problem that can be solved exactly. Importantly, the prior probability of models usually become a regularisation term in the optimisation problem, making it a well-defined problem. In cases where we cannot directly reformulate a learning-from-data problem as an easy optimisation problem, one can resort to defining an upper bound or lower bound to the desired optimisation criteria and solve that easier problem instead, a technique generally referred to as variational inference, which is closely related to the expectation-maximisation technique. If even that cannot be done, which is not unusual when the model class $F$ has a discrete structure, then one have to resort to approximate search algorithms that make use of local topology, or draw inspiration from biological processes (evolutionary search) or physical processes (simulated annealing).

The third and final approach to approximating $f^*$ is Maximum Likelihood, which is similar to the MAP approach except that we assume all models have equal prior probability, which means we seek the model that maximises the likelihood $f_{\it ml} = \arg \max_{f \in F} Pr(S | f)$ . As for MAP estimation, the techniques of Mathematical Programming, Expectation Maximisation / Variational Inference, and Approximate search also apply to Maximum Likelihood estimation.

In the diagram, we have also shown a collection of techniques grouped under Representation Learning in the Passive setting. Compression methods include, among other things, clustering algorithms that are informed by the Minimum Description Length / Minimum Message Length principle, or algorithmic information-theoretic constructs like Normalized Compression Distance. Manifold Learning include the different non-linear dimensionality reduction techniques that generalise Principal Component Analysis, including Isomap, Spectral Embedding, multi-dimensional scaling, t-SNE, etc. Another class of algorithms for computing embeddings are Auto Encoders, which are typically 3-layer neural networks where the input and output layer have the same number of nodes but the middle layer has a smaller number of nodes. By training identity functions on such neural networks, we rely on principles like the Information Bottleneck method in the middle layer to generate good compressed representations of the training data. Graph neural networks, and their predecessors like word2vec, are yet another class of techniques for computing embedding of words / tokens into numeric vectors that are then used in further downstream processing. These embeddings can usually capture some contextual semantics of words / tokens based on what other words / tokens tend to co-occur in their neighbourhood. Together with the Attention Mechanism, embedding techniques like Auto-encoders and graph neural networks have been used very successfully as architectural components in Transformers that underlie a lot of the success behind large language models in recent years.

We now move on to Active learning. The simplest Active setup is the multi-armed bandit problem, where the agent has to come up with a strategy of choosing, at each time step, N different possible actions, each of which has an unknown payoff. This is a foundational problem whose solution requires optimally balancing exploration and exploitation, the former to get better data to estimate the payoff of each action, and the latter to achieve faster and better rewards using payoff estimates based on data collected so far. In the more general reinforcement learning setup, the agent can be in many different underlying states, which may or may not be directly observable to the agent, and taking an action in a state can result in a reward signal that is usually sparse, so most actions in most states result in no feedback from the environment. The goal of the agent is then to learn from data collected from interactions with the environment to learn an action-selection policy that seeks to achieve long-term average or discounted accumulated rewards. This optimisation problem can be formulated using the Bellman equation, and various stochastic dynamic programming algorithms have been proposed for solving the optimisation problem. Algorithms like Q-learning and Temporal Difference learning can solve the Bellman equation exactly for small state and action spaces. For problems with large state, observation, and action spaces, we usually have to resort to state abstraction techniques and / or function approximation techniques, the latter either approximate the value function, the Q-function, or the policy function itself. The different choices can lead to different algorithmic strategies, which are shown in the diagram.

That’s a bit of a whirlwind tour of major machine learning principles and algorithms. Hope it is useful to you.

Mental Models 4 Life

A Map of Machine Learning Principles and Algorithms

Leave a comment Cancel reply

Mental Models 4 Life

Share this:

Related

Leave a comment Cancel reply