Welcome,
This is my first post ever , I’d love to get your feedback.
I will try to summarize important terms and concepts of machine learning. The focus is on intuition but there will also be practical and theoretical notes.
Target Audience, this is for you if you:
Having a bad memory but being (at least considering myself to be ) a philomath who loves machine learning, I developed the habit of taking notes, then summarizing and finally making a cheat sheet for every new ML domain I encounter. There are multiple reasons I want to switch to a web-page:
You do not really understand something unless you can explain it to your grandmother. - Albert EinsteinMy grandma's are awesome but not really into ML (yet). You have thus been designated "volunteer" to temporarily replace them.
Keeping a uniform notation in machine learning is not easy as many sub-domains use different notations due to historical reasons. I will try using a uniform notation for all the glossary:
To make it easier to search the relevant information in the Glossary here is the color coding I will be using:
Disclaimer:
Ctrl+f
.Enough talking: prepare your popcorn and let’s get going !
The Curse of dimensionality refers to various practical issues when working with high dimensional data. These are often computational problems or counter intuitive phenomenas, coming from our Euclidean view of the 3 dimensional world.
They are all closely related but I like to think of 3 major issues with high dimensional inputs $x \in \mathbb{R}^d, \ d \ggg 1$:
You need exponentially more data to fill in a high dimensional space. I.e. if the dataset size is constant, increasing the dimensions makes your data sparser.
Intuition : The volume size grows exponentially with the number of dimensions. Think of filling a $d$ dimensional unit hypercube with points at a $0.1$ interval. In 1 dimension we need $10$ of these points. In 2 dimension we already need 100 of these. In $d$ dimension we need $10^d$ observation !
Let’s look at a simple example :
Imagine we trained a certain classifier for distinguishing between and . Now we want to predict the class of an unkown observation . Let’s assume that:
Given only 1 feature (1D), we would simply need to look at $30\%$ of the dimension values. In 2D we would need to look at $\sqrt{0.3}=54.8\%$ of each dimensions. In 3D we would need $\sqrt[3]{0.3}=66.9\%$ of in each dimensions. Visually:
.
In order to keep a constant support (i.e. amount of knowledge of the space), we thus need more data when adding dimensions. In other words, if we add dimensions without adding data, there will be large unknown sub-spaces. This is called sparsity.
I have kept the same number of observation in the plots, so that you can appreciate how “holes” appear in our training data as the dimension grows.
Disadvantage : The data sparsity issue causes machine learning algorithms to fail finding patterns or to overfit.
Basically, the volume of a high dimensional orange is mostly in its skin and not in the pulp! Which means expensive high dimensional juices
Intuition : The volume of a sphere depends on $r^d$. The skin has a slightly greater $r$ than the pulp, in high dimensions this slight difference will become very important.
If you’re not convinced, stick with my simple proof. Let’s consider a $d$ dimensional unit orange (i.e. $r=1$), with a skin of width $\epsilon$. Let’s compute the ratio of the volume in the skin to the total volume of the orange. We can avoid any integrals by noting that the volume of a hypersphere is proportional to $r^d$ i.e. : $V_{d}(r) = k r^{d}$.
\[\begin{align*} ratio_{skin/orange}(d) &= \frac{V_{skin}}{V_{orange}} \\ &= \frac{V_{orange} - V_{pulp}}{V_{orange}} \\ &= \frac{V_{d}(1) - V_{d}(1-\epsilon) }{V_{d}(1)} \\ &= \frac{k 1^d - k (1-\epsilon)^d}{k 1^d} \\ &= 1 - (1-\epsilon)^d \end{align*}\]Taking $\epsilon = 0.05$ as an example, here is the $ratio_{skin/orange}(d)$ we would get:
Side Notes : The same goes for hyper-cubes: most of the mass is concentrated at the furthest points from the center (i.e. the corners). That’s why you will sometimes hear that hyper-cubes are “spiky”. Think of the $[-1,1]^d$ hyper-cube: the distance from the center of the faces to the origin will trivially be $1 \ \forall d$, while the distance to each corners will be $\sqrt{d}$ (Pythagorean theorem). So the distance to corners increases with $d$ but not the center of the faces, which makes us think of spikes. This is why you will sometimes see such pictures :
.
There’s nothing that makes Euclidean distance intrinsically meaningless for high dimensions. But due to our finite number of data, 2 points in high dimensions seem to be more “similar” due to sparsity and basic probabilities.
Intuition :
In such discussions, people often cite a theorem stating that for i.i.d points in high dimension, a query point $\mathbf{x}^{(q)}$ converges to the same distance to all other points $P=\{\mathbf{x}^{(n)}\}_{n=1}^N$ :
\[\lim_{d \to \infty} \mathop{\mathbb{E}} \left[\frac{\max_{n} \, (\mathbf{x}^{(q)},\mathbf{x}^{(n)})}{\min_{n} \, (\mathbf{x}^{(q)},\mathbf{x}^{(n)})} \right] \to 1\]Practical : using dimensionality reduction often gives you better results for subsequent steps due to this curse. It makes the algorithm converge faster and reduces overfitting. But be careful not to underfit by using too few features.
Side Notes :
Resources : Great post about the curse of dimensionality in classification which inspired me, On the Surprising Behavior of Distance Metrics in High Dimensional Space is a famous paper which proposes the use of fractional distance metrics, nice blog of simulations.
Images modified from: oranges, 7D cube
Side Notes : The focus is on binary classification but most scores can be generalized to the multi-class setting. Often this is achieved by only considering “correct class” and “incorrect class” in order to make it a binary classification, then you average (weighted by the proportion of observation in the class) the score for each classes.
Resources : Additional scores based on confusion matrix
These two major model types, distinguish themselves by the approach they are taking to learn. Although these distinctions are not task-specific task, you will most often hear those in the context of classification.
In classification, the task is to identify the category $y$ of an observation, given its features $\mathbf{x}$: $y \vert \mathbf{x}$. There are 2 possible approaches:
Some of advantages / disadvantages are equivalent with different wording. These are rule of thumbs !
Rule of thumb : If your need to train the best classifier on a large data set, use a discriminative model. If your task involves more constraints (online learning, semi supervised learning, small dataset, …) use a generative model.
Let’s illustrate the advantages and disadvantage of both methods with an example . Suppose we are asked to construct a classifier for the “true distribution” below. There are two training sets: “small sample” and “large sample”. Suppose that the generator assumes point are generated from a Gaussian.
How well will the algorithms distinguish the classes in each case ?
.
This was simply an example that hopefully illustrates the advantages and disadvantages of needing more assumptions. Depending on their assumptions, some generative models would find the small red cluster.
Resources : A. Ng and M. Jordan have a must read paper on the subject, T. Mitchell summarizes very well these concepts in his slides, and section 8.6 of K. Murphy’s book has a great overview of pros and cons, which strongly influenced the devoted section above.
Given a random variable $X$ and a possible outcome $x_i$ associated with a probability $p_X(x_i)=p_i$, the information content (also called self-information or surprisal) is defined as :
\[\operatorname{I} (p_i) = - log(p_i)\]Intuition : The “information” of an event is higher when the event is less probable. Indeed, if an event is not surprising, then learning about it does not convey much additional information.
Example : “it is currently cold in Antarctica” does not convey much information as you knew it with high probability. “It is currently very warm in Antarctica” carries a lot more information and you would be surprised about it.
Side note :
Intuition :
Side notes :
The concept of entropy is central in both thermodynamics and information theory, and I find that quite amazing. It originally comes from statistical thermodynamics and is so important, that it is carved on Ludwig Boltzmann’s grave (one of the father of this field). You will often hear:
These 2 way of thinking may seem different but in reality they are exactly the same. They essentially answer: how hard is it to describe this thing?
I will focus here on the information theory point of view, because its interpretation is more intuitive for machine learning. I also don’t want to spend to much time thinking about thermodynamics, as people that do often commit suicide .
\[H(X) = H(p) \equiv \mathbb{E}\left[\operatorname{I} (p_i)\right] = \sum_{i=1}^K p_i \ \log(\frac{1}{p_i}) = - \sum_{i=1}^K p_i\ log(p_i)\]In information theory there are 2 intuitive way of thinking of entropy. These are best explained through an example :
Suppose my friend Claude offers me to join him for a NBA game (Cavaliers vs Spurs) tonight. Unfortunately I can’t come, but I ask him to record who scored each field goals. Claude is very geeky and uses a binary phone which can only write 0 and 1. As he doesn’t have much memory left, he wants to use the smallest possible number of bits.
Taking $\alpha = 1 $ for simplicity, we get $surprise(p_i) = -log(p_i) = nBit(p_i)$. We thus derived a formula for computing the surprise associated with event $x_i$ and the optimal number of bits that should be used to encode that event. This value is called information content $I(p_i)$. In order to get the average surprise / number of bits associated with a random variable $X$ we simply have to take the expectation over all possible events (i.e average weighted by probability of event). This gives us the entropy formula $H(X) = \sum_i p_i \ \log(\frac{1}{p_i}) = - \sum_i p_i\ log(p_i)$
From the example above we see that entropy corresponds to :
Side notes :
Resources : Excellent explanation of the link between entropy in thermodynamics and information theory, friendly introduction to entropy related concepts
Differential entropy (= continuous entropy), is the generalization of entropy for continuous random variables.
Given a continuous random variable $X$ with a probability density function $f(x)$:
\[h(X) = h(f) := - \int_{-\infty}^{\infty} f(x) \log {f(x)} \ dx\]If you had to make a guess, which distribution maximizes entropy for a given variance ? You guessed it : it’s the Gaussian distribution.
Side notes : Differential entropy can be negative.
We saw that entropy is the expected number of bits used to encode an observation of $X$ under the optimal coding scheme. In contrast cross entropy is the expected number of bits to encode an observation of $X$ under the wrong coding scheme. Let’s call $q$ the wrong probability distribution that is used to make a coding scheme. Then we will use $-log(q_i)$ bits to encode the $i^{th}$ possible values of $X$. Although we are using $q$ as a wrong probability distribution, the observations will still be distributed based on $p$. We thus have to take the expected value over $p$ :
\[H(p,q) = \mathbb{E}_p\left[\operatorname{I} (q_i)\right] = - \sum_i p_i \log(q_i)\]From this interpretation it naturally follows that:
Side notes : Log loss is often called cross entropy loss, indeed it is the cross-entropy between the distribution of the true labels and the predictions.
The Kullback-Leibler Divergence (= relative entropy = information gain) from $q$ to $p$ is simply the difference between the cross entropy and the entropy:
\[\begin{align*} D_{KL}(p\|q) &= H(p,q) - H(p) \\ &= [- \sum_i p_i \log(q_i)] - [- \sum_i p_i \log(p_i)] \\ &= \sum_i p_i \log(\frac{p_i}{q_i}) \end{align*}\]Intuition
KL divergence is often used with probability distribution of continuous random variables. In this case the expectation involves integrals:
\[D_{KL}(p \parallel q) = \int_{- \infty}^{\infty} p(x) \log(\frac{p(x)}{q(x)}) dx\]In order to understand why KL divergence is not symmetrical, it is useful to think of a simple example of a dice and a coin (let’s indicate head and tails by 0 and 1 respectively). Both are fair and thus their PDF is uniform. Their entropy is trivially: $H(p_{coin})=log(2)$ and $H(p_{dice})=log(6)$. Let’s first consider $D_{KL}(p_{coin} \parallel p_{dice})$. The 2 possible events of $X_{dice}$ are 0,1 which are also possible for the coin. The average number of bits to encode a coin observation under the dice encoding, will thus simply be $log(6)$, and the KL divergence is of $log(6)-log(2)$ additional bits. Now let’s consider the problem the other way around: $D_{KL}(p_{dice} \parallel p_{coin})$. We will use $log(2)=1$ bit to encode the events of 0 and 1. But how many bits will we use to encode $3,4,5,6$ ? Well the optimal encoding for the dice doesn’t have any encoding for these as they will never happen in his world. The KL divergence is thus not defined (division by 0). The KL divergence is thus not symmetric and cannot be a distance.
Side notes : Minimizing cross entropy with respect to $q$ is the same as minimizing $D_{KL}(p \parallel q)$. Indeed the 2 equations are equivalent up to an additive constant (the entropy of $p$) which doesn’t depend on $q$.
Intuition : The mutual information between 2 random variables X and Y measures how much (on average) information about one of the r.v. you receive by knowing the value of the other. If $X,\ Y$ are independent, then knowing $X$ doesn’t give information about $Y$ so $\operatorname{I} (X;Y)=0$ because $p(x,y)=p(x)p(y)$. The maximum information you can get about $Y$ from $X$ is all the information of $Y$ i.e. $H(Y)$. This is the case for $X=Y$ : $\operatorname{I} (Y;Y)= \sum_{y \in \mathcal Y} p(y) \log{ \left(\frac{p(y)}{p(y)\,p(y)} \right) = H(Y) }$
Side note :
This is all interesting, but why are we talking about information theory concepts in machine learning ? Well it turns our that many ML algorithms can be interpreted with entropy related concepts.
The 3 major ways we see entropy in machine learning are through:
Maximizing information gain (i.e entropy) at each step of our algorithm. Example:
Minimizing KL divergence between the actual unknown probability distribution of observations $p$ and the predicted one $q$. Example:
Minimizing KL divergence between the computationally intractable $p$ and a simpler approximation $q$. Indeed machine learning is not only about theory but also about how to make something work in practice.Example:
There is no one model that works best for every problem.
Let’s try predicting the next fruit in the sequence:
You would probably say right ? Maybe with a lower probability you would say . But have you thought of saying ? I doubt it. I never told you that the sequence was constrained in the type of fruit, but naturally we make assumptions that the data “behaves well”. The point here is that without any knowledge/assumptions on the data, all future data are equally probable.
The theorem builds on this, and states that every algorithm has the same performance when averaged over all data distributions. So the average performance of a deep learning classifier is the same as random classifiers.
Side Notes :
Resources : D. Wolpert’s proof.
These 2 types of methods distinguish themselves based on their answer to the following question: “Will I use the same amount of memory to store the model trained on $100$ examples than to store a model trained on $10 000$ of them ? “ If yes then you are using a parametric model. If not, you are using a non-parametric model.
Practical : Start with a parametric model. It’s often worth trying a non-parametric model if: you are doing clustering, or the training data is not too big but the problem is very hard.
Side Note : Strictly speaking any non-parametric model could be seen as a infinite-parametric model. So if you want to be picky: next time you hear a colleague talking about non-parametric models, tell him it’s in fact parametric. I decline any liability for the consequence on your relationship with him/her .
Supervised learning tasks tackle problems that have labeled data.
Intuition : It can be thought of a teacher who corrects a multiple choice exam. At the end you will get your average result as well as the correct answer to any of the questions.
Supervised learning can be further separated into two broad type of problems:
The classification problem consists of assigning a set of classes/categories to an observation. I.e \(\mathbf{x} \mapsto y,\ y \in \{0,1,...,C\}\)
Classification problems can be further separated into:
Common evaluation metrics include Accuracy, F1-Score, AUC… I have a section devoted for these classification metrics.
Compare to : Regression
The basic idea behind building a decision tree is to :
Here is a little gif showing these steps:
Note: For more information, please see the “details” and “Pseudocode and Complexity” drop-down below.
The idea behind decision trees is to partition the input space into multiple regions. E.g. region of men who are more than 27 years old. Then predict the most probable class for each region, by assigning the mode of the training data in this region. Unfortunately, finding an optimal partitioning is usually computationally infeasible (NP-complete) due to the combinatorially large number of possible trees. In practice the different algorithms thus use a greedy approach. I.e. each split of the decision tree tries to maximize a certain criterion regardless of the next splits.
How should we define an optimality criterion for a split? Let’s define an impurity (error) of the current state, which we’ll try to minimize. Here are 3 possible state impurities:
Here is a quick graph showing the impurity depending on a class distribution in a binary setting:
Side Notes :
When should we stop splitting? It is important not to split too many times to avoid over-fitting. Here are a few heuristics that can be used as a stopping criterion:
Such heuristics require problem-dependent thresholds (hyperparameters), and can yield relatively bad results. For example decision trees might have to split the data without any purity gain, to reach high purity gains at the following step. It is thus common to grow large trees using the number of training example in a leaf node as a stopping criterion. To avoid over-fitting, the algorithm would prune back the resulting tree. In CART, the pruning criterion $C_{pruning}(T)$ balances impurity and model complexity by regularization. The regularized variable is often the number of leaf nodes $\vert T \vert$, as below:
\[C_{pruning}(T) = \sum^{\vert T \vert }_{v=1} I(T,v) + \lambda \vert T \vert\]$\lambda$ is selected via cross validation and trades-off impurity and model complexity, for a given tree $T$, with leaf nodes $v=1…\vertT \vert$ using Impurity measure $I$.
Variants: there are various decision tree methods, that differ with regards to the following points:
Famous variants:
Other variants include : C5.0 (next version of C4.5, probably less used because it’s patented), MARS.
Resources : A comparative study of different decision tree methods.
def buildTree(X,Y):
if stop_criteria(X,Y) :
# if stop then store the majority class
tree.class = mode(X)
return Null
minImpurity = infinity
bestSplit = None
for j in features:
for T in thresholds:
if impurity(X,Y,j,T) < minImpurity:
bestSplit = (j,T)
minImpurity = impurity(X,Y,j,T)
X_left,Y_Left,X_right,Y_right = split(X,Y,bestSplit)
tree.split = bestSplit # adds current split
tree.left = buildTree(X_left,Y_Left) # adds subsequent left splits
tree.right buildTree(X_right,Y_right) # adds subsequent right splits
return tree
def singlePredictTree(tree,xi):
if tree.class is not Null:
return tree.class
j,T = tree.split
if xi[j] >= T:
return singlePredictTree(tree.right,xi)
else:
return singlePredictTree(tree.left,xi)
def allPredictTree(tree,Xt):
t,d = Xt.shape
Yt = vector(d)
for i in t:
Yt[i] = singlePredictTree(tree,Xt[i,:])
return Yt
Let’s first think about the complexity for building the first decision stump (first function call):
We now have the complexity of a decision stump. You could think that finding the complexity of building a tree would be multiplying it by the number of function calls: Right? Not really, that would be an over-estimate. Indeed, at each function call, the training data size $N$ would have decreased. The intuition for the result we are looking for, is that at each level $l=1…M$ the sum of the training data in each function is still $N$. Multiple function working in parallel with a subset of examples take the same time as a single function would, with the whole training set $N$. The complexity at each level is thus still $O(DN\log(N))$ so the complexity for building a tree of depth $M$ is $O(MDN\log(N))$. Proof that the work at each level stays constant:
At each iterations the dataset is split into $\nu$ subsets of $k_i$ element and a set of $n-\sum_{i=1}^{\nu} k_i$. At every level, the total cost would therefore be (using properties of logarithms and the fact that $k_i \le N$ ) :
\[\begin{align*} cost &= O(k_1D\log(k_1)) + ... + O((N-\sum_{i=1}^{\nu} k_i)D\log(N-\sum_{i=1}^{\nu} k_i))\\ &\le O(k_1D\log(N)) + ... + O((N-\sum_{i=1}^{\nu} k_i)D\log(N))\\ &= O(((N-\sum_{i=1}^{\nu} k_i)+\sum_{i=1}^{\nu} k_i)D\log(N)) \\ &= O(ND\log(N)) \end{align*}\]The last possible adjustment I see, is to sort everything once, store it and simply use this precomputed data at each level. The final training complexity is therefore $O(MDN + ND\log(N))$ .
The time complexity of making predictions is straightforward: for each $t$ examples, go through a question at each $M$ levels. I.e. $O(MT)$ .
Naive Bayes is a family of generative models that predicts $p(y=c \vert\mathbf{x})$ by assuming that all features are conditionally independent given the label: $x_i \perp x_j \vert y , \forall i,j$. This is a simplifying assumption that very rarely holds in practice, which makes the algorithm “naive”. Classifying with such an assumption is very easy:
\[\begin{aligned} \hat{y} &= arg\max_c p(y=c\vert\mathbf{x}, \pmb\theta) \\ &= arg\max_c \frac{p(y=c, \pmb\theta)p(\mathbf{x}\vert y=c, \pmb\theta) }{p(x, \pmb\theta)} & & \text{Bayes Rule} \\ &= arg\max_c \frac{p(y=c, \pmb\theta)\prod_{j=1}^D p(x_\vert y=c, \pmb\theta) }{p(x, \pmb\theta)} & & \text{Conditional Independence Assumption} \\ &= arg\max_c p(y=c, \pmb\theta)\prod_{j=1}^D p(x_j\vert y=c, \pmb\theta) & & \text{Constant denominator} \end{aligned}\]Note that because we are in a classification setting $y$ takes discrete values, so $p(y=c, \pmb\theta)=\pi_c$ is a categorical distribution.
You might wonder why we use the simplifying conditional independence assumption. We could directly predict using $\hat{y} = arg\max_c p(y=c, \pmb\theta)p(\mathbf{x} \vert y=c, \pmb\theta)$. The conditional assumption enables us to have better estimates of the parameters $\theta$ using less data . Indeed, $p(\mathbf{x} \vert y=c, \pmb\theta)$ requires to have much more data as it is a $D$ dimensional distribution (for each possible label $c$), while $\prod_{j=1}^D p(x_j \vert y=c, \pmb\theta)$ factorizes it into $D$ 1-dimensional distributions which requires a lot less data to fit due to curse of dimensionality. In addition to requiring less data, it also enables to easily mix different family of distributions for each features.
We still have to address 2 important questions:
The family of distributions to use is an important design choice that will give rise to specific types of Naive Bayes classifiers. Importantly the family of distribution $p(x_j \vert y=c, \pmb\theta)$ does not need to be the same $\forall j$, which enables the use of very different features (e.g. continuous and discrete). In practice, people often use Gaussian distribution for continuous features, and Multinomial or Bernoulli distributions for discrete features :
Gaussian Naive Bayes :
Using a Gaussian distribution is a typical assumption when dealing with continuous data $x_j \in \mathbb{R}$. This corresponds to assuming that each feature conditioned over the label is a univariate Gaussian:
\[p(x_j \vert y=c, \pmb\theta) = \mathcal{N}(x_j;\mu_{jc},\sigma_{jc}^2)\]Note that if all the features are assumed to be Gaussian, this corresponds to fitting a multivariate Gaussian with a diagonal covariance : $p(\mathbf{x} \vert y=c, \pmb\theta)= \mathcal{N}(\mathbf{x};\pmb\mu_{c},\text{diag}(\pmb\sigma_{c}^2))$.
The decision boundary is quadratic as it corresponds to ellipses (Gaussians) that intercept .
Multinomial Naive Bayes :
In the case of categorical features $x_j \in \{1,…, K\}$ we can use a Multinomial distribution, where $\theta_{jc}$ denotes the probability of having feature $j$ at any step of an example of class $c$ :
\[p(\pmb{x} \vert y=c, \pmb\theta) = \operatorname{Mu}(\pmb{x}; \theta_{jc}) = \frac{(\sum_j x_j)!}{\prod_j x_j !} \prod_{j=1}^D \theta_{jc}^{x_j}\]Multinomial Naive Bayes is typically used for document classification , and corresponds to representing all documents as a bag of word (no order). We then estimate (see below)$\theta_{jc}$ by counting word occurrences to find the proportions of times word $j$ is found in a document classified as $c$.
The equation above is called Naive Bayes although the features $x_j$ are not technically independent because of the constraint $\sum_j x_j = const$. The training procedure is still the same because for classification we only care about comparing probabilities rather than their absolute values, in which case Multinomial Naive Bayes actually gives the same results as a product of Categorical Naive Bayes whose features satisfy the conditional independence property.
Multinomial Naive Bayes is a linear classifier when expressed in log-space :
\[\begin{aligned} \log p(y=c \vert \mathbf{x}, \pmb\theta) &\propto \log \left( p(y=c, \pmb\theta)\prod_{j=1}^D p(x_j \vert y=c, \pmb\theta) \right)\\ &= \log p(y=c, \pmb\theta) + \sum_{j=1}^D x_j \log \theta_{jc} \\ &= b + \mathbf{w}^T_c \mathbf{x} \\ \end{aligned}\]Multivariate Bernoulli Naive Bayes:
In the case of binary features $x_j \in \{0,1\}$ we can use a Bernoulli distribution, where $\theta_{jc}$ denotes the probability that feature $j$ occurs in class $c$:
\[p(x_j \vert y=c, \pmb\theta) = \operatorname{Ber}(x_j; \theta_{jc}) = \theta_{jc}^{x_j} \cdot (1-\theta_{jc})^{1-x_j}\]Bernoulli Naive Bayes is typically used for classifying short text , and corresponds to looking at the presence and absence of words in a phrase (no counts).
Multivariate Bernoulli Naive Bayes is not the same as using Multinomial Naive Bayes with frequency counts truncated to 1. Indeed, it models the absence of words in addition to their presence.
Finally we have to train the model by finding the best estimated parameters $\hat\theta$. This can either be done using pointwise estimates (e.g. MLE) or a Bayesian perspective.
Maximum Likelihood Estimate (MLE):
The negative log-likelihood of the dataset $\mathcal{D}=\{\mathbf{x}^{(n)},y^{(n)}\}_{n=1}^N$ is :
\[\begin{aligned} NL\mathcal{L}(\pmb{\theta} \vert \mathcal{D}) &= - \log \mathcal{L}(\pmb{\theta} \vert \mathcal{D}) \\ &= - \log \prod_{n=1}^N \mathcal{L}(\pmb{\theta} \vert \mathbf{x}^{(n)},y^{(n)}) & & \textit{i.i.d} \text{ dataset} \\ &= - \log \prod_{n=1}^N p(\mathbf{x}^{(n)},y^{(n)} \vert \pmb{\theta}) \\ &= - \log \prod_{n=1}^N \left( p(y^{(n)} \vert \pmb{\pi}) \prod_{j=1}^D p(x_{j}^{(n)} \vert\pmb{\theta}_j) \right) \\ &= - \log \prod_{n=1}^N \left( \prod_{c=1}^C \pi_c^{\mathcal{I}[y^{(n)}=c]} \prod_{j=1}^D \prod_{c=1}^C p(x_{j}^{(n)} \vert \theta_{jc})^{\mathcal{I}[y^{(n)}=c]} \right) \\ &= - \log \left( \prod_{c=1}^C \pi_c^{N_c} \prod_{j=1}^D \prod_{c=1}^C \prod_{n : y^{(n)}=c} p(x_{j}^{(n)} \vert \theta_{jc}) \right) \\ &= - \sum_{c=1}^C N_c \log \pi_c + \sum_{j=1}^D \sum_{c=1}^C \sum_{n : y^{(n)}=c} \log p(x_{j}^{(n)} \vert \theta_{jc}) \\ \end{aligned}\]As the negative log likelihood decomposes in terms that only depend on $\pi$ and each $\theta_{jc}$ we can optimize all parameters separately.
Minimizing the first term by using Lagrange multipliers to enforce $\sum_c \pi_c$, we get that $\hat\pi_c = \frac{N_c}{N}$. Which is naturally the proportion of examples labeled with $y=c$.
The $\theta_{jc}$ depends on the family of distribution we are using. In the Multinomial case it can be shown to be $\hat\theta_{jc}=\frac{N_{jc}}{N_c}$. Which is very easy to compute, as it only requires to count the number of times a certain feature $x_j$ is seen in an example with label $y=c$.
Bayesian Estimate:
The problem with MLE is that it over-fits. For example, if a feature is always present in all training samples (e.g. the word “the” in document classification) then the model will break if it sees a test sample without that features as it would give a probability of 0 to all labels.
By taking a Bayesian approach, over-fitting is mitigated thanks to priors. In order to do so we have to compute the posterior :
\[\begin{aligned} p(\pmb\theta \vert \mathcal{D}) &= \prod_{n=1}^N p(\pmb\theta \vert \mathbf{x}^{(n)},y^{(n)}) & \textit{i.i.d} \text{ dataset} \\ &\propto \prod_{n=1}^N p(\mathbf{x}^{(n)},y^{(n)} \vert \pmb\theta)p(\pmb\theta) \\ &\propto \prod_{c=1}^C \left( \pi_c^{N_c} \cdot p(\pi_c) \right) \prod_{j=1}^D \prod_{c=1}^C \prod_{n : y^{(n)}=c} \left( p(x_{j}^{(n)} \vert \theta_{jc}) \cdot p(\theta_{jc}) \right) \\ \end{aligned}\]Using (factored) conjugate priors (Dirichlet for Multinomial, Beta for Bernoulli, Gaussian for Gaussian), this gives the same estimates as in the MLE case (the prior has the same form than the likelihood and posterior) but regularized.
The Bayesian framework requires predicting by integrating out all the parameters $\pmb\theta$. The only difference with the first set of equations we have derived for classifying using Naive Bayes, is that the predictive distribution is conditioned on the training data $\mathcal{D}$ instead of the parameters $\pmb\theta$:
\[\begin{aligned} \hat{y} &= arg\max_c p(y=c \vert \pmb{x},\mathcal{D}) \\ &= arg\max_c \int p(y=c\vert \pmb{x},\pmb\theta) p(\pmb\theta \vert \mathcal{D}) d\pmb\theta\\ &= arg\max_c \int p(\pmb\theta\vert\mathcal{D}) p(y=c \vert\pmb\theta) \prod_{j=1}^D p(x_j\vert y=c, \pmb\theta) d\pmb\theta \\ &= arg\max_c \left( \int p(y=c\vert\pmb\pi) p(\pmb\pi \vert \mathcal{D}) d\pmb\pi \right) \prod_{j=1}^D \int p(x_j\vert y=c, \theta_{jc}) p(\theta_{jc}\vert\mathcal{D}) d\theta_{jc} & & \text{Factored prior} \\ \end{aligned}\]As before, the term to maximize decomposes in one term per parameter. All parameters can thus be independently maximized. These integrals are generally intractable, but in the case of the Gaussian, the Multinomial, and the Bernoulli with their corresponding conjugate prior, we can fortunately compute a closed form solution.
In the case of the Multinomial (and Bernoulli), the solution is equivalent to predicting with a point estimate $\hat{\pmb\theta}=\bar{\pmb\theta}$. Where $\bar{\pmb\theta}$ is the mean of the posterior distribution.
Using a symmetric Dirichlet prior : $p(\pmb\pi)=\text{Dir}(\pmb\pi; \pmb{1}\alpha_\pi)$ we get that $\hat\pi_c = \bar\pi_c = \frac{N_c + \alpha_\pi }{N + \alpha_\pi C}$.
The Bayesian Multinomial naive Bayes is equivalent to predicting with a point estimate :
\[\bar\theta_{jc} = \hat\theta_{jc}=\frac{N_{jc} + \alpha }{N_c + \alpha D}\]I.e. Bayesian Multinomial Naive Bayes with symmetric Dirichlet prior assigns predictive posterior distribution:
\[p(y=c\vert\mathbf{x},\mathcal{D}) \propto \frac{N_c + \alpha_\theta }{N + \alpha_\theta C} \prod_{j=1}^D \frac{N_{jc} + \alpha_\theta }{N_c + \alpha_\theta D}\]the corresponding graphical model is:
When using a uniform prior $\alpha_\theta=1$, this equation is called Laplace smoothing or add-one smoothing. $\alpha$ intuitively represents a “pseudocount” $\alpha_\theta$ of features $x_{jc}$ . For document classification it simply corresponds to giving an initial non zero count to all words, which avoids the problem of having a test document $x^{(t)}$ with $p(y=c\vert x^{(t)})=0$ if it contains a single word $x_{j}^*$ that has never been seen in a training document with label $c$. $\alpha=1$ is a common choice in examples although smaller often work better .
Side Notes :
The “Naive” comes from the conditional independence of features given the label. The “Bayes” part of the name comes from the use of Bayes theorem to use a generative model, but it is not a Bayesian method as it does not require marginalizing over all parameters.
If we estimated the Multinomial Naive Bayes using Maximum A Posteriori Estimate (MAP) instead of MSE or the Bayesian way, we would predict using the mode of the posterior (instead of the mean in the Bayesian case) $\hat\theta_{jc}=\frac{N_{jc} + \alpha - 1}{N_c + (\alpha -1)D}$ (similarly for $\pmb\pi$). This means that Laplace smoothing could also be interpreted as using MAP with a non uniform prior $\text{Dir}(\pmb\theta; \pmb{2})$. But when using a uniform Dirichlet prior MAP coincides with the MSE.
Gaussian Naive Bayes is equivalent to Quadratic Discriminant Analysis when each covariance matrix $\Sigma_c$ is diagonal.
Discrete Naive Bayes and Logistic Regression are a “generative-discriminative pair” as they both take the same form (linear in log probabilities) but estimate parameters differently. For example, in the binary case, Naive Bayes predicts the class with the largest probability. I.e. it predicts $C_1$ if $\log \frac{p(C_1 \vert \pmb{x})}{p(C_2 \vert \pmb{x})} = \log \frac{p(C_1 \vert \pmb{x})}{1-p(C_1 \vert \pmb{x})} > 0$. We have seen that discrete Naive Bayes is linear in the log space, so we can rewrite the equation as $\log \frac{p(C_1 \vert \pmb{x})}{1-p(C_1 \vert \pmb{x})} = 2 \log p(C_1 \vert \pmb{x}) - 1 = 2 \left( b + \mathbf{w}^T_c \mathbf{x} \right) - 1 = b’ + \mathbf{w’}^T_c \mathbf{x} > 0$. This linear regression on the log odds ratio is exactly the form of Logistic Regression (the usual equation is recovered by solving $\log \frac{p}{1-p} = b + \mathbf{w’}^T \mathbf{x}$). The same can be shown for Multinomial Naive Bayes and Multinomial Logistic Regression.
For document classification, it is common to replace raw counts in Multinomial Naive Bayes with tf-idf weights.
Resources : See section 3.5 of K. Murphy’s book for all the derivation steps and examples.
Decision trees are more often used for classification problems. I thus talk at length about them here.
The 2 differences with decision trees for classification are:
Let’s look at a simple plot to get a better idea of the algorithm:
Besides the disadvantages seen in the decision trees for classification, decision trees for regression suffer from the fact that it predicts a non smooth function .
Unsupervised learning tasks consist in finding structure in unlabeled data without a specific desired outcome.
Supervised learning can be further separated into multiple subtasks (the separation is not as clear as in the supervised setting):
Due to the lack of ground truth labels, it difficult to measure the performance of such methods, but such methods are extremely important due to the amount of accessible unlabeled data.
In Reinforcement Learning (RL), the sequential decision-making algorithm (an agent) interacts with an environment it is uncertain about. The agent learns to map situations to actions to maximize a long term reward. During training, the action it chooses are evaluated rather than instructed.
Example: Games can very naturally be framed in a RL framework. For example, when playing tennis you are not told how good every movement you make is, but you are given a certain reward if you win the whole game.
Side Notes: Games could also be framed in a supervised problem. The training set would consist in many different states of the environment and the optimal action to take in each of those. Creating such a dataset is not possible for most applications as it requires to enumerate the exponential number of states and to know the associated best action (e.g. exact rotation of all your joints when you play tennis). Note that during supervised training, the feedback indicates the correct action independently to the chosen action. The RL framework is a lot more natural as the agent is trained by playing the game. Importantly, the agent interacts with the environment such that the states that it will visit depend on previous actions. So it is a chicken-egg problem where it will unlikely reach good states before being trained, but it has to reach good states to get reward and train effectively. This leads to training curves that start with very long plateaus of low reward until it reaches a good state (somewhat by chance) and then learn quickly. In contrast, supervised methods have very steep loss curves at the start.
Resources : The link and differences between supervised and RL is described in details by A. Barto and T. Dietterich.
In RL, future states depend on current actions, thus requiring to model indirect consequences of actions and planning. Furthermore, the agent often has to take actions in real-time while planning for the future. All of the above makes it very similar to how humans learn, and is thus widely used in psychology and neuroscience.
Resources : All this section on RL is highly influenced by Sutton and Barto’s introductory book.
A fundamental trade-off in RL is how to balance exploration and exploitation. Indeed, we often assume that the agent has to maximize its reward over all episodes. As it often lacks knowledge about the environment, it has to decide between taking the action it currently thinks is the best, and taking new actions to learn about how good these are (which might be bring higher return in the long run).
Example: Lets consider a simplified version of RL (Multi-armed Bandits) to illustrate this concept (example and plots from Sutton and Barto):
Let’s assume that you go in a very generous casino. I.e. an utopic casino in which you can make money in the long run (spoiler alert: this doesn’t exist, and in real-life casinos the best action is always to leave ). This casino contains 10-slot machines, each of these gives a certain reward $r$ when the agent decides to play on on them. The agent can stay the whole day in this casino and wants to maximize it’s profit.
Although the agent doesn’t know it, the reward $r \sim \mathcal{N}(\mu_a, 1)$ where each of the 10 machines have a fixed $\mu_a$ which were sampled from $\mathcal{N}(0, 1)$ when building the casino. The actual reward distributions are the following (where $q_{*}(a)$ denote the expected return when choosing slot $a$):
A good agent would try to estimate $Q_t(a) = q_{*}(a)$, where the subscript $t$ indicates that the estimate depends on the time (the more the agent plays the better the estimates should be). The most natural estimate $Q_t(a)$ is to average over all rewards when taking action $a$. At every time step $t$ there’s at least one action $\hat{a}=arg\max_a Q_t(a)$ which the agent believes maximizes $q_{*}(a)$. Taking the greedy action $\hat{a}$ exploits your current beliefs of the environment. This might not always be “best action”, indeed if $\hat{a} \neq a^{*}$ then the agents estimates of the environment are wrong and it will be better-off by taking a non greedy action $a \neq \hat{a}$ (supposing it will still play many times). Such actions explore the environment to improve estimates.
During exploration, reward is lower in the short run, but higher in the long run because once you discovered better actions you can exploit them many times. Whether to explore or exploit is a complex problem that depends on your current estimate, uncertainties, and the number of remaining steps.
Here are a few possible exploration mechanisms:
Optimistic Initial Values (UCB): give a highly optimistic $Q_1(a), \ \forall a$(e.g. $Q_1(a)=5$ in our example, which a lot larger than $q_{*}(a)$). As th first few samples from an action will normally decrease $Q_t(a)$, this ensures that all actions will be at least sampled a few times before following the greedy policy. Note that this biases permanently the action-value estimates when estimating $Q_t(a)$ through averages (although the biases decreases).
Gradient Bandit Algorithms: An other natural way of choosing the best action would be to keep a numerical preference for each action $H_t(a)$. And sample more often actions with larger preference. I.e. sample from $\pi_t := \text{softmax}(H_t)$, where $\pi_t(a)$ denotes the probability of taking action $a$ at time $t$. The preference can then be learned via stochastic gradient ascent, by increasing $H_t(A_t)$ if the reward due to the current action $A_t$ is larger than the current average reward $\bar{R}_t$ (decreasing $H_t(A_t)$ if not). The non selected actions $a \neq A_t$ are moved in the opposite direction. Formally:
By running all the different strategies for different hyperparameters and averaging over 1000 decisions, we get:
We see that UCB performs best in this case, and is the most robust with regards to its hyper-parameter $c$. Although UCB tends to work well, this will not always be the case (No Free Lunch Theorem again).
Side Notes:
Markov Decision Processes (MDPs) are a mathematical idealized form of the RL problem that suppose that states follow the Markov Property. I.e. that future states are conditionally independent of past ones given the present state: $S_{t+1} \perp \{S_{i}\}_{i=1}^{t-1} \vert S_t$.
Before diving into the details, it is useful to visualize how simple a MDP is (image taken from Sutton and Barto):
Important concepts:
To schematically represents what different algorithms do, it is useful to look at Backup Diagrams. These are trees that are constructed by unrolling all the possible action and states. White nodes represent a state, from which teh agent will follow its policy to select one possible action (black node). The environment will then sample through the dynamics to bring the agent to a new state. Green nodes will be used to denote terminal nodes, and the path taen by the algorithm will be shown in red. Unlike transition graphs, state nodes are not unique (e.g. states can be their own successors). Example:
Side Notes : We usually assume that we have a finite MDP. I.e. that $\mathcal{R},\mathcal{A},\mathcal{S}$ are finite. Dealing with continuous state and actions pairs requires approximations. One possible way of converting a continuous problem to a finite one, is to discretized the state and actions space.
In the following sections, we will :
Solving the RL tasks, consists in finding a good policy. A policy $\pi’$ is defined to be better than $\pi \iff v_{\pi’}(s) \geq v_{\pi}(s), \ \forall s \in \mathcal{S}$. The optimal policy $\pi_{*}$ has an associated state optimal value-function and optimal action-value function:
\[v_*(s)=v_{\pi_*}(s):= \max_\pi v_\pi(s)\] \[\begin{aligned} q_*(s,a) &:= q_{\pi_*}(s,a) \\ &= \max_\pi q_\pi(s, a) \\ &= \mathbb{E}[R_{t+1} + \gamma v_*(S_{t+1}) \vert S_{t}=s, A_{t}=a] \end{aligned}\]A special recursive update (the Bellman optimality equations) can be written for the optimal functions $v_{*}$, $q_{*}$ by taking the best action at each step instead of marginalizing:
\[\begin{aligned} v_*(s) &= \max_a q_{\pi_*}(s, a) \\ &= \max_a \mathbb{E}[R_{t+1} + \gamma v_*(S_{t+1}) \vert S_{t}=s, A_{t}=a] \\ &= \max_a \sum_{s'} \sum_r p(s', r \vert s, a) \left[r + \gamma v_*(s') \right] \end{aligned}\] \[\begin{aligned} q_*(s,a) &= \mathbb{E}[R_{t+1} + \gamma \max_{a'} q_*(S_{t+1}, a') \vert S_{t}=s, A_{t}=a] \\ &= \sum_{s'} \sum_{r} p(s', r \vert s, a) \left[r + \gamma \max_{a'} q_*(s', a')\right] \end{aligned}\]The optimal Bellman equations are a set of equations ($\vert \mathcal{S} \vert$ equations and unknowns) with a unique solution. But these equations are now non-linear (due to the max). If the dynamics of the environment were known, you could solve the non-linear system of equation to get $v_{*}$, and follow $\pi_{*}$ by greedily choosing the action that maximizes the expected $v_{*}$. By solving for $q_{*}$, you would simply chose the action $a$ that maximizes $q_{*}(s,a)$, which doesn’t require to know anything about the system dynamics.
In practice, the optimal Bellman equations can rarely be solved due to three major problems:
Practical RL algorithms thus settle for approximating the optimal Bellman equations. Usually they parametrize functions and focus mostly on states that will be frequently encountered to make the computations possible.
Side Notes: DP algorithms use the Bellman equations to update estimates based on other estimates (typically the value function at the next state). This general idea is called Bootstrapping.
The high-level idea is to iteratively: evaluate $v_\pi$ for the current policy $\pi$ (policy evaluation 1), use $v_\pi$ to improve $\pi’$ (policy improvement 1), evaluate $v_{\pi’}$ (policy evaluation 2)… Thus obtaining a sequence of strictly monotonically improving policies and value functions (except when converged to $\pi_{*}$). As a finite MDP has only a finite number of possible policies, this is guaranteed to converge in a finite number of steps. This idea is often visualized using a 1d example taken from Sutton and Barto):
This simplified diagram shows that although the policy improvement and policy evaluation “pull in opposite directions”, the 2 processes still converge to find a single joint solution. Almost all RL algorithms can be described using 2 interacting processes (the prediction task for approximating a value, and the control task for improving the policy), which are often called Generalized Policy Iteration (GPI).
In python pseudo-code:
def policy_iteration(environment):
# v initialized arbitrarily for all states except V(terminal)=0
v, pi = initialize()
is_converged = False
while not is_converged:
v = policy_evaluation(pi, environment, v)
pi, is_converged = policy_improvement(v, environment, pi)
return pi
The first step is to evaluate $v_\pi$ for a given $\pi$. This can be done by solving the Bellman equation :
\[v_\pi(s) = \sum_{a} \pi(a \vert s) \sum_{s'} \sum_r p(s', r\vert s, a) \left[ r + \gamma v_\pi(s') \right]\]Solving the equation can be done by either:
Linear System: This is a set of linear equations ($\vert \mathcal{S} \vert$ equations and unknowns) with a unique solution (if $\gamma <1$ or if it is an episodic task). Note that we would have to solve for these equations at every step of the policy iteration and $\vert \mathcal{S} \vert$ is often very large. Assuming a deterministic policy and reward, this would take $O(\vert \mathcal{S} \vert^3)$ operations to solve.
Iterative method: Modify the Bellman equation to become an iterative method that is guaranteed to converge when $k\to \infty$ if $v_\pi$ exists. This is done by realizing that $v_\pi$ is a fixed point of:
Side Notes : In the iterative method, we would have to keep to keep 2 arrays $v_k(s)$ and $v_{k+1}(s)$. At each iteration $k$ we would update $v_{k+1}(s)$ by looping through all states. Importantly, the algorithm also converges to $v_\pi(s)$ if we keep in memory a single array that would be updated “in-place” (often converges faster as it updates the value for some states using the latest available values). The order of state updates has a significant influence on the convergence in the “in-place” case .
We solve for an approximation $V \approx v_\pi$ by halting the algorithm when the change for every state is “small”.
In python pseudo-code:
def q(s, a, v, environment, gamma):
"""Computes the action-value function `q` for a given state and action."""
dynamics, states, actions, rewards = environment
return sum(dynamics(s_p,r,s,a) * (r + gamma * v(s_p))
for s_p in states
for r in rewards)
def policy_evaluation(pi, environment, v_init, threshold=..., gamma=...):
dynamics, states, actions, rewards = environment
V = v_init # don't restart from scratch policy evaluation
delta = 0
while delta < threshold :
delta = 0
for s in states:
v = V(s)
# Bellman update
V(s) = sum(pi(a, s) * q(a,s, V, environment, gamma) for s_p in states)
# stop when change for any state is small
delta = max(delta, abs(v-V(s)))
return v
Now that we have an (estimate of) $v_\pi$ which says how good it is to be in $s$ when following $\pi$, we want to know how to change the policy to yield a higher return.
We have previously defined a policy $\pi’$ to be better than $\pi$ if $v_{\pi’}(s) > v_{\pi}(s), \forall s$. One simple way of improving the current policy $\pi$ would thus be to use a $\pi’$ which is identical to $\pi$ at each state besides one $s_{update}$ for which it will take a better action. Let’s assume that we have a deterministic policy $\pi$ that we follow at every step besides one step when in $s_{update}$ at which we follow the new $\pi’$ (and continue with $\pi$). By construction :
\[q_\pi(s, {\pi'}(s)) = v_\pi(s), \forall s \in \mathcal{S} \setminus \{s_{update}\}\] \[q_\pi(s_{update}, {\pi'}(s_{update})) > v_\pi(s_{update})\]Then it can be proved that (Policy Improvement Theorem) :
\[v_{\pi'}(s) \geq v_\pi(s), \forall s \in \mathcal{S} \setminus \{s_{update}\}\] \[v_{\pi'}(s_{update}) > v_\pi(s_{update})\]I.e. if such policy $\pi’$ can be constructed, then it is a better than $\pi$.
The same holds if we extend the update to all actions and all states and stochastic policies. The general policy improvement algorithm is thus :
\[\begin{aligned} \pi'(s) &= arg\max_a q_\pi(s,a) \\ &= arg\max_a \sum_{s'} \sum_r p(s', r\vert s, a) \left[ r + \gamma v_\pi(s') \right] \end{aligned}\]$\pi’$ is always going to be better than $\pi$ except if $\pi=\pi_{*}$, in which case the update equations would turn into the Bellman optimality equation.
In python pseudo-code:
def policy_improvement(v, environment, pi):
dynamics, states, actions, rewards = environment
is_converged = True
for s in states:
old_action = pi(s)
# q defined in `policy_evaluation` pseudo-code
pi(s) = argmax(q(s,a, v, environment, gamma) for a in actions)
if old_action != pi(s):
is_converged = False
return pi, is_converged
Practical : Assuming a deterministic reward, this would take $O(\vert \mathcal{S} \vert^2 \vert \mathcal{A} \vert)$ operations to solve. Each iteration of the policy iteration algorithm thus takes $O(\vert \mathcal{S} \vert^2 (\vert \mathcal{S} \vert + \vert \mathcal{A} \vert))$ for a deterministic policy and reward signal, if we solve the linear system of equations for the policy evaluation step.
In policy iteration, the bottleneck is the policy evaluation which requires multiple loops over the state space (convergence only for an infinite number of loops). Importantly, the same convergence guarantees as with policy iteration hold when doing a single policy evaluation step. Policy iteration with a single evaluation step, is called value iteration and can be written as a simple update step that combines the truncated policy evaluation and the policy improvement steps:
\[v_{k+1}(s) = \max_{a} \sum_{s'} \sum_r p(s', r\vert s, a) \left[ r + \gamma v_{k}(s') \right], \ \forall s \in \mathcal{S}\]This is guaranteed to converge for arbitrary $v_0$ if $v_*$ exists. Value iteration corresponds to the update rule derived from the Bellman optimality equation . The formula, is very similar to policy evaluation, but it maximizes instead of marginalizing over actions.
Like policy evaluation, value iteration needs an infinite number of steps to converge to $v_*$. In practice we stop whenever the change for all actions is small.
In python pseudo-code:
def value_to_policy(v, environment, gamma):
"""Makes a deterministic policy from a value function."""
dynamics, states, actions, rewards = environment
pi = dict()
for s in states:
# q defined in `policy_evaluation` pseudo-code
pi(s) = argmax(q(s,a, v, environment, gamma) for a in actions)
return pi
def value_iteration(environment, threshold=..., gamma=...):
dynamics, states, actions, rewards = environment
# v initialized arbitrarily for all states except V(terminal)=0
V = initialize()
delta = threshold + 1 # force at least 1 pass
while delta < threshold :
delta = 0
for s in states:
v = V(s)
# Bellman optimal update
# q defined in `policy_evaluation` pseudo-code
V(s) = max(q(s, a, v, environment, gamma) for a in actions)
# stop when change for any state is small
delta = max(delta, abs(v-V(s)))
pi = value_to_policy(v, environment, gamma)
return pi
Practical : Faster convergence is often achieved by doing a couple of policy evaluation sweeps (instead of a single one in the value iteration case) between each policy improvement. The entire class of truncated policy iteration converges. Truncated policy iteration can be schematically seen as using the modified generalized policy iteration diagram:
As seen above, truncated policy iteration uses only approximate value functions. This usually increases the number of required policy evaluation and iteration steps, but greatly decreases the number of steps per policy iteration making the overall algorithm usually quicker. Assuming a deterministic policy and reward signal, each iteration for the value iteration takes $O(\vert \mathcal{S} \vert^2 \vert \mathcal{A} \vert)$ which is less than exact (solving the linear system) policy iteration $O(\vert \mathcal{S} \vert^2 (\vert \mathcal{S} \vert + \vert \mathcal{A} \vert))$.
A drawback of DP algorithms, is that they require to loop over all states for a single sweep. Asynchronous DP algorithms are in-place iterative methods that update the value of each state in any order. Such algorithms are still guaranteed to converge as long as it can’t ignore any state after some points (for the episodic case it is also easy to avoid the few orderings that do not converge).
By carefully selecting the states to update, we can often improve the convergence rate. Furthermore, asynchronous DP enables to update the value of the states as the agent visits them. This is very useful in practice, and focuses the computations on the most relevant states.
Practical : Asynchronous DP are usually preferred for problems with large state-spaces
Although DP algorithms are the most used for finding exact solution of the Bellman optimality equations, other methods can have better worst-case convergence guarantees. Linear Programming (LP) is one of those methods. Indeed, the Bellman optimality equations can be written as a linear program. Let $B$ be the Bellman operator (i.e. $v_{k+1} = B(v_k)$), and $\pmb{\mu_0}$ is a probability distribution over states, then:
\[\begin{array}{ccc} v_* = & arg\min_{v} & \pmb{\mu}_0^T \mathbf{v} \\ & \text{s.t.} & \mathbf{v} \geq B(\mathbf{v}) \\ \end{array}\]Indeed, if $v \geq B(v)$ then $B(v) \geq B(B(v))$ due to the monotonicity of the Bellman operator. By repeated applications we must have that $v \geq B(v) \geq B(B(v)) \geq B^3(v) \geq \ldots \geq B^{\infty}(v) = v_{*}$. Any solution of the LP must satisfy $v \geq B(v)$ and must thus be $v_{*}$. Then the objective function $\pmb{\mu}_0^T \mathbf{v}$ is the expected cumulative reward when beginning at a state drawn from $\pmb{\mu}_0$. By substituting for the Bellman operator $B$:
\[\begin{array}{ccc} v_* = & arg\min_{v} & \sum_s \mu_0(s) v(s) \\ & \text{s.t.} & v(s) \geq \sum_{s'} \sum_{r} p(s', r \vert s , a) \left[r + \gamma v(s') \right]\\ & & \forall s \in \mathcal{s}, \ \forall a \in \mathcal{A} \end{array}\]Using the dual form of the LP program, the equation above can be rewritten as :
\[\begin{aligned} \max_\lambda & \sum_s \sum_a \sum_{s'} \sum_r \lambda(s,a) p(s', r \vert s , a) r\\ \text{s.t.} & \sum_a \lambda(s',a) = \mu_0(s') + \gamma \sum_s \sum_a p(s'|s,a) \lambda(s,a) \\ & \lambda(s,a) \geq 0 \\ & \forall s' \in \mathcal{s} \end{aligned}\]The constraints in the dual LP ensure that :
\[\lambda(s,a) = \sum_{t=0}^\infty \gamma^t p(S_t=s, A_t=a)\]I.e. they are the discounted state-action counts. While the dual objective maximizes the expected discounted return. The optimal policy can is :
$\pi_*(s)=max_a \mu(s,a)$
Practical: Linear programming is sometimes better than DP for small number of states, but it does not scale well.
Although LP are rarely useful, they provide connections to a number of other methods that have been used to find approximate large-scale MDP solutions.
As mentioned previously, the dynamics of the environment are rarely known. In such cases we cannot use DP. Monte Carlo (MC) methods bypass this lack of knowledge by estimating the expected return from experience (i.e. sampling of the unknown dynamics).
MC methods are very similar to the previously discussed Generalized Policy Iteration(GPI). The main differences being:
They learn the value-function by sampling from the MDP (experience) rather than computing these values using the dynamics of the MDP.
MC methods do not bootstrap: each value function for each state/action is estimated independently. I.e. they do not update value estimates based on other value estimates.
In DP, given a state-value function, we could look ahead one step to determine a policy. This is not possible anymore due to the lack of knowledge of the dynamics. It is thus crucial to estimate the action value function $q_{*}$ instead of $v_{*}$ in policy evaluation.
Recall that $q_\pi(s) = \mathbb{E}[R_{t+1} + \gamma G_{t+1} \vert S_{t}=s, A_t=a]$. Monte Carlo Estimation approximates this expectations through sampling. I.e. by averaging the returns after every visits of state action pairs $(s,a)$.
Note that pairs $(s,a)$ could be visited multiple times in the same episode. How we treat these visits gives rise to 2 slightly different methods:
Both methods converge to $q_\pi(s,a)$ as the number of visits $n \to \infty$. The convergence of the first-visit MC is easier to prove as each return is an i.i.d estimate of $q_\pi(s,a)$. The standard deviation of the estimate drops with $\frac{1}{\sqrt{n}}$.
Side Notes:
Let’s make a simple generalized policy iteration (GPI) algorithm using MC methods. As a reminder, GPI consists in iteratively alternating between evaluation (E) and improvement (I) of the policy, until we reach the optimal policy:
\[\pi _ { 0 } \stackrel { \mathrm { E } } { \longrightarrow } q _ { \pi _ { 0 } } \stackrel { \mathrm { I } } { \longrightarrow } \pi _ { 1 } \stackrel { \mathrm { E } } { \longrightarrow } q _ { \pi _ { 1 } } \stackrel { \mathrm { I } } { \longrightarrow } \pi _ { 2 } \stackrel { \mathrm { E } } { \longrightarrow } \cdots \stackrel { \mathrm { I } } { \longrightarrow } \pi _ { * } \stackrel { \mathrm { E } } { \rightarrow } q _ { * }\]Unsurprisingly, MC methods can be shown to converge if they maintain exploration and when the policy evaluation step uses an $\infty$ number of samples. Indeed, these 2 conditions ensure that all expectations are correct as MC sampling methods are unbiased.
Of course using an $\infty$ number of samples is not possible, and we would like to alternate (after every episode) between evaluation and improvement even when evaluation did not converge (similarly value iteration). Although MC methods cannot converge to a suboptimal policy in this case, the fact that it converges to the optimal fixed point has yet to be formally proved.
Maintaining exploration is a major issue. Indeed, if $\pi$ is deterministic then the samples will only improve estimates for one action per state. To ensure convergence we thus need a policy which is Greedy in the Limit with Infinite Exploration (GLIE). I.e. : 1/ All state-action pairs have to be explored infinitely many times; 2/ The policy has to converge to a greedy one. Possible solutions include:
In python pseudo-code:
def update_eps_policy(Q, pi, s, actions, eps):
"""Update an epsilon policy for a given state."""
q_best = -float("inf")
for a in actions:
pi[(a, s)] = eps/len(actions)
q_a = Q[(s, a)]
if q_a > q_best:
q_best = q_a
a_best = a
pi[(a_best, s)] = 1 - (eps - (eps/len(actions)))
return pi
def q_to_eps_pi(Q, eps, actions, states):
"""Creates an epsilon policy from a action value function."""
pi = defaultdict(lambda x: 0)
for s in states:
pi = update_eps_policy(Q, pi, s, actions, eps)
return pi
def on_policy_mc(game, actions, states, n=..., eps=..., gamma=...):
"""On policy Monte Carlo GPI using first-visit update and epsilon greedy policy."""
returns = defaultdict(list)
Q = defaultdict(lambda x: 0)
for _ in range(n):
pi = q_to_eps_pi(Q, eps, actions, states)
T, list_states, list_actions, list_rewards = game(pi)
G = 0
for t in range(T-1,-1,-1): # T-1, T-2, ... 0
r_t, s_t, a_t = list_rewards[t], list_states[t], list_actions[t]
G = gamma * G + r_t # current return
if s_t not in list_states[:t]: # if first
returns[(s_t, a_t)].append(G)
Q[(s_t, a_t)] = mean(returns[(s_t, a_t)]) # mean over all episodes
return V
The MC methods can be implemented incrementally. Instead of getting estimates of $q_\pi$ by keeping in memory all $G_t^{(i)}$ to average over those, the average can be computed exactly at each step $n$. Let $m \coloneqq \sum_i \mathcal{I}[F_i(s,a) \neq -1]$, then:
\[\begin{aligned} Q_{m+1}(s,a) &= \frac{1}{m} \sum_{i:G_{F_i(s,a)}^{(i)}(\pi)\neq -1}^m G_{F_i(s,a)}^{(i)}(\pi)\\ &= \frac{1}{m} \left(G_{F_m(s,a)}^{(m)}(\pi) + \sum_{i:G_{F_i(s,a)}^{(i)}(\pi)\neq -1}^{m-1} G_{F_i(s,a)}^{(i)}(\pi)\right)\\ &= \frac{1}{m} \left(G_{F_m(s,a)}^{(m)}(\pi) + (m-1)\frac{1}{m-1} \sum_{i:G_{F_i(s,a)}^{(i)}(\pi)\neq -1}^{m-1} G_{F_i(s,a)}^{(i)}(\pi)\right)\\ &= \frac{1}{m} \left(G_{F_m(s,a)}^{(m)}(\pi) + (m-1)Q_{m}(s,a) \right) \\ &= Q_{m}(s,a) + \frac{1}{m} \left(G_{F_m(s,a)}^{(m)}(\pi) - (m-1)Q_{m}(s,a) \right) \\ \end{aligned}\]This is of the general form :
\[\textrm{new_estimate}=\textrm{old_estimate} + \alpha_n \cdot (\textrm{target}-\text_rm{old_estimate})\]Where the step-size $\alpha_n$ varies as it is $\frac{1}{n}$. In RL, problems are often non-stationary, in which case we would like to give more importance to recent samples. A popular way of achieving this, is by using a constant step-size $\alpha \in ]0,1]$. This gives rise to an exponentially decaying weighted average (also called exponential recency-weighted average in RL). Indeed:
\[\begin{aligned} Q_{m+1}(s,a) &= Q_{m}(s,a) + \alpha \left(G_{F_m(s,a)}^{(m)}(\pi)- Q_{m}(s,a) \right) \\ &= \alpha G_{F_m(s,a)}^{(m)}(\pi) + (1-\alpha)Q_{m}(s,a) \\ &= \alpha G_{F_m(s,a)}^{(m)}(\pi) + (1-\alpha) \left( \alpha G_{F_m(s,a)}^{(m)}(\pi) + (1-\alpha)Q_{m-1}(s,a) \right) \\ &= (1-\alpha)^mQ_1 + \sum_{i=1}^m \alpha (1-\alpha)^{m-i} G_{F_i(s,a)}^{(i)}(\pi) & \text{Recursive Application} \\ \end{aligned}\]Intuition: At every step we update our estimates by taking a small step towards the goal, the direction (sign in 1D) is given by the error $\epsilon = \left(G_{F_m(s,a)}^{(m)}(\pi) - Q_{n}(s,a) \right)$ and the steps size by $\alpha * \vert \epsilon \vert$.
Side Notes:
In the on-policy case we had to use a hack ($\epsilon \text{-greedy}$ policy) in order to ensure convergence. The previous method thus compromises between ensuring exploration and learning the (nearly) optimal policy. Off-policy methods remove the need of compromise by having 2 different policy.
The behavior policy $b$ is used to collect samples and is a non-zero stochastic policy which ensures convergence by ensuring exploration. The target policy $\pi$ is the policy we are estimating and will be using at test time, it focuses on exploitation. The latter is often a deterministic policy. These methods contrast with on-policy ones, that uses a single policy.
Intuition: The intuition behind off-policy methods is to follow an other policy but to weight the final return in such a way that compensates for the actions taken by $b$. This can be done via Importance Sampling without biasing the final estimate.
Given a starting state $S_t$ the probability of all subsequent state-action trajectory $A_t, S_{t+1}, A_{t+1}, \ldots, S_T$ when following $\pi$ is:
\[P(A_t, S_{t+1}, A_{t+1}, \ldots, S_T \vert S_t, A_{t:T-1} \sim \pi) = \prod_{k=t}^{T-1} \pi (A_k \vert S_k) p(S_{k+1} \vert S_k, A_k)\]Note that the dynamics $p(s’ \vert s, a)$ are unknown but we only care about the ratio of the state-action trajectory when following $\pi$ and $b$. The importance sampling ratio is:
\[\begin{aligned} w_{t:T-1} &= \frac{P(A_t, S_{t+1}, A_{t+1}, \ldots, S_T \vert S_t, A_{t:T-1} \sim \pi)}{P(A_t, S_{t+1}, A_{t+1}, \ldots, S_T \vert S_t, A_{t:T-1} \sim b) }\\ &= \frac{\prod_{k=t}^{T-1} \pi (A_k \vert S_k) p(S_{k+1} \vert S_k, A_k)}{\prod_{k=t}^{T-1} b (A_k \vert S_k) p(S_{k+1} \vert S_k, A_k)}\\ &= \prod_{k=t}^{T-1} \frac{ \pi (A_k \vert S_k) }{b (A_k \vert S_k)} \end{aligned}\]Note that if we simply average over all returns we would get $\mathbb{E}[G_t \vert S_t=s] = v_b(s)$, to get $v_\pi(s)$ we can use the previously computed importance sampling ratio:
\[\mathbb{E}[w_{t:T-1} G_t \vert S_t=s] = v_\pi(s)\]Side Notes:
Practical: Off-policy methods are very useful to learn by seeing a human expert or non-learning controller.
In python pseudo-code:
def off_policy_mc(Q, game, b, actions, n=..., gamma=...):
"""Off policy Monte Carlo GPI using incremental update and weighted importance sampling."""
pi = dict()
returns = defaultdict(list)
Q = defaultdict(lambda x: random.rand)
C = defaultdict(lambda x: 0)
for _ in range(n):
T, list_states, list_actions, list_rewards = game(b)
G = 0
W = 1
for t in range(T-1,-1,-1): # T-1, T-2, ... 0
r_t, s_t, a_t = list_rewards[t], list_states[t], list_actions[t]
G = gamma * G + r_t # current return
C[(s_t, a_t)] += W
Q[(s_t, a_t)] += (W/C[(s_t, a_t)])(G-Q[(s_t, a_t)])
pi[s_t] = argmax(Q[(s_t, a)] for a in actions)
if a_t != pi(s):
break
W /= b[(s_t, a_t)]
return V
Side Notes: As for on policy, Off policy MC control can also be written in an Incremental manner. The details can be found in chapter 5.6 of Sutton and Barto.
The fundamental idea of temporal-difference (TD) learning is to remove the need of waiting until the end of an episode to get $G_t$ in MC methods, by taking a single step and update using $R_{t+1} + \gamma V(S_{t+1}) \approx G_t$. Note that when the estimates are correct, i.e. $V(S_{t+1}) = v_\pi(S_{t+1})$, then $\mathbb{E}[R_{t+1} + \gamma V(S_{t+1})] = \mathbb{E}[G_t]$.
As we have seen, incremental MC methods for evaluating $V$, can be written as $V_{k+1}(S_t) = V_{k}(S_t) + \alpha \left( G_t - V_{k}(S_t) \right)$. Using bootstrapping, this becomes:
\[V_{k+1}(S_t) = V_{k}(S_t) + \alpha \left( R_{t+1} + \gamma V_k(S_{t+1}) - V_{k}(S_t) \right)\]This important update is called $\textbf{TD(0)}$, as it is a specific case of TD(lambda). The error term that is used to update our estimate, is very important in RL and Neuroscience. It is called the TD Error:
\[\delta_t := R_{t+1} + \gamma V(S_{t+1}) - V(S_t)\]Importantly $\text{TD}(0)$ always converges to $v_\pi$ if $\alpha$ decreases according to the usual stochastic conditions (e.g. $\alpha = \frac{1}{t}$). Although, the theoretical difference in speed of convergence of TD(0) and MC methods is still open, the former often converges faster in practice.
Side Notes:
Similarly to MC methods, we need an action-value function $q_\pi$ instead of a state one $v_\pi$ because we do not know the dynamics and thus cannot simply learn the policy from $v_\pi$. Replacing TD estimates in on policy MC GPI, we have:
Side Notes:
In python pseudo-code:
def sarsa(game, states, actions, n=..., eps=..., gamma=..., alpha=...):
"""SARSA using epsilon greedy policy."""
Q = defaultdict(lambda x: 0)
pi = q_to_eps_pi(Q, eps, actions, states) # defined in on policy MC
for _ in range(n):
s_t = game() #initializes the state
a_t = sample(pi, s_t)
while s_t not is terminal:
s_t1, r_t = game(s_t, a_t) # single run
# compute A_{t+1} for update
a_t1 = sample(pi, s_t1)
Q[(s_t, a_t)] += alpha * (r_t + gamma*Q[(s_t1, a_t1)] - Q[(s_t, a_t)])
pi = update_eps_policy(Q, pi, s_t, actions, eps) # defined in on policy MC
s_t, a_t = s_t1, a_t1
return pi
Just as with off-policy MC, TD can be written as an off-policy algorithm. In this case, the TD error is not computed with respect to the next sample but with respect to the current optimal greedy policy (i.e. the one maximizing the current action-value function $Q$)
The algorithm can be proved to converge as long as all state-action pairs continue being updated.
In python pseudo-code:
def q_learning(game, states, actions, n=..., eps=..., gamma=..., alpha=...):
"""Q learning."""
Q = defaultdict(lambda x: 0)
pi = q_to_eps_pi(Q, eps, actions, states) # defined in on policy MC
for _ in range(n):
s_t = game() # initializes the state
while s_t not is terminal:
a_t = sample(pi, s_t)
s_t1, r_t = game(s_t, a_t) # single run
a_best = argmax(Q[(s_t1, a)] for a in actions)
Q[(s_t, a_t)] += alpha * (r_t + gamma*a_best - Q[(s_t, a_t)])
pi = update_eps_policy(Q, pi, s_t, actions, eps) # defined in on policy MC
s_t, = s_t1
return pi
Soon
The previous sections were separated by learning style. I will now separate by algorithm similarity.
Soon
Intuition :
Advantage :
Disadvantage :
A VAE can be thought of an auto-encoder that maps and reconstructs inputs to and from latent distributions rather than point-wise latent codes. This enables sampling to generate new examples and forces similar examples to be encoded by similar latent distributions. Indeed, latent distributions with more overlap or forced to generate similar samples. Importantly, making the network generate noise around the latent point-wise estimates is not satisfactory as the encoder would learn to generate very large means or very small variance, thus making the noise neglectable. To avoid this issue we force the latent distributions to be close to a given distribution $p(\mathbf{z})$ (typically a unit Gaussian).
The encoder is a neural network parametrized by $\mathbf{\phi}$ that maps the input $\mathbf{x}$ to a latent distribution $q_\phi(\mathbf{z}\vert\mathbf{x})$ from which to sample a noisy representations of the low dimension latent code $\mathbf{z}$. The decoder is parametrized by $\mathbf{\theta}$ and deterministically maps the sampled latent code $\mathbf{z}$ to the corresponding reconstructed $\mathbf{\hat{x}}=f_\theta(\mathbf{z})$. The training procedure requires to minimize the expected reconstruction loss $L(\mathbf{x},\mathbf{\hat{x}})$ under the constraint of having $q_\phi(\mathbf{z}\vert \mathbf{x}) \approx p(\mathbf{z})$. The latter constraint can be written as $D_\text{KL}(q_\phi(\mathbf{z}\vert\mathbf{x})\parallel p(\mathbf{z}) ) < \delta$
\[\begin{array}{ccc} \pmb{\theta}^*, \pmb{\phi}^* = & arg\min_{\phi, \theta} & \mathbb{E}_{\mathbf{x}\sim\mathcal{D}}\left[\mathbb{E}_{\mathbf{z}\sim q_\phi(\mathbf{z}\vert\mathbf{x})}\left[L(\mathbf{x},f_\theta(\mathbf{z}))\right]\right]\\ & \text{s.t.} & D_\text{KL}(q_\phi(\mathbf{z}\vert\mathbf{x})\parallel p(\mathbf{z})) < \delta \text{, }\forall \mathbf{x}\sim\mathcal{D} \end{array}\]This constrained optimization problem can be rewritten as an unconstrained one using the KKT conditions (generalization of Lagrange multipliers allowing inequality constraints):
\[\begin{aligned} \pmb{\theta}^*, \pmb{\phi}^* &= arg\min_{\phi, \theta} \mathbb{E}_{\mathbf{x}\sim\mathcal{D}}\left[\mathbb{E}_{\mathbf{z}\sim q_\phi(\mathbf{z}\vert\mathbf{x})}\left[L(\mathbf{x},f_\theta(\mathbf{z}))\right] + \beta \left( D_\text{KL}(q_\phi(\mathbf{z}\vert\mathbf{x})\parallel p(\mathbf{z})) - \delta \right) \right]\\ &= arg\min_{\phi, \theta} \sum_{n=1}^N \left( \mathbb{E}_{q_\phi(\mathbf{z}\vert\mathbf{x}^{(n)})}\left[L(\mathbf{x}^{(n)},f_\theta(\mathbf{z}^{(n)}))\right] + \beta D_\text{KL}(q_\phi(\mathbf{z}^{(n)}\vert\mathbf{x}^{(n)})\parallel p(\mathbf{z}^{(n)}))\right) \\ &= arg\min_{\phi, \theta} \sum_{n=1}^N \left( - \mathbb{E}_{q_\phi(\mathbf{z}^{(n)}\vert\mathbf{x}^{(n)})}\left[\log \, p(\mathbf{x}^{(n)}\vert\mathbf{z}^{(n)})\right] + \beta D_\text{KL}(q_\phi(\mathbf{z}^{(n)}\vert\mathbf{x}^{(n)})\parallel p(\mathbf{z}^{(n)}))\right) \end{aligned}\]Where the last line replaced the reconstruction loss $L$ with the negative log likelihood of the data as predicted by a hypothetical probabilistic discriminator. The output of the deterministic discriminator can be interpreted as the expected value of the hypothetical probabilistic discriminator. Many of the usual losses $L$ for deterministic outputs correspond to negative log likelihoods: mean squared error <-> Gaussian likelihood, log loss <-> logistic likelihood, …
The global loss consists of the expected reconstruction loss (first term) and a regularisation term (KL divergence).
We now have a loss and a model: it should be straight-forward to train such a network in your favorite deep learning library, right ? Not really… The problem arises from the stochasticity of the latent representation $\mathbf{z}\sim q_\phi(\mathbf{z}\vert\mathbf{x})$. Indeed, during the backward pass, gradients cannot flow through stochastic nodes. Fortunately, there’s a simple way of making it work by using the reparametrisation trick, which consists in expressing $\mathbf{z}$ as a deterministic function of the inputs and of an independent random variable $\mathbf{z}=g_\phi(\mathbf{x}, \pmb{\epsilon})$. In the case of a multivariate Gaussian with a diagonal covariance, the following equations are equivalent:
\[\begin{aligned} \mathbf{z} &\sim \mathcal{N}(\pmb{\mu}(\mathbf{x}), diag(\pmb{\sigma}(\mathbf{x})^2)) & \\ \mathbf{z} &= \boldsymbol{\mu} + \pmb{\sigma} \odot \boldsymbol{\epsilon} \text{, where } \pmb{\epsilon} \sim \mathcal{N}(\mathbf{0}, I) \end{aligned}\]The figure from Lil’Log summarizes very well the deep learning point of view, with the small caveat that in practice we use a deterministic decoder and a loss function rather than the probabilistic one:
Side Notes :
The loss corresponds to $\beta$-VAE. You recover the original VAE loss by setting $\beta=1$.
Usually $p(\mathbf{z})=\mathcal{N}(\mathbf{z};\mathbf{0},I)$ and $q_\phi(\mathbf{z}\vert\mathbf{x})=\mathcal{N}(\mathbf{z};\pmb{\mu}(\mathbf{x}),\text{diag}(\pmb{\sigma}(\mathbf{x})))$. In which case the KL divergence can be computed in closed form :
In practice, it is better to make the network output $\log \, \pmb{\sigma^2}$ (the log variance) as this constrains sigma to be positive even with the network outputting a real value. We output the variance instead of $\pmb{\sigma}$ as it shows up twice in the KL divergence.
For black and white image generation (e.g. MNIST), cross-entropy loss (corresponding to fitting a Bernouilli for each pixel) gives better results than mean squared error. The same holds for colored images, which is achieved by rescaling the output of the sigmoid to 0-255 in each channel.
When defining a generative probabilistic model, it is important to define a step by step generation procedure. For each datapoint $i$:
The graphical model is simply :
The objective is to have a graphical model that maximizes the probability of generating real data :
\[p(\mathbf{x})=\int p(\mathbf{x} \vert \mathbf{z}) p(\mathbf{z}) d\mathbf{z}\]Unfortunately the integral above is not tractable, as it would require integrating over all possible $\mathbf{z}$ sampled from a prior. It could be approximated by Markov Chain Monte Carlo (MCMC), i.e. by sampling $N$ different $\mathbf{z}^{(n)}$ and then compute $p(\mathbf{x}) \approx \frac{1}{N}\sum_{n=1}^{N}p(\mathbf{x}\vert\mathbf{z}^{(n)})$. But as the volume spanned by $\mathbf{z}$ is potentially large, $N$ may need to be very large to obtain an accurate estimate of $p(\mathbf{x})$ (as we will sample many times from uninformative regions of the prior). To improve the sample efficiency (decrease variance) of the estimate we would like to sample “important” $\mathbf{z}$ rather than blindly sampling from the prior. Let’s call $q$ a probability distribution that assigns high probability to “important” $\mathbf{z}$. As the notion of “importance” might depend on the $\mathbf{x}$ we want to predict, let’s condition $q$ on those : $q(\mathbf{z} \vert \mathbf{x})$. We can use Importance Sampling to use samples from $q(\mathbf{z} \vert \mathbf{x})$ without biasing our estimate:
\[\begin{aligned} p(\mathbf{x}) &=\int p(\mathbf{x} \vert \mathbf{z}) p(\mathbf{z}) d\mathbf{z} \\ &=\int p(\mathbf{x} \vert \mathbf{z}) p(\mathbf{z}) \frac{q(\mathbf{z}\vert \mathbf{x})}{q(\mathbf{z}\vert \mathbf{x})} d\mathbf{z} \\ &=\int q(\mathbf{z}\vert \mathbf{x}) p(\mathbf{x} \vert \mathbf{z}) \frac{p(\mathbf{z})}{q(\mathbf{z}\vert \mathbf{x})} d\mathbf{z} \\ &= \mathbb{E}_{q(\mathbf{z}\vert \mathbf{x})} \left[ \log \left( p(\mathbf{x} \vert \mathbf{z}) \frac{p(\mathbf{z})}{q(\mathbf{z}\vert \mathbf{x})} \right) \right] \end{aligned}\]Noting that $-\log$ is a convex function, we can use Jensen Inequality, which states that $g(\mathbb{E}[X]) \leq \mathbb{E}[g(X)] \text{, where } g \text{ is convex}$:
\[\begin{aligned} \log \,p(\mathbf{x}) &\geq \int q(\mathbf{z}\vert \mathbf{x}) \log \left( p(\mathbf{x} \vert \mathbf{z}) \frac{p(\mathbf{z})}{q(\mathbf{z}\vert \mathbf{x})} \right) d\mathbf{z} \\ &\geq \int q(\mathbf{z}\vert \mathbf{x}) \log p(\mathbf{x} \vert \mathbf{z})d\mathbf{z} - \int q(\mathbf{z}\vert \mathbf{x}) \log \, \frac{q(\mathbf{z}\vert \mathbf{x})}{p(\mathbf{z})} d\mathbf{z} \\ &\geq \mathbb{E}_{q(\mathbf{z}\vert \mathbf{x})} \left[ \log \, p(\mathbf{x} \vert \mathbf{z}) \right] - D_\text{KL}(q(\mathbf{z}\vert \mathbf{x})\parallel p(\mathbf{z}) ) \end{aligned}\]The right hand side of the equation is called the Evidence Lower Bound (ELBO) as it lower bounds the (log) evidence $p(\mathbf{x})$. The last missing component is to chose appropriate probability functions :
Finally, let’s train $\pmb{\theta}$ and $\pmb{\phi}$ by maximizing the probability of generating the training (i.e. real) data $p_{\theta,\phi}(\mathbf{x})$. Assuming that all data points are i.i.d then:
\[\begin{aligned} \pmb{\theta}^*, \pmb{\phi}^* &= arg\max_{\theta, \phi} \prod_{n=1}^N p_{\theta,\phi}(\mathbf{x}^{(n)}) \\ &= arg\max_{\theta, \phi} \sum_{n=1}^N \log \, p_{\theta,\phi}(\mathbf{x}^{(n)})\\ &\approx arg\max_{\theta, \phi} \sum_{n=1}^N \text{ELBO}_{\theta,\phi}(\mathbf{x}^{(n)},\mathbf{z}^{(n)})\\ &= arg\max_{\theta, \phi} \sum_{n=1}^N \left( \mathbb{E}_{q(\mathbf{z}^{(n)} \vert \mathbf{x}^{(n)})} \left[ \log \, p(\mathbf{x}^{(n)} \vert \mathbf{z}^{(n)}) \right] - D_\text{KL}(q(\mathbf{z}^{(n)} \vert \mathbf{x}^{(n)}) \parallel p(\mathbf{z}^{(n)}) ) \right) \end{aligned}\]This corresponds to the loss we derived with the deep learning perspective for $\beta=1$. Please refer to the previous subsection for the reparamatrization trick and closed form solution of the KL divergence.
Side Notes :
The ELBO can also be derived by directly minimizing $D_\text{KL}(q(\mathbf{z} \vert \mathbf{x}) \parallel p(\mathbf{z} \vert \mathbf{x}) )$, which intuitively “turns the problem around” and tries to find a good mapping from $\mathbf{x}$ to $\mathbf{z}$. The idea of doing approximate tractable inference with a simple distribution $q(\mathbf{z} \vert \mathbf{x})$ instead of the real posterior $p(\mathbf{z} \vert \mathbf{x})$ is called variational inference. This is where the name of variational auto-encoder comes from.
The choice of direction of the KL divergence is natural with the importance sampling derivation.
You can get a tighter lower bound by getting a less biased estimate of the expectation you want, by using sampling methods: $\mathbb{E}_{q(\mathbf{z}\vert \mathbf{x})} \left[ \log \left( p(\mathbf{x} \vert \mathbf{z}) \frac{p(\mathbf{z})}{q(\mathbf{z}\vert \mathbf{x})} \right) \right]$. In expectation this is a tighter lower bound is called Importance Weighted Autoencoders.
Resources : D. Kingma and M. Welling first paper on VAEs
Disentangled representations most often refers to having a latent space where with each dimension is not very correlated (factorizable latent space). Such representations have been shown to be more general and robust. They enable compositionality, which is key for human thoughts and language. For example, if I ask you to draw a blue apple, you would not have any issues doing it (assuming your drawing skills are better than mine ) because you understand the concept of an apple and the concept of being blue. Being able to model such representation is what we are aiming for in VAE, as illustrated in Jeremy Jordan’s blog :
Having Gaussian distribution with diagonal covariance for the latent variable, means that each latent dimension can be modeled by a single Gaussian. Intuitively, we can thus think of latent dimensions as knobs used to add or remove something. Using a prior, forces the model to encode the confidence of how much of the knob to turn.
Although the variance of each knob is independent (diagonal covariance), the means can still be highly dependent. The unit Gaussian prior forces each dimension to correspond to the principle components of the distribution of means (i.e. the axes are parallel to the directions with the most variance like in PCA), thus forcing the knobs to encode disentangled factors of variations (small correlation) which are often more interpretable. This is because dimensions aligned with the principle components will decrease the KL divergence due to Pythagoras theorem. Graphically :
Soon
The previous sections were separated by learning style and algorithm similarity. I will now focus on the application.
Soon
Soon
Soon
Soon
The following sections are not strictly speaking part of Machine Learning. They are nevertheless very related and important.
The goal of mathematical optimization (also called mathematical programming), is to find a minima or maxima of an objective function $f(x)$. The “best” optimization algorithm, is very problem specific and depends on the objective function to minimize and the type of constraints the solution has to satisfy. In mathematics and machine learning, optimization problems are usually stated in terms of minimization, which is why I will focus on minimization. Importantly, if you want to maximize $f(x)$ you can equivalently minimize $f’(x)=-f(x)$.
The optimization problems are often separated into a large number of overlapping sets of problems. In this blog I will focus on a few key distinctions between sets of problems:
Side Notes: As optimization is crucial for many different fields, the vocabulary and conventions are not really standardize. For example $f(x)$, is often called the objective function, the loss function, or the cost function when it is minimized. And is called the utility or fitness function when it has to be maximized. In physics and computer vision, it is often called the energy function.
Linear programming (LP) deals with linear objective functions subject to linear constraints (equality and inequality). Any LP problem can be expressed in a canonical form:
\[\begin{array}{cc} \max_x & f(x)=\mathbf{c}^T\mathbf{x} \\ \text{s.t.} & A\mathbf{x} \leq \mathbf{b} \\ & \mathbf{x} \geq \mathbf{0} \end{array}\]The constraints $A\mathbf{x} \leq b$ are called the main constraints, while $\mathbf{x} \geq \mathbf{0}$ are the non-negativity constraints. The set of points that satisfy all constraints form the feasible region. This region is a convex polytope (points between in a polyhedron that is defined by intersecting lines. Formally, it is the intersection of a finite number of half-spaces) defined by the linear inequalities. The goal is to find the point in the polyhedron associated with the maximum (or minimum) of the function. Note: for unbounded problems (i.e. objective function can take large values) the feasible region is infinite.
Let’s look at the simple example used in this blog, from whom I borrowed the nice plots:
\[\begin{array}{cc} \max_x & f(\mathbf{x})=3x_1+2x_2 \\ \text{s.t.} & 2x_1 + x_2 \leq 100 \\ & x_1 + x_2 \leq 80 \\ & x_1 \leq 40 \\ & x_1, x_2 \geq 0 \\ \end{array}\]The feasible region in this case is :
We now have to find the point in this region which maximizes $3x_1+2x_2$. For any given $z$ let’s consider the set of all points whose cost is $\mathbf{c}^T\mathbf{x}=z$ (level curves of the objective function). This is the line described by $z=3x_1+2x_2$. Note that $\forall z$ these lines are parallel to each other and perpendicular to $\mathbf{c}=(3,2)$. In particular, increasing $z$ corresponds to moving the line $z=3x_1+2x_2$ along the direction of $\mathbf{c}$ ($\nabla f(\mathbf{x})=\mathbf{c}$):
As we want to maximize $z$, we would like to move the line as much as possible along $\mathbf{c}$, while staying in the feasible region. The best we can do is $z=180$, and the vector $\mathbf{x}=(20,60)$ is the corresponding optimal solution:
Note that the optimal solution $\mathbf{x}=(20,60)$ is a vertex of the polytope. We will show later that this is always the case.
Side Notes: Beware that the term standard form is very often used in the literature to mean different things. For example in Bertismas and Tsitsiklis the standard form is equivalent to the canonical form we have seen above, while the standard form in Vanderbei’ book refers to what I will call the slack form. I thus decided not to use the term “standard form”.
Canonical Form:
\(\begin{array}{cc} \max_x & f(x)=\mathbf{c}^T\mathbf{x} \\ \text{s.t.} & A\mathbf{x} \leq \mathbf{b} \\ & \mathbf{x} \geq \mathbf{0} \end{array}\)
Slack Form:
\(\begin{array}{cc} \min_x & f(x)=\mathbf{c}^T\mathbf{x} \\ \text{s.t.} & A\mathbf{x} = \mathbf{b} \\ & \mathbf{x} \geq \mathbf{0} \end{array}\)
The canonical form is very natural and thus often used to develop the theory, but the slack form is usually the starting points for all algorithms as it is computationally more convenient.
Any general LP problem can be converted back and forth to these 2 forms. Let’s first see how to convert general LP problems to the canonical form:
Let’s now see how to convert general LP problems to the slack form:
Side Notes:
We have already seen an example which visualized the canonical form of the problem. I.e. for $\mathbf{x} \in \mathbb{R}^{n’}$ the feasible set is a polytope with at most $m’$ vertices (described by $A \mathbf{x} \geq \mathbf{b}$) in a $n’$ dimensional space.
In the slack form with $m$ constraints and $\mathbf{x} \in \mathbb{R}^{n}$, the constraints force $\mathbf{x}$ to lie on an $(n-m)$ dimensional subspace (each constraint removes a degree of freedom of $\mathbf{x}$). Once in that $(n-m)$ dimensional subspace, the feasible set is only bounded by the non-negativity constraints $\mathbf{x} \geq \mathbf{0}$. I.e. the slack form of the problem, can be seen as minimizing a function in a polytope in a $(n-m)$ dimensional subspace. Importantly, the polytope has the exact same shape as for the canonical form, but lives is a different space!
In optimization, it is very common to convert constrained problems to a relaxed unconstrained formulation, by allowing constraints to be violated but adding “a price” $\mathbf{y}$ to pay for that violation. When the price is correctly chosen, the solution of both problems becomes the same (by letting $\mathbf{y} \to \infty$ you effectively don’t allow any violation of the constraints:). This is the key idea behind the well known Lagrange Multipliers. Let’s call the problem formulation we have seen until know the primal problem. Then the slack form of the primal problem can be relaxed as:
\[\begin{array}{cc} \min_x & \mathbf{c}^T\mathbf{x} + \mathbf{y}^T (\mathbf{b}-A\mathbf{x}) \\ \text{s.t.} & \mathbf{x} \geq \mathbf{0} \end{array}\]Let $g(\mathbf{y})$ be the optimal cost of the relaxed problem as a function of the violation price $\mathbf{y}$. $g(\mathbf{y})$ is a lower bound to the the optimal solution $\mathbf{x}^{*}$ of the constraint problem (weak duality theorem). Indeed ($A\mathbf{x}^*=\mathbf{b}$):
\[g(\mathbf{y})=min_{\mathbf{x} \geq \mathbf{0}} \left( \mathbf{c}^T\mathbf{x} + \mathbf{y}^T (\mathbf{b}-A\mathbf{x}) \right) \leq \mathbf{c}^T\mathbf{x}^* + \mathbf{y}^T (\mathbf{b}-A\mathbf{x}^*) = \mathbf{c}^T\mathbf{x}^*\]The goal is thus to find the tightest lower bound $\max_y g(\mathbf{y})$, and is called the dual problem. The problem can be simplified by noting that:
\[\begin{aligned} \max_\mathbf{y} g(\mathbf{y}) &= \max_\mathbf{y} & min_{\mathbf{x} \geq \mathbf{0}} \left( \mathbf{c}^T\mathbf{x} + \mathbf{y}^T (\mathbf{b}-A\mathbf{x}) \right) \\ &= \max_\mathbf{y} & \mathbf{y}^T \mathbf{b} + min_{\mathbf{x} \geq \mathbf{0}} \left( (\mathbf{c}^T - \mathbf{y}^T A)\mathbf{x} \right)\\ &= \max_\mathbf{y} & \mathbf{y}^T \mathbf{b} + 0 \\ &\ \ \ \text{ s.t.} & \mathbf{y}^TA \leq \mathbf{c}^T \end{aligned}\]Where the last line comes from the fact that $min_{\mathbf{x} \geq \mathbf{0}} \left( (\mathbf{c}^T + \mathbf{y}^T A)\mathbf{x} \right)$ is $0$ if $\mathbf{y}^TA \leq \mathbf{c}^T$ and $-\infty$ otherwise.
The dual problem is thus another LP problem that can be written as:
\[\begin{array}{cc} \max_y & \mathbf{y}^T \mathbf{b} \\ \text{s.t.} & A^T\mathbf{y} \leq \mathbf{c} \end{array}\]Side Notes:
Canonical Primal LP:
\(\begin{array}{cc} \max_x & \mathbf{c}^T\mathbf{x} \\ \text{s.t.} & A\mathbf{x} \leq b \\ & \mathbf{x} \geq \mathbf{0} \end{array}\)
Associated Dual LP:
\(\begin{array}{cc} \min_y & \mathbf{b}^T\mathbf{y} \\ \text{s.t.} & A^T\mathbf{y} \geq \mathbf{c} \\ & \mathbf{y} \geq \mathbf{0} \end{array}\)
Resources: the intuitive link between the primal and dual form explained above comes from the very good Introduction to Linear Optimization. Another great resource is the coursera course on discrete optimization (lecture LP6 week 5).
Soon
Unfortunately here ends today’s journey together. But don’t be too disappointed, I’m only getting started. I still have a tuns of notes eagerly waiting to get upgraded, and I don’t intend to stop learning yet .
PS: Any reaction/suggestions would be very appreciated: just drop a comment below or click on the .
See you soon !