Machine Learning Glossary


Machine Learning Glossary / Cheat Sheet with a focus on intuition

Welcome,

This is my first post ever :bowtie:, I’d love to get your feedback.

What

I will try to summarize important terms and concepts of machine learning. The focus is on intuition but there will also be practical and theoretical notes.

Target Audience, this is for you if you:

  • Have a decent understanding of a concept but want more intuition.
  • Are switching machine learning sub-domains.
  • Knew a term, but want to refresh your knowledge as it’s hard to remember everything (that’s me :sweat_smile: ).
  • Need to learn the important concepts in an efficient way. Students cramming for an exam: that’s for you :four_leaf_clover: !

Why

Having a bad memory but being (at least considering myself to be :sweat_smile: ) a philomath who loves machine learning, I developed the habit of taking notes, then summarizing and finally making a cheat sheet for every new ML domain I encounter. There are multiple reasons I want to switch to a web-page:

  • Paper is not practical and prone to loss.
  • Thinking that someone I don't know (I'm talking about you :raising_hand: ) might read this post makes me write higher quality notes .
  • I'm forever grateful to people that spend time on forums and open source projects. I now want to give back to the community (The contribution isn't comparable, but I have to start somewhere :innocent: ).
  • Taking notes on a computer is a necessary step for my migration to CS :sweat_smile: .
  • As a wise man once said:
    You do not really understand something unless you can explain it to your grandmother. - Albert Einstein
    My grandma's are awesome :heart: but not really into ML (yet). You have thus been designated "volunteer" to temporarily replace them.

Notation

Keeping a uniform notation in machine learning is not easy as many sub-domains use different notations due to historical reasons. I will try using a uniform notation for all the glossary:

  • Lower case letters (a,b,c,…): scalars and functions.
  • Bold capital letters (A,B,C …): matrices.
  • Bold lower case letters (a,b,c …): vectors.
  • Capital letters (X,Y,Z…): either a random variable or a number of values an index can take (e.g. $k=1 \ldots K$)
  • Superscripts with $(n)$ (e.g. $x^{(n)}$) are used to denote one of the $T$ training examples. Superscripts with $(t)$ denote one of the $T$ test examples.

To make it easier to search the relevant information in the Glossary here is the color coding I will be using:

  • :bulb: Intuition
  • :wrench: Practical
  • :x: Disadvantage
  • :white_check_mark: Advantage
  • :school_satchel: Example
  • :mag: Side Notes
  • :wavy_dash: Compare to
  • :information_source: Resources

Disclaimer:

  • This is my first post ever :bowtie:, I would love to get your feedback.
  • I’m bad at spelling: Apologies in advance (feel free to correct me).
  • Check out the resources from where I got most of this information.
  • ML sub-domains overlap A LOT. I’ll try not to make the separations too artificial. Any suggestions would be appreciated :relaxed: . Note that I separate domains both by learning style and by algorithm similarity.
  • This is not meant to be a post read in order, but rather used as a “cheat-sheet”. Use the table of content or Ctrl+f.
  • Many parts are still missing, come back soon for those.

Enough talking: prepare your popcorn and let’s get going :clapper: !

General Machine Learning Concepts

Curse of Dimensionality

The Curse of dimensionality refers to various practical issues when working with high dimensional data. These are often computational problems or counter intuitive phenomenas, coming from our Euclidean view of the 3 dimensional world.

They are all closely related but I like to think of 3 major issues with high dimensional inputs $x \in \mathbb{R}^d, \ d \ggg 1$:

Sparsity Issue

You need exponentially more data to fill in a high dimensional space. I.e. if the dataset size is constant, increasing the dimensions makes your data sparser.

:bulb: Intuition : The volume size grows exponentially with the number of dimensions. Think of filling a $d$ dimensional unit hypercube with points at a $0.1$ interval. In 1 dimension we need $10$ of these points. In 2 dimension we already need 100 of these. In $d$ dimension we need $10^d$ observation !

Let’s look at a simple example :

Imagine we trained a certain classifier for distinguishing between :white_circle: and :large_blue_circle:. Now we want to predict the class of an unkown observation :black_circle: . Let’s assume that:

  • All features are given in percentages $[0,1]$
  • The algorithm is non-parametric and has to look at the points in the surrounding hypercube, which spans $30\%$ of the input space (see below).

Given only 1 feature (1D), we would simply need to look at $30\%$ of the dimension values. In 2D we would need to look at $\sqrt{0.3}=54.8\%$ of each dimensions. In 3D we would need $\sqrt[3]{0.3}=66.9\%$ of in each dimensions. Visually:

sparsity in 1D

sparsity in 2D

sparsity in 3D

.

In order to keep a constant support (i.e. amount of knowledge of the space), we thus need more data when adding dimensions. In other words, if we add dimensions without adding data, there will be large unknown sub-spaces. This is called sparsity.

I have kept the same number of observation in the plots, so that you can appreciate how “holes” appear in our training data as the dimension grows.

:x: Disadvantage : The data sparsity issue causes machine learning algorithms to fail finding patterns or to overfit.

Points are further from the center

Basically, the volume of a high dimensional orange is mostly in its skin and not in the pulp! Which means expensive high dimensional juices :pensive: :tropical_drink:

:bulb: Intuition : The volume of a sphere depends on $r^d$. The skin has a slightly greater $r$ than the pulp, in high dimensions this slight difference will become very important.

If you’re not convinced, stick with my simple proof. Let’s consider a $d$ dimensional unit orange (i.e. $r=1$), with a skin of width $\epsilon$. Let’s compute the ratio of the volume in the skin to the total volume of the orange. We can avoid any integrals by noting that the volume of a hypersphere is proportional to $r^d$ i.e. : $V_{d}(r) = k r^{d}$.

Taking $\epsilon = 0.05$ as an example, here is the $ratio_{skin/orange}(d)$ we would get:

2D orange

3D orange

5D orange

10D orange

.

:mag: Side Notes : The same goes for hyper-cubes: most of the mass is concentrated at the furthest points from the center (i.e. the corners). That’s why you will sometimes hear that hyper-cubes are “spiky”. Think of the $[-1,1]^d$ hyper-cube: the distance from the center of the faces to the origin will trivially be $1 \ \forall d$, while the distance to each corners will be $\sqrt{d}$ (Pythagorean theorem). So the distance to corners increases with $d$ but not the center of the faces, which makes us think of spikes. This is why you will sometimes see such pictures :

2D hypercube

3D hypercube

7D hypercube

.

Euclidean distance becomes meaningless

There’s nothing that makes Euclidean distance intrinsically meaningless for high dimensions. But due to our finite number of data, 2 points in high dimensions seem to be more “similar” due to sparsity and basic probabilities.

:bulb: Intuition :

  • Let’s consider the distance between 2 points $\mathbf{q}$ and $p$ that are close in $\mathbb{R}^d$. By adding independent dimensions, the probability that these 2 points differ greatly in at least one dimension grows (due to randomness). This is what causes the sparsity issue. Similarly, the probability that 2 points far away in $\mathbb{R}$ will have at least one similar dimension in $\mathbb{R}^d, \ d’>d$, also grows. So basically, adding dimensions makes points seem more random, and the distances thus become less useful.
  • Euclidean distance accentuates the point above. Indeed, by adding dimensions, the probability that $\mathbf{x}^{(1)}$ and $\mathbf{x}^{(2)}$ points have at least one completely different feature grows. i.e. $\max_j \, (x_j^{(1)}, x_j^{(2)})$ increases. The Euclidean distance between 2 points is $D(\mathbf{x}^{(1)},\mathbf{x}^{(2)})=\sqrt{\sum_{j=1}^D (\mathbf{x}_j^{(1)}-\mathbf{x}_j^{(2)})^2}$. Because of the squared term, the distance depends strongly on $max_j \, (x_j^{(1)}-x_j^{(2)})$. This results in less relative difference between distances of “similar” and “dissimilar points” in high dimensions. Manhattan ($L_1$) or fractional distance metrics ($L_c$ with $c<1$) are thus preferred in high dimensions.

In such discussions, people often cite a theorem stating that for i.i.d points in high dimension, a query point $\mathbf{x}^{(q)}$ converges to the same distance to all other points $P=\{\mathbf{x}^{(n)}\}_{n=1}^N$ :

:wrench: Practical : using dimensionality reduction often gives you better results for subsequent steps due to this curse. It makes the algorithm converge faster and reduces overfitting. But be careful not to underfit by using too few features.

:mag: Side Notes :

  • Although the curse of dimensionality is a big issue, we can find effective techniques in high-dimensions because:
    • Real data is often confined to a lower effective dimensionality (e.g. a low dimensional manifold in a higher dimensional space).
    • Interpolation-like techniques can overcome some of the sparsity issues due to the local smoothness of real data.
  • You often see plots of the unit $d$-ball volume vs its dimensionality. Although the non-monotonicity of such plots is intriguing, they can erroneously make you believe that high dimensional hypersphere are smaller than low dimensional ones. This does not make sense as a lower dimensional hypersphere can always be fitted in a higher dimensional one. The issue arises from comparing apple and oranges (no puns intended :sweat_smile:) due to different units: Is $0.99 m^2$ really smaller than $1 m$?

:information_source: Resources : Great post about the curse of dimensionality in classification which inspired me, On the Surprising Behavior of Distance Metrics in High Dimensional Space is a famous paper which proposes the use of fractional distance metrics, nice blog of simulations.

Images modified from: oranges, 7D cube

Evaluation Metrics

Classification Metrics

Single Metrics

:mag: Side Notes : The focus is on binary classification but most scores can be generalized to the multi-class setting. Often this is achieved by only considering “correct class” and “incorrect class” in order to make it a binary classification, then you average (weighted by the proportion of observation in the class) the score for each classes.

confusion matrix

  • Accuracy: correctly classified fraction of observations.
    • $ Acc = \frac{Real Positives}{Total} = \frac{TP+FN}{TP+FN+TN+FP}$
    • :bulb: In general, how much can we trust the predictions ?
    • :wrench: Use if no class imbalance and cost of error is the same for both types
  • Precision fraction of positive predictions that were actually positive.
    • $ Prec = \frac{TP}{Predicted Positives} = \frac{TP}{TP+FP}$
    • :bulb: How much can we trust positive predictions ?
    • :wrench: Use if FP are the worst errors
  • Recall fraction of positive observations that have been correctly predicted.
    • $ Rec = \frac{TP}{Actual Positives} = \frac{TP}{TP+FN}$
    • :bulb: How many actual positives will we find?
    • :wrench: Use if FN are the worst errors
  • F1-Score harmonic mean (good for averaging rates) of recall and precision.
    • $F1 = 2 \frac{Precision * Recall}{Precision + Recall}$
    • If recall is $\beta$ time more important than precision use $F_{\beta} = (1+\beta^2) \frac{Precision * Recall}{\beta^2 Precision + Recall}$
    • :bulb: How much to trust our algorithms for the positive class
    • :wrench: Use if the positive class is the most important one (i.e. want a detector rather than a classifier)
  • Specificity recall for the negative negatives.
    • $ Spec = \frac{TN}{Actual Negatives} = \frac{TN}{TN+FP}$
  • Log-Loss measures performance when model outputs a probability $\hat{y_{ic}}$ that observation $n$ is in class $c$
    • Also called Cross entropy loss or logistic loss
    • $logLoss = - \frac{1}{N} \sum^N_{n=1} \sum^C_{c=1} y_{nc} \ln(\hat{y}_{nc})$
    • Use the natural logarithm for consistency
    • Incorporates the idea of probabilistic confidence
    • Log Loss is the metric that is minimized through Logistic Regression and more generally Softmax
    • :bulb: Penalizes more if the model is confident but wrong (see graph below)
    • :bulb: Log-loss is the cross entropy between the distribution of the true labels and the predictions
    • :wrench: Use when you are interested in outputting confidence of results
    • The graph below shows the log loss depending on the confidence of the algorithm that an observation should be classed in the correct category. For multiple observation we compute the log loss of each and then average them.

    log loss

  • Cohen’s Kappa Improvement of your classifier compared to always guessing the most probable class
    • $\kappa = \frac{accuracy - percent_{MaxClass}}{1 - percent_{MaxClass}}$
    • Often used to computer inter-rater (e.g. 2 humans) reliability: $\kappa = \frac{p_o- p_e}{1 - p_e}$ where $p_o$ is the observed agreement and $p_e$ is the expected agreement due to chance.
    • $ \kappa \leq 1$ (if $<0$ then useless).
    • :bulb: Accuracy improvement weighted by class imbalance .
    • :wrench: Use when high class imbalance and all classes are of similar importance
  • AUC Area Under the Curve. Summarizes curves in a single metric.
    • It normally refers to the ROC curve. Can also be used for other curves like the precision-recall one.
    • :bulb: Probability that a randomly selected positive observation has is predicted with a higher score than a randomly selected negative observation .
    • :mag: AUC evaluates results at all possible cut-off points. It gives better insights about how well the classifier is able to separate between classes . This makes it very different from the other metrics that typically depend on the cut-off threshold (e.g. 0.5 for Logistic Regression).
    • :wrench: Use when building a classifier for users that will have different needs (they could tweak the cut-off point) . From my experience AUC is widely used in statistics (~go-to metric in bio-statistics) but less in machine learning.
    • Random predictions: $AUC = 0.5$. Perfect predictions: $AUC=1$.
Visual Metrics
  • ROC Curve : Receiver Operating Characteristic
    • Plot showing the TP rate vs the FP rate, over a varying threshold.
    • This plot from wikipedia shows it well:

ROC curve

  • Confusion Matrix a $C*C$ matrix which shows the number of observation of class $c$ that have been labeled $c’, \ \forall c=1 \ldots C \text{ and } c’=1\ldots C$
    • :mag: Be careful: People are not consistent with the axis :you can find real-predicted and predicted-real .
    • Best understood with an example:

    Multi Confusion Matrix

:information_source: Resources : Additional scores based on confusion matrix

Generative vs Discriminative

These two major model types, distinguish themselves by the approach they are taking to learn. Although these distinctions are not task-specific task, you will most often hear those in the context of classification.

Differences

In classification, the task is to identify the category $y$ of an observation, given its features $\mathbf{x}$: $y \vert \mathbf{x}$. There are 2 possible approaches:

  • Discriminative learn the decision boundaries between classes.
    • :bulb: Tell me in which class is this observation given past data.
    • Can be probabilistic or non-probabilistic models. If probabilistic, the prediction is $\hat{y}=arg\max_{y=1 \ldots C} \, p(y \vert \mathbf{x})$. If non probabilistic, the model “draws” a boundary between classes, if the point $\mathbf{x}$ is on one side of of the boundary then predict $y=1$ if it is on the other then $y=2$ (multiple boundaries for multiple class).
    • Directly models what we care about: $y \vert \mathbf{x}$.
    • :school_satchel: As an example, for language classification, the discriminative model would learn to distinguish between languages from their sound but wouldn’t understand anything.
  • Generative model the distribution of each classes.
    • :bulb: First “understand” the meaning of the data, then use your knowledge to classify.
    • Model the joint distribution $p(y,\mathbf{x})$ (often using $p(y,\mathbf{x})=p(\mathbf{x} \vert y)p(y)$). Then find the desired conditional probability through Bayes theorem: $p(y \vert \mathbf{x})=\frac{p(y,\mathbf{x})}{p(\mathbf{x})}$. Finally, predict $\hat{y}=arg\max_{y=1 \ldots C} \, p(y \vert \mathbf{x})$ (same as discriminative).
    • Generative models often use more assumptions to as t is a harder task.
    • :school_satchel: To continue with the previous example, the generative model would first learn how to speak the language and then classify from which language the words come from.

Pros / Cons

Some of advantages / disadvantages are equivalent with different wording. These are rule of thumbs !

  • Discriminative:
    • :white_check_mark: Such models need less assumptions as they are tackling an easier problem.
    • :white_check_mark: Often less bias => better if more data.
    • Often :x: slower convergence rate . Logistic Regression requires $O(d)$ observations to converge to its asymptotic error.
    • :x: Prone to over-fitting when there's less data, as it doesn't make assumptions to constrain it from finding inexistent patterns.
    • Often :x: More variance.
    • :x: Hard to update the model with new data (online learning).
    • :x: Have to retrain model when adding new classes.
    • :x: In practice needs additional regularization / kernel / penalty functions.
  • Generative
    • :white_check_mark: Faster convergence rate => better if less data . Naive Bayes only requires $O(\log(d))$ observations to converge to its asymptotic rate.
    • Often :white_check_mark: less variance.
    • :white_check_mark: Can easily update the model with new data (online learning).
    • :white_check_mark: Can generate new data by looking at $p(\mathbf{x} \vert y)$.
    • :white_check_mark: Can handle missing features .
    • :white_check_mark: You don't need to retrain model when adding new classes as the parameters of classes are fitted independently.
    • :white_check_mark: Easy to extend to the semi-supervised case.
    • Often :x: more Biais.
    • :x: Uses computational power to compute something we didn't ask for.

:wrench: Rule of thumb : If your need to train the best classifier on a large data set, use a discriminative model. If your task involves more constraints (online learning, semi supervised learning, small dataset, …) use a generative model.

Let’s illustrate the advantages and disadvantage of both methods with an example . Suppose we are asked to construct a classifier for the “true distribution” below. There are two training sets: “small sample” and “large sample”. Suppose that the generator assumes point are generated from a Gaussian.

discriminative vs generative true distribution

discriminative vs generative small sample

discriminative vs generative large sample

How well will the algorithms distinguish the classes in each case ?

  • Small Sample:
    • The discriminative model never saw examples at the bottom of the blue ellipse. It will not find the correct decision boundary there.
    • The generative model assumes that the data follows a normal distribution (ellipse). It will therefore infer the correct decision boundary without ever having seen data points there!

small sample discriminative

small sample generative

.

  • Large Sample:
    • The discriminative model is not restricted by assumptions and can find small red cluster inside the blue one.
    • The generative model assumes that the data follows a Gaussian distribution (ellipse) and won’t be able to find the small red cluster.

large sample discriminative

large sample generative

This was simply an example that hopefully illustrates the advantages and disadvantages of needing more assumptions. Depending on their assumptions, some generative models would find the small red cluster.

Examples of Algorithms

Discriminative
Generative
  • Naives Bayes
  • Gaussian Discriminant Analysis
  • Latent Dirichlet Allocation
  • Restricted Boltzmann Machines
  • Gaussian Mixture Models
  • Hidden Markov Models
  • Sigmoid Belief Networks
  • Bayesian networks
  • Markov random fields
Hybrid
  • Generative Adversarial Networks

:information_source: Resources : A. Ng and M. Jordan have a must read paper on the subject, T. Mitchell summarizes very well these concepts in his slides, and section 8.6 of K. Murphy’s book has a great overview of pros and cons, which strongly influenced the devoted section above.

Information Theory

Information Content

Given a random variable $X$ and a possible outcome $x_i$ associated with a probability $p_X(x_i)=p_i$, the information content (also called self-information or surprisal) is defined as :

:bulb: Intuition : The “information” of an event is higher when the event is less probable. Indeed, if an event is not surprising, then learning about it does not convey much additional information.

:school_satchel: Example : “it is currently cold in Antarctica” does not convey much information as you knew it with high probability. “It is currently very warm in Antarctica” carries a lot more information and you would be surprised about it.

:mag: Side note :

  • $\operatorname{I} (p_i) \in [0,\infty[$
  • Don’t confuse the information content in information theory with the everyday word which refers to “meaningful information”. A book with random letters will have more information content because each new letter would be a surprise to you. But it will definitely not have more meaning than a book with English words .

Entropy

Long Story Short

:bulb: Intuition :

  • The entropy of a random variable is the expected information-content. I.e. the expected amount of surprise you would have by observing a random variable .
  • Entropy is the expected number of bits (assuming $log_2$) used to encode an observation from a (discrete) random variable under the optimal coding scheme .

:mag: Side notes :

  • $H(X) \geq 0$
  • Entropy is maximized when all events occur with uniform probability. If $X$ can take $n$ values then : $max(H) = H(X_{uniform})= \sum_i^K \frac{1}{K} \log(\frac{1}{ 1/K} ) = \log(K)$

Long Story Long

The concept of entropy is central in both thermodynamics and information theory, and I find that quite amazing. It originally comes from statistical thermodynamics and is so important, that it is carved on Ludwig Boltzmann’s grave (one of the father of this field). You will often hear:

  • Thermodynamics: Entropy is a measure of disorder
  • Information Theory: Entropy is a measure of information

These 2 way of thinking may seem different but in reality they are exactly the same. They essentially answer: how hard is it to describe this thing?

I will focus here on the information theory point of view, because its interpretation is more intuitive for machine learning. I also don’t want to spend to much time thinking about thermodynamics, as people that do often commit suicide :flushed:.

In information theory there are 2 intuitive way of thinking of entropy. These are best explained through an example :

:school_satchel: Suppose my friend Claude offers me to join him for a NBA game (Cavaliers vs Spurs) tonight. Unfortunately I can’t come, but I ask him to record who scored each field goals. Claude is very geeky and uses a binary phone which can only write 0 and 1. As he doesn’t have much memory left, he wants to use the smallest possible number of bits.

  1. From previous games, Claude knows that Lebron James will very likely score more than the old (but awesome :basketball: ) Manu Ginobili. Will he use the same number of bits to indicate that Lebron scored, than he will for Ginobili ? Of course not, he will allocate less bits for Lebron’s buckets as he will be writing them down more often. He’s essentially exploiting his knowledge about the distribution of field goals to reduce the expected number of bits to write down. It turns out that if he knew the probability $p_i$ of each player $i$ to score, he should encode their name with $nBit(p_i)=\log_2(1/p_i)$ bits. This has been intuitively constructed by Claude (Shannon) himself as it is the only measure (up to a constant) that satisfies axioms of information measure. The intuition behind this is the following:
    • Multiplying probabilities of 2 players scoring should result in adding their bits. Indeed imagine Lebron and Ginobili have respectively 0.25 and 0.0625 probability of scoring the next field goal. Then, the probability that Lebron scores the 2 next field goals would be the same than Ginobili scoring a single one ($p(Lebron)*p(Lebron) = 0.25 * 0.25 = 0.0625 = Ginobili$). We should thus allocate 2 times less bits for Lebron, so that on average we always add the same number of bits per observation. $nBit(Lebron) = \frac{1}{2} * nBit(Ginobili) = \frac{1}{2} * nBit(p(Lebron)^2)$. The logarithm is a function that turns multiplication into sums as required. The number of bits should thus be of the form $nBit(p_i) = \alpha * \log(p_i) + \beta $
    • Players that have higher probability of scoring should be encoded by a lower number of bits . I.e $nBit$ should decrease when $p_i$ increases: $nBit(p_i) = - \alpha * \log(p_i) + \beta, \alpha > 0 $
    • If Lebron had $100%$ probability of scoring, why would I have bothered asking Claude to write anything down ? I would have known everything a priori . I.e $nBit$ should be $0$ for $p_i = 1$ : $nBit(p_i) = \alpha * \log(p_i), \alpha > 0 $
  2. Now Claude sends me the message containing information about who scored each bucket. Seeing that Lebron scored will surprise me less than Ginobili. I.e Claude’s message gives me more information when telling me that Ginobili scored. If I wanted to quantify my surprise for each field goal, I should make a measure that satisfies the following conditions:
    • The lower the probability of a player to score, the more surprised I will be . The measure of surprise should thus be a decreasing function of probability: $surprise(x_i) = -f(p_i) * \alpha, \alpha > 0$.
    • Supposing that players scoring are independent of one another, it’s reasonable to ask that my surprise seeing Lebron and Ginobili scoring in a row should be the same than the sum of my surprise seeing that Lebron scored and my surprise seeing that Ginobili scored. I.e. Multiplying independent probabilities should sum the surprise : $surprise(p_i * x_j) = surprise(p_i) + surprise(p_j)$.
    • Finally, the measure should be continuous given probabilities . $surprise(p_i) = -\log(p_{i}) * \alpha, \alpha > 0$

Taking $\alpha = 1 $ for simplicity, we get $surprise(p_i) = -log(p_i) = nBit(p_i)$. We thus derived a formula for computing the surprise associated with event $x_i$ and the optimal number of bits that should be used to encode that event. This value is called information content $I(p_i)$. In order to get the average surprise / number of bits associated with a random variable $X$ we simply have to take the expectation over all possible events (i.e average weighted by probability of event). This gives us the entropy formula $H(X) = \sum_i p_i \ \log(\frac{1}{p_i}) = - \sum_i p_i\ log(p_i)$

From the example above we see that entropy corresponds to :

  • to the expected number of bits to optimally encode a message
  • the average amount of information gained by observing a random variable

:mag: Side notes :

  • From our derivation we see that the function is defined up to a constant term $\alpha$. This is the reason why the formula works equally well for any logarithmic base, indeed changing the base is the same as multiplying by a constant. In the context of information theory we use $\log_2$.
  • Entropy is the reason (second law of thermodynamics) why putting an ice cube in your Moscow Mule (my go-to drink) doesn’t normally make your ice cube colder and your cocktail warmer. I say “normally” because it is possible but very improbable : ponder about this next time your sipping your own go-to drink :smirk: !

:information_source: Resources : Excellent explanation of the link between entropy in thermodynamics and information theory, friendly introduction to entropy related concepts

Differential Entropy

Differential entropy (= continuous entropy), is the generalization of entropy for continuous random variables.

Given a continuous random variable $X$ with a probability density function $f(x)$:

If you had to make a guess, which distribution maximizes entropy for a given variance ? You guessed it : it’s the Gaussian distribution.

:mag: Side notes : Differential entropy can be negative.

Cross Entropy

We saw that entropy is the expected number of bits used to encode an observation of $X$ under the optimal coding scheme. In contrast cross entropy is the expected number of bits to encode an observation of $X$ under the wrong coding scheme. Let’s call $q$ the wrong probability distribution that is used to make a coding scheme. Then we will use $-log(q_i)$ bits to encode the $i^{th}$ possible values of $X$. Although we are using $q$ as a wrong probability distribution, the observations will still be distributed based on $p$. We thus have to take the expected value over $p$ :

From this interpretation it naturally follows that:

  • $H(p,q) > H(p), \forall q \neq p$
  • $H(p,p) = H(p)$

:mag: Side notes : Log loss is often called cross entropy loss, indeed it is the cross-entropy between the distribution of the true labels and the predictions.

Kullback-Leibler Divergence

The Kullback-Leibler Divergence (= relative entropy = information gain) from $q$ to $p$ is simply the difference between the cross entropy and the entropy:

:bulb: Intuition

  • KL divergence corresponds to the number of additional bits you will have to use when using an encoding scheme based on the wrong probability distribution $q$ compared to the real $p$ .
  • KL divergence says in average how much more surprised you will be by rolling a loaded dice but thinking it’s fair, compared to the surprise of knowing that it’s loaded.
  • KL divergence is often called the information gain achieved by using $p$ instead of $q$
  • KL divergence can be thought as the “distance” between 2 probability distribution. Mathematically it’s not a distance as it’s none symmetrical. It is thus more correct to say that it is a measure of how a probability distribution $q$ diverges from an other one $p$.

KL divergence is often used with probability distribution of continuous random variables. In this case the expectation involves integrals:

In order to understand why KL divergence is not symmetrical, it is useful to think of a simple example of a dice and a coin (let’s indicate head and tails by 0 and 1 respectively). Both are fair and thus their PDF is uniform. Their entropy is trivially: $H(p_{coin})=log(2)$ and $H(p_{dice})=log(6)$. Let’s first consider $D_{KL}(p_{coin} \parallel p_{dice})$. The 2 possible events of $X_{dice}$ are 0,1 which are also possible for the coin. The average number of bits to encode a coin observation under the dice encoding, will thus simply be $log(6)$, and the KL divergence is of $log(6)-log(2)$ additional bits. Now let’s consider the problem the other way around: $D_{KL}(p_{dice} \parallel p_{coin})$. We will use $log(2)=1$ bit to encode the events of 0 and 1. But how many bits will we use to encode $3,4,5,6$ ? Well the optimal encoding for the dice doesn’t have any encoding for these as they will never happen in his world. The KL divergence is thus not defined (division by 0). The KL divergence is thus not symmetric and cannot be a distance.

:mag: Side notes : Minimizing cross entropy with respect to $q$ is the same as minimizing $D_{KL}(p \parallel q)$. Indeed the 2 equations are equivalent up to an additive constant (the entropy of $p$) which doesn’t depend on $q$.

Mutual Information

:bulb: Intuition : The mutual information between 2 random variables X and Y measures how much (on average) information about one of the r.v. you receive by knowing the value of the other. If $X,\ Y$ are independent, then knowing $X$ doesn’t give information about $Y$ so $\operatorname{I} (X;Y)=0$ because $p(x,y)=p(x)p(y)$. The maximum information you can get about $Y$ from $X$ is all the information of $Y$ i.e. $H(Y)$. This is the case for $X=Y$ : $\operatorname{I} (Y;Y)= \sum_{y \in \mathcal Y} p(y) \log{ \left(\frac{p(y)}{p(y)\,p(y)} \right) = H(Y) }$

:mag: Side note :

  • The mutual information is more similar to the concept of entropy than to information content. Indeed, the latter was only defined for an outcome of a random variable, while the entropy and mutual information are defined for a r.v. by taking an expectation.
  • $\operatorname{I} (X;Y) \in [0, min(\operatorname{I} (X), \operatorname{I} (Y;Y))]$
  • $\operatorname{I} (X;X) = \operatorname{I} (X)$
  • $\operatorname{I} (X;Y) = 0 \iff X \,\bot\, Y$
  • The generalization of mutual information to $V$ random variables $X_1,X_2,\ldots,X_V$ is the Total Correlation: $C(X_1, X_2, \ldots, X_V) := \operatorname{D_{KL}}\left[ p(X_1, \dots, X_V) \parallel p(X_1)p(X_2)\dots p(X_V)\right]$. It denotes the total amount of information shared across the entire set of random variables. The minimum $C_\min=0$ when no r.v. are statistically dependent. The maximum total correlation occurs when a single r.v. determines all the others : $C_\max = \sum_{i=1}^V H(X_i)-\max\limits_{X_i}H(X_i)$.
  • Data Processing Inequality: for any Markov chain $X \rightarrow Y \rightarrow Z$: $\operatorname{I} (X;Y) \geq \operatorname{I} (X;Z)$
  • Reparametrization Invariance: for invertible functions $\phi,\psi$: $\operatorname{I} (X;Y) = \operatorname{I} (\phi(X);\psi(Y))$

Machine Learning and Entropy

This is all interesting, but why are we talking about information theory concepts in machine learning :sweat_smile: ? Well it turns our that many ML algorithms can be interpreted with entropy related concepts.

The 3 major ways we see entropy in machine learning are through:

  • Maximizing information gain (i.e entropy) at each step of our algorithm. Example:

    • When building decision trees you greedily select to split on the attribute which maximizes information gain (i.e the difference of entropy before and after the split). Intuitively you want to know the value of the attribute, that decreases the randomness in your data by the largest amount.
  • Minimizing KL divergence between the actual unknown probability distribution of observations $p$ and the predicted one $q$. Example:

    • The Maximum Likelihood Estimator (MLE) of our parameters $\hat{ \theta }_{MLE}$ are also the parameter which minimizes the KL divergence between our predicted distribution $q_\theta$ and the actual unknown one $p$ (or the cross entropy). I.e
  • Minimizing KL divergence between the computationally intractable $p$ and a simpler approximation $q$. Indeed machine learning is not only about theory but also about how to make something work in practice.Example:

    • This is the whole point of Variational Inference (= variational Bayes) which approximates posterior probabilities of unobserved variables that are often intractable due to the integral in the denominator. Thus turning the inference problem to an optimization one. These methods are an alternative to Monte Carlo sampling methods for inference (e.g. Gibbs Sampling). In general sampling methods are slower but asymptotically exact.

No Free Lunch Theorem

There is no one model that works best for every problem.

Let’s try predicting the next fruit in the sequence:

:tangerine: :apple: :tangerine: :apple: :tangerine: ...

You would probably say :apple: right ? Maybe with a lower probability you would say :tangerine: . But have you thought of saying :watermelon: ? I doubt it. I never told you that the sequence was constrained in the type of fruit, but naturally we make assumptions that the data “behaves well”. The point here is that without any knowledge/assumptions on the data, all future data are equally probable.

The theorem builds on this, and states that every algorithm has the same performance when averaged over all data distributions. So the average performance of a deep learning classifier is the same as random classifiers.

:mag: Side Notes :

  • You will often hear the name of this theorem when someone asks a question starting with “what is the best […] ?”.
  • In the real world, things tend to “behave well”. They are for example often (locally) continuous. In such settings some algorithms are definitely better than others.
  • Since the theorem publication in 1996, other methods have kept the lunch metaphor. For example: kitchen sink algorithms, random kitchen sink, fastfood, à la carte, and that’s one of the reason why I decided to stick with fruit examples in this blog :wink:.
  • The theorem has been extend to optimization and search algorithms.

:information_source: Resources : D. Wolpert’s proof.

Parametric vs Non Parametric

These 2 types of methods distinguish themselves based on their answer to the following question: “Will I use the same amount of memory to store the model trained on $100$ examples than to store a model trained on $10 000$ of them ? “ If yes then you are using a parametric model. If not, you are using a non-parametric model.

  • Parametric:
    • :bulb: The memory used to store a model trained on $100$ observations is the same as for a model trained on $10 000$ of them .
    • I.e: The number of parameters is fixed.
    • :white_check_mark: Computationally less expensive to store and predict.
    • :white_check_mark: Less variance.
    • :x: More bias.
    • :x: Makes more assumption on the data to fit less parameters.
    • :school_satchel: Example : K-Means clustering, Linear Regression, Neural Networks:

    Linear Regression

  • Non Parametric:
    • :bulb: I will use less memory to store a model trained on $100$ observation than for a model trained on $10 000$ of them .
    • I.e: The number of parameters is grows with the training set.
    • :white_check_mark: More flexible / general.
    • :white_check_mark: Makes less assumptions.
    • :white_check_mark: Less bias.
    • :x: More variance.
    • :x: Bad if test set is relatively different than train set.
    • :x: Computationally more expensive as it has to store and compute over a higher number of “parameters” (unbounded).
    • :school_satchel: Example : K-Nearest Neighbors clustering, RBF Regression, Gaussian Processes:

    RBF Regression

:wrench: Practical : Start with a parametric model. It’s often worth trying a non-parametric model if: you are doing clustering, or the training data is not too big but the problem is very hard.

:mag: Side Note : Strictly speaking any non-parametric model could be seen as a infinite-parametric model. So if you want to be picky: next time you hear a colleague talking about non-parametric models, tell him it’s in fact parametric. I decline any liability for the consequence on your relationship with him/her :sweat_smile: .

Supervised Learning

Supervised learning tasks tackle problems that have labeled data.

:bulb: Intuition : It can be thought of a teacher who corrects a multiple choice exam. At the end you will get your average result as well as the correct answer to any of the questions.

Supervised learning can be further separated into two broad type of problems:

  • Classification: here the output variable $y$ is categorical. We are basically trying to assign one or multiple classes to an observation. Example: is it a cat or not ?
  • Regression: here the output variable $y$ is continuous. Example : how tall is this person ?

Classification

The classification problem consists of assigning a set of classes/categories to an observation. I.e

Classification problems can be further separated into:

  • Binary: There are 2 possible classes.
  • Multi-Class: There are more than 2 possible classes.
  • Multi-Label: If labels are not mutually exclusive. Often replaced by binary classification specifying whether an observation should be assigned to each class.

Common evaluation metrics include Accuracy, F1-Score, AUC… I have a section devoted for these classification metrics.

:wavy_dash: Compare to : Regression

Decision Trees

Overview
  • :bulb: Intuition :
    • Split the training data based on “the best” question(e.g. is he older than 27 ?). Recursively split the data while you are unhappy with the classification results.
    • Decision trees are basically the algorithm to use for the “20 question” game. Akinator is a nice example of what can been implemented with decision trees. Akinator is probably based on fuzzy logic expert systems (as it can work with wrong answers) but you could do a simpler version with decision trees.
    • “Optimal” splits are found by maximization of information gain or similar methods.
  • :wrench: Practical :
    • Decision trees thrive when you need a simple and interpretable model but the relationship between $y$ and $\mathbf{x}$ is complex.
    • Training Complexity : $O(MND + ND\log(N) )$ .
    • Testing Complexity : $O(MT)$ .
    • Notation Used : $M=depth$ ; ; ; .
  • :white_check_mark: Advantage :
    • Interpretable .
    • Few hyper-parameters.
    • Needs less data cleaning :
      • No normalization needed.
      • Can handle missing values.
      • Handles numerical and categorical variables.
    • Robust to outliers.
    • Doesn’t make assumptions regarding the data distribution.
    • Performs feature selection.
    • Scales well.
  • :x: Disadvantage :
    • Generally poor accuracy because greedy selection.
    • High variance because if the top split changes, everything does.
    • Splits are parallel to features axes => need multiple splits to separate 2 classes with a 45° decision boundary.
    • No online learning.

The basic idea behind building a decision tree is to :

  1. Find an optimal split (feature + threshold). I.e. the split which minimizes the impurity (maximizes information gain).
  2. Partition the dataset into 2 subsets based on the split above.
  3. Recursively apply $1$ and $2$ this to each new subset until a stop criterion is met.
  4. To avoid over-fitting: prune the nodes which “aren’t very useful”.

Here is a little gif showing these steps:

Building Decision Trees Classification

Note: For more information, please see the “details” and “Pseudocode and Complexity” drop-down below.

Details

The idea behind decision trees is to partition the input space into multiple regions. E.g. region of men who are more than 27 years old. Then predict the most probable class for each region, by assigning the mode of the training data in this region. Unfortunately, finding an optimal partitioning is usually computationally infeasible (NP-complete) due to the combinatorially large number of possible trees. In practice the different algorithms thus use a greedy approach. I.e. each split of the decision tree tries to maximize a certain criterion regardless of the next splits.

How should we define an optimality criterion for a split? Let’s define an impurity (error) of the current state, which we’ll try to minimize. Here are 3 possible state impurities:

  • Classification Error:
    • :bulb: The accuracy error : $1-Acc$ of the current state. I.e. the error we would do, by stopping at the current state.
  • Entropy:
    • :bulb: How unpredictable are the classes of the current state.
    • Minimize the entropy corresponds to maximizing the information gain.
  • Gini Impurity:
    • :bulb: Expected ($\mathbb{E}[\cdot] = \sum_{c=1}^C p(c) (\cdot) $) probability of misclassifying ($\sum_{c=1}^C p(c) (1-\cdot)$) a randomly selected element, if it were classified according to the label distribution ($\sum_{c=1}^C p(c) (1-p(c))$) .

Here is a quick graph showing the impurity depending on a class distribution in a binary setting:

Impurity Measure

:mag: Side Notes :

  • Classification error may seem like a natural choice, but don’t get fooled by the appearances: it’s generally worst than the 2 other methods:
    • It is “more” greedy than the others. Indeed, it only focuses on the current error, while Gini and Entropy try to make a purer split which will make subsequent steps easier. Suppose we have a binary classification with 100 observation in each class $(100,100)$. Let’s compare a split which divides the data into $(20,80)$ and $(80,20)$, to an other split which would divide it into $(40,100)$ and $(60,0)$. In both case the accuracy error would be of $0.20\%$. But we would prefer the second case, which is pure and will not have to be split further. Gini impurity and the Entropy would correctly chose the latter.
    • Classification error takes only into account the most probable class. So having a split with 2 extremely probable classes will have a similar error to split with one extremely probable class and many improbable ones.
  • Gini Impurity and Entropy differ less than 2% of the time as you can see in the plot above. Entropy is a little slower to compute due to the logarithmic operation.

When should we stop splitting? It is important not to split too many times to avoid over-fitting. Here are a few heuristics that can be used as a stopping criterion:

  • When the number of training examples in a a leaf node is small.
  • When the depth reaches a threshold.
  • When the impurity is low.
  • When the purity gain due to the split is small.

Such heuristics require problem-dependent thresholds (hyperparameters), and can yield relatively bad results. For example decision trees might have to split the data without any purity gain, to reach high purity gains at the following step. It is thus common to grow large trees using the number of training example in a leaf node as a stopping criterion. To avoid over-fitting, the algorithm would prune back the resulting tree. In CART, the pruning criterion $C_{pruning}(T)$ balances impurity and model complexity by regularization. The regularized variable is often the number of leaf nodes $\vert T \vert$, as below:

$\lambda$ is selected via cross validation and trades-off impurity and model complexity, for a given tree $T$, with leaf nodes $v=1…\vertT \vert$ using Impurity measure $I$.

Variants: there are various decision tree methods, that differ with regards to the following points:

  • Splitting Criterion ? Gini / Entropy.
  • Technique to reduce over-fitting ?
  • How many variables can be used in a split ?
  • Building binary trees ?
  • Handling of missing values ?
  • Do they handle regression?
  • Robustness to outliers?

Famous variants:

  • ID3: first decision tree implementation. Not used in practice.
  • C4.5: Improvement over ID3 by the same developer. Error based pruning. Uses entropy. Handles missing values. Susceptible to outliers. Can create empty branches.
  • CART: Uses Gini. Cost complexity pruning. Binary trees. Handles missing values. Handles regression. Not susceptible to outliers.
  • CHAID: Finds a splitting variable using Chi-squared to test the dependency between a variable and a response. No pruning. Seems better for describing the data, but worst for predicting.

Other variants include : C5.0 (next version of C4.5, probably less used because it’s patented), MARS.

:information_source: Resources : A comparative study of different decision tree methods.

Pseudocode and Complexity
  • Pseudocode The simple version of a decision tree can be written in a few lines of python pseudocode:
def buildTree(X,Y):
    if stop_criteria(X,Y) :
        # if stop then store the majority class
        tree.class = mode(X) 
        return Null

    minImpurity = infinity
    bestSplit = None
    for j in features:
        for T in thresholds:
            if impurity(X,Y,j,T) < minImpurity:
                bestSplit = (j,T)
                minImpurity = impurity(X,Y,j,T) 

    X_left,Y_Left,X_right,Y_right = split(X,Y,bestSplit)

    tree.split = bestSplit # adds current split
    tree.left = buildTree(X_left,Y_Left) # adds subsequent left splits
    tree.right buildTree(X_right,Y_right) # adds subsequent right splits

return tree

def singlePredictTree(tree,xi):
    if tree.class is not Null:
        return tree.class

    j,T = tree.split
    if xi[j] >= T:
        return singlePredictTree(tree.right,xi)
    else:
        return singlePredictTree(tree.left,xi)

def allPredictTree(tree,Xt):
    t,d = Xt.shape
    Yt = vector(d)
    for i in t:
        Yt[i] = singlePredictTree(tree,Xt[i,:])

    return Yt
  • Complexity I will be using the following notation: ; ; ; ; .

Let’s first think about the complexity for building the first decision stump (first function call):

  • In a decision stump, we loop over all features and thresholds $O(KD)$, then compute the impurity. The impurity depends solely on class probabilities. Computing probabilities means looping over all $X$ and count the $Y$ : $O(N)$. With this simple pseudocode, the time complexity for building a stump is thus $O(KDN)$.
  • In reality, we don’t have to look for arbitrary thresholds, only for the unique values taken by at least an example. E.g. no need of testing $feature_j>0.11$ and $feature_j>0.12$ when all $feature_j$ are either $0.10$ or $0.80$. Let’s replace the number of possible thresholds $K$ by training set size $N$. $O(N^2D)$
  • Currently we are looping twice over all $X$, once for the threshold and once to compute the impurity. If the data was sorted by the current feature, the impurity could simply be updated as we loop through the examples. E.g. when considering the rule $feature_j>0.8$ after having already considered $feature_j>0.7$, we do not have to recompute all the class probabilities: we can simply take the probabilities from $feature_j>0.7$ and make the adjustments knowing the number of example with $feature_j==0.7$. For each feature $j$ we should first sort all data $O(N\log(N))$ then loop once in $O(N)$, the final would be in $O(DN\log(N))$.

We now have the complexity of a decision stump. You could think that finding the complexity of building a tree would be multiplying it by the number of function calls: Right? Not really, that would be an over-estimate. Indeed, at each function call, the training data size $N$ would have decreased. The intuition for the result we are looking for, is that at each level $l=1…M$ the sum of the training data in each function is still $N$. Multiple function working in parallel with a subset of examples take the same time as a single function would, with the whole training set $N$. The complexity at each level is thus still $O(DN\log(N))$ so the complexity for building a tree of depth $M$ is $O(MDN\log(N))$. Proof that the work at each level stays constant:

At each iterations the dataset is split into $\nu$ subsets of $k_i$ element and a set of $n-\sum_{i=1}^{\nu} k_i$. At every level, the total cost would therefore be (using properties of logarithms and the fact that $k_i \le N$ ) :

The last possible adjustment I see, is to sort everything once, store it and simply use this precomputed data at each level. The final training complexity is therefore $O(MDN + ND\log(N))$ .

The time complexity of making predictions is straightforward: for each $t$ examples, go through a question at each $M$ levels. I.e. $O(MT)$ .

Naive Bayes

Overview
Gaussian Case: Piecewise Quadratic Decision Boundary
Discrete Case: Piecewise Linear Decision Boundary
  • :bulb: Intuition :
    • In the discrete case: “advance counting”. E.g. Given a sentence $x_i$ to classify as spam or not, count all the times each word $w^i_j$ was in previously seen spam sentence and predict as spam if the total (weighted) “spam counts” is larger than the number of “non spam counts”.
    • Use a conditional independence assumption (“naive”) to have better estimates of the parameters even with little data.
  • :wrench: Practical :
    • Good and simple baseline.
    • Thrive when the number of features is large but the dataset size is not.
    • Training Complexity : $O(ND)$ .
    • Testing Complexity : $O(TDK)$ .
    • Notation Used : ; ; ; .
  • :white_check_mark: Advantage :
    • Simple to understand and implement.
    • Fast and scalable (train and test).
    • Handles online learning.
    • Works well with little data.
    • Not sensitive to irrelevant features.
    • Handles real and discrete data.
    • Probabilistic.
    • Handles missing value.
  • :x: Disadvantage :
    • Strong conditional independence assumption of features given labels.
    • Sensitive to features that have not often been seen (mitigated by smoothing).

Naive Bayes is a family of generative models that predicts $p(y=c \vert\mathbf{x})$ by assuming that all features are conditionally independent given the label: $x_i \perp x_j \vert y , \forall i,j$. This is a simplifying assumption that very rarely holds in practice, which makes the algorithm “naive”. Classifying with such an assumption is very easy:

Note that because we are in a classification setting $y$ takes discrete values, so $p(y=c, \pmb\theta)=\pi_c$ is a categorical distribution.

You might wonder why we use the simplifying conditional independence assumption. We could directly predict using $\hat{y} = arg\max_c p(y=c, \pmb\theta)p(\mathbf{x} \vert y=c, \pmb\theta)$. The conditional assumption enables us to have better estimates of the parameters $\theta$ using less data . Indeed, $p(\mathbf{x} \vert y=c, \pmb\theta)$ requires to have much more data as it is a $D$ dimensional distribution (for each possible label $c$), while $\prod_{j=1}^D p(x_j \vert y=c, \pmb\theta)$ factorizes it into $D$ 1-dimensional distributions which requires a lot less data to fit due to curse of dimensionality. In addition to requiring less data, it also enables to easily mix different family of distributions for each features.

We still have to address 2 important questions:

  • What family of distribution to use for $p(x_j \vert y=c, \pmb\theta)$ (often called the event model of the Naive Bayes classifier)?
  • How to estimate the parameters $\theta$?
Event Models of Naive Bayes

The family of distributions to use is an important design choice that will give rise to specific types of Naive Bayes classifiers. Importantly the family of distribution $p(x_j \vert y=c, \pmb\theta)$ does not need to be the same $\forall j$, which enables the use of very different features (e.g. continuous and discrete). In practice, people often use Gaussian distribution for continuous features, and Multinomial or Bernoulli distributions for discrete features :

Gaussian Naive Bayes :

Using a Gaussian distribution is a typical assumption when dealing with continuous data $x_j \in \mathbb{R}$. This corresponds to assuming that each feature conditioned over the label is a univariate Gaussian:

Note that if all the features are assumed to be Gaussian, this corresponds to fitting a multivariate Gaussian with a diagonal covariance : $p(\mathbf{x} \vert y=c, \pmb\theta)= \mathcal{N}(\mathbf{x};\pmb\mu_{c},\text{diag}(\pmb\sigma_{c}^2))$.

The decision boundary is quadratic as it corresponds to ellipses (Gaussians) that intercept .

Multinomial Naive Bayes :

In the case of categorical features $x_j \in \{1,…, K\}$ we can use a Multinomial distribution, where $\theta_{jc}$ denotes the probability of having feature $j$ at any step of an example of class $c$ :

Multinomial Naive Bayes is typically used for document classification , and corresponds to representing all documents as a bag of word (no order). We then estimate (see below)$\theta_{jc}$ by counting word occurrences to find the proportions of times word $j$ is found in a document classified as $c$.

The equation above is called Naive Bayes although the features $x_j$ are not technically independent because of the constraint $\sum_j x_j = const$. The training procedure is still the same because for classification we only care about comparing probabilities rather than their absolute values, in which case Multinomial Naive Bayes actually gives the same results as a product of Categorical Naive Bayes whose features satisfy the conditional independence property.

Multinomial Naive Bayes is a linear classifier when expressed in log-space :

Multivariate Bernoulli Naive Bayes:

In the case of binary features $x_j \in \{0,1\}$ we can use a Bernoulli distribution, where $\theta_{jc}$ denotes the probability that feature $j$ occurs in class $c$:

Bernoulli Naive Bayes is typically used for classifying short text , and corresponds to looking at the presence and absence of words in a phrase (no counts).

Multivariate Bernoulli Naive Bayes is not the same as using Multinomial Naive Bayes with frequency counts truncated to 1. Indeed, it models the absence of words in addition to their presence.

Training

Finally we have to train the model by finding the best estimated parameters $\hat\theta$. This can either be done using pointwise estimates (e.g. MLE) or a Bayesian perspective.

Maximum Likelihood Estimate (MLE):

The negative log-likelihood of the dataset $\mathcal{D}=\{\mathbf{x}^{(n)},y^{(n)}\}_{n=1}^N$ is :

As the negative log likelihood decomposes in terms that only depend on $\pi$ and each $\theta_{jc}$ we can optimize all parameters separately.

Minimizing the first term by using Lagrange multipliers to enforce $\sum_c \pi_c$, we get that $\hat\pi_c = \frac{N_c}{N}$. Which is naturally the proportion of examples labeled with $y=c$.

The $\theta_{jc}$ depends on the family of distribution we are using. In the Multinomial case it can be shown to be $\hat\theta_{jc}=\frac{N_{jc}}{N_c}$. Which is very easy to compute, as it only requires to count the number of times a certain feature $x_j$ is seen in an example with label $y=c$.

Bayesian Estimate:

The problem with MLE is that it over-fits. For example, if a feature is always present in all training samples (e.g. the word “the” in document classification) then the model will break if it sees a test sample without that features as it would give a probability of 0 to all labels.

By taking a Bayesian approach, over-fitting is mitigated thanks to priors. In order to do so we have to compute the posterior :

Using (factored) conjugate priors (Dirichlet for Multinomial, Beta for Bernoulli, Gaussian for Gaussian), this gives the same estimates as in the MLE case (the prior has the same form than the likelihood and posterior) but regularized.

The Bayesian framework requires predicting by integrating out all the parameters $\pmb\theta$. The only difference with the first set of equations we have derived for classifying using Naive Bayes, is that the predictive distribution is conditioned on the training data $\mathcal{D}$ instead of the parameters $\pmb\theta$:

As before, the term to maximize decomposes in one term per parameter. All parameters can thus be independently maximized. These integrals are generally intractable, but in the case of the Gaussian, the Multinomial, and the Bernoulli with their corresponding conjugate prior, we can fortunately compute a closed form solution.

In the case of the Multinomial (and Bernoulli), the solution is equivalent to predicting with a point estimate $\hat{\pmb\theta}=\bar{\pmb\theta}$. Where $\bar{\pmb\theta}$ is the mean of the posterior distribution.

Using a symmetric Dirichlet prior : $p(\pmb\pi)=\text{Dir}(\pmb\pi; \pmb{1}\alpha_\pi)$ we get that $\hat\pi_c = \bar\pi_c = \frac{N_c + \alpha_\pi }{N + \alpha_\pi C}$.

The Bayesian Multinomial naive Bayes is equivalent to predicting with a point estimate :

I.e. Bayesian Multinomial Naive Bayes with symmetric Dirichlet prior assigns predictive posterior distribution:

the corresponding graphical model is:

Bayesian Naive Bayes

When using a uniform prior $\alpha_\theta=1$, this equation is called Laplace smoothing or add-one smoothing. $\alpha$ intuitively represents a “pseudocount” $\alpha_\theta$ of features $x_{jc}$ . For document classification it simply corresponds to giving an initial non zero count to all words, which avoids the problem of having a test document $x^{(t)}$ with $p(y=c\vert x^{(t)})=0$ if it contains a single word $x_{j}^*$ that has never been seen in a training document with label $c$. $\alpha=1$ is a common choice in examples although smaller often work better .

:mag: Side Notes :

  • The “Naive” comes from the conditional independence of features given the label. The “Bayes” part of the name comes from the use of Bayes theorem to use a generative model, but it is not a Bayesian method as it does not require marginalizing over all parameters.

  • If we estimated the Multinomial Naive Bayes using Maximum A Posteriori Estimate (MAP) instead of MSE or the Bayesian way, we would predict using the mode of the posterior (instead of the mean in the Bayesian case) $\hat\theta_{jc}=\frac{N_{jc} + \alpha - 1}{N_c + (\alpha -1)D}$ (similarly for $\pmb\pi$). This means that Laplace smoothing could also be interpreted as using MAP with a non uniform prior $\text{Dir}(\pmb\theta; \pmb{2})$. But when using a uniform Dirichlet prior MAP coincides with the MSE.

  • Gaussian Naive Bayes is equivalent to Quadratic Discriminant Analysis when each covariance matrix $\Sigma_c$ is diagonal.

  • Discrete Naive Bayes and Logistic Regression are a “generative-discriminative pair” as they both take the same form (linear in log probabilities) but estimate parameters differently. For example, in the binary case, Naive Bayes predicts the class with the largest probability. I.e. it predicts $C_1$ if $\log \frac{p(C_1 \vert \pmb{x})}{p(C_2 \vert \pmb{x})} = \log \frac{p(C_1 \vert \pmb{x})}{1-p(C_1 \vert \pmb{x})} > 0$. We have seen that discrete Naive Bayes is linear in the log space, so we can rewrite the equation as $\log \frac{p(C_1 \vert \pmb{x})}{1-p(C_1 \vert \pmb{x})} = 2 \log p(C_1 \vert \pmb{x}) - 1 = 2 \left( b + \mathbf{w}^T_c \mathbf{x} \right) - 1 = b’ + \mathbf{w’}^T_c \mathbf{x} > 0$. This linear regression on the log odds ratio is exactly the form of Logistic Regression (the usual equation is recovered by solving $\log \frac{p}{1-p} = b + \mathbf{w’}^T \mathbf{x}$). The same can be shown for Multinomial Naive Bayes and Multinomial Logistic Regression.

  • For document classification, it is common to replace raw counts in Multinomial Naive Bayes with tf-idf weights.

:information_source: Resources : See section 3.5 of K. Murphy’s book for all the derivation steps and examples.

Regression

Decision Trees

Decision trees are more often used for classification problems. I thus talk at length about them here.

The 2 differences with decision trees for classification are:

  • What error to minimize for an optimal split? This replaces the impurity measure in the classification setting. A widely used error function for regression is the sum of squared error. We don’t use the mean squared error so that the subtraction of the error after and before a split make sense. Sum of squared error for region $R$:
  • What to predict for a given space region? In the classification setting, we predicted the mode of the subset of training data in this space. Taking the mode doesn’t make sense for a continuous variable. Now that we’ve defined an error function above, we would like to predict a value which minimizes this sum of squares error function. This corresponds to the region average value. Predicting the mean is intuitively what we would have done.

Let’s look at a simple plot to get a better idea of the algorithm:

Building Decision Trees Regression

:x: Besides the disadvantages seen in the decision trees for classification, decision trees for regression suffer from the fact that it predicts a non smooth function .

Unsupervised Learning

Unsupervised learning tasks consist in finding structure in unlabeled data without a specific desired outcome.

Supervised learning can be further separated into multiple subtasks (the separation is not as clear as in the supervised setting):

  • Clustering: can you find cluster in the data?
  • Clustering: can you find cluster in the data ?
  • Density Estimation: what is the underlying probability distribution that gave rise to the data?
  • Dimensionality Reduction how to best compress the data?
  • Outlier Detection which data-point are outliers?

Due to the lack of ground truth labels, it difficult to measure the performance of such methods, but such methods are extremely important due to the amount of accessible unlabeled data.

Reinforcement Learning

In Reinforcement Learning (RL), the sequential decision-making algorithm (an agent) interacts with an environment it is uncertain about. The agent learns to map situations to actions to maximize a long term reward. During training, the action it chooses are evaluated rather than instructed.

:school_satchel: Example: Games can very naturally be framed in a RL framework. For example, when playing tennis you are not told how good every movement you make is, but you are given a certain reward if you win the whole game.

:mag: Side Notes: Games could also be framed in a supervised problem. The training set would consist in many different states of the environment and the optimal action to take in each of those. Creating such a dataset is not possible for most applications as it requires to enumerate the exponential number of states and to know the associated best action (e.g. exact rotation of all your joints when you play tennis). Note that during supervised training, the feedback indicates the correct action independently to the chosen action. The RL framework is a lot more natural as the agent is trained by playing the game. Importantly, the agent interacts with the environment such that the states that it will visit depend on previous actions. So it is a chicken-egg problem where it will unlikely reach good states before being trained, but it has to reach good states to get reward and train effectively. This leads to training curves that start with very long plateaus of low reward until it reaches a good state (somewhat by chance) and then learn quickly. In contrast, supervised methods have very steep loss curves at the start.

:information_source: Resources : The link and differences between supervised and RL is described in details by A. Barto and T. Dietterich.

In RL, future states depend on current actions, thus requiring to model indirect consequences of actions and planning. Furthermore, the agent often has to take actions in real-time while planning for the future. All of the above makes it very similar to how humans learn, and is thus widely used in psychology and neuroscience.

:information_source: Resources : All this section on RL is highly influenced by Sutton and Barto’s introductory book.

Exploration vs Exploitation

A fundamental trade-off in RL is how to balance exploration and exploitation. Indeed, we often assume that the agent has to maximize its reward over all episodes. As it often lacks knowledge about the environment, it has to decide between taking the action it currently thinks is the best, and taking new actions to learn about how good these are (which might be bring higher return in the long run).

:school_satchel: Example: Lets consider a simplified version of RL (Multi-armed Bandits) to illustrate this concept (example and plots from Sutton and Barto):

Let’s assume that you go in a very generous casino. I.e. an utopic casino in which you can make money in the long run (spoiler alert: this doesn’t exist, and in real-life casinos the best action is always to leave :sweat_smile:). This casino contains 10-slot machines, each of these gives a certain reward $r$ when the agent decides to play on on them. The agent can stay the whole day in this casino and wants to maximize it’s profit.

Although the agent doesn’t know it, the reward $r \sim \mathcal{N}(\mu_a, 1)$ where each of the 10 machines have a fixed $\mu_a$ which were sampled from $\mathcal{N}(0, 1)$ when building the casino. The actual reward distributions are the following (where $q_{*}(a)$ denote the expected return when choosing slot $a$):

10-Armed Bandits

A good agent would try to estimate $Q_t(a) = q_{*}(a)$, where the subscript $t$ indicates that the estimate depends on the time (the more the agent plays the better the estimates should be). The most natural estimate $Q_t(a)$ is to average over all rewards when taking action $a$. At every time step $t$ there’s at least one action $\hat{a}=arg\max_a Q_t(a)$ which the agent believes maximizes $q_{*}(a)$. Taking the greedy action $\hat{a}$ exploits your current beliefs of the environment. This might not always be “best action”, indeed if $\hat{a} \neq a^{*}$ then the agents estimates of the environment are wrong and it will be better-off by taking a non greedy action $a \neq \hat{a}$ (supposing it will still play many times). Such actions explore the environment to improve estimates.

During exploration, reward is lower in the short run, but higher in the long run because once you discovered better actions you can exploit them many times. Whether to explore or exploit is a complex problem that depends on your current estimate, uncertainties, and the number of remaining steps.

Here are a few possible exploration mechanisms:

  • $\pmb{\epsilon}\mathbf{-greedy}$: take the greedy action with probability $1-(\epsilon-\frac{\epsilon}{10})$ and all other actions with probability $\frac{\epsilon}{10}$.
  • Upper-Confidence Bound: $\epsilon\text{-greedy}$ forces the non greedy actions to be tried uniformly. It would seem like a better idea to give a preference for actions that are nearly greedy or particularly uncertain. One way of doing is by adding a term that measures the variance of the estimate of $Q_t(a)$. Such a term should be inversely proportional to the number of times we have seen an action $N_t(a)$. We use $\frac{\log t}{N_t(a)}$ in order to force the model to take an action $a$ if it has not been taken in a long time (i.e. if $t$ increases but not $N_t(a)$ then $\frac{\log t}{N_t(a)}$ increases). The logarithm is used in order have less exploration over time but still an infinite amount:
  • Optimistic Initial Values (UCB): give a highly optimistic $Q_1(a), \ \forall a$(e.g. $Q_1(a)=5$ in our example, which a lot larger than $q_{*}(a)$). As th first few samples from an action will normally decrease $Q_t(a)$, this ensures that all actions will be at least sampled a few times before following the greedy policy. Note that this biases permanently the action-value estimates when estimating $Q_t(a)$ through averages (although the biases decreases).

  • Gradient Bandit Algorithms: An other natural way of choosing the best action would be to keep a numerical preference for each action $H_t(a)$. And sample more often actions with larger preference. I.e. sample from $\pi_t := \text{softmax}(H_t)$, where $\pi_t(a)$ denotes the probability of taking action $a$ at time $t$. The preference can then be learned via stochastic gradient ascent, by increasing $H_t(A_t)$ if the reward due to the current action $A_t$ is larger than the current average reward $\bar{R}_t$ (decreasing $H_t(A_t)$ if not). The non selected actions $a \neq A_t$ are moved in the opposite direction. Formally:

By running all the different strategies for different hyperparameters and averaging over 1000 decisions, we get:

Parameters Exploration Multi-armed Bandits

We see that UCB performs best in this case, and is the most robust with regards to its hyper-parameter $c$. Although UCB tends to work well, this will not always be the case (No Free Lunch Theorem again).

:mag: Side Notes:

  • In non-stationary environments (I.e. the reward probabilities are changing over time), it is important to always explore as the optimal action might change over time. In such environment, it is better to use exponentially decaying weighted average for $Q_t(a)$. I.e. give more importance to later samples.
  • The multi-armed bandits problem, is a simplification of RL as future states and actions are independent of the current action.
  • We will see other ways of maintaining exploration in future sections.

Markov Decision Process

Markov Decision Processes (MDPs) are a mathematical idealized form of the RL problem that suppose that states follow the Markov Property. I.e. that future states are conditionally independent of past ones given the present state: $S_{t+1} \perp \{S_{i}\}_{i=1}^{t-1} \vert S_t$.

Before diving into the details, it is useful to visualize how simple a MDP is (image taken from Sutton and Barto):

Markov Decision Process

Important concepts:

  • Agent: learner and decision maker. This corresponds to the controller in classical control theory.
  • Environment: everything outside of the agent. I.e. what it interacts with. It corresponds to the plant in classical control theory. Note that in the case of a human / robot, the body should be considered as the environment rather as the agent because it cannot be modified arbitrarily (the boundary is defined by the lack of possible control rather than lack of knowledge).
  • Time step t: discrete time at which the agent and environment interact. Note that it doesn’t have to correspond to fix real-time intervals. Furthermore, it can be extended to the continuous setting.
  • State $S_t = s \in \mathcal{S}$ : information available to the agent about the environment.
  • Action $A_t=a \in \mathcal{A}$ : action that the agent decides to take. It corresponds to the control signal in classical control theory.
  • Reward $R_{t+1} = r \in \mathcal{R} \subset \mathbb{R}$: a value which is returned at each step by a (deterministic or stochastic) reward signal function depending on the previous $S_t$ and $A_t$. Intuitively, the reward corresponds to current (short term) pain / pleasure that the agent is feeling. Although the reward is computed inside the agent / brain, we consider them to be external (given by the environment) as they cannot be modified arbitrarily by the agent. The reward signal we chose should truly represent what we want to accomplish , not how to accomplish it (such prior knowledge can be added in the initial policy or value function). For example, in chess, the agent should only be rewarded if it actually wins the game. Giving a reward for achieving subgoals (e.g. taking out an opponent’s piece) could result in reward hacking (i.e. the agent might found a way to achieve large rewards without winning).
  • Return $G_t := \sum_{\tau=1}^{T-t} \gamma^{\tau-1} R_{t+\tau} \text{, with } \gamma \in [0,1[$: the expected discounted cumulative reward, which has to be maximized by the agent. Note that the discounting factor $\gamma$, has a two-fold use. First and foremost, it enables to have a finite $G_t$ even for continuing tasks $T=\infty$ (opposite of episodic tasks) assuming that $R_t$ is bounded $\forall t$. It also enables to encode the preference for rewards in the near future, and is a parameter that can be tuned to select the “farsightedness” of your agent (“myopic” agent with $\gamma=0$, “farsighted” as $\gamma \to 1$). Importantly, the return can be defined in a recursive manner :
  • The dynamics of the MDP: $p(s’, r\vert s, a) := P(S_{t+1}=s’, R_{t+1}=r \vert S_t=s, A_t=a)$. In a MDP, this probability completely characterizes the network dynamics due to the Markov property. Some useful functions that can be derived from it are:
    • State-transition probabilities
    • Expected rewards for state-actions pairs:
    • Expected rewards for state-actions-next state triplets:
  • Policy $\pi(a\vert s)$: a mapping from states to probabilities over actions. Intuitively, it corresponds to a “The behavioral function” . Often called stimulus-response in psychology.
  • State-value function for policy $\pi$, $v_\pi(s)$: expected return for an agent that follows a policy $\pi$ and starts in state $s$. Intuitively, it corresponds to how good it is to be in a certain state $s$ (long term). This function is not given by the environment but is often predicted by the agent. Similar to the return, the value function can also be defined recursively, by the very important Bellman equation :
  • Action-value function (Q function) for policy $\pi$, $q_\pi(s,a)$: expected total reward and agent can get starting from a state $s$. Intuitively, it corresponds to how good it is to be in a certain state $s$ and take specific action $a$ (long term). A Bellman equation can be derived in a similar way to the value function:
  • Model of the environment: an internal model in the agent to predict the dynamic of the environment (e.g. probability of getting a certain reward or getting in a certain state for each action). This is only used by some RL agents for planning.

To schematically represents what different algorithms do, it is useful to look at Backup Diagrams. These are trees that are constructed by unrolling all the possible action and states. White nodes represent a state, from which teh agent will follow its policy to select one possible action (black node). The environment will then sample through the dynamics to bring the agent to a new state. Green nodes will be used to denote terminal nodes, and the path taen by the algorithm will be shown in red. Unlike transition graphs, state nodes are not unique (e.g. states can be their own successors). Example:

Backup Diagram

:mag: Side Notes : We usually assume that we have a finite MDP. I.e. that $\mathcal{R},\mathcal{A},\mathcal{S}$ are finite. Dealing with continuous state and actions pairs requires approximations. One possible way of converting a continuous problem to a finite one, is to discretized the state and actions space.

In the following sections, we will :

  • see how to solve the RL MDP problem exactly through a non linear set of equations or dynamic programing
  • approximate the solution by bypassing the need of knowing the dynamics of the system.
  • model the dynamics of the system to enable the use of exact methods.

Bellman Optimality Equations

Solving the RL tasks, consists in finding a good policy. A policy $\pi’$ is defined to be better than $\pi \iff v_{\pi’}(s) \geq v_{\pi}(s), \ \forall s \in \mathcal{S}$. The optimal policy $\pi_{*}$ has an associated state optimal value-function and optimal action-value function:

A special recursive update (the Bellman optimality equations) can be written for the optimal functions $v_{*}$, $q_{*}$ by taking the best action at each step instead of marginalizing:

The optimal Bellman equations are a set of equations ($\vert \mathcal{S} \vert$ equations and unknowns) with a unique solution. But these equations are now non-linear (due to the max). If the dynamics of the environment were known, you could solve the non-linear system of equation to get $v_{*}$, and follow $\pi_{*}$ by greedily choosing the action that maximizes the expected $v_{*}$. By solving for $q_{*}$, you would simply chose the action $a$ that maximizes $q_{*}(s,a)$, which doesn’t require to know anything about the system dynamics.

In practice, the optimal Bellman equations can rarely be solved due to three major problems:

  • They require the dynamics of the environment which is rarely known.
  • They require large computational resources (memory and computation) to solve as the number of states $\vert \mathcal{S} \vert$ might be huge (or infinite). This is nearly always an issue.
  • The Markov Property.

Practical RL algorithms thus settle for approximating the optimal Bellman equations. Usually they parametrize functions and focus mostly on states that will be frequently encountered to make the computations possible.

Dynamic Programming

Overview
Exact
Bootstrapping
Full Backups
  • :bulb: Intuition :
    • Use dynamic programming as every step only depends on the previous step due to the MDP assumption.
    • derives update rules for $v_\pi$ or $q_\pi$ from the Bellman optimality equations to give rise to iterative methods that solve exactly the optimal control problem of finding $\pi_{*}$
  • :wrench: Practical:
    • DP algorithms are guaranteed to find the optimal policy in polynomial time with respect to $\vert \mathcal{S} \vert$ and , even-though the number of possible deterministic policies is $\vert \mathcal{A} \vert ^ {\vert \mathcal{S} \vert}$. This exponential speedup comes from the MDP assumption.
    • In practice, DP algorithm converge much faster than the theoretical worst case scenario.
  • :white_check_mark: Advantage :
    • Exact solution.
  • :x: Disadvantage :
    • Requires the dynamics of the environment.
    • Requires large computational resources as $\vert \mathcal{S} \vert$ is usually huge.
    • Requires $\infty$ number of iterations to find the exact solution.
    • Strongly dependent on the MDP assumption.
  • Backup Diagram from David Silver’s slides:

Backup Diagram Dynamic Programming

:mag: Side Notes: DP algorithms use the Bellman equations to update estimates based on other estimates (typically the value function at the next state). This general idea is called Bootstrapping.

Policy Iteration

The high-level idea is to iteratively: evaluate $v_\pi$ for the current policy $\pi$ (policy evaluation 1), use $v_\pi$ to improve $\pi’$ (policy improvement 1), evaluate $v_{\pi’}$ (policy evaluation 2)… Thus obtaining a sequence of strictly monotonically improving policies and value functions (except when converged to $\pi_{*}$). As a finite MDP has only a finite number of possible policies, this is guaranteed to converge in a finite number of steps. This idea is often visualized using a 1d example taken from Sutton and Barto):

Generalized Policy Iteration

This simplified diagram shows that although the policy improvement and policy evaluation “pull in opposite directions”, the 2 processes still converge to find a single joint solution. Almost all RL algorithms can be described using 2 interacting processes (the prediction task for approximating a value, and the control task for improving the policy), which are often called Generalized Policy Iteration (GPI).

In python pseudo-code:

def policy_iteration(environment):
    # v initialized arbitrarily for all states except V(terminal)=0
    v, pi = initialize()
    is_converged = False
    while not is_converged:
        v = policy_evaluation(pi, environment, v)
        pi, is_converged = policy_improvement(v, environment, pi)
    return pi
Policy Evaluation

The first step is to evaluate $v_\pi$ for a given $\pi$. This can be done by solving the Bellman equation :

Solving the equation can be done by either:

  • Linear System: This is a set of linear equations ($\vert \mathcal{S} \vert$ equations and unknowns) with a unique solution (if $\gamma <1$ or if it is an episodic task). Note that we would have to solve for these equations at every step of the policy iteration and $\vert \mathcal{S} \vert$ is often very large. Assuming a deterministic policy and reward, this would take $O(\vert \mathcal{S} \vert^3)$ operations to solve.

  • Iterative method: Modify the Bellman equation to become an iterative method that is guaranteed to converge when $k\to \infty$ if $v_\pi$ exists. This is done by realizing that $v_\pi$ is a fixed point of:

:mag: Side Notes : In the iterative method, we would have to keep to keep 2 arrays $v_k(s)$ and $v_{k+1}(s)$. At each iteration $k$ we would update $v_{k+1}(s)$ by looping through all states. Importantly, the algorithm also converges to $v_\pi(s)$ if we keep in memory a single array that would be updated “in-place” (often converges faster as it updates the value for some states using the latest available values). The order of state updates has a significant influence on the convergence in the “in-place” case .

We solve for an approximation $V \approx v_\pi$ by halting the algorithm when the change for every state is “small”.

In python pseudo-code:

def q(s, a, v, environment, gamma):
    """Computes the action-value function `q` for a given state and action."""
    dynamics, states, actions, rewards = environment
    return sum(dynamics(s_p,r,s,a) * (r + gamma * v(s_p))
               for s_p in states
               for r in rewards)

def policy_evaluation(pi, environment, v_init, threshold=..., gamma=...):
    dynamics, states, actions, rewards = environment
    V = v_init # don't restart from scratch policy evaluation
    delta = 0
    while delta < threshold :
        delta = 0
        for s in states:
            v = V(s)
            # Bellman update
            V(s) = sum(pi(a, s) * q(a,s, V, environment, gamma) for s_p in states)
            # stop when change for any state is small
            delta = max(delta, abs(v-V(s)))
    return v
Policy Improvement

Now that we have an (estimate of) $v_\pi$ which says how good it is to be in $s$ when following $\pi$, we want to know how to change the policy to yield a higher return.

We have previously defined a policy $\pi’$ to be better than $\pi$ if $v_{\pi’}(s) > v_{\pi}(s), \forall s$. One simple way of improving the current policy $\pi$ would thus be to use a $\pi’$ which is identical to $\pi$ at each state besides one $s_{update}$ for which it will take a better action. Let’s assume that we have a deterministic policy $\pi$ that we follow at every step besides one step when in $s_{update}$ at which we follow the new $\pi’$ (and continue with $\pi$). By construction :

Then it can be proved that (Policy Improvement Theorem) :

I.e. if such policy $\pi’$ can be constructed, then it is a better than $\pi$.

The same holds if we extend the update to all actions and all states and stochastic policies. The general policy improvement algorithm is thus :

$\pi’$ is always going to be better than $\pi$ except if $\pi=\pi_{*}$, in which case the update equations would turn into the Bellman optimality equation.

In python pseudo-code:

def policy_improvement(v, environment, pi):
    dynamics, states, actions, rewards = environment
    is_converged = True
    for s in states:
        old_action = pi(s)
        # q defined in `policy_evaluation` pseudo-code
        pi(s) = argmax(q(s,a, v, environment, gamma) for a in actions)
        if old_action != pi(s):
            is_converged = False
    return pi, is_converged

:wrench: Practical : Assuming a deterministic reward, this would take $O(\vert \mathcal{S} \vert^2 \vert \mathcal{A} \vert)$ operations to solve. Each iteration of the policy iteration algorithm thus takes $O(\vert \mathcal{S} \vert^2 (\vert \mathcal{S} \vert + \vert \mathcal{A} \vert))$ for a deterministic policy and reward signal, if we solve the linear system of equations for the policy evaluation step.

Value Iteration

In policy iteration, the bottleneck is the policy evaluation which requires multiple loops over the state space (convergence only for an infinite number of loops). Importantly, the same convergence guarantees as with policy iteration hold when doing a single policy evaluation step. Policy iteration with a single evaluation step, is called value iteration and can be written as a simple update step that combines the truncated policy evaluation and the policy improvement steps:

This is guaranteed to converge for arbitrary $v_0$ if $v_*$ exists. Value iteration corresponds to the update rule derived from the Bellman optimality equation . The formula, is very similar to policy evaluation, but it maximizes instead of marginalizing over actions.

Like policy evaluation, value iteration needs an infinite number of steps to converge to $v_*$. In practice we stop whenever the change for all actions is small.

In python pseudo-code:

def value_to_policy(v, environment, gamma):
    """Makes a deterministic policy from a value function."""
    dynamics, states, actions, rewards = environment
    pi = dict()
    for s in states:
        # q defined in `policy_evaluation` pseudo-code
        pi(s) = argmax(q(s,a, v, environment, gamma) for a in actions)
    return pi

def value_iteration(environment, threshold=..., gamma=...):
    dynamics, states, actions, rewards = environment
    # v initialized arbitrarily for all states except V(terminal)=0
    V = initialize()
    delta = threshold + 1 # force at least 1 pass 
    while delta < threshold :
        delta = 0
        for s in states:
            v = V(s)
            # Bellman optimal update
            # q defined in `policy_evaluation` pseudo-code
            V(s) = max(q(s, a, v, environment, gamma) for a in actions)
            # stop when change for any state is small
            delta = max(delta, abs(v-V(s)))
    pi = value_to_policy(v, environment, gamma)
    return pi

:wrench: Practical : Faster convergence is often achieved by doing a couple of policy evaluation sweeps (instead of a single one in the value iteration case) between each policy improvement. The entire class of truncated policy iteration converges. Truncated policy iteration can be schematically seen as using the modified generalized policy iteration diagram:

Value Iteration

As seen above, truncated policy iteration uses only approximate value functions. This usually increases the number of required policy evaluation and iteration steps, but greatly decreases the number of steps per policy iteration making the overall algorithm usually quicker. Assuming a deterministic policy and reward signal, each iteration for the value iteration takes $O(\vert \mathcal{S} \vert^2 \vert \mathcal{A} \vert)$ which is less than exact (solving the linear system) policy iteration $O(\vert \mathcal{S} \vert^2 (\vert \mathcal{S} \vert + \vert \mathcal{A} \vert))$.

Asynchronous Dynamic Programming

A drawback of DP algorithms, is that they require to loop over all states for a single sweep. Asynchronous DP algorithms are in-place iterative methods that update the value of each state in any order. Such algorithms are still guaranteed to converge as long as it can’t ignore any state after some points (for the episodic case it is also easy to avoid the few orderings that do not converge).

By carefully selecting the states to update, we can often improve the convergence rate. Furthermore, asynchronous DP enables to update the value of the states as the agent visits them. This is very useful in practice, and focuses the computations on the most relevant states.

:wrench: Practical : Asynchronous DP are usually preferred for problems with large state-spaces

Non DP Exact Methods

Although DP algorithms are the most used for finding exact solution of the Bellman optimality equations, other methods can have better worst-case convergence guarantees. Linear Programming (LP) is one of those methods. Indeed, the Bellman optimality equations can be written as a linear program. Let $B$ be the Bellman operator (i.e. $v_{k+1} = B(v_k)$), and $\pmb{\mu_0}$ is a probability distribution over states, then:

Indeed, if $v \geq B(v)$ then $B(v) \geq B(B(v))$ due to the monotonicity of the Bellman operator. By repeated applications we must have that $v \geq B(v) \geq B(B(v)) \geq B^3(v) \geq \ldots \geq B^{\infty}(v) = v_{*}$. Any solution of the LP must satisfy $v \geq B(v)$ and must thus be $v_{*}$. Then the objective function $\pmb{\mu}_0^T \mathbf{v}$ is the expected cumulative reward when beginning at a state drawn from $\pmb{\mu}_0$. By substituting for the Bellman operator $B$:

Using the dual form of the LP program, the equation above can be rewritten as :

The constraints in the dual LP ensure that :

I.e. they are the discounted state-action counts. While the dual objective maximizes the expected discounted return. The optimal policy can is :

$\pi_*(s)=max_a \mu(s,a)$

:wrench: Practical: Linear programming is sometimes better than DP for small number of states, but it does not scale well.

Although LP are rarely useful, they provide connections to a number of other methods that have been used to find approximate large-scale MDP solutions.

Monte Carlo GPI

Overview
Approximate
No Bootstrapping
Unbiased
Sample Backups
Model Free
  • :bulb: Intuition :
    • Approximates Generalized Policy Iteration by estimating the expectations through sampling rather than computing them.
    • The idea is to estimate the value function by following a policy and averaging returns over multiple episodes, then updating the value function at every visited state.
  • :white_check_mark: Advantage :
    • Can learn from experience, without explicit knowledge of the dynamics.
    • Unbiased
    • Less harmed by MDP violation because they do not bootstrap.
    • Not very sensitive to initialization
  • :x: Disadvantage :
    • High variance because depends on many random actions, transitions, rewards
    • Have to wait until end of episode to update.
    • Only works for episodic tasks.
    • Slow convergence.
    • Suffer if lack of exploration.
  • Backup Diagram

Backup Diagram Monte Carlo

As mentioned previously, the dynamics of the environment are rarely known. In such cases we cannot use DP. Monte Carlo (MC) methods bypass this lack of knowledge by estimating the expected return from experience (i.e. sampling of the unknown dynamics).

MC methods are very similar to the previously discussed Generalized Policy Iteration(GPI). The main differences being:

  • They learn the value-function by sampling from the MDP (experience) rather than computing these values using the dynamics of the MDP.

  • MC methods do not bootstrap: each value function for each state/action is estimated independently. I.e. they do not update value estimates based on other value estimates.

  • In DP, given a state-value function, we could look ahead one step to determine a policy. This is not possible anymore due to the lack of knowledge of the dynamics. It is thus crucial to estimate the action value function $q_{*}$ instead of $v_{*}$ in policy evaluation.

Recall that $q_\pi(s) = \mathbb{E}[R_{t+1} + \gamma G_{t+1} \vert S_{t}=s, A_t=a]$. Monte Carlo Estimation approximates this expectations through sampling. I.e. by averaging the returns after every visits of state action pairs $(s,a)$.

Note that pairs $(s,a)$ could be visited multiple times in the same episode. How we treat these visits gives rise to 2 slightly different methods:

  • First-visit MC Method: estimate $q_\pi(s,a)$ as the average of the returns following the first visit to $(s,a)$. This has been more studied and is the one I will be using.
  • Every-visit MC Method: estimate $q_\pi(s,a)$ as the average of the returns following all visit to $(s,a)$. These are often preferred as they don’t require to keep track of which states have been visited.

Both methods converge to $q_\pi(s,a)$ as the number of visits $n \to \infty$. The convergence of the first-visit MC is easier to prove as each return is an i.i.d estimate of $q_\pi(s,a)$. The standard deviation of the estimate drops with $\frac{1}{\sqrt{n}}$.

:mag: Side Notes:

  • MC methods can learn from actual experience or simulated one. The former is used when there’s no knowledge about the dynamics. The latter is useful when it is possible to generate samples of the environment but infeasible to write it explicitly. For example, it is very easy to simulate a game of blackjack, but computing $p(s’,r’ \vert s,a)$ as a function of the dealer cards is complicated.
  • The return at each state depends on the future action. Due to the training of the policy, the problem becomes non-stationary from the point of view of the earlier states.
  • In order to have well defined returns, we will only be considering episodic tasks (i.e. finite $T$)
  • The idea is the same as in Bayesian modeling, where we approximated expectations by sampling.
  • MC methods do not bootstrap and are thus very useful to compute the value of only a subset of states, by starting many episodes at the state of interests.

On-Policy Monte Carlo GPI

Let’s make a simple generalized policy iteration (GPI) algorithm using MC methods. As a reminder, GPI consists in iteratively alternating between evaluation (E) and improvement (I) of the policy, until we reach the optimal policy:

  • Generalized Policy Evaluation (prediction). Evaluates the value function $Q \approx q_\pi$ (not $V$ as we do not have the dynamics). Let $F_i(s,a)$ return the time step $t$ at which state-action pair $(s,a)$ is first seen (first-visit MC) in episode $i$ (return $-1$ if never seen), and $G_{t}^{(i)}(\pi)$ be the discounted return from time $t$ of the $i^{th}$ episode when following $\pi$:
  • Policy Improvement, make a GLIE (defined below) policy $\pi$ from $Q$. Note that the policy improvement theorem still holds.

Unsurprisingly, MC methods can be shown to converge if they maintain exploration and when the policy evaluation step uses an $\infty$ number of samples. Indeed, these 2 conditions ensure that all expectations are correct as MC sampling methods are unbiased.

Of course using an $\infty$ number of samples is not possible, and we would like to alternate (after every episode) between evaluation and improvement even when evaluation did not converge (similarly value iteration). Although MC methods cannot converge to a suboptimal policy in this case, the fact that it converges to the optimal fixed point has yet to be formally proved.

Maintaining exploration is a major issue. Indeed, if $\pi$ is deterministic then the samples will only improve estimates for one action per state. To ensure convergence we thus need a policy which is Greedy in the Limit with Infinite Exploration (GLIE). I.e. : 1/ All state-action pairs have to be explored infinitely many times; 2/ The policy has to converge to a greedy one. Possible solutions include:

  • Exploring Starts: start every episode with a sampled state-action pair from a distribution that is non-zero for all pairs. This ensures that all pairs $(s,a)$ will be visited an infinite number of times as $n \to \infty$. Choosing starting conditions is often not applicable (e.g. most games always start from the same position).
  • Non-Zero Stochastic Policy: to ensure that all pairs $(s,a)$ are encountered, use a stochastic policy with a non-zero probability for all actions in each state. $\epsilon\textbf{-greedy}$ is a well known policy, which takes the greedy action with probability $1-(\epsilon+\frac{\epsilon}{\vert \mathcal{A} \vert})$ and assigns a uniform probability of $\frac{\epsilon}{\vert \mathcal{A} \vert}$ to all other actions. To be a GLIE policy, $\epsilon$ has to converge to 0 in the limit (e.g. $\frac{1}{t}$).

In python pseudo-code:

def update_eps_policy(Q, pi, s, actions, eps):
    """Update an epsilon policy for a given state."""
    q_best = -float("inf")
    for a in actions:
        pi[(a, s)] = eps/len(actions)
        q_a = Q[(s, a)]
        if q_a  > q_best:
            q_best = q_a 
            a_best = a
    pi[(a_best, s)] = 1 - (eps - (eps/len(actions)))
    return pi

def q_to_eps_pi(Q, eps, actions, states):
    """Creates an epsilon policy from a action value function."""
    pi = defaultdict(lambda x: 0)
    for s in states:
        pi = update_eps_policy(Q, pi, s, actions, eps)
    return pi

def on_policy_mc(game, actions, states, n=..., eps=..., gamma=...):
    """On policy Monte Carlo GPI using first-visit update and epsilon greedy policy."""
    returns = defaultdict(list)
    Q = defaultdict(lambda x: 0)
    for _ in range(n):
        pi = q_to_eps_pi(Q, eps, actions, states)
        T, list_states, list_actions, list_rewards = game(pi)
        G = 0
        for t in range(T-1,-1,-1): # T-1, T-2, ... 0
            r_t, s_t, a_t = list_rewards[t], list_states[t], list_actions[t]
            G = gamma * G + r_t # current return
            if s_t not in list_states[:t]: # if first
                returns[(s_t, a_t)].append(G)
                Q[(s_t, a_t)] = mean(returns[(s_t, a_t)]) # mean over all episodes
    return V
Incremental Implementation

The MC methods can be implemented incrementally. Instead of getting estimates of $q_\pi$ by keeping in memory all $G_t^{(i)}$ to average over those, the average can be computed exactly at each step $n$. Let $m \coloneqq \sum_i \mathcal{I}[F_i(s,a) \neq -1]$, then:

This is of the general form :

Where the step-size $\alpha_n$ varies as it is $\frac{1}{n}$. In RL, problems are often non-stationary, in which case we would like to give more importance to recent samples. A popular way of achieving this, is by using a constant step-size $\alpha \in ]0,1]$. This gives rise to an exponentially decaying weighted average (also called exponential recency-weighted average in RL). Indeed:

:bulb: Intuition: At every step we update our estimates by taking a small step towards the goal, the direction (sign in 1D) is given by the error $\epsilon = \left(G_{F_m(s,a)}^{(m)}(\pi) - Q_{n}(s,a) \right)$ and the steps size by $\alpha * \vert \epsilon \vert$.

:mag: Side Notes:

  • The weight given the $i^{th}$ return $G_{F_i(s,a)}^{(i)}(\pi)$ is $\alpha(1-\alpha)^{n-i}$, which decreases exponentially as i decreases ($(1-\alpha) \in [0,1[$).
  • It is a well weighted average because the sum of the weights can be shown to be 1.
  • From stochastic approximation theory we know that we need a Robbins-Monro sequence of $\alpha_n$ for convergence: $\sum_{m=1}^{\infty} \alpha_m = \infty$ (steps large enough to overcome initial conditions / noise) and $\sum_{m=1}^{\infty} \alpha_m^2 < \infty$ (steps small enough to converge). This is the case for $\alpha_m = \frac{1}{m}$ but not for a constant $\alpha$. But in non-stationary environment we actually don’t want the algorithm to converge, in order to continue learning.
  • For every-visit MC, this can simply be rewritten as the following update step at the end of each episode:

Off-Policy Monte Carlo GPI

In the on-policy case we had to use a hack ($\epsilon \text{-greedy}$ policy) in order to ensure convergence. The previous method thus compromises between ensuring exploration and learning the (nearly) optimal policy. Off-policy methods remove the need of compromise by having 2 different policy.

The behavior policy $b$ is used to collect samples and is a non-zero stochastic policy which ensures convergence by ensuring exploration. The target policy $\pi$ is the policy we are estimating and will be using at test time, it focuses on exploitation. The latter is often a deterministic policy. These methods contrast with on-policy ones, that uses a single policy.

:bulb: Intuition: The intuition behind off-policy methods is to follow an other policy but to weight the final return in such a way that compensates for the actions taken by $b$. This can be done via Importance Sampling without biasing the final estimate.

Given a starting state $S_t$ the probability of all subsequent state-action trajectory $A_t, S_{t+1}, A_{t+1}, \ldots, S_T$ when following $\pi$ is:

Note that the dynamics $p(s’ \vert s, a)$ are unknown but we only care about the ratio of the state-action trajectory when following $\pi$ and $b$. The importance sampling ratio is:

Note that if we simply average over all returns we would get $\mathbb{E}[G_t \vert S_t=s] = v_b(s)$, to get $v_\pi(s)$ we can use the previously computed importance sampling ratio:

:mag: Side Notes:

  • Off policy methods are more general as on policy methods can be written as off-policy methods with the same behavior and target policy.
  • In order to estimate values of $\pi$ using $b$ we require $\pi(a \vert s) \ge 0 \implies b(a\vert s) \ge$ (coverage assumption). I.e. all actions of $\pi$ can be taken by $b$.
  • The formula shown above is the ordinary importance sampling, although it is unbiased it can have large variance. Weighted importance sampling is biased (although it is a consistent estimate as the bias decreases with $O(1/n)$) but is usually preferred as the variance is usually dramatically smaller:
  • The importance method above treats the returns $G_0$ as a whole, without taking into account the discount factors. For example if $\gamma=1$, then $G_0 = R_1$, we would thus only need the importance sampling ratio $\frac{pi(A_0 \vert S_0)}{b(A_0 \vert S_0)}$, yet we currently use also the 99 other factors $\frac{pi(A_1 \vert S_1)}{b(A_1 \vert S_1)} \ldots \frac{pi(A_{99} \vert S_{99})}{b(A_{99} \vert S_{99})}$ which greatly increases the variance. Discounting-aware importance sampling greatly decreases the variance by taking the discounts into account.

:wrench: Practical: Off-policy methods are very useful to learn by seeing a human expert or non-learning controller.

In python pseudo-code:

def off_policy_mc(Q, game, b, actions, n=..., gamma=...):
    """Off policy Monte Carlo GPI using incremental update and weighted importance sampling."""
    pi = dict()
    returns = defaultdict(list)
    Q = defaultdict(lambda x: random.rand)
    C = defaultdict(lambda x: 0)
    for _ in range(n):
        T, list_states, list_actions, list_rewards = game(b)
        G = 0
        W = 1
        for t in range(T-1,-1,-1): # T-1, T-2, ... 0
            r_t, s_t, a_t = list_rewards[t], list_states[t], list_actions[t]
            G = gamma * G + r_t # current return
            C[(s_t, a_t)] += W
            Q[(s_t, a_t)] += (W/C[(s_t, a_t)])(G-Q[(s_t, a_t)])
            pi[s_t] = argmax(Q[(s_t, a)] for a in actions)
            if a_t != pi(s):
                break
            W /= b[(s_t, a_t)]
    return V

:mag: Side Notes: As for on policy, Off policy MC control can also be written in an Incremental manner. The details can be found in chapter 5.6 of Sutton and Barto.

Temporal-Difference Learning

Overview
Approximate
Bootstrapping
Unbiased
Sample Backups
Model-Free
  • :bulb: Intuition :
    • Modify Monte Carlo methods to update estimates from other estimates without having to wait for a final outcome.
  • :white_check_mark: Advantage :
    • Can learn from experience, without explicit knowledge of the dynamics.
    • Can compute the value of only a subset of states.
    • On-line learning by updating at every step.
    • Works in continuing (non-terminating) environments
    • Low variance as only depends on a single random action, transition, reward
    • Very effective in Markov environments (exploit the property through bootstrapping)
  • :x: Disadvantage :
    • Suffer if Markov property does not hold.
    • Suffer if lack of exploration.
  • Backup Diagram:

Backup Diagram Temporal Difference

The fundamental idea of temporal-difference (TD) learning is to remove the need of waiting until the end of an episode to get $G_t$ in MC methods, by taking a single step and update using $R_{t+1} + \gamma V(S_{t+1}) \approx G_t$. Note that when the estimates are correct, i.e. $V(S_{t+1}) = v_\pi(S_{t+1})$, then $\mathbb{E}[R_{t+1} + \gamma V(S_{t+1})] = \mathbb{E}[G_t]$.

As we have seen, incremental MC methods for evaluating $V$, can be written as $V_{k+1}(S_t) = V_{k}(S_t) + \alpha \left( G_t - V_{k}(S_t) \right)$. Using bootstrapping, this becomes:

This important update is called $\textbf{TD(0)}$, as it is a specific case of TD(lambda). The error term that is used to update our estimate, is very important in RL and Neuroscience. It is called the TD Error:

Importantly $\text{TD}(0)$ always converges to $v_\pi$ if $\alpha$ decreases according to the usual stochastic conditions (e.g. $\alpha = \frac{1}{t}$). Although, the theoretical difference in speed of convergence of TD(0) and MC methods is still open, the former often converges faster in practice.

:mag: Side Notes:

  • $\delta_t$ is the error in $V(S_t)$ but is only available at time $t+1$.
  • The MC error can be written as a sum of TD errors if $V$ is not updated during the episode: $G_t - V(S_t) = \sum_{k=t}^{T-1} \gamma^{k-t} \delta_k$
  • Updating the value with the TD error, is called the backup
  • the name comes from the fact that you look at the difference between estimates of consecutive time.
  • For batch updating (i.e. single batch of data that is processed multiple times and the model is updated after each batch pass), MC and TD(0) do not in general find the same answer. MC methods find the estimate that minimizes the mean-squared error on the training set, while TD(0) finds the maximum likelihood estimate of the Markov Process. For example, suppose we have 2 states $A$, $B$ and a training data $\{(A:0:B:0), (B:1), (B:1), (B:0)\}$. What should $V(A)$ be? There are at least 2 answers that make sense. MC methods would answer $V(A)=0$ because every time $A$ has been seen, it resulted in a reward of $0$. This doesn’t take into account the Markov property. Conversely, $TD(0)$ would answer $V(A)=0.5$ because $100\%$ of $A$ transited to $B$. Using the Markov assumption, the rewards should be independent of the past once we are in state $B$. I.e. $V(A)=V(B)=\frac{2}{4}=\frac{1}{2}$.

On-Policy TD GPI (SARSA)

Similarly to MC methods, we need an action-value function $q_\pi$ instead of a state one $v_\pi$ because we do not know the dynamics and thus cannot simply learn the policy from $v_\pi$. Replacing TD estimates in on policy MC GPI, we have:

  • Generalized Policy Evaluation (Prediction): this update is done for every transition from a non-terminal state $S_t$ (set $Q(s_{terminal},A_{t+1})=0$):
  • Policy Improvement: make a GLIE policy $\pi$ from $Q$. Note that the policy improvement theorem still holds.

:mag: Side Notes:

  • The update rule uses the values $(S_t, A_t, R_{t+1}, S_{t+1}, A_{t+1})$, which gave its name.
  • As with On policy MC GPI, you have to ensure sufficient exploration which can be done using an $\epsilon\text{-greedy}$ policy.
  • As with MC, the algorithm converges to an optimal policy as long as the policy is GLIE.

In python pseudo-code:

def sarsa(game, states, actions, n=..., eps=..., gamma=..., alpha=...):
    """SARSA using epsilon greedy policy."""
    Q = defaultdict(lambda x: 0)
    pi = q_to_eps_pi(Q, eps, actions, states) # defined in on policy MC
    for _ in range(n):
        s_t = game() #initializes the state
        a_t = sample(pi, s_t)
        while s_t not is terminal:
            s_t1, r_t = game(s_t, a_t) # single run
            # compute A_{t+1} for update 
            a_t1 = sample(pi, s_t1)
            Q[(s_t, a_t)] += alpha * (r_t + gamma*Q[(s_t1, a_t1)] - Q[(s_t, a_t)])
            pi = update_eps_policy(Q, pi, s_t, actions, eps) # defined in on policy MC
            s_t, a_t = s_t1, a_t1
    return pi

Off Policy TD GPI (Q-Learning)

Just as with off-policy MC, TD can be written as an off-policy algorithm. In this case, the TD error is not computed with respect to the next sample but with respect to the current optimal greedy policy (i.e. the one maximizing the current action-value function $Q$)

  • Generalized Policy Evaluation (Prediction):
  • Policy Improvement: make a GLIE policy $\pi$ from $Q$.

The algorithm can be proved to converge as long as all state-action pairs continue being updated.

In python pseudo-code:

def q_learning(game, states, actions, n=..., eps=..., gamma=..., alpha=...):
    """Q learning."""
    Q = defaultdict(lambda x: 0)
    pi = q_to_eps_pi(Q, eps, actions, states) # defined in on policy MC
    for _ in range(n):
        s_t = game() # initializes the state
        while s_t not is terminal:
            a_t = sample(pi, s_t)
            s_t1, r_t = game(s_t, a_t) # single run
            a_best = argmax(Q[(s_t1, a)] for a in actions)
            Q[(s_t, a_t)] += alpha * (r_t + gamma*a_best - Q[(s_t, a_t)])
            pi = update_eps_policy(Q, pi, s_t, actions, eps) # defined in on policy MC
            s_t, = s_t1
    return pi

Expected Sarsa

n-steps temporal-difference

Partially Supervised Learning

Soon


The previous sections were separated by learning style. I will now separate by algorithm similarity.


Bayesian Methods

Soon

Deep Learning

Deep Generative Models

Variational Autoencoders

Overview
Dimensionality Reduction
Generation

:bulb: Intuition :

  • Generator from a latent code similar to how a human would first have to decide what to generate (meaning) before trying to represent it.
  • An autoencoder that uses distributions instead of point estimates for the latent codes. This enables sampling of new examples by decoding a sampled latent code.
  • A variational Auto-encoder (VAE) embeds examples by forcing similar examples to have similar latent representations (“smooth latent space”). This is achieved by adding noise to the latent codes during the training procedure, such that the decoder learns to assign similar examples to similar latent representations to decrease the expected reconstruction loss.
  • VAEs are probabilistic graphical models with the goal of latent modeling. They use neural networks to approximate complex conditional probability distributions.

:white_check_mark: Advantage :

  • Theoretically appealing .
  • Nice latent manifold structure .
  • Finds disentangled factors of variation .
  • Stable training.
  • Straight-forward to implement.
  • Easy to evaluate and compare models using an (approximate) data log-likelihood (evidence).

:x: Disadvantage :

  • The samples are often blurry due to type of reconstruction loss used in practice (although different losses can help with that issue VAE-GAN)
  • Generator puts non zero-mass everywhere making it hard to model a thin manifold.
  • The priors that can currently be used in practice are restrictive.

Deep Learning Perspective

A VAE can be thought of an auto-encoder that maps and reconstructs inputs to and from latent distributions rather than point-wise latent codes. This enables sampling to generate new examples and forces similar examples to be encoded by similar latent distributions. Indeed, latent distributions with more overlap or forced to generate similar samples. Importantly, making the network generate noise around the latent point-wise estimates is not satisfactory as the encoder would learn to generate very large means or very small variance, thus making the noise neglectable. To avoid this issue we force the latent distributions to be close to a given distribution $p(\mathbf{z})$ (typically a unit Gaussian).

The encoder is a neural network parametrized by $\mathbf{\phi}$ that maps the input $\mathbf{x}$ to a latent distribution $q_\phi(\mathbf{z}\vert\mathbf{x})$ from which to sample a noisy representations of the low dimension latent code $\mathbf{z}$. The decoder is parametrized by $\mathbf{\theta}$ and deterministically maps the sampled latent code $\mathbf{z}$ to the corresponding reconstructed $\mathbf{\hat{x}}=f_\theta(\mathbf{z})$. The training procedure requires to minimize the expected reconstruction loss $L(\mathbf{x},\mathbf{\hat{x}})$ under the constraint of having $q_\phi(\mathbf{z}\vert \mathbf{x}) \approx p(\mathbf{z})$. The latter constraint can be written as $D_\text{KL}(q_\phi(\mathbf{z}\vert\mathbf{x})\parallel p(\mathbf{z}) ) < \delta$

This constrained optimization problem can be rewritten as an unconstrained one using the KKT conditions (generalization of Lagrange multipliers allowing inequality constraints):

Where the last line replaced the reconstruction loss $L$ with the negative log likelihood of the data as predicted by a hypothetical probabilistic discriminator. The output of the deterministic discriminator can be interpreted as the expected value of the hypothetical probabilistic discriminator. Many of the usual losses $L$ for deterministic outputs correspond to negative log likelihoods: mean squared error <-> Gaussian likelihood, log loss <-> logistic likelihood, …

The global loss consists of the expected reconstruction loss (first term) and a regularisation term (KL divergence).

We now have a loss and a model: it should be straight-forward to train such a network in your favorite deep learning library, right ? Not really… The problem arises from the stochasticity of the latent representation $\mathbf{z}\sim q_\phi(\mathbf{z}\vert\mathbf{x})$. Indeed, during the backward pass, gradients cannot flow through stochastic nodes. Fortunately, there’s a simple way of making it work by using the reparametrisation trick, which consists in expressing $\mathbf{z}$ as a deterministic function of the inputs and of an independent random variable $\mathbf{z}=g_\phi(\mathbf{x}, \pmb{\epsilon})$. In the case of a multivariate Gaussian with a diagonal covariance, the following equations are equivalent:

The figure from Lil’Log summarizes very well the deep learning point of view, with the small caveat that in practice we use a deterministic decoder and a loss function rather than the probabilistic one:

vae

:mag: Side Notes :

  • The loss corresponds to $\beta$-VAE. You recover the original VAE loss by setting $\beta=1$.

  • Usually $p(\mathbf{z})=\mathcal{N}(\mathbf{z};\mathbf{0},I)$ and $q_\phi(\mathbf{z}\vert\mathbf{x})=\mathcal{N}(\mathbf{z};\pmb{\mu}(\mathbf{x}),\text{diag}(\pmb{\sigma}(\mathbf{x})))$. In which case the KL divergence can be computed in closed form :

  • In practice, it is better to make the network output $\log \, \pmb{\sigma^2}$ (the log variance) as this constrains sigma to be positive even with the network outputting a real value. We output the variance instead of $\pmb{\sigma}$ as it shows up twice in the KL divergence.

  • For black and white image generation (e.g. MNIST), cross-entropy loss (corresponding to fitting a Bernouilli for each pixel) gives better results than mean squared error. The same holds for colored images, which is achieved by rescaling the output of the sigmoid to 0-255 in each channel.

Graphical Model Perspective

When defining a generative probabilistic model, it is important to define a step by step generation procedure. For each datapoint $i$:

  1. Sample the latent variable (i.e. the semantics of what you want to generate): $\mathbf{z}^{(n)} \sim p(\mathbf{z})$.
  2. Sample the generated datapoint conditioned on the latent variable: $\mathbf{x}^{(n)} \sim p(\mathbf{x} \vert \mathbf{z})$.

The graphical model is simply :

log loss

The objective is to have a graphical model that maximizes the probability of generating real data :

Unfortunately the integral above is not tractable, as it would require integrating over all possible $\mathbf{z}$ sampled from a prior. It could be approximated by Markov Chain Monte Carlo (MCMC), i.e. by sampling $N$ different $\mathbf{z}^{(n)}$ and then compute $p(\mathbf{x}) \approx \frac{1}{N}\sum_{n=1}^{N}p(\mathbf{x}\vert\mathbf{z}^{(n)})$. But as the volume spanned by $\mathbf{z}$ is potentially large, $N$ may need to be very large to obtain an accurate estimate of $p(\mathbf{x})$ (as we will sample many times from uninformative regions of the prior). To improve the sample efficiency (decrease variance) of the estimate we would like to sample “important” $\mathbf{z}$ rather than blindly sampling from the prior. Let’s call $q$ a probability distribution that assigns high probability to “important” $\mathbf{z}$. As the notion of “importance” might depend on the $\mathbf{x}$ we want to predict, let’s condition $q$ on those : $q(\mathbf{z} \vert \mathbf{x})$. We can use Importance Sampling to use samples from $q(\mathbf{z} \vert \mathbf{x})$ without biasing our estimate:

Noting that $-\log$ is a convex function, we can use Jensen Inequality, which states that $g(\mathbb{E}[X]) \leq \mathbb{E}[g(X)] \text{, where } g \text{ is convex}$:

The right hand side of the equation is called the Evidence Lower Bound (ELBO) as it lower bounds the (log) evidence $p(\mathbf{x})$. The last missing component is to chose appropriate probability functions :

  • $p(\mathbf{x} \vert \mathbf{z})$ might be a very complicated distribution. We use a neural network parametrized by $\pmb{\theta}$ to approximate it: $p_\theta(\mathbf{x} \vert \mathbf{z})$.
  • $p(\mathbf{z})$ can be any prior. It is often set as $\mathcal{N}(\mathbf{0}, I)$ which makes the KL divergence term simple and states that we prefer having latent codes with uncorrelated dimensions (disentangled).
  • $q(\mathbf{z}\vert \mathbf{x})$ should give more mass to “important” $\mathbf{z}$. We let a neural network parametrized by $\pmb{\phi}$ deciding how to set it : $q_\phi(\mathbf{z} \vert \mathbf{x})$. In order to have a closed form KL divergence, we usually use a Gaussian distribution with diagonal covariance whose parameters $\pmb{\mu}, \pmb{\sigma}$ are the output of the neural network : $q_\phi(\mathbf{z} \vert \mathbf{x}) = \mathcal{N}(\mathbf{z};\pmb{\mu}(\mathbf{x}),\text{diag}(\pmb{\sigma}(\mathbf{x})))$.

Finally, let’s train $\pmb{\theta}$ and $\pmb{\phi}$ by maximizing the probability of generating the training (i.e. real) data $p_{\theta,\phi}(\mathbf{x})$. Assuming that all data points are i.i.d then:

This corresponds to the loss we derived with the deep learning perspective for $\beta=1$. Please refer to the previous subsection for the reparamatrization trick and closed form solution of the KL divergence.

:mag: Side Notes :

  • The ELBO can also be derived by directly minimizing $D_\text{KL}(q(\mathbf{z} \vert \mathbf{x}) \parallel p(\mathbf{z} \vert \mathbf{x}) )$, which intuitively “turns the problem around” and tries to find a good mapping from $\mathbf{x}$ to $\mathbf{z}$. The idea of doing approximate tractable inference with a simple distribution $q(\mathbf{z} \vert \mathbf{x})$ instead of the real posterior $p(\mathbf{z} \vert \mathbf{x})$ is called variational inference. This is where the name of variational auto-encoder comes from.

  • The choice of direction of the KL divergence is natural with the importance sampling derivation.

  • You can get a tighter lower bound by getting a less biased estimate of the expectation you want, by using sampling methods: $\mathbb{E}_{q(\mathbf{z}\vert \mathbf{x})} \left[ \log \left( p(\mathbf{x} \vert \mathbf{z}) \frac{p(\mathbf{z})}{q(\mathbf{z}\vert \mathbf{x})} \right) \right]$. In expectation this is a tighter lower bound is called Importance Weighted Autoencoders.

:information_source: Resources : D. Kingma and M. Welling first paper on VAEs

Intuition: Disentangled Representations

Disentangled representations most often refers to having a latent space where with each dimension is not very correlated (factorizable latent space). Such representations have been shown to be more general and robust. They enable compositionality, which is key for human thoughts and language. For example, if I ask you to draw a blue apple, you would not have any issues doing it (assuming your drawing skills are better than mine :sweat_smile:) because you understand the concept of an apple and the concept of being blue. Being able to model such representation is what we are aiming for in VAE, as illustrated in Jeremy Jordan’s blog :

disentangled vae

Having Gaussian distribution with diagonal covariance for the latent variable, means that each latent dimension can be modeled by a single Gaussian. Intuitively, we can thus think of latent dimensions as knobs used to add or remove something. Using a prior, forces the model to encode the confidence of how much of the knob to turn.

Although the variance of each knob is independent (diagonal covariance), the means can still be highly dependent. The unit Gaussian prior forces each dimension to correspond to the principle components of the distribution of means (i.e. the axes are parallel to the directions with the most variance like in PCA), thus forcing the knobs to encode disentangled factors of variations (small correlation) which are often more interpretable. This is because dimensions aligned with the principle components will decrease the KL divergence due to Pythagoras theorem. Graphically :

pca vae

Graphical Models

Soon


The previous sections were separated by learning style and algorithm similarity. I will now focus on the application.


Automatic Speech Recognition

Soon

Computer Vision

Soon

Natural Language Processing

Soon

Time Series

Soon


The following sections are not strictly speaking part of Machine Learning. They are nevertheless very related and important.


Optimization

The goal of mathematical optimization (also called mathematical programming), is to find a minima or maxima of an objective function $f(x)$. The “best” optimization algorithm, is very problem specific and depends on the objective function to minimize and the type of constraints the solution has to satisfy. In mathematics and machine learning, optimization problems are usually stated in terms of minimization, which is why I will focus on minimization. Importantly, if you want to maximize $f(x)$ you can equivalently minimize $f’(x)=-f(x)$.

The optimization problems are often separated into a large number of overlapping sets of problems. In this blog I will focus on a few key distinctions between sets of problems:

  • Convex Optimization vs Non-Convex Optimization: The former consists of the sets of problems that deal with a convex objective function and set of constraints. The latter is a lot harder and contains all other problems. Current non-convex algorithms usually don’t have nice convergence guarantees.
  • Discrete Optimization vs Continuous Optimization: as suggested by the name, some of the variables in discrete optimization are restricted to be discrete. Discrete optimization is often separated into combinatorial optimization (e.g. on graph structures) and integer programming (linear programming with integer variables). Discrete optimization tend to be harder than continuous optimization, which contain only continuous variables.
  • Finite terminating algorithms vs Iterative methods vs Heuristic-based models: the first type of algorithms terminate after a finite number of steps. Iterative methods converge to a solution. Heuristic based methods is any algorithm that is not guaranteed to converge mathematically, but is still very useful in practice.
  • Unconstrained Optimization vs Constrained Optimization: whether or not there are constraints on the variables of the objective function. The constraints might be hard (required to be satisfied) or soft, in which case the objective function is penalized by the extent that the constraints are not satisfied.
  • Deterministic Optimization vs Stochastic Optimization: whether the data is assumed to be noisy or not. The latter is extremely useful in practice as most data is noisy.
  • Linear Programming vs Non-Linear Programming: The former consists of the sets of problems that have a linear objective functions and only linear constraints. Non-linear programming consists of all other problems, and can be convex or not.
  • $n^{th}$ Order Algorithms: Which distinguishes which order derivative the algorithm is using. In optimization the most common are:
    • Zero-order optimization algorithm (=derivative free): use only the function values. This has to be used if the explicit function form is not given (i.e. black-box optimization).
    • First-order optimization algorithm: these algorithms optimize the objective function by taking advantage of the derivatives, which says how a function is increasing or decreasing in every direction. This class of algorithm is the most used in machine learning problems.
    • Second-order optimization algorithm: these algorithms use the Hessian to know about the curvature of the function.

:mag: Side Notes: As optimization is crucial for many different fields, the vocabulary and conventions are not really standardize. For example $f(x)$, is often called the objective function, the loss function, or the cost function when it is minimized. And is called the utility or fitness function when it has to be maximized. In physics and computer vision, it is often called the energy function.

Convex Optimization

Linear Programming

Linear programming (LP) deals with linear objective functions subject to linear constraints (equality and inequality). Any LP problem can be expressed in a canonical form:

The constraints $A\mathbf{x} \leq b$ are called the main constraints, while $\mathbf{x} \geq \mathbf{0}$ are the non-negativity constraints. The set of points that satisfy all constraints form the feasible region. This region is a convex polytope (points between in a polyhedron that is defined by intersecting lines. Formally, it is the intersection of a finite number of half-spaces) defined by the linear inequalities. The goal is to find the point in the polyhedron associated with the maximum (or minimum) of the function. Note: for unbounded problems (i.e. objective function can take large values) the feasible region is infinite.

Let’s look at the simple example used in this blog, from whom I borrowed the nice plots:

The feasible region in this case is :

Linear Programming Feasible Region

We now have to find the point in this region which maximizes $3x_1+2x_2$. For any given $z$ let’s consider the set of all points who’s cost is $\mathbf{c}^T\mathbf{x}=z$ (level curves of the objective function). This is the line described by $z=3x_1+2x_2$. Note that $\forall z$ these lines are parallel to each other and perpendicular to $\mathbf{c}=(3,2)$. In particular, increasing $z$ corresponds to moving the line $z=3x_1+2x_2$ along the direction of $\mathbf{c}$ ($\nabla f(\mathbf{x})=\mathbf{c}$):

Linear Programming Normal Level Curves

As we want to maximize $z$, we would like to move the line as much as possible along $\mathbf{c}$, while staying in the feasible region. The best we can do is $z=180$, and the vector $\mathbf{x}=(20,60)$ is the corresponding optimal solution:

Linear Programming Level Curves

Note that the optimal solution $\mathbf{x}=(20,60)$ is a vertex of the polytope. We will show later that this is always the case.

Canonical and Slack form

:mag: Side Notes: Beware that the term standard form is very often used in the literature to mean different things. For example in Bertismas and Tsitsiklis the standard form is equivalent to the canonical form we have seen above, while the standard form in Vanderbei’ book refers to what I will call the slack form. I thus decided not to use the term “standard form”.

Canonical Form:

Slack Form:

The canonical form is very natural and thus often used to develop the theory, but the slack form is usually the starting points for all algorithms as it is computationally more convenient.

Any general LP problem can be converted back and forth to these 2 forms. Let’s first see how to convert general LP problems to the canonical form:

  1. Convert to maximizing problems, by multiplying the objective function of minimizing problems by $-1$.
  2. Convert constraints to $leq$: $\forall i$ such that there exist a main constraint of the form $\mathbf{a}_i^T \mathbf{x} \geq b_i$, change it to $(-\mathbf{a}_i)^T\mathbf{x}\leq (-b_i)$.
  3. Remove equality constraints $\sum_j a_{ij}^T x_j = b_i$ by solving for a $x_j$ associated with a non zero $a_{ij}$ and substituting this $x_j$ whenever it appears in the objective function or constraints. This removes one constraint and one variable from the problem.
  4. Restrict variables to be non-negative by replacing unrestricted or negative $x_j$ by $x_j=x_j^{+}-x_j^{-}$ where $x_j^{+}, x_j^{-} \geq 0$ . Indeed, any real number can be written as the difference between 2 non-negative numbers.

Let’s now see how to convert general LP problems to the slack form:

  1. Convert to minimizing problems, by multiplying the objective function of maximizing problems by $-1$.
  2. Restrict variables to be non-negative by replacing unrestricted or negative $x_j$ by $x_j=x_j^{+}-x_j^{-}$ where $x_j^{+}, x_j^{-} \geq 0$.
  3. Eliminate inequality constraints $\mathbf{a}_i^T\mathbf{x}\leq b_i$ by introducing a slack variable : $\mathbf{a}_i^T\mathbf{x} + s_i = b_i$, $s_i \geq 0$. Similarly for $\mathbf{a}_i^T\mathbf{x} \geq b_i$, introduce a surplus variable $\mathbf{a}_i^T\mathbf{x} - s_i = b_i$, $s_i \geq 0$.

:mag: Side Notes:

  • when converting from the canonical form to the slack one, the objective function $f(x)=\mathbf{c}^T\mathbf{x}$ is unmodified because $\mathbf{x}^{slack}=[\mathbf{x}^{canonical}; \mathbf{s}]$ and $\mathbf{c}^{slack}=[\mathbf{c}^{canonical}; \mathbf{0}]$, and the constraints are simply a re-parametrization using slack variables.
  • In the slack form, $A \in \mathbb{R}^{m \times n}$ has full row rank $m$ (i.e. $n > m$), otherwise there would be “no room for optimization” as there would be a single or no solution

We have already seen an example which visualized the canonical form of the problem. I.e. for $\mathbf{x} \in \mathbb{R}^{n’}$ the feasible set is a polytope with at most $m’$ vertices (described by $A \mathbf{x} \geq \mathbf{b}$) in a $n’$ dimensional space.

In the slack form with $m$ constraints and $\mathbf{x} \in \mathbb{R}^{n}$, the constraints force $\mathbf{x}$ to lie on an $(n-m)$ dimensional subspace (each constraint removes a degree of freedom of $\mathbf{x}$). Once in that $(n-m)$ dimensional subspace, the feasible set is only bounded by the non-negativity constraints $\mathbf{x} \geq \mathbf{0}$. I.e. the slack form of the problem, can be seen as minimizing a function in a polytope in a $(n-m)$ dimensional subspace. Importantly, the polytope has the exact same shape as for the canonical form, but lives is a different space!

Duality

In optimization, it is very common to convert constrained problems to a relaxed unconstrained formulation, by allowing constraints to be violated but adding “a price” $\mathbf{y}$ to pay for that violation. When the price is correctly chosen, the solution of both problems becomes the same (by letting $\mathbf{y} \to \infty$ you effectively don’t allow any violation of the constraints:). This is the key idea behind the well known Lagrange Multipliers. Let’s call the problem formulation we have seen until know the primal problem. Then the slack form of the primal problem can be relaxed as:

Let $g(\mathbf{y})$ be the optimal cost of the relaxed problem as a function of the violation price $\mathbf{y}$. $g(\mathbf{y})$ is a lower bound to the the optimal solution $\mathbf{x}^{*}$ of the constraint problem (weak duality theorem). Indeed ($A\mathbf{x}^*=\mathbf{b}$):

The goal is thus to find the tightest lower bound $\max_y g(\mathbf{y})$, and is called the dual problem. The problem can be simplified by noting that:

Where the last line comes from the fact that $min_{\mathbf{x} \geq \mathbf{0}} \left( (\mathbf{c}^T + \mathbf{y}^T A)\mathbf{x} \right)$ is $0$ if $\mathbf{y}^TA \leq \mathbf{c}^T$ and $-\infty$ otherwise.

The dual problem is thus an other LP problem that can be written as:

:mag: Side Notes:

  • Each main constraint in the primal LP, becomes a variable in the dual LP. Similarly, each variable in the primal LP becomes a main constraint in the dual LP. The objective is reversed.
  • The dual of the dual is the primal.
  • Weak Duality Theorem: if $\mathbf{x}$ is a feasible solution to the primal LP and $\mathbf{y}$ is a feasible solution of the dual LP, then $\mathbf{y}^T\mathbf{b} \leq \mathbf{c}^T\mathbf{x}$.
  • Strong Duality Theorem: if one of the dual / primal LP has a solution, so does the second one, and the respective objective function are equal.
  • Complementary Slackness $\mathbf{x}, \ \mathbf{p}$ are optimal solutions for the primal and dual LP respectively iff $y_i(\mathbf{a}_i^T\mathbf{x}-b_i)=0, \ \forall i$ and $x_j(c_j-\mathbf{p}^T\mathbf{a}_j)=0, \forall j$. I.e. for the primal LP, either the constraints are tight or the corresponding dual variable $p_j$ is 0 (and vis-versa).
  • Importantly the dual problem is also a LP, meaning that the dual feasible set is a convex polytope. The polytope is now defined by $n$ constraints and lives in a $m$ dimensional space instead of the reverse in the primal case ($m$ constraints, $n$ dimensional space).
  • In case the primal LP is in the canonical form, the dual LP can be written as :

Canonical Primal LP:

Associated Dual LP:

:information_source: Resources: the intuitive link between the primal and dual form explained above comes from the very good Introduction to Linear Optimization.

Computational Neuroscience

Soon

Unfortunately here ends today’s journey together. But don’t be too disappointed, I’m only getting started. I still have a tuns of notes eagerly waiting to get upgraded, and I don’t intend to stop learning yet :sweat_smile:.

PS: Any reaction/suggestions would be very appreciated: just drop a comment below or click on the :heart: .

See you soon !