Machine Learning Glossary

Machine Learning Glossary / Cheat Sheet with a focus on intuition


This is my first post ever :bowtie:, I would love to get your feedback.


I will try to summarize important terms and concepts of machine learning. The focus is on intuition but there will also be practical and theoretical notes.

Target Audience, this is for you if you:

  • Have a decent understanding of a concept but want more intuition.
  • Are switching machine learning sub-domains.
  • Knew a term, but want to refresh your knowledge as it’s hard to remember everything (that’s me :sweat_smile: ).
  • Need to learn the important concepts in an efficient way. Students cramming for an exam: that’s for you :four_leaf_clover: !


Having a bad memory but being (at least considering myself to be :sweat_smile: ) a philomath who loves machine learning, I developed the habit of taking notes, then summarizing and finally making a cheat sheet for every new ML domain I encounter. There are multiple reasons I want to switch to a web-page:

  • Paper is not practical and prone to loss.
  • Thinking that someone I don't know (I'm talking about you :raising_hand: ) might read this post makes me write higher quality notes .
  • I'm forever grateful to people that spend time on forums and open source projects. I now want to give back to the community (The contribution isn't comparable, but I have to start somewhere :innocent: ).
  • Taking notes on a computer is a necessary step for my migration to CS :sweat_smile: .
  • As a wise man once said:
    You do not really understand something unless you can explain it to your grandmother. - Albert Einstein
    My grandma's are awesome :heart: but not really into ML (yet). You have thus been designated "volunteer" to temporarily replace them.


To make it easier to search the relevant information in the Glossary here is the color coding I will be using:

  • :bulb: Intuition
  • :wrench: Practical
  • :x: Disadvantage
  • :white_check_mark: Advantage
  • :school_satchel: Example
  • :mag: Side Notes
  • :wavy_dash: Compare to
  • :information_source: Resources


  • This is my first post ever :bowtie:, I would love to get your feedback.
  • I’m bad at spelling: Apologies in advance (feel free to correct me).
  • Check out the resources from where I got most of this information.
  • ML sub-domains overlap A LOT. I’ll try not to make the separations too artificial. Any suggestions would be appreciated :relaxed: . Note that I separate domains both by learning style and by algorithm similarity.
  • This is not meant to be a post read in order, but rather used as a “cheat-sheet”. Use the table of content or Ctrl+f.

Enough talking: prepare your popcorn and let’s get going :clapper: !

General Machine Learning Concepts

Curse of Dimensionality

The Curse of dimensionality refers to various practical issues when working with high dimensional data. These are often computational problems or counter intuitive phenomenas, coming from our Euclidean view of the 3 dimensional world (let’s keep time out of the equations).

They are all closely related but I like to think of 3 major issues with high dimensional inputs $x \in \mathbb{R}^d, \ d \ggg 1$:

Sparsity Issue

You need exponentially more data to fill in a high dimensional space. I.e if the dataset size is constant, increasing the dimensions makes your data sparser.

:bulb: Intuition : The volume size grows exponentially with the number of dimensions. Think of filling a $d$ dimensional unit hypercube with points at a $0.1$ interval. In 1 dimension we need $10$ of these points. In 2 dimension we already need 100 of these. In $d$ dimension we need $10^d$ observation !

Let’s look at a simple example :

Imagine we trained a certain classifier for distinguishing between :white_circle: and :large_blue_circle: circles. Now we want to predict the class of an unkown observation :black_circle: . Let’s assume that:

  • All features are given in percentages $[0,1]$
  • The algorithm is non-parametric and has to look at the points in the surrounding hypercube, which spans $30\%$ of the input space (see below).

Given only 1 feature (1D), we would simply need to look at $30\%$ of the dimension values. In 2D we would need to look at $\sqrt{0.3}=54.8\%$ of each dimensions. In 3D we would need $\sqrt[3]{0.3}=66.9\%$ of in each dimensions. Visually:

sparsity in 1D

sparsity in 2D

sparsity in 3D


We thus see that in order to keep a constant support (i.e amount of knowledge of the space), we need to look at more data when adding dimensions. In other words, if we add dimensions without adding data, there will be large unknown spaces. This is called sparsity.

I have kept the same number of observation in the plots, so that you can appreciate how “holes” appear in our training data as the dimension grows.

:x: Disadvantage : The data sparsity issue causes machine learning algorithms to fail finding patterns or to overfit.

Points are further from the center

Basically, the volume of a high dimensional orange is mostly in its skin and not in the pulp! Which means expensive high dimensional juices :pensive: :tropical_drink:

:bulb: Intuition : The volume of a sphere depends on $r^d$. So as $d$ increases, the importance of $r$ will increase. The skin has a slightly greater $r$ than the pulp, in high dimensions this slight difference will become very important.

If you’re not convinced, stick with my simple proof. Let’s consider a $d$ dimensional unit orange (i.e $r=1$), with a skin of width $\epsilon$. Let’s compute the ratio of the volume in the skin to the total volume of the orange. This could be done by integration, but we can skip these steps by simply noting that the volume of a hypersphere is proportional to to $r^d$ i.e : $V_{d}(r) = k r^{d}$.

Taking $\epsilon = 0.05$ as an example, here is the $ratio_{skin/orange}(d)$ we would get:

2D orange

3D orange

5D orange

10D orange


:mag: Side Notes : The same goes for hyper-cubes. I.e most of the mass is concentrated in their edges. That’s why you will sometimes hear that hyper-cubes are “spiky”. Think of the $[-1,1]^d$ hyper-cube: the distance from the center of the faces to the origin will trivially be $0 \ \forall d$, while the distance to each corners will be $\sqrt{d}$ (Pythagorean theorem). So basically the corners go further but not the center of the faces, which makes us think of spikes. This is why you will sometimes see such pictures :

2D hypercube

3D hypercube

7D hypercube


Euclidean distance becomes meaningless

There’s nothing that makes Euclidean distance intrinsically meaningless for high dimensions. It is rather than with our finite number of data, 2 points in high dimensions seem to be more “similar”. This is simply due to probability and sparsity.

:bulb: Intuition :

  • Let’s consider the distance between 2 close random points: $q$ and $p$. By adding independent dimensions, the probability that these 2 points differ greatly in at least one dimension grows (due to chance). This is what causes the sparsity issue. Similarly, the probability that 2 points that were far away will have at least one similar dimension, also grows. So basically, adding dimensions makes points seem more random, and the distances thus become less useful.
  • Euclidean distance accentuates the point above. Indeed, by adding dimensions, the probability that $q$ and $p$ points have at least one completely different feature grows. i.e $max_i(q,p)$ increases. The Euclidean distance between 2 points is $d(q,p)=\sqrt{\sum_{i=1}^n (q_i-p_i)^2}$. Because of the squared term, the distance depends strongly on $max_i(q_i-p_i)$. So due to chance, there is often less difference between distances of “similar” and “dissimilar points” in high dimensions. This is why Manhattan (L1) distance or fractional distance metrics (Lc with $c<1$) are preferred in high dimensions.

In such discussions people often cite a theorem which states that if the dimension are independent, the minimum and maximum distance between a point $p$ and $n$ other points $q^l$ become indiscernible when normalized. I.e all $n$ points $q^l$ converge to the same distance of $p$ in high dimension:

The key point here, is that we fix the number of points $n$ (sparsity issues) and that we are adding independent dimensions (chance of having different features grows). This is exactly what I tried to show intuitively.

:wrench: Practical : using dimensionality reduction often gives you better results for subsequent steps due to this curse. It makes the algorithm converge faster and reduces overfitting. But be careful not to underfit by using too few features.

:mag: Side Notes :

  • Although the curse of dimensionality is a big issue, we can find effective techniques in high-dimensions because:
    • Real data will often be confined to a lower effective dimensionality (ex: a 2D Gaussian in a higher dimensional space).
    • Real data will often be locally smooth, so that interpolation-like techniques can overcome some of the sparsity issues.
  • You will often see plots of the unit $d$-ball volume vs its dimensionality. Although I find the non-monotonicity of such plots very intriguing, I am not fond of these as they make you want to conclude that high dimensional hypersphere are smaller than low dimensional ones. Of course this makes no sense as a lower dimensional hypersphere can always be fitted in a higher dimensional one. The issue is that we are comparing apple and oranges (no puns intended :sweat_smile:) by comparing different units: Is $1 m$ really smaller than $0.99 m^2$ ?

:information_source: Resources : Great post about the curse of dimensionality in classification which inspired me, On the Surprising Behavior of Distance Metrics in High Dimensional Space is a famous paper which proposes the use of fractional distance metrics, nice blog of simulations.

Images modified from: oranges, 7D cube

Evaluation Metrics

Classification Metrics

Single Metrics

:mag: Side Notes : I will mostly focus on binary classification but most scores can be generalized to the multi-class setting. Often this is achieved by only considering “correct class” and “incorrect class” in order to make it a binary classification, then you average (weighted by the proportion of observation in the class) the score for each classes.

  • TP / TN / FN / FP: The best way to understand these is to look at a confusion matrix.

confusion matrix

  • Accuracy: fraction of observation correctly classified.
    • $ Acc = \frac{Real Positives}{Total} = \frac{TP+FN}{TP+FN+TN+FP}$
    • :bulb: In general, how much can we trust the predictions ?
    • :wrench: Use if no class imbalance and cost of error is the same for both types
  • Precision fraction of the observation predicted as positive that were actually positive.
    • $ Prec = \frac{TP}{Predicted Positives} = \frac{TP}{TP+FP}$
    • :bulb: How much can we trust positive predictions ?
    • :wrench: Use if FP are the worst
  • Recall fraction of the positive observation that have been correctly predicted.
    • $ Rec = \frac{TP}{Actual Positives} = \frac{TP}{TP+FN}$
    • :bulb: How many actual positives will we find?
    • :wrench: Use if FN are the worst
  • F1-Score harmonic mean (good for averaging rates) of recall and precision.
    • $F1 = 2 \frac{Precision * Recall}{Precision + Recall}$
    • If recall is $\beta$ time more important than precision use $F_{\beta} = (1+\beta^2) \frac{Precision * Recall}{\beta^2 Precision + Recall}$
    • :bulb: How well much can we trust our algorithms for the positive class
    • :wrench: Use if the positive class is more important (want a detector more than a classifier)
  • Specificity like recall but for negatives. $ Spec = \frac{TN}{Actual Negatives} = \frac{TN}{TN+FP}$

  • Cohen’s Kappa Improvement of your classifier over always guessing the most probable class
    • $\kappa = \frac{accuracy - percentageMaxClass}{1 - percentageMaxClass}$
    • More generally can compare 2 raters (ex: 2 humans): $\kappa = \frac{p_o- p_e}{1 - p_e}$ where $p_o$ is the observed agreement and $p_e$ is the expected agreement due to chance.
    • $ \kappa \leq 1$ (if $<0$ then useless).
    • :bulb: Accuracy improvement weighted by class imbalance .
    • wrench: Use when high class imbalance and all classes are ~the same importance
  • Log-Loss measures performance when model outputs a probability $\hat{y_ic}$ that observation $i$ is in class $c$
    • Also called Cross entropy loss or logistic loss
    • $logLoss = - \frac{1}{N} \sum^N_{i=1} \sum^K_{c=1} y_{ic} \ln(\hat{y}_{ic})$
    • Use the natural logarithm for consistency
    • Incorporates the idea of probabilistic confidence
    • Log Loss is the metric that is minimized through Logistic Regression and more generally Softmax
    • :bulb: Penalizes more if confident but wrong (see graph below)
    • :bulb: Log-loss is the cross entropy between the distribution of the true labels and the predictions
    • :wrench: Use when you are interested in outputting confidence of results
    • The graph below shows the log loss depending on the confidence of the algorithm that an observation should be classed in the correct category. For multiple observation we compute the log loss of each and then average them.

    log loss

  • AUC Area Under the Curve. Summarizes curves in a simple single metric.
    • It normally refers to the ROC curve. Can also be used for other curves like the precision-recall one.
    • :bulb: Probability that a randomly selected positive has a higher score than a randomly selected negative observation .
    • :mag: AUC evaluates results at all possible cut-off points. It gives better insights about how well the classifier is able to separate between classes . This makes it very different from the other metrics above that depend on the cut-off threshold.
    • :wrench: Use when building a classifier for users that will have different needs (they could tweak the cut-off point) . From my experience AUC is often used in statistics (~go-to metric in bio-statistics) but less so in machine learning.
    • Random predictions: $AUC = 0.5$. Perfect predictions: $AUC=1$.
Visual Metrics
  • Confusion Matrix a $K*K$ matrix which shows the number of observation of class $c$ that have been labeled $c’$ $\forall c,c’ \in 1,…,K$
    • :mag: Be careful: People are not consistent with the axis :you can find real-predicted and predicted-real .
    • This is best understood through an example:

    Multi Confusion Matrix

  • ROC Curve Receiver Operating Characteristic
    • Plot showing the TP rate vs the FP rate, over a varying threshold.
    • This plot from wikipedia shows it well:

ROC curve

:information_source: Resources : Additional scores based on confusion matrix

Generative vs Discriminative

These two major types of models, distinguish themselves by the approach they are taking to learn. Although these distinctions are not specific to a particular task, you will most often hear about the distinction between generative and discriminative classifiers.


In classification, the task is to identify the category $y$ of an observation, given its features $x$. In mathematical notation we are looking for $y|x$. There are 2 approaches, to this problem:

  • Discriminative learn the boundaries between classes, called the decision boundaries.
    • :bulb: Simply tell me in which class is this observation given past data.
    • Can be probabilistic or non-probabilistic. If probabilistic, it tries to model $p(y|x)$ and give label $y$ for which $p(y|x)$ is maximum. If non probabilistic: simply “draws” a boundary between classes, if on one side then class A if on the other then B (easily generalizes for multiple class).
    • Directly models what we care about: $y|x$.
    • :school_satchel: As an example for detecting languages from a conversation, the discriminative model would learn to distinguish between languages from their sound but wouldn’t understand anything.
  • Generative model the distribution of each classes.
    • :bulb: First understand what this data means, then use your knowledge to classify.
    • First, model the joint distribution $p(y,x)$ (normally through $p(y,x)=p(x|y)p(y)$). Then find the conditional probability we are looking for, through Bayes theorem: $p(y|x)=\frac{p(y,x)}{p(x)}$. Finally find $y$ which maximizes $p(y|x)$ (same as discriminative).
    • Computes more information than discriminative classifiers. Thus more general.
    • :school_satchel: To continue with the previous example, the generative model would first learn how to speak the language and then say from which language the words come from.

Pros / Cons

Please note that some of advantages / disadvantages mean the same thing but are worded differently.

  • Discriminative:
    • :white_check_mark: Less bias => better if more data.
    • :white_check_mark: Less model assumptions as it's tackling an easier problem.
    • :x: Slower convergence rate . Logistic Regression requires $O(d)$ observations.
    • :x: Prone to over-fitting when there's less data, as it doesn't make assumptions to constrain it from finding inexistent patterns.
    • :x: More variance.
    • :x: Hard to update the model with new data (online learning).
    • :x: Have to retrain model when adding new classes.
    • :x: In practice needs additional regularization / kernel / penalty functions.
  • Generative
    • :white_check_mark: Faster convergence rate => better if less data . Naive Bayes only requires $O(\log(d))$ observations.
    • :white_check_mark: Less variance.
    • :white_check_mark: Can easily update the model with new data (online learning).
    • :white_check_mark: Can generate new data by looking at $p(x|y)$.
    • :white_check_mark: Can handle missing features .
    • :white_check_mark: You don't need to retrain model when adding new classes as the parameters of classes are fitted independently.
    • :white_check_mark: Easy to extend to the semi-supervised case.
    • :x: More Biais.
    • :x: Prone to under-fitting when more there's data because of the multiple assumptions.
    • :x: Uses computational power to compute something we didn't ask for.

:wrench: Rule of thumb : If your problem is only to train the best classifier on a large data set, use a discriminative model. If your task involves more constraints (online learning, semi supervised learning, small data set, …) use a generative model.

Let’s illustrate the advantages and disadvantage of both methods with an example . Imagine we are asked to make a classifier for the “true distribution” below. As a training set, we are once given a “small sample” and an other time a “large sample”.

discriminative vs generative true distribution

discriminative vs generative small sample

discriminative vs generative large sample

How well will the algorithms distinguish the classes in each case ?

  • Small Sample:
    • The discriminative model never saw any examples at the bottom of the blue ellipse. It has no chance of finding the correct decision boundary there.
    • The generative model assumes that the data follows a normal distribution (ellipse). It will therefore be able to infer the correct decision boundary without ever having seen data points there!

small sample discriminative

small sample generative


  • Large Sample:
    • The discriminative model doesn’t have any assumption that restricts it of finding the small red cluster inside the blue one.
    • The generative model still assumes that the data follows a normal distribution (ellipse). It will therefore not be able to find the small red cluster.

large sample discriminative

large sample generative

Please note that this is simply an example. Some generative models would find the small red cluster: it all depends on the assumptions they are making. (I hope that) It still gives you a good idea of the advantages and disadvantages.

Examples of Algorithms

  • Naives Bayes
  • Gaussian Discriminant Analysis
  • Latent Dirichlet Allocation
  • Restricted Boltzmann Machines
  • Gaussian Mixture Models
  • Hidden Markov Models
  • Sigmoid Belief Networks
  • Bayesian networks
  • Markov random fields

:information_source: Resources : A. Ng and M. Jordan have a must read paper on the subject, T. Mitchell summarizes very well these concepts in his slides, and section 8.6 of K. Murphy’s book has a great overview of pros and cons, which strongly influenced the devoted section above.

Information Theory


Long Story Short
  • :bulb: Intuition :
    • The entropy of a random variable is intuitively the expected amount of surprise you would have by observing it . We often say it is a measure of “information” in the sense that if something is not surprising to you, then you didn’t learn much by seeing. So it didn’t convey much new information.
    • Entropy is the expected number of bits (for $log_2$) used to encode an observation from a (discrete) random variable under the optimal coding scheme .
  • Don’t confuse the “information” in information theory with the everyday word which refers to “meaningful information”. A book with random letters will have more information because each new letter would be a surprise to you. But it will definitely not have more meaning than a book with English words .

  • :mag: Side notes :
    • $H(X) \geq 0$
    • Entropy is maximized when all events occur with uniform probability. If $X$ can take $n$ values then : $max(H) = H(p_{uniform})= \sum_i^n \frac{1}{n} \log(\frac{1}{ 1/n} ) = \log(n)$

Long Story Long

The simple concept of entropy is central in both thermodynamics and information theory, and I find that quite amazing. It originally comes from statistical thermodynamics and is so central there, that it is carved on Ludwig Boltzmann’s grave (one of the father of this field). You will often hear:

  • Thermodynamics: Entropy is a measure of disorder
  • Information Theory: Entropy is a measure of information

These 2 way of thinking may seem different but in reality they are exactly the same. They essentially answer: how hard is it to describe this thing?

I will focus here on the information theory point of view, because its interpretation is more intuitive for machine learning. Also I don’t want to spend to much time thinking about thermodynamics, as people that do often commit suicide :flushed:.

In information theory there are 2 intuitive way of thinking of entropy. These are best explained through an example :

:school_satchel: Imagine that my friend Claude offers me to go see a NBA game (Cavaliers vs Spurs) with him tonight. Unfortunately I can’t come but ask him to record who scored each field goals. Claude is very geeky and uses a binary phone which can only write 0 and 1. As he doesn’t have much memory left, he wants to use the smallest possible number of bits.

  1. From previous games, Claude knows that Lebron James will very likely score more than the old (but awesome :basketball: ) Manu Ginobili. Will he use the same number of bits to indicate that Lebron scored, than he will for Ginobili ? Of course not, he will allocate less bits for Lebron as he will be writing it down more often. He’s essentially exploiting his knowledge about the distribution of field goals to reduce the expected number of bits to write down. It turns out that if he knew the probability $p_i$ of each player $i$ to score he should encode their name with $nBit(p_i)=log_2(1/p_i)$ bits. This has been intuitively constructed by Claude (Shannon) himself as it is the only measure (up to a constant) that satisfies axioms of information measure. The intuition behind this is the following:
    • Multiplying probabilities of 2 players scoring should result in adding their bits. Indeed imagine Lebron and Ginobili have respectively 0.25 and 0.0625 probability of scoring the next field goal. Then, the probability that Lebron scores the 2 next field goals would be the same than Ginobili scoring a single one ($lebron*lebron = 0.25 * 0.25 = 0.0625 = Ginobili$). We should thus allocate 2 times less bits for Lebron, so that on average we always add the same number of bits per observation. $nBit(p_{Lebron}) = \frac{1}{2} * nBit(p_{Ginobili}) = \frac{1}{2} * nBit(p^2_{Lebron})$. From this we quickly realize that we need to use logarithms and that the simplest H will be of the form: $H(p_i) = \alpha * \log(p_i) + \beta $
    • Players that have higher probability of scoring should be encoded by a lower number of bits . I.e H should decrease when $p_i$ increases: $H(p_i) = - \alpha * \log(p_i) + \beta, \alpha > 0 $
    • If Lebron had $100%$ probability of scoring, why would I have bothered asking Claude to write anything down ? I would have known everything a priori . I.e H should be $0$ for $p_i = 1$ : $H(p_i) = - \alpha * \log(p_i), \alpha > 0 $
  2. Now Claude sends me the message containing information about who scored. Seeing that Lebron scored will surprise me less than Ginobili. I.e Claude’s message gives me more information when telling me that Ginobili scored. If I wanted to quantify my surprise for each field goal, I should make a measure that satisfies the following conditions:
    • The lower the probability of a player to score, the more surprised I will be . The measure of surprise should thus be a decreasing function of probability: $surprise(p_i) = -f(p_i) * \alpha, \alpha > 0$.
    • Supposing that players scoring are independent of one another, it’s reasonable to ask that my surprise if Lebron and Ginobili scored in a row should be the same than the sum of my suprise if Lebron scored and my surprise if Ginobili scored. Multiplying independent probabilities should sum the surprise : $surprise(p_i * p_j) = surprise(p_i) + surprise(p_j)$.
    • Finally, the measure should be continuous given probabilities . $surprise(Lebron) = -\log(p_{Lebron}) * \alpha, \alpha > 0$

Taking $\alpha = 1 $ for simplicity, we get $surprise(p_i) = -log(p_i) = nBit(p_i)$. We thus derived a formula for computing the surprise associated with event $i$ and the optimal number of bits that should be used to encode that event. In order to get the average surprise / number of bits associated with a random variable $X$ we simply have to take the expectation over all possible events (i.e average weighted by probability of event). This gives us the entropy formula $H(p) = \sum_i p_i \ \log(\frac{1}{p_i}) = - \sum_i p_i\ log(p_i)$

From the example above we see that entropy corresponds to :

  • to the expected number of bits to optimally encode a message
  • the average amount of information gained by observing a random variable

:mag: Side notes :

  • From our derivation we see that the function is defined up to a constant term $\alpha$. This is the reason why the formula works equally well for any logarithmic base, indeed changing the base is the same as multiplying by a constant. In the context of information theory we use $log_2$.
  • Entropy is the reason (second law of thermodynamics) why putting an ice cube in your Moscow Mule (yes that is my go-to drink) doesn’t normally make your ice cube colder and your cocktail warmer. I say “normally” because it is possible but very improbable : ponder about this next time your sipping your own go-to drink :smirk: !

:information_source: Resources : Excellent explanation of the link between entropy in thermodynamics and information theory, friendly introduction to entropy related concepts

Differential Entropy

Differential entropy (= continuous entropy), is the generalization of entropy for continuous random variables.

Given a continuous random variable $X$ with a probability density function $f(x)$:

If you had to make a guess, which distribution maximizes entropy for a given variance ? You guessed it : it’s the Gaussian distribution.

:mag: Side notes : Differential entropy can be negative.

Cross Entropy

We saw that entropy is the expected number of bits used to encode an observation of $X$ under the optimal coding scheme. In contrast cross entropy is the expected number of bits to encode an observation of $X$ under the wrong coding scheme. Let’s call $q$ the wrong probability distribution that is used to make a coding scheme. Then we will use $-log(q_i)$ bits to encode the $i^{th}$ possible values of $X$. Although we are using $q$ as a wrong probability distribution, the observations will still be distributed based on $p$. We thus have to take the expected value over $p$ :

From this interpretation it naturally follows that:

  • $H(p,q) > H(p), \forall q \neq p$
  • $H(p,p) = H(p)$

:mag: Side notes : Log loss is often called cross entropy loss, indeed it is the cross-entropy between the distribution of the true labels and the predictions.

Kullback-Leibler Divergence

The Kullback-Leibler Divergence (= relative entropy = information gain) from $q$ to $p$ is simply the difference between the cross entropy and the entropy:

  • :bulb: Intuition
    • KL divergence corresponds to the number of additional bits you will have to use when using an encoding scheme based on the wrong probability distribution $q$ compared to the real $p$ .
    • KL divergence says in average how much more surprised you will be by rolling a loaded dice but thinking it’s fair, compared to the surprise of knowing that it’s loaded.
    • KL divergence is often called the information gain achieved by using $p$ instead of $q$
    • KL divergence can be thought as the “distance” between 2 probability distribution. Mathematically it’s not a distance as it’s none symmetrical. It is thus more correct to say that it is a measure of how a probability distribution $q$ diverges from an other one $p$.

KL divergence is often used with probability distribution of continuous random variables. In this case the expectation involves integrals:

In order to understand why KL divergence is not symmetrical, it is useful to think of a simple example of a dice and a coin (let’s indicate head and tails by 0 and 1). Both are fair and thus their PDF is uniform. Their entropy is trivially: $H(p_{coin})=log(2)$ and $H(p_{dice}=log(6))$. Let’s first consider $D_{KL}(p_{coin}|p_{dice})$. The 2 possible events of $X_{dice}$ are 0,1 which are also possible in the dice. The average number of bits to encode a coin observation under the dice encoding, will thus simply be $log(6)$, and the KL divergence is of $log(6)-log(2)$ additional bits. Now let’s consider the problem the other way around: $D_{KL}(p_{dice}|p_{coin})$. We will use $log(2)=1$ bit to encode the events of 0 and 1. But how many bits will we use to encode $3,4,5,6$ ? Well the optimal encoding for the dice doesn’t have any encoding for these as they will never happen in his world. The KL divergence is thus not defined (division by 0). With this example you should clearly understand that the additional bits required to encode $p$ with $q$ is not the same as encoding $q$ with $p$. The KL divergence is thus not symmetric and cannot be a distance.

:mag: Side notes : Minimizing cross entropy with respect to $1$ is the same as minimizing $D_{KL}(p|q)$. Indeed the 2 equations are equivalent up to an additive constant (the entropy of $p$) which doesn’t depend on $q$.

Machine Learning and Entropy

This is all interesting, but why are we talking about information theory concepts in machine learning :sweat_smile: ? Well it turns our that many ML algorithms can be interpreted with entropy related concepts.

The 2 major ways we see entropy in machine learning are through:

  • Maximizing information gain (i.e entropy) at each step of our algorithm. Example:

    • When building decision trees you greedily select to split on the attribute which maximizes information gain (i.e the difference of entropy before and after the split). Intuitively you want to chose to know the value of the attribute, which would decrease the randomness in your data by the largest amount.
  • Minimizing KL divergence between the actual unknown probability distribution of observations $p$ and the predicted one $q$. Example:

    • The Maximum Likelihood Estimator (MLE) of our parameters $\hat{ \theta }_{MLE}$ are also the parameter which minimizes the KL divergence between our predicted distribution and the actual unknown one . I.e
  • Minimizing KL divergence between the computationally intractable $p$ and a simpler approximation $q$. Indeed machine learning is not only about theory but also about how to make something work in practice.Example:

    • This is the whole point of Variational Inference (= variational Bayes) which approximates posterior probabilities of unobserved variables that are often intractable due to the integral in the denominator. Thus turning the inference problem to an optimization one. These methods are an alternative to Monte Carlo sampling methods for inference (ex: Gibbs Sampling). In general sampling methods are slower but asymptotically exact.

No Free Lunch Theorem

There is no one model that works best for every problem.

Let’s try predicting the next fruit in the sequence:

:tangerine: :apple: :tangerine: :apple: :tangerine: ...

You would probably say :apple: right ? Maybe with a lower probability you would say :tangerine: . But have you thought of saying :watermelon: ? I doubt it. I never told you that the sequence was constrained in the type of fruit, but naturally we make assumptions that the data “behaves well”. The point here is that without any knowledge/assumptions on the data, all future data are equally probable.

The theorem builds on this, and states that every algorithm has the same performance when averaged over all data distributions. So for example a deep learning classifier have in average the same performance as a random one.

:mag: Side Notes :

  • You will often hear the name of this theorem when someone asks a question starting with “what is the best […] ?”.
  • In the real world, things tend to “behave well”. They are for example often (locally) continuous. In such settings some algorithms are definitely better than others.
  • Since the theorem publication in 1996, other methods have kept the lunch metaphor. For example: kitchen sink algorithms,random kitchen sink, fastfood, à la carte, and that’s one of the reason why I decided to stick with fruit examples in this blog :wink:.
  • The theorem has been extend to optimization and search algorithms.

:information_source: Resources : D. Wolpert’s proof.

Parametric vs Non Parametric

These 2 types of methods distinguish themselves based on their answer to the following question: “Will I use the same amount of memory to store the model trained on $100$ examples than to store a model trained on $10 000$ of them ? “ If yes then you are using a parametric model. If not, you are using a non-parametric model.

  • Parametric:
    • :bulb: The memory used to store a model trained on $100$ observations is the same as for a model trained on $10 000$ of them .
    • I.e: The number of parameters is fixed.
    • :white_check_mark: Computationally less expensive to store and predict.
    • :white_check_mark: Less variance.
    • :x: More bias.
    • :x: Makes more assumption on the data to fit less parameters.
    • :school_satchel: Example : K-Means clustering, Linear Regression:

    Linear Regression

  • Non Parametric:
    • :bulb: I will use less memory to store a model trained on $100$ observation than for a model trained on $10 000$ of them .
    • I.e: The number of parameters is grows with the training set.
    • :white_check_mark: More flexible / general.
    • :white_check_mark: Makes less assumptions.
    • :white_check_mark: Less bias.
    • :x: More variance.
    • :x: Bad if test set is relatively different than train set.
    • :x: Computationally more expensive as it has to store and compute over a higher number of “parameters” (unbounded).
    • :school_satchel: Example : K-Nearest Neighbors clustering, RBF Regression:

    RBF Regression

:wrench: Practical : Start with a parametric model. It’s often worth trying a non-parametric model if: you are doing clustering, or the training data is not too big but the problem is very hard.

:mag: Side Note : Strictly speaking any non-parametric model could be seen as a infinite-parametric model. So if you want to be picky: next time you hear a colleague talking about non-parametric models, tell him it’s in fact parametric. I decline any liability for the consequence on your relationship with him/her :sweat_smile: .

Supervised Learning

Supervised learning tasks tackle problems that have labeled data.

:bulb: Intuition : It can be thought of a teacher who corrects a multiple choice exam. At the end you will get the average result as well as a detailed result saying which answer was wrong what was the correct answer.

Supervised learning can be further separated into two broad type of problems:

  • Classification: here the output variable $y$ is categorical. We are basically trying to assign one or multiple classes to an observation. Example: is it a cat or not ?
  • Regression: here the output variable $y$ is continuous. Example : how tall is this person ?


The classification problem consists of assigning a set of classes/categories to an observation. I.e

Classification problems can be further separated into:

  • Binary: There are 2 possible classes.
  • Multi-Class: There are more than 2 possible classes.
  • Multi-Label: If labels are not mutually exclusive. Often replaced by binary classification specifying whether an observation should be assigned to each class.

Common evaluation metrics include Accuracy, F1-Score, AUC… I have a section devoted for these classification metrics.

:wavy_dash: Compare to : Regression

Decision Trees

  • :bulb: Intuition :
    • Split the training data based on “the best” question you can ask (ex: is he older than 27 ?). Recursively do the above while not happy with the classification results.
    • Decision trees are basically the algorithm to use for the “20 question” game. Akinator is a great algorithm that could have been implemented with decision trees. Akinator is probably based on fuzzy logic expert systems (as it can work with wrong answers) but you could do a simpler version with decision trees.
    • “Optimal” splits are found are found by maximization of information gain or similar methods.
  • :wrench: Practical :
    • “Use when you need a simple and interpretable model but the relationship between $y$ and $x$ is complex”.
    • Training Complexity : $O(mnd + nd\log(n) )$ .
    • Testing Complexity : $O(mt)$ .
    • Notation Used : $m=depth$ ; ; ; .
  • :white_check_mark: Advantage :
    • Interpretable .
    • Few hyper-parameters.
    • Needs less data cleaning :
      • No normalization needed.
      • Can handle missing values.
      • Handles numerical and categorical variables.
    • Robust to outliers.
    • Doesn’t make assumptions regarding the data distribution.
    • Performs feature selection.
    • Scales well.
  • :x: Disadvantage :
    • Generally poor accuracy because greedy selection.
    • High variance because if the top split changes, everything does.
    • Splits are parallel to features axes => need multiple splits to separate 2 classes with a 45° decision boundary.
    • No online learning.

The basic idea behind building a decision tree is to :

  1. Find an optimal split (feature + threshold). I.e the split which minimizes the impurity (maximizes information gain).
  2. Partition the dataset into 2 subsets based on the split above.
  3. Recursively apply $1$ and $2$ this to each new subset until a stop criterion is met.
  4. To avoid over-fitting: prune the nodes which “aren’t very useful”.

Here is a little gif showing these steps:

Building Decision Trees CLassification

Note: For more information, please see the “details” and “Pseudocode and Complexity” drop-down below.


The idea behind decision trees is to partition the input space into multiple region. Ex: region of men who are more than 27 years old. Then predict the most probable class for each region, by assigning the mode of the training data in this region. Unfortunately, finding an optimal partitioning is usually computationally infeasible (NP-complete) due to the combinatorially large number of possible trees. In practice the different algorithms thus use a greedy approach. I.e each split of the decision tree tries to maximize a certain criterion regardless of the next splits.

How should we define an optimality criterion for a split? Let’s define an impurity (error) of the current state, which we’ll try to minimize. Here are 3 possibilities of state impurities:

  • Classification Error:
    • :bulb: The accuracy error : $1-Acc$ of the current state. I.e the error we would do by stopping at the current state.
  • Entropy:
    • :bulb: How unpredictable are the classes of the current state.
    • Minimize the entropy corresponds to maximizing the information gain.
  • Gini Impurity:
    • :bulb: How unpredictable are the classes of the current state.
    • Minimize the entropy corresponds to maximizing the information gain.

Here is a quick graph showing the impurity depending on a class distribution in a binary setting:

Impurity Measure

:mag: Side Notes :

  • Classification error may seem like a natural choice, but don’t get fooled by the appearances: it’s generally worst than the 2 other methods:
    • It is “more” greedy than the others. Indeed, it only focuses on the current error, while Gini and Entropy try to make a purer split which will make subsequent steps easier. Suppose we have a binary classification with 100 observation in each class $(100,100)$. Let’s compare a split which divides the data into $(20,80)$ and $(80,20)$, to an other split which would divide it into $(40,100)$ and $(60,0)$. In both case the accuracy error would be of $0.20\%$. But we would prefer the second case, which is pure and will not have to be split further. Gini impurity and the Entropy would correctly chose the latter.
    • Classification error doesn’t look at all classes, only the most probable one. So having a split with 2 extremely probable classes would be the same than having a split with one extremely probable class and many improbable ones.
  • Gini Impurity and Entropy give very similar results as you can see in the plot above. Chose one and stick with it.

But when should we stop ? If possible, is important to stop soon to decrease over-fitting. There are a variety of different heuristics to determine when to stop. Here are a few:

  • When the number of training examples is small enough.
  • When the depth reaches a threshold.
  • If the impurity is low enough.
  • If the purity gain due to the split is too small.

One problem with these, are the multiple problem-dependent thresholds (hyperparameters) to chose. Furthermore, some heuristics might yield relatively bad results. For example decision trees might have to split the data without any purity gain, to reach high purity gain at the following step. For these reasons, it is common to grow large trees using only the number of training example in a leaf node as a stopping criterion. To avoid over-fitting, the algorithm would prune back the resulting tree. In CART, the pruning criterion $C_{pruning}(T)$ tries to balance impurity and model complexity by regularization. The regularized variable is often the number of leaf nodes $|T|$, as below:

Where $\lambda$ determines the trade-off between impurity and model complexity, for a given tree $T$, with leaf nodes $v=1…|T|$ using Impurity measure $I$. Then you would simply chose $\lambda$ using cross validation.

Variants: there are multiple different decision tree methods. They mostly differ with regards to the following points:

  • Splitting Criterion ? Gini / Entropy.
  • Technique to Reduce Over-fitting ?
  • How many variables can be used in a split ?
  • Do they build Binary Trees ?
  • How they handle Missing Values ?
  • Do they handle Regression?
  • Are they susceptible to outliers?

The most famous variants are:

  • ID3: first decision tree implementation. Not used in practice.
  • C4.5: Improvement over ID3 by the same developer. Error based pruning. Uses entropy. Handles missing values. Susceptible to outliers. Can create empty branches.
  • CART: Uses Gini. Cost complexity pruning. Binary trees. Handles missing values. Handles regression. Not susceptible to outliers.
  • CHAID: FInds a splitting variable using Chi-squared to test the dependency between a variable and a response. No pruning. Seems better for describing the data, but worst for predicting.

Other variants include : C5.0 (next version of C4.5, probably less used because patented), MARS.

:information_source: Resources : A comparative study of different decision tree methods.

Pseudocode and Complexity
  • Pseudocode The simple version of a decision tree can be written in a few lines of python pseudocode:
def buildTree(X,Y):
    if stop_criteria(X,Y) :
        # if stop then store the majority class
        tree.class = mode(X) 
        return Null

    minImpurity = infinity
    bestSplit = None
    for j in features:
        for T in thresholds:
            if impurity(X,Y,j,T) < minImpurity:
                bestSplit = (j,T)
                minImpurity = impurity(X,Y,j,T) 

    X_left,Y_Left,X_right,Y_right = split(X,Y,bestSplit)

    tree.split = bestSplit # adds current split
    tree.left = buildTree(X_left,Y_Left) # adds subsequent left splits
    tree.right buildTree(X_right,Y_right) # adds subsequent right splits

return tree

def singlePredictTree(tree,xi):
    if tree.class is not Null:
        return tree.class

    j,T = tree.split
    if xi[j] >= T:
        return singlePredictTree(tree.right,xi)
        return singlePredictTree(tree.left,xi)

def allPredictTree(tree,Xt):
    t,d = Xt.shape
    Yt = vector(d)
    for i in t:
        Yt[i] = singlePredictTree(tree,Xt[i,:])

    return Yt
  • Complexity I will be using the following notation: ; ; ; ; .

Let’s first think about the complexity for building the first decision stump (first function call):

  • In a decision stump, we loop over all features and thresholds $O(td)$, then compute the impurity. The impurity depends solely on class probabilities. Computing probabilities means looping over all $X$ and count the $Y$ : $O(n)$. With this simple pseudocode, the time complexity for building a stump is thus $O(tdn)$.
  • In reality, we don’t have to look for arbitrary thresholds, only for the unique values taken by at least an example. Ex: no need of testing $feature_j>0.11$ and $feature_j>0.12$ when all $feature_j$ are either $0.10$ or $0.80$. Let’s replace the number of possible thresholds $t$ by training set size $n$. $O(n^2d)$
  • Currently we are looping twice over all $X$, once for the threshold and once to compute the impurity. Instead, if the data was sorted by the current feature, the impurity could simply be updated as we loop through possible thresholds. Ex: when considering the rule $feature_j>0.8$ after having already considered $feature_j>0.7$, we do not have to recompute all the class probabilities: we can simply take the probabilities from $feature_j>0.7$ and make the adjustments knowing the number of example with $feature_j==0.7$. For each feature $j$ we should first sort all data $O(n\log(n))$ then loop once in $O(n)$, the final would be in $O(dn\log(n))$.

We now have the complexity of a decision stump. You could think that finding the complexity of building a tree would be multiplying it by by the number of function calls: Right ? Well not really, that would be an over-estimate. Indeed, at each function call, the training data size $n$ would have decreased. The intuition for the result we are looking for, is that at each level $l=1…m$ the sum of the training data in each function is still $n$. Multiple function working in parallel take the same time as a single function would, with the whole training set $n$. The complexity at each level is thus still $O(dn\log(n))$ so the complexity for building a tree of depth $m$ is $O(mdn\log(n))$. If you you want a proof that the work at each level stays constant:

At each iterations the dataset is split into $\nu$ subsets of $k_i$ element and a set of $n-\sum_{i=1}^{\nu} k_i$. At every level, the total cost would therefore be (using the well known log propriety and the fact that $k_i \le n$ ) :

The last possible adjustment I see, is to sort everything once, store it and simply use this precomputed data at each level. The final training complexity is therefore $O(mdn + nd\log(n))$ .

The time complexity of making predictions is straightforward: for each $t$ examples, go through a question at each $m$ levels. I.e $O(mt)$ .


Decision Trees

Decision trees are more often used for classification problems. I thus talk at length about them here.

The 2 differences with decision trees for classification are:

  • What error to minimize for an optimal split? This replaces the impurity measure in the classification setting. An easy error function for regression which has the nice interpretation of being linked to variance, is the sum of squared error. Note that we don’t use the mean squared error because when computing the error/variance reduction wafter a split we want to subtract the sum of errors after the split from the error before the split. Sum of squared error for region $R$:
  • What to predict for a given space region? Before we were predicting the mode of the subset of training data in this space. Taking the mode doesn’t make sense for a continuous variable. Now that we’ve defined an error function above, we would like to predict a value which minimizes this sum of squares error function. The values which minimizes this error function is simply the average of the values of the points in this region. Thankfully, predicting the mean is intuitively what we would have done.

The rest is the same as in the classification setting. Let’s look at a simple plot to get a better idea of the algorithm:

Building Decision Trees Regression

:x: Besides the disadvantages seen in the decision trees for classification, decision trees for regression suffer from the fact that it predicts a non smooth function .

Unsupervised Learning

Reinforcement Learning

Partially Supervised Learning

The previous sections were separated by learning style. I will now separate by algorithm similarity.

Bayesian Methods

Deep Learning

Graphical Models

The previous sections were separated by learning style and algorithm similarity. I will now focus on the application.

Natural Language Processing

Time Series

The following sections are not strictly speaking part of Machine Learning. They are nevertheless very related and important.


Computational Neuroscience

Unfortunately here ends today’s journey together. But don’t be too disappointed, I’m only getting started. I still have a tuns of notes eagerly waiting to get upgraded, and I don’t intend to stop learning yet :sweat_smile:.

PS: Any reaction/suggestions would be very appreciated: just drop a comment below or click on the :heart: .

See you soon !