Welcome,
This is my first post ever , I would love to get your feedback.
I will try to summarize important terms and concepts of machine learning. The focus is on intuition but there will also be practical and theoretical notes.
Target Audience, this is for you if you:
Having a bad memory but being (at least considering myself to be ) a philomath who loves machine learning, I developed the habit of taking notes, then summarizing and finally making a cheat sheet for every new ML domain I encounter. There are multiple reasons I want to switch to a web-page:
You do not really understand something unless you can explain it to your grandmother. - Albert EinsteinMy grandma's are awesome but not really into ML (yet). You have thus been designated "volunteer" to temporarily replace them.
To make it easier to search the relevant information in the Glossary here is the color coding I will be using:
Disclaimer:
Ctrl+f
.Enough talking: prepare your popcorn and let’s get going !
The Curse of dimensionality refers to various practical issues when working with high dimensional data. These are often computational problems or counter intuitive phenomenas, coming from our Euclidean view of the 3 dimensional world (let’s keep time out of the equations).
They are all closely related but I like to think of 3 major issues with high dimensional inputs $x \in \mathbb{R}^d, \ d \ggg 1$:
You need exponentially more data to fill in a high dimensional space. I.e if the dataset size is constant, increasing the dimensions makes your data sparser.
Intuition : The volume size grows exponentially with the number of dimensions. Think of filling a $d$ dimensional unit hypercube with points at a $0.1$ interval. In 1 dimension we need $10$ of these points. In 2 dimension we already need 100 of these. In $d$ dimension we need $10^d$ observation !
Let’s look at a simple example :
Imagine we trained a certain classifier for distinguishing between and circles. Now we want to predict the class of an unkown observation . Let’s assume that:
Given only 1 feature (1D), we would simply need to look at $30\%$ of the dimension values. In 2D we would need to look at $\sqrt{0.3}=54.8\%$ of each dimensions. In 3D we would need $\sqrt[3]{0.3}=66.9\%$ of in each dimensions. Visually:
.
We thus see that in order to keep a constant support (i.e amount of knowledge of the space), we need to look at more data when adding dimensions. In other words, if we add dimensions without adding data, there will be large unknown spaces. This is called sparsity.
I have kept the same number of observation in the plots, so that you can appreciate how “holes” appear in our training data as the dimension grows.
Disadvantage : The data sparsity issue causes machine learning algorithms to fail finding patterns or to overfit.
Basically, the volume of a high dimensional orange is mostly in its skin and not in the pulp! Which means expensive high dimensional juices
Intuition : The volume of a sphere depends on $r^d$. So as $d$ increases, the importance of $r$ will increase. The skin has a slightly greater $r$ than the pulp, in high dimensions this slight difference will become very important.
If you’re not convinced, stick with my simple proof. Let’s consider a $d$ dimensional unit orange (i.e $r=1$), with a skin of width $\epsilon$. Let’s compute the ratio of the volume in the skin to the total volume of the orange. This could be done by integration, but we can skip these steps by simply noting that the volume of a hypersphere is proportional to to $r^d$ i.e : $V_{d}(r) = k r^{d}$.
Taking $\epsilon = 0.05$ as an example, here is the $ratio_{skin/orange}(d)$ we would get:
.
Side Notes : The same goes for hyper-cubes. I.e most of the mass is concentrated in their edges. That’s why you will sometimes hear that hyper-cubes are “spiky”. Think of the $[-1,1]^d$ hyper-cube: the distance from the center of the faces to the origin will trivially be $0 \ \forall d$, while the distance to each corners will be $\sqrt{d}$ (Pythagorean theorem). So basically the corners go further but not the center of the faces, which makes us think of spikes. This is why you will sometimes see such pictures :
.
There’s nothing that makes Euclidean distance intrinsically meaningless for high dimensions. It is rather than with our finite number of data, 2 points in high dimensions seem to be more “similar”. This is simply due to probability and sparsity.
Intuition :
In such discussions people often cite a theorem which states that if the dimension are independent, the minimum and maximum distance between a point $p$ and $n$ other points $q^l$ become indiscernible when normalized. I.e all $n$ points $q^l$ converge to the same distance of $p$ in high dimension:
The key point here, is that we fix the number of points $n$ (sparsity issues) and that we are adding independent dimensions (chance of having different features grows). This is exactly what I tried to show intuitively.
Practical : using dimensionality reduction often gives you better results for subsequent steps due to this curse. It makes the algorithm converge faster and reduces overfitting. But be careful not to underfit by using too few features.
Side Notes :
Resources : Great post about the curse of dimensionality in classification which inspired me, On the Surprising Behavior of Distance Metrics in High Dimensional Space is a famous paper which proposes the use of fractional distance metrics, nice blog of simulations.
Images modified from: oranges, 7D cube
Side Notes : I will mostly focus on binary classification but most scores can be generalized to the multi-class setting. Often this is achieved by only considering “correct class” and “incorrect class” in order to make it a binary classification, then you average (weighted by the proportion of observation in the class) the score for each classes.
Specificity like recall but for negatives. $ Spec = \frac{TN}{Actual Negatives} = \frac{TN}{TN+FP}$
Resources : Additional scores based on confusion matrix
These two major types of models, distinguish themselves by the approach they are taking to learn. Although these distinctions are not specific to a particular task, you will most often hear about the distinction between generative and discriminative classifiers.
In classification, the task is to identify the category $y$ of an observation, given its features $x$. In mathematical notation we are looking for $y|x$. There are 2 approaches, to this problem:
Please note that some of advantages / disadvantages mean the same thing but are worded differently.
Rule of thumb : If your problem is only to train the best classifier on a large data set, use a discriminative model. If your task involves more constraints (online learning, semi supervised learning, small data set, …) use a generative model.
Let’s illustrate the advantages and disadvantage of both methods with an example . Imagine we are asked to make a classifier for the “true distribution” below. As a training set, we are once given a “small sample” and an other time a “large sample”.
How well will the algorithms distinguish the classes in each case ?
.
Please note that this is simply an example. Some generative models would find the small red cluster: it all depends on the assumptions they are making. (I hope that) It still gives you a good idea of the advantages and disadvantages.
Resources : A. Ng and M. Jordan have a must read paper on the subject, T. Mitchell summarizes very well these concepts in his slides, and section 8.6 of K. Murphy’s book has a great overview of pros and cons, which strongly influenced the devoted section above.
Don’t confuse the “information” in information theory with the everyday word which refers to “meaningful information”. A book with random letters will have more information because each new letter would be a surprise to you. But it will definitely not have more meaning than a book with English words .
The simple concept of entropy is central in both thermodynamics and information theory, and I find that quite amazing. It originally comes from statistical thermodynamics and is so central there, that it is carved on Ludwig Boltzmann’s grave (one of the father of this field). You will often hear:
These 2 way of thinking may seem different but in reality they are exactly the same. They essentially answer: how hard is it to describe this thing?
I will focus here on the information theory point of view, because its interpretation is more intuitive for machine learning. Also I don’t want to spend to much time thinking about thermodynamics, as people that do often commit suicide .
In information theory there are 2 intuitive way of thinking of entropy. These are best explained through an example :
Imagine that my friend Claude offers me to go see a NBA game (Cavaliers vs Spurs) with him tonight. Unfortunately I can’t come but ask him to record who scored each field goals. Claude is very geeky and uses a binary phone which can only write 0 and 1. As he doesn’t have much memory left, he wants to use the smallest possible number of bits.
Taking $\alpha = 1 $ for simplicity, we get $surprise(p_i) = -log(p_i) = nBit(p_i)$. We thus derived a formula for computing the surprise associated with event $i$ and the optimal number of bits that should be used to encode that event. In order to get the average surprise / number of bits associated with a random variable $X$ we simply have to take the expectation over all possible events (i.e average weighted by probability of event). This gives us the entropy formula $H(p) = \sum_i p_i \ \log(\frac{1}{p_i}) = - \sum_i p_i\ log(p_i)$
From the example above we see that entropy corresponds to :
Side notes :
Resources : Excellent explanation of the link between entropy in thermodynamics and information theory, friendly introduction to entropy related concepts
Differential entropy (= continuous entropy), is the generalization of entropy for continuous random variables.
Given a continuous random variable $X$ with a probability density function $f(x)$:
If you had to make a guess, which distribution maximizes entropy for a given variance ? You guessed it : it’s the Gaussian distribution.
Side notes : Differential entropy can be negative.
We saw that entropy is the expected number of bits used to encode an observation of $X$ under the optimal coding scheme. In contrast cross entropy is the expected number of bits to encode an observation of $X$ under the wrong coding scheme. Let’s call $q$ the wrong probability distribution that is used to make a coding scheme. Then we will use $-log(q_i)$ bits to encode the $i^{th}$ possible values of $X$. Although we are using $q$ as a wrong probability distribution, the observations will still be distributed based on $p$. We thus have to take the expected value over $p$ :
From this interpretation it naturally follows that:
Side notes : Log loss is often called cross entropy loss, indeed it is the cross-entropy between the distribution of the true labels and the predictions.
The Kullback-Leibler Divergence (= relative entropy = information gain) from $q$ to $p$ is simply the difference between the cross entropy and the entropy:
KL divergence is often used with probability distribution of continuous random variables. In this case the expectation involves integrals:
In order to understand why KL divergence is not symmetrical, it is useful to think of a simple example of a dice and a coin (let’s indicate head and tails by 0 and 1). Both are fair and thus their PDF is uniform. Their entropy is trivially: $H(p_{coin})=log(2)$ and $H(p_{dice}=log(6))$. Let’s first consider $D_{KL}(p_{coin}|p_{dice})$. The 2 possible events of $X_{dice}$ are 0,1 which are also possible in the dice. The average number of bits to encode a coin observation under the dice encoding, will thus simply be $log(6)$, and the KL divergence is of $log(6)-log(2)$ additional bits. Now let’s consider the problem the other way around: $D_{KL}(p_{dice}|p_{coin})$. We will use $log(2)=1$ bit to encode the events of 0 and 1. But how many bits will we use to encode $3,4,5,6$ ? Well the optimal encoding for the dice doesn’t have any encoding for these as they will never happen in his world. The KL divergence is thus not defined (division by 0). With this example you should clearly understand that the additional bits required to encode $p$ with $q$ is not the same as encoding $q$ with $p$. The KL divergence is thus not symmetric and cannot be a distance.
Side notes : Minimizing cross entropy with respect to $1$ is the same as minimizing $D_{KL}(p|q)$. Indeed the 2 equations are equivalent up to an additive constant (the entropy of $p$) which doesn’t depend on $q$.
This is all interesting, but why are we talking about information theory concepts in machine learning ? Well it turns our that many ML algorithms can be interpreted with entropy related concepts.
The 2 major ways we see entropy in machine learning are through:
Maximizing information gain (i.e entropy) at each step of our algorithm. Example:
Minimizing KL divergence between the actual unknown probability distribution of observations $p$ and the predicted one $q$. Example:
Minimizing KL divergence between the computationally intractable $p$ and a simpler approximation $q$. Indeed machine learning is not only about theory but also about how to make something work in practice.Example:
There is no one model that works best for every problem.
Let’s try predicting the next fruit in the sequence:
You would probably say right ? Maybe with a lower probability you would say . But have you thought of saying ? I doubt it. I never told you that the sequence was constrained in the type of fruit, but naturally we make assumptions that the data “behaves well”. The point here is that without any knowledge/assumptions on the data, all future data are equally probable.
The theorem builds on this, and states that every algorithm has the same performance when averaged over all data distributions. So for example a deep learning classifier have in average the same performance as a random one.
Side Notes :
Resources : D. Wolpert’s proof.
These 2 types of methods distinguish themselves based on their answer to the following question: “Will I use the same amount of memory to store the model trained on $100$ examples than to store a model trained on $10 000$ of them ? “ If yes then you are using a parametric model. If not, you are using a non-parametric model.
Practical : Start with a parametric model. It’s often worth trying a non-parametric model if: you are doing clustering, or the training data is not too big but the problem is very hard.
Side Note : Strictly speaking any non-parametric model could be seen as a infinite-parametric model. So if you want to be picky: next time you hear a colleague talking about non-parametric models, tell him it’s in fact parametric. I decline any liability for the consequence on your relationship with him/her .
Supervised learning tasks tackle problems that have labeled data.
Intuition : It can be thought of a teacher who corrects a multiple choice exam. At the end you will get the average result as well as a detailed result saying which answer was wrong what was the correct answer.
Supervised learning can be further separated into two broad type of problems:
The classification problem consists of assigning a set of classes/categories to an observation. I.e
Classification problems can be further separated into:
Common evaluation metrics include Accuracy, F1-Score, AUC… I have a section devoted for these classification metrics.
Compare to : Regression
The basic idea behind building a decision tree is to :
Here is a little gif showing these steps:
Note: For more information, please see the “details” and “Pseudocode and Complexity” drop-down below.
The idea behind decision trees is to partition the input space into multiple region. Ex: region of men who are more than 27 years old. Then predict the most probable class for each region, by assigning the mode of the training data in this region. Unfortunately, finding an optimal partitioning is usually computationally infeasible (NP-complete) due to the combinatorially large number of possible trees. In practice the different algorithms thus use a greedy approach. I.e each split of the decision tree tries to maximize a certain criterion regardless of the next splits.
How should we define an optimality criterion for a split? Let’s define an impurity (error) of the current state, which we’ll try to minimize. Here are 3 possibilities of state impurities:
Here is a quick graph showing the impurity depending on a class distribution in a binary setting:
Side Notes :
But when should we stop ? If possible, is important to stop soon to decrease over-fitting. There are a variety of different heuristics to determine when to stop. Here are a few:
One problem with these, are the multiple problem-dependent thresholds (hyperparameters) to chose. Furthermore, some heuristics might yield relatively bad results. For example decision trees might have to split the data without any purity gain, to reach high purity gain at the following step. For these reasons, it is common to grow large trees using only the number of training example in a leaf node as a stopping criterion. To avoid over-fitting, the algorithm would prune back the resulting tree. In CART, the pruning criterion $C_{pruning}(T)$ tries to balance impurity and model complexity by regularization. The regularized variable is often the number of leaf nodes $|T|$, as below:
Where $\lambda$ determines the trade-off between impurity and model complexity, for a given tree $T$, with leaf nodes $v=1…|T|$ using Impurity measure $I$. Then you would simply chose $\lambda$ using cross validation.
Variants: there are multiple different decision tree methods. They mostly differ with regards to the following points:
The most famous variants are:
Other variants include : C5.0 (next version of C4.5, probably less used because patented), MARS.
Resources : A comparative study of different decision tree methods.
def buildTree(X,Y):
if stop_criteria(X,Y) :
# if stop then store the majority class
tree.class = mode(X)
return Null
minImpurity = infinity
bestSplit = None
for j in features:
for T in thresholds:
if impurity(X,Y,j,T) < minImpurity:
bestSplit = (j,T)
minImpurity = impurity(X,Y,j,T)
X_left,Y_Left,X_right,Y_right = split(X,Y,bestSplit)
tree.split = bestSplit # adds current split
tree.left = buildTree(X_left,Y_Left) # adds subsequent left splits
tree.right buildTree(X_right,Y_right) # adds subsequent right splits
return tree
def singlePredictTree(tree,xi):
if tree.class is not Null:
return tree.class
j,T = tree.split
if xi[j] >= T:
return singlePredictTree(tree.right,xi)
else:
return singlePredictTree(tree.left,xi)
def allPredictTree(tree,Xt):
t,d = Xt.shape
Yt = vector(d)
for i in t:
Yt[i] = singlePredictTree(tree,Xt[i,:])
return Yt
Let’s first think about the complexity for building the first decision stump (first function call):
We now have the complexity of a decision stump. You could think that finding the complexity of building a tree would be multiplying it by by the number of function calls: Right ? Well not really, that would be an over-estimate. Indeed, at each function call, the training data size $n$ would have decreased. The intuition for the result we are looking for, is that at each level $l=1…m$ the sum of the training data in each function is still $n$. Multiple function working in parallel take the same time as a single function would, with the whole training set $n$. The complexity at each level is thus still $O(dn\log(n))$ so the complexity for building a tree of depth $m$ is $O(mdn\log(n))$. If you you want a proof that the work at each level stays constant:
At each iterations the dataset is split into $\nu$ subsets of $k_i$ element and a set of $n-\sum_{i=1}^{\nu} k_i$. At every level, the total cost would therefore be (using the well known log propriety and the fact that $k_i \le n$ ) :
The last possible adjustment I see, is to sort everything once, store it and simply use this precomputed data at each level. The final training complexity is therefore $O(mdn + nd\log(n))$ .
The time complexity of making predictions is straightforward: for each $t$ examples, go through a question at each $m$ levels. I.e $O(mt)$ .
Decision trees are more often used for classification problems. I thus talk at length about them here.
The 2 differences with decision trees for classification are:
The rest is the same as in the classification setting. Let’s look at a simple plot to get a better idea of the algorithm:
Besides the disadvantages seen in the decision trees for classification, decision trees for regression suffer from the fact that it predicts a non smooth function .
The previous sections were separated by learning style. I will now separate by algorithm similarity.
The previous sections were separated by learning style and algorithm similarity. I will now focus on the application.
The following sections are not strictly speaking part of Machine Learning. They are nevertheless very related and important.
Unfortunately here ends today’s journey together. But don’t be too disappointed, I’m only getting started. I still have a tuns of notes eagerly waiting to get upgraded, and I don’t intend to stop learning yet .
PS: Any reaction/suggestions would be very appreciated: just drop a comment below or click on the .
See you soon !