Generative vs discriminative

,
4 min.

These two major model types, distinguish themselves by the approach they are taking to learn. Although these distinctions are not task-specific task, you will most often hear those in the context of classification.

Differences

In classification, the task is to identify the category $y$ of an observation, given its features $\mathbf{x}$: $y \vert \mathbf{x}$. There are 2 possible approaches:

  • Discriminative learn the decision boundaries between classes.
    • :bulb: Tell me in which class is this observation given past data.
    • Can be probabilistic or non-probabilistic models. If probabilistic, the prediction is $\hat{y}=arg\max_{y=1 \ldots C} \, p(y \vert \mathbf{x})$. If non probabilistic, the model "draws" a boundary between classes, if the point $\mathbf{x}$ is on one side of of the boundary then predict $y=1$ if it is on the other then $y=2$ (multiple boundaries for multiple class).
    • Directly models what we care about: $y \vert \mathbf{x}$.
    • :school_satchel: As an example, for language classification, the discriminative model would learn to distinguish between languages from their sound but wouldn't understand anything.
  • Generative model the distribution of each classes.
    • :bulb: First "understand" the meaning of the data, then use your knowledge to classify.
    • Model the joint distribution $p(y,\mathbf{x})$ (often using $p(y,\mathbf{x})=p(\mathbf{x} \vert y)p(y)$). Then find the desired conditional probability through Bayes theorem: $p(y \vert \mathbf{x})=\frac{p(y,\mathbf{x})}{p(\mathbf{x})}$. Finally, predict $\hat{y}=arg\max_{y=1 \ldots C} \, p(y \vert \mathbf{x})$ (same as discriminative).
    • Generative models often use more assumptions to as t is a harder task.
    • :school_satchel: To continue with the previous example, the generative model would first learn how to speak the language and then classify from which language the words come from.

Pros / Cons

Some of advantages / disadvantages are equivalent with different wording. These are rule of thumbs !

  • Discriminative:
    • :white_check_mark: Such models need less assumptions as they are tackling an easier problem.
    • :white_check_mark: Often less bias => better if more data.
    • Often :x: slower convergence rate . Logistic Regression requires $O(d)$ observations to converge to its asymptotic error.
    • :x: Prone to over-fitting when there's less data, as it doesn't make assumptions to constrain it from finding inexistent patterns.
    • Often :x: More variance.
    • :x: Hard to update the model with new data (online learning).
    • :x: Have to retrain model when adding new classes.
    • :x: In practice needs additional regularization / kernel / penalty functions.
  • Generative
    • :white_check_mark: Faster convergence rate => better if less data . Naive Bayes only requires $O(\log(d))$ observations to converge to its asymptotic rate.
    • Often :white_check_mark: less variance.
    • :white_check_mark: Can easily update the model with new data (online learning).
    • :white_check_mark: Can generate new data by looking at $p(\mathbf{x} \vert y)$.
    • :white_check_mark: Can handle missing features .
    • :white_check_mark: You don't need to retrain model when adding new classes as the parameters of classes are fitted independently.
    • :white_check_mark: Easy to extend to the semi-supervised case.
    • Often :x: more Biais.
    • :x: Uses computational power to compute something we didn't ask for.

:wrench: Rule of thumb : If your need to train the best classifier on a large data set, use a discriminative model. If your task involves more constraints (online learning, semi supervised learning, small dataset, …) use a generative model.

Let's illustrate the advantages and disadvantage of both methods with an example . Suppose we are asked to construct a classifier for the "true distribution" below. There are two training sets: "small sample" and "large sample". Suppose that the generator assumes point are generated from a Gaussian.

discriminative vs generative true distribution

discriminative vs generative small sample

discriminative vs generative large sample

How well will the algorithms distinguish the classes in each case ?

  • Small Sample:
    • The discriminative model never saw examples at the bottom of the blue ellipse. It will not find the correct decision boundary there.
    • The generative model assumes that the data follows a normal distribution (ellipse). It will therefore infer the correct decision boundary without ever having seen data points there!

small sample discriminative

small sample generative

  • Large Sample:
    • The discriminative model is not restricted by assumptions and can find small red cluster inside the blue one.
    • The generative model assumes that the data follows a Gaussian distribution (ellipse) and won't be able to find the small red cluster.

large sample discriminative

large sample generative

This was simply an example that hopefully illustrates the advantages and disadvantages of needing more assumptions. Depending on their assumptions, some generative models would find the small red cluster.

Examples of Algorithms

Discriminative

  • Logistic Regression
  • Softmax
  • Traditional Neural Networks
  • Conditional Random Fields
  • Maximum Entropy Markov Model
  • Decision Trees

Generative

  • Naives Bayes
  • Gaussian Discriminant Analysis
  • Latent Dirichlet Allocation
  • Restricted Boltzmann Machines
  • Gaussian Mixture Models
  • Hidden Markov Models
  • Sigmoid Belief Networks
  • Bayesian networks
  • Markov random fields

Hybrid

  • Generative Adversarial Networks

:information_source: Resources : A. Ng and M. Jordan have a must read paper on the subject, T. Mitchell summarizes very well these concepts in his slides, and section 8.6 of K. Murphy's book has a great overview of pros and cons, which strongly influenced the devoted section above.

Tags: ,

Published:

Updated:

4 min.