Latent NPFs

Overview

We concluded the previous section by noting two important drawbacks of the CNPF:

  • The predictive distribution is factorised across target points, and thus can neither account for correlations in the predictive nor (as a result) produce “coherent” samples from the predictive distribution.

  • The predictive distribution requires specification of a particular parametric form (e.g. Gaussian).

In this section we discuss an alternative parametrisation of \(p_\theta( \mathbf{y}_\mathcal{T} | \mathbf{x}_\mathcal{T}; \mathcal{C})\) that still satisfies our desiredata for NPs, and addresses both of these issues. The main idea is to introduce a latent variable \(\mathbf{z}\) into the definition of the predictive distribution. This leads us to the second major branch of the NPF, which we refer to as the Latent Neural Process Sub-family, or LNPF for short. A graphical representation of the LNPF is given in Fig. 31.

graphical model LNP

Fig. 31 Probabilistic graphical model for LNPs.

To specify this family of models, we must define a few components:

  • An encoder: \(p_{\theta} \left( \mathbf{z} | \mathcal{C} \right)\), which provides a distribution over the latent variable \(\mathbf{z}\) having observed the context set \(\mathcal{C}\). As with other NPF, the encoder needs to be permutation invariant to correctly treat \(\mathcal{C}\) as a set. A typical example is to first have a deterministic representation \(R\) and then use it to output the mean and (log) standard deviations of a Gaussian distribution over \(\mathbf{z}\).

  • A decoder: \(p_{\theta} \left( \mathbf{y}_{\mathcal{T}} | \mathbf{x}_{\mathcal{T}}, \mathbf{z} \right)\), which provides predictive distributions conditioned on \(\mathbf{z}\) and a target location \(\mathbf{x}_{\mathcal{T}}\). The decoder will usually be the same as the CNPF, but using a sample of the latent representation \(\mathbf{z}\) and marginalizing them, rather than a deterministic representation.

Putting altogether:

(4)\[\begin{split}\begin{align} p_{\theta}(\mathbf{y}_\mathcal{T} | \mathbf{x}_\mathcal{T}; \mathcal{C}) &= \int p_{\theta} \left( \mathbf{z} | \mathcal{C} \right) p_{\theta} \left( \mathbf{y}_{\mathcal{T}} | \mathbf{x}_{\mathcal{T}}, \mathbf{z} \right) \mathrm{d}\mathbf{z} & \text{Marginalisation} \\ &= \int p_{\theta} \left( \mathbf{z} | \mathcal{C} \right) \prod_{t=1}^{T} p_{\theta}(y^{(t)} | x^{(t)}, \mathbf{z}) \, \mathrm{d}\mathbf{z} & \text{Factorisation}\\ &= \int p_{\theta} \left( \mathbf{z} | \mathcal{C} \right) \prod_{t=1}^{T} \mathcal{N} \left( y^{(t)}; \mu^{(t)}, \sigma^{2(t)} \right) \mathrm{d}\mathbf{z} & \text{Gaussianity} \end{align}\end{split}\]

Now, you might be worried that we have still made both the factorisation and Gaussian assumptions! However, while \(p_{\theta} \left( \mathbf{y}_{\mathcal{T}} | \mathbf{x}_{\mathcal{T}}, \mathbf{z} \right)\) is still factorised, the predictive distribution we are actually interested in, \(p_\theta( \mathbf{y}_\mathcal{T} | \mathbf{x}_\mathcal{T}; \mathcal{C})\), is no longer due to the marginalisation of \(\mathbf{z}\), thus addressing the first problem we associated with the CNPF. Moreover, the predictive distribution is no longer Gaussian either. In fact, since the predictive now has the form of an infinite mixture of Gaussians, potentially any predictive density can be represented (i.e. learned) by this form. This is great news, as it (conceptually) relieves us of the burden of choosing / designing a bespoke likelihood function when deploying the NPF for a new application!

However, there is an important drawback. The key difficulty with the LNPF is that the likelihood we defined in Eq.(4) is no longer analytically tractable. We now discuss how to train members of the LNPF in general. After discussing several training procedures, we’ll introduce extensions of each of the CNPF members discussed in the previous chapter to their corresponding member of the LNPF.

Training LNPF members

Ideally, we would like to directly maximize the likelihood defined in Eq.(4) to optimise the parameters of the model. However, the integral over \(\mathbf{z}\) renders this quantity intractable, so we must consider alternatives. In fact, this story is not new, and the same issues arise when considering other latent variable models, such as variational auto-encoders (VAEs).

The question of how best to train LNPF members is still open, and there is ongoing research in this area. In this section, we will cover two methods for training LNPF models, but each have their flaws, and deciding which is the preferred training method must often be answered empirically. Here is a brief summary / preview of both methods which are described in details in the following sections:

Table 3 Summary of training methods for LNPF models

Training Method

Approximation of Log Likelihood

Biased

Variance

Empirical Performance

NPML [FBG+20]

Sample Estimate

Yes

Large

Usually better

NPVI [GSR+18]

Variational Inference

Yes

Small

Usually worse

Neural Process Maximum Likelihood (NPML)

First, let’s consider a direct approach to optimising the log-marginal predictive likelihood of LNPF members. While this quantity is no longer tractable (as it was with members of the CNPF), we can derive an estimator using Monte-Carlo sampling:

(5)\[\begin{split}\begin{align} \log p_{\theta}(\mathbf{y}_\mathcal{T} | \mathbf{x}_\mathcal{T}; \mathcal{C}) &= \log \int p_{\theta} \left( \mathbf{z} | \mathcal{C} \right) \prod_{t=1}^{T} p_{\theta} \left( y^{(t)} | x^{(t)}, \mathbf{z} \right) \mathrm{d}\mathbf{z} & \text{Marginalisation} \\ & \approx \log \left( \frac{1}{L} \sum_{l=1}^{L} \prod_{t=1}^{T} p_{\theta} \left( y^{(t)} | x^{(t)}, \mathbf{z}_l \right) \right) & \text{Monte-Carlo approximation} \\ & = \hat{\mathcal{L}}_{\mathrm{ML}} \end{align}\end{split}\]

where each \(\mathbf{z}_l \sim p_{\theta} \left( \mathbf{z} | \mathcal{C} \right)\).

Eq.(5) provides a simple-to-compute objective function for training LNPF-members, which we can then use with standard optimisers to learn the model parameters \(\theta\). The final (numerically stable) pseudo-code for NPML is given in Fig. 32:

Pseudo-code NPML.

Fig. 32 Pseudo-code for a single training step of a LNPF member with NPML.

NPML is conceptually very simple as it directly approximates the training procedure of the CNPF, in the sense that it targets the same predictive likelihood during training. Moreover, it tends to work well in practice, typically leading to models achieving good performance. However, it suffers from two important drawbacks:

  1. Bias: When applying the Monte-Carlo approximation, we have employed an unbiased estimator to the predictive likelihood. However, in practice we are interested in the log likelihood. Unfortunately, the log of an unbiased estimator is not itself unbiased. As a result, NPML is a biased (conservative) estimator of the true log-likelihood.

  2. High Variance: In practice it turns out that NPML is quite sensitive to the number of samples \(L\) used to approximate it. In both our GP and image experiments, we find that on the order of 20 samples are required to achieve “good” performance. Of course, the computational and memory costs of training scale linearly with \(L\), often limiting the number of samples that can be used in practice.

Unfortunately, decreasing the number of samples \(L\) needed to perform well turns out to be quite difficult, and is an open question in training latent variable models in general. However, we next describe an alternative training procedure that typically works well with fewer samples.

Neural Process Variational Inference (NPVI)

NPVI is a training procedure proposed by [GSR+18], which takes inspiration from the literature on variational inference (VI). The central idea behind this objective function is to use posterior sampling to reduce the variance of NPML. For a better intuition regarding this, note that NPML is defined using an expectation against \(p_{\theta}(\mathbf{z} | \mathcal{C})\). The idea in posterior sampling is to use the whole task, including the target set, \(\mathcal{D} = \mathcal{C} \cup \mathcal{T}\) to produce the distribution over \(\mathbf{z}\), thus leading to more informative samples and lower variance objectives.

In our case, the posterior distribution from which we would like to sample is \(p(\mathbf{z} | \mathcal{C}, \mathcal{T})\), i.e., the distribution of the latent variable having observed both the context and target sets. Unfortunately, this posterior is intractable. To address this, [GSR+18] propose to replace the true posterior with simply passing both the context and target sets through the encoder, i.e.

(6)\[\begin{align} p \left( \mathbf{z} | \mathcal{C}, \mathcal{T} \right) \approx p_{\theta} \left( \mathbf{z} | \mathcal{D} \right) \end{align}\]

We can now derive the final objective function, which is a lower bound to the log marginal likelihood, by (i) introducing the approximate posterior as a sampling distribution, and (ii) employing a straightforward application of Jensen’s inequality.

(7)\[\begin{split}\begin{align} \log p_{\theta}(\mathbf{y}_\mathcal{T} | \mathbf{x}_\mathcal{T}, \mathcal{C}) &= \log \int p_{\theta} \left( \mathbf{z} | \mathcal{C} \right) p_{\theta} \left( \mathbf{y}_{\mathcal{T}} | \mathbf{x}_{\mathcal{T}}, \mathbf{z} \right) \mathrm{d}\mathbf{z} & \text{Marginalisation} \\ & = \log \int p_{\theta} \left( \mathbf{z} | \mathcal{D} \right) \frac{p_{\theta} \left( \mathbf{z} | \mathcal{C} \right)}{p_{\theta} \left( \mathbf{z} | \mathcal{D} \right)} p_{\theta} \left( \mathbf{y}_{\mathcal{T}} | \mathbf{x}_{\mathcal{T}}, \mathbf{z} \right) \mathrm{d}\mathbf{z} & \text{Importance Weight} \\ & \geq \int p_{\theta} \left( \mathbf{z} | \mathcal{D} \right) \left( \log p_{\theta} \left( \mathbf{y}_{\mathcal{T}} | \mathbf{x}_{\mathcal{T}}, \mathbf{z} \right) + \log \frac{p_{\theta} \left( \mathbf{z} | \mathcal{C} \right)}{p_{\theta} \left( \mathbf{z} | \mathcal{D} \right)} \right) & \text{Jensen's inequality} \\ & = \mathbb{E}_{\mathbf{z} \sim p_{\theta} \left( \mathbf{z} | \mathcal{D} \right)} \left[ \log p_{\theta} \left( \mathbf{y}_{\mathcal{T}} | \mathbf{x}_{\mathcal{T}}, \mathbf{z} \right) \right] - \mathrm{KL} \left( p_{\theta} \left( \mathbf{z} | \mathcal{D} \right) \| p_{ \theta} \left( \mathbf{z} | \mathcal{C} \right) \right) \\ & = \mathcal{L}_{\mathrm{VI}} \end{align}\end{split}\]

where \(\mathrm{KL}(p \| q)\) is the Kullback-Liebler (KL) divergence between two distributions \(p\) and \(q\), and we have used the shorthand \(p_{\theta} \left( \mathbf{y}_{\mathcal{T}} | \mathbf{x}_{\mathcal{T}}, \mathbf{z} \right) = \prod_{t=1}^{T} p_{\theta} \left( y^{(t)} | x^{(t)}, \mathbf{z} \right)\) to ease notation. Let’s consider what we have achieved in Eq. (7).

Test Time

Of course, we can only sample from this approximate posterior during training, when we have access to both the context and target sets. At test time, we will only have access to the context set, and so the forward pass through the model will be equivalent to that of the model when trained with NPML, i.e., we will only pass the context set through the encoder. This is an important detail of NPVI: forward passes at meta-train time look different than they do at meta-test time!

When both the encoder and inference network parameterise Gaussian distributions over \(\mathbf{z}\) (as is standard), the KL-term can be computed analytically. Hence we can derive an unbiased estimator to Eq.(7) by taking samples from \(p_{\theta} \left( \mathbf{z} | \mathcal{D} \right)\) to estimate the first term on the RHS. Fig. 33 provides the pseudo-code for a single training iteration for a LNPF member, using NPVI as the target objective.

Pseudo-code NPVI.

Fig. 33 Pseudo-code for a single training step of a LNPF member with NPVI.

To better understand the NPVI objective, it is important to note — see the box below — that it can be rewritten as the difference between the desired log marginal likelihood and a KL divergence between approximate and true posterior:

(8)\[\mathcal{L}_{VI} = \log p_{\theta}(\mathbf{y}_\mathcal{T} | \mathbf{x}_\mathcal{T}, \mathcal{C}) - \mathrm{KL} \left( p_{\theta} \left( \mathbf{z} | \mathcal{D} \right) \| p \left( \mathbf{z} | \mathcal{C}, \mathcal{T} \right) \right)\]

The NPVI can thus be seen as maximizing the desired log marginal likelihood as well as forcing the approximate and true posterior to be similar.

As we have discussed, the most appealing property of NPVI is that it utilises posterior sampling to reduce the variance of the Monte-Carlo estimator of the intractable expectation. This means that often we can get away with training models taking just a single sample, resulting in computationally and memory efficient training procedures. However, it also comes with several drawbacks, which can be roughly summarised as follows:

  • NPVI is focused on approximating the posterior distribution (see Eq.(8) ). However, in the NPF setting, we are typically only interested in the predictive distribution \(p(\mathbf{y}_T | \mathbf{x}_T, \mathcal{C})\), and it is unclear whether focusing our efforts on \(\mathbf{z}\) is beneficial to achieving higher quality predictive distributions.

  • In NPVI, the encoder plays a dual role: it is both part of the model, and used for posterior sampling. This fact introduces additional complexities in the training procedure, and it may be that using the encoder as an approximate posterior has a detrimental effect on the resulting predictive distributions.

As we shall see below, it is often the case that models trained with NPML produce better fits than equivalent models trained with NPVI, at the cost of additional computational and memory costs of training. Moreover, using NPVI often requires additional tricks and contraints to get good descent performance.

Armed with procedures for training LNPF-members, we turn our attention to the models themselves. In particular, we next introduce the latent-variable variant of each of the conditional models introduced in the previous section, and we shall see that having addressed the training procedures, the extension to latent variables is quite straightforward from a practical perspective.

Latent Neural Process (LNP)

Computational graph LNP

Fig. 34 Computational graph for LNPS.

The latent neural process [GSR+18] is the latent counterpart of the CNP, and the first member of the LNPF proposed in the literature. Given the vector \(R\), which is computed as in the CNP, we simply pass it through an additional MLP to predict the mean and variance of the latent representation \(\mathbf{z}\), from which we can produce samples. The decoder then has the same architecture as that of the CNP, and we can simply pass samples of \(\mathbf{z}\), together with desired target locations, to produce our predictive distributions. Fig. 34 illustrates the computational graph of the LNP.

Throughout the section we train LNPs with NPVI as in the original paper. Below, we show the predictive distribution of an LNP trained on samples from the RBF-kernel GP, as the number of observed context points from the underlying function increases.

LNP on GP with RBF kernel

Fig. 35 Samples from posterior predictive of LNPs (Blue) and the oracle GP (Green) with RBF kernel.

Fig. 31 shows that the latent variable indeed enables coherent sampling from the posterior predictive. In fact, within the range \([-1, 1]\) the model produces very nice samples, that seem to properly mimic those from the underlying process. Nevertheless, here too we see that the LNP suffers from the same underfitting issue as discussed with CNPs. We again see the tendency to overestimate the uncertainty, and often not pass through all the context points with the mean functions. Moreover, we can observe that beyond the \([-1, 1]\), the model seems to “give up” on the context points and uncertainty, despite having been trained on the range \([-2, 2]\).

Let us now consider the image experiments as we did with the CNP.

LNP on CelebA and MNIST

Fig. 37 samples (means conditioned on different samples from the latent) of the posterior predictive of a LNP on CelebA \(32\times32\) and MNIST. The last row shows the standard deviation of the posterior predictive corresponding to the last sample.

From Fig. 37 we see again that the latent variable enables relatively coherent sampling from the posterior predictive. As with the CNP, the LNP still underfits on images as is best illustrated when the context set is half the image.

Details

Model details, training and more plots in LNP Notebook. We also provide pretrained models to play around with.

Attentive Latent Neural Process (AttnLNP)

The Attentive LNPs [KMS+19] is the latent counterpart of AttnCNPs. Differently from the LNP, the AttnLNP added a “latent path” in addition to (rather than instead of) the deterministic path. The latent path is implemented with the same method as LNPs, i.e. a mean aggregation followed by a parametrization of a Gaussian. In other words, even though the deterministic representation \(R^{(t)}\) is target specific, the latent representation \(\mathbf{z}\) is target independent as seen in the computational graph (Fig. 38).

Computational graph AttnLNP

Fig. 38 Computational graph for AttnLNPS.

Throughout the section we train AttnLNPs with NPVI as in the original paper. Below, we show the predictive distribution of an AttnLNP trained on samples from RBF, periodic, and noisy Matern kernel GPs, again viewing the predictive as the number of observed context points is increased.

AttnLNP on single GP

Fig. 39 Samples Posterior predictive of AttnLNPs (Blue) and the oracle GP (Green) with RBF,periodic, and noisy Matern kernel.

Fig. 39 paints an interesting picture regarding the AttnLNP. On the one hand, we see that it is able to do a significantly better job in modelling the marginals than the LNP. However, on closer inspection, we can see several issues with the resulting distributions:

  • Kinks: The samples do not seem to be smooth, and we see “kinks” that are similar (though even more pronounced) than in Fig. 14.

  • Collapse to AttnCNP: In many places the AttnLNP seems to collpase the distribution around the latent variable, and express all its uncertainty via the observation noise. This tends to occur more often for the AttnLNP when trained with NPVI rather than NPML.

Let us now consider the image experiments as we did with the AttnCNP.

AttnLNP on CelebA, MNIST, ZSMM

Fig. 41 Samples from posterior predictive of an AttnCNP for CelebA \(32\times32\), MNIST, ZSMM.

From Fig. 41 we see that AttnLNP generates quite impressive samples, and exhibits descent sampling and good performances when the model does not require generalisation (CelebA \(32\times32\), MNIST). However, as expected, the model “breaks” for ZSMM as it still cannot extrapolate.

Details

Model details, training and more plots in AttnLNP Notebook. We also provide pretrained models to play around with.

Convolutional Latent Neural Process (ConvLNP)

The Convolutional LNP [FBG+20] is the latent counterpart of the ConvCNP. In contrast with the AttnLNP, the latent path replaces the deterministic one (as with LNP), resulting in a latent functional representation (a latent stochastic process) instead of a latent vector valued variable.

Another way of viewing the ConvLNP, which is useful in gaining an intuitive understanding of the computational graph (see Fig. 42) is as consisting of two stacked ConvCNPs: the first takes in context sets and outputs a latent stochastic process. The second takes as input a sample from the latent process and models the posterior predictive conditioned on that sample.

Computational graph ConvLNP using two ConvCNPs

Fig. 42 Computational graph for ConvLNPs.

One difficulty arises in training the ConvLNP with the NPVI objective, as it requires evaluating the KL divergence between two stochastic processes, which is a tricky proposition. [FBG+20] propose a simple approach, that approximates this quantity by instead summing the KL divergences at each discretisation location. However, as they note, the ConvLNP performs significantly better in most cases when trained with NPML rather than NPVI. Throughout this section we will thus use NPML instead of NPVI.

ConvLNP on GPs with RBF, periodic, Matern kernel

Fig. 44 Samples Posterior predictive of ConvLNPs (Blue) and the oracle GP (Green) with RBF,periodic, and noisy Matern kernel.

From Fig. 44 we see that ConvLNP performs very well and the samples are reminiscent of those from a GP, i.e., with much richer variability compared to Fig. 39. Further, as in the case of the ConvCNP, we see that the ConvLNP elegantly generalises beyond the range in \(X\)-space on which it was trained.

Next, we consider the more challenging problem of having the ConvLNP model a stochastic process whose posterior predictive is non Gaussian. We do so by having the following underlying generative process: first, sample one of the 3 kernels discussed above, and second, sample a function from the sampled kernel. Importantly, the data generating process is a mixture of GPs, and the true posterior predictive process (achieved by marginalising over the different kernels) is non-Gaussian.

ConvLNP trained on GPs with RBF,Matern,periodic kernel

Fig. 46 Similar to Fig. 44 but the training was performed on all data simultaneously.

Fig. 46 demonstrates that ConvLNP performs quite well in this harder setting. Indeed, it seems to model the predictive process using the periodic kernel when the number of context points is small but quickly (around 10 context points) recovers the correct underlying kernel. Note how in the middle plot, the ConvLNP becomes progressively more and more “confident” that the process is periodic as more data is observed. Note that in Fig. 46 we are plotting the posterior predictive for the sampled GP, rather than the actual, non-GP posterior predictive process.

Again, we consider the performance of the ConvLNP in the image setting. In Fig. 47 we see that the ConvLNP does a reasonable job producing samples when the context sets are uniformly subsampled from images, but struggles with the “structured” context sets, e.g. when the left or bottom halves of the image are missing. Moreover, the ConvLNP is able to produce samples in the generalisation setting (ZSMM), but these are not always coherent, and include some strange artifacts that seem more similar to sampling the MNIST “texture” than coherent digits.

ConvLNP on CelebA, MNIST, ZSMM

Fig. 47 Samples from posterior predictive of an ConvCNP for CelebA \(32\times32\), MNIST, ZSMM.

As discussed in the ‘Issues with the CNPF’ section, members of the CNPF could not be used to generate coherent samples, nor model non-Gaussian posterior predictive distributions. In contrast, Fig. 48 (right) demonstrates that, as expected, ConvLNP is able to produce non-Gaussian predictives for pixels, with interesting bi-modal and heavy-tailed behaviours.

Samples from ConvLNP on MNIST and posterior of different pixels

Fig. 48 Samples form the posterior predictive of ConvCNPs on MNIST (left) and posterior predictive of some pixels (right).

From Fig. 41 we see that AttnLNP generates quite impressive samples, and exhibits descent sampling and good performances when the model does not require generalisation (CelebA \(32\times32\), MNIST). However, as expected, the model “breaks” for ZSMM as it still cannot extrapolate.

Details

Model details, training and more plots in AttnLNP Notebook. We also provide pretrained models to play around with.

Issues and Discussion

We have seen how members of the LNPF utilise a latent variable to define a predictive distribution, thus achieving structured and expressive predictive distributions over target sets. Despite these advantages, the LNPF suffers from important drawbacks:

  • The training procedure only optimises a biased objective or a lower bound to the true objective.

  • Approximating the objective function requires sampling, which can lead to high variance during training.

  • They are more memory and computationally demanding, requiring many samples to estimate the objective for NPML.

  • It is difficult to quantitatively evaluate and compare different models, since only lower bounds to the log predictive likelihood can be estimated.

Despite these challenges, the LNPF defines a useful and powerful class of models. Whether to deploy a member of the CNPF or LNPF depends on the task at hand. For example, if samples are not required for a particular application, and we have reason to believe a parametric distribution may be a good description of the likelihood, it may well be that a CNPF would be preferable. Conversely, it is crucial to use the LNPF if sampling or dependencies in the predictive are required, for example in Thompson sampling.