Datasets¶
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
import logging
import os
import warnings
import torch
os.chdir("../..")
warnings.filterwarnings("ignore")
warnings.simplefilter("ignore")
logging.disable(logging.ERROR)
N_THREADS = 8
IS_FORCE_CPU = False # Nota Bene : notebooks don't deallocate GPU memory
if IS_FORCE_CPU:
os.environ["CUDA_VISIBLE_DEVICES"] = ""
torch.set_num_threads(N_THREADS)
1D Gaussian Process Datasets¶
A good way to compare members of the NPFs is by trying to model samples from different Gaussian processes. Specifically, we will investigate NPFs in the following three settings:
Samples from a single GP the first question we want to understand is how well NPFs can model a ground truth GP. To test this hypothesis, we will repeatedly sample from a single GP and then compare the posterior predictive to the ground truth GP. Note that this is not really a natural setting, but the goal is to understand how biased the objective is and which NPF can best approximate the ground truth GP with finite computation. We will
Training : 100 epochs, each epoch consists of 10k different context and target sets from the GP (never see the same example twice)
Datasets : 3 datasets, each corresponding to a GP with a different kernel with fixed hyper-parameters. Namely : RBF, Exp-Sine-Squared (periodic), Matern with noise.
Evaluation : compare to the ground truth GP
Samples from GPs with varying Kernels in the second experiment will investigate whether members of the NPF can model a ground truth GP even it was trained on samples from different kernels. I.e. if it can “recognize” the kernel and model the ground truth GP once it does. To do so we will simply train on the 3 datasets from the first point.
Training : 100 epochs, each epoch consists of 10k different context and target sets from the GP (never see the same example twice)
Datasets : union of the 3 datasets from the first point
Evaluation : evaluate on each 3 datasets separately and compare to the ground truth GP
Samples from GPs with varying Kernel hyperparameters finally we will test whether the NPFs can model a family of GPs.
Training : 100 epochs, each epoch consists of 10k different context and target sets from the GP (never see the same example twice)
Datasets : Data generated by sampling a length scale \([0.01, 0.3]\) and then a function from a GP with Matern kernel and the corresponding length scale
Evaluation : compare to a GP with Matern Kernel, with a length scale fitted on the context points (using the marginal likelihood)
Extensions
See the docstrings of
GPDataset
for more parameters.Adding a dataset from a new kernel is straightforward by defining your own kernel and following the same step as below.
# GPDataset Docstring
from utils.data import GPDataset
print(GPDataset.__doc__)
Dataset of functions generated by a gaussian process.
Parameters
----------
kernel : sklearn.gaussian_process.kernels or list
The kernel specifying the covariance function of the GP. If None is
passed, the kernel "1.0 * RBF(1.0)" is used as default.
min_max : tuple of floats, optional
Min and max point at which to evaluate the function (bounds).
n_samples : int, optional
Number of sampled functions contained in dataset.
n_points : int, optional
Number of points at which to evaluate f(x) for x in min_max.
is_vary_kernel_hyp : bool, optional
Whether to sample each example from a kernel with random hyperparameters,
that are sampled uniformly in the kernel hyperparameters `*_bounds`.
save_file : string or tuple of strings, optional
Where to save and load the dataset. If tuple `(file, group)`, save in
the hdf5 under the given group. If `None` regenerate samples indefinitely.
Note that if the saved dataset has been completely used,
it will generate a new sub-dataset for every epoch and save it for future
use.
n_same_samples : int, optional
Number of samples with same kernel hyperparameters and X. This makes the
sampling quicker.
is_reuse_across_epochs : bool, optional
Whether to reuse the same samples across epochs. This makes the
sampling quicker and storing less memory heavy if `save_file` is given.
kwargs:
Additional arguments to `GaussianProcessRegressor`.
Samples from a single GP¶
from sklearn.gaussian_process.kernels import RBF, ExpSineSquared, Matern, WhiteKernel
from utils.ntbks_helpers import get_gp_datasets
def get_datasets_single_gp():
"""Return train / tets / valid sets for 'Samples from a single GP'."""
kernels = dict()
kernels["RBF_Kernel"] = RBF(length_scale=(0.2))
kernels["Periodic_Kernel"] = ExpSineSquared(length_scale=1, periodicity=0.5)
#kernels["Matern_Kernel"] = Matern(length_scale=0.2, nu=1.5)
kernels["Noisy_Matern_Kernel"] = WhiteKernel(noise_level=0.1) + Matern(
length_scale=0.2, nu=1.5
)
return get_gp_datasets(
kernels,
is_vary_kernel_hyp=False, # use a single hyperparameter per kernel
n_samples=50000, # number of different context-target sets
n_points=128, # size of target U context set for each sample
is_reuse_across_epochs=False, # never see the same example twice
)
# create the dataset and store it (if not already done)
(datasets, _, __,) = get_datasets_single_gp()
import matplotlib.pyplot as plt
from utils.visualize import plot_dataset_samples_1d
n_datasets = len(datasets)
fig, axes = plt.subplots(n_datasets, 1, figsize=(11, 5 * n_datasets), squeeze=False)
for i, (k, dataset) in enumerate(datasets.items()):
plot_dataset_samples_1d(dataset, title=k.replace("_", " "), ax=axes.flatten()[i], n_samples=3)
Samples from GPs with varying Kernels¶
from utils.data.helpers import DatasetMerger
def get_datasets_varying_kernel_gp():
"""Return train / tets / valid sets for 'Samples from GPs with varying Kernels'."""
datasets, test_datasets, valid_datasets = get_datasets_single_gp()
return (
dict(All_Kernels=DatasetMerger(datasets.values())),
dict(All_Kernels=DatasetMerger(test_datasets.values())),
dict(All_Kernels=DatasetMerger(valid_datasets.values())),
)
# create the dataset and store it (if not already done)
(datasets, _, __,) = get_datasets_varying_kernel_gp()
n_datasets = len(datasets)
fig, axes = plt.subplots(n_datasets, 1, figsize=(11, 5 * n_datasets), squeeze=False)
for i, (k, dataset) in enumerate(datasets.items()):
plot_dataset_samples_1d(dataset, title=k.replace("_", " "), ax=axes.flatten()[i], n_samples=10)
Samples from GPs with varying Kernel hyperparameters¶
def get_datasets_variable_hyp_gp():
"""Return train / tets / valid sets for 'Samples from GPs with varying Kernel hyperparameters'."""
kernels = dict()
kernels["Variable_Matern_Kernel"] = Matern(length_scale_bounds=(0.01, 0.3), nu=1.5)
return get_gp_datasets(
kernels,
is_vary_kernel_hyp=True, # use a single hyperparameter per kernel
n_samples=50000, # number of different context-target sets
n_points=128, # size of target U context set for each sample
is_reuse_across_epochs=False, # never see the same example twice
)
# create the dataset and store it (if not already done)
(datasets, _, __,) = get_datasets_variable_hyp_gp()
n_datasets = len(datasets)
fig, axes = plt.subplots(n_datasets, 1, figsize=(11, 5 * n_datasets), squeeze=False)
for i, (k, dataset) in enumerate(datasets.items()):
plot_dataset_samples_1d(dataset, title=k.replace("_", " "), ax=axes.flatten()[i], n_samples=5)
Image Datasets¶
We will be using the following datasets
MNIST [LBB+98]
CelebA [LLWT18] (rescaled to \(32 \times 32\) pixels)
Zero Shot Multi MNIST [GBF+20] (ZSMM).
Note
We will follow [FBG+20] for ZSMM. Namely, train on translated MNIST and test on a larger canvas with multiple digits (ZSMM).
from utils.data import get_train_test_img_dataset
datasets = dict()
_, datasets["CelebA32"] = get_train_test_img_dataset("celeba32")
_, datasets["MNIST"] = get_train_test_img_dataset("mnist")
from utils.visualize import plot_dataset_samples_imgs
n_datasets = len(datasets)
fig, axes = plt.subplots(1, n_datasets, figsize=(5 * n_datasets, 5))
for i, (k, dataset) in enumerate(datasets.items()):
plot_dataset_samples_imgs(dataset, title=k, ax=axes[i])
Let us now visualize some training and testing samples from ZSMM:
datasets_zsmm = dict()
datasets_zsmm["ZSMM Train"], datasets_zsmm["ZSMM Test"] = get_train_test_img_dataset("zsmms")
from utils.visualize import plot_dataset_samples_imgs
n_datasets = len(datasets_zsmm)
fig, axes = plt.subplots(1, n_datasets, figsize=(5 * n_datasets, 5))
for i, (k, dataset) in enumerate(datasets_zsmm.items()):
plot_dataset_samples_imgs(dataset, title=k, ax=axes[i])