try:
import astra
except ModuleNotFoundError:
%pip install -U git+https://github.com/sustainability-lab/ASTRA
Goals:
G1: Given probability distributions \(p\) and \(q\), find the divergence (measure of similarity) between them
Let us first look at G1. Look at the illustration below. We have a normal distribution \(p\) and two other normal distributions \(q_1\) and \(q_2\). Which of \(q_1\) and \(q_2\), would we consider closer to \(p\)? \(q_2\), right?
To understand the notion of similarity, we use a metric called the KL-divergence given as \(D_{KL}(a || b)\) where \(a\) and \(b\) are the two distributions.
For G1, we can say \(q_2\) is closer to \(p\) compared to \(q_1\) as:
\(D_{KL}(q_2 || p) \lt D_{KL}(q_1 || p)\)
For the above example, we have the values as \(D_{KL}(q_2|| p) = 0.07\) and \(D_{KL}(q_1|| p)= 0.35\)
G2: assuming \(p\) to be fixed, can we find optimum parameters of \(q\) to make it as close as possible to \(p\)
The following GIF shows the process of finding the optimum set of parameters for a normal distribution \(q\) so that it becomes as close as possible to \(p\). This is equivalent of minimizing \(D_{KL}(q || p)\)
The following GIF shows the above but for a two-dimensional distribution.
G3: finding the “distance” between two distributions of different families
The below image shows the KL-divergence between distribution 1 (mixture of Gaussians) and distribution 2 (Gaussian)
G4: optimizing the “distance” between two distributions of different families
The below GIF shows the optimization of the KL-divergence between distribution 1 (mixture of Gaussians) and distribution 2 (Gaussian)
G5: Approximating the KL-divergence
G6: Implementing variational inference for linear regression
Basic Imports
import numpy as np
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import seaborn as sns
from ipywidgets import interact
from astra.torch.utils import train_fn
from astra.torch.models import AstraModel
=torch.distributions
dist
# Default figure size for matplotlib
'figure.figsize'] = (6, 4)
plt.rcParams[
%matplotlib inline
%config InlineBackend.figure_format='retina'
Creating distributions
Creating \(p\sim\mathcal{N}(1.00, 4.00)\)
= dist.Normal(1, 4) p
= torch.linspace(-5, 15, 200)
z_values = torch.exp(p.log_prob(z_values))
prob_values_p =r"$p\sim\mathcal{N}(1.00, 4.00)$")
plt.plot(z_values, prob_values_p, label
sns.despine()
plt.legend()"x")
plt.xlabel("PDF") plt.ylabel(
Text(0, 0.5, 'PDF')
Creating \(q\sim\mathcal{N}(loc, scale)\)
def create_q(loc, scale):
return dist.Normal(loc, scale)
Generating a few qs for different location and scale value
= {}
q 0, 1)] = create_q(0.0, 1.0)
q[(
for loc in [0, 1]:
for scale in [1, 2]:
= create_q(float(loc), float(scale)) q[(loc, scale)]
=r"$p\sim\mathcal{N}(1.00, 4.00)$", lw=3)
plt.plot(z_values, prob_values_p, label
plt.plot(
z_values,0.0, 2.0).log_prob(z_values)),
torch.exp(create_q(=r"$q_1\sim\mathcal{N}(0.00, 2.00)$",
label=2,
lw="--",
linestyle
)
plt.plot(
z_values,1.0, 3.0).log_prob(z_values)),
torch.exp(create_q(=r"$q_2\sim\mathcal{N}(1.00, 3.00)$",
label=2,
lw="-.",
linestyle
)
=(1.04, 1), borderaxespad=0)
plt.legend(bbox_to_anchor"x")
plt.xlabel("PDF")
plt.ylabel(
sns.despine()
plt.tight_layout()
plt.savefig("dkl.png",
=150,
dpi )
#### Computing KL-divergence
= dist.kl_divergence(create_q(0.0, 2.0), p)
q_0_2_dkl = dist.kl_divergence(create_q(1.0, 3.0), p)
q_1_3_dkl
print(f"D_KL (q(0, 2)||p) = {q_0_2_dkl:0.2f}")
print(f"D_KL (q(1, 3)||p) = {q_1_3_dkl:0.2f}")
D_KL (q(0, 2)||p) = 0.35
D_KL (q(1, 3)||p) = 0.07
As mentioned earlier, clearly, \(q_2\sim\mathcal{N}(1.00, 3.00)\) seems closer to \(p\)
Optimizing the KL-divergence between q and p
We could create a grid of (loc, scale) pairs and find the best, as shown below.
=r"$p\sim\mathcal{N}(1.00, 4.00)$", lw=5)
plt.plot(z_values, prob_values_p, label
for loc in [0, 1]:
for scale in [1, 2]:
= q[(loc, scale)]
q_d = dist.kl_divergence(q[(loc, scale)], p)
kl_d
plt.plot(
z_values,
torch.exp(q_d.log_prob(z_values)),=rf"$q\sim\mathcal{{N}}({loc}, {scale})$"
label+ "\n"
+ rf"$D_{{KL}}(q||p)$ = {kl_d:0.2f}",
)=(1.04, 1), borderaxespad=0)
plt.legend(bbox_to_anchor"x")
plt.xlabel("PDF")
plt.ylabel( sns.despine()
Or, we could use continuous optimization to find the best loc and scale parameters for q.
class TrainableNormal(torch.nn.Module):
def __init__(self, loc, scale):
super().__init__()
self.loc = nn.Parameter(torch.tensor(loc))
= torch.log(torch.expm1(torch.tensor(scale)))
raw_scale self.raw_scale = nn.Parameter(raw_scale)
def forward(self, X=None):
= torch.functional.F.softplus(self.raw_scale)
scale return dist.Normal(loc=self.loc, scale=scale, validate_args=False)
def __repr__(self):
return f"TrainableNormal(loc={self.loc.item():0.2f}, scale={torch.functional.F.softplus(self.raw_scale).item():0.2f})"
def loss_fn(model_output, output, p):
= model_output
to_learn_dist return dist.kl_divergence(to_learn_dist, p)
= 4.0
loc = 0.5
scale = TrainableNormal(loc=loc, scale=scale)
model = [torch.tensor(loc)]
loc_array = [torch.tensor(scale)]
scale_array = [loss_fn(model(), None, p).item()]
loss_array = torch.optim.Adam(model.parameters(), lr=0.05)
opt = 100
epochs for i in range(4):
= train_fn(model, loss_fn=loss_fn, loss_fn_kwargs={"p": p}, optimizer=opt, epochs=epochs, verbose=False, return_state_dict=True)
(iter_losses, epoch_losses), state_dict_list
loss_array.extend(epoch_losses)"loc"] for state_dict in state_dict_list])
loc_array.extend([state_dict["raw_scale"]) for state_dict in state_dict_list])
scale_array.extend([torch.functional.F.softplus(state_dict[
print(
f"Iteration: {i*epochs}, Loss: {epoch_losses[-1]:0.2f}, Loc: {loc_array[-1]:0.2f}, Scale: {scale_array[-1]:0.2f}"
)
Iteration: 0, Loss: 0.05, Loc: 1.06, Scale: 3.15
Iteration: 100, Loss: 0.00, Loc: 1.00, Scale: 3.92
Iteration: 200, Loss: 0.00, Loc: 1.00, Scale: 4.00
Iteration: 300, Loss: 0.00, Loc: 1.00, Scale: 4.00
= 'scale')
plt.plot(torch.tensor(scale_array), label = 'loc')
plt.plot(torch.tensor(loc_array), label = 'loss')
plt.plot(torch.tensor(loss_array), label
plt.legend()"Iteration") plt.xlabel(
Text(0.5, 0, 'Iteration')
After training, we are able to recover the scale and loc very close to that of \(p\)
Animation!
def animate(i):
= plt.figure(tight_layout=True, figsize=(8, 4))
fig = fig.gca()
ax
ax.clear()=r"$p\sim\mathcal{N}(1.00, 4.00)$", lw=5)
ax.plot(z_values, prob_values_p, label= dist.Normal(loc = loc_array[i], scale=scale_array[i])
to_learn_q = loss_array[i]
loss = loc_array[i].item()
loc = scale_array[i].item()
scale
ax.plot(
z_values.numpy(),
torch.exp(to_learn_q.log_prob(z_values)).numpy(),=rf"$q\sim \mathcal{{N}}({loc:0.2f}, {scale:0.2f})$",
label
)
rf"Iteration: {i}, $D_{{KL}}(q||p)$: {loss:0.2f}")
ax.set_title(=(1.1, 1), borderaxespad=0)
ax.legend(bbox_to_anchor0, 1))
ax.set_ylim((-5, 15))
ax.set_xlim((
"x")
ax.set_xlabel("PDF")
ax.set_ylabel(
sns.despine()
@interact(i=(0, len(loc_array) - 1))
def show_animation(i=0):
animate(i)
Finding the KL divergence for two distributions from different families
Let us rework our example with p
coming from a mixture of Gaussian distribution and q
being Normal.
= dist.MixtureSameFamily(
p_s =dist.Categorical(probs=torch.tensor([0.5, 0.5])),
mixture_distribution=dist.Normal(
component_distribution=torch.tensor([-0.2, 1]), scale=torch.tensor([0.4, 0.5]) # One for each component.
loc
),
)
p_s
MixtureSameFamily(
Categorical(probs: torch.Size([2]), logits: torch.Size([2])),
Normal(loc: torch.Size([2]), scale: torch.Size([2])))
plt.plot(z_values, torch.exp(p_s.log_prob(z_values))) sns.despine()
Let us create two Normal distributions q_1 and q_2 and plot them to see which looks closer to p_s.
= create_q(3, 1)
q_1 = create_q(3, 4.5) q_2
= torch.exp(p_s.log_prob(z_values))
prob_values_p_s = torch.exp(q_1.log_prob(z_values))
prob_values_q_1 = torch.exp(q_2.log_prob(z_values))
prob_values_q_2
=r"MOG")
plt.plot(z_values, prob_values_p_s, label=r"$q_1\sim\mathcal{N} (3, 1.0)$")
plt.plot(z_values, prob_values_q_1, label=r"$q_2\sim\mathcal{N} (3, 4.5)$")
plt.plot(z_values, prob_values_q_2, label
sns.despine()
plt.legend()"x")
plt.xlabel("PDF")
plt.ylabel(
plt.tight_layout()
plt.savefig("dkl-different.png",
=150,
dpi )
try:
dist.kl_divergence(q_1, p_s)except NotImplementedError:
print(f"KL divergence not implemented between {q_1.__class__} and {p_s.__class__}")
KL divergence not implemented between <class 'torch.distributions.normal.Normal'> and <class 'torch.distributions.mixture_same_family.MixtureSameFamily'>
As we see above, we can not compute the KL divergence directly. The core idea would now be to leverage the Monte Carlo sampling and generating the expectation. The following function does that.
def kl_via_sampling(q, p, n_samples=100000):
# Get samples from q
= q.sample([n_samples])
sample_set # Use the definition of KL-divergence
return torch.mean(q.log_prob(sample_set) - p.log_prob(sample_set))
dist.kl_divergence(q_1, q_2)
tensor(1.0288)
kl_via_sampling(q_1, q_2)
tensor(1.0287)
kl_via_sampling(q_1, p_s), kl_via_sampling(q_2, p_s)
(tensor(9.4532), tensor(44.9828))
As we can see from KL divergence calculations, q_1
is closer to our Gaussian mixture distribution.
Optimizing the KL divergence for two distributions from different families
We saw that we can calculate the KL divergence between two different distribution families via sampling. But, as we did earlier, will we be able to optimize the parameters of our target surrogate distribution? The answer is No! As we have introduced sampling. However, there is still a way – by reparameterization!
Our surrogate q in this case is parameterized by loc
and scale
. The key idea here is to generate samples from a standard normal distribution (loc=0, scale=1) and then apply an affine transformation on the generated samples to get the samples generated from q. See my other post on sampling from normal distribution to understand this better.
The loss can now be thought of as a function of loc
and scale
.
Backpropagation through sampling
Version 1: Default
import torch.nn as nn
= nn.Parameter(torch.tensor(2.0, requires_grad=True))
loc = nn.Parameter(torch.tensor(0.5, requires_grad=True))
scale = dist.Normal(loc, scale)
q = q.sample()
s
try:
s.backward()print(f"Gradient of loc: {loc.grad}")
print(f"Gradient of scale: {scale.grad}")
except Exception as e:
print(f"Error: {e}")
Error: element 0 of tensors does not require grad and does not have a grad_fn
from torchviz import make_dot
={"loc": loc, "scale": scale},
make_dot(s, params=True, show_saved=True) show_attrs
= torch.tensor(2.0, requires_grad=True)
loc = torch.tensor(0.5, requires_grad=True)
scale = dist.Normal(loc, scale)
q = q.sample()
s try:
q.log_prob(s).backward()print(f"Gradient of loc: {loc.grad}")
print(f"Gradient of scale: {scale.grad}")
except Exception as e:
print(f"Error: {e}")
Gradient of loc: 2.0397958755493164
Gradient of scale: 0.0803835391998291
={"loc": loc, "scale": scale},
make_dot(q.log_prob(s), params=False,
show_attrs=False) show_saved
Version 2: Using the reparameterization trick manually
42)
torch.manual_seed(= dist.Normal(0, 1)
std_normal = torch.tensor(2.0, requires_grad=True)
loc = torch.tensor(0.5, requires_grad=True)
scale
= std_normal.sample()
eps = loc + scale * eps
z try:
q.log_prob(z).backward()print(f"Gradient of loc: {loc.grad}")
print(f"Gradient of scale: {scale.grad}")
print(f"Sample from standard normal: {eps}")
except Exception as e:
print(f"Error: {e}")
Gradient of loc: -0.6733808517456055
Gradient of scale: -0.22672083973884583
Sample from standard normal: 0.33669036626815796
={"loc": loc, "scale": scale},
make_dot(z, params=False, show_saved=False) show_attrs
Version 3: Using rsample
instead of sample
42)
torch.manual_seed(
= torch.tensor(2.0, requires_grad=True)
loc = torch.tensor(0.5, requires_grad=True)
scale = dist.Normal(loc, scale)
q = q.rsample()
s
try:
q.log_prob(s).backward()print(f"Gradient of loc: {loc.grad}")
print(f"Gradient of scale: {scale.grad}")
except Exception as e:
print(f"Error: {e}")
Gradient of loc: 0.0
Gradient of scale: -1.9999998807907104
Original formulation (not differentiable)
Reparameterized formulation (differentiable)
= 5000
n_samples
def original_loss_rsample(model_output, output, p_s):
= model_output
q = q.rsample([n_samples])
sample_set return torch.mean(q.log_prob(sample_set) - p_s.log_prob(sample_set))
def loss(model_output, output, p_s):
= model_output
q = q.loc
loc = q.scale
scale
= dist.Normal(loc=0.0, scale=1.0)
std_normal = std_normal.sample([n_samples])
sample_set = loc + scale * sample_set
sample_set return torch.mean(q.log_prob(sample_set) - p_s.log_prob(sample_set))
Having defined the loss above, we can now optimize loc
and scale
to minimize the KL-divergence.
def optimize_loc_scale(loss_fn):
= 6.0
loc = 0.1
scale = TrainableNormal(loc=loc, scale=scale)
model
= [torch.tensor(loc)]
loc_array = [torch.tensor(scale)]
scale_array = [loss_fn(model(), None, p_s).item()]
loss_array = torch.optim.Adam(model.parameters(), lr=0.05)
opt = 500
epochs for i in range(7):
= train_fn(model, loss_fn=loss_fn, loss_fn_kwargs={"p_s": p_s}, optimizer=opt, epochs=epochs, verbose=False, return_state_dict=True)
(iter_losses, epoch_losses), state_dict_list
loss_array.extend(epoch_losses)"loc"] for state_dict in state_dict_list])
loc_array.extend([state_dict["raw_scale"]) for state_dict in state_dict_list])
scale_array.extend([torch.functional.F.softplus(state_dict[
print(
f"Iteration: {i*epochs}, Loss: {epoch_losses[-1]:0.2f}, Loc: {loc_array[-1]:0.2f}, Scale: {scale_array[-1]:0.2f}"
)
return loc_array, scale_array, loss_array
= optimize_loc_scale(original_loss_rsample) loc_array, scale_array, loss_array
Iteration: 0, Loss: 0.05, Loc: 0.43, Scale: 0.72
Iteration: 500, Loss: 0.06, Loc: 0.43, Scale: 0.72
Iteration: 1000, Loss: 0.05, Loc: 0.43, Scale: 0.72
Iteration: 1500, Loss: 0.05, Loc: 0.43, Scale: 0.72
Iteration: 2000, Loss: 0.06, Loc: 0.43, Scale: 0.71
Iteration: 2500, Loss: 0.05, Loc: 0.43, Scale: 0.72
Iteration: 3000, Loss: 0.06, Loc: 0.43, Scale: 0.74
= optimize_loc_scale(loss) loc_array, scale_array, loss_array
Iteration: 0, Loss: 0.06, Loc: 0.43, Scale: 0.72
Iteration: 500, Loss: 0.07, Loc: 0.43, Scale: 0.72
Iteration: 1000, Loss: 0.06, Loc: 0.43, Scale: 0.71
Iteration: 1500, Loss: 0.05, Loc: 0.43, Scale: 0.73
Iteration: 2000, Loss: 0.05, Loc: 0.43, Scale: 0.70
Iteration: 2500, Loss: 0.06, Loc: 0.43, Scale: 0.71
Iteration: 3000, Loss: 0.05, Loc: 0.42, Scale: 0.72
= TrainableNormal(loc=loc_array[-1], scale=scale_array[-1])()
q_s q_s
/tmp/ipykernel_1389396/4271825319.py:4: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
self.loc = nn.Parameter(torch.tensor(loc))
/tmp/ipykernel_1389396/4271825319.py:5: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
raw_scale = torch.log(torch.expm1(torch.tensor(scale)))
Normal(loc: Parameter containing:
tensor(0.4250, requires_grad=True), scale: 0.716407299041748)
= torch.exp(p_s.log_prob(z_values))
prob_values_p_s = torch.exp(q_s.log_prob(z_values))
prob_values_q_s
=r"p")
plt.plot(z_values, prob_values_p_s.detach(), label=r"q")
plt.plot(z_values, prob_values_q_s.detach(), label
sns.despine()
plt.legend()"x")
plt.xlabel("PDF") plt.ylabel(
Text(0, 0.5, 'PDF')
= torch.exp(p_s.log_prob(z_values))
prob_values_p_s
def a(iteration):
= plt.figure(tight_layout=True, figsize=(8, 4))
fig = fig.gca()
ax
= loc_array[iteration]
loc = scale_array[iteration]
scale = dist.Normal(loc=loc, scale=scale)
q_s
= torch.exp(q_s.log_prob(z_values))
prob_values_q_s
=r"p")
ax.plot(z_values, prob_values_p_s, label=r"q")
ax.plot(z_values, prob_values_q_s, labelf"Iteration {iteration}, Loss: {loss_array[iteration]:0.2f}")
ax.set_title(-0.05, 1.05))
ax.set_ylim((
ax.legend()
sns.despine()
@interact(i=(0, len(loc_array) - 1))
def show_animation(i=0):
return a(i)
="loc")
plt.plot(loc_array, label="scale")
plt.plot(scale_array, label"Iterations")
plt.xlabel(
sns.despine() plt.legend()
<matplotlib.legend.Legend at 0x7fa021101650>
ELBO
Coin toss example
def log_prior(theta):
return dist.Beta(5, 3, validate_args=False).log_prob(theta)
def log_likelihood(theta, data):
return dist.Bernoulli(theta, validate_args=False).log_prob(data).sum()
def log_joint(theta, data):
return log_likelihood(theta, data) + log_prior(theta)
def true_log_posterior(theta, data):
return dist.Beta(5 + data.sum(), 3 + len(data) - data.sum(), validate_args=False).log_prob(theta)
= torch.linspace(0, 1, 200)
theta_values
# Generate data: 17 heads out of 30 tosses
= torch.ones(17)
data = torch.cat([data, torch.zeros(13)])
data print(data.shape)
try:
= log_likelihood(theta_values, data)
ll except Exception as e:
print(f"Error: {e}")
torch.Size([30])
Error: The size of tensor a (200) must match the size of tensor b (30) at non-singleton dimension 0
try:
= torch.vmap(lambda theta: log_likelihood(theta, data))(theta_values)
ll except Exception as e:
print(f"Error: {e}")
# Plot prior
= plt.subplots(2, 1, sharex=True, figsize=(4, 3))
fig, ax 0].plot(theta_values, torch.exp(log_prior(theta_values)), label="Prior", color="C0")
ax[
# Plot likelihood
1].plot(theta_values, ll, label="Likelihood", color="C1")
ax[
# Plot posterior
0].plot(theta_values, torch.exp(true_log_posterior(theta_values, data)), label="Posterior", color="C2")
ax[
1].set_xlabel(r"$\theta$")
ax[# Legend outside the plot
=(1.4, 1), borderaxespad=0) fig.legend(bbox_to_anchor
<matplotlib.legend.Legend at 0x7fa020448490>
Now, let us create a target distribution q
\[q(\theta) = \text{Beta}(\alpha, \beta)\]
def neg_elbo(log_joint_fn, q, data, n_mc_samples=100):
# Sample from q
= q.rsample([n_mc_samples])
theta # Compute the log probabilities
= q.log_prob(theta)
log_q_theta = torch.vmap(lambda theta: log_joint_fn(theta, data))(theta)
log_joint # Compute ELBO as the difference of log probabilities
# print(log_joint, log_q_theta)
= log_joint - log_q_theta
elbo return -elbo.mean()
0.1, 2), data=data) neg_elbo(log_joint, dist.Beta(
tensor(188.3255)
5 + data.sum(), 3 + len(data) - data.sum()), data=data) neg_elbo(log_joint, dist.Beta(
tensor(21.3972)
= 45.0
alpha = 45.0
beta
class TrainableBeta(AstraModel):
def __init__(self, alpha, beta):
super().__init__()
self.alpha = nn.Parameter(torch.tensor(alpha))
self.beta = nn.Parameter(torch.tensor(beta))
def forward(self, X=None):
return dist.Beta(self.alpha, self.beta, validate_args=False)
def loss_fn(model_output, output):
= model_output
q = output
data return neg_elbo(log_joint, q, data)
= []
alpha_array = []
beta_array = []
loss_array
= TrainableBeta(alpha, beta)
model = torch.optim.Adam(model.parameters(), lr=0.05)
opt
= 400
epochs for i in range(10):
# opt.zero_grad()
= train_fn(model, output=data, loss_fn=loss_fn, optimizer=opt, epochs=epochs, verbose=False, return_state_dict=True)
(iter_losses, epoch_losses), state_dict_list
loss_array.extend(epoch_losses)"alpha"] for state_dict in state_dict_list])
alpha_array.extend([state_dict["beta"] for state_dict in state_dict_list])
beta_array.extend([state_dict[
print(model.alpha.item(), model.beta.item(), loss_array[-1])
50.2536506652832 36.2883186340332 21.54409408569336
46.11083984375 33.47718811035156 21.542587280273438
40.47787857055664 29.384883880615234 21.4658203125
34.14217758178711 24.53378677368164 21.41468048095703
28.26129150390625 20.57001304626465 21.416034698486328
24.69085121154785 18.057777404785156 21.396530151367188
23.145231246948242 16.898151397705078 21.405750274658203
22.396196365356445 16.42294692993164 21.39972496032715
22.189348220825195 16.165355682373047 21.397798538208008
22.17797088623047 15.884150505065918 21.395658493041992
# Plot the posterior with the true posterior
= plt.subplots(1, 1, figsize=(4, 3))
fig, ax = model.alpha.item()
alpha = model.beta.item()
beta
with torch.no_grad():
= model()
q =fr"True posterior: $Beta({data.sum()+5}, {len(data)+3-data.sum()})$")
ax.plot(theta_values, torch.exp(true_log_posterior(theta_values, data)), label='--', label=fr"Learned posterior: $Beta({alpha:0.2f}, {beta:0.2f})$")
ax.plot(theta_values, torch.exp(q.log_prob(theta_values)), ls=(1.1, 1), borderaxespad=0) plt.legend(bbox_to_anchor
<matplotlib.legend.Legend at 0x7fa011e48cd0>
import pyro
import pyro.distributions as py_dist
from pyro.infer import SVI, Trace_ELBO
def model(data):
= pyro.sample("theta", py_dist.Beta(torch.tensor(5.0), torch.tensor(3.0)))
theta with pyro.plate("data", len(data)):
"obs", py_dist.Bernoulli(theta), obs=data)
pyro.sample(
def guide(data):
= pyro.param("alpha_q", torch.tensor(45.0))
alpha_q = pyro.param("beta_q", torch.tensor(45.0))
beta_q "theta", py_dist.Beta(alpha_q, beta_q))
pyro.sample(
pyro.clear_param_store()
# Optimizer
= {"lr": 0.05}
adam_params = pyro.optim.Adam(adam_params)
optimizer
# SVI
= Trace_ELBO(num_particles=5)
t = SVI(model, guide, optimizer, loss=t) svi
t.num_particles
5
= SVI(model, guide, optimizer, loss=t)
svi for step in range(8001):
= svi.step(data)
loss if step % 400 == 0:
print(f"Step {step} | Loss: {loss:0.2f} | alpha: {pyro.param('alpha_q').item():0.2f} | beta: {pyro.param('beta_q').item():0.2f}")
Step 0 | Loss: 21.66 | alpha: 45.05 | beta: 44.95
Step 400 | Loss: 21.69 | alpha: 50.92 | beta: 35.95
Step 800 | Loss: 21.60 | alpha: 47.69 | beta: 34.25
Step 1200 | Loss: 21.53 | alpha: 44.25 | beta: 32.45
Step 1600 | Loss: 21.29 | alpha: 41.77 | beta: 29.58
Step 2000 | Loss: 21.55 | alpha: 38.87 | beta: 27.68
Step 2400 | Loss: 21.43 | alpha: 35.87 | beta: 26.30
Step 2800 | Loss: 21.46 | alpha: 33.70 | beta: 24.43
Step 3200 | Loss: 21.31 | alpha: 31.49 | beta: 23.31
Step 3600 | Loss: 21.51 | alpha: 29.69 | beta: 21.95
Step 4000 | Loss: 21.41 | alpha: 28.48 | beta: 20.82
Step 4400 | Loss: 21.46 | alpha: 27.65 | beta: 19.81
Step 4800 | Loss: 21.31 | alpha: 26.54 | beta: 18.74
Step 5200 | Loss: 21.43 | alpha: 25.50 | beta: 18.48
Step 5600 | Loss: 21.32 | alpha: 24.82 | beta: 18.13
Step 6000 | Loss: 21.39 | alpha: 24.52 | beta: 17.72
Step 6400 | Loss: 21.43 | alpha: 23.89 | beta: 17.56
Step 6800 | Loss: 21.43 | alpha: 24.07 | beta: 17.03
Step 7200 | Loss: 21.42 | alpha: 23.48 | beta: 17.08
Step 7600 | Loss: 21.42 | alpha: 23.33 | beta: 16.81
Step 8000 | Loss: 21.41 | alpha: 22.91 | beta: 16.75
# grab the learned variational parameters
= pyro.param("alpha_q").item()
alpha_q = pyro.param("beta_q").item() beta_q
alpha_q, beta_q
(22.907527923583984, 16.748062133789062)
=fr"True posterior: $Beta({data.sum()+5}, {len(data)+3-data.sum()})$")
plt.plot(theta_values, torch.exp(true_log_posterior(theta_values, data)), label=fr"Learned posterior (Our): $Beta({alpha:0.2f}, {beta:0.2f})$")
plt.plot(theta_values, torch.exp(q.log_prob(theta_values)), label=fr"Learned posterior (Pyro): $Beta({alpha_q:0.2f}, {beta_q:0.2f})$")
plt.plot(theta_values, torch.exp(py_dist.Beta(alpha_q, beta_q).log_prob(theta_values)), label# legend outside the plot
=(1.1, 1), borderaxespad=0) plt.legend(bbox_to_anchor
<matplotlib.legend.Legend at 0x7fa011ce8190>
Linear regression
= 200
N = torch.linspace(-5, 5, N)
x_lin
= lambda x: 3*x + 4
f_true
= 5*dist.Normal(0, 1).sample([N])
eps
= f_true(x_lin) + eps
y
=r"$f(x)$")
plt.plot(x_lin, f_true(x_lin), label"o", label=r"$y$", alpha=0.5) plt.plot(x_lin, y,
def log_prior(theta):
return dist.MultivariateNormal(torch.zeros(2),
2),
torch.eye(=False).log_prob(theta)
validate_args
def log_likelihood(theta, x, y):
return dist.MultivariateNormal(theta[0] + theta[1] * x, torch.eye(N),
=False).log_prob(y).sum()
validate_args
def log_joint(theta, x, y):
return log_likelihood(theta, x, y) + log_prior(theta)
0.0, 0.0])), log_prior(torch.tensor([4.0, 3.0])), log_prior(torch.tensor([4.0, 0.0])) log_prior(torch.tensor([
(tensor(-1.8379), tensor(-14.3379), tensor(-9.8379))
0.0, 0.0]), x_lin, y), log_likelihood(torch.tensor([4.0, 3.0]), x_lin, y), log_likelihood(torch.tensor([4.0, 0.0]), x_lin, y) log_likelihood(torch.tensor([
(tensor(-11948.3555), tensor(-2697.1028), tensor(-10563.2061))
def neg_elbo_x_y(log_joint_fn, q, x, y, n_mc_samples=100):
# Sample from q
= q.rsample([n_mc_samples])
theta # Compute the log probabilities
= q.log_prob(theta).sum(dim=-1)
log_q_theta = torch.vmap(lambda theta: log_joint_fn(theta, x, y))(theta)
log_joint # Compute ELBO as the difference of log probabilities
# print(log_joint.shape, log_q_theta.shape)
= log_joint - log_q_theta
elbo return -elbo.mean()
2), torch.ones(2), validate_args=False), x_lin, y), neg_elbo_x_y(log_joint, dist.Normal(torch.tensor([4.0, 3.0]), torch.ones(2), validate_args=False), x_lin, y), neg_elbo_x_y(log_joint, dist.Normal(torch.tensor([4.0, 0.0]), torch.ones(2), validate_args=False), x_lin, y) neg_elbo_x_y(log_joint, dist.Normal(torch.zeros(
(tensor(12558.4824), tensor(3719.2258), tensor(10860.5801))
def loss_fn(model_output, output, x):
= model_output
q = output
y # print(x.shape, y.shape)
return neg_elbo_x_y(log_joint, q, x, y)
= []
loc_array = []
scale_array = []
loss_array = TrainableNormal(loc=[0.0, 0.0], scale=[0.1, 0.1])
model = torch.optim.Adam(model.parameters(), lr=0.05)
opt
= 400
epochs for i in range(10):
= train_fn(model, output=y, loss_fn=loss_fn, loss_fn_kwargs={"x": x_lin}, optimizer=opt, epochs=epochs, verbose=False, return_state_dict=True)
(iter_losses, epoch_losses), state_dict_list
loss_array.extend(epoch_losses)"loc"] for state_dict in state_dict_list])
loc_array.extend([state_dict["raw_scale"]) for state_dict in state_dict_list])
scale_array.extend([torch.functional.F.softplus(state_dict[
print(
f"Iteration: {i*epochs}, Loss: {epoch_losses[-1]:0.2f}, Loc: {loc_array[-1]}, Scale: {scale_array[-1]}"
)
Iteration: 0, Loss: 2705.07, Loc: tensor([3.7121, 3.0564]), Scale: tensor([0.0711, 0.0252])
Iteration: 400, Loss: 2705.07, Loc: tensor([3.7135, 3.0555]), Scale: tensor([0.0707, 0.0244])
Iteration: 800, Loss: 2705.07, Loc: tensor([3.7129, 3.0558]), Scale: tensor([0.0708, 0.0244])
Iteration: 1200, Loss: 2705.07, Loc: tensor([3.7117, 3.0553]), Scale: tensor([0.0709, 0.0242])
Iteration: 1600, Loss: 2705.07, Loc: tensor([3.7119, 3.0540]), Scale: tensor([0.0721, 0.0243])
Iteration: 2000, Loss: 2705.08, Loc: tensor([3.7177, 3.0562]), Scale: tensor([0.0685, 0.0244])
Iteration: 2400, Loss: 2705.07, Loc: tensor([3.7122, 3.0555]), Scale: tensor([0.0722, 0.0243])
Iteration: 2800, Loss: 2705.08, Loc: tensor([3.7158, 3.0519]), Scale: tensor([0.0714, 0.0242])
Iteration: 3200, Loss: 2705.06, Loc: tensor([3.7078, 3.0541]), Scale: tensor([0.0694, 0.0248])
Iteration: 3600, Loss: 2705.09, Loc: tensor([3.7101, 3.0551]), Scale: tensor([0.0722, 0.0243])
# Predictive distribution using the learned posterior
= 1000
n_samples = model()
q = q.sample([n_samples])
q_samples print(q_samples.shape)
=r"$f(x)$")
plt.plot(x_lin, f_true(x_lin), label"o", label=r"$y$", alpha=0.1)
plt.plot(x_lin, y,
= []
y_hats for i in range(n_samples):
= q_samples[i, 0] + q_samples[i, 1] * x_lin
y_hat
y_hats.append(y_hat)
= torch.stack(y_hats).mean(0)
y_hat_mean = torch.stack(y_hats).std(0)
y_hat_std
=r"$\mathbb{E}[f(x)]$", color='C2')
plt.plot(x_lin, y_hat_mean, label- y_hat_std, y_hat_mean + y_hat_std, alpha=0.3, color='C2')
plt.fill_between(x_lin, y_hat_mean plt.legend()
torch.Size([1000, 2])
<matplotlib.legend.Legend at 0x7fa021239750>
Learning the full covariance matrix
Reference: 1. https://ericmjl.github.io/notes/stats-ml/estimating-a-multivariate-gaussians-parameters-by-gradient-descent/
class LearnableMVN(AstraModel):
def __init__(self, loc, log_L_flat):
super().__init__()
self.loc = nn.Parameter(loc)
self.log_L_flat = nn.Parameter(log_L_flat)
def forward(self, X=None):
= torch.reshape(self.log_L_flat, (2, 2))
L = torch.triu(L)
U = U@U.T
cov return dist.MultivariateNormal(self.loc, cov)
def __repr__(self):
return f"LearnableMVN(loc={self.loc.item():0.2f}, log_L_flat={self.log_L_flat.item():0.2f})"
= []
loc_array = []
cov_array = []
loss_array
= LearnableMVN(loc=torch.tensor([0.0, 0.0]), log_L_flat=torch.tensor([0.1, 0.1, 0.1, 0.1]))
model = torch.optim.Adam(model.parameters(), lr=0.05)
opt
= 400
epochs for i in range(6):
= train_fn(model, output=y, loss_fn=loss_fn, loss_fn_kwargs={"x": x_lin}, optimizer=opt, epochs=epochs, verbose=False, return_state_dict=True)
(iter_losses, epoch_losses), state_dict_list
loss_array.extend(epoch_losses)"loc"] for state_dict in state_dict_list])
loc_array.extend([state_dict["log_L_flat"] for state_dict in state_dict_list])
cov_array.extend([state_dict[
= torch.reshape(cov_array[-1], (2, 2))
L = L@L.T
cov print(
f"Iteration: {i*epochs}, Loss: {epoch_losses[-1]:0.2f}, Loc: {loc_array[-1]}, Cov: {cov}"
)
Iteration: 0, Loss: 2694.08, Loc: tensor([3.7033, 3.0549]), Cov: tensor([[0.4711, 0.0753],
[0.0753, 0.0767]])
Iteration: 400, Loss: 2692.83, Loc: tensor([3.7217, 3.0509]), Cov: tensor([[0.4738, 0.0665],
[0.0665, 0.0786]])
Iteration: 800, Loss: 2694.50, Loc: tensor([3.7100, 3.0540]), Cov: tensor([[0.5015, 0.0794],
[0.0794, 0.0594]])
Iteration: 1200, Loss: 2696.70, Loc: tensor([3.7286, 3.0559]), Cov: tensor([[0.4520, 0.0622],
[0.0622, 0.0897]])
Iteration: 1600, Loss: 2694.03, Loc: tensor([3.7115, 3.0484]), Cov: tensor([[0.5839, 0.0805],
[0.0805, 0.0751]])
Iteration: 2000, Loss: 2693.38, Loc: tensor([3.7027, 3.0596]), Cov: tensor([[0.4718, 0.0615],
[0.0615, 0.0643]])
# Predictive distribution using the learned posterior
= 1000
n_samples = model()
q = q.sample([n_samples])
q_samples print(q_samples.shape)
=r"$f(x)$")
plt.plot(x_lin, f_true(x_lin), label"o", label=r"$y$", alpha=0.1)
plt.plot(x_lin, y,
= []
y_hats for i in range(n_samples):
= q_samples[i, 0] + q_samples[i, 1] * x_lin
y_hat
y_hats.append(y_hat)
= torch.stack(y_hats).mean(0)
y_hat_mean = torch.stack(y_hats).std(0)
y_hat_std
=r"$\mathbb{E}[f(x)]$", color='C2')
plt.plot(x_lin, y_hat_mean, label- y_hat_std, y_hat_mean + y_hat_std, alpha=0.3, color='C2')
plt.fill_between(x_lin, y_hat_mean plt.legend()
torch.Size([1000, 2])
<matplotlib.legend.Legend at 0x7fa01186f1d0>
# This is intentional stop here
asdjasndjka
NameError: name 'asdjasndjka' is not defined
Stochastic VI [TODO]
When data is large
We can use stochastic gradient descent to optimize the ELBO. We can use the following formula to compute the gradient of the ELBO.
Now, given our linear regression problem setup, we want to maximize the ELBO.
We can do so by the following. As a simple example, let us assume \(\theta \in R^2\)
- Assume some q. Say, a Normal distribution. So, \(q\sim \mathcal{N}_2\)
- Draw samples from q. Say N samples.
- Initilize ELBO = 0.0
- For each sample:
- Let us assume drawn sample is \([\theta_1, \theta_2]^T\)
- Compute log_prob of prior on \([\theta_1, \theta_2]^T\) or
lp = p.log_prob(θ1, θ2)
- Compute log_prob of likelihood on \([\theta_1, \theta_2]^T\) or
ll = l.log_prob(θ1, θ2)
- Compute log_prob of q on \([\theta_1, \theta_2]^T\) or
lq = q.log_prob(θ1, θ2)
- ELBO = ELBO + (ll+lp-q)
- Return ELBO/N
= dist.Normal(loc = 0., scale = 1.)
prior = dist.Normal(loc = 5., scale = 1.) p
= p.sample([1000]) samples
= torch.tensor(1.0, requires_grad=True)
mu
def surrogate_sample(mu):
= dist.Normal(loc = 0., scale=1.)
std_normal = std_normal.sample()
sample_std_normal return mu + sample_std_normal
= surrogate_sample(mu) samples_from_surrogate
samples_from_surrogate
tensor(0.1894, grad_fn=<AddBackward0>)
def logprob_prior(mu):
return prior.log_prob(mu)
= logprob_prior(samples_from_surrogate)
lp
def log_likelihood(mu, samples):
= dist.Normal(loc=mu, scale=1)
di return torch.sum(di.log_prob(samples))
= log_likelihood(samples_from_surrogate, samples)
ll
= surrogate.log_prob(samples_from_surrogate)
ls
def elbo_loss(mu, data_samples):
= surrogate_sample(mu)
samples_from_surrogate = logprob_prior(samples_from_surrogate)
lp = log_likelihood(samples_from_surrogate, data_samples)
ll = surrogate.log_prob(samples_from_surrogate)
ls
return -lp - ll + ls
= torch.tensor(1.0, requires_grad=True)
mu
= []
loc_array = []
loss_array
= torch.optim.Adam([mu], lr=0.02)
opt for i in range(2000):
= elbo_loss(mu, samples)
loss_val
loss_val.backward()
loc_array.append(mu.item())
loss_array.append(loss_val.item())
if i % 100 == 0:
print(
f"Iteration: {i}, Loss: {loss_val.item():0.2f}, Loc: {mu.item():0.3f}"
)
opt.step() opt.zero_grad()
Iteration: 0, Loss: 11693.85, Loc: 1.000
Iteration: 100, Loss: 2550.90, Loc: 2.744
Iteration: 200, Loss: 2124.30, Loc: 3.871
Iteration: 300, Loss: 2272.48, Loc: 4.582
Iteration: 400, Loss: 2025.17, Loc: 4.829
Iteration: 500, Loss: 1434.45, Loc: 5.079
Iteration: 600, Loss: 1693.33, Loc: 5.007
Iteration: 700, Loss: 1495.89, Loc: 4.957
Iteration: 800, Loss: 2698.28, Loc: 5.149
Iteration: 900, Loss: 2819.85, Loc: 5.117
Iteration: 1000, Loss: 1491.79, Loc: 5.112
Iteration: 1100, Loss: 1767.87, Loc: 4.958
Iteration: 1200, Loss: 1535.30, Loc: 4.988
Iteration: 1300, Loss: 1458.61, Loc: 4.949
Iteration: 1400, Loss: 1400.21, Loc: 4.917
Iteration: 1500, Loss: 2613.42, Loc: 5.073
Iteration: 1600, Loss: 1411.46, Loc: 4.901
Iteration: 1700, Loss: 1587.94, Loc: 5.203
Iteration: 1800, Loss: 1461.40, Loc: 5.011
Iteration: 1900, Loss: 1504.93, Loc: 5.076
plt.plot(loss_array)
from numpy.lib.stride_tricks import sliding_window_view
= 10), axis=1)) plt.plot(np.average(sliding_window_view(loss_array, window_shape
Linear Regression
= 3.
true_theta_0 = 4.
true_theta_1
= torch.linspace(-5, 5, 100)
x = true_theta_0 + true_theta_1*x
y_true = y_true + torch.normal(mean = torch.zeros_like(x), std = torch.ones_like(x))
y_noisy
plt.plot(x, y_true)=20, alpha=0.5) plt.scatter(x, y_noisy, s
<matplotlib.collections.PathCollection at 0x1407aaac0>
= x_dash@theta_prior.sample()
y_pred ="Fit")
plt.plot(x, y_pred, label=20, alpha=0.5, label='Data')
plt.scatter(x, y_noisy, s plt.legend()
<matplotlib.legend.Legend at 0x14064e850>
= dist.MultivariateNormal(loc = torch.tensor([0., 0.]), covariance_matrix=torch.eye(2))
theta_prior
def likelihood(theta, x, y):
= torch.vstack((torch.ones_like(x), x)).t()
x_dash = dist.Normal(loc=x_dash@theta, scale=torch.ones_like(x))
d return torch.sum(d.log_prob(y))
likelihood(theta_prior.sample(), x, y_noisy)
tensor(-3558.0769)
= torch.tensor([-1., 1.], requires_grad=True)
loc = dist.MultivariateNormal(loc = loc, covariance_matrix=torch.eye(2))
surrogate_mvn surrogate_mvn
MultivariateNormal(loc: torch.Size([2]), covariance_matrix: torch.Size([2, 2]))
surrogate_mvn.sample()
tensor([-1.1585, 2.6212])
def surrogate_sample_mvn(loc):
= dist.MultivariateNormal(loc = torch.zeros_like(loc), covariance_matrix=torch.eye(loc.shape[0]))
std_normal_mvn = std_normal_mvn.sample()
sample_std_normal return loc + sample_std_normal
def elbo_loss(loc, x, y):
= surrogate_sample_mvn(loc)
samples_from_surrogate_mvn = theta_prior.log_prob(samples_from_surrogate_mvn)
lp = likelihood(samples_from_surrogate_mvn, x, y_noisy)
ll = surrogate_mvn.log_prob(samples_from_surrogate_mvn)
ls
return -lp - ll + ls
loc.shape, x.shape, y_noisy.shape
(torch.Size([2]), torch.Size([100]), torch.Size([100]))
elbo_loss(loc, x, y_noisy)
tensor(2850.3154, grad_fn=<AddBackward0>)
= torch.tensor([-1., 1.], requires_grad=True)
loc
= []
loc_array = []
loss_array
= torch.optim.Adam([loc], lr=0.02)
opt for i in range(10000):
= elbo_loss(loc, x, y_noisy)
loss_val
loss_val.backward()
loc_array.append(mu.item())
loss_array.append(loss_val.item())
if i % 1000 == 0:
print(
f"Iteration: {i}, Loss: {loss_val.item():0.2f}, Loc: {loc}"
)
opt.step() opt.zero_grad()
Iteration: 0, Loss: 5479.97, Loc: tensor([-1., 1.], requires_grad=True)
Iteration: 1000, Loss: 566.63, Loc: tensor([2.9970, 4.0573], requires_grad=True)
Iteration: 2000, Loss: 362.19, Loc: tensor([2.9283, 3.9778], requires_grad=True)
Iteration: 3000, Loss: 231.23, Loc: tensor([2.8845, 4.1480], requires_grad=True)
Iteration: 4000, Loss: 277.94, Loc: tensor([2.9284, 3.9904], requires_grad=True)
Iteration: 5000, Loss: 1151.51, Loc: tensor([2.9620, 4.0523], requires_grad=True)
Iteration: 6000, Loss: 582.19, Loc: tensor([2.8003, 4.0540], requires_grad=True)
Iteration: 7000, Loss: 178.48, Loc: tensor([2.8916, 3.9968], requires_grad=True)
Iteration: 8000, Loss: 274.76, Loc: tensor([3.0807, 4.1957], requires_grad=True)
Iteration: 9000, Loss: 578.37, Loc: tensor([2.9830, 4.0174], requires_grad=True)
= dist.MultivariateNormal(loc = loc, covariance_matrix=torch.eye(2)) learnt_surrogate
= x_dash@learnt_surrogate.sample([500]).t()
y_samples_surrogate = 0.02, color='k');
plt.plot(x, y_samples_surrogate, alpha =20, alpha=0.5) plt.scatter(x, y_noisy, s
<matplotlib.collections.PathCollection at 0x144709a90>
@learnt_surrogate.loc.detach().t()
x_dash= torch.linalg.cholesky(learnt_surrogate.covariance_matrix)
theta_sd
#y_samples_surrogate = x_dash@learnt_surrogate.loc.t()
#plt.plot(x, y_samples_surrogate, alpha = 0.02, color='k');
#plt.scatter(x, y_noisy, s=20, alpha=0.5)
tensor([-1.6542e+01, -1.6148e+01, -1.5754e+01, -1.5360e+01, -1.4966e+01,
-1.4572e+01, -1.4178e+01, -1.3784e+01, -1.3390e+01, -1.2996e+01,
-1.2602e+01, -1.2208e+01, -1.1814e+01, -1.1420e+01, -1.1026e+01,
-1.0632e+01, -1.0238e+01, -9.8441e+00, -9.4501e+00, -9.0561e+00,
-8.6621e+00, -8.2681e+00, -7.8741e+00, -7.4801e+00, -7.0860e+00,
-6.6920e+00, -6.2980e+00, -5.9040e+00, -5.5100e+00, -5.1160e+00,
-4.7220e+00, -4.3280e+00, -3.9340e+00, -3.5400e+00, -3.1460e+00,
-2.7520e+00, -2.3579e+00, -1.9639e+00, -1.5699e+00, -1.1759e+00,
-7.8191e-01, -3.8790e-01, 6.1054e-03, 4.0011e-01, 7.9412e-01,
1.1881e+00, 1.5821e+00, 1.9761e+00, 2.3702e+00, 2.7642e+00,
3.1582e+00, 3.5522e+00, 3.9462e+00, 4.3402e+00, 4.7342e+00,
5.1282e+00, 5.5222e+00, 5.9162e+00, 6.3102e+00, 6.7043e+00,
7.0983e+00, 7.4923e+00, 7.8863e+00, 8.2803e+00, 8.6743e+00,
9.0683e+00, 9.4623e+00, 9.8563e+00, 1.0250e+01, 1.0644e+01,
1.1038e+01, 1.1432e+01, 1.1826e+01, 1.2220e+01, 1.2614e+01,
1.3008e+01, 1.3402e+01, 1.3796e+01, 1.4190e+01, 1.4584e+01,
1.4978e+01, 1.5372e+01, 1.5766e+01, 1.6160e+01, 1.6554e+01,
1.6948e+01, 1.7342e+01, 1.7736e+01, 1.8131e+01, 1.8525e+01,
1.8919e+01, 1.9313e+01, 1.9707e+01, 2.0101e+01, 2.0495e+01,
2.0889e+01, 2.1283e+01, 2.1677e+01, 2.2071e+01, 2.2465e+01])
TODO
- Pyro for linear regression example
- Handle more samples in ELBO
- Reuse some methods
- Add figure on reparameterization
- Linear regression learn covariance also
- Linear regression posterior compare with analytical posterior (refer Murphy book)
- Clean up code and reuse code whwrever possible
- Improve figures and make them consistent
- Add background maths wherever needed
- plot the Directed graphical model (refer Maths ML book and render in Pyro)
- Look at the TFP post on https://www.tensorflow.org/probability/examples/Probabilistic_Layers_Regression
- Show the effect of data size (less data, solution towards prior, else dominated by likelihood)
- Mean Firld (full covariance v/s diagonal) for surrogate
References
- https://www.youtube.com/watch?v=HUsznqt2V5I
- https://www.youtube.com/watch?v=x9StQ8RZ0ag&list=PLISXH-iEM4JlFsAp7trKCWyxeO3M70QyJ&index=9
- https://colab.research.google.com/github/goodboychan/goodboychan.github.io/blob/main/_notebooks/2021-09-13-02-Minimizing-KL-Divergence.ipynb#scrollTo=gd_ev8ceII8q
- https://goodboychan.github.io/python/coursera/tensorflow_probability/icl/2021/09/13/02-Minimizing-KL-Divergence.html
Sandbox
from typing import Sequence
= torch.tensor([1., 2., 3.])
t isinstance(t, Sequence)
False
torch.Size([])
tensor(1.)