# Neural Linear Model ¶

TLDR

```
import probflow as pf
import probflow.utils.ops as O
class NeuralLinear(pf.ContinuousModel):
def __init__(self, dims):
self.net = pf.DenseNetwork(dims, probabilistic=False)
self.loc = pf.Dense(dims[-1], 1)
self.std = pf.Dense(dims[-1], 1)
def __call__(self, x):
h = O.relu(self.net(x))
return pf.Normal(self.loc(h), O.softplus(self.std(h)))
model = NeuralLinear([x.shape[1], 256, 128, 64, 32])
model.fit(x, y)
```

The neural linear model is an efficient way to get posterior samples and uncertainty out of a regular neural network. Basically, you just slap a Bayesian linear regression on top of the last hidden layer of a regular (non-Bayesian, non-probabilistic) neural network. It was first proposed for use with Bayesian optimization (Snoek et al., 2015), but has applications in reinforcement learning (see Riquelme et al., 2018 and Azizzadenesheli and Anandkumar, 2019), active learning, AutoML (Zhou and Precioso, 2019), and just Bayesian regression problems in general (Ober and Rasmussen, 2019).

Bayesian linear regressions have a closed form solution, so the usual approach for training a neural linear model is to first train the neural network to minimize the (for example) mean squared error. Then in a second step, compute the closed-form solution to the Bayesian linear regression regressing the last hidden layer’s activations onto the target variable. Here we’ll just use variational inference to train the neural network and Bayesian regression together end-to-end.

But first! Some imports:

```
import probflow as pf
import probflow.utils.ops as O
import matplotlib.pyplot as plt
```

## Load data¶

We’ll be training models to predict taxi fares given the origin, destination, and time of the trip. The data we’ll use comes from the New York City Taxi and Limousine Commission dataset, which is a public BigQuery dataset. I’ll skip the data loading and feature engineering here (see the colab for all of that!), but at the end of the day we’re left with around 2.4 million training samples and little over half a million validation samples, in the form of four numpy arrays:

```
x_train # training features, size (2399893, 9)
y_train # training target, size (2399893, 1)
x_val # validation features, size (600107, 9)
y_val # validation target, size (600107, 1)
```

## Fitting the Neural Linear Model¶

Let’s create a neural linear model with ProbFlow. ProbFlow’s `Dense`

,
`DenseNetwork`

, and `DenseRegression`

classes take a
`probabilistic`

keyword argument, which when `True`

(the default), uses
parameters which are probabilistic (i.e., they use Normal distributions as the
variational posteriors). But when the `probabilistic`

kwarg is set to
`False`

, then the parameters are totally non-probabilistic (i.e. a
Deterministic distribution, aka a Dirac function, is used as the
“variational posterior”). So, with `probabilistic = False`

, ProbFlow won’t
model any uncertainty as to those parameters’ values (like your run-of-the-mill
non-Bayesian neural network would).

So, to create a neural linear model, we can just create a regular
non-probabilistic neural network using `DenseNetwork`

with
`probabilistic = False`

, and then perform a Bayesian linear regression on top
of the final hidden layer (using the `Dense`

class - we’ll also predict
the noise error to allow for heteroscedasticity).

```
class NeuralLinear(pf.ContinuousModel):
def __init__(self, dims):
self.net = pf.DenseNetwork(dims, probabilistic=False)
self.loc = pf.Dense(dims[-1], 1) # probabilistic=True by default
self.std = pf.Dense(dims[-1], 1) # probabilistic=True by default
def __call__(self, x):
h = O.relu(self.net(x))
return pf.Normal(self.loc(h), O.softplus(self.std(h)))
```

However, one thing that really helps is to define what initialization we want
for the head of the network which predicts the standard deviation. You don’t
*need* to, but a lot of random initializations can lead to bad fits otherwise.
Basically we’re just initializing the layer which predicts the noise standard
deviation such that the standard deviation is small early in training.
Otherwise the noise coming from the variational posteriors overwhelms the
signal coming from the data, and so the network is unable to learn.
Initializing the standard deviation head to output small values allows the
network to learn the means early in training, and then later in training the
variational posteriors can expand to capture the uncertainty properly.

So, to set the values at which our variables for that layer are initialized,

```
std_kwargs = {
"weight_kwargs": {
"initializer": {
"loc": 0,
"scale": -2,
}
},
"bias_kwargs": {
"initializer": {
"loc": -2,
"scale": -2,
}
}
}
```

And then we can re-define our neural linear model using that initialization
(exactly the same as before, except we’ve added `**std_kwargs`

):

```
class NeuralLinear(pf.ContinuousModel):
def __init__(self, dims):
self.net = pf.DenseNetwork(dims, probabilistic=False)
self.loc = pf.Dense(dims[-1], 1)
self.std = pf.Dense(dims[-1], 1, **std_kwargs)
def __call__(self, x):
h = O.relu(self.net(x))
return pf.Normal(self.loc(h), O.softplus(self.std(h)))
```

Then we can instantiate the model. We’ll use a fully-connected sequential architecture, where the first hidden layer has 128 units, the second has 64, and the third has 32.

```
model = NeuralLinear([x.shape[1], 128, 64, 32])
```

We’ll also use a `MonitorMetric`

callback to monitor the mean absolute
error of the model’s predictions on the validation data over the course of
training. Additionally, we’ll anneal the learning rate from 0.0005 to near
zero over the course of training using a `LearningRateScheduler`

callback.

```
nlm_mae = pf.MonitorMetric('mae', x_val, y_val)
nlm_lrs = pf.LearningRateScheduler(lambda e: 5e-4 + e * 5e-6)
```

Then, we can fit the model!

```
nlm_model.fit(
x_train,
y_train,
batch_size=2048,
epochs=100,
callbacks=[nlm_mae, nlm_lrs]
)
```

## Fitting a fully Bayesian network¶

For comparison, let’s also fit a fully Bayesian model (i.e., one where *all*
the parameters of the network are modeled using Normal distributions as their
variational posteriors). We’ll define it in exactly the same way as the neural
linear model, except that we’ll use `probabilistic = True`

(the default) when
initializing the `DenseNetwork`

.

```
class FullyBayesianNetwork(pf.ContinuousModel):
def __init__(self, dims):
self.net = pf.DenseNetwork(dims)
self.loc = pf.Dense(dims[-1], 1)
self.std = pf.Dense(dims[-1], 1, **std_kwargs)
def __call__(self, x):
h = O.relu(self.net(x))
return pf.Normal(self.loc(h), O.softplus(self.std(h)))
```

And we’ll initialize it using the exact same architecture:

```
bnn_model = FullyBayesianNetwork([x_train.shape[1], 128, 64, 32])
```

Again we’ll monitor the MAE over the course of training, and anneal the learning rate:

```
bnn_mae = pf.MonitorMetric('mae', x_val, y_val)
bnn_lrs = pf.LearningRateScheduler(lambda e: 5e-4 + e * 5e-6)
```

And then we can fit it:

```
bnn_model.fit(
x_train,
y_train,
batch_size=2048,
epochs=100,
callbacks=[bnn_mae, bnn_lrs]
)
```

## Accuracy and time¶

Let’s compare how long it took each network to reach similar levels of
accuracy. We can use the `MonitorMetric`

callback objects to plot the
mean absolute error as a function of walltime.

```
# Plot wall time vs mean absolute error
nlm_mae.plot(x="time", label="Neural Linear")
bnn_mae.plot(x="time", label="Fully Bayesian")
plt.legend()
plt.show()
```

So the neural linear model is a lot faster to train! Both models seem to be able to reach similar levels of accuracy, but the neural linear model just gets to a given accuracy level more quickly.

## Uncertainty calibration¶

But is the neural linear model - a mostly non-Bayesian network - at all well-calibrated in terms of its uncertainty estimates? We can measure the calibration of the model (how accurate its uncertainty estimates are) using calibration curves and the mean squared calibration error for regression (see Kuleshov et al., 2018 and Chung et al., 2020).

Essentially, the calibration curve plots the proportion of samples which our model predicts to have less than or equal to some cumulative probability, against the actual proportion of samples. In other words, 10% of samples should fall below the model’s predicted lower 10th percentile confidence interval, 20% of samples should fall below the model’s predicted lower 20th percentile confidence interval, and so on, all the way up to the 100th percentile. If the model is perfectly calibrated, these values will match, and we’ll end up with a calibration curve which goes along the identity line from (0, 0) to (1, 1).

Then, the mean squared calibration error is just a metric of how far our
model’s calibration deviates from the ideal calibration curve (i.e., what the
calibration curve would be if the model were perfectly calibrated - the
identity line). Check out the docs for
`ContinuousModel.calibration_metric()`

for a more formal definition (and
for other calibration metrics we could have used).

ProbFlow’s `ContinuousModel`

class has methods to compute both the
calibration curve and various calibration metrics. So, all we need to do to
view the calibration curve is:

```
# Plot each model's calibration curve
nlm_model.calibration_curve_plot(
x_val, y_val, batch_size=10000, label="Neural Linear"
)
bnn_model.calibration_curve_plot(
x_val, y_val, batch_size=10000, label="Fully Bayesian"
)
plt.legend()
plt.show()
```

Note that for `ContinuousModel.calibration_curve_plot()`

above (and for
`ContinuousModel.calibration_metric()`

below) we’re using the
`batch_size`

keyword argument to perform the computation in batches.
Otherwise, if we tried to run the calculation using all ~600,000 samples in the
validation data as one single ginormous batch, we’d run out of memory!

To view the mean squared calibration error (MSCE), we can just use
`ContinuousModel.calibration_metric()`

(lower is better):

```
>>> nlm_model.calibration_metric("msce", x_val, y_val, batch_size=10000)
0.00307
>>> bnn_model.calibration_metric("msce", x_val, y_val, batch_size=10000)
0.00412
```

Both the neural linear model and the fully Bayesian network have pretty good calibration errors! With different data splits and different random initializations, one or the other comes out technically on top, but I’d say they’re about the same on average. Given that the neural linear model is so much faster to train, as long as you aren’t specifically interested in the posteriors of the neural network parameters, there’s definitely an upside to using this hybrid method to get the advantages of Bayesian modeling (uncertainty quantification, separation of epistemic and aleatoric uncertainty, etc) with most of the efficiency advantages of training a normal non-Bayesian neural network!

## References¶

Jasper Snoek, Oren Rippel, Kevin Swersky, Ryan Kiros, Nadathur Satish, Narayanan Sundaram, Md. Mostofa Ali Patwary, Prabhat, Ryan P. Adams. Scalable Bayesian Optimization Using Deep Neural Networks, 2015

Carlos Riquelme, George Tucker, and Jasper Snoek. Deep Bayesian Bandits Showdown, 2018

Sebastian W. Ober and Carl Edward Rasmussen. Benchmarking the Neural Linear Model for Regression, 2019

Kamyar Azizzadenesheli and Animashree Anandkumar. Efficient Exploration through Bayesian Deep Q-Networks, 2019

Weilin Zhou and Frederic Precioso. Adaptive Bayesian Linear Regression for Automated Machine Learning, 2019

Youngseog Chung, Willie Neiswanger, Ian Char, Jeff Schneider. Beyond Pinball Loss: Quantile Methods for Calibrated Uncertainty Quantification, 2020