※ This is an English version. The Japanese version is here.

There is a technique called Information Dropout.
It is a method to generalize Dropout which was originally proposed to avoid overfitting in the context of deep learning.
The theoretical foundation of Information Dropout is based on **Information Bottleneck** which aims to learn expressions of optimal data for a given task.

The first draft was uploaded to arXiv on November 4, 2016 (arXiv:1611.01353), and submitted to ICLR2017. Unfortunately, it was not accepted, but one of the reviewers said

The authors all agree that the theory presented in the paper is of high quality and is promising but the experiments are not compelling.

Since the manuscript has been updated from the first draft, I believe that it will be accepted to any of the top conferences when the numerical experiment have been added enough.

I felt it interesting when I read this paper, but unfortunately, I could not find the implementation. Recently I just had a motivation to know how to use the deep learning framework Keras, so I implemented it myself and tried to reproduce the results of the paper. In this article, I will describe the outline of Information Dropout and its implementation.

## Information Bottleneck

Given the data $\mathbf{x}$, consider the optimal representation $\mathbf{z}$ to solve task $\mathbf{y}$.
Intuitively, it seems that features such as affiliation and career are the good **representation** to predict the annual income of each person (**task**) from various characteristics (**data**) of people.
On the other hand, if the task is to predict running speed from the same data, probably the result of fitness measurement will be much important as representation.
Thus, the appropriate representation of the data can vary depending on the task.
More precisely, $\mathbf{z}$ satisfying the following three conditions is an optimal representation.

- The probability distribution of $\mathbf{z}$ only depends on $\mathbf{x}$. The Markov chain only gives the dependency to $\mathbf{y}$ through $\mathbf{x}$.
- The mutual information is preserved $I(\mathbf{x};\mathbf{y}) = I(\mathbf{z}; \mathbf{y})$ in the conversion from $\mathbf{x}$ to $\mathbf{z}$, so the representation does not lose information about task $\mathbf{y}$.
- Among the representations that satisfy the above two conditions, the one with the smallest $I(\mathbf{x}; \mathbf{z})$.

These three conditions can be written as constrained minimization problem as follows.

$\begin{aligned} \min &\:\: I(\mathbf{x}; \mathbf{z}) \\ \mathrm{s.t.} &\:\: H(\mathbf{y}|\mathbf{z}) = H(\mathbf{y}|\mathbf{x}), \end{aligned}$

where $I(p; q)$ is the mutual information between $p$ and $q$.
By using method of Lagrange multiplier, we obtain the following **Information Bottleneck (IB) Lagrangian**,

$\mathcal{L} = H(\mathbf{y}|\mathbf{z}) + \beta I(\mathbf{x}; \mathbf{z}).$

It is a motivation to obtain a representation that extracts only task related information while squeezing data information as much as possible. Considering that $p(\mathbf{x}, \mathbf{y})$ is a true distribution and estimate $p_{\theta}(\mathbf{z}|\mathbf{x})$ and $p_{\theta}(\mathbf{y}|\mathbf{z})$ while training data $\{\mathbf{x}_i, \mathbf{y}_i\}_{i=1,2,...,N}$ is obtained, IB Lagrangian can be approximated as,

$\mathcal{L}(\theta) = \frac{1}{N} \sum_{i=1}^N \mathbb{E}_{\mathbf{z} \sim p_{\theta} (\mathbf{z}|\mathbf{x}_i)} [ -\log p_{\theta} (\mathbf{y}_i|\mathbf{z}) ] + \beta D_{\mathrm{KL}} ( p_{\theta} (\mathbf{z}|\mathbf{x}_i) || p_{\theta}(\mathbf{z}) ),$

using sample mean. Here, the first term represents the cross-entropy reconstruction error, and the second term can be seen as the penalty term concerning the transmission of information from $\mathbf{x}$ to $\mathbf{z}$.

Also, if $\beta=1$ with this objective function, it matches Variational Auto-Encoder in terms of expressions. IB Lagrangian can be said to be a more general expression that can manipulate the balance between reconstruction error and penalty of VAE.

## Information Dropout

From here I will explain actual implementations of IB Lagrangian. First, for $p_{\theta}(\mathbf{z}|\mathbf{x})$, Information Dropout models the Dropout in a generalized form, injects multiplicative noise $\epsilon$ to the deterministic transformation $f(\mathbf{x})$,

$\begin{aligned} \epsilon &\sim p_{\alpha(\mathbf{x})} (\epsilon) \\ \mathbf{z} &= \epsilon \odot f(\mathbf{x}). \end{aligned}$

The calculation procedure is as shown in the following schematic diagram referred from the original paper.

Here, if we take a Bernoulli distribution for $p_{\alpha(\mathbf{x})}$, it exactly matches the classical Dropout.
Since we can take any arbitrary discrete or continuous probability distribution for $p_{\alpha(\mathbf{x})}$, it is a more general formulation of Dropout.
It is noteworthy that $p_{\alpha(\mathbf{x})}(\epsilon)$ depends on $\mathbf{x}$ as shown in the formula.
That is, in this model, **the probability of dropping an element varies depending on input data**.
Information Dropout can also be interpreted as gain or gate for the activity of the elements.
Also, the approach of decomposing inputs into deterministic term $f(\mathbf{x})$ and stochastic term $\epsilon$ is based on reparameterization trick in Variational Auto-Encoder.

The next problem is to calculate the second term of objective function, KL divergence. To realize deterministic learning using minibatch, we need to calculate KL divergence analytically. Here, x and z are one dimension for simplicity. First, we assume that the distribution of r.v. $\epsilon$ follows a lognormal distribution $p_{\alpha(x)}(\epsilon) = \log \mathcal{N} (0, \alpha_{\theta}^2(x))$. At this time, assuming that the prior distribution of $z$ is a logarithmic uniform distribution $p(\log(z))=c$ and the activation function $f$ is ReLU $f(x)=\max(0, x)$, the KL divergence can be analytically calculated as,

$D_{\mathrm{KL}} ( p_{\theta} (z|x) || p_{\theta}(z) ) = - \log \alpha_{\theta}(x) + \mathrm{const.}$

From the above calculations, this KL divergence has the effect of increasing $\log \alpha_{\theta}(x)$ which controls the variance of noise as a penalty term as much as possible. This penalty term can be interpreted as increasing the variation of elements due to Dropout and having an effect of creating a robust space.

## Implementation

Implementation using the deep learning framework Keras is shown below.

In the original paper, authors did two experiments; (1) replacing the latent variable $\mathbf{z}$ of Variational Auto-Encoder with Information Dropout, and (2) using Information Dropout as a regularization layer of supervised learning more than once in the middle layer.

Here I will explain the implementation of the latter model. The table of the network composition is cited from the original paper below.

For this model, considering the following implementation as top-down manner.

```
kernel_size = (3, 3)
nb_filters = [32, 64, 96, 192]
input_tensor = Input(shape=input_shape, name='input')
x = information_dropout_block(input_tensor, kernel_size, nb_filters[0], beta)
x = information_dropout_block(x, kernel_size, nb_filters[1], beta)
x = information_dropout_block(x, kernel_size, nb_filters[2], beta)
x = information_dropout_block(x, kernel_size, nb_filters[3], beta)
x = Convolution2D(192, 3, 3)(x)
x = BatchNormalization(axis=bn_axis)(x)
x = Activation('relu')(x)
x = Convolution2D(192, 1, 1)(x)
x = BatchNormalization(axis=bn_axis)(x)
x = Activation('relu')(x)
x = Convolution2D(10, 1, 1)(x)
x = BatchNormalization(axis=bn_axis)(x)
x = Activation('relu')(x)
x = GlobalAveragePooling2D()(x)
x = Lambda(lambda x: softmax(x))(x)
model = Model(input_tensor, x, name='All-CNN-96')
```

Import of classes and functions of Keras has been omitted.
Since there are four blocks in the table that perform similar Convolution and Dropout, we abstract them using the `information_dropout_block`

function ^{*1}.
The `information_dropout_block`

function takes an input tensor, kernel size, number of filters, information dropout hyperparameter $\beta$ as arguments.
The implementation details of the `information_dropout_block`

function are shown below.

```
def information_dropout_block(input_tensor, kernel_size, nb_filter, beta):
x = Convolution2D(nb_filter, kernel_size[0], kernel_size[1])(input_tensor)
x = BatchNormalization(axis=bn_axis)(x)
x = Activation('relu')(x)
x = Convolution2D(nb_filter, kernel_size[0], kernel_size[1])(x)
x = BatchNormalization(axis=bn_axis)(x)
x = Activation('relu')(x)
f_x = Convolution2D(nb_filter, kernel_size[0], kernel_size[1], subsample=(2, 2))(x)
f_x = BatchNormalization(axis=bn_axis)(f_x)
f_x = Activation('relu')(f_x)
logalpha = Convolution2D(nb_filter, kernel_size[0], kernel_size[1],
activity_regularizer=KLRegularizer(beta=beta),
subsample=(2, 2))(x)
def sampling(args):
f_x, logalpha = args
epsilon = K.exp(K.random_normal(shape=K.shape(f_x), mean=0.,
std=K.exp(logalpha)))
return K.in_train_phase(f_x * epsilon, f_x)
noise_x = Lambda(sampling)([f_x, logalpha])
return noise_x
```

The point is that probabilistic sampling is hidden in a block as a `sampling`

function, so it is no longer necessary to consider that it is a probabilistic method when constructing a network.
Also, by implementing the penalty term which appears in the objective function of Information Dropout, which realizes separation of network architecture and modification of objective function.
The implementation of `KLRegularizer`

is as follows.

```
class KLRegularizer(Regularizer):
def __init__(self, beta=0.0):
self.beta = K.cast_to_floatx(beta)
def __call__(self, logalpha):
regularization = 0
regularization += - self.beta * K.mean(K.sum(logalpha, keepdims=0))
return regularization
```

By using the analytical solution above, a penalty is appearing as $-\log \alpha_{\theta}(\mathbf{x})$ for $\alpha$ which determines the probability of Dropout. The implementation will be in a very simple form as above.

^{*1}: The idea of abstracting processing by function block was based on Keras official sample of ResNet.

## Numerical Results

In this classification task, I used the dataset that was also used in the original paper called Cluttered MNIST.

Cluttered MNIST is a dataset randomly injected noise to handwritten numerals from 0 to 9. We can control the magnitude and amount of noise as a parameter. Samples are shown below.

I trained the network with 50,000 training data and 10,000 test data. The confusion matrix for the test data after training is as shown in the figure below.

It seems that learning is roughly going well. The classification between 1 and 7, 2 and 7, and 4 and 9 are slightly mistaken, but since these numbers are similar, it seems like a mistake that fits human intuition.

Next, I tried the experiment which I thought most interesting part of the original paper personally.

The figure shown above is the visualization of the second term in the objective function,

$\mathcal{L}(\theta) = \frac{1}{N} \sum_{i=1}^N \mathbb{E}_{\mathbf{z} \sim p_{\theta} (\mathbf{z}|\mathbf{x}_i)} [ -\log p_{\theta} (\mathbf{y}_i|\mathbf{z}) ] + \beta D_{\mathrm{KL}} ( p_{\theta} (\mathbf{z}|\mathbf{x}_i) || p_{\theta}(\mathbf{z}) ),$

in the input data space. In my implementation, the input $\mathbf{x}$ branches into $f(\mathbf{x})$ and $\log\alpha_{\theta}(\mathbf{x})$, so the activity of the latter element completely matches to the penalty term, and I visualized it as it is. For the input image shown in the leftmost column, the layers on which Information Dropout was performed are arranged in Dropout 0, 1, 2 in shallow order. The KL divergence $D_{\mathrm{KL}} ( p_{\theta} (\mathbf{z}|\mathbf{x}_i) || p_{\theta}(\mathbf{z}) )$ quantifies how much the shape of the distribution has changed from the prior distribution $p_{\theta}(\mathbf{z})$ of the representation $\mathbf{z}$ when data $\mathbf{x}$ is given, so it extracts the part which seems to be essential for identification in the input data.

In the original paper, when an appropriate value was set for $\beta$, only the region related to classification, which will be the location of the number 5 in this figure, was extracted gradually by the penalty term. However, in this experiment, it might be not enough to tune hyperparameters including $\beta$, the noise part was also picked up, and the result could not be reproduced well.

As a personal interpretation, I think that the form of $D_{\mathrm{KL}} ( p_{\theta} (\mathbf{z}|\mathbf{x}_i) || p_{\theta}(\mathbf{z}) )$ is a concept closely related to saliency map or Bayesian surprise. I felt it interesting that expressions spontaneously appear in the middle layer without explicitly using them as teacher label.

## Putting it all together

The code introduced here extracted only the essential part of the following repository. Besides, I also performed experiments on Variational Auto-Encoder, and the results are available at the following repository.