Cross Platform
Entropy Coding
Quantization
Integer Networks
Data Compression
variational models
A review of "Integer Networks for Data Compression with Latent-Variable Models" by Aaron Dees and Myles Doyle.
TL;DR
This paper tackles the issue of the breakdown of entropy coding when the prior of a variational latent-variable model is computed on different hardware or software platforms. The solution proposed in the paper is to use ‘Integer Networks’, which attempts to remove the floating-point math that causes this non-deterministic behaviour across platforms. The results show that using such networks on both the computation of the prior as well as the representation itself can yield reliable cross-platform encoding and decoding of images using variational models. The paper highlights how an Integer Network architecture can be constructed, as well as how training can be adapted to account for this change.
Variational latent-variable models are a popular choice for modern data compression. In short, successful variational latent-variable models can represent relevant information in the data they are trained for in a compact fashion.
A variational autoencoder (VAE) is a type of variational latent-variable model that can be used for image compression. We can break the simple deep learning compression pipeline shown in Figure 1 below into several steps:
Figure 1: Simple model architecture for image compression. This model architecture describes the steps to encode and decode an image using a very simple variational model with a neural network based encoder and decider (shown in the blue tetrahedrons and labelled appropriately), an arithmetic encoder and arithmetic decoder (shown as red rectangles), a quantization module (shown as a green square), and finally a factorized entropy model (shown as a red database-rectangle with rounded edges). The diagram describes the flow of data through the pipeline to produce a bitstream for the latent
From the above pipeline, we define the encoder model distribution as
The entropy parameters used by the entropy encoder and entropy decoder must be identical. This is because minor changes in these inputs can cause large changes in the entropy decoded latents. This blog will dive deeper into this issue and the solution explored in the paper.
The universal representation of any compressed data today is the binary channel, this is restricted compared to the richness of latent representations obtained from Variational latent-variable models. Thus, for practical compression, the latent representations discussed above need to be converted to binary. This is done through entropy coding.
A popular entropy coding technique is arithmetic coding; such as range coding, or ANS coding (Duda 2014). Range coding is asymptotically optimal; in other words, the bitstream length converges to the expected KL-divergence once the representation sequence is sufficiently large. For this to hold, the sending and receiving side of the binary channel must be able to compute the prior
Figure 2: Left: Image decoded using integer arithmetic. Right: Image decoded using floating-point arithmetic. The image was decoded correctly, beginning in the top-left corner, until floating-point round-off error caused a small discrepancy between the sender and the receiver's computation of prior, p, at which point the error propagated catastrophically. (Ballé et al., Sept 2018)
Thus it is imperative that the computation of the prior
In typical compression applications, the sender and receiver may be using different hardware or software platforms. This becomes an issue as floating-point math and numerical round-off are highly platform-dependent and cannot presently be made deterministic across arbitrary platforms. Therefore, performing the computation of the prior on heterogeneous platforms may not yield the same results, which as we mentioned, is required by range coding to guarantee its deterministic behaviour, else we will likely get catastrophic results similar to those in Figure 2. In some cases, it is even noted that round-off errors can be non-deterministic on the same platform (Ballé et al., Sept 2018).
Additionally, latent-variable models that employ ANNs to compute priors and other learned image compression models (Ballé et al., Feb 2018), are likely to be even more vulnerable to such issues. This is because ANNs generally use large amounts of floating-point math.
The solution proposed in this paper is to use integer arithmetic. They construct a type of quantized neural network called Integer Networks, targeted at generative and compression models, to prevent cross-platform non-determinism in the computation of the prior
To access the efficacy of Integer Networks in enabling platform-independent compression and decompression an image compression model described in Balle et al (Feb 2018) was used, this can be seen in Figure 3, below;
Figure 3: Image Compression Model architecture used to evaluate the efficacy of Integer Networks in enabling platform-independent compression and decompression (Ballé et al., Feb 2018)
The above is a familiar model defined with a hyperprior network. The paper compares the original model seen in Figure 3, with a version in which the network
As seen in Table 1 below, the original model incorrectly decompresses over 50% of images on average when the pipeline is split across platforms. The modified model on the other hand reduces this failure rate to 0%.
Model performance is also tested using rate-distortion performance against bits per pixel. This can be seen in Figure 4 (left). Here, the modified model performs identically to the original which is what we would expect given the only change is to the calculation of the prior and not the latent representation itself.
Figure 4: Rate-distortion performance of image compression models (Ballé et al., Sept 2018). Left: Performance of Ballé et al. (Feb 2018) vs the same model with integer prior. Performance is identical. Right: Performance of Ballé et al. (Feb 2018) ReLU model with 128 filters per layer vs. the same model, with integer transforms and QReLU activation functions with 128 or 256 filters per layer.
Experiments were also carried out on using Integer Networks throughout the model, using Balle's ReLU model, Ballé (July 2018). It was found that the rate-distortion performance for these decreased, as seen in Figure 4 (right). The reason highlighted for this is that decreased accuracy obtained using integer arithmetic leads to a lower approximation capacity than with floating-point networks. This loss in model performance is compensated by increasing the number of filters per layer. It is noted that this increases the training time required for model convergence, it's also noted that it may not affect runtime, as the storage overhead required for the integer implementation is lower and integer arithmetic is computationally less complex. Presumably, this is unverified.
As we've previously discussed, the paper proposes the use of Integer Networks to ensure the network is deterministic on a wide variety of platforms. It proposes restricting all data types to integers and that all operations use basic arithmetic or lookup tables. Since integer multiplications increase the output range compared to that of the input range, an additional step is required after each linear operation, where each output is scaled by a learned parameter,
Where,
We infer from the above that the accumulators of the linear transform require a larger bit width than the activation and filter coefficients, to account for the potential increased output range due to multiplicative operations. This means that
The paper also suggests some solutions to overcome training difficulties. It proposes that the network is trained entirely using floating point computations, rounding to integers after every computation, with back propagation done entirely in floating point precision. Integer parameters,
where
where
To ensure that
A small constant is used, for example,
As you'd expect an issue occurs when trying to backpropagate gradients into these parameters as rounding has zero gradients almost everywhere. The workaround for this highlighted in the paper is to replace the derivative of such rounding functions with an identity function.
Derivatives of the loss functions are computed wrt
where
Where
Once training is completed, integer parameters
On top of rounding parameters, it is also required to round the activations. To get gradients for rounding division
Assuming the prior on the latent representation if given by
where
with
where
Now that we have a well-defined method of training our network using an integer network, and have identical priors used in encoding and decoding of the bitstream, how can we compute the latent representation of our data deterministically at validation while ensuring the marginal resembles a smooth function during training?
To do this the model must be robust to non-determinism both in training and at validation. This has been explored for categorical distributions and vector-quantization by Jang et all. (2017) and Agustsson et al. (2017) respectively. They evaluate their ANNs followed by an argmax to get some deterministic output while using softmax with rounded inputs instead during training to get useful gradients. This allows for a platform-independent pipeline.
As we are familiar with, from previous work by Ballé et al. (Feb 2018) probabilistic encoder distributions are used for training which is replaced with a deterministic probabilistic encoder during evaluation. The probabilistic encoder distribution used in training is
where
since
Training with the above construction was found to lead to instabilities. These were potentially caused by the marginal
As a solution to this issue, the paper proposes using the first of the two equations above during training, but with the last layer of
where
where
Now, we have a formalism where the marginal resembles a smooth function during training but can be represented deterministically during evaluation. This removes the need for regularizing the prior.
The paper details the formalism to define an Integer Network as well as how it differs during training and evaluation. The solution proposed in the paper clearly resolves the issue of non-determinism across training and evaluation across heterogeneous compute systems when implementing a compression pipeline. This is demonstrated in the results.
In this paper, there are two techniques explored that are very interesting:
Ágústsson, E.Þ. et al., n.d. Soft-to-Hard Vector Quantization for End-to-End Learning Compressible Representations. Advances in Neural Information Processing Systems 30, 1141–1151.
Ballé, J., July 2018. Efficient Nonlinear Transforms for Lossy Image Compression. arXiv:1802.00847 [eess].
Ballé, J., Johnston, N., Minnen, D., Sept 2018a. Integer Networks for Data Compression with Latent-Variable Models. Presented at the International Conference on Learning Representations.
Ballé, J., Minnen, D., Singh, S., Hwang, S.J., Johnston, N., Jan 2018b. Variational image compression with a scale hyperprior. Presented at the International Conference on Learning Representations.
Duda, J., 2014. Asymmetric numeral systems: entropy coding combining speed of Huffman coding with compression rate of arithmetic coding. arXiv:1311.2540 [cs, math].
Jang, E., Gu, S., Poole, B., 2016. Categorical Reparameterization with Gumbel-Softmax.