MADE: Masked Autoencoder for Distribution Estimation

A precursor to autoregressive models such as PixelCNN, this models a conditional probability distributed over the pixels in a flattened represented of the image where the value of each pixel depends on the value of those previous to it.

The auto-encoders in the paper use MLPs and they prevent information flow from future pixels by randomly assigned an id in the range $i \in [0, N-1)$ to each neuron of the first layer so that a neuron with id $n$ only receives inputs from pixels $0 \ldots n$. No inputs are received from the final pixel. The weight kernels are generated in such a way that from forbidden inputs are masked hence the name.

Subsequently layers are built in a similar way ensuring that a neuron with id $n^l$ in layer l only receives inputs from neurons in layer in $l-1$ with id $\leq n^l$. The only difference for the final layer $L$ is that neutrons only receive inputs from those with idx strictly less than $n^L$. This way pixel $i$ can be used to predict the outputs corresponding to positions $i+1 \ldots N-1$ but not output for position $i$

Later models like PixelCNN / PixelRNN used models that were sequential by nature making masking a lot easier. However what is interesting in this paper is that they experiment with different kinds of order rather than the natural one suggested by the order of the pixels in the flattened representation of the image and also try dynamically generating different random orderings as well as different maskings throughout training.

Scaling Autoregressive Video Models

In this paper they build a transformer-based model to generate videos as 3-dimensional slices (time $\times$ height $\times$ width). They essentially model the video as a nested sequence - subslices, pixel per subslice, channels per pixel. The model can be seen as a cascade of encoder followed by decoded followed by channel MLPs with each model refining the predictions of the one before.

  1. The video slice is subdivided into strided slices along each dimension. This ordered set of slices comprises the outer sequence that is fed to the encoder.The encoder generates a sub-slice level representation for the subsequent slice.

  2. The decoder generates the subslice a pixel at a time. First it predicts an initial value for the next pixel using only the previous pixels.

  3. Finally the channels of a pixel are predicted by channel-specific 1 layer MLPs conditioned on the predictions for the previous channels. Each subsequent MLP has increasingly more units to account for the fact the all the previous channel predictions are used.

References