Weeks 1-4 at RC and ideas for week 5

I am about a little less than a third of the way through my batch at the Recurse Center. Strictly it is a third (4/12 weeks) since there is a break for Christmas but I don’t want to take a break.

It took me a while to settle down to a regular rhythm and I had not really taken that into account. At the same time even in the first couple of weeks I have felt excited and motivated to learn. I have often found myself in a state of flow and I have not had my usual problems with focus. Usually I would be reluctant to get started and would keep wanting to pause and do something else. I used to waste a lot of time browsing and felt unable to kick that habit. Now on most days I have hardly any temptation to browse.

What I need to work on now is to leverage this state of affairs to get things done. Although I feel productive that before, I am not finishing things. I keep allowing unplanned goals to intrude and eat up sizeable chunks of time. Since many of these new tasks are quite challenging and interesting in themselves they are capable of leading to a state of flow that makes me believe I am more productive than I really am.

I usually manage to pull back but not before I have allowed new goals steal sizeable portions of my time. In consequence I failed to devote enough time on most tasks and usually have failed to do them often enough to avoid the need the first 30 minutes reminding myself what I did last time.

There are tasks I began in Week 1 that could have been done by Week 2. For example they were not complex long-running open-ended projects but course assignments which it is possible to complete in the 7-10 days (depending on the course and the homework) which students usually get to finish them.

Whilst it might sound like I am trying to curtail my curiosity contrary to the unschooling philosophy of RC, in my case it is an attempt to overcome FOMO. The problem is one that I have had often experienced during earlier periods of self-study. I try to read on the RC forums daily what everyone is up to as well as sharing my own updates. I usually get ideas and inspiration from their work but I am also too easily swayed by what others are doing.

The obvious solution would seem to be to have backlog of some kind and I have tried that. But I find it difficult to write down a small number of tasks each day even though I know I won’t get most of them done if I have a long list.

Experiment for week 5

I have wondered if it might help to state here my goals for the upcoming week. I am not sure it will work any better than posting daily check-ins on the RC forum except that they will be more public here. I’m going to try it for what it is worth despite feeling rather ashamed of sharing what is really not a particularly ambitious list of goals. But the challenge involved is to get them all done within the next week and to minimise how much is carried over to the next.

These are tasks that I can finish by 12/12/2020 and how confident I am that I will finish them.

Courses

Assignments

CS242 HW 1
Graphics homework 0, 3/5
Distributed systems homework 1, 4/5

Studies
Study up to the end of chapter 3 of the RL book, 3/5
Finish studies on eigenvalues and eigenvectors, 3/5
Finish studies on wavelets, 3/5
Part 1 of Depth First Learning
Look up background for detection models

Papers

Read all these papers where completion criteria is adding couple of sentences about each paper to the blog.

CS294 week 1

PixelCNN++: Salimans, Tim, et al. “Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modifications.” arXiv preprint arXiv:1701.05517 (2017)
Gated PixelCNN: Van den Oord, Aaron, et al. “Conditional image generation with pixelcnn decoders.” Advances in Neural Information Processing Systems. 2016.
WaveNet: Oord, Aaron van den, et al. “Wavenet: A generative model for raw audio.” arXiv preprint arXiv:1609.03499 (2016).
PixelCNN Super Resolution: Dahl, Ryan, Mohammad Norouzi, and Jonathon Shlens. “Pixel recursive super resolution.” Proceedings of the IEEE International Conference on Computer Vision. 2017.

Transformer (list adapted from references here)

Rami Al-Rfou, et al. “Character-level language modeling with deeper self-attention.” AAAI 2019. A transformer is used for character-level language modelling and achieves SoTA on enwik8 and text8. Unlike many NLP architectures prior to it the model used in the paper is very deep (64 layers, which was deep for NLP model at the time the paper was published).

They use a number of training strategies all of which appear to be important for the success of the model. These include: 1/ making predictions for each position, not just the final position (“Multiple Positions”); 2/ making > 1 predictions at each position e.g. not just for the next character but also for subsequent characters (“Multiple Targets”); 3/ making these predictions at every layer (“Intermediate Layer Losses”); 4/ learned positional embeddings at every layer (the motivation for this is that in a deep model the positional information might get lost downstream).

These additional predictions are only made during training and are associated with additional parameters also only used during training. They result in auxiliary losses whose relative importance to the main task is adjusted through various weighting schemes

They suggest that leaving out “Multiple Positions” and “Intermediate Layer Losses” result in the biggest drop in performance because using these approaches is like training on many times more input label pairs. Whilst best model has 64 layers a smaller model with 12 layers also outperforms most earlier models (except one case in enwiki8 which used dynamic evaluation - which they don’t employ)

They also show that the model learns good long range dependences. For example the model displays high uncertainty about an unlikely word that it has never seen but replacing a word earlier in the text 100s of characters earler with this word leads to less certainty about the unlikely word.
Olah & Carter, “Attention and Augmented Recurrent Neural Networks”, Distill, 2016. (3/5)
Sainbayar Sukhbaatar, et al. “Adaptive Attention Span in Transformers”. ACL 2019. (3/5)
Rewon Child, et al. “Generating Long Sequences with Sparse Transformers” arXiv:1904.10509 (2019). (3/5)
Nikita Kitaev, et al. “Reformer: The Efficient Transformer” ICLR 2020. This is another approach to enable transformers to handle long sequences. The bottleneck is the attention map whose size is of order (N^2) where N is the sequence length. In this paper they note that the attention maps depend on the $\text{softmax}{QK^T}$ they will dominated by elements which correspond to key and query vectors which are the closest to each other whilst other elements will be close to zero. Based on this insight they use a hashing mechanism which with high probability will place vectors that are close to each other (in the sense of having a large cosine similarity) in the same bins. The attention weights then only need to be found for pairs of keys and queries in the same bin with pairs in different bins giving rise to a weight of zero. If the the bin sizes are much smaller than $N$ then this approach will be a lot cheaper. One change they make to the regular model is to define the key vectors as the unit vectors in the direction of the query vectors (i.e. $k_z = q_z / \lVert q_z \rVert$) which ensures that the hashes of $k_z$ and $q_z$ are the same and alleviates the problem of very uneven numbers of keys and queries in a bucket. I presumed that this means that 1/ they don’t do a key transform of the input in self-attention layers prior to applying attention; 2/ they don’t use the encoder outputs / memory to generate the key vectors when obtaining the attention weights to apply to the encoder outputs. I am not sure if this is the case but it looks like all the attention heads are self-attention with the difference that an element does not attend to itself unless there is no other element available e.g. if it is the first element in a sequence. They also experiment with adding reversible layers as in the reversible ResNet. The resulting model has a performance comparable to the regular transformer on a number of datasets and problems (enwiki8, WMT, imagenet64).
Alex Graves. (“Adaptive Computation Time for Recurrent Neural Networks”) The key idea here is that a model should be able dynamically decide how much computation time to spend on a task based on its complexity. Consider an addition problem with a sequence of numbers of variable lengths. Adding a larger number will be more involved than adding a smaller one and the model should be able to adapt accordingly. This can also occur with non-sequential inputs where the prediction is more difficult to make for some inputs compared to others. The model works by modifying a normal RNN cell so that it kind of becomes an RNN itself. The difference is that the sequence involved is the number of computation steps rather than the data itself. We distinguish between data timesteps $t$ and number of compute steps $i$. Each iteration receives the entire input at timestep $t$ with an indicator when $i=0$ to inform the cell that the computation has just started for this data element. The cell generates and output and state at each compute step $i$ and the state is fed back for the next compute step. (When $i=0$ the state from the previous timestep, or when $t=0$ the initial state, is used). The cell makes an additonal prediction at every step $i$ which is used to decide if it should stop computing for element $t$. These predictions are also used to generate weights for aggregating the outputs and states from each compute step $i$. The model is trained using regular losses for the problem and an additional loss that penalises long computation times. The model outperforms baseline models which don’t use adaptive computation (i.e. where the RNN cell behaves like a regular RNN cell with just one compute step) on a number of algorithmic tasks (parity, logic, addition, sorting).
Niki Parmar, et al. “Image Transformer” ICML 2018. (4/5) They use transformers for autoregressive image generation. The Image Transformer has a larger receptive field compared to PixelCNN but can be trained for all positions in parallel unlike RNN-based approaches.

The input consists of all the channels of the image concatenated along the width (so the input size is $H \times 3 \times W$). There is different ordering compared of the pixels to PixelCNN / RNN. Here the image is split into blocks which are generated one at a time and the ordering is block by block left to right, top to bottom.

A key change is the use of local attention for larger images where it would be computationally infeasible to attend to all the $(H \cdot W \cdot 3) ^ 2$ positions (for RGB images). The image is split into non-overlapping blocks for the query. For the key and value inputs the image is split into overlapping blocks consisting of the pixels in the query block and some of the previous pixels (to left or above).

Two variants are tried where the blocks 1 or 2 dimensonal. In terms of receptive field size both are equally good but the 2-d blocks result in a more balanced context. They also experiment with categorical and mixture of logistics distirbutions for the colour probabilities.

A further difference is that for some of the models they only use a decoder - for unconditional image generation as in PixelCNN or class-conditional generation where the input is non-sequential (and in a sense a class label is already encoded). This makes sense because in the original Transformer the encoder is allow to look at all the positions before and after the present present whilst the decoder is only allowed to look at earlier positiions.

Another experiment they try is super-resolution for which they run the encoder on a downsampled image and use the decoder to generate a higher resolution version of it.

They achieve SoTA for ImageNet unconditional generation and their CelebA super-resolution images fooled humans significantly more often (around a third of the times version 10%) that those generated by other models.

Zihang Dai, et al. “Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context.” ACL 2019. The key idea is when an input sequence is part of a longer sequence, you save hidden states corresponding to earlier parts of the full sequence which you concatenate with the hidden states of the input sequence before feeding to the attention layer but only let gradients flow through the input sequence’s hidden states and not through the previous hidden states.

The motivation for this approach is that unlike the basic transformer model it lets you you take advantage of the fact that the transformers don’t get affected by vanishing gradients and enables you to use information from a longer sequence rather than just a fixed length input. It also prevents context fragmentation so for example if you have a long a passage and you are breaking it up into chunks to feed into the model whilst training, the model doesn’t have any information from earlier parts to help with understanding of the present chunk and they draw the analogy of using memory augmented networks.

The other thing which they do, which is complementary to this change, is to use relative positional encodings so that the model is informed that the previous hidden states correspond to earlier parts of the sequence. They introduce two sets of attention weights to deal with the positional and the word (or token) embeddings and replace the positional embedding query with a pair of learned parameters effectively turning the position-word and position-position attention components into biases.

They achieve SoTA on WikiText103, enwiki8 and text8. They also achieve SoTA on One Billion Words, which does not have long range dependencies, and (among models without 2-step fine-tuning) Penn TreeBank, which is a smaller dataset, which suggests that the model generalises well to datasets different from those for which it was designed.
Aidan N. Gomez, et al. “The Reversible Residual Network: Backpropagation Without Storing Activations” NIPS 2017. (1/5)
Mostafa Dehghani, et al. “Universal Transformers” ICLR 2019. The model in this paper is inspired by the Adaptive Computation Time RNN](https://arxiv.org/abs/1603.08983) but applies this approach to a Transformer. Similar to the regular Transformer model they use an encoder-decoder architecture where the encoder block comprises a self-attention block and a transition block (plus LayerNorm and Dropout) and the decoder block comprises self-attention, memory attention and a transition block. The key difference is that they have just a single encoder and decoder block. Encoding consists of applying the same encoder block, initially to the input and then to the output from the previous application for a number of timesteps. The output of this is passed to the decoder which is then repeatedly run for a number of timesteps. The number of timesteps can be determined beforehand or decided dynamically on a per-symbol basis similar to ACT-RNN using a halting mechanism as in ACT-RNN to decide whether to stop computing for a particular symbol. The difference is that since the whole sequence needs to be present each time to use in self-attention, the output for the symbols which have been halted is simplied copied over to the next stage. This model is particularly good at algorithmic tasks at which the regular transform struggles. They show that unlike the regular Transformer it is Turing-complete. It also outperforms regular transformers on language tasks and manages to get SoTA on the LAMBDADA language modelling task.
Emilio Parisotto, et al. “Stabilizing Transformers for Reinforcement Learning” arXiv:1910.06764 (2019). They make some relatively minor changes to the residual connections in the Transformer architecture to make it suitable for RL: 1/ Apply LayerNorm only at the input to the main branch and not at the input to the residual branch allowing the residual connection to be an identity function of the input to the block; 2/ Replace addition with gating. With a gating mode based on GRUs, the resulting model is more stable across runs with different hyperparameters and performs better than a regular transformer as well as other approaches, reaching state-of-the-art on the DMLab-30 benchmark.
Krzysztof Choromanski et al. “Rethinking Attention with Performers” arXiv:2009.14794 (2020) (3/5)

Implementations
Implement an experiment from “Adaptive Computation Time for Recurrent Neural Networks”, 4/5
Implement RetinaNet with accompanying notes, 4/5
Finish YogaNet [link] 4/5