Papers for CS330: Supervised multi-task learning and transfer learning
Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics
In this paper they use uncertainty to weight different vision tasks including semantic segmentation, instance segmentation and depth prediction. They show that model they use is extremely sensitive to weight selection but with an appropriate weighting a multi-task model does better than individual models for each task. However a grid search is time-consuming and becomes increasingly expensive as more tasks are added.
Their approach is based on the idea of task uncertainty which is a kind of uncertainty that does not depend on the input data but only on the task e.g. classification, segmentation, etc. In the Bayesian formulation, variances are learned for each task and the inverse of these variances also weight the losses for the task.
The variance parameter for each task is interpreted as the uncertainty of the model about its prediction for a given task. The higher the variance the lower the contribution of the task to the loss.
The architecture they use has a deep shared encoder model based on DeepLabV3 with a set of shallow task-dependent decoders for each task.
What was also interesting in this paper other than its main contribution was how they cast instance segmentation, which by itself is usually treated as multi-task problem, via approaches like MaskRCNN, as a single task. They predict for each pixel a distance vector to the centroid of the instance to which it belongs, clustering the predicted centroids to generate instances and then assigning each point to the its nearest instance.
They found that their weighting works much better than uniform weights but also improves over weights found via grid search, which they consider due to the limited resolution of grid search combined with the ability of the uncertainty-based weights to be dynamic during training.
Universal Language Model Fine-tuning for Text Classification
Transfer learning via fine-tuning is a common approach in computer vision where weights trained typically for ImageNet classification are transferred to other datasets as well as tasks via fine-tuning. In NLP, this is less frequent and usually limited to using word vectors trained on other datasets.
ULMFit is a 3-layer LSTM language model on WikiText-103, consisting of 28595 Wikipedia aricles and 103 million words, which they characterise as an ImageNet equivalent for Natural (English) Language Processing. Language modelling is chosen as a “source task” since it captures many aspects that are important downstream.
They train both a forward and a backward model, fine-tune these independently and average the results.
They use a number of techniques for fine-tuning including:
- Slanted learning rate which increases quickly for a small fraction of iterations early on then decreases more slowly untl the end of training. The idea is that you want parameters to converge to a suitable paramter space at the start then refine them further within this region for the remaining time.
- To preserve information from earlier parts of the text they concatenate the average and max pooled hidden states from earlier time steps to the most recent hidden state.
- They use modified version of BPTT by spliting a long document into batches and using the final state of the previous batch as the initial state for the present and backpropagate the gradients to the batches whose hidden states contributed to the final output.
- They use different learning rates for each layer, with lower rates for earlier layers. It has been found in practice weights go from general to task specific as you do go deeper down a network.
- They also unfreeze the layers one at a time before each epoch starting with the final layer in order to avoid catastrophic forgetting.
The model is described as universal since it seems to work well when fine-tuned for various datasets with different quantities of data, documents of varying sizes and different label types e.g. sentiment analysis, topic classification, etc. and uses the same architecture (only difference is the number of classes in the final layer) every case. It also doesn’t need dataset specific changes like custom preprocessing or feature engineering.
For the transfer learning part a small MLP made up of two dense layers is attached to the output of the LSTM model. This model is then fine-tuned on 6 different text classification datasets and the resulting model beats the state of the art for the test set for all of these datasets.
The IMDB dataset validation split is used to fine-tune the hyperparameters for transfer learning and these are reused for all the other tasks.Fine-tuning also with only a small number of labels also yields models that perform well compared to models trained from scratch using all the labels.