MAML

  • For a set of tasks $\mathcal{T}_i$ you find the loss gradient and update separate parameters for each of the $K$ examples of the task using the ‘train-train’ data.
  • The parameters of the model are now the ‘adapted’ parameters $\theta_i$.
  • Then you find the average loss for each task using its adapted parameters and the ‘train-test’ data and find the gradients of these with respect to the shared parameters.
  • So basically you do what is like a training step for each task, testing the resulting models using task specific parameters and finally learning shared parameters.
  • To evaluate you fine-tune for 1 or more steps on ‘test-train’ examples and find the metrics on the ‘test-test’ examples.
  • A sentence I didn’t understand very well:
    • After noting that MAML continues to improve during evaluation when fine-tuned for >1 steps even though only a single step is used during training, they note:

      This improvement suggests that MAML optimizes the paramaters such that they lie in a region that is amenable to fast adaptation and is sensitive to loss functions from $p(\mathcal{T})$ … rather than overfitting to parameters $\theta$ that only improve after one step.

    • Earlier they note that the goal is:

      [T]o find model parameters that are sensitive to changes in the task such that small changes in the parameters will produce large improvements on the loss function of any task drawn from $P(\mathcal{T})$.

    • I get that the weights need to be such that when you fine-tune them just a bit on a new task the performance improves a lot.
  • Implementation details for sine experiment (Colab notebook here)
    • Amplitude [0.1, 5.]
    • Phase [0, $pi$]
    • x [-5., 5.]
    • Loss MS
    • 2 layers, 40 units, ReLU
    • K = 10
    • alpha = 0.01
    • Adam as meta optimiser
    • Not clear how many used as test for the meta step (pseudo-code seems to indicate just 1)
    • During evaluation do the adaptive step using K (they try [5, 10, 20]) ‘test-train’ examples and (I think) report the loss for 1 (I think) ‘test-test’ example

References