Supermasks in Superposition

Written by Michael (Mike) Erlihson, PhD.

This review is part of a series of reviews in Machine & Deep Learning that are originally published in Hebrew, aiming to make it accessible in a plain language under the name #DeepNightLearners.

Good night friends, today we are again in our section DeepNightLearners with a review of a Deep Learning article. Today I've chosen to review the article Supermasks in Superposition.

Reviewer Corner:

Reading recommendation: Recommended - there are two cool ideas in the article.
Clarity of writing: Medium+
Math and DL knowledge level: Basic knowledge with continual learning and Hopfield Networks.
Practical applications: Building large neuron networks with fixed weights that can be used for multi-tasks.

Article Details:

Article link: available for download here
Code link: available here
Publishsed on: 22/10/2020 on Arxiv
Presented at: NeurIPS 2020

Article Domains:

continual learning with Neural Networks
Multi-task learning with Neural Networks.

Mathematical Tools, Concepts and Marks:

Binary masks on neural network weights
Catastrophic forgetting in neural networks
Hopfield Networks (HN)
Entropy (the basic concept on which the article is based)

The Article in Essence:

The article suggests a training method called SupSup of a large neural network (let's call it a base network), which allows it to perform several different tasks. After a random initialization, the weights of the network are fixed, and this same network can still be used for various classification tasks. It is done through the learning of a set of separate binary masks (a collection of 0s and 1s) for each task. During inference, these learned binary masks are used to turn on or off specific neuron-connections in the network. This is a method to overcome the "forgetfulness" that happens when training an existing network over a new task.

Figure 1: (left) During training SupSup learns a separate supermask (subnetwork) for each task. (right) At inference time, SupSup can infer task identity by superimposing all supermasks, each weighted by an αi , and using gradients to maximize confidence.

The article defines 4 typical continuous learning scenarios and suggestions for each one, a method to train one single base network which can perform multiple tasks:

GG - The tasks are known and identified (given) during both the training and the inference - i.e. we know what the task is.
GN - The tasks are given and known during training, but not during inference and the labels are shared between the tasks i.e. - we need to guess the task so we can choose the correct mask during inference.
GNu - The tasks are given and known during training, but not during inference and the labels are different (unshared) for each task - i.e. the output layer must be large enough to contain the sum of all the different tasks labels.
NN - The tasks are not known, not during training nor inference. For this, the labels must be shared. During training, we must decide whether to use an existing mask or to train a new one, and a similar decision must be made during inference.

Basic Ideas:

There are several interesting ideas in the article, which can be divided into two groups:

Methods for solving the continuous learning scenarios above
Methods to efficiently store and produce the appropriate mask for a task.

Let's dive in.

Methods for solving the continuous learning scenarios above

GG tasks: Choosing the correct mask during inference
GN and GNu Tasks: The article suggests describing the inference task mask as a linear combination of all the pre-trained masks. The coefficients of these linear combinations are chosen so that larger values lead to a bigger drop in the network output entropy for the specific task. In other words, the coefficient for which the negative entropy gradient (it's multiplied by -1) of the base network is maximal. This enables avoiding running the network for all the masks during inference (when there are hundreds or thousands of masks or a large amount of data, this can get significantly expensive). Instead, the task runs once, on an averaged weight network (in the paper they call it One-Shot) and calculating the gradient. It is important to mention that one-shot is based on a single gradient of the base network entropy, which is a non-convex function w.r.t the masks coefficients. This fact may result in choosing the wrong mask. To overcome this problem, the authors suggest choosing the coefficients through an iterative method. In every iteration, half of the coefficients - those with the lowest gradients - are elapsed to zero, until a single mask is left.
NN tasks: Similar to the GN/GNu method, but when no coefficient leads to a significant rise in the entropy (when calculating softmax on all the gradients), then a new mask is trained. Otherwise, the mask of the coefficient with the maximal entropy is used. Another simple option to solve it, instead of training a new mask (when needed), optimal linear combination coefficients can be trained. After all, the coefficients need less storage space than the masks.

Besides, the authors suggest a nice trick that significantly improves performance. They add "artificial labels" to the task, that is additional neurons to the last layer of the base-network. For example, when training an MNIST network with 10 Classes, the last layer would include 100 neurons, where 90 of the added neurons belong to a "non-existing" class. The article is using these neuron values to update the derivative according to the coefficients of the mask's linear combination.

Table 1: Overview of different Continual Learning scenarios. We suggest scenario names that provide an intuitive understanding of the variations in training, inference, and evaluation, while allowing a full coverage of the scenarios previously defined in [49] and [55]. See text for more complete description.

Methods to efficiently store and produce the appropriate mask for a task.

The article suggests storing the masks in a Hopfield network (HN) by updating the network's weights. During inference, the optimal mask is located by minimizing the sum of the HN's energy function (its loss function) and the base network mask entropy for this task.

The Intuition Corner: Let's understand the rationale of each of the article's suggested ideas.

The base network input layer entropy gradient w.r.t. the masks' linear combination coefficients: The more confident the network is about its prediction for a specific sample, the lower the input entropy would be. The main assumption here is that if the network is 'confident' for a specific mask, then the task must be similar or equal to the task this mask was previously trained on. Pay attention that the entropy is calculated over the entire training set, which makes this assumption plausible.

Adding artificial labels to the input layer: This is an idea I liked a lot - when training a network with a larger amount of neurons in the output layer for a specific task, the network learns to position their highly negative values (which are elapsed by the softmax function). If during inference, the values of these neurons non-negative values, it's a sign that the given task doesn't match the task for which the mask was trained. Instead of using entropy, the article suggests calculating the logarithm on the sum of the exponents of these artificial neurons in the output layer. A high value indicates a mask-to-task mismatch. Yet with all the beauty in this idea, I have a feeling that a similar result can be achieved through using sigmoid temperature.

Saving masks in a Hopfield Network: To save space while storing the masks, the article suggests saving them in a Hopfield Network. HN is a matrix, purposed to store vectors that are made of {-1, 1} in a noise-free form. The network masks are composed of {0, 1}, so they need to be transformed to match the HN format. Each time we store an additional vector in the HN, we update the matrix with this vector (there are several methods to do so, the authors used the Storkey learning rule). So how does one read values from this memory-matrix? Assuming we got a noisy version of the vector, we feed it to energy functions that are defined by this matrix and try to minimize it. It can be proved that the minimum is achieved at the nearest point to the noisy input.

But in our case (this works only in the GN scenario) we don't only need to locate the stored vector which is most similar to the input (which is always an average of all the masks) but also find a mask that minimizes the entropy of the network output. So they added to the HN loss an item that contains the network entropy. The loss is a linear combination of the 'normal' HN loss with a factor that rises with the iterations and the entropy loss decrease with the number of the iterations. The intuition here is that in the beginning, it moves in the correct mask direction, and upon reaching the mask area, we perform the regular HN energy minimization.

Finally, I would like to state that the article is using quite a large, over-parameterized, base network. This allows finding masks that elapse many of its weights, which can be trained on many different tasks. Also, they haven't stated how each mask training is performed per task (there are many methods).

Achievements

The article proves the superiority of their method in all the scenarios listed above and shows that their performance isn't too far from the optimal performance of the base network on the task (when the network is re-trained separately on each task). They also demonstrate a significant storage space saving compared to methods with similar performance. Also, they show their approach can train thousands of tasks on a single base network, nearly without harming the best performance.

The effect of output size s on SupSup performance using the One-Shot algorithm. Results shown for PermutedMNIST with LeNet 300-100 (left) and FC 1024-1024 (right)

(left) SplitImagenet performance in Scenario GG. SupSup approaches upper bound performance with significantly fewer bytes. (right) SplitCIFAR100 performance in Scenario GG shown as mean and standard deviation over 5 seed and splits. SupSup outperforms similar size baselines and benefits from transfer

Datasets

GG: SplitCIFAR100, SplitImageNet

GN: PermutedMNIST, RotatedMNIST, SplitMNIST

NN: PermutedMNIST

P.S.

The article suggests a brilliant method to train a single network over a large number of tasks. Yet it's important to remember several things:

The tasks they trained had a similar difficulty (I'm not sure this would have worked if the tasks had different degrees of difficulties - maybe then each mask should have a different amount of 1s or something).
The trained tasks are semantically similar. They didn't try combining datasets from different domains.
The trained tasks aren't too difficult. So perhaps training a smaller separate network would yield smaller storage-space?

#deepnightlearners

This post was written by Michael (Mike) Erlihson, Ph.D.

Michael works in the cybersecurity company Salt Security as a principal data scientist. Michael researches and works in the deep learning field while lecturing and making scientific material more accessible to the public audience.