To meet the aforementioned constraints, we approach the continual learning problem from a counterintuitive perspective in which the “curse” of high dimensionality is turned to a blessing. The key insight is to exploit one of the intriguing properties of the “curse”: independent random high-dimensional vectors will be dissimilar and so can naturally represent different classes, which has been already exploited in few-shot learning [1]. By doing so the representation of a novel class is not only incremental to the old learning but also avoids interference. One step further, this design offers inherent compressibility by having multiple training samples, or even multiple class vectors being represented by a single vector.
When it is employed by deep CNNs, it allows to continually express many more classes than the fixed dimensions in the class vector, with minimal interference among all classes. We therefore propose C-FSCIL that architecturally is composed of a frozen CNN (as feature extractor), a trainable fixed-size fully connected layer, and a dynamically growing explicit memory which stores as many vectors as the number of classes encountered so far. The frozen part is separated from the growing part by inserting the fully connected layer which outputs class vectors in the high-dimensional embedding space whose dimensionality remains fixed and is therefore independent of the number of classes in the past and future. The CNN is meta-learned by a proper sharpened attention (see our blog article from 2021, or [1]) to represent different image classes with dissimilar vectors. Throughout the course of continual learning, C-FSCL is constrained to either no gradient updates (Mode 1) or a small constant number of iterations for retraining only the fully connected layer (Modes 2 and 3).
Our retraining in Modes 2 and 3 can be seen as an extremely efficient version of the latent replay technique [2] that is applied only to the last fully connected layer. Our employed retraining procedure is extremely efficient because it stores only one compact activation pattern per class in the memory, and it applies only a small constant number of iterations to update the last layer. Such efficient and cheap replay is sufficient thanks to using the very good pre-trained embedding [3].
Our most prominent result is to enable deep CNN to learn continually at scale under extreme constraints, i.e., very few training samples, fixed compute cost, and a slowly growing memory proportional to the number of classes encountered so far (see Figure 1). Experiments on datasets from natural images (CIFAR100, and miniImageNet) and handwritten characters from 50 different alphabets (Omniglot) show that C-FSCIL sets a new state-of-the-art accuracy record by outperforming the baselines even in Mode 1 which simply computes class vectors in one pass without any gradient-based parameter update. C-FSCIL also scales up to the largest problem size ever tried in this few-shot continual learning setting by learning 423 novel classes on top of 1,200 base classes with less than 1.6% accuracy drop using Mode 3.
Continual learning faces a number of issues namely catastrophic forgetting, class imbalance problem, and interfere with the past classes.
We avoid these issues systematically in C-FSCIL, where high-dimensional quasi-orthogonal vectors are assigned to each and every class with the aim of reducing interference.
This cannot be achieved with other methods because they fail to support a larger number of classes than the vector dimensionality of the layer they are connected to.
Our class vectors are stored in the explicit memory where they can be selectively updated. The alignment to classes can be improved by proper retraining of the fully connected layer whose structure remains fixed and independent of the number of classes.
C-FSCIL naturally pushes the meta-learned prototypes towards quasi-orthogonality. This therefore leads to class prototypes having large inter-class separation that could provide robustness against adversarial perturbations, without requiring any adversarial training. Furthermore, the precision of such robust class prototypes can be reduced, as confirmed by the bipolarization in Mode 2, which makes them ideal for implementation on emerging hardware technologies exploiting non-volatile memory for in-memory computation.
Editor’s note: The research was carried out by Michael Hersche, Geethan Karunaratne, Giovanni Cherubini, Abu Sebastian, and Abbas Rahimi at IBM Research Europe - Zurich in collaboration with Luca Benini from ETH Zurich.
[1] Karunaratne, G., Schmuck, M., Le Gallo, M. et al. Robust high-dimensional memory-augmented neural networks. Nat Commun 12, 2468 (2021). https://doi.org/10.1038/s41467-021-22364-0
[2] L. Pellegrini, G. Graffieti, V. Lomonaco and D. Maltoni, "Latent Replay for Real-Time Continual Learning," 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2020, pp. 10203-10209, doi: 10.1109/IROS45743.2020.9341460.
[3] Tian, Y., Wang, Y., Krishnan, D., Tenenbaum, J.B., Isola, P. (2020). Rethinking Few-Shot Image Classification: A Good Embedding is All You Need?. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12359. Springer, Cham. https://doi.org/10.1007/978-3-030-58568-6_16