Continually learning new classes from fresh data without forgetting previous knowledge of old classes is a very challenging research problem. Moreover, it is imperative that such learning must respect certain memory and computational constraints such as (i) training samples are limited to only a few per class, (ii) the computational cost of learning a novel class remains constant, and (iii) the memory footprint of the model grows at most linearly with the number of classes observed. To meet the above constraints, we propose C-FSCIL, which is architecturally composed of a frozen meta-learned feature extractor, a trainable fixed-size fully connected layer, and a rewritable dynamically growing memory that stores as many vectors as the number of encountered classes. C-FSCIL provides three update modes that offer a trade-off between accuracy and compute-memory cost of learning novel classes. C-FSCIL exploits hyperdimensional embedding that allows to continually express many more classes than the fixed dimensions in the vector space, with minimal interference. The quality of class vector representations is further improved by aligning them quasi-orthogonally to each other by means of novel loss functions. Experiments on the CIFAR100, mini-ImageNet, and Omniglot datasets show that C-FSCIL outperforms the baselines with remarkable accuracy and compression. It also scales up to the largest problem size ever tried in this few-shot setting by learning 423 novel classes on top of 1200 base classes with less than 1.6% accuracy drop. Our code is available at https://github.com/IBM/constrained-FSCIL.
Deep convolutional neural networks (CNNs) have obtained remarkable success in various computer vision tasks such as image classification, stemming from the availability of large amount of training samples as well as huge computational and memory resources. This, however, poses challenges for their applicability to smart agents deployed in new and dynamic environments where there is a need to (i) continually learn about novel classes from very few training samples without forgetting previous knowledge of old classes; and (ii) perform learning under extreme resource availability.
We consider such a challenging scenario of learning novel classes from an online stream of data, including never-seen-before-classes, where we impose constraints on the training sample size, computational cost, and memory size. More specifically, our learner meets the following constraints (i) training samples are limited to only a few per class, (ii) the computational cost of learning a novel class remains constant, and (iii) the memory footprint of the model grows at most linearly with the number of classes observed.
The IBM Research approach
To meet the aforementioned constraints, we approach the continual learning problem from a counterintuitive perspective in which the “curse” of high dimensionality is turned to a blessing. The key insight is to exploit one of the intriguing properties of the “curse”: independent random high-dimensional vectors will be dissimilar and so can naturally represent different classes, which has been already exploited in few-shot learning . By doing so the representation of a novel class is not only incremental to the old learning but also avoids interference. One step further, this design offers inherent compressibility by having multiple training samples, or even multiple class vectors being represented by a single vector.
When it is employed by deep CNNs, it allows to continually express many more classes than the fixed dimensions in the class vector, with minimal interference among all classes. We therefore propose C-FSCIL that architecturally is composed of a frozen CNN (as feature extractor), a trainable fixed-size fully connected layer, and a dynamically growing explicit memory which stores as many vectors as the number of classes encountered so far. The frozen part is separated from the growing part by inserting the fully connected layer which outputs class vectors in the high-dimensional embedding space whose dimensionality remains fixed and is therefore independent of the number of classes in the past and future. The CNN is meta-learned by a proper sharpened attention (see our blog article from 2021, or ) to represent different image classes with dissimilar vectors. Throughout the course of continual learning, C-FSCL is constrained to either no gradient updates (Mode 1) or a small constant number of iterations for retraining only the fully connected layer (Modes 2 and 3).
Our retraining in Modes 2 and 3 can be seen as an extremely efficient version of the latent replay technique  that is applied only to the last fully connected layer. Our employed retraining procedure is extremely efficient because it stores only one compact activation pattern per class in the memory, and it applies only a small constant number of iterations to update the last layer. Such efficient and cheap replay is sufficient thanks to using the very good pre-trained embedding .
Results and implications
Our most prominent result is to enable deep CNN to learn continually at scale under extreme constraints, i.e., very few training samples, fixed compute cost, and a slowly growing memory proportional to the number of classes encountered so far (see Figure 1). Experiments on datasets from natural images (CIFAR100, and miniImageNet) and handwritten characters from 50 different alphabets (Omniglot) show that C-FSCIL sets a new state-of-the-art accuracy record by outperforming the baselines even in Mode 1 which simply computes class vectors in one pass without any gradient-based parameter update. C-FSCIL also scales up to the largest problem size ever tried in this few-shot continual learning setting by learning 423 novel classes on top of 1,200 base classes with less than 1.6% accuracy drop using Mode 3.
Continual learning faces a number of issues namely catastrophic forgetting, class imbalance problem, and interfere with the past classes.
We avoid these issues systematically in C-FSCIL, where high-dimensional quasi-orthogonal vectors are assigned to each and every class with the aim of reducing interference.
This cannot be achieved with other methods because they fail to support a larger number of classes than the vector dimensionality of the layer they are connected to.
Our class vectors are stored in the explicit memory where they can be selectively updated. The alignment to classes can be improved by proper retraining of the fully connected layer whose structure remains fixed and independent of the number of classes.
C-FSCIL naturally pushes the meta-learned prototypes towards quasi-orthogonality. This therefore leads to class prototypes having large inter-class separation that could provide robustness against adversarial perturbations, without requiring any adversarial training. Furthermore, the precision of such robust class prototypes can be reduced, as confirmed by the bipolarization in Mode 2, which makes them ideal for implementation on emerging hardware technologies exploiting non-volatile memory for in-memory computation.
Editor’s note: The research was carried out by Michael Hersche, Geethan Karunaratne, Giovanni Cherubini, Abu Sebastian, and Abbas Rahimi at IBM Research Europe - Zurich in collaboration with Luca Benini from ETH Zurich.
 Karunaratne, G., Schmuck, M., Le Gallo, M. et al. Robust high-dimensional memory-augmented neural networks. Nat Commun 12, 2468 (2021). https://doi.org/10.1038/s41467-021-22364-0
 L. Pellegrini, G. Graffieti, V. Lomonaco and D. Maltoni, "Latent Replay for Real-Time Continual Learning," 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2020, pp. 10203-10209, doi: 10.1109/IROS45743.2020.9341460.
 Tian, Y., Wang, Y., Krishnan, D., Tenenbaum, J.B., Isola, P. (2020). Rethinking Few-Shot Image Classification: A Good Embedding is All You Need?. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12359. Springer, Cham. https://doi.org/10.1007/978-3-030-58568-6_16