Transductive Learning for Textual Few-Shot Classification: Abstract & Intro

cover
1 Mar 2024

This paper is available on arxiv under CC 4.0 license.

Authors:

(1) Pierre Colombo, Equall, Paris, France & MICS, CentraleSupelec, Universite Paris-Saclay, France;

(2) Victor Pellegrain, IRT SystemX Saclay, France & France & MICS, CentraleSupelec, Universite Paris-Saclay, France;

(3) Malik Boudiaf, ÉTS Montreal, LIVIA, ILLS, Canada;

(4) Victor Storchan, Mozilla.ai, Paris, France;

(5) Myriam Tami, MICS, CentraleSupelec, Universite Paris-Saclay, France;

(6) Ismail Ben Ayed, ÉTS Montreal, LIVIA, ILLS, Canada;

(7) Celine Hudelot, MICS, CentraleSupelec, Universite Paris-Saclay, France;

(8) Pablo Piantanida, ILLS, MILA, CNRS, CentraleSupélec, Canada.

Abstract

Proprietary and closed APIs are becoming increasingly common to process natural language, and are impacting the practical applications of natural language processing, including fewshot classification. Few-shot classification involves training a model to perform a new classification task with a handful of labeled data. This paper presents three contributions. First, we introduce a scenario where the embedding of a pre-trained model is served through a gated API with compute-cost and data-privacy constraints. Second, we propose a transductive inference, a learning paradigm that has been overlooked by the NLP community. Transductive inference, unlike traditional inductive learning, leverages the statistics of unlabeled data. We also introduce a new parameter-free transductive regularizer based on the Fisher-Rao loss, which can be used on top of the gated API embeddings. This method fully utilizes unlabeled data, does not share any label with the third-party API provider and could serve as a baseline for future research. Third, we propose an improved experimental setting and compile a benchmark of eight datasets involving multiclass classification in four different languages, with up to 151 classes. We evaluate our methods using eight backbone models, along with an episodic evaluation over 1,000 episodes, which demonstrate the superiority of transductive inference over the standard inductive setting.

1 Introduction

Recent advances in Natural Language Processing (NLP) have been largely driven by the scaling paradigm (Kaplan et al., 2020; Rosenfeld et al., 2019), where larger models with increased parameters have been shown to achieve state-of-the-art results in various NLP tasks (Touvron et al., 2023; Radford et al., 2019). This approach has led to the development of foundation models such as ChatGPT (Lehman et al., 2023; Kocon et al. ´ , 2023; Brown et al., 2020), GPT-4 (OpenAI, 2023), GPT3 (Brown et al., 2020), T5 (Raffel et al., 2020), and BERT (Devlin et al., 2018), which have achieved unprecedented performance in text classification (Liu et al., 2019b), language modeling, machine translation (Fan et al., 2021), and coding tasks (Chen et al., 2021a).

Despite the success of the scaling paradigm, significant challenges still exist especially when the many practical constraints of real-world scenarios have to be met: labeled data can be severely limited (i.e., few-shot scenario (Song et al., 2022; Ye et al., 2021)), data privacy is critical for many industries and has become the subject of increasingly many regulatory pieces (Commission, 2020, 2016), compute costs need to be optimized (Strubell et al., 2019). Furthermore, these challenges are made even more complex as stronger foundation models are now available only through APIs (e.g., OpenAI’s GPT-3, GPT-4 or ChatGPT, Anthropic’s Claude or Google’s PaLM (Chowdhery et al., 2022)) which has led to some of their parameters being concealed, presenting new challenges for model adaptation (Solaiman, 2023). This paper is centered on the fundamental task of fewshot text classification, specifically focusing on cloud-based/API access. Specifically, we formulate three requirements for API-based few-shot learning (FSL) (see Fig. 1):

(R1) Black-box scenario. We focus on learning from models that are opaquely deployed in production to the end-user, who only has access to the end-point of the encoder, i.e., the resulting text embedding produced by the final layer of the network. (R2) Low resources / computation time. AI systems are often required to make rapid predictions at high frequencies in various real-world applications. Therefore, any few-shot classifier used in such scenarios should have a low training and inference time, as well as require minimal computational resources. (R3) Limited Data Sharing. When utilizing API models, data sharing becomes a major concern. In the current landscape, providers are increasingly offering less transparent procedures for training their networks. As a result, users prefer sharing as little information as possible, such as labeling schema and annotated data, to safeguard their data privacy. Shortcomings of Existing Works. While numerous previous studies have addressed the popular few-shot classification setting, to our knowledge no existing line of work adequately satisfies the three API requirements described above. In particular, prompt-based FSL (Schick and Schütze, 2020a) and parameter-efficient fine-tuning FSL (Houlsby et al., 2019) both require access to the model’s gradients, while in-context learning scales poorly with the task’s size (e.g number of shots, number of classes) (Chen et al., 2021b; Min et al., 2021, 2022; Brown et al., 2020) and requires full data sharing. Instead, we focus on methods that can operate within API-based constraints.

Under R1, R2, and R3 requirements, the standard inductive learning (Liu et al., 2022) may be quite limiting. To mitigate the labeled data scarcity while retaining API compliance, we revisit transduction (Vapnik, 1999) in the context of textual few-shot classification. Specifically, in the context of FSL, transductive FSL (Liu et al., 2019a) advocates leveraging unlabeled test samples of a task as an additional source of information on the underlying task’s data distribution in order to better define decision boundaries. Such additional source essentially comes for free in many offline applications, including sentiment analysis for customer feedback, legal document classification, or text-based medical diagnosis.

Our findings corroborate recent findings in computer vision (Liu et al., 2019a; Ziko et al., 2020; Lichtenstein et al., 2020; Boudiaf et al., 2020; Hu et al., 2021b), that substantial gains can be obtained from using transduction over induction, opening new avenue of research for the NLP community. However, the transductive gain comes at the cost of introducing additional hyperparameters, and carefully tuning them. Motivated by Occam’s razor principle, we propose a novel hyperparameter-free transductive regularizer based on Fisher-Rao distances and demonstrate the strongest predictive performances across various benchmarks and models while keeping hyperparameter tuning minimal. We believe that this parameter-free transductive regularizer can serve as a baseline for future research.

Contributions

In this paper, we make several contributions to the field of textual FSL. Precisely, our contributions are threefold:

A new textual few-shot scenario: We present a new scenario for FSL using textual API-based models that accurately capture real-world constraints. Our new scenario opens up new research avenues and opportunities to address the challenges associated with FSL using API-based models, paving the way for improved performance in practical applications.

A novel transductive baseline. Our paper proposes a transductive FSL algorithm that utilizes a novel parameter-free Fisher-Rao-based loss. By leveraging only the network’s embedding (R1), our approach enables fast and efficient predictions (R2) without the need to share the labeling schema or the labels of few-shot examples making it compliant with (R3). This innovative method marks a significant step forward in the field of FSL.

A truly improved experimental setting. Previous studies on textual few-shot classification (Schick and Schütze, 2022, 2020b; Mahabadi et al., 2022; Tam et al., 2021; Gao et al., 2020) have predominantly assessed their algorithms on classification tasks with a restricted number of labels (typically less than five). We take a step forward and create a benchmark that is more representative of realworld scenarios. Our benchmark relies on a total of eight datasets, covering multiclass classification tasks with up to 151 classes, across four different languages. Moreover, we further enhanced the evaluation process by not only considering 10 classifiers trained with 10 different seeds (Logan IV et al., 2021; Mahabadi et al., 2022), but also by relying on episodic evaluation on 1,000 episodes (Hospedales et al., 2021). Our results clearly demonstrate the superiority of transductive methods.