Abstract
Recent work in natural language processing (NLP) has yielded appealing results from scaling model parameters and training data. (그런데 여기서, performance를 향상시키기 위해 using only scale하는 것은 resource consumption또한 증가하는 것을 의미한다.)
This survey synthesizes and relates current methods and findings in efficient NLP. We aim to provide both guidance for conducting NLP under limited resources, and point towards promising research directions for developing more efficient methods.
1 Introduction
Scaling has become a key ingredient in achieving state-of-the-art performance in NLP. However, despite the merits of scaling, it poses key challenges to making these breakthroughs accessible in resource constrained environments in having a non-neglibible environmental impact and in complying with hardware constraints. (그래서 model efficiency에 대한 focus가 맞춰졌다.)
Definition: Efficiency is characterized by the relationship between resources going into a system and its output, with a more efficient system producing the same output with fewer resources.
Scope of this survey: We address this work to two groups of readers: (1) Researchers from all fields of NLP working with limited resources; and (2) Researchers interested in improving the state of the art of efficient methods in NLP.
(순서는 data efficiency-> model design-> pre-training& fine-tuning-> inference-> hardware-> evaluation-> best suited model 로 진행된다.)
2 Data
Data efficiency is improved by using fewer training instances, or by making better use of available instances. Fixed compute budgets motivate balancing model size and training data size, especially during pre-training.
2.1 Filtering
Improving data quality can boost performance while reducing training costs during pre-training and fine-tuning.
While such filtering approaches are useful for mitigating biases, they may not always serve as tools to filter existing datasets, as these often suffer from insufficient training data. (여기서 such filtering approaches에대한 예시들이 나왔었는데 removing duplicates이나 full corpus만큼 효과가 비슷한 subset data를 사용한 경우를 말한다. 즉 이 approaches는 효과적일순 있겠지만 항상 always는 아닌 것이다.)
2.2 Active Learning
Active learning aims to reduce the number of training instancs. In contrast to filtering, it is applied during data collection (instead of after) to only annotate the most helpful or useful instances for training. (Active learning은 많은 장점들을 가지고 있으며 machine translation, language learning, entity linking등에 사용되었다.) Despite its advantages, some open questions make active learning difficult to apply in practice. It remains unclear how model-based sampling impacts the performance of models using architectures different from that in sampling. Also, selecting “difficult” instances may increase annotation cost and difficulty.
2.3 Curriculum Learning
Curriculum learning aims to find a data ordering that reduces the number of training steps required to achieve a target performance. This method does not reduce dataset size, but does improve its utilization. Hence, it is a common approach for improving training efficiency in both pre-training and fine-tuning.
A major challenge in curriculum learning is determining pace, i.e., when to progress to more difficult instances. (만일 carefully chosen되지 않는다면, "easy" instances에 compute하는게 낭비될수 있다.)
To tackle this, work has investigated adaptive ordering strategies based on current model state, called self-paced learning. However, self-paced learning involves large training costs, and disentangling instance ordering from factors such as optimizer choice and batch size is non-trivial.
2.4 Estimating Data Quality
Datasets frequently present high levels of noise and misaligned instances. Estimating data quality encompasses research efforts which propose better uncertainty estimates as well as analytical tools such as dataset cartography.
3 Model Design
Efficient model design covers architectural changes and adding new modules to accelerate training.
3.1 Improving Attention in Tranformers
The transformer’s self-attention mechanism has a quadratic dependency(두 변수 간의 관계가 이차 함수로 나타난다는 것을 의미하며 즉, 한 변수의 변화가 다른 변수에 대한 변화에 제곱 관계로 영향을 미치는 경우를 나타냄.) on sequence length which is not fully utilized by existing models.
1) Existing strategies include better using already processed segments via recurrence to connect multiple segments, learning a network to compress a longer-term memory, separately modeling global and local attention , and modeling long inputs as a continuoustime signal.
2) Another line of research uses fixed attention patterns, where tokens attend to their immediate context (local attention) and possibly to a few global positions (global attention; Beltagy et al., 2020; Zaheer et al., 2020; Child et al., 2019). Compared to using the full selfattention matrix, such approaches can scale linearly with the input length.
Despite various improvements in attention mechanisms, most of them struggle with very long sequences. (이때문에 'an alternative to transformers that alleviates the short memory problem and the quadratic bottleneck cost of self-attention by discretizing state space representations through parameterization of the state matrix'이나 ' multi-headed transformer attention mechanism with a single-headed mechanism' 방법들이 제안되었다.)
3.2 Sparse Modeling
To leverage sparsity for efficiency, many models follow the mixture-of-experts (MoE) concept, which routes computation through small subnetworks instead of passing the input through the entire model. (전체 모델이 아니라 여러개로 분할해서)
3.3 Parameter Efficiency
Methods that reduce parameter count can reduce computational costs and memory usage. One such approach is to share weights across layers of a model while maintaining the downstream task performance. (이외에도 여러 연구가 진행되고 있다.)
3.4 Retrieval-Augmented Models
Parametric models can be combined with retrieval mechanisms for text generation, leading to semi-parametric models.
3.5 Model Design Considerations
Despite considerable advances, one major challenge is modeling long sequences in many real world documents.(너무 길이가 길어진다면 attention mechanism이나 positional encoding과 같은 design choices에 의존하게 될 가능성이 높다.)
Finally, while new model designs improve efficiency through different means, further improvements can emerge from combining approaches, such as making MoE(sparse modeling approaches) more efficient using quantization and using parameter-efficient models for distillation.
4 Pre-training
Modern transfer learning approaches in NLP typically involve pre-training a model in a self-supervised fashion on large amounts of text before fine-tuning it on specific tasks.
4.1 Optimization Objective
The choice of the task can determine the success of the pre-trained model on downstream tasks.
Left-to-right language models: GPT, PaLM
are trained with the causal language modeling (CLM) objective, which involves predicting the next token given a context.
BERT
uses a masked language model (MLM) task, which involves filling randomly masked tokens. (여기서 기존의 데이터를 잘 활용하기 위해, 다양한 masking strategies가 연구되어졌다. )
4.2 Pre-training Considerations
Despite increases in the size of pre-trained models , many pre-training efficiency gains come from improving model design and selection as well as making more efficient use of the available data.
While transformers have been the dominant architecture in pre-trained models, more efficient modeling methods such as state space representations and MoEs have the potential to overcome some challenges of pre-training transformers.
5 Fine-tuning
Fine-tuning refers to adapting a pre-trained model to a new downstream task.
5.1 Parameter-Efficient Fine-Tuning
Gradient-based fine-tuning typically involves training all model parameters on a downstream task. Hence, fine-tuning a pre-trained model on a new task creates an entirely new set of model parameters. (Adapting a pre-trained model to downstream tasks by training a new classification layer and leaving the rest of the parameters fixed하는 방법은 full model을 train시키는 것보다 훨씬 적은 parameter만 업데이트시키긴 하지만 worse performance를 보여서 덜 흔하다. Model을 새로운 task에 적용시킬때 적은수의 parameter만 업데이트시키는 방법들은 severely proposed 되어왔다.)
5.2 Multi-Task and Zero-Shot Learning
Multi-task learning aims to train a single model that can perform a wide variety of tasks out of the box. (이 방법은 fine-tuning의 퍼포먼스를 좋게 만든다.) In certain cases, a multi-task model works on new tasks without any fine-tuning, also referred to as zero-shot generalization.
5.3 Prompting
Inspired by models like GPT-3, prompting refers to casting a task as a textual instruction to a language model. In general, prompts can be either crafted manually or automatically using fill-in templates for token, span, and sentence-level completion.
5.4 Fine-Tuning Considerations
An emerging problem with large language models is the universally high cost of fully fine-tuning them. (물론 prompting이 완화시킬수 있지만.. can be tedious) One promising direction for efficiently introducing new knowledge into models is to combine existing methods for efficient fine-tuning.
To gain a better understanding of these models(large pretrained models for fine-tuning) while still leveraging efficiency, a premising direction is to combine techniques such as sparse modeling and parameter-efficient method.
6 Inference and Compression
Inference involves computing a trained model’s prediction for a given input. Inference can be made more efficient by accelerating the process for time efficiency (latency), or by compressing the model to reduce memory requirements.
6.1 Pruning
Pruning removes irrelevant weights from a neural network to reduce computation, and furthermore, decreases memory capacity and bandwidth requirements. Pruning was initially introduced at the individual weight level (unstructured pruning), but more recent approaches prune larger components of the network.
While early pruning (e.g., during pre-training) can further reduce training costs, it increases the risk of over-pruning: removing nodes essential for downstream task performance. (물론 'regrowing' pruned weights를 통해 완화시킬수 있지만 이는 결국 training costs를 증가시키게 된다.)
6.2 Knowledge Distillation(큰 모델을 사용하여 작은 모델을 훈련시키는 방법)
The process of knowledge distillation uses supervision signals from a large (teacher) model to train a smaller (student) model, and often leads to the student outperforming a similarly sized model trained without this supervision.
*supervision: 모델의 학습 과정에서 사용되는 지도 신호를 의미한다. 이 지도 신호는 모델에게 올바른 출력을 제공하여 모델이 원하는 결과를 예측하도록 도와주는 역할을 한다. 지도 학습에서는 모델이 입력과 정답 쌍을 사용하여 학습하며, 이 때 입력에 대한 정답을 "supervision signal"이라고 한다.
6.3 Quantization
Mapping high-precision data types to low-precision ones is referred to as quantization. (양자화는 training과 inference costs를 줄이기 위해 can be applied at different stages in the NLP.) Different components may have a different sensitivities regarding their underlying precision, so there is a body of work on mixed-precision quantization.
6.4 Inference Considerations
While efficiency during pre-training and finetuning focuses on the computational resources and time required to train and optimize a model, inference efficiency is focused on how well a learned model can perform on new input data in real-world scenarios. Moreover, inference optimization is ultimately context-specific and the requirements vary according to the use-case. Therefore, there is no one-size-fits-all solution to optimizing inference, but instead a plethora of techniques
Promising directions for optimizing inference efficiency might consider tighter integration across or more general purpose approaches with respect to algorithm, software and hardware
7 Hardware Utilization
8 Evaluating Efficiency
9 Model Selection
10 Conclusion
'AI > 논문 리뷰' 카테고리의 다른 글
CEM: Commonsense-aware Empathetic Response Generation 요약 (2) | 2023.08.28 |
---|---|
Representation Learning (0) | 2023.04.30 |
Simclr, Moco and BYOL (0) | 2023.04.30 |
VAE_ Variational Auto-Encoder (0) | 2023.04.30 |
PGGAN & DCGAN (0) | 2023.04.30 |