Research Question

Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art finetuning approaches.
Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting.

Introduction

Define Meta-learning for LMs:

means the model develops a broad set of skills and pattern recognition abilities at training time, and then uses those abilities at inference time to rapidly adapt to or recognize the desired task (illustrated in Figure 1.1).

Define zero-shot, one-shot, few-shot:

We further specialize the description to “zero-shot”, “one-shot”, or “few-shot” (based on the meta-learning framework) depending on how many demonstrations are provided at inference time.

Overall results:

A heuristic sense of the overall results can be seen in Figure 1.3, which aggregates the various tasks (though it should not be seen as a rigorous or meaningful benchmark in itself).

Approach

Our basic pre-training approach, including model, data, and training, is similar to the process described in GPT-2, with relatively straightforward scaling up of the model size, dataset size and diversity, and length of training.

Model and Architectures

We use the same model and architecture as GPT-2, including the modified initialization, pre-normalization, and reversible tokenization described therein,