Research Question

We demonstrate that language models begin to learn various natural language processing tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText.

The capacity of the language model is essential to the success of zero-shot task transfer and increasing it improves performance in a log-linear fashion across tasks.
Our largest model, GPT-2, is a 1.5B parameter Transformer that achieves state of the art results on 7 out of 8 tested language modeling datasets in a zero-shot setting but still underfits WebText.

If the “GPT-1 paper” was trying to form a framework to unify NLG task solutions (pre-training + SFT), then this “GPT-2” paper aims to get rid of SFT stage and prove that unsupervised learning (pre-training) is enough.

Motivation

Our suspicion is that the prevalence of single task training on single domain datasets is a major contributor to the lack of generalization observed in current systems.
The current best performing systems on language tasks utilize a combination of pre-training and supervised finetuning:
- First, word vectors were learned and used as inputs to task-specific architectures
- then the contextual representations of recurrent networks were transferred
- and recent work suggests that task-specific architectures are no longer necessary and transferring many self-attention blocks is sufficient (GPT-1 and BERT)
These methods still require supervised training in order to perform a task.
We demonstrate language models can perform down-stream tasks in a zero-shot setting – without any parameter or architecture modification (meaning without further finetuning).

Approach

Since the supervised objective is the the same as the unsupervised objective but only evaluated on a subset of the sequence, the global minimum of the unsupervised objective is also the global minimum of the supervised objective.
While dialog is an attractive approach, we worry it is overly restrictive. The internet contains a vast amount of information that is passively available without the need for interactive communication.
Our speculation is that a language model with sufficient capacity will begin to learn to infer and perform the tasks demonstrated in natural language sequences in order to better predict them, regardless of their method of procurement.

Training Dataset

Since Common Crawl has significant data quality issues (a large amount of documents “whose content are mostly unintelligible”), we created a new web scrape on our own which emphasizes document quality.

As a starting point, we scraped all outbound links from Reddit, which received at least 3 karma.
The resulting dataset, WebText, contains the text subset of these 45 million links.
To extract the text from HTML responses we use a combination of the Dragnet (Peters & Lecocq, 2013) and Newspaper1 content extractors.
We removed all Wikipedia documents from WebText since it is a common data source for other datasets and could complicate analysis due to overlapping training data with test evaluation tasks.

Input Representation (BPE)

Reference BPE implementations often operate on Unicode code points and not byte sequences.
- These implementations would result in a base vocabulary of over 130,000 before any multi-symbol tokens are added.
- In contrast, a byte-level version of BPE only requires a base vocabulary of size 256.
Directly applying BPE to the byte sequence results in suboptimal merges due to BPE using a greedy frequency based heuristic for building the token vocabulary.
- We observed BPE including many versions of common words like dog since they occur in many variations such as dog. dog! dog?
- To avoid this, we prevent BPE from merging across character categories for any byte sequence.
- We add an exception for spaces which significantly improves the compression efficiency while adding only minimal fragmentation of words across multiple vocab tokens.

Research Question

Motivation

Approach

Training Dataset

Input Representation (BPE)

Model Architecture