Research Question

We demonstrate that language models begin to learn various natural language processing tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText.

If the “GPT-1 paper” was trying to form a framework to unify NLG task solutions (pre-training + SFT), then this “GPT-2” paper aims to get rid of SFT stage and prove that unsupervised learning (pre-training) is enough.

Motivation

Approach

Training Dataset

Since Common Crawl has significant data quality issues (a large amount of documents “whose content are mostly unintelligible”), we created a new web scrape on our own which emphasizes document quality.

Input Representation (BPE)

Model Architecture