Research Question

We demonstrate that large gains on NLG tasks can be realized by generative pre-training of a language model on a diverse corpus of unlabeled text, followed by discriminative fine-tuning on each specific task.
We introduce a framework (about how to train a model), not a model (This way of phrasing is given by authors in the Conclusion section)

Approach

In this paper, we explore a semi-supervised approach for language understanding tasks using a combination of unsupervised pre-training and supervised fine-tuning.
Our goal is to learn a universal representation that transfers with little adaptation to a wide range of tasks.
We assume access to a large corpus of unlabeled text and several datasets with manually annotated training examples (target tasks).
We evaluate our approach on four types of language understanding tasks – natural language inference, question answering, semantic similarity, and text classification.

Our training procedure consists of two stages.

The first stage is learning a high-capacity language model on a large corpus of text.
This is followed by a fine-tuning stage, where we adapt the model to a discriminative task with labeled data.

Objective: Given an unsupervised corpus of tokens $\mathcal{U} = \{u_1, \ldots, u_n\}$, we use a standard language modeling objective to maximize the following likelihood:

$L_1(\mathcal{U}) = \sum_{i} \log P(u_i | u_{i-k}, \ldots, u_{i-1}; \Theta)$

where $k$ is the size of the context window.
Model Architecture: In our experiments, we use a multi-layer Transformer decoder

$h_0 = U W_e + W_p$ $h_l = \text{transformer block}(h_{l-1}) \ \forall i \in [1, n]$ $P(u) = \text{softmax}(h_n W_e^T)$

where $n$ is the number of layers, $W_e$ is the token embedding matrix, and $W_p$ is the position embedding matrix.