Our training procedure consists of two stages.
Objective: Given an unsupervised corpus of tokens $\mathcal{U} = \{u_1, \ldots, u_n\}$, we use a standard language modeling objective to maximize the following likelihood:
$L_1(\mathcal{U}) = \sum_{i} \log P(u_i | u_{i-k}, \ldots, u_{i-1}; \Theta)$
where $k$ is the size of the context window.
Model Architecture: In our experiments, we use a multi-layer Transformer decoder
$h_0 = U W_e + W_p$ $h_l = \text{transformer block}(h_{l-1}) \ \forall i \in [1, n]$ $P(u) = \text{softmax}(h_n W_e^T)$
where $n$ is the number of layers, $W_e$ is the token embedding matrix, and $W_p$ is the position embedding matrix.