even for the largest configuration (8.3 billion parameters) running on 512 GPUs, we achieve 74% scaling relative to linear scaling of the strong single GPU baseline configuration (1.2 billion parameters).

Research Question

In this work, we present our techniques for training very large transformer models and implement a simple, efficient intra-layer model parallel approach
- Our approach does not require a new compiler or library changes, is orthogonal and complimentary to pipeline model parallelism
We train an 8.3 billion parameter transformer language model similar to GPT-2 and a 3.9 billion parameter model similar to BERT.
- We show that careful attention to the placement of layer normalization in BERT-like models is critical to achieving increased performance as the model size grows.

Approach

Existing Approach

Data parallelism (Valiant, 1990)

where a training minibatch is split across multiple workers, (Related work skipped)

However, these techniques have one fundamental limitation in the problem size they can tackle: the model must fit entirely on one worker.

One solution to this problem is to employ parameter sharing to reduce the memory footprint of the model (Lan et al., 2019), but this limits the overall capacity of the model.
Our approach is to utilize model parallelism to split the model across multiple accelerators.

Model parallelism

in which the memory usage and computation of a model is distributed across multiple workers.

Within model parallelism, there are two further paradigms:

layer-wise pipeline parallelism:
- In pipeline model parallelism, groups of operations are performed on one device before the outputs are passed to the next device in the pipeline where a different group of operations are performed.
- Some approaches (skipped) suffer from inconsistency issues.
- The GPipe framework for TensorFlow (Huang et al., 2018) overcomes this inconsistency issue by using synchronous gradient decent.
distributed tensor computation (more general): is an orthogonal and more general approach that partitions a tensor operation across multiple devices to accelerate computation or increase model size.
- We utilize similar insights to those leveraged in Mesh-TensorFlow and exploit parallelism in computing the transformer’s attention heads to parallelize our transformer model.

Model Parallel Transformers

We introduce model parallelism in, Attention, MLP blocks and the language modeling head separately.