even for the largest configuration (8.3 billion parameters) running on 512 GPUs, we achieve 74% scaling relative to linear scaling of the strong single GPU baseline configuration (1.2 billion parameters).

Research Question

Approach

Existing Approach

Data parallelism (Valiant, 1990)

where a training minibatch is split across multiple workers, (Related work skipped)

However, these techniques have one fundamental limitation in the problem size they can tackle: the model must fit entirely on one worker.

Model parallelism

in which the memory usage and computation of a model is distributed across multiple workers.

Within model parallelism, there are two further paradigms:

Model Parallel Transformers

We introduce model parallelism in, Attention, MLP blocks and the language modeling head separately.