LLaMA: Open and Efficient Foundation Language Models (Touvron et al., 2023)

Research Question

We introduce LLaMA, a collection of foundation language models ranging from 7B to 65B parameters. The focus of this work is to train a series of language models that achieve the best possible performance at various inference budgets, by training on more tokens than what is typically used.

Approach

Pre-training Data

Data statistics: For the most part, we reuse data sources that have been leveraged to train other LLMs, with the restriction of only using data that is publicly available

For most of our training data, each token is used only once during training, with the exception of the Wikipedia and Books domains, over which we perform approximately two epochs.

Tokenizer:

We tokenize the data with the bytepair encoding (BPE) algorithm using the implementation from SentencePiece. Notably, we split all numbers into individual digits, and fallback to bytes to decompose unknown UTF-8 characters.
Overall, our entire training dataset contains roughly 1.4T tokens after tokenization.

Architecture

Pre-normalization (as used in GPT2 and GPT3): To improve the training stability, we normalize the input of each transformer sub-layer with RMSNorm, instead of normalizing the output.
SwiGLU activation function (as used in PaLM): We replace the ReLU non-linearity by the SwiGLU activation function. We use a dimension of 2 3 4d instead of 4d as in PaLM.
Rotary Embeddings (as used in GPTNeo): We remove the absolute positional embeddings, and instead, add rotary positional embeddings (RoPE) at each layer of the network.

Optimizer

Our models are trained using the AdamW optimizer with the following hyper-parameters:

$β_1$ = 0.9, $β_2$ = 0.95
cosine learning rate schedule, such that the final learning rate is equal to 10% of the maximal learning rate.
weight decay of 0.1
gradient clipping of 1.0
2, 000 warmup steps
vary the learning rate and batch size with the size of the model

Efficient Implementation

We make several optimizations to improve the training speed of our models:

We use an efficient implementation of the causal multi-head attention to reduce memory usage and runtime (from xformers library).
- This is achieved by not storing the attention weights and not computing the key/query scores that are masked due to the causal nature of the language modeling task.
We save the activations that are expensive to compute, such as the outputs of linear layers.
- This is achieved by manually implementing the backward function for the transformer layers, instead of relying on the PyTorch autograd.
We also overlap the computation of activations and the communication between GPUs over the network (due to all_reduce operations) as much as possible.