Research Question

We introduce LLaMA, a collection of foundation language models ranging from 7B to 65B parameters. The focus of this work is to train a series of language models that achieve the best possible performance at various inference budgets, by training on more tokens than what is typically used.

Approach

Pre-training Data

Tokenizer:

Architecture

image.png

image.png

Optimizer

Our models are trained using the AdamW optimizer with the following hyper-parameters:

Efficient Implementation

We make several optimizations to improve the training speed of our models: