Research Question

In this work, we develop and release Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters.
Our fine-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases.

Pretraining

We began with the pretraining approach described in LLaMA paper, using an optimized auto-regressive transformer, but made several changes to improve performance:

Specifically, we performed more robust data cleaning, updated our data mixes, trained on 40% more total tokens,
doubled the context length,
and used grouped-query attention (GQA) to improve inference scalability for our larger models.

Pretraining Data

Our training corpus includes a new mix of data from publicly available sources, which does not include data from Meta’s products or services.

Training Details

Architecture

We adopt most of the pretraining setting and model architecture from Llama 1.

We use the standard transformer architecture (Vaswani et al., 2017),