Research Question

Pre-training

Data

To enhance the richness and diversity, we have organized our approach into three essential stages:

Deduplication

The deduplication and remixing stages ensure a diverse representation of the data by sampling unique instances.

image.png

Filtering

Remixing

In the remixing phase, we adjust our approach to address data imbalances, focusing on increasing the presence of underrepresented domains.

Tokenizer