Token-Based Scaling Results:
Models fine-tuned on token-scaled datasets show improvements in Q&A accuracy.
Gains are more pronounced at smaller scales (e.g., 1x to 5x) but diminish with further scaling (e.g., 10x), suggesting overfitting to repetitive Q&A pairs.
Models evaluated on pre-cutoff events like the 2018 FIFA World Cup performed better initially, serving as a sanity check.
Fact-Based Scaling Results:
Fact-based scaling results are more consistent across different scales, showing continuous improvements as the scale increases.
Unlike token-based scaling, there is no performance drop for larger scales, indicating better generalization and retention of knowledge.
This method provides better overall coverage of facts, leading to more reliable knowledge injection.
Cross-Validation:
Evaluating token-scaled models on fact-based evaluation sets highlights the limitations of token-based scaling in covering all relevant facts.
Fact-based evaluation serves as a more robust measure of the model’s knowledge retention.