Injecting New Knowledge into Large Language Models via Supervised Fine-Tuning (Mecklenburg et al., 2024)

How can supervised fine-tuning (SFT) be used to robustly inject new knowledge into large language models (LLMs), especially for out-of-domain facts and events occurring after the model's training cutoff?

The study focuses on incorporating new, out-of-domain knowledge into LLMs using SFT, particularly for recent sporting events.
It compares two dataset generation strategies—token-based scaling and fact-based scaling—for constructing training data.
- Token-based scaling: Generates question-answer pairs based on the length of document sections, aiming for quantity.
- Fact-based scaling: Prioritizes even coverage of salient facts within documents, aiming for quality and completeness.
Experiments on GPT-4 demonstrate the effectiveness of these methods, with a focus on improving model performance in question-answering tasks about recent events.

Token-based Dataset Generation:
- Initializes with a manually created Q&A pair.
- Generates questions until the generated token count between unique questions and answers for that section surpasses 10 times our source section tokens.
- Creates 1x, 5x, and 10x scaled datasets based on the number of generated Q&A pairs.
Fact-based Dataset Generation:
- Identifies atomic facts from documents using GPT-4.
- Generates multiple Q&A pairs for each fact to ensure comprehensive coverage.
- Creates 1x, 5x, and 10x datasets by varying the number of question-answer pairs for each fact.
Both methods aim to help the model internalize new knowledge rather than just memorizing data.

Token-Based Scaling Results:
- Models fine-tuned on token-scaled datasets show improvements in Q&A accuracy.
- Gains are more pronounced at smaller scales (e.g., 1x to 5x) but diminish with further scaling (e.g., 10x), suggesting overfitting to repetitive Q&A pairs.
- Models evaluated on pre-cutoff events like the 2018 FIFA World Cup performed better initially, serving as a sanity check.
Fact-Based Scaling Results:
- Fact-based scaling results are more consistent across different scales, showing continuous improvements as the scale increases.
- Unlike token-based scaling, there is no performance drop for larger scales, indicating better generalization and retention of knowledge.
- This method provides better overall coverage of facts, leading to more reliable knowledge injection.
Cross-Validation:
- Evaluating token-scaled models on fact-based evaluation sets highlights the limitations of token-based scaling in covering all relevant facts.
- Fact-based evaluation serves as a more robust measure of the model’s knowledge retention.
  - Although fact-based approach seemed to bring less performance increase when compared directly with token-based approach, but when evaluated thoroughly on the whole fact-based evaluation dataset, it becomes obvious that token-based approach fails more to capture relevant knowledge and leads to lower performance.

Prompts can be found in Appendix A
Training Configurations:
- Fine-tuning is performed using LoRA (Low-Rank Adaptation) with a rank of 16, batch size of 1, and 3 training epochs.
- The training focuses on assistant-generated tokens, with user-prompt tokens being frozen.
- A balance between language modeling and factual retention is sought through the design of the loss function.

Token-Based vs. Fact-Based Trade-offs:
- Token-based scaling can quickly increase Q&A volume but risks uneven fact coverage.
- Fact-based scaling ensures more comprehensive knowledge retention, leading to smoother performance improvements.
Hyperparameter Sensitivity:
- Increasing training epochs can significantly impact results, revealing opportunities for further optimization.