How to Fine-tune LLM Successfully in Bakery

Guide to Successful LLM Fine-tuning in Bakery

Introduction

This comprehensive guide will walk you through the process of fine-tuning Large Language Models (LLMs) in Bakery. We'll cover all the parameters available in the FineTunePayload and provide best practices for achieving optimal results.

Fine-tuning Parameters Overview

Essential Parameters

Base Model Selection

You must provide either:

base_model: Asset ID of a model in Bakery or Hugging Face model identifier (e.g., "gpt2", "facebook/opt-350m")

Dataset Selection:

Please select available datasets in "My Datasets" & choose appropriate file format for input & output column selection.

Training Parameters

Learning Rate

{
    "learning_rate": 2e-5  // Default value
}

Recommendations:

Small models (< 3B parameters): 2e-5 to 5e-5
Medium models (3B-7B parameters): 1e-5 to 3e-5
Large models (> 7B parameters): 5e-6 to 1e-5

Number of Epochs

{
    "epochs": 1  // Default value
}

Guidelines:

Small datasets (< 1000 examples): 5-10 epochs
Medium datasets (1000-10000 examples): 3-5 epochs
Large datasets (> 10000 examples): 1-3 epochs

GPU Selection

{
    "gpu": "NVIDIA_TESLA_A100"  // or "NVIDIA_TESLA_V100"
}

Choose based on:

Model size
- To fine-tune small model like GPT2 or BERT, please use NVIDIA_TESLA_V100 for successful fine-tuning
- To fine-tune medium/large model like LlaMa, StableCode, Qwen or SmolLM2, please use NVIDIA_TESLA_A100 for successful fine-tuning
Dataset size
- For complex dataset, please use NVIDIA_TESLA_A100
Training budget
Required training speed
- For faster training process, please use NVIDIA_TESLA_A100

Step-by-Step Fine-tuning Guide

1. Prepare Your Dataset

Ensure your dataset meets these criteria:

Clean, high-quality data
Proper format (CSV, JSON, TXT, or Parquet)
Appropriate size for your model
Correct column names

2. Choose following parameters for fine-tuning successfully

3. Monitor Training Progress

The system provides real-time metrics:

Loss values
Learning rate
Training time
Training logs

Best Practices

1. Dataset Quality

Clean your data thoroughly
Remove duplicates
Balance your dataset
Validate data format

2. Model Selection

Start with smaller models for testing
Consider computational resources
Match model size to dataset size
Check model licenses

3. Training Parameters

Start with recommended defaults
Adjust based on monitoring
Use early stopping when needed
Save checkpoints regularly

4. Resource Management

Monitor GPU memory usage
Schedule large jobs appropriately
Implement proper error handling
Back up important checkpoints

Common Issues and Solutions

1. Out of Memory Errors

Solution:

Reduce batch size
Use smaller sequence lengths
Choose a smaller model
Upgrade GPU

2. Poor Training Results

Solution:

Check data quality
Adjust learning rate
Increase epochs
Validate preprocessing

3. Slow Training Speed

Solution:

Use more powerful GPU
Optimize batch size
Reduce dataset size
Simplify model architecture

Performance Optimization Tips

Epoch Planning

Use early stopping
Monitor validation metrics
Implement checkpointing
Save best models

Post-Training Steps

1. Model Evaluation

Test on validation set
Compare with baseline
Check for overfitting
Validate outputs

2. Model Storage

Automatic save to Bakery
Version control
Metadata storage

For additional assistance, contact:

Join our Discord
Visit our Documentation

PreviousSupported LLM Models for Fine-tuning in Bakery

Last updated 7 months ago