How to Fine-tune LLM Successfully in Bakery
Guide to Successful LLM Fine-tuning in Bakery
Introduction
This comprehensive guide will walk you through the process of fine-tuning Large Language Models (LLMs) in Bakery. We'll cover all the parameters available in the FineTunePayload and provide best practices for achieving optimal results.
Fine-tuning Parameters Overview
Essential Parameters
Base Model Selection
You must provide either:
base_model
: Asset ID of a model in Bakery or Hugging Face model identifier (e.g., "gpt2", "facebook/opt-350m")


Dataset Selection:
Please select available datasets in "My Datasets" & choose appropriate file format for input & output column selection.
Training Parameters
Learning Rate
{
"learning_rate": 2e-5 // Default value
}
Recommendations:
Small models (< 3B parameters): 2e-5 to 5e-5
Medium models (3B-7B parameters): 1e-5 to 3e-5
Large models (> 7B parameters): 5e-6 to 1e-5
Number of Epochs
{
"epochs": 1 // Default value
}
Guidelines:
Small datasets (< 1000 examples): 5-10 epochs
Medium datasets (1000-10000 examples): 3-5 epochs
Large datasets (> 10000 examples): 1-3 epochs
GPU Selection
{
"gpu": "NVIDIA_TESLA_A100" // or "NVIDIA_TESLA_V100"
}
Choose based on:
Model size
To fine-tune small model like GPT2 or BERT, please use NVIDIA_TESLA_V100 for successful fine-tuning
To fine-tune medium/large model like LlaMa, StableCode, Qwen or SmolLM2, please use NVIDIA_TESLA_A100 for successful fine-tuning
Dataset size
For complex dataset, please use NVIDIA_TESLA_A100
Training budget
Required training speed
For faster training process, please use NVIDIA_TESLA_A100
Step-by-Step Fine-tuning Guide
1. Prepare Your Dataset
Ensure your dataset meets these criteria:
Clean, high-quality data
Proper format (CSV, JSON, TXT, or Parquet)
Appropriate size for your model
Correct column names
2. Choose following parameters for fine-tuning successfully

3. Monitor Training Progress
The system provides real-time metrics:
Loss values
Learning rate
Training time
Training logs
Best Practices
1. Dataset Quality
Clean your data thoroughly
Remove duplicates
Balance your dataset
Validate data format
2. Model Selection
Start with smaller models for testing
Consider computational resources
Match model size to dataset size
Check model licenses
3. Training Parameters
Start with recommended defaults
Adjust based on monitoring
Use early stopping when needed
Save checkpoints regularly
4. Resource Management
Monitor GPU memory usage
Schedule large jobs appropriately
Implement proper error handling
Back up important checkpoints
Common Issues and Solutions
1. Out of Memory Errors
Solution:
Reduce batch size
Use smaller sequence lengths
Choose a smaller model
Upgrade GPU
2. Poor Training Results
Solution:
Check data quality
Adjust learning rate
Increase epochs
Validate preprocessing
3. Slow Training Speed
Solution:
Use more powerful GPU
Optimize batch size
Reduce dataset size
Simplify model architecture
Performance Optimization Tips
Epoch Planning
Use early stopping
Monitor validation metrics
Implement checkpointing
Save best models
Post-Training Steps
1. Model Evaluation
Test on validation set
Compare with baseline
Check for overfitting
Validate outputs
2. Model Storage
Automatic save to Bakery
Version control
Metadata storage
For additional assistance, contact:
Join our Discord
Visit our Documentation
Last updated