Dataset Preparation Guide for Fine-tuning

Introduction

This guide explains how to prepare your datasets for fine-tuning models in Bakery. We support multiple data formats including CSV, TXT, JSON, and Parquet. Each format has specific requirements to ensure successful model training.

Supported File Formats

CSV Format

CSV (Comma-Separated Values) files should be structured with clear column headers and consistent data formatting.

input_text,output_text
"What is Python?","Python is a high-level programming language."
"Explain JSON","JSON is a lightweight data interchange format."

Requirements:

First row must contain column headers
Must include at least one of these columns:
- input_text: Required for training input
- output_text: Optional for supervised learning
Use quotes around text containing commas
UTF-8 encoding recommended

TXT Format

Text files should be organized with clear separation between input and output pairs.

What is Python?
Python is a high-level programming language.

Explain JSON
JSON is a lightweight data interchange format.

Requirements:

Separate input/output pairs with double newlines
Each pair should have input on first line and output on second line
UTF-8 encoding recommended
No special formatting needed within lines

JSON Format

JSON files can be either single objects or arrays of training examples.

[
  {
    "input_text": "What is Python?",
    "output_text": "Python is a high-level programming language."
  },
  {
    "input_text": "Explain JSON",
    "output_text": "JSON is a lightweight data interchange format."
  }
]

Requirements:

Must be valid JSON format
Array of objects recommended
Each object should have:
- input_text: Required field
- output_text: Optional field
UTF-8 encoding recommended

Parquet Format

Parquet files should be structured with clearly defined columns and data types.

# Example of creating a Parquet file using pandas
import pandas as pd

data = {
    'input_text': ['What is Python?', 'Explain JSON'],
    'output_text': ['Python is a high-level programming language.', 
                   'JSON is a lightweight data interchange format.']
}
df = pd.DataFrame(data)
df.to_parquet('training_data.parquet')

Requirements:

Must include at least one of these columns:
- input_text: Required for training input
- output_text: Optional for supervised learning
String data type for text columns
Efficient compression recommended

Best Practices

Data Quality
- Clean your data before creating the dataset
- Remove any corrupted or irrelevant examples
- Ensure consistent formatting
File Size
- Keep individual files under 1GB
- Split large datasets into multiple files if needed
- Consider using Parquet for large datasets
Text Length
- Consider model context window limitations
- Keep inputs and outputs within reasonable lengths
- Use truncation settings appropriately

Support

For additional help:

Check our GitHub repository
Join our Discord community

PreviousFine-Tuning AI Models NextSupported LLM Models for Fine-tuning in Bakery

Last updated 7 months ago