Dataset Preparation Guide for Fine-tuning
Introduction
This guide explains how to prepare your datasets for fine-tuning models in Bakery. We support multiple data formats including CSV, TXT, JSON, and Parquet. Each format has specific requirements to ensure successful model training.
Supported File Formats
CSV Format
CSV (Comma-Separated Values) files should be structured with clear column headers and consistent data formatting.
input_text,output_text
"What is Python?","Python is a high-level programming language."
"Explain JSON","JSON is a lightweight data interchange format."
Requirements:
First row must contain column headers
Must include at least one of these columns:
input_text
: Required for training inputoutput_text
: Optional for supervised learning
Use quotes around text containing commas
UTF-8 encoding recommended
TXT Format
Text files should be organized with clear separation between input and output pairs.
What is Python?
Python is a high-level programming language.
Explain JSON
JSON is a lightweight data interchange format.
Requirements:
Separate input/output pairs with double newlines
Each pair should have input on first line and output on second line
UTF-8 encoding recommended
No special formatting needed within lines
JSON Format
JSON files can be either single objects or arrays of training examples.
[
{
"input_text": "What is Python?",
"output_text": "Python is a high-level programming language."
},
{
"input_text": "Explain JSON",
"output_text": "JSON is a lightweight data interchange format."
}
]
Requirements:
Must be valid JSON format
Array of objects recommended
Each object should have:
input_text
: Required fieldoutput_text
: Optional field
UTF-8 encoding recommended
Parquet Format
Parquet files should be structured with clearly defined columns and data types.
# Example of creating a Parquet file using pandas
import pandas as pd
data = {
'input_text': ['What is Python?', 'Explain JSON'],
'output_text': ['Python is a high-level programming language.',
'JSON is a lightweight data interchange format.']
}
df = pd.DataFrame(data)
df.to_parquet('training_data.parquet')
Requirements:
Must include at least one of these columns:
input_text
: Required for training inputoutput_text
: Optional for supervised learning
String data type for text columns
Efficient compression recommended
Best Practices
Data Quality
Clean your data before creating the dataset
Remove any corrupted or irrelevant examples
Ensure consistent formatting
File Size
Keep individual files under 1GB
Split large datasets into multiple files if needed
Consider using Parquet for large datasets
Text Length
Consider model context window limitations
Keep inputs and outputs within reasonable lengths
Use truncation settings appropriately
Support
For additional help:
Check our GitHub repository
Join our Discord community
Last updated