Dataset Preparation Guide for Fine-tuning

Introduction

This guide explains how to prepare your datasets for fine-tuning models in Bakery. We support multiple data formats including CSV, TXT, JSON, and Parquet. Each format has specific requirements to ensure successful model training.

Supported File Formats

CSV Format

CSV (Comma-Separated Values) files should be structured with clear column headers and consistent data formatting.

input_text,output_text
"What is Python?","Python is a high-level programming language."
"Explain JSON","JSON is a lightweight data interchange format."

Requirements:

  • First row must contain column headers

  • Must include at least one of these columns:

    • input_text: Required for training input

    • output_text: Optional for supervised learning

  • Use quotes around text containing commas

  • UTF-8 encoding recommended

TXT Format

Text files should be organized with clear separation between input and output pairs.

What is Python?
Python is a high-level programming language.

Explain JSON
JSON is a lightweight data interchange format.

Requirements:

  • Separate input/output pairs with double newlines

  • Each pair should have input on first line and output on second line

  • UTF-8 encoding recommended

  • No special formatting needed within lines

JSON Format

JSON files can be either single objects or arrays of training examples.

[
  {
    "input_text": "What is Python?",
    "output_text": "Python is a high-level programming language."
  },
  {
    "input_text": "Explain JSON",
    "output_text": "JSON is a lightweight data interchange format."
  }
]

Requirements:

  • Must be valid JSON format

  • Array of objects recommended

  • Each object should have:

    • input_text: Required field

    • output_text: Optional field

  • UTF-8 encoding recommended

Parquet Format

Parquet files should be structured with clearly defined columns and data types.

# Example of creating a Parquet file using pandas
import pandas as pd

data = {
    'input_text': ['What is Python?', 'Explain JSON'],
    'output_text': ['Python is a high-level programming language.', 
                   'JSON is a lightweight data interchange format.']
}
df = pd.DataFrame(data)
df.to_parquet('training_data.parquet')

Requirements:

  • Must include at least one of these columns:

    • input_text: Required for training input

    • output_text: Optional for supervised learning

  • String data type for text columns

  • Efficient compression recommended

Best Practices

  1. Data Quality

    • Clean your data before creating the dataset

    • Remove any corrupted or irrelevant examples

    • Ensure consistent formatting

  2. File Size

    • Keep individual files under 1GB

    • Split large datasets into multiple files if needed

    • Consider using Parquet for large datasets

  3. Text Length

    • Consider model context window limitations

    • Keep inputs and outputs within reasonable lengths

    • Use truncation settings appropriately

Support

For additional help:

Last updated