Enrich Your Dataset to Improve Fine-Tuning Results

🧠 Why This Matters

A small dataset can limit a model’s ability to generalize and perform well on unseen data — especially in domain-specific tasks. With Prem Studio, you can enrich your dataset using synthetic data generation strategies. This allows you to:

📈 Expand your dataset size with low human effort
🧠 Introduce diversity into training examples
🚀 Improve model performance and generalization during fine-tuning
🧪 Prototype quickly, even with minimal real data

In this guide, we’ll show you how to enrich an existing dataset with synthetic data — using a customer support chatbot use case.

💼 Use Case: Fine-Tuning a Customer Support Chatbot

Imagine you’re building a domain-specific chatbot to answer product-related queries for an e-commerce company. You’ve collected only 50 QA pairs from past support tickets — not enough for robust fine-tuning. Instead of manually creating more data, you can enrich your dataset using synthetic generation in Prem Studio. This guide shows you how.

⚙️ Step-by-Step: Enrich Your Dataset with Synthetic Data

Select Your Dataset and Perform a 50/50 Split

GIF of opening a dataset and creating a validation split

From the sidebar, go to Datasets and open your 50-row dataset.Split it 50/50: 25 datapoints for training and 25 for validation. This ensures your evaluation is based on a meaningful validation size, despite the small dataset.Only examples in the Training and Uncategorized buckets will be used during enrichment.

It’s not mandatory, but we highly recommend splitting your dataset into training and validation sets before running enrichment. This helps avoid data leakage and ensures more reliable evaluations. See our Dataset Best Practices Guide for more.

Launch the Enrichment Workflow

In the top right corner, click Enrich Dataset. Then choose Seed data enhancement. You’ll be enriching the 25 training datapoints to boost generalizability.

Define Enrichment Settings and (Optional) Instructions

GIF of setting enrichment arguments and instructions

Set the following:

New pairs to generate: 500
Creativity: 0.1 (lower creativity = safer, more consistent results)

Use a higher value if your use case requires more creative outputs (e.g. roleplaying).

Instructions (optional):

Keep answers short and helpful. Focus on product-related questions, shipping issues, return policies, and discount inquiries.
Avoid repetitive or overly technical questions.

Review and Approve Synthetic Examples

Click Generate. Review the 500 synthetic examples. Approve the ones you’d like to keep.

Add Synthetic Datapoint to Training Bucket

GIF of moving synthetic datapoints to training bucket

In case the data quality matches your expectations, you can continue by adding all the datapoints to the Training bucket.

This happens automatically when using the Autosplit functionality if “Allow synthetic data in Validation” is not selected — even if you apply a split like 80/20 or 70/30.

(Optional) Further Enrich Using Textual Documents

You can optionally upload additional textual sources — such as PDFs, TXT files, or HTML pages — to provide contextual grounding for enrichment.These documents (e.g. product manuals, help center articles, policy pages) are used alongside your seed datapoints to generate more realistic and context-aware synthetic QA pairs.

When documents are provided, the enrichment engine combines both seed examples and content from your uploaded files to create new, high-quality datapoints that better reflect your domain language and topics.

This step is useful if you have internal documentation or unstructured content that the model can learn from.

Once enrichment is complete, your dataset will contain both original and synthetic entries — ready to be used for model fine-tuning.

📊 Example Before and After Enrichment

Type	Question	Answer
Original	How can I track my order?	You can track it using the link in your confirmation email.
Synthetic	Where do I check the status of my shipment?	Use the tracking link in the confirmation email we sent you.
Synthetic	Can I know where my package is?	Yes, the tracking link in your confirmation email shows real-time updates.

📊 Dataset Size: Before vs After

Started with 50 original examples.

Split: 25 training / 25 validation
After enrichment: +500 synthetic datapoints → 550 total (525 in training + 25 in validation)

📦 What’s Next?

With your enriched dataset, you can now:

Fine-tune a model with higher data diversity (Fine-Tuning Guide)
Evaluate model generalization with agentic evaluation

💡 Pro Tips

Always enrich after splitting to avoid data leakage.
Use instructions to control output tone, complexity, topic, or QA structure.
Review synthetic data for consistency — quality > quantity.
Avoid over-relying on synthetic examples for evaluation.

Get started

Datasets 🗃️

Fine-Tuning 🛠️

Inference 🏃‍♂️

Agentic Evaluations 📈

User Guides 📚

Examples 📖

Playground 🛝

Stats 📊

Resources 🧰

Cookbook 🍳

Enrich Your Dataset to Improve Fine-Tuning Results

🧠 Why This Matters

💼 Use Case: Fine-Tuning a Customer Support Chatbot

⚙️ Step-by-Step: Enrich Your Dataset with Synthetic Data

📊 Example Before and After Enrichment

📊 Dataset Size: Before vs After

📦 What’s Next?

💡 Pro Tips

Get started

Datasets 🗃️

Fine-Tuning 🛠️

Inference 🏃‍♂️

Agentic Evaluations 📈

User Guides 📚

Examples 📖

Playground 🛝

Stats 📊

Resources 🧰

Cookbook 🍳

​🧠 Why This Matters

​💼 Use Case: Fine-Tuning a Customer Support Chatbot

​⚙️ Step-by-Step: Enrich Your Dataset with Synthetic Data

​📊 Example Before and After Enrichment

​📊 Dataset Size: Before vs After

​📦 What’s Next?

​💡 Pro Tips

🧠 Why This Matters

💼 Use Case: Fine-Tuning a Customer Support Chatbot

⚙️ Step-by-Step: Enrich Your Dataset with Synthetic Data

📊 Example Before and After Enrichment

📊 Dataset Size: Before vs After

📦 What’s Next?

💡 Pro Tips