๐Ÿง  Why This Matters

A small dataset can limit a modelโ€™s ability to generalize and perform well on unseen data โ€” especially in domain-specific tasks. With Prem Studio, you can enrich your dataset using synthetic data generation strategies. This allows you to:
  • ๐Ÿ“ˆ Expand your dataset size with low human effort
  • ๐Ÿง  Introduce diversity into training examples
  • ๐Ÿš€ Improve model performance and generalization during fine-tuning
  • ๐Ÿงช Prototype quickly, even with minimal real data
In this guide, weโ€™ll show you how to enrich an existing dataset with synthetic data โ€” using a customer support chatbot use case.

๐Ÿ’ผ Use Case: Fine-Tuning a Customer Support Chatbot

Imagine youโ€™re building a domain-specific chatbot to answer product-related queries for an e-commerce company. Youโ€™ve collected only 50 QA pairs from past support tickets โ€” not enough for robust fine-tuning. Instead of manually creating more data, you can enrich your dataset using synthetic generation in Prem Studio. This guide shows you how.

โš™๏ธ Step-by-Step: Enrich Your Dataset with Synthetic Data

1

Select Your Dataset and Perform a 50/50 Split

GIF of opening a dataset and creating a validation splitFrom the sidebar, go to Datasets and open your 50-row dataset.Split it 50/50: 25 datapoints for training and 25 for validation. This ensures your evaluation is based on a meaningful validation size, despite the small dataset.Only examples in the Training and Uncategorized buckets will be used during enrichment.
Itโ€™s not mandatory, but we highly recommend splitting your dataset into training and validation sets before running enrichment. This helps avoid data leakage and ensures more reliable evaluations. See our Dataset Best Practices Guide for more.
2

Launch the Enrichment Workflow

GIF of clicking enrich dataset buttonIn the top right corner, click Enrich Dataset. Then choose Seed data enhancement. Youโ€™ll be enriching the 25 training datapoints to boost generalizability.
3

Define Enrichment Settings and (Optional) Instructions

GIF of setting enrichment arguments and instructionsSet the following:
  • New pairs to generate: 500
  • Creativity: 0.1 (lower creativity = safer, more consistent results)
Use a higher value if your use case requires more creative outputs (e.g. roleplaying).
  • Instructions (optional):
Keep answers short and helpful. Focus on product-related questions, shipping issues, return policies, and discount inquiries.
Avoid repetitive or overly technical questions.
4

Review and Approve Synthetic Examples

Click Generate. Review the 500 synthetic examples. Approve the ones youโ€™d like to keep.
5

Add Synthetic Datapoint to Training Bucket

GIF of moving synthetic datapoints to training bucketIn case the data quality matches your expectations, you can continue by adding all the datapoints to the Training bucket.
This happens automatically when using the Autosplit functionality if โ€œAllow synthetic data in Validationโ€ is not selected โ€” even if you apply a split like 80/20 or 70/30.
6

(Optional) Further Enrich Using Textual Documents

GIF of moving synthetic datapoints to training bucketYou can optionally upload additional textual sources โ€” such as PDFs, TXT files, or HTML pages โ€” to provide contextual grounding for enrichment.These documents (e.g. product manuals, help center articles, policy pages) are used alongside your seed datapoints to generate more realistic and context-aware synthetic QA pairs.
When documents are provided, the enrichment engine combines both seed examples and content from your uploaded files to create new, high-quality datapoints that better reflect your domain language and topics.
This step is useful if you have internal documentation or unstructured content that the model can learn from.
Once enrichment is complete, your dataset will contain both original and synthetic entries โ€” ready to be used for model fine-tuning.

๐Ÿ“Š Example Before and After Enrichment

TypeQuestionAnswer
OriginalHow can I track my order?You can track it using the link in your confirmation email.
SyntheticWhere do I check the status of my shipment?Use the tracking link in the confirmation email we sent you.
SyntheticCan I know where my package is?Yes, the tracking link in your confirmation email shows real-time updates.

๐Ÿ“Š Dataset Size: Before vs After

Started with 50 original examples.
  • Split: 25 training / 25 validation
  • After enrichment: +500 synthetic datapoints โ†’ 550 total (525 in training + 25 in validation)

๐Ÿ“ฆ Whatโ€™s Next?

With your enriched dataset, you can now:

๐Ÿ’ก Pro Tips

  • Always enrich after splitting to avoid data leakage.
  • Use instructions to control output tone, complexity, topic, or QA structure.
  • Review synthetic data for consistency โ€” quality > quantity.
  • Avoid over-relying on synthetic examples for evaluation.