π§ Why This Matters
Creating high-quality QA datasets is often a bottleneck. With synthetic QA generation, you can:- Convert any text source into structured QA pairs.
- Quickly scale your dataset with minimal manual effort.
- Prepare reliable data for fine-tuning or evaluating small and large language models (SLMs/LLMs).
Example use case - Structured Output from Receipts In this guide we will see how to synthetically generate QA pairs from a set of receipts. Weβll use 20 invoice receipts from this dataset. Each file will generate 1 Synthetic QA pair, producing a structured JSON output that extracts key fields (date, total amount, business name, etc.). This type of dataset is crucial for fine-tuning models that need to reliably extract structured data from semi-structured sources (like receipts, tickets, or invoices).
βοΈ Step-by-Step: Using QA Generation in Prem Studio
1
Create a New Dataset

2
Upload Your Input Files

3
Configure Advanced Settings (Optional but Recommended)

- Rules & Constraints β enforce requirements.
- QA Examples (up to 3) β provide few-shot examples to guide generation.
- Question/Answer Guidance β define QA formats.
- Creativity Level β adjust diversity of generated QAs.
4
4. Review & Generate
Before starting, youβll see a recap of your setup:
- Data sources (your input files).
- Generation settings (QA per file, creativity level, etc.).
- Estimated cost. Click Create dataset to start the process.

5
Review Your Synthetic Dataset
Once generated, preview your QAs directly in Prem Studio.
From the UI, you can also edit or delete datapoints to refine the dataset before using it.

π¦ Whatβs Next?
You can now:- Preview and edit the generated data
- Fine-tune your model with it (see our fine-tuning guide)
- Evaluate models against it (see our evaluation guide)
π‘ Pro Tips
- Always define advanced settings if you expect strict outputs.
- Start small (10β20 docs) to validate your setup, then scale up.
- Use domain-specific schemas (invoices, medical records, support tickets, etc.).
- Keep creativity low when extracting structured data.