Generate Dataset from Text Files (PDF, DOCX, and more)

🧠 Why This Matters

Creating high-quality QA datasets is often a bottleneck. With synthetic QA generation, you can:

Convert any text source into structured QA pairs.
Quickly scale your dataset with minimal manual effort.
Prepare reliable data for fine-tuning or evaluating small and large language models (SLMs/LLMs).

Example use case - Structured Output from Receipts In this guide we will see how to synthetically generate QA pairs from a set of receipts. We’ll use 20 invoice receipts from this dataset. Each file will generate 1 Synthetic QA pair, producing a structured JSON output that extracts key fields (date, total amount, business name, etc.). This type of dataset is crucial for fine-tuning models that need to reliably extract structured data from semi-structured sources (like receipts, tickets, or invoices).

⚙️ Step-by-Step: Using QA Generation in Prem Studio

Create a New Dataset

From the left sidebar, go to Datasets → + Create dataset → Synthetic Data.

Upload Your Input Files

Drag and drop your text files (PDF, DOCX, TXT, HTML). 📌 In this example, we use 20 invoice receipts.

Configure Advanced Settings (Optional but Recommended)

This step lets you steer the QA generation toward your expected output. It includes:

Rules & Constraints → enforce requirements.
QA Examples (up to 3) → provide few-shot examples to guide generation.
Question/Answer Guidance → define QA formats.
Creativity Level → adjust diversity of generated QAs.

For our receipts example:Rules & Constraints:

- Always include the full extracted text in the question.
- Clearly instruct extraction into the given JSON schema.
- Only output the JSON object (no extra text).
- Fill fields only if explicitly present in the text.
- If a field is missing, leave it empty but keep the key.
- Strictly follow the schema.
- Never infer, guess, or fabricate information.

Question Format

{EXTRACTED_TEXT}
Task: Extract all the information available in the text and present it in the JSON format below.
Do not infer or invent details — only include what is explicitly stated.
JSON Schema:
{
    "DateTime": "YYYY-MM-DD HH:MM:SS",
    "Total Amount": "number",
    "Currency": "string",
    "Business Name": "string",
    "Business Location": "string"
}

Answer Format

{
    "DateTime": "<transaction_datetime_if_present>",
    "Total Amount": "<total_amount_if_present>",
    "Currency": "<currency_code_if_present>",
    "Business Name": "<business_name_if_present>",
    "Business Location": "<city_state_country_if_present>"
}

Creativity: set to 0

4. Review & Generate

Before starting, you’ll see a recap of your setup:

Data sources (your input files).
Generation settings (QA per file, creativity level, etc.).
Estimated cost. Click Create dataset to start the process.

Review Your Synthetic Dataset

Once generated, preview your QAs directly in Prem Studio. From the UI, you can also edit or delete datapoints to refine the dataset before using it. Recap Synthetic Data Generation Task

📦 What’s Next?

You can now:

Preview and edit the generated data
Fine-tune your model with it (see our fine-tuning guide)
Evaluate models against it (see our evaluation guide)

💡 Pro Tips

Always define advanced settings if you expect strict outputs.
Start small (10–20 docs) to validate your setup, then scale up.
Use domain-specific schemas (invoices, medical records, support tickets, etc.).
Keep creativity low when extracting structured data.

Get started

Projects 🚀

Datasets 🗃️

Fine-Tuning 🛠️

Inference 🏃‍♂️

Agentic Evaluations 📈

User Guides 📚

Resources 🧰

Generate Dataset from Text Files (PDF, DOCX, and more)

🧠 Why This Matters

⚙️ Step-by-Step: Using QA Generation in Prem Studio

📦 What’s Next?

💡 Pro Tips

Get started

Projects 🚀

Datasets 🗃️

Fine-Tuning 🛠️

Inference 🏃‍♂️

Agentic Evaluations 📈

User Guides 📚

Resources 🧰

​🧠 Why This Matters

​⚙️ Step-by-Step: Using QA Generation in Prem Studio

​📦 What’s Next?

​💡 Pro Tips

🧠 Why This Matters

⚙️ Step-by-Step: Using QA Generation in Prem Studio

📦 What’s Next?

💡 Pro Tips