Creating a Dataset

In Prem, you can build datasets in two ways:
  • Upload an existing dataset in JSONL format.
  • Generate synthetic datasets directly from different input sources such as files, YouTube videos, websites, or a mix of sources.

Uploading a Dataset

1

Create a Dataset

Click the + Create Dataset button in the top-right corner of the page.GIF of clicking the create a dataset button
2

Upload a File

  • Enter a name for your dataset.
  • Upload your JSONL file.
  • Click Confirm to create the dataset.
GIF of uploading a file
Refer to our Datasets documentation for more information on the JSONL file format.

Generate a Dataset with Synthetic Data Generation

Synthetic data generation lets you create datasets from various input sources beyond JSONL. You can import documents, scrape websites, process YouTube videos, or combine multiple sources into one dataset.
1

Step 1: Define Dataset and Sources

  • Enter a descriptive dataset name.
  • Choose one or more data sources:
    • Files Only: PDF, DOCX, TXT, HTML, PPTX
    • YouTube Videos: individual videos or playlists
    • Web Scraping: one or more website URLs
    • Mixed Sources: combine multiple input types
  • Set the number of QA pairs to generate from each source.
Synthetic Dataset Generation - Input sources selection
2

Step 2: Set Optional Guidance (Optional)

Get control of the generation process by setting additional parameters:
  • Rules & Constraints – add conditions for the generated content (e.g., enforce style, define tone, restrict scope).
  • QA Guidance – provide example pairs or specify output formats.
  • Creativity Level – adjust the model’s temperature to balance consistency vs. variety.
Synthetic Dataset Generation - Define advanced settings
3

Step 3: Review and Create Dataset

Review a recap of your settings from the previous steps. Click Create Dataset to start the generation process.Synthetic Dataset Generation - Summary

Options for Synthetic Data Generation

When generating synthetic datasets, you can configure:
  • Data Sources – files, YouTube videos, websites, or a mix.
  • Synthetic Pairs Configuration – number of QA pairs per input.
  • Rules & Constraints – optional rules to shape the outputs.
  • QA Guidance – add examples or output specifications.
  • Creativity Level – control the randomness of the generation.
For deeper explanations and advanced usage, see our user guide on how to Generate Dataset from Text Files.

Next Steps