Creating a Dataset
In Prem, you can build datasets in two ways:- Upload an existing dataset in
JSONL
format. - Generate synthetic datasets directly from different input sources such as files, YouTube videos, websites, or a mix of sources.
Uploading a Dataset
1
Create a Dataset
Click the + Create Dataset button in the top-right corner of the page.

2
Upload a File
- Enter a name for your dataset.
- Upload your
JSONL
file. - Click Confirm to create the dataset.

Refer to our Datasets documentation for more information on the
JSONL
file format.Generate a Dataset with Synthetic Data Generation
Synthetic data generation lets you create datasets from various input sources beyond JSONL. You can import documents, scrape websites, process YouTube videos, or combine multiple sources into one dataset.1
Step 1: Define Dataset and Sources
- Enter a descriptive dataset name.
- Choose one or more data sources:
- Files Only: PDF, DOCX, TXT, HTML, PPTX
- YouTube Videos: individual videos or playlists
- Web Scraping: one or more website URLs
- Mixed Sources: combine multiple input types
- Set the number of QA pairs to generate from each source.

2
Step 2: Set Optional Guidance (Optional)
Get control of the generation process by setting additional parameters:
- Rules & Constraints β add conditions for the generated content (e.g., enforce style, define tone, restrict scope).
- QA Guidance β provide example pairs or specify output formats.
- Creativity Level β adjust the modelβs temperature to balance consistency vs. variety.

3
Step 3: Review and Create Dataset
Review a recap of your settings from the previous steps.
Click Create Dataset to start the generation process.

Options for Synthetic Data Generation
When generating synthetic datasets, you can configure:- Data Sources β files, YouTube videos, websites, or a mix.
- Synthetic Pairs Configuration β number of QA pairs per input.
- Rules & Constraints β optional rules to shape the outputs.
- QA Guidance β add examples or output specifications.
- Creativity Level β control the randomness of the generation.