Creating a Dataset

1

Click + Create To Create a Dataset

GIF of clicking the create a dataset buttonTo create a dataset, click the + Create Dataset button in the top right corner of the page.you have some options here:
  • Upload a JSONL file with conversation datapoints
  • Generate synthetic data from text and pdf files
2

Upload a File

GIF of uploading a file
  • Choose a name for your dataset.
  • Write a system prompt for your dataset.
  • Upload a JSONL file.
  • Click Confirm to create the dataset.
Refer to our Datasets documentation for more information on the JSONL file format.
When you upload a JSONL file, Prem will automatically create a system prompt for you that you can edit.

Upload with Synthetic Data Generation

You may have unstructured data stored in different file types that are not in the JSONL format. This is where our synthetic data generation comes in. You can upload these files to Prem and generate synthetic data based on your files. You can import the following files:
  • PDF files
  • TXT files
  • DOC files
  • DOCX files
  • HTML files
  • PPT files
GIF of uploading a file with synthetic data generation
1

Choose a name for your dataset

Enter a descriptive name for your dataset that will help you identify it later.
2

Write a system prompt for your dataset

Create a system prompt that will guide the synthetic data generation process.
3

Upload your file

Upload your file (PDF, TXT, DOC, DOCX, HTML, or PPT) that will be used to generate synthetic data.
4

Click Confirm

Review your settings and click Confirm to start the synthetic data generation process.

Options for Synthetic Data Generation

Types of Pairs to Generate

QA: Generate question-answer pairs from your content. You can create multiple pairs per file. Summary: Generate one summary per file. The number of synthetic pairs equals the number of files uploaded.

New Pairs to Generate

For QA pairs, you can generate multiple question-answer pairs per file. The minimum is set to the number of files uploaded, and you can increase it in increments based on your file count.

User Instructions (optional)

You can add user instructions to the synthetic data generation process. This will be added to the system prompt for each pair.

Next Steps