Introduction
premsql
provides a simple API to use various pre-processed datasets for Text-to-SQL tasks. Text-to-SQL is complex as it requires data dependencies on databases and tables. The premsql
datasets help streamline this by providing easy access to datasets and enabling you to create your own datasets with private databases.
Available Datasets
Currently, the following datasets are readily available: Letβs see how to use these datasets withpremsql
.
Loading Datasets
To get started, youβll use theText2SQLDataset
class from the premsql
package. Hereβs an example using the BirdBench dataset:
BirdBench is one of the very popular and recent Text to SQL Benchmarking dataset. It comes with around 9K train and 1534 validation data
composing of complex and diverse queries around real world datasets. They also have a private test data which they use to test different
Text to SQL models and approaches by the community.
Accessing Raw Data
The dataset object provides two key methods:raw_dataset
: Returns a dictionary with raw data from the JSON file.filters_available
: Lists the filters available for the dataset.
Output
Output
Available Filters
You can view the available filters for the dataset:Output
Output
db_id
. PremSQL automatically detects which filters are available for a dataset and provides them as a list.
Setting Up the Dataset
To load the processed dataset, use thesetup_dataset
method. This will process and return the dataset object. The method has several optional parameters:
filter_by
: Filters the dataset based on the provided filter.num_rows
: Limits the number of rows to return.num_fewshot
: Defines how many few-shot examples to include in the prompt.model_name_or_path
: Applies the model-specific prompt template and tokenizes the dataset if provided.prompt_template
: Custom prompt template; defaults are available in thepremsql
prompts module.
Output
Output
The difference between
loading
and setting up
a dataset is, in loading, we generally
download the dataset or load the raw dataset from some folder. Setting up a dataset means
processing the dataset (which includes inserting schemas, few shot prompts, adding customization, filtering etc)
based on userβs requirements.Preview Without Tokenization
Tokenization sometimes can be computationally heavy. To preview the dataset without tokenizing, setmodel_name_or_path
to None
.
Output
Output
Filtering Datasets
We are now loading the BirdBench validation dataset. And as you can see there are two types of filters available. These are:db_id
or difficulty
You can filter datasets based on available criteria, such as db_id
or difficulty
.
Example of filtering by difficulty:
Output
Output
Using Other Available Datasets
As mentioned earlier, PremSQL provides access to some best readily available open source datasets like: Letβs load and see how the other datasets looks like:Output
Output
Dataset Merging
premsql
datasets support merging, allowing you to pack multiple datasets together for combined use.
This is useful when training on multiple datasets or you want to do validation on a combination of datasets.
Output
Output
Understanding Prompts in PremSQL
Hereβs how a prompt looks when wrapped around a modelβs prompt template:Output
Output
db_id
, question
and SQL
(all case sensitives) and
the dataset is structured in the standardization followed by PremSQL. Letβs understand more about the standardization and how you can create your own dataset.
Creating Your Own Dataset
If you have some annotations for your Text to SQL task on your private data then you can create your own dataset.When it comes to tasks like Text to SQL, we face a lot of dependencies with databases at every step. For example, to make a dataset for Text to SQL, you need a module that can connect to databases, fetch the schemas, and apply them to the prompt. Thatβs why we have introduced a more standard approach to providing inputs, which is used all across PremSQL. If you use it for your private databases, then you can extend and customize it for your data and your use case. However, you strictly need to organize your files in the following structure:databases
should contain a .sqlite
file matching the sub-folder name. Your train
or validation
JSON file should contain a list of dictionaries with these required keys:
db_id
: Corresponds to the folder and.sqlite
file name.question
: User question.SQL
: The ground truth SQL query.
Advanced Customization for Text-to-SQL Datasets
So far, all the examples shown are tailored to.sqlite
databases. However, you might encounter scenarios where:
- You are working with different databases, such as PostgreSQL or other cloud-based database instances.
- You want to implement custom logic before generating prompts.
- You need to extend the utility of
premsql
to fit specific requirements.
Step 1: Define a Custom Dataset Instance
A dataset instance manages operations on individual data points. To create a custom dataset instance, extend thepremsql.datasets.base.Text2SQLBaseInstance
class. Below is a blueprint of how to define your custom instance:
schema_prompt
, this class also includes other methods like additional_prompt
and apply_prompt
. These methods have default implementations that are database-agnostic, but you can customize them as needed.
Step 2: Define a Custom Dataset Class
Once you have your custom instance defined, you can create a custom dataset class by extending thepremsql.datasets.base.Text2SQLBaseDataset
class. This class handles the overall dataset setup and management. Hereβs an example:
__init__
method, you can define any specific initialization logic needed for your dataset. Similarly, the setup_dataset
method can be modified to include any custom setup steps or logic.
Summary
By following these steps, you can extendpremsql
to support different databases, implement custom preprocessing logic, or tailor the dataset setup to your specific needs. For a complete understanding, review the Text2SQLBaseDataset
and Text2SQLBaseInstance
classes in the premsql
source code. We will be releasing a detailed tutorial soon on how to create datasets for different types of databases beyond SQLite.