Learn how to use and customize premsql datasets for Text-to-SQL tasks, including working with available datasets, creating your own, and extending functionalities.
premsql
provides a simple API to use various pre-processed datasets for Text-to-SQL tasks. Text-to-SQL is complex as it requires data dependencies on databases and tables. The premsql
datasets help streamline this by providing easy access to datasets and enabling you to create your own datasets with private databases.
premsql
.
Text2SQLDataset
class from the premsql
package. Hereβs an example using the BirdBench dataset:
raw_dataset
: Returns a dictionary with raw data from the JSON file.filters_available
: Lists the filters available for the dataset.Output
Output
db_id
. PremSQL automatically detects which filters are available for a dataset and provides them as a list.
setup_dataset
method. This will process and return the dataset object. The method has several optional parameters:
filter_by
: Filters the dataset based on the provided filter.num_rows
: Limits the number of rows to return.num_fewshot
: Defines how many few-shot examples to include in the prompt.model_name_or_path
: Applies the model-specific prompt template and tokenizes the dataset if provided.prompt_template
: Custom prompt template; defaults are available in the premsql
prompts module.Output
loading
and setting up
a dataset is, in loading, we generally
download the dataset or load the raw dataset from some folder. Setting up a dataset means
processing the dataset (which includes inserting schemas, few shot prompts, adding customization, filtering etc)
based on userβs requirements.model_name_or_path
to None
.
Output
db_id
or difficulty
You can filter datasets based on available criteria, such as db_id
or difficulty
.
Example of filtering by difficulty:
Output
Output
premsql
datasets support merging, allowing you to pack multiple datasets together for combined use.
This is useful when training on multiple datasets or you want to do validation on a combination of datasets.
Output
Output
db_id
, question
and SQL
(all case sensitives) and
the dataset is structured in the standardization followed by PremSQL. Letβs understand more about the standardization and how you can create your own dataset.
databases
should contain a .sqlite
file matching the sub-folder name. Your train
or validation
JSON file should contain a list of dictionaries with these required keys:
db_id
: Corresponds to the folder and .sqlite
file name.question
: User question.SQL
: The ground truth SQL query..sqlite
databases. However, you might encounter scenarios where:
premsql
to fit specific requirements.premsql.datasets.base.Text2SQLBaseInstance
class. Below is a blueprint of how to define your custom instance:
schema_prompt
, this class also includes other methods like additional_prompt
and apply_prompt
. These methods have default implementations that are database-agnostic, but you can customize them as needed.
premsql.datasets.base.Text2SQLBaseDataset
class. This class handles the overall dataset setup and management. Hereβs an example:
__init__
method, you can define any specific initialization logic needed for your dataset. Similarly, the setup_dataset
method can be modified to include any custom setup steps or logic.
premsql
to support different databases, implement custom preprocessing logic, or tailor the dataset setup to your specific needs. For a complete understanding, review the Text2SQLBaseDataset
and Text2SQLBaseInstance
classes in the premsql
source code. We will be releasing a detailed tutorial soon on how to create datasets for different types of databases beyond SQLite.