Requirements
To create an error-handling dataset or prompt, the data entity must include:db_id
: Identifier for the database.db_path
: Path to the database.prompt
: The original prompt used to generate the initial results.
References
If you are unfamiliar with generators, executors, or datasets, you can check out the following tutorials:Datasets Tutorial
Learn how to utilize pre-processed datasets for Text-to-SQL tasks. This tutorial covers dataset evaluation, fine-tuning, and creation of custom datasets.
Generators Tutorial
A step-by-step guide on how to use Text-to-SQL generators to create SQL queries from user input and specified database sources.
Executors Tutorial
Learn how to connect to databases and execute SQL queries generated by models. This tutorial covers execution, troubleshooting, and best practices.
Evaluators Tutorial
Understand the evaluation of Text-to-SQL models with metrics like execution accuracy and Valid Efficiency Score (VES).
Overview
For Training
- Start with an existing dataset compatible with premsql datasets.
- Use a generator to run on the dataset. The executor gathers errors for incorrect generations.
- Use the existing response, initial prompt, and the error to create new data points using an error-handling prompt.
For Inference
PremSQL automatically handles error correction in BaseLine Agents and execution-guided decoding, so manual intervention is not needed.Defining Generators and Executors
Let’s start by defining the necessary components:Setting Up the Dataset
We define our existing training dataset using BirdBench. For demonstration, we’re usingnum_rows=10
, but it’s recommended to use the full dataset when training. Typically, the error dataset will be smaller than the training set if you’re using a well-trained model.
Creating the Error Handling Dataset
Creating the error-handling dataset is simple: feed in the generator and executor of your choice.Generating and Saving Error Dataset
Run the error handling generation and save the results. The results are saved in./experiments/train/<generator-experiment-name>/error_dataset.json
. Use force=True
to regenerate if needed.
This is how an sample of Error dataset would look like:
This is how an sample of Error dataset would look like:
Error Prompt Template
Each data point in the error dataset will include an error-handling prompt, which looks like this:Loading an Existing Error Handling Dataset
You don’t need to rerun the error-handling pipeline once generated. Usefrom_existing
to load the dataset during fine-tuning.
Tokenizing the Dataset
We also support tokenizing the error dataset during loading, particularly useful when integrating with Hugging Face Transformers.Example with SQLite Executor
Let’s see another example using a different executor—SQLite Executor.Generating a Tokenized Dataset On-the-Fly
You can generate and save a tokenized dataset directly during the error generation process.Output
Output