Evaluators and Executors

premsql evaluators consist of two things. It comes with executors which helps to connect and execute SQL to the DB. PremSQL Text2SQLEvaluator lets you to evaluate your models / pipelines using the executors w.r.t. a gold dataset. Our evaluators supports the following metrics as of now:

  1. Execution Accuracy (EX)
  2. Valid Efficiency Score (VES)

Execution Accuracy (EX)

Execution Accuracy measures the correctness of the SQL generated by the model by comparing the executed results with the ground truth SQL. This metric gives a direct measure of how accurate the generated SQL is.

Valid Efficiency Score (VES)

The primary objective of the SQL queries generated by Large Language Models (LLMs) is accuracy. However, performance is also crucial when dealing with large datasets. The Valid Efficiency Score assesses both accuracy and performance, ensuring that the generated query is optimized for execution. The figure below illustrates how this metric is computed:

Evaluation Setup

Let’s dive into the code to see how you can use premsql to evaluate models or pipelines using these metrics.

Lets setup all the things required for an evaluation. For an evaluation, we need an evaluation / validation dataset. A generator that will generate and generate and
saves the results. We also need an executor which we will plug in to our evaluator to execute the SQL queries which will be compared with the ground truth SQL queries.

Lets start with importing the required libraries and setting up our Datasets and Generators. If you are not familiar with Datasets and Generators, you can checkout out Datasets and Generators guide.

from premsql.generators import Text2SQLGeneratorHF
from premsql.datasets import Text2SQLDataset

# Initialize the BirdBench Dataset
bird_dataset = Text2SQLDataset(
    dataset_name='bird', split="validation", force_download=False,
    dataset_folder="/path/to/the/dataset"
).setup_dataset(num_rows=10)

# Initialize the generator 
generator = Text2SQLGeneratorHF(
    model_or_name_or_path="premai-io/prem-1B-SQL",
    experiment_name="test_generators",
    device="cuda:0",
    type="test"
)

# Get the responses 
responses = generator.generate_and_save_results(
    dataset=bird_dataset,
    temperature=0.1,
    max_new_tokens=256
)

Now we have generated / loaded our responses. We are now ready for evaluating the model.

from premsql.executors import SQLiteExecutor
from premsql.evaluator import Text2SQLEvaluator

# Define the executor 
executor = SQLiteExecutor()

# Define the evaluator 
evaluator = Text2SQLEvaluator(
    executor=executor,
    experiment_path=generator.experiment_path
)

# Now evaluate the models 
results = evaluator.execute(
    metric_name="accuracy",
    model_responses=response,
    filter_by="db_id",
    meta_time_out=10
)

print(results)

Arguments

  • metric_name: str The metric to be evaluated. It can be either accuracy or ves.
  • model_responses: dict The responses generated by the model.
  • filter_by: str The filter to be applied to the responses. It can be any key from the responses.
  • meta_time_out: int The time out for the execution of the SQL queries.

As you can see that used filter_by option to filter the results by db_id. This will not only give you the overall accuracy but also it will divide the distribution of the accuracy over all the available databases. If you have a key called: difficulty then it will show the result distribution over different available difficulty. Make sure these keys are available in the dataset, because if these are available in the dataset then it will be reflected on the model responses and hence on the results.

Once evaluation is fininshed it will write three files:

  1. predict.json: This file contains the responses and the evaluation results.
  2. accuracy.json: This file contains the accuracy results.
  3. ves.json: This file contains the VES results. This will only be written if you have provided the metric_name to be ves.

This is how the predict.json would look like after evaluation.

{
    "question_id": 23,
    "db_id": "california_schools",
    "question": "List the names of schools with more than 30 difference in enrollments between K-12 and ages 5-17? Please also give the full street address of the schools.",
    "evidence": "Difference in enrollment = `Enrollment (K-12)` - `Enrollment (Ages 5-17)`",
    "SQL": "SELECT T1.School, T1.Street FROM schools AS T1 INNER JOIN frpm AS T2 ON T1.CDSCode = T2.CDSCode WHERE T2.`Enrollment (K-12)` - T2.`Enrollment (Ages 5-17)` > 30",
    "difficulty": "moderate",
    "db_path": "data/bird/validation/dev_databases/california_schools/california_schools.sqlite",
    "prompt": "<begin of sentence>You are an ...",
    "dataset_type": "real",
    "generated": "SELECT T2.`School Name`, T2.Street, T2.City, T2.Zip FROM satscores AS T1 INNER JOIN frpm AS T2 ON T1.cds = T2.CDSCode WHERE T1.`Enrollment (K-12)` - T1.`Enrollment (Ages 5-17)` > 30",
    "error": "no such column: T2.Street",
    "accuracy": 0
}

Similarly, to compute the VES, use the same method, but with different filters. For instance, here, we filter results based on db_id.

ves = evaluator.execute(
    metric_name="ves", 
    model_responses=responses, 
    filter_by="db_id"
)

Step 7: Plotting VES Results

Visualize the results to understand performance across different databases. Such analysis helps identify areas for improvement.

import matplotlib.pyplot as plt

ves.pop("overall")

plt.figure(figsize=(12, 6))
plt.bar(ves.keys(), ves.values(), color='skyblue')
plt.xticks(rotation=45, ha='right')
plt.ylabel('Values')
plt.title('Values of Different Categories Excluding Overall')
plt.grid(axis='y', linestyle='--', alpha=0.6)
plt.tight_layout()
plt.show()

This will show this nice graph:

The results written on predict.json and the ability to plot graphs plot over different filters makes it very useful to analyse the results and evaluate the pipelines empirically.