Evaluators
Evaluates different Text to SQL models and pipelines with standard benchmarks and metrics.
Evaluators and Executors
premsql evaluators consist of two things. It comes with executors which helps to connect and execute SQL to the DB. PremSQL Text2SQLEvaluator lets you to evaluate your models / pipelines using the executors w.r.t. a gold dataset. Our evaluators supports the following metrics as of now:
- Execution Accuracy (EX)
- Valid Efficiency Score (VES)
Execution Accuracy (EX)
Execution Accuracy measures the correctness of the SQL generated by the model by comparing the executed results with the ground truth SQL. This metric gives a direct measure of how accurate the generated SQL is.
Valid Efficiency Score (VES)
The primary objective of the SQL queries generated by Large Language Models (LLMs) is accuracy. However, performance is also crucial when dealing with large datasets. The Valid Efficiency Score assesses both accuracy and performance, ensuring that the generated query is optimized for execution. The figure below illustrates how this metric is computed:
Evaluation Setup
Let’s dive into the code to see how you can use premsql
to evaluate models or pipelines using these metrics.
Lets setup all the things required for an evaluation. For an evaluation,
we need an evaluation / validation dataset. A generator that will generate and generate and
saves the results. We also need an executor which we will plug in to our evaluator
to execute the SQL queries which will be compared with the ground truth SQL queries.
Lets start with importing the required libraries and setting up our Datasets and Generators. If you are not familiar with Datasets and Generators, you can checkout out Datasets and Generators guide.
from premsql.generators import Text2SQLGeneratorHF
from premsql.datasets import Text2SQLDataset
# Initialize the BirdBench Dataset
bird_dataset = Text2SQLDataset(
dataset_name='bird', split="validation", force_download=False,
dataset_folder="/path/to/the/dataset"
).setup_dataset(num_rows=10)
# Initialize the generator
generator = Text2SQLGeneratorHF(
model_or_name_or_path="premai-io/prem-1B-SQL",
experiment_name="test_generators",
device="cuda:0",
type="test"
)
# Get the responses
responses = generator.generate_and_save_results(
dataset=bird_dataset,
temperature=0.1,
max_new_tokens=256
)
Now we have generated / loaded our responses. We are now ready for evaluating the model.
from premsql.executors import SQLiteExecutor
from premsql.evaluator import Text2SQLEvaluator
# Define the executor
executor = SQLiteExecutor()
# Define the evaluator
evaluator = Text2SQLEvaluator(
executor=executor,
experiment_path=generator.experiment_path
)
# Now evaluate the models
results = evaluator.execute(
metric_name="accuracy",
model_responses=response,
filter_by="db_id",
meta_time_out=10
)
print(results)
Arguments
- metric_name: str
The metric to be evaluated. It can be either
accuracy
orves
. - model_responses: dict The responses generated by the model.
- filter_by: str The filter to be applied to the responses. It can be any key from the responses.
- meta_time_out: int The time out for the execution of the SQL queries.
As you can see that used filter_by
option to filter the results by db_id
. This will
not only give you the overall accuracy but also it will divide the distribution of the accuracy
over all the available databases. If you have a key called: difficulty
then it will show the result
distribution over different available difficulty. Make sure these keys are available in the dataset, because
if these are available in the dataset then it will be reflected on the model responses and hence on
the results.
Once evaluation is fininshed it will write three files:
predict.json
: This file contains the responses and the evaluation results.accuracy.json
: This file contains the accuracy results.ves.json
: This file contains the VES results. This will only be written if you have provided the metric_name to beves
.
This is how the predict.json
would look like after evaluation.
{
"question_id": 23,
"db_id": "california_schools",
"question": "List the names of schools with more than 30 difference in enrollments between K-12 and ages 5-17? Please also give the full street address of the schools.",
"evidence": "Difference in enrollment = `Enrollment (K-12)` - `Enrollment (Ages 5-17)`",
"SQL": "SELECT T1.School, T1.Street FROM schools AS T1 INNER JOIN frpm AS T2 ON T1.CDSCode = T2.CDSCode WHERE T2.`Enrollment (K-12)` - T2.`Enrollment (Ages 5-17)` > 30",
"difficulty": "moderate",
"db_path": "data/bird/validation/dev_databases/california_schools/california_schools.sqlite",
"prompt": "<begin of sentence>You are an ...",
"dataset_type": "real",
"generated": "SELECT T2.`School Name`, T2.Street, T2.City, T2.Zip FROM satscores AS T1 INNER JOIN frpm AS T2 ON T1.cds = T2.CDSCode WHERE T1.`Enrollment (K-12)` - T1.`Enrollment (Ages 5-17)` > 30",
"error": "no such column: T2.Street",
"accuracy": 0
}
Similarly, to compute the VES, use the same method, but with different filters. For instance, here, we filter results based on db_id
.
ves = evaluator.execute(
metric_name="ves",
model_responses=responses,
filter_by="db_id"
)
Step 7: Plotting VES Results
Visualize the results to understand performance across different databases. Such analysis helps identify areas for improvement.
import matplotlib.pyplot as plt
ves.pop("overall")
plt.figure(figsize=(12, 6))
plt.bar(ves.keys(), ves.values(), color='skyblue')
plt.xticks(rotation=45, ha='right')
plt.ylabel('Values')
plt.title('Values of Different Categories Excluding Overall')
plt.grid(axis='y', linestyle='--', alpha=0.6)
plt.tight_layout()
plt.show()
This will show this nice graph:
The results written on predict.json
and the ability to plot graphs plot over different
filters makes it very useful to analyse the results and evaluate the pipelines empirically.