Learn about the evaluation metrics available in Prem and how to create your own.
Traditional deterministic metrics used for language models (such as token accuracy) are insufficient for evaluating LLMs. Therefore, we define several abstract evaluation concepts as:
Correctness: Measures the factual accuracy and correctness of the response
Conciseness: Measures how concise and to-the-point the response is without unnecessary information
Hallucination: Measures whether the response contains made-up or fabricated information
Each metric has its own defined rules. In case of metrics like: token-accuracy has code based rules. For metrics like above, we have natural language conditioned rules called Rubrics.
You can edit evaluation metrics by rules the LLM should follow and rules the LLM should not follow.Just hover over each rule, and click on the pencil icon to edit it. When youβre done, click on the checkmark icon to save your changes.