Evaluating LLMs: Building Trustworthy AI for Safety-Critical Applications

In this ELOQUENCE webinar, the discussion focused on one of the most important challenges in today’s AI landscape: how to evaluate large language models so they can be used more reliably, especially in safety-critical applications.

The session featured Professor Tatiana Kalganova from Brunel University of London, an expert in artificial intelligence, machine learning and intelligent systems. Within the ELOQUENCE project, Brunel University is contributing to the development of an evaluation framework for large language models, with a particular focus on trustworthiness, explainability and bias.

Tatiana opened the session by introducing Brunel’s interdisciplinary research environment, where technical AI expertise is combined with perspectives from law, business, economics, arts and social sciences. This interdisciplinary approach is especially important when working with AI systems that are expected to operate in real-world contexts. Technical performance alone is no longer enough; issues such as legal compliance, social acceptance, trust and accountability must be considered from the beginning.

A central theme of the webinar was the difficulty of trusting LLMs. Many users are already familiar with tools such as ChatGPT, Copilot or Gemini, but these systems often require several attempts, rewritten prompts or additional clarification before producing a useful answer. In everyday use, this may be acceptable. However, in safety-critical scenarios, where decisions must be made quickly and based on reliable information, there is far less room for error.

Tatiana explained that evaluating LLMs is very different from evaluating traditional machine learning models. In classical machine learning, researchers often work with clearly defined datasets and can measure accuracy against known correct answers. With large language models, the situation is more complex: we often do not know exactly what data they were trained on, and their responses may vary depending on how a question is phrased. This makes it necessary to develop new evaluation methods that can assess not only whether an answer is correct, but also how faithful, consistent and reliable it is.

The webinar also explored the growing role of LLM-as-a-judge approaches, where one language model is used to evaluate the output of another. While this method is becoming increasingly common, Tatiana emphasised that human judgement remains essential, especially when evaluating factuality, faithfulness and multilingual responses. Current metrics still rely heavily on comparison with human evaluation, particularly in more complex tasks such as long-form question answering, conversational AI and summarisation.

Within ELOQUENCE, Brunel University is working on fused evaluation metrics that combine several existing approaches rather than relying on a single measure. Early results suggest that this combined approach can improve the assessment of trustworthiness across short-form question answering, long-form question answering, conversational question answering and summarisation.

The key message of the session was clear: trustworthy AI cannot be achieved through one metric or one test. It requires a broader evaluation framework that brings together trustworthiness, explainability, bias, multilinguality and human judgement. This is especially crucial if LLMs are to be used in high-risk environments where accuracy, reliability and transparency matter most.

Watch the full webinar here.