Trustworthy AI in High-Stakes Applications

The joint ELOQUENCE and TrustLLM webinar brought together two Horizon Europe projects to explore one of the most pressing questions in artificial intelligence today: how can AI systems be made trustworthy enough for high-stakes applications, where reliability, robustness, and human trust are essential?

The session featured Annika Simonsen from the TrustLLM project and Dr Petr Motlíček from the ELOQUENCE project. Together, they offered complementary perspectives on trustworthy AI, from the alignment of large language models with European values to the development of explainable and reliable AI systems for real-world dialogue scenarios.

Annika opened the session by presenting work carried out within TrustLLM on evaluating whether large language models reflect European values. As LLMs are increasingly deployed in tools that interact directly with users, it becomes important to understand not only how accurate or fluent they are, but also what kinds of values they express in their responses. This is especially relevant in the European context, where AI systems are expected to align with principles such as human dignity, freedom, democracy, equality, the rule of law, and human rights.

To explore this, the TrustLLM team developed a benchmark based on the European Values Study, a large-scale survey that has been collecting data on people’s beliefs, attitudes, and values across Europe for decades. Instead of defining European values only in abstract terms, the team used real survey responses to identify patterns that are representative of European Union countries. Their findings showed that how a question is asked can significantly influence how an LLM responds. Models may appear highly aligned when answering direct multiple-choice questions, but behave differently when placed in more realistic, situational contexts.

This highlighted an important point for AI evaluation: benchmarks should reflect how people actually interact with AI systems in everyday life. The team also found differences depending on the language used to prompt the model, showing that multilingual evaluation is essential when assessing trustworthiness.

In the second part of the webinar, Petr presented ELOQUENCE’s work on trustworthy AI for dialogue systems, with a particular focus on high-risk applications such as medical support, law enforcement, call centres, and air traffic navigation. These are contexts where AI should not replace humans, but support them in making better and more informed decisions.

Petr explained that AI systems, including large language models, can produce errors and hallucinations while still sounding confident. This makes trustworthiness especially challenging. In high-risk environments, it is not enough for a system to provide an answer; it should also be able to indicate when it does not know, rely on trusted sources, and support human oversight.

The webinar showed that trustworthy AI requires more than one solution. It depends on better evaluation methods, multilingual robustness, transparency, human involvement, and continuous testing in realistic scenarios. Ultimately, both projects share a common goal: developing AI systems that are not only powerful, but also safe, understandable, and aligned with the values and needs of the people who use them.

Listen to the full webinar here.