Detecting and Mitigating Bias in Machine Translation

In this ELOQUENCE Web Café session, the discussion focused on one of the key challenges in modern language technologies: how to evaluate machine translation systems not only for accuracy, but also for fairness, robustness, and trustworthiness.

The session featured Javier García Gilabert and Miguel Claramunt from the Barcelona Supercomputing Center, where they work in the Language Technologies Unit. Within the ELOQUENCE project, BSC contributes to the integration of project pilots into the Interactive Playground and supports partners in deploying experiments using high-performance computing facilities. In this session, Javier and Miguel introduced their recent work on MTLens, a toolkit designed to support better machine translation evaluation.

The conversation began by explaining why traditional evaluation methods are no longer enough. For many years, machine translation models have mainly been assessed according to translation quality: how close the output is to a reference translation, or how well it preserves the meaning of the original sentence. However, as AI systems are increasingly used in real-world settings, evaluation must go further. It is no longer sufficient to ask whether a translation is fluent or technically correct. We also need to ask whether it introduces bias, harmful language, or other unwanted behaviours.

One of the central issues discussed was gender bias in machine translation. This can happen when a model produces stereotypical translations, especially when translating from a language with fewer gender cues, such as English, into more gendered languages such as Spanish or French. For example, professions like “doctor” may be translated in masculine form, while “nurse” may be translated in feminine form, even though both professions can be performed by people of any gender. Such patterns may seem small, but they can reinforce stereotypes and affect how people perceive certain roles in society.

The speakers also addressed added toxicity, which occurs when a translation model generates toxic or harmful content even though the original sentence is neutral. This issue is especially relevant in low-resource languages, where models may be less stable and more likely to hallucinate. Another important aspect was robustness to misspellings. Real users often make spelling mistakes, and trustworthy systems should still be able to provide reliable translations even when input is imperfect.

MTLens was developed to bring these different evaluation dimensions together. Instead of focusing only on general translation quality, the toolkit allows researchers to examine specific risks such as gender bias, toxicity, and robustness. It also provides visualisations that help users better understand what evaluation scores mean and where a model may be failing.

The discussion also connected this work to speech translation, which is highly relevant for ELOQUENCE. Unlike text-based translation, speech translation uses audio as input and may include additional information, such as the speaker’s voice, which can help reduce ambiguity in gendered translations.

Ultimately, this Web Café highlighted that building trustworthy language technologies requires more than high performance. It requires careful evaluation, transparency, and tools that help researchers identify and address bias before AI systems are used in real-world contexts.

Listen to the full episode here.