Creating digital avatars that speak naturally and convincingly in real time remains one of the most complex challenges in artificial intelligence. While Text-to-Speech systems have reached high levels of quality, realistic facial animation, especially accurate lip, jaw and facial expression synchronization is still an open research problem.
The challenge becomes even greater in low-resource languages such as Serbian, where limited datasets, lack of standardized tools and real-time constraints significantly increase complexity.
To address this challenge, the University of Novi Sad is organizing a student competition for Master’s students and 3rd and 4th year undergraduate students. Participants are invited to develop a system for audio-visual speech synthesis for the Serbian language.
The task is to generate time-dependent blendshape coefficients controlling facial animation, based on input text (and synthesized speech). Solutions will be evaluated based on animation naturalness, audio-visual synchronization, real-time capability, robustness, and quality of documentation and presentation.
Task Description
Participants are required to develop a system for audio-visual speech synthesis for the Serbian language, where:
Input: text (and speech synthesized from that text)
Output: time-dependent blendshape coefficients controlling the facial animation of an avatar
The goal is to produce animation that appears as natural as possible, is precisely synchronized with the speech, and operates in (or close to) real time.
Participants will have direct access to the avatar animation application, but they will not have control over the avatar itself. Based on the submitted blendshape coefficients, the application generates animation videos. These videos are provided to teams so they can iteratively improve their solutions.
In the final evaluation phase, participants will receive a new set of sentences. They will submit the generated blendshape coefficients for these sentences to the organizers, who will produce the final avatar animation videos using the application. Teams will be ranked based on these videos.
The evaluation will analyze the naturalness and realism of the animation, the quality of audio-visual synchronization, and the stability of the predictions. After selecting the five best solutions, teams will submit their code and a written report.
The winner will be determined based on the following criteria:
- Naturalness of the animation
- Latency, i.e., the ability to perform real-time synthesis
- Overall system robustness
- Quality and clarity of the written report
- Quality and clarity of the live presentation.
Details
Who can participate
- Master’s students
- 3rd and 4th year undergraduate students
Team applications
Applications should be sent to: tijana.nosek@uns.ac.rs
The application must include:
- Team name
- Names of all team members
- Name of the school/university each team member is affiliated with
Important dates
- Team application deadline: March 15, 2026
- Release of the test dataset: March 21, 2026
- Submission of results on the test dataset: March 28, 2026
- Announcement of the top 5 teams advancing to the second round: April 4, 2026
- Submission of code, report, and presentation: April 11, 2026
- Final event – live presentations: April 25, 2026
Additional notes
- Awards will be provided for the best teams.
- All participants are welcome to attend the final event (participants cover their own travel costs).
- Online participation will also be available for those who cannot attend in person.
Contact
- Vuk Stanojev: vukst@uns.ac.rs
- Tijana Nosek: tijana.nosek@uns.ac.rs
Dataset
The competition uses the AI-SPEAK dataset, which includes audio recordings and corresponding frame-level blendshape coefficients, a list of supported blendshapes, and synthesized sentences.
The complete dataset can be downloaded via the provided link.
