STSM: Building Multilingual Neologism Dataset in low-resource languages
Name: Atul Kumar Ojha
Start : 17/08/2025
End: 29/08/2025
Atul Kumar Ojha’s STSM at the School of German Language and Literature, Aristotle University of Thessaloniki, hosted by Dr. Paraskevi Giouli, focused on how large language models process neologisms in multilingual contexts.

His work involved low-resource languages such as Bhojpuri, Irish, and Marathi. For these, the grantee designed a multilingual annotation schema, compiled neologism lists, and provided English translations to facilitate cross-linguistic analysis. Selected terms were tested using large language models like GPT-4 and NLLB to evaluate recognition, interpretation, and contextual generation—revealing both the models’ strengths and limitations.
The STSM resulted in an openly accessible, structured dataset accompanied by annotation guidelines and methodological notes. This resource supports future multilingual research and fosters ongoing collaboration within the ENEOLI Action.

