STSM: Building Multilingual Neologism Dataset in low-resource languages

Name: Atul Kumar Ojha

Start : 17/08/2025

End: 29/08/2025

Atul Kumar Ojha’s STSM at the School of German Language and Literature, Aristotle University of Thessaloniki, hosted by Dr. Paraskevi Giouli, focused on how large language models process neologisms in multilingual contexts.

His work involved low-resource languages such as Bhojpuri, Irish, and Marathi. For these, the grantee designed a multilingual annotation schema, compiled neologism lists, and provided English translations to facilitate cross-linguistic analysis. Selected terms were tested using large language models like GPT-4 and NLLB to evaluate recognition, interpretation, and contextual generation—revealing both the models’ strengths and limitations.

The STSM resulted in an openly accessible, structured dataset accompanied by annotation guidelines and methodological notes. This resource supports future multilingual research and fosters ongoing collaboration within the ENEOLI Action.