Processing of a ‘living’ ancient language: Issues and Insights from Chinese
- Speaker: Dr. Chu-ren Huang (The HongKong Polytechnic University, P.R. China)
- Abstract: An inherited ancient language is an ancient language that has a living language as its direct descendent. In this talk, we further define a living ancient language as an inherited language that has some significant linguistic components retained by its living descendant. Ancient Chinese, with its orthographic system retained in modern Mandarin, is a clear example. As such, the processing of a living ancient language poses unique challenges and opportunities at the same time. The common challenges include how to differentiate an ancient language from a similar language or a variety, how to differentiate descendants from different periods, the likely occurrences of faux amis, and how to deal with ‘code mixing’ of the ancient language and its living descendant when the codes of the two languages are very similar or even identical. The challenges unique to Chinese include that identical character strings may need to be segmented or chunked differently in ancient and living languages. The unique opportunities include the use of the retained component as constant to identify changes and variations, the possibility to apply extrapolation (when an adjacent model is known) and interpolation (when prior and later models are known) to build language models, and the application of an end-to-end (instead of point-to-point) study of the diachronic changes through time. In this talk, I will elaborate on these important issues with an emphasis on several examples of how these unique opportunities are leveraged to inform us of the cultural, environmental, and societal changes that may (or may not) have been documented historically.
When the past meets the future at Odessus
- Speaker: Dr. Thea Sommerschield (Ca’ Foscari University of Venice, Italy)
- Bio: Thea Sommerschield is a Marie Skłodowska-Curie fellow at Ca’ Foscari University of Venice. Her research uses machine learning to study the epigraphic cultures of the ancient Mediterranean world. Since obtaining her DPhil in Ancient History (University of Oxford), she has been the Ralegh Radford Rome Awardee at the British School at Rome, Fellow in Hellenic Studies at Harvard’s CHS and Research Innovator at Google Cloud. She co-led the Pythia (EMNLP, 2019) and Ithaca (Nature, 2022) projects, and works extensively on Sicilian epigraphy.
- Abstract: Ancient languages preserve the cultures and histories of the past. However, their study is fraught with difficulties, and experts must tackle a range of challenging tasks, from deciphering lost languages to restoring
damaged inscriptions. Technological aids have long supported these efforts, yet recent advances in Machine Learning have catalysed revolutionary shifts in the Humanities, akin to the impact of microscopes and telescopes on
scientific exploration. This talk will survey some of the main tasks, trends and transformations in the multifaceted field of Machine Learning for Ancient Languages, inspired and guided by my personal experience in this domain.
We will retrace the progress in ML techniques now at the disposal of historical researchers, and chart some of the main contributions to this dynamic field. We will address some of the many extant challenges (such as bias,
interpretable outputs and uneven digitisation standards), and discuss promising directions for future advancement.
The main takeaway of this talk is the pivotal role of active collaboration between specialists from both domains in producing impactful and compelling scholarship.
Harnessing Multilingual Models for Ancient Language Processing
- Speaker: Dr. Gabriel Stanovsky (Hebrew University of Jerusalem, Israel)
- Bio: Dr. Gabriel Stanovsky is a senior lecturer (assistant professor) in the school of computer science & engineering at the Hebrew University of Jerusalem, and a research scientist at the Allen Institute for AI (AI2). He did his postdoctoral research at the University of Washington and AI2 in Seattle, working with Prof. Luke Zettlemoyer and Prof. Noah Smith, and his PhD with Prof. Ido Dagan at Bar-Ilan University. He is interested in developing natural language processing models which deal with real-world texts and help answer multi-disciplinary research questions, in archeology, law, medicine, and more. His work has received awards at top-tier venues, including ACL, NAACL, and CoNLL, and recognition in popular journals such as Science and New Scientist, and The New York Times.
- Abstract: Recent years have seen large language models (LLMs) enabling progress in all NLP tasks. However, LLMs seem predicated on massive-scale raw training data, which currently limits their application only to modern languages with strong online presence. In this talk I’ll explore the question: How can ancient language processing benefit from recent strides in NLP, despite the limited amount of available data? I’ll argue that a promising avenue to achieve this is by leveraging data in modern languages in conjunction with small-scale ancient language corpora. Towards that end, I’ll present a multi-lingual LLM which achieves state-of-the-art performance for the Akkadian language, as well as a thorough evaluation of its performance. Finally, I’ll discuss future work enabled by such LLMs for ancient language processing.