Speaker: Dr. Patrick J. Burns
  • Talk title: The Role of “Small” Models for Ancient NLP in a World of Large Language Models
  • BIO: Patrick J. Burns is Associate Research Scholar, Digital Projects at NYU's Institute for the Studyof the Ancient World working in ancient-world data processing and historical language textmining and analysis. Patrick earned his doctorate in Classics from Fordham University in 2016and has since been active in the area of computational philology. Patrick is the maintainer of theLatinCy, pretrained natural language processing pipelines for Latin, a co-author/developer forLatinBERT, and has been a contributor to the Classical Language Toolkit. In forthcoming articlesand chapters, Patrick is ranging widely through related research topics such as Latin-to-Latinautomatic question generation, assessing Latin readability via named entity linking, theapplication of contextual embeddings for intertextuality detection, and the philology of OCRcorrection, to name only a few directions. Recent publications include “(Re)Active Latin:Computational Chat as Future colloquia” for the New England Classical Journal, which looks atthe pedagogical implications of using large language models and chatbots in the Latin classroomas well as a blog post “How Much Latin Does ChatGPT ‘Know’?” which explores theLatin-language training data of OpenAI's popular AI interface. Lastly, Patrick has taughtgraduate seminars at ISAW on topics such as the uses of generative artificial intelligence inancient-world research and ancient-language NLP.

  • Abstract: In the field of Latin natural language processing, there are tasks for which competitive, if not state-of-the-art, performance is exhibited by large language models like GPT 4o or Claude. Yet as opposed to modern English (for which that statement may also be arguably true), there are some Latin NLP tasks like coreference resolution or automatic question generation for which work on smaller, task-specific models is either just underway or does not yet exist. LLMs in this case have “skipped” steps on a path of continuous development and improvement. I argue in this talk that, while we should take advantage of such LLM advancements in Latin NLP, some significant part of our attention should also be directed backwards on filling in these skipped steps. By returning to and focusing again on “small” language models—including everything from rigorously evaluated and field-tested static embeddings models to the last iterations of smaller LLMs like BERT models, most especially those with custom task-specific heads—we can promote a culture of interpretable and explainable philology: interpretable, following Russell and Norvig (Artificial Intelligence 4th ed. [2021], p. 711-12), because these smaller models—from their training data to their configuration and parameterization—can be directly inspected, and explainable because such models allow us to maintain an understanding of how specific outputs result from specific inputs. In sum, I argue that, although LLMs will serve (and serve well) short-terms interests in ancient-language NLP, we should redouble our efforts—through data curation, through attention to model parameterization, and through competitive evaluation (like shared tasks)—to develop smaller models equally up to our biggest language challenges. While my talk will use Latin as its ancient-language focus, the conclusion will discuss ways to adapt lessons learned to other ancient languages, such as Ancient Greek, Akkadian, and Middle Egyptian, among others.


Speaker: Dr. Donald Sturgeon
  • Talk title: Ancient languages in the age of LLMs: opportunities and challenges
  • BIO: Donald Sturgeon is Assistant Professor at the Department of Computer Science in Durham University. Prior to joining Durham, he held postdoctoral positions at the City University of Hong Kong and Harvard University. With a background in classical Chinese philosophy and digital humanities, his main research interests are in natural language processing for premodern Chinese, digital libraries, and the application of digital methods to the study of the language, literature, and history of premodern China. He is also the creator of the Chinese Text Project (https://ctext.org), a widely used digital library of premodern Chinese written works which he began in 2005, which now contains over 36 million pages of primary source material together with transcriptions and annotations maintained by an active crowdsourcing community.

  • Abstract: The advent of practical generative large language models capable of producing fluent, human-like text presents opportunities and challenges for large-scale digital analysis of historical writing, as well as for digital libraries and other systems that mediate human access to historical source materials. Generative tasks starting with – but by no means limited to – translation and summarization provide extensive opportunities for a variety of automated contextual assistance for navigation and reading, as well as for extraction of many types of information that would previously require human involvement and/or the training of special-purpose models to obtain. At the same time, LLMs are not the solution to every problem – depending on the task, smaller fine-tuned models and even simpler rule-based approaches may be competitive, produce superior results, or generate outputs that are more reliable, consistent, and explainable. In addition to more general issues faced in applying contemporary LLMs, further challenges emerge when applying them to ancient languages in particular. Some of these stem directly from the limited volumes of text and data that may be available to train or fine-tune models, while others relate more directly to the problem domain, such as risks of bias due to the often very large imbalance between the volume of text available in training for the contemporary form of a language versus its historical forms. Focusing primarily on classical and literary Chinese but with applicability to premodern languages more generally, this talk aims to highlight the promise and limitations of innovative applications of LLMs to these domains.