Machine Learning at the Flatiron Institute: Danqui Chen

Name: Machine Learning at the Flatiron Institute: Danqui Chen
Start: 2025-04-08T16:00:00-04:00
End: 2025-04-08T18:00:00-04:00
Location: Ingrid Daubechies Auditorium (IDA)

Tuesday April 8, 2025
4:00 - 6:00 p.m.

Add to Calendar

Ingrid Daubechies Auditorium (IDA)

Title: Optimizing Data Use for Pre-training Language Models

Abstract: Modern language models are trained on massive, unstructured data consisting of trillions of tokens, typically obtained by crawling the web. In this talk, Danqui argues that we are still in the early stages of understanding pre-training data and unlocking its full potential, and that more effective use of data can lead to both more compute-efficient and more capable language models. Danqui will present several perspectives on improving data curation, focusing on three general techniques. First, quality filtering aims to train classifiers that can distinguish high-quality from low-quality documents at scale (QuRating). Second, domain curation focuses on developing taxonomies of web data and leveraging domain mixing strategies to enhance pre-training (WebOrganizer). Third, Danqui will introduce a simple pre-training approach that conditions on metadata, which both accelerates training and improves model steerability (MeCo). Together, these efforts highlight the importance of optimizing the use of pre-training data and point toward a more data-centric paradigm for training future language models.

About the Speaker

Danqi Chen is an Assistant Professor of Computer Science at Princeton University and co-leads the Princeton NLP Group. She also serves as an Associate Director of Princeton Language and Intelligence (PLI), an initiative focused on developing fundamental research of large AI models. Her research centers on training, adapting, and understanding language models (LMs), with an emphasis on making them more accessible to academia. She also works at the intersection of LMs and retrieval, exploring how retrieval can serve as a foundational component of LMs. Before joining Princeton, Danqi was a visiting scientist at Facebook AI Research in Seattle. She earned her Ph.D. from Stanford University (2018) and her B.E. from Tsinghua University (2012), both in Computer Science. Her work has been recognized with a Sloan Fellowship, an NSF CAREER Award, a Samsung AI Researcher of the Year Award, and multiple outstanding paper awards from ACL and EMNLP.