Machine Learning at the Flatiron Institute: Danqui Chen
Title: Optimizing Data Use for Pre-training Language Models
Abstract: Modern language models are trained on massive, unstructured data consisting of trillions of tokens, typically obtained by crawling the web. In this talk, Danqui argues that we are still in the early stages of understanding pre-training data and unlocking its full potential, and that more effective use of data can lead to both more compute-efficient and more capable language models. Danqui will present several perspectives on improving data curation, focusing on three general techniques. First, quality filtering aims to train classifiers that can distinguish high-quality from low-quality documents at scale (QuRating). Second, domain curation focuses on developing taxonomies of web data and leveraging domain mixing strategies to enhance pre-training (WebOrganizer). Third, Danqui will introduce a simple pre-training approach that conditions on metadata, which both accelerates training and improves model steerability (MeCo). Together, these efforts highlight the importance of optimizing the use of pre-training data and point toward a more data-centric paradigm for training future language models.