BabyLM

Training sample efficient language models on developmentally plausible corpora

Humans are remarkably efficient language learners compared to language models. It would take a human a million years to accumulate as much linguistic input as most popular large language models, yet we learn language in just a few years. BabyLM is a competition and workshop devoted to data-efficient language learning. How do humans acquire language from so little input? How can we build more data-efficient language models? How do we design language models to be better cognitive models? The competition’s goal is to challenge the community to answers these questions. We also aim to democratize research into LM pretraining, which has recently become increasingly concentrated in a few large industry groups. While a small academic group will not build the next ChatGPT, there are still valuable research questions about pretraining that we can address with a small budget.

In 2023, I co-founded and co-organized the BabyLM Challenge with a group of amazing collaborators, now including:

Leshem Choshen (IBM Research, MIT)
Ryan Cotterell (ETH Zurich)
Michael Hu (NYU)
Tal Linzen (NYU)
Aaron Mueller (Northeastern)
Candace Ross (Meta AI)
Alex Warstadt (ETH Zurich, UCSD)
Ethan Wilcox (ETH Zurich, Georgetown)
Adina Williams (Meta AI)
Chengxu Zhuang (MIT)

The competition was the shared task for CoNLL 2023, and we had 31 talks and posters presented at CoNLL in Singapore! Prior to organizing the challenge, I worked on training and evaluating data-limited language models (Warstadt et al., 2020) (Zhang et al., 2021), and I published a position piece arguing for the value of data-limited models to cognitive science (Warstadt & Bowman, 2022). We published our proceedings (Warstadt et al., 2023), and we summarized the findings of all the submissions in a review article (Warstadt et al., 2023). I also contributed to submissions trying to incorporate visual input (Amariucai & Warstadt, 2023) and auditory input (Wolf et al., 2023) into the training of LMs. More recently, some of the organizers have published a position piece arguing for the value of small language models (Wilcox et al., 2024). The 2024 BabyLM Challenge saw the addition of a Multimodal Track. It will be collocated with CoNLL in Miami.

References

2024

Under Review

Bigger is not always better: The importance of human-scale language modeling for psycholinguistics

Ethan Gotlieb Wilcox, Michael Hu, Aaron Mueller, and 6 more authors

2024

Bib HTML PDF

@unpublished{wilcox2024bigger,
  title = {Bigger is not always better: {The} importance of human-scale language modeling for psycholinguistics},
  author = {Wilcox, Ethan Gotlieb and Hu, Michael and Mueller, Aaron and Linzen, Tal and Warstadt, Alex and Choshen, Leshem and Zhuang, Chengxu and Cotterell, Ryan and Williams, Adina},
  year = {2024},
}

2023

BabyLM

Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning

Dec 2023

Bib HTML PDF

@proceedings{warstadt2023babylm,
  title = {Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning},
  editor = {Warstadt, Alex and Mueller, Aaron and Choshen, Leshem and Wilcox, Ethan and Zhuang, Chengxu and Ciro, Juan and Mosquera, Rafael and Paranjabe, Bhargavi and Williams, Adina and Linzen, Tal and Cotterell, Ryan},
  month = dec,
  year = {2023},
  address = {Singapore},
  publisher = {Association for Computational Linguistics},
}

BabyLM

Findings of the BabyLM challenge: Sample-efficient pretraining on developmentally plausible corpora

Alex Warstadt, Aaron Mueller, Leshem Choshen, and 8 more authors

In Proceedings of the BabyLM challenge at the 27th conference on computational natural language learning, Dec 2023

Bib HTML PDF

@inproceedings{warstadt2023findings,
  address = {Singapore},
  title = {Findings of the {BabyLM} challenge: {Sample}-efficient pretraining on developmentally plausible corpora},
  doi = {10.18653/v1/2023.conll-babylm.1},
  booktitle = {Proceedings of the {BabyLM} challenge at the 27th conference on computational natural language learning},
  publisher = {Association for Computational Linguistics},
  author = {Warstadt, Alex and Mueller, Aaron and Choshen, Leshem and Wilcox, Ethan and Zhuang, Chengxu and Ciro, Juan and Mosquera, Rafael and Paranjabe, Bhargavi and Williams, Adina and Linzen, Tal and Cotterell, Ryan},
  editor = {Warstadt, Alex and Mueller, Aaron and Choshen, Leshem and Wilcox, Ethan and Zhuang, Chengxu and Ciro, Juan and Mosquera, Rafael and Paranjabe, Bhargavi and Williams, Adina and Linzen, Tal and Cotterell, Ryan},
  month = dec,
  year = {2023},
  pages = {1--34},
}

BabyLM

Acquiring linguistic knowledge from multimodal input

Theodor Amariucai, and Alexander Scott Warstadt

In Proceedings of the BabyLM challenge at the 27th conference on computational natural language learning, Dec 2023

Bib HTML PDF

@inproceedings{amariucai2023acquiring,
  address = {Singapore},
  title = {Acquiring linguistic knowledge from multimodal input},
  doi = {10.18653/v1/2023.conll-babylm.11},
  booktitle = {Proceedings of the {BabyLM} challenge at the 27th conference on computational natural language learning},
  publisher = {Association for Computational Linguistics},
  author = {Amariucai, Theodor and Warstadt, Alexander Scott},
  editor = {Warstadt, Alex and Mueller, Aaron and Choshen, Leshem and Wilcox, Ethan and Zhuang, Chengxu and Ciro, Juan and Mosquera, Rafael and Paranjabe, Bhargavi and Williams, Adina and Linzen, Tal and Cotterell, Ryan},
  month = dec,
  year = {2023},
  pages = {128--141},
}

BabyLM

WhisBERT: Multimodal text-audio language modeling on 100M words

Lukas Wolf, Klemen Kotar, Greta Tuckute, and 4 more authors

In Proceedings of the BabyLM challenge at the 27th conference on computational natural language learning, Dec 2023

Bib HTML PDF

@inproceedings{wolf2023whisbert,
  address = {Singapore},
  title = {{WhisBERT}: {Multimodal} text-audio language modeling on {100M} words},
  doi = {10.18653/v1/2023.conll-babylm.21},
  booktitle = {Proceedings of the {BabyLM} challenge at the 27th conference on computational natural language learning},
  publisher = {Association for Computational Linguistics},
  author = {Wolf, Lukas and Kotar, Klemen and Tuckute, Greta and Hosseini, Eghbal and I. Regev, Tamar and Gotlieb Wilcox, Ethan and Warstadt, Alexander Scott},
  editor = {Warstadt, Alex and Mueller, Aaron and Choshen, Leshem and Wilcox, Ethan and Zhuang, Chengxu and Ciro, Juan and Mosquera, Rafael and Paranjabe, Bhargavi and Williams, Adina and Linzen, Tal and Cotterell, Ryan},
  month = dec,
  year = {2023},
  pages = {253--258},
}

2022

Book Chapter
What artificial neural networks can tell us about human language acquisition

Alex Warstadt, and Samuel R Bowman

In Algebraic Structures in Natural Language, Dec 2022

Abs Bib HTML PDF

Rapid progress in machine learning for natural language processing has the potential to transform debates about how humans learn language. However, the learning environments and biases of current artificial learners and humans diverge in ways that weaken the impact of the evidence obtained from learning simulations. For example, today’s most effective neural language models are trained on roughly one thousand times the amount of linguistic data available to a typical child. To increase the relevance of learnability results from computational models, we need to train model learners without significant advantages over humans. If an appropriate model successfully acquires some target linguistic knowledge, it can provide a proof of concept that the target is learnable in a hypothesized human learning scenario. Plausible model learners will enable us to carry out experimental manipulations to make causal inferences about variables in the learning environment, and to rigorously test poverty-of-the-stimulus-style claims arguing for innate linguistic knowledge in humans. Comparable experiments will never be possible with human subjects due to practical and ethical considerations. So far, attempts to deprive current models of unfair advantages fail to achieve human-level grammatical knowledge. But before we can justifiably conclude that language learning requires more prior domain-specific knowledge than current models possess, we must first explore other training regimes as ways to make computational learners more efficient at learning from limited linguistic input.
@incollection{warstadt2022what, title = {What artificial neural networks can tell us about human language acquisition}, booktitle = {Algebraic {Structures} in {Natural} {Language}}, publisher = {CRC Press}, author = {Warstadt, Alex and Bowman, Samuel R}, editor = {Lappin, Shalom and Bernardy, Jean-Philippe}, year = {2022}, pages = {17--60}, }

2021

ACL
When Do You Need Billions of Words of Pretraining Data?

Yian Zhang, Alex Warstadt, Xiaocheng Li, and 1 more author

In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Aug 2021

Abs Bib HTML PDF

NLP is currently dominated by language models like RoBERTa which are pretrained on billions of words. But what exact knowledge or skills do Transformer LMs learn from large-scale pretraining that they cannot learn from less data? To explore this question, we adopt five styles of evaluation: classifier probing, information-theoretic probing, unsupervised relative acceptability judgments, unsupervised language model knowledge probing, and fine-tuning on NLU tasks. We then draw learning curves that track the growth of these different measures of model ability with respect to pretraining data volume using the MiniBERTas, a group of RoBERTa models pretrained on 1M, 10M, 100M and 1B words. We find that these LMs require only about 10M to 100M words to learn to reliably encode most syntactic and semantic features we test. They need a much larger quantity of data in order to acquire enough commonsense knowledge and other skills required to master typical downstream NLU tasks. The results suggest that, while the ability to encode linguistic features is almost certainly necessary for language understanding, it is likely that other, unidentified, forms of knowledge are the major drivers of recent improvements in language understanding among large pretrained models.
@inproceedings{zhang2021when, address = {Online}, title = {When {Do} {You} {Need} {Billions} of {Words} of {Pretraining} {Data}?}, doi = {10.18653/v1/2021.acl-long.90}, webdate = {2021-09-17}, booktitle = {Proceedings of the 59th {Annual} {Meeting} of the {Association} for {Computational} {Linguistics} and the 11th {International} {Joint} {Conference} on {Natural} {Language} {Processing} ({Volume} 1: {Long} {Papers})}, publisher = {Association for Computational Linguistics}, author = {Zhang, Yian and Warstadt, Alex and Li, Xiaocheng and Bowman, Samuel R.}, month = aug, year = {2021}, pages = {1112--1125}, }

2020

ACL
Learning which features matter: RoBERTa acquires a preference for linguistic generalizations (eventually)

Alex Warstadt, Yian Zhang, Xiaocheng Li, and 2 more authors

In Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), Nov 2020

Abs Bib HTML PDF

One reason pretraining on self-supervised linguistic tasks is effective is that it teaches models features that are helpful for language understanding. However, we want pretrained models to learn not only to represent linguistic features, but also to use those features preferentially during fine-turning. With this goal in mind, we introduce a new English-language diagnostic set called MSGS (the Mixed Signals Generalization Set), which consists of 20 ambiguous binary classification tasks that we use to test whether a pretrained model prefers linguistic or surface generalizations during finetuning. We pretrain RoBERTa from scratch on quantities of data ranging from 1M to 1B words and compare their performance on MSGS to the publicly available RoBERTa_BASE. We find that models can learn to represent linguistic features with little pretraining data, but require far more data to learn to prefer linguistic generalizations over surface ones. Eventually, with about 30B words of pretraining data, RoBERTa_BASE does consistently demonstrate a linguistic bias with some regularity. We conclude that while self-supervised pretraining is an effective way to learn helpful inductive biases, there is likely room to improve the rate at which models learn which features matter.
@inproceedings{warstadt2020learning, address = {Online}, title = {Learning which features matter: {RoBERTa} acquires a preference for linguistic generalizations (eventually)}, doi = {10.18653/v1/2020.emnlp-main.16}, booktitle = {Proceedings of the 2020 conference on empirical methods in natural language processing ({EMNLP})}, publisher = {Association for Computational Linguistics}, author = {Warstadt, Alex and Zhang, Yian and Li, Xiaocheng and Liu, Haokun and Bowman, Samuel R.}, month = nov, year = {2020}, pages = {217--235}, }