BabyLM

Training sample efficient language models on developmentally plausible corpora

Humans are remarkably efficient language learners compared to language models. It would take a human a million years to accumulate as much linguistic input as most popular large language models, yet we learn language in just a few years. BabyLM is a competition and workshop devoted to data-efficient language learning. How do humans acquire language from so little input? How can we build more data-efficient language models? How do we design language models to be better cognitive models? The competition’s goal is to challenge the community to answers these questions. We also aim to democratize research into LM pretraining, which has recently become increasingly concentrated in a few large industry groups. While a small academic group will not build the next ChatGPT, there are still valuable research questions about pretraining that we can address with a small budget.

In 2023, I co-founded and co-organized the BabyLM Challenge with a group of amazing collaborators, now including:

  • Leshem Choshen (IBM Research, MIT)
  • Ryan Cotterell (ETH Zurich)
  • Michael Hu (NYU)
  • Tal Linzen (NYU)
  • Aaron Mueller (Northeastern)
  • Candace Ross (Meta AI)
  • Alex Warstadt (ETH Zurich, UCSD)
  • Ethan Wilcox (ETH Zurich, Georgetown)
  • Adina Williams (Meta AI)
  • Chengxu Zhuang (MIT)

The competition was the shared task for CoNLL 2023, and we had 31 talks and posters presented at CoNLL in Singapore! Prior to organizing the challenge, I worked on training and evaluating data-limited language models (Warstadt et al., 2020) (Zhang et al., 2021), and I published a position piece arguing for the value of data-limited models to cognitive science (Warstadt & Bowman, 2022). We published our proceedings (Warstadt et al., 2023), and we summarized the findings of all the submissions in a review article (Warstadt et al., 2023). I also contributed to submissions trying to incorporate visual input (Amariucai & Warstadt, 2023) and auditory input (Wolf et al., 2023) into the training of LMs. More recently, some of the organizers have published a position piece arguing for the value of small language models (Wilcox et al., 2024). The 2024 BabyLM Challenge saw the addition of a Multimodal Track. It will be collocated with CoNLL in Miami.

References

2024

  1. Under Review
    Bigger is not always better: The importance of human-scale language modeling for psycholinguistics
    Ethan Gotlieb Wilcox, Michael Hu, Aaron Mueller, and 6 more authors
    2024

2023

  1. BabyLM
    Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning
    Dec 2023
  2. BabyLM
    Findings of the BabyLM challenge: Sample-efficient pretraining on developmentally plausible corpora
    Alex Warstadt, Aaron Mueller, Leshem Choshen, and 8 more authors
    In Proceedings of the BabyLM challenge at the 27th conference on computational natural language learning, Dec 2023
  3. BabyLM
    Acquiring linguistic knowledge from multimodal input
    Theodor Amariucai, and Alexander Scott Warstadt
    In Proceedings of the BabyLM challenge at the 27th conference on computational natural language learning, Dec 2023
  4. BabyLM
    WhisBERT: Multimodal text-audio language modeling on 100M words
    Lukas Wolf, Klemen Kotar, Greta Tuckute, and 4 more authors
    In Proceedings of the BabyLM challenge at the 27th conference on computational natural language learning, Dec 2023

2022

  1. Book Chapter
    What artificial neural networks can tell us about human language acquisition
    Alex Warstadt, and Samuel R Bowman
    In Algebraic Structures in Natural Language, Dec 2022

2021

  1. ACL
    When Do You Need Billions of Words of Pretraining Data?
    Yian Zhang, Alex Warstadt, Xiaocheng Li, and 1 more author
    In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Aug 2021

2020

  1. ACL
    Learning which features matter: RoBERTa acquires a preference for linguistic generalizations (eventually)
    Alex Warstadt, Yian Zhang, Xiaocheng Li, and 2 more authors
    In Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), Nov 2020