SFO Network

Details

PhD position: Biodiversity Estimation as a Lens into LLM Knowledge Content Supervising PIs: Prof. Justin Calabrese (CASUS) & Prof. Simon Razniewski (TU Dresden) Disciplines: AI foundations, statistical ecology Motivation and research questions: Foundation models, in particular large language models (LLMs), have significantly advanced AI. A major contributor to their success is internalized knowledge, which in quantitative terms, is still poorly understood. LLMs memorize significant amounts of factual knowledge, however, there exists no reliable quantification of the extent of this knowledge, with orders of magnitude between known lower bounds (100 M facts) and naïve estimates of upper bounds (40 B facts) for frontier models like GPT-4. Exhaustively probing LLMs is unfeasible, for both computational and monetary reasons. In this project, we explore alternative approaches inspired by the study of biodiversity in ecology. We hypothesize that internalized knowledge in LLMs (hereafter “knowledge diversity”) can be viewed analogously to biodiversity in ecological communities. Ecology has decades of experience in developing both theories to explain biodiversity, and statistical approaches to quantify it from limited samples. In particular, named entities in LLMs can, under some circumstances, be considered analogous to individuals within a species. Furthermore, LLM characteristics that correlate with increased knowledge diversity, including number of model parameters, size of the training dataset, and the total amount of compute time can also be mapped onto ecological concepts that correlate with increased biodiversity such as number of resource types, size of the species pool, and amount of successional time, respectively. Quantifying biodiversity in ecological communities typically involves estimating the total number of species (i.e., species richness) and the abundance of each species from a limited set of samples. Communities can then be characterized, compared, and ranked in terms of their species richness and patterns of relative species abundance. A myriad of richness and abundance estimators exist in the ecological literature, with each making different assumptions and being tailored to different types of data. Limited samples of named entities memorized by an LLM can be readily obtained, which, together with the above-described analogies, suggests the possibility to leverage existing biodiversity estimation techniques to quantify knowledge diversity in LLMs. However, there currently exists no work that explores which biodiversity estimators are most suitable, which estimator assumptions are most plausible for LLMs, how LLMs should be sampled optimally to maximize compatibility with biodiversity estimators, or which existing biodiversity estimators are computationally efficient to handle the large samples that can be extracted from LLMs. Computer science frequently supplies theory and techniques that accelerate discovery in domain sciences like ecology. In this project, however, we look to a domain science to provide inspiration for quantifying the knowledge diversity of LLMs, which is a frontier problem in computer science. This approach could, for the first time, enable reliable estimates of the factual knowledge seen and memorized by LLMs, and therefore advance our understanding of the potentials and limitations of these models. For ecology, it could provide a stress test for estimation techniques on very large datasets, lead to improvements in the computational algorithms underpinning biodiversity estimators, and emphasize the wider relevance of statistical ecology beyond the core conservation science domain. This work therefore has the potential to significantly advance both computer science and ecology. Apply until February 22 here: https://ideas.helmholtz.de/apply/

PhD Position Satistics and AI

Details

Share Scholarship

Related Scholarships