PhD Position Satistics and AI
TU Dresden and CASUS
Germany
Deadline: Feb 22, 2026
Details
PhD position: Biodiversity Estimation as a Lens into LLM Knowledge Content
Supervising PIs: Prof. Justin Calabrese (CASUS) & Prof. Simon
Razniewski (TU Dresden)
Disciplines: AI foundations, statistical ecology
Motivation and research questions: Foundation models, in particular
large language models (LLMs), have significantly advanced AI. A major
contributor to their success is internalized knowledge, which in
quantitative terms, is still poorly understood. LLMs memorize
significant amounts of factual knowledge, however, there exists no
reliable quantification of the extent of this knowledge, with orders
of magnitude between known lower bounds (100 M facts) and naïve
estimates of upper bounds (40 B facts) for frontier models like GPT-4.
Exhaustively probing LLMs is unfeasible, for both computational and
monetary reasons.
In this project, we explore alternative approaches inspired by the
study of biodiversity in ecology. We hypothesize that internalized
knowledge in LLMs (hereafter “knowledge diversity”) can be viewed
analogously to biodiversity in ecological communities. Ecology has
decades of experience in developing both theories to explain
biodiversity, and statistical approaches to quantify it from limited
samples. In particular, named entities in LLMs can, under some
circumstances, be considered analogous to individuals within a
species. Furthermore, LLM characteristics that correlate with
increased knowledge diversity, including number of model parameters,
size of the training dataset, and the total amount of compute time can
also be mapped onto ecological concepts that correlate with increased
biodiversity such as number of resource types, size of the species
pool, and amount of successional time, respectively.
Quantifying biodiversity in ecological communities typically involves
estimating the total number of species (i.e., species richness) and
the abundance of each species from a limited set of samples.
Communities can then be characterized, compared, and ranked in terms
of their species richness and patterns of relative species abundance.
A myriad of richness and abundance estimators exist in the ecological
literature, with each making different assumptions and being tailored
to different types of data. Limited samples of named entities
memorized by an LLM can be readily obtained, which, together with the
above-described analogies, suggests the possibility to leverage
existing biodiversity estimation techniques to quantify knowledge
diversity in LLMs. However, there currently exists no work that
explores which biodiversity estimators are most suitable, which
estimator assumptions are most plausible for LLMs, how LLMs should be
sampled optimally to maximize compatibility with biodiversity
estimators, or which existing biodiversity estimators are
computationally efficient to handle the large samples that can be
extracted from LLMs.
Computer science frequently supplies theory and techniques that
accelerate discovery in domain sciences like ecology. In this project,
however, we look to a domain science to provide inspiration for
quantifying the knowledge diversity of LLMs, which is a frontier
problem in computer science. This approach could, for the first time,
enable reliable estimates of the factual knowledge seen and memorized
by LLMs, and therefore advance our understanding of the potentials and
limitations of these models. For ecology, it could provide a stress
test for estimation techniques on very large datasets, lead to
improvements in the computational algorithms underpinning biodiversity
estimators, and emphasize the wider relevance of statistical ecology
beyond the core conservation science domain. This work therefore has
the potential to significantly advance both computer science and
ecology.
Apply until February 22 here: https://ideas.helmholtz.de/apply/
Related Scholarships
Loading scholarships...