Introduction
Nanobodies—compact, single-domain antibody fragments—are seeing increasing use in therapeutics and diagnostics due to their high specificity and stability. However, optimizing multiple properties such as expression yield and binding affinity remains experimentally costly. While machine learning can accelerate candidate selection, its effectiveness depends on the quality and diversity of labeled data. Standard active learning (AL) approaches address this by prioritizing informative samples, but typically ignore the practical constraints critical to nanobody development.
Methods
We present a multi-objective active learning (MOAL) framework tailored to nanobody discovery. This framework integrates predictive models for binding affinity and expression yield with uncertainty estimation from ensemble learning. Candidate selection is guided by three objectives: informativeness (model improvement), feasibility (predicted expression), and performance (binding affinity). To balance trade-offs among these objectives, we apply evolutionary multi-objective optimization algorithms, specifically NSGA-II and IBEA. This enables exploration of diverse, high-potential regions of nanobody sequence space.
Results
We evaluate our framework on a curated dataset of characterized nanobody sequences and a large-scale nanobody repertoire comprising over 10 million candidates. The curated data enable supervised learning, while the repertoire supports broad exploration. Our approach identifies nanobody candidates that are both experimentally viable and model-informative, improving generalization while reducing experimental costs. By avoiding redundant queries and favoring biologically diverse selections, this method supports efficient discovery.
Conclusions
Our domain-aware MOAL approach provides an effective strategy for guiding nanobody selection under multiple constraints. It enables iterative refinement of predictive models while maintaining experimental feasibility. Though it was developed for nanobody engineering, the framework generalizes to other biological domains requiring data-efficient, multi-objective decision-making.