Nucleotide context models outperform protein language models for predicting antibody affinity maturation

Mackenzie M. Johnson, Kevin Sung, Hugh K. Haddox, Ashni A. Vora, Tatsuya Araki, Gabriel D. Victora, Yun S. Song, Julia Fukuyama, Frederick A. Matsen IV

Abstract

Antibodies play a crucial role in adaptive immunity. They develop as B cell receptors (BCRs): membrane-bound forms of antibodies that are expressed on the surfaces of B cells. BCRs are refined through affinity maturation, a process of somatic hypermutation (SHM) and natural selection, to improve binding to an antigen.

Introduction

Antibodies are an essential component of the adaptive immune system. B cell receptors (BCRs) are membrane-bound forms of antibodies that are expressed on the surfaces of B cells. These BCRs are created by V(D)J recombination during B cell development, where V, D, and J gene segments are randomly recombined to create the unique antigen-binding regions.

Materials and method

The two major analysis steps were data preparation (Fig 1A) and model performance evaluation (Fig 1B and 1C). Data preparation involved separating sequences of a BCR repertoire into clonal families, where sequences likely to be descendants from the same naive B cell were clustered together. 

Results

Overview of data preparation and EPAM model evaluation

Our objective was to assess the performance of various models on predicting amino acid substitutions over the course of affinity maturation in BCR repertoires. To meet this objective, we devised EPAM, a unified framework for evaluating models of affinity maturation. Following Spisak et al. [14], data preparation for EPAM involved identifying clonal families from BCR repertoires and reconstructing the phylogenetic tree with ancestral sequences (Fig 1A). 

Discussion

Researchers have used a diversity of approaches to model antibody sequences. The ME models of affinity maturation directly probe the factors governing antibody evolution via estimating a limited set of parameters specific to the biological process. These models are usually evaluated on how the inclusion of parameters improves the likelihood of the reconstructed evolutionary process [17–19].

Acknowledgments

We thank the following for sharing processed human data with us: Corey Watson, Easton Ford, Melissa Smith, Oscar Rodriguez, and other members of the Watson lab for providing the Ford et al. and Rodriguez et al. data sets, and Thomas MacCarthy for the Tang et al. data set. We thank members of the Matsen group as well as the Victora and DeWitt labs and Oxford Protein Informatics Group for helpful discussions.

Citation: Johnson MM, Sung K, Haddox HK, Vora AA, Araki T, Victora GD, et al. (2025) Nucleotide context models outperform protein language models for predicting antibody affinity maturation. PLoS Comput Biol 21(12): e1013758. https://doi.org/10.1371/journal.pcbi.1013758

Editor: Alexey Onufriev, Virginia Polytechnic Institute and State University, UNITED STATES OF AMERICA

Received: July 28, 2025; Accepted: November 17, 2025; Published: December 1, 2025

Copyright: © 2025 Johnson et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: Our models, inference code, and analysis scripts are publicly available on GitHub at https://github.com/matsengrp/epam. All processed data files are publicly available on Zenodo at https://doi.org/10.5281/zenodo.17353498.

Funding: This work was supported by National Institutes of Health grants R01-AI146028 (PI Matsen), R56-HG013117 (PI Song) and R01-HG013117 (PI Song). Scientific Computing Infrastructure at Fred Hutch was funded by ORIP grant S10OD028685. Frederick Matsen is an investigator of the Howard Hughes Medical Institute. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: I have read the journal’s policy and the authors of this manuscript have the following competing interests: G.D.V. is an advisor for and holds stock of the Vaccine Company. T.A. is currently an employee of Pfizer Inc.