Nanobody binder design with BoltzGen#
Designing de novo nanobody binders to PDL1 using BoltzGen and ProteinMPNN
In this tutorial, we’ll demonstrate how to use the OpenProtein.AI Python client to design a nanobody that binds to a target protein. We refer to the designed protein as the binder and the protein being bound as the target.
Unlike general protein binder design, for nanobody design we utilize a scaffold-based approach. We will start with an existing nanobody framework and essentially “graft” new Complementarity-Determining Regions (CDRs) onto it. This ensures that our designed binder retains the stable, expressible framework regions of a natural nanobody while tailoring the binding loops (CDRs) to our specific target. The design process consists of four main steps:
Query Specification: Specify the design problem as a “query”, including
the target protein (PDL1)
the nanobody scaffold (framework regions)
the lengths of the CDR loops to be designed
Structure Generation: Generate plausible structures for the nanobody binder CDRs using BoltzGen (Stark et al., 2025), a generative model capable of designing backbone structures using scaffolds.
Sequence Design: Design sequences for the generated CDRs using ProteinMPNN (Dauparas et al., 2022), an inverse folding model for generating the binder sequence conditioned on the generated structure.
In Silico Validation: Validate the designs by predicting their structures with Boltz-2 (Passaro et al., 2025) and computing metrics to select the best candidates for experimental testing.
Prerequisites#
To run this tutorial, you’ll need a Python environment containing the following packages:
openprotein_python>=0.10.1molviewspec(for structure visualization)
See the Python client installation instructions for more info.
Additionally, you should have your credentials set up in ~/.openprotein/config.toml to authenticate with the OpenProtein.AI API.
Import necessary packages#
[ ]:
from dataclasses import dataclass
import numpy as np
import numpy.typing as npt
import pandas as pd
from tqdm import tqdm
import molviewspec as mvs
from molviewspec.nodes import RepresentationTypeT
import openprotein
from openprotein.molecules import Protein, Complex, Structure
Connect to OpenProtein.AI#
[2]:
session = openprotein.connect()
print("✅ Successfully connected to the OpenProtein.AI API!")
✅ Successfully connected to the OpenProtein.AI API!
Step 1: Binder design problem specification#
Specify the nanobody binder design problem
In this tutorial, we will design nanobody binders against Programmed Death-Ligand 1 (PDL1). This design problem is adapted from the BoltzGen study (Stark et al., 2025).
We will design a nanobody binder, which is a single-domain antibody fragment derived from heavy-chain-only antibodies found in camelids. To restrict the BotlzGen structure generator to specifically design nanobody binders, we will use a scaffold. The scaffold defines an overall framework structure and specific designable regions for binding to the target. This means we will keep the framework regions of an existing, well-behaved nanobody constant, while redesigning the Complementarity Determining Regions (CDRs) to bind our specific target. Later, we will use ProteinMPNN to fill in the CDR sequences conditioned on the generated binder CDR structures.
The scaffold provides convenient ways to generate binders of other types such as scFvs for FAbs.
Note: We are using ProteinMPNN for inverse folding the CDRs here, but we could use other generative models such as PoET-2 instead. PoET-2 is unique in its ability to use prompt context sequences that define specific families of proteins, e.g., human VHH domains, that can be used to guide the generated proteins towards specific characteristics. This is especially useful if we want to redesign the whole binder sequence (framework regions and CDRs). Because we are only redesigning the CDRs here, we’ll use ProteinMPNN for simplicity. Learn more about PoET-2.
Step 1.1: Define and visualize the target#
We fetch the structure of the PDL1 target from PDB.
[ ]:
structure = Structure.from_pdb_id("7uxq") # PDL1
first_complex = structure[0]
target = first_complex.get_protein(chain_id="A")
print(target.formatted(include=("sequence", "structure_mask")))
Unknown amino acid at position 1: ACE
Residue at position 1 missing backbone atom=N
Residue at position 1 missing backbone atom=CA
Unknown amino acid at position 1: ACE
Residue at position 1 missing backbone atom=N
Residue at position 1 missing backbone atom=CA
0 SEQUENCE MAFTVTVPKDLYVVEYGSNMTIECKFPVEKQLDLAALIVYWEMEDKNIIQFVHGEEDLKV
0 STRUCTURE_MASK ^
60 SEQUENCE QHSSYRQRARLLKDQLSLGNAALQITDVKLQDAGVYRCMISYGGADYKRITVKVNAPYAA
60 STRUCTURE_MASK
120 SEQUENCE ALEHHHHHH
120 STRUCTURE_MASK ^
Remove the His tag and linker since we don’t want to bind that.
[4]:
target = target[:len(target) - 8]
print(target.formatted(include=("sequence", "structure_mask")))
0 SEQUENCE MAFTVTVPKDLYVVEYGSNMTIECKFPVEKQLDLAALIVYWEMEDKNIIQFVHGEEDLKV
0 STRUCTURE_MASK ^
60 SEQUENCE QHSSYRQRARLLKDQLSLGNAALQITDVKLQDAGVYRCMISYGGADYKRITVKVNAPYAA
60 STRUCTURE_MASK
120 SEQUENCE A
120 STRUCTURE_MASK
Step 1.2: Define the nanobody scaffold#
We will use the structure from PDB ID 7eow as our scaffold. Thie scaffold is from caplacizumab, a humanized VHH. We’ll use the structure and framework region of this VHH as the scaffold for our binder, but design new CDRs for binding to our target.
First, we load the protein.
[5]:
structure = Structure.from_pdb_id("7eow")
first_complex = structure[0]
binder_scaffold = first_complex.get_protein(chain_id="B")
print(binder_scaffold.formatted(include=("sequence", "structure_mask")))
0 SEQUENCE MEVQLVESGGGLVQPGGSLRLSCAASGRTFSYNPMGWFRQAPGKGRELVAAISRTGGSTY
0 STRUCTURE_MASK ^
60 SEQUENCE YPDSVEGRFTISRDNAKRMVYLQMNSLRAEDTAVYYCAAAGVRAEDGRVRTLPSEYTFWG
60 STRUCTURE_MASK
120 SEQUENCE QGTQVTVSSLEHHHHHH
120 STRUCTURE_MASK ^^^^^^^^
Clean the scaffold#
We remove the leading Methionine (M) and the trailing Histidine tag (His-tag) because they are expression artifacts. The structure mask above confirms these residues have no defined structure.
[6]:
binder_scaffold = binder_scaffold[~binder_scaffold.get_structure_mask()]
print(binder_scaffold.formatted(include=("sequence", "structure_mask")))
0 SEQUENCE EVQLVESGGGLVQPGGSLRLSCAASGRTFSYNPMGWFRQAPGKGRELVAAISRTGGSTYY
0 STRUCTURE_MASK
60 SEQUENCE PDSVEGRFTISRDNAKRMVYLQMNSLRAEDTAVYYCAAAGVRAEDGRVRTLPSEYTFWGQ
60 STRUCTURE_MASK
120 SEQUENCE GTQVTVSS
120 STRUCTURE_MASK
Define the framework and binding regions#
We want to use the nanobody structure as a framework, but design new CDRs for binding to our target. To do this, we keep the framework regions (FWRs) constant but replace the CDRs with designable regions.
In this example, we will set CDR1 length to 10 (increased from 9), CDR2 to 8 (same as scaffold), and CDR3 to 20 (decreased from 21). The X characters represent residues to be designed.
[ ]:
# Extract framework regions and print their sequences for verification
fwr1 = binder_scaffold[:25]
fwr2 = binder_scaffold[34:51]
fwr3 = binder_scaffold[59:97]
fwr4 = binder_scaffold[118:]
print("FWR1:", fwr1.sequence.decode())
print("FWR2:", fwr2.sequence.decode())
print("FWR3:", fwr3.sequence.decode())
print("FWR4:", fwr4.sequence.decode())
FWR1: EVQLVESGGGLVQPGGSLRLSCAAS
FWR2: GWFRQAPGKGRELVAAI
FWR3: YPDSVEGRFTISRDNAKRMVYLQMNSLRAEDTAVYYCA
FWR4: GQGTQVTVSS
[ ]:
# Create the binder scaffold by combining the frameworks and cdrs of the required length
cdr1_length = 10
cdr2_length = 8
cdr3_length = 20
binder_scaffold = (
fwr1
+ "X" * cdr1_length
+ fwr2
+ "X" * cdr2_length
+ fwr3
+ "X" * cdr3_length
+ fwr4
)
print(binder_scaffold.formatted(include=("sequence", "structure_mask")))
0 SEQUENCE EVQLVESGGGLVQPGGSLRLSCAASXXXXXXXXXXGWFRQAPGKGRELVAAIXXXXXXXX
0 STRUCTURE_MASK ^^^^^^^^^^ ^^^^^^^^
60 SEQUENCE YPDSVEGRFTISRDNAKRMVYLQMNSLRAEDTAVYYCAXXXXXXXXXXXXXXXXXXXXGQ
60 STRUCTURE_MASK ^^^^^^^^^^^^^^^^^^^^
120 SEQUENCE GTQVTVSS
120 STRUCTURE_MASK
Step 1.3: Configure relative positioning (Groups)#
By default, all residues are in “group 0”, which implies their relative positions are fixed. Since we want the nanobody to dock against the target (i.e., its position relative to the target is not fixed), we assign the scaffold to a different group (group 1).
[9]:
# Visualize current groups (all 0)
print("\nVisualize target groups:")
print(target.formatted(("sequence", "group")))
print("\nVisualize binder scaffold groups:")
print(binder_scaffold.formatted(("sequence", "group")))
Visualize target groups:
0 SEQUENCE MAFTVTVPKDLYVVEYGSNMTIECKFPVEKQLDLAALIVYWEMEDKNIIQFVHGEEDLKV
0 GROUP 000000000000000000000000000000000000000000000000000000000000
60 SEQUENCE QHSSYRQRARLLKDQLSLGNAALQITDVKLQDAGVYRCMISYGGADYKRITVKVNAPYAA
60 GROUP 000000000000000000000000000000000000000000000000000000000000
120 SEQUENCE A
120 GROUP 0
Visualize binder scaffold groups:
0 SEQUENCE EVQLVESGGGLVQPGGSLRLSCAASXXXXXXXXXXGWFRQAPGKGRELVAAIXXXXXXXX
0 GROUP 000000000000000000000000000000000000000000000000000000000000
60 SEQUENCE YPDSVEGRFTISRDNAKRMVYLQMNSLRAEDTAVYYCAXXXXXXXXXXXXXXXXXXXXGQ
60 GROUP 000000000000000000000000000000000000000000000000000000000000
120 SEQUENCE GTQVTVSS
120 GROUP 00000000
[10]:
# Set scaffold to group 1 to unfix relative position
binder_scaffold = binder_scaffold.set_group(1)
print("\nUpdated binder scaffold groups:")
print(binder_scaffold.formatted(("sequence", "group")))
Updated binder scaffold groups:
0 SEQUENCE EVQLVESGGGLVQPGGSLRLSCAASXXXXXXXXXXGWFRQAPGKGRELVAAIXXXXXXXX
0 GROUP 111111111111111111111111111111111111111111111111111111111111
60 SEQUENCE YPDSVEGRFTISRDNAKRMVYLQMNSLRAEDTAVYYCAXXXXXXXXXXXXXXXXXXXXGQ
60 GROUP 111111111111111111111111111111111111111111111111111111111111
120 SEQUENCE GTQVTVSS
120 GROUP 11111111
Finally, we combine the target and the binder scaffold into a single Complex query.
[11]:
query = target & binder_scaffold
print("Query type", type(query))
print("Chains in query:", list(query.get_chains().keys()))
print("\nVisualize target (Chain A):")
print(query.get_protein(chain_id="A").formatted(include=("sequence", "structure_mask")))
print("\nVisualize binder scaffold (Chain B):")
print(query.get_protein(chain_id="B").formatted(include=("sequence", "structure_mask")))
Query type <class 'openprotein.molecules.complex.Complex'>
Chains in query: ['A', 'B']
Visualize target (Chain A):
0 SEQUENCE MAFTVTVPKDLYVVEYGSNMTIECKFPVEKQLDLAALIVYWEMEDKNIIQFVHGEEDLKV
0 STRUCTURE_MASK ^
60 SEQUENCE QHSSYRQRARLLKDQLSLGNAALQITDVKLQDAGVYRCMISYGGADYKRITVKVNAPYAA
60 STRUCTURE_MASK
120 SEQUENCE A
120 STRUCTURE_MASK
Visualize binder scaffold (Chain B):
0 SEQUENCE EVQLVESGGGLVQPGGSLRLSCAASXXXXXXXXXXGWFRQAPGKGRELVAAIXXXXXXXX
0 STRUCTURE_MASK ^^^^^^^^^^ ^^^^^^^^
60 SEQUENCE YPDSVEGRFTISRDNAKRMVYLQMNSLRAEDTAVYYCAXXXXXXXXXXXXXXXXXXXXGQ
60 STRUCTURE_MASK ^^^^^^^^^^^^^^^^^^^^
120 SEQUENCE GTQVTVSS
120 STRUCTURE_MASK
Visualize the structure of the query#
It’s always a good idea to visualize the 3D structures of our query to ensure that we’ve created it correctly. First, let’s define a helper function using the molviewspec package to help us create the visualization. It’s not necessary to understand the implementation details.
[12]:
@dataclass(frozen=True)
class ColorSpec:
chain_id: str
color: str
positions: list[int] | None = None
rep_type: RepresentationTypeT = "cartoon"
def visualize_cif(cif_string: str, colors: list[ColorSpec]):
builder = mvs.create_builder()
model = (
builder.download(url="structure.cif").parse(format="mmcif").model_structure()
)
for color_spec in colors:
component = model.component(
selector=(
mvs.ComponentExpression(label_asym_id=color_spec.chain_id)
if color_spec.positions is None
else [
mvs.ComponentExpression(
label_asym_id=color_spec.chain_id, label_seq_id=i
)
for i in color_spec.positions
]
)
)
rep = component.representation(type=color_spec.rep_type)
rep.color(color=color_spec.color)
builder.molstar_notebook(
data={"structure.cif": cif_string},
width=600,
height=500,
)
Now, let’s use the helper function, visualize_cif, to visualize our query.
[13]:
visualize_cif(
cif_string=query.to_string(),
colors=[
# color the target chain a light blue
ColorSpec(chain_id="A", color="#b5e2f5"), # target in blue
ColorSpec(chain_id="B", color="#f4c30b"), # binder scaffold in orange
],
)
We can see from the visualization that:
The query contains our target (blue) and binder scaffold (orange).
The binder scaffold contains dashed lines indicating the missing structure of the CDRs that we will design in the following step.
The binder does not bind to the target, and adopts an unnatural position far from the target. The correct positioning will also be designed in the following step.
Step 2: Structure generation with BoltzGen#
Generate plausible structures for the nanobody binder
We use BoltzGen, a structure generation model that supports scaffolds, to generate plausible backbone structures for our nanobody that complement the target.
[ ]:
N_STRUCTURES = 100
# 1. Create the BoltzGen job
boltzgen_job = session.models.boltzgen.generate(query=query, N=N_STRUCTURES)
print(boltzgen_job)
# 2. Wait for the job to finish
_ = boltzgen_job.wait_until_done(verbose=True)
# 3. Get results
generated_structures: list[Complex] = boltzgen_job.get()
print(f"Generated {len(generated_structures)} structures.")
job_id='18d53e75-a1d7-4b99-a682-44e2de4dddc3' job_type='/models/boltzgen' status=<JobStatus.PENDING: 'PENDING'> created_date=datetime.datetime(2026, 1, 21, 11, 16, 20, 391931, tzinfo=TzInfo(UTC)) start_date=None end_date=None prerequisite_job_id=None progress_message=None progress_counter=0 sequence_length=None
Waiting: 100%|██████████| 100/100 [26:02<00:00, 15.62s/it, status=SUCCESS]
Generated 100 structures.
Step 3: Sequence generation with ProteinMPNN#
Design the CDR sequences using inverse folding
BoltzGen generates the backbone structure and an initial sequence, but we can often improve the sequence quality (e.g., expression, stability) using an inverse folding model.
We will use ProteinMPNN, an inverse folding model commonly used for generating amino acid sequences likely to fold into a defined backbone structure.
Generate sequences#
We now iterate through our generated structures and redesign the CDR regions. In this example, we are only infilling the sequence of the CDRs while keeping the sequence of the framework region from the scaffold. We utilize the binder_scaffold sequence to mask the CDR regions (where the scaffold has X), ensuring we only redesign those specific areas while preserving the backbone structure.
We could mask the sequence of the complete nanobody to generate the sequence for both the framework and CDRs, but ProteinMPNN will not generate natural-like framework regions. For this, we recommend using PoET-2 with a human or camelid VHH context.
Note: To inverse fold multichain complexes with ProteinMPNN, we could either use a Complex object containing multiple chains, or a single Protein object that joins the target and binder with a linker (e.g., GGGGS*3). Below, we demonstrate the latter approach, as only the latter approach is compatible with PoET-2. For an example of the former approach, see the RFdiffusion binder design
tutorial.
[ ]:
N_SEQS_PER_STRUCTURE = 10
# 1. Create the jobs
gen_jobs = []
for generated_structure in tqdm(
generated_structures, mininterval=1.0, desc="Creating jobs"
):
# 1a. Create protein for inverse folding by joining the generated target and binder
# with a linker
target_for_inverse_folding = generated_structure.get_protein(chain_id="A")
binder_for_inverse_folding = (
generated_structure.get_protein(chain_id="B")
.copy()
# use the scaffold framework sequence
.set_sequence(binder_scaffold.sequence)
# mask the side chain atoms since we want to generate these sequences
.mask_structure(side_chain_only=True)
)
linker = "GGGGS" * 3
protein_for_inverse_folding = (
target_for_inverse_folding + linker + binder_for_inverse_folding
)
# 1b. Create the sequence generation job and store the job
gen_job = session.models.proteinmpnn.generate(
query=protein_for_inverse_folding,
num_samples=N_SEQS_PER_STRUCTURE,
seed=42, # for reproducibility
)
gen_jobs.append(gen_job)
# 2. Wait for the jobs to finish
for gen_job in tqdm(gen_jobs, mininterval=1.0, desc="Waiting for jobs"):
_ = gen_job.wait_until_done()
assert gen_job.status == "SUCCESS"
Creating jobs: 100%|██████████| 100/100 [01:49<00:00, 1.09s/it]
Waiting for jobs: 100%|██████████| 100/100 [09:16<00:00, 5.56s/it]
We collect all the designed sequences into a dataframe for further analysis:
[ ]:
records = []
for i, gen_job in enumerate(tqdm(gen_jobs, mininterval=1.0)):
for j, proteinmpnn_result in enumerate(gen_job.get()):
# The result includes target + linker + binder.
full_sequence = proteinmpnn_result.sequence
# We need to extract just the binder sequence from the end.
designed_binder_sequence = full_sequence[-len(binder_scaffold):]
records.append(
{
"design_idx": i * N_SEQS_PER_STRUCTURE + j,
"structure_idx": i,
"sequence_idx": j,
"score": proteinmpnn_result.score.mean().item(),
"sequence": designed_binder_sequence,
}
)
df = pd.DataFrame.from_records(records).set_index(["structure_idx", "sequence_idx"])
df.head()
100%|██████████| 100/100 [00:34<00:00, 2.87it/s]
| design_idx | score | sequence | ||
|---|---|---|---|---|
| structure_idx | sequence_idx | |||
| 0 | 0 | 0 | 0.7589 | EVQLVESGGGLVQPGGSLRLSCAASGDANWEKLCMGWFRQAPGKGR... |
| 1 | 1 | 0.7935 | EVQLVESGGGLVQPGGSLRLSCAASGDANFAKLCMGWFRQAPGKGR... | |
| 2 | 2 | 0.7774 | EVQLVESGGGLVQPGGSLRLSCAASGDAVFEKLCMGWFRQAPGKGR... | |
| 3 | 3 | 0.7570 | EVQLVESGGGLVQPGGSLRLSCAASGSANFSKLCFGWFRQAPGKGR... | |
| 4 | 4 | 0.7914 | EVQLVESGGGLVQPGGSLRLSCAASGDANFEKLGMGWFRQAPGKGR... |
Step 4: In silico validation#
Validate designs using structure prediction
Finally, we validate our designs by predicting the structure of the designed sequences using Boltz-2. We will predict the complex structure and compute metrics to filter for high-confidence designs.
Predict structures with Boltz-2#
We predict the structures of our designs using Boltz-2 (Passaro et al., 2025).
Typically, to validate in silico binder designs, we predict the target-binder complex structure and check for consistency with the original designed structure and binding interface. Because the structure of the target is known, but the binder structure is not, we usually run structure prediction in single sequence mode for both the binder and target sequences using only the target structure as a template. Single sequence mode means that no multiple sequence alignment is used for the binder. This is important to efficiently screen large numbers of binder designs where homology search is a bottleneck.
However, because we do support templates yet (coming soon!), in this tutorial we’ll use an MSA for the target instead of a template; this generally achieves the same objective, allowing the model to accurately recapitulate the target’s known structure.
Below, we compute an MSA for the target, and use it to predict the structure of all designed sequences; this generally takes about 50 minutes.
[17]:
# 1. Compute MSA for target to use for folding
target_msa = session.align.create_msa(target.sequence)
# 2. Create the complexes to fold
complexes_to_fold = []
for i, generated_structure in enumerate(generated_structures): # for each structure
for j in range(N_SEQS_PER_STRUCTURE):
designed_binder_sequence = df.loc[(i, j)]["sequence"]
complex = Complex(
{"A": Protein(target.sequence), "B": Protein(designed_binder_sequence)}
)
# set MSAs to use for structure prediction
# note that `binder.msa = Protein.single_sequence_mode` needs to be set
# explicitly when running structure prediction without MSAs
complex.get_protein(chain_id="A").msa = target_msa
complex.get_protein(chain_id="B").msa = Protein.single_sequence_mode
complexes_to_fold.append(complex)
# 3. Create the job
fold_job = session.fold.boltz_2.fold(complexes_to_fold)
print(fold_job)
# 4. Wait for the job to finish
_ = fold_job.wait_until_done(verbose=True)
num_records=1000 job_id='2d21c594-77e7-4a41-a7d1-4aafc5607be2' job_type=<JobType.embeddings_fold: '/embeddings/fold'> status=<JobStatus.PENDING: 'PENDING'> created_date=datetime.datetime(2026, 1, 21, 11, 54, 13, 291057, tzinfo=TzInfo(UTC)) start_date=None end_date=None prerequisite_job_id=None progress_message=None progress_counter=0 sequence_length=None
Waiting: 100%|██████████| 100/100 [47:56<00:00, 28.77s/it, status=SUCCESS]
We retrieve the predicted structures as a list of Complex objects:
[18]:
predicted_structures: list[Complex] = []
for structure in fold_job.get(verbose=True):
# each prediction is a Structure object
assert isinstance(structure, Structure)
# since we only make one prediction per design, we extract the `Complex` of that one
# prediction only, and append that to our list of predicted structures
predicted_structures.append(structure[0])
Retrieving: 100%|██████████| 1000/1000 [01:01<00:00, 16.20it/s]
Let’s visualize the first predicted structure to check that it looks reasonable:
[ ]:
visualize_cif(
predicted_structures[0].to_string(),
colors=[
ColorSpec(chain_id="A", color="#b5e2f5"), # target in blue
ColorSpec(chain_id="B", color="#f4c30b"), # binder in orange
],
)
As desired, the predicted structure contains the target, and the binder close to the target chain.
In addition to the predicted structures, we also retrieve the predicted aligned errors (PAEs), which we will use for computing metrics below. The PAE is a structure prediction confidence metric that has been highly effective at identifying successful binders.
[20]:
predicted_paes: list[npt.NDArray[np.floating]] = fold_job.get_pae()
Filter and select designs by metrics#
We compute, filter, and select designs using standard structure prediction metrics and thresholds adapted from the RFdiffusion (Watson et al., 2023) and BoltzGen studies.
Metric |
Description |
Ideal Value |
|---|---|---|
RMSD |
Measures how closely the predicted structure of the entire complex matches the generated structure. |
< 2.5 Å |
iPAE |
Confidence that the binder forms an interface with the target. |
< 10 |
Binder RMSD |
Measures how closely the predicted structure of just the binder matches the generated structure. |
< 1.0 Å |
Binder pLDDT |
Confidence in the predicted structure of the binder. |
> 80 |
Compute metrics#
Below, we compute these metrics and collate the metrics and designed sequences into a dataframe for further analysis.
[21]:
records = [] # collect metrics and designed sequences into a list of records
for i, generated_structure in enumerate(tqdm(generated_structures, mininterval=1.0)):
for j in range(N_SEQS_PER_STRUCTURE):
predicted_structure = predicted_structures[i * N_SEQS_PER_STRUCTURE + j]
# compute overall rmsd
rmsd = predicted_structure.rmsd(generated_structure)
# compute ipae
pae = predicted_paes[i * N_SEQS_PER_STRUCTURE + j].squeeze(0)
ipae0 = np.mean(pae[: len(target), len(target) :])
ipae1 = np.mean(pae[len(target) :, : len(target)])
ipae = (ipae0 + ipae1) / 2
# compute binder metrics
generated_binder = generated_structure.get_protein(chain_id="B")
predicted_binder = predicted_structure.get_protein(chain_id="B")
binder_rmsd = predicted_binder.rmsd(generated_binder)
binder_plddt = predicted_binder.plddt.mean()
# get dataframe row containing designed sequence
row = df.loc[(i, j)]
# record all relevant data
records.append(
{
"design_idx": row["design_idx"],
"structure_idx": i,
"sequence_idx": j,
"rmsd": rmsd,
"ipae": ipae,
"binder_rmsd": binder_rmsd,
"binder_plddt": binder_plddt,
"score": row["score"],
"sequence": row["sequence"],
}
)
df = pd.DataFrame.from_records(records).set_index(["structure_idx", "sequence_idx"])
df.head()
100%|██████████| 100/100 [00:01<00:00, 99.55it/s]
[21]:
| design_idx | rmsd | ipae | binder_rmsd | binder_plddt | score | sequence | ||
|---|---|---|---|---|---|---|---|---|
| structure_idx | sequence_idx | |||||||
| 0 | 0 | 0 | 7.577425 | 5.727729 | 1.943376 | 93.676575 | 0.7589 | EVQLVESGGGLVQPGGSLRLSCAASGDANWEKLCMGWFRQAPGKGR... |
| 1 | 1 | 2.838705 | 6.132021 | 1.090485 | 87.045631 | 0.7935 | EVQLVESGGGLVQPGGSLRLSCAASGDANFAKLCMGWFRQAPGKGR... | |
| 2 | 2 | 3.169565 | 5.058994 | 1.452003 | 91.964874 | 0.7774 | EVQLVESGGGLVQPGGSLRLSCAASGDAVFEKLCMGWFRQAPGKGR... | |
| 3 | 3 | 9.350628 | 9.426828 | 1.407688 | 94.753342 | 0.7570 | EVQLVESGGGLVQPGGSLRLSCAASGSANFSKLCFGWFRQAPGKGR... | |
| 4 | 4 | 2.436349 | 8.677398 | 1.413848 | 87.227753 | 0.7914 | EVQLVESGGGLVQPGGSLRLSCAASGDANFEKLGMGWFRQAPGKGR... |
Filter and select designs by metrics#
We start by filtering the designs based on the ideal metric thresholds.
[ ]:
df_filtered = df[
(df["rmsd"] < 2.5)
& (df["ipae"] < 10)
& (df["binder_rmsd"] < 1.0)
& (df["binder_plddt"] > 80)
]
print("# designs passing filters", len(df_filtered))
print(
"# unique structures passing filters",
df_filtered.index.get_level_values("structure_idx").nunique(),
)
# designs passing filters 250
# unique structures passing filters 77
Looks like we have a good number of designs meeting the ideal metric thresholds!
Next, we rank the designs based on iPAE to prioritize designs with high confidence of interaction. We’ll also select just the top design per unique structure, to select for a diverse set of binders.
[ ]:
df_selected = (
# rank by ipae
df_filtered.reset_index().sort_values(by="ipae")
# select best sequence per structure
.groupby("structure_idx", sort=False).first()
# set dataframe index
.reset_index().set_index(["structure_idx", "sequence_idx"])
)
df_selected.head(3)
| design_idx | rmsd | ipae | binder_rmsd | binder_plddt | score | sequence | ||
|---|---|---|---|---|---|---|---|---|
| structure_idx | sequence_idx | |||||||
| 33 | 1 | 331 | 1.469114 | 3.434022 | 0.583415 | 97.080017 | 0.7707 | EVQLVESGGGLVQPGGSLRLSCAASGDFDFSTYSFGWFRQAPGKGR... |
| 65 | 9 | 659 | 1.772557 | 3.734464 | 0.782987 | 95.126297 | 0.7117 | EVQLVESGGGLVQPGGSLRLSCAASGNVNFSLLKAGWFRQAPGKGR... |
| 22 | 2 | 222 | 1.485924 | 3.788646 | 0.728534 | 96.831818 | 0.7933 | EVQLVESGGGLVQPGGSLRLSCAASGSVNFSLLMMGWFRQAPGKGR... |
We now have a ranked list of promising binder designs!
Before we send them off for experimental validation, we should visually inspect their stuctures for any anomalies. For example, below we visualize the predicted structure of the top ranked design superimposed onto the corresponding generated structure (light gray) and see that it looks reasonable on visual inspection:
[ ]:
# get predicted and generated structure of top design
design_idx, structure_idx = df_selected.reset_index().iloc[0][
["design_idx", "structure_idx"]
]
predicted_structure = predicted_structures[design_idx]
generated_structure = generated_structures[structure_idx]
# superimpose predicted structure on the generate structure
predicted_structure = predicted_structure.copy().superimpose_onto(generated_structure)
# visualize
visualize_cif(
Complex(
{
"A_predicted": predicted_structure.get_protein(chain_id="A"),
"B_predicted": predicted_structure.get_protein(chain_id="B"),
"A_generated": generated_structure.get_protein(chain_id="A"),
"B_generated": generated_structure.get_protein(chain_id="B"),
}
).to_string(),
colors=[
ColorSpec(chain_id="A_predicted", color="#b5e2f5"),
ColorSpec(chain_id="B_predicted", color="#f4c30b"),
ColorSpec(chain_id="A_generated", color="#F2F0EF"),
ColorSpec(chain_id="B_generated", color="#F2F0EF"),
],
)
The top second ranked design looks reasonable as well:
[ ]:
# get predicted and generated structure of the second best
design_idx, structure_idx = df_selected.reset_index().iloc[1][
["design_idx", "structure_idx"]
]
predicted_structure = predicted_structures[design_idx]
generated_structure = generated_structures[structure_idx]
# superimpose predicted structure on the generate structure
predicted_structure = predicted_structure.copy().superimpose_onto(generated_structure)
# visualize
visualize_cif(
Complex(
{
"A_predicted": predicted_structure.get_protein(chain_id="A"),
"B_predicted": predicted_structure.get_protein(chain_id="B"),
"A_generated": generated_structure.get_protein(chain_id="A"),
"B_generated": generated_structure.get_protein(chain_id="B"),
}
).to_string(),
colors=[
ColorSpec(chain_id="A_predicted", color="#b5e2f5"),
ColorSpec(chain_id="B_predicted", color="#f4c30b"),
ColorSpec(chain_id="A_generated", color="#F2F0EF"),
ColorSpec(chain_id="B_generated", color="#F2F0EF"),
],
)
After visually confirming the designs and further filtering based on any additional metrics you may have in mind (e.g. metrics relevant to your specific assay), the designs can then be sent off for experimental testing!
Conclusion#
In this walkthrough, we’ve demonstrated how to design nanobody binders for a target of interest using BoltzGen and ProteinMPNN. We validated the designs using in-silico metrics and visualized them to ensure their viability. The top-ranked designs from this workflow can be:
Expressed and purified for experimental validation
Tested for binding affinity
Further optimized through additional rounds of design, for example, with OpenProtein.AI’s property regression models.
Read more about our binder design workflows and other de novo design tools here:
or see the detailed API references