Nanobody binder design with BoltzGen#

Designing de novo nanobody binders to PDL1 using BoltzGen and ProteinMPNN

In this tutorial, we’ll demonstrate how to use the OpenProtein.AI Python client to design a nanobody that binds to a target protein. We refer to the designed protein as the binder and the protein being bound as the target.

Unlike general protein binder design, for nanobody design we utilize a scaffold-based approach. We will start with an existing nanobody framework and essentially “graft” new Complementarity-Determining Regions (CDRs) onto it. This ensures that our designed binder retains the stable, expressible framework regions of a natural nanobody while tailoring the binding loops (CDRs) to our specific target. The design process consists of four main steps:

Query Specification: Specify the design problem as a “query”, including
1. the target protein (PDL1)
2. the nanobody scaffold (framework regions)
3. the lengths of the CDR loops to be designed
Structure Generation: Generate plausible structures for the nanobody binder CDRs using BoltzGen (Stark et al., 2025), a generative model capable of designing backbone structures using scaffolds.
Sequence Design: Design sequences for the generated CDRs using ProteinMPNN (Dauparas et al., 2022), an inverse folding model for generating the binder sequence conditioned on the generated structure.
In Silico Validation: Validate the designs by predicting their structures with Boltz-2 (Passaro et al., 2025) and computing metrics to select the best candidates for experimental testing.

Prerequisites#

To run this tutorial, you’ll need a Python environment containing the following packages:

openprotein_python>=0.11.0
molviewspec (for structure visualization)

See the Python client installation instructions for more info.

Additionally, you should have your credentials set up in ~/.openprotein/config.toml to authenticate with the OpenProtein.AI API.

Import necessary packages#

[ ]:

from dataclasses import dataclass

import numpy as np
import numpy.typing as npt
import pandas as pd

from tqdm import tqdm

import molviewspec as mvs
from molviewspec.nodes import RepresentationTypeT

import openprotein
from openprotein.molecules import Protein, Complex, Structure

Connect to OpenProtein.AI#

[2]:

session = openprotein.connect()
print("✅ Successfully connected to the OpenProtein.AI API!")

✅ Successfully connected to the OpenProtein.AI API!

Step 1: Binder design problem specification#

Specify the nanobody binder design problem

In this tutorial, we will design nanobody binders against Programmed Death-Ligand 1 (PDL1). This design problem is adapted from the BoltzGen study (Stark et al., 2025).

We will design a nanobody binder, which is a single-domain antibody fragment derived from heavy-chain-only antibodies found in camelids. To restrict the BotlzGen structure generator to specifically design nanobody binders, we will use a scaffold. The scaffold defines an overall framework structure and specific designable regions for binding to the target. This means we will keep the framework regions of an existing, well-behaved nanobody constant, while redesigning the Complementarity Determining Regions (CDRs) to bind our specific target. Later, we will use ProteinMPNN to fill in the CDR sequences conditioned on the generated binder CDR structures.

The scaffold provides convenient ways to generate binders of other types such as scFvs for FAbs.

Note: We are using ProteinMPNN for inverse folding the CDRs here, but we could use other generative models such as PoET-2 instead. PoET-2 is unique in its ability to use prompt context sequences that define specific families of proteins, e.g., human VHH domains, that can be used to guide the generated proteins towards specific characteristics. This is especially useful if we want to redesign the whole binder sequence (framework regions and CDRs). Because we are only redesigning the CDRs here, we’ll use ProteinMPNN for simplicity. Learn more about PoET-2.

Step 1.1: Define and visualize the target#

We fetch the structure of the PDL1 target from PDB.

[ ]:

structure = Structure.from_pdb_id("7uxq") # PDL1
first_complex = structure[0]
target = first_complex.get_protein(chain_id="A")
print(target.formatted(include=("sequence", "structure_mask")))

Unknown amino acid at position 1: ACE
Residue at position 1 missing backbone atom=N
Residue at position 1 missing backbone atom=CA
Unknown amino acid at position 1: ACE
Residue at position 1 missing backbone atom=N
Residue at position 1 missing backbone atom=CA
0     SEQUENCE       MAFTVTVPKDLYVVEYGSNMTIECKFPVEKQLDLAALIVYWEMEDKNIIQFVHGEEDLKV
0     STRUCTURE_MASK ^

60    SEQUENCE       QHSSYRQRARLLKDQLSLGNAALQITDVKLQDAGVYRCMISYGGADYKRITVKVNAPYAA
60    STRUCTURE_MASK

120   SEQUENCE       ALEHHHHHH
120   STRUCTURE_MASK         ^

Remove the His tag and linker since we don’t want to bind that.

[4]:

target = target[:len(target) - 8]
print(target.formatted(include=("sequence", "structure_mask")))

0     SEQUENCE       MAFTVTVPKDLYVVEYGSNMTIECKFPVEKQLDLAALIVYWEMEDKNIIQFVHGEEDLKV
0     STRUCTURE_MASK ^

60    SEQUENCE       QHSSYRQRARLLKDQLSLGNAALQITDVKLQDAGVYRCMISYGGADYKRITVKVNAPYAA
60    STRUCTURE_MASK

120   SEQUENCE       A
120   STRUCTURE_MASK

Step 1.2: Define the nanobody scaffold#

We will use the structure from PDB ID 7eow as our scaffold. Thie scaffold is from caplacizumab, a humanized VHH. We’ll use the structure and framework region of this VHH as the scaffold for our binder, but design new CDRs for binding to our target.

First, we load the protein.

[5]:

structure = Structure.from_pdb_id("7eow")
first_complex = structure[0]
binder_scaffold = first_complex.get_protein(chain_id="B")
print(binder_scaffold.formatted(include=("sequence", "structure_mask")))

0     SEQUENCE       MEVQLVESGGGLVQPGGSLRLSCAASGRTFSYNPMGWFRQAPGKGRELVAAISRTGGSTY
0     STRUCTURE_MASK ^

60    SEQUENCE       YPDSVEGRFTISRDNAKRMVYLQMNSLRAEDTAVYYCAAAGVRAEDGRVRTLPSEYTFWG
60    STRUCTURE_MASK

120   SEQUENCE       QGTQVTVSSLEHHHHHH
120   STRUCTURE_MASK          ^^^^^^^^

Clean the scaffold#

We remove the leading Methionine (M) and the trailing Histidine tag (His-tag) because they are expression artifacts. The structure mask above confirms these residues have no defined structure.

[6]:

binder_scaffold = binder_scaffold[~binder_scaffold.get_structure_mask()]
print(binder_scaffold.formatted(include=("sequence", "structure_mask")))

0     SEQUENCE       EVQLVESGGGLVQPGGSLRLSCAASGRTFSYNPMGWFRQAPGKGRELVAAISRTGGSTYY
0     STRUCTURE_MASK

60    SEQUENCE       PDSVEGRFTISRDNAKRMVYLQMNSLRAEDTAVYYCAAAGVRAEDGRVRTLPSEYTFWGQ
60    STRUCTURE_MASK

120   SEQUENCE       GTQVTVSS
120   STRUCTURE_MASK

Define the framework and binding regions#

We want to use the nanobody structure as a framework, but design new CDRs for binding to our target. To do this, we keep the framework regions (FWRs) constant but replace the CDRs with designable regions.

In this example, we will set CDR1 length to 10 (increased from 9), CDR2 to 8 (same as scaffold), and CDR3 to 20 (decreased from 21). The X characters represent residues to be designed.

[ ]:

# Extract framework regions and print their sequences for verification
fwr1 = binder_scaffold[:25]
fwr2 = binder_scaffold[34:51]
fwr3 = binder_scaffold[59:97]
fwr4 = binder_scaffold[118:]
print("FWR1:", fwr1.sequence.decode())
print("FWR2:", fwr2.sequence.decode())
print("FWR3:", fwr3.sequence.decode())
print("FWR4:", fwr4.sequence.decode())

FWR1: EVQLVESGGGLVQPGGSLRLSCAAS
FWR2: GWFRQAPGKGRELVAAI
FWR3: YPDSVEGRFTISRDNAKRMVYLQMNSLRAEDTAVYYCA
FWR4: GQGTQVTVSS

[ ]:

# Create the binder scaffold by combining the frameworks and cdrs of the required length
cdr1_length = 10
cdr2_length = 8
cdr3_length = 20
binder_scaffold = (
    fwr1
    + "X" * cdr1_length
    + fwr2
    + "X" * cdr2_length
    + fwr3
    + "X" * cdr3_length
    + fwr4
)
print(binder_scaffold.formatted(include=("sequence", "structure_mask")))

0     SEQUENCE       EVQLVESGGGLVQPGGSLRLSCAASXXXXXXXXXXGWFRQAPGKGRELVAAIXXXXXXXX
0     STRUCTURE_MASK                          ^^^^^^^^^^                 ^^^^^^^^

60    SEQUENCE       YPDSVEGRFTISRDNAKRMVYLQMNSLRAEDTAVYYCAXXXXXXXXXXXXXXXXXXXXGQ
60    STRUCTURE_MASK                                       ^^^^^^^^^^^^^^^^^^^^

120   SEQUENCE       GTQVTVSS
120   STRUCTURE_MASK

Step 1.3: Configure relative positioning (Groups)#

By default, all residues are in “group 0”, which implies their relative positions are fixed. Since we want the nanobody to dock against the target (i.e., its position relative to the target is not fixed), we assign the scaffold to a different group (group 1).

[9]:

# Visualize current groups (all 0)
print("\nVisualize target groups:")
print(target.formatted(("sequence", "group")))
print("\nVisualize binder scaffold groups:")
print(binder_scaffold.formatted(("sequence", "group")))


Visualize target groups:
0     SEQUENCE MAFTVTVPKDLYVVEYGSNMTIECKFPVEKQLDLAALIVYWEMEDKNIIQFVHGEEDLKV
0     GROUP    000000000000000000000000000000000000000000000000000000000000

60    SEQUENCE QHSSYRQRARLLKDQLSLGNAALQITDVKLQDAGVYRCMISYGGADYKRITVKVNAPYAA
60    GROUP    000000000000000000000000000000000000000000000000000000000000

120   SEQUENCE A
120   GROUP    0

Visualize binder scaffold groups:
0     SEQUENCE EVQLVESGGGLVQPGGSLRLSCAASXXXXXXXXXXGWFRQAPGKGRELVAAIXXXXXXXX
0     GROUP    000000000000000000000000000000000000000000000000000000000000

60    SEQUENCE YPDSVEGRFTISRDNAKRMVYLQMNSLRAEDTAVYYCAXXXXXXXXXXXXXXXXXXXXGQ
60    GROUP    000000000000000000000000000000000000000000000000000000000000

120   SEQUENCE GTQVTVSS
120   GROUP    00000000

[10]:

# Set scaffold to group 1 to unfix relative position
binder_scaffold = binder_scaffold.set_group(1)
print("\nUpdated binder scaffold groups:")
print(binder_scaffold.formatted(("sequence", "group")))


Updated binder scaffold groups:
0     SEQUENCE EVQLVESGGGLVQPGGSLRLSCAASXXXXXXXXXXGWFRQAPGKGRELVAAIXXXXXXXX
0     GROUP    111111111111111111111111111111111111111111111111111111111111

60    SEQUENCE YPDSVEGRFTISRDNAKRMVYLQMNSLRAEDTAVYYCAXXXXXXXXXXXXXXXXXXXXGQ
60    GROUP    111111111111111111111111111111111111111111111111111111111111

120   SEQUENCE GTQVTVSS
120   GROUP    11111111

Finally, we combine the target and the binder scaffold into a single Complex query.

[11]:

query = target & binder_scaffold
print("Query type", type(query))
print("Chains in query:", list(query.get_chains().keys()))
print("\nVisualize target (Chain A):")
print(query.get_protein(chain_id="A").formatted(include=("sequence", "structure_mask")))
print("\nVisualize binder scaffold (Chain B):")
print(query.get_protein(chain_id="B").formatted(include=("sequence", "structure_mask")))

Query type <class 'openprotein.molecules.complex.Complex'>
Chains in query: ['A', 'B']

Visualize target (Chain A):
0     SEQUENCE       MAFTVTVPKDLYVVEYGSNMTIECKFPVEKQLDLAALIVYWEMEDKNIIQFVHGEEDLKV
0     STRUCTURE_MASK ^

60    SEQUENCE       QHSSYRQRARLLKDQLSLGNAALQITDVKLQDAGVYRCMISYGGADYKRITVKVNAPYAA
60    STRUCTURE_MASK

120   SEQUENCE       A
120   STRUCTURE_MASK

Visualize binder scaffold (Chain B):
0     SEQUENCE       EVQLVESGGGLVQPGGSLRLSCAASXXXXXXXXXXGWFRQAPGKGRELVAAIXXXXXXXX
0     STRUCTURE_MASK                          ^^^^^^^^^^                 ^^^^^^^^

60    SEQUENCE       YPDSVEGRFTISRDNAKRMVYLQMNSLRAEDTAVYYCAXXXXXXXXXXXXXXXXXXXXGQ
60    STRUCTURE_MASK                                       ^^^^^^^^^^^^^^^^^^^^

120   SEQUENCE       GTQVTVSS
120   STRUCTURE_MASK

Visualize the structure of the query#

It’s always a good idea to visualize the 3D structures of our query to ensure that we’ve created it correctly. First, let’s define a helper function using the molviewspec package to help us create the visualization. It’s not necessary to understand the implementation details.

[12]:

@dataclass(frozen=True)
class ColorSpec:
    chain_id: str
    color: str
    positions: list[int] | None = None
    rep_type: RepresentationTypeT = "cartoon"


def visualize_cif(cif_string: str, colors: list[ColorSpec]):
    builder = mvs.create_builder()
    model = (
        builder.download(url="structure.cif").parse(format="mmcif").model_structure()
    )
    for color_spec in colors:
        component = model.component(
            selector=(
                mvs.ComponentExpression(label_asym_id=color_spec.chain_id)
                if color_spec.positions is None
                else [
                    mvs.ComponentExpression(
                        label_asym_id=color_spec.chain_id, label_seq_id=i
                    )
                    for i in color_spec.positions
                ]
            )
        )
        rep = component.representation(type=color_spec.rep_type)
        rep.color(color=color_spec.color)
    builder.molstar_notebook(
        data={"structure.cif": cif_string},
        width=600,
        height=500,
    )

Now, let’s use the helper function, visualize_cif, to visualize our query.

[13]:

visualize_cif(
    cif_string=query.to_string(),
    colors=[
        # color the target chain a light blue
        ColorSpec(chain_id="A", color="#b5e2f5"),  # target in blue
        ColorSpec(chain_id="B", color="#f4c30b"),  # binder scaffold in orange
    ],
)

We can see from the visualization that:

The query contains our target (blue) and binder scaffold (orange).
The binder scaffold contains dashed lines indicating the missing structure of the CDRs that we will design in the following step.
The binder does not bind to the target, and adopts an unnatural position far from the target. The correct positioning will also be designed in the following step.

Step 2: Structure generation with BoltzGen#

Generate plausible structures for the nanobody binder

We use BoltzGen, a structure generation model that supports scaffolds, to generate plausible backbone structures for our nanobody that complement the target.

[ ]:

N_STRUCTURES = 100

# 1. Create the BoltzGen job
boltzgen_job = session.models.boltzgen.generate(query=query, N=N_STRUCTURES)
print(boltzgen_job)

# 2. Wait for the job to finish
_ = boltzgen_job.wait_until_done(verbose=True)

# 3. Get results
generated_structures: list[Complex] = boltzgen_job.get()
print(f"Generated {len(generated_structures)} structures.")

job_id='18d53e75-a1d7-4b99-a682-44e2de4dddc3' job_type='/models/boltzgen' status=<JobStatus.PENDING: 'PENDING'> created_date=datetime.datetime(2026, 1, 21, 11, 16, 20, 391931, tzinfo=TzInfo(UTC)) start_date=None end_date=None prerequisite_job_id=None progress_message=None progress_counter=0 sequence_length=None

Waiting: 100%|██████████| 100/100 [26:02<00:00, 15.62s/it, status=SUCCESS]

Generated 100 structures.

Step 3: Sequence generation with ProteinMPNN#

Design the CDR sequences using inverse folding

BoltzGen generates the backbone structure and an initial sequence, but we can often improve the sequence quality (e.g., expression, stability) using an inverse folding model.

We will use ProteinMPNN, an inverse folding model commonly used for generating amino acid sequences likely to fold into a defined backbone structure.

Generate sequences#

We now iterate through our generated structures and redesign the CDR regions. In this example, we are only infilling the sequence of the CDRs while keeping the sequence of the framework region from the scaffold. We utilize the binder_scaffold sequence to mask the CDR regions (where the scaffold has X), ensuring we only redesign those specific areas while preserving the backbone structure.

We could mask the sequence of the complete nanobody to generate the sequence for both the framework and CDRs, but ProteinMPNN will not generate natural-like framework regions. For this, we recommend using PoET-2 with a human or camelid VHH context.

Note: To inverse fold multichain complexes with ProteinMPNN, we could either use a Complex object containing multiple chains, or a single Protein object that joins the target and binder with a linker (e.g., GGGGS*3). Below, we demonstrate the latter approach, as only the latter approach is compatible with PoET-2. For an example of the former approach, see the RFdiffusion binder design tutorial.

[ ]:

N_SEQS_PER_STRUCTURE = 10

# 1. Create the jobs
gen_jobs = []
for generated_structure in tqdm(
    generated_structures, mininterval=1.0, desc="Creating jobs"
):
    # 1a. Create protein for inverse folding by joining the generated target and binder
    #     with a linker
    target_for_inverse_folding = generated_structure.get_protein(chain_id="A")
    binder_for_inverse_folding = (
        generated_structure.get_protein(chain_id="B")
        .copy()
        # use the scaffold framework sequence
        .set_sequence(binder_scaffold.sequence)
        # mask the side chain atoms since we want to generate these sequences
        .mask_structure(side_chain_only=True)
    )
    linker = "GGGGS" * 3
    protein_for_inverse_folding = (
        target_for_inverse_folding + linker + binder_for_inverse_folding
    )
    # 1b. Create the sequence generation job and store the job
    gen_job = session.models.proteinmpnn.generate(
        query=protein_for_inverse_folding,
        num_samples=N_SEQS_PER_STRUCTURE,
        seed=42,  # for reproducibility
    )
    gen_jobs.append(gen_job)

# 2. Wait for the jobs to finish
for gen_job in tqdm(gen_jobs, mininterval=1.0, desc="Waiting for jobs"):
    _ = gen_job.wait_until_done()
    assert gen_job.status == "SUCCESS"

Creating jobs: 100%|██████████| 100/100 [01:49<00:00,  1.09s/it]
Waiting for jobs: 100%|██████████| 100/100 [09:16<00:00,  5.56s/it]

We collect all the designed sequences into a dataframe for further analysis:

[ ]:

records = []
for i, gen_job in enumerate(tqdm(gen_jobs, mininterval=1.0)):
    for j, proteinmpnn_result in enumerate(gen_job.get()):
        # The result includes target + linker + binder.
        full_sequence = proteinmpnn_result.sequence
        # We need to extract just the binder sequence from the end.
        designed_binder_sequence = full_sequence[-len(binder_scaffold):]
        records.append(
            {
                "design_idx": i * N_SEQS_PER_STRUCTURE + j,
                "structure_idx": i,
                "sequence_idx": j,
                "score": proteinmpnn_result.score.mean().item(),
                "sequence": designed_binder_sequence,
            }
        )
df = pd.DataFrame.from_records(records).set_index(["structure_idx", "sequence_idx"])
df.head()

100%|██████████| 100/100 [00:34<00:00,  2.87it/s]

		design_idx	score	sequence
structure_idx	sequence_idx
0	0	0	0.7589	EVQLVESGGGLVQPGGSLRLSCAASGDANWEKLCMGWFRQAPGKGR...
	1	1	0.7935	EVQLVESGGGLVQPGGSLRLSCAASGDANFAKLCMGWFRQAPGKGR...
	2	2	0.7774	EVQLVESGGGLVQPGGSLRLSCAASGDAVFEKLCMGWFRQAPGKGR...
	3	3	0.7570	EVQLVESGGGLVQPGGSLRLSCAASGSANFSKLCFGWFRQAPGKGR...
	4	4	0.7914	EVQLVESGGGLVQPGGSLRLSCAASGDANFEKLGMGWFRQAPGKGR...

Step 4: In silico validation#

Validate designs using structure prediction

Finally, we validate our designs by predicting the structure of the designed sequences using Boltz-2. We will predict the complex structure and compute metrics to filter for high-confidence designs.

Predict structures with Boltz-2#

We predict the structures of our designs using Boltz-2 (Passaro et al., 2025).

Typically, to validate in silico binder designs, we predict the target-binder complex structure and check for consistency with the original designed structure and binding interface. Because the structure of the target is known, but the binder structure is not, we usually run structure prediction in single sequence mode for both the binder and target sequences using only the target structure as a template. Single sequence mode means that no multiple sequence alignment is used for the binder. This is important to efficiently screen large numbers of binder designs where homology search is a bottleneck.

Below, we apply this approach to predict the structures of all designed sequences; this generally takes about 50 minutes.

[ ]:

# 1. Create the complexes to fold
complexes_to_fold = []
for i, generated_structure in enumerate(generated_structures):  # for each structure
    for j in range(N_SEQS_PER_STRUCTURE):
        designed_binder_sequence = df.loc[(i, j)]["sequence"]
        # 1a. Create the complex to fold, containing the target sequence and designed
        #     binder sequence
        complex = Complex(
            {"A": Protein(target.sequence), "B": Protein(designed_binder_sequence)}
        )
        # 1b. Set both the target MSA and the binder MSA to single sequence mode
        complex.get_protein(chain_id="A").msa = Protein.single_sequence_mode
        complex.get_protein(chain_id="B").msa = Protein.single_sequence_mode
        # 1c. Set the known target structure as a template for the target only
        complex.get_protein(chain_id="A").templates = [target]
        complexes_to_fold.append(complex)

# 2. Create the job
fold_job = session.fold.boltz_2.fold(complexes_to_fold)
print(fold_job)

# 3. Wait for the job to finish
_ = fold_job.wait_until_done(verbose=True)

num_records=1000 job_id='2d21c594-77e7-4a41-a7d1-4aafc5607be2' job_type=<JobType.embeddings_fold: '/embeddings/fold'> status=<JobStatus.PENDING: 'PENDING'> created_date=datetime.datetime(2026, 1, 21, 11, 54, 13, 291057, tzinfo=TzInfo(UTC)) start_date=None end_date=None prerequisite_job_id=None progress_message=None progress_counter=0 sequence_length=None

Waiting: 100%|██████████| 100/100 [47:56<00:00, 28.77s/it, status=SUCCESS]

We retrieve the predicted structures as a list of Complex objects:

[18]:

predicted_structures: list[Complex] = []
for structure in fold_job.get(verbose=True):
    # each prediction is a Structure object
    assert isinstance(structure, Structure)
    # since we only make one prediction per design, we extract the `Complex` of that one
    # prediction only, and append that to our list of predicted structures
    predicted_structures.append(structure[0])

Retrieving: 100%|██████████| 1000/1000 [01:01<00:00, 16.20it/s]

Let’s visualize the first predicted structure to check that it looks reasonable:

[ ]:

visualize_cif(
    predicted_structures[0].to_string(),
    colors=[
        ColorSpec(chain_id="A", color="#b5e2f5"),  # target in blue
        ColorSpec(chain_id="B", color="#f4c30b"),  # binder in orange
    ],
)

As desired, the predicted structure contains the target, and the binder close to the target chain.

In addition to the predicted structures, we also retrieve the predicted aligned errors (PAEs), which we will use for computing metrics below. The PAE is a structure prediction confidence metric that has been highly effective at identifying successful binders.

[20]:

predicted_paes: list[npt.NDArray[np.floating]] = fold_job.get_pae()

Filter and select designs by metrics#

We compute, filter, and select designs using standard structure prediction metrics and thresholds adapted from the RFdiffusion (Watson et al., 2023) and BoltzGen studies.

Metric	Description	Ideal Value
RMSD	Measures how closely the predicted structure of the entire complex matches the generated structure.	< 2.5 Å
iPAE	Confidence that the binder forms an interface with the target.	< 10
Binder RMSD	Measures how closely the predicted structure of just the binder matches the generated structure.	< 1.0 Å
Binder pLDDT	Confidence in the predicted structure of the binder.	> 80

Compute metrics#

Below, we compute these metrics and collate the metrics and designed sequences into a dataframe for further analysis.

[21]:

records = []  # collect metrics and designed sequences into a list of records
for i, generated_structure in enumerate(tqdm(generated_structures, mininterval=1.0)):
    for j in range(N_SEQS_PER_STRUCTURE):
        predicted_structure = predicted_structures[i * N_SEQS_PER_STRUCTURE + j]
        # compute overall rmsd
        rmsd = predicted_structure.rmsd(generated_structure)
        # compute ipae
        pae = predicted_paes[i * N_SEQS_PER_STRUCTURE + j].squeeze(0)
        ipae0 = np.mean(pae[: len(target), len(target) :])
        ipae1 = np.mean(pae[len(target) :, : len(target)])
        ipae = (ipae0 + ipae1) / 2
        # compute binder metrics
        generated_binder = generated_structure.get_protein(chain_id="B")
        predicted_binder = predicted_structure.get_protein(chain_id="B")
        binder_rmsd = predicted_binder.rmsd(generated_binder)
        binder_plddt = predicted_binder.plddt.mean()
        # get dataframe row containing designed sequence
        row = df.loc[(i, j)]
        # record all relevant data
        records.append(
            {
                "design_idx": row["design_idx"],
                "structure_idx": i,
                "sequence_idx": j,
                "rmsd": rmsd,
                "ipae": ipae,
                "binder_rmsd": binder_rmsd,
                "binder_plddt": binder_plddt,
                "score": row["score"],
                "sequence": row["sequence"],
            }
        )
df = pd.DataFrame.from_records(records).set_index(["structure_idx", "sequence_idx"])
df.head()

100%|██████████| 100/100 [00:01<00:00, 99.55it/s]

[21]:

		design_idx	rmsd	ipae	binder_rmsd	binder_plddt	score	sequence
structure_idx	sequence_idx
0	0	0	7.577425	5.727729	1.943376	93.676575	0.7589	EVQLVESGGGLVQPGGSLRLSCAASGDANWEKLCMGWFRQAPGKGR...
	1	1	2.838705	6.132021	1.090485	87.045631	0.7935	EVQLVESGGGLVQPGGSLRLSCAASGDANFAKLCMGWFRQAPGKGR...
	2	2	3.169565	5.058994	1.452003	91.964874	0.7774	EVQLVESGGGLVQPGGSLRLSCAASGDAVFEKLCMGWFRQAPGKGR...
	3	3	9.350628	9.426828	1.407688	94.753342	0.7570	EVQLVESGGGLVQPGGSLRLSCAASGSANFSKLCFGWFRQAPGKGR...
	4	4	2.436349	8.677398	1.413848	87.227753	0.7914	EVQLVESGGGLVQPGGSLRLSCAASGDANFEKLGMGWFRQAPGKGR...

Filter and select designs by metrics#

We start by filtering the designs based on the ideal metric thresholds.

[ ]:

df_filtered = df[
    (df["rmsd"] < 2.5)
    & (df["ipae"] < 10)
    & (df["binder_rmsd"] < 1.0)
    & (df["binder_plddt"] > 80)
]
print("# designs passing filters", len(df_filtered))
print(
    "# unique structures passing filters",
    df_filtered.index.get_level_values("structure_idx").nunique(),
)

# designs passing filters 146
# unique structures passing filters 48

Looks like we have a good number of designs meeting the ideal metric thresholds!

Next, we rank the designs based on iPAE to prioritize designs with high confidence of interaction. We’ll also select just the top design per unique structure, to select for a diverse set of binders.

[ ]:

df_selected = (
    # rank by ipae
    df_filtered.reset_index().sort_values(by="ipae")
    # select best sequence per structure
    .groupby("structure_idx", sort=False).first()
    # set dataframe index
    .reset_index().set_index(["structure_idx", "sequence_idx"])
)
df_selected.head(3)

		design_idx	rmsd	ipae	binder_rmsd	binder_plddt	score	sequence
structure_idx	sequence_idx
33	1	331	1.469114	3.434022	0.583415	97.080017	0.7707	EVQLVESGGGLVQPGGSLRLSCAASGDFDFSTYSFGWFRQAPGKGR...
65	9	659	1.772557	3.734464	0.782987	95.126297	0.7117	EVQLVESGGGLVQPGGSLRLSCAASGNVNFSLLKAGWFRQAPGKGR...
22	2	222	1.485924	3.788646	0.728534	96.831818	0.7933	EVQLVESGGGLVQPGGSLRLSCAASGSVNFSLLMMGWFRQAPGKGR...

We now have a ranked list of promising binder designs!

Before we send them off for experimental validation, we should visually inspect their stuctures for any anomalies. For example, below we visualize the predicted structure of the top ranked design superimposed onto the corresponding generated structure (light gray) and see that it looks reasonable on visual inspection:

[ ]:

# get predicted and generated structure of top design
design_idx, structure_idx = df_selected.reset_index().iloc[0][
    ["design_idx", "structure_idx"]
]
predicted_structure = predicted_structures[design_idx]
generated_structure = generated_structures[structure_idx]
# superimpose predicted structure on the generated structure
predicted_structure = predicted_structure.copy().superimpose_onto(generated_structure)
# visualize
visualize_cif(
    Complex(
        {
            "A_predicted": predicted_structure.get_protein(chain_id="A"),
            "B_predicted": predicted_structure.get_protein(chain_id="B"),
            "A_generated": generated_structure.get_protein(chain_id="A"),
            "B_generated": generated_structure.get_protein(chain_id="B"),
        }
    ).to_string(),
    colors=[
        ColorSpec(chain_id="A_predicted", color="#b5e2f5"),
        ColorSpec(chain_id="B_predicted", color="#f4c30b"),
        ColorSpec(chain_id="A_generated", color="#F2F0EF"),
        ColorSpec(chain_id="B_generated", color="#F2F0EF"),
    ],
)

The top second ranked design looks reasonable as well:

[ ]:

# get predicted and generated structure of the second best
design_idx, structure_idx = df_selected.reset_index().iloc[1][
    ["design_idx", "structure_idx"]
]
predicted_structure = predicted_structures[design_idx]
generated_structure = generated_structures[structure_idx]
# superimpose predicted structure on the generated structure
predicted_structure = predicted_structure.copy().superimpose_onto(generated_structure)
# visualize
visualize_cif(
    Complex(
        {
            "A_predicted": predicted_structure.get_protein(chain_id="A"),
            "B_predicted": predicted_structure.get_protein(chain_id="B"),
            "A_generated": generated_structure.get_protein(chain_id="A"),
            "B_generated": generated_structure.get_protein(chain_id="B"),
        }
    ).to_string(),
    colors=[
        ColorSpec(chain_id="A_predicted", color="#b5e2f5"),
        ColorSpec(chain_id="B_predicted", color="#f4c30b"),
        ColorSpec(chain_id="A_generated", color="#F2F0EF"),
        ColorSpec(chain_id="B_generated", color="#F2F0EF"),
    ],
)

After visually confirming the designs and further filtering based on any additional metrics you may have in mind (e.g. metrics relevant to your specific assay), the designs can then be sent off for experimental testing!

Conclusion#

In this walkthrough, we’ve demonstrated how to design nanobody binders for a target of interest using BoltzGen and ProteinMPNN. We validated the designs using in-silico metrics and visualized them to ensure their viability. The top-ranked designs from this workflow can be:

Expressed and purified for experimental validation
Tested for binding affinity
Further optimized through additional rounds of design, for example, with OpenProtein.AI’s property regression models.

Read more about our binder design workflows and other de novo design tools here:

or see the detailed API references