Library analysis, comparison, and cost efficiency calculation#

Library reagents and experiments, whether from combinatorial or bespoke libraries, are resource intensive. The OpenProtein.AI platform provides tools to make informed decisions around cost-performance trade-offs.

This walkthrough shows you how to use the OpenProtein.AI Python client to analyze and compare libraries for expected performance. In this example, we evaluate ML-designed and combinatorial variant libraries and demonstrate how a user can make data-driven decisions around library composition and size to achieve success with a given confidence level.

What you need before getting started#

You need an OpenProtein.AI account and a dataset with sequence and property measurements. This tutorial uses the antibody variable fragment mutagenesis dataset, 14H.

Setting up your environment#

You also need a Jupyter notebook environment for data analysis and visualization:

[ ]:
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import scipy.stats
from scipy.stats import spearmanr
import json

Next, connect to the OpenProtein.AI backend using your credentials:

[ ]:
import openprotein

with open('secrets.config', 'r') as f:
    config = json.load(f)

session = openprotein.connect(config['username'], config['password'])

Load data for analysis#

We’ll use an antibody variable fragment mutagenesis dataset, 14H. This dataset contains heavy chain variable fragment sequences mutagensized in all three CDRs with measured binding affinities.

Upload the dataset:

[ ]:
path = 'data/14H_affinities_nonan.csv'
table = pd.read_csv(path)
print(table.shape)
table.head()
(7476, 2)
sequence log_kdnm
0 EVQLVETGGGLVQPGGSLRLSCAASPFELNSYGISWVRQAPGKGPE... 2.696930
1 EVQLVETGGGLVQPGGSLRLSCAASGFDLNSYGISWVRQAPGKGPE... 1.988077
2 EVQLVETGGGLVQPGGSLRLSCAASGFKLNSYGISWVRQAPGKGPE... 2.975487
3 EVQLVETGGGLVQPGGSLRLSCAASGFHLNSYGISWVRQAPGKGPE... 2.381006
4 EVQLVETGGGLVQPGGSLRLSCAASGFTLLSYGISWVRQAPGKGPE... 3.481793

As we can see above, the dataset is displayed with each row representing a sequence along with its corresponding ‘log_kdnm’ value.

Post the dataset to train a model for affinity prediction:

[ ]:
dataset = session.data.create(table, '14H_affinities', 'Antibody heavy chain variable region mutagenesis with affinity measurements.')
dataset
AssayMetadata(model_config={'protected_namespaces': ()}, assay_name='14H_affinities', assay_description='Antibody heavy chain variable region mutagenesis with affinity measurements.', assay_id='84840578-3df0-40f7-bc7e-f7f8f7a3aae4', original_filename='assay_data', created_date=datetime.datetime(2023, 9, 18, 8, 16, 51, 87298), num_rows=7476, num_entries=7476, measurement_names=['log_kdnm'], sequence_length=118)

The returned AssayMetadata object contains information about the assay metadata, including the model configuration, assay name, description, ID, original filename, creation date, number of rows, number of entries, measurement names, and sequence length.

Train a sequence-to-function prediction model#

We’re ready to train the model using our dataset:

[ ]:
model_future = session.train.create_training_job(dataset, 'log_kdnm')
print(model_future.job)
model_future.wait_until_done(verbose=True)
model_config={'protected_namespaces': ()} status=<JobStatus.PENDING: 'PENDING'> job_id='2b587e80-48c2-411f-b737-beea861b9c3b' job_type='/workflow/train' created_date=datetime.datetime(2023, 9, 18, 8, 16, 57, 987677) start_date=None end_date=None prerequisite_job_id='e6606a24-c592-4249-a20e-62a24a3346fa' progress_message=None progress_counter=None num_records=None sequence_length=118
Waiting: 100%|██████████| 100/100 [06:06<00:00,  3.66s/it, status=SUCCESS]
True

Run cross validation#

It’s important to evaluate the trained model’s performance through cross-validation. This process estimates the model’s predictive abilities by simulating how well it would perform on independent datasets.

[ ]:
model_future.crossvalidate()
cv_future = model_future.crossvalidation
print(cv_future)
cv_future.wait_until_done(verbose=True)
model_config={'protected_namespaces': ()} status=<JobStatus.PENDING: 'PENDING'> job_id='6abe8cc9-24a9-4e69-b713-9c14f687b8ff' job_type='/workflow/crossvalidate' created_date=datetime.datetime(2023, 9, 18, 8, 23, 4, 765616) start_date=None end_date=None prerequisite_job_id='2b587e80-48c2-411f-b737-beea861b9c3b' progress_message=None progress_counter=None num_records=None
Waiting: 100%|██████████| 100/100 [00:27<00:00,  3.66it/s, status=SUCCESS]
True

Retrieve the results of the cross-validation process with:

[ ]:
cv_results = cv_future.get()

The next step is to process the cross-validation results and visualize the relationship between predicted and actual affinities on a scatter plot.

[ ]:
y = []
y_mu = []
y_var = []
for r in cv_results:
    y.append(r.y)
    y_mu.append(r.y_mu)
    y_var.append(r.y_var)
y = np.array(y)
y_mu = np.array(y_mu)
y_var = np.array(y_var)

y.shape, y_mu.shape, y_var.shape
((7476,), (7476,), (7476,))
[ ]:
print(spearmanr(y_mu, y))
plt.scatter(y_mu, y, s=2)
plt.axis('equal')
plt.xlabel('Predicted affinity')
plt.ylabel('Actual affinity')
SignificanceResult(statistic=0.6974296283359848, pvalue=0.0)
Text(0, 0.5, 'Actual affinity')
../_images/walkthroughs_quantitative-decision-making-library-design_20_2.png

Cross validation performance is strong, as indicated by a Spearman rho = 0.697. We’re ready to start working with our data.

Use the design API to identify variants with enhanced binding affinity#

[ ]:
# create the design objective
# look for variants that will have single digit picomolar affinity, Kd <10 pm (< -2 log Kd)
from openprotein.models import ModelCriterion, Criterion, DesignJobCreate

target = -2.0
criterion = [
    [ModelCriterion(
        criterion_type = 'model',
        model_id = model_future.job.job_id,
        measurement_name = 'log_kdnm',
        criterion = Criterion(target=target, weight=1.0, direction='<')
    )]
]

# limit mutation sites to the CDRs
sites = [ 25,  26,  27,  28,  29,  30,  31,  32,  33,  34,  49,  50,  51,
         52,  53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,
         98,  99, 100, 101, 102, 103, 104, 105, 106]
site_mask = np.ones(len(table['sequence'].iloc[0]), dtype=bool)
site_mask[sites] = False
sites = np.where(site_mask)[0].tolist()
print(sites)

# run the algorithm for 50 steps with otherwise default parameters
design_params = DesignJobCreate(
    assay_id = dataset.id,
    criteria = criterion,
    num_steps = 50,
    mutation_positions = sites,
)

design_future = session.design.create_design_job(design_params)
print(design_future.job)
design_future.wait_until_done(verbose=True)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117]
model_config={'protected_namespaces': ()} status=<JobStatus.PENDING: 'PENDING'> job_id='254204e0-49de-44ed-9267-85fe573fafc0' job_type='/workflow/design' created_date=datetime.datetime(2023, 9, 18, 11, 55, 49, 578750) start_date=None end_date=None prerequisite_job_id=None progress_message=None progress_counter=None num_records=None
Waiting: 100%|██████████| 100/100 [27:53<00:00, 16.73s/it, status=SUCCESS]
True
[ ]:
# get the design results and parse them into a table for analysis
design_results = design_future.get()

rows = []
for r in design_results:
    scores = r.scores[0][0]
    y_mu = scores.metadata.y_mu
    y_var = scores.metadata.y_var
    score = scores.score
    row = [r.step, score, y_mu, y_var, r.sequence]
    rows.append(row)
design_results = pd.DataFrame(rows, columns=['step', 'score', 'y_mu', 'y_var', 'sequence'])
print(design_results.shape)
design_results.sort_values('score', ascending=False).head()
(12800, 5)
step score y_mu y_var sequence
6400 25 -1.477311 -1.552875 0.360569 EVQLVETGGGLVQPGGSLRLSCAASGFSLNEYGISWVRQAPGKGPE...
9984 39 -1.477311 -1.552875 0.360569 EVQLVETGGGLVQPGGSLRLSCAASGFSLNEYGISWVRQAPGKGPE...
12032 47 -1.477311 -1.552875 0.360569 EVQLVETGGGLVQPGGSLRLSCAASGFSLNEYGISWVRQAPGKGPE...
4864 19 -1.477311 -1.552875 0.360569 EVQLVETGGGLVQPGGSLRLSCAASGFSLNEYGISWVRQAPGKGPE...
10240 40 -1.477311 -1.552875 0.360569 EVQLVETGGGLVQPGGSLRLSCAASGFSLNEYGISWVRQAPGKGPE...

Analyze possible libraries from the design results#

[ ]:
design_results_filtered = design_results.drop_duplicates('sequence').sort_values(['score', 'y_mu'], ascending=[False, True])
print(design_results_filtered.shape)
design_results_filtered.head()
(2208, 5)
step score y_mu y_var sequence
2048 8 -1.477311 -1.552875 0.360569 EVQLVETGGGLVQPGGSLRLSCAASGFSLNEYGISWVRQAPGKGPE...
2049 8 -1.610582 -1.460287 0.410439 EVQLVETGGGLVQPGGSLRLSCAASGFSLNEYGISWVRQAPGKGPE...
2306 9 -1.704086 -1.423261 0.403445 EVQLVETGGGLVQPGGSLRLSCAASGFSLNEYGISWVRQAPGKGPE...
2050 8 -1.753393 -1.368601 0.449600 EVQLVETGGGLVQPGGSLRLSCAASGFSLNEYGISWVRQAPGKGPE...
1792 7 -1.754609 -1.403983 0.399925 EVQLVETGGGLVQPGGSLRLSCAASGFSLNEYGISWVRQAPGKGPE...
[ ]:
library_1000 = design_results_filtered.iloc[:1000]
library_100 = design_results_filtered.iloc[:100]

pred_dist_1000 = scipy.stats.norm(library_1000.y_mu, np.sqrt(library_1000.y_var))
pred_dist_100 = scipy.stats.norm(library_100.y_mu, np.sqrt(library_100.y_var))

Examine the expected distributions over affinites in the designed libraries versus the initial dataset#

[ ]:
# plot the expected results vs. the input data
bins = np.linspace(-4, 6, 101)

format_params = {'alpha': 0.5, 'histtype': 'stepfilled', 'ec': 'k'}
_ = plt.hist(table['log_kdnm'].values, bins=bins, label='Initial Dataset', color='gray', density=True, **format_params)

# plot the predictive distribution over property values
c_lo = pred_dist_1000.cdf(bins[:-1][:, np.newaxis]).mean(axis=1)
c_hi = pred_dist_1000.cdf(bins[1:][:, np.newaxis]).mean(axis=1)
weights = (c_hi - c_lo)/(bins[1:] - bins[:-1])
_ = plt.hist(bins[:-1], bins=bins, weights=weights, label='Designed (Top 1000)', color='lightblue', hatch='/', **format_params)

c_lo = pred_dist_100.cdf(bins[:-1][:, np.newaxis]).mean(axis=1)
c_hi = pred_dist_100.cdf(bins[1:][:, np.newaxis]).mean(axis=1)
weights = (c_hi - c_lo)/(bins[1:] - bins[:-1])
_ = plt.hist(bins[:-1], bins=bins, weights=weights, label='Desiged (Top 100)', **format_params)

_ = plt.legend(loc='best')
_ = plt.xlabel('log$_{10}$ $K_D$ nM')
_ = plt.ylabel('Density')
_ = plt.title('Affinity distributions of designed libraries versus the initial dataset')
../_images/walkthroughs_quantitative-decision-making-library-design_29_0.png

Examine the predictive distribution over the best affinity in each library#

Above, we looked at the overall expected distribution of affinities, but we may actually be interested in what the affinity value of the best sequence in a library is likely to be. When comparing the libraries of size 100 and size 1,000, the affinity distributions are expected to be nearly identical, but we would expect a larger library to be more likely to contain a sequence that meets the design criteria (log KD < -2) and to contain a best sequence with stronger affinity.

Probability of containing a success#

[ ]:
# what is the probability that a variant meeting the design criteria is present in each library?
target = -2
print('target value <', target)

log_one_minus_p_1000 = pred_dist_1000.logsf(target)
p_1000 = 1 - np.exp(np.sum(log_one_minus_p_1000)) # the probability that at least one sequence meets the target value
print(f'Designed (Top 1000), probability of at least one success = {p_1000:.5f}')

log_one_minus_p_100 = pred_dist_100.logsf(target)
p_100 = 1 - np.exp(np.sum(log_one_minus_p_100))
print(f'Designed (Top 100), probability of at least one success = {p_100:.5f}')
target value < -2
Designed (Top 1000), probability of at least one success = 1.00000
Designed (Top 100), probability of at least one success = 1.00000
[ ]:
# in this case, we think that both libraries are virtually guaranteed to contain a successful variant!
# let's take a more fine-grained look

log_one_minus_p = pred_dist_1000.logsf(target)
print('Top k', 'Probability', sep='\t')
for i in range(20):
    p = 1 - np.exp(np.sum(log_one_minus_p[:i+1]))
    print(f'{i+1:2>d}', f'{p:.5f}', sep='\t')
Top k   Probability
1       0.22825
2       0.38242
3       0.49478
4       0.58228
5       0.65454
6       0.71242
7       0.75998
8       0.79950
9       0.83138
10      0.85754
11      0.87952
12      0.89809
13      0.91355
14      0.92650
15      0.93737
16      0.94663
17      0.95449
18      0.96119
19      0.96690
20      0.97175

As we can see, we would only need to test a few sequences to have high probability of finding one that has picomolar affinity. We can also use this kind of analysis to determine how large our library should be given a success confidence level. For example, if we wanted a library with at least 0.95 probability of containing a variant with picomolar affinity (only 1/20 chance of failure), then we would only need to test the top 17 sequences to achieve this.

[ ]:
# let's perform the same analysis but for a more aggresive affinity target value
target = -3
print('target value <', target)

log_one_minus_p_1000 = pred_dist_1000.logsf(target)
p_1000 = 1 - np.exp(np.sum(log_one_minus_p_1000)) # the probability that at least one sequence meets the target value
print(f'Designed (Top 1000), probability of at least one success = {p_1000:.5f}')

log_one_minus_p_100 = pred_dist_100.logsf(target)
p_100 = 1 - np.exp(np.sum(log_one_minus_p_100))
print(f'Designed (Top 100), probability of at least one success = {p_100:.5f}')
target value < -3
Designed (Top 1000), probability of at least one success = 0.82664
Designed (Top 100), probability of at least one success = 0.31175
[ ]:
# now we can see clearer resolution between library sizes and would need a much larger library to be confident
# that it will include a sequence with log kd < -3

log_one_minus_p = pred_dist_1000.logsf(target)
print('Top k', 'Probability', sep='\t')
for i in range(20):
    p = 1 - np.exp(np.sum(log_one_minus_p[:i+1]))
    print(f'{i+1:2>d}', f'{p:.5f}', sep='\t')
Top k   Probability
1       0.00798
2       0.01604
3       0.02246
4       0.02977
5       0.03541
6       0.03906
7       0.04367
8       0.04668
9       0.05171
10      0.05431
11      0.05778
12      0.06153
13      0.06569
14      0.06907
15      0.07257
16      0.07658
17      0.07929
18      0.08415
19      0.08712
20      0.09223

Predicted distribution over the best affinity in each library#

Next, let’s examine the distribution over the affinity of the best sequence we expect to see in each library.

[ ]:
# estimate best affinities in each library from sampling
num_samples = 10_000

r = np.random.randn(num_samples, 1000)*pred_dist_1000.std() + pred_dist_1000.mean()
sample_best_1000 = r.min(axis=-1)

r = np.random.randn(num_samples, 100)*pred_dist_100.std() + pred_dist_100.mean()
sample_best_100 = r.min(axis=-1)
[ ]:
# plot the expected results vs. the input data
bins = np.linspace(-4, 6, 101)

format_params = {'alpha': 0.5, 'histtype': 'stepfilled', 'ec': 'k'}
_ = plt.hist(table['log_kdnm'].values, bins=bins, label='Initial Dataset', color='gray', density=True, **format_params)

# plot the predictive distribution over property values
_ = plt.hist(sample_best_1000, bins=bins, label='Designed (Top 1000)', color='lightblue', density=True, hatch='/', **format_params)

_ = plt.hist(sample_best_100, bins=bins, label='Desiged (Top 100)', density=True, **format_params)

_ = plt.legend(loc='best')
_ = plt.xlabel('log$_{10}$ $K_D$ nM')
_ = plt.ylabel('Density')
_ = plt.title('Affinity distributions of designed libraries versus the initial dataset')
../_images/walkthroughs_quantitative-decision-making-library-design_39_0.png
[ ]:
# plot out the expected best value versus library size
r = np.random.randn(num_samples, 1000)*pred_dist_1000.std() + pred_dist_1000.mean()
print('Top k', 'E[Best Affinity]', sep='\t')
for i in range(20):
    sample_best = r[:, :i+1].min(axis=-1)
    expected_best_value = sample_best.mean()
    print(f'{i+1:2>d}', f'{expected_best_value:.5f}', sep='\t')
Top k   E[Best Affinity]
1       -1.55617
2       -1.85484
3       -2.00796
4       -2.10551
5       -2.18081
6       -2.23472
7       -2.28024
8       -2.31515
9       -2.34844
10      -2.37435
11      -2.39734
12      -2.41974
13      -2.44148
14      -2.45966
15      -2.47678
16      -2.49347
17      -2.50689
18      -2.52336
19      -2.53482
20      -2.54932

ML-designed sequences are divergent from the initial dataset#

The variants we’ve identified are divergent from anything present in the initial dataset. Let’s calculate some edit distance statistics between our top 1000 sequence library and the sequences in the initial dataset.

[ ]:
x_1000 = library_1000['sequence'].values
x_1000 = np.array([[c for c in x_1000[i]] for i in range(len(x_1000))])
print(x_1000.shape)

x_init = table['sequence'].values
x_init = np.array([[c for c in x_init[i]] for i in range(len(x_init))])
print(x_init.shape)

edit_dist = 0
for i in range(x_1000.shape[1]):
    edit_dist += (x_1000[:, i][:, np.newaxis] != x_init[:, i])
edit_dist.shape
(1000, 118)
(7476, 118)
(1000, 7476)
[ ]:
min_edit_dist = edit_dist.min(axis=-1)
print(min_edit_dist.min(), min_edit_dist.mean(), min_edit_dist.max())

bins = np.arange(min_edit_dist.min(), min_edit_dist.max()+1) - 0.5
_ = plt.hist(min_edit_dist, bins=bins)
4 6.383 9
../_images/walkthroughs_quantitative-decision-making-library-design_43_1.png

Sequences in the top 1000 library are, on average, more than 6 edits away from the most similar sequence in the inital library and the most divergent variant identified is 9 mutations away from anything in the initial library.

Compare top sequences against alternative library designs#

Design a conventional combinatorial variant library (CVL)#

A typical library design approach is to re-randomize the substitution variants found in the intial screen. Let’s build a CVL based on the variants found in the top 50 sequences.

[ ]:
# randomize the variants found in the best sequences in the initial library
# to create a CVL

amino_acids = 'ARNDCQEGHILKMFPSTWYV'
amino_acid_to_index = {amino_acids[i]: i for i in range(len(amino_acids))}

topk = 50
sequences = table.sort_values('log_kdnm').head(n=topk)['sequence'].values
sequences = np.array([[amino_acid_to_index[sequences[i][j]] for j in range(len(sequences[i]))] for i in range(len(sequences))])
sequences.shape

(50, 118)
[ ]:
counts = np.zeros((sequences.shape[1], len(amino_acids)))
for i in range(sequences.shape[1]):
    index, count = np.unique(sequences[:, i], return_counts=True)
    counts[i, index] = count
counts = (counts > 1)
cvl = counts/counts.sum(axis=-1, keepdims=True)
[ ]:
# visualize the CVL
_, ax = plt.subplots(figsize=(20, 4))
ax.imshow(cvl.T)
_ = ax.set_yticks(np.arange(len(amino_acids)), [c for c in amino_acids])
_ = ax.set_xlabel('Position')
_ = ax.set_ylabel('Amino Acid')
../_images/walkthroughs_quantitative-decision-making-library-design_49_0.png

Assess the expected performance of the conventional CVL#

[ ]:
library_size = 10_000
random = np.random.RandomState(42)
cvl_sequences_sampled = np.zeros((library_size, cvl.shape[0]), dtype=int)
for i in range(len(cvl)):
    cvl_sequences_sampled[:, i] = random.choice(len(cvl[i]), p=cvl[i], size=(library_size), replace=True)
cvl_sequences_sampled = [''.join([amino_acids[i] for i in cvl_sequences_sampled[j]]) for j in range(cvl_sequences_sampled.shape[0])]
# only consider the unique sequences
cvl_sequences_sampled = np.unique(cvl_sequences_sampled).tolist()
[ ]:
cvl_predict_future = session.predict.create_predict_job(cvl_sequences_sampled, model_future)
print(cvl_predict_future.job)
cvl_predict_future.wait_until_done(verbose=True)
model_config={'protected_namespaces': ()} status=<JobStatus.PENDING: 'PENDING'> job_id='7987a0ff-0b6b-4045-a7f5-768c82e26d66' job_type='/workflow/predict' created_date=None start_date=None end_date=None prerequisite_job_id=None progress_message=None progress_counter=0 num_records=None
Waiting: 100%|██████████| 100/100 [10:56<00:00,  6.56s/it, status=SUCCESS]
True
[ ]:
cvl_predict_results = cvl_predict_future.get()

cvl_y_mus = np.zeros(len(cvl_predict_results))
cvl_y_vars = np.zeros(len(cvl_predict_results))
for i,r in enumerate(cvl_predict_results):
    props = r.predictions[0].properties['log_kdnm']
    cvl_y_mus[i] = props['y_mu']
    cvl_y_vars[i] = props['y_var']
cvl_dists = scipy.stats.norm(cvl_y_mus, np.sqrt(cvl_y_vars))
[ ]:
# plot the expected results vs. the input data
bins = np.linspace(-4, 6, 101)

format_params = {'alpha': 0.5, 'histtype': 'stepfilled', 'ec': 'k'}
_ = plt.hist(table['log_kdnm'].values, bins=bins, label='Initial Dataset', color='gray', density=True, **format_params)

# plot the predictive distribution over property values
c_lo = pred_dist_100.cdf(bins[:-1][:, np.newaxis]).mean(axis=1)
c_hi = pred_dist_100.cdf(bins[1:][:, np.newaxis]).mean(axis=1)
weights = (c_hi - c_lo)/(bins[1:] - bins[:-1])
_ = plt.hist(bins[:-1], bins=bins, weights=weights, label='Desiged (Top 100)', **format_params)

# compare the expected performance of the 10,000 sequences from the CVL
c_lo = cvl_dists.cdf(bins[:-1][:, np.newaxis]).mean(axis=1)
c_hi = cvl_dists.cdf(bins[1:][:, np.newaxis]).mean(axis=1)
weights = (c_hi - c_lo)/(bins[1:] - bins[:-1])
_ = plt.hist(bins[:-1], bins=bins, weights=weights, label='CVL (standard)', color='lightblue', hatch='/', **format_params)

_ = plt.legend(loc='best')
_ = plt.xlabel('log$_{10}$ $K_D$ nM')
_ = plt.ylabel('Density')
_ = plt.title('Affinity distributions of designed libraries versus the initial dataset')
../_images/walkthroughs_quantitative-decision-making-library-design_54_0.png

The expected performance of a sequence in the standard CVL is expected to be better than the average sequence in the initial dataset and may improve over the best sequences in the intial library, but is expected to be significantly worse than the sequences in the bespoke library we created with the design API.

Compare the library success statistics#

[ ]:
# what is the probability that a variant meeting the design criteria is present in each library?
# if we tested 10,000 sequences from the CVL, how would it compare with the bespoke top 100 sequences?
target = -2
print('target value <', target)

log_one_minus_p_100 = pred_dist_100.logsf(target)
p_100 = 1 - np.exp(np.sum(log_one_minus_p_100))
print(f'Designed, 100 sequences, probability of at least one success = {p_100:.5f}')

log_one_minus_p_cvl = cvl_dists.logsf(target)
p_cvl = 1 - np.exp(np.sum(log_one_minus_p_cvl))
print(f'CVL (standard), 10,000 sequences, probability of at least one success = {p_cvl:.5f}')
target value < -2
Designed, 100 sequences, probability of at least one success = 1.00000
CVL (standard), 10,000 sequences, probability of at least one success = 0.74066

We can see that even if we test 10,000 sequences from the CVL, the probability of finding a sequence that meets our target affinity (picomolar) is only about 67%, significantly lower than the virtual certainty that a success will be contained in the bespoke 100 sequence library. If we compare this with the fine-grained results above, we can see that we would only need to test the top 3 bespoke sequence designs to have the same probability of success.

Predicted distribution over the best affinity in each library#

Next, let’s examine the distribution over the affinity of the best sequence we expect to see in each library.

[ ]:
cvl_dists.std().shape
(9969,)
[ ]:
# estimate best affinities in each library from sampling
num_samples = 10_000

r = np.random.randn(num_samples, len(cvl_y_mus))*cvl_dists.std() + cvl_dists.mean()
sample_cvl = r.min(axis=-1)

r = np.random.randn(num_samples, 100)*pred_dist_100.std() + pred_dist_100.mean()
sample_best_100 = r.min(axis=-1)
[ ]:
# plot the expected results vs. the input data
bins = np.linspace(-4, 6, 101)

format_params = {'alpha': 0.5, 'histtype': 'stepfilled', 'ec': 'k'}
_ = plt.hist(table['log_kdnm'].values, bins=bins, label='Initial Dataset', color='gray', density=True, **format_params)

# plot the predictive distribution over property values
_ = plt.hist(sample_best_100, bins=bins, label='Desiged (Top 100)', density=True, **format_params)
_ = plt.hist(sample_cvl, bins=bins, label='CVL (standard), 10,000 sequences', color='lightblue', density=True, hatch='/', **format_params)

_ = plt.legend(loc='best')
_ = plt.xlabel('log$_{10}$ $K_D$ nM')
_ = plt.ylabel('Density')
_ = plt.title('Affinity distributions of best sequence in each library')
../_images/walkthroughs_quantitative-decision-making-library-design_62_0.png
[ ]:
# plot out the expected best value versus library size
r_des = np.random.randn(num_samples, 1000)*pred_dist_1000.std() + pred_dist_1000.mean()
r_cvl = np.random.randn(num_samples, len(cvl_y_mus))*cvl_dists.std() + cvl_dists.mean()
print('Library Size', 'E[Best Affinity | Bespoke]', 'E[Best Affinity | CVL (standard)]', sep='\t')
for i in range(20):
    sample_best = r_des[:, :i+1].min(axis=-1)
    expected_best_value_des = sample_best.mean()
    sample_best = r_cvl[:, :i+1].min(axis=-1)
    expected_best_value_cvl = sample_best.mean()
    print(f'{i+1: 12d}', f'{expected_best_value_des:26.5f}', f'{expected_best_value_cvl:33.5f}', sep='\t')
Library Size    E[Best Affinity | Bespoke]      E[Best Affinity | CVL (standard)]
           1                      -1.54616                                1.52048
           2                      -1.85585                                1.07512
           3                      -2.01505                                0.62800
           4                      -2.11463                                0.28505
           5                      -2.18733                               -0.10294
           6                      -2.23961                               -0.12237
           7                      -2.28554                               -0.16530
           8                      -2.31774                               -0.18378
           9                      -2.35245                               -0.18535
          10                      -2.37868                               -0.18983
          11                      -2.40330                               -0.20601
          12                      -2.42624                               -0.25380
          13                      -2.44741                               -0.31830
          14                      -2.46363                               -0.32024
          15                      -2.48038                               -0.39747
          16                      -2.49615                               -0.40643
          17                      -2.50757                               -0.40733
          18                      -2.52493                               -0.40755
          19                      -2.53690                               -0.40813
          20                      -2.55117                               -0.41004

Design a better combinatorial variant library from the OpenProtein.AI design results#

[ ]:
# randomize the variants found in the best sequences in the initial library
# to create a CVL

amino_acids = 'ARNDCQEGHILKMFPSTWYV'
amino_acid_to_index = {amino_acids[i]: i for i in range(len(amino_acids))}

sequences = library_1000['sequence'].values
sequences = np.array([[amino_acid_to_index[sequences[i][j]] for j in range(len(sequences[i]))] for i in range(len(sequences))])
sequences.shape

(1000, 118)
[ ]:
counts = np.zeros((sequences.shape[1], len(amino_acids)))
for i in range(sequences.shape[1]):
    index, count = np.unique(sequences[:, i], return_counts=True)
    counts[i, index] = count
# counts = (counts > 1)
cvl2 = counts/counts.sum(axis=-1, keepdims=True)
[ ]:
# visualize the CVL
_, ax = plt.subplots(figsize=(20, 4))
ax.imshow(cvl2.T)
_ = ax.set_yticks(np.arange(len(amino_acids)), [c for c in amino_acids])
_ = ax.set_xlabel('Position')
_ = ax.set_ylabel('Amino Acid')
../_images/walkthroughs_quantitative-decision-making-library-design_67_0.png

Assess the expected performance of the optimized CVL#

[ ]:
library_size = 10_000
random = np.random.RandomState(42)
cvl2_sequences_sampled = np.zeros((library_size, cvl2.shape[0]), dtype=int)
for i in range(len(cvl2)):
    cvl2_sequences_sampled[:, i] = random.choice(len(cvl2[i]), p=cvl2[i], size=(library_size), replace=True)
cvl2_sequences_sampled = [''.join([amino_acids[i] for i in cvl2_sequences_sampled[j]]) for j in range(cvl2_sequences_sampled.shape[0])]
# only consider the unique sequences
cvl2_sequences_sampled = np.unique(cvl2_sequences_sampled).tolist()
[ ]:
cvl2_predict_future = session.predict.create_predict_job(cvl2_sequences_sampled, model_future)
print(cvl2_predict_future.job)
cvl2_predict_future.wait_until_done(verbose=True)
model_config={'protected_namespaces': ()} status=<JobStatus.PENDING: 'PENDING'> job_id='615fa9ee-5c31-48b2-98bc-4830bc150e27' job_type='/workflow/predict' created_date=None start_date=None end_date=None prerequisite_job_id=None progress_message=None progress_counter=0 num_records=None
Waiting: 100%|██████████| 100/100 [01:21<00:00,  1.23it/s, status=SUCCESS]
True
[ ]:
cvl2_predict_results = cvl2_predict_future.get()

cvl2_y_mus = np.zeros(len(cvl2_predict_results))
cvl2_y_vars = np.zeros(len(cvl2_predict_results))
for i,r in enumerate(cvl2_predict_results):
    props = r.predictions[0].properties['log_kdnm']
    cvl2_y_mus[i] = props['y_mu']
    cvl2_y_vars[i] = props['y_var']
cvl2_dists = scipy.stats.norm(cvl2_y_mus, np.sqrt(cvl2_y_vars))
[ ]:
# plot the expected results vs. the input data
bins = np.linspace(-4, 6, 101)

format_params = {'alpha': 0.5, 'histtype': 'stepfilled', 'ec': 'k'}
_ = plt.hist(table['log_kdnm'].values, bins=bins, label='Initial Dataset', color='gray', density=True, **format_params)

# plot the predictive distribution over property values
c_lo = pred_dist_100.cdf(bins[:-1][:, np.newaxis]).mean(axis=1)
c_hi = pred_dist_100.cdf(bins[1:][:, np.newaxis]).mean(axis=1)
weights = (c_hi - c_lo)/(bins[1:] - bins[:-1])
_ = plt.hist(bins[:-1], bins=bins, weights=weights, label='Desiged (Top 100)', **format_params)

# compare the expected performance of the 10,000 sequences from the CVL
c_lo = cvl_dists.cdf(bins[:-1][:, np.newaxis]).mean(axis=1)
c_hi = cvl_dists.cdf(bins[1:][:, np.newaxis]).mean(axis=1)
weights = (c_hi - c_lo)/(bins[1:] - bins[:-1])
_ = plt.hist(bins[:-1], bins=bins, weights=weights, label='CVL (standard)', color='lightblue', hatch='/', **format_params)

c_lo = cvl2_dists.cdf(bins[:-1][:, np.newaxis]).mean(axis=1)
c_hi = cvl2_dists.cdf(bins[1:][:, np.newaxis]).mean(axis=1)
weights = (c_hi - c_lo)/(bins[1:] - bins[:-1])
_ = plt.hist(bins[:-1], bins=bins, weights=weights, label='CVL (optimized)', color='blue', hatch='/', **format_params)

_ = plt.legend(loc='best')
_ = plt.xlabel('log$_{10}$ $K_D$ nM')
_ = plt.ylabel('Density')
_ = plt.title('Affinity distributions of designed libraries versus the initial dataset')
../_images/walkthroughs_quantitative-decision-making-library-design_72_0.png

We can see that the CVL derived from variants identified by the design algorithm contains sequences that are expected to perform much better, on average, than the sequences in the standard CVL. Sequences in the optimized CVL are only expected to be slightly worse than the best bespoke variants we identified.

Compare the library success statistics#

[ ]:
# what is the probability that a variant meeting the design criteria is present in each library?
# if we tested 10,000 sequences from the CVL, how would it compare with the bespoke top 100 sequences?
target = -2
print('target value <', target)

log_one_minus_p_100 = pred_dist_100.logsf(target)
p_100 = 1 - np.exp(np.sum(log_one_minus_p_100))
print(f'Designed, 100 sequences, probability of at least one success = {p_100:.5f}')

log_one_minus_p_cvl = cvl_dists.logsf(target)
p_cvl = 1 - np.exp(np.sum(log_one_minus_p_cvl))
print(f'CVL (standard), 10,000 sequences, probability of at least one success = {p_cvl:.5f}')

log_one_minus_p_cvl2 = cvl2_dists.logsf(target)[:100]
p_cvl2 = 1 - np.exp(np.sum(log_one_minus_p_cvl2))
print(f'CVL (optimized), 100 sequences, probability of at least one success = {p_cvl2:.5f}')

log_one_minus_p_cvl2 = cvl2_dists.logsf(target)[:1000]
p_cvl2 = 1 - np.exp(np.sum(log_one_minus_p_cvl2))
print(f'CVL (optimized), 1,000 sequences, probability of at least one success = {p_cvl2:.5f}')

log_one_minus_p_cvl2 = cvl2_dists.logsf(target)
p_cvl2 = 1 - np.exp(np.sum(log_one_minus_p_cvl2))
print(f'CVL (optimized), 10,000 sequences, probability of at least one success = {p_cvl2:.5f}')
target value < -2
Designed, 100 sequences, probability of at least one success = 1.00000
CVL (standard), 10,000 sequences, probability of at least one success = 0.74066
CVL (optimized), 100 sequences, probability of at least one success = 0.81434
CVL (optimized), 1,000 sequences, probability of at least one success = 1.00000
CVL (optimized), 10,000 sequences, probability of at least one success = 1.00000
[ ]:
# what is the probability that a variant meeting the design criteria is present in each library?
# if we tested 10,000 sequences from the CVL, how would it compare with the bespoke top 100 sequences?
target = -3
print('target value <', target)

log_one_minus_p_100 = pred_dist_100.logsf(target)
p_100 = 1 - np.exp(np.sum(log_one_minus_p_100))
print(f'Designed, 100 sequences, probability of at least one success = {p_100:.5f}')

log_one_minus_p_cvl = cvl_dists.logsf(target)
p_cvl = 1 - np.exp(np.sum(log_one_minus_p_cvl))
print(f'CVL (standard), 10,000 sequences, probability of at least one success = {p_cvl:.5f}')

log_one_minus_p_cvl2 = cvl2_dists.logsf(target)[:100]
p_cvl2 = 1 - np.exp(np.sum(log_one_minus_p_cvl2))
print(f'CVL (optimized), 100 sequences, probability of at least one success = {p_cvl2:.5f}')

log_one_minus_p_cvl2 = cvl2_dists.logsf(target)[:1000]
p_cvl2 = 1 - np.exp(np.sum(log_one_minus_p_cvl2))
print(f'CVL (optimized), 1,000 sequences, probability of at least one success = {p_cvl2:.5f}')

log_one_minus_p_cvl2 = cvl2_dists.logsf(target)
p_cvl2 = 1 - np.exp(np.sum(log_one_minus_p_cvl2))
print(f'CVL (optimized), 10,000 sequences, probability of at least one success = {p_cvl2:.5f}')
target value < -3
Designed, 100 sequences, probability of at least one success = 0.31175
CVL (standard), 10,000 sequences, probability of at least one success = 0.00621
CVL (optimized), 100 sequences, probability of at least one success = 0.02804
CVL (optimized), 1,000 sequences, probability of at least one success = 0.34838
CVL (optimized), 10,000 sequences, probability of at least one success = 0.95893

We can see that even if we test 10,000 sequences from the CVL, the probability of finding a sequence that meets our target affinity (picomolar) is only about 67%, significantly lower than the virtual certainty that a success will be contained in the bespoke 100 sequence library. If we compare this with the fine-grained results above, we can see that we would only need to test the top 3 bespoke sequence designs to have the same probability of success.

Predicted distribution over the best affinity in each library#

Next, let’s examine the distribution over the affinity of the best sequence we expect to see in each library.

[ ]:
# estimate best affinities in each library from sampling
num_samples = 10_000

r = np.random.randn(num_samples, len(cvl_y_mus))*cvl_dists.std() + cvl_dists.mean()
sample_cvl = r.min(axis=-1)

r = np.random.randn(num_samples, len(cvl2_y_mus))*cvl2_dists.std() + cvl2_dists.mean()
sample_cvl2 = r.min(axis=-1)

r = np.random.randn(num_samples, 100)*pred_dist_100.std() + pred_dist_100.mean()
sample_best_100 = r.min(axis=-1)
[ ]:
# plot the expected results vs. the input data
bins = np.linspace(-4, 6, 101)

format_params = {'alpha': 0.5, 'histtype': 'stepfilled', 'ec': 'k'}
_ = plt.hist(table['log_kdnm'].values, bins=bins, label='Initial Dataset', color='gray', density=True, **format_params)

# plot the predictive distribution over property values
_ = plt.hist(sample_best_100, bins=bins, label='Desiged (Top 100)', density=True, **format_params)
_ = plt.hist(sample_cvl, bins=bins, label='CVL (standard), 10,000 sequences', color='lightblue', density=True, hatch='/', **format_params)
_ = plt.hist(sample_cvl2, bins=bins, label='CVL (optimized), 10,000 sequences', color='blue', density=True, hatch='/', **format_params)

_ = plt.legend(loc='best')
_ = plt.xlabel('log$_{10}$ $K_D$ nM')
_ = plt.ylabel('Density')
_ = plt.title('Affinity distributions of best sequence in each library')
../_images/walkthroughs_quantitative-decision-making-library-design_80_0.png
[ ]:
# plot out the expected best value versus library size
r_des = np.random.randn(num_samples, 1000)*pred_dist_1000.std() + pred_dist_1000.mean()
r_cvl = np.random.randn(num_samples, len(cvl_y_mus))*cvl_dists.std() + cvl_dists.mean()
r_cvl2 = np.random.randn(num_samples, len(cvl2_y_mus))*cvl2_dists.std() + cvl2_dists.mean()
print('Library Size', 'E[Best Affinity | Bespoke]', 'E[Best Affinity | CVL (standard)]', 'E[Best Affinity | CVL (optimized)]', sep='\t')
for i in range(20):
    sample_best = r_des[:, :i+1].min(axis=-1)
    expected_best_value_des = sample_best.mean()
    sample_best = r_cvl[:, :i+1].min(axis=-1)
    expected_best_value_cvl = sample_best.mean()
    sample_best = r_cvl2[:, :i+1].min(axis=-1)
    expected_best_value_cvl2 = sample_best.mean()
    print(f'{i+1: 12d}', f'{expected_best_value_des:26.5f}', f'{expected_best_value_cvl:33.5f}', f'{expected_best_value_cvl2:33.5f}', sep='\t')
Library Size    E[Best Affinity | Bespoke]      E[Best Affinity | CVL (standard)]       E[Best Affinity | CVL (optimized)]
           1                      -1.55286                                1.51589                                 0.38328
           2                      -1.84912                                1.08002                                -0.47345
           3                      -2.00703                                0.63800                                -0.81306
           4                      -2.10333                                0.29565                                -0.89996
           5                      -2.17862                               -0.11100                                -1.24034
           6                      -2.23045                               -0.12912                                -1.24835
           7                      -2.27606                               -0.17099                                -1.24867
           8                      -2.31387                               -0.18948                                -1.44040
           9                      -2.34779                               -0.19020                                -1.56067
          10                      -2.37340                               -0.19456                                -1.68243
          11                      -2.39798                               -0.21236                                -1.76242
          12                      -2.42163                               -0.26278                                -1.84131
          13                      -2.44410                               -0.32787                                -1.87820
          14                      -2.46091                               -0.32975                                -1.90222
          15                      -2.47712                               -0.39906                                -1.90426
          16                      -2.49407                               -0.40723                                -1.94257
          17                      -2.50681                               -0.40822                                -1.94602
          18                      -2.52192                               -0.40840                                -1.96347
          19                      -2.53436                               -0.40878                                -1.99541
          20                      -2.55005                               -0.41058                                -2.00509

## Summary and next steps

In this example, we evaluated: - The initial antibody variable fragment mutagenesis, 14H dataset. - A conventional combinatorial library designed using the top 50 sequences in the antibody variable fragment mutagenesis, 14H dataset. - A machine learning-designed library using OP Models. - A combinatorial library based on our machine learning-designed library.

For each library, we were able to predict performance and evaluate the size at which one library performed better than the other. This can be useful for making decisions on library design strategies and size trade-offs between fully bespoke and partially randomized libraries. We also explored a strategy for designing semi-randomized combinatorial variant libraries using the design tool.

We encourage you to try this out on your own libraries when making choices on library size and composition. Happy modeling!