Tutorial: Visualizing antibody embeddings with OpenProtein

In this Tutorial we will visualize embeddings from antibodies all derived from a common parent. In doing so we will show that that such proteins have a characteristic star-burst based pattern that derives from their relation to a common parent.

%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import scipy.special
import json

Connect to the OpenProtein.AI API

import openprotein

with open('secrets.config', 'r') as f:
    config = json.load(f)

session = openprotein.connect(config['username'], config['password'])

Basic inference endpoints

Models in the embedding module can be selected with the session.embedding.get_model(model_id) function and expose three basic inference methods * model.embed([sequences], reduction='MEAN') - request embeddings, returning the mean embedding over the sequence length by default. reduction=’SUM’ and reduction=None are also options, which return the summed embedding or the full LxD sized embedding (a D-dimensional embedding for each position of the sequence) * model.logits([sequences]) - request logits, returning the final output layer of the language model * model.attn([sequences]) - request attention maps, returning the attention maps for the final layer of transformer models. these are HxLxL where H is the number of attention heads used by the model * model.fit_svd([sequences], n_components=1024, reduction=None) - fit a SVD decomposition of the embeddings of the input sequences, truncated to the specified number of components after applying the specified embedding reduction. Using reduction=None requires the sequences to all be the same length. The resulting SVD model is especially useful for retrieving high fidelity, custom-sized embeddings of large numbers of sequences or sequence variants, because full sized embeddings are large and can be cumbersome to download and work with. This will be covered in more detail later in the tutorial.

Sequences are expected to be amino acid sequences using the standard single character amino acid alphabet, plus X to indicate masked/unknown positions. More information about can be retrieved from the model metadata (model.metadata) which includes sequence length constraints and other information. Available models can be listed with session.embedding.list_models().

Note - ESM models will return length plus 2 (\(L+2\)) rather than \(L\) sized results for full embeddings, logits, and attn, because “<start>” and “<stop>” tokens are appended to the input sequences.

# let's work with an antibody heavy chain sequence and our rotaprot-large-uniref90-ft model
model = session.embedding.get_model('rotaprot-large-uniref90-ft')

Inference requests are asynchronous

Inference functions return asynchronous Future objects that can be queried for job completion status, waited on, and used to retrieve results once available. The Future wraps a Job object including a job_id and other information about the job that can be used to re-connect and retrive results later.

Logits are un-normalized log-probabilities of amino acids at each position according to the masked-LM

logits_future = model.logits([wt])

# wait until the results are ready, polling periodically and reporting job progress (if multiple sequences are sent in the inference request)

# get the results, this is a list of (sequence, np.ndarray) tuples
# these are not guaranteed to be in the same order as the input, so it's generally recommended to check the sequence in the returned results
logits_results = logits_future.get()
returned_sequence, logits = logits_results[0]
# logits are of shape LxV, where V is the vocabulary size for the model (20 in this case)
# logits are not softmaxed, so if normalized probabilities are required they should be passed through a softmax or log_softmax function
model_config={'protected_namespaces': ()} status=<JobStatus.PENDING: 'PENDING'> job_id='2f7fc455-7600-42db-ad07-747905cf72c0' job_type='/embeddings/logits' created_date=datetime.datetime(2023, 9, 19, 7, 45, 44, 328183, tzinfo=datetime.timezone.utc) start_date=None end_date=None prerequisite_job_id=None progress_message=None progress_counter=0 num_records=1
Waiting: 100%|██████████| 100/100 [00:00<00:00, 370.59it/s, status=SUCCESS]
(118, 20)

Let’s plot these logits by position:

# plot the logits
alphabet = model.metadata.output_tokens

_, ax = plt.subplots(figsize=(20, 4))
ax.imshow(scipy.special.log_softmax(logits, axis=-1).T)
_ = ax.set_xlabel('Position')
_ = ax.set_ylabel('Amino Acid')
_ = ax.set_yticks(np.arange(len(alphabet)), alphabet)

Embeddings and reduced embeddings

Embeddings for a sequence are LxD, where L is the length of the sequence and D is the embedding dimension of the model. This can be prohibitively large for many sequences, as D is generally 1024 or larger. Reduced embeddings are of shape D and are either the mean (reduction='MEAN') or the sum (reduction='SUM') of the embeddings over the length dimension of the sequence.

Full size embeddings (reduction=None) generally give better fidelity, but mean embeddings are still powerful features and are easier to manipulate as we can make them uniform across proteins of different sequence lengths. In general, the reduced embeddings perform only marginally worse than full-sized embeddings in explaining protein sequences, but we leave it up to the user to decide what is most appropriate.

# because requests are async, these will run in parallel without waiting
embedding_full_future = model.embed([wt], reduction=None)
print('FULL', embedding_full_future.job)
embedding_mean_future = model.embed([wt], reduction='MEAN') # mean is the default
print('MEAN', embedding_mean_future.job)
embedding_sum_future = model.embed([wt], reduction='SUM') # we provide SUM as an option, but MEAN is generally the better reduction
print('SUM', embedding_sum_future.job)

# wait on the results
fs = [embedding_full_future, embedding_mean_future, embedding_sum_future]
for f in fs:

# NOTE - for large numbers of full size embeddings, downloading them can actually be the slowest step
result_full = embedding_full_future.get()
print('FULL shape =', result_full[0][1].shape)

result_mean = embedding_mean_future.get()
print('MEAN shape =', result_mean[0][1].shape)

result_sum = embedding_sum_future.get()
print('SUM shape =', result_sum[0][1].shape)
FULL model_config={'protected_namespaces': ()} status=<JobStatus.PENDING: 'PENDING'> job_id='f3a31feb-9ad7-4b08-90bf-6e6e0964cea0' job_type='/embeddings/embed' created_date=datetime.datetime(2023, 9, 19, 7, 45, 45, 691973, tzinfo=datetime.timezone.utc) start_date=None end_date=None prerequisite_job_id=None progress_message=None progress_counter=0 num_records=1
MEAN model_config={'protected_namespaces': ()} status=<JobStatus.PENDING: 'PENDING'> job_id='c3313e4b-32a0-41e1-9238-80f646ffb313' job_type='/embeddings/embed_reduced' created_date=datetime.datetime(2023, 9, 19, 7, 45, 45, 949078, tzinfo=datetime.timezone.utc) start_date=None end_date=None prerequisite_job_id=None progress_message=None progress_counter=0 num_records=1
SUM model_config={'protected_namespaces': ()} status=<JobStatus.PENDING: 'PENDING'> job_id='5fe56689-9bd3-4206-ba21-f7f398aa9029' job_type='/embeddings/embed_reduced' created_date=datetime.datetime(2023, 9, 19, 7, 45, 46, 208511, tzinfo=datetime.timezone.utc) start_date=None end_date=None prerequisite_job_id=None progress_message=None progress_counter=0 num_records=1
Waiting: 100%|██████████| 100/100 [00:00<00:00, 378.89it/s, status=SUCCESS]
Waiting: 100%|██████████| 100/100 [00:00<00:00, 377.30it/s, status=SUCCESS]
Waiting: 100%|██████████| 100/100 [00:00<00:00, 365.26it/s, status=SUCCESS]
FULL shape = (118, 1536)
MEAN shape = (1536,)
SUM shape = (1536,)
       -0.08315928, -0.95647794], dtype=float32))]

Attention maps can reveal statistical couplings between sites

attn_future = model.attn([wt])


attn_result = attn_future.get()
attn_map = attn_result[0][1]
model_config={'protected_namespaces': ()} status=<JobStatus.PENDING: 'PENDING'> job_id='998f5bd8-8c6f-434d-b872-39642a316cd3' job_type='/embeddings/attn' created_date=datetime.datetime(2023, 9, 19, 7, 45, 49, 415850, tzinfo=datetime.timezone.utc) start_date=None end_date=None prerequisite_job_id=None progress_message=None progress_counter=0 num_records=1
Waiting: 100%|██████████| 100/100 [00:00<00:00, 374.24it/s, status=SUCCESS]
(24, 118, 118)
# plot a few attention heads
_, axs = plt.subplots(1, 3, figsize=(24, 8))
for i in range(len(axs)):
    _ = axs[i].set_title(f'Attention Map for Head={i}')
    _ = axs[i].set_xlabel('Position')
    _ = axs[i].set_ylabel('Position')
# plot the symetrized average attention map
a = attn_map.mean(axis=0)
a = a + a.T - a*a.T

_, ax = plt.subplots(figsize=(8, 8))
_ = ax.set_title('Symmetrized Average Attention Map')
_ = ax.set_xlabel('Position')
_ = ax.set_ylabel('Position')

Using SVD to get high-fidelity custom-sized embeddings

Truncated SVD can be used to find reduced sized protein embeddings that retain the most information by finding vectors that best explain the localized sequence space.

This is extremely useful for retrieving large numbers of embeddings for mutagenesis analysis. The SVD can be fit on any set of sequences using the model.fit_svd() function, but these sequences must be of equal length if reduction=None (the default) is used. This returns a new SVD model object that can be used to embed new sequences.

Existing SVD models can be listed and retrieved with session.embedding.list_svd() and can be deleted with the svd.delete() function.

# enumerate all single substution variants of the seed sequence and use those to fit an SVD'd embedding of the local sequence landscape
alphabet = model.metadata.output_tokens
alphabet = [c.encode() for c in alphabet] # encode to bytes since we're working with byte strings

variants = [wt]
for i in range(len(wt)):
    wt_i = wt[i:i+1]
    for j in range(len(alphabet)):
        if alphabet[j] != wt_i:
            v = wt[:i] + alphabet[j] + wt[i+1:]
            assert len(v) == len(wt)
            assert v != wt

# fit a small SVD
svd = model.fit_svd(variants, n_components=256)
    "model_config": {
        "protected_namespaces": []
    "id": "5cc594f0-39b0-49f3-b3a2-1ef81dfba4f6",
    "status": "SUCCESS",
    "created_date": "2023-09-18T17:27:56.364375+00:00",
    "model_id": "rotaprot-large-uniref90-ft",
    "n_components": 256,
    "reduction": null,
    "sequence_length": 118
Waiting: 100%|██████████| 100/100 [00:00<00:00, 328.48it/s, status=SUCCESS]
# embed the variants
# NOTE - you need to wait for the embedding model to finish fitting before calling embed
# otherwise you will get an error

svd_embed_future = svd.embed(variants)

svd_embed_results = svd_embed_future.get()
model_config={'protected_namespaces': ()} status=<JobStatus.PENDING: 'PENDING'> job_id='f44717e3-08ba-42b0-95bd-833071112964' job_type='/embeddings/embed' created_date=datetime.datetime(2023, 9, 19, 7, 45, 53, 968240, tzinfo=datetime.timezone.utc) start_date=None end_date=None prerequisite_job_id=None progress_message=None progress_counter=0 num_records=2243
Waiting: 100%|██████████| 100/100 [00:05<00:00, 17.83it/s, status=SUCCESS]
embedding_vectors = np.stack([v for _,v in svd_embed_results], axis=0)
(2243, 256)
# visualize a UMAP of these embeddings
from umap import UMAP

um = UMAP(n_components=2, n_neighbors=30, min_dist=0)
z = um.fit_transform(embedding_vectors)

the UMAP (below) shows a characteristic star burst pattern typical of single substitions of a parent sequence


_, ax = plt.subplots(figsize=(8, 8)) ax.scatter(z[:, 0], z[:, 1], s=1) _ = ax.set_xticks([], []) _ = ax.set_yticks([], []) _ = ax.set_xlabel('UMAP1') _ = ax.set_ylabel('UMAP2')
[ ]: