[1]:
%matplotlib inline

Tutorial: Designing new Chorismate Mutase enzymes with PoET

Chorismate Mutase (EC 5.4.99.5) (CM) holds a pivotal role in the biosynthesis of biological compounds, facilitating a crucial chemical reaction in the shikimate pathway - the conversion of chorismate to prephenate. CM is responsible for a million-fold acceleration of this reaction rate, and therefore is crucial to bacterial, archaeal, fungal, and plant life.

CM is instrumental in the production of two aromatic amino acids in vivo, phenylalanine and tyrosine. It operates at a pivitol point in the shikimate pathway where it channels its substrate (chorismate) towards the biosynthesis of tyrosine and phenylalanine, and away from another aromatic amino acid, tryptophan. It therefore can determine the relative balance of various aromatic amino acids in several biological systems.

As well as this core role in aromatic amino acid synthesis, CM displays many other interesting characteristics from both a biological and industrial perspective: * CM is used in industrial biotechnology to develop novel antimicrobials efficiently * CM is the only known natural enzyme that catalyses a pericyclic reaction. * Certain variants of this enzyme are thought to show allosteric regulaton

It is therefore a good candidate for mutagenesis, both to improve the biotechnological applications of this enzyme, and to study it’s unique biochemistry.

Here, we will use Openprotein’s PoET pipelines to delve into the intricacies of Chorismate Mutase, exploring its function and potential for enhanced activity through bioengineering.

PoET

PoET, a state-of-the-art protein language model, enables the evaluation and generation of protein sequences from scratch, without necessitating functional measurements of your protein.

PoET’s core workflow revolves around a ‘prompt’, a strategically assembled set of sequences encapsulating essential insights about the local fitness landscape and co-evolutionary patterns related to your protein of interest. This prompt creation involves two crucial steps - first, generating a multiple sequence alignment (MSA) of your protein’s sequences spanning various evolutionary lineages, and secondly, applying filters to refine this MSA into a focussed prompt.

The sequences, owing to their shared evolutionary ties, inherently embed information about the protein’s local fitness landscape. PoET utilizes this information to assess the probability of observing a specific sequence based on the inferred evolutionary process. Although PoET also boasts the capability to generate novel sequences, for the scope of this tutorial, we will focus solely on its scoring function to identify promising Chorismate Mutase enzyme variants.

[2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import pearsonr, spearmanr
import json
import seaborn as sns

from umap import UMAP
sns.set()
[3]:
import openprotein # install from pypi
import openprotein.fasta as fasta

Connect to the OpenProtein.AI API

[4]:
with open('secrets.config', 'r') as f:
    config = json.load(f)

session = openprotein.connect(config['username'], config['password'])

Load the chorismate mutase dataset

We have a dataset of Chorismate mutase activity from Russ et al 2020.

It contains the following information (as seen below): * Species - the source of Chorismate Mutase (CM) sequence * Sequence - the amino acid sequence * ID to EcCM - the sequence identity to E. coli CM. (EcCM). * top ID - the highest sequence identity to any sequence in the data (exc. EcCM). * norm r.e. - the normalized relative enrichment from the assay

The enrichment (norm r.e.) assay involves using a knock-out CM E. coli strain and then re-introducing variant CM. As CM is essential for E. coli growth on minimal media (i.e. without aromatic amino acids), the E. coli count itself can be used as a proxy for enzymatic activity.

The dataset also includes a computational measure of protein fitness pioneered in Russ’ 2020 paper above: * EDCA - sequence modelling energy; lower energies are associated with higher probability

We will compare the wetlab functional assay (norm r.e.) with EDCA and our own OpenProtein proprietary model, PoET which makes scoring and generating novel sequences easy.

[5]:
table_natural = pd.read_csv('data/chorismate_mutase_natural_sequences.csv' )
table_variant = pd.read_csv('data/chorismate_mutase_variant_sequences.csv')

table_natural.head()
[5]:
No. Species Sequence ID to EcCM top ID EDCA (rel. to EcCM) norm r.e.
0 1 Escherichia coli ... -TSENPLLALREKISALDEKLLALLAERRELAVEVGKAKLLSHRPV... 1.00 0.93 0.00 1.00
1 2 Thermus thermophilus HB8 ... ---DERIQALRKEVDRVNREILRLLSERGRLVQEIGRLQTELGLPH... 0.26 0.89 28.62 -0.06
2 3 Pseudomonas aeruginosa UCBPP-PA14 ... PEDCTGLADIREAIDQIDLDIVQALGRRMDYVKAASRFKANEA-AI... 0.20 0.71 -4.19 0.04
3 4 Enterobacter cloacae subsp. dissolvens SDM ... -TPENPLLDLRVKISALDEKLLALLAERRALAIEVGKAKLDSHRPV... 0.93 0.93 8.57 0.65
4 5 Thermus aquaticus Y51MC23 ... ---DERIQALRREVDRVNRELLRLLSERGRLVQEIGRIQTELGLPH... 0.28 0.89 18.42 -0.08

Datasets

Let’s look at the Species with the highest identity to EcCM: * We can see norm r.e. is benchmarked to EcCM (=1) * We can see the most similar sequences all score lower than EcCM

[6]:
table_natural.sort_values("ID to EcCM", ascending=False).drop(columns=["Sequence"]).head()
[6]:
No. Species ID to EcCM top ID EDCA (rel. to EcCM) norm r.e.
0 1 Escherichia coli ... 1.00 0.93 0.00 1.00
3 4 Enterobacter cloacae subsp. dissolvens SDM ... 0.93 0.93 8.57 0.65
6 7 Cronobacter sakazakii ATCC BAA-894 ... 0.91 0.90 6.42 0.95
5 6 Salmonella enterica subsp. enterica serovar Sa... 0.90 0.89 9.83 0.70
8 9 Erwinia billingiae Eb661 ... 0.80 0.85 26.25 0.84
[7]:
table_natural.sort_values("EDCA (rel. to EcCM)", ascending=False).drop(columns=["Sequence"]).head()
[7]:
No. Species ID to EcCM top ID EDCA (rel. to EcCM) norm r.e.
379 380 Arthrobacter sp. FB24 ... 0.18 0.83 54.95 0.07
380 381 Arthrobacter phenanthrenivorans Sphe3 ... 0.18 0.83 53.67 -0.03
398 399 Helicobacter pylori B38 ... 0.27 0.89 52.00 0.88
1057 1058 Caulobacter sp. K31 ... 0.19 0.30 50.81 -0.06
399 400 Helicobacter pylori SouthAfrica7 ... 0.25 0.89 49.09 0.90
[8]:
import seaborn as sns
from scipy.stats import spearmanr

_, axs = plt.subplots(1, 2, figsize=(12, 4), sharex=True, sharey=True)

natural_corr, natural_p = spearmanr(table_natural['EDCA (rel. to EcCM)'].values, table_natural['norm r.e.'].values)
variant_corr, variant_p = spearmanr(table_variant['EDCA (rel. to EcCM)'].values, table_variant['norm r.e.'].values)

sns.regplot(data=table_natural, x='EDCA (rel. to EcCM)', y='norm r.e.', ax=axs[0])
axs[0].set_title(f'Natural sequences (Coeff: {natural_corr:.2f}, p-value: {natural_p:.2f})')
axs[0].set_xlabel("EDCA")
axs[0].set_ylabel("Enrichment")

sns.regplot(data=table_variant, x='EDCA (rel. to EcCM)', y='norm r.e.', ax=axs[1])
axs[1].set_title(f'Variant sequences (Coeff: {variant_corr:.2f}, p-value: {variant_p:.2f})')
axs[1].set_xlabel("EDCA")

plt.tight_layout()
plt.show()

../_images/demos_chorismate_12_0.png

In the graphics displayed above, we examine the relationships between the EDCA and the normalized relative enrichment over EcCM (‘norm r.e.’) for both ‘Natural’ and ‘Variant’ sequences.

We can see from the ‘Natural’ graph above that there is no clear relationship between the EDCA of a natural sequence and it’s relative enrichment, this is confirmed by a high spearman p-value (0.24) and low correlation coefficient (0.04).

By contrast, the ‘Variant’ graph shows a different picture. The data here exhibits a more patterned distribution, suggesting a possible relationship between the two variables. The Spearman correlation coefficient for the ‘Variants’ dataset is large and negative (~0.4) and the p-value significant (<0.05). Given that a more negative EDCA means a more likely sequence, there appears to be a positive relationship between the statistical liklihood of a sequence and its enrichment.

Lastly, we can also see the span of EDCA is much larger for our variant sequences than natural sequences.

Interestingly, given the E. coli based enrichment assay, we might expect identity to EcCM alone to be a reasonable predictor for enzymatic activity. We find this to be somewhat true (see below):

[9]:
from Bio import pairwise2

ECCM_sequence = table_natural.query("`ID to EcCM`==1").Sequence[0]

table_natural['EcCM_similarity'] = [pairwise2.align.globalxx(i, ECCM_sequence)[0].score for i in table_natural.Sequence]
table_variant['EcCM_similarity'] = [pairwise2.align.globalxx(i, ECCM_sequence)[0].score for i in table_variant.Sequence]
[10]:
import seaborn as sns
from scipy.stats import spearmanr

_, axs = plt.subplots(1, 2, figsize=(12, 4), sharex=True, sharey=True)

natural_corr, natural_p = spearmanr(table_natural['EcCM_similarity'].values, table_natural['EDCA (rel. to EcCM)'].values)
variant_corr, variant_p = spearmanr(table_variant['EcCM_similarity'].values, table_variant['EDCA (rel. to EcCM)'].values)

sns.regplot(data=table_natural, x='EcCM_similarity', y='EDCA (rel. to EcCM)', ax=axs[0])
axs[0].set_title(f'Natural sequences (Coeff: {natural_corr:.2f}, p-value: {natural_p:.2f})')
axs[0].set_xlabel("EcCM similarity %")
axs[0].set_ylabel("EDCA")

sns.regplot(data=table_variant, x='EcCM_similarity', y='EDCA (rel. to EcCM)', ax=axs[1])
axs[1].set_title(f'Variant sequences (Coeff: {variant_corr:.2f}, p-value: {variant_p:.2f})')
axs[1].set_xlabel("EcCM similarity %")

plt.tight_layout()
plt.show()

../_images/demos_chorismate_16_0.png
[11]:
import seaborn as sns
from scipy.stats import spearmanr

_, axs = plt.subplots(1, 2, figsize=(12, 4), sharex=True, sharey=True)

natural_corr, natural_p = spearmanr(table_natural['EcCM_similarity'].values, table_natural['norm r.e.'].values)
variant_corr, variant_p = spearmanr(table_variant['EcCM_similarity'].values, table_variant['norm r.e.'].values)

sns.regplot(data=table_natural, x='EcCM_similarity', y='norm r.e.', ax=axs[0])
axs[0].set_title(f'Natural sequences (Coeff: {natural_corr:.2f}, p-value: {natural_p:.2f})')
axs[0].set_xlabel("EcCM similarity %")
axs[0].set_ylabel("Enrichment")

sns.regplot(data=table_variant, x='EcCM_similarity', y='norm r.e.', ax=axs[1])
axs[1].set_title(f'Variant sequences (Coeff: {variant_corr:.2f}, p-value: {variant_p:.2f})')
axs[1].set_xlabel("EcCM similarity %")

plt.tight_layout()
plt.show()

../_images/demos_chorismate_17_0.png

We can see above that there is a statistically significant relationship between sequence identity to EcCM and relative enrichment, the coeffecients are <0.2 and therefore not particularly strong, but they do present a simple baseline of expectations for model performance.

Let’s now use our PoET model to score these same sequences.

Generating an MSA and Prompt for PoET

We will build a prompt around the E. coli chorismate mutase sequence, generating an MSA and filtering it to create a curated prompt to evaluate our target sequences on.

[12]:
seed = table_natural.iloc[0]['Sequence']
seed = seed.replace('-', '')
seed
[12]:
'TSENPLLALREKISALDEKLLALLAERRELAVEVGKAKLLSHRPVRDIDRERDLLERLITLGKAHHLDAHYITRLFQLIIEDSVLTQQALLQQH'
[13]:
msa = session.poet.create_msa(seed.encode())
print(msa)

status=<JobStatus.SUCCESS: 'SUCCESS'> job_id='a4a314ec-61fc-40fa-84ee-e6ec7f9461ec' job_type='/align/align' created_date=datetime.datetime(2023, 9, 4, 12, 40, 28, 403610) start_date=None end_date=datetime.datetime(2023, 9, 4, 12, 40, 28, 463327) prerequisite_job_id=None progress_message=None progress_counter=None num_records=None msa_id='a4a314ec-61fc-40fa-84ee-e6ec7f9461ec'

Let’s examine the resulting MSA:

We can see the constructed MSA had 3500+ sequences, only 4 of which were found in our natural sequence dataset:

[14]:
msa_result = pd.DataFrame(list(list(msa.get_msa())), columns=['name','sequence'])
print(f"Number of sequences in the MSA: {msa_result.sequence.nunique()}")
Number of sequences in the MSA: 3741
[15]:
set(table_natural.Sequence.str.replace("-","").tolist()) & set(msa_result.sequence.str.replace("-","").tolist())
[15]:
{'EQNLVNLREKISNIDADLLDLLAKRRQLANNIAKTKLLENRPIRDKNRERELLNVLITKGKNRGLDGFYITRLFQMIIEDSVLTQQALLQQH',
 'TENSLLALRERISALDLKLLALLAERRTMAIEVAQSKMHSHRPIRDKEREAELINTLIQEGKKQGLDGHYITRVYQLIIEDSVLTQQALLQQH',
 'TPENPLLDLRVKISALDEKLLALLAERRALAIEVGKAKLDSHRPVRDIDRERDLLERLIQLGKAHHLDAHYITRLFQLIIEDSVLTQQALLQQH',
 'TSENPLLALREKISALDEKLLALLAERRELAVEVGKAKLLSHRPVRDIDRERDLLERLITLGKAHHLDAHYITRLFQLIIEDSVLTQQALLQQH'}

We have generated a large, diverse MSA that can be filtered into a suitable prompt for our PoET Model.

Rank the natural and variant sequences

Creating a prompt from an MSA involves filtering the ~3700 sequences to ~100 based on a number of criteria, our aim is for the resulting prompt to successfully encode relevant evolutionary and fitness data that our PoET model can use as context when scoring or generating new sequences. For now, let’s try the default prompt settings:

  • num_residues: This controls the maximum number of residues (tokens) in the prompt. We’ll use the default 12288 here. Which should be roughly X sequences.

  • method: This parameter signifies the MSA sampling method to be used. The default method is NEIGHBORS_NONGAP_NORM_NO_LIMIT.

  • homology_level: Applicable for the neighbors methods only, this parameter controls the level of homology for sequences in the MSA. The value ranges between 0 and 1, with the default set at 0.8.

  • max_similarity: This defines the maximum similarity between sequences in the MSA and the seed, ranging between 0 and 1. By default, it is set to 1.0.

  • min_similarity: This determines the minimum similarity between sequences in the MSA and the seed, with values ranging from 0 to 1. The default is set at 0.0.

  • always_include_seed_sequence: This Boolean parameter controls whether to always include the seed sequence in the MSA or not. By default, it is set to False.

  • num_ensemble_prompts: It denotes the number of ensemble jobs to run, each job will have a different prompt to generate a more diverse set of scores. We will set this value to 3.

  • random_seed: The seed allows you to make reproducible runs. We’ll set it to 42 here.

[16]:
prompt = msa.sample_prompt(num_ensemble_prompts=3, random_seed=42)
print(prompt)
status=<JobStatus.SUCCESS: 'SUCCESS'> job_id='f16b9c39-83cc-413f-8a2a-ed6bb8034a13' job_type='/align/prompt' created_date=datetime.datetime(2023, 8, 31, 11, 47, 35, 788089) start_date=datetime.datetime(2023, 8, 31, 11, 48, 10, 80963) end_date=datetime.datetime(2023, 8, 31, 11, 48, 20, 327770) prerequisite_job_id=None progress_message=None progress_counter=None num_records=None
[17]:
prompt.wait() # wait for job to complete

prompt_result = []
for i in [1,2,3]:
    prompt_result.append( pd.DataFrame( list(prompt.get_prompt(i)) , columns=['name','sequence']) )

We have created 3 prompts, each of 100 sequences. We can see the prompt for the first model (in the ensemble of 3 models) below:

[18]:
prompt_result[0]
[18]:
name sequence
0 UPI00082D6948 LEKLRKQIDAVDQRIIDAFIERLGIAQGVAEYKIRQGLPVLDRSRE...
1 A0A846MY19 ELREQIDFLDSQLVELLSQRQRYIERAAEIKAHRHEIRDEARIQDV...
2 A0A2V9Y9X5 DIEDWRKKIDEMDRKLVALLSERAKAAIAIGKLKRETSLPIYEPDR...
3 U5Q7D5 TSNPLDELRQKIDICDDKLIRILHERMSISETIGRYKKENNITILQ...
4 A0A5Q4EK06 SETALDELRAEVDSMDEMLVWALSVRMELASKIAEVKKEAGLNSLQ...
... ... ...
95 A0A2M6GHJ6 TPEDQLEEMRREIDQLDNELMEVFARRARMSRKIGQYKKEHNMTIL...
96 A0A1Q6Q7M0 DLNDYRKEIDSIDDQLIALFAKRMETAEKIAAYKKANGLRVLDAHR...
97 A0A6J4XVB8 SEEDVATLRKRIDDIDEQILEMIGRRLEAARAIGRIKQQAGITVVD...
98 A0A379T9Q7 MAKRRALAIEVGQAKLLSHRPVRDIDRERALLDRLIHLGKAHHLDA...
99 A0A142V8X9 LSDLRKQIDNLDTELVKLMAKRLEVSDQIGKIKEETNSPVQDLSRE...

100 rows × 2 columns

Each of the three prompts will satisfy the criteria given above, but will use different sequences to do so. This diversity will enable us to run the model 3 times and get an ensemble of scores for a more robust comparison.

Let’s see how many sequences overlap between the prompts:

[19]:
overlap = np.zeros((3,3))
seqs = []
for i,p1 in enumerate(prompt_result):
    for j,p2 in enumerate(prompt_result):
        overlapped_seqs = list(set(p1['sequence']) & set(p2['sequence']))
        overlap[i,j] = len(overlapped_seqs)
        if i!=j:
            seqs += overlapped_seqs
seqs = list(set(seqs))
[20]:
pd.DataFrame(overlap, columns=['prompt1','prompt2','prompt3'], index=['prompt1','prompt2','prompt3'],).astype(int)
[20]:
prompt1 prompt2 prompt3
prompt1 100 2 0
prompt2 2 100 3
prompt3 0 3 100

We can see a handful of sequences appear more than once, including TSENPLL…. which occurs under accession A0A379UTL6 and A0A5R1YFZ7 due to being present in 2 different ColabFold sources.

[21]:
pd.concat(prompt_result).query("sequence in @seqs").drop_duplicates()
[21]:
name sequence
6 A0A4Y7S557 DIKELREELSQIDLELVQLFTKRMDAIARISKYKIQNNFPVLDNER...
63 B0NB59 LEELRIRLDDVDEQMVRLFEERMSICEEVGEYK
20 A0A379UTL6 TSENPLLALRDKISALDEELLALLAKRRALAIEVGQAKLLSHRPVR...
55 A0A4R2PQ94 DDPLVAARTRIAEIDRTLLDLLAERRAMSRRIAEHKLATGTAVRNT...
63 UPI000D3E7058 LQKLRESIDAIDSQLVELLAKRLELVGDVGRVKSHHGLPLYVPERE...
33 A0A5R1YFZ7 TSENPLLALRDKISALDEELLALLAKRRALAIEVGQAKLLSHRPVR...

PoET Scoring

We will now use the 3 diverse prompts above to condition our PoET model and score our database of CM sequences. To do this we will first we remove any gap indicators (-) and then we send them to the PoET scoring endpoint:

[22]:
sequences = table_natural['Sequence'].apply(lambda s: s.replace('-', '').encode()).values
print(sequences.shape)

future_natural = session.poet.score(prompt, sequences)
print(future_natural.job)
(1130,)
status=<JobStatus.SUCCESS: 'SUCCESS'> job_id='935b59f2-572b-4b91-ab45-82ca7e2a193b' job_type='/poet' created_date=datetime.datetime(2023, 8, 31, 8, 34, 28, 835858) start_date=datetime.datetime(2023, 8, 31, 8, 40, 11, 311632) end_date=datetime.datetime(2023, 8, 31, 8, 40, 39, 834083) prerequisite_job_id=None progress_message=None progress_counter=None num_records=None
[23]:
sequences = table_variant['Sequence'].apply(lambda s: s.replace('-', '').encode()).values
print(sequences.shape)

future_variant = session.poet.score(prompt, sequences)
print(future_variant.job)
(1618,)
status=<JobStatus.SUCCESS: 'SUCCESS'> job_id='7b19f572-5e6b-423b-9a52-1a3f91c0697a' job_type='/poet' created_date=datetime.datetime(2023, 8, 31, 8, 34, 31, 917467) start_date=datetime.datetime(2023, 8, 31, 8, 40, 13, 146207) end_date=datetime.datetime(2023, 8, 31, 8, 40, 46, 100279) prerequisite_job_id=None progress_message=None progress_counter=None num_records=None

We can wait for the jobs to complete, this will auto-fetch the results too:

[24]:
results_natural = future_natural.wait(verbose=True)
results_variant = future_variant.wait(verbose=True)

print(len(results_natural), len(results_variant))
Waiting: 100%|██████████| 100/100 [00:00<00:00, 329.99it/s, status=SUCCESS]
Waiting: 100%|██████████| 100/100 [00:00<00:00, 435.26it/s, status=SUCCESS]
1130 1618
[25]:
scores_natural = np.array([r.score for r in results_natural])
scores_variant = np.array([r.score for r in results_variant])
scores_natural.shape, scores_variant.shape
[25]:
((1130, 3), (1618, 3))
[26]:
score_natural = scores_natural.mean(axis=1)
score_variant = scores_variant.mean(axis=1)
score_natural.shape, scores_variant.shape
[26]:
((1130,), (1618, 3))

We can now compare the enrichment scores vs the PoET liklihood for this default prompt:

[27]:
_, axs = plt.subplots(1, 2, figsize=(12, 4))

x = score_natural
y = table_natural['norm r.e.'].values
sns.regplot(x=x, y=y, ax=axs[0])
natural_corr, natural_p = spearmanr(x, y)
axs[0].set_title(f'Natural sequences (Coeff: {natural_corr:.2f}, p-value: {natural_p:.2f})')
axs[0].set_xlabel("Score")
axs[0].set_ylabel("Enrichment")

x = score_variant
y = table_variant['norm r.e.'].values
sns.regplot(x=x, y=y, ax=axs[1])
variant_corr, variant_p = spearmanr(x, y)
axs[1].set_title(f'Variant sequences (Coeff: {variant_corr:.2f}, p-value: {variant_p:.2f})')
axs[1].set_xlabel("Score")
axs[1].set_ylabel("Enrichment")

plt.tight_layout()
plt.show()

../_images/demos_chorismate_48_0.png

We can see above that both the natural sequences and variant sequences correlate strongly with the PoET liklihood scores when conditioned on the default prompt. Both relationships have p<0.05, indicating significance, furthermore we can see the correlation coefficient is stronger for the variant sequences as opposed to the natural sequences (0.5 vs 0.35).

We can also see that both these coefficients exceed the <0.2 correlation we saw with our simple EcCM identity baseline model.

Most strikingly, however, we note that both of these results improve upon the EDCA measure we saw above, from Russ et al 2020, where the correlation coefficients were ~0.04 and ~0.4 for natural and variant sequences respectively.

Combining these sequences we can see the trend still holds:

[28]:
x_combined = np.concatenate([score_natural, score_variant])
y_combined = np.concatenate([table_natural['norm r.e.'].values, table_variant['norm r.e.'].values])
correlation, p_value = spearmanr(x_combined, y_combined)

sns.regplot(x=x_combined, y=y_combined)
plt.title(f'All Sequences (Spearman: {correlation:.2f}, p-value: {p_value:.2f})')
plt.xlabel("Score")
plt.ylabel("Enrichment")
plt.tight_layout()
plt.show()
../_images/demos_chorismate_51_0.png

Lastly let’s compare PoET scores to EcCM identity to compare:

[29]:
_, axs = plt.subplots(1, 2, figsize=(12, 4))

x = table_natural['EcCM_similarity'].values
y = score_natural
sns.regplot(x=x, y=y, ax=axs[0])
natural_corr, natural_p = spearmanr(x, y)
axs[0].set_title(f'Natural sequences (Coeff: {natural_corr:.2f}, p-value: {natural_p:.2f})')
axs[0].set_ylabel("Score")
axs[0].set_xlabel("EcCM similarity %")

x = table_variant['EcCM_similarity'].values
y = score_variant
sns.regplot(x=x, y=y, ax=axs[1])
variant_corr, variant_p = spearmanr(x, y)
axs[1].set_title(f'Variant sequences (Coeff: {variant_corr:.2f}, p-value: {variant_p:.2f})')
axs[1].set_ylabel("Score")
axs[1].set_xlabel("EcCM similarity %")

plt.tight_layout()
plt.show()

../_images/demos_chorismate_53_0.png

We can see there is a relationship between PoET scores and EcCM similarity. However given that the PoET score v enrichment correlations are much stronger than the similarity vs enrichment correlation, we know the model is learning something more fundamental about protein fitness from the MSA-based prompts.

Improving the prompt

We have seen how a default prompt PoET model can generate statistically significant correlations between the experimental relative enrichment and PoET liklihoods.

Let’s now try to improve the prompting by guiding the model with purely natural functional sequences of chorismate mutase. To accomplish this, we will skip the seed -> MSA generation step and supply a read-made MSA instead (containing functional natural sequences only, where functional is defined as a >0.45 enrichment).

Providing sequences that operate within the natural functional subspace allows us to steer the PoET model’s attention towards naturally occurring, functionally viable protein sequences. This focused approach enables the model to discern evolutionary patterns, relationships and constraints more effectively. Consequently, when scoring potential sequences, PoET is better equipped to make accurate and informed predictions that are grounded in naturally observed functionality and diversity.

[30]:
table_nat_functional = table_natural.loc[table_natural['norm r.e.'] >= 0.45]
print(table_nat_functional.shape)
table_nat_functional.head()
(407, 8)
[30]:
No. Species Sequence ID to EcCM top ID EDCA (rel. to EcCM) norm r.e. EcCM_similarity
0 1 Escherichia coli ... -TSENPLLALREKISALDEKLLALLAERRELAVEVGKAKLLSHRPV... 1.00 0.93 0.00 1.00 96.0
3 4 Enterobacter cloacae subsp. dissolvens SDM ... -TPENPLLDLRVKISALDEKLLALLAERRALAIEVGKAKLDSHRPV... 0.93 0.93 8.57 0.65 89.0
5 6 Salmonella enterica subsp. enterica serovar Sa... -TSENPLLALRDKISALDEELLALLAKRRALAIEVGQAKLLSHRPV... 0.90 0.89 9.83 0.70 86.0
6 7 Cronobacter sakazakii ATCC BAA-894 ... -TAENPLLALRDKISALDEQLLALLAQRRALAVEVGKAKLATHRPV... 0.91 0.90 6.42 0.95 87.0
8 9 Erwinia billingiae Eb661 ... -TPENPLLGLRDKISALDEQLLNMLAQRRALAIEVATAKMATHRPI... 0.80 0.85 26.25 0.84 77.0
[31]:
from io import BytesIO

msa_const = session.poet.upload_msa(BytesIO(b'\n'.join(table_nat_functional['Sequence'].apply(lambda s: s.encode()))))
print(msa_const)


status=<JobStatus.SUCCESS: 'SUCCESS'> job_id='8466c46d-9741-4729-8b81-d5cbf9f416a8' job_type='/align/align' created_date=datetime.datetime(2023, 8, 31, 13, 3, 9, 66237) start_date=None end_date=datetime.datetime(2023, 8, 31, 13, 3, 9, 105718) prerequisite_job_id=None progress_message=None progress_counter=None num_records=None
[32]:
prompt_const = msa_const.sample_prompt( num_ensemble_prompts=3, random_seed=42)
print(prompt_const)
status=<JobStatus.SUCCESS: 'SUCCESS'> job_id='4c466f72-a87e-4c5c-8526-d0f629008a25' job_type='/align/prompt' created_date=datetime.datetime(2023, 8, 31, 13, 4, 17, 166501) start_date=datetime.datetime(2023, 8, 31, 13, 4, 48, 771624) end_date=datetime.datetime(2023, 8, 31, 13, 4, 50, 612839) prerequisite_job_id=None progress_message=None progress_counter=None num_records=None
[33]:
sequences = table_variant['Sequence'].apply(lambda s: s.replace('-', '').encode()).values
print(sequences.shape)


future_variant_const = session.poet.score(prompt_const, sequences)

print(future_variant_const.job)
(1618,)
reloaded c3f2bb8a-0bbf-4efa-9dc7-b931c747319c
status=<JobStatus.SUCCESS: 'SUCCESS'> job_id='c3f2bb8a-0bbf-4efa-9dc7-b931c747319c' job_type='/poet' created_date=datetime.datetime(2023, 8, 31, 13, 12, 34, 213447) start_date=datetime.datetime(2023, 8, 31, 13, 26, 23, 201731) end_date=datetime.datetime(2023, 8, 31, 13, 27, 1, 501021) prerequisite_job_id=None progress_message=None progress_counter=None num_records=None

We can again wait for the job to complete.

[34]:
results_variant_const = future_variant_const.wait(verbose=True)
print(len(results_variant_const))
Waiting: 100%|██████████| 100/100 [00:00<00:00, 505.06it/s, status=SUCCESS]
1618
[35]:
scores_variant_const = np.array([r.score for r in results_variant_const])
score_variant_const = scores_variant_const.mean(axis=1)
scores_variant_const.shape, score_variant_const.shape
[35]:
((1618, 3), (1618,))
[36]:
_, ax = plt.subplots(figsize=(6, 4))

x = score_variant_const
y = table_variant['norm r.e.'].values


correlation, p_value = spearmanr(x, y)

sns.regplot(x=x, y=y)
plt.title(f'Variant Sequences - specialized prompt (Spearman: {correlation:.2f}, p-value: {p_value:.2f})')
plt.xlabel("Score")
plt.ylabel("Enrichment")
plt.tight_layout()
plt.show()
../_images/demos_chorismate_63_0.png

We can see above that the specialised prompt has improved the spearman correlation from 0.5 (for default prompt parameters) to 0.55 for the variant dataset.

Sequence generation

Given the good performance of this prompt in scoring sequences, we can re-use it to generate novel sequences de novo.

[37]:

future_gen = session.poet.generate(prompt_const, num_samples=1000, topp=0.9) print(future_gen.job)
status=<JobStatus.SUCCESS: 'SUCCESS'> job_id='6f029e64-3bed-4f19-804a-3f0a36e5e4b2' job_type='/poet/generate' created_date=datetime.datetime(2023, 8, 31, 14, 19, 55, 723824) start_date=datetime.datetime(2023, 8, 31, 16, 31, 12, 770341) end_date=datetime.datetime(2023, 8, 31, 17, 15, 5, 655404) prerequisite_job_id=None progress_message=None progress_counter=None num_records=None
[38]:
results_gen = future_gen.wait(verbose=True)
print(len(results_gen))
Waiting: 100%|██████████| 100/100 [00:00<00:00, 328.94it/s, status=SUCCESS]
Retrieving: 1000it [00:04, 229.63it/s]
1000

Let’s look at the top scoring sequences that were generated:

[39]:
sample_names = np.array([s.name for s in results_gen])
sample_sequences = np.array([s.sequence for s in results_gen])
sample_scores = np.array([s.score for s in results_gen]).mean(axis=1)
order = np.argsort(-sample_scores)
for i in order[:10]:
    print(sample_names[i], sample_scores[i], sample_sequences[i].decode(), sep='\t')
generated-sequence-912  -13.342949231465658     SSEKRLEEIRKRISEIDEQLITLIAERTGFALEIASLKNSLGASVTDSKREQDICEQTRILCEECIECSVALKIIKILMEYNKEVQAEFFRKV
generated-sequence-583  -14.095219612121582     KNLDLGELRGQLDEIDGQLVRLFEQRMKICADVAEYKIETGKAVYDGERERQKLEAVGAMAHGFNSQAVRELFAQLMTVSRKRQYQLLARH
generated-sequence-731  -14.92087173461914      SSEKRLEEIRKRISEIDEQLITLIAERTGFAPEIASLKNSLGASVTDSKREQDICEQTRILCEECIECSVALKIIKILMEYNKEVQADFFRKV
generated-sequence-910  -15.217793146769205     EARRLLEEHRRTIDNLDAALIHILAERFRCTQQVGRLKAANSMPPSDPEREERQIRRLRKLAADGLDPVFAEKFLRFIVSEVIRHHERIADES
generated-sequence-730  -15.646750450134277     KNLDLGELRGQLDEIDGQLVRLFEQRMKICADVAEYKIETGKAVYDGERERQKLEAVGAMAHGFNSQAVRELFSQLMTVSRKRQYQLLARQ
generated-sequence-956  -16.414565404256184     SQPNERLNDLRRAVDHLDDQLLDILQERFAVIDQIGAHKREHHLSVFQSDRWKDVMESRTQKGVQNLSEKFMKELLYSIHEESVKRQEKQLKES
generated-sequence-925  -17.29950205485026      KNLDLGELRGQLDEIDGQLVRLFEQRMKICADVAEYKIETGKAVYDGERERQKLEAVGAMAHGFNSQAVRELFSQLMTISRKRQYQLLAKH
generated-sequence-354  -18.183929443359375     QPNERLNDLRRAVDHLDDQLLDILQERFAVIDQIGAHKREHHLSVFQSDRWKDVMESRTQKGVQNLSEKFMKELLYSIHEESVKRQEKQLRDT
generated-sequence-848  -18.382567087809246     SSEKRLEEIRKRISEIDEQLITLIAERTGFAPEIASLKNSLGASVTDSKREQDICEQTRILCEECIECSVALKIIKILMEYNKEVQAEYLRKV
generated-sequence-248  -20.782123565673828     EARRLLEEHRRTIDNLDAALIHILAERFRCTQQVGRLKAANGMPPADPDREERQIRRLRKLAADGLDPVFAEKFLRFIVSEVIRHHERIADEF

We can also compare (below) the scores for the generated sequences with the scores of the variants from Russ 2020. Note that the generated sequences (orange) are generally right-shifted (i.e. higher scoring). Given the strong correlation (>0.5) we saw above between enrichment performance and PoET score, there is reason to believe these de novo generated sequences will perform better than the EDCA ranked sequences.

[40]:
plt.title("PoET Generated vs EDCA Variant sequences")
mi = np.min(score_variant_const)
ma = max(np.max(score_variant_const), np.max(sample_scores))
bins = np.linspace(mi, ma, 31)

plt.xlabel("PoET Score")
plt.ylabel("Frequency")
_ = plt.hist(score_variant_const, bins=bins, alpha=0.5, label='Variants')
_ = plt.hist(sample_scores, bins=bins, alpha=0.5, label="Generated")
plt.legend();
../_images/demos_chorismate_71_0.png

Generated sequences

Let’s also compare the top ranked sequences to the original EcCM:

[41]:
from Bio import pairwise2
import Bio.Align.substitution_matrices
blosum62 = Bio.Align.substitution_matrices.load('blosum62')
[42]:
for i in order[:5]:
    print("---"*10)
    x = sample_sequences[i].decode()
    print(f'>{sample_names[i]} {sample_scores[i]}')
    for a in pairwise2.align.globalds(seed, x, blosum62, -11, -1):
        print(pairwise2.format_alignment(*a))

        break
------------------------------
>generated-sequence-912 -13.342949231465658
TSENPLLALREKISALDEKLLALLAERRELAVEVGKAKLLSHRPVRDIDRERDLLERLITLGKAHHLDAHYITRLFQLIIEDSVLTQQALLQQH
.||..|...|..||..||.|..|.|||...|.|....|......|.|..||.|..|....|.. ................|.....|.......
SSEKRLEEIRKRISEIDEQLITLIAERTGFALEIASLKNSLGASVTDSKREQDICEQTRILCE-ECIECSVALKIIKILMEYNKEVQAEFFRKV
  Score=91

------------------------------
>generated-sequence-583 -14.095219612121582
TSENPLLALREKISALDEKLLALLAERRELAVEVGKAKLLSHRPVRDIDRERDLLERLITLGKAHHLDAHYITRLFQLIIEDSVLTQQALLQQH
.... |..||......|..|..|...|......|...|......|.|..|||..||.....  ||.........||......|...|..||..|
KNLD-LGELRGQLDEIDGQLVRLFEQRMKICADVAEYKIETGKAVYDGERERQKLEAVGAM--AHGFNSQAVRELFAQLMTVSRKRQYQLLARH
  Score=76

------------------------------
>generated-sequence-731 -14.92087173461914
TSENPLLALREKISALDEKLLALLAERRELAVEVGKAKLLSHRPVRDIDRERDLLERLITLGKAHHLDAHYITRLFQLIIEDSVLTQQALLQQH
.||..|...|..||..||.|..|.|||...|.|....|......|.|..||.|..|....|.. ................|.....|.......
SSEKRLEEIRKRISEIDEQLITLIAERTGFAPEIASLKNSLGASVTDSKREQDICEQTRILCE-ECIECSVALKIIKILMEYNKEVQADFFRKV
  Score=87

------------------------------
>generated-sequence-910 -15.217793146769205
TSENPLLALREKISALDEKLLALLAERRELAVEVGKAKLLSHRPVRDIDRERDLLERLITLGKAHHLDAHYITRLFQLIIEDSVLTQQALLQQH
.....|...|..|..||..|...||||......||..|.....|..|..||.....||..|. |..||..........|...............
EARRLLEEHRRTIDNLDAALIHILAERFRCTQQVGRLKAANSMPPSDPEREERQIRRLRKLA-ADGLDPVFAEKFLRFIVSEVIRHHERIADES
  Score=76

------------------------------
>generated-sequence-730 -15.646750450134277
TSENPLLALREKISALDEKLLALLAERRELAVEVGKAKLLSHRPVRDIDRERDLLERLITLGKAHHLDAHYITRLFQLIIEDSVLTQQALLQQH
.... |..||......|..|..|...|......|...|......|.|..|||..||.....  ||.........||......|...|..||...
KNLD-LGELRGQLDEIDGQLVRLFEQRMKICADVAEYKIETGKAVYDGERERQKLEAVGAM--AHGFNSQAVRELFSQLMTVSRKRQYQLLARQ
  Score=69

We can see there is significant sequence diversity in the generated sequences Vs EcCM. Given the histogram above, we might expect some of these newly generated sequences to outperform the original variants from Russ 2020.

Sequence UMAP

We can gain more insight into the sequence space occupied by our generated sequences vs the variant and natural sequences in our original dataset by running a UMAP dimension reduction and visualization pipeline.

To visualize the generated sequences in the same space as the natural and variant sequences from the paper, we need to use an external multiple sequence alignment tool, we’ll ouput a fasta, run the alignment elsewhere, and and save the subsequent results to data/chorismate_mutase_combined_seqs.clustal to use in this notebook.

[43]:
seqs = np.stack(
    [np.frombuffer(s.encode().upper(), dtype=np.uint8) for s in table_natural['Sequence']] + [np.frombuffer(s.encode().upper(), dtype=np.uint8) for s in table_variant['Sequence']]
)
seqs.shape
[43]:
(2748, 96)
[44]:
with open('data/chorismate_mutase_combined_seqs.fasta', 'wb') as f:
    for i in range(len(table_natural)):
        name = f'>Natural_{i+1}'.encode()
        f.write(name)
        f.write(b'\n')
        f.write(table_natural.iloc[i]['Sequence'].encode().replace(b'-', b''))
        f.write(b'\n')
    for i in range(len(table_variant)):
        name = f'>Variant_{i+1}'.encode()
        f.write(name)
        f.write(b'\n')
        f.write(table_variant.iloc[i]['Sequence'].encode().replace(b'-', b''))
        f.write(b'\n')
    for i in range(len(sample_sequences)):
        name = f'>poet_generated_{i+1}'.encode()
        f.write(name)
        f.write(b'\n')
        f.write(sample_sequences[i])
        f.write(b'\n')
[45]:
from Bio import AlignIO

with open('data/chorismate_mutase_combined_seqs.clustal') as f:
    alignment = AlignIO.read(f, 'clustal')
[46]:
aln = np.stack([np.frombuffer(s.seq.encode().upper(), dtype=np.uint8) for s in alignment])

groups = []
measure = []
for i in range(len(alignment)):
    tokens = alignment[i].id.split('_')
    g = '_'.join(tokens[:-1])
    j = int(tokens[-1]) - 1
    v = np.nan
    if g == 'Variant':
        v = table_variant.iloc[j]['norm r.e.']
    elif g == 'Natural':
        v = table_natural.iloc[j]['norm r.e.']

    groups.append(g)
    measure.append(v)
measure = np.array(measure)
groups[:10], measure[:10]
c:\Users\micha\miniconda3\envs\py38\lib\site-packages\Bio\Seq.py:1772: BiopythonDeprecationWarning: myseq.encode has been deprecated; please use bytes(myseq) instead.
  warnings.warn(
[46]:
(['Natural',
  'Natural',
  'Natural',
  'Variant',
  'Variant',
  'Natural',
  'Natural',
  'Natural',
  'Variant',
  'Variant'],
 array([ 0.27,  0.14, -0.02, -0.13, -0.15,  0.18, -0.2 , -0.02,  0.15,
        -0.07]))
[47]:
umap = UMAP(metric='hamming')
z = umap.fit_transform(aln)
c:\Users\micha\miniconda3\envs\py38\lib\site-packages\umap\umap_.py:1802: UserWarning: gradient function is not yet implemented for hamming distance metric; inverse_transform will be unavailable
  warn(

We can see below a UMAP representation of the CM natural sequences (blue) and the subset of sequences that are functional (defined as an enrichment of >0.45). Whilst there is overall good overlap, we can see that some natural sequences that are non-functional clearly cluster together:

[48]:
fig = plt.figure(figsize=(8,8))
s = 20

loc = ~np.isnan(measure)
sns.scatterplot(x=z[loc, 0], y=z[loc, 1], data=None, alpha=0.5, s=s, label='Natural sequence space')

loc = ~np.isnan(measure) & (measure > 0.45)
sns.scatterplot(x=z[loc, 0], y=z[loc, 1], data=None, alpha=0.5, s=s, label='Functional subspace')

plt.legend(loc='best')
plt.title("Natural and Functional sequences")
plt.xticks([])
plt.yticks([]);
plt.xlabel("UMAP1")
plt.ylabel("UMAP2");
../_images/demos_chorismate_83_0.png

We can now add in our generated PoET sequences (green):

[49]:
fig = plt.figure(figsize=(8,8))
s = 20

loc = ~np.isnan(measure)
sns.scatterplot(x=z[loc, 0], y=z[loc, 1], s=s, label='Natural sequence space')

loc = ~np.isnan(measure) & (measure > 0.45)
sns.scatterplot(x=z[loc, 0], y=z[loc, 1], s=s, label='Functional subspace')

loc = np.isnan(measure)
sns.scatterplot(x=z[loc, 0], y=z[loc, 1], s=s, label='PoET Generated')

plt.legend(loc='best')
plt.xticks([])
plt.yticks([])
plt.xlabel("UMAP1")
plt.ylabel("UMAP2");
plt.title("PoET, natural and functional sequences");
../_images/demos_chorismate_85_0.png

We can see above that the PoET generated sequences do a good job of covering the functional subspace areas (represented by the overlay of green and orange dots). And rarely overlap with the natural, but non CM functional sequences. This suggests PoET is correctly inferring functional properties based on our CM prompt, and again that these sequences might be fruitful for further wetlab analysis!

Lastly, we can make a gif of these results to overlay them one-by-one:

[50]:
from matplotlib.animation import FuncAnimation

fig = plt.figure(figsize=(8,8))
s = 2.0

loc = ~np.isnan(measure)
plt.scatter(z[loc, 0], z[loc, 1], s=s, label='Natural sequence space')
plt.legend(loc='best')
plt.xticks([])
plt.yticks([])
plt.tight_layout()

def animate(step):
  if step == 1:
    loc = ~np.isnan(measure) & (measure > 0.45)
    plt.scatter(z[loc, 0], z[loc, 1], s=s, label='Functional subspace')
  if step == 2:
    loc = np.isnan(measure)
    plt.scatter(z[loc, 0], z[loc, 1], s=s, label='PoET Generated')
  plt.legend(loc='best')


animation = FuncAnimation(
  fig,
  animate,
  frames=[0, 1, 2],
  interval=5.0,
)

animation.save(
  'chorismate_umap.gif',
  dpi=300,
  fps=1.0,
)
plt.title("UMAP sequence space");
plt.xlabel("UMAP1")
plt.ylabel("UMAP2");
MovieWriter ffmpeg unavailable; using Pillow instead.
../_images/demos_chorismate_88_1.png

Summary

In this tutorial we have taken a dataset of Chorismate mutase proteins that have associated wetlab measurements (relative enrichment). We have shown that PoET can accurately score CM variants and natural sequences and that our off-the-shelf solution performs better than another method (Russ, 2020) based on EDCA. Specifically, we are able to achieve a correlation between PoET and enrichment of 0.35 & 0.5 vs 0.04 and 0.41 for natural and variant sequences respectively.

Furthermore, we develop novel CM sequences that may display superior enrichment vs the Russ (2020) proposed sequences.

Overall, we have used PoET to generate new sequences without costly wetlab experimentation and without prior measurements. We could take these sequences for downstream wetlab assays to kickstart a productive protein engineering cycle!