Tutorial: Generating substrate specific thiolases with PoET
This tutorial will introduce OpenProtein’s new PoET workflow, a cutting-edge protein language model designed to facilitate evaluating protein sequences from scratch, eliminating the need for functional measurements of your specific protein. With this model, you can effortlessly score candidate proteins, or generate a custom diverse library before running any wetlab experiments!
The core workflow in PoET is governed by a prompt, which consists of a collection of sequences that encode valuable insights into the local fitness landscape and co-evolutionary patterns associated with your protein of interest. One effective approach for constructing this prompt involves, first; generating a multiple sequence alignment (MSA) of your protein with sequences from various evolutionary lineages, and second; applying filters to that MSA to create a focussed prompt. Since these sequences share an evolutionary connection, they inherently carry information about the protein’s local fitness landscape. PoET leverages this information to both:
assess the probability of observing a specific sequence based on the inferred evolutionary process
generate novel sequences based on the same inferred evolutionary landscape.
We will apply this new machine learning model to develop novel thiolase enzymes with activity against a compound of interest.
Thiolase superfamily
Thiolases play a critical role in catalyzing thioester formation — a key process in fatty acid biosynthesis in vivo. Thiolases have attracted significant research interest due to their role in catalysing thioester formation, which is a key intermediary in several industrial processes, including:
Pharmaceuticals: thioesters can be extensively modified with specific moieties, enabling the synthesis of a wide range of pharmaceuticals with desired properties
Biopolymers: thiolases enable the efficient conversion of organic acids into thioesters, which are subsequently polymerized to create environmentally friendly biopolymers. For example polyhydroxyalkanoate (PHA).
Flavours and fragrences: thioesters are key components in the synthesis of aroma compounds and flavour enhancers
Understanding the factors that govern thiolase substrate specificity is therefore crucial. Furthermore, developing novel thiolases with improved specificity or activity may yield novel products or more efficient industrial processes.
This tutorial will build on the work and data made available by Robinson (2020). Robinson et al. investigated thiolase substrate specificity and activity with an extensive library of compounds and thiolases.
We will demonstrate how OpenProtein’s PoET model can leverage co-evolutionary data from an MSA to score sequences, generate new candidate sequences de novo, and facilitate further in-vitro testing. Our workflow will begin with a single seed sequence obtained freely from Uniprot, while the Robinson dataset will tie our observations back to experimental data, and provide an empirical validation of the trends we see in our modelling.
Activity and specificity dataset
I have reproduced figure 2 from the Robinson paper as a visual summary of the dataset; we have 15 chemical compounds of various chemical structures, and 73 thiolases from a broad range of taxonomic classes. We can note that most thiolases show activity against only a subset of compounds, and activity is highly variable (~1-3 orders of magnitude difference).
Code
import pandas as pd import jsonfrom AWSTools.Batchtools import batch_utils as bu import openproteinfrom scipy.stats import spearmanrimport numpy as npimport osimport matplotlib.pyplot as pltimport seaborn as snsfrom pandas import option_contextfrom misc import wait_for_jobimport requests sns.set_style("whitegrid")withopen('secrets.config', 'r') as f: config = json.load(f)session = openprotein.connect(config['username'], config['password'])
Loading the data
Let’s first load and examine the Robinson 2020 data:
Code
data = pd.read_csv("./data/Robinson/enzyme_substrate_activity.csv")data = data.rename(columns={"cmpnd": "compound", 'log_slope':'log_activity'})data['compound'] = data.compound.str.replace(" ","_")# read in sequences and match to datasequences = bu.fasta2csv("./data/Robinson//robinson.fa")sequences['org'] = sequences['accession'].apply(lambda x: " ".join( x.split('1_')[-1:]) )sequences['org'] = sequences.org.str.replace("_"," ")sequences = sequences.rename(columns={"seq": "sequence"})# manual matchessequences.loc[sequences.accession.str.contains("Halobacteriovorax"),'org'] ="Halobacteriovorax marinus"sequences.loc[sequences.accession.str.contains("Enhygromyxa"),'org'] ="Enhygromyxa salina"sequences.loc[sequences.accession.str.contains("Kocuria_varians"),'org'] ="Kocuria varians"sequences.loc[sequences.accession.str.contains("Pseudoxanthomonas"),'org'] ="Pseudoxanthomonas"sequences.loc[sequences.accession.str.contains("translucens"),'org'] ="Xanthomonas translucens"sequences.loc[sequences.accession.str.contains("Leifsonia"),'org'] ="Leifsonia sp. Leaf325"sequences.loc[sequences.accession.str.contains("Chloroflexi_bacterium"),'org'] ="Chloroflexi bacterium RBG 13 51 36"matched =len(set(sequences.org.unique()) &set(data.org.unique()))# merge togetherdata = data.merge(sequences, on='org', how='left')assert data[data.org.isna()].shape[0]==0data = data.sort_values(['org','compound'])print( f"Matched {matched} out of {len(data.org.unique())} thiolases" )data.head(3)
Matched 72 out of 73 thiolases
filename
compound
org
log_activity
accession
sequence
404
output//2019-11-22/2019-11-22_7Ph heptanoate_a...
7Ph_heptanoate
Actinomycetospora chiangmaiensis
1.437703
WP_026204600.1_Actinomycetospora_chiangmaiensis
MRAAQRYDDVVITGLAHVDAPHVVTSTELEDRLAGTLGRLRITPGL...
657
output//2019-12-13/2019-12-13_azido_all_data_c...
azido
Actinomycetospora chiangmaiensis
1.722814
WP_026204600.1_Actinomycetospora_chiangmaiensis
MRAAQRYDDVVITGLAHVDAPHVVTSTELEDRLAGTLGRLRITPGL...
295
output//2019-11-15/2019-11-15_butoxy_all_data_...
butoxy
Actinomycetospora chiangmaiensis
1.989242
WP_026204600.1_Actinomycetospora_chiangmaiensis
MRAAQRYDDVVITGLAHVDAPHVVTSTELEDRLAGTLGRLRITPGL...
We can see above the same data that corresponds to figure 2. We will need to wrangle the data a little to summarise replicates and add back in the 0 activity rows:
Code
# Summarize replicatesdata = ( data.groupby(['org', 'compound', 'sequence']) .agg(mean_log_activity=('log_activity', 'mean'), std_log_activity=('log_activity', 'std'), count_activity=('log_activity', 'count')) .reset_index())data = data.fillna(0.0)# add in zero-activity rowsnew_df = pd.DataFrame(index=pd.MultiIndex.from_product([data['org'].unique(), data['compound'].unique()], names=['org', 'compound']))new_df['mean_log_activity'] =0new_df['std_log_activity'] =0# update the new dataframe with the original dataframe's valuesnew_df.update(data.set_index(['org', 'compound']))# reset the index and mergedata = new_df.reset_index()data = data.merge(sequences, on='org', how='left')data.head()
We have been able to reproduce most of the Robinson dataset, yielding measurements for 14 out of 15 compounds across 72 out of 73 organisms.
Exploring the data
We can investigate the data briefly before proceeding with PoET.
First, let’s examine the sequences from each organism to get an idea of sequence diversity:
Code
from Bio import pairwise2def align_report(seq1, seq2):"""Calculate the similarity between two sequences using the globalxx alignment score"""if seq1==seq2:return1 alignments = pairwise2.align.globalxx(seq1, seq2) top_alignment = alignments[0]print(pairwise2.format_alignment(*top_alignment) )return pairwise2.format_alignment(*top_alignment)def calculate_similarity(seq1, seq2):"""Calculate the similarity between two sequences using the globalxx alignment score"""if seq1==seq2:return1 alignments = pairwise2.align.globalxx(seq1, seq2) top_alignment = alignments[0] alignment_score = top_alignment[-3] similarity = alignment_score /len(seq1)return similaritydef generate_similarity_matrix(sequences):"""Generate a similarity matrix for a list of sequences""" num_sequences =len(sequences) similarity_matrix = np.zeros((num_sequences, num_sequences))for i inrange(num_sequences):for j inrange(i+1, num_sequences): similarity = calculate_similarity(sequences[i], sequences[j]) similarity_matrix[i, j] = similarity similarity_matrix[j, i] = similarity np.fill_diagonal(similarity_matrix, 1)return similarity_matrix
We can see above that there is a lot of diversity in our sequences; the average pairwise similarity is <50%. And this covers a range from 36-88% (excluding the diagonal). This diversity of sequences makes the Robinson data very useful for evolutionary prompting.
Choosing a target candidate
We can now build our prompt, to do that we will first select a good compound for development (i.e. good data availability, broad range of activity and good specificity). Once we have selected our target compound we will take the most active enzyme sequence as a seed to generate an MSA, and then impose filters on that MSA to create a prompt.
First, let’s see how much non-zero data is available for each compound, choosing a compound with more data will make the resulting analysis more robust:
# compounds with most dataordered_compounds = data[data.mean_log_activity!=0].compound.value_counts().index.tolist()data[data.mean_log_activity!=0].compound.value_counts()[::-1].plot(kind='barh', figsize=(8, 4));plt.title("Number of datapoints per compound");plt.xlabel("Number of datapoints");
We can see above the number of data points per compound; azido, butoxy and heptynoate have the most available data, covering >90% of the enzymes.
We also want a target compound with a reasonable range of activities for two reasons:
ideally measurement error should be smaller than true activity differences, this is more likely for larger data ranges
a larger range suggests more scope for improved enzymatic activity, and a more interesting fitness landscape
I’ve plotted the range of activities below per compound:
Code
f, ax = plt.subplots(figsize=(10, 4))sns.boxplot( data.query("mean_log_activity!=0"), x='compound', y='mean_log_activity',ax=ax, order=ordered_compounds)plt.xticks(rotation=90)plt.title("Range of activity for each Compound");plt.ylabel("Enzymatic activity (log10)");plt.xlabel('');
We can see butoxy, hexanoate and heptanoate all have broad ranges of activity. Butoxy was also noted above for its data availibility.
Lastly, let’s look at enzyme specificity, here defined as the correlation in activity across compounds. Specific enzymes are usually more highly valued than promiscuous enzymes, partly due to their greater efficiency and production of fewer by-products. Therefore we want to pick a compound that will enable us to prompt for specificity (as well as activity).
We can see most specific enzymes target dimethyl, with butoxy a little further up the list.
Based on the above findings, butoxy stands out as a promising compound to focus on for several reasons:
Extensive experimental data: Butoxy has many non-zero data points.
Wide range activity: Butoxy activity spans 1-2 orders of magnitude.
Specificity: Butoxy is only weakly correlated to other compounds.
Moreover, the most recent data available lists production of butoxy at 10-50 million pounds per year in the USA. This is used in diverse applications such as:
Paint additives: plasticizers in paints improve durability and flexibility.
Cosmetics and personal care: butoxy solvents prevent ingredient separation and maintain clarity, even at high temperatures.
Hygeine and cleaning: due to its effective ability to dissolve grease and dirt.
Considering these factors, the development of an enhanced butoxy-specific thiolase would yield significant industrial benefits via potential gains in product performance, cost-effectiveness, and sustainability.
Code
selected_compound ='butoxy'
Let’s briefly examine the butoxy data a little more closely.
Below is the correlation between butoxy activity and every other compound in our Robinson dataset. We can see the weakest correlation is against dimethyl, ClPhpropionate and oxadiazole. These will serve as interesting negative controls later on.
Let’s also take a closer look at the distribution of butoxy activity per enzyme (below), we can see this spans 1-2 orders of magnitude, with K. sedentarius having the most active thiolase:
Code
f, ax = plt.subplots(figsize=(10, 4))sns.barplot( data.query("compound== @selected_compound ").sort_values('mean_log_activity'), x='org', y='mean_log_activity',ax=ax, color='darkblue' )plt.xticks(rotation=90)plt.title(f"Range of activity for {selected_compound}");plt.ylabel("Enzymatic activity (log10)");plt.xlabel('');
Summary
We have 72 enzymes that catalyse 14 thiols with a range of activity. We have selected butoxy as our target compound due to it’s data availability and industrial importance. Kytococcus sedentarius represents the most active butoxy thiolase in our dataset.
PoET thiolase scoring
Job creation
Having explored the data and selected butoxy catalysis as our main goal, we can now proceed to the PoET job, and scoring all 72 enzymes.
We will build a prompt around the most active butoxy catalysing enzyme (from K. sedentarius), and compare our model scores with experimental data from Robinson.
Code
x = data.query("compound==@selected_compound").sort_values('mean_log_activity', ascending=False)msa_seed = x.sequence.tolist()[0]msa_seed_id = x.org.tolist()[0] print(f"Selected compound: {selected_compound}")print(f"Most active {selected_compound} enzyme: {msa_seed_id}")print("Enzyme sequence:", msa_seed)
Selected compound: butoxy
Most active butoxy enzyme: Kytococcus sedentarius
Enzyme sequence: MPGNATFRHSNTSVLSVTALPAPEVHTSAEFDARIMDTLRRNKLVPGVLSRLVGIKERRWWPEDVHFTEAAAEAGRRALVDAGIKPEQVGLMVNTSVSRDHLEPSTAVRIHDTLGLPRSASNFDVTNACLGFVNGMQIAAAMIDSGQIEYALIVNAETTRHTHETTLKRLESEDATADDVLQQLATLTLGSGAAAMVLGRTDKHPEGHRFVGGVSRAGTEHHELCVGDFDRMTTDSVGLQEAGIALSEELWAEAAEEFDWQHLDRYVIHQVSTVHTEQICDRLGLDGAKVPRTFPTYGNMGPAAVPFTLALEAPHLERGDRVLLMGIGSGLNACCAEIEW
Code
# one sequence per organism, save to filedata.query("compound==@selected_compound")['sequence'].to_csv("./data/query_data.csv", index=False)
We can now kick off the job!
I will upload the query sequences (to be scored) as well as the seed sequence (to generate the MSA and prompt). For the prompt, I have selected hyper-parameters that will filter the MSA down to sequences that are at least somewhat similar to the seed sequence.
The exact arguments are:
Use first sequence as seed to generate MSA = True (i.e. use K. sedentarius sequence to generate MSA)
number of prompts to ensemble = 1 (ensemble only 1 replicate for now)
minimum similarity to seed = 0.5 (ensure all prompt sequences have >50% sequence similarity to the seed. This is important to select for butoxy activity specifically, and not just thiol activity in general)
Homology level = 0.5 (the tolerance for redundant sequences in the prompt, sequences with >0.5 similarity will be considered redundant and downsampled)
always include seed sequence: True (i.e. ensure K. sedentarius sequence is always included in the prompt)
maximum number of sequences = 50 (the total number of sequences per replicate)
random seed = 42 (for reproducible results)
Code
results_path ="./results/Robinson2020/prots2prot-query.csv"scoring_job_id ='26a43427-796e-4fc6-8c4a-31ca6109fa0b'evo_prompt_id ='3e581bc1-353e-4dfc-aeb4-82e1e6b5843e'scoring_result = pd.read_csv(results_path)print(f"Reading from {scoring_job_id}")scoring_result.head(3)
Reading from 26a43427-796e-4fc6-8c4a-31ca6109fa0b
scores
sequence
score
0
[-492.5433044433594]
MRAAQRYDDVVITGLAHVDAPHVVTSTELEDRLAGTLGRLRITPGL...
-492.543304
1
[-703.46240234375]
MNVGITGIAYALPAREVTTAELQRRVAPKLPPRMFEQATGITRRRV...
-703.462402
2
[-219.8573303222656]
MDGNATTTHRNVALLSVATTMATRITTSAEIDRRLAPSLKRLKLPT...
-219.857330
Prompting PoET
Let’s examine the MSA constructed from our seed sequence (before any filtering, such as minimum similarity is applied):
Code
response = session.prots2prot.get_msa(scoring_job_id)msadf = pd.DataFrame(list(response), columns=['name','sequence'])print("Number of sequences in MSA:", len(msadf))print("Sample:")msadf.head(10)
Number of sequences in MSA: 1604
Sample:
name
sequence
0
seed
MPGNATFRHSNTSVLSVTALPAPEVHTSAEFDARIMDTLRRNKLVP...
1
A0A892J436
MPGNATFRHTNTSVLSVTGLPAPEVHTSAEFDARIMDTLRRNKLVP...
2
A0A1F1KA82
MPGNATFRHSNTSILSVTGLPAPEVRTSAEFDERIMDTLRRNRLVP...
3
A0A4Q7V0T1
MTGTALFHHDNTVIVSVCGIDAPTVVTSAEIDDRLSDTYERLELRP...
4
A0A516PWD2
MSGNATYRLTNTSMLSVSAIEAPVVATSEEFDDRLEATYRRTGLRR...
5
A0A495XTP4
MSGNATFRYANTAVLSVCAVDAPVVMTSDAFDEKLLETYARVGLRS...
6
A0A839QRV3
MGGNATFRHDNSALLAIGSVEAPVVVTSDEFDRRLAPSLKRLRLSR...
7
A0A176UIU6
MIGNATFRHRNTALLSVISVQAPRIVSSTEFDHRLASTLRRLKFPP...
8
A0A838H8R0
MTGNGIFRSSNTSVLAVDALDAPRVVTSAELDERLSATYRRVGLRP...
9
A0A3N0C1T5
MIGNATFRHRNTALLSVSSVEAPKVVSSAEFDHMLAPTLQRLKFPP...
We have >1600 sequences in the MSA, and we told the prompt to take up to 50 (per replicate).
Let’s see how many sequences are in the prompt:
Code
response = session.prots2prot.get_prompt(scoring_job_id, prompt_index=1)promptdf = pd.DataFrame(list(response), columns=['name','sequence'])#rename seed from 101 to uniprotkb entry promptdf.loc[promptdf.name=='101', 'name'] ='C7NEV1'print("Number of sequences in prompt:", len(promptdf))print("Sample:")promptdf.head(10)
Number of sequences in prompt: 50
Sample:
name
sequence
0
seed
MPGNATFRHSNTSVLSVTALPAPEVHTSAEFDARIMDTLRRNKLVP...
1
A0A1F1KA82
MPGNATFRHSNTSILSVTGLPAPEVRTSAEFDERIMDTLRRNRLVP...
2
A0A516PWD2
MSGNATYRLTNTSMLSVSAIEAPVVATSEEFDDRLEATYRRTGLRR...
3
A0A543KKX1
MNGNATFRHHDTAVLSVTAVDAPVVKTSDDFDEVLADSYRRNGLRP...
4
A0A098YAU2
MSGNATHRFSNTSILTVQTADASQVVTSDDIDQRLADTYKRVGLRP...
5
UPI001C43CB83
MPGNATFHHRNASLLSVTEVQAPEVLTSKDLDKRLGDALKRLRLPT...
6
UPI000975AA8B
MPGNATTHFDDVALVSVASVLPSRVTTSDDIEERLAPALKRLKLKP...
7
UPI001C67085F
MSGNITFRHHDTAVLAVTAVDAPVVKTSEEFDAIIGDSYTRNGLRP...
8
UPI00047A1E2B
MTGNFNFRHRNTALLGVESVEAPVIITSAEFDRILSPALRRLRLSK...
9
UPI001963D3D9
MAGNAIHRFANTTVLAVCAIDAPRVVTSAEIDERLADVYTRVSLRP...
The prompt filtering resulted in 50 sequences, including the seed sequence (C7NEV1). Exactly as expected (though <50 would also be expected with more stringent filtering).
We can also sense-check that the prompt doesn’t simply contain the query sequences:
Code
# Use all the prompts combinedresponse = session.prots2prot.get_prompt(scoring_job_id)promptdf = pd.DataFrame(list(response), columns=['name','sequence'])promptdf = promptdf.drop_duplicates().reset_index(drop=True)print("Total sequences across all replicates:", promptdf.shape[0])
Total sequences across all replicates: 50
Code
print(f"Number of overlapping sequences in prompt and query: {len(set(unique_sequences) &set(promptdf.sequence.unique().tolist()))}")x = [i for i in unique_sequences if i in promptdf.sequence.unique().tolist()]data.query("compound==@selected_compound").set_index('sequence').loc[ x ]
Number of overlapping sequences in prompt and query: 2
print(f"Sequence similarity K. sedentarius v M. massiliensis: {calculate_similarity(x[0], x[1]):.2f}")
Sequence similarity K. sedentarius v M. massiliensis: 0.56
As well containing the original seed (as requested in our prompt definition), the prompt also contains a sequence from Mobilicoccus massiliensis (which is also part of the Micrococcales order, and also highly active against butoxy - so no surprise our MSA captured it).
The remaining 48 sequences are from organisms not in our dataset. Let’s try to find these in Uniprot. As the MSA uses ColabFold, which contains metagenomic and transcriptomic data, we may not find all the sequences in the UniProtKB. These will be marked notfound below:
Code
prompt_orgs = [] for accnumber in promptdf.name.unique()[0:]:if accnumber =='101':continue r = requests.get(f"https://www.uniprot.org/uniprotkb/{accnumber}.json", timeout=3) prompt_orgs.append((accnumber, r.json().get('organism', {}).get('scientificName', 'notfound'), r.json().get('organism', {}).get('lineage', [] )))promptdf = pd.DataFrame(prompt_orgs, columns=['accnumber', 'organism', 'lineage']).merge(promptdf, left_on='accnumber', right_on='name')# we know M. massiliensis assert promptdf.loc[promptdf.accnumber=='UPI00059046A7', 'sequence'].tolist()[0]== data.loc[data.org=='Mobilicoccus massiliensis', 'sequence'].tolist()[0]promptdf.loc[promptdf.accnumber=='UPI00059046A7', 'organism'] ='Mobilicoccus massiliensis'promptdf.organism.value_counts()
We can see above we were able to ascertain the origins of just over half the prompt sequences, suggesting the other half came from metagenomic or transcriptomic sources.
Below, I have summarised the taxonomic order for the sequences present in UniProtKB:
We can see the prompt is quite diverse, containing 10 different taxonomic orders.
Scoring results
We can now turn to the PoET results. PoET modelling will return a log-liklihood (LL) score to evaluate the fitness of our queries against the prompt. The higher the score (i.e. less negative) the better the sequence (the more fit).
We can download our results for further analysis from the UI:
Code
scoring_result.head()
scores
sequence
score
0
[-492.5433044433594]
MRAAQRYDDVVITGLAHVDAPHVVTSTELEDRLAGTLGRLRITPGL...
-492.543304
1
[-703.46240234375]
MNVGITGIAYALPAREVTTAELQRRVAPKLPPRMFEQATGITRRRV...
-703.462402
2
[-219.8573303222656]
MDGNATTTHRNVALLSVATTMATRITTSAEIDRRLAPSLKRLKLPT...
-219.857330
3
[-328.9368896484375]
MRGNATTRHSNVALLGITEAVAPVEVSSKSFDDRLADTLKTLRLPK...
-328.936890
4
[-766.7886352539062]
MTYSRIGQIAVHLPEGGRTSAEVEALVAELNPGFEPPEGLIERFTG...
-766.788635
We expect 72 results, one for each enzyme uploaded:
Now let’s see how well our PoET model scores correlate with the experimental activity data against butoxy:
Code
x = compound_scores.query("mean_log_activity>0") # drop 0 active rowsr,p = spearmanr(x['score'], x['mean_log_activity'])r,p = np.round(r,2), np.round(p,2)print(f"Spearman correlation between score and mean_log_activity: {r:.2f}, p={p:.2f}")f, axs = plt.subplots(1,1, figsize=(12,6), sharex=True)ax = axs# full resultssns.regplot(data=x, x='mean_log_activity', y='score', scatter_kws={'s': 5, 'alpha': 0.5}, n_boot=100, robust=True, ax=ax )ax .set_xlabel("Experimental ")ax .set_ylabel("Model score")# add spearmanr r and p results to graphax .text(0.05, 0.9, f"spearman r={r:.2f}\np-value ={p:.2f}", transform=plt.gca().transAxes, size=14);ax.set_title(f"Correlation between {selected_compound} and log-liklihood score\n MSA seed: {msa_seed_id}");
Spearman correlation between score and mean_log_activity: 0.41, p=0.00
Excellent, a >0.4 correlation (p<0.05) between activity and model score with just a single sequence, and no wetlab data!
We can also examine how it performed against compounds that were weakly correlated with butoxy to check specificity (see the reminder below) .
Code
print("Weakest correlation to butoxy activity:")compound_correlations.head(3).reset_index().rename( columns={'index':'compound', selected_compound:'correlation'} )
Weakest correlation to butoxy activity:
compound
correlation
0
dimethyl
0.12
1
ClPh_propionate
0.16
2
oxadiazole
0.25
Given the nature of the prompt we supplied, we expect there to be very little correlation between our model scores and experimental activity for these 3 compounds:
Code
f, axs = plt.subplots(1,3,figsize=(15,5), sharex=True, sharey=True)axs = axs.flatten()for i, compound inenumerate(compound_correlations.index.tolist()[0:3]): x = scoring_result.merge(data.query("compound==@compound"), on='sequence') x = x.query("mean_log_activity!=0") r,p = spearmanr(x['score'], x['mean_log_activity']) sns.regplot(data=x, x='mean_log_activity', y='score', scatter_kws={'s': 5, 'alpha': 0.5}, n_boot=100, robust=True, ax=axs[i] ) axs[i].set_ylabel("model score") axs[i].set_xlabel("Activity (log10)")# add spearmanr r and p results to graph axs[i].set_title(f"{compound} vs score {r=:.2f}, {p=:.2f}");plt.tight_layout()
/opt/conda/envs/py38/lib/python3.8/site-packages/statsmodels/robust/robust_linear_model.py:288: ConvergenceWarning: Estimated scale is 0.0 indicating that the most last iteration produced a perfect fit of the weighted data.
warnings.warn('Estimated scale is 0.0 indicating that the most'
As expected, our p-values are all >0.05, indicating a non-significant relationship between fitness (which we defined using a butoxy thiolase) and experimental activity against dimethyl, ClPh propionate and oxadiazole! This suggests our prompt was succesful in encoding the features that make our butoxy thiolase specific and active!
Note that the key to success here is two things (in order of importance):
minimum similarity sets a floor in how similar our prompt sequences are to the K. sedentarius seed. This is important as we need our prompt to reflect butoxy activity specifically (and not thiolases in general).
homology level sets a ceiling on redundant sequences, ensuring the prompt is still diverse even after the minimum similarity step.
For reference, you can see the results from the default prompt settings (min. similarity = 0, homology level =0.8) in the appendix. This yields a correlation coefficient of 0.26 (p=0.03). Which is significant, but far below the 0.4 with the more thoughtful prompt above.
PoET scoring Summary
This is a data-free way to start interrogating sequences for fitness, in this example, we took a series of thiolases and created a prompt to reflect butoxy catalysing activity. We achieved this by using the most active enzyme for butoxy catalysis as a seed for an MSA, and then a series of filters to ensure the resulting prompt looked reasonably like the original enzyme.
We then tested the results of this approach across 72 thiolases from 10 different taxonomic classes. This resulted in a strong (>0.4) spearman correlation between PoET model scores and experimental activity. We confirmed that this was a specific effect by comparing those same scores to dimethyl (and other weakly correlated thiols) catalysis - and found no significant relationship (as expected).
PoET de novo thiolase generation
Job creation
Now that we have a meaningful prompt, we can re-use it to generate entirely new sequences. We can use these new sequences to understand butoxy thiolases, and to explore sequence and structural constraints in butoxy catalysis. Additionally, we could run these through wetlab assays to generate more data, and start exploring the fitness landscape with empirical observations. For now, we will focus on computational insights.
To kick off the PoET generation job we will supply a few key arguments:
Number of sequences to generate = 50
top-k = 10 (at each step, sample from only the 10 most likely amino acids).
Maximum generated protein length = 500 (i.e. stop generating any protein beyond this size)
We have left the following arguments at their default value:
temperature = 1.0 (this will control the creativity of the model, higher temperatures lead to more diverse output. 1.0 is a good default value)
top-p = 1.0 (this is an alternative to top-k, where we sample from the amino acids within cumulative p probability distribution)
Code
gen_results_path ="./results/Robinson2020/Robinson_generate.csv"gen_job_id ='5597d678-0bc9-4828-90af-2527e6898977'print(f"Reading from job {gen_job_id}")gen_score_result = pd.read_csv(gen_results_path)gen_score_result = gen_score_result.sort_values(by='score', ascending=False).reset_index(drop=True)gen_score_result['name'] = [f"gen-{i+1}"for i inrange(gen_score_result.shape[0])]gen_score_result[['name','sequence']].to_csv("./results/Robinson2020/colabfold_input.csv", index=False)
Reading from job 5597d678-0bc9-4828-90af-2527e6898977
Evaluating generated sequences
We have generated a variety of sequences with different scores, let’s first download the novel sequences and scores like we did previously:
Code
gen_score_result.head()
scores
sequence
score
name
0
[-105.93228149414062]
MSENATYRLANTSVLSVSTIEAPEVVTSDHLDELLQPAYDSTGMMA...
-105.932281
gen-1
1
[-114.87188720703125]
MGRNATHRFGNTSVLSVCAIEAPRTITSAEIDERLAPVYQRVGLRA...
-114.871887
gen-2
2
[-121.16313171386719]
MNGNTSWRAANTALLSVTAVEAPVVVTSDELDERLAGPMARLGIRA...
-121.163132
gen-3
3
[-133.42160034179688]
GNATTRFANVSLLSVASRLPSRVTPSTQIQERLGAAMTRLRLPPRL...
-133.421600
gen-4
4
[-142.5845947265625]
MSGNATFRLTNTAILAVTTVEAPEVITSEQLDERLRATYDRIGMTP...
-142.584595
gen-5
Below I have plotted:
(top) the distribution of generated sequence scores (red) against all 72 sequences in our data (grey)
(bottom) generated sequence scores (red) against a subset of highly active sequences (light blue - defined as being more active than the median enzyme).
We can see the scores of the generated sequences are generally right-shifted, suggesting greater fitness (defined, as we saw above, as butoxy catalysis):
f, axs = plt.subplots(2,1, figsize=(12,6), sharex=True, sharey=True)plt.suptitle("Generated sequence scores")axs[0].set_title("Generated sequences vs all experimental sequences")axs[0].hist(gen_score_result.score, bins=20, alpha=0.5, color='red', label='PoET sequences');axs[0].hist(compound_scores.score, bins=20, alpha=0.2, color='grey', label='All experimental sequences');axs[0].legend(bbox_to_anchor=(1.05, 1), loc='upper left', borderaxespad=0.)axs[0].set_ylabel( "Number of sequences")median_activity = compound_scores.mean_log_activity.median()axs[1].set_title("Generated sequences vs active experimental sequences")axs[1].hist(gen_score_result.score, bins=20, alpha=0.5, color='red', label='PoET sequences');axs[1].hist(compound_scores.query("mean_log_activity>@median_activity").score, bins=20, alpha=0.4, color='lightblue', label='All active sequences');axs[1].legend(bbox_to_anchor=(1.05, 1), loc='upper left', borderaxespad=0.)axs[1].set_ylabel( "Number of sequences")plt.xlabel("model score");
We can also examine the similarity of our generated sequences; first to each other, and then to the prompt sequences. To make the plotting easier, I have named each sequence in order of descending score (so gen-1 is the highest scoring sequence, and gen-50 the lowest, with gen-25 being the median scoring sequence).
Sequences are on average ~60% similar to the seed sequence. This is roughly in line with our expectations, as we saw our diverse group of thiolases that catalyze all 14 compounds have approximately 50% sequence similarity. We would expect butoxy specific enzymes to be somewhat higher than that broad thiolase baseline.
We can also see below that there is no significant relationship between the similarity to the seed and the model score for the generated sequences: this suggests the prompt is doing a good job (and not overfitting):
Code
r,p = spearmanr(gen_score_result['score'], gen_score_result['seed_similarity'])print(f"Spearman correlation between score and seed similarity: {r:.2f}, p={p:.2f}")f, axs = plt.subplots(1,1, figsize=(12,6), sharex=True)ax = axs# full resultssns.regplot(data=gen_score_result, x='seed_similarity', y='score', scatter_kws={'s': 5, 'alpha': 0.5}, n_boot=100, robust=True, ax=ax )ax .set_xlabel("Similarity to seed")ax .set_ylabel("Model score")# add spearmanr r and p results to graphax .text(0.05, 0.9, f"spearman r={r:.2f}\np-value ={p:.2f}", transform=plt.gca().transAxes, size=14);ax.set_title(f"Seed sequence similarity vs model score");
Spearman correlation between score and seed similarity: 0.10, p=0.49
Code
from multiprocessing import Pooldef calculate_similarity_wrapper(args):return calculate_similarity(*args)def generate_similarity_distinct(sequences1, sequences2):"""Generate a similarity matrix for a list of sequences""" num_sequences1 =len(sequences1) num_sequences2 =len(sequences2)with Pool() as p: similarity_matrix = np.array(p.map(calculate_similarity_wrapper, [(seq1, seq2) for seq1 in sequences1 for seq2 in sequences2]))return similarity_matrix.reshape(num_sequences1, num_sequences2)
Lastly, let’s look at the sequence similarity between the prompt sequences (below heatmap, Y axis) and the generated sequences (X axis):
print(f"Summary statistics for pairwise generated v prompt comparison: ")print(f"\t Mean similarity: {compared_sim.mean(axis=None):.2f}")print(f"\t Max similarity: {compared_sim.max(axis=None):.2f}")print(f"\t Min similarity: {compared_sim.min(axis=None):.2f}")
Summary statistics for pairwise generated v prompt comparison:
Mean similarity: 0.58
Max similarity: 0.94
Min similarity: 0.32
Again, we can see mean similarity is low (within expectations).
There are a few hotspots for the top sequences, I’ve investigated these below by plotting the maximum similarity to any prompt sequence, for each generated sequence:
print(f"Average max similarity to prompt across all generated sequences: {compared_sim.max(axis=0).mean():.2f}")
Average max similarity to prompt across all generated sequences: 0.75
We see there appears to be a general trend of higher scoring sequences being more closely related to a prompt sequence (above). This is to be expected, our scoring is defined by our prompt, so sequences more similar to the prompt will score higher (by definition). We could tweak additional parameters in the PoET generate workflow if we wanted to increase, or decrease the “creativity” of the model (for example temperature, or top p).
Let’s take two sequences;
the best scoring (gen-1)
the median scoring (gen-25)
And further investigate their relationship to the prompt.
First, we can see the full results for prompt sequence similarity for the best sequence (gen-1) below. Note that for this sequence, the mean similarity to the prompt sequences is 56% (range: 46-94%).
Code
best_generated_sequence = gen_score_result.sequence.tolist()[0]print("Prompt similarity summary for gen-1:")print(f"\tmean: {compared_sim['gen-1'].mean():.2f}")print(f"\tmin: {compared_sim['gen-1'].min():.2f}")print(f"\tmax: {compared_sim['gen-1'].max():.2f}")
The best scoring sequence actually shares 93.5% identity with the M. massiliensis we saw previously (and know to be an active butoxy thiolase). We therefore expect this sequence to perform well in any wetlab experiment.
The median scoring sequence shows a more uniform similarity to prompt sequences (mean: 60%, range: 49-71%). Note that the closest prompt sequence derives from Microlunatus flavus, but at 71% similarity we expect significant divergence from this organism’s thiolase.
Code
print(f"M. flavus in Robinson dataset: {promptdf[promptdf.accnumber==compared_sim['gen-25'].idxmax()].sequence.tolist()[0] in data.sequence.tolist()}")
M. flavus in Robinson dataset: False
The sequences selected above represent two good case studies for futher evaluation. One sequence is the most fit sequence generated by PoET, whilst the median sequence should more closely reflect the bulk of the data. We will investigate these structurally below.
Investigating generated thiolase structures
Robinson reports a conserved Cys across all thiolases that is buried in the active site and vital for activity, we can see this below, at position 129 in the MSA seed Kytococcus sedentarius structure from uniprot (note this is an alphafold prediction of the enzyme, as the structure is yet to be solved experimentally).
Code
[f"Cys{i+1}"for i inrange(len(msa_seed)) if msa_seed[i]=='C']
Note the key Cys129-His269 interaction in the active site below (vital for the thiolase “ping-pong” reaction mechanism). As well as the stabilising non-covalent interaction between Gly301 and the same His269. It is likely that this stabilising His-Gly hydrogen bonding pattern has a role in active site dynamics.
First, since we know the active site Cys is so important to catalytic activity, and that loss of this Cys would completely ablate activity, let’s examine the Cys residues in our two generated sequences of interest:
Code
seq_of_interest = ['gen-1', 'gen-25'] # best scored 3 + medianfor name in seq_of_interest: seq = gen_score_result.loc[gen_score_result.name==name, 'sequence'].tolist()[0] x = calculate_similarity(msa_seed, seq)print(f"Sequence similarity of {msa_seed_id} and {name}= {x:.2f}")print(f"Cys positions for {name}:", [i+1for i inrange(len(seq)) if seq[i]=='C'])print()
Sequence similarity of Kytococcus sedentarius and gen-1= 0.56
Cys positions for gen-1: [120, 129, 224, 277, 279]
Sequence similarity of Kytococcus sedentarius and gen-25= 0.66
Cys positions for gen-25: [98, 120, 129, 225, 335]
We can see above that both our best, and median sequence, have a conserved Cys129.
Structural analysis
Best scoring generated sequence
For a more thorough evaluation we can also fold these sequences with Alphafold, and run a pairwise structural aligner with the original seed sequence.
We can see the results below for the best generated sequence, with the active site Cys highlighted in the enzyme’s core:
seed (green)
generated (orange):
We can see there is an overwhelming structural similarity between our seed and our best generated sequence, despite only having 56% sequence identity.
Zooming in on the active site, we see the same Cys-His interaction as the seed sequence’s structure (numbered Cys129-His268 in this generated sequence).
Looking at the full ensemble of 5 models that AlphaFold produced (below), we can see there is overall agreement for both the Cys (yellow) and Histidine (blue) in the generated sequence across predictions, this may suggest there is little uncertainty or flexibility around these residues.
This matches both the K. sedentarius and M. massiliensis AphaFold predictions. I’ve reproduced the M. massiliensis 5 model ensemble below for illustration:
An interesting difference however, is that the His269-Gly301 stabilizing non-covalent interaction in the seed structure has been replaced by a His268-Gln269 interaction in our generated structure:
Overall, we can see a high degree of structural similarity, with a few key active site differences. The overall RMSD between the seed and best generated sequence is <1A:
We can also compare this generated sequence to it’s closest prompt sequence (which had a 94% sequence identity):
First at the sequence level (below) which shows significant similarity:
start =250stop =310print(f"Alignment at ~250-310 region:")for i inrange(3):print(aligned.split("\n")[i][start:stop])
Alignment at ~250-310 region:
DAS-KLFVEGLSLATDAWAEAKGEWDWEEMDWYVAHQTSTAHIACLCDTLKLP-A-EKFP
|| ||||||||||||||||||||||||||||||||||||||||||||||| | ||||
DA-RKLFVEGLSLATDAWAEAKGEWDWEEMDWYVAHQTSTAHIACLCDTLK--VAPEKFP
Interestingly, the M. massiliensis structure (below, AlphaFold), shows a different hydrogen bonding pattern for His268, despite the sequence conservation (above). The M. massiliensis structure lacks any stabilising hydrogen bonding between His268 and adjacent residues. This differs to both the K. sedentarius structure (where Gly301 provides support) and the generated sequence above (where Gln269 provides support). Given that the K. sedentarius and M. massiliensis thiolases were amongst the most active for butoxy, there is reason to believe this generated sequence, which borrows from both, would also be highly active against butoxy.
Overall, we can see there is a strong structural similarity between our best scoring generated sequence and both the seed sequence and it’s closest prompt sequence; this is reflected in the alignment results, which show a <1A RMSD. Additionally we see strong evidence for conservation at the active site, specifically the Cys-His interaction. Interestingly the stabilising hydrogen bonding pattern differs between all three sequences. And despite the strong sequence similarity between our sequence and M. massiliensis, our structure has elements more reminscent of K. sedentarius (e.g. the stabilising hydrogen bonding around the active site Histidine). Given these shared elements from two of the best butoxy thiolases in our dataset, we would expect this sequence to perform well in any wetlab assay.
Median scoring sequence
Let’s now look at the median generated sequence, which has a much lower maximum sequence similarity to any sequence in the prompt (71%). We will follow the same workflow, and look for similar patterns in the active site that might indicate specificity or activity:
First, we can see the overall alignment between this sequence and the seed sequence is good, RMSD is again <1A.
Focussing on the active site (below), we see His269 and Cys129 are conserved again, likely preserving the core “ping-pong” catalytic mechanism. However, the Gln269-His268 interaction that we saw in the best generated sequence, is absent here (reminiscent of M. massiliensis).
There is a Gly301 in the vicinity of the active site Histidine, but this Glycine is distanced from His269, suggesting those stabilizing hydogen bonds we saw in the seed structure will be transient or non-existent here. This may have repercussions for active site volume and solvent accessibility, and therefore activity and specificity (as noted by Robinsin et al. 2020).
Examining the AlphaFold ensemble of 5 models (below), we see clear agreement on the Cys129 orientation, but less agreement on the positioning of the active site Histidine (blue) than either K. sedentarius, M. massiliensis or the best generated sequence. This may suggest more flexibility around His269, potentially due to the weaker non-covalent interactions discussed above:
If we examine the sequence similarity between this median generated sequence and it’s most similar prompt sequence (below) we see less conservation overall, as expected from its low sequence similarity percentage:
Interestingly, we see evidence for strong conservation around the Cys129 residue (see below: the NACL motif is conserved):
But less conservation around the His269 at the sequence level:
This pattern of conservation (structure and sequence) at the Cys129 position, with non-conservation at the His269 position is likely to have implications for activity.
Overall, the generate functionality of PoET has yielded a sequence with low identity but high structural fidelity. This median scoring sequence shows only 66% identity to the seed, and 71% to the closest prompt sequence. Despite this low sequence level similarity, the structures are remarkably similar, and the active sites have been preserved very carefully. Furthermore, the changes in secondary support for the active site are likely to effect protein dynamics and activity and warrant further study.
Generated sequences summary
We have generated 50 sequences de novo by using an MSA of our butoxy catalysing enzyme as a prompt. We previously saw that sequence fitness scored by protsprot correlated well with butoxy activity (but not with dimethyl, or other compounds). We therefore might expect these novel sequences to show similar activity and specificity against butoxy.
Utilizing AlphaFold, we were able to show that 2 representative sequences (the best, and the median) align with the seed sequence structure to within 1A RMSD, despite having low sequence identity to the seed (50-60%) and, in the case of the median sequence, low identity to any prompt sequence (mean: 60%, max: 71%). Furthermore, these generated sequences preserve a vital Cys-His interaction in the active site, but differ in how they stabilize the active site His with surrounding non-covalent interactions. We hypothesize that these secondary differences may influence substrate activity and specificity and warrant further study.
Overall, these sequences (or a subset of them) could be progressed to wetlab evaluation to assess for improved activity over our K. sedentarius sequence. This would then generate more data that could be used in OpenProtein’s suite of tools to engineer increasingly active and specific thiolases!
Summary
OpenProtein’s PoET workflow, a state-of-the-art protein language model, enables the evaluation of protein sequences without functional measurements, leveraging multiple sequence alignment to assess protein fitness and generate novel sequences.
Using an external thiolase dataset from Robinson 2020, PoET demonstrated a strong correlation between model scores and experimental butoxy catalytic activity. Furthermore, the novel sequences generated in the second half of this demo showed structural similarity to the seed sequence and potential for improved activity, thus offering a powerful tool for protein engineering without preliminary wetlab experiments.
Appendix
Additional figures:
Full results for compound activity vs model score:
Code
f, axs = plt.subplots(len(compound_correlations)//2, 2, figsize=(15,15), sharex=True, sharey=True)axs = axs.flatten()plt.suptitle("Model score vs compound activity:\n all compounds", y=1.02)for i, compound inenumerate(compound_correlations.index.tolist()): x = scoring_result.merge(data.query("compound==@compound"), on='sequence') x = x.query("mean_log_activity>0") r,p = spearmanr(x['score'], x['mean_log_activity']) sns.regplot(data=x, x='mean_log_activity', y='score', scatter_kws={'s': 5, 'alpha': 0.5}, n_boot=100, robust=True, ax=axs[i] ) axs[i] .set_ylabel("") axs[i].set_xlabel("")# add spearmanr r and p results to graph axs[i].set_title(f"{compound}{r=:.2f}, {p=:.2f}");axs[len(compound_correlations)-2].set_xlabel("log10 activity")axs[len(compound_correlations)-1].set_xlabel("log10 activity")axs[len(compound_correlations)-2].set_ylabel("Model score")plt.tight_layout()
Additional prompts:
Scoring with default parameters yields a weaker model:
The default prompt settings (as used below) are:
Use first sequence as seed to generate MSA = True (as before)
number of prompts to ensemble = 1
minimum similarity to seed = 0.0
Homology level = 0.8
always include seed sequence: False
maximum number of sequences = 0
maximum number of amino acids = 12000
Code
results_default_prompt ="./results/Robinson2020/prots2prot-query-defaultprompt.csv"scoring_defaultprompt_job_id ='0d1576e5-f954-453f-ba4e-d5c218701946'scoring_result_defaultprompt = pd.read_csv(results_default_prompt)print(f"Reading from {scoring_defaultprompt_job_id}")scoring_result_defaultprompt.head(3)
Reading from 0d1576e5-f954-453f-ba4e-d5c218701946
scores
sequence
score
0
[-497.7345581054688]
MRAAQRYDDVVITGLAHVDAPHVVTSTELEDRLAGTLGRLRITPGL...
-497.734558
1
[-329.5821228027344]
MNVGITGIAYALPAREVTTAELQRRVAPKLPPRMFEQATGITRRRV...
-329.582123
2
[-232.86839294433597]
MDGNATTTHRNVALLSVATTMATRITTSAEIDRRLAPSLKRLKLPT...
-232.868393
Code
x = scoring_result_defaultprompt.merge(data.query("compound==@selected_compound"), on='sequence')x = x.query("mean_log_activity>0")r,p = spearmanr(x['score'], x['mean_log_activity'])print(f"Spearman correlation between score and mean_log_activity: {r:.2f}, p={p:.2f}")f, axs = plt.subplots(1,1, figsize=(12,6), sharex=True)ax = axs# full resultssns.regplot(data=x, x='mean_log_activity', y='score', scatter_kws={'s': 5, 'alpha': 0.5}, n_boot=100, robust=True, ax=ax )ax .set_xlabel("Experimental ")ax .set_ylabel("Model score")# add spearmanr r and p results to graphax .text(0.05, 0.9, f"spearman r={r:.2f}\np-value ={p:.2f}", transform=plt.gca().transAxes, size=14);ax.set_title(f"Correlation between {selected_compound} and log-liklihood score\n Default prompt. ");
Spearman correlation between score and mean_log_activity: 0.26, p=0.03
Code
response = session.prots2prot.get_prompt(scoring_defaultprompt_job_id, prompt_index=1)defaultpromptdf = pd.DataFrame(list(response), columns=['name','sequence'])#rename seed from 101 to uniprotkb entry defaultpromptdf.loc[defaultpromptdf.name=='101', 'name'] ='C7NEV1'print("Number of sequences in prompt:", len(defaultpromptdf))print("Sample:")defaultpromptdf.head(10)
Number of sequences in prompt: 40
Sample:
name
sequence
0
A0A1F9RVZ1
RYDKVYLAGLGYELAPVVVDTAEIEERLKPALAALRIAEGQIESLT...
1
A0A810ML27
PVGITAVASVLPDREVTTADLQQRVAAASNLPLPPGMFAQATGIDR...
2
A0A354ZDY2
KVMPLFWGYTGSACVPMALAVSRGRIRSGDLVTLVGSGVGYNAGVALL
3
A0A3N1ARH3
VGITGVGVSLPDRTLTTADLQDRVAAASDVVLPPGLFATMTGIGQR...
4
A0A367YQ38
MKTDATYQLNNTAICSAVAVEAPDRVTSAWIDEQLAETYQRLGMTP...
5
A0A1V5NTN6
FSRVFLESIGYELPHNIITSTWIEERLSPLYKKLFLQPGQLEALTG...
6
A0A535XU34
MAGREVKYATRQLTESSLTALREGMSVDDIDQFLFHQANLRIIESV...
7
UPI0003B5A6F3
MNGNATTRHNNVALLAVASTIAPRITTSHEIDQQLAPVLKRLRLPQ...
8
A0A7X8QXU4
RFQKACISSIGYELGPVVVTSDALEARIAPVYKSLGLPLGQLQGLT...
9
A0A0K1PEU7
RWSRVCIEAIAYELPDERVTSSALEARLAPVYEALRLGGGQLEALT...
Code
print(f"Number of overlapping sequences in default args prompt and query: {len(set(unique_sequences) &set(defaultpromptdf.sequence.unique().tolist()))}")x = [i for i in unique_sequences if i in defaultpromptdf.sequence.unique().tolist()]
Number of overlapping sequences in default args prompt and query: 0
Code
prompt_orgs = [] for accnumber in defaultpromptdf.name.unique()[0:]:if accnumber =='101':continue r = requests.get(f"https://www.uniprot.org/uniprotkb/{accnumber}.json", timeout=3) prompt_orgs.append((accnumber, r.json().get('organism', {}).get('scientificName', 'notfound'), r.json().get('organism', {}).get('lineage', [] )))defaultpromptdf = pd.DataFrame(prompt_orgs, columns=['accnumber', 'organism', 'lineage']).merge(defaultpromptdf, left_on='accnumber', right_on='name')defaultpromptdf.organism.value_counts()
Interestingly, our default prompt contains neither the K. sedentarius or M. massiliensis sequence. And is drawn from a much larger range of taxonomic orders. This likely is including thiolases other than butoxy thiolases.
Code
defaultpromptdf['seed_similarity'] = [calculate_similarity(msa_seed, s) for s in defaultpromptdf.sequence ]promptdf['seed_similarity'] = [calculate_similarity(msa_seed, s) for s in promptdf.sequence ]
We can see above the prompt produced with the default args contains sequences that are far more divergent than our performant prompt used in section 2. This is likely a symptom of the taxonomic diversity and likely the cause of the poor performance with the default prompt.
Code
# Full results from default prompt args jobf, axs = plt.subplots(data.compound.nunique()//2, 2, figsize=(15,15), sharex=True, sharey=True)axs = axs.flatten()plt.suptitle("Correlation between experimental activity and model score\n Default prompt", y=1.02)for i, compound inenumerate(data.compound.unique()): x = scoring_result_defaultprompt.merge(data.query("compound==@compound"), on='sequence') x = x.query("mean_log_activity>0") r,p = spearmanr(x['score'], x['mean_log_activity']) sns.regplot(data=x, x='mean_log_activity', y='score', scatter_kws={'s': 5, 'alpha': 0.5}, n_boot=100, robust=True, ax=axs[i] ) axs[i] .set_ylabel("") axs[i].set_xlabel("")# add spearmanr r and p results to graph axs[i].set_title(f"Default prompt: {compound}{r=:.2f}, {p=:.2f}");axs[len(compound_correlations)-2].set_xlabel("log10 activity")axs[len(compound_correlations)-1].set_xlabel("log10 activity")axs[len(compound_correlations)-2].set_ylabel("Model score (default prompt)")plt.tight_layout()
/opt/conda/envs/py38/lib/python3.8/site-packages/statsmodels/robust/robust_linear_model.py:288: ConvergenceWarning: Estimated scale is 0.0 indicating that the most last iteration produced a perfect fit of the weighted data.
warnings.warn('Estimated scale is 0.0 indicating that the most'