Creating a prompt#

This tutorial shows you how to use your multiple sequence alignment (MSA) to make a prompt, which is an input that instructs the PoET model to generate the desired response. PoET uses a prompt made up of a set of related sequences that encode information about the fitness landscape of a protein of interest. These sequences may be homologs, family members, or some other grouping that represents your protein of interest.

What you need before getting started#

You need an MSA previously returned by the create_msa function. For more information, see Creating an MSA.

Setting up your prompt#

Creating a prompt from an MSA involves filtering based on a number of criteria. Our goal is for the resulting prompt to successfully encode relevant evolutionary and fitness data that PoET can use as context when scoring or generating new sequences.

This tutorial will use the default prompt settings:

  • num_residues: This controls the maximum number of residues (tokens) in the prompt. The default is 12288.

  • num_sequences: This controls the number of sequences. Use as an alternative to num_residues.

  • method: This parameter signifies the MSA sampling method to be used. The default method is NEIGHBORS_NONGAP_NORM_NO_LIMIT.

  • homology_level: Applicable for the neighbors methods only, this parameter controls the level of homology for sequences in the MSA. The value ranges between 0 and 1, with the default set at 0.8.

  • max_similarity: This defines the maximum similarity between sequences in the MSA and the seed, ranging between 0 and 1. By default, it is set to 1.0.

  • min_similarity: This determines the minimum similarity between sequences in the MSA and the seed, with values ranging from 0 to 1. The default is set at 0.0.

  • always_include_seed_sequence: This Boolean parameter controls whether to always include the seed sequence in the MSA or not. By default, it is set to False.

We’ll set the following arguments:

  • num_ensemble_prompts: Denotes the number of ensemble jobs to run. Each job will have a different prompt to generate a more diverse set of scores. We will set this value to 1.

  • random_seed: The seed allows you to make reproducible runs. We’ll set it to 42 here.

Generating your prompt#

Sample from the MSA to generate a prompt:

[ ]:
num_prompts = 3
prompt = msa.sample_prompt(num_ensemble_prompts=num_prompts, random_seed=42)
print(prompt)
status=<JobStatus.PENDING: 'PENDING'> job_id='cc91a87b-77a7-473a-be4a-0dae1eb30662' job_type=<JobType.align_prompt: '/align/prompt'> created_date=datetime.datetime(2024, 5, 9, 5, 33, 8, 975262) start_date=None end_date=None prerequisite_job_id=None progress_message=None progress_counter=None num_records=None sequence_length=None msa_id='cc91a87b-77a7-473a-be4a-0dae1eb30662' prompt_id='cc91a87b-77a7-473a-be4a-0dae1eb30662'

OpenProtein.AI uses an asynchronous API, where potentially long running functions return a job ID that can be used to query for completed results. Return a job ID with:

[ ]:
prompt.id
'cc91a87b-77a7-473a-be4a-0dae1eb30662'

Wait for the job to complete with prompt.wait. View the results with

[ ]:
import pandas as pd
prompt.wait()
prompt_result = []
for i in range(num_prompts):
    prompt_result.append( pd.DataFrame( list(prompt.get_prompt(i)) , columns=['name','sequence']) )

prompt_result
[Empty DataFrame
 Columns: [name, sequence]
 Index: [],
              name                                           sequence
 0   UPI001ED8170E  MDVLKKGFSMAKDGVVAAAEKTKAGVEEAAAKTKEGVIYVGNKTME...
 1   UPI0018F740AA  MDMFMKGLNMAKEGVVAAAEKTKQGVTEAAEKTKEGVLYVGNRTRE...
 2      A0A3Q2U6H0  MDVFMKGLSKAKEGMAVAAEKTKEGVAVAAEKTKEGVMFVGNKAKD...
 3      A0A6I8S5G8  MDVFMKGLSKAKEGVVAAAEKTKQGVAEAAGKTKEGVLYVGSKTKE...
 4   UPI001CE06301  MDALMKGFSMAKEGVVAAAEKTKAGMEEAAAKTKEGVMYVGNKTKE...
 ..            ...                                                ...
 95     A0A8C2WLA4  MDALKKGLNMAKDGVVSAAEKTKAGVGGAATKTKEGVFYVGNKTME...
 96  UPI000944B882  MDVFMKGLSKAKEGVVAAAEKTKQGVAEAAGKTKEGVLYVGSKTKE...
 97     A0A3B1KG41  MDVLKKGFSIAKEGVVAAAEKTKAGVEEAAAKTKEGVMYVGTKTKE...
 98     A0A0Q3TLY5  MDVFMKGLSKAKEGVVAAAEKTKQGVAEAAGKTKEGVLYVGSRTKE...
 99     A0A7K8NI46  MDVFMKGLSKAKEGVVAAAEKTKQGVAEAAGKTKEGVLYVGRWAAE...

 [100 rows x 2 columns],
              name                                           sequence
 0   UPI0003317F87  MDVFMKGLSMAKEGVVAAAEKTKQGVTEAEKTKEGVLYVVAEKTKE...
 1   UPI0003318113  MDVFKKGFSIAKEGVVGAVEKTKQGVTEAAEKTKEGVLYVGAKTKE...
 2   UPI0018B0E974  MDVFKKGFSIAKEGVVGAVEKTKQGVTEAAEKTKEGVMYVGTKTKE...
 3   UPI001CFABEBD  MDVFKKGFSMAKEGVVAAAEKTKQGVAEAAEKTKEGVMYVGTKTKE...
 4      A0A4W4DNW5  MDVLMKGLSKAKDGVATAAEKTKQGVTGAAGMTKDGVLYVGESFAE...
 ..            ...                                                ...
 95  UPI001C670E09  MDAFMKGLSKAKEGVVAAAEKTKQGVAEAAEKTKEGVLYVGSKTQG...
 96     A0A0A7HRT6  MDVFMKGLSKAKEGVVAAAEKTKQGVAEAAGKTKEGVLYVGSKTKE...
 97     A0A4W6ENE0  MDVLMKGFSMAKEGVVAAAEKTKAGMEEAAAKTKEGVMYVGSKTKE...
 98  UPI0018B08E9F            QNEEGAPQEGILQDMPVDPDNEAYEMPSEEGYQDYEPEA
 99     A0A673IWB3  MDVFMKGLSKAKEGMAVAAEKTKEGVAVAAEKTKEGVMFVGKTWMA...

 [100 rows x 2 columns]]

Next steps#

See our Align and MSA API page for more information.

Now that you have a prompt, use it to score sequences, perform a single site analysis, or generate de novo sequences.