openprotein.align#

Some tools (e.g. PoET, AlphaFold2) require an MSA to be generated. The tools below will help you achieve this.

class openprotein.align.AlignAPI[source]#

Align API interface for creating alignments and MSAs (multiple sequence alignments) which can be used for other protein tasks.

__init__(session)[source]#
Parameters:

session (APISession)

mafft(sequences, names=None, auto=True, ep=None, op=None)[source]#

Align sequences using the mafft algorithm.

Set auto to True to automatically attempt the best parameters. Leave a parameter as None to use system defaults.

Parameters:
  • sequences (Sequence[bytes or str]) – Sequences to align.

  • names (Sequence[str], optional) – Optional list of sequence names, must be the same length as sequences if provided.

  • auto (bool, default=True) – Set to True to automatically set algorithm parameters.

  • ep (float, optional) – MAFFT “ep” parameter. Sets the offset value for the scoring matrix; lower values make gap opening more difficult. If None, uses system default.

  • op (float, optional) – MAFFT “op” parameter. Sets the gap opening penalty; higher values increase the cost of opening gaps. If None, uses system default.

Returns:

Future object awaiting the contents of the MSA upload.

Return type:

MSAFuture

Raises:

Exception – If names and sequences are not the same length.

mafft_file(file, auto=True, ep=None, op=None)[source]#

Align sequences using the mafft algorithm. Sequences can be provided as FASTA or CSV formats. If CSV, the file must be headerless with either a single sequence column or name, sequence columns.

Set auto to True to automatically attempt the best parameters. Leave a parameter as None to use system defaults.

Parameters:
  • file (file-like object) – Sequences to align in FASTA or CSV format.

  • auto (bool, default=True) – Set to True to automatically set algorithm parameters.

  • ep (float, optional) – MAFFT “ep” parameter. Sets the offset value for the scoring matrix; lower values make gap opening more difficult. If None, uses system default.

  • op (float, optional) – MAFFT “op” parameter. Sets the gap opening penalty; higher values increase the cost of opening gaps. If None, uses system default.

Returns:

Future object awaiting the contents of the MSA upload.

Return type:

MSAFuture

clustalo(sequences, names=None, clustersize=None, iterations=None)[source]#

Align sequences using the clustal omega algorithm.

Sequences can be provided as FASTA or CSV formats. If CSV, the file must be headerless with either a single sequence column or name, sequence columns.

Parameters:
  • sequences (Sequence[bytes or str]) – Sequences to align.

  • names (Sequence[str], optional) – Optional list of sequence names, must be the same length as sequences if provided.

  • clustersize (int, optional) – Maximum number of sequences per cluster during guide tree generation. If None, uses the default value.

  • iterations (int, optional) – Number of refinement iterations performed during alignment. If None, uses the default value.

Returns:

Future object awaiting the contents of the MSA upload.

Return type:

MSAFuture

Raises:

Exception – If names and sequences are not the same length.

clustalo_file(file, clustersize=None, iterations=None)[source]#

Align sequences using the clustal omega algorithm.

Sequences can be provided as FASTA or CSV formats. If CSV, the file must be headerless with either a single sequence column or name, sequence columns.

Parameters:
  • file (file-like object) – Sequences to align in FASTA or CSV format.

  • clustersize (int, optional) – Maximum number of sequences per cluster during guide tree generation. If None, uses the default value.

  • iterations (int, optional) – Number of refinement iterations performed during alignment. If None, uses the default value.

Returns:

Future object awaiting the contents of the MSA upload.

Return type:

MSAFuture

abnumber(sequences, names=None, scheme=AbNumberScheme.CHOTHIA)[source]#

Align antibody sequences using AbNumber.

Sequences can be provided as FASTA or CSV formats. If CSV, the file must be headerless with either a single sequence column or name, sequence columns.

The antibody numbering scheme can be specified.

Parameters:
  • sequences (Sequence[bytes or str]) – Sequences to align.

  • names (Sequence[str], optional) – Optional list of sequence names, must be the same length as sequences if provided.

  • scheme (AbNumberScheme, default=AbNumberScheme.CHOTHIA) – Antibody numbering scheme.

Returns:

Future object awaiting the contents of the MSA upload.

Return type:

MSAFuture

Raises:

Exception – If names and sequences are not the same length.

abnumber_file(file, scheme=AbNumberScheme.CHOTHIA)[source]#

Align antibody sequences using AbNumber.

Sequences can be provided as FASTA or CSV formats. If CSV, the file must be headerless with either a single sequence column or name, sequence columns.

The antibody numbering scheme can be specified.

Parameters:
  • file (file-like object) – Sequences to align in FASTA or CSV format.

  • scheme (AbNumberScheme, default=AbNumberScheme.CHOTHIA) – Antibody numbering scheme.

Returns:

Future object awaiting the contents of the MSA upload.

Return type:

MSAFuture

upload_msa(msa_file)[source]#

Upload an MSA from a file.

Parameters:

msa_file (str) – Path to a ready-made MSA file.

Returns:

Future object awaiting the contents of the MSA upload.

Return type:

MSAFuture

Raises:

APIError – If there is an issue with the API request.

create_msa(seed)[source]#

Construct an MSA via homology search with the seed sequence.

Parameters:

seed (bytes) – Seed sequence for the MSA construction.

Returns:

Future object awaiting the contents of the MSA upload.

Return type:

MSAFuture

Raises:

APIError – If there is an issue with the API request.

upload_prompt(prompt_file)[source]#

Directly upload a prompt.

This method is deprecated. Use create_prompt on the prompt module instead.

Parameters:

prompt_file (BinaryIO) – Binary I/O object representing the prompt file.

Returns:

An object representing the status and results of the prompt job.

Return type:

PromptJob

Raises:

DeprecationError – This method is no longer supported.

get_prompt(job, prompt_index=None)[source]#

Get prompts for a given job.

This method is deprecated. Use get_prompt on the prompt module instead.

Parameters:
  • job (Job) – The job for which to retrieve data.

  • prompt_index (int, optional) – The replicate number for the prompt (input_type=-PROMPT only).

Returns:

An iterator over rows of the prompt data.

Return type:

Iterator[list[str]]

Raises:

DeprecationError – This method is no longer supported.

get_seed(job_id)[source]#

Get seed sequence for a given MSA job.

Parameters:
  • job (Job) – The job for which to retrieve data.

  • job_id (str)

Returns:

Seed sequence that was used to generate the MSA.

Return type:

str

get_msa(job_id)[source]#

Get generated MSA for a given job.

Parameters:
  • job (Job) – The job for which to retrieve data.

  • job_id (str)

Returns:

An iterator over names and sequences of the MSA data.

Return type:

Iterator[tuple[str, str]]

class openprotein.align.MSAFuture[source]#

Represents a future for MSA (Multiple Sequence Alignment) results.

Parameters:
  • session (APISession) – An instance of APISession for API interactions.

  • job (MSAJob) – The MSA job.

  • page_size (int, optional) – The number of results to fetch in a single page. Defaults to config.POET_PAGE_SIZE.

session#

An instance of APISession for API interactions.

Type:

APISession

job#

The MSA job.

Type:

MSAJob | MafftJob | ClustalOJob | AbNumberJob

page_size#

The number of results to fetch in a single page.

Type:

int

msa_id#

The job ID for the MSA.

Type:

str

get(verbose=False)[source]#

Retrieve the MSA of the job as an iterator over CSV rows.

Parameters:

verbose (bool)

Return type:

Iterator[tuple[str, str]]

sample_prompt(...)[source]#

Create a protein sequence prompt from the linked MSA for PoET Jobs.

Parameters:
  • num_sequences (int | None)

  • num_residues (int | None)

  • method (MSASamplingMethod)

  • homology_level (float)

  • max_similarity (float)

  • min_similarity (float)

  • always_include_seed_sequence (bool)

  • num_ensemble_prompts (int)

  • random_seed (int | None)

Return type:

Prompt

__init__(session, job, page_size=50000)[source]#

Initialize an MSAFuture instance.

Parameters:
  • session (APISession) – An instance of APISession for API interactions.

  • job (MSAJob) – The MSA job.

  • page_size (int, optional) – The number of results to fetch in a single page. Defaults to config.POET_PAGE_SIZE.

get(verbose=False)[source]#

Retrieve the MSA of the job.

Parameters:

verbose (bool, optional) – Whether to print verbose output. Defaults to False.

Returns:

An iterator over names and sequences of the MSA data.

Return type:

Iterator[tuple[str, str]]

sample_prompt(num_sequences=None, num_residues=None, method=MSASamplingMethod.NEIGHBORS_NONGAP_NORM_NO_LIMIT, homology_level=0.8, max_similarity=1.0, min_similarity=0.0, always_include_seed_sequence=False, num_ensemble_prompts=1, random_seed=None)[source]#

Create a protein sequence prompt from the linked MSA for PoET Jobs.

Parameters:
  • num_sequences (int, optional) – Maximum number of sequences in the prompt. Must be less than 100.

  • num_residues (int, optional) – Maximum number of residues (tokens) in the prompt. Must be less than 24577.

  • method (MSASamplingMethod, optional) – Method to use for MSA sampling. Defaults to NEIGHBORS_NONGAP_NORM_NO_LIMIT.

  • homology_level (float, optional) – Level of homology for sequences in the MSA (neighbors methods only). Must be between 0 and 1. Defaults to 0.8.

  • max_similarity (float, optional) – Maximum similarity between sequences in the MSA and the seed. Must be between 0 and 1. Defaults to 1.0.

  • min_similarity (float, optional) – Minimum similarity between sequences in the MSA and the seed. Must be between 0 and 1. Defaults to 0.0.

  • always_include_seed_sequence (bool, optional) – Whether to always include the seed sequence in the MSA. Defaults to False.

  • num_ensemble_prompts (int, optional) – Number of ensemble jobs to run. Defaults to 1.

  • random_seed (int, optional) – Seed for random number generation. Defaults to a random number between 0 and 2**32-1.

Raises:
  • InvalidParameterError – If provided parameter values are not in the allowed range.

  • MissingParameterError – If both or none of ‘num_sequences’ and ‘num_residues’ are specified.

Returns:

A Prompt instance for the created prompt job.

Return type:

Prompt

cancelled()#

check if job is cancelled

Return type:

bool

done()#

Check if job is complete

Return type:

bool

get_input(input_type)#

Retrieve input data for this alignment job.

Parameters:

input_type (AlignType) – The type of input data to retrieve.

Returns:

An iterator over the input data rows.

Return type:

Iterator[list[str]]

get_seed()#

Retrieve the seed sequence for this alignment job.

Returns:

The seed sequence.

Return type:

str

property id#

The job ID for this alignment job.

Returns:

The job ID.

Return type:

str

refresh()#

Refresh job status.

wait(interval=5, timeout=None, verbose=False)#

Wait for job to complete, then fetch results.

Parameters:
  • interval (int, optional) – time between polling. Defaults to config.POLLING_INTERVAL.

  • timeout (int, optional) – max time to wait. Defaults to None.

  • verbose (bool, optional) – verbosity flag. Defaults to False.

Returns:

results of job

Return type:

results

wait_until_done(interval=5, timeout=None, verbose=False)#

Wait for job to complete. Do not fetch results (unlike wait())

Parameters:
  • interval (int, optional) – time between polling. Defaults to config.POLLING_INTERVAL.

  • timeout (int, optional) – max time to wait. Defaults to None.

  • verbose (bool, optional) – verbosity flag. Defaults to False.

Returns:

results of job

Return type:

results