openprotein.align#
Some tools (e.g. PoET, AlphaFold2) require an MSA to be generated. The tools below will help you achieve this.
- class openprotein.align.AlignAPI[source]#
Align API interface for creating alignments and MSAs (multiple sequence alignments) which can be used for other protein tasks.
- __init__(session)[source]#
- Parameters:
session (APISession)
- mafft(sequences, names=None, auto=True, ep=None, op=None)[source]#
Align sequences using the mafft algorithm.
Set auto to True to automatically attempt the best parameters. Leave a parameter as None to use system defaults.
- Parameters:
sequences (Sequence[bytes or str]) – Sequences to align.
names (Sequence[str], optional) – Optional list of sequence names, must be the same length as sequences if provided.
auto (bool, default=True) – Set to True to automatically set algorithm parameters.
ep (float, optional) – MAFFT “ep” parameter. Sets the offset value for the scoring matrix; lower values make gap opening more difficult. If None, uses system default.
op (float, optional) – MAFFT “op” parameter. Sets the gap opening penalty; higher values increase the cost of opening gaps. If None, uses system default.
- Returns:
Future object awaiting the contents of the MSA upload.
- Return type:
- Raises:
Exception – If names and sequences are not the same length.
- mafft_file(file, auto=True, ep=None, op=None)[source]#
Align sequences using the mafft algorithm. Sequences can be provided as FASTA or CSV formats. If CSV, the file must be headerless with either a single sequence column or name, sequence columns.
Set auto to True to automatically attempt the best parameters. Leave a parameter as None to use system defaults.
- Parameters:
file (file-like object) – Sequences to align in FASTA or CSV format.
auto (bool, default=True) – Set to True to automatically set algorithm parameters.
ep (float, optional) – MAFFT “ep” parameter. Sets the offset value for the scoring matrix; lower values make gap opening more difficult. If None, uses system default.
op (float, optional) – MAFFT “op” parameter. Sets the gap opening penalty; higher values increase the cost of opening gaps. If None, uses system default.
- Returns:
Future object awaiting the contents of the MSA upload.
- Return type:
- clustalo(sequences, names=None, clustersize=None, iterations=None)[source]#
Align sequences using the clustal omega algorithm.
Sequences can be provided as FASTA or CSV formats. If CSV, the file must be headerless with either a single sequence column or name, sequence columns.
- Parameters:
sequences (Sequence[bytes or str]) – Sequences to align.
names (Sequence[str], optional) – Optional list of sequence names, must be the same length as sequences if provided.
clustersize (int, optional) – Maximum number of sequences per cluster during guide tree generation. If None, uses the default value.
iterations (int, optional) – Number of refinement iterations performed during alignment. If None, uses the default value.
- Returns:
Future object awaiting the contents of the MSA upload.
- Return type:
- Raises:
Exception – If names and sequences are not the same length.
- clustalo_file(file, clustersize=None, iterations=None)[source]#
Align sequences using the clustal omega algorithm.
Sequences can be provided as FASTA or CSV formats. If CSV, the file must be headerless with either a single sequence column or name, sequence columns.
- Parameters:
file (file-like object) – Sequences to align in FASTA or CSV format.
clustersize (int, optional) – Maximum number of sequences per cluster during guide tree generation. If None, uses the default value.
iterations (int, optional) – Number of refinement iterations performed during alignment. If None, uses the default value.
- Returns:
Future object awaiting the contents of the MSA upload.
- Return type:
- abnumber(sequences, names=None, scheme=AbNumberScheme.CHOTHIA)[source]#
Align antibody sequences using AbNumber.
Sequences can be provided as FASTA or CSV formats. If CSV, the file must be headerless with either a single sequence column or name, sequence columns.
The antibody numbering scheme can be specified.
- Parameters:
sequences (Sequence[bytes or str]) – Sequences to align.
names (Sequence[str], optional) – Optional list of sequence names, must be the same length as sequences if provided.
scheme (AbNumberScheme, default=AbNumberScheme.CHOTHIA) – Antibody numbering scheme.
- Returns:
Future object awaiting the contents of the MSA upload.
- Return type:
- Raises:
Exception – If names and sequences are not the same length.
- abnumber_file(file, scheme=AbNumberScheme.CHOTHIA)[source]#
Align antibody sequences using AbNumber.
Sequences can be provided as FASTA or CSV formats. If CSV, the file must be headerless with either a single sequence column or name, sequence columns.
The antibody numbering scheme can be specified.
- Parameters:
file (file-like object) – Sequences to align in FASTA or CSV format.
scheme (AbNumberScheme, default=AbNumberScheme.CHOTHIA) – Antibody numbering scheme.
- Returns:
Future object awaiting the contents of the MSA upload.
- Return type:
- upload_msa(msa_file)[source]#
Upload an MSA from a file.
- Parameters:
msa_file (str) – Path to a ready-made MSA file.
- Returns:
Future object awaiting the contents of the MSA upload.
- Return type:
- Raises:
APIError – If there is an issue with the API request.
- create_msa(seed)[source]#
Construct an MSA via homology search with the seed sequence.
- Parameters:
seed (bytes) – Seed sequence for the MSA construction.
- Returns:
Future object awaiting the contents of the MSA upload.
- Return type:
- Raises:
APIError – If there is an issue with the API request.
- upload_prompt(prompt_file)[source]#
Directly upload a prompt.
This method is deprecated. Use create_prompt on the prompt module instead.
- Parameters:
prompt_file (BinaryIO) – Binary I/O object representing the prompt file.
- Returns:
An object representing the status and results of the prompt job.
- Return type:
PromptJob
- Raises:
DeprecationError – This method is no longer supported.
- get_prompt(job, prompt_index=None)[source]#
Get prompts for a given job.
This method is deprecated. Use get_prompt on the prompt module instead.
- Parameters:
job (Job) – The job for which to retrieve data.
prompt_index (int, optional) – The replicate number for the prompt (input_type=-PROMPT only).
- Returns:
An iterator over rows of the prompt data.
- Return type:
Iterator[list[str]]
- Raises:
DeprecationError – This method is no longer supported.
- class openprotein.align.MSAFuture[source]#
Represents a future for MSA (Multiple Sequence Alignment) results.
- Parameters:
session (APISession) – An instance of APISession for API interactions.
job (MSAJob) – The MSA job.
page_size (int, optional) – The number of results to fetch in a single page. Defaults to config.POET_PAGE_SIZE.
- session#
An instance of APISession for API interactions.
- Type:
- job#
The MSA job.
- Type:
MSAJob | MafftJob | ClustalOJob | AbNumberJob
- page_size#
The number of results to fetch in a single page.
- Type:
int
- msa_id#
The job ID for the MSA.
- Type:
str
- get(verbose=False)[source]#
Retrieve the MSA of the job as an iterator over CSV rows.
- Parameters:
verbose (bool)
- Return type:
Iterator[tuple[str, str]]
- sample_prompt(...)[source]#
Create a protein sequence prompt from the linked MSA for PoET Jobs.
- Parameters:
num_sequences (int | None)
num_residues (int | None)
method (MSASamplingMethod)
homology_level (float)
max_similarity (float)
min_similarity (float)
always_include_seed_sequence (bool)
num_ensemble_prompts (int)
random_seed (int | None)
- Return type:
- __init__(session, job, page_size=50000)[source]#
Initialize an MSAFuture instance.
- Parameters:
session (APISession) – An instance of APISession for API interactions.
job (MSAJob) – The MSA job.
page_size (int, optional) – The number of results to fetch in a single page. Defaults to config.POET_PAGE_SIZE.
- get(verbose=False)[source]#
Retrieve the MSA of the job.
- Parameters:
verbose (bool, optional) – Whether to print verbose output. Defaults to False.
- Returns:
An iterator over names and sequences of the MSA data.
- Return type:
Iterator[tuple[str, str]]
- sample_prompt(num_sequences=None, num_residues=None, method=MSASamplingMethod.NEIGHBORS_NONGAP_NORM_NO_LIMIT, homology_level=0.8, max_similarity=1.0, min_similarity=0.0, always_include_seed_sequence=False, num_ensemble_prompts=1, random_seed=None)[source]#
Create a protein sequence prompt from the linked MSA for PoET Jobs.
- Parameters:
num_sequences (int, optional) – Maximum number of sequences in the prompt. Must be less than 100.
num_residues (int, optional) – Maximum number of residues (tokens) in the prompt. Must be less than 24577.
method (MSASamplingMethod, optional) – Method to use for MSA sampling. Defaults to NEIGHBORS_NONGAP_NORM_NO_LIMIT.
homology_level (float, optional) – Level of homology for sequences in the MSA (neighbors methods only). Must be between 0 and 1. Defaults to 0.8.
max_similarity (float, optional) – Maximum similarity between sequences in the MSA and the seed. Must be between 0 and 1. Defaults to 1.0.
min_similarity (float, optional) – Minimum similarity between sequences in the MSA and the seed. Must be between 0 and 1. Defaults to 0.0.
always_include_seed_sequence (bool, optional) – Whether to always include the seed sequence in the MSA. Defaults to False.
num_ensemble_prompts (int, optional) – Number of ensemble jobs to run. Defaults to 1.
random_seed (int, optional) – Seed for random number generation. Defaults to a random number between 0 and 2**32-1.
- Raises:
InvalidParameterError – If provided parameter values are not in the allowed range.
MissingParameterError – If both or none of ‘num_sequences’ and ‘num_residues’ are specified.
- Returns:
A Prompt instance for the created prompt job.
- Return type:
- cancelled()#
check if job is cancelled
- Return type:
bool
- done()#
Check if job is complete
- Return type:
bool
- get_input(input_type)#
Retrieve input data for this alignment job.
- Parameters:
input_type (AlignType) – The type of input data to retrieve.
- Returns:
An iterator over the input data rows.
- Return type:
Iterator[list[str]]
- get_seed()#
Retrieve the seed sequence for this alignment job.
- Returns:
The seed sequence.
- Return type:
str
- property id#
The job ID for this alignment job.
- Returns:
The job ID.
- Return type:
str
- refresh()#
Refresh job status.
- wait(interval=5, timeout=None, verbose=False)#
Wait for job to complete, then fetch results.
- Parameters:
interval (int, optional) – time between polling. Defaults to config.POLLING_INTERVAL.
timeout (int, optional) – max time to wait. Defaults to None.
verbose (bool, optional) – verbosity flag. Defaults to False.
- Returns:
results of job
- Return type:
results
- wait_until_done(interval=5, timeout=None, verbose=False)#
Wait for job to complete. Do not fetch results (unlike wait())
- Parameters:
interval (int, optional) – time between polling. Defaults to config.POLLING_INTERVAL.
timeout (int, optional) – max time to wait. Defaults to None.
verbose (bool, optional) – verbosity flag. Defaults to False.
- Returns:
results of job
- Return type:
results