openprotein.embeddings#

Create embeddings for your protein sequences using open-source and proprietary models!

Note that for PoET Models, you will also need to utilize our align. workflow.

Endpoints#

class openprotein.embeddings.EmbeddingsAPI[source]#

This class defines a high level interface for accessing the embeddings API.

You can access all our models either via get_model() or directly through the session’s embedding attribute using the model’s ID and the desired method. For example, to use the attention method on the protein sequence model, you would use session.embedding.prot_seq.attn().

Examples

Accessing a model’s method:

# To call the attention method on the protein sequence model:
import openprotein
session = openprotein.connect(username="user", password="password")
session.embedding.prot_seq.attn()

Using the get_model method:

# Get a model instance by name:
import openprotein
session = openprotein.connect(username="user", password="password")
# list available models:
print(session.embedding.list_models() )
# init model by name
model = session.embedding.get_model('prot-seq')
prot_seq: OpenProteinModel#
rotaprot_large_uniref50w: OpenProteinModel#
rotaprot_large_uniref90_ft: OpenProteinModel#
poet: PoETModel#
esm1b: ESMModel#
esm1b_t33_650M_UR50S: ESMModel#
esm1v: ESMModel#
esm1v_t33_650M_UR90S_1: ESMModel#
esm1v_t33_650M_UR90S_2: ESMModel#
esm1v_t33_650M_UR90S_3: ESMModel#
esm1v_t33_650M_UR90S_4: ESMModel#
esm1v_t33_650M_UR90S_5: ESMModel#
esm2: ESMModel#
esm2_t12_35M_UR50D: ESMModel#
esm2_t30_150M_UR50D: ESMModel#
esm2_t33_650M_UR50D: ESMModel#
esm2_t36_3B_UR50D: ESMModel#
esm2_t6_8M_UR50D: ESMModel#
__init__(session)[source]#
Parameters:

session (APISession)

list_models()[source]#

list models available for creating embeddings of your sequences

Return type:

list[EmbeddingModel]

get_model(name)[source]#

Get model by model_id.

ProtembedModel allows all the usual job manipulation: e.g. making POST and GET requests for this model specifically.

Parameters:
  • model_id (str) – the model identifier

  • name (str)

Returns:

The model

Return type:

ProtembedModel

Raises:

HTTPError – If the GET request does not succeed.

Models#

class openprotein.embeddings.OpenProteinModel[source]#

Class providing inference endpoints for proprietary protein embedding models served by OpenProtein.

Examples

View specific model details (inc supported tokens) with the ? operator.

import openprotein
session = openprotein.connect(username="user", password="password")
session.embedding.prot_seq?
model_id: list[str] | str = ['prot-seq', 'rotaprot-large-uniref50w', 'rotaprot_large_uniref90_ft']#
__init__(session, model_id, metadata=None)#
Parameters:
  • session (APISession)

  • model_id (str)

  • metadata (ModelMetadata | None)

attn(sequences, **kwargs)#

Attention embeddings for sequences using this model.

Parameters:

sequences (List[bytes]) – sequences to SVD

Return type:

EmbeddingResultFuture

classmethod create(session, model_id, default=None)#

Create and return an instance of the appropriate Future class based on the job type.

Returns: - An instance of the appropriate Future class.

Parameters:
  • session (APISession)

  • model_id (str)

  • default (type[EmbeddingModel] | None)

embed(sequences, reduction=ReductionType.MEAN, **kwargs)#

Embed sequences using this model.

Parameters:
  • sequences (List[bytes]) – sequences to SVD

  • reduction (ReductionType | None, Optional) – embeddings reduction to use (e.g. mean)

Return type:

EmbeddingResultFuture

fit_gp(assay, properties, reduction, name=None, description=None, **kwargs)#

Fit a GP on assay using this embedding model and hyperparameters.

Parameters:
  • assay (AssayMetadata | str) – Assay to fit GP on.

  • properties (list[str]) – Properties in the assay to fit the gp on.

  • reduction (str) – Type of embedding reduction to use for computing features. PLM must use reduction.

  • name (str | None)

  • description (str | None)

Return type:

PredictorModel

fit_svd(sequences=None, assay=None, n_components=1024, reduction=None, **kwargs)#

Fit an SVD on the embedding results of this model.

This function will create an SVDModel based on the embeddings from this model as well as the hyperparameters specified in the args.

Parameters:
  • sequences (List[bytes]) – sequences to SVD

  • n_components (int) – number of components in SVD. Will determine output shapes

  • reduction (ReductionType | None) – embeddings reduction to use (e.g. mean)

  • assay (AssayDataset | None)

Return type:

SVDModel

fit_umap(sequences=None, assay=None, n_components=2, reduction=ReductionType.MEAN, **kwargs)#

Fit an UMAP on the embedding results of this model.

This function will create an UMAPModel based on the embeddings from this model as well as the hyperparameters specified in the args.

Parameters:
  • sequences (list[bytes] | None) – Optional sequences to fit UMAP with. Either use sequences or assay. sequences is preferred.

  • assay (AssayDataset | None) – Optional assay containing sequences to fit UMAP with. Either use sequences or assay. Ignored if sequences are provided.

  • n_components (int) – Number of components in UMAP fit. Will determine output shapes. Defaults to 2.

  • reduction (ReductionType | None) – Embeddings reduction to use (e.g. mean). Defaults to MEAN.

Return type:

UMAPModel

get_metadata()#

Get model metadata for this model.

Return type:

ModelMetadata

classmethod get_model()#
logits(sequences, **kwargs)#

logit embeddings for sequences using this model.

Parameters:

sequences (List[bytes]) – sequences to SVD

Return type:

EmbeddingResultFuture

property metadata#
class openprotein.embeddings.ESMModel[source]#

Class providing inference endpoints for Facebook’s ESM protein language Models.

Examples

View specific model details (inc supported tokens) with the ? operator.

import openprotein
session = openprotein.connect(username="user", password="password")
session.embedding.esm2_t12_35M_UR50D?
__init__(session, model_id, metadata=None)#
Parameters:
  • session (APISession)

  • model_id (str)

  • metadata (ModelMetadata | None)

attn(sequences, **kwargs)#

Attention embeddings for sequences using this model.

Parameters:

sequences (List[bytes]) – sequences to SVD

Return type:

EmbeddingResultFuture

classmethod create(session, model_id, default=None)#

Create and return an instance of the appropriate Future class based on the job type.

Returns: - An instance of the appropriate Future class.

Parameters:
  • session (APISession)

  • model_id (str)

  • default (type[EmbeddingModel] | None)

embed(sequences, reduction=ReductionType.MEAN, **kwargs)#

Embed sequences using this model.

Parameters:
  • sequences (List[bytes]) – sequences to SVD

  • reduction (ReductionType | None, Optional) – embeddings reduction to use (e.g. mean)

Return type:

EmbeddingResultFuture

fit_gp(assay, properties, reduction, name=None, description=None, **kwargs)#

Fit a GP on assay using this embedding model and hyperparameters.

Parameters:
  • assay (AssayMetadata | str) – Assay to fit GP on.

  • properties (list[str]) – Properties in the assay to fit the gp on.

  • reduction (str) – Type of embedding reduction to use for computing features. PLM must use reduction.

  • name (str | None)

  • description (str | None)

Return type:

PredictorModel

fit_svd(sequences=None, assay=None, n_components=1024, reduction=None, **kwargs)#

Fit an SVD on the embedding results of this model.

This function will create an SVDModel based on the embeddings from this model as well as the hyperparameters specified in the args.

Parameters:
  • sequences (List[bytes]) – sequences to SVD

  • n_components (int) – number of components in SVD. Will determine output shapes

  • reduction (ReductionType | None) – embeddings reduction to use (e.g. mean)

  • assay (AssayDataset | None)

Return type:

SVDModel

fit_umap(sequences=None, assay=None, n_components=2, reduction=ReductionType.MEAN, **kwargs)#

Fit an UMAP on the embedding results of this model.

This function will create an UMAPModel based on the embeddings from this model as well as the hyperparameters specified in the args.

Parameters:
  • sequences (list[bytes] | None) – Optional sequences to fit UMAP with. Either use sequences or assay. sequences is preferred.

  • assay (AssayDataset | None) – Optional assay containing sequences to fit UMAP with. Either use sequences or assay. Ignored if sequences are provided.

  • n_components (int) – Number of components in UMAP fit. Will determine output shapes. Defaults to 2.

  • reduction (ReductionType | None) – Embeddings reduction to use (e.g. mean). Defaults to MEAN.

Return type:

UMAPModel

get_metadata()#

Get model metadata for this model.

Return type:

ModelMetadata

logits(sequences, **kwargs)#

logit embeddings for sequences using this model.

Parameters:

sequences (List[bytes]) – sequences to SVD

Return type:

EmbeddingResultFuture

class openprotein.embeddings.PoETModel[source]#

Class for OpenProtein’s foundation model PoET - NB. PoET functions are dependent on a prompt supplied via the align endpoints.

Examples

View specific model details (inc supported tokens) with the ? operator.

import openprotein
session = openprotein.connect(username="user", password="password")
session.embedding.poet.<embeddings_method>
__init__(session, model_id, metadata=None)[source]#
Parameters:
  • session (APISession)

  • model_id (str)

  • metadata (ModelMetadata | None)

embed(prompt, sequences, reduction=ReductionType.MEAN)[source]#

Embed sequences using this model.

Parameters:
  • prompt (Union[str, PromptFuture]) – prompt from an align workflow to condition Poet model

  • sequence (bytes) – Sequence to embed.

  • reduction (str) – embeddings reduction to use (e.g. mean)

  • sequences (list[bytes])

Returns:

A future object that returns the embeddings of the submitted sequences.

Return type:

EmbeddingResultFuture

logits(prompt, sequences)[source]#

logit embeddings for sequences using this model.

Parameters:
  • prompt (Union[str, PromptFuture]) – prompt from an align workflow to condition Poet model

  • sequence (bytes) – Sequence to analyse.

  • sequences (list[bytes])

Returns:

A future object that returns the logits of the submitted sequences.

Return type:

EmbeddingResultFuture

attn()[source]#

Not Available for Poet.

score(prompt, sequences)[source]#

Score query sequences using the specified prompt.

Parameters:
  • prompt (str | PromptFuture) – Prompt or prompt_id or prompt from an align workflow to condition Poet model

  • sequence (list[bytes]) – Sequences to score.

  • sequences (list[bytes])

Returns:

A future object that returns the scores of the submitted sequences.

Return type:

EmbeddingsScoreFuture

single_site(prompt, sequence)[source]#

Score all single substitutions of the query sequence using the specified prompt.

Parameters:
  • prompt (str | PromptFuture) – Prompt or prompt_id or prompt from an align workflow to condition Poet model

  • sequence (bytes) – Sequence to analyse.

Returns:

A future object that returns the scores of the mutated sequence.

Return type:

EmbeddingsScoreFuture

generate(prompt, num_samples=100, temperature=1.0, topk=None, topp=None, max_length=1000, seed=None)[source]#

Generate protein sequences conditioned on a prompt.

Parameters:
  • prompt (Union[str, PromptFuture]) – prompt from an align workflow to condition Poet model

  • num_samples (int, optional) – The number of samples to generate, by default 100.

  • temperature (float, optional) – The temperature for sampling. Higher values produce more random outputs, by default 1.0.

  • topk (int, optional) – The number of top-k residues to consider during sampling, by default None.

  • topp (float, optional) – The cumulative probability threshold for top-p sampling, by default None.

  • max_length (int, optional) – The maximum length of generated proteins, by default 1000.

  • seed (int, optional) – Seed for random number generation, by default a random number.

Returns:

A future object representing the status and information about the generation job.

Return type:

EmbeddingsGenerateFuture

fit_svd(prompt, sequences=None, assay=None, n_components=1024, reduction=None)[source]#

Fit an SVD on the embedding results of PoET.

This function will create an SVDModel based on the embeddings from this model as well as the hyperparameters specified in the args.

Parameters:
  • prompt (Union[str, PromptFuture]) – prompt from an align workflow to condition Poet model

  • sequences (List[bytes]) – sequences to SVD

  • n_components (int) – number of components in SVD. Will determine output shapes

  • reduction (str) – embeddings reduction to use (e.g. mean)

  • assay (AssayDataset | None)

Returns:

A future that represents the fitted SVD model.

Return type:

SVDModel

fit_umap(prompt, sequences=None, assay=None, n_components=2, reduction=ReductionType.MEAN)[source]#

Fit a UMAP on assay using PoET and hyperparameters.

This function will create a UMAP based on the embeddings from this PoET model as well as the hyperparameters specified in the args.

Parameters:
  • prompt (Union[str, PromptFuture]) – prompt from an align workflow to condition Poet model

  • sequences (list[bytes] | None) – Optional sequences to fit UMAP with. Either use sequences or assay. sequences is preferred.

  • assay (AssayDataset | None) – Optional assay containing sequences to fit UMAP with. Either use sequences or assay. Ignored if sequences are provided.

  • n_components (int) – Number of components in UMAP fit. Will determine output shapes. Defaults to 2.

  • reduction (ReductionType | None) – Embeddings reduction to use (e.g. mean). Defaults to MEAN.

Returns:

A future that represents the fitted UMAP model.

Return type:

UMAPModel

fit_gp(prompt, assay, properties, **kwargs)[source]#

Fit a GP on assay using this embedding model and hyperparameters.

Parameters:
  • assay (AssayMetadata | str) – Assay to fit GP on.

  • properties (list[str]) – Properties in the assay to fit the gp on.

  • reduction (str) – Type of embedding reduction to use for computing features. PLM must use reduction.

  • prompt (str | PromptFuture)

Returns:

A future that represents the trained predictor model.

Return type:

PredictorModel

classmethod create(session, model_id, default=None)#

Create and return an instance of the appropriate Future class based on the job type.

Returns: - An instance of the appropriate Future class.

Parameters:
  • session (APISession)

  • model_id (str)

  • default (type[EmbeddingModel] | None)

get_metadata()#

Get model metadata for this model.

Return type:

ModelMetadata

class openprotein.embeddings.SVDModel[source]#

Class providing embedding endpoint for SVD models. Also allows retrieving embeddings of sequences used to fit the SVD with get. Implements a Future to allow waiting for a fit job.

__init__(session, job=None, metadata=None)[source]#

Initializes with either job get or svd metadata get.

Parameters:
  • session (APISession)

  • job (FitJob | None)

  • metadata (SVDMetadata | None)

get_model()[source]#

Fetch embeddings model

Return type:

EmbeddingModel

delete()[source]#

Delete this SVD model.

Return type:

bool

get(verbose=False)[source]#

Return the results from this job.

Parameters:

verbose (bool)

get_inputs()[source]#

Get sequences used for svd job.

Returns:

List[bytes]

Return type:

list of sequences

embed(sequences, **kwargs)[source]#

Use this SVD model to get reduced embeddings from input sequences.

Parameters:

sequences (List[bytes]) – List of protein sequences.

Returns:

Class for further job manipulation.

Return type:

EmbeddingResultFuture

fit_umap(sequences=None, assay=None, n_components=2, **kwargs)[source]#

Fit an UMAP on the embedding results of this model.

This function will create an UMAPModel based on the embeddings from this model as well as the hyperparameters specified in the args.

Parameters:
  • sequences (List[bytes]) – sequences to UMAP

  • n_components (int) – number of components in UMAP. Will determine output shapes

  • reduction (ReductionType | None) – embeddings reduction to use (e.g. mean)

  • assay (AssayDataset | None)

Return type:

UMAPModel

fit_gp(assay, properties, name=None, description=None, **kwargs)[source]#

Fit a GP on assay using this embedding model and hyperparameters.

Parameters:
  • assay (AssayMetadata | str) – Assay to fit GP on.

  • properties (list[str]) – Properties in the assay to fit the gp on.

  • name (str | None)

  • description (str | None)

Return type:

PredictorModel

cancelled()#

check if job is cancelled

Return type:

bool

done()#

Check if job is complete

Return type:

bool

refresh()#

Refresh job status.

wait(interval=5, timeout=None, verbose=False)#

Wait for job to complete, then fetch results.

Parameters:
  • interval (int, optional) – time between polling. Defaults to config.POLLING_INTERVAL.

  • timeout (int, optional) – max time to wait. Defaults to None.

  • verbose (bool, optional) – verbosity flag. Defaults to False.

Returns:

results of job

Return type:

results

wait_until_done(interval=5, timeout=None, verbose=False)#

Wait for job to complete. Do not fetch results (unlike wait())

Parameters:
  • interval (int, optional) – time between polling. Defaults to config.POLLING_INTERVAL.

  • timeout (int, optional) – max time to wait. Defaults to None.

  • verbose (bool, optional) – verbosity flag. Defaults to False.

Returns:

results of job

Return type:

results

class openprotein.embeddings.UMAPModel[source]#

Class providing embedding endpoint for UMAP models. Also allows retrieving embeddings of sequences used to fit the UMAP with get. Implements a Future to allow waiting for a fit job.

__init__(session, job=None, metadata=None)[source]#

Initializes with either job get or umap metadata get.

Parameters:
  • session (APISession)

  • job (FitJob | None)

  • metadata (UMAPMetadata | None)

get_model()[source]#

Fetch embeddings model

Return type:

EmbeddingModel

delete()[source]#

Delete this UMAP model.

Return type:

bool

get(verbose=False)[source]#

Return the results from this job.

Parameters:

verbose (bool)

get_inputs()[source]#

Get sequences used for umap job.

Returns:

List[bytes]

Return type:

list of sequences

embed(sequences, **kwargs)[source]#

Use this UMAP model to get reduced embeddings from input sequences.

Parameters:

sequences (List[bytes]) – List of protein sequences.

Returns:

Class for further job manipulation.

Return type:

EmbeddingResultFuture

cancelled()#

check if job is cancelled

Return type:

bool

done()#

Check if job is complete

Return type:

bool

refresh()#

Refresh job status.

wait(interval=5, timeout=None, verbose=False)#

Wait for job to complete, then fetch results.

Parameters:
  • interval (int, optional) – time between polling. Defaults to config.POLLING_INTERVAL.

  • timeout (int, optional) – max time to wait. Defaults to None.

  • verbose (bool, optional) – verbosity flag. Defaults to False.

Returns:

results of job

Return type:

results

wait_until_done(interval=5, timeout=None, verbose=False)#

Wait for job to complete. Do not fetch results (unlike wait())

Parameters:
  • interval (int, optional) – time between polling. Defaults to config.POLLING_INTERVAL.

  • timeout (int, optional) – max time to wait. Defaults to None.

  • verbose (bool, optional) – verbosity flag. Defaults to False.

Returns:

results of job

Return type:

results

Results#

class openprotein.embeddings.EmbeddingsResultFuture[source]#

Future for manipulating results for embeddings-related requests.

__init__(session, job, sequences=None, max_workers=10)[source]#

Retrieve results from asynchronous, mapped endpoints.

Use max_workers > 0 to enable concurrent retrieval of multiple pages.

Parameters:
  • session (APISession)

  • job (EmbeddingsJob | AttnJob | LogitsJob)

  • sequences (list[bytes] | list[str] | None)

  • max_workers (int)

get(verbose=False)[source]#

Return the results from this job.

Return type:

list

get_item(sequence)[source]#

Get embedding results for specified sequence.

Parameters:

sequence (bytes) – sequence to fetch results for

Returns:

embeddings

Return type:

np.ndarray

cancelled()#

check if job is cancelled

Return type:

bool

done()#

Check if job is complete

Return type:

bool

refresh()#

Refresh job status.

stream()#

Retrieve results for this job as a stream.

wait(interval=5, timeout=None, verbose=False)#

Wait for job to complete, then fetch results.

Parameters:
  • interval (int, optional) – time between polling. Defaults to config.POLLING_INTERVAL.

  • timeout (int, optional) – max time to wait. Defaults to None.

  • verbose (bool, optional) – verbosity flag. Defaults to False.

Returns:

results of job

Return type:

results

wait_until_done(interval=5, timeout=None, verbose=False)#

Wait for job to complete. Do not fetch results (unlike wait())

Parameters:
  • interval (int, optional) – time between polling. Defaults to config.POLLING_INTERVAL.

  • timeout (int, optional) – max time to wait. Defaults to None.

  • verbose (bool, optional) – verbosity flag. Defaults to False.

Returns:

results of job

Return type:

results

class openprotein.embeddings.EmbeddingsScoreFuture[source]#

Future for manipulating results for embeddings score-related requests.

__init__(session, job, sequences=None)[source]#
Parameters:
  • session (APISession)

  • job (ScoreJob | ScoreSingleSiteJob | GenerateJob)

  • sequences (list[bytes] | list[str] | None)

stream()[source]#

Return the results from this job as a generator.

Return type:

Generator

cancelled()#

check if job is cancelled

Return type:

bool

done()#

Check if job is complete

Return type:

bool

get(verbose=False, **kwargs)#

Return the results from this job.

Parameters:

verbose (bool)

Return type:

list

refresh()#

Refresh job status.

wait(interval=5, timeout=None, verbose=False)#

Wait for job to complete, then fetch results.

Parameters:
  • interval (int, optional) – time between polling. Defaults to config.POLLING_INTERVAL.

  • timeout (int, optional) – max time to wait. Defaults to None.

  • verbose (bool, optional) – verbosity flag. Defaults to False.

Returns:

results of job

Return type:

results

wait_until_done(interval=5, timeout=None, verbose=False)#

Wait for job to complete. Do not fetch results (unlike wait())

Parameters:
  • interval (int, optional) – time between polling. Defaults to config.POLLING_INTERVAL.

  • timeout (int, optional) – max time to wait. Defaults to None.

  • verbose (bool, optional) – verbosity flag. Defaults to False.

Returns:

results of job

Return type:

results

class openprotein.embeddings.EmbeddingsGenerateFuture[source]#

Future for manipulating results for embeddings generate-related requests.

__init__(session, job, sequences=None)#
Parameters:
  • session (APISession)

  • job (ScoreJob | ScoreSingleSiteJob | GenerateJob)

  • sequences (list[bytes] | list[str] | None)

cancelled()#

check if job is cancelled

Return type:

bool

done()#

Check if job is complete

Return type:

bool

get(verbose=False, **kwargs)#

Return the results from this job.

Parameters:

verbose (bool)

Return type:

list

refresh()#

Refresh job status.

stream()#

Return the results from this job as a generator.

Return type:

Generator

wait(interval=5, timeout=None, verbose=False)#

Wait for job to complete, then fetch results.

Parameters:
  • interval (int, optional) – time between polling. Defaults to config.POLLING_INTERVAL.

  • timeout (int, optional) – max time to wait. Defaults to None.

  • verbose (bool, optional) – verbosity flag. Defaults to False.

Returns:

results of job

Return type:

results

wait_until_done(interval=5, timeout=None, verbose=False)#

Wait for job to complete. Do not fetch results (unlike wait())

Parameters:
  • interval (int, optional) – time between polling. Defaults to config.POLLING_INTERVAL.

  • timeout (int, optional) – max time to wait. Defaults to None.

  • verbose (bool, optional) – verbosity flag. Defaults to False.

Returns:

results of job

Return type:

results