openprotein.embeddings#
Create embeddings for your protein sequences using open-source and proprietary models!
Note that for PoET Models, you will also need to utilize our align. workflow.
Endpoints#
- class openprotein.embeddings.EmbeddingsAPI[source]#
This class defines a high level interface for accessing the embeddings API.
You can access all our models either via
get_model()
or directly through the session’s embedding attribute using the model’s ID and the desired method. For example, to use the attention method on the protein sequence model, you would usesession.embedding.prot_seq.attn()
.Examples
Accessing a model’s method:
# To call the attention method on the protein sequence model: import openprotein session = openprotein.connect(username="user", password="password") session.embedding.prot_seq.attn()
Using the get_model method:
# Get a model instance by name: import openprotein session = openprotein.connect(username="user", password="password") # list available models: print(session.embedding.list_models() ) # init model by name model = session.embedding.get_model('prot-seq')
- prot_seq: OpenProteinModel#
- rotaprot_large_uniref50w: OpenProteinModel#
- rotaprot_large_uniref90_ft: OpenProteinModel#
- __init__(session)[source]#
- Parameters:
session (APISession)
- list_models()[source]#
list models available for creating embeddings of your sequences
- Return type:
list[EmbeddingModel]
- get_model(name)[source]#
Get model by model_id.
ProtembedModel allows all the usual job manipulation: e.g. making POST and GET requests for this model specifically.
- Parameters:
model_id (str) – the model identifier
name (str)
- Returns:
The model
- Return type:
ProtembedModel
- Raises:
HTTPError – If the GET request does not succeed.
Models#
- class openprotein.embeddings.OpenProteinModel[source]#
Class providing inference endpoints for proprietary protein embedding models served by OpenProtein.
Examples
View specific model details (inc supported tokens) with the ? operator.
import openprotein session = openprotein.connect(username="user", password="password") session.embedding.prot_seq?
- model_id: list[str] | str = ['prot-seq', 'rotaprot-large-uniref50w', 'rotaprot_large_uniref90_ft']#
- __init__(session, model_id, metadata=None)#
- Parameters:
session (APISession)
model_id (str)
metadata (ModelMetadata | None)
- attn(sequences, **kwargs)#
Attention embeddings for sequences using this model.
- Parameters:
sequences (List[bytes]) – sequences to SVD
- Return type:
EmbeddingResultFuture
- classmethod create(session, model_id, default=None)#
Create and return an instance of the appropriate Future class based on the job type.
Returns: - An instance of the appropriate Future class.
- Parameters:
session (APISession)
model_id (str)
default (type[EmbeddingModel] | None)
- embed(sequences, reduction=ReductionType.MEAN, **kwargs)#
Embed sequences using this model.
- Parameters:
sequences (List[bytes]) – sequences to SVD
reduction (ReductionType | None, Optional) – embeddings reduction to use (e.g. mean)
- Return type:
EmbeddingResultFuture
- fit_gp(assay, properties, reduction, name=None, description=None, **kwargs)#
Fit a GP on assay using this embedding model and hyperparameters.
- Parameters:
assay (AssayMetadata | str) – Assay to fit GP on.
properties (list[str]) – Properties in the assay to fit the gp on.
reduction (str) – Type of embedding reduction to use for computing features. PLM must use reduction.
name (str | None)
description (str | None)
- Return type:
- fit_svd(sequences=None, assay=None, n_components=1024, reduction=None, **kwargs)#
Fit an SVD on the embedding results of this model.
This function will create an SVDModel based on the embeddings from this model as well as the hyperparameters specified in the args.
- Parameters:
sequences (List[bytes]) – sequences to SVD
n_components (int) – number of components in SVD. Will determine output shapes
reduction (ReductionType | None) – embeddings reduction to use (e.g. mean)
assay (AssayDataset | None)
- Return type:
- fit_umap(sequences=None, assay=None, n_components=2, reduction=ReductionType.MEAN, **kwargs)#
Fit an UMAP on the embedding results of this model.
This function will create an UMAPModel based on the embeddings from this model as well as the hyperparameters specified in the args.
- Parameters:
sequences (list[bytes] | None) – Optional sequences to fit UMAP with. Either use sequences or assay. sequences is preferred.
assay (AssayDataset | None) – Optional assay containing sequences to fit UMAP with. Either use sequences or assay. Ignored if sequences are provided.
n_components (int) – Number of components in UMAP fit. Will determine output shapes. Defaults to 2.
reduction (ReductionType | None) – Embeddings reduction to use (e.g. mean). Defaults to MEAN.
- Return type:
- get_metadata()#
Get model metadata for this model.
- Return type:
ModelMetadata
- classmethod get_model()#
- logits(sequences, **kwargs)#
logit embeddings for sequences using this model.
- Parameters:
sequences (List[bytes]) – sequences to SVD
- Return type:
EmbeddingResultFuture
- property metadata#
- class openprotein.embeddings.ESMModel[source]#
Class providing inference endpoints for Facebook’s ESM protein language Models.
Examples
View specific model details (inc supported tokens) with the ? operator.
import openprotein session = openprotein.connect(username="user", password="password") session.embedding.esm2_t12_35M_UR50D?
- __init__(session, model_id, metadata=None)#
- Parameters:
session (APISession)
model_id (str)
metadata (ModelMetadata | None)
- attn(sequences, **kwargs)#
Attention embeddings for sequences using this model.
- Parameters:
sequences (List[bytes]) – sequences to SVD
- Return type:
EmbeddingResultFuture
- classmethod create(session, model_id, default=None)#
Create and return an instance of the appropriate Future class based on the job type.
Returns: - An instance of the appropriate Future class.
- Parameters:
session (APISession)
model_id (str)
default (type[EmbeddingModel] | None)
- embed(sequences, reduction=ReductionType.MEAN, **kwargs)#
Embed sequences using this model.
- Parameters:
sequences (List[bytes]) – sequences to SVD
reduction (ReductionType | None, Optional) – embeddings reduction to use (e.g. mean)
- Return type:
EmbeddingResultFuture
- fit_gp(assay, properties, reduction, name=None, description=None, **kwargs)#
Fit a GP on assay using this embedding model and hyperparameters.
- Parameters:
assay (AssayMetadata | str) – Assay to fit GP on.
properties (list[str]) – Properties in the assay to fit the gp on.
reduction (str) – Type of embedding reduction to use for computing features. PLM must use reduction.
name (str | None)
description (str | None)
- Return type:
- fit_svd(sequences=None, assay=None, n_components=1024, reduction=None, **kwargs)#
Fit an SVD on the embedding results of this model.
This function will create an SVDModel based on the embeddings from this model as well as the hyperparameters specified in the args.
- Parameters:
sequences (List[bytes]) – sequences to SVD
n_components (int) – number of components in SVD. Will determine output shapes
reduction (ReductionType | None) – embeddings reduction to use (e.g. mean)
assay (AssayDataset | None)
- Return type:
- fit_umap(sequences=None, assay=None, n_components=2, reduction=ReductionType.MEAN, **kwargs)#
Fit an UMAP on the embedding results of this model.
This function will create an UMAPModel based on the embeddings from this model as well as the hyperparameters specified in the args.
- Parameters:
sequences (list[bytes] | None) – Optional sequences to fit UMAP with. Either use sequences or assay. sequences is preferred.
assay (AssayDataset | None) – Optional assay containing sequences to fit UMAP with. Either use sequences or assay. Ignored if sequences are provided.
n_components (int) – Number of components in UMAP fit. Will determine output shapes. Defaults to 2.
reduction (ReductionType | None) – Embeddings reduction to use (e.g. mean). Defaults to MEAN.
- Return type:
- get_metadata()#
Get model metadata for this model.
- Return type:
ModelMetadata
- logits(sequences, **kwargs)#
logit embeddings for sequences using this model.
- Parameters:
sequences (List[bytes]) – sequences to SVD
- Return type:
EmbeddingResultFuture
- class openprotein.embeddings.PoETModel[source]#
Class for OpenProtein’s foundation model PoET - NB. PoET functions are dependent on a prompt supplied via the align endpoints.
Examples
View specific model details (inc supported tokens) with the ? operator.
import openprotein session = openprotein.connect(username="user", password="password") session.embedding.poet.<embeddings_method>
- __init__(session, model_id, metadata=None)[source]#
- Parameters:
session (APISession)
model_id (str)
metadata (ModelMetadata | None)
- embed(prompt, sequences, reduction=ReductionType.MEAN)[source]#
Embed sequences using this model.
- Parameters:
prompt (Union[str, PromptFuture]) – prompt from an align workflow to condition Poet model
sequence (bytes) – Sequence to embed.
reduction (str) – embeddings reduction to use (e.g. mean)
sequences (list[bytes])
- Returns:
A future object that returns the embeddings of the submitted sequences.
- Return type:
EmbeddingResultFuture
- logits(prompt, sequences)[source]#
logit embeddings for sequences using this model.
- Parameters:
prompt (Union[str, PromptFuture]) – prompt from an align workflow to condition Poet model
sequence (bytes) – Sequence to analyse.
sequences (list[bytes])
- Returns:
A future object that returns the logits of the submitted sequences.
- Return type:
EmbeddingResultFuture
- score(prompt, sequences)[source]#
Score query sequences using the specified prompt.
- Parameters:
prompt (str | PromptFuture) – Prompt or prompt_id or prompt from an align workflow to condition Poet model
sequence (list[bytes]) – Sequences to score.
sequences (list[bytes])
- Returns:
A future object that returns the scores of the submitted sequences.
- Return type:
- single_site(prompt, sequence)[source]#
Score all single substitutions of the query sequence using the specified prompt.
- Parameters:
prompt (str | PromptFuture) – Prompt or prompt_id or prompt from an align workflow to condition Poet model
sequence (bytes) – Sequence to analyse.
- Returns:
A future object that returns the scores of the mutated sequence.
- Return type:
- generate(prompt, num_samples=100, temperature=1.0, topk=None, topp=None, max_length=1000, seed=None)[source]#
Generate protein sequences conditioned on a prompt.
- Parameters:
prompt (Union[str, PromptFuture]) – prompt from an align workflow to condition Poet model
num_samples (int, optional) – The number of samples to generate, by default 100.
temperature (float, optional) – The temperature for sampling. Higher values produce more random outputs, by default 1.0.
topk (int, optional) – The number of top-k residues to consider during sampling, by default None.
topp (float, optional) – The cumulative probability threshold for top-p sampling, by default None.
max_length (int, optional) – The maximum length of generated proteins, by default 1000.
seed (int, optional) – Seed for random number generation, by default a random number.
- Returns:
A future object representing the status and information about the generation job.
- Return type:
- fit_svd(prompt, sequences=None, assay=None, n_components=1024, reduction=None)[source]#
Fit an SVD on the embedding results of PoET.
This function will create an SVDModel based on the embeddings from this model as well as the hyperparameters specified in the args.
- Parameters:
prompt (Union[str, PromptFuture]) – prompt from an align workflow to condition Poet model
sequences (List[bytes]) – sequences to SVD
n_components (int) – number of components in SVD. Will determine output shapes
reduction (str) – embeddings reduction to use (e.g. mean)
assay (AssayDataset | None)
- Returns:
A future that represents the fitted SVD model.
- Return type:
- fit_umap(prompt, sequences=None, assay=None, n_components=2, reduction=ReductionType.MEAN)[source]#
Fit a UMAP on assay using PoET and hyperparameters.
This function will create a UMAP based on the embeddings from this PoET model as well as the hyperparameters specified in the args.
- Parameters:
prompt (Union[str, PromptFuture]) – prompt from an align workflow to condition Poet model
sequences (list[bytes] | None) – Optional sequences to fit UMAP with. Either use sequences or assay. sequences is preferred.
assay (AssayDataset | None) – Optional assay containing sequences to fit UMAP with. Either use sequences or assay. Ignored if sequences are provided.
n_components (int) – Number of components in UMAP fit. Will determine output shapes. Defaults to 2.
reduction (ReductionType | None) – Embeddings reduction to use (e.g. mean). Defaults to MEAN.
- Returns:
A future that represents the fitted UMAP model.
- Return type:
- fit_gp(prompt, assay, properties, **kwargs)[source]#
Fit a GP on assay using this embedding model and hyperparameters.
- Parameters:
assay (AssayMetadata | str) – Assay to fit GP on.
properties (list[str]) – Properties in the assay to fit the gp on.
reduction (str) – Type of embedding reduction to use for computing features. PLM must use reduction.
prompt (str | PromptFuture)
- Returns:
A future that represents the trained predictor model.
- Return type:
- classmethod create(session, model_id, default=None)#
Create and return an instance of the appropriate Future class based on the job type.
Returns: - An instance of the appropriate Future class.
- Parameters:
session (APISession)
model_id (str)
default (type[EmbeddingModel] | None)
- get_metadata()#
Get model metadata for this model.
- Return type:
ModelMetadata
- class openprotein.embeddings.SVDModel[source]#
Class providing embedding endpoint for SVD models. Also allows retrieving embeddings of sequences used to fit the SVD with get. Implements a Future to allow waiting for a fit job.
- __init__(session, job=None, metadata=None)[source]#
Initializes with either job get or svd metadata get.
- Parameters:
session (APISession)
job (FitJob | None)
metadata (SVDMetadata | None)
- get_inputs()[source]#
Get sequences used for svd job.
- Returns:
List[bytes]
- Return type:
list of sequences
- embed(sequences, **kwargs)[source]#
Use this SVD model to get reduced embeddings from input sequences.
- Parameters:
sequences (List[bytes]) – List of protein sequences.
- Returns:
Class for further job manipulation.
- Return type:
EmbeddingResultFuture
- fit_umap(sequences=None, assay=None, n_components=2, **kwargs)[source]#
Fit an UMAP on the embedding results of this model.
This function will create an UMAPModel based on the embeddings from this model as well as the hyperparameters specified in the args.
- Parameters:
sequences (List[bytes]) – sequences to UMAP
n_components (int) – number of components in UMAP. Will determine output shapes
reduction (ReductionType | None) – embeddings reduction to use (e.g. mean)
assay (AssayDataset | None)
- Return type:
- fit_gp(assay, properties, name=None, description=None, **kwargs)[source]#
Fit a GP on assay using this embedding model and hyperparameters.
- Parameters:
assay (AssayMetadata | str) – Assay to fit GP on.
properties (list[str]) – Properties in the assay to fit the gp on.
name (str | None)
description (str | None)
- Return type:
- cancelled()#
check if job is cancelled
- Return type:
bool
- done()#
Check if job is complete
- Return type:
bool
- refresh()#
Refresh job status.
- wait(interval=5, timeout=None, verbose=False)#
Wait for job to complete, then fetch results.
- Parameters:
interval (int, optional) – time between polling. Defaults to config.POLLING_INTERVAL.
timeout (int, optional) – max time to wait. Defaults to None.
verbose (bool, optional) – verbosity flag. Defaults to False.
- Returns:
results of job
- Return type:
results
- wait_until_done(interval=5, timeout=None, verbose=False)#
Wait for job to complete. Do not fetch results (unlike wait())
- Parameters:
interval (int, optional) – time between polling. Defaults to config.POLLING_INTERVAL.
timeout (int, optional) – max time to wait. Defaults to None.
verbose (bool, optional) – verbosity flag. Defaults to False.
- Returns:
results of job
- Return type:
results
- class openprotein.embeddings.UMAPModel[source]#
Class providing embedding endpoint for UMAP models. Also allows retrieving embeddings of sequences used to fit the UMAP with get. Implements a Future to allow waiting for a fit job.
- __init__(session, job=None, metadata=None)[source]#
Initializes with either job get or umap metadata get.
- Parameters:
session (APISession)
job (FitJob | None)
metadata (UMAPMetadata | None)
- get_inputs()[source]#
Get sequences used for umap job.
- Returns:
List[bytes]
- Return type:
list of sequences
- embed(sequences, **kwargs)[source]#
Use this UMAP model to get reduced embeddings from input sequences.
- Parameters:
sequences (List[bytes]) – List of protein sequences.
- Returns:
Class for further job manipulation.
- Return type:
EmbeddingResultFuture
- cancelled()#
check if job is cancelled
- Return type:
bool
- done()#
Check if job is complete
- Return type:
bool
- refresh()#
Refresh job status.
- wait(interval=5, timeout=None, verbose=False)#
Wait for job to complete, then fetch results.
- Parameters:
interval (int, optional) – time between polling. Defaults to config.POLLING_INTERVAL.
timeout (int, optional) – max time to wait. Defaults to None.
verbose (bool, optional) – verbosity flag. Defaults to False.
- Returns:
results of job
- Return type:
results
- wait_until_done(interval=5, timeout=None, verbose=False)#
Wait for job to complete. Do not fetch results (unlike wait())
- Parameters:
interval (int, optional) – time between polling. Defaults to config.POLLING_INTERVAL.
timeout (int, optional) – max time to wait. Defaults to None.
verbose (bool, optional) – verbosity flag. Defaults to False.
- Returns:
results of job
- Return type:
results
Results#
- class openprotein.embeddings.EmbeddingsResultFuture[source]#
Future for manipulating results for embeddings-related requests.
- __init__(session, job, sequences=None, max_workers=10)[source]#
Retrieve results from asynchronous, mapped endpoints.
Use max_workers > 0 to enable concurrent retrieval of multiple pages.
- Parameters:
session (APISession)
job (EmbeddingsJob | AttnJob | LogitsJob)
sequences (list[bytes] | list[str] | None)
max_workers (int)
- get_item(sequence)[source]#
Get embedding results for specified sequence.
- Parameters:
sequence (bytes) – sequence to fetch results for
- Returns:
embeddings
- Return type:
np.ndarray
- cancelled()#
check if job is cancelled
- Return type:
bool
- done()#
Check if job is complete
- Return type:
bool
- refresh()#
Refresh job status.
- stream()#
Retrieve results for this job as a stream.
- wait(interval=5, timeout=None, verbose=False)#
Wait for job to complete, then fetch results.
- Parameters:
interval (int, optional) – time between polling. Defaults to config.POLLING_INTERVAL.
timeout (int, optional) – max time to wait. Defaults to None.
verbose (bool, optional) – verbosity flag. Defaults to False.
- Returns:
results of job
- Return type:
results
- wait_until_done(interval=5, timeout=None, verbose=False)#
Wait for job to complete. Do not fetch results (unlike wait())
- Parameters:
interval (int, optional) – time between polling. Defaults to config.POLLING_INTERVAL.
timeout (int, optional) – max time to wait. Defaults to None.
verbose (bool, optional) – verbosity flag. Defaults to False.
- Returns:
results of job
- Return type:
results
- class openprotein.embeddings.EmbeddingsScoreFuture[source]#
Future for manipulating results for embeddings score-related requests.
- __init__(session, job, sequences=None)[source]#
- Parameters:
session (APISession)
job (ScoreJob | ScoreSingleSiteJob | GenerateJob)
sequences (list[bytes] | list[str] | None)
- cancelled()#
check if job is cancelled
- Return type:
bool
- done()#
Check if job is complete
- Return type:
bool
- get(verbose=False, **kwargs)#
Return the results from this job.
- Parameters:
verbose (bool)
- Return type:
list
- refresh()#
Refresh job status.
- wait(interval=5, timeout=None, verbose=False)#
Wait for job to complete, then fetch results.
- Parameters:
interval (int, optional) – time between polling. Defaults to config.POLLING_INTERVAL.
timeout (int, optional) – max time to wait. Defaults to None.
verbose (bool, optional) – verbosity flag. Defaults to False.
- Returns:
results of job
- Return type:
results
- wait_until_done(interval=5, timeout=None, verbose=False)#
Wait for job to complete. Do not fetch results (unlike wait())
- Parameters:
interval (int, optional) – time between polling. Defaults to config.POLLING_INTERVAL.
timeout (int, optional) – max time to wait. Defaults to None.
verbose (bool, optional) – verbosity flag. Defaults to False.
- Returns:
results of job
- Return type:
results
- class openprotein.embeddings.EmbeddingsGenerateFuture[source]#
Future for manipulating results for embeddings generate-related requests.
- __init__(session, job, sequences=None)#
- Parameters:
session (APISession)
job (ScoreJob | ScoreSingleSiteJob | GenerateJob)
sequences (list[bytes] | list[str] | None)
- cancelled()#
check if job is cancelled
- Return type:
bool
- done()#
Check if job is complete
- Return type:
bool
- get(verbose=False, **kwargs)#
Return the results from this job.
- Parameters:
verbose (bool)
- Return type:
list
- refresh()#
Refresh job status.
- stream()#
Return the results from this job as a generator.
- Return type:
Generator
- wait(interval=5, timeout=None, verbose=False)#
Wait for job to complete, then fetch results.
- Parameters:
interval (int, optional) – time between polling. Defaults to config.POLLING_INTERVAL.
timeout (int, optional) – max time to wait. Defaults to None.
verbose (bool, optional) – verbosity flag. Defaults to False.
- Returns:
results of job
- Return type:
results
- wait_until_done(interval=5, timeout=None, verbose=False)#
Wait for job to complete. Do not fetch results (unlike wait())
- Parameters:
interval (int, optional) – time between polling. Defaults to config.POLLING_INTERVAL.
timeout (int, optional) – max time to wait. Defaults to None.
verbose (bool, optional) – verbosity flag. Defaults to False.
- Returns:
results of job
- Return type:
results