Model metadata#

This tutorial briefly covers how to run Embedding workflows.

What you need before getting started#

Import the json module:

[ ]:

import json

Viewing models and metadata#

List the available models:

[ ]:

session.embedding.list_models()

[esm1b_t33_650M_UR50S,
 esm1v_t33_650M_UR90S_1,
 esm1v_t33_650M_UR90S_2,
 esm1v_t33_650M_UR90S_3,
 esm1v_t33_650M_UR90S_4,
 esm1v_t33_650M_UR90S_5,
 esm2_t12_35M_UR50D,
 esm2_t30_150M_UR50D,
 esm2_t33_650M_UR50D,
 esm2_t36_3B_UR50D,
 esm2_t6_8M_UR50D,
 poet,
 prot-seq,
 rotaprot-large-uniref50w,
 rotaprot-large-uniref90-ft]

Fetch metadata for more information, including publications and DOIs where available:

[ ]:

esm_model = session.embedding.list_models()[0]
for k,v in esm_model.metadata.dict()['description'].items():
  print(f"{k}: {v}")

citation_title: Biological Structure and Function Emerge from Scaling Unsupervised Learning to 250 Million Protein Sequences
doi: 10.1101/622803
summary: ESM1b model with 650M parameters

or:

[ ]:

session.embedding.prot_seq?

Masked protein language model (~300M parameters) trained on UniRef50 with contact and secondary structure prediction as secondary objectives. Uses random Fourier position embeddings and FlashAttention to enable fast inference.
         max_sequence_length = 1024
         supported outputs = ['attn', 'embed', 'logits']
         supported tokens = ['A', 'R', 'N', 'D', 'C', 'Q', 'E', 'G', 'H', 'I', 'L', 'K', 'M', 'F', 'P', 'S', 'T', 'W', 'Y', 'V', 'X', 'O', 'U', 'B', 'Z']

View data about supported tokens and outputs:

[ ]:

for k,v in esm_model.metadata.dict().items():
  if k == "token_descriptions":
    continue
  print(f"{k}: {v}")

model_id: esm1b_t33_650M_UR50S
description: {'citation_title': 'Biological Structure and Function Emerge from Scaling Unsupervised Learning to 250 Million Protein Sequences', 'doi': '10.1101/622803', 'summary': 'ESM1b model with 650M parameters'}
max_sequence_length: 1022
dimension: 1280
output_types: ['attn', 'embed', 'logits']
input_tokens: ['A', 'R', 'N', 'D', 'C', 'Q', 'E', 'G', 'H', 'I', 'L', 'K', 'M', 'F', 'P', 'S', 'T', 'W', 'Y', 'V', 'X', 'O', 'U', 'B', 'Z']
output_tokens: ['<cls>', '<pad>', '<eos>', '<unk>', 'L', 'A', 'G', 'V', 'S', 'E', 'R', 'T', 'I', 'D', 'P', 'K', 'Q', 'N', 'F', 'Y', 'M', 'H', 'W', 'C', '<null_0>', 'B', 'U', 'Z', 'O', '.', '-', '<null_1>', 'X']

Next steps#

For more information, visit the Embeddings API reference.