Glossary#

General#

Term

Definition

Project

A project houses data for your protein of interest. You can upload multiple datasets, but each project should be for a different protein of interest.

Library

A library is a repository for your designed sequence variants. By saving your predicted results as a library, you can easily reference previously created variants.

Fitness

Similar to evolutionary fitness, but in this case refers to how much better in performance a protein is in terms of the design objective or natural evolutionary landscape.

Datasets#

Term

Definition

Identifier

This column contains the name of your sequence.

Property

This column contains the measured values of your functions of interest. You can input more than 1 property.

Sequence

This column contains the protein sequences of your variants.

Mutant

This column denotes your mutation codes. If you use mutation codes (for example, A20T where alanine at site 20 is substituted with threonine), you will be required to input the parent sequence.

Parent

This is the base sequence that your dataset can be based upon if you are using mutational codes. It is usually your wild type sequence.

Visualizations#

Term

Definition

UMAP

Uniform Manifold Approximation and Projection (UMAP) is a dimension reduction technique that can be used for visualisation similarly to t-SNE, and also for general nonlinear dimension reduction. Read more here.

Histogram

A representation of the distribution of the data, showing the number of observations that fall within each bin.

Joint plots

A way of understanding the relationship between two variables and the distribution of individuals of each variable.

Models#

Term

Definition

Gaussian

A nonparametric supervised machine learning method that models and predicts functions. It defines a distribution over functions, allowing for flexibility in capturing uncertainty and making predictions.

Bayesian regression

Bayesian regression is a statistical method that combines prior knowledge with observed data. It provides a framework for making predictions while accounting for uncertainty.

Embeddings

Embeddings are a way to represent the meaning of sequences as a list of numbers.

Dimension reduction

Transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally close to its intrinsic dimension.

Cross-validation

Cross-validation is a resampling method that uses different portions of the data to test and train a model on different iterations.

Training curve

Plots the optimal value of a model’s loss function for a training set against this loss function evaluated on a validation data set using the same parameters that produced the optimal function.

Spearman rho

A nonparametric measure of rank correlation. It assesses how well the relationship between two variables can be described using a monotonic function.

Loss

The training loss indicates how well the model is fitting the training data.

Design#

Term

Definition

Design objectives

This is your desired output based on your functions of interest.

Target value

The value you wish the output should achieve.

Direction

The direction you wish the output should achieve.

Weight

This helps the model to generate results based on the priorities of the output.

Iterations

The number of iterations is equivalent to the number of batches needed to complete one epoch during model training. So if a dataset includes 1,000 images split into mini-batches of 100 images, it will take 10 iterations to complete a single epoch.

Log-likelihood score

Indicates the probability that a predicted sequence will achieve the specified design objective.

Mean

The mean of the predicted experimental value that a sequence would achieve.

Standard deviation

Indicates the amount of variation in the mean value and therefore provides a level of confidence in the predicted value.

PoET#

Term

Definition

Prompt

A prompt is an input that instructs a Generative AI model to generate the desired response. PoET uses a prompt made up of a set of related sequences. These sequences may be homologs, family members, or some other grouping that represents your protein of interest.

Query

This refers to the list of sequences you wish to score using our PoET model.

Multiple sequence alignment

The sequence alignment of three or more biological sequences, usually DNA, RNA or protein. It can identify the evolutionary relationships and common patterns between genes and proteins.

Sampling methods#

Term

Definition

Ensemble

In an ensemble, multiple prompts are sampled independently from the MSA following the prompt sampling parameters. Each sequence is scored using each prompt individually, and the final score is the average score across prompts. Ensembling improves the accuracy of the sequence scores, but takes longer to run.

Neighbors

Sample more diverse, less redundant sequences from the MSA by sampling each sequence with weight inversely proportional to its number of homologs in the MSA.

Homology level

Determines the identity at which two sequences are considered redundant. For example, when the homology level is set to 0.8, it means that a sequence will be considered to belong to the same group if it has more than 80% sequence identity.

Random seed

Determines the state of the random number generator for random sampling. If it set to a specific number, the algorithm will sample the same set of sequences each time.

Sampling parameters#

Term

Definition

Top-p

Top-p (also known as nucleus sampling) limits sampling to amino acids with sum likelihoods which do not exceed the specified value. As a result, the list of possible amino acids is dynamically selected based on the sum of likelihood scores achieving the top-p value. For example, setting a top-p of 0.8 limits sampling to an 80% or greater probability. Other amino acids are ignored.

Top-k

Top-k limits sampling to a shortlist of amino acids, where the top-k parameter sets the size of the shortlist. For example, setting top-k to 5 means the model samples from the 5 likeliest amino acids at each position. Other amino acids are ignored.

Temperature

Temperature is a number used to tune the degree of randomness. A lower temperature means less randomness; a temperature of 0 will always yield the same output.