# Glossary

## General

Term |
Definition |

Project | A project houses data for your protein of interest. You can upload multiple datasets, but each project should be for a different protein of interest. |

Library | A library is a repository for your designed sequence variants. By saving your predicted results as a library, you can easily reference previously created variants. |

Fitness | Similar to evolutionary fitness, but in this case refers to how much better in performance a protein is in terms of the design objective or natural evolutionary landscape. |

## Datasets

Term |
Definition |

Identifier | This column contains the name of your sequence. |

Property | This column contains the measured values of your functions of interest. You can input more than 1 property. |

Sequence | This column contains the protein sequences of your variants. |

Mutant | This column denotes your mutation codes. If you use mutation codes (for example, A20T where alanine at site 20 is substituted with threonine), you will be required to input the parent sequence. |

Parent sequence | This is the base sequence that your dataset can be based upon if you are using mutational codes. It is usually your wild type sequence. |

## Visualizations

Term |
Definition |

UMAP | Uniform Manifold Approximation and Projection (UMAP) is a dimension reduction technique that can be used for visualisation similarly to t-SNE, and also for general nonlinear dimension reduction. Read more here. |

Histogram | A representation of the distribution of the data, showing the number of observations that fall within each bin. |

Joint plots | A way of understanding the relationship between two variables and the distribution of individuals of each variable. |

## Models

Term |
Definition |

Gaussian process | A nonparametric supervised machine learning method that models and predicts functions. It defines a distribution over functions, allowing for flexibility in capturing uncertainty and making predictions. |

Bayesian regression | Bayesian regression is a statistical method that combines prior knowledge with observed data. It provides a framework for making predictions while accounting for uncertainty. |

Embeddings | Embeddings are a way to represent the meaning of sequences as a list of numbers. |

Dimension reduction | Transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally close to its intrinsic dimension. |

Cross-validation | Cross-validation is a resampling method that uses different portions of the data to test and train a model on different iterations. |

Training curve | Plots the optimal value of a model's loss function for a training set against this loss function evaluated on a validation data set using the same parameters that produced the optimal function. |

Spearman rho | A nonparametric measure of rank correlation. It assesses how well the relationship between two variables can be described using a monotonic function. |

Loss | The training loss indicates how well the model is fitting the training data. |

## Design

Term |
Definition |

Design objectives | This is your desired output based on your functions of interest. |

Target value | The value you wish the output should achieve. |

Direction | The direction you wish the output should achieve. |

Weight | This helps the model to generate results based on the priorities of the output. |

Iterations | The number of iterations is equivalent to the number of batches needed to complete one epoch during model training. So if a dataset includes 1,000 images split into mini-batches of 100 images, it will take 10 iterations to complete a single epoch. |

Log-likelihood score or score | Indicates the probability that a predicted sequence will achieve the specified design objective. |

Mean | The mean of the predicted experimental value that a sequence would achieve. |

Standard deviation | Indicates the amount of variation in the mean value and therefore provides a level of confidence in the predicted value. |

## PoET

Term |
Definition |

Prompt | A prompt is an input that instructs a Generative AI model to generate the desired response. PoET uses a prompt made up of a set of related sequences. These sequences may be homologs, family members, or some other grouping that represents your protein of interest. |

Query | This refers to the list of sequences you wish to score using our PoET model. |

Multiple sequence alignment | The sequence alignment of three or more biological sequences, usually DNA, RNA or protein. It can identify the evolutionary relationships and common patterns between genes and proteins. |

## Sampling methods

Term |
Definition |

Ensemble | In an ensemble, multiple prompts are sampled independently from the MSA following the prompt sampling parameters. Each sequence is scored using each prompt individually, and the final score is the average score across prompts. Ensembling improves the accuracy of the sequence scores, but takes longer to run. |

Neighbors | Sample more diverse, less redundant sequences from the MSA by sampling each sequence with weight inversely proportional to its number of homologs in the MSA. |

Homology level | Determines the identity at which two sequences are considered redundant. For example, when the homology level is set to 0.8, it means that a sequence will be considered to belong to the same group if it has more than 80% sequence identity. |

Random seed | Determines the state of the random number generator for random sampling. If it set to a specific number, the algorithm will sample the same set of sequences each time. |

## Sampling parameters

Term |
Definition |

Top-p | Top-p (also known as nucleus sampling) limits sampling to amino acids with sum likelihoods which do not exceed the specified value. As a result, the list of possible amino acids is dynamically selected based on the sum of likelihood scores achieving the top-p value. For example, setting a top-p of 0.8 limits sampling to amino acids summing to an 80% or greater probability. Other amino acids are ignored. |

Top-k | Top-k limits sampling to a shortlist of amino acids, where the top-k parameter sets the size of the shortlist. For example, setting top-k to 5 means the model samples from the 5 likeliest amino acids at each position. Other amino acids are ignored. |

Temperature | Temperature is a number used to tune the degree of randomness. A lower temperature means less randomness; a temperature of 0 will always yield the same output. |