Designing Sequences

This tutorial shows you how to design protein sequences based on your chosen objectives. Use this tool to customize design objectives for individual projects and design a variant library based on your data.

What you need before getting started

Please note that this tool requires experimental data in the form of an uploaded dataset. For instructions on how to upload your data, see Uploading your data

If you don’t yet have experimental data, start with our PoET tools.

If you run into any challenges or have questions while getting started, please contact OpenProtein.AI support.

About design criteria

OpenProtein.AI uses Bayesian property predictors, which output a distribution over possible values of the property for a variant. Design goals are defined by a property’s value being greater than or less than a target value.

The mean, or most likely value of the property for the sequence as predicted by the model, is what you would get from a typical regression model. The models also output a standard deviation indicating their certainty in the value of that property. Based on this distribution, we can calculate the probability that a sequence variant meets a design criteria and express it as a log-likelihood score.

Creating your custom designs

Navigate to your dataset, then select Create design. This opens a new window where you can edit the design name and define your parameters and design objectives.

Explore tradeoffs between the number of mutations in each variant and predicted properties by selecting Use number of mutations criteria and set criteria for multiple properties.

We recommend using the default 25 step design algorithm in the Number of design steps field.

When your criteria are set, select Generate design. Your designs are accessible through the navigation panel on the left.

Interpreting your results

Your results will display in a UMAP, a histogram, a joint plot, and a design result table.

Variant sequences generated by the design process are overlaid on the Uniform Manifold Approximation and Projection (UMAP). Hover over a point in the UMAP to view the sequence and score.

Your designed sequences are colored by their predicted property, and you can adjust the display in several ways.

  • Use the color options panel to adjust the color settings and change which property the new points are colored by.
  • Highlight specific sequences by clicking on individual points, or hold Shift while dragging your cursor to select multiple points.
  • Select the eye icon to the left of sequences in the table to toggle the visibility of a sequence on and off in the UMAP.

The Histogram tab compares the expected property distributions for the designs against your original library and joint plots for all of the properties. Hover over the graph to view the property, source, binned value, and frequency.

The Joint plot tab provides a visual representation of the distribution of individuals of each variable and helps in understanding the relationship between two variables.

The Design result table displays all designed and input sequences. If you want to compare your generated results against a benchmark, select Add a reference sequence , enter a parent sequence or sequence of interest, and select Add. You can update or delete reference sequences by selecting Edit reference sequence, choosing your desired action, then selecting Update. Using a reference sequence lets you view mutation sites to better understand specific substitutions present in your sequence libraries and designs.

Select Advanced filters to only display unique sequences.

You can also sort the table by the predicted property or log-likelihood score. Use the filter icon to set simple filters for each column.

Use the Generate more sequences at the bottom of your design results page to generate more candidate sequences.

Saving your sequences

Save your results as a library within your project by selecting Save as library. Add a library name and description, then select Save. Access your libraries from the left hand navigation panel.

You can also export your results as a CSV file by selecting Export , then selecting which rows to include.

Fine-tuning your designs

Sometimes your produced sequences are high diversity, bear little resemblance to the input data, or have predicted property values far from what you’re looking for. This can happen if you set ambitious design objectives, for example with target values far from sequences in your dataset. If the model cannot find any sequences that are predicted to achieve your design objectives, it will explore sequences with maximum uncertainty since those are more likely to achieve the design objective broadly.

Reducing the design target can help the system find sequences that are likely to be incremental improvements, and will produce designs that will be closer to your target values.

Using your designed sequences

Explore your sequence’s 3D structure with the Structure Prediction tool, or use Substitution Analysis to evaluate the single substitution variants of your sequence.