openprotein.molecules#

These data primitives represent a unified interface to working with our platform, whether it be structure prediction, binder design or inverse folding.

Protein#

Protein is a fundamental primitive for working with proteins on the platform. These can be uploaded to our platform as a Query to be used with models like ProteinMPNNModel (e.g. for inverse-folding), as well as for ease of reuse.

class openprotein.molecules.Protein(sequence, name=None)[source]#

Represents a protein with an optional name.

This class supports partial or complete information: users may create a Protein with only a sequence, only a structure, or both. The class ensures that all provided fields have consistent residue-level lengths and provides convenient methods for indexing, masking, and structural comparisons.

Conventions:

Missing or unknown residues in the sequence are denoted by b”X”.
Missing structural data (coordinates or pLDDT) are represented by NaN.
Residue indices are 1-indexed for user-facing methods suffixed with at E.g. .at, mask_sequence_at

Examples

Create a Protein from sequence only:: Protein(sequence=”ACDEFGHIK”)
Create a Protein from sequence and name:: Protein(sequence=”ACDEFGHIK”, name=”my_protein”)

property msa: str | MSAFuture | None | Type[NullMSA]#: A reference identifier to the MSA associated to this protein.

property templates: Sequence[Protein | Complex | Template]#: A list of templates for guiding the structure prediction of this protein.

at(positions)[source]#

Return a new Protein object containing residues at given 1-indexed positions.

mask_sequence()[source]#

Mask entire sequence.

mask_sequence_at(positions)[source]#

Mask sequence at given 1-indexed positions.

mask_sequence_except_at(positions)[source]#

Mask sequence at all positions except the given 1-indexed positions.

mask_structure(side_chain_only=False)[source]#

Mask entire structure.

mask_structure_at(positions, side_chain_only=False)[source]#

Mask structure at given 1-indexed positions.

mask_structure_except_at(positions, side_chain_only=False)[source]#

Mask structure at all positions except the given 1-indexed positions.

get_structure_mask()[source]#

Computes the structure mask of the protein. The structure mask is a boolean array indicating, at each position, whether the structure is undefined at that position.

property has_structure: bool#: Whether or not the structure is known at any position in the protein.

rmsd(tgt: Protein, backbone_only: bool | str | Sequence[str] = False, return_transform: Literal[False] = False) → float[source]#

rmsd(tgt: Protein, backbone_only: bool | str | Sequence[str] = False, return_transform: Literal[True] = True) → tuple[float, ndarray[tuple[Any, ...], dtype[floating]], ndarray[tuple[Any, ...], dtype[floating]]]

Compute the root-mean-square deviation (RMSD) between this Protein and a target Protein.

Only atoms that are present (i.e., not NaN) in both structures are included in the calculation.

Parameters:

tgt – The target Protein to compare against.
backbone_only –
Specifies which atoms to include in the RMSD calculation. - If False (default), all atom types are included. - If True, only backbone atoms (“N”, “CA”, “C”) are included. - If a string, it must be a single atom type (e.g., “CA”). - If a sequence of strings, it must be a non-empty list of atom types

(e.g., [“CA”, “CB”, “O”]). All specified atom types must be valid.
return_transform – If True, returns both the rmsd and the transformation that should be applied to tgt to superimpose it onto this Protein. If False (default), returns only the rmsd value.

Returns:

The RMSD value (float). If return_transform is True:

A tuple (float, np.ndarray, np.ndarray) containing the RMSD value, the rotation matrix, and the translation vector.

Return type:

If return_transform is False (default)

Notes

This method assumes that sequences of self and tgt are already aligned.

to_string(format='cif')[source]#

Serialize this Protein to a string. Note that format=”pdb” may not serialize all aspects of this object, so format=”cif”, the default, is preferred.

static from_expr(expr, name=None)[source]#

Create a Protein from a sequence expression.

A sequence expression allows you to define protein sequences using a concise notation that mixes fixed sequences, design regions, and length ranges.

Useful for creating a design Query.

Parameters:

expr (str | int) – Sequence expression string or integer - Fixed sequences: “ACGT” (literal amino acids) - Design regions: “6” or 6 (any 6 amino acids) - Length ranges: “3..5” (between 3-5 amino acids) - Combined: “AAAA6C3..5” (AAAA + 6 design + C + 3-5 design)
name (str | None) – Optional name for the protein

Returns:

Protein object with the parsed sequence

Return type:

Protein

Examples

>>> # Fixed sequence with 6 flexible positions and fixed end
>>> Protein.from_expr("MKLL6VVAA").sequence
>>> b'MKLLXXXXXXVVAA'

>>> # Design region of any 15 amino acids
>>> Protein.from_expr(15).sequence
>>> b'XXXXXXXXXXXXXXX'

>>> # Variable length region between 10-20 residues
>>> Protein.from_expr("10..20").sequence
>>> b'XXXXXXXXXX??????????'

static from_filepath(path, chain_id, use_bfactor_as_plddt=None, model_idx=0, verbose=True)[source]#

Create a Protein from a structure file.

If the structure file has multiple conformers, the first conformer is always used.

Parameters:

path (Path | str) – path to structure file (e.g. pdb or cif file)
chain_id (str) – id of the chain in the structure file to use
use_bfactor_as_plddt (bool | None) – whether or not to use bfactors as pLDDTs. If None, this parameter will be determined based on heuristics. These heuristics may change over time.
model_idx (int) – index of the model in the structure file to use
verbose (bool) – whether or not to print debugging information such as oddities in the structure e.g. missing atoms

formatted(include=('sequence',), width=60, value_maps=None)[source]#

Format the sequence and/or additional feature tracks aligned and wrapped to a specific width.

Complex#

Complex describes a molecular complex representation. These can be uploaded to our platform as a Query to be used with models like RFdiffusionModel, (e.g. for multi-chain binder design) as well as for easy reuse.

class openprotein.molecules.Complex(chains=None, name=None)[source]#

property templates: Sequence[Protein | Complex | Template]#: A list of templates for guiding the structure prediction of this molecular complex.

to_string(format='cif')[source]#

Serialize this Complex to a string. Note that format=”pdb” may not serialize all aspects of this object, so format=”cif”, the default, is preferred.

Structure#

Structure describes a collection of Complex instances. These are typically created when parsing structure files (e.g., CIF, PDB) that contain multiple models of the same molecular complex, such as NMR ensembles or computational predictions with multiple conformations.

class openprotein.molecules.Structure(complexes, name=None)[source]#

Represents a collection of Complex instances.

to_string(format='cif')[source]#

Serialize this Structure to a string. Note that format=”pdb” may not serialize all aspects of this object, so format=”cif”, the default, is preferred.

static from_pdb_id(pdb_id, verbose=True)[source]#

Creates a Structure instance by downloading data from the RCSB PDB.

This method performs an HTTP GET request to the RCSB web server to fetch the structure file (in CIF format) associated with the given PDB ID.

Parameters:

pdb_id (str) – The 4-character PDB identifier (e.g. “1XYZ”).
verbose (bool, optional) – Whether to print warnings to stdout. Defaults to True.

Returns:

A new instance containing the parsed structure data.

Return type:

Structure

Raises:

requests.HTTPError – If the PDB ID is invalid, the server is unreachable, or the request returns a 404/500 status code.

Ligand#

Ligand represents a ligand that can be described either by smiles or ccd. These are intended to be used as part of a Complex.

class openprotein.molecules.Ligand(ccd=None, smiles=None, _structure_block=None)[source]#

Represents a ligand with optional Chemical Component Dictionary (CCD) identifier and SMILES string.

Requires either a CCD identifier or SMILES string.

ccd#

The CCD identifier for the ligand.

Type:: str | None

smiles#

The SMILES representation of the ligand.

Type:: str | None

DNA#

DNA represents a DNA chain. These are intended to be used as part of a Complex.

class openprotein.molecules.DNA(sequence, cyclic=False, _structure_block=None)[source]#

Represents a DNA sequence.

sequence#

The nucleotide sequence of the DNA.

Type:: str

RNA#

RNA represents a RNA chain. These are intended to be used as part of a Complex.

class openprotein.molecules.RNA(sequence, cyclic=False, _structure_block=None)[source]#

Represents an RNA sequence.

sequence#

The nucleotide sequence of the RNA.

Type:: str

Template#

Template describes a template that can be used for constraining fold jobs.

class openprotein.molecules.Template(template, mapping=None)[source]#

A structural template used to guide the folding of a target chain or complex.

This class wraps a structural source (Protein or Complex) and defines how it should map to the target(s).

template#

The structural object to be used as a template. Must contain structural data (coordinates).

Type:: Protein | Complex

mapping#

The rule for assigning this template to the target. - Mapping[str, str]: Explicitly maps {template_chain_id: target_chain_id}. - str: Apply this template to a specific target_chain_id. (If template is

a Complex, a selection algorithm is used to pick the best source chain).

None: Automatic assignment. The folding algorithm will determine which chain(s) this template applies to.

Type:: Mapping[str, str] | str | None

validate_for_target(target)[source]#

Ensures this Template is compatible with a specific target Molecule.

Parameters:

target (Protein | Complex) – The Protein or Complex that is being folded.

Raises:

ValueError – If this Template is invalid, or if chain IDs referenced in mapping do not exist in the target.
TypeError – If the template/target combination is structurally incompatible.