openprotein.molecules#
These data primitives represent a unified interface to working with our platform, whether it be structure prediction, binder design or inverse folding.
Protein#
Protein is a fundamental primitive for working with proteins on the platform. These can be uploaded to our platform as a Query to be used with models like ProteinMPNNModel (e.g. for inverse-folding), as well as for ease of reuse.
- class openprotein.molecules.Protein(sequence, name=None)[source]#
Represents a protein with an optional name.
This class supports partial or complete information: users may create a Protein with only a sequence, only a structure, or both. The class ensures that all provided fields have consistent residue-level lengths and provides convenient methods for indexing, masking, and structural comparisons.
- Conventions:
Missing or unknown residues in the sequence are denoted by b”X”.
Missing structural data (coordinates or pLDDT) are represented by NaN.
Residue indices are 1-indexed for user-facing methods suffixed with at E.g. .at, mask_sequence_at
Examples
- Create a Protein from sequence only:
Protein(sequence=”ACDEFGHIK”)
- Create a Protein from sequence and name:
Protein(sequence=”ACDEFGHIK”, name=”my_protein”)
- property msa: str | MSAFuture | None | Type[NullMSA]#
A reference identifier to the MSA associated to this protein.
- at(positions)[source]#
Return a new Protein object containing residues at given 1-indexed positions.
- mask_sequence_except_at(positions)[source]#
Mask sequence at all positions except the given 1-indexed positions.
- mask_structure_at(positions, side_chain_only=False)[source]#
Mask structure at given 1-indexed positions.
- mask_structure_except_at(positions, side_chain_only=False)[source]#
Mask structure at all positions except the given 1-indexed positions.
- get_structure_mask()[source]#
Computes the structure mask of the protein. The structure mask is a boolean array indicating, at each position, whether the structure is undefined at that position.
- property has_structure: bool#
Whether or not the structure is known at any position in the protein.
- rmsd(tgt: Protein, backbone_only: bool | str | Sequence[str] = False, return_transform: Literal[False] = False) float[source]#
- rmsd(tgt: Protein, backbone_only: bool | str | Sequence[str] = False, return_transform: Literal[True] = True) tuple[float, ndarray[tuple[Any, ...], dtype[floating]], ndarray[tuple[Any, ...], dtype[floating]]]
Compute the root-mean-square deviation (RMSD) between this Protein and a target Protein.
Only atoms that are present (i.e., not NaN) in both structures are included in the calculation.
- Parameters:
tgt – The target Protein to compare against.
backbone_only –
Specifies which atoms to include in the RMSD calculation. - If False (default), all atom types are included. - If True, only backbone atoms (“N”, “CA”, “C”) are included. - If a string, it must be a single atom type (e.g., “CA”). - If a sequence of strings, it must be a non-empty list of atom types
(e.g., [“CA”, “CB”, “O”]). All specified atom types must be valid.
return_transform – If True, returns both the rmsd and the transformation that should be applied to tgt to superimpose it onto this Protein. If False (default), returns only the rmsd value.
- Returns:
The RMSD value (float). If return_transform is True:
A tuple (float, np.ndarray, np.ndarray) containing the RMSD value, the rotation matrix, and the translation vector.
- Return type:
If return_transform is False (default)
Notes
This method assumes that sequences of self and tgt are already aligned.
- to_string(format='cif')[source]#
Serialize this Protein to a string. Note that format=”pdb” may not serialize all aspects of this object, so format=”cif”, the default, is preferred.
- static from_expr(expr, name=None)[source]#
Create a Protein from a sequence expression.
A sequence expression allows you to define protein sequences using a concise notation that mixes fixed sequences, design regions, and length ranges.
Useful for creating a design
Query.- Parameters:
expr (str | int) – Sequence expression string or integer - Fixed sequences: “ACGT” (literal amino acids) - Design regions: “6” or 6 (any 6 amino acids) - Length ranges: “3..5” (between 3-5 amino acids) - Combined: “AAAA6C3..5” (AAAA + 6 design + C + 3-5 design)
name (str | None) – Optional name for the protein
- Returns:
Protein object with the parsed sequence
- Return type:
Examples
>>> # Fixed sequence with 6 flexible positions and fixed end >>> Protein.from_expr("MKLL6VVAA").sequence >>> b'MKLLXXXXXXVVAA'
>>> # Design region of any 15 amino acids >>> Protein.from_expr(15).sequence >>> b'XXXXXXXXXXXXXXX'
>>> # Variable length region between 10-20 residues >>> Protein.from_expr("10..20").sequence >>> b'XXXXXXXXXX??????????'
- static from_filepath(path, chain_id, use_bfactor_as_plddt=None, model_idx=0, verbose=True)[source]#
Create a Protein from a structure file.
If the structure file has multiple conformers, the first conformer is always used.
- Parameters:
path (Path | str) – path to structure file (e.g. pdb or cif file)
chain_id (str) – id of the chain in the structure file to use
use_bfactor_as_plddt (bool | None) – whether or not to use bfactors as pLDDTs. If None, this parameter will be determined based on heuristics. These heuristics may change over time.
model_idx (int) – index of the model in the structure file to use
verbose (bool) – whether or not to print debugging information such as oddities in the structure e.g. missing atoms
Complex#
Complex describes a molecular complex representation. These can be uploaded to our platform as a Query to be used with models like RFdiffusionModel, (e.g. for multi-chain binder design) as well as for easy reuse.
Structure#
Structure describes a collection of Complex instances. These are typically created when parsing structure files (e.g., CIF, PDB) that contain multiple models of the same molecular complex, such as NMR ensembles or computational predictions with multiple conformations.
- class openprotein.molecules.Structure(complexes, name=None)[source]#
Represents a collection of
Complexinstances.- to_string(format='cif')[source]#
Serialize this Structure to a string. Note that format=”pdb” may not serialize all aspects of this object, so format=”cif”, the default, is preferred.
- static from_pdb_id(pdb_id, verbose=True)[source]#
Creates a Structure instance by downloading data from the RCSB PDB.
This method performs an HTTP GET request to the RCSB web server to fetch the structure file (in CIF format) associated with the given PDB ID.
- Parameters:
pdb_id (str) – The 4-character PDB identifier (e.g. “1XYZ”).
verbose (bool, optional) – Whether to print warnings to stdout. Defaults to True.
- Returns:
A new instance containing the parsed structure data.
- Return type:
- Raises:
requests.HTTPError – If the PDB ID is invalid, the server is unreachable, or the request returns a 404/500 status code.
Ligand#
Ligand represents a ligand that can be described either by smiles or ccd. These are intended to be used as part of a Complex.
- class openprotein.molecules.Ligand(ccd=None, smiles=None, _structure_block=None)[source]#
Represents a ligand with optional Chemical Component Dictionary (CCD) identifier and SMILES string.
Requires either a CCD identifier or SMILES string.
- ccd#
The CCD identifier for the ligand.
- Type:
str | None
- smiles#
The SMILES representation of the ligand.
- Type:
str | None
DNA#
DNA represents a DNA chain. These are intended to be used as part of a Complex.
RNA#
RNA represents a RNA chain. These are intended to be used as part of a Complex.