Structural Descriptors¶
mol.descriptors() returns a fixed-size descriptor dictionary for quick ML
feature tables:
features = mol.descriptors()
features["radius_of_gyration"]
features["principal_moments"]
features["distance_histogram"]
Batch featurization:
X, names = ms.featurize_many(
["a.pdb", "b.pdb", "c.xyz"],
return_names=True,
)
Included features:
- atom and residue counts
- element counts
- molecular mass
- centroid and center of mass
- radius of gyration
- bounding-box dimensions and volume
- inertia tensor
- principal moments and axes
- shape anisotropy
- asphericity, acylindricity, and relative shape anisotropy (κ²)
- compactness
- distance histogram
- bond length summary statistics
- atom and residue contact summaries
- SASA summary statistics (total, mean, std, max)
- polar-contact count and salt-bridge count
Full contact matrices remain available through mol.contact_map(...).
Distance histograms and atom contact counts are computed in coordinate blocks
instead of a full pairwise distance array:
features = mol.descriptors(distance_chunk_size=2048)
Stable presets are available when you need reproducible feature columns:
features = mol.descriptors(preset="native-basic")
X, names = ms.featurize_many(paths, preset="native-3d", return_names=True)
names = ms.descriptor_feature_names("native-3d")
Preset options:
native-basic: counts, mass, size, compactness, bond summaries, and contact summaries.native-3d:native-basicplus centres, inertia, principal axes/moments, the gyration-tensor shape descriptors (asphericity, acylindricity, relative shape anisotropy κ²), SASA summary statistics, polar-contact and salt-bridge counts, and distance histograms.rdkit-basic:native-basicplus a stable subset of RDKit scalar descriptors.
The gyration-tensor shape descriptors in native-3d come from the eigenvalues
λ₁ ≤ λ₂ ≤ λ₃ of the gyration tensor (recovered from the mass-weighted inertia
moments): asphericity b = λ₃ − ½(λ₁+λ₂) grows as a structure elongates,
acylindricity c = λ₂ − λ₁ is zero for any axially symmetric shape, and the
relative shape anisotropy κ² = (b² + ¾c²)/R_g⁴ runs from 0 (a sphere or
higher-symmetry arrangement) to 1 (a perfectly linear one). They are distinct
from the legacy shape_anisotropy column, which applies a similar formula to
the inertia moments directly.
native-3d also includes surface and interaction summaries: sasa_total,
sasa_mean, sasa_std, sasa_max from the Shrake-Rupley SASA (computed at a
coarser sasa_n_points than mol.sasa() for batch speed; tune it via the
sasa_n_points argument), a polar_contact_count (N/O atom pairs 2.5-3.5 Å
apart in different residues, a coarse geometric proxy for polar contacts rather
than a validated hydrogen-bond count), and a salt_bridge_count (basic
side-chain N within 4 Å of an acidic side-chain O, counting unique residue
pairs). These need coordinates and are computed only for native-3d (and the
unfiltered default), so native-basic/rdkit-basic stay fast.
Ligand binding sites have their own fixed-size preset because they need a ligand context:
mol = ms.read("examples/data/3ptb.pdb")
site = mol.binding_site(cutoff=4.5)
pocket = site.descriptors(mol, preset="pocket-basic")
names = ms.pocket_descriptor_feature_names("pocket-basic")
pocket-basic includes pocket atom and residue counts, amino-acid composition,
protein-ligand contact counts, radius of gyration, bounding-box dimensions, and
ligand-distance summaries.
Solvent-accessible surface area (SASA)¶
mol.sasa() returns an approximate solvent-accessible surface area in Ų using
a vectorised Shrake-Rupley sphere — a fast, pure-NumPy descriptor of solvent
exposure with no C extensions or external SASA libraries:
mol = ms.read("examples/data/1ubq.pdb")
per_atom = mol.sasa() # (n_atoms,) array, Ų
per_res = mol.sasa(level="residue") # (n_residues,) summed per residue
total = mol.sasa().sum() # whole-structure total
Each atom's expanded sphere (its van der Waals radius plus a probe_radius
water probe, 1.4 Å by default) is sampled with n_points quasi-uniform points;
a point is accessible when it lies outside every neighbouring atom's expanded
sphere. Accuracy improves with n_points (default 192, within a few percent of
an exact analytical surface) at the cost of speed. Residue-level values follow
mol.residue_groups() order.
This is an approximation aimed at the descriptors workflow, not a replacement
for an exact analytical surface; it is not folded into the fixed descriptors()
presets, so those feature columns stay stable.
Relative solvent accessibility (RSA)¶
mol.relative_sasa() normalises each residue's absolute SASA by a reference
maximum (Tien et al. 2013) to give RSA, then classifies residues as exposed or
buried — a high-signal per-residue feature for interface/binding-site work and
residue-level graphs:
exp = mol.relative_sasa(threshold=0.20) # ResidueExposure, per residue
exp.rsa # relative SASA (NaN where no reference)
exp.exposed # bool: rsa >= threshold
exp.sasa # absolute SASA (Ų)
zip(exp.resnames, exp.resids, exp.exposed) # label each residue
RSA can slightly exceed 1 (the reference is an extended Gly-X-Gly tripeptide),
and residues with no reference (ligands, waters, non-standard names) get NaN
RSA and count as not exposed. SASA is computed on the whole structure, so burial
reflects neighbours. It pairs naturally as a custom node feature on a
residue contact graph.
RDKit descriptors¶
Install the optional chemical backend to access RDKit's scalar descriptor set:
pip install "molscope[chem]"
Use RDKit descriptors directly:
rdkit_features = mol.rdkit_descriptors(names=["MolWt", "TPSA", "NumHDonors"])
Or merge selected RDKit descriptors into the standard MolScope descriptor dictionary:
features = mol.descriptors(
include_rdkit=True,
rdkit_descriptor_names=["MolWt", "TPSA", "NumHDonors"],
)
When rdkit_descriptor_names is omitted, all scalar RDKit descriptors available
in the installed RDKit version are included with an rdkit_ prefix.