Skip to content

PDB to Descriptors

This tutorial turns a few PDB structures into a stable, numeric feature table. The result is a CSV you can load into scikit-learn, pandas, a notebook, or a plain spreadsheet.

You will build:

  • one row per input structure,
  • stable descriptor columns from the native-3d preset,
  • a lightweight CSV export with no optional dependencies.

Start with bundled PDB files

MolScope includes three useful teaching structures:

from pathlib import Path

paths = [
    Path("examples/data/1fqy.pdb"),  # Aquaporin-1, single model
    Path("examples/data/1aml.pdb"),  # amyloid-beta NMR ensemble, first model via read()
    Path("examples/data/3ptb.pdb"),  # trypsin with ligand/waters/calcium
]

read() returns one Molecule. For a multi-model PDB such as 1aml.pdb, it uses the first model; use read_pdb_models() when you want one row per model in an ensemble.

Inspect one structure

import molscope as ms

mol = ms.read(paths[0])

print(mol.summary())
print("chains:", sorted(set(mol.chains)))
print("alpha carbons:", len(mol.alpha_carbons()))
print("radius of gyration:", round(mol.radius_of_gyration, 2), "A")

Descriptors are most useful when you know what biological or geometric signal you expect. For 1fqy, useful table-level signals include size, chain count, shape, contact density, and residue contacts.

Compute one descriptor dictionary

features = mol.descriptors(preset="native-3d")

print(features["n_atoms"])
print(features["n_residues"])
print(features["radius_of_gyration"])
print(features["principal_moments"])
print(features["distance_histogram"])

The dictionary intentionally mixes scalar values and short vector values. Use featurize_many() when you want a flat numeric matrix with stable columns.

Build a feature matrix

X, names = ms.featurize_many(
    paths,
    preset="native-3d",
    return_names=True,
)

print(f"{X.shape[0]} structures x {X.shape[1]} descriptor columns")
print(names[:8])

Expected shape for the bundled PDBs:

3 structures x 71 descriptor columns
['n_atoms', 'n_residues', 'molecular_mass', 'count_H', ...]

The native-3d preset includes counts, mass, bounding-box dimensions, compactness, bond summaries, contact summaries, centers, inertia, principal axes/moments, shape anisotropy, and a fixed-length pairwise-distance histogram.

Write a CSV

import csv

out = Path("pdb_descriptors.csv")

with out.open("w", newline="") as f:
    writer = csv.writer(f)
    writer.writerow(["file", *names])
    for path, row in zip(paths, X):
        writer.writerow([path.name, *row])

print(f"wrote {out}")

You can also use the CLI for batch descriptor generation:

molscope analyze examples/data/*.pdb --out pdb_descriptors.csv --preset native-3d --jobs 4

Choose the right preset

Preset Best for Notes
native-basic Fast, compact tables Counts, mass, size, compactness, bond/contact summaries.
native-3d Shape-aware ML baselines Adds centers, inertia, principal axes/moments, and distance histograms.
rdkit-basic Small-molecule chemistry tables Requires the RDKit-backed chem extra.

Descriptor values are not normalized. For ML, standardize columns after export and keep the fitted scaler with your model.

Ensemble variant

To featurize every model in 1aml.pdb, read all models first and flatten each descriptor dictionary:

from molscope.descriptors import flatten_descriptors

models = ms.read_pdb_models("examples/data/1aml.pdb")
names = ms.descriptor_feature_names("native-3d")

rows = []
for model_id, model in enumerate(models, start=1):
    flat = flatten_descriptors(model.descriptors(preset="native-3d"))
    rows.append([model_id, *[flat[name] for name in names]])

That gives one row per conformer, which is useful for ensemble clustering, conformer classification, or toy graph-level labels.