MolScope documentation¶
Turn a molecular structure file into descriptors, contact maps, ML graphs, and coarse-grained bead models, with a small, readable Python API.
The problem MolScope is built for: you have structure files and you want ML-ready graphs and descriptors without installing a heavy stack or writing glue code. The core depends only on NumPy and Matplotlib; heavier backends (RDKit, PyTorch Geometric, DGL, Gemmi) are opt-in extras you add only when a workflow needs them. It is built for teaching, exploratory analysis, and ML-for-molecules prototyping, not as a replacement for full simulation or cheminformatics stacks.
import molscope as ms
mol = ms.read("examples/data/1fqy.pdb") # or ms.fetch("1fqy")
print(mol.summary()) # atoms, formula, chains, bounding box
desc = mol.descriptors() # dict of structural descriptors
graph = mol.to_graph() # ML-ready graph, no extra deps
data = mol.to_pyg_data() # PyTorch Geometric Data ([pyg])
# ...or a whole folder of structures, in one call:
ds = ms.build_dataset("data/*.pdb", fmt="pyg", labels="labels.csv",
split=(0.8, 0.1, 0.1), cache_dir=".graph_cache")
for batch in ds.loader("train", batch_size=32): # batching DataLoader, ready to train
...
Core workflows¶
Each has a task-oriented tutorial:
| Workflow | Output |
|---|---|
| PDB to descriptors | Fixed-width structural and optional RDKit-backed feature tables for screening, QC, and classical ML. |
| PDB to graph/GNN | Atom/bond, residue-contact, and PyTorch Geometric-ready graph data, with positional encodings. |
| PDB to coarse-grained beads | Residue, simplified Martini-style, custom, and virtual-site bead models for inspection and graph prototyping. |
What supports those workflows¶
- Read
.pdb,.xyz,.cifatom-site loops, and.sdffiles (and stream large multi-model files frame by frame), preserving explicit SDF/PDB bonds. - Fetch structures from the RCSB by ID, or build from SMILES with RDKit.
- Select atoms by element, chain, residue name, atom name, and residue id.
- Compute geometry, RMSD, contacts, contact maps, ensembles, and descriptors.
- Analyse proteins through backbone/alpha-carbon selections, ligands, waters, binding sites, contact maps, and simplified DSSP-style secondary structure.
- Expose optional RDKit-backed chemical features, descriptors, and bond-order inference; preserve SDF formal charges.
- Export atom/bond and residue-contact graphs to NetworkX, PyTorch Geometric, or DGL, with Laplacian and random-walk positional encodings.
- Assemble a split, labelled graph dataset from a folder or RCSB accessions with
build_dataset/fetch_dataset: an on-disk featurisation cache, batchingDataLoaders, and train-only target standardisation. - Prototype interpretable coarse-grained mappings (and export OpenMM residue templates) for teaching, inspection, and graph representations.
- Visualise molecules with Matplotlib or py3Dmol, from Python or the CLI.
- Document scientific validation against MDAnalysis, RDKit,
mkdssp, and invariant checks, with explicit assumptions and tolerances.
Install¶
pip install molscope # core: NumPy + Matplotlib only
pip install "molscope[chem,cif,pyg]" # add extras for the workflows you need
For development from the repository:
uv sync
uv run pytest
See the installation guide for the full list of extras, and the quickstart to get going.