Skip to content

Chemical Perception

MolScope keeps cheminformatics out of the core install. For aromaticity, valence and sanitized charge features, install the optional RDKit backend:

pip install "molscope[chem]"

Then call:

import molscope as ms

mol = ms.read("ligand.sdf")
chem = mol.chemical_features()

chem.formal_charges
chem.total_valences
chem.aromatic_atoms
chem.bond_orders
chem.aromatic_bonds

RDKit descriptors are also available through the same optional backend:

rdkit_features = mol.rdkit_descriptors(names=["MolWt", "TPSA"])
features = mol.descriptors(include_rdkit=True, rdkit_descriptor_names=["MolWt"])

SDF/MOL V2000 formal charges and bond orders are preserved by the built-in reader before RDKit is involved. If a structure only has coordinates, MolScope can pass geometrically inferred single bonds to RDKit, but it does not infer general bond orders from raw coordinates.

Proteins: residue-template bonds

A bare protein PDB has no bond records, so geometric inference gives single bonds only: aromaticity, backbone carbonyls, and side-chain double bonds are all lost, and chemical_features comes back essentially empty (zero aromatic atoms).

For standard residues the chemistry is known, so you don't need to guess it. Read with bond_perception="template" to have RDKit's residue-aware PDB reader assign the bonds, Kekule bond orders, and formal charges from its built-in residue templates (plus peptide bonds and disulfides):

mol = ms.read("protein.pdb", bond_perception="template")   # needs the chem extra
chem = mol.chemical_features()
chem.aromatic_atoms.sum()    # now counts the Phe/Tyr/His/Trp rings

read, read_pdb, and fetch all accept the option (PDB input only). It needs RDKit, and it only helps for standard residues: modified residues, non-standard ligands, and exotic chemistry still fall back to best-effort perception. This is not a force field, just template-based connectivity and bond orders; for energies and parameters use a dedicated force-field tool. The default stays "geometric", so nothing changes unless you opt in.

Protonation and formal charges

A crystallographic PDB models heavy atoms only, so RDKit reads every residue as neutral and chemical_features().formal_charges sums to zero — faithful to the file, but not the ionisation state at physiological pH. Add protonation="standard" (with template bonds) for an idealised pH-7 assignment of the standard ionisable side chains:

mol = ms.read("protein.pdb", bond_perception="template", protonation="standard")
mol.chemical_features().formal_charges.sum()   # e.g. +6 for trypsin

The fixed assignment is aspartate/glutamate -1, lysine/arginine +1, histidine neutral, and termini uncharged (see molscope.chem.STANDARD_PROTONATION). It is a textbook model: it ignores local pKa shifts, buried or metal-coordinating residues, and termini.

For an environment-aware prediction, use protonation="pka", which runs PROPKA to predict each ionisable group's pKa from the structure and then assigns the charge of the dominant state at a chosen pH:

mol = ms.read(
    "protein.pdb", bond_perception="template", protonation="pka", ph=7.4
)
mol.chemical_features().formal_charges.sum()   # reflects local pKa shifts

This accounts for the side chains, the N- and C-termini, and cysteine/tyrosine, and it tracks pH: lowering ph protonates (net charge rises) while raising it deprotonates (net charge falls). It needs the propka extra (pip install "molscope[propka]"). The default "none" keeps the as-modelled neutral state.

For small molecules the analogous knob lives on the dataset/library path: prepare_dataset(..., protonation="pka", ph=7.4) and smiles_descriptors(..., protonation="pka") set each SMILES to its dominant ionisation state at the target pH (via Dimorphite-DL, pip install "molscope[dimorphite]") before descriptors and fingerprints are computed — so charge-sensitive features match the species present at that pH.