Protein analysis: structured coordinate data¶

MolScope treats proteins read from PDB/mmCIF as structured coordinate data: coordinates plus atom names, residue names, residue ids, chain ids and ATOM/HETATM records. Those fields make protein-specific questions possible. None of the tools on this page need extra dependencies.

Backbone atoms, residues and chains¶

Protein backbone atoms use the standard atom names N, CA, C and O. Alpha carbons (CA) are a common residue-level proxy for contact maps, RMSD and coarse structural comparison.

import molscope as ms

mol = ms.read("examples/data/1fqy.pdb")

len(list(mol.residue_groups()))     # 226 residue groups
mol.chain_ids()                     # ['A']
mol.backbone()                      # N, CA, C, O atoms
mol.alpha_carbons()                 # one CA per residue for this protein

Residues are identified internally by chain, integer residue number, insertion code, and residue name. The old integer resids array remains available, while mol.icodes, mol.residue_ids, contact-map residue_ids, and binding-site records keep insertion-code distinctions such as A:SER100A vs A:THR100B.

For structures with HETATM records, MolScope separates polymer atoms from ligands, waters and ions:

trypsin = ms.read("examples/data/3ptb.pdb")

trypsin.protein()                   # ATOM atoms only
trypsin.hetero_atoms()              # HETATM atoms: ligand, waters, calcium
trypsin.select(resname="HOH")       # waters
trypsin.ligands()                   # non-solvent, non-ion HETATM groups

The tutorial notebook notebooks/protein_analysis_from_scratch.ipynb walks this from raw files through contacts, ligands and secondary structure using the bundled 1fqy, 1aml and 3ptb examples.

Contact maps¶

Residue contact maps summarize which residue pairs are close in 3D:

cmap = mol.contact_map(cutoff=8.0, level="residue", method="ca", min_seq_sep=4)
cmap.matrix.shape                  # (226, 226)
cmap.n_contacts                    # non-local contact count
cmap.contact_order()               # local vs long-range contact metric

For NMR ensembles, contact frequency reports how often each pair is in contact:

models = ms.read_pdb_models("examples/data/1aml.pdb")
freq = ms.ensemble_contact_frequency(models, cutoff=8.0)
freq.matrix                         # values in [0, 1]

Secondary structure elements¶

secondary_structure() runs MolScope's built-in simplified DSSP-style assignment and returns a [SecondaryStructure][]. It is designed for teaching and prototyping: it follows the Kabsch-Sander hydrogen-bond idea and is cross-checked against mkdssp, but it is not bit-identical to canonical reference DSSP on every edge case.

Use it for lightweight structure inspection. Use a reference mkdssp installation when production-grade secondary-structure labels are required. On top of the per-residue codes, you can extract elements, reduce to three states, and break the assignment down by chain.

ss = ms.read("examples/data/1fqy.pdb").secondary_structure()

ss.string            # 8-state codes, e.g. '--HHHHH--EEEE--'
ss.simplified()      # 3-state: 'CCHHHHHCCEEEEC...' (H helix, E strand, C coil)

for seg in ss.segments():           # contiguous helices/strands, coil omitted
    print(seg)                      # e.g. SSSegment(H A:9-36, len=28)

ss.summary()                        # helix/strand/coil counts and fractions
ss.per_chain()                      # the same breakdown per chain id

Backbone torsions (Ramachandran)¶

backbone_torsions() returns the phi/psi/omega dihedrals per residue in degrees, with NaN at chain termini and breaks:

tor = ms.read("examples/data/1fqy.pdb").backbone_torsions()
tor.phi, tor.psi, tor.omega         # (R,) arrays aligned with tor.resids/chains

Helical residues cluster near phi -63, psi -42; trans peptide bonds give omega near 180.

For the standard Ramachandran scatter, plot_ramachandran draws phi vs psi on a [-180, 180] grid, coloured by secondary-structure class by default, with schematic alpha/beta region guides:

mol = ms.read("examples/data/1fqy.pdb")
mol.plot_ramachandran()                       # color_by="ss" (default)
mol.plot_ramachandran(color_by="navy", regions=False)

The shaded regions are approximate teaching guides, not statistically-derived density contours. Residues whose phi or psi is undefined (chain ends, breaks) are skipped.

Chain interfaces¶

For multi-chain structures, find the residues that form an interface between two chains, or summarise inter-chain contacts across the whole assembly:

mol = ms.fetch("1brs")                       # barnase-barstar complex

iface = mol.interface("A", "D", cutoff=5.0)  # -> Interface
iface.residues_a                             # interface residues on chain A
iface.residues_b                             # ...and on chain D
iface.n_atom_contacts                        # atom pairs across the interface

ccm = mol.chain_contacts(cutoff=5.0)         # -> ChainContactMatrix
ccm.count("A", "D")                          # inter-chain atom contacts

Ligand-binding sites¶

HETATM groups (ligands, cofactors, ions, water) are tracked separately from polymer atoms. ligands() lists the non-solvent groups; binding_site() finds the protein residues around one.

mol = ms.read("examples/data/3ptb.pdb")      # trypsin + benzamidine

mol.ligands()                                # [LigandResidue(A:BEN1, 9 atoms)]

site = mol.binding_site(cutoff=4.5)          # the single ligand is auto-detected
for res, d in zip(site.residues, site.min_distances):
    print(res, round(d, 2))                  # closest residues first
# A:GLY219 2.82
# A:ASP189 2.87   <- the benzamidine specificity residue
# A:SER190 3.04
# ...

site.n_atom_contacts                         # protein-ligand atom contacts
site.to_records()                            # table-ready residue rows
site.descriptors(mol, preset="pocket-basic") # pocket descriptors
site.plot(mol, show=False)                   # residue subset plus ligand

When several ligands are present, name one explicitly:

mol.binding_site(ligand="BEN")               # by residue name
mol.binding_site(ligand=("A", 1))            # by (chain, resid)
mol.binding_site(ligand=("A", 100, "A"))     # by (chain, resid, insertion code)

Describe a pocket for LLMs¶

select_pocket(...).describe_environment() renders the pocket as a chemistry-aware paragraph (hydrophobic wall, aromatic residues, hydrogen bonds, salt bridges) you can feed directly to an LLM or RAG pipeline. The interactions come from distance-only heuristics over heavy atoms (no donor/acceptor typing, bond-angle, or protonation-state analysis), so the prose is phrased as candidates ("likely", "possible"); treat it as a first-pass scaffold and confirm with a dedicated profiler such as PLIP or ProLIF.

pocket = mol.select_pocket(ligand="BEN", cutoff=4.5)
print(pocket.describe_environment())
# ... A possible salt bridge / electrostatic contact is suggested between the
# ligand N2 and the carboxylate of ASP189 (2.9 A). ...

pocket.environment().to_dict()               # structured findings for JSON / RAG

Water and common ions are excluded from ligands() by default (exclude_water, exclude_ions). The polymer/hetero split is also available as selections:

mol.protein()        # ATOM atoms only
mol.hetero_atoms()   # HETATM atoms only
mol.select(hetero=True)

To export the binding-site residue table without writing Python:

molscope binding-site examples/data/3ptb.pdb --out site.csv --cutoff 4.5
molscope binding-site examples/data/3ptb.pdb --out site.csv --descriptors-out pocket.csv