Coordinate formats compared: XYZ, PDB, mmCIF, SDF¶
MolScope reads four common molecular coordinate formats. They look interchangeable (each one lists atoms and positions), but they differ sharply in what metadata they carry and how reliably they carry it. Choosing the right format, and knowing what you lose when you convert between them, avoids silent data loss.
What each format stores¶
| Capability | XYZ | PDB | mmCIF | SDF / MOL |
|---|---|---|---|---|
| 3D coordinates | yes | yes | yes | yes |
| Element symbols | optional | yes | yes | yes |
| Atom names | no | yes | yes | no |
| Residues and residue ids | no | yes | yes | no |
| Chains | no | yes (1 char) | yes | no |
| Explicit bonds | no | CONECT only |
optional | yes |
| Bond orders | no | no | optional | yes |
| Formal charges | no | rarely | optional | yes |
| Multiple models / frames | yes (frames) | yes (MODEL) |
yes | yes (records) |
| ATOM vs HETATM distinction | no | yes | yes (group_PDB) |
no |
"Optional" means the format can express it but many files omit it; MolScope reads it when present.
XYZ: coordinates and not much else¶
XYZ is the lowest common denominator: an atom count, a comment line, then
element x y z per atom. MolScope also accepts bare x y z dumps (with #
comments) and multi-frame trajectories via read_xyz_frames.
- Reliable: coordinates, and element symbols when the first column is present.
- Missing: atom names, residues, chains, bonds, charges. There is no concept
of topology, so
mol.residue_groups()orsecondary_structure()will not work on an XYZ structure.
Reach for XYZ for quantum-chemistry inputs/outputs, small molecules, and quick geometry dumps where connectivity does not matter.
PDB: convenient but messy¶
The Protein Data Bank format is the workhorse of structural biology, and almost everything reads it. Its problems come from its age: it is a fixed-column format frozen around 80-character punched-card records.
- Fields live in fixed columns, not whitespace-separated tokens, so MolScope slices columns (e.g. coordinates from columns 31-54). Whitespace splitting silently mis-reads real files.
- Hard size limits leak into the science: 4 characters for an atom name, 1 character for a chain id, 4 digits for a residue number, 5 for an atom serial. Large assemblies overflow these and need hacks or mmCIF.
- Connectivity is mostly implicit: bonds appear only in optional
CONECTrecords (typically just for ligands/HETATM), so MolScope infers most bonds from geometry. - Alternate conformations (altLoc) and insertion codes complicate "one
residue, one atom" assumptions;
read_pdb(..., altloc=...)selects a policy.
What MolScope reads from PDB: coordinates, element, atom name, residue
name/id, chain, the ATOM/HETATM flag, multiple MODEL records (NMR ensembles via
read_pdb_models), and CONECT bonds.
mmCIF: why it exists¶
mmCIF (macromolecular CIF) is the PDB's modern successor and the archival format of the wwPDB. It exists specifically to fix PDB's structural limits.
- It is a key/value and loop format, not fixed columns, so there are no 4-character or 5-digit ceilings. Large ribosomes and capsids that cannot fit in PDB are expressed cleanly.
- It is self-describing and dictionary-based: every column is a named data
item (
_atom_site.Cartn_x,_atom_site.label_comp_id, ...), validated against a published dictionary. New data items can be added without breaking parsers. - It is extensible: experimental metadata, assemblies, and chemistry live in the same file under documented names.
MolScope ships a lightweight built-in _atom_site reader and an optional
Gemmi backend (pip install "molscope[cif]")
for robust parsing and validation. Prefer mmCIF over PDB for anything large or
when you care about archival correctness.
SDF / MOL: small molecules with real chemistry¶
SDF (and the single-molecule MOL it wraps) is the cheminformatics format. Its V2000 connection table stores the chemistry that PDB and XYZ lack.
- Reliable: coordinates, elements, explicit bonds with bond orders, and
formal charges (both the per-atom charge codes and
M CHGlines). - Missing: residues, chains, and biological context. SDF describes a molecule, not a macromolecular assembly.
- MolScope reads the V2000 format. The newer V3000 connection table is a different layout and is rejected with a clear error rather than mis-read; convert V3000 to V2000 with RDKit or OpenBabel first.
Use SDF for ligands, drug-like molecules, and any workflow where bonds, bond orders, and charges must be exact.
Multi-pose SDF and docking output¶
Docking tools (AutoDock Vina, Gnina, Smina) write one SDF record per pose, with
the score in a > <tag> data field. read_sdf reads the
first record; read_sdf_frames reads every pose as a list
of molecules, keeping each pose's 3D coordinates and exposing its data fields via
Molecule.properties:
import molscope as ms
poses = ms.read_sdf_frames("vina_out.sdf") # core install, no extras
best = min(poses, key=lambda m: float(m.properties["minimizedAffinity"]))
print(best.name, best.properties["minimizedAffinity"])
Because the 3D pose is preserved (unlike an RDKit round-trip through SMILES),
each pose flows straight into MolScope's descriptor, contact-map, and diversity
tools, so a hit list can be ranked, de-duplicated, and triaged in place. Common
score tags are minimizedAffinity (Vina/Smina) and CNNaffinity / CNNscore
(Gnina); the values are kept as raw strings, so cast to float before sorting.
Choosing a format¶
- Quantum chemistry, tiny molecules, geometry only: XYZ.
- Proteins and nucleic acids, broad tool compatibility: PDB (small/medium) or mmCIF (large or archival).
- Small molecules where chemistry matters: SDF.
Remember that converting down loses metadata permanently: PDB to XYZ drops
residues and chains; anything to XYZ drops bonds. MolScope's
Molecule keeps whatever the source provided, and
writers emit only what the target format supports.
Handling malformed files¶
MolScope's readers report the file, format, and line when input is malformed,
instead of failing with a raw ValueError from deep inside a parser:
structure.pdb: invalid PDB file (line 42): could not read coordinate columns 31-54 from record: 'ATOM ...'
frame.xyz: invalid XYZ file (line 1): header declares 30 atoms but 28 were found
model.cif: invalid mmCIF file: _atom_site loop is missing coordinate column(s) ['Cartn_x']; found columns [...]
ligand.sdf: invalid SDF file (line 4): V3000 connection tables are not supported (only V2000); convert the file with RDKit or OpenBabel first
See Reading molecular files for the reader/writer API.