Skip to content

Coordinate formats compared: XYZ, PDB, mmCIF, SDF

MolScope reads four common molecular coordinate formats. They look interchangeable (each one lists atoms and positions), but they differ sharply in what metadata they carry and how reliably they carry it. Choosing the right format, and knowing what you lose when you convert between them, avoids silent data loss.

What each format stores

Capability XYZ PDB mmCIF SDF / MOL
3D coordinates yes yes yes yes
Element symbols optional yes yes yes
Atom names no yes yes no
Residues and residue ids no yes yes no
Chains no yes (1 char) yes no
Explicit bonds no CONECT only optional yes
Bond orders no no optional yes
Formal charges no rarely optional yes
Multiple models / frames yes (frames) yes (MODEL) yes yes (records)
ATOM vs HETATM distinction no yes yes (group_PDB) no

"Optional" means the format can express it but many files omit it; MolScope reads it when present.

XYZ: coordinates and not much else

XYZ is the lowest common denominator: an atom count, a comment line, then element x y z per atom. MolScope also accepts bare x y z dumps (with # comments) and multi-frame trajectories via read_xyz_frames.

  • Reliable: coordinates, and element symbols when the first column is present.
  • Missing: atom names, residues, chains, bonds, charges. There is no concept of topology, so mol.residue_groups() or secondary_structure() will not work on an XYZ structure.

Reach for XYZ for quantum-chemistry inputs/outputs, small molecules, and quick geometry dumps where connectivity does not matter.

PDB: convenient but messy

The Protein Data Bank format is the workhorse of structural biology, and almost everything reads it. Its problems come from its age: it is a fixed-column format frozen around 80-character punched-card records.

  • Fields live in fixed columns, not whitespace-separated tokens, so MolScope slices columns (e.g. coordinates from columns 31-54). Whitespace splitting silently mis-reads real files.
  • Hard size limits leak into the science: 4 characters for an atom name, 1 character for a chain id, 4 digits for a residue number, 5 for an atom serial. Large assemblies overflow these and need hacks or mmCIF.
  • Connectivity is mostly implicit: bonds appear only in optional CONECT records (typically just for ligands/HETATM), so MolScope infers most bonds from geometry.
  • Alternate conformations (altLoc) and insertion codes complicate "one residue, one atom" assumptions; read_pdb(..., altloc=...) selects a policy.

What MolScope reads from PDB: coordinates, element, atom name, residue name/id, chain, the ATOM/HETATM flag, multiple MODEL records (NMR ensembles via read_pdb_models), and CONECT bonds.

mmCIF: why it exists

mmCIF (macromolecular CIF) is the PDB's modern successor and the archival format of the wwPDB. It exists specifically to fix PDB's structural limits.

  • It is a key/value and loop format, not fixed columns, so there are no 4-character or 5-digit ceilings. Large ribosomes and capsids that cannot fit in PDB are expressed cleanly.
  • It is self-describing and dictionary-based: every column is a named data item (_atom_site.Cartn_x, _atom_site.label_comp_id, ...), validated against a published dictionary. New data items can be added without breaking parsers.
  • It is extensible: experimental metadata, assemblies, and chemistry live in the same file under documented names.

MolScope ships a lightweight built-in _atom_site reader and an optional Gemmi backend (pip install "molscope[cif]") for robust parsing and validation. Prefer mmCIF over PDB for anything large or when you care about archival correctness.

SDF / MOL: small molecules with real chemistry

SDF (and the single-molecule MOL it wraps) is the cheminformatics format. Its V2000 connection table stores the chemistry that PDB and XYZ lack.

  • Reliable: coordinates, elements, explicit bonds with bond orders, and formal charges (both the per-atom charge codes and M CHG lines).
  • Missing: residues, chains, and biological context. SDF describes a molecule, not a macromolecular assembly.
  • MolScope reads the V2000 format. The newer V3000 connection table is a different layout and is rejected with a clear error rather than mis-read; convert V3000 to V2000 with RDKit or OpenBabel first.

Use SDF for ligands, drug-like molecules, and any workflow where bonds, bond orders, and charges must be exact.

Multi-pose SDF and docking output

Docking tools (AutoDock Vina, Gnina, Smina) write one SDF record per pose, with the score in a > <tag> data field. read_sdf reads the first record; read_sdf_frames reads every pose as a list of molecules, keeping each pose's 3D coordinates and exposing its data fields via Molecule.properties:

import molscope as ms

poses = ms.read_sdf_frames("vina_out.sdf")        # core install, no extras
best = min(poses, key=lambda m: float(m.properties["minimizedAffinity"]))
print(best.name, best.properties["minimizedAffinity"])

Because the 3D pose is preserved (unlike an RDKit round-trip through SMILES), each pose flows straight into MolScope's descriptor, contact-map, and diversity tools, so a hit list can be ranked, de-duplicated, and triaged in place. Common score tags are minimizedAffinity (Vina/Smina) and CNNaffinity / CNNscore (Gnina); the values are kept as raw strings, so cast to float before sorting.

Choosing a format

  • Quantum chemistry, tiny molecules, geometry only: XYZ.
  • Proteins and nucleic acids, broad tool compatibility: PDB (small/medium) or mmCIF (large or archival).
  • Small molecules where chemistry matters: SDF.

Remember that converting down loses metadata permanently: PDB to XYZ drops residues and chains; anything to XYZ drops bonds. MolScope's Molecule keeps whatever the source provided, and writers emit only what the target format supports.

Handling malformed files

MolScope's readers report the file, format, and line when input is malformed, instead of failing with a raw ValueError from deep inside a parser:

structure.pdb: invalid PDB file (line 42): could not read coordinate columns 31-54 from record: 'ATOM ...'
frame.xyz: invalid XYZ file (line 1): header declares 30 atoms but 28 were found
model.cif: invalid mmCIF file: _atom_site loop is missing coordinate column(s) ['Cartn_x']; found columns [...]
ligand.sdf: invalid SDF file (line 4): V3000 connection tables are not supported (only V2000); convert the file with RDKit or OpenBabel first

See Reading molecular files for the reader/writer API.