Reading Molecular Files¶

Use ms.read() to dispatch by file extension:

import molscope as ms

mol = ms.read("structure.pdb")
mol = ms.read("trajectory.xyz")
mol = ms.read("small_molecule.sdf")
mol = ms.read("structure.cif")

Supported formats:

Format	Notes
PDB	Fixed-column parser for `ATOM`/`HETATM`; preserves insertion codes and `CONECT` bonds.
XYZ	Single-frame and multi-frame XYZ files.
SDF/MOL	V2000 atom and bond block reader; preserves bond orders and formal charges.
CIF/mmCIF	Reader for standard `_atom_site` coordinate loops, including quoted values and `_atom_site.pdbx_PDB_ins_code`.

For what each format stores, which metadata is reliable, and why PDB and mmCIF differ, see Coordinate formats compared.

PDB alternate conformations can be selected explicitly:

mol = ms.read_pdb("structure.pdb", altloc="primary")
mol = ms.read_pdb("structure.pdb", altloc="highest_occupancy")

Supported policies are primary, first, highest_occupancy, and all.

Residue numbers remain available as the integer mol.resids array. PDB/mmCIF insertion codes are available as mol.icodes, and full per-atom identities are available as mol.residue_ids.

The built-in CIF reader handles standard atom-site coordinate loops with quoted values and semicolon-delimited text fields. Install molscope[cif] to use the optional Gemmi parser and validation helpers:

mol = ms.read_cif("structure.cif", parser="gemmi")
report = ms.validate_cif("structure.cif")
report.raise_for_errors()

Dictionary-aware validation is available when you provide local dictionary files:

report = ms.validate_cif("structure.cif", dictionaries=["mmcif_pdbx_v50.dic"])

Download a structure from RCSB:

mol = ms.fetch("1fqy")

From a SMILES string¶

ms.read_smiles() builds a Molecule from a SMILES by generating one 3D conformer with RDKit (needs the chem extra):

mol = ms.read_smiles("CC(=O)O")                          # acetic acid
mol = ms.read_smiles("c1ccccc1", add_hs=False, seed=7)   # heavy atoms only, reproducible

Bonds, Kekule bond orders, and formal charges come from RDKit. The coordinates are a generated conformer, not an experimental or energy-minimised structure: ideal as input for descriptors and graph-ML (where topology matters), but treat geometry-dependent results (contact maps, RMSD against experiment, precise distances) with care. An invalid SMILES, or one RDKit cannot embed, raises ValueError. For an experimental geometry, read a PDB/mmCIF/SDF instead.

Writing¶

ms.write_xyz(mol, "out.xyz")
ms.write_pdb(mol, "out.pdb")
ms.write_sdf(mol, "out.sdf")   # coords, bonds, Kekulé orders, formal charges, properties
ms.write_cif(mol, "out.cif")   # minimal _atom_site loop (coords + residue/chain metadata)

All four writers are dependency-free and round-trip with the matching readers. Two scope limits worth knowing: SDF is an atom-and-bond format (no chain/residue metadata) capped at 999 atoms/bonds by V2000, and write_cif emits a coordinate mmCIF — an _atom_site loop only, with no symmetry, anisotropy, or bonds.

Multi-frame output¶

write_frames is the write-side counterpart to the multi-frame readers and stream: it takes a list or a generator of molecules and appends them one at a time, so you can filter, slice, or align an ensemble or trajectory-lite stream and save the subset without holding it all in memory. The format follows the extension and it returns the number of frames written.

frames = (m for m in ms.stream("traj.pdb") if m.radius_of_gyration < 15)
n = ms.write_frames(frames, "compact.pdb")   # MODEL/ENDMDL blocks
ms.write_frames(models, "ensemble.xyz")      # concatenated XYZ frames
ms.write_frames(poses, "hits.sdf")           # $$$$-delimited records, bonds kept

Supports .pdb/.xyz/.sdf (not mmCIF, which has no multi-frame form). Frames need not share an atom count, so varied SDF records are fine. Multi-frame PDB omits CONECT (per-model serials make a single global record ambiguous), so bonds are re-inferred on read; use .sdf if you need per-frame bonds.