Reading Molecular Files¶
Use ms.read() to dispatch by file extension:
import molscope as ms
mol = ms.read("structure.pdb")
mol = ms.read("trajectory.xyz")
mol = ms.read("small_molecule.sdf")
mol = ms.read("structure.cif")
Supported formats:
| Format | Notes |
|---|---|
| PDB | Fixed-column parser for ATOM/HETATM; preserves insertion codes and CONECT bonds. |
| XYZ | Single-frame and multi-frame XYZ files. |
| SDF/MOL | V2000 atom and bond block reader; preserves bond orders and formal charges. |
| CIF/mmCIF | Reader for standard _atom_site coordinate loops, including quoted values and _atom_site.pdbx_PDB_ins_code. |
For what each format stores, which metadata is reliable, and why PDB and mmCIF differ, see Coordinate formats compared.
PDB alternate conformations can be selected explicitly:
mol = ms.read_pdb("structure.pdb", altloc="primary")
mol = ms.read_pdb("structure.pdb", altloc="highest_occupancy")
Supported policies are primary, first, highest_occupancy, and all.
Residue numbers remain available as the integer mol.resids array. PDB/mmCIF
insertion codes are available as mol.icodes, and full per-atom identities are
available as mol.residue_ids.
The built-in CIF reader handles standard atom-site coordinate loops with quoted
values and semicolon-delimited text fields. Install molscope[cif] to use the
optional Gemmi parser and validation helpers:
mol = ms.read_cif("structure.cif", parser="gemmi")
report = ms.validate_cif("structure.cif")
report.raise_for_errors()
Dictionary-aware validation is available when you provide local dictionary files:
report = ms.validate_cif("structure.cif", dictionaries=["mmcif_pdbx_v50.dic"])
Download a structure from RCSB:
mol = ms.fetch("1fqy")
From a SMILES string¶
ms.read_smiles() builds a Molecule from a SMILES by generating one 3D
conformer with RDKit (needs the chem extra):
mol = ms.read_smiles("CC(=O)O") # acetic acid
mol = ms.read_smiles("c1ccccc1", add_hs=False, seed=7) # heavy atoms only, reproducible
Bonds, Kekule bond orders, and formal charges come from RDKit. The coordinates
are a generated conformer, not an experimental or energy-minimised structure:
ideal as input for descriptors and graph-ML (where topology matters), but treat
geometry-dependent results (contact maps, RMSD against experiment, precise
distances) with care. An invalid SMILES, or one RDKit cannot embed, raises
ValueError. For an experimental geometry, read a PDB/mmCIF/SDF instead.
Writing¶
ms.write_xyz(mol, "out.xyz")
ms.write_pdb(mol, "out.pdb")
ms.write_sdf(mol, "out.sdf") # coords, bonds, Kekulé orders, formal charges, properties
ms.write_cif(mol, "out.cif") # minimal _atom_site loop (coords + residue/chain metadata)
All four writers are dependency-free and round-trip with the matching readers.
Two scope limits worth knowing: SDF is an atom-and-bond format (no chain/residue
metadata) capped at 999 atoms/bonds by V2000, and write_cif emits a coordinate
mmCIF — an _atom_site loop only, with no symmetry, anisotropy, or bonds.
Multi-frame output¶
write_frames is the write-side counterpart to the multi-frame readers and
stream: it takes a list or a generator of molecules and appends them one at
a time, so you can filter, slice, or align an ensemble or trajectory-lite stream
and save the subset without holding it all in memory. The format follows the
extension and it returns the number of frames written.
frames = (m for m in ms.stream("traj.pdb") if m.radius_of_gyration < 15)
n = ms.write_frames(frames, "compact.pdb") # MODEL/ENDMDL blocks
ms.write_frames(models, "ensemble.xyz") # concatenated XYZ frames
ms.write_frames(poses, "hits.sdf") # $$$$-delimited records, bonds kept
Supports .pdb/.xyz/.sdf (not mmCIF, which has no multi-frame form). Frames
need not share an atom count, so varied SDF records are fine. Multi-frame PDB
omits CONECT (per-model serials make a single global record ambiguous), so bonds
are re-inferred on read; use .sdf if you need per-frame bonds.