Structure QC: is this structure ML-ready?¶
Before a PDB or mmCIF file becomes a descriptor table, contact map, or graph, it
is worth asking whether the coordinates are trustworthy: are residues missing,
is the chain broken, are there alternate conformations, what is the net charge?
prepare_structure reads the file once and answers all of that in a single
StructureReport.
import molscope as ms
report = ms.prepare_structure("1ubq.pdb") # a path or a 4-char PDB id
print(report.summary())
# 1ubq.pdb: ML-ready | 660 atoms | chains A | net charge +0 | warnings: ...
What it checks¶
| Check | Severity | Needs |
|---|---|---|
| Missing backbone atoms (N, CA, C, O) | blocker | core |
| Backbone chain breaks (CA–CA > 4.5 Å between adjacent residues) | blocker | core |
| Residue-numbering gaps | warning | core |
| Truncated side chains (fewer heavy atoms than the residue should have) | warning | core |
| Non-standard residues + ligand / water inventory | warning | core |
| Hydrogens present | warning | core |
| Alternate conformations / partial occupancies | warning | core (PDB) |
| Net formal charge at a chosen pH | informational | chem (+ propka for "pka") |
The topology checks run on the bare NumPy install. The net-charge step reuses the
protonation backends and degrades gracefully: if RDKit (or PROPKA) is missing,
or the file cannot be template-parsed, the charge is reported as None with an
explanatory note instead of raising.
report = ms.prepare_structure("1ubq.pdb", protonation="pka", ph=7.4)
report.net_charge # e.g. 0 (PROPKA prediction at pH 7.4)
report.ml_ready # True
report.blockers # [] -> nothing that corrupts distance/graph features
report.warnings # ['57 atom(s) with occupancy < 1', 'no hydrogens present']
report.to_dict() # JSON-serialisable, for pipelines
print(report.report_markdown()) # a full human-readable report
ml_ready is a heuristic
A structure is called not ML-ready only for blockers — missing backbone atoms or internal chain breaks, which corrupt distance- and graph-based features. Everything else is surfaced as a warning. Treat the verdict as triage, not as a guarantee for a specific modelling task.
From the command line¶
molscope structure-report 1ubq.pdb # one-line verdict
molscope structure-report --fetch 1ubq --json # full JSON report
molscope structure-report 1ubq.pdb --out report.md # write a Markdown report
molscope structure-report model.pdb --protonation pka --ph 7.4
--fetch <PDBID> downloads (and caches) the entry from RCSB first. Gzipped PDBs
(.pdb.gz, as RCSB serves them) are handled transparently, including for the
net-charge step.