Why We Rewrote the Hermes Camera Intake Script From Scratch (And Cut 73% of the Code)
Hermes v0.7.4 replaces a 1,948-line camera intake script with a 522-line rewrite that does the same job. A post about why that happened — and why, for research infrastructure, a smaller honest tool is more trustworthy than a larger careful one.
Hermes v0.7.4 replaces a 1,948-line camera intake script with a 522-line rewrite that does the same job. The cut is 73 percent of the code, 70 percent of the file size, and roughly 100 percent of the pretending-we-were-building-a-distributed-system. This is a post about why that happened — and why, for research infrastructure, a smaller honest tool is a more trustworthy tool than a larger careful one.
What the script does
Project Hermes maintains a registry of cameras that feed into the platform. Each camera has a profile — location, pointing direction, field of view, stream URL, attribution mode, and two reference images (one daytime, one aerial-from-above). The camera intake script is the ceremony that promotes a curated camera profile from "draft on disk" to "live in the production registry." It validates the fields, checks the images, prompts the operator to confirm, appends to cameras.json, and copies the reference images into place.
That is all it does. There is one operator. There are no concurrent writers. The registry lives on the same disk as everything else. The failure modes are: a typo in the config file, a missing image, a duplicate camera ID, a hand-edit that broke the JSON. Real-world failure modes for a one-person research platform.
What we built first
The first version, written under careful audit, was 1,948 lines.
It included:
A thirteen-step atomic commit pipeline with a completed list and four distinct rollback boundary branches, each with its own recovery semantics. A TOCTOU re-validation step at commit time, in case a parallel writer had modified the registry between validation and commit (there is no parallel writer). An fsync of the parent directory after rename, for crash durability of the rename operation against power loss in the millisecond window between rename and fsync (the filesystem journal already handles this). A canonical-JSON byte-comparison verification step, which serialized the in-memory registry deterministically, wrote it to disk, read it back, and byte-compared the two canonical forms to detect "disk corruption" (which would be detected by the next read anyway). A profile_sha256 field hashed over the canonical form of each record, in case someone hand-edited a camera record and wanted to detect the tamper (the platform has no audit-log requirement that demands this). Backup pruning to retain the most recent twenty .bak files (one-line cron job, if you ever want it). A _rollback function with CRITICAL-marked log levels for the case where the rollback itself fails (manual recovery is cp .bak.timestamp cameras.json, takes thirty seconds).
Each of those decisions, taken individually, was defensible. Each one answered a real engineering question with a real engineering answer. None of them was wrong, exactly.
What was wrong was the aggregate. We had built distributed-database infrastructure for a single-operator workflow on a single machine. We had treated every imaginable failure mode as if it deserved code, regardless of whether the recovery from that failure mode was more expensive than just rerunning the script.
How we noticed
The way you notice this kind of drift is by asking, near the end of the build, what's actually load-bearing.
We went line-by-line through the commit pipeline and asked: if this safety check were removed, what would actually go wrong, and could the operator fix it in two minutes? The answers were almost uniformly: nothing significant would go wrong, and yes, the operator could fix it in two minutes. The fsync-parent-directory ceremony protected against a failure mode that requires power loss inside a millisecond window, on a system with a journaling filesystem, with a backup file sitting one directory over. The canonical-JSON byte verify protected against a corruption mode that would be caught by the next read. The TOCTOU recheck protected against a writer that does not exist.
We then did the inverse pass, which is harder: which of these checks is actually catching real operator mistakes? The answer turned out to be: field validation (catches typos and missing keys), image checks (catches corrupt or wrong-extension files), the sign-and-magnitude warning on near-zero coordinates (catches "I forgot the decimal point" in latitude), the aspect-ratio mismatch warning (catches "I uploaded the daytime as the aerial by accident"), and the COMMIT confirmation prompt (catches "I meant to type the test command not the live command"). Five things. Real things. Things an operator at 11 p.m. with a curated camera and a sleepy brain actually does wrong.
Everything else was ceremony for hazards we were never going to encounter.
The rewrite
The 522-line version keeps every check that catches a real operator mistake and discards every check that protects against an imaginary one. Specifically: field validation is unchanged. Image checks (existence, extension, size, samefile, luminance, aspect ratio) are unchanged. The summary format the operator reads before committing is unchanged. The COMMIT confirmation prompt and the HERMES_ALLOW_NO_CONFIRM env-var override are unchanged. The atomic-write-and-rename pattern is preserved (it is genuinely cheap and genuinely correct). The backup-before-write is preserved. The simple rollback — "if a copy fails partway, restore cameras.json from the .bak and remove the partial assets directory" — is preserved.
What is gone: the thirteen-step pipeline orchestration, the four rollback boundary branches, the TOCTOU recheck, the fsync-parent ceremony, the canonical-JSON byte verify, the profile_sha256, the backup pruning, the CRITICAL-level log path. Roughly 1,400 lines of code that protected against nothing the operator actually does.
The new script does the same job. It is faster to read, faster to audit, faster to modify, and — crucially for a research platform — easier to reason about when something does go wrong. There are fewer places for a bug to hide. There are fewer assumptions to keep in your head. There are fewer steps in the pipeline that could themselves be the source of an unexpected behavior the next time we trace one.
Why this matters for research infrastructure
The temptation in any infrastructure code is to treat thoroughness as virtue. More checks are more rigorous. More rollback paths are more defensive. More logging is more transparent. This is the implicit theory behind the 1,948-line version, and it is the theory we want to argue against.
In research infrastructure, every line of code is a place where an assumption can hide. A platform that produces cited results — and that wants to be auditable — is a platform whose code has to be small enough that someone who is not its author can read it and understand what it does. A 1,948-line script is not auditable in any meaningful sense. A 522-line script is.
The rigor in a research platform is not in the code's defensive depth. It is in the methodology the code implements, the validations it actually runs, and the legibility of the artifacts it produces. Code defensiveness against hazards that do not exist is not rigor. It is theater. And theater is the enemy of auditability, because it gives the reader the impression that something is being checked when nothing is.
The rewrite is a small thing. One script, one workflow. But the principle generalizes: in a research platform, ceremony is the enemy of auditability. We would rather ship 522 lines that we can defend, line for line, than 1,948 lines that we cannot.
The new intake script is at tools/camera_intake.py in the Hermes tree, and the first camera commit through it is in the registry as of this week.
We will have more to say about that camera in a few days.