Field note

The telemetry contract we attach to every Proof Sprint

Operations trusts us because the Proof Sprint lands with the same dashboards their staff monitors daily.

Arun Keita

Lead Systems Engineer

Published: December 25, 2025
Reading time: 7 min

The fastest way to kill a successful proof of concept is to hand it to operations without observability. You've built something that works. It solves the problem. Everyone's excited. Then it lands in production and the SRE team has no idea how to monitor it, no way to debug it, and no confidence they can roll it back if something breaks at 2am.

Six months later, it's still running on the original developer's laptop because nobody trusted it enough to run it properly. The Proof Sprint succeeded technically and failed organizationally.

We solve this with what we call the telemetry contract: a set of non-negotiable observability and handoff requirements that ship with every Proof Sprint. Operations doesn't have to trust us. They have to trust their own dashboards—and we make sure those dashboards work from day one.

Why telemetry is a delivery problem, not an ops problem

Most teams treat observability as something ops adds after delivery. The developers build the thing, throw it over the wall, and the SRE team figures out how to monitor it. This is backwards.

By the time ops gets involved, the code is frozen. Adding instrumentation means reopening a "finished" project. Adding log statements means changing code that's already passed QA. The team that understands how the system works has moved on to the next project. Nobody wants to go back.

So the monitoring gets hacked together from the outside. Ops adds synthetic checks that probe endpoints without understanding the internal flow. They set up alerts based on guesses about what "normal" looks like. When something breaks, they page the original developers anyway because the dashboards don't tell them anything useful.

The telemetry contract inverts this. Observability is a delivery requirement, not a post-delivery afterthought. The Proof Sprint isn't done until ops can monitor it with tools they already know.

Signal parity: your collectors, your dashboards

Every Proof Sprint emits traces, metrics, and logs in the formats the client already uses. If they're on Datadog, we emit to Datadog. If they're on Grafana with Prometheus and Loki, we emit to those. If they've got a custom OpenTelemetry collector, we wire into it.

This sounds obvious, but it's surprisingly rare. Most vendor deliverables arrive with their own monitoring stack. "Here's the system, and here's the new dashboard you need to watch it." Now ops has N+1 monitoring systems to check. The alert fatigue compounds. The mental context-switching burns people out.

Signal parity means the Proof Sprint shows up in the same place your team is already looking. Same dashboards. Same alert channels. Same runbook format. When something needs attention, it surfaces through the existing triage process instead of requiring a new one.

We do the integration work during the sprint, not after. By day 10, the metrics are flowing into your existing collectors and the traces are searchable in your existing trace viewer. Your SREs don't learn a new tool. They use the tools they trust.

The evidence packet

Every Proof Sprint ships with an evidence packet: a documented bundle of security and operational artifacts that procurement and security need to approve the deployment.

The packet includes:

Software Bill of Materials (SBOM): A complete inventory of every dependency in the system, including version numbers and license information. When a new CVE drops for a library you use, you can check in five minutes whether the Proof Sprint is affected.

Software Composition Analysis (SCA) report: Automated vulnerability scanning results for all dependencies. We run this on every build and include the most recent scan in the handoff. Known vulnerabilities are documented with severity ratings and remediation status.

IAM diff: A precise list of every permission the system requires. Not "admin access to the cloud account"—specific IAM policies showing exactly what the system can read, write, and execute. Security reviews become tractable because the scope is explicit.

Data flow documentation: Where data comes from, where it goes, and what transformations happen in between. If the system touches PII, the data flow shows exactly which services handle it and how it's protected in transit and at rest.

The evidence packet turns security review from an interrogation into a checklist. Instead of asking "what does this thing do?" and waiting for someone to explain, the reviewer can read the documentation and ask specific follow-up questions. Approvals that used to take weeks happen in days.

Runway for ops: making Day 2 boring

The first day a system runs in production is exciting. The second day should be boring. If Day 2 is exciting, something is wrong.

We design for boring Day 2s. Every Proof Sprint includes:

Canary deployment plan: How to roll out changes incrementally, with automatic rollback triggers if error rates spike. This isn't a document that says "consider using canary deployments"—it's a working configuration that ops can execute.

Runbooks: Step-by-step procedures for common operational scenarios. What to do when latency spikes. How to restart the service if it hangs. How to check if the database connection pool is exhausted. Written for someone who didn't build the system and is looking at it for the first time at 3am.

Rollback procedures: Exactly how to revert to the previous version, how long it takes, and what state will be lost. Not "you can roll back," but "execute this script, wait 4 minutes for the health checks, verify with this dashboard query."

Capacity planning notes: Current resource utilization, projected growth curves, and the specific bottlenecks that will appear first as load increases. When the system needs more capacity, ops knows whether to add replicas, increase memory, or provision a larger database.

This runway exists so that operations doesn't depend on us after handoff. They have everything they need to keep the system running, diagnose problems, and make changes. The Proof Sprint doesn't create an ongoing relationship—it creates an independent capability.

Why this matters for the business case

Proof Sprints often die in procurement because the total cost of ownership is unclear. The 10-day build is cheap, but what about supporting it forever? What about the ops team learning a new stack? What about security approvals that could take months?

The telemetry contract de-risks all of this upfront. When we hand off the Proof Sprint, the ops team already knows how to support it because they're using their existing tools. Security has already reviewed the evidence packet during the sprint. There's no hidden cost waiting on the other side of deployment.

This also means faster feedback. If the Proof Sprint doesn't work, you find out in 10 days while it's still cheap to kill. You don't find out six months later when you're debugging production incidents and realizing the system was never properly instrumented.

Observable systems are debuggable systems. Debuggable systems are maintainable systems. Maintainable systems are systems that actually get used. The telemetry contract isn't overhead—it's what makes the Proof Sprint useful beyond the demo.

The telemetry contract we attach to every Proof Sprint

Why telemetry is a delivery problem, not an ops problem

Signal parity: your collectors, your dashboards

The evidence packet

Runway for ops: making Day 2 boring

Why this matters for the business case

Latest analysis

Strike room protocols that stop analysis paralysis

AI won't save your consultancy

Ready to move from reading to running?