Works in Prod Blog

The Classic 'Works on My Machine' — Now With Neural Networks

2026-05-21T09:00:00.000Z

There's a version of this story where we made a mistake and fixed it. That's true but incomplete. The fuller version is about what a structured evaluation process gets right, what it misses, and how the ground can shift under you even when you've done the work.

Why Document Fidelity Matters

Our virtual assistant answers questions by retrieving content from product documentation — policies, guides, structured forms. For retrieval to produce accurate, grounded answers, the source material has to survive parsing intact.

This sounds obvious until you see what "not intact" looks like in practice. A table rendered as a flat string of values with no row or column relationship. A heading that becomes a bold paragraph indistinguishable from body copy. A figure caption attached to nothing. These aren't cosmetic defects — they're retrieval failures. The model can't correctly reference what wasn't preserved.

Our requirements were non-negotiable:

Scanned and rasterised PDFs — documents with no embedded text layer, relying entirely on OCR
Table structure — including merged cells, multi-row headers, nested content
Layout and heading hierarchy — sections, subsections, callouts, columns
Image semantics — figures and diagrams needed descriptions, not just placeholders

Without these, the knowledge base becomes a lossy approximation of the actual documents. RAG answers become confidently wrong.

Starting Simple, Hitting Walls

We started where most teams start: the obvious libraries. python-docx for Word documents, pdfplumber, pypdf, and PyMuPDF for PDFs. These are excellent tools — fast, lightweight, well-maintained, and more than capable for clean digital documents.

Our documents weren't clean. The simpler libraries handled native digital PDFs without trouble but broke down on scanned content and gave up entirely on complex table structures. Merged cells became misaligned rows. Headings lost their hierarchy. OCR was absent or rudimentary.

We needed to go deeper.

Building the Evaluation Harness

Rather than try libraries one at a time, we built a structured comparison harness — available here if you want to run it yourself. Ten parsers, standardised document inputs representing the range of formats we'd encounter in production, and a consistent set of metrics:

Wall-clock parse time
Peak memory consumption
Word count accuracy (proxy for text extraction fidelity)
Table detection and structure preservation
Heading and layout extraction quality

The harness made evaluation repeatable and honest. We ran everything through the same documents, measured the same things, and scored against the same rubric.

Docling won on every qualitative dimension that mattered. It produced structured, readable Markdown that actually reflected the document's intent — tables with correct cell relationships, headings with correct hierarchy, image descriptions, and full OCR support for scanned content. For our requirements, it wasn't close.

We shipped it.

The Constraint We Didn't Measure

Here is where the story gets instructive.

Docling is not a lightweight parser. It's a neural network pipeline: layout detection models, table structure analysis, OCR, and PyTorch inference — all running locally, inside the pod, on every document processed.

We had evaluated it on an M4 Pro MacBook with Apple Silicon MPS acceleration. Near-GPU performance for PyTorch workloads. Parse times of two to three seconds per document.

Production was CPU-only Kubernetes nodes.

The performance gap was not a percentage difference. It was an order of magnitude. What took seconds on the MacBook took minutes on a CPU pod. Our AWS load balancer had a 60-second timeout. Docling on CPU regularly exceeded that.

The consequence: timeouts, 502 errors, retry storms, queue backlog, pod memory pressure. Five weeks of thread throttling, semaphore tuning, and concurrency experiments — the full story is in the Docling post-mortem. The constraint was never the code. It was the hardware, and we had benchmarked against hardware we didn't have in production.

There's a human element worth naming too. We were under pressure to find a solution that met those qualitative requirements, and Docling met them convincingly. When you're looking for something specific and you find it, the instinct is to commit — not to ask what happens when you move it to a different machine. That instinct is understandable. It's also exactly when the infrastructure question matters most.

The harness was rigorous. The question it was missing: does this library assume hardware you don't have?

Pragmatism Over Purity

At some point the engineering question stopped being "how do we make Docling work on CPU" and became "why does it have to run locally at all?"

We could have chased GPU nodes. We could have built an async queue and worked around the timeout with a callback model. We could have kept tuning. Any of those would have been defensible.

Instead we stepped back. The requirement was structured, semantically meaningful content extracted from documents. That's a capability problem, not an infrastructure problem. The assumption that it needed to be solved with local inference was an artefact of the evaluation process, not a genuine constraint.

Now It's All Prompt-Driven

Claude via Bedrock handles our complete requirements — scanned PDFs, merged-cell tables, layout hierarchy, image semantics — without a byte of local inference.

The implementation is straightforward. For documents under 4.5MB we send a document block directly to the Bedrock API. For larger documents, we rasterise each page to PNG and send image blocks. Claude returns structured Markdown that preserves the document's intent.

Pod CPU stays flat during parsing. No timeouts. No GPU nodes. No async queue. No concurrency tuning. The "parsing pipeline" is a well-prompted API call.

The accuracy is comparable to Docling on our document set. The operational complexity is dramatically lower.

What the Evaluation Harness Gets Right — And What It Doesn't

The structured evaluation process had real value. It forced rigour where gut feel would have been faster but less reliable. The harness surfaced Docling as the correct answer to the qualitative question we were asking.

The gap was in the question itself. We measured capability and performance, but performance on the wrong hardware. For any library that runs local inference — ML models, neural networks, GPU-accelerated workloads — production hardware parity is not an optional benchmark condition. It's the first one.

There's also a broader point about evaluation framing. The harness asked "which library does this best?" It didn't ask "should this be a library at all?" As LLM APIs have matured, the answer to a class of document understanding problems has shifted from "find the best local model" to "describe what you need and ask a capable model." The evaluation dimension that matters now isn't which OCR pipeline is most accurate — it's whether the capability can be prompt-driven and whether that changes your operational posture.

For us it did, substantially.

Lessons

1. Hardware parity is line one of the evaluation checklist for ML-heavy libraries. Benchmarking on an M4 Pro and deploying to CPU Kubernetes is not a benchmark. It's a misdirection. Add production-equivalent hardware to the evaluation environment before architectural commitment.

2. Structured evaluation is worth building — but the harness only finds what you measure. The comparison harness was the right approach. We just needed one more measurement axis: "does this work on the hardware we actually have?"

3. Pragmatism beats purity. We could have made Docling work in production. The question was whether the cost — GPU nodes, async queues, operational complexity — was proportionate to the benefit over an API-based alternative. It wasn't.

4. The right abstraction level has shifted. A year ago the answer to "parse this complex PDF" was "find the best parser library." Today it's often "send it to a VLM." The evaluation harness needs to include that option, and the question needs to be capability-first rather than implementation-first.

5. Five weeks is feedback, not failure. The operational pain of running Docling in production gave us the forcing function to reconsider the approach. Teams that avoid the pain by over-engineering around it (GPU nodes, larger instances, longer timeouts) often miss the signal.

We Spent Five Weeks Making Docling Work. Then We Deleted It.

2026-05-21T08:00:00.000Z

This is a post-mortem on five weeks of infrastructure work that ended with git rm and 1,452 lines deleted from the lockfile alone.

The library in question is Docling. It's a capable open-source document parser from IBM Research — handles PDFs, tables, figures, DOCX, the lot. On paper it looked like exactly what we needed. In practice it turned out to be a small ML platform hiding inside a Python package, and we didn't fully appreciate that distinction until we were already three acts deep.

Act I: The Optimistic Beginning

The first pull request adding Docling was merged and reverted on the same day. A flag from the universe that was politely ignored.

A couple of days later it was back in with the proper integration. The complications started immediately:

A Docker entrypoint script was needed to pre-download HuggingFace models at container startup
HOME and HF_HOME env vars had to be manually set so the image could write to its own cache directories
The DOCLING_MODELS list kept breaking as a shell argument — positional args split incorrectly, then comma-separated, then space-separated — three separate fixes for what should have been a config value
Tesseract OCR and OpenCV had to be added to the runtime Docker stage
EasyOCR kept sneaking back into the model list and had to be explicitly excluded every time

None of this is catastrophic. But it's the kind of friction that tells you something about what you're dealing with.

Act II: The Model Infrastructure Tax

Because Docling models can't be downloaded at cold-start in production — far too slow (EKS health checks started failing almost immediately) — a CI/CD model sync workflow was introduced to pre-bake them into the Docker image. The container ballooned from a small fish to a blowfish, which led to caching the models in S3 with an InitContainer. This became its own small project: a GitHub Actions workflow to sync models, config to point Docling at the S3 cache path, and downstream fixes when the sync workflow itself had bugs.

The application now had an out-of-band model synchronisation pipeline that had to be kept in step with the Docling version in pyproject.toml. Updating the OCR engine or parser models meant updating the Dockerfile, updating the sync workflow, and triggering the S3 pipeline before deploy — in that order.

This is the moment where "we added a parsing library" became "we are now operating a small model registry."

Act III: The Performance Whack-a-Mole

OMP thread count reduced from 4 to 2 because Docling was spawning more CPU threads than the pod's limit. The pod was being throttled.
images_scale pinned to 1.0, accelerator switched to AUTO
Thread bootstrap broke entirely and had to be fixed
The parse method was decomposed to make tuning easier
OCR engine switched from EasyOCR to RapidOCR — which then had to be added to the model sync workflow and the Docker defaults
OCR skipped entirely for native digital PDFs — meaning Docling's headline feature wasn't being used for the most common input type

On the same day as that last round of fixes, a VLM-based parser was added and validated in a few hours. It used Bedrock's document API directly. No local models. No thread budget. Just an API call.

Act IV: The Quiet Betrayal

A config change silently disabled Docling routing and enabled the VLM parsers by default. Docling was still in the codebase, still in the Docker image, still pulling in Torch and Tesseract and RapidOCR and a full HuggingFace model cache.

It was handling zero traffic.

Act V: The Purge

Gone. All of it.

The Docling parser and all its tests
The S3 model sync CI workflow
All Docling-specific settings and constants
Torch, Tesseract, OpenCV from the build
The Docker entrypoint script
The thread-count tuning, the images_scale pin, all of it

1,452 lines deleted from uv.lock alone.

What replaced it

Two parsers, one fallback chain, zero local ML models.

The primary path sends the raw document bytes directly to Bedrock as a document content block — one API call, no pre-processing. Claude handles layout, tables, and embedded images natively. The only constraint is an undocumented ~4.5 MB limit on document blocks; files over that automatically fall through to the fallback.

The fallback rasterises each page to an image using pypdfium2 and sends them to Bedrock vision in parallel. DOCX files go through LibreOffice headless first to become a PDF, then hit the same rasterise-and-describe path.

The comparison:

Concern	Docling	Now
PDF text extraction	Torch + Tesseract + layout models	Bedrock document block
Tables	Docling HTML table mode	Claude extracts to HTML natively
Images / figures	SmolVLM locally	Bedrock vision per page
DOCX	pydocx parser	LibreOffice → PDF → same VLM path
Dependencies	Torch, Tesseract, RapidOCR, OpenCV, HF models	pypdfium2, LibreOffice (system)
Infrastructure	S3 model sync pipeline, entrypoint bootstrap	Nothing — models live in Bedrock

The entire local inference stack is gone. Parsing is now API calls with a lightweight rasterisation step as fallback for large files.

The lesson, if there is one

Docling required bundling a mini ML stack — Torch, Tesseract, RapidOCR, HuggingFace models — plus a dedicated S3 sync pipeline and several rounds of thread-count tuning, to do PDF parsing that a Bedrock API call does better with zero infrastructure overhead and a thirty-line parser class.

In hindsight the clue was in the very first day: merged and reverted before anyone had even run it in a container. The library wasn't broken — it did what it said. The mistake was not recognising early enough that we were running local inference on CPU-based nodes. Docling's layout models and OCR pipeline are designed for GPU workloads — on CPU they're slow by nature, not by misconfiguration. No amount of OMP thread tuning was going to fix that. We gave it a Kubernetes pod with a CPU limit and spent three weeks wondering why it was slow, when the answer was baked into the infrastructure choice from the start.

To be clear: Docling is a capable library. On GPU-backed infrastructure, with a proper model serving layer, it likely performs well — and there are self-hosted or air-gapped contexts where a managed API isn't an option and something like Docling is exactly the right tool. We may also have missed configuration options that would have helped. It's entirely possible that with different infrastructure or more time, we'd have got there.

But that's the point. The question isn't whether a tool works in the right environment — it's whether your environment is the right one for it. Running local inference on CPU nodes in an application service, when a managed API exists that does the job with less ops surface, is a mismatch. Not a failure of the tool. A failure of context.

We learnt it the expensive way.

The Art of the Architecture Diagram Is Knowing What to Leave Out

2026-05-21T07:00:00.000Z

There's a particular type of diagram that engineers love producing and nobody can read. It has boxes for every service, arrows for every dependency, labels on each arrow explaining the protocol, and a legend in the corner that requires its own legend. It is, technically speaking, accurate. It is also completely useless.

I've been doing C4 modelling for a while now and I genuinely love it — not because it's fashionable, but because it gives you a proper mental model for what a diagram is actually for. System Context, Container, Component, Code. Four levels. Each one answers a different question for a different audience. The mistake most teams make is picking the wrong level, or worse, mixing levels in the same diagram because they couldn't decide and nobody wanted to have the argument.

The tooling has always been a headache. I tried a few diagram-as-code options and they all seemed to get in the way more than they helped. Recently switched to D2 and it's been noticeably better. Minimal syntax, clean output, doesn't try to do seventeen things at once. It renders the diagram and gets out of your way.

The discipline isn't in drawing the boxes though — it's in deciding what not to draw. Engineers instinctively want to put everything in. Every dependency. Every call. Every integration they personally wired up at 11pm on a Thursday and feel a certain proprietary affection for. The result is a diagram that documents institutional knowledge and communicates nothing.

Here's the difference in practice. The noisy version — everything that's technically true:

The clean version — what the Container diagram should actually say:

Same system. The second one takes thirty seconds to understand. The first takes thirty minutes and a whiteboard session that everyone leaves more confused than when they arrived.

My preference is top-down layout — things flow downward, dependencies point in one direction, you can scan it. It's a small thing that makes a significant difference to whether a diagram reads as a story or as a pub quiz question.

AI tools make the arrows problem considerably worse. They're very close to the code — that's their whole thing — and if you ask one to generate an architecture diagram, it will dutifully render every import, every function call, every database relationship it can find. The result looks impressively comprehensive. It communicates approximately nothing.

Simplification requires a judgement call about what matters to the reader. That depends on who the reader is, what decision they're trying to make, and what level of detail actually serves them. Current AI tools don't have access to any of that context. A human has to make the call. That judgement is the actual skill the diagram is expressing.

The test I use: if someone new to the team can look at your diagram and explain back what the system does — without you hovering over their shoulder narrating — the diagram is doing its job. If they need you to explain the diagram, you've drawn a very expensive set of notes.

Pick your level. Remove the noise. Push back on the arrows.

Your Codebase Has Rules. Does CI Know That?

2026-05-21T06:00:00.000Z

There's a particular kind of meeting that happens on mixed teams. Someone's opened a pull request, and two engineers are staring at the same diff with completely different facial expressions. One is confused. The other is quietly furious. Neither is wrong, exactly — they just have entirely different mental models of what the codebase is supposed to look like.

That's the drift I'm talking about. Not bugs. Not broken tests. Just two people who've been building in the same repository for months and have somehow ended up with incompatible ideas about what goes where.

The friction tends to come from different professional histories. Engineers who came up through software development often have layered architecture drilled in early. Engineers who came up through data science and ML often optimised for iteration speed over structure. Neither background is wrong — they just don't automatically agree on where things belong.

The answer, at least in part, is pytest-archon. It lets you write tests that assert structural rules about your codebase. Not "does this function return the right value" — more like "nothing in the API layer should reach directly into the database layer." Rules you'd normally write in a wiki nobody reads, enforced as a test that CI will actually fail on.

Here's what that looks like:

from pytest_archon import archrule

def test_api_does_not_import_database():
    (
        archrule("api-layer-isolation")
        .match("myapp.api.*")
        .should_not_import("myapp.database.*")
        .check("myapp")
    )

That's it. That test will fail if anyone — human or otherwise — writes an API handler that imports a database model directly. No relying on the comment thread getting resolved. No wiki page that's accurate until it isn't. The build fails. The feedback is immediate.

Which brings me to the agentic angle, because this isn't just about human engineers anymore. When you're vibe-coding with an agent generating chunks of your codebase, the agent doesn't inherently know your architectural rules. It knows how to write Python. It does not know that your team decided six months ago that services should never instantiate repositories directly. Architecture tests give it the same feedback signal they give a human: that's not how we do it here, try again.

The rules become the documentation. Living documentation, with teeth. If the structure is correct, the tests pass. If it drifts — whether from a human in a hurry or an agent that didn't know better — they don't.

We caught no bugs this way. We caught something slower and harder to fix than bugs: a gradual divergence in how the team understood the system. An AI engineer pulling in a service directly from an API handler because that's how you'd do it in a notebook. A software engineer quietly losing the will to review it because the conversation about why it's wrong is long and the PR queue is longer.

Architecture tests remove the conversation. The boundary is in the codebase. The codebase enforces it. Everyone, human and agent alike, gets the same feedback.

Getting ahead of drift is worth more than most people give it credit for.

You Can't Debug What Bedrock Swallowed

2026-05-21T05:00:00.000Z

There's a particular kind of hell reserved for debugging LLM-backed systems that nobody bothered to instrument. You've got a request that took twelve seconds and you don't know if the slow part was your retrieval pipeline, the prompt construction, the Bedrock call itself, or the post-processing that turned the model's output into something you'd actually show a user. You have logs. You have vibes. You have, essentially, nothing.

We hit this early on an LLM project and it focused the mind quickly.

AWS Bedrock is opaque by design. You send a prompt, you get tokens back, and what happens between those two events isn't your concern. That's fine — it's not your model to look inside.

The problem is when that opacity bleeds into the code you wrote. Your retrieval logic, your prompt templates, your retry handling, your fallback paths — none of that needs to be a mystery. But without deliberate instrumentation, it becomes one anyway. You end up with a black box you built yourself, which is a considerably more embarrassing situation than the one Bedrock put you in.

Rather than manually sprinkling trace calls everywhere and inevitably missing the interesting bits, I wrote a Python decorator that wraps functions and methods automatically. Every call gets emitted as a span — class name, method name, duration, outcome — and it all folds into a single trace you can read in sequence:

import functools
import time
from opentelemetry import trace

tracer = trace.get_tracer(__name__)

def traced(func):
    @functools.wraps(func)
    def wrapper(*args, **kwargs):
        class_name = args[0].__class__.__name__ if args else ""
        span_name = f"{class_name}.{func.__name__}" if class_name else func.__name__

        with tracer.start_as_current_span(span_name) as span:
            start = time.perf_counter()
            try:
                result = func(*args, **kwargs)
                span.set_status(trace.StatusCode.OK)
                return result
            except Exception as exc:
                span.record_exception(exc)
                span.set_status(trace.StatusCode.ERROR, str(exc))
                raise
            finally:
                span.set_attribute("duration_ms", (time.perf_counter() - start) * 1000)
    return wrapper

Apply it to the functions you care about and suddenly your trace reads like a story. Vector search: 40ms. Prompt assembly: 2ms. That "fast" Bedrock call that's actually 3.8 seconds because you're using a large model with a 6,000-token context and no caching — that's visible now. The information was always there. You just couldn't see it.

The part I didn't anticipate: OpenTelemetry handles both technical traces and business metrics through the same pipeline. We used it to answer latency questions ("why did this request take four seconds?") and business questions at the same time ("how many users hit the fallback path today?", "what's our prompt cache hit rate this week?"). Same instrumentation layer, different dimensions. There's something satisfying about a monitoring setup that doesn't require you to maintain two separate systems with two separate mental models.

Here's the thing that surprised me most: a well-instrumented LLM pipeline can actually be easier to reason about than a lot of distributed systems. The order of operations is relatively clear, and when every step emits a span, you can read a trace like a timeline. The non-determinism of the model itself is a different problem — spans won't tell you why the model said what it said — but at least the plumbing stops being a mystery.

The opacity was never really about the LLM. It was about the code around the LLM that we hadn't bothered to make visible.

What I took from this: don't leave observability as something to add later when things go wrong. Wire it in from the start — it's the interface you build for yourself so that when Bedrock starts behaving oddly at 11pm, you have structured data to work with rather than a twelve-second request duration and a shrug.

TDD Was Solving the Agent Problem Before Agents Existed

2026-05-21T04:00:00.000Z

The first time I set an agent loose on a real codebase, it ran out of context before it had done anything useful. That's a clarifying experience.

The repository wasn't exotic — a Python monorepo with shared libraries and some infrastructure code. I drew a diagram to understand what was happening. A rectangle for the full context window; blocks for what was already consumed just from loading the codebase: directory tree, CLAUDE.md, relevant modules, config, dependencies. The bar was more than half full before the agent had read a single line of task context or seen a single error message.

That image stuck with me. Half the agent's working memory gone on orientation alone. And the uncomfortable follow-up question: whose fault is that?

Mostly ours, it turns out.

A codebase with fuzzy boundaries, large unfocused modules, and implicit conventions forces the agent to do the same orientation work a new engineer would do — except a new engineer can ask questions, build intuition over weeks, and remember what they learned yesterday. With the tools I've been using, there's no persistent memory between sessions by default. Every session is effectively day one. The codebase has to compensate for what the agent can't retain.

There's a body of practice — going back about twenty-five years — that points in exactly this direction. We just didn't know we were solving this particular problem at the time.

TDD and the XP practices around it — simple design, ruthless refactoring, tests as documentation — produce exactly the properties that make a codebase agent-readable. Small focused units with explicit interfaces. Behaviour described in tests rather than buried in implementation. No accidental complexity quietly accumulating in corners. Clear boundaries that tell you where one thing ends and another begins.

None of this is new. But the agentic era has made the value of it more visible. The "too much to hold in your head at once" problem that TDD was designed to address is the same problem the context window makes concrete. A codebase with small focused units and tests that describe behaviour fits into an agent's context cleanly. One where complexity has accumulated unchecked — regardless of how it got there — does not.

Tests also do something specific for agents that code alone can't: they describe intended behaviour without requiring the agent to read the implementation. A test called test_chat_service_returns_error_on_empty_prompt tells the agent more in one line than several hundred lines of service code could. When an agent needs to understand a boundary, it reads the tests. Targeted context. Problem contained.

The cost angle is real too. Context is billed by the token. An agent flailing around a poorly structured codebase — re-reading files, tracing implicit dependencies, inferring conventions that should be explicit — is burning money before it's produced anything. Good structure isn't just clean, it's cheap.

This also connects to the current conversation around "vibe-coding" and agents generating code freely. From what I've seen, the concern isn't really about whether the agent can write working code — it often can. The concern is whether the codebase it's writing into has enough structure to keep the output coherent over time. Architecture tests help here too: codify the rules, and both human and agent get immediate feedback when something drifts from the intended shape.

The agentic era didn't invent a new problem. It gave us a new, very legible way to feel the cost of one we'd been politely ignoring for years.

TDD and XP have always pushed toward properties — small units, explicit interfaces, behaviour-as-tests — that turn out to be just as valuable for agents as they are for humans. The reasons stack up.

The Blockers Don't Care That You're Using AI

2026-05-21T03:00:00.000Z

I wrote a post back in 2021 about walking skeletons — the idea that before you go deep on features, you ship something thin and deployable end-to-end. Not because it's useful to users, but because it flushes out the real blockers while the cost of finding them is still low. Permissions. Pipelines. Infrastructure assumptions that looked fine on a whiteboard.

That advice hasn't aged out. If anything, AI projects have made it more relevant, not less.

Here's the thing about AI systems specifically: the surface area for things to silently go wrong is larger. You've got model integrations, inference infrastructure, data pipelines, prompt management, evaluation loops, and whatever cloud hoops your organisation has decided to add on top. Any one of those can be perfectly fine in isolation and a disaster when wired together in a real environment. The feedback loop — getting something running end-to-end early — serves exactly the same purpose it always did. You learn what's actually broken before you've written ten thousand lines of feature code around it.

I saw this recently on an AI project. We pushed to establish the skeleton early, before anyone could argue we weren't ready. And yes, we hit the usual suspects: IAM permissions that looked correct until they didn't, model API access that needed a different approval path than expected, tooling that worked locally and had strong opinions about containers. None of it was AI-specific. I've been burned by the same class of problems in every non-trivial project I've worked on since before "AI engineering" was a job title.

What was different was the speed at which we got through it. That's where the AI era genuinely does change things — not the nature of the blockers, but the time between "we found the problem" and "we fixed it." Digging through IAM policies, drafting the right internal request, figuring out the correct incantation for whichever cloud service had opinions today — all of it moved faster with AI tooling alongside. It's a multiplier on the resolution side, not the discovery side.

Which is worth saying plainly: AI tooling doesn't help you find blockers you never looked for. The skeleton is still the mechanism for surfacing them. You still have to commit to the discipline of doing it early, before the temptation to just build features wins.

That temptation is strong. AI tooling makes feature development feel fast. You can go from idea to working prototype in a morning. That speed makes it very easy to go deep before you've established whether any of it will actually run in production. And then you've got a lot of impressive-looking code and a pipeline that doesn't exist yet.

The honest caveat: this is still a discipline problem more than a tooling problem. AI makes the fixing faster, but it can't make you look for trouble before you think you need to. The teams that skipped the walking skeleton before will probably still skip it now — just with faster excuses.

The Metric Your Users Feel Before You Measure It

2026-05-21T02:00:00.000Z

Working on a streaming chat product taught me something: the standard latency metrics don't really describe what users experience. They're not waiting for a page to load or an API to return a JSON blob. They're watching tokens appear — and what they feel before anything appears is the thing most teams aren't measuring.

That thing is time-to-first-token. TTFT.

I ran into this while load testing a streaming chat endpoint. The obvious thing to reach for is http_req_duration — it's right there in k6, it captures how long the request took, job done. Except it isn't. For a streaming LLM response, http_req_duration captures the entire stream from first byte sent to last byte received. If your model takes two seconds to start streaming and then streams for eight seconds, your p95 latency looks like ten seconds. That tells you almost nothing about whether the product feels responsive.

What matters to the person using a chat interface is: how long until something appears? A response that starts in 800ms and streams for thirty seconds feels fast. A response that sits blank for four seconds then dumps everything at once feels broken — even if the total duration is shorter.

That gap between "request sent" and "first content chunk received" is TTFT, and it's the metric that actually describes streaming UX.

Measuring it

Standard k6 doesn't parse SSE streams, so you need a custom binary built with xk6-sse:

xk6 build --with github.com/phymbert/xk6-sse

Then define a custom Trend metric and record it the moment the first content event arrives:

import sse from "k6/x/sse";
import { Trend } from "k6/metrics";

const ttft = new Trend("ttft_s", true);

export default function () {
  const start = Date.now();
  let firstTokenRecorded = false;

  sse.open(url, params, (client) => {
    client.on("event", (event) => {
      if (!firstTokenRecorded && event.data !== "[DONE]") {
        ttft.add((Date.now() - start) / 1000);
        firstTokenRecorded = true;
      }
    });
  });
}

Critically: read the stream to completion even after you've recorded TTFT. Dropping the connection early skews the load profile — the server is still doing work and your test stops accounting for it.

Load profile design

For observational load testing, a ramp-and-hold pattern gives you clean steady-state numbers to actually reason about:

VUs
100 ┤               ▄▄▄▄▄▄▄
 50 ┤         ▄▄▄▄▄▀       ▀▄▄▄▄▄
 10 ┤   ▄▄▄▄▄▀                   ▀▄▄▄▄▄
  0 ┤──▀                               ▀──
      ramp  hold   ramp   hold   ramp  hold

During a ramp phase, concurrency is in transition — exclude those samples from headline stats. During a hold phase, you have stable concurrency and predictable sample counts. Tag every sample with its phase and VU target so you can filter cleanly in post-processing.

One request per VU per hold window also gives you deterministic sample counts: 50 VUs × 60s hold = exactly 50 requests. Reproducible, comparable across runs.

What you actually learn

Here's where it gets interesting. Once you have TTFT as a real metric, you start seeing things that http_req_duration completely hides.

What we found was that different models have quite different characteristics. Some are fast to start and slow to finish. Some are the opposite. Prompt caching had a measurable effect in our tests — a cache hit on a large system prompt shaved hundreds of milliseconds off first token time, and we wouldn't have seen that signal at all if we'd only been watching total duration.

We also saw how concurrency affects perceived responsiveness differently than it affects throughput. At low VU counts the TTFT p95 looked fine. Add more concurrent users and TTFT degraded before throughput did — which means users start feeling slowness before the dashboards register a problem. Your mileage will vary depending on model and infrastructure, but it's worth checking.

These are the kinds of insights that only exist once you're measuring the right thing. Total request duration isn't the wrong metric — it's just the wrong first metric for a product where the UX is a stream.

Set a concrete target before you test

Before running at scale, decide what good looks like. What is an acceptable p95 TTFT at your expected concurrency? Write it down before you look at the numbers — otherwise you'll rationalise whatever you find.

Your target will depend on your model, your infrastructure, your users' expectations, and honestly, what your provider can actually deliver under load. The test reveals that; it doesn't guarantee it. But without a target, you're just generating numbers.

TTFT is one dimension. Token throughput — how fast the stream itself moves once started — is another. Both matter, and they can point in opposite directions. Worth measuring separately.

Stop Arguing With Your Terminal About Python Versions

2026-05-21T01:00:00.000Z

Project setup should be boring. Not in a "this is beneath me" way — in a "this takes thirty seconds and I never think about it" way.

Most of the time, it isn't.

You clone a repo you haven't touched in six months. The README says "install Python 3.11". You have 3.12. Something breaks. You remember there's a .python-version file somewhere, or maybe a requires-python in pyproject.toml. You spend twenty minutes figuring out which version manager you're even supposed to be using for this project before you've written a single line of code.

This isn't a Python problem. Terraform has tfenv. Node has nvm or volta or .nvmrc. Every language brings its own version manager, its own config file format, its own way of silently using the wrong version. And that's before you even get to figuring out how to run things — is it make test? ./scripts/test.sh? Some npm script buried in a package.json? Nobody knows. You ask Slack.

I got tired of this and started using mise. It's a single tool that handles both problems: pinned runtimes and discoverable tasks, for any language, in one file.

A Python service looks like this:

[tools]
python = "3.12.3"

[tasks.test]
description = "Run the test suite"
run = "pytest"

[tasks.verify]
description = "Run all checks before pushing"
depends = ["test", "build"]

Run mise install and you get exactly that Python version. Run mise tasks and you see everything the project knows how to do. Run mise run verify before pushing. That's it.

The part I find most satisfying is that mise run becomes a stable interface that hides whatever's behind it. I had a project that needed a custom k6 binary with SSE support for load testing a streaming API. Building it requires Go and a tool called xk6, which most people have never heard of. With mise, that's just:

[tools]
go = "1.22.3"

[tasks.build]
description = "Build k6 with xk6-sse extension"
run = "xk6 build --with github.com/phymbert/xk6-sse"

Now mise run build works for everyone — the developer who knows what xk6 is, the one who doesn't, and the CI job. Nobody has to know what's behind it. When I added another extension later, I changed one line. The interface didn't move.

Speaking of CI — this is where the real payoff is. A GitHub Actions workflow for a mise project looks like:

jobs:
  verify:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: jdx/mise-action@v4
      - run: mise run verify

mise-action reads mise.toml, installs the pinned versions, and puts them on PATH. Then mise run verify runs the exact same thing you run locally. No separate version install steps. No drift between what CI checks and what you check. This is the thing that makes it worth the setup cost — CI and local are no longer two separate mental models.

The one thing mise can't do is install itself. You need it on the machine before any of this works. I solve that with Chezmoi, a dotfile manager that runs once on a fresh machine. A run_once_install-mise.sh script does the bootstrap:

#!/bin/sh
curl https://mise.run | sh

Then the shell hook in ~/.zshrc (also managed by Chezmoi) activates mise per directory:

eval "$(mise activate zsh)"

Chezmoi sets up the machine, mise sets up each project. Neither knows the other exists. You go from a blank laptop to a running project without reading a setup guide — which is the point.

It won't fix an undocumented deployment process or a service that can't run locally. It encodes what's already known. And if your team is already settled on nvm + make for a single-language, single-runtime project, the migration cost might not be worth it. The value really compounds when you're working across multiple projects or switching between them regularly — which, in my experience, is most of the time.

mise replaces pyenv, nvm, rbenv, tfenv, asdf, and most other per-language version managers. If you're on asdf already, migration is painless — mise reads .tool-versions files natively.

Vscode Snippet To Add Markdown Frontmatter

2022-05-19T11:46:52.000Z

Click on settings for VSCode
Click on "User Snippets
Click on "New Global Snippets File..."
Add the following JSON which will be limited to markdown files only

{
    "Add Docusaurus blog frontmatter": {
        "body": [
            "---",
            "draft: true",
            "modified: ${CURRENT_YEAR}-${CURRENT_MONTH}-${CURRENT_DATE}T${CURRENT_HOUR}:${CURRENT_MINUTE}:${CURRENT_SECOND}.000Z",
            "date: ${CURRENT_YEAR}-${CURRENT_MONTH}-${CURRENT_DATE}T${CURRENT_HOUR}:${CURRENT_MINUTE}:${CURRENT_SECOND}.000Z",
            "title: ${TM_FILENAME_BASE/(\\w.*)/${1:/capitalized}/}",
            "slug: ${TM_FILENAME_BASE/([\\w-]+$)|([\\w-]+)|([-\\s]+)|([^\\w]+)/${1:/downcase}${2:/downcase}${2:+-}/gm}",
            "---"
        ],
        "description": "Create Blogpost Frontmatter",
        "scope": "markdown,mdx,md",
        "prefix": ["blog", "draft blog", "frontmatter", "add frontmatter"]
    }
}

Establishing A Walking Skeleton For Projects

2021-09-16T11:56:19.338Z

I've been reading the excellent book Growing Object-Oriented Software, Guided By Tests and there's so much that resonated with me about starting work on a new project.

As with anything new, give developers some shiny new something to work on and there's always the temptation to dive right in and get started with code. This often means that you're starting from the inside-out of a problem space and often some operational details are overlooked. When we're done solving that problem, trying to release that or to push that to production is often a problem nobody had perceived.

I recently experienced this on a project where we'd resorted to creating the application locally to put that online later. We had an idea of things like tech limitations and choices at the time, and deferring that decision seemed right, but it later came to bite us when we wanted to release the first feature.

We had roadblocks after one another, these came in the form of security policies, technology choices and release process already in place and trying something new. This whole thing cost us a couple of months of back and forth between dev/ops/admin folks.

So if I could tell my past self, I would say, release early and release often even if it means releasing the project skeleton in a hello world state.

In the context of the book I've been reading, establishing a walking skeleton is hugely important.

Journey To The Centre Of The Stack

2020-11-30T11:00:00.000Z

I first wrote this post in 2020 after spending several weeks containerising a legacy application I hadn't built and didn't fully understand. The experience was mostly archaeology — reading old config files, tracing hardcoded paths, figuring out what half a dozen processes actually did before touching anything. By the time I had a working Docker image, I'd earned it.

I'm updating it now because the journey has changed, and I think it's worth being honest about how.

The destination is the same. Legacy modernisation still means diving into unfamiliar depth, finding the load-bearing assumptions nobody documented, and making a series of architectural decisions that will outlive the sprint you're in. None of that has gone away.

What's changed is the discovery phase. The part where you spend half a day grepping through twelve config files to find every hardcoded /tmp path. The part where you read three hundred lines of an entrypoint script to understand what order things start in. The part where you're trying to build a mental model of a system from first principles because the person who built it left two years ago.

That part is cheaper now. Not free — cheaper. And for senior engineers in particular, that matters more than it might sound.

The cognitive load is the real cost

When you're working in unfamiliar legacy code, there's a ceiling on how much you can think about architecture while simultaneously trying to understand what you're looking at. The mental budget goes to comprehension first and decision- making second.

AI tooling shifts that balance. You can ask an agent to map the dependency graph, find all the places a config value is used, summarise what a given module does, or trace what happens to a file after it's uploaded. It doesn't always get this perfectly right, but it gets you oriented faster. And orientation is the precondition for good architectural thinking.

The senior engineer's job in a legacy modernisation isn't to read every file — it's to understand the system well enough to make the right calls. AI handles more of the reading. You do more of the deciding. That's a reasonable trade.

What still requires a human

This is the part worth being direct about: you cannot vibe code your way through a legacy containerisation.

Legacy systems have accumulated tradeoffs that aren't visible in the code itself. A hardcoded path exists for a reason. Session storage lives in a particular place because of a deployment constraint nobody remembered to remove. An environment variable has a default that only works in production because of how the CI pipeline was wired up five years ago.

An agent will find the path, tell you what it does, maybe suggest you move it to an env var. What it can't tell you is whether that change will break the cron job on the production server that six business processes depend on — the one that isn't in any of the tests because it predates the testing culture.

AI tools don't have access to the organisational context: the deployment constraints, the team agreements, the compliance requirements, the reason something was done a particular way three years ago. That knowledge lives in people, not code. And in legacy systems, it's often the most important knowledge there is.

That's still yours. The judgment about which changes are safe, which tradeoffs are real, which "quick wins" are landmines with a friendly face — that's the work. AI lowers the cost of getting to the point where you can make those calls. It doesn't make the calls for you.

The practical shape of it now

For what it's worth, here's roughly how I'd approach a legacy containerisation today, with AI tooling in the picture:

Discovery first. Ask the agent to map the application — what processes run, what config files exist, what external services are referenced, what file paths are hardcoded. Treat this as a starting point for your own investigation, not a definitive answer. Legacy codebases often have behaviour that doesn't show up in a static read.

Identify the decisions, not just the tasks. The genuine work in a containerisation is a small number of architectural choices — how to handle sessions, where persistent storage lives, how secrets are managed, how the application handles multiple running instances. Everything else is mechanics. Get to the decisions faster and spend your time there.

Keep the base image simple. If you have multiple applications sharing a common runtime, extract a base image. Agent tooling is good at spotting what's common across applications — use it for that comparison work.

Externalise everything that changes between environments. File paths, URLs, secrets, feature flags — environment variables. Not because it's clever, but because it's the minimum requirement for any container to be operationally useful. This hasn't changed.

Test incrementally. Don't wait until the Dockerfile is complete to run the application. Run it as early as possible, find the first thing that breaks, fix it, repeat. The agent can help write tests for the areas you're refactoring, but you need to know which areas matter enough to test.

Understand what you're committing to. Every dependency you add to the container — a system library, an OCR engine, a local model — is infrastructure you're now responsible for. The Docling post on this blog exists because we learnt that lesson the long way.

The title still fits

The journey to the centre of the stack is still a journey. The technology has changed, the tooling has improved, and you have better company on the way down than you did five years ago. But the stack is still there, and the centre of it still contains the decisions that matter.

The difference is that you can now spend more of your cognitive budget on the part that actually requires experience to get right.

That seems like a reasonable upgrade.

JSON Web Tokens

2017-02-28T11:45:44.128Z

What is it?

JSON Web Token (JWT) is a compact, URL-safe means of representing claims to be transferred between two parties. The claims in a JWT are encoded as a JSON object that is used as the payload of a JSON Web Signature (JWS) structure or as the plaintext of a JSON Web Encryption (JWE) structure, enabling the claims to be digitally signed or integrity protected with a Message Authentication Code (MAC) and/or encrypted.

JSON Web Tokens are an open, industry-standard RFC 7519 method for representing claims securely between two parties. See here: https://jwt.io

In this context, "claim" can be something like a "command", a one-time authorization, or basically any other scenario that you can word as:

Hello Server B, Server A told me that I could "claim goes here", and here’s the (cryptographic) proof.

Before we dive into this further, I’d like to define some terms we use in the realm of authentication.

Authentication — Proving who you are

Authorization — Being granted access to resources

Token — medium used to persist authentication and get authorization

So, what does It Look Like?

Well, it looks like another confusing looking string

eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiIxMjM0NTY3ODkwIiwibmFtZSI6IkpvaG4gRG9lIiwiYWRtaW4iOnRydWV9.TJVA95OrM7E2cBab30RMHrHDcEfxjoYZgeFONFh7HgQ

Upon closer inspections, you’ll see that this JWT consist of three parts separated by dots (.), which are:

Header
Payload
Signature

Header.Payload.Signature

So, let’s break it down a little:

// header
    eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9
// payload
    .eyJzdWIiOiIxMjM0NTY3ODkwIiwibmFtZSI6IkpvaG4gRG9lIiwiYWRtaW4iOnRydWV9
// signature
    .TJVA95OrM7E2cBab30RMHrHDcEfxjoYZgeFONFh7HgQ

HS256 indicates that this token is signed using HMAC-SHA256.
{
    "alg": "HS256",
    "typ": "JWT"
}

Claims/Payload

The payload contains the claims that we wish to make
{ "sub": "1234567890", "name": "John Doe", "admin": true}

Signature

We use the following formula to calcalate signature

HMACSHA256(encodeBase64(header) + "." + encodeBase64(payload), secret)

This then gives us something like:

thiseyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJsb2dnZWRJbkFzIjoiYWRtaW4iLCJpYXQiOjE0MjI3Nzk2Mzh9.gzSraSYS8EXBxLN_oWnFSRgCzcmJmMjLiuyu5CSpyHI

Let’s expand on the claims section of JWT. The following claims are part of the RFC document:

iss: who is the issuer of this token auth.example.com sub: what is the subject of this token e.g. auth aud: who can use this token e.g ['client1.example.com','client2.example.com'] exp: Defines the expiration time as unix timestamp e.g. 1488192525 nbf: define how long after the issued token was generated we can use it e.g. 300 seconds (5 minutes) iat: issued at is a unix timestamp e.g. 1488192525 jti: JWT ID unique id. This can be used to prevent a token from being replayed e.g. "xa443D"

The key names are case sensitive and have been kept small to keep the JSON payload compact.

How does the Authentication Flow work?

In authentication, when the user successfully logs in using their credentials, a JSON Web Token will be returned and must be saved locally (typically in local storage, but cookies can be also used), instead of the traditional approach of creating a session in the server and returning a cookie.

POST /login
{
    email: "username@example-domain.com"
    password: "5£cUr3PA$$W0rd!"
}

Response 201 Created
{
    token: "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiIxMjM0NTY3ODkwIiwibmFtZSI6IkpvaG4gRG9lIiwiYWRtaW4iOnRydWV9.TJVA95OrM7E2cBab30RMHrHDcEfxjoYZgeFONFh7HgQ"
}

Any subsequent calls to the API would typically send the Authorization header using the Bearer schema.

Authorization: "Bearer myToken"

Therefore the content of the header should look like the following.

GET /
# Headers
Authorization: "Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiIxMjM0NTY3ODkwIiwibmFtZSI6IkpvaG4gRG9lIiwiYWRtaW4iOnRydWV9.TJVA95OrM7E2cBab30RMHrHDcEfxjoYZgeFONFh7HgQ"

This is a stateless authentication mechanism as the user state is never saved in the server memory. The server’s protected routes will check for a valid JWT in the Authorization header, and if there is, the user will be allowed.

signature valid?
client allowed? aud- expected issuer? iss- can this token be used? nbf

As JWTs are self-contained, all the necessary information is there, reducing the need of going back and forward to the database. This allows us to fully rely on data APIs that are stateless and even make requests to downstream services. It doesn’t matter which domains are serving the APIs, as Cross-Origin Resource Sharing (CORS) won’t be an issue as it doesn’t use cookies.

Making a case for JWT

Portability: they work across many different platforms, having implementations in various programming languages.
Compact: Because of its size, it can be sent through an URL, POST parameter, or inside an HTTP header. Additionally, due to its size its transmission is fast.
Self-contained: The payload contains all the required information about the user, to avoid querying the database more than once.
Control: Allows fine grained control over types of permissions. You can specify detailed access control information within the token itself as part of its payload. For instance, in the same way that you can create AWS security policies with very specific permissions, you can limit the token to only give read/write access to a single resource. In contrast, API Keys tend to have a coarse all-or-nothing access.

Problems with JWT

Cannot be used in place of Sessions & Cookies. If we want to use them in such a manner, then stick with Sessions and Cookies.
Data goes stale. For instance, an admin with a JWT token has had their access revoked but the token will keep on working because it was generated and verified correctly with the secret key.
There’s a critical vulnerability when using Asymmetric keys. The attackers know which algorithm was used to generate the token. This is open to abuse from the attackers. The server should already know which algorithm was used to generate/verify the integrity of this token.

Conclusion

JSON Web Tokens offer many advantages but not without having some drawbacks. If you work on an extremely large-scale application, sessions could be the appropriate choice. It is completely reasonable to combine sessions and JWT — they each have their own purpose, and sometimes you need both. Just don’t use JWT for persistent data.

Blogging Like a Hacker

2017-01-29T02:01:12.000Z

Hello World 🌏

This is my first post, hoping there's a lot more I can write, but for now, this is me getting started with blogging.

I am an experienced Software Developer from the UK. I started my first fulltime job in 2011, I never thought to share my thoughts & experience. Through this blog, I am hoping to channel my thoughts and hopefully pay forward the knowledge in the same way I've found to be useful from other bloggers.

For now, I have a lot to learn about GitHub pages but I shall be adding more content over the coming future.

Stay tuned. ⚠️ 🚧

Works in Prod Blog

The Classic 'Works on My Machine' — Now With Neural Networks

Why Document Fidelity Matters​

Starting Simple, Hitting Walls​

Building the Evaluation Harness​

The Constraint We Didn't Measure​

Pragmatism Over Purity​

Now It's All Prompt-Driven​

What the Evaluation Harness Gets Right — And What It Doesn't​

Lessons​

We Spent Five Weeks Making Docling Work. Then We Deleted It.

Act I: The Optimistic Beginning​

Act II: The Model Infrastructure Tax​

Act III: The Performance Whack-a-Mole​

Act IV: The Quiet Betrayal​

Act V: The Purge​

What replaced it​

The lesson, if there is one​

The Art of the Architecture Diagram Is Knowing What to Leave Out

Your Codebase Has Rules. Does CI Know That?

You Can't Debug What Bedrock Swallowed

TDD Was Solving the Agent Problem Before Agents Existed

The Blockers Don't Care That You're Using AI

The Metric Your Users Feel Before You Measure It

Measuring it​

Load profile design​

What you actually learn​

Set a concrete target before you test​

Stop Arguing With Your Terminal About Python Versions

Vscode Snippet To Add Markdown Frontmatter

Establishing A Walking Skeleton For Projects

Journey To The Centre Of The Stack

The cognitive load is the real cost​

What still requires a human​

The practical shape of it now​

The title still fits​