In Pulumi, stacks can separate an infrastructure build by environment or by more granular concern. In a recent setup I worked on, we went with the latter approach. One stack handled compute, another handled networking and a third managed global artifacts like container images. Given that Pulumi stacks are all about independent interoperable units, the stack differentiation helped us reason about our infrastructure as separate areas of concern. Until it didn’t. ...
The Journey to Otel Collector
In a previous role, I worked on establishing an observability stack using OpenTelemetry (OTel) which mostly involved setting up instances of OTel collector to run across a distributed network. The main goal of this effort was to decouple data collection from data export, to more reliably gather data from various services for export. In the simplest of setups, logs, and metrics can be sent directly to Grafana Cloud via an OTLP (OTel Protocol) exporter. An OTLP exporter is the component that sends your telemetry data from the OTel Collector to a backend like Grafana. Exporters generally handle authentication, and batching and can be configured for HTTP or gRPC transport. Essentially, they act as the bridge between your application’s telemetry and your monitoring platform. ...
You might not need an orchestrator
Across the startups that I’ve worked at, a recurring theme has been using Nomad and eventually migrating off of it. In one of them, which you might already be fairly acquainted with, Nomad was an initial iteration at constraint-based deployments enabling regional rollouts and seamless rescheduling across a fleet during maintenance events. For a company like Fly.io that gives customers the ability to schedule apps in different regions, an orchestrator is fundamental to the user experience. That said, many companies (at least several that I’ve worked at) rely on orchestrators like Nomad for deployments. In these instances, Nomad is overkill and often complicate the deployment experience. ...
There's an 'A' in failure
I was recently let go from my role. This wasn’t my first experience with an unexpected termination. In fact, this time last year I found myself in a similar situation albeit under different circumstances and with far more surprise. Yet this one felt like an especially low point in my career; unexpected in a similar way, but predictable given the circumstances. I was technically on an unofficial PIP prior to receiving the news, which was somewhat reasonable given that I wasn’t shipping and showing progress fast enough. I had taken on the substantial task of reconfiguring our infra from the ground up, all while maintaining the current one single handedly. I’d written about this in a previous blogpost, but this job stretched my skills well past my comfort zone and I was firing on all cylinders. With little to no onboarding (as is often common in every early-stage startup) I had to learn fast, make decisions confidently, and move at breakneck pace. In previous roles, I frequently had the support of a team or a more senior coworker when I pushed significantly past a growth edge. This time, however, I was on my own, operating in a completely asynchronous remote environment, building greenfield infra I had no prior experience with. I thought I could manage (I am no stranger to working through uncertainty) but this work was arduous and there was barely enough room or time for a beginner’s mindset. ...
Sync the local-first powerhouse
One of the hardest problems in local-first software is sync. In previous posts, I talked about OTs/CRDTs, the role of the cloud relative to the client, and general conflict resolution. But I never quite landed on the core theme behind those ideas; sync is the bread and butter of data reconciliation in local-first software. Briefly, a sync engine is responsible for collecting changes, resolving conflicts and propagating these changes across all clients. Considering that every client operates on their own copy of the data, and can go offline for long, unknown stretches of time, a good sync engine must be designed for availability, partition tolerance, and eventual consistency the classic trifecta of trade-offs in distributed systems. ...
What Makes a Good Deterministic Merge
The goal of every data reconciliation algorithm is to ensure all replicas converge on the same state of the world, in spite of concurrent changes. The best way to achieve this is through a deterministic merge where the same result can be guaranteed regardless of the order or timing of updates. The backbone of a deterministic merge is one that is associative (grouping doesn’t matter), commutative (order doesn’t matter) and idempotent (reapplying changes yields the same result). ...
It's more than just OT vs CRDT
When it comes to data reconciliation in the context of building real-time, collaborative user experiences, two major techniques are often discussed: Operational Transform (OT) and Conflict-free Data Replicated Types (CRDT). OT was developed in the late 80s and pioneered the foundational principles of real time collaborative editors. CRDTs emerged much later (around 2006), and were partly motivated by the complexities of implementing OT and the correctness pitfalls it can introduce. If we compare the two methodologies more broadly, OT is a strictly operation-based approach to data reconciliation while CRDTs offer both state-based and operation-based approaches. Examining the two from this lens, it’s clear that OT and CRDT are not simply opposites. More accurately, they are representative examples within a broad spectrum of reconciliation strategies. A more meaningful comparison therefore would be state-based vs operation-based approaches. This better captures the fundamental differences in how changes are propagated and merged across replicas. ...
A Primer on Conflict in Local First
In my last post, I argued that conflict in local-first is an inherently human problem. Without understanding the intent, and expectations of the end user, reconciliation is not always meaningful, given that correctness doesn’t equal usefulness from a user perspective. Before we dive into the technicalities of how conflicts get resolved, it’s worth exploring why drift happens in the first place and the basics of how it gets resolved. When we work in settings where multiple replicas can be edited independently, conflict is a direct, unavoidable result. Replicas can drift apart for several reasons: network latency, concurrent changes, and offline edits are all a natural cause of divergence in local-first systems. Once the replicas diverge, the need to agree on consistent state arises. The goal here is deterministic reconciliation. In other words, given a set of changes, all replicas should eventually all converge on the same result. ...
The semantics of conflict
Conflict is a recurring topic of discussion in the local-first space. This is unsurprising given that real time collaboration and intermittent offline access inevitably introduces drift between replicas. When multiple users collaborate on a single application or even when a user works on the same application from separate devices, change conflicts occur. This echoes a foundational principle in distributed systems: when networks partition, systems must decide how and whether to be consistent or available. In fact, much of the conversation around conflict resolution in local-first draws directly from distributed systems. Topics such as Conflict-free Replicated Data Type (CRDT) and Operational Transform (OT) all stem from key research in distributed systems. However, unlike in distributed systems where conflict is a matter of direct data reconciliation, conflict in local-first is a result of collaboration and is therefore, arguably, a human problem as much as a technical one. ...
The client is the replica
With the rise of mobile in the early aughts, where updates and collaborative edits felt instantaneous (since they wrote directly to local storage and used shared network drives) the traditional client-server model of the web proved clunky. Instead of instantaneous updates that were common to mobile interactions, web updates incurred added latency from the extra round trip it had to make. The responsiveness of native apps set a new precedent for user experience. ...