Customer Success
PrepBusiness: Eliminating Mysterious Downtime
While Doubling Hardware
"We couldn't find anyone else offering the combination of engineering time, on-call SLA, and infrastructure that Lithus
provides. Instead of just migrating our infrastructure, they solved problems we'd been
struggling with for months, without us having to hire additional staff."
~ Keith Brink, Founder, PrepBusiness
The Situation
PrepBusiness is a Canadian SaaS company that provides a warehouse management system for Amazon prep centres. Their customers are global, so uptime matters.
They were running on AWS with a mix of ECS, EKS, and RDS. The setup worked, but sporadic outages kept hitting customer-facing services. Nobody could figure out why. The RDS instance was under-resourced, they were relying on external contractors for AWS maintenance, and they had no visibility into what was actually happening inside their stack.
What We Built
Database migration
We set up secure network connections between their AWS infrastructure and our cluster, then ran live database synchronisation with a new redundant PostgreSQL cluster on StackGres. PrepBusiness opted for a weekend maintenance window rather than a more involved zero-downtime switchover. The cutover itself took 15 minutes, well within the maintenance window PrepBusiness had communicated to their customers.
Private cloud cluster
We deployed a single-AZ cluster running our full managed Kubernetes stack that more than doubled their available compute and memory at the same monthly cost.
KubeVirt for VM workloads
We deployed KubeVirt for workloads that need full VMs. This includes dedicated GitHub Actions runners that we operate on PrepBusiness's behalf, so their team gets CI/CD runners without managing the infrastructure.
Finding the Downtime
The mysterious outages were one of PrepBusiness's main pain points, and they persisted after migration. That told us the problem was not infrastructure-related.
We used our observability stack — Prometheus metrics, Loki logging rules, and Grafana dashboards — to instrument the database layer over several weeks. The data eventually showed what was happening: long-running application queries, PostgreSQL autovacuum, and scheduled maintenance jobs were all landing on the database at the same time. CPU usage would peg at around 20 cores for extended periods. With the database saturated, web requests that depended on it would back up and time out after 30 seconds.
During the investigation, we used the spare capacity in the cluster to double the database's available CPU and memory at no extra cost. Once we could see the overlapping loads in the dashboards, we rescheduled the maintenance jobs and tuned the query patterns so they no longer competed with production traffic. The downtime stopped.
"We were struggling to pin down an issue that would cause unpredictable downtime for all our customers. Lithus dug into it with us, built dashboards to help us observe performance across all our services, and ultimately isolated the issue so that we could push out a fix."
~ Keith Brink, Founder, PrepBusiness
The Results
- Doubled compute and memory at the same monthly spend
- Downtime eliminated by tracing overlapping database loads to their source
- 45-day migration from contract to go-live, with 15 minutes of planned downtime
- No more relying on external contractors for day-to-day infrastructure work
Ongoing Work
Since the migration, we have upgraded PrepBusiness's deployment pipeline from NGINX Unit (now deprecated) to FrankenPHP, handled PHP version upgrades, and applied security patches when CVEs landed. We continue to operate their cluster and respond to infrastructure issues so their team can focus on the product. This is what included SRE time looks like in practice.