Customer Success

PrepBusiness: Eliminating Mysterious Downtime
While Doubling Hardware

Industry: SaaS — Amazon Prep & Logistics

Location: Canada (Global Customers)

Website: prepbusiness.com

"We couldn't find anyone else offering the combination of engineering time, on-call SLA, and infrastructure that Lithus provides. Instead of just migrating our infrastructure, they solved problems we'd been struggling with for months, without us having to hire additional staff."
~ Keith Brink, Founder, PrepBusiness

The Situation

PrepBusiness is a Canadian SaaS company that provides a warehouse management system for Amazon prep centres. Their customers are global, so uptime matters.

They were running on AWS with a mix of ECS, EKS, and RDS. The setup worked, but sporadic outages kept hitting customer-facing services. Nobody could figure out why. The RDS instance was under-resourced, they were relying on external contractors for AWS maintenance, and they had no visibility into what was actually happening inside their stack.

What We Built

Database migration
We set up secure network connections between their AWS infrastructure and our cluster, then ran live database synchronisation with a new redundant PostgreSQL cluster on StackGres. PrepBusiness opted for a weekend maintenance window rather than a more involved zero-downtime switchover. The cutover itself took 15 minutes, well within the maintenance window PrepBusiness had communicated to their customers.

Private cloud cluster
We deployed a single-AZ cluster running our full managed Kubernetes stack that more than doubled their available compute and memory at the same monthly cost.

KubeVirt for VM workloads
We deployed KubeVirt for workloads that need full VMs. This includes dedicated GitHub Actions runners that we operate on PrepBusiness's behalf, so their team gets CI/CD runners without managing the infrastructure.

Finding the Downtime

The mysterious outages were one of PrepBusiness's main pain points, and they persisted after migration. That told us the problem was not infrastructure-related.

We used our observability stack — Prometheus metrics, Loki logging rules, and Grafana dashboards — to instrument the database layer over several weeks. The data eventually showed what was happening: long-running application queries, PostgreSQL autovacuum, and scheduled maintenance jobs were all landing on the database at the same time. CPU usage would peg at around 20 cores for extended periods. With the database saturated, web requests that depended on it would back up and time out after 30 seconds.

During the investigation, we used the spare capacity in the cluster to double the database's available CPU and memory at no extra cost. Once we could see the overlapping loads in the dashboards, we rescheduled the maintenance jobs and tuned the query patterns so they no longer competed with production traffic. The downtime stopped.

"We were struggling to pin down an issue that would cause unpredictable downtime for all our customers. Lithus dug into it with us, built dashboards to help us observe performance across all our services, and ultimately isolated the issue so that we could push out a fix."
~ Keith Brink, Founder, PrepBusiness

The Results

Doubled compute and memory at the same monthly spend
Downtime eliminated by tracing overlapping database loads to their source
45-day migration from contract to go-live, with 15 minutes of planned downtime
No more relying on external contractors for day-to-day infrastructure work

Ongoing Work

Since the migration, we have upgraded PrepBusiness's deployment pipeline from NGINX Unit (now deprecated) to FrankenPHP, handled PHP version upgrades, and applied security patches when CVEs landed. We continue to operate their cluster and respond to infrastructure issues so their team can focus on the product. This is what included SRE time looks like in practice.

Calculate savings Talk to us

PrepBusiness: Eliminating Mysterious Downtime
While Doubling Hardware

The Situation

What We Built

Finding the Downtime

The Results

Ongoing Work

More case studies

The Climate Risk Group

PrepBusiness: SaaS Logistics

Futurepump: IoT Telemetry

PrepBusiness: Eliminating Mysterious Downtime While Doubling Hardware

The Situation

What We Built

Finding the Downtime

The Results

Ongoing Work

More case studies

The Climate Risk Group

PrepBusiness: SaaS Logistics

Futurepump: IoT Telemetry

PrepBusiness: Eliminating Mysterious Downtime
While Doubling Hardware