Customer Success
The Climate Risk Group:
From Unpredictable AWS Bills to Bare Metal
"I tell people we moved off AWS, but that doesn't really capture it. We got a team that built us custom tooling, migrated our databases, deployed a private network, and responds on Slack within hours. AWS doesn't offer that at any price."
~ Tim McEwan, CTO, The Climate Risk Group
The Situation
The Climate Risk Group provides physical climate risk assessments to major global banks and governments. Their models analyse individual assets (harbours, buildings, critical infrastructure) against climate projections, producing detailed per-asset reports. The compute requirements are large and growing. The clients relying on these assessments expect the infrastructure behind them to be solid.
Their AWS setup had grown organically over several years: three separate EKS clusters across multiple sub-accounts, legacy servers, a managed database service, and configuration spread across multiple systems. Data scientists needed VMs for analysis work. The development team was already on Kubernetes but still early in adopting it. One engineer was managing all of it alongside other responsibilities.
Monthly AWS spend averaged $25,000, with unpredictable spikes. One EFS transfer event cost $20,000 over three days. Legacy on-demand instances for data science work added another $5-10k/month with no cost governance.
The Climate Risk Group needed a partner to take ownership of the infrastructure, not just advise on it.
What We Built
We migrated The Climate Risk Group from three separate AWS clusters to a single multi-AZ bare-metal cluster in Germany, running our full managed Kubernetes stack. The cluster is logically separated into production, QA, and development environments. The migration was phased over four months: QA first, then production, then development. Billing started when workloads were running.
Production workloads get scheduling priority; data analysis workloads use spare capacity when available.
Storage
We deployed a dedicated object storage cluster and JuiceFS as a POSIX-compatible
shared filesystem, replacing AWS S3, Cloudflare R2, and EFS. The object
storage cluster benchmarks at 200 Gbps aggregate throughput handling
50,000 requests per second. At sustained throughput, the equivalent S3 GET request
volume alone would cost ~$55,700 USD per month on AWS. On bare metal, the cost
is fixed.
Data scientist VMs
The Climate Risk Group's data scientists need full Linux VMs for analysis work. On AWS, these were
expensive on-demand instances with no cost governance. We moved them into the
cluster using KubeVirt and built a custom Kubernetes operator and CLI in Rust
to manage the full VM lifecycle: provisioning, networking, resizing, monitoring,
and shutdown. Their team provisions VMs through the CLI. Each VM automatically
connects to their private Tailscale mesh network, powered by the open-source Headscale project.
Database
We migrated The Climate Risk Group's databases off a managed database vendor into in-cluster
PostgreSQL. Every database runs as a primary/replica pair with snapshot backups
and point-in-time recovery, with 2-4x the resources of the managed instances they replaced.
We later handled major version upgrades.
Private network
We deployed a self-hosted Headscale server, giving the entire team secure access
to the cluster and internal services through a Tailscale mesh network. They
didn't have this before. It simplified compliance and gave everyone access
to internal tooling without managing traditional VPN infrastructure.
Compliance
Moving off AWS removed the need for Security Hub — there are no longer dozens
of managed services to aggregate findings across. For their ISO 27001 requirements,
we deployed Falco for runtime threat detection at the kernel level and Kyverno
for policy enforcement. Security events across the cluster are reported and auditable.
"Small configuration changes on AWS lead to them charging you orders of magnitude more money. That's not going to happen on bare metal."
~ Sohum Banerjea, Senior Architect, The Climate Risk Group
The Numbers
| Before | After | |
|---|---|---|
| Monthly cost | $25k USD, with unpredictable spikes | ~45% less, fixed monthly rate |
| Clusters to manage | 3 separate EKS clusters | 1 multi-AZ cluster, 3 logical environments |
| Storage | EFS ($2,500/day during spikes) + Cloudflare R2 | 200 Gbps dedicated cluster, flat cost |
| Data scientist VMs | On-demand AWS instances, no cost controls | KubeVirt VMs with CLI provisioning and lifecycle management |
| Database | Managed vendor, resource-constrained. Costs growing with data volume | Primary database self-hosted in-cluster, with 2-4x resources |
| Private network access | None | Full team access via Tailscale mesh network, controlled via ACL |
| Compliance tooling | AWS Security Hub | Falco + Kyverno (simpler surface, fewer tools needed) |
Support
We work with The Climate Risk Group's team directly via Slack day-to-day and run fortnightly calls to coordinate priorities and work through their DevOps backlog. We respond to alerts, handle capacity planning, and help debug application-level issues — feeding back infrastructure-level findings to their developers with specific remediations.
Their team uses the Grafana observability stack directly. We build dashboards tailored to their needs, so their developers can see how their application behaves across the cluster without needing to be intimately familiar with the infrastructure underneath.
After migrating their workloads on, we observed early batch runs and saw that resource use was non-optimal. High context-switch rates showed the scheduler struggling with bursty CPU demand from individual batch processes. We built custom metrics to measure actual throughput against allocated resources. We constrained CPU allocations to slow individual processes down, making scheduling more deterministic so batch jobs could be packed tightly into the available hardware. We tightened memory limits so that leaking processes would be killed before their usage ballooned and sat idle. Over several days, we continually tuned these limits until cluster throughput was optimised for their workload.
"Thank you very much for being so responsive, by the way. It is awesome."
~ Tim McEwan, CTO, The Climate Risk Group