# The Climate Risk Group: From Unpredictable AWS Bills to Bare Metal

**Industry**: Climate Risk Analytics | **Location**: Australia / UK | **Team**: ~45 people | **Website**: theclimateriskgroup.com

> "I tell people we moved off AWS, but that doesn't really capture it. We got a team that built us custom tooling, migrated our databases, deployed a private network, and responds on Slack within hours. AWS doesn't offer that at any price." - Tim McEwan, CTO, The Climate Risk Group

## The Situation

The Climate Risk Group provides physical climate risk assessments to major global banks and governments. Their models analyse individual assets (harbours, buildings, critical infrastructure) against climate projections.

Their AWS setup had grown organically over several years: three separate EKS clusters across multiple sub-accounts, legacy servers, a managed database service, and configuration spread across multiple systems. One engineer was managing all of it alongside other responsibilities.

Monthly AWS spend averaged $25,000, with unpredictable spikes. One EFS transfer event cost $20,000 over three days. Legacy on-demand instances for data science work added another $5-10k/month with no cost governance.

## What We Built

Migrated The Climate Risk Group from three separate AWS clusters to a single multi-AZ bare-metal cluster in Germany, running our [full managed Kubernetes stack](/services). Logically separated into production, QA, and development environments. Migration phased over four months: QA first, then production, then development. Billing started when workloads were running.

**Storage**: Dedicated object storage cluster and JuiceFS as a POSIX-compatible shared filesystem, replacing AWS S3, Cloudflare R2, and EFS. The object storage cluster benchmarks at 200 Gbps aggregate throughput handling 50,000 requests per second. At sustained throughput, the equivalent S3 GET request volume alone would cost ~$55,700 USD per month on AWS. On bare metal, the cost is fixed.

**Data scientist VMs**: Moved data science VMs into the cluster using KubeVirt with a custom Kubernetes operator and CLI built in Rust. Full VM lifecycle management: provisioning, networking, resizing, monitoring, and shutdown. Each VM automatically connects to a private Tailscale mesh network via open-source Headscale.

**Database**: Migrated off a managed database vendor into in-cluster PostgreSQL. Every database runs as a primary/replica pair with snapshot backups and point-in-time recovery, with 2-4x the resources of the managed instances they replaced.

**Private network**: Self-hosted Headscale server giving the entire team secure access to the cluster and internal services through a Tailscale mesh network. Simplified compliance and gave everyone access to internal tooling without managing traditional VPN infrastructure.

**Compliance**: Deployed Falco for runtime threat detection at the kernel level and Kyverno for policy enforcement, replacing AWS Security Hub.

## The Numbers

| | Before | After |
|---|---|---|
| Monthly cost | $25k USD, with unpredictable spikes | ~45% less, [fixed monthly rate](/pricing) |
| Clusters to manage | 3 separate EKS clusters | 1 multi-AZ cluster, 3 logical environments |
| Storage | EFS ($2,500/day during spikes) + Cloudflare R2 | 200 Gbps dedicated cluster, flat cost |
| Data scientist VMs | On-demand AWS instances, no cost controls | KubeVirt VMs with CLI provisioning and lifecycle management |
| Database | Managed vendor, resource-constrained. Costs growing with data volume | Primary database self-hosted in-cluster, with 2-4x resources |
| Private network access | None | Full team access via Tailscale mesh network, controlled via ACL |
| Compliance tooling | AWS Security Hub | Falco + Kyverno (simpler surface, fewer tools needed) |

## Support

Direct Slack access day-to-day and fortnightly calls to coordinate priorities. We respond to alerts, handle capacity planning, and help debug application-level issues — feeding back infrastructure-level findings to their developers with specific remediations.

Their team uses the Grafana observability stack directly. We build dashboards tailored to their needs, so their developers can see how their application behaves across the cluster without needing to be intimately familiar with the infrastructure underneath.

After migrating their workloads on, we observed early batch runs and saw that resource use was non-optimal. High context-switch rates showed the scheduler struggling with bursty CPU demand from individual batch processes. We built custom metrics to measure actual throughput against allocated resources. We constrained CPU allocations to slow individual processes down, making scheduling more deterministic so batch jobs could be packed tightly into the available hardware. We tightened memory limits so that leaking processes would be killed before their usage ballooned and sat idle. Over several days, we continually tuned these limits until cluster throughput was optimised for their workload.

> "Thank you very much for being so responsive, by the way. It is awesome." - Tim McEwan, CTO, The Climate Risk Group
