Customer Success

The Climate Risk Group:
From Unpredictable AWS Bills to Bare Metal

Industry: Climate Risk Analytics

Location: Australia / UK

Team: ~45 people

Website: theclimateriskgroup.com

"I tell people we moved off AWS, but that doesn't really capture it. We got a team that built us custom tooling, migrated our databases, deployed a private network, and responds on Slack within hours. AWS doesn't offer that at any price."
~ Tim McEwan, CTO, The Climate Risk Group

The Situation

The Climate Risk Group provides physical climate risk assessments to major global banks and governments. Their models analyse individual assets (harbours, buildings, critical infrastructure) against climate projections, producing detailed per-asset reports. The compute requirements are large and growing. The clients relying on these assessments expect the infrastructure behind them to be solid.

Their AWS setup had grown organically over several years: three separate EKS clusters across multiple sub-accounts, legacy servers, a managed database service, and configuration spread across multiple systems. Data scientists needed VMs for analysis work. The development team was already on Kubernetes but still early in adopting it. One engineer was managing all of it alongside other responsibilities.

Monthly AWS spend averaged $25,000, with unpredictable spikes. One EFS transfer event cost $20,000 over three days. Legacy on-demand instances for data science work added another $5-10k/month with no cost governance.

The Climate Risk Group needed a partner to take ownership of the infrastructure, not just advise on it.

What We Built

We migrated The Climate Risk Group from three separate AWS clusters to a single multi-AZ bare-metal cluster in Germany, running our full managed Kubernetes stack. The cluster is logically separated into production, QA, and development environments. The migration was phased over four months: QA first, then production, then development. Billing started when workloads were running.

Production workloads get scheduling priority; data analysis workloads use spare capacity when available.

Storage
We deployed a dedicated object storage cluster and JuiceFS as a POSIX-compatible shared filesystem, replacing AWS S3, Cloudflare R2, and EFS. The object storage cluster benchmarks at 200 Gbps aggregate throughput handling 50,000 requests per second. At sustained throughput, the equivalent S3 GET request volume alone would cost ~$55,700 USD per month on AWS. On bare metal, the cost is fixed.

Data scientist VMs
The Climate Risk Group's data scientists need full Linux VMs for analysis work. On AWS, these were expensive on-demand instances with no cost governance. We moved them into the cluster using KubeVirt and built a custom Kubernetes operator and CLI in Rust to manage the full VM lifecycle: provisioning, networking, resizing, monitoring, and shutdown. Their team provisions VMs through the CLI. Each VM automatically connects to their private Tailscale mesh network, powered by the open-source Headscale project.

Database
We migrated The Climate Risk Group's databases off a managed database vendor into in-cluster PostgreSQL. Every database runs as a primary/replica pair with snapshot backups and point-in-time recovery, with 2-4x the resources of the managed instances they replaced. We later handled major version upgrades.

Private network
We deployed a self-hosted Headscale server, giving the entire team secure access to the cluster and internal services through a Tailscale mesh network. They didn't have this before. It simplified compliance and gave everyone access to internal tooling without managing traditional VPN infrastructure.

Compliance
Moving off AWS removed the need for Security Hub — there are no longer dozens of managed services to aggregate findings across. For their ISO 27001 requirements, we deployed Falco for runtime threat detection at the kernel level and Kyverno for policy enforcement. Security events across the cluster are reported and auditable.

"Small configuration changes on AWS lead to them charging you orders of magnitude more money. That's not going to happen on bare metal."
~ Sohum Banerjea, Senior Architect, The Climate Risk Group

The Numbers

	Before	After
Monthly cost	$25k USD, with unpredictable spikes	~45% less, fixed monthly rate
Clusters to manage	3 separate EKS clusters	1 multi-AZ cluster, 3 logical environments
Storage	EFS ($2,500/day during spikes) + Cloudflare R2	200 Gbps dedicated cluster, flat cost
Data scientist VMs	On-demand AWS instances, no cost controls	KubeVirt VMs with CLI provisioning and lifecycle management
Database	Managed vendor, resource-constrained. Costs growing with data volume	Primary database self-hosted in-cluster, with 2-4x resources
Private network access	None	Full team access via Tailscale mesh network, controlled via ACL
Compliance tooling	AWS Security Hub	Falco + Kyverno (simpler surface, fewer tools needed)

Support

We work with The Climate Risk Group's team directly via Slack day-to-day and run fortnightly calls to coordinate priorities and work through their DevOps backlog. We respond to alerts, handle capacity planning, and help debug application-level issues — feeding back infrastructure-level findings to their developers with specific remediations.

Their team uses the Grafana observability stack directly. We build dashboards tailored to their needs, so their developers can see how their application behaves across the cluster without needing to be intimately familiar with the infrastructure underneath.

After migrating their workloads on, we observed early batch runs and saw that resource use was non-optimal. High context-switch rates showed the scheduler struggling with bursty CPU demand from individual batch processes. We built custom metrics to measure actual throughput against allocated resources. We constrained CPU allocations to slow individual processes down, making scheduling more deterministic so batch jobs could be packed tightly into the available hardware. We tightened memory limits so that leaking processes would be killed before their usage ballooned and sat idle. Over several days, we continually tuned these limits until cluster throughput was optimised for their workload.

"Thank you very much for being so responsive, by the way. It is awesome."
~ Tim McEwan, CTO, The Climate Risk Group

Calculate savings Talk to us

The Climate Risk Group:
From Unpredictable AWS Bills to Bare Metal

The Situation

What We Built

The Numbers

Support

More case studies

The Climate Risk Group

PrepBusiness: SaaS Logistics

Futurepump: IoT Telemetry

The Climate Risk Group: From Unpredictable AWS Bills to Bare Metal

The Situation

What We Built

The Numbers

Support

More case studies

The Climate Risk Group

PrepBusiness: SaaS Logistics

Futurepump: IoT Telemetry

The Climate Risk Group:
From Unpredictable AWS Bills to Bare Metal