Architected and implemented a Kubernetes Operator in Python (kopf framework) to watch Flink CRDs and react to cluster lifecycle events, reducing operational overhead for observability collection across the fleet.
Scaled the Kubernetes Operator to cover 12,000+ total jobs (1,000+ per region across 12 regions), providing unified visibility and replacing manual, per-cluster monitoring across every regional Flink deployment.
Deployed and scaled the Flink History Server to support 1,000+ concurrent jobs per region across 12 regional Flink clusters, enabling post-run diagnostics, job performance analysis, and SLA tracking for the data platform engineering team.
Led a 4-person team on the AWS-to-GCP migration of Atlassian's data platform workloads, including real-time Flink streaming and batch pipelines across 12 regions, owning cross-cloud authentication, data transfer, GCP Workload Identity Federation (WIF), service account provisioning, and end-to-end cluster setup on GCP.