Riki Mondal

Senior Platform & Backend Engineer

Bengaluru, IN.

About

Highly accomplished Senior Platform & Distributed Systems Engineer with 8+ years of experience in building large-scale observability, streaming, and cloud infrastructure. Proven leader in driving significant cost optimization, delivering $2M+ in annual infrastructure savings, and ensuring high-availability systems with 99.99% SLO across global deployments. Expert in Kubernetes, Kafka, Apache Flink, Prometheus/Mimir, Python, Java, and Go, specializing in platform ownership, multi-region architecture, and cross-functional engineering leadership.

Work

Atlassian India LLP

Senior Software Engineer – Data Platform Reliability & Infrastructure

Bangalore, Karnataka, India

Aug 2024

→

Dec 2024

Summary

Led the architecture and implementation of a Kubernetes Operator to enhance observability and managed a 4-person team for AWS-to-GCP migration of data platform workloads.

Highlights

Architected and implemented a Kubernetes Operator in Python (kopf framework) to watch Flink CRDs and react to cluster lifecycle events, reducing operational overhead for observability collection across the fleet.

Scaled the Kubernetes Operator to cover 12,000+ total jobs (1,000+ per region across 12 regions), providing unified visibility and replacing manual, per-cluster monitoring across every regional Flink deployment.

Deployed and scaled the Flink History Server to support 1,000+ concurrent jobs per region across 12 regional Flink clusters, enabling post-run diagnostics, job performance analysis, and SLA tracking for the data platform engineering team.

Led a 4-person team on the AWS-to-GCP migration of Atlassian's data platform workloads, including real-time Flink streaming and batch pipelines across 12 regions, owning cross-cloud authentication, data transfer, GCP Workload Identity Federation (WIF), service account provisioning, and end-to-end cluster setup on GCP.

Atlassian India LLP

Senior Software Engineer – Metrics & Observability Platform

Bangalore, Karnataka, India

Jul 2022

→

Aug 2024

Summary

Owned and scaled Atlassian's centralized Grafana Mimir platform, contributing to significant cost savings and performance improvements.

Highlights

Owned and scaled Atlassian's centralized Grafana Mimir platform to 500M+ active time series at 15K write QPS / 6K read QPS, supporting 130K+ monitoring detectors at 99.99% SLO.

Reduced infrastructure costs by $2M/year and query latency by 40% through strategic optimizations of the Grafana Mimir platform.

Designed and led implementation of a multi-tenant authentication gateway for Grafana Mimir ingestion, processing 15K write QPS while eliminating cross-tenant data exposure risk and replacing a manually operated allowlist system.

Pioneered an AI/ML-powered SignalFlow-to-PromQL automated translation system, achieving 88% accuracy and eliminating 2,000+ hours of manual migration work, accelerating the SignalFx-to-Prometheus/Grafana transition by 6 months.

Deployed enterprise-grade Sentry error monitoring in a FedRAMP-compliant environment, enabling real-time debugging for government contracts and reducing production incident resolution time by 35%.

Glance (InMobi Group)

Senior Software Engineer (SDE-3)

Bangalore, Karnataka, India

Jul 2022

→

Aug 2024

Summary

Led a 5-person engineering team to scale the Glance Publishing Platform and engineered a distributed job scheduler to enhance content ingestion capacity.

Highlights

Led a 5-person engineering team on the Glance Publishing Platform, scaling system throughput from 10K to 15K RPS (50%) through strategic architectural improvements, supporting content delivery to 10M+ daily active users.

Engineered a distributed job scheduler handling 20K concurrent RSS scraping jobs using Redis, PostgreSQL, and Python on Kubernetes, reducing publisher onboarding time from 3 days to 5 minutes (98% reduction) and increasing content ingestion capacity by 50x.

Refactored legacy CMS architecture to Spring Boot microservices with Redis caching, cutting API latency from 850ms to 340ms (60% improvement) and enabling 200K daily content operations with zero additional infrastructure.

Architected a real-time content delivery pipeline for Glance Spaces using Java and Kafka, processing 50M+ daily events to lock-screen users, contributing to 15% DAU growth and 20% improvement in user satisfaction.

Audited and rightsized Glance's Kubernetes cloud estate using Terraform, ArgoCD, and Grafana, cutting monthly spend from $140K to $98K ($504K annualised saving) with zero SLA regressions.

Glance (InMobi Group)

Senior Software Engineer (SDE-2)

Bangalore, Karnataka, India

Sep 2021

→

Jun 2022

Summary

Developed a distributed load testing platform, built a vendor-agnostic event-driven microservice, and implemented a company-wide OpenTelemetry observability stack.

Highlights

Developed a distributed load testing platform using Locust, Python, and Kubernetes, enabling teams to simulate production-level traffic, cutting test setup time from 5 hours to 2 hours (60%) and accelerating release cycles by 35%.

Built a vendor-agnostic event-driven microservice supporting multiple message brokers (Kafka, PubSub), handling 1,000+ subscribers and processing 10M+ daily events, saving development teams 20 engineering hours weekly on integration tasks.

Implemented a company-wide OpenTelemetry observability stack with eBPF-based instrumentation across 40+ microservices, capturing kernel-level network and latency traces without code changes, improving debugging efficiency by 50% and halving average incident resolution time from 4 hours to 2 hours.

Modernised 8 legacy Jersey-based services to Spring Boot, improving developer velocity by 45% and reducing build times from 12 to 9 minutes.

Tata 1MG

Senior Software Engineer (SDE-2)

Gurgaon, Haryana, India

Aug 2020

→

Sep 2021

Summary

Architected a digital health records platform and created a production-ready FastAPI microservice boilerplate to support telemedicine expansion.

Highlights

Architected a digital health records platform using FastAPI and PostgreSQL, enabling real-time lab report and prescription management for 50K+ monthly patients, improving data accessibility by 80% and supporting Tata 1MG's telemedicine expansion.

Created a production-ready FastAPI microservice boilerplate with unit tests and CI/CD integration, adopted by 10+ engineering teams, reducing new service setup from 2 days to 4 hours.

Built a centralised communication platform using Python and SQS/Kafka, handling 100K+ daily notifications (email, SMS, WhatsApp) and improving delivery reliability from 80% to 99.99%.

Tata 1MG

Software Engineer (SDE-1)

Gurgaon, Haryana, India

Feb 2018

→

Aug 2020

Summary

Developed an ML-powered recommendation engine, engineered a subscription-based Diabetes care platform, and built an automated health survey system.

Highlights

Developed an ML-powered recommendation engine using Python and Cassandra, delivering personalised product suggestions that increased conversion rates by 30% and click-through rates by 20%, contributing $60K+ in additional quarterly revenue.

Engineered a subscription-based Diabetes care platform using FastAPI and Kubernetes, growing to 3,500+ daily active users within 6 months at 75% retention, establishing it as a flagship product generating $100K+ annual revenue.

Engineered an automated health survey system using Python and PostgreSQL, delivering targeted assessments to 100K+ users monthly, boosting engagement by 45% through gamified experiences.

Education

National Institute of Technology Durgapur

Durgapur, West Bengal, India

Sep 2013

→

Jun 2017

B.Tech

Information Technology

Publications

Technical Blog

Published by

Medium

Summary

Published 5+ articles on distributed systems, observability, and platform engineering, reaching 10K+ readers. Topics included OpenTelemetry with eBPF instrumentation, Kafka event-driven architectures, Kubernetes-based load testing, and distributed job scheduling at scale.

Languages

English

Skills

Languages

Python, Java, Go, SQL, Node.js.

Infrastructure & Cloud

Kubernetes, Kubernetes Operators, Helm, Kustomize, Docker, Terraform, GCP, AWS, ArgoCD, Jenkins, GitHub Actions, Bitbucket Pipelines.

Observability & Streaming

Prometheus, Grafana Mimir, OpenTelemetry, Kafka, Apache Flink, Grafana, Sentry, eBPF.

Data & Storage

PostgreSQL, Redis, Cassandra, MongoDB, MySQL, Apache Iceberg.

Frameworks

Spring Boot, FastAPI, Flask, Django.

Core Competencies

Distributed Systems, Platform Engineering, SLI/SLO Design, Capacity Planning, FinOps, Cost Optimisation, High-Availability Systems, Multi-Region Infrastructure, Microservices Architecture.