【Capsule】At FunNow, we’re building joyful experiences, at the speed of now. As a Site Reliability Engineer, you’ll play a crucial role in ensuring our platform stays fast, resilient, and secure for millions of users booking spontaneous fun across Asia. But here’s the twist: we don’t just monitor uptime — we build with AI and automation. From Kubernetes tuning to auto-healing infrastructure, CI/CD pipelines to incident response, you'll be hands-on in evolving our DevOps culture. If you love scalable systems, believe in developer efficiency, and treat infrastructure as code, welcome aboard.【Typical Accountability】1. Design robust architectures to comprehensively improve system availability, scalability, and service quality2. Ensure stable service operation, monitor core service status, and quickly troubleshoot issues3. Conduct in-depth analysis of system performance bottlenecks and propose and implement improvement solutions4. Maintain and optimize Kubernetes clusters (EKS/GKE), effectively handling resource pressure, node anomalies, and other situations5. Maintain and improve CI/CD pipelines and automated deployment systems (GitHub Actions / ArgoCD) to significantly enhance engineering team development efficiency6. Establish and continuously optimize system monitoring and alerting mechanisms (Prometheus / Grafana / Alertmanager)7. Assist with incident response and problem investigation8. Regularly participate in system inspections and audits, proactively proposing and implementing improvements9. Assist in maintaining and implementing fundamental security settings (e.g., IAM, resource permissions, encrypted storage)10. Actively share your experience to collectively enhance the team's engineering culture
Diperlukan pengalaman selama 2 tahun
Tidak ada tanggung jawab manajemen