We are looking for an experienced Site Reliability Engineer (SRE) to build and maintain a reliable, scalable, and resilient platform infrastructure. The ideal candidate will have strong expertise in automation, infrastructure as code (IaC), monitoring, and system scalability.【Responsibility】Develop and maintain automated deployment pipelines using IaC tools (e.g., Ansible, Kubernetes, Jenkins, ArgoCD).
Implement and manage monitoring and telemetry solutions with Prometheus and Grafana to ensure system visibility and performance optimization.
Design and execute Disaster Recovery (DR) and backup strategies to enhance system resilience.
Improve system scalability and reliability through automation and proactive performance optimizations.
Continuously evolve the infrastructure by identifying and implementing improvements that enhance reliability and deployment speed.