SRE 的使命:打造兼具彈性與極致可靠的數位服務
TVBS 的服務架構承載著千萬用戶等級的日常流量,橫跨新聞、影音內容、電商等多個產品線。SRE 團隊是確保這套複雜系統 7x24 穩定運行的核心力量。我們不只是維運基礎設施,更是開發團隊最信賴的夥伴,致力於透過架構優化、流程自動化與 SRE 最佳實踐,在快速迭代的開發節奏中,捍衛服務的可靠性與效能。
我們正在尋找一位具備深厚技術底蘊和自動化思維的 SRE 工程師。您將主導高流量環境下的 Kubernetes 叢集管理、設計並實施 Infrastructure as Code (IaC),並推動監控系統的進化。如果您熱衷於消除系統瓶頸、降低維運成本(Toil Reduction),並相信 automated everything 是未來趨G勢,這裡是您發揮影響力的最佳舞台。
【主要職責】 (Responsibilities)
SRE 實踐與可靠性定義: 主導 SRE 核心實踐導入,包含定義服務等級目標 (SLO)、建立錯誤預算 (Error Budgets),並推動 Postmortem 文化,從根本上提升系統可靠性。Kubernetes (EKS) 運維與優化: 負責 AWS EKS 叢集的日常管理、效能調校、成本優化與高可用性架構設計,確保容器化應用的穩定運行。基礎設施即代碼 (IaC) 與自動化: 使用 Terraform 或 Cloudformation 全面管理雲端資源。開發自動化腳本(Python/Shell)以減少手動操作,提升部署效率與一致性。監控系統與可觀測性 (Observability): 建構與維護全面的監控告警系統(如 Prometheus, Grafana, OpenSearch/ELK),確保能即時發現並定位問題,並持續優化可觀測性。CI/CD 流程優化: 擁有並持續改善 CI/CD Pipeline,與開發團隊協作,加速軟體交付速度並確保部署品質。緊急應變與效能調校: 擔任 On-call 輪值,處理線上緊急事件,並針對系統效能瓶頸進行分析與調優。
The SRE Mission: Building Resilient and Scalable Digital Services
TVBS's service architecture supports daily traffic from tens of millions of users across multiple product lines, including news, video content, and e-commerce. The SRE team is the core force ensuring the 24/7 stability of this complex system. We don't just maintain infrastructure; we are the most trusted partners of the development teams. We safeguard service reliability and performance amidst rapid development cycles through architectural optimization, process automation, and SRE best practices.
We are looking for an SRE with deep technical expertise and an automation-first mindset. You will lead the management of Kubernetes clusters in a high-traffic environment, design and implement Infrastructure as Code (IaC), and drive the evolution of our monitoring systems. If you are passionate about eliminating system bottlenecks, reducing operational toil, and believe that automating everything is key to future success, this is the perfect stage for you to make an impact.
Key Responsibilities
SRE Practices and Reliability Definition: Lead the implementation of core SRE practices, including defining Service Level Objectives (SLOs), establishing Error Budgets, and promoting a Postmortem culture to fundamentally enhance system reliability.Kubernetes (EKS) Operations and Optimization: Manage the daily operations, performance tuning, cost optimization, and high-availability design of AWS EKS clusters to ensure the stability of containerized applications.Infrastructure as Code (IaC) and Automation: Manage cloud resources comprehensively using Terraform or Cloudformation. Develop automation scripts (Python/Shell) to reduce manual operations and improve deployment efficiency and consistency.Monitoring Systems and Observability: Build and maintain comprehensive monitoring and alerting systems (e.g., Prometheus, Grafana, OpenSearch/ELK) to ensure real-time issue detection and localization, and continuously improve observability.CI/CD Pipeline Optimization: Own and continuously improve CI/CD pipelines, collaborating with development teams to accelerate software delivery while ensuring deployment quality.Incident Response and Performance Tuning: Participate in on-call rotations, handle production emergencies, and conduct performance analysis and tuning for system bottlenecks.
1 years of experience required
No management responsibility