Site Reliability Engineer of AI Infrastructure Operations (IMC)

Job updated 4 days ago
The employer was active 4 days ago

Job Description

Established in 1987 and headquartered in Taiwan, TSMC pioneered the pure-play foundry business model with an exclusive focus on manufacturing its customers’ products. As of 2024, TSMC serves more than 500 customers and manufactures over 11,000 products for high-performance computing, smartphones, the Internet of Things (IoT), automotive, and digital consumer electronics. It is the world’s largest provider of logic ICs, with an annual capacity of 16 million 12-inch equivalent wafers. TSMC operates fabs in Taiwan as well as manufacturing subsidiaries in Washington State, Japan and China, and the Company began construction on a specialty technology fab in Dresden, Germany, in 2024. In Arizona, TSMC is building three fabs, with the first starting 4nm production in 2025, the second by 2028, and the third by the end of the decade.

Manage and lead the design, implementation, and maintenance of AI infrastructure systems for reliable operations of VNAP's AI prediction services and training environments

  1. Co-work with IT/CIM infra teams, which host CPU/GPU application servers and database services such as VM/K8S, Kafka, MongoDB, Oracle middleware for VNAP, to ensure high availability and reliability through well-established monitor metrics and alarms.
  2. Design and implement infra-as-code tools like Ansible and Terraform to establish auto-recovery mechanisms to minimize tool idle/hold lot impacts caused by system issues.
  3. Develop and maintain applications using C#/Delphi/Python on top of those infrastructure systems.
  4. Work location : Hsinchu or Taoyuan
  5. Hiring Organization: IMC

Requirements

  1. Master's degree in Computer Science, Information Technology, or related field.
  2. Minimum 3 years of experience in infrastructure and system administration/operations.
  3. Strong understanding and hands-on experience of message queuing systems and SQL/No-SQL databases, such as Kafka, MongDB, Oracle and MariaDB.
  4. Experience in operational system administration, such as Windows servers and Linux distributions.
  5. Strong experience in networking technologies including firewalls, nginx load balancing, and virtual IP setup.
  6. Experience in operation monitor systems such as Zabbix, Prometheus, and Graphana.
  7. Experience in infra-as-code tools like Ansible and Terraform.
  8. Experience in application development using C#/Delphi/Python on top of AI infrastructure system components for auto recovery.
  9. Excellent communication and interpersonal skills for cross division/department cooperation.

Fostering a global inclusive workplace reflects TSMC’s core values and business philosophy and is essential for our future success. Our commitment to global inclusive workplace allows us to create an environment where every employee, regardless of gender, age, disability, religion, race, ethnicity, nationality, political affiliation, or sexual orientation, can bring their unique perspective and experiences to work, enabling us to drive profitability, increase productivity, and unleash innovation. We strive to create a workplace that is equitable and accessible to all employees. We are committed to fostering an inclusive culture where every employee feels valued and empowered to contribute to our mission and provide excellent service to our global customers.

1
3 years of experience required
40,000+ TWD / month
Personal Invitation Link
This is your personal referral link for job invitation. You'll receive an email notification when someone applied for the position via your job link.
Share this job
Logo of TSMC 台積電.

About us

台積公司是全世界最大的專業積體電路製造服務公司.

台積公司在民國七十六年成立於台灣新竹科學工業園區,並開創了專業積體電路製造服務商業模式。

台積公司以領先業界的製程技術及設計解決方案組合支援其全球客戶及夥伴生態系統的蓬勃發展,以此釋放全球半導體產業的創新。身為全球的企業公民,台積公司的營運範圍遍及亞洲、歐洲及北美。