Role Summary
This role owns the engineering depth behind ZB’s high-density GPU deployment standards. You will design, validate, and continuously improve liquid cooling and power delivery architectures that enable reliable operation at modern AI rack densities. The scope spans concept design, vendor selection, testing and commissioning, field troubleshooting, and building repeatable standards that scale across multiple sites and partners. We want to combine our proprietary GPU optimization software to reflect what is required specs in a data hall and data center environment for customers. What You Will Do
Architecture and Reference Designs
Own ZB reference architectures for liquid-cooled AI data centers, including direct-to-chip and hybrid approaches, and provide clear decision frameworks for when each architecture is appropriate.Define rack-level and row-level cooling topology (CDUs, manifolds, supply and return routing, redundancy philosophy) and produce buildable specifications.Partner with electrical engineering counterparts to define end-to-end power architecture (medium voltage interface where applicable, transformers, switchgear, UPS, PDUs, busway, grounding, and redundancy).Translate AI workload and GPU platform requirements into thermal and electrical design targets, including transient behavior, ramp profiles, and failure modes.Engineering Validation and TestingDevelop test plans for cooling loop performance, leak integrity, pressure and flow stability, and heat rejection efficiency under representative AI loads.Define acceptance criteria for component suppliers and system integrators, including FAT and SAT procedures, instrumentation requirements, and documentation standards.Establish reliability and maintainability standards: isolation valves, bypass loops, drain and fill procedures, service clearances, and spare strategy.Create incident playbooks for thermal excursions, pump failures, flow alarms, and power events. Ensure procedures are realistic and operator-friendly.Deployment, Commissioning, and Operations SupportSupport commissioning and handover for deployments. Validate that thermal and electrical systems meet design intent before scaling workloads.Troubleshoot cross-discipline issues: hot spots, uneven flow distribution, unstable differential pressure, air entrainment, sensor drift, and control loop tuning.Work with operations teams to define preventive maintenance schedules, calibration routines, water quality management, and filter strategy to protect IT equipment.Build operational dashboards and telemetry requirements in partnership with software teams to ensure early detection and fast root-cause analysis.Vendor Management and Cost-Performance OptimizationEvaluate suppliers on performance, reliability, lead time, serviceability, and total cost of ownership. Maintain an approved vendor list and qualification criteria.Negotiate and enforce engineering deliverables: drawings, BOM transparency, testing evidence, warranty terms, and service response SLAs.Drive cost and efficiency improvements through design simplification, standardization, and repeatable modules without compromising uptime and safety.Ensure compliance with relevant standards and good practices for data centers and liquid cooling systems, including safety, labeling, and documentation.Documentation and Knowledge BuildingProduce clear, version-controlled engineering documentation: specifications, one-line diagrams, PIDs, commissioning checklists, and SOPs.Train internal teams and partners on ZB standards, including installation best practices and common failure modes.Contribute to ZB’s customer-facing technical collateral where appropriate, ensuring accuracy and credibility. First 30 Days
Review ZB’s current deployment footprint, partner ecosystem, and existing reference designs and pain points.Map critical vendors and identify highest-risk components and processes (leaks, sensor quality, control logic, redundancy gaps).Deliver an initial gap analysis with priority actions and quick wins.First 60 DaysPublish v1 of ZB’s liquid cooling and power reference standards that matches with Company software (zWare) requirements to be on spec, including minimum acceptance criteria and commissioning checklists.Run or redesign at least one validation and commissioning workflow with updated instrumentation and sign-off steps.Create incident playbooks for the top failure scenarios and train relevant teams.First 90 DaysComplete supplier qualification and define an approved vendor list with clear testing and documentation requirements.Demonstrate measurable reliability improvements through telemetry, incident reduction, or commissioning cycle time reductions.Establish a scalable documentation library and training cadence for internal and partner teams.