Description: Must have skills • PowerEdge Rack/Tower Experience, RHEL and Ubuntu OS, Experience working in a larger server environment.All potential applicants are encouraged to scroll through and read the complete job description before applying.• PowerEdge XE server experience
- Poweredge XE-series GPU servers and accelerated/HPC environment • Strong experience with HPC operations in production environments (scheduling, monitoring, troubleshooting, capacity management).• Hands on expertise with NVIDIA GPU infrastructure (preferably GB200/GB300 class racks, H100/B100 or similar generations) including firmware/driver stack, monitoring, and lifecycle management.• Familiarity with liquid cooled data center infrastructure: working safely around cold plate / rear door heat exchanger systems, understanding facility interfaces (CDU, manifolds, leak detection, etc.).• Solid understanding of Linux administration (RHEL/CentOS/Ubuntu) in HPC/AI clusters: OS provisioning, patching, performance tuning, and troubleshooting.• Experience with cluster management tools (e.g., Bright, OpenHPC, Slurm, PBS/LSF, Kubernetes based AI stacks, or equivalent schedulers and orchestration frameworks).• Strong Day 2 operations mindset: incident response, root-cause analysis, change management, and operational runbook creation.• Excellent customer facing communication skills and the ability to work on site, embedded with the customer’s team and their operations staff.Nice to have skills (if any) • Prior exposure to liquid-cooled or rack-scale GPU platforms (Nvidia HGX, NVLink, etc.) is highly desirable • Prior experience operating or supporting Dell XE/XE HPC & AI solutions and related management tooling.• Background in HP/HPE HPC environments (familiar with how HP is typically deployed/managed in HPC shops, to ease comparison and migration).• Exposure to AI/ML and GenAI workloads running at scale on NVIDIA GPU infrastructure (MLOps pipelines, model training/inference operations).• Experience with data center facilities coordination (power and cooling planning, rack integration, cabling standards, change windows).• Scripting skills in Python/Bash/PowerShell for automation, reporting, and integration with monitoring/ITSM systems.• Familiarity with ITIL aligned processes (incident, problem, and change management) and documenting runbooks/SOPs.• Strong scripting/automation skills, especially in Python and Bash, for automating operational workflows, health checks, and reporting across large GPU clusters.• Experience building or maintaining infrastructure automation (e.g., Ansible, Terraform, or similar) for repeatable deployment, configuration, and lifecycle management of HPC/AI nodes.• Some scripting skills in Python/Bash/PowerShell for automation, reporting, and integration with monitoring/ITSM systems. xrczosw• Experience with DevOps / Git based workflows (GitLab/GitHub, CI/CD) to help with scripts, automation playbooks, and configuration as code.