← all jobs

[Remote] Lead Engineer – HPC Operations

Work from home Full-time role Hiring

Note: The job is a remote job and is open to candidates in USA. Core42 is a leader in AI-powered cloud and digital infrastructure, driving transformative technology solutions globally. They are seeking a highly skilled Lead Engineer – HPC Operations to oversee the daily operations and support of high-performance computing clusters designed to power large-scale AI and ML workloads, ensuring stable and high-performing infrastructure.

Responsibilities

  • Oversee the daily operational management of HPC infrastructure, including compute, storage, networking, and scheduler components (e.g., Slurm, Kubernetes, etc.)
  • Drive efforts to optimize the efficiency and performance of HPC systems, ensuring maximum resource utilization and minimizing downtime
  • Serve as the primary technical escalation point for L2 support teams, ensuring rapid and effective resolution of incidents and service requests
  • Continuously monitor system health, performance, and resource utilization using advanced monitoring tools (e.g., Prometheus, Grafana, DCGM)
  • Manage user environments for AI/ML workloads, including container orchestration (e.g., Docker, Kubernetes) and workflow tools (e.g., MLflow, Kubeflow)
  • Define and enforce job scheduling policies, priorities, and partitions within Slurm and/or Kubernetes environments to ensure resource fairness, efficiency, and workload optimization
  • Lead root cause analysis (RCA) of operational issues, contributing to post-mortem documentation and driving continuous improvement initiatives
  • Provide mentorship and technical guidance to junior engineers, fostering skills development and knowledge sharing across teams. Participate in on-call rotation as necessary
  • Ensure adherence to security and operational policies, assisting in audits and maintaining documentation for change and incident management processes

Skills

  • Bachelor's or Master's degree in Computer Science, Engineering, or a related technical field
  • Minimum of 8 years of experience in HPC operations, systems engineering, or DevOps roles, with at least 2 years in a leadership or ownership capacity
  • Advanced expertise in configuring, optimizing, and maintaining complex HPC environments, including hardware, software, and storage systems
  • Hands-on experience managing Slurm clusters and/or Kubernetes-based environments for AI/ML workloads
  • In-depth knowledge of GPU resource management, workload schedulers, and performance tuning for AI/ML workloads
  • Proficiency with monitoring and observability frameworks such as Prometheus, Grafana, and DCGM
  • Strong scripting and automation skills, including Python, Bash, Ansible, and Terraform
  • Solid understanding of Linux (RHEL/CentOS/Ubuntu), networking technologies (RDMA, InfiniBand, RoCE), and storage solutions (NFS, Lustre, Ceph)

Benefits

  • Bonus and benefits on top

Company Overview

  • Core42 is a developer of foundation models to empower organizations in different industries. It was founded in 2021, and is headquartered in Abu Dhabi, Abu Dhabi, ARE, with a workforce of 1001-5000 employees. Its website is https://www.core42.ai.
  • More open positions

    [Remote] Remote – Leisure Air Desk Travel Consultant

    Work from home Full-time role

    [Remote] Healthcare Business Analyst / Technical Writer

    Work from home Full-time role

    [Remote] Account Manager Power Utilities

    Work from home Full-time role

    [Remote] Recruiting Associate

    Work from home Full-time role

    [Remote] Clinical Quality Associate

    Work from home Full-time role

    Entry-Level Remote Jobs - No Experience Necessary, Earn $25-$35/Hour

    Work from home Full-time role

    Amazon Live Chat Jobs Remote Customer Service No Phone Required 25 to 35 Per Hour

    Work from home Full-time role

    Remote Customer Chat Support Specialist – Entry-Level Live Chat Operator at careerzynith

    Work from home Full-time role

    Remote Insurance Sales – CHICAGO, IL (60629) | Paid Training & Growth

    Work from home Full-time role

    Remote Full-Time Data Entry Clerk – High‑Volume Typing, Accuracy & Organizational Excellence (Entry‑Level, Flexible Schedule)

    Work from home Full-time role

    Treasury Consultant - Eden Prairie, MN preferred, remote considered

    Work from home Full-time role

    Applications Analyst. Permanent Full Time - Flexible Location

    Work from home Full-time role

    Remote Part-Time Data Entry Clerk – Entry Level Position with Flexible Shifts, Growth Path & Remote Work Freedom

    Work from home Full-time role

    AI Infrastructure DC Design Engineer II

    Work from home Full-time role

    Teacher - Information and Communication Technology (American Curriculum, High School, Remote)

    Work from home Full-time role

    Experienced Team Lead, Customer Content Services – North America Overnight Operations

    Work from home Full-time role

    [Hiring] RN- MD Live Virtual Primary Care @Cigna Healthcare

    Work from home Full-time role

    Flexible Part-Time Remote Data Entry & Administrative Assistant – Supplemental Income Opportunity with careerzynith

    Work from home Full-time role

    Customer Service Agent – Part‑Time Remote Weekend Support for careerzynith’s Remedy On‑Call Telehealth Scheduling Platform

    Work from home Full-time role

    Tele-Radiology

    Work from home Full-time role

    Onchain Crypto Analyst Needed for Multi-Chain Transaction Tracking

    Work from home Full-time role