HamburgerMenu
iimjobs

Posted by

Job Views:  
20
Applications:  3
Recruiter Actions:  0

Posted in

GenAI

Job Code

1652042

Novartis Healthcare - Associate Director - DDIT IES Cloud Engineering

premium_icon
Posted today
Posted today
star-icon

4.1

grey-divider

383+ Reviews

Job Level: 5

Reports to: Director DDIT IES Cloud Engineering

ROLE PURPOSE:

- Responsible for designing, building, and managing a cutting-edge AI and Generative AI infrastructure based on NVIDIA SuperPOD NV72 system, tailored for pharmaceutical business use cases. The platform will enable Biomedical Research Scientists and other business users to accelerate early molecule development and research activities by providing robust, scalable, and secure GPU computing resources.

MAJOR ACCOUNTABILITIES:

- Architect and Design: Lead the design and architecture of an NVIDIA SuperPOD-based AI infrastructure platform supporting Generative AI workloads and advanced analytics for pharma use cases like BioNeMo, AlphaFold, ESMFold, OpenFold, ProtGPT2, and NVIDIA Clara suite.

- Platform Development: Implement ML/Ops solutions (Run:AI) on Kubernetes clusters optimized for NVIDIA GPUs.

- Data Management: Design and implement high-performance data pipelines for large-scale genomics and chemical compound datasets.

- Security and Compliance: Ensure robust security measures and compliance for HPC and multi-cloud environments.

- Performance Optimization: Optimize GPU cluster performance, networking, and storage for cost-efficiency and scalability.

- Innovation: Stay updated with NVIDIA AI infrastructure advancements and HPC trends.

TECHNICAL EXPERTIES:

- Expertise in deploying and managing GBX00 GPU-based clusters.

- Understanding of advanced interconnect technologies for GB-series GPUs.

- Performance tuning for multi-node GBX00 workloads using NCCL, CUDA NVLink, NVSwitch, Storage and Inband High-Speed Ethernet Fabric, RDMA tuning, QoS policies, Out of Band Management.

- Redundant power and cooling systems for HPC reliability.

- Cluster Management: NVIDIA Base Command Manager, Slurm, Kubernetes for GPU scheduling.

- Firmware & Driver Management: CUDA, NCCL, InfiniBand drivers, GPU firmware updates.

- EFA, NVLink and InfiniBand switches for ultra-low latency GPU cluster communication.

- Separate Ethernet-based management network for orchestration and monitoring.

- Parallel File Systems: Spectrum Scale (GPFS) or Lustre for high-performance distributed storage.

- Multi-petabyte capacity with NVMe SSD tiers for scratch space and HDD tiers for archival.

- Integration with object storage for AI datasets.

- Monitoring & Troubleshooting: DCGM, Prometheus, Grafana for telemetry and health checks.

- Security & Compliance: RBAC, encryption, secure multi-tenant configurations.

- Al/ML Workflow optimization, troubleshooting and job scheduling

QUALIFICATIONS:

- Bachelor's degree in IT, Computer Science, or Engineering.

- 8+ years of experience in GPU-based AI infrastructure and HPC systems.

- Deep expertise in NVIDIA DGX systems and SuperPOD architecture.

- Strong knowledge of containerization (Docker, Kubernetes) and DevOps practices.

- Excellent collaboration and documentation skills.

KEY PERFORMANCE INDICATORS:

- On-time delivery of NVIDIA SuperPOD infrastructure.

- SLA adherence for AI workloads.

- Cost optimization and performance benchmarks.

- Successful onboarding of pharma AI use cases.

Didn’t find the job appropriate? Report this Job

Posted by

Job Views:  
20
Applications:  3
Recruiter Actions:  0

Posted in

GenAI

Job Code

1652042

UPSKILL YOURSELF

My Learning Centre

Explore CoursesArrow