
Description
WHAT YOU DO AT AMD CHANGES EVERYTHING
We care deeply about transforming lives with AMD technology to enrich our industry, our communities, and the world. Our mission is to build great products that accelerate next-generation computing experiences – the building blocks for the data center, artificial intelligence, PCs, gaming and embedded. Underpinning our mission is the AMD culture. We push the limits of innovation to solve the world's most important challenges. We strive for execution excellence while being direct, humble, collaborative, and inclusive of diverse perspectives.
AMD together we advance_
THE ROLE:
AMD is seeking a driven and collaborative MLOps Engineer to join our Engineering Operations team in Atlanta. You will support and optimize large-scale, multi-GPU/CPU ML infrastructure to enable world-class AI and rendering research. Collaborating with teams across North America and Europe, you will design robust, automated pipelines and help push the boundaries of machine learning and high-performance compute in a production data center environment.
THE PERSON:You are a hands-on engineer passionate about both machine learning operations and large-scale infrastructure. You excel at collaborating with researchers and IT specialists, drive automation, and enjoy solving complex technical challenges at the intersection of data science and systems engineering.
KEY RESPONSIBILITIES:- Architect, deploy, and maintain high-availability Linux/GPU/CPU server clusters for ML workloads, ensuring optimal performance, security, and scalability.
- Collaborate cross-functionally with data science, research, and IT teams (across North America and Europe) to streamline ML model training, test, deployment, and monitoring pipelines.
- Build and automate end-to-end CI/CD workflows for ML (using MLflow, DVC, Kubeflow, Airflow, or similar tools).
- Configure, monitor, and optimize large-scale NAS and data transfer for sharing of models, datasets, and training results.
- Proactively monitor infrastructure and application health (using Prometheus, Grafana, or similar), addressing performance bottlenecks, failures, and incidents.
- Implement robust security, user management, and access protocols in line with international compliance (GDPR, etc.).
- Document processes, workflows, and troubleshooting guides for global teams; support remote debugging and rapid incident response.
- Stay abreast of trends in AI infrastructure, MLOps toolchains, and AMD hardware accelerators.
- Strong programming/scripting background (Python, Bash, or Go), and proven experience with Linux server administration.
- Practical experience managing GPU/CPU clusters and Kubernetes orchestration.
- Experience with infrastructure automation (Ansible, Terraform) and CI/CD pipeline design.
- Familiarity with MLOps stacks (MLflow, DVC, Kubeflow, Flyte, Airflow).
- Monitoring and troubleshooting distributed workloads for ML/AI, HPC, or rendering.
- Experience configuring and managing NAS or other distributed file systems for large data.
- Knowledge of networking (TCP/IP, VLANs, firewalls), data privacy, and compliance.
- Strong communication, troubleshooting, and documentation skills.
- Previous exposure to supporting render farms or real-time graphics pipelines is a plus.
Computer Science, Computer Engineering, Electrical Engineering, or closely related field.
Location: Atlanta GA Data Center (Onsite)
#LI-CS1
Benefits offered are described: AMD benefits at a glance.
AMD does not accept unsolicited resumes from headhunters, recruitment agencies, or fee-based recruitment services. AMD and its subsidiaries are equal opportunity, inclusive employers and will consider all applicants without regard to age, ancestry, color, marital status, medical condition, mental or physical disability, national origin, race, religion, political and/or third-party affiliation, sex, pregnancy, sexual orientation, gender identity, military or veteran status, or any other characteristic protected by law. We encourage applications from all qualified candidates and will accommodate applicants' needs under the respective laws throughout all stages of the recruitment and selection process.
Apply on company website