Description
WHAT YOU DO AT AMD CHANGES EVERYTHING
At AMD, our mission is to build great products that accelerate next-generation computing experiences—from AI and data centers, to PCs, gaming and embedded systems. Grounded in a culture of innovation and collaboration, we believe real progress comes from bold ideas, human ingenuity and a shared passion to create something extraordinary. When you join AMD, you'll discover the real differentiator is our culture. We push the limits of innovation to solve the world's most important challenges—striving for execution excellence, while being direct, humble, collaborative, and inclusive of diverse perspectives. Join us as we shape the future of AI and beyond. Together, we advance your career.
Lead Performance and Optimization Engineer
THE ROLE:
We are seeking a Performance Engineer with strong expertise in serverclass CPUs, CPU microarchitecture, and ML inference, responsible for benchmarking, analysing, and optimizing CPU inference performance using EPYCoptimized ML libraries (e.g., ZenDNN) with common frameworks (PyTorch, TensorFlow, ONNX Runtime). The role includes handson work in performance debugging, OS/BIOS tuning, thread/core affinity, multiinstance execution, and Python/scriptingbased automation. .
THE PERSON:
The ideal candidate should be passionate about software engineering and possess leadership skills to drive sophisticated issues to resolution. Able to communicate effectively and work optimally with different teams across AMD.
KEY RESPONSIBILITIES:
Performance Engineering & Optimization
- Run and optimize ML inference workloads on CPUs using EPYCoptimized libraries (ZenDNN), improving throughput/latency across singleinstance and multiinstance scenarios.
- Configure and tune NUMA, HugePages, SMT, power/performance modes, CPU isolation, scheduler settings, scaling governors, and other OS/BIOS parameters.
- Design and validate thread/core affinity strategies for singleinstance, multiinstance, multisocket, and frameworklevel multiinstance execution models.
- Optimize workload behaviour through NUMAaware locality, thread scheduling/pinning, batch size tuning, operatorlevel parallelism, and other CPUfocused techniques.
- Contribute to multiinstance execution framework development, including policies for instance partitioning, core allocation, memory distribution, and orchestration of parallel runs on large EPYC systems.
Benchmarking & Analysis
- Develop and run structured benchmarks across EPYC SKUs, core counts, caching/topology variations, sockets, and diverse batch sizes.
- Analyze scaling for singleinstance vs. multiinstance execution, instance placement strategies, and workload isolation.
- Use perf, VTune, ftrace/tracecmd, PMU counters, flame graphs to identify bottlenecks in compute, memory, thread scheduling, or instancelevel competition.
- Perform rootcause analysis for regressions in latency, throughput, multiinstance efficiency, memory bandwidth, and pipeline behaviour.
ML Inference Domain Knowledge
- Understand how ML frameworks execute models on CPU, including tensor shapes/layouts, operator behavior, threading models, kernel dispatch, scheduling strategies, and multiinstance runtime interactions.
- Interpret how model architecture and operator composition influence performance across single and multiple concurrent inference instances.
- Collaborate with ZenDNN and kernel/ops teams to relay findings and help guide kernel/operatorlevel improvements.
Automation & Tooling
- Build automation pipelines for singleinstance and multiinstance benchmarking, profiling, orchestration, scaling studies, and regression detection.
- Develop Python/Bash tooling that manages instance spawning, CPU core partitioning, memory pinning, performance data capture, reporting, and visual dashboards.
- Maintain reproducible experiment workflows for both singleinstance and multiinstance configurations.
Required Skills & Qualifications
- Strong understanding of CPU architecture: pipelines, caches, TLB, NUMA, SMT/HT, vector units (AVX2/AVX512/VNNI/BF16/INT8), and memory hierarchy.
- 8 to 12 years in performance engineering, systems optimization, or lowlevel execution on Linux.
- Handson experience with Linux tuning and serverclass OS/BIOS configuration.
- Proficiency with perf, VTune, PMU counters, ftrace/tracecmd, flame graphs, and multiinstance profiling.
- Strong knowledge of ML inference execution (tensors, operators, threading models) on CPU backends.
- Strong Python and Bash for automation and performance tooling.
- Experience in multicore scaling, thread affinity, scheduler behavior, concurrency techniques, and multiinstance execution strategies.
- Familiarity with PyTorch, TensorFlow, ONNX Runtime for running inference workloads.
#LI-PK1
Benefits offered are described: AMD benefits at a glance.
AMD does not accept unsolicited resumes from headhunters, recruitment agencies, or fee-based recruitment services. AMD and its subsidiaries are equal opportunity, inclusive employers and will consider all applicants without regard to age, ancestry, color, marital status, medical condition, mental or physical disability, national origin, race, religion, political and/or third-party affiliation, sex, pregnancy, sexual orientation, gender identity, military or veteran status, or any other characteristic protected by law. We encourage applications from all qualified candidates and will accommodate applicants' needs under the respective laws throughout all stages of the recruitment and selection process.
AMD may use Artificial Intelligence to help screen, assess or select applicants for this position. AMD's “Responsible AI Policy” is available here.
This posting is for an existing vacancy.
Apply on company website