Senior Applied Scientist - Machine Learning Systems Engineer- Photoshop

Adobe Adobe · Enterprise · Seattle, WA +1

Senior ML Systems & Efficiency Engineer for Photoshop ART R&D team, focused on optimizing inference performance, latency, and cost efficiency for image editing applications. The role involves deep expertise in AI/ML systems, computer vision, distributed inference, and performance optimization, with a mandate to deliver production-ready ML systems at lower cost and higher efficiency. Responsibilities include designing and optimizing inference systems, developing high-performance GPU kernels, conducting performance profiling, collaborating on distributed serving systems, and establishing cost-aware ML engineering practices.

What you'd actually do

  1. Design and optimize high-throughput, low-latency inference systems. Optimize model architectures to improve deployment and runtime efficiency using techniques such as distillation, pruning, quantization, and Mixture-of-Experts (MoE). Implement advanced serving strategies including batching, caching (KV, semantic, embedding), quantization (FP8/INT8), and distributed inference strategies including data, tensor, pipeline, expert, and hybrid parallelism, with a focus on balancing computation and communication efficiency. Explore training or fine-tuning approaches when they directly lead to more efficient inference, simpler deployment, or improved runtime performance.
  2. Write and maintain high-performance GPU kernels using Triton or CUDA to accelerate custom model layers and critical workloads. Improve GPU utilization through kernel fusion, asynchronous pipelines, and optimized scheduling strategies.
  3. Conduct deep performance analysis using tools such as PyTorch Profiler and NVIDIA Nsight to identify bottlenecks in compute, memory, and communication. Optimize end-to-end system performance across inference workloads.
  4. Partner with infrastructure teams to design scalable and reliable distributed serving systems across heterogeneous hardware environments (e.g., A100, H100, B200, CPU). Contribute to resource scheduling, GPU pooling, and elastic workload management.
  5. Establish and track efficiency metrics such as cost per million inferences. Build benchmarking frameworks and dashboards to guide tradeoffs among quality, latency, and compute cost, enabling data-driven system and product decisions.

Skills

Required

  • Python
  • C++
  • GPU architecture understanding
  • performance diagnosis
  • distributed systems
  • high-performance systems development
  • Triton or CUDA for performance-critical workloads
  • rigorous measurement and benchmarking
  • system efficiency, scalability, and reliability in production environments

Nice to have

  • Master’s or PhD in Computer Science, Electrical Engineering, or related field with focus on ML systems, distributed systems, or HPC
  • Triton
  • vLLM
  • SGLang
  • xDiT
  • TensorRT
  • ONNX Runtime
  • AOTI
  • operator fusion
  • graph-level optimization
  • PyTorch Profiler
  • NVIDIA Nsight
  • CUDA tooling
  • NCCL
  • Docker
  • Kubernetes
  • Transformers
  • multimodal models
  • Mixture-of-Experts (MoE)
  • Diffusion Transformers (DiT)

What the JD emphasized

  • production-ready
  • inference performance
  • latency
  • cost efficiency
  • high-quality ML systems
  • substantially lower cost
  • higher efficiency
  • deep expertise
  • distributed inference
  • multimodal model profiling
  • performance optimization
  • high-leverage role
  • outsized impact
  • saving millions of dollars
  • practical innovations
  • high-throughput
  • low-latency inference systems
  • runtime efficiency
  • advanced serving strategies
  • distributed inference strategies
  • computation and communication efficiency
  • more efficient inference
  • simpler deployment
  • improved runtime performance
  • high-performance GPU kernels
  • critical workloads
  • GPU utilization
  • asynchronous pipelines
  • optimized scheduling strategies
  • deep performance analysis
  • identify bottlenecks
  • end-to-end system performance
  • inference workloads
  • scalable and reliable distributed serving systems
  • heterogeneous hardware environments
  • resource scheduling
  • GPU pooling
  • elastic workload management
  • cost-aware ML engineering
  • efficiency metrics
  • cost per million inferences
  • benchmarking frameworks
  • data-driven system and product decisions
  • trusted technical advisor
  • efficiency tradeoffs
  • best practices
  • scalable and cost-efficient ML development
  • performance-oriented systems design
  • Distributed Inference & Serving Expertise
  • large-scale inference
  • serving workloads
  • distributed frameworks
  • runtime systems
  • inference compilation and optimization tools
  • system-level performance tradeoffs
  • GPU & Performance Engineering Skills
  • GPU architecture
  • diagnosing performance bottlenecks
  • compute, memory, and I/O subsystems
  • Programming & Systems Development
  • high-performance or distributed systems
  • performance-critical workloads
  • Data-Driven Engineering Mindset
  • rigorous measurement and benchmarking
  • system efficiency, scalability, and reliability
  • production environments
  • Open-source serving frameworks
  • Inference compilation tools
  • GPU profiling and performance analysis tools
  • Distributed Systems & Communication
  • low-level communication libraries
  • large-scale distributed serving environments
  • Containerization & Cluster Operations
  • containerized workflows
  • production ML workloads
  • shared GPU clusters
  • Model Architectures

Other signals

  • inference performance
  • latency
  • cost efficiency
  • GPU utilization
  • distributed inference
  • multimodal model profiling