Staff Software Engineer - Ai/ml Systems and Reliability

Adobe · Enterprise · San Jose, CA

Staff Software Engineer focused on building and scaling the AI/ML platform for Adobe Experience Platform's Personalization ML solutions and Generative AI capabilities. The role involves MLOps, infrastructure, and reliability engineering for scalable model training, reliable inference, automated ML workflows, and production-grade AI systems.

What you'd actually do

  1. Architect and build infrastructure for AI/ML systems, including Personalization and Generative AI platforms.
  2. Design and build MLOps capabilities such as model deployment pipelines, feature stores, model registries, and inference infrastructure.
  3. Improve reliability, scalability, observability, and operational efficiency of distributed AI systems.
  4. Build monitoring, alerting, logging, and tracing solutions for production services.
  5. Lead technical design and architecture discussions across teams.

Skills

Required

  • Python or Java
  • microservices
  • REST APIs
  • cloud-native architectures
  • AWS or Azure
  • Kubernetes
  • Docker
  • CI/CD
  • infrastructure automation
  • production operations
  • reliability
  • scalability
  • observability for distributed systems
  • troubleshooting
  • communication
  • collaboration

Nice to have

  • MLOps platforms
  • ML infrastructure
  • Generative AI applications
  • Ray
  • Kafka
  • Spark
  • Airflow
  • MySQL
  • PostgreSQL
  • Redis
  • Elasticsearch
  • Snowflake
  • high-throughput, low-latency production systems

What the JD emphasized

  • operating highly reliable cloud-native infrastructure
  • production reliability
  • high-throughput, low-latency production systems

Other signals

  • MLOps platform development
  • distributed systems engineering
  • production reliability
  • Generative AI capabilities