Software Engineer (multiple Levels) - Machine Learning Infrastructure, Slack

Salesforce Salesforce · Enterprise · Seattle, WA +3

Software Engineer role focused on building and operating foundational ML infrastructure for Slack AI. This includes systems for large-scale model training, GPU-backed inference, and deployment using Kubernetes, Ray, and vLLM. The role emphasizes reliability, performance, and scalability for AI-driven capabilities within Slack.

What you'd actually do

  1. Design, build, and operate systems to train, serve, and deploy machine learning models at scale, with a focus on reliability, performance, and operational simplicity
  2. Evolve GPU backed inference infrastructure to support high throughput, latency sensitive workloads, including large scale model serving
  3. Architect and optimize distributed training and data processing systems using platforms such as Ray, Airflow, Spark, or similar technologies
  4. Build and maintain Kubernetes based platforms and orchestration layers using tools such as KubeRay, vLLM, and internally developed services
  5. Develop robust monitoring, observability, and alerting for production ML workloads to ensure operational excellence

Skills

Required

  • Distributed systems
  • GPU infrastructure
  • Kubernetes
  • ML lifecycle
  • model training
  • model deployment
  • inference
  • monitoring
  • observability
  • reliability
  • performance
  • scalability
  • Ray
  • vLLM
  • Airflow
  • Spark

Nice to have

  • prompt engineering
  • AI agents

What the JD emphasized

  • GPU infrastructure
  • large scale model serving
  • high throughput
  • latency sensitive workloads
  • responsible training of models on sensitive data with strong privacy and safety requirements

Other signals

  • ML Infrastructure
  • GPU infrastructure
  • model training
  • model deployment
  • inference
  • monitoring
  • Kubernetes
  • vLLM
  • Ray
  • large scale
  • high throughput
  • latency sensitive