Sr. Computer Scientist

Adobe Adobe · Enterprise · Noida, India

Senior Infrastructure Developer to own, evolve, and scale the platform powering demanding ML training and serving workloads. This role involves architecting Kubernetes-native systems, leading cross-geo projects, writing production-grade code (Go, Python, Rust), and ensuring the reliability and efficiency of large-scale GPU clusters. Focus on infrastructure as code, deep observability, and complex networking challenges within a multi-cloud environment.

What you'd actually do

  1. Architect for scale
  2. Lead cross-geo initiatives
  3. Codify infrastructure
  4. Build observability
  5. Write production code

Skills

Required

  • Kubernetes
  • GPU infrastructure
  • SRE
  • platform engineering
  • infrastructure roles
  • Kubernetes internals
  • scheduler
  • kubelet
  • CRDs
  • operators
  • admission controllers
  • GPU/accelerator training workloads
  • Multi-cluster management
  • federation
  • workload placement strategies
  • Helm
  • Kustomize
  • GitOps (Flux/ArgoCD)
  • AWS
  • VPC
  • EKS
  • EC2
  • S3
  • IAM
  • TGW
  • Terraform
  • Pulumi
  • CI/CD for infrastructure
  • drift detection
  • plan gating
  • rollback strategies
  • Cost optimization
  • reserved capacity planning
  • spot instance management
  • Prometheus
  • Grafana
  • AlertManager
  • Distributed tracing
  • OpenTelemetry
  • Jaeger
  • Tempo
  • Log aggregation
  • Loki
  • Elasticsearch/OpenSearch
  • SLO/SLI design
  • error budget policy
  • multi-tier alerting
  • TCP/IP
  • DNS
  • TLS
  • HTTP/2
  • gRPC
  • CNI plugins
  • Cilium
  • Calico
  • Flannel
  • Service mesh (Istio/Linkerd)
  • ingress controllers
  • API gateways
  • Network debugging
  • packet captures
  • eBPF traces
  • kernel counters
  • Go
  • Python
  • Rust
  • Distributed systems design
  • consistency
  • availability
  • failure modes
  • Kubernetes operator authoring
  • controller-runtime patterns
  • Technical writing
  • design docs
  • ADRs
  • runbooks
  • Leadership
  • Cross-Geo Collaboration
  • async-first collaboration
  • distributed, cross-timezone teams

Nice to have

  • Azure
  • GCP
  • ML training pipeline internals
  • eBPF-based observability
  • eBPF-based networking
  • Chaos engineering
  • game days
  • Open-source infrastructure contributions
  • Security
  • compliance
  • audit experience

What the JD emphasized

  • 10+ years of experience
  • deep Kubernetes expertise
  • strong networking fundamentals
  • operated systems at massive scale
  • thousands of GPU hours depend on every day
  • massive, distributed training jobs running on GPU clusters spanning thousands of accelerators
  • Kubernetes & GPU Infrastructure
  • Expert-level Kubernetes internals
  • Proven experience running GPU/accelerator training workloads at scale
  • Deep AWS hands-on experience required
  • Prometheus, Grafana, AlertManager — at scale, not just lab setups
  • Deep TCP/IP, DNS, TLS, HTTP/2, gRPC — not just surface familiarity
  • Production-quality code in Go, Python, or Rust — you ship, not just script
  • Led multi-quarter, cross-functional projects from whiteboard to production

Other signals

  • powers our most demanding ML training workloads
  • architecting systems
  • leading multi-quarter projects
  • reliability bar for an infrastructure that thousands of GPU hours depend on
  • cutting-edge platform designed to train and serve large-scale machine learning models
  • supports everything from small-scale experimentation to massive, distributed training jobs running on GPU clusters spanning thousands of accelerators
  • provides ML engineers and researchers with the tools to onboard, monitor, and scale their workloads
  • Dynamic GPU orchestration
  • Training & inference workflows
  • Observability & cost tracking
  • Self-service developer tooling
  • Multi-cloud infrastructure
  • reliability, scalability, and efficiency of this platform
  • speed at which AI teams can innovate