Senior Systems Engineer, Os Automation

Weights & Biases Weights & Biases · Data AI · Bellevue, WA +3 · Technology

Senior Systems Engineer focused on automating and scaling Linux OS and Kernel build pipelines, with a strong emphasis on integrating AI/ML technologies like LLMs, RAG, and predictive modeling to create AI-native infrastructure, smart CI/CD, auto-remediation, and predictive regression detection.

What you'd actually do

  1. Design, maintain, and automate reproducible OS image build pipelines for our massive fleet of GPU-accelerated servers.
  2. Collaborate with kernel engineers to package, validate, and distribute custom Linux builds across Intel, AMD, and ARM architectures.
  3. Build tooling to manage dependencies, versioning, and release workflows, ensuring hermetic builds.
  4. Standardize the collection of build metrics to create a baseline for future AI modeling.
  5. Architect AI agents that ingest and analyze build logs in real-time.
  6. Develop systems that auto-triage errors, categorize failure patterns, and generate context-aware fix suggestions for engineering teams.
  7. Design ML workflows that utilize historical performance data to detect kernel and OS regressions (latency, throughput, stability) in staging environments before they impact production.
  8. Implement closed-loop feedback systems that analyze real-time system metrics and automatically suggest or apply sysctl parameter optimizations for specific customer workloads.
  9. Engineer LLM-driven interfaces for Slack/internal tools, enabling stakeholders to query build statuses, request log summaries, or provision resources using natural language commands.

Skills

Required

  • Linux Systems Engineering
  • Release Engineering
  • DevOps
  • Linux internals
  • package management (Debian/Ubuntu)
  • build systems
  • Python
  • integrating API-based AI models
  • RAG (Retrieval-Augmented Generation)
  • event-driven automation
  • data structures for vector search or time-series analysis

Nice to have

  • Kubeflow
  • MLFlow
  • High-Performance Computing (HPC)
  • fine-tuning small language models (SLMs)

What the JD emphasized

  • AI-native infrastructure
  • AI agents
  • LLMs
  • RAG
  • predictive modeling
  • Python (essential for the AI integration aspects of this role)
  • integrating API-based AI models (OpenAI, Anthropic, or local open-source models) into software workflows
  • RAG (Retrieval-Augmented Generation) architectures

Other signals

  • AI-native infrastructure
  • AI agents
  • LLMs
  • RAG
  • predictive modeling
  • ML workflows
  • ChatOps