Software Engineer Ii, ML Platform, Tvscientific

Pinterest Pinterest · Consumer · San Francisco, CA · tvScientific

This role is for a Systems/Platform Engineer focused on building and scaling low-latency, high-throughput ML infrastructure for real-time advertising and measurement. The engineer will work on the foundation of the training and serving stack, focusing on performance, observability, and reliability. While the role mentions AI and using AI tools, the core responsibility is infrastructure engineering, not direct AI model development.

What you'd actually do

  1. Scale the decision making process for tools for the tvScientific AI team, from our workflows to our training infrastructure to our Kubernetes deployments
  2. Improve the developer experience for the data science team
  3. Upgrade our observability tooling
  4. Make every deployment smooth as our infrastructure evolves.

Skills

Required

  • Deep understanding of Linux
  • Excellent writing skills
  • A systems-oriented mindset
  • Experience in high-performance software (RTB, HFT, etc.)
  • Software engineering experience + reliability (e.g. CI/CD) expertise
  • Strong observability instincts
  • Demonstrated ability to use AI to improve speed and quality in your day-to-day workflow for relevant outputs
  • Strong track record of critical evaluation and verification of AI-assisted work (e.g., testing, source-checking, data validation, peer review)
  • High integrity and ownership: you protect sensitive data, avoid over-reliance on AI, and remain accountable for final decisions and deliverables

Nice to have

  • Reverse-engineering experience
  • Terraform, EKS, or MLOps experience
  • Python, Scala, or Zig experience
  • NixOS experience
  • Adtech or CTV experience
  • Experience deploying a distributed system across multiple clouds
  • Experience in hard real-time low-latency (<10 ms) environments

What the JD emphasized

  • sub‑millisecond decisioning
  • high‑throughput data access
  • tight integration with Pinterest’s core tech stack
  • queries and RPCs in terms of syscalls, cache lines, and wire formats
  • systems that stay fast and predictable under load
  • storage and indexing strategies
  • streaming and fanout
  • backpressure and failure handling across services and regions
  • observable, debuggable, and operable in production
  • IO scheduling and batching
  • lock‑free or low‑contention data structures
  • connection pooling
  • query planning
  • kernel and network tuning
  • on‑disk layout and indexing
  • circuit‑breaking
  • autoscaling
  • incident response
  • NixOS
  • Rust
  • robust SLIs/SLOs
  • high-performance software (RTB, HFT, etc.)
  • low-latency (<10 ms) environments