AI Frontier · AI lab
Currently tracking 76 active AI roles, down 16% versus the prior 4 weeks. Primary focus: Data · Engineering. Salary range $148k–$600k (avg $306k).
| Title | Stage | AI score |
|---|---|---|
| Member of Technical Staff - Inference The role focuses on designing and optimizing large-scale model serving systems for high-performance inference, ensuring speed and reliability for millions of users. Responsibilities include architecting distributed infrastructure, optimizing latency and throughput, building high-concurrency systems, and accelerating inference engines. | Serve | 9 |
| Backend Engineer - API Backend Engineer responsible for building and owning the xAI API, focusing on high-throughput, low-latency inference for LLMs. This includes model serving infrastructure, request routing, rate limiting, observability, and scaling, with potential involvement in agent SDKs and orchestration. | ServeAgent |
| 8 |
| Member of Technical Staff - Voice Product The role focuses on building and shipping voice features for the Grok product, involving backend engineering for scalable, low-latency voice infrastructure and model integrations. The goal is to drive performance, reliability, and quality of voice interactions at a global scale. | Serve | 7 |
| Member of Technical Staff - Compute Infrastructure The role involves building and optimizing large-scale GPU clusters and the platform layer for AI training and inference. Responsibilities include low-level CUDA kernel development, Linux kernel internals, custom orchestration, and performance debugging across the full stack to accelerate AI model development. | ServeAgent | 7 |
| Member of Technical Staff Seeking a Member of Technical Staff to manage and enhance reliability for a multi-data center AI infrastructure. This role focuses on automating processes, building observability solutions, and ensuring seamless operations for mission-critical AI infrastructure. The ideal candidate will combine strong coding abilities with hands-on data center experience to build scalable reliability services, optimize system performance, and minimize downtime, including partnership with facility operations. The primary objective is to mitigate downtime and minimize impact to end-users through proactive automation, robust observability, and integrated software-physical reliability strategies, ensuring AI infrastructure remains resilient and scalable. | Serve | 5 |