Site Reliability Engineering

ThoughtSpot ThoughtSpot · Data AI · Bangalore, India

This Site Reliability Engineering role focuses on ensuring the reliability of a SaaS platform by leveraging AI/ML for proactive observability, predictive anomaly detection, and intelligent alerting. The role also involves customer-facing technical support and incident management, with a strong emphasis on integrating AI tools into daily SRE workflows.

What you'd actually do

  1. Act as the primary point of contact for customer-facing technical issues related to our SaaS platform, including data connectivity, report errors, performance concerns, access problems, data inconsistencies, software bugs, and integration challenges.
  2. Maintain, monitor, and troubleshoot ThoughtSpot cloud infrastructure using tools like Grafana, Prometheus, Datadog, and Splunk.
  3. Monitor system health and performance through metrics, logs, and dashboards to detect and prevent issues proactively.
  4. Implement and leverage AI/ML-driven solutions for proactive observability, predictive anomaly detection, and intelligent alerting to enhance service reliability and reduce Mean Time to Resolution (MTTR).
  5. Participate in on-call rotations, lead incident reviews, and conduct thorough root cause analyses to drive continuous improvement.

Skills

Required

  • B.S. in Computer Science or equivalent relevant experience
  • Proven experience troubleshooting complex Linux systems and managing virtualization and cloud platforms (VMware, AWS, Azure, GCP)
  • Hands-on experience with monitoring tools such as Grafana, Prometheus, Datadog, or Splunk
  • Demonstrated experience and a keen interest in leveraging AI/ML principles to address SRE challenges — including AIOps, predictive maintenance, and intelligent automation
  • Prior experience in enterprise customer support, including on-call rotations and incident management, with the ability to lead root cause analyses
  • Strong problem-solving and algorithmic thinking with a solid understanding of system internals
  • Excellent verbal and written communication skills
  • Familiarity with scripting and programming languages such as Python, Go, Bash, or Java
  • Exposure to infrastructure and service monitoring frameworks with the ability to analyze data to ensure high availability
  • Comfortably and confidently integrate artificial intelligence into their daily workflow to increase productivity and quality
  • Hands-on experience to leverage AI tools (industry-leading LLMs) to increase productivity, automate routine tasks, and improve work quality
  • Speak to the experience of using AI for research, content creation, and document summarization while maintaining ownership of judgment and final decisions
  • Write effective prompts to get the most accurate and creative results from AI tools
  • Curiosity in exploring new AI tools
  • Adaptability to quickly learn and implement new, emerging AI technologies
  • Critical thinking to know when to identify when AI should be used versus when human judgement is necess

Nice to have

  • Experience partnering with Engineering to design and implement mission-critical tooling and automation that advances system debuggability, high availability, elastic scalability, and performance
  • Experience with alerting strategies and monitoring system tuning to minimize alert fatigue and optimize Mean Time to Acknowledge (MTTA)
  • Familiarity with C/C++ or other low-level systems languages

What the JD emphasized

  • AI/ML-driven solutions for proactive observability, predictive anomaly detection, and intelligent alerting
  • Optimize SRE workflows with AI tools
  • leveraging AI/ML principles to address SRE challenges — including AIOps, predictive maintenance, and intelligent automation
  • integrate artificial intelligence into their daily workflow to increase productivity and quality
  • leverage AI tools (industry-leading LLMs) to increase productivity, automate routine tasks, and improve work quality

Other signals

  • leveraging AI/ML to deliver timely updates, meaningful solutions, and predictive improvements
  • Implement and leverage AI/ML-driven solutions for proactive observability, predictive anomaly detection, and intelligent alerting
  • Optimize SRE workflows with AI tools to boost operational effectiveness
  • Demonstrated experience and a keen interest in leveraging AI/ML principles to address SRE challenges — including AIOps, predictive maintenance, and intelligent automation
  • Comfortably and confidently integrate artificial intelligence into their daily workflow to increase productivity and quality
  • Hands-on experience to leverage AI tools (industry-leading LLMs) to increase productivity, automate routine tasks, and improve work quality