Principal Software Engineer - AI Infra Compute

Oracle Oracle · Enterprise · Austin, TX +1

Principal Software Engineer focused on building and enhancing the AI infrastructure compute platform, specifically GPU availability and monitoring. The role involves designing and developing software for cloud infrastructure components, with a strong emphasis on leveraging AI/ML techniques for tasks like spike detection, automated ticket routing, and test automation to improve performance and reliability of large-scale AI/ML/HPC workloads.

What you'd actually do

  1. Design, develop, troubleshoot, and debug software programs for various cloud infrastructure components, including databases, applications, tools, and networks.
  2. Design and implement spike detection mechanisms for provisioning failures to minimize operational disruptions using ML algorithms.
  3. Developing an automated ticket routing framework to streamline workflows, enhance efficiency, and reduce operational overhead, powered by NLP and ML.
  4. Harness the power of AI and ML to create innovative tools and frameworks that automate testing, simulate complex environments, and reproduce incidents, freeing up human ingenuity to focus on higher-value tasks and amplifying our ability to deliver exceptional customer experiences.

Skills

Required

  • Python
  • Java
  • TypeScript
  • Agile Principles
  • data modeling
  • data warehousing
  • data governance
  • OCI
  • AWS
  • Azure
  • Google Cloud Platform (GCP)
  • Linux
  • MacOS
  • Bash
  • Perl
  • Ruby
  • Docker
  • RESTful APIs
  • API gateways
  • API security
  • Swagger/OpenAPI
  • AI-powered tools and platforms

Nice to have

  • RoCE
  • Infiniband
  • Kafka
  • stream processing
  • chatbots
  • virtual assistants
  • predictive analytics

What the JD emphasized

  • AI Infrastructure
  • GPU platform
  • AI/ML/HPC workloads
  • thousands of GPUs
  • ML algorithms
  • NLP and ML
  • AI and ML

Other signals

  • GPU platform for AI/ML/HPC workloads
  • design and implement spike detection mechanisms for provisioning failures to minimize operational disruptions using ML algorithms
  • Developing an automated ticket routing framework to streamline workflows, enhance efficiency, and reduce operational overhead, powered by NLP and ML
  • Harness the power of AI and ML to create innovative tools and frameworks that automate testing, simulate complex environments, and reproduce incidents