What you'd actually do

Design and implement scalable orchestration for serving and training AI/ML models.

Explore and incorporate contemporary research on AI, agents, and inference systems into the software stack for designing, monitoring, troubleshooting and deploying networks.

Evaluate, Integrate, and Optimize technologies across the stack, for latency, throughput, and resource utilization for training and inference workloads.

Lead initiatives in AI systems design, including Retrieval-Augmented Generation (RAG) and LLM fine-tuning.

Design and develop scalable services and tools to support GPU-accelerated AI pipelines, Python/Go, and observability frameworks.

Skills

Required

Python
ML frameworks (PyTorch, TensorFlow)
LLMs
embeddings
vector search
RAG pipelines
fine-tuning
Data engineering: Spark, Kafka, Flink, OCI Streaming/Data Flow
Distributed systems
large-scale training/inference
Handling network telemetry (NetFlow, packet captures, streaming telemetry)
Network automation frameworks (Terraform, Ansible, NAPALM, Batfish is a plus)
Containerization
model serving
GPU workflows
CI/CD
MLOps tools
Writing design docs
scoping features
owning delivery end-to-end
7+ years of experience building software systems
prior experience building AI applications training models

Nice to have

MSEE, MSCS, or MSCE
Batfish

What the JD emphasized

building advanced AI applications powered by AI models

training AI models

building and optimizing large-scale AI systems

development and deployment of AI solutions

serving and training AI/ML models

training and inference workloads

LLM fine-tuning

large-scale training/inference

model serving

MLOps tools

building software systems

building AI applications training models

Other signals

design and development team to build advanced AI applications powered by AI models

use AI/ML to automate, optimize, and secure networks

training AI models

building and optimizing large-scale AI systems

development and deployment of AI solutions

serving and training AI/ML models

contemporary research on AI, agents, and inference systems

training and inference workloads

LLM fine-tuning

GPU-accelerated AI pipelines

large-scale training/inference

model serving

MLOps tools

In this role you will lead the design and development team to build advanced AI applications powered by AI models. You will use AI/ML to automate, optimize, and secure networks, focusing on tasks like self-provisioning, auto-ingesting, auto-qualifying systems and self-healing networks, requiring skills in Python, ML frameworks, training AI models, and an understanding of networking protocols, data center designs, infrastructure as a service, network monitoring and network automation.

As a Principal AI Developer in the Networking Org, you will be responsible for building and optimizing large-scale AI systems, ensuring scalability, reliability, and performance. The candidate should be able to work collaboratively with cross-functional teams to drive the development and deployment of AI solutions. If you have a passion for building cutting-edge AI applications and are looking for a challenging role, we encourage you to apply. Strong problem-solving skills, attention to detail, and excellent communication skills are essential for this role.

Design and implement scalable orchestration for serving and training AI/ML models.
Explore and incorporate contemporary research on AI, agents, and inference systems into the software stack for designing, monitoring, troubleshooting and deploying networks.
Evaluate, Integrate, and Optimize technologies across the stack, for latency, throughput, and resource utilization for training and inference workloads.
Lead initiatives in AI systems design, including Retrieval-Augmented Generation (RAG) and LLM fine-tuning.
Design and develop scalable services and tools to support GPU-accelerated AI pipelines, Python/Go, and observability frameworks.

Required/Preferred experience:

Strong Python and ML frameworks (PyTorch, TensorFlow)
LLMs, embeddings, vector search, RAG pipelines, and fine-tuning
Data engineering: Spark, Kafka, Flink, OCI Streaming/Data Flow
Distributed systems and large-scale training/inference
Handling network telemetry (NetFlow, packet captures, streaming telemetry)
Network automation frameworks (Terraform, Ansible, NAPALM, Batfish is a plus)
Containerization, model serving, GPU workflows, CI/CD, and MLOps tools
Writing design docs, scoping features, and owning delivery end-to-end

Required Education and Work Experience:

BSEE, BSCS, BSCE, or equivalent. MSEE, MSCS, or MSCE is a plus. At least 7+ years of experience building software systems and prior experience building AI applications training models.

Disclaimer:

Certain U.S. based or U.S. customer or client-facing roles may be required to comply with applicable requirements, such as immunization/occupational health mandates, and/or drug testing requirements.

Range and benefit information provided in this posting are specific to the stated locations only

US: Hiring Range in USD from: $99,600 to $234,600 per annum. May be eligible for bonus, equity, and compensation deferral.

Oracle maintains broad salary ranges for its roles in order to account for variations in knowledge, skills, experience, market conditions and locations, as well as reflect Oracle's differing products, industries and lines of business. Candidates are typically placed into the range based on the preceding factors as well as internal peer equity.

Oracle US offers a comprehensive benefits package which includes the following:

Medical, dental, and vision insurance, including expert medical opinion
Short term disability and long term disability
Life insurance and AD&D
Supplemental life insurance (Employee/Spouse/Child)
Health care and dependent care Flexible Spending Accounts
Pre-tax commuter and parking benefits
401(k) Savings and Investment Plan with company match
Paid time off: Flexible Vacation is provided to all eligible employees assigned to a salaried (non-overtime eligible) position. Accrued Vacation is provided to all other employees eligible for vacation benefits. For employees working at least 35 hours per week, the vacation accrual rate is 13 days annually for the first three years of employment and 18 days annually for subsequent years of employment. Vacation accrual is prorated for employees working between 20 and 34 hours per week. Employees working fewer than 20 hours per week are not eligible for vacation.
11 paid holidays
Paid sick leave: 72 hours of paid sick leave upon date of hire. Refreshes each calendar year. Unused balance will carry over each year up to a maximum cap of 112 hours.
Paid parental leave
Adoption assistance
Employee Stock Purchase Plan
Financial planning and group legal
Voluntary benefits including auto, homeowner and pet insurance

The role will generally accept applications for at least three calendar days from the posting date or as long as the job remains posted.

Career Level - IC4

Design and implement scalable orchestration for serving and training AI/ML models.
Explore and incorporate contemporary research on AI, agents, and inference systems into the software stack for designing, monitoring, troubleshooting and deploying networks.
Evaluate, Integrate, and Optimize technologies across the stack, for latency, throughput, and resource utilization for training and inference workloads.
Lead initiatives in AI systems design, including Retrieval-Augmented Generation (RAG) and LLM fine-tuning.
Design and develop scalable services and tools to support GPU-accelerated AI pipelines, Python/Go, and observability frameworks.