Applied Scientist 3

Oracle · Enterprise · BENGALURU, KARNATAKA, India

This Applied Scientist role focuses on designing and building data-centric Generative AI methods, including synthetic data generation, multimodal data curation, and data augmentation. The role involves developing evaluation frameworks to connect data quality with downstream GenAI model performance and implementing modern generative AI techniques. Responsibilities include building scalable data and ML pipelines, production-quality code for ML workflows, and translating research into practical systems. The role operates across the full lifecycle from research to production support.

What you'd actually do

Design and build data-centric GenAI methods for synthetic data generation, multimodal data curation, data augmentation, filtering, deduplication, and quality assessment.
Develop and evaluate synthetic data pipelines for text, speech, vision, and multimodal GenAI use cases, including controllable generation, provenance tracking, safety checks, and domain adaptation.
Build evaluation frameworks that connect data quality to downstream GenAI model performance, including benchmark design, ablation studies, error analysis, and model-feedback loops.
Research and implement modern generative AI techniques, including LLM/VLM-based data generation, fine-tuning, instruction tuning, preference optimization, and model-based data labeling.
Build scalable data and ML pipelines for acquisition, cleaning, transformation, metadata extraction, embedding generation, labeling, training, and evaluation.

Skills

Required

Python programming
PyTorch
Deep learning stacks
Data-centric AI
GenAI methods
Synthetic data generation
Data quality measurement
Dataset curation
Model-based labeling
Active learning
Deduplication
Data augmentation
Experiment design
Statistical analysis
Ablation studies
Benchmark evaluation
Error analysis
Model training
Model inference
Model evaluation
Production monitoring
Research paper implementation
Communication skills
Technical proposals
Design documents
Experiment reports
Stakeholder presentations
Scalable data pipelines
ML pipelines
Distributed compute
Cloud storage
Batch processing
Workflow orchestration

Nice to have

Hugging Face
LLMs
VLMs
Diffusion models
Multimodal models

What the JD emphasized

Ph.D. degree, Master's degree, or equivalent experience in computer science, artificial intelligence, machine learning, operations research, statistics, or a related technical field.
5+ years with a Master's degree or 3+ years with a Ph.D. applying machine learning to real-world problems.
Strong Python programming skills and experience building production-quality ML, GenAI, or data systems.

Other signals

data-centric GenAI methods
synthetic data generation
multimodal data curation
evaluation frameworks
production-quality code

Read full job description

Design and build data-centric GenAI methods for synthetic data generation, multimodal data curation, data augmentation, filtering, deduplication, and quality assessment.
Develop and evaluate synthetic data pipelines for text, speech, vision, and multimodal GenAI use cases, including controllable generation, provenance tracking, safety checks, and domain adaptation.
Build evaluation frameworks that connect data quality to downstream GenAI model performance, including benchmark design, ablation studies, error analysis, and model-feedback loops.
Research and implement modern generative AI techniques, including LLM/VLM-based data generation, fine-tuning, instruction tuning, preference optimization, and model-based data labeling.
Build scalable data and ML pipelines for acquisition, cleaning, transformation, metadata extraction, embedding generation, labeling, training, and evaluation.
Develop production-quality code for batch and real-time ML workflows, including model inference, feature processing, data validation, monitoring, and operational automation.
Translate research papers and emerging GenAI techniques into practical systems that improve data quality, model quality, and customer-facing AI outcomes.
Partner with modeling, product, infrastructure, and domain teams to define GenAI data requirements, quality bars, evaluation criteria, and delivery plans.
Operate across the full lifecycle: research, prototyping, experimentation, productionization, testing, CI/CD, monitoring, runbooks, and production support.
Ph.D. degree, Master's degree, or equivalent experience in computer science, artificial intelligence, machine learning, operations research, statistics, or a related technical field.
5+ years with a Master's degree or 3+ years with a Ph.D. applying machine learning to real-world problems.
Strong Python programming skills and experience building production-quality ML, GenAI, or data systems.
Hands-on experience with PyTorch and modern deep learning stacks; experience with Hugging Face, LLMs, VLMs, diffusion models, or multimodal models is strongly preferred.
Experience with data-centric AI or GenAI methods such as synthetic data generation, data quality measurement, dataset curation, weak supervision, model-based labeling, active learning, deduplication, or data augmentation.
Experience designing experiments and interpreting results through statistical analysis, ablation studies, benchmark evaluation, and error analysis.
Strong understanding of model training, inference, evaluation, and production monitoring.
Ability to read research papers, identify practical value, and implement useful techniques in real systems.
Strong written and verbal communication skills, including technical proposals, design documents, experiment reports, and stakeholder presentations.
Experience building scalable data or ML pipelines using distributed compute, cloud storage, batch processing, or workflow orchestration.
- Career Level - IC3

Design and build data-centric GenAI methods for synthetic data generation, multimodal data curation, data augmentation, filtering, deduplication, and quality assessment.
Develop and evaluate synthetic data pipelines for text, speech, vision, and multimodal GenAI use cases, including controllable generation, provenance tracking, safety checks, and domain adaptation.
Build evaluation frameworks that connect data quality to downstream GenAI model performance, including benchmark design, ablation studies, error analysis, and model-feedback loops.
Research and implement modern generative AI techniques, including LLM/VLM-based data generation, fine-tuning, instruction tuning, preference optimization, and model-based data labeling.
Build scalable data and ML pipelines for acquisition, cleaning, transformation, metadata extraction, embedding generation, labeling, training, and evaluation.
Develop production-quality code for batch and real-time ML workflows, including model inference, feature processing, data validation, monitoring, and operational automation.
Translate research papers and emerging GenAI techniques into practical systems that improve data quality, model quality, and customer-facing AI outcomes.
Partner with modeling, product, infrastructure, and domain teams to define GenAI data requirements, quality bars, evaluation criteria, and delivery plans.
Operate across the full lifecycle: research, prototyping, experimentation, productionization, testing, CI/CD, monitoring, runbooks, and production support.
Ph.D. degree, Master's degree, or equivalent experience in computer science, artificial intelligence, machine learning, operations research, statistics, or a related technical field.
5+ years with a Master's degree or 3+ years with a Ph.D. applying machine learning to real-world problems.
Strong Python programming skills and experience building production-quality ML, GenAI, or data systems.
Hands-on experience with PyTorch and modern deep learning stacks; experience with Hugging Face, LLMs, VLMs, diffusion models, or multimodal models is strongly preferred.
Experience with data-centric AI or GenAI methods such as synthetic data generation, data quality measurement, dataset curation, weak supervision, model-based labeling, active learning, deduplication, or data augmentation.
Experience designing experiments and interpreting results through statistical analysis, ablation studies, benchmark evaluation, and error analysis.
Strong understanding of model training, inference, evaluation, and production monitoring.
Ability to read research papers, identify practical value, and implement useful techniques in real systems.
Strong written and verbal communication skills, including technical proposals, design documents, experiment reports, and stakeholder presentations.
Experience building scalable data or ML pipelines using distributed compute, cloud storage, batch processing, or workflow orchestration.
- Career Level - IC3