Applied Scientist 3

Oracle Oracle · Enterprise · BENGALURU, KARNATAKA, India

This Applied Scientist role focuses on designing and building data-centric Generative AI methods, including synthetic data generation, multimodal data curation, and data augmentation. The role involves developing evaluation frameworks to connect data quality with downstream GenAI model performance and implementing modern generative AI techniques. Responsibilities include building scalable data and ML pipelines, production-quality code for ML workflows, and translating research into practical systems. The role operates across the full lifecycle from research to production support.

What you'd actually do

  1. Design and build data-centric GenAI methods for synthetic data generation, multimodal data curation, data augmentation, filtering, deduplication, and quality assessment.
  2. Develop and evaluate synthetic data pipelines for text, speech, vision, and multimodal GenAI use cases, including controllable generation, provenance tracking, safety checks, and domain adaptation.
  3. Build evaluation frameworks that connect data quality to downstream GenAI model performance, including benchmark design, ablation studies, error analysis, and model-feedback loops.
  4. Research and implement modern generative AI techniques, including LLM/VLM-based data generation, fine-tuning, instruction tuning, preference optimization, and model-based data labeling.
  5. Build scalable data and ML pipelines for acquisition, cleaning, transformation, metadata extraction, embedding generation, labeling, training, and evaluation.

Skills

Required

  • Python programming
  • PyTorch
  • Deep learning stacks
  • Data-centric AI
  • GenAI methods
  • Synthetic data generation
  • Data quality measurement
  • Dataset curation
  • Model-based labeling
  • Active learning
  • Deduplication
  • Data augmentation
  • Experiment design
  • Statistical analysis
  • Ablation studies
  • Benchmark evaluation
  • Error analysis
  • Model training
  • Model inference
  • Model evaluation
  • Production monitoring
  • Research paper implementation
  • Communication skills
  • Technical proposals
  • Design documents
  • Experiment reports
  • Stakeholder presentations
  • Scalable data pipelines
  • ML pipelines
  • Distributed compute
  • Cloud storage
  • Batch processing
  • Workflow orchestration

Nice to have

  • Hugging Face
  • LLMs
  • VLMs
  • Diffusion models
  • Multimodal models

What the JD emphasized

  • Ph.D. degree, Master's degree, or equivalent experience in computer science, artificial intelligence, machine learning, operations research, statistics, or a related technical field.
  • 5+ years with a Master's degree or 3+ years with a Ph.D. applying machine learning to real-world problems.
  • Strong Python programming skills and experience building production-quality ML, GenAI, or data systems.

Other signals

  • data-centric GenAI methods
  • synthetic data generation
  • multimodal data curation
  • evaluation frameworks
  • production-quality code