What you'd actually do

Translate business problems into data science and ML solutions

Build NLP and GenAI applications using modern LLMs

Implement RAG pipelines, prompt engineering, and vector search

Own the full ML lifecycle: experimentation → training → deployment → monitoring

Design end-to-end data and ML architectures

Skills

Required

Python
advanced SQL
statistics
probability
linear algebra
XGBoost
LightGBM
PyTorch
TensorFlow
AWS
Azure
GCP

Nice to have

Spark / PySpark
Airflow
MLflow
DVC
W&B
FastAPI
Docker
Kubernetes
LangChain
LlamaIndex
FAISS
Pinecone
Weaviate
Flink
data governance
privacy
compliance
time-series
anomaly detection
recommendation systems
Apache Spark
Airflow / Dagster / Prefect / Azure Data Factory / Databricks
Kafka
Docker
Kubernetes
Databricks
Snowflake
BigQuery

What the JD emphasized

end-to-end data science and machine learning solutions

machine learning expertise

data engineering

MLOps

cloud platforms

technical leadership

scalable data and ML systems

NLP and GenAI applications

RAG pipelines

prompt engineering

vector search

production LLM systems

full ML lifecycle

deployment

monitoring

AWS / Azure / GCP

end-to-end data and ML architectures

Job Description

Advanced Data Scientist

Location

Bangalore, India

Role Overview

We are looking for a Advanced Data Scientist who can own end‑to‑end data science and machine learning solutions, from problem formulation to production deployment. This role requires a strong blend of machine learning expertise, data engineering, MLOps, cloud platforms, and technical leadership.

You will work closely with product, engineering, and business stakeholders to design scalable data and ML systems that drive measurable business impact.

Key Responsibilities

Data Science & Machine Learning

Translate business problems into data science and ML solutions
Perform advanced EDA, feature engineering, and model development
Build and optimize:
- Classical ML models (regression, classification, tree‑based models)
- Time‑series, anomaly detection, and recommendation systems
Develop and fine‑tune deep learning models using PyTorch / TensorFlow
Design and evaluate experiments (A/B testing, statistical validation)

GenAI, NLP & LLM Solutions

Build NLP and GenAI applications using modern LLMs
Implement RAG pipelines, prompt engineering, and vector search
Integrate LLMs using OpenAI / Azure OpenAI APIs
Evaluate model quality, latency, and cost for production LLM systems

Data Engineering & Pipelines (Good to Have)

Design and build scalable data pipelines for batch and streaming use cases
Work with distributed processing frameworks like Apache Spark
Orchestrate workflows using **Airflow / Dagster / Prefect/ Azure Data Factory / Databricks **
Handle real‑time data using Kafka or cloud‑native streaming services
Ensure data reliability, quality, and performance at scale

MLOps, Deployment & Production

Own the full ML lifecycle: experimentation → training → deployment → monitoring
Implement model versioning, reproducibility, and CI/CD pipelines
Deploy models using REST APIs or batch inference pipelines
Monitor model performance, drift, and data quality in production
Work with Docker and Kubernetes for scalable deployments

Cloud & Platform Engineering

Build solutions on AWS / Azure / GCP (at least one in depth)
Work with cloud data platforms like Databricks, Snowflake, BigQuery
Optimize system performance and cloud costs
Ensure security, access control, and compliance best practices

Architecture, Collaboration & Leadership

Design end‑to‑end data and ML architectures
Make tradeoffs between batch vs streaming, cost vs performance
Mentor junior data scientists and review code and models
Set data science and ML best practices across teams
Communicate insights clearly to technical and non‑technical stakeholders

Required Skills & Qualifications

Core Technical Skills

Strong proficiency in Python and advanced SQL
Solid foundation in statistics, probability, and linear algebra
Hands‑on experience with XGBoost, LightGBM
Experience with PyTorch or TensorFlow

Data Engineering (Good to have)

Strong experience with Spark / PySpark
Pipeline orchestration using Airflow or similar tools
Experience with relational, NoSQL, and analytical databases
Understanding of data lakes and warehouse architectures

MLOps & DevOps (Optional)

Experience with MLflow, DVC, or W&B
Model deployment using FastAPI
Containers and orchestration: Docker, Kubernetes
CI/CD and monitoring tools

Cloud Platforms

Deep expertise in at least one cloud provider:
- AWS, Azure, or GCP
Experience with managed ML and data services

Preferred / Nice‑to‑Have

Experience with LLM frameworks (LangChain, LlamaIndex)
Vector databases (FAISS, Pinecone, Weaviate)
Streaming frameworks (Flink)
Knowledge of data governance, privacy, and compliance
Experience leading cross‑functional technical initiatives

Machine Learning Algorithms & Techniques (Hands‑On)

Supervised Learning

Linear Models
- Linear Regression
- Logistic Regression
- Regularization (L1, L2, Elastic Net)
Tree‑Based Models
- Decision Trees
- Random Forest
- Gradient Boosting (XGBoost, LightGBM, CatBoost)
Clustering Techniques
- K‑Means
- Hierarchical Clustering
- DBSCAN
- PCA (feature reduction)
- t‑SNE / UMAP (visualization & analysis)

Dimensionality Reduction

Time Series & Forecasting (Basic–Intermediate)

Statistical forecasting:
- Moving averages
- ARIMA / SARIMA (conceptual + basic use)
ML‑based forecasting using regression and tree‑based models

Model Evaluation & Optimization

Cross‑validation techniques
Hyperparameter tuning (Grid Search, Random Search)
Bias–variance tradeoff
Handling class imbalance
Selection of appropriate evaluation metrics

Experience

8–12+ years