Job Description
Advanced Data Scientist
Location
Bangalore, India
Role Overview
We are looking for a Advanced Data Scientist who can own end‑to‑end data science and machine learning solutions, from problem formulation to production deployment. This role requires a strong blend of machine learning expertise, data engineering, MLOps, cloud platforms, and technical leadership.
You will work closely with product, engineering, and business stakeholders to design scalable data and ML systems that drive measurable business impact.
Key Responsibilities
Data Science & Machine Learning
Translate business problems into data science and ML solutions
Perform advanced EDA, feature engineering, and model development
Build and optimize:
- Classical ML models (regression, classification, tree‑based models)
- Time‑series, anomaly detection, and recommendation systems
Develop and fine‑tune deep learning models using PyTorch / TensorFlow
Design and evaluate experiments (A/B testing, statistical validation)
GenAI, NLP & LLM Solutions
- Build NLP and GenAI applications using modern LLMs
- Implement RAG pipelines, prompt engineering, and vector search
- Integrate LLMs using OpenAI / Azure OpenAI APIs
- Evaluate model quality, latency, and cost for production LLM systems
Data Engineering & Pipelines (Good to Have)
- Design and build scalable data pipelines for batch and streaming use cases
- Work with distributed processing frameworks like Apache Spark
- Orchestrate workflows using **Airflow / Dagster / Prefect/ Azure Data Factory / Databricks **
- Handle real‑time data using Kafka or cloud‑native streaming services
- Ensure data reliability, quality, and performance at scale
MLOps, Deployment & Production
- Own the full ML lifecycle: experimentation → training → deployment → monitoring
- Implement model versioning, reproducibility, and CI/CD pipelines
- Deploy models using REST APIs or batch inference pipelines
- Monitor model performance, drift, and data quality in production
- Work with Docker and Kubernetes for scalable deployments
Cloud & Platform Engineering
- Build solutions on AWS / Azure / GCP (at least one in depth)
- Work with cloud data platforms like Databricks, Snowflake, BigQuery
- Optimize system performance and cloud costs
- Ensure security, access control, and compliance best practices
Architecture, Collaboration & Leadership
- Design end‑to‑end data and ML architectures
- Make tradeoffs between batch vs streaming, cost vs performance
- Mentor junior data scientists and review code and models
- Set data science and ML best practices across teams
- Communicate insights clearly to technical and non‑technical stakeholders
Required Skills & Qualifications
Core Technical Skills
- Strong proficiency in Python and advanced SQL
- Solid foundation in statistics, probability, and linear algebra
- Hands‑on experience with XGBoost, LightGBM
- Experience with PyTorch or TensorFlow
Data Engineering (Good to have)
- Strong experience with Spark / PySpark
- Pipeline orchestration using Airflow or similar tools
- Experience with relational, NoSQL, and analytical databases
- Understanding of data lakes and warehouse architectures
MLOps & DevOps (Optional)
- Experience with MLflow, DVC, or W&B
- Model deployment using FastAPI
- Containers and orchestration: Docker, Kubernetes
- CI/CD and monitoring tools
Cloud Platforms
Deep expertise in at least one cloud provider:
- AWS, Azure, or GCP
Experience with managed ML and data services
Preferred / Nice‑to‑Have
- Experience with LLM frameworks (LangChain, LlamaIndex)
- Vector databases (FAISS, Pinecone, Weaviate)
- Streaming frameworks (Flink)
- Knowledge of data governance, privacy, and compliance
- Experience leading cross‑functional technical initiatives
Machine Learning Algorithms & Techniques (Hands‑On)
Supervised Learning
Linear Models
- Linear Regression
- Logistic Regression
- Regularization (L1, L2, Elastic Net)
Tree‑Based Models
- Decision Trees
- Random Forest
- Gradient Boosting (XGBoost, LightGBM, CatBoost)
Clustering Techniques
- K‑Means
- Hierarchical Clustering
- DBSCAN
- PCA (feature reduction)
- t‑SNE / UMAP (visualization & analysis)
Dimensionality Reduction
Time Series & Forecasting (Basic–Intermediate)
Statistical forecasting:
- Moving averages
- ARIMA / SARIMA (conceptual + basic use)
ML‑based forecasting using regression and tree‑based models
Model Evaluation & Optimization
- Cross‑validation techniques
- Hyperparameter tuning (Grid Search, Random Search)
- Bias–variance tradeoff
- Handling class imbalance
- Selection of appropriate evaluation metrics
Experience
8–12+ years