What you'd actually do

Design, develop, and maintain scalable data pipelines on AWS using services such as S3, Glue, Lambda, Redshift, and EMR.

Build and optimize data warehousing solutions using Snowflake, including performance tuning and data modeling.

Write efficient and reusable code in Python and SQL for data transformation and processing.

Develop and optimize solutions using graph databases (e.g., Neo4j, Amazon Neptune), including query design and performance tuning.

Design, build, and operate vector database solutions (e.g., Milvus, Amazon OpenSearch) to support semantic search, recommendations, RAG, and AI-driven use cases.

Skills

Required

AWS cloud stack
Snowflake
Python
SQL
graph databases
vector databases
data modeling
performance tuning
Git
Azure DevOps
analytical and problem-solving skills
communication and collaboration abilities
Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field

Nice to have

NVIDIA ecosystem
AWS Step Functions
data governance and compliance practices
real-time data processing frameworks (e.g., Kafka, Spark Streaming)
RAPIDS libraries (cuDF, cuML, cuGraph)
CUDA-based tooling

What the JD emphasized

AWS cloud stack

Snowflake

Python

SQL

graph and vector database technologies

AWS cloud services, including data and AI workloads

Snowflake architecture, performance tuning, and best practices

Python and SQL for data pipelines, transformations, and services

graph and vector data modelling concepts and their practical applications

graph databases (e.g., Neo4j, Neptune)

vector databases (e.g., Milvus, Amazon OpenSearch)

data ingestion pipelines for unstructured sources

embedding generation at scale

vector databases, specifically Milvus

Knowledge Graph ingestion pipelines

pipeline engineering skills in Python

orchestrating multi-stage document processing workflows

deploying and monitoring these pipelines in production environments

Career Area:

Technology, Digital and Data

Job Description:

**Your Work Shapes the World at Caterpillar Inc. **

When you join Caterpillar, you're joining a global team who cares not just about the work we do – but also about each other. We are the makers, problem solvers, and future world builders who are creating stronger, more sustainable communities. We don't just talk about progress and innovation here – we make it happen, with our customers, where we work and live. Together, we are building a better world, so we can all enjoy living in it.

Job Summary

We are looking for a highly motivated and experienced Data Engineer to join our data engineering team. The ideal candidate will have a strong background in building scalable data pipelines using the AWS cloud stack and extensive hands-on experience with Snowflake. Proficiency in Python and SQL, along with graph and vector database technologies, is essential. This role requires strong problem-solving abilities and a proactive mindset to deliver efficient, scalable, and reliable data solutions.

Key Responsibilities

Design, develop, and maintain scalable data pipelines on AWS using services such as S3, Glue, Lambda, Redshift, and EMR.
Build and optimize data warehousing solutions using Snowflake, including performance tuning and data modeling.
Write efficient and reusable code in Python and SQL for data transformation and processing.
Collaborate with cross-functional teams, including data scientists, analysts, and business stakeholders, to understand data requirements.
Develop and optimize solutions using graph databases (e.g., Neo4j, Amazon Neptune), including query design and performance tuning.
Design, build, and operate vector database solutions (e.g., Milvus, Amazon OpenSearch) to support semantic search, recommendations, RAG, and AI-driven use cases.
Integrate vector databases with LLM-based applications and AI workflows.
Monitor, troubleshoot, and improve pipeline performance and reliability.
Ensure data quality, integrity, and security across all stages of the pipeline.
Participate in code reviews, architecture discussions, and continuous improvement initiatives.

Required Qualifications

8+ years of experience in data engineering or related roles.
Strong hands-on experience with AWS cloud services, including data and AI workloads.
Deep understanding of Snowflake architecture, performance tuning, and best practices.
Advanced proficiency in Python and SQL for data pipelines, transformations, and services.
Strong understanding of graph and vector data modelling concepts and their practical applications.
Hands-on experience with graph databases (e.g., Neo4j, Neptune) and vector databases (e.g., Milvus, Amazon OpenSearch).
Experience with version control systems (e.g., Git) and Git workflows.
Experience working with Azure DevOps (AzDO) boards for backlog management in Agile environments.
Excellent analytical and problem-solving skills.
Strong communication and collaboration abilities.
Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field.

Nice to Have skills

Knowledge of the NVIDIA ecosystem and its applications in data and AI.

Preferred Qualifications

Experience with orchestration tools such as AWS Step Functions.
Familiarity with data governance and compliance practices.
Exposure to real-time data processing frameworks (e.g., Kafka, Spark Streaming).

Mode detail on Knowledge Base

Experience designing and deploying data ingestion pipelines for unstructured sources such as PDFs, Word documents, and HTML files, including text extraction, chunking strategies, and embedding generation at scale.
Hands-on expertise with vector databases, specifically Milvus, covering schema design, indexing, and optimizing write performance for large-scale embedding ingestion pipelines.
Proficiency in building Knowledge Graph ingestion pipelines using Graph Databases — including entity extraction, relationship modelling, and populating nodes and attributes.
Strong pipeline engineering skills in Python and frameworks for orchestrating multi-stage document processing workflows, with experience deploying and monitoring these pipelines in production environments.
Bonus: Exposure to RAPIDS libraries (cuDF, cuML, cuGraph) or CUDA-based tooling for GPU-accelerated data processing, enabling faster transformation and optimization during large-scale ingestion workflows.

Posting Dates:

June 19, 2026 - June 25, 2026

Caterpillar is an Equal Opportunity Employer. Qualified applicants of any age are encouraged to apply

Not ready to apply? Join our Talent Community.