Distinguished, Software Engineer - Observability

Walmart Walmart · Retail · Sunnyvale, CA +2

Distinguished Software Engineer focused on architecting and developing cloud-native observability systems, processing large volumes of telemetry data. The role involves designing and implementing distributed systems, utilizing various telemetry technologies, and integrating AI/ML for anomaly detection and predictive behaviors. Collaboration with cross-functional teams and leadership in research and production deployment are key.

What you'd actually do

  1. As an observability Distinguished Engineer, you will be a key researcher and technical lead expert in the architecture and development of cloud native observability designs, managed services, and real-time telemetry software systems.
  2. You will design, develop and implement large-scale distributed systems that process large volumes of data focusing on scalability, latency, and fault-tolerance in every system built.
  3. You will utilize multiple telemetry technologies such as: data models, metric libraries, data logging, distributed tracing, datalakes, data correlation, rule based alerting engines, real-time data streaming pipelines, TSDBs, and application performance management (APM).
  4. You will also utilize TSDBs and correlation and data fusion of multiple data types and heterogenous data streams coupled with Artificial intelligence (AI) and Learned Behaviors for anomaly detection, and forward projections of system and application expected behaviors.
  5. This role will involve collaboration with enterprise architects, product managers, data scientist, engineers and business managers to bring telemetry R&D projects into production.

Skills

Required

  • BS/MS in Computer Science, Engineering, or equivalent, with 15+ or more years in software engineering, design and architecture
  • Java language and associated frameworks
  • architecture leadership with demonstrated enterprise level software implementations
  • architectural leadership in research, evaluation, creation of software designs, and distributed software implementations in production
  • technical leadership, software roadmaps, research and development, new software initiatives and customer and engineering coordination and engagement
  • Full stack cloud software development experience
  • API development, integration, and utilization
  • Cloud technologies and cloud native designs
  • Cloud infrastructures and technologies, such as OpenStack, Azure, GCP or AWS.
  • Large scale distributed systems experience including scalability and fault tolerance.
  • TSDBs (InfluxDB, Kairos, Cortex, Thanos, Prometheus) or equivalent
  • Extract, transform, and load (ETL) processes
  • Real-time telemetry pipelines and publish/subscribe models (Kafka or equivalent)
  • Data warehousing, datalakes, processing and data analytics
  • SQL (AzureSQL, Postgress or equivalent) a solid foundation in advanced SQL
  • Unix/Linux shell scripting or similar programming/scripting knowledge
  • Real-time time monitoring and alerting: metric agents, real-time dashboards, alerting rules
  • Excellent written and verbal communication skills for diverse audiences based on engineering subject matter
  • Ability to document requirements, architectural designs, and analysis findings in both business and technical terminology
  • Software development in an Agile iterative CI/CD development environment
  • Promote and support company policies, procedures, mission, values, and standards of ethics and integrity

Nice to have

  • agentic AI – Model context protocol (MCP) servers, Large Language Models (LLMs), Retrieval-Augmented Generation (RAG), Natural Language processing (NPL)
  • Python, JavaScript, advanced shell scripting, Configuration management -Ansible, chef, puppet
  • Application Performance Monitoring (APM) and/or Distributed Tracing
  • Deployment of Kubernetes, containers, service meshes, and micro services
  • Micro services architectures, Istio, and micrometer
  • Open Telemetry standards and protocols
  • Go development
  • Observability tools and system architectures
  • Experience in creating and maintaining managed metric services
  • NoSQL (Cassandra, CosmosDB or equivalent)
  • Storm, Spark or similar real-time streaming software
  • UI development - JavaScript, HTML, CSS and experience with frameworks like React and AngularJS
  • Involvement and contribution with open-source software communities
  • Demonstrated background in developing software systems

What the JD emphasized

  • deep understanding of the Java language
  • architectural leadership in research, evaluation, creation of software designs, and distributed software implementations in production
  • technical leadership, software roadmaps, research and development, new software initiatives and customer and engineering coordination and engagement

Other signals

  • design and development of cloud native observability designs
  • real-time telemetry software systems
  • large-scale distributed systems
  • Artificial intelligence (AI) and Learned Behaviors for anomaly detection
  • collaboration with enterprise architects, product managers, data scientist, engineers and business managers