What you'd actually do

Define and evolve the observability vision and roadmap for PCS DS applications

Design and implement/integrate standardized observability frameworks (metrics, logs, traces, events, profiling).

Collaborate with platform, SRE, and product teams to instrument services using OpenTelemetry and other modern observability tooling.

Build and maintain dashboards, alerts, and SLOs that reflect both technical and business health indicators.

Lead / contribute to incident analysis and postmortem reviews, driving improvements in system resilience and observability coverage.

Skills

Required

observability strategy
instrumentation
metrics, logs, traces
OpenTelemetry
Prometheus
Grafana
Datadog
Dynatrace
Go
Python
Bash
Terraform
distributed tracing
SLO/SLI frameworks
incident response workflows
distributed systems
microservices
cloud platforms (AWS, Azure, GCP)
AI-powered anomaly detection
SRE practices

Nice to have

healthcare or regulated industries experience
data privacy and compliance (HIPAA, HITRUST)
cost optimization
telemetry data governance
chaos engineering

What the JD emphasized

observability vision and roadmap

standardized observability frameworks

instrument services

dashboards, alerts, and SLOs

incident analysis

healthcare compliance standards

observability-first development

observability solutions in cloud-native environments

observability pillars

distributed tracing

SLO/SLI frameworks

incident response workflows

AI-powered anomaly detection

Job Description Summary

As a Staff Software Engineer (Observability), you will be responsible for defining and implementing the observability strategy across PCS Digital Solutions Cloud Applications.

Job Description

Roles and Responsibilities

In this role, you will:

Define and evolve the observability vision and roadmap for PCS DS applications
Design and implement/integrate standardized observability frameworks (metrics, logs, traces, events, profiling).
Collaborate with platform, SRE, and product teams to instrument services using OpenTelemetry and other modern observability tooling.
Build and maintain dashboards, alerts, and SLOs that reflect both technical and business health indicators.
Evaluate, integrate, and optimize observability agents (e.g., Prometheus, Fluent bit, OTEL and other agents).
Design self-remediation solutions leveraging observability tooling.
Implement Best Practices for using GenAI tools of Observability platforms.
Lead / contribute to incident analysis and postmortem reviews, driving improvements in system resilience and observability coverage.
Conduct Operational Readiness Reviews (ORRs) to validate monitoring, alerting, and rollback strategies before go-live.
Ensure observability practices align with healthcare compliance standards (e.g., HIPAA, GDPR, HITRUST).
Mentor engineers and promote a culture of observability-first development.

Required Qualifications

Bachelor’s or master’s degree in computer science, Engineering, or a related technical field.
10+ years of experience in software engineering, SRE, or platform engineering roles.
4+ years of experience in contributing in observability solutions in cloud-native environments (Kubernetes, microservices, serverless).
Deep expertise in observability pillars (metrics, logs, traces) and tools like OpenTelemetry, Prometheus, Grafana, Datadog, Dynatrace etc.
Strong programming/scripting skills (e.g., Go, Python, Bash, Terraform).
Experience with distributed tracing, SLO/SLI frameworks, and incident response workflows.
Deep expertise in distributed systems, microservices, and cloud platforms (AWS, Azure, GCP).
Experience with AI-powered anomaly detection, automated incident response, and cost optimization for observability at scale.
Familiarity with SRE practices, chaos engineering
Excellent communication and collaboration skills.

Desired Characteristics

Experience in healthcare or regulated industries.
Knowledge of data privacy and compliance (HIPAA, HITRUST).
Experience with cost optimization and telemetry data governance.
Contributions to open-source observability projects.

Additional Information

**Relocation Assistance Provided: **No

Job Description

Roles and Responsibilities

In this role, you will:

Define and evolve the observability vision and roadmap for PCS DS applications

Design and implement/integrate standardized observability frameworks (metrics, logs, traces, events, profiling).

Collaborate with platform, SRE, and product teams to instrument services using OpenTelemetry and other modern observability tooling.

Build and maintain dashboards, alerts, and SLOs that reflect both technical and business health indicators.

Evaluate, integrate, and optimize observability agents (e.g., Prometheus, Fluent bit, OTEL and other agents).

Design self-remediation solutions leveraging observability tooling.

Implement Best Practices for using GenAI tools of Observability platforms.

Lead / contribute to incident analysis and postmortem reviews, driving improvements in system resilience and observability coverage.

Conduct Operational Readiness Reviews (ORRs) to validate monitoring, alerting, and rollback strategies before go-live.

Ensure observability practices align with healthcare compliance standards (e.g., HIPAA, GDPR, HITRUST).

Mentor engineers and promote a culture of observability-first development.

Required Qualifications

Bachelor’s or master’s degree in computer science, Engineering, or a related technical field.

10+ years of experience in software engineering, SRE, or platform engineering roles.

4+ years of experience in contributing in observability solutions in cloud-native environments (Kubernetes, microservices, serverless).

Deep expertise in observability pillars (metrics, logs, traces) and tools like OpenTelemetry, Prometheus, Grafana, Datadog, Dynatrace etc.

Strong programming/scripting skills (e.g., Go, Python, Bash, Terraform).

Experience with distributed tracing, SLO/SLI frameworks, and incident response workflows.

Deep expertise in distributed systems, microservices, and cloud platforms (AWS, Azure, GCP).

Experience with AI-powered anomaly detection, automated incident response, and cost optimization for observability at scale.

Familiarity with SRE practices, chaos engineering

Excellent communication and collaboration skills.

Desired Characteristics

Experience in healthcare or regulated industries.

Knowledge of data privacy and compliance (HIPAA, HITRUST).

Experience with cost optimization and telemetry data governance.

Contributions to open-source observability projects.