Site Reliability Engineer at Ford

What you'd actually do

Write, configure, and deploy code in Go and Javascript that improves service reliability for existing or new systems; set standard for others with respect to code quality.

Work within Google Cloud Platform (GCP) infrastructure, optimizing performance and cost, and scaling resources to meet demand.

Implement and manage SRE monitoring application backends using Golang, Postgres, and OpenTelemetry. Develop tooling using Terraform and other IaC tools to ensure visibility and proactive issue detection across our platforms.

Collaborate with development teams to enhance system reliability and performance, applying a platform engineering mindset to system administration tasks.

Troubleshoot and resolve issues in our dev, test, and production environments.

What the JD emphasized

3+ years of experience as an SRE, Software Engineer, DevOps Engineer or similar role

Solid programming skills in Golang and scripting languages

Proficient with monitoring and observability tools, particularly OpenTelemetry, Dynatrace or other tools

Proficient with cloud services, with a strong preference for Kubernetes and Google Cloud Platform (GCP) experience

Enterprise Technology is the engine driving the future of transportation. If you’re looking for the chance to leverage advanced technology to redefine the mobility landscape, enhance the customer experience and improve people’s lives, this is the opportunity for you.

Ford is seeking an experienced and passionate Site Reliability Engineer (SRE) to join our team in developing, enhancing, and expanding our global monitoring and observability platform. You'll blend software and systems engineering to ensure the uptime, scalability, and maintainability of our critical cloud services. You'll be at the intersection of SRE and Software Development, building and driving the adoption of our global monitoring capabilities.

If you're passionate about using your IT expertise and analytical skills to shape the future of transportation, this is your opportunity to make a real impact. Join us and be part of a team that's building the future of mobility!

Write, configure, and deploy code in Go and Javascript that improves service reliability for existing or new systems; set standard for others with respect to code quality.
Work within Google Cloud Platform (GCP) infrastructure, optimizing performance and cost, and scaling resources to meet demand.
Provide helpful and actionable feedback and review for code or production changes.
Drive repair/optimization of complex systems with consideration towards a wide range of contributing factors.
Lead debugging, troubleshooting, and analysis of service architecture and design.
Participate in on-call rotation.
Write documentation: design, system analysis, runbooks, playbooks. Provide design feedback and uplevel design skills of others.
Implement and manage SRE monitoring application backends using Golang, Postgres, and OpenTelemetry. Develop tooling using Terraform and other IaC tools to ensure visibility and proactive issue detection across our platforms.
Collaborate with development teams to enhance system reliability and performance, applying a platform engineering mindset to system administration tasks.
Develop and maintain automated solutions for operational aspects such as on-call monitoring, performance tuning, and disaster recovery.
Troubleshoot and resolve issues in our dev, test, and production environments.
Participate in postmortem analysis and create preventative measures for future incidents.
Implement and maintain security best practices across our infrastructure, ensuring compliance with industry standards and internal policies. Participate in security audits and vulnerability assessments.
Participate in capacity planning and forecasting efforts to ensure our systems can handle future growth and demand. Analyze trends and make recommendations for resource allocation.
Identify and address performance bottlenecks through code profiling, system analysis, and configuration tuning. Implement and monitor performance metrics to proactively identify and resolve issues.
Develop, maintain, and test disaster recovery plans and procedures to ensure business continuity in the event of a major outage or disaster. Participate in regular disaster recovery exercises.
Contribute to internal knowledge bases and documentation.
Bachelor’s degree in Computer Science, Engineering, Mathematics or equivalent work experience.
3+ years of experience as an SRE, Software Engineer, DevOps Engineer or similar role.
Solid programming skills in Golang and scripting languages, with a good understanding of software development best practices.
Proficient with monitoring and observability tools, particularly OpenTelemetry, Dynatrace or other tools.
Proficient with cloud services, with a strong preference for Kubernetes and Google Cloud Platform (GCP) experience.
Experience with relational and document databases.
Ability to debug, optimize code, and automate routine tasks.
Strong problem-solving skills and the ability to work under pressure in a fast-paced environment.
Excellent verbal and written communication skills.

You may not check every box, or your experience may look a little different from what we've outlined, but if you think you can bring value to Ford Motor Company, we encourage you to apply!

As an established global company, we offer the benefit of choice. You can choose what your Ford future will look like: will your story span the globe, or keep you close to home? Will your career be a deep dive into what you love, or a series of new teams and new skills? Will you be a leader, a changemaker, a technical expert, a culture builder…or all of the above? No matter what you choose, we offer a work life that works for you, including:

Immediate medical, dental, vision and prescription drug coverage
Flexible family care days, paid parental leave, new parent ramp-up programs, subsidized back-up child care and more
Family building benefits including adoption and surrogacy expense reimbursement, fertility treatments, and more
Vehicle discount program for employees and family members and management leases
Tuition assistance
Established and active employee resource groups
Paid time off for individual and team community service
A generous schedule of paid holidays, including the week between Christmas and New Year's Day
Paid time off and the option to purchase additional vacation time.

For a detailed look at our benefits, click here: https://fordcareers.co/GSR

This position ranges from salary grade 6-8 and ranges from $85,400-$192,900.

Final determination of salary grade will be based on candidate's skills and experience, and base salary will be set within the applicable range according to job scope, responsibility and competitive market value.

Visa sponsorship is not available for this position.

Relocation assistance not provided for this position.

Candidates for positions with Ford Motor Company must be legally authorized to work in the United States. Verification of employment eligibility will be required at the time of hire.

We are an Equal Opportunity Employer. All qualified applicants will receive consideration for employment without regard to race, religion, color, age, sex, national origin, sexual orientation, gender identity, disability status or protected veteran status. In the United States, if you need a reasonable accommodation for the online application process due to a disability, please call 1-888-336-0660.

#LI-Remote #LI-DS2

Write, configure, and deploy code in Go and Javascript that improves service reliability for existing or new systems; set standard for others with respect to code quality.
Work within Google Cloud Platform (GCP) infrastructure, optimizing performance and cost, and scaling resources to meet demand.
Provide helpful and actionable feedback and review for code or production changes.
Drive repair/optimization of complex systems with consideration towards a wide range of contributing factors.
Lead debugging, troubleshooting, and analysis of service architecture and design.
Participate in on-call rotation.
Write documentation: design, system analysis, runbooks, playbooks. Provide design feedback and uplevel design skills of others.
Implement and manage SRE monitoring application backends using Golang, Postgres, and OpenTelemetry. Develop tooling using Terraform and other IaC tools to ensure visibility and proactive issue detection across our platforms.
Collaborate with development teams to enhance system reliability and performance, applying a platform engineering mindset to system administration tasks.
Develop and maintain automated solutions for operational aspects such as on-call monitoring, performance tuning, and disaster recovery.
Troubleshoot and resolve issues in our dev, test, and production environments.
Participate in postmortem analysis and create preventative measures for future incidents.
Implement and maintain security best practices across our infrastructure, ensuring compliance with industry standards and internal policies. Participate in security audits and vulnerability assessments.
Participate in capacity planning and forecasting efforts to ensure our systems can handle future growth and demand. Analyze trends and make recommendations for resource allocation.
Identify and address performance bottlenecks through code profiling, system analysis, and configuration tuning. Implement and monitor performance metrics to proactively identify and resolve issues.
Develop, maintain, and test disaster recovery plans and procedures to ensure business continuity in the event of a major outage or disaster. Participate in regular disaster recovery exercises.
Contribute to internal knowledge bases and documentation.
Bachelor’s degree in Computer Science, Engineering, Mathematics or equivalent work experience.
3+ years of experience as an SRE, Software Engineer, DevOps Engineer or similar role.
Solid programming skills in Golang and scripting languages, with a good understanding of software development best practices.
Proficient with monitoring and observability tools, particularly OpenTelemetry, Dynatrace or other tools.
Proficient with cloud services, with a strong preference for Kubernetes and Google Cloud Platform (GCP) experience.
Experience with relational and document databases.
Ability to debug, optimize code, and automate routine tasks.
Strong problem-solving skills and the ability to work under pressure in a fast-paced environment.
Excellent verbal and written communication skills.