Site reliability engineers (sre) job offer

Site Reliability Engineers (SRE)

13 Jun 2024

Texas, Austin, 73301

Site Reliability Engineers (SRE)

Vacancy expired!

SREs, ensure that our Cloud services meet the reliability and uptime requirements of our demanding enterprise customers. This is achieved with proactivity, through the practice of sound engineering practices and resilient design from day 0; as well as with reactively, through a well-defined and effective on-call rotation that runs 24x7.

SREs engineer our production systems to be run at scale, so that manual and repetitive work is fully eliminated. They follow blameless postmortems practices so that all incidents are well understood and problems are fixed at their root. Over time, they make our systems more robust, fault-tolerant and able to self-heal during the worst of outages and through the most unexpected circumstances.

SREs are experts in troubleshooting complex problems and can dig very deep into why systems break in production. In order to do that, they rely on observability practices like centralized logging, distributed tracing and anomaly detection. They shorten detection (MTTD) and recovery times (MTTR), by improving the accuracy of alarms and speed of troubleshooting.

SREs leverage the latest infrastructure automation best practices and the toolset offered by Cloud Providers, so that they multiply their effectiveness and reach bigger outcomes.

Key Responsibilities and Skills:

Automate highly scalable and resilient cloud operations that can be executed with no customer downtime;
Perform blameless root cause analysis on outages and ensure action items are done;
Fix resiliency problems wherever they are in the product, or collaborate with product teams to do it;
Monitor customer infrastructure, measuring availability and system health;
Collaborate with customer support in recovering from escalated outages;
Troubleshoot complex incidents in highly distributed systems;
Shorten time to detecting by improving the accuracy of alarms;
Be a key stakeholder in the design of cloud services so that they are resilient from day 0.

Minimum Qualifications and Skills:

Bachelor or Master Degree in Computer Science or similar;
1+ years of experience in software development or operations. Programming skills in at least a high-level programming language (C, Python, Java, C#, golang, etc.);
Experience in troubleshooting and debugging;
Availability to work in shifts and be part of the 24x7 on-call rotation;
Fluency in English and good communication skills.

Preferred Qualifications and Skills:

Experience with automation and IaC (Terraform, Ansible, etc.);
Experience with Cloud providers (AWS, Azure and Google Cloud Platform);
Experience with Docker and Kubernetes.
Experience with monitoring and troubleshooting complex distributed systems;
Experience in designing resilient and fault-tolerant systems;
Experience in debugging complex, distributed systems.

Location: US, remote

EEO Employer

Apex Systems is an equal opportunity employer. We do not discriminate or allow discrimination on the basis of race, color, religion, creed, sex (including pregnancy, childbirth, breastfeeding, or related medical conditions), age, sexual orientation, gender identity, national origin, ancestry, citizenship, genetic information, registered domestic partner status, marital status, disability, status as a crime victim, protected veteran status, political affiliation, union membership, or any other characteristic protected by law. Apex will consider qualified applicants with criminal histories in a manner consistent with the requirements of applicable law. If you have visited our website in search of information on employment opportunities or to apply for a position, and you require an accommodation in using our website for a search or application, please contact our Employee Services Department at or

Job Details

ID

JC15418563
State

Texas
City

Austin
Job type

Permanent
Salary

N/A
Hiring Company

Apex Systems
Date

2021-06-12
Deadline

2021-08-11
Category

Internet engineering
Print

Site Reliability Engineers (SRE)

Site Reliability Engineers (SRE)

Site Reliability Engineers (SRE)

Job Details

Navigation

Vacancies