Senior Site Reliability Engineer, Coordination and Service Discovery Infrastructure

Senior Site Reliability Engineer, Coordination and Service Discovery Infrastructure

29 Sep 2024
Washington, Uswa 00000 Uswa USA

Senior Site Reliability Engineer, Coordination and Service Discovery Infrastructure

Vacancy expired!

Job Description

The Coordination team develops and operates highly-available foundational services that are used by almost every engineer at Twitter. Our vision is to provide robust control plane infrastructure for Twitter that serves use cases such as distributed coordination, service discovery, topology management, and configuration management. We manage one of the world’s largest ZooKeeper deployments and are actively involved with the open source community! As a Site Reliability Engineer embedded on the Coordination & Service Discovery Infrastructure team, you’ll bring the SRE discipline and perspective to the priorities and challenges we face.

What you’ll be doing:

- Build tooling to improve the automation of operations, and reduction of toil. This includes automatic failure remediation, application and systems deployment, capacity planning, and fleet management.

- Troubleshoot mission-critical distributed systems that have some of the highest availability and lowest latency objectives within Twitter.

- Collaborate with Software Engineering teams. Bring the SRE mindset for availability, reliability, scalability, disaster recovery, problem/incident management, and performance of production services.

- Help bring our service to more data centers and cloud environments faster with reliable automation, Docker + Kubernetes, and other ideas you’ve got!

- Identify and contribute to solutions for reducing service downtime, reducing alert noise, improving monitoring, and helping our services reach Service Level Objectives (SLOs).

- Participate in the teams Scrums and On Call rotation.

- Work with highly distributed and diverse hardware, software, and networking teams throughout the company.

Qualifications

- 5+ years of improving the Reliability of data intensive applications, storage engines, and distributed systems in an internet-scale production environment.

- Practical knowledge of at least one programming language (Python, Java, Ruby, C, C, Scala, or any other modern systems language).

- Demonstrable knowledge of Linux operating system internals, TCP/IP, filesystems, disk/storage technologies.

- Experience with state configuration tools (Puppet, Chef, etc.).

- Experience setting up capacity plans for physical and/or virtual infrastructure.

- Ability to prioritize tasks and work independently. A self-starter.

- Good written and oral skills, to help create clarity when working across multiple services and stakeholders.

- Bonus: Hands on experience with Finagle, ZooKeeper, or service discovery systems. Hands on experience with Mesos and Aurora.


Additional Information

All of your information will be kept confidential according to EEO guidelines.

Job Details

  • ID
    JC4923941
  • State
  • City
  • Job type
    Full-time
  • Salary
    N/A
  • Hiring Company
    Twitter
  • Date
    2020-09-29
  • Deadline
    2020-11-28
  • Category

Jocancy Online Job Portal by jobSearchi.