Principal sre vanilla kubernetes aws monitoring job offer

Principal SRE / Vanilla Kubernetes / AWS / Monitoring

28 Jan 2024

California, Los angeles, 90001

Principal SRE / Vanilla Kubernetes / AWS / Monitoring

Vacancy expired!

We are looking for an experienced Site Reliability Engineer to join our Technical Operations team. Site Reliability Engineers are hybrid software/systems engineers whose overarching goal is to ensure that Production Services are "Always On." They strive to build the most reliable and performant systems on the planet.

SREs work closely cross-functional teams to ensure we have the right set of tools to generate, collect, analyze, visualize and alert on operational data, so we know exactly what happens across the ecosystem and can see problems before they occur and address them as quickly as possible.

Responsibilities:

Supervise capacity & utilization and work closely with cross-functional teams to orchestrate scale-up/down of the services
Own & operate critical open-source services like Elasticsearch, Kafka, RabbitMQ, Redis
Build tools and design processes that help improve observability and system resiliency of the platform
Triage Site Availability Incidents and proactively work towards reducing MTTR for customer impacting incidents
Partner with Service owners to implement Service Level Metrics & Service Level Objectives that act as service level health indicators
Establish design patterns for monitoring, benchmarking and deploying new features for the backend services
Develop and maintain technical documentation, network diagrams, runbooks, and procedures
Driving initiatives to evolve our current platform to increase efficiency and keep it in line with current standards and best practices
Responding to production incidents and using your experience in software development, systems engineering, and networking to proactively prevent repeatable issues
Provide relief and sustainable resolution to issues within our infrastructure
Drive initiatives with partner teams to improve the reliability and performance of the infrastructure through improved system design

Skills and Qualifications

Systematic problem-solving approach, combined with a sense of ownership and drive
Full-stack debugging and performance optimization ability, including knowledge of Cloud systems (load balancing, caching, content distribution, etc.), continuous integration/build systems, Java, SQL and NoSQL databases
Track record monitoring and analyzing system performance, isolating issues or bottlenecks that could impact reliability, performance and scalability
Strong experience with observability tools such as Grafana, Prometheus, Zabbix etc
Good experience in any of the scripting/programming languages: Python, GoLang etc
Experience with one or more OSS technologies like Elasticsearch, Kafka and Redis
Familiar with container technology, such as: Docker, Kubernetes, Mesos, etc.
Understanding and experience with SRE concepts and practices, including being an advocate for the elimination of toil and drive simple solutions
Good verbal and written communication skills, and be able to work effectively with geographically remote teams

Good to have

Experience with big data related component operation and maintenance experience (hadoop/yarn/hbase/hive/spark, etc.)
Solid understanding of Linux system is a big plus

Job Details

ID

JC32589542
State

California
City

Los angeles
Job type

Permanent
Salary

N/A
Hiring Company

Motion Recruitment
Date

2022-01-27
Deadline

2022-03-28
Category

Architect/engineer/CAD
Print

Principal SRE / Vanilla Kubernetes / AWS / Monitoring

Principal SRE / Vanilla Kubernetes / AWS / Monitoring

Principal SRE / Vanilla Kubernetes / AWS / Monitoring

Job Details

Navigation

Vacancies