Site Reliability Engineer (3 Positions) San Francisco-CA (remote-till covid) What You'll Do
Gain deep knowledge of our complex applications.
Serve as a primary point responsible for the overall health, performance, and capacity of Tempo platform and applications.
Design, develop and support tools and libraries as part of Infrastructure Tooling & Automation
Develop automation tools to support growing infrastructure and provide reporting and APIs for various applications
Develop tools to improve our ability to rapidly deploy and effectively monitor custom applications in a large-scale UNIX environment.
Troubleshoot and resolve issues with core infrastructure services
Incubate new ideas that can bring operational efficiency and support scaling of services
Lead internal working groups to evaluate, adopt and deploy new technology
Audit software for potential security and performance problems
Architect and develop conguration management policies
Assist in the roll-out and deployment of new product features and installations to facilitate our rapid iteration and constant growth.
Work closely with development teams to ensure that platforms are designed with "operability" in mind.
Function well in a fast-paced, rapidly-changing environment.
Participate in a 24x7 rotation for escalations.
About You:
7+ years of professional software experience in Operations and Reliability Engineering
Preferred having educational backgrounds in Management Information Systems (MIS), Computer Information Systems (CIS), Computer Science (CS), or Mathematics
Experience in public cloud solutions like AWS at application setup level and beyond (/Google Cloud Platform)
Experience working with Python, Flask, SQLAlchemy, and other frameworks
Experience working at scale with thousands of systems in a DevOps/SRE role
Experience with conguration management tools (Terraform, Cloudformation etc)
Python experience, specifically for systems automation.
Familiar with system hardening and server security best practices.
Knowledge of most of these: data structures, relational and non-relational databases, networking, Linux internals, filesystems, web architecture, APIs and related topics
Expertise automating system administration tasks with scripting tools (Python or shell preferred).
Experience with monitoring and automation tools such as DataDog, Sentry, Splunk, Ansible, Terraform etc.
Aptitude for analyzing and troubleshooting operating system, networking, configuration and performance problems.
Fundamental understanding of Internet networking protocols: TCP/IP, TLS, DNS, HTTP etc.
Ability to install, configure and maintain Linux hosts and popular open source applications such as Nginx, Apache HTTPd etc
Strong interpersonal communication skills and ability to work well in a diverse, team-focused environment with other SREs, Engineers, Product Managers, etc.
Strong desire to work in a fast-paced, start-up environment with short release cycles