Job Title:Site Reliability Engineer Job Type: Direct Hire Location: Dallas, TX Job Description:
Client is looking for a Site Reliability Engineer to join our SRE and Release Management team.
The keys to their SRE culture include, but not limited to, teamwork, inquisitiveness, problem-solving, critical-thinking, transparency, and diversity. We work closely with software and systems engineers to drive adoption of modern reliability practices like SLOs, error budget policies, actionable alerts, incident retrospectives, chaos testing, and end-to-end ownership. You will also discover ample opportunities for growth in many areas: improved technology skills, effective leadership, dedicated mentorship, creative design, strong communication skills, teamwork, and more.
Simply put, as an SRE, you will help build and operate fast and reliable systems that help people get jobs. Are you up for the challenge?
Responsibilities
Design, develop, and implement software that improves the stability, scalability, availability, and latency of Corporate products
Implement application/infrastructure observability solutions and perform maintenance to ensure desired application availability
Real-time service management inclusive of building monitoring for the golden signal SLIs, establishing, negotiating SLOs with the business, building alerting, creating playbooks and runbooks for services in conjunction with development teams, product owners and support
Triage and decompose incidents into smaller pieces, identify probable root causes using skills gained through debugging code, operating networks, building hardware, or in other, entirely unrelated domains
Work closely with software engineers to build reliable, performant systems
Requirements:
Enjoy building and running distributed systems at scale in an AWS environment.
Appreciate the challenges and trade-offs to be made when building and deploying systems to production deployments, monitoring, scheduling, load balancing
Knowledge of standard methodologies related to security, performance, and disaster recovery
Skilled in identifying performance bottlenecks, identifying anomalous system behavior, and resolving root cause of service issues.
Effectively work across teams and functions to influence design, operations and deployment of highly available software
Strong analytical skills in support of production issue resolution and root cause identification.
Strong organizational skills to manage a variety of work areas and cross team engagements. Strong experience working on high data volume applications managed with modern Infrastructure-as-Code methodologies/tooling.
Experience with container technologies and orchestration platforms (Docker, Kubernetes, Rancher, Cloud Foundry)
Experience managing and using CI/CD tech stack systems (Bamboo, Azure DevOps, Jenkins, CircleCi)
Experience implementing a highly scalable/distributed CiCD Pipeline.
Experience working with monitoring and observability tools (We use New Relic and OpsGenie )
Strong Knowledge working with RDS and Snowflake.
Strong knowledge of programming/scripting languages (Python, Bash, Groovy, Go lang, IaC (Terraform) ). Software Engineers looking to get into SRE/Devops encouraged to apply.