Site Reliability Engineer

Site Reliability Engineer

02 Dec 2024
Texas, Dallas / fort worth, 75201 Dallas / fort worth USA

Site Reliability Engineer

Vacancy expired!

Job Title:

Site Reliability Engineer

Job Type: Direct Hire

Location: Dallas, TX


Job Description:
  • Client is looking for a Site Reliability Engineer to join our SRE and Release Management team.
  • The keys to their SRE culture include, but not limited to, teamwork, inquisitiveness, problem-solving, critical-thinking, transparency, and diversity. We work closely with software and systems engineers to drive adoption of modern reliability practices like SLOs, error budget policies, actionable alerts, incident retrospectives, chaos testing, and end-to-end ownership. You will also discover ample opportunities for growth in many areas: improved technology skills, effective leadership, dedicated mentorship, creative design, strong communication skills, teamwork, and more.
  • Simply put, as an SRE, you will help build and operate fast and reliable systems that help people get jobs. Are you up for the challenge?
  • Responsibilities
  • Design, develop, and implement software that improves the stability, scalability, availability, and latency of Corporate products
  • Implement application/infrastructure observability solutions and perform maintenance to ensure desired application availability
  • Real-time service management inclusive of building monitoring for the golden signal SLIs, establishing, negotiating SLOs with the business, building alerting, creating playbooks and runbooks for services in conjunction with development teams, product owners and support
  • Triage and decompose incidents into smaller pieces, identify probable root causes using skills gained through debugging code, operating networks, building hardware, or in other, entirely unrelated domains
  • Work closely with software engineers to build reliable, performant systems


    Requirements:
  • Enjoy building and running distributed systems at scale in an AWS environment.
  • Appreciate the challenges and trade-offs to be made when building and deploying systems to production deployments, monitoring, scheduling, load balancing
  • Knowledge of standard methodologies related to security, performance, and disaster recovery
  • Skilled in identifying performance bottlenecks, identifying anomalous system behavior, and resolving root cause of service issues.
  • Effectively work across teams and functions to influence design, operations and deployment of highly available software
  • Strong analytical skills in support of production issue resolution and root cause identification.
  • Strong organizational skills to manage a variety of work areas and cross team engagements. Strong experience working on high data volume applications managed with modern Infrastructure-as-Code methodologies/tooling.
  • Experience with container technologies and orchestration platforms (Docker, Kubernetes, Rancher, Cloud Foundry)
  • Experience managing and using CI/CD tech stack systems (Bamboo, Azure DevOps, Jenkins, CircleCi)
  • Experience implementing a highly scalable/distributed CiCD Pipeline.
  • Experience working with monitoring and observability tools (We use New Relic and OpsGenie )
  • Strong Knowledge working with RDS and Snowflake.
  • Strong knowledge of programming/scripting languages (Python, Bash, Groovy, Go lang, IaC (Terraform) ). Software Engineers looking to get into SRE/Devops encouraged to apply.

Related jobs

Job Details

Jocancy Online Job Portal by jobSearchi.