Position is bonus eligible Prestigious Enterprise Company is currently seeking a Site Reliability Software Engineer to implement tools and processes necessary to achieve required SLOs for our Platform.
Responsibilities:
Define and implement CI/CD pipelines.
Automate delivery of platform services using infrastructure-as-a-code. Build self-service playbooks for platform which can be consumed across globally distributed teams.
Define and implement incident response management process, deploy necessary tools.
Fix support and escalation issues.
Conduct post-incident reviews.
Collaborate with application and business stakeholders to ensure high-quality product is developed and deployed in production. Work diligently with other engineering teams to ratify release processes necessary to meet business goals.
Drive continuous improvement process
Qualifications:
Expert knowledge of one of the major public cloud platforms (Azure, AWS, Google Cloud Platform)
Hands-on programming experience in Python or other object-oriented programming languages.
Expert knowledge of Infrastructure and Application Monitoring tools: Prometheus, Grafana, DataDog, etc
Experience implementing IaC concepts using Terraform, Chef, Puppet.
Experience with Elasticsearch, Kibana
Experience administering Databases
Expert in Linux administration.
Expert knowledge of Docker, Helm.
Experience implementing CI/CD for cloud native applications.
Experience with deploying applications that utilize Service Mesh
Experience administering Kubernetes clusters.
Experience defining and implementing incident response management processes.
Bachelor’s degree
8+ years’ experience in software engineering
Preferred Skills:
Master’s degree
Understanding of GitOps principals.
Experience implementing secure and compliant Kubernetes platforms.
Experience deploying and managing stateful distributed service in Kubernetes.
Experience with security scanning tools.
Experience with intrusion detection systems.
Experience with various messaging systems, such as Kafka or RabbitMQ
Working knowledge of Databricks, Team Foundation Server, TeamCity, Octopus deploys and DataDog