Site Reliability Engineer - Azure Platform We are searching for an experienced software or systems engineer interested in ensuring the stability, reliability, and performance of our SaaS assessment platform. The primary focus of the SRE team is to maintain the integrity and performance of our continuously-integrated platform through understanding the relationships between infrastructure-layer code, functional application pools, network and software-defined load balancers, Mongo, SQL and PostgreSQL databases, message queues and data caches, all running in a mixture of private data centers and Content Delivery Networks. We use a mixture of monitoring tools to identify and mitigate possible client affecting issues. As a member of the SRE team, you must have the desire to troubleshoot complex technical problems, and you must work effectively with a wide variety of technical internal teams and vendors, as well as internal non-technical teams. Strong verbal and written communication skills are a must, as you will be collaborating with others to diagnose, resolve and communicate issues as efficiently as possible. You must be self-driven and can look at problems in new and different ways to find solutions. You will look for ways to implement scripting and automation to improve existing tasks and procedures. Responsibilities
Address, resolve and perform root cause analysis on all support escalations
Analyze the current state of the application and infrastructure, designing appropriate solutions and working with teams to implement them
Coordinate emergency responses, perform root cause analysis, identify and implement solutions to prevent re-occurrences
Work with the Operations and Software Engineering teams to identify ways to increase MTBF and lower MTTR for the environment
Review entire application stack and execute initiatives to reduce failures, defects and issues with overall performance
Review code base and make recommendations for improving performance
Collaborate with quality assurance engineers to assist in resolution of software defects
Collaborate with project architects and project lead developers to prove the validity of new software technologies
Engage and help improve software development methodology
Perform other duties as assigned to ensure the success of the team and the entire organization
Identifying and working with the team to implement more efficient system procedures
Maintaining environment monitoring systems to provide the best visibility into the state of the production environment
Maintaining performance analysis tools, identifying any negative changes to performance and working with the teams to resolve them
Researching industry trends and technologies, and promote adoption of best-in-class tools and technologies
Taking initiative to advance the quality, performance, or scalability of our applications, by influencing the architecture or design of our products
Design, develop and execute automated tests to validate solutions and environments
Troubleshoot issues across the entire stack – hardware, software, application and network
Follow system resource utilization trends and identify capacity planning needs
Participate in regular meetings, both within the team and across it, to discuss previous accomplishments, upcoming goals and any roadblocks in the way.
Skills and Abilities
Experience Continuous Integration & Delivery
Deep Linux and Windows systems knowledge and administration background
Strong understanding of Java and/or .NET
Hands-on experience with SQL and NoSQL database troubleshooting and performance tuning
Good understanding of software design concepts
Understanding of software development methodologies
Understanding of Java EE mid-tier System Architecture principles
Excellent analytical, troubleshooting and communication skills
Experience with log aggregation and data analysis
Experience with at least one scripting language
Experience with Application Performance Management (APM), Network Performance Management (NPM), and Real User Monitoring (RUM) tools and data are a big plus
Ability to code in at least one scripting language (Ruby, Python, Lua, Javascript, etc.)
Ability to support the web platform during off-hours
Demonstrated ability to follow through with all tasks, promises and commitments
Ability to communicate and work effectively within priorities
Ability to advocate ideas and to objectively participate in design critiques
Ability to work under tight timelines in a fast-paced environment