Roles & Responsibilities:

  • Responsible for Toil Reduction, implementing identified improvement opportunities, handling minor enhancement and non-ticketed activity.
  • Define and monitor service level metrics that include incident management KPIs like: MTTD, MTTR, MTBF, MTTF, Unavailability rate, Incident count, etc.
  • Create rules to optimise incident response by metrics, streamlining alert flows, and collaboration and communication across squads.
  • Proactively identify the issues that might disrupt the service in production
  • Address incoming service request to their support groups/Jira tool
  • Create and maintain alerts
  • Change validation or change planning related requests
  • Assist business stakeholder in determining SLO or adjusting threshold limits
  • Demand and capacity management & make corrections to SLI/SLO threshold limits
  • Gather and analyse metrics from both operating systems and applications to assist in performance tuning and fault finding
  • Partner with development teams to improve services through rigorous testing and release procedures
  • Participate in system design consulting, platform management, and capacity planning
  • Create sustainable systems and services through automation and uplifts
  • Balance feature development speed and reliability with well-defined service level objective (SLO, SLI)
  • Debug production issues across services and levels of the stack
  • Monitoring and audit the production operations and policies related to infrastructure

Education & Experience Requirements:

  • Bachelor’s Degree in Software Engineering, Computer Science or related field
  • Software engineering and task automation skills with Bash, Python
  • Familiarity with the Agile software development lifecycle
  • Deep background in Linux systems and engineering
  • Highly experienced with engineering and automating on Amazon Web Services (AWS)
  • Experience supporting web applications running on Java / Apache / Tomcat in a live production environment
  • Prior experience with IaC tools like Terraform
  • Prior experience with DevOps tools (Git, Gitlab)
  • Production-At-Scale support background in a heavily microservice-based world
  • Hands-on engineering and ops expertise in containerization (Docker, Kubernetes/EKS, CNI, and Ingress networking)
  • Strong understanding of Single-Sign-On, SAML, and OAuth (Bonus if the hands-on experience with Okta)
  • Seasoned expertise around x.509 certificate technology and basic concepts of encryption
  • Experience working with Relational Databases such as MongoDB, Postgresql, Sql
  • Advanced exposure to application development, web UI (design and development), JSON, application architecture
  • Experience strongly utilising observability tools (logging/APM) like Datadog, Cloud Watch, and PagerDuty.

Apply as Software Reliability Engineer