Job responsibilities
- Provide 24/7 support & incident management.
- Participate in on-call rotation & support, ensuring stability and performance of production environments.
- Respond to incidents or issues reported by CS, CSE or monitoring alerts.
- Run recovery jobs and follow the steps based on the SOP’s.
- Take proactive actions to address infrastructure issues to mitigate and prevent production outages.
- Respond to monitoring alerts according to defined SOP’s.
- Participate in Post Incident Reviews and discussions.
- Build effective working relationships with peers across the global locations.
- Make suggestions for process improvements and enhanced operational efficiencies.
- Strong experience with Monitoring and Alerting Tools: CloudWatch, Grafana, PagerDuty.
- Provide superior problem remediation support within the web/application/cloud/container tier environments in support of negotiated Service Level Agreements (SLA’s).
- Partner and collaborate with SRE or CAE team to build automation to prevent problem recurrence.
Required experience
- Minimum of 2 years of development experience in a cloud environment.
- Minimum of 2 years in incident response and major incident management.
- Minimum of 4 years of Linux experience.
- Passionate about solving and analyzing problems in a global scale distributed system.
- Ability to prioritize and stay on top of all incidents reported.
- Working knowledge with configuration tools such as Chef, Puppet, Ansible, Rundeck.
- Experience in troubleshooting database or ETL related issues.