100% Remote || Hiring || C2C || SRE Lead
Hi Sir ,
Hope you are doing good, Review the job description and if you have any consultant Please Share me updated Resume.
Position : SRE Lead
Location : Remote
Job description :
Key Responsibilities
- Strengthen the team’s SRE practices, starting from service level indicator definitions, objectives, error budgets, thresholds, alerting and error management systems.
- Site Planning – SRE will have to work with dev and testing teams to plan changes to production and other systems.
- Optimizing planned outages – This includes optimizing dev. Ops area and any other activity resulting in a planned outage.
- Toil management – Identify areas of high toil and find solutions for improvement.
- Leverage automation wherever possible to minimize workload, enhance stability, and improve the overall functionality of the environment.
- Alert management – Strengthening areas with alerting, including establishing goals, criteria, alert recalls, reset, enable/disable revising error budget based on the toil undergone by teams.
- Prevention of outages – respond to non-critical alerts and work closely with development and testing teams.
- Verification – Work closely with Load and Performance teams in redefining parameters like load and concurrent users.
- Incident management – Chair meetings with development and operations teams in the event of an incident.
- Post Incident Reviews – Derive learnings from issues and alerts along with teams, inclusive of RCAs. Work on long term solutions which could include changes in code, configuration, change in design/architecture or capacity planning.
- Reporting with Reliability Metrics – This includes set of derived metrics which includes Availability, Mean Time to Restore, Mean Time Between Repairs and Probability of Failure.
- Continuous improvement – Development and maintain a backlog of SRE improvements opportunities.
- With company sponsorship, underdo necessary background checks to obtain and maintain U.S. Federal Government “Public Trust” suitability clearance.
Requirements
- Knowledgeable within the Site Reliability Engineering discipline with a proven track record of success.
- Proficient with administering Azure systems.
- Proficient with Kubernetes systems. Familiarly with Podman/Docker and Helm Charts.
- Proficient with Python.
- Experience with GitHub.
- Knowledgeable with resiliency / reliability design patterns.
To be a match fit, candidate will have experience in:
- Prometheus
- AKS Monitoring
- Grafana
- Automation
Thanks & Regards,
Deepak Pal
Technical Recruiter
Diverse Lynx LLC
300 Alexander Park| Suite #200|Princeton, NJ 08540
Office: +1 (732) 452-1006 Ext. 211
Email: [email protected]|URL: http://www.diverselynx.com