100% Remote || Hiring || C2C || SRE Lead

100% Remote || Hiring || C2C || SRE Lead

 

Hi Sir ,

 

Hope you are doing good, Review the job description and if you have any consultant Please Share me updated Resume.

 

Position :      SRE Lead

Location :     Remote

 

Job description :

 Key Responsibilities

  • Strengthen the team’s SRE practices, starting from service level indicator definitions, objectives, error budgets, thresholds, alerting and error management systems.
  • Site Planning – SRE will have to work with dev and testing teams to plan changes to production and other systems.
  • Optimizing planned outages – This includes optimizing dev. Ops area and any other activity resulting in a planned outage.
  • Toil management – Identify areas of high toil and find solutions for improvement.
  • Leverage automation wherever possible to minimize workload, enhance stability, and improve the overall functionality of the environment.
  • Alert management – Strengthening areas with alerting, including establishing goals, criteria, alert recalls, reset, enable/disable revising error budget based on the toil undergone by teams.
  • Prevention of outages – respond to non-critical alerts and work closely with development and testing teams.
  • Verification – Work closely with Load and Performance teams in redefining parameters like load and concurrent users.
  • Incident management – Chair meetings with development and operations teams in the event of an incident.
  • Post Incident Reviews – Derive learnings from issues and alerts along with teams, inclusive of RCAs. Work on long term solutions which could include changes in code, configuration, change in design/architecture or capacity planning.
  • Reporting with Reliability Metrics – This includes set of derived metrics which includes Availability, Mean Time to Restore, Mean Time Between Repairs and Probability of Failure.
  • Continuous improvement – Development and maintain a backlog of SRE improvements opportunities.
  • With company sponsorship, underdo necessary background checks to obtain and maintain U.S. Federal Government “Public Trust” suitability clearance.

Requirements

  • Knowledgeable within the Site Reliability Engineering discipline with a proven track record of success.
  • Proficient with administering Azure systems.
  • Proficient with Kubernetes systems. Familiarly with Podman/Docker and Helm Charts.
  • Proficient with Python.
  • Experience with GitHub.
  • Knowledgeable with resiliency / reliability design patterns.

 To be a match fit, candidate will have experience in:

  • Prometheus
  • AKS Monitoring
  • Grafana
  • Automation

 

Thanks & Regards,
 
Deepak Pal
Technical Recruiter
Diverse Lynx LLC
300 Alexander Park| Suite #200|Princeton, NJ 08540
Office: +1 (732) 452-1006 Ext. 211
Email: [email protected]|URL: http://www.diverselynx.com

 

0 0 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Most Voted
Newest Oldest
Inline Feedbacks
View all comments