Senior Site Reliability Engineer (SRE) for Atlanta, GA (F2F interview – locals only)

Hi ,
Hope you are doing Good!!!

Please find the attached Job Description. If you feel comfortable then please send me your updated resume or call me back on 512-399-0788.

Position: Senior Site Reliability Engineer (SRE)

Location: Atlanta, GA (F2F interview – locals only)

Duration: 12+ months

No H1B

Job Overview:

As a Senior Site Reliability Engineer (SRE) with our Retail Technology team, you will be at the forefront of Cloud and Big Data technology. You'll play a key role in ensuring the reliability and performance of our critical applications and services. This position offers the opportunity to work with industry-leading technologies and establish yourself as a technical leader.

Key Responsibilities:

Implement, improve, and maintain monitoring, alerting, and logging solutions to detect and respond to incidents.
Collaborate closely with the development team to deploy applications and services, ensuring they meet reliability and performance standards.
Automate deployment, configuration management, and troubleshooting processes to streamline operations.
Participate in on-call rotation, triage production incidents, lead Root Cause Analysis (RCA) efforts, and implement preventive actions.
Serve as the escalation point for complex issues in both on-premise and AWS environments.

Qualifications:

Deep understanding of AWS services: (Lambda, S3, SQS, IAM, Route 53, etc.) and proficiency in infrastructure as code (e.g., Terraform, CloudFormation).
Hands-on experience with monitoring tools: (e.g., CloudWatch, Sumo Logic, Dynatrace, Grafana) for application performance monitoring and alerting.
Proficiency in scripting and automation: (e.g., Python, Bash) to build and maintain deployment pipelines and infrastructure.
Strong analytical and troubleshooting skills to diagnose and resolve complex infrastructure, application, and data issues.
Experience with containerization: (Docker, Kubernetes) and serverless architecture (AWS Lambda).

Core Responsibilities:

Manage and optimize data streaming and API components in OpenShift On-premise and AWS.
Proactively review application APIs and processes to identify opportunities for optimizing response times.
Automate various types of testing, including data quality checks, delivery to production, and deployment processes.
Develop integrations between On-premise applications, AWS, and third-party tools (ServiceNow, VersionOne, Sumo).
Collaborate with teams to create SLI/SLOs.
Monitor and lead troubleshooting of performance issues for platform applications, develop solutions, and document artifacts from root cause analysis.
Evolve the cloud infrastructure ecosystem by experimenting with emerging technologies.
Design and develop CI/CD pipelines for deploying application artifacts, APIs, and Data Process Jobs.
Maintain data integrity and access control using AWS security tools (e.g., HSM, IAM).
Develop tools to monitor AWS billing, generate cost-related reports, and implement cost optimization strategies.
Design and implement data security tools in collaboration with enterprise security architects.
Monitor and analyze platform capacity and performance, implementing elastic infrastructure as needed.
Contribute to backup strategies and disaster recovery solutions.
Provide continuous improvement input on design, performance, and security enhancements.

Desired Skillset:

Deep understanding of AWS cloud platforms.
Proficiency in automation, scripting, and monitoring using tools like OpenShift, CloudFormation, Terraform, Ansible, Shell, Python.
Strong technical knowledge of infrastructure layers (Linux OS, virtualization platforms, networking, storage, backup strategies).
Experience in end-to-end operations for enterprise systems and applications, including issue resolution for mission-critical systems.
Familiarity with CI/CD tools (Gitlab, Github, Jenkins, Maven, Gradle, Nexus).
Experience with Software Release Management.

Required Qualifications:

Education: BS in Computer Science or related technical field (or equivalent practical experience).
Experience:
- 3+ years of DevOps/SysOps engineering experience focusing on major cloud platforms (AWS preferred).
- 2+ years of application development experience, including data streaming and deploying/monitoring high-availability critical application components.
- 1+ years of experience in Site Reliability Engineering is preferred.