Hiring—Site Reliability Engineer—Houston TX (Onsite)

Hi,

Hope you are doing well!

Please find the below Job Descriptions.

Job Title: Site Reliability Engineer

Location: Houston TX (Onsite)

Job Description:

Looking for a SRE engineer with 8+ years of IT and Software experience to Run the production environment by monitoring availability and taking a holistic view of system health, Improve reliability, quality, and time-to-market of suite of software solutions- Application Monitoring tools- Datadog, Dynatrace, Measure and optimize system performance, with an eye toward pushing our capabilities forward, getting ahead of customer needs, and innovating for continual improvement, Provide primary operational support and engineering for multiple large-scale distributed software applications.

Responsibilities and Objectives

• Build and Deploy Infrastructure and Software using DevOps and CI/CD

• Support Kubernetes Platforms such as EKS, AKS and Open Shift

• Supporting application teams deployed on Cloud Platforms

• Troubleshoot day to day issues on the cloud

• Ensuring safety and soundness of the platform

• Service Management, Process documentation, Knowledge documentation

• Collaborating with Engineering teams for defects, new features and operationalize them

• Respond to issues in a timely and efficient manner with the assistance of other team members and other resources with the goal to minimize the impact to our customers

• Assist in maturity efforts around application environments to create a more stable and effective solution to all consumers

• Work in conjunction with other application support members to create and facilitate a 24x7x365 resolution mechanism

• Assist in the continual improvement of documentation, processes, governance, customer onboarding, etc. around application environments

• Gather and analyze metrics from operating systems as well as applications to assist in performance tuning and fault finding

• Partner with development teams to improve services through rigorous testing and release procedures

• Participate in system design consulting, platform management, and capacity planning

• Create sustainable systems and services through automation and uplifts

• Balance feature development speed and reliability with well-defined service-level objectives

• Monitor infra, apps and network components and DevOps pipelines.

• Review and provide inputs on overall design and observability of the platform

• Operational support for platform and workloads/products hosted on the platform

• Automate manual activities, Troubleshooting and Root cause analysis, Problem management

• Creating Observability Plans, Troubleshooting and defect management

• Cost Management

Required skills and qualifications

• Cloud experience, AWS preferred.

• Ability to script in one or more languages (Python, Terraform etc.)

• Ability to understand (structured and OOP) using one or more high-level languages, such as Python, Java, C/C++, Ruby, and JavaScript

• Experience with distributed computing and storage technologies as well as dynamic resource management frameworks (Apache Mesos, Kubernetes, Yarn)

• Proactive approach to identifying problems, performance bottlenecks, and areas for improvement

• DevOps, CI/CD, GitHub Actions or Jenkins

• Experience with monitoring software and ability to create dashboards and reports.

Best Regards,

Joshna Palla

Client Partner