SRE with AI/ML & IaC || Remote

Hi,
 
Alka this side from Crox Consulting. I am hiring for below role. Please share resume asap.
 

Role: SRE Contractor AI/ML Infrastructure and Ops Engineer  (AI/ML Training) 

Location:  (Remote) 

 

Job Description – 

SRE Contractor – AI/ML Infrastructure and Ops Engineer  (AI/ML Training)

As the Infrastructure and Ops Engineer, you will work on operations related to UAIS (United AI Studio  – enterprise AI/ML platform), and in particular in relation to AI/ML training initiative supporting thousands of learners on the platform. This individual contributor (IC) role requires experience on working on large-scale AI/ML platforms guaranteeing stability, reliability, scalability, and performance. Experience with modern Infrastructure and DevOps tools and paradigms, as well as hands-on knowledge with major cloud-based services like Azure, AWS and GCP is a must.

 

Primary Responsibilities:

    • Continuous support: Provide continuous SRE support to thousands of geographically distributed learners on the UAIS platform: respond to tickets, triage support, liaise with customers.  
    • Automation & DevOps: Improve existing Infrastructure as Code (IaC) according to best DevOps practices.
    • Systems Monitoring: Develop and maintain monitoring frameworks for UAIS infrastructure in relation to AI/ML training program
    • Security & Compliance: Collaborate with cybersecurity teams to ensure all systems and operations comply with industry standards and are secure against evolving threats.
    • Capacity Planning & Cost Optimization: Forecast and manage capacity requirements for the AI/ML training environment, while identifying opportunities to reduce costs without compromising performance.

Required Qualifications:

    • Bachelor’s degree in computer science, information technology, or a related field.
    • 5+ years of infrastructure experience: Proven experience working on large-scale, cloud-based, enterprise-level software platforms and deep understanding of multi-cloud architectures, specifically Azure, AWS, and GCP, with hands-on experience in cloud management.
    • 3+ years of practical experience in Infrastructure-as-Code and CI/CD tools like Terraform, Git Actions and alike.
    • 2+ years of practical experience in containerization technologies(Kubernetes, Docker) and orchestration
    • 2+ years of practical experience in Scripting & Automation Skills: Advanced proficiency in scripting languages such as Python and Bash to support automation and system integration efforts.

 

Preferred Qualifications:

    • Security & Compliance Knowledge: Strong understanding of security best practices and experience ensuring compliance with relevant regulatory frameworks.
    • Machine Learning and LLM Operations: Exposure to modern tools and techniques in MLOps and LLMOps fields. 
    • Exposure to AI/ML-specific infrastructure tools (e.g., MLflow, Kubeflow) for managing and deploying models at scale.
    • Exposure to a Regulated Industry: Experience working within a healthcare or regulated industry, with solid understanding of the unique challenges and compliance requirements.
    • Ability to work independentlymanage multiple projects simultaneously, and adapt to changing priorities in a fast-paced environment.
 
 
 

To unsubscribe from future emails or to update your email preferences click here

0 0 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Most Voted
Newest Oldest
Inline Feedbacks
View all comments