Role: SRE Contractor AI/ML Infrastructure and Ops Engineer (AI/ML Training)
Location: (Remote)
Job Description –
SRE Contractor – AI/ML Infrastructure and Ops Engineer (AI/ML Training)
As the Infrastructure and Ops Engineer, you will work on operations related to UAIS (United AI Studio – enterprise AI/ML platform), and in particular in relation to AI/ML training initiative supporting thousands of learners on the platform. This individual contributor (IC) role requires experience on working on large-scale AI/ML platforms guaranteeing stability, reliability, scalability, and performance. Experience with modern Infrastructure and DevOps tools and paradigms, as well as hands-on knowledge with major cloud-based services like Azure, AWS and GCP is a must.
Primary Responsibilities:
-
- Continuous support: Provide continuous SRE support to thousands of geographically distributed learners on the UAIS platform: respond to tickets, triage support, liaise with customers.
- Automation & DevOps: Improve existing Infrastructure as Code (IaC) according to best DevOps practices.
- Systems Monitoring: Develop and maintain monitoring frameworks for UAIS infrastructure in relation to AI/ML training program
- Security & Compliance: Collaborate with cybersecurity teams to ensure all systems and operations comply with industry standards and are secure against evolving threats.
- Capacity Planning & Cost Optimization: Forecast and manage capacity requirements for the AI/ML training environment, while identifying opportunities to reduce costs without compromising performance.
Required Qualifications:
-
- Bachelor’s degree in computer science, information technology, or a related field.
- 5+ years of infrastructure experience: Proven experience working on large-scale, cloud-based, enterprise-level software platforms and deep understanding of multi-cloud architectures, specifically Azure, AWS, and GCP, with hands-on experience in cloud management.
- 3+ years of practical experience in Infrastructure-as-Code and CI/CD tools like Terraform, Git Actions and alike.
- 2+ years of practical experience in containerization technologies(Kubernetes, Docker) and orchestration
- 2+ years of practical experience in Scripting & Automation Skills: Advanced proficiency in scripting languages such as Python and Bash to support automation and system integration efforts.
Preferred Qualifications:
-
- Security & Compliance Knowledge: Strong understanding of security best practices and experience ensuring compliance with relevant regulatory frameworks.
- Machine Learning and LLM Operations: Exposure to modern tools and techniques in MLOps and LLMOps fields.
- Exposure to AI/ML-specific infrastructure tools (e.g., MLflow, Kubeflow) for managing and deploying models at scale.
- Exposure to a Regulated Industry: Experience working within a healthcare or regulated industry, with solid understanding of the unique challenges and compliance requirements.
- Ability to work independently, manage multiple projects simultaneously, and adapt to changing priorities in a fast-paced environment.
To unsubscribe from future emails or to update your email preferences click here