– Enterprise Consulting experience with a focus on Data, Machine Learning, and GenAI solutions (15+ years preferred)
– Proficiency in designing and delivering solutions that leverage GenAI technologies (e.g., LLMs, Foundation Models)
– Deep familiarity with relevant concepts and models/technologies (e.g., transformer models, prompt engineering, model fine-tuning)
– Experience delivering and scaling complex infrastructural solutions across diverse platforms
– Ability to translate complex processes and business problems into technical solutions
– Strong knowledge of:
– vLLM
– OpenShift AI
– Prometheus
– Grafana
– Aqua
– Automation of deployment and execution of pipelines
– Proficient knowledge of:
– Python and SQL
– Apache Spark, Apache Hadoop, Informatica, and similar data processing tools
– Proven experience with building test procedures and ensuring data pipeline quality, reliability, performance, and scalability
– Proven experience developing a comprehensive set of process document runbooks and plans will be developed to cover key operational procedures and activities, including:
– Operational procedures
– Activities and tasks
– Escalation processes
– Communication plans
– Ability to develop applications that expose and use Restful APIs for data querying and ingestion
– Understanding of the AI tooling ecosystem (e.g., Kubernetes, MLOps, AIOps)
– Strong communication and customer-facing skills
– Ability to work efficiently in collaborative teams using Agile methodologies
– Ability to influence and interact with confidence and credibility at all levels
– General awareness of Dell Technologies products
– Knowledge of industry-wide AI Studios and AI Workbenches
– Experience preparing data for machine learning and Large Language Model ingestion and training
– Experience building and using Information Retrieval frameworks for LLM inferencing
– Working knowledge of Linux, cloud technologies, and Lean and Iterative Deployment Methodologies
– University Degree aligned to Data Engineering and/or Data Science
– Relevant industry certifications (e.g., Databricks Certified Data Engineer, Microsoft Certifications, NVIDIA Certifications)
This AI Operations Engineer with ITIL Process Support Specialty will working within a broader team to ramp up to provide operational support by the target go-live date, March 1st, 2025. The services to be performed include:
– Incident management: Level 3 support for issue identification, diagnosis, escalation, resolution, and coordination with key stakeholders and providers (e.g., NVIDIA, OpenShift).
– Vulnerability management: risk assessment, CVE scanning, patching, and remediation.
– Monitoring and alerting integration and operation.
– Model performance tuning and troubleshooting.
– Container image management and deployment.
– Model deployment: including pipeline development and operation of end-to-end deployment processes.
– Code deployment: including pipeline development and operation of end-to-end deployment processes.
– Day 1 Activities