Back
Hybrid (partially remote)Mid Level

ML Platform Engineer

Kaiko · Amsterdam, Netherlands
Posted 1 hour ago
Amsterdam, Netherlands

About us

Kaiko is developing a next-generation autonomous clinical AI assistant that supports healthcare professionals in analyzing patient data, guidelines, and diagnostics. Medical decisions are seldom made by a single individual or based on a single data source. Kaiko's assistant preserves a continuous patient context across multiple encounters, clinicians, and institutions, facilitating collaboration, second opinions, and complex diagnostic processes. The system is engineered to function safely within real clinical settings, with human oversight, auditability, and compliance with regulations at its core.

Job description

As a ML platform engineer, you will help design, scale, and develop the infrastructure that supports Kaiko's foundation model training and deployment across the entire ML lifecycle: from compute orchestration that manages large-scale training jobs, through experiment tracking and model registry, to GPU-backed deployment in production. You will collaborate closely with research and product engineering teams to understand the platform requirements and take ownership of the engineering solutions.

  • Build and evolve the infrastructure that makes ML development fast, reliable, and observable: from IaC to CI/CD to Kubernetes-based workload orchestration
  • Contribute to the compute orchestration layer: help scale GPU workloads across heterogeneous on-prem and cloud clusters using Kubernetes, Ray, and advanced GPU scheduling technologies such as KAI-Scheduler
  • Support hybrid and multi-cloud strategies that balance performance, compliance, and cost, including the Hammerspace storage rollout
  • Own and evolve parts of the AI Factory: the Dagster-based orchestrator and its Ray integration that training jobs depend on
  • Build and maintain the model lifecycle layer: experiment tracking, model registry, versioning, and GPU-backed serving infrastructure, so models trained on our clusters move reliably into production
  • Partner with both research and product engineering to understand platform requirements and translate them into shared infrastructure that works across use cases
  • Bring engineering rigour to a fast-moving stack: lineage, reproducibility, ownership boundaries, and documentation that lets the team move quickly without losing track

Relevant work experience

  • 2-5 Years of Experience in Production ML Platform Engineering or ML Ops Role
  • Hands-on experience with Kubernetes, Helm, Terraform, Docker, and CI/CD tooling (ArgoCD, GitHub Actions, or comparable)
  • Production experience scheduling GPU workloads on Kubernetes, Ray, or comparable, or strong motivation to grow into this quickly
  • Hands-on experience with Linux and NVIDIA GPU environments, including multi-node training stacks and the networking that connects them

Benefits

An attractive and competitive salary, a good pension plan and 25 vacation days per year

Great offsites and team events to strengthen the team and celebrate successes together

A EUR 1000 learning and development budget to help you grow

Autonomy to do your work the way that works best for you, whether you have a kid or prefer early mornings

An annual commuting subsidy

Skills required for the job

ML platform engineeringKubernetesTerraformDockerCI/CD toolingPythonCollaborationObservability