Production-Grade ML Platform
Training → Deployment → Monitoring on K3s
Overview
A production-shaped MLOps stack running on K3s with GPU node-affinity. MLflow handles experiments and the model registry, Kubeflow Pipelines orchestrates training DAGs, Triton Inference Server hosts versioned models behind a gRPC/HTTP gateway, and Prometheus + Grafana + Loki cover metrics, dashboards, and structured logs. CI/CD pushes models from registry → staging → production with automated canary rollout.
The Problem
Most ML projects ship as a notebook and a Dockerfile. The hard part — reproducible training, versioned models, safe rollout, and observability — is usually skipped or bolted on later. The goal was a single platform that enforces these properties from day one, on hardware a single engineer can actually operate.
The Approach
Everything runs declaratively on K3s with GitOps via Argo CD. Training jobs are Kubeflow pipelines that log to MLflow; promoting a model is a registry tag change that triggers a Triton model-repository sync. Triton handles dynamic batching, model ensembles, and per-model GPU allocation. SLOs are encoded as Prometheus alerts (latency p99, GPU memory, error rate); structured logs ship to Loki. Canary rollouts split traffic at the gateway and auto-rollback on alert breaches.
Results
Reproducible training runs, versioned models with one-command promotion, p99 inference latency tracked per model, and zero-downtime canary deploys on a homelab budget. The platform consistently absorbs node restarts and GPU driver bumps without losing in-flight pipelines.
Like what you see?
I'm always open to collaborations on AI, robotics, edge computing, or embedded systems.