MLOps & Infrastructure

Production-Grade ML Platform

Training → Deployment → Monitoring on K3s

Ongoing

MLOps & Infrastructure

2× RTX 3090

GPU Nodes

128 GB

Cluster Memory

24+ pods

Services

99.7%

Control-Plane Uptime

Overview

A production-shaped MLOps stack running on K3s with GPU node-affinity. MLflow handles experiments and the model registry, Kubeflow Pipelines orchestrates training DAGs, Triton Inference Server hosts versioned models behind a gRPC/HTTP gateway, and Prometheus + Grafana + Loki cover metrics, dashboards, and structured logs. CI/CD pushes models from registry → staging → production with automated canary rollout.

The Problem

Most ML projects ship as a notebook and a Dockerfile. The hard part — reproducible training, versioned models, safe rollout, and observability — is usually skipped or bolted on later. The goal was a single platform that enforces these properties from day one, on hardware a single engineer can actually operate.

The Approach

Everything runs declaratively on K3s with GitOps via Argo CD. Training jobs are Kubeflow pipelines that log to MLflow; promoting a model is a registry tag change that triggers a Triton model-repository sync. Triton handles dynamic batching, model ensembles, and per-model GPU allocation. SLOs are encoded as Prometheus alerts (latency p99, GPU memory, error rate); structured logs ship to Loki. Canary rollouts split traffic at the gateway and auto-rollback on alert breaches.

Results

Reproducible training runs, versioned models with one-command promotion, p99 inference latency tracked per model, and zero-downtime canary deploys on a homelab budget. The platform consistently absorbs node restarts and GPU driver bumps without losing in-flight pipelines.

Like what you see?

I'm always open to collaborations on AI, robotics, edge computing, or embedded systems.

Get in Touch