2025-11-08

15 MIN READ

RAMESH MOKARIYA

ML MODEL DEPLOYMENT STRATEGIES

A comprehensive guide to deploying machine learning models in production environments. From containerization with Docker to orchestration with Kubernetes, learn battle-tested strategies for reliable ML systems.

MACHINE LEARNING DOCKER KUBERNETES MLOPS CI/CD

The ML Deployment Challenge

Training a machine learning model is only half the battle. The real challenge begins when you need to serve predictions reliably at scale, handle model versioning, monitor performance drift, and maintain 99.9% uptime in production.

After deploying dozens of ML models at STARK Industries—from recommendation systems to predictive maintenance—I've learned that successful ML deployment requires a systematic approach combining DevOps best practices with ML-specific considerations.

What is MLOps?

MLOps (Machine Learning Operations) is a set of practices that combines ML, DevOps, and Data Engineering to deploy and maintain ML systems in production reliably and efficiently.

Deployment Architecture Overview

Our production ML deployment follows a microservices architecture, with each model served as an independent containerized service orchestrated by Kubernetes:

// ML DEPLOYMENT ARCHITECTURE

graph TB A[API Gateway] --> B[Load Balancer] B --> C[Model Service v1.2] B --> D[Model Service v1.2] B --> E[Model Service v1.3
Canary 10%] C --> F[Model Container
TensorFlow Serving] D --> G[Model Container
TensorFlow Serving] E --> H[Model Container
TensorFlow Serving] F --> I[Model Storage
S3/GCS] G --> I H --> I J[Monitoring
Prometheus] -.-> C J -.-> D J -.-> E K[CI/CD Pipeline] --> L[Model Registry] L --> C L --> D L --> E style C fill:#00d9ff,stroke:#00d9ff,color:#000 style D fill:#00d9ff,stroke:#00d9ff,color:#000 style E fill:#ffd700,stroke:#ffd700,color:#000 style F fill:#00d9ff,stroke:#00d9ff,color:#000 style G fill:#00d9ff,stroke:#00d9ff,color:#000 style H fill:#ffd700,stroke:#ffd700,color:#000

Step 1: Containerizing ML Models

Docker containers provide isolation, reproducibility, and portability for ML models. Here's how we containerize a TensorFlow model with a FastAPI serving layer:

Model Serving with FastAPI

# app.py - FastAPI model serving application
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import tensorflow as tf
import numpy as np
import uvicorn

app = FastAPI(title="Predictive Maintenance Model API")

# Load model at startup
model = None

@app.on_event("startup")
async def load_model():
    global model
    try:
        model = tf.keras.models.load_model('/models/maintenance_model')
        print("Model loaded successfully")
    except Exception as e:
        print(f"Error loading model: {e}")
        raise

class PredictionRequest(BaseModel):
    temperature: float
    vibration: float
    pressure: float
    rpm: float

class PredictionResponse(BaseModel):
    failure_probability: float
    risk_level: str
    recommended_action: str

@app.post("/predict", response_model=PredictionResponse)
async def predict(request: PredictionRequest):
    if model is None:
        raise HTTPException(status_code=503, detail="Model not loaded")
    
    # Prepare input features
    features = np.array([[
        request.temperature,
        request.vibration,
        request.pressure,
        request.rpm
    ]])
    
    # Make prediction
    prediction = model.predict(features)[0][0]
    
    # Determine risk level
    if prediction < 0.3:
        risk = "LOW"
        action = "Continue normal operation"
    elif prediction < 0.7:
        risk = "MEDIUM"
        action = "Schedule inspection within 48 hours"
    else:
        risk = "HIGH"
        action = "Immediate maintenance required"
    
    return PredictionResponse(
        failure_probability=float(prediction),
        risk_level=risk,
        recommended_action=action
    )

@app.get("/health")
async def health():
    return {"status": "healthy", "model_loaded": model is not None}

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

Dockerfile for Production

# Dockerfile
FROM python:3.9-slim

# Set working directory
WORKDIR /app

# Install system dependencies
RUN apt-get update && apt-get install -y \
    build-essential \
    curl \
    && rm -rf /var/lib/apt/lists/*

# Copy requirements and install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy application code
COPY app.py .

# Create model directory
RUN mkdir -p /models

# Download model from S3 (in production)
# This would be replaced with actual S3 download in CI/CD
COPY ./models /models

# Expose port
EXPOSE 8000

# Health check
HEALTHCHECK --interval=30s --timeout=3s --start-period=40s --retries=3 \
    CMD curl -f http://localhost:8000/health || exit 1

# Run application
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "4"]

Requirements File

# requirements.txt
fastapi==0.104.1
uvicorn[standard]==0.24.0
tensorflow==2.14.0
numpy==1.24.3
pydantic==2.4.2
prometheus-client==0.18.0
boto3==1.28.85  # For S3 model loading

Step 2: Kubernetes Deployment

Kubernetes orchestrates our containerized models, providing auto-scaling, load balancing, and self-healing capabilities.

Deployment Manifest

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ml-model-deployment
  labels:
    app: ml-model
    version: v1.2
spec:
  replicas: 3
  selector:
    matchLabels:
      app: ml-model
  template:
    metadata:
      labels:
        app: ml-model
        version: v1.2
    spec:
      containers:
      - name: model-server
        image: stark-registry.io/ml-model:v1.2
        ports:
        - containerPort: 8000
          name: http
        resources:
          requests:
            memory: "2Gi"
            cpu: "1000m"
          limits:
            memory: "4Gi"
            cpu: "2000m"
        env:
        - name: MODEL_NAME
          value: "maintenance_predictor"
        - name: MODEL_VERSION
          value: "v1.2"
        - name: AWS_REGION
          value: "us-east-1"
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 5
          periodSeconds: 5
---
apiVersion: v1
kind: Service
metadata:
  name: ml-model-service
spec:
  type: LoadBalancer
  selector:
    app: ml-model
  ports:
  - protocol: TCP
    port: 80
    targetPort: 8000
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ml-model-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ml-model-deployment
  minReplicas: 3
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80

// KUBERNETES CLUSTER ARCHITECTURE

graph TB A[Ingress Controller] --> B[Service
ml-model-service] B --> C[Pod 1
Model v1.2] B --> D[Pod 2
Model v1.2] B --> E[Pod 3
Model v1.2] F[HPA] -.->|Scale| B G[Prometheus] -.->|Monitor| C G -.->|Monitor| D G -.->|Monitor| E C --> H[PVC
Model Storage] D --> H E --> H style B fill:#00d9ff,stroke:#00d9ff,color:#000 style C fill:#ffd700,stroke:#ffd700,color:#000 style D fill:#ffd700,stroke:#ffd700,color:#000 style E fill:#ffd700,stroke:#ffd700,color:#000 style F fill:#00d9ff,stroke:#00d9ff,color:#000

Step 3: CI/CD Pipeline for ML

Automated deployment pipelines ensure consistent, repeatable model deployments with proper testing and validation at each stage.

GitHub Actions Workflow

# .github/workflows/ml-deploy.yml
name: ML Model Deployment

on:
  push:
    branches: [main]
    paths:
      - 'models/**'
      - 'app.py'
      - 'Dockerfile'

env:
  REGISTRY: stark-registry.io
  IMAGE_NAME: ml-model

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.9'

      - name: Install dependencies
        run: |
          pip install -r requirements.txt
          pip install pytest

      - name: Run unit tests
        run: pytest tests/

      - name: Model validation
        run: python scripts/validate_model.py

  build:
    needs: test
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Build Docker image
        run: |
          docker build -t ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }} .
          docker tag ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }} \
                     ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:latest

      - name: Push to registry
        run: |
          echo "${{ secrets.REGISTRY_PASSWORD }}" | docker login ${{ env.REGISTRY }} -u ${{ secrets.REGISTRY_USER }} --password-stdin
          docker push ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }}
          docker push ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:latest

  deploy-staging:
    needs: build
    runs-on: ubuntu-latest
    steps:
      - name: Deploy to staging
        run: |
          kubectl set image deployment/ml-model-deployment \
            model-server=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }} \
            --namespace=staging

      - name: Wait for rollout
        run: kubectl rollout status deployment/ml-model-deployment -n staging

      - name: Run integration tests
        run: python scripts/integration_tests.py --env=staging

  deploy-production:
    needs: deploy-staging
    runs-on: ubuntu-latest
    environment: production
    steps:
      - name: Canary deployment (10%)
        run: |
          kubectl apply -f k8s/canary-deployment.yaml
          
      - name: Monitor canary metrics
        run: python scripts/monitor_canary.py --duration=300

      - name: Full rollout
        run: |
          kubectl set image deployment/ml-model-deployment \
            model-server=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }} \
            --namespace=production

// CI/CD DEPLOYMENT PIPELINE

graph LR A[Code Push] --> B[Run Tests] B --> C{Tests Pass?} C -->|Yes| D[Build Docker Image] C -->|No| Z[Fail Pipeline] D --> E[Push to Registry] E --> F[Deploy to Staging] F --> G[Integration Tests] G --> H{Tests Pass?} H -->|Yes| I[Canary Deployment] H -->|No| Z I --> J[Monitor Metrics] J --> K{Canary Healthy?} K -->|Yes| L[Full Production Rollout] K -->|No| M[Rollback] style D fill:#00d9ff,stroke:#00d9ff,color:#000 style F fill:#ffd700,stroke:#ffd700,color:#000 style I fill:#ffd700,stroke:#ffd700,color:#000 style L fill:#00ff88,stroke:#00ff88,color:#000 style M fill:#ff4444,stroke:#ff4444,color:#000

Step 4: Monitoring and Observability

ML models require specialized monitoring beyond traditional application metrics. We track:

Prediction latency: p50, p95, p99 response times
Model drift: Distribution changes in input features
Prediction distribution: Are predictions still within expected ranges?
Model performance: Accuracy, precision, recall on validation data
Resource utilization: CPU, memory, GPU usage

Prometheus Metrics

# metrics.py
from prometheus_client import Counter, Histogram, Gauge
import time

# Request metrics
prediction_counter = Counter(
    'model_predictions_total',
    'Total number of predictions',
    ['model_version', 'status']
)

prediction_latency = Histogram(
    'model_prediction_latency_seconds',
    'Prediction latency in seconds',
    ['model_version']
)

# Model drift metrics
feature_distribution = Gauge(
    'model_feature_mean',
    'Mean value of input features',
    ['feature_name', 'model_version']
)

# Usage in FastAPI
@app.post("/predict")
async def predict(request: PredictionRequest):
    start_time = time.time()
    
    try:
        # Make prediction
        result = model.predict(...)
        
        # Record metrics
        prediction_counter.labels(
            model_version="v1.2",
            status="success"
        ).inc()
        
        latency = time.time() - start_time
        prediction_latency.labels(
            model_version="v1.2"
        ).observe(latency)
        
        # Track feature distributions
        feature_distribution.labels(
            feature_name="temperature",
            model_version="v1.2"
        ).set(request.temperature)
        
        return result
        
    except Exception as e:
        prediction_counter.labels(
            model_version="v1.2",
            status="error"
        ).inc()
        raise

Model Drift Detection

Set up alerts when feature distributions shift significantly from training data. This often indicates that your model may need retraining. We use the Kolmogorov-Smirnov test with a threshold of p < 0.05.

Model Versioning Strategy

We use semantic versioning for models (v1.2.3) where:

Major version: Incompatible API changes or complete model architecture changes
Minor version: Backward-compatible improvements (retrained on new data)
Patch version: Bug fixes, performance optimizations

Rollback and Disaster Recovery

Always maintain the ability to rollback to the previous model version instantly:

# Quick rollback command
kubectl rollout undo deployment/ml-model-deployment -n production

# Rollback to specific version
kubectl rollout undo deployment/ml-model-deployment --to-revision=2 -n production

Best Practices Summary

Key Takeaways

1. Containerize Everything: Use Docker for reproducible environments
2. Automate Testing: Unit tests, integration tests, and model validation
3. Gradual Rollouts: Use canary deployments to minimize risk
4. Monitor Continuously: Track both system and ML-specific metrics
5. Version Control: Models, code, and configurations
6. Plan for Rollbacks: Always have an escape hatch

Conclusion

Deploying ML models to production is a complex engineering challenge that requires careful planning and robust infrastructure. By combining containerization, orchestration, automated testing, and comprehensive monitoring, you can build ML systems that are reliable, scalable, and maintainable.

The strategies outlined here have enabled us to maintain 99.9% uptime for our ML services at STARK Industries while deploying new model versions multiple times per week. Start small, automate everything, and always plan for failure.

← PREVIOUS BRIEFING

OPTIMIZING DATA PIPELINES AT SCALE

NEXT BRIEFING →

SERVERLESS DATA ARCHITECTURE