vDAG Metrics

🧠 vDAGs Health Check Policy, Metrics

This notebook provides an overview of the Health Check Status and vDAGs Metrics features in the AIOSv1 platform. These features are designed to helps users to monitor and manage their vDAGs effectively.

To Know More about vDAG Controller reffer to the vDAG Controller Documentation.

🏗️VDAG CONTROLLER ARCHITECTURE

vdag-controller-architecture

1. 🩺 Health Status:

This feature provides a real-time overview of the health of vDAGs, including the status of each blocks of the vdag and their instances.

Health checker policy operates on the health check data of all the blocks that are part of the vDAG which is collected periodically.

This policy will be called periodically, based on the interval specified in the vDAG config.

🐍 Sample Policy for Health Checker:

import logging

class AIOSv1PolicyRule:
    def __init__(self, rule_id, settings, parameters):
        self.rule_id = rule_id
        self.settings = settings
        self.parameters = parameters

        # This is NOT a timestamp, it's seconds since last metric
        self.allowed_metrics_age = parameters.get("allowed_metrics_age", 30)
        self.forced_health_status = {}  # {block_id: True/False}
        self.last_healthy = False

        logging.warning(f"[INIT] HealthCheckerPolicy initialized with allowed_metrics_age={self.allowed_metrics_age}")

    def eval(self, parameters, input_data, context):
        logging.warning("[EVAL] eval() called with input_data keys: %s", list(input_data.keys()))

        # if input_data['mode'] != "default": #fast_check
        #     return {"overall_healthy": self.last_healthy}

        health_data = input_data.get("health_check_data", {})
        if not health_data:
            logging.warning("[EVAL] No health_check_data found in input.")
            return {
                "blocks": {},
                "overall_healthy": False
            }

        result = {"blocks": {}, "overall_healthy": True}

        for block_id, data in health_data.items():
            logging.warning("[EVAL] Processing block_id: %s", block_id)

            if block_id in self.forced_health_status:
                is_healthy = self.forced_health_status[block_id]
                reason = "forced_override"
                logging.warning("[EVAL] Forced status for %s: %s", block_id, is_healthy)
            else:
                instances = data.get("instances", [])
                healthy_instances = []

                for inst in instances:
                    if inst.get("healthy") is not True:
                        continue

                    last_metrics_age = inst.get("lastMetrics")
                    if last_metrics_age is None:
                        logging.warning("[EVAL] Skipping instance without lastMetrics: %s", inst.get("instanceId"))
                        continue

                    if last_metrics_age <= self.allowed_metrics_age:
                        healthy_instances.append(inst)
                    else:
                        logging.warning(
                            "[EVAL] instance %s too old: lastMetrics=%s > allowed=%s",
                            inst.get("instanceId"), last_metrics_age, self.allowed_metrics_age
                        )

                is_healthy = len(healthy_instances) > 0
                reason = f"{len(healthy_instances)} healthy (age ≤ {self.allowed_metrics_age}s)"

            result["blocks"][block_id] = {
                "healthy": is_healthy,
                "reason": reason
            }

            if not is_healthy:
                result["overall_healthy"] = False
                self.last_healthy = False
            else:
                self.last_healthy = True

        logging.warning("[EVAL] Final health check result: %s", result)
        return result

    def management(self, action: str, data: dict) -> dict:
        logging.warning(f"[MGMT] management() called with action={action}, data={data}")
        try:
            action = action.lower()

            if action == "get_forced_status":
                return {"status": "ok", "value": self.forced_health_status}

            elif action == "force_healthy":
                block_id = data["block_id"]
                self.forced_health_status[block_id] = True
                return {"status": "ok", "message": f"Block {block_id} forced to healthy"}

            elif action == "force_unhealthy":
                block_id = data["block_id"]
                self.forced_health_status[block_id] = False
                return {"status": "ok", "message": f"Block {block_id} forced to unhealthy"}

            elif action == "clear_forced":
                block_id = data["block_id"]
                self.forced_health_status.pop(block_id, None)
                return {"status": "ok", "message": f"Forced status cleared for {block_id}"}

            elif action == "clear_all_forced":
                self.forced_health_status.clear()
                return {"status": "ok", "message": "All forced statuses cleared"}

            elif action == "set_allowed_metrics_age":
                self.allowed_metrics_age = int(data["value"])
                return {"status": "ok", "message": f"allowed_metrics_age set to {self.allowed_metrics_age}"}

            else:
                return {"status": "error", "message": f"Unknown action '{action}'"}

        except Exception as e:
            logging.error(f"[MGMT] Error handling management action: {e}")
            return {"status": "error", "message": str(e)}

📝 Registration Process:

zip the code: zip -r health_checker_2.zip code
upload the zip file: bash upload.sh
register the policy: bash register_policy.sh.

🚀 Create vDAG Controller with Health Checker Policy:

%%bash
curl -X POST http://MANAGEMENTMASTER:30600/vdag-controller/gcp-cluster-2 \
  -H "Content-Type: application/json" \
  -d '{
    "action": "create_controller",
    "payload": {
      "vdag_controller_id": "policies-test-c", 
      "vdag_uri": "llm-analyzer:0.0.3-stable",
      "config": {
        "policy_execution_mode": "local",
        "replicas": 1,
        "custom_data": {
            "quotaChecker": {
                "quotaCheckerPolicyRule": {
                    "policyRuleURI": "quota-checker:2.0-stable",
                    "parameters": {
                        "default_limit": 1,
                        "whitelist": ["session10"]
                    }
                }
            },
            "qualityChecker": {
              "qualityCheckerPolicyRule": {
                "policyRuleURI": "quality-checker:2.0-stable",
                "parameters": {
                  "db_url": "redis://POLICYSTORESERVER:6379/0"
                }
              },
              "framesInterval": 1
            },
            "healthChecker": {
              "healthCheckerPolicyRule": {
                "policyRuleURI": "health-checker:3.0-stable",
                "parameters": {
                  "allowed_metrics_age": 60
                }
              },
              "interval": 60
            }
        }
      },
      "search_tags": []
    }
  }'

🔍 Query the vDAG Controller Details

Once the controller is created, we can verify its status and configuration using a GET request.

%%bash
curl -X GET http://MANAGEMENTMASTER:30103/vdag-controller/policies-test-c | json_pp

🏥 Query the health of the vDAG

Query the health of all the Blocks (with their instances) in the vDAG using the REST API.

%%bash
curl http://CLUSTER1MASTER:30828/health/check | json_pp

🏥 Use Management command for Health Policy

Update/Query using any management function for the health policy.

%%bash
curl -X POST http://CLUSTER1MASTER:30828/health/mgmt  -H "Content-Type: application/json" \
    -d '{"mgmt_action": "set_allowed_metrics_age", "mgmt_data": {"value": 5}}'

🧹 Step 4: Clean-up

The controller can be removed using the following command

%%bash
curl -X POST http://MANAGEMENTMASTER:30600/vdag-controller/gcp-cluster-2 \
  -H "Content-Type: application/json" \
  -d '{
    "action": "remove_controller",
    "payload": {
      "vdag_controller_id": "policies-test-c"
    }
  }'

The vDAG entry if not needed anymore can be removed using the following command:

%%bash
curl -X DELETE http://MANAGEMENTMASTER:30103/vdag/llm-analyzer:0.0.3-stable

2. 📈 Metrics of vDAG:

Following metrics are exported by vDAG controller: - inference_requests_total: Total number of inference requests processed - inference_fps: Frames per second (FPS) of inference processing - inference_latency_seconds: Latency per inference request in seconds

These metrics can be queried from Global vDAG metrics DB:

Global vDAG metrics database stores vDAG metrics from all the vDAG controllers running across the clusters in the network. These metrics are reported by the vDAG controllers at fixed intervals. Global vDAG Metrics database also provides the query APIs which can be used by the systems and users for monitoring and decision making.

Global vDAG Metrics DB APIs

%%bash
curl -X GET http://MANAGEMENTMASTER:30203/vdag/policies-test-c2 | json_pp

MongoDB-style vDAG Metrics DB APIs

%%bash
curl -X POST http://MANAGEMENTMASTER:30203/vdag/query \
  -H "Content-Type: application/json" \
  -d '{
    "vdagURI": "llm-analyzer:0.0.3-stable"
  }' | json_pp

%%bash
curl -X POST http://MANAGEMENTMASTER:30203/vdag/query \
  -H "Content-Type: application/json" \
  -d '{
    "inference_requests_total": { "$gt": 5 }
  }' | json_pp