vDAG Metrics
๐ง vDAGs Health Check Policy, Metrics
This notebook provides an overview of the Health Check Status and vDAGs Metrics features in the AIOSv1 platform. These features are designed to helps users to monitor and manage their vDAGs effectively.
To Know More about vDAG Controller reffer to the vDAG Controller Documentation.
๐๏ธVDAG CONTROLLER ARCHITECTURE

1. ๐ฉบ Health Status:
This feature provides a real-time overview of the health of vDAGs, including the status of each blocks of the vdag and their instances.
Health checker policy operates on the health check data of all the blocks that are part of the vDAG which is collected periodically.
This policy will be called periodically, based on the interval specified in the vDAG config.
๐ Sample Policy for Health Checker:
import logging
class AIOSv1PolicyRule:
def __init__(self, rule_id, settings, parameters):
self.rule_id = rule_id
self.settings = settings
self.parameters = parameters
# This is NOT a timestamp, it's seconds since last metric
self.allowed_metrics_age = parameters.get("allowed_metrics_age", 30)
self.forced_health_status = {} # {block_id: True/False}
self.last_healthy = False
logging.warning(f"[INIT] HealthCheckerPolicy initialized with allowed_metrics_age={self.allowed_metrics_age}")
def eval(self, parameters, input_data, context):
logging.warning("[EVAL] eval() called with input_data keys: %s", list(input_data.keys()))
# if input_data['mode'] != "default": #fast_check
# return {"overall_healthy": self.last_healthy}
health_data = input_data.get("health_check_data", {})
if not health_data:
logging.warning("[EVAL] No health_check_data found in input.")
return {
"blocks": {},
"overall_healthy": False
}
result = {"blocks": {}, "overall_healthy": True}
for block_id, data in health_data.items():
logging.warning("[EVAL] Processing block_id: %s", block_id)
if block_id in self.forced_health_status:
is_healthy = self.forced_health_status[block_id]
reason = "forced_override"
logging.warning("[EVAL] Forced status for %s: %s", block_id, is_healthy)
else:
instances = data.get("instances", [])
healthy_instances = []
for inst in instances:
if inst.get("healthy") is not True:
continue
last_metrics_age = inst.get("lastMetrics")
if last_metrics_age is None:
logging.warning("[EVAL] Skipping instance without lastMetrics: %s", inst.get("instanceId"))
continue
if last_metrics_age <= self.allowed_metrics_age:
healthy_instances.append(inst)
else:
logging.warning(
"[EVAL] instance %s too old: lastMetrics=%s > allowed=%s",
inst.get("instanceId"), last_metrics_age, self.allowed_metrics_age
)
is_healthy = len(healthy_instances) > 0
reason = f"{len(healthy_instances)} healthy (age โค {self.allowed_metrics_age}s)"
result["blocks"][block_id] = {
"healthy": is_healthy,
"reason": reason
}
if not is_healthy:
result["overall_healthy"] = False
self.last_healthy = False
else:
self.last_healthy = True
logging.warning("[EVAL] Final health check result: %s", result)
return result
def management(self, action: str, data: dict) -> dict:
logging.warning(f"[MGMT] management() called with action={action}, data={data}")
try:
action = action.lower()
if action == "get_forced_status":
return {"status": "ok", "value": self.forced_health_status}
elif action == "force_healthy":
block_id = data["block_id"]
self.forced_health_status[block_id] = True
return {"status": "ok", "message": f"Block {block_id} forced to healthy"}
elif action == "force_unhealthy":
block_id = data["block_id"]
self.forced_health_status[block_id] = False
return {"status": "ok", "message": f"Block {block_id} forced to unhealthy"}
elif action == "clear_forced":
block_id = data["block_id"]
self.forced_health_status.pop(block_id, None)
return {"status": "ok", "message": f"Forced status cleared for {block_id}"}
elif action == "clear_all_forced":
self.forced_health_status.clear()
return {"status": "ok", "message": "All forced statuses cleared"}
elif action == "set_allowed_metrics_age":
self.allowed_metrics_age = int(data["value"])
return {"status": "ok", "message": f"allowed_metrics_age set to {self.allowed_metrics_age}"}
else:
return {"status": "error", "message": f"Unknown action '{action}'"}
except Exception as e:
logging.error(f"[MGMT] Error handling management action: {e}")
return {"status": "error", "message": str(e)}
๐ Registration Process:
- zip the code:
zip -r health_checker_2.zip code - upload the zip file:
bash upload.sh - register the policy:
bash register_policy.sh.
๐ Create vDAG Controller with Health Checker Policy:
%%bash
curl -X POST http://MANAGEMENTMASTER:30600/vdag-controller/gcp-cluster-2 \
-H "Content-Type: application/json" \
-d '{
"action": "create_controller",
"payload": {
"vdag_controller_id": "policies-test-c",
"vdag_uri": "llm-analyzer:0.0.3-stable",
"config": {
"policy_execution_mode": "local",
"replicas": 1,
"custom_data": {
"quotaChecker": {
"quotaCheckerPolicyRule": {
"policyRuleURI": "quota-checker:2.0-stable",
"parameters": {
"default_limit": 1,
"whitelist": ["session10"]
}
}
},
"qualityChecker": {
"qualityCheckerPolicyRule": {
"policyRuleURI": "quality-checker:2.0-stable",
"parameters": {
"db_url": "redis://POLICYSTORESERVER:6379/0"
}
},
"framesInterval": 1
},
"healthChecker": {
"healthCheckerPolicyRule": {
"policyRuleURI": "health-checker:3.0-stable",
"parameters": {
"allowed_metrics_age": 60
}
},
"interval": 60
}
}
},
"search_tags": []
}
}'
๐ Query the vDAG Controller Details
Once the controller is created, we can verify its status and configuration using a GET request.
%%bash
curl -X GET http://MANAGEMENTMASTER:30103/vdag-controller/policies-test-c | json_pp
๐ฅ Query the health of the vDAG
Query the health of all the Blocks (with their instances) in the vDAG using the REST API.
%%bash
curl http://CLUSTER1MASTER:30828/health/check | json_pp
๐ฅ Use Management command for Health Policy
Update/Query using any management function for the health policy.
%%bash
curl -X POST http://CLUSTER1MASTER:30828/health/mgmt -H "Content-Type: application/json" \
-d '{"mgmt_action": "set_allowed_metrics_age", "mgmt_data": {"value": 5}}'
๐งน Step 4: Clean-up
The controller can be removed using the following command
%%bash
curl -X POST http://MANAGEMENTMASTER:30600/vdag-controller/gcp-cluster-2 \
-H "Content-Type: application/json" \
-d '{
"action": "remove_controller",
"payload": {
"vdag_controller_id": "policies-test-c"
}
}'
The vDAG entry if not needed anymore can be removed using the following command:
%%bash
curl -X DELETE http://MANAGEMENTMASTER:30103/vdag/llm-analyzer:0.0.3-stable
2. ๐ Metrics of vDAG:
Following metrics are exported by vDAG controller:
- inference_requests_total: Total number of inference requests processed
- inference_fps: Frames per second (FPS) of inference processing
- inference_latency_seconds: Latency per inference request in seconds
These metrics can be queried from Global vDAG metrics DB:
Global vDAG metrics database stores vDAG metrics from all the vDAG controllers running across the clusters in the network. These metrics are reported by the vDAG controllers at fixed intervals. Global vDAG Metrics database also provides the query APIs which can be used by the systems and users for monitoring and decision making.
Global vDAG Metrics DB APIs
%%bash
curl -X GET http://MANAGEMENTMASTER:30203/vdag/policies-test-c2 | json_pp
MongoDB-style vDAG Metrics DB APIs
%%bash
curl -X POST http://MANAGEMENTMASTER:30203/vdag/query \
-H "Content-Type: application/json" \
-d '{
"vdagURI": "llm-analyzer:0.0.3-stable"
}' | json_pp
%%bash
curl -X POST http://MANAGEMENTMASTER:30203/vdag/query \
-H "Content-Type: application/json" \
-d '{
"inference_requests_total": { "$gt": 5 }
}' | json_pp