Skip to main content
Use this page for VMS master, VMS agent collector, dashboard, alerting, and synthetic/readiness issues. The goal is to triage quickly by layer: Infrastructure -> Service -> User -> Business Flow, then route to the right owner.

Fast triage rules

1

Identify impact scope

Check whether the issue affects one host/service, one collector, one dashboard, or the whole VMS master.
2

Check freshness

Compare last metric time, last heartbeat, alert timestamp, and active maintenance windows.
3

Locate the layer

Classify the issue as Infrastructure, Service, User check, or Business Flow to avoid routing it to the wrong owner.
4

Verify inventory

Check system, environment, service, owner, criticality, and scope tags before declaring missing data.
5

Escalate with evidence

Include collector id, host, service, dashboard, alert id, timestamp, and related logs when escalating.

Collector does not send data

SymptomCommon causeFix
Master sees no heartbeatCollector is stopped, token is wrong, ingest URL is wrongCheck service status, token file, DNS, TLS, and ingest URL
Heartbeat exists but metrics are emptyHost metric module is disabled or lacks read permissionCheck [host_metrics], collector user, and permissions
Data appears intermittentlyUnstable network, firewall idle timeout, retry too lowCheck egress to master, increase retry/backoff and keepalive
Only one service check is missingWrong process name, port, or health pathCompare inventory with the real process and test port from collector
Duplicate collectorTwo collectors use the same id or hostname tagAssign a unique id, fix tags, and remove duplicate inventory entry

Dashboard is stale or missing data

SymptomCommon causeFix
Dashboard is not updatingSlow ingest queue, slow time-series store, wrong query rangeCheck master health, ingest queue, time range, and collector timestamp
One host group is missingMissing tag or wrong environment/system tagStandardize tags and refresh inventory
KPI/SLA is wrongService lacks criticality or owner metadataAdd metadata and rerun aggregation
Topology edge is missingConnection check is not defined or dependency was renamedUpdate connection inventory and remap source/target

Alert noise

SymptomCommon causeFix
Alert flaps repeatedlyThreshold too tight, retry too low, timeout too shortIncrease retry, use debounce, tune threshold from baseline
Alert fires during maintenanceMaintenance window missing or tag does not matchCreate maintenance window by system/environment/service
Alert goes to wrong ownerOwner tag is wrong or routing rule is too broadFix owner tag and split rules by system/service
Too many Low alertsAlerts are not grouped or suppressedGroup alerts by service and suppress secondary symptoms

Service health check fails

SymptomCommon causeFix
TCP failsService down, port changed, firewall blockedCheck process, listen port, and firewall from collector node
HTTP health failsWrong health path, unexpected status code, TLS problemCheck URL, method, expected status, certificate, and proxy
Latency increasesSlow network, dependency timeout, overloaded hostCompare network, CPU, memory, disk IO, and downstream checks
Process check failsProcess name changed after deploymentUpdate process matcher for the new release

Pre-market readiness fails

StepCheck
1Identify failing step: host, service, dependency, synthetic, or report delivery
2Check whether maintenance or deployment happened before 08:30
3Compare Overall dashboard with the related system dashboard
4Rerun the check manually from collector or probe node when needed
5Send a report with pass/fail state, owner, and next action

Information to include for support

InformationExample
CollectorCollector id, hostname, version, environment
TargetAffected system/service/dependency
TimestampStart time and last metric time
DashboardDashboard name, panel, query range
AlertAlert id, severity, owner, routing channel
LogCollector log, master ingest log, or synthetic run log