If you have ever opened a Kube-State-Metrics endpoint and seen hundreds of metrics, you already know the problem. It is not a lack of data. It is too much of it, without clear direction.
The kube state metrics list is powerful, but also overwhelming if you do not know what actually matters. Most setups either collect everything (and suffer performance issues), or collect too little (and miss critical signals). This guide fixes that. You will learn:
- what the metrics list actually contains
- how to group and understand it
- which metrics matter in real-world setups
- how to control it using allow lists
What the Kube State Metrics List Actually Is
At its core, Kube-State-Metrics converts Kubernetes object data into Prometheus metrics. That means instead of raw YAML or API responses, you get structured, queryable data like:
- pod status
- deployment replicas
- node conditions
- resource ownership
Each metric follows a consistent pattern:
kube_<resource>_<field>
Examples:
kube_pod_status_phasekube_deployment_status_replicas_availablekube_node_status_condition
These are not performance metrics (like CPU or memory). They reflect state: what is happening inside your cluster objects.
Why the Metrics List Feels Overwhelming
The confusion usually comes from volume, not complexity. Out of the box, Kube-State-Metrics exposes:
- dozens of resource types
- hundreds of individual metrics
- multiple labels per metric
This leads to:
- noisy dashboards
- slow Prometheus queries
- high cardinality issues
And most importantly, a lack of focus. If everything is tracked, nothing stands out.
Breaking Down the Metrics List (By Category)
The easiest way to understand the kube state metrics list is to group it.
1. Pod Metrics
Pods are the most commonly monitored resource. Key examples:
kube_pod_status_phasekube_pod_container_status_readykube_pod_container_status_restarts_total
What they tell you:
- Is the pod running, pending, or failed?
- Are containers healthy?
- Are restarts increasing?
Why it matters: This is usually your first signal when something breaks.
2. Deployment Metrics
Deployments reflect application health at a higher level. Key examples:
kube_deployment_spec_replicaskube_deployment_status_replicas_availablekube_deployment_status_replicas_unavailable
What they tell you:
- desired vs actual state
- rollout success
- availability issues
Why it matters: It shows whether your application is meeting expectations.
3. Node Metrics
Nodes represent your infrastructure layer. Key examples:
kube_node_status_conditionkube_node_spec_unschedulable
What they tell you:
- node readiness
- scheduling issues
- infrastructure health signals
Why it matters: If nodes fail, everything above them is affected.
4. Stateful & Storage Metrics
Important for data-heavy workloads. Key examples:
kube_statefulset_status_replicaskube_persistentvolumeclaim_status_phase
What they tell you:
- storage binding status
- stateful workload stability
Why it matters: Data issues are harder to detect and recover from.
5. Service & Networking Metrics
Focus on how traffic flows. Key examples:
kube_service_spec_typekube_endpoint_address_available
What they tell you:
- service exposure
- endpoint availability
Why it matters: Apps can be healthy but still unreachable.
6. Resource Metadata Metrics
These include labels, annotations, and ownership. Key examples:
kube_pod_labelskube_deployment_labels
What they tell you:
- how resources are grouped
- how queries can be filtered
Why it matters: Labels drive dashboards, alerts, and segmentation.
The Hidden Problem: Too Many Metrics
Here is where most setups fail. They assume: “More metrics = better monitoring”
In reality:
- too many metrics increase storage costs
- queries become slower
- dashboards become unreadable
And worse: important signals get buried. This is exactly where kube state metrics allow list becomes critical.
How to Control the Metrics List (Allow List Approach)
Instead of collecting everything, you can restrict what gets exposed. This is done using the kube state metrics allow list. It lets you define:
- which resources to include
- which metrics to expose
- which labels to keep
Example Strategy (Simple and Effective)
Start with only what you actually need. The goal is to build a useful and maintainable one.
Step 1: Core Metrics Only
Begin with the fundamentals: pods, deployments and nodes. These three give you immediate visibility into:
- whether workloads are running
- whether applications are available
- whether infrastructure is healthy
This baseline alone is enough to detect most common failures and understand cluster state at a high level.
Do not rush beyond this stage. First, make sure your dashboards and queries around these metrics are clear and reliable.
Step 2: Add Based on Use Case
Once your baseline is stable, expand only where it adds real value. Examples:
- Using databases or persistent workloads → add
statefulsets - Handling storage-heavy apps → include
persistentvolumeclaims (PVCs) - Running microservices → add
servicesandendpoints
The key idea: Each new metric should answer a specific question.
If you cannot clearly say why you need a metric, you probably do not need it yet. This prevents unnecessary complexity from creeping into your setup.
Step 3: Reduce Label Noise
Labels are powerful, but also dangerous if overused. Too many labels lead to high cardinality, slower Prometheus queries and increased storage usage. Start with only essential labels:
namespace→ separates environments logicallyapp→ identifies the workloadenvironment→ distinguishes dev/staging/prod
Be cautious with: dynamic labels (like pod IDs), auto-generated metadata and excessive custom labels. Clean labeling early saves you from major performance issues later.
Step 4: Iterate Based on Queries
Your monitoring setup should evolve based on actual usage, not assumptions. Ask:
- Which metrics are actively used in dashboards?
- Which metrics are used in alerts?
- Which ones are never touched?
If a metric is never queried: remove it. If dashboards feel cluttered:
- simplify inputs
- reduce dimensions
- refine queries
This step is ongoing. A good setup is not static, it improves over time based on real needs.
How to Use the Metrics List Effectively
Knowing the list is not enough. You need to use it with intent. Otherwise, you end up with dashboards full of data but no real insight.
1. Focus on State Transitions
Static values rarely tell you much on their own. What actually matters is how things change over time. Instead of just checking current status, track patterns like:
- pod restarts increasing
- replicas dropping below desired count
- nodes shifting from Ready → NotReady
These transitions are early warning signals. They help you catch issues before they turn into outages. A stable system is not defined by perfect values , it is defined by predictable behavior over time.
2. Build Intent-Driven Dashboards
Most dashboards fail because they are built around metrics. Do not group metrics randomly. Group them based on how you actually think and operate. Examples:
- Application view → deployments, pods, availability
- Service view → endpoints, traffic readiness
- Team view → resources owned by a specific team
- Environment view → dev vs staging vs production
Each dashboard should answer a clear question:
- Is my app healthy?
- Is this service reachable?
- Is production stable right now?
If a dashboard does not answer a specific question, it becomes noise.
3. Combine Metrics for Context
Single metrics can be misleading. Real insight comes from connecting signals together. Example:
- pod restarting → could be minor
- deployment replicas dropping → could be temporary
But together: pod restarting + replicas dropping = strong indication of a real issue
Another example: node not ready + pods pending = scheduling or infrastructure problem
Context removes guesswork. It turns raw data into something you can act on confidently.
4. Avoid “Metric Hoarding”
It is tempting to collect everything “just in case.” This is one of the biggest mistakes. Unused metrics:
- increase storage usage
- slow down queries
- clutter dashboards
- make troubleshooting harder
Be strict, if a metric is not used in dashboards, alerts and debugging workflows. It should not exist in your setup. A smaller, focused metrics set is always more powerful than a large, noisy one.
Common Mistakes to Avoid
1. Treating the Metrics List as a Checklist
Not every metric is useful for every setup. Trying to include everything leads to:
- noise
- confusion
- wasted resources
Focus on relevance, not completeness.
2. Ignoring Cardinality
High-cardinality metrics slow everything down. Too many labels:
- increase Prometheus load
- make queries inefficient
Start minimal. Expand carefully.
3. Building Dashboards Before Understanding Metrics
This is a common trap. Without understanding:
- dashboards look impressive
- but provide no real insight
Learn the metrics first.
4. Not Using Allow Lists
Default exposure is too broad. Without filtering:
- unnecessary metrics pile up
- performance degrades
Control is not optional. It is essential.
5. Mixing State Metrics with Performance Expectations
Kube-State-Metrics does not show CPU or memory usage. It shows object state. Expecting performance insights from it leads to confusion.
When You Actually Need the Full Metrics List
Most setups do not need the full metrics list. But there are a few cases where broader visibility makes sense.
1. Multi-Team Clusters
If multiple teams use the same cluster, each may need different metrics. Start wider, but organize by team and usage.
2. Debugging Complex Issues
When problems are unclear (e.g., random pod failures or scheduling issues), you may need more data. Expand temporarily, then reduce again.
3. Platform or Tooling Setup
If you are building shared dashboards or monitoring systems, a broader metrics set helps. Still, let teams filter what they actually use.
4. Exploration Phase
When you are just starting, a wider view helps you understand what matters. But do not keep everything long-term.
Do not default to “collect everything.” Use the full metrics list only when needed and refine it afterward.
Conclusion
The kube state metrics list is not meant to be consumed in full. It is meant to be filtered, shaped, and aligned with your actual needs. The goal is not more data. The goal is useful data.
If your dashboards feel noisy or your queries are slowing down, it is time to simplify your setup. Start refining your metrics using a focused allow list and build a monitoring system that actually works for you.
FAQ Section
1. What is the kube state metrics list?
It is the full set of Prometheus metrics exposed by Kube-State-Metrics, representing the state of Kubernetes objects like pods, deployments, and nodes.
2. Should I use all available metrics?
No. Most setups benefit from selecting only relevant metrics. Collecting everything often creates noise and performance issues.
3. What is a kube state metrics allow list?
It is a configuration that lets you limit which metrics and labels are exposed, helping reduce noise and improve performance.
4. Why are my Prometheus queries slow?
High cardinality and too many unused metrics are common causes. Reducing labels and filtering metrics usually improves performance.
5. Does Kube-State-Metrics show CPU or memory usage?
No. It only exposes Kubernetes object state. For resource usage, you need tools like Metrics Server or node exporters.