Kube State Metrics List: Complete Breakdown

If you have ever opened a Kube-State-Metrics endpoint and seen hundreds of metrics, you already know the problem. It is not a lack of data. It is too much of it, without clear direction.

The kube state metrics list is powerful, but also overwhelming if you do not know what actually matters. Most setups either collect everything (and suffer performance issues), or collect too little (and miss critical signals). This guide fixes that. You will learn:

  • what the metrics list actually contains
  • how to group and understand it
  • which metrics matter in real-world setups
  • how to control it using allow lists

What the Kube State Metrics List Actually Is

At its core, Kube-State-Metrics converts Kubernetes object data into Prometheus metrics. That means instead of raw YAML or API responses, you get structured, queryable data like:

  • pod status
  • deployment replicas
  • node conditions
  • resource ownership

Each metric follows a consistent pattern:

kube_<resource>_<field>

Examples:

  • kube_pod_status_phase
  • kube_deployment_status_replicas_available
  • kube_node_status_condition

These are not performance metrics (like CPU or memory). They reflect state: what is happening inside your cluster objects.

Why the Metrics List Feels Overwhelming

The confusion usually comes from volume, not complexity. Out of the box, Kube-State-Metrics exposes:

  • dozens of resource types
  • hundreds of individual metrics
  • multiple labels per metric

This leads to:

  • noisy dashboards
  • slow Prometheus queries
  • high cardinality issues

And most importantly, a lack of focus. If everything is tracked, nothing stands out.

Breaking Down the Metrics List (By Category)

The easiest way to understand the kube state metrics list is to group it.

1. Pod Metrics

Pods are the most commonly monitored resource. Key examples:

  • kube_pod_status_phase
  • kube_pod_container_status_ready
  • kube_pod_container_status_restarts_total

What they tell you:

  • Is the pod running, pending, or failed?
  • Are containers healthy?
  • Are restarts increasing?

Why it matters: This is usually your first signal when something breaks.

2. Deployment Metrics

Deployments reflect application health at a higher level. Key examples:

  • kube_deployment_spec_replicas
  • kube_deployment_status_replicas_available
  • kube_deployment_status_replicas_unavailable

What they tell you:

  • desired vs actual state
  • rollout success
  • availability issues

Why it matters: It shows whether your application is meeting expectations.

3. Node Metrics

Nodes represent your infrastructure layer. Key examples:

  • kube_node_status_condition
  • kube_node_spec_unschedulable

What they tell you:

  • node readiness
  • scheduling issues
  • infrastructure health signals

Why it matters: If nodes fail, everything above them is affected.

4. Stateful & Storage Metrics

Important for data-heavy workloads. Key examples:

  • kube_statefulset_status_replicas
  • kube_persistentvolumeclaim_status_phase

What they tell you:

  • storage binding status
  • stateful workload stability

Why it matters: Data issues are harder to detect and recover from.

5. Service & Networking Metrics

Focus on how traffic flows. Key examples:

  • kube_service_spec_type
  • kube_endpoint_address_available

What they tell you:

  • service exposure
  • endpoint availability

Why it matters: Apps can be healthy but still unreachable.

6. Resource Metadata Metrics

These include labels, annotations, and ownership. Key examples:

  • kube_pod_labels
  • kube_deployment_labels

What they tell you:

  • how resources are grouped
  • how queries can be filtered

Why it matters: Labels drive dashboards, alerts, and segmentation.

The Hidden Problem: Too Many Metrics

Here is where most setups fail. They assume: “More metrics = better monitoring”

In reality:

  • too many metrics increase storage costs
  • queries become slower
  • dashboards become unreadable

And worse: important signals get buried. This is exactly where kube state metrics allow list becomes critical.

How to Control the Metrics List (Allow List Approach)

Instead of collecting everything, you can restrict what gets exposed. This is done using the kube state metrics allow list. It lets you define:

  • which resources to include
  • which metrics to expose
  • which labels to keep

Example Strategy (Simple and Effective)

Start with only what you actually need. The goal is to build a useful and maintainable one.

Step 1: Core Metrics Only

Begin with the fundamentals: pods, deployments and nodes. These three give you immediate visibility into:

  • whether workloads are running
  • whether applications are available
  • whether infrastructure is healthy

This baseline alone is enough to detect most common failures and understand cluster state at a high level.

Do not rush beyond this stage. First, make sure your dashboards and queries around these metrics are clear and reliable.

Step 2: Add Based on Use Case

Once your baseline is stable, expand only where it adds real value. Examples:

  • Using databases or persistent workloads → add statefulsets
  • Handling storage-heavy apps → include persistentvolumeclaims (PVCs)
  • Running microservices → add services and endpoints

The key idea: Each new metric should answer a specific question.

If you cannot clearly say why you need a metric, you probably do not need it yet. This prevents unnecessary complexity from creeping into your setup.

Step 3: Reduce Label Noise

Labels are powerful, but also dangerous if overused. Too many labels lead to high cardinality, slower Prometheus queries and increased storage usage. Start with only essential labels:

  • namespace → separates environments logically
  • app → identifies the workload
  • environment → distinguishes dev/staging/prod

Be cautious with: dynamic labels (like pod IDs), auto-generated metadata and excessive custom labels. Clean labeling early saves you from major performance issues later.

Step 4: Iterate Based on Queries

Your monitoring setup should evolve based on actual usage, not assumptions. Ask:

  • Which metrics are actively used in dashboards?
  • Which metrics are used in alerts?
  • Which ones are never touched?

If a metric is never queried: remove it. If dashboards feel cluttered:

  • simplify inputs
  • reduce dimensions
  • refine queries

This step is ongoing. A good setup is not static, it improves over time based on real needs.

How to Use the Metrics List Effectively

Knowing the list is not enough. You need to use it with intent. Otherwise, you end up with dashboards full of data but no real insight.

1. Focus on State Transitions

Static values rarely tell you much on their own. What actually matters is how things change over time. Instead of just checking current status, track patterns like:

  • pod restarts increasing
  • replicas dropping below desired count
  • nodes shifting from Ready → NotReady

These transitions are early warning signals. They help you catch issues before they turn into outages. A stable system is not defined by perfect values , it is defined by predictable behavior over time.

2. Build Intent-Driven Dashboards

Most dashboards fail because they are built around metrics. Do not group metrics randomly. Group them based on how you actually think and operate. Examples:

  • Application view → deployments, pods, availability
  • Service view → endpoints, traffic readiness
  • Team view → resources owned by a specific team
  • Environment view → dev vs staging vs production

Each dashboard should answer a clear question:

  • Is my app healthy?
  • Is this service reachable?
  • Is production stable right now?

If a dashboard does not answer a specific question, it becomes noise.

3. Combine Metrics for Context

Single metrics can be misleading. Real insight comes from connecting signals together. Example:

  • pod restarting → could be minor
  • deployment replicas dropping → could be temporary

But together: pod restarting + replicas dropping  = strong indication of a real issue

Another example: node not ready + pods pending  = scheduling or infrastructure problem

Context removes guesswork. It turns raw data into something you can act on confidently.

4. Avoid “Metric Hoarding”

It is tempting to collect everything “just in case.” This is one of the biggest mistakes. Unused metrics:

  • increase storage usage
  • slow down queries
  • clutter dashboards
  • make troubleshooting harder

Be strict, if a metric is not used in dashboards, alerts and debugging workflows. It should not exist in your setup. A smaller, focused metrics set is always more powerful than a large, noisy one.

Common Mistakes to Avoid

1. Treating the Metrics List as a Checklist

Not every metric is useful for every setup. Trying to include everything leads to:

  • noise
  • confusion
  • wasted resources

Focus on relevance, not completeness.

2. Ignoring Cardinality

High-cardinality metrics slow everything down. Too many labels:

  • increase Prometheus load
  • make queries inefficient

Start minimal. Expand carefully.

3. Building Dashboards Before Understanding Metrics

This is a common trap. Without understanding:

  • dashboards look impressive
  • but provide no real insight

Learn the metrics first.

4. Not Using Allow Lists

Default exposure is too broad. Without filtering:

  • unnecessary metrics pile up
  • performance degrades

Control is not optional. It is essential.

5. Mixing State Metrics with Performance Expectations

Kube-State-Metrics does not show CPU or memory usage. It shows object state. Expecting performance insights from it leads to confusion.

When You Actually Need the Full Metrics List

Most setups do not need the full metrics list. But there are a few cases where broader visibility makes sense.

1. Multi-Team Clusters

If multiple teams use the same cluster, each may need different metrics. Start wider, but organize by team and usage.

2. Debugging Complex Issues

When problems are unclear (e.g., random pod failures or scheduling issues), you may need more data. Expand temporarily, then reduce again.

3. Platform or Tooling Setup

If you are building shared dashboards or monitoring systems, a broader metrics set helps. Still, let teams filter what they actually use.

4. Exploration Phase

When you are just starting, a wider view helps you understand what matters. But do not keep everything long-term.

Do not default to “collect everything.” Use the full metrics list only when needed and refine it afterward.

Conclusion

The kube state metrics list is not meant to be consumed in full. It is meant to be filtered, shaped, and aligned with your actual needs. The goal is not more data. The goal is useful data.

If your dashboards feel noisy or your queries are slowing down, it is time to simplify your setup. Start refining your metrics using a focused allow list and build a monitoring system that actually works for you.

FAQ Section

1. What is the kube state metrics list?

It is the full set of Prometheus metrics exposed by Kube-State-Metrics, representing the state of Kubernetes objects like pods, deployments, and nodes.

2. Should I use all available metrics?

No. Most setups benefit from selecting only relevant metrics. Collecting everything often creates noise and performance issues.

3. What is a kube state metrics allow list?

It is a configuration that lets you limit which metrics and labels are exposed, helping reduce noise and improve performance.

4. Why are my Prometheus queries slow?

High cardinality and too many unused metrics are common causes. Reducing labels and filtering metrics usually improves performance.

5. Does Kube-State-Metrics show CPU or memory usage?

No. It only exposes Kubernetes object state. For resource usage, you need tools like Metrics Server or node exporters.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top