Someone who knows ingress-nginx an open-source project used by millions on k8s. It can generate millions of cardinality if ingress rules are updating or are getting deleted on regular basis.
Yes, ingress-nginx don’t remove the labels from the old ingress-rules from the prometheus metrics that it provides for observability of the pod.
Here an example of increase the cardinality from a pod of ingress-nginx:
cat metrics | grep -v "#" |cut -d "{" -f1 | sort | uniq -c | sort -rn | head -n40
3048 nginx_ingress_controller_request_duration_seconds_bucket
2988 nginx_ingress_controller_response_duration_seconds_bucket
2988 nginx_ingress_controller_connect_duration_seconds_bucket
2820 nginx_ingress_controller_header_duration_seconds_bucket
2794 nginx_ingress_controller_response_size_bucket
2794 nginx_ingress_controller_request_size_bucket
2032 nginx_ingress_controller_bytes_sent_bucket
254 nginx_ingress_controller_response_size_sum
254 nginx_ingress_controller_response_size_count
254 nginx_ingress_controller_requests
254 nginx_ingress_controller_request_size_sum
254 nginx_ingress_controller_request_size_count
254 nginx_ingress_controller_request_duration_seconds_sum
254 nginx_ingress_controller_request_duration_seconds_count
254 nginx_ingress_controller_bytes_sent_sum
254 nginx_ingress_controller_bytes_sent_count
249 nginx_ingress_controller_response_duration_seconds_sum
249 nginx_ingress_controller_response_duration_seconds_count
249 nginx_ingress_controller_connect_duration_seconds_sum
249 nginx_ingress_controller_connect_duration_seconds_count
235 nginx_ingress_controller_header_duration_seconds_sum
235 nginx_ingress_controller_header_duration_seconds_count
One target scrape was generating 21 MB of data per 10 sec. So one can imagine data generated by running few pods in k8s cluster. It will result in few(45) GBs of data/hour for an example of 12 pods running and at scrape interval of 10 second.
This will result in increasing the resources of prometheus pod or server, and one will end up storing not useful (dump) data in tsdb or on s3. Which will increase the cost of operations and creates issues by monitoring not useful data on grafana.
Yes its an issue in ingress-nginx and looks like its not fixing any time soon. There is an issure related to it under discussion.
For those who don’t know that are suffering or about to be affected by high cardinality because of this issue, the only solution is to restart the ingress-nginx pod/deployment on k8s periodically.
Restart will reduce the cardinality and one will not get the metrics data of the old removed or updated ingress rules , which is expected and correct for most of the SREs or devops engineers.
Example of an Restart for ingress-nginx deployment: (reduced cardinality)
cat metrics | grep -v "#" |cut -d "{" -f1 | sort | uniq -c | sort -rn | head -n40
288 nginx_ingress_controller_response_duration_seconds_bucket
288 nginx_ingress_controller_request_duration_seconds_bucket
288 nginx_ingress_controller_header_duration_seconds_bucket
288 nginx_ingress_controller_connect_duration_seconds_bucket
264 nginx_ingress_controller_response_size_bucket
264 nginx_ingress_controller_request_size_bucket
192 nginx_ingress_controller_bytes_sent_bucket
24 nginx_ingress_controller_response_size_sum
24 nginx_ingress_controller_response_size_count
24 nginx_ingress_controller_response_duration_seconds_sum
24 nginx_ingress_controller_response_duration_seconds_count
24 nginx_ingress_controller_requests
24 nginx_ingress_controller_request_size_sum
24 nginx_ingress_controller_request_size_count
24 nginx_ingress_controller_request_duration_seconds_sum
24 nginx_ingress_controller_request_duration_seconds_count
24 nginx_ingress_controller_header_duration_seconds_sum
24 nginx_ingress_controller_header_duration_seconds_count
24 nginx_ingress_controller_connect_duration_seconds_sum
24 nginx_ingress_controller_connect_duration_seconds_count
24 nginx_ingress_controller_bytes_sent_sum
24 nginx_ingress_controller_bytes_sent_count
21 nginx_ingress_controller_ingress_upstream_latency_seconds
19 nginx_ingress_controller_orphan_ingress
7 nginx_ingress_controller_ingress_upstream_latency_seconds_sum
7 nginx_ingress_controller_ingress_upstream_latency_seconds_count