暂无图片
暂无图片
暂无图片
暂无图片
暂无图片

Kubernetes Kafka集群监控

314



Hi~朋友,关注置顶防止错过消息


Promethues是开源的监控和告警工具,可以用来收集各个数据源的度量数据,在我们整个Kubernetes集群中,我们同样采用Promethues作为我们的监控工具。

什么是Prometheus Operator

Prometheus Operator是用来简化Prometheus及其相关组件在Kubernetes集群中部署和管理的工具。

Operator可以通过CRD来管理Prometheus的实现,无需手动创建和配置各个组件,Prometheus Operator中的CRD主要由以下构成:

  • Prometheus:定义Prometheus的部署

  • Alertmanager:定义AlertManager的部署

  • ThanosRuler:定义所需的ThanosRuler部署,用于在多个Prometheus数据源上进行规则评估,实现集中式告警解决方案

  • ServiceMonitor:声明性地指定如何监控一组Kubernetes Service,简化了Kubernetes服务过程,无须手动维护Prometheus配置文件

  • PodMonitor:声明性地指定如何监控一组Pod,用于直接监控不属于特定服务的Pod

  • Probe:声明性地指定如何监控一组Ingress或静态目标,用于监控外部端点或与Kubernetes服务不直接相关的网络路径

  • PrometheusRule:定义一组Prometheus的告警规则

  • AlertmanagerConfig:声明性地指定Alertmanager配置的子部分,允许定义自定义的告警路由、抑制规则和接收器配置,以实现精细的告警管理

Prometheus Operator安装

  1. 独立安装,可以参考Github文档进行安装

  2. 集成安装,由于我们使用了Kubersphere作为我们Kubernetes集群的管理平台,Kubersphere默认会安装Prometheus Operator

如何监控Kafka

对于如何监控Kafka并实现预警我们需要做以下工作:

  1. 对于Kafka CRD我们需要使用JMX Prometheus Exporter进行指标导出:

  1. apiVersion: kafka.strimzi.io/v1beta2

  2. kind: Kafka

  3. metadata:

  4. name: business-kafka

  5. annotations:

  6. strimzi.io/node-pools: enabled

  7. strimzi.io/kraft: enabled

  8. spec:

  9. kafka:

  10. version: 3.8.0

  11. metadataVersion: 3.8-IV0

  12. ....

  13. metricsConfig:

  14. type: jmxPrometheusExporter

  15. valueFrom:

  16. configMapKeyRef:

  17. name: kafka-metrics

  18. key: kafka-metrics-config.yml

JMX Prometheus Exporter用于将Kafka的JMX 指标导出为Prometheus可以抓取的格式。这允许您通过Prometheus来监控Kafka的性能指标,如请求速率、延迟、错误率等,jmx的配置文件来自于以下ConfigMap:

  1. kind: ConfigMap

  2. apiVersion: v1

  3. metadata:

  4. name: kafka-metrics

  5. namespace: kafka

  6. labels:

  7. app: strimzi

  8. data:

  9. kafka-metrics-config.yml: |

  10. lowercaseOutputName: true

  11. rules:

  12. # Special cases and very specific rules

  13. - pattern: kafka.server<type=(.+), name=(.+), clientId=(.+), topic=(.+), partition=(.*)><>Value

  14. name: kafka_server_$1_$2

  15. type: GAUGE

  16. labels:

  17. clientId: "$3"

  18. topic: "$4"

  19. partition: "$5"

  20. - pattern: kafka.server<type=(.+), name=(.+), clientId=(.+), brokerHost=(.+), brokerPort=(.+)><>Value

  21. name: kafka_server_$1_$2

  22. type: GAUGE

  23. labels:

  24. clientId: "$3"

  25. broker: "$4:$5"

  26. - pattern: kafka.server<type=(.+), cipher=(.+), protocol=(.+), listener=(.+), networkProcessor=(.+)><>connections

  27. name: kafka_server_$1_connections_tls_info

  28. type: GAUGE

  29. labels:

  30. cipher: "$2"

  31. protocol: "$3"

  32. listener: "$4"

  33. networkProcessor: "$5"

  34. - pattern: kafka.server<type=(.+), clientSoftwareName=(.+), clientSoftwareVersion=(.+), listener=(.+), networkProcessor=(.+)><>connections

  35. name: kafka_server_$1_connections_software

  36. type: GAUGE

  37. labels:

  38. clientSoftwareName: "$2"

  39. clientSoftwareVersion: "$3"

  40. listener: "$4"

  41. networkProcessor: "$5"

  42. - pattern: "kafka.server<type=(.+), listener=(.+), networkProcessor=(.+)><>(.+-total):"

  43. name: kafka_server_$1_$4

  44. type: COUNTER

  45. labels:

  46. listener: "$2"

  47. networkProcessor: "$3"

  48. - pattern: "kafka.server<type=(.+), listener=(.+), networkProcessor=(.+)><>(.+):"

  49. name: kafka_server_$1_$4

  50. type: GAUGE

  51. labels:

  52. listener: "$2"

  53. networkProcessor: "$3"

  54. - pattern: kafka.server<type=(.+), listener=(.+), networkProcessor=(.+)><>(.+-total)

  55. name: kafka_server_$1_$4

  56. type: COUNTER

  57. labels:

  58. listener: "$2"

  59. networkProcessor: "$3"

  60. - pattern: kafka.server<type=(.+), listener=(.+), networkProcessor=(.+)><>(.+)

  61. name: kafka_server_$1_$4

  62. type: GAUGE

  63. labels:

  64. listener: "$2"

  65. networkProcessor: "$3"

  66. # Some percent metrics use MeanRate attribute

  67. # Ex) kafka.server<type=(KafkaRequestHandlerPool), name=(RequestHandlerAvgIdlePercent)><>MeanRate

  68. - pattern: kafka.(\w+)<type=(.+), name=(.+)Percent\w*><>MeanRate

  69. name: kafka_$1_$2_$3_percent

  70. type: GAUGE

  71. # Generic gauges for percents

  72. - pattern: kafka.(\w+)<type=(.+), name=(.+)Percent\w*><>Value

  73. name: kafka_$1_$2_$3_percent

  74. type: GAUGE

  75. - pattern: kafka.(\w+)<type=(.+), name=(.+)Percent\w*, (.+)=(.+)><>Value

  76. name: kafka_$1_$2_$3_percent

  77. type: GAUGE

  78. labels:

  79. "$4": "$5"

  80. # Generic per-second counters with 0-2 key/value pairs

  81. - pattern: kafka.(\w+)<type=(.+), name=(.+)PerSec\w*, (.+)=(.+), (.+)=(.+)><>Count

  82. name: kafka_$1_$2_$3_total

  83. type: COUNTER

  84. labels:

  85. "$4": "$5"

  86. "$6": "$7"

  87. - pattern: kafka.(\w+)<type=(.+), name=(.+)PerSec\w*, (.+)=(.+)><>Count

  88. name: kafka_$1_$2_$3_total

  89. type: COUNTER

  90. labels:

  91. "$4": "$5"

  92. - pattern: kafka.(\w+)<type=(.+), name=(.+)PerSec\w*><>Count

  93. name: kafka_$1_$2_$3_total

  94. type: COUNTER

  95. # Generic gauges with 0-2 key/value pairs

  96. - pattern: kafka.(\w+)<type=(.+), name=(.+), (.+)=(.+), (.+)=(.+)><>Value

  97. name: kafka_$1_$2_$3

  98. type: GAUGE

  99. labels:

  100. "$4": "$5"

  101. "$6": "$7"

  102. - pattern: kafka.(\w+)<type=(.+), name=(.+), (.+)=(.+)><>Value

  103. name: kafka_$1_$2_$3

  104. type: GAUGE

  105. labels:

  106. "$4": "$5"

  107. - pattern: kafka.(\w+)<type=(.+), name=(.+)><>Value

  108. name: kafka_$1_$2_$3

  109. type: GAUGE

  110. # Emulate Prometheus 'Summary' metrics for the exported 'Histogram's.

  111. # Note that these are missing the '_sum' metric!

  112. - pattern: kafka.(\w+)<type=(.+), name=(.+), (.+)=(.+), (.+)=(.+)><>Count

  113. name: kafka_$1_$2_$3_count

  114. type: COUNTER

  115. labels:

  116. "$4": "$5"

  117. "$6": "$7"

  118. - pattern: kafka.(\w+)<type=(.+), name=(.+), (.+)=(.*), (.+)=(.+)><>(\d+)thPercentile

  119. name: kafka_$1_$2_$3

  120. type: GAUGE

  121. labels:

  122. "$4": "$5"

  123. "$6": "$7"

  124. quantile: "0.$8"

  125. - pattern: kafka.(\w+)<type=(.+), name=(.+), (.+)=(.+)><>Count

  126. name: kafka_$1_$2_$3_count

  127. type: COUNTER

  128. labels:

  129. "$4": "$5"

  130. - pattern: kafka.(\w+)<type=(.+), name=(.+), (.+)=(.*)><>(\d+)thPercentile

  131. name: kafka_$1_$2_$3

  132. type: GAUGE

  133. labels:

  134. "$4": "$5"

  135. quantile: "0.$6"

  136. - pattern: kafka.(\w+)<type=(.+), name=(.+)><>Count

  137. name: kafka_$1_$2_$3_count

  138. type: COUNTER

  139. - pattern: kafka.(\w+)<type=(.+), name=(.+)><>(\d+)thPercentile

  140. name: kafka_$1_$2_$3

  141. type: GAUGE

  142. labels:

  143. quantile: "0.$4"

  144. # KRaft overall related metrics

  145. # distinguish between always increasing COUNTER (total and max) and variable GAUGE (all others) metrics

  146. - pattern: "kafka.server<type=raft-metrics><>(.+-total|.+-max):"

  147. name: kafka_server_raftmetrics_$1

  148. type: COUNTER

  149. - pattern: "kafka.server<type=raft-metrics><>(.+):"

  150. name: kafka_server_raftmetrics_$1

  151. type: GAUGE

  152. # KRaft "low level" channels related metrics

  153. # distinguish between always increasing COUNTER (total and max) and variable GAUGE (all others) metrics

  154. - pattern: "kafka.server<type=raft-channel-metrics><>(.+-total|.+-max):"

  155. name: kafka_server_raftchannelmetrics_$1

  156. type: COUNTER

  157. - pattern: "kafka.server<type=raft-channel-metrics><>(.+):"

  158. name: kafka_server_raftchannelmetrics_$1

  159. type: GAUGE

  160. # Broker metrics related to fetching metadata topic records in KRaft mode

  161. - pattern: "kafka.server<type=broker-metadata-metrics><>(.+):"

  162. name: kafka_server_brokermetadatametrics_$1

  163. type: GAUGE

2. 对于Kafka CRD我们还要部署Kafka Exporter,如下

  1. apiVersion: kafka.strimzi.io/v1beta2

  2. kind: Kafka

  3. metadata:

  4. name: business-kafka

  5. annotations:

  6. strimzi.io/node-pools: enabled

  7. strimzi.io/kraft: enabled

  8. spec:

  9. ....

  10. kafkaExporter:

  11. groupRegex: ".*"

  12. topicRegex: ".*"

  13. resources:

  14. requests:

  15. cpu: 200m

  16. memory: 64Mi

  17. limits:

  18. cpu: 500m

  19. memory: 128Mi

  20. logging: info

  21. enableSaramaLogging: true

  22. readinessProbe:

  23. initialDelaySeconds: 15

  24. timeoutSeconds: 5

  25. livenessProbe:

  26. initialDelaySeconds: 15

  27. timeoutSeconds: 5

Kafka Exporter和JMX Prometheus Exporter 都是用于将Kafka指标暴露给Prometheus的工具,但它们的侧重点和作用有所不同:

  • Kafka Exporter专门用于监控Kafka消费者组的延迟(consumer lag)以及消费者的状态

  • JMX Prometheus Exporter用于通过JMX接口暴露Kafka进程中各种JVM级别的性能和Kafka内部状态指标。

3. 上述两个Exporter配置好以后,我们就可以使用 PodMonitor对指标进行收集写入Prometheus了

  1. apiVersion: monitoring.coreos.com/v1

  2. kind: PodMonitor

  3. metadata:

  4. name: kafka-resources-metrics

  5. labels:

  6. app: strimzi

  7. spec:

  8. selector:

  9. matchExpressions:

  10. - key: "strimzi.io/kind"

  11. operator: In

  12. values: ["Kafka", "KafkaConnect", "KafkaMirrorMaker", "KafkaMirrorMaker2"]

  13. namespaceSelector:

  14. matchNames:

  15. - kafka

  16. podMetricsEndpoints:

  17. - path: /metrics

  18. port: tcp-prometheus

  19. relabelings:

  20. - separator: ;

  21. regex: __meta_kubernetes_pod_label_(strimzi_io_.+)

  22. replacement: $1

  23. action: labelmap

  24. - sourceLabels: [__meta_kubernetes_namespace]

  25. separator: ;

  26. regex: (.*)

  27. targetLabel: namespace

  28. replacement: $1

  29. action: replace

  30. - sourceLabels: [__meta_kubernetes_pod_name]

  31. separator: ;

  32. regex: (.*)

  33. targetLabel: kubernetes_pod_name

  34. replacement: $1

  35. action: replace

  36. - sourceLabels: [__meta_kubernetes_pod_node_name]

  37. separator: ;

  38. regex: (.*)

  39. targetLabel: node_name

  40. replacement: $1

  41. action: replace

  42. - sourceLabels: [__meta_kubernetes_pod_host_ip]

  43. separator: ;

  44. regex: (.*)

  45. targetLabel: node_ip

  46. replacement: $1

  47. action: replace

  • spec.selector.matchExpressions:指定了需要监控的Pod需要匹配的标签

  • spec.namespaceSelector:指定了监控的Namespace

  • spec.podMetricsEndpoints.path:Prometheus将访问/metrics路径来抓取指

  • spec.podMetricsEndpoints.port:指标抓取使用的端口

  • spec.podMetricsEndpoints.relabelings:用于修改和标准化指标标签

指标可视化

下面我们需要将Prometheus收集到的数据进行可视化,这里我们借助Grafana进行展示,我们主要建立 3个Dashboard:

Kafka Exporter的Dashboard如下图所示:

这三个模板的配置文件我们可以从strimzi/strimzi-kafka-operator的github 中找到,如下图:

添加PromtheusRule规则

  1. apiVersion: monitoring.coreos.com/v1

  2. kind: PrometheusRule

  3. metadata:

  4. labels:

  5. prometheus: k8s

  6. role: alert-rules

  7. name: kafka-prometheus-rules

  8. namespace: kafka

  9. spec:

  10. groups:

  11. - name: kafka

  12. rules:

  13. - alert: KafkaRunningOutOfSpace

  14. expr: kubelet_volume_stats_available_bytes{persistentvolumeclaim=~"data(-[0-9]+)?-(.+)-kafka-[0-9]+"} * 100 / kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"data(-[0-9]+)?-(.+)-kafka-[0-9]+"} < 15

  15. for: 10s

  16. labels:

  17. severity: warning

  18. annotations:

  19. summary: 'Kafka is running out of free disk space'

  20. description: 'There are only {{ $value }} percent available at {{ $labels.persistentvolumeclaim }} PVC'

  21. - alert: UnderReplicatedPartitions

  22. expr: kafka_server_replicamanager_underreplicatedpartitions > 0

  23. for: 10s

  24. labels:

  25. severity: warning

  26. annotations:

  27. summary: 'Kafka under replicated partitions'

  28. description: 'There are {{ $value }} under replicated partitions on {{ $labels.kubernetes_pod_name }}'

  29. - alert: AbnormalControllerState

  30. expr: sum(kafka_controller_kafkacontroller_activecontrollercount) by (strimzi_io_name) != 1

  31. for: 10s

  32. labels:

  33. severity: warning

  34. annotations:

  35. summary: 'Kafka abnormal controller state'

  36. description: 'There are {{ $value }} active controllers in the cluster'

  37. - alert: OfflinePartitions

  38. expr: sum(kafka_controller_kafkacontroller_offlinepartitionscount) > 0

  39. for: 10s

  40. labels:

  41. severity: warning

  42. annotations:

  43. summary: 'Kafka offline partitions'

  44. description: 'One or more partitions have no leader'

  45. - alert: UnderMinIsrPartitionCount

  46. expr: kafka_server_replicamanager_underminisrpartitioncount > 0

  47. for: 10s

  48. labels:

  49. severity: warning

  50. annotations:

  51. summary: 'Kafka under min ISR partitions'

  52. description: 'There are {{ $value }} partitions under the min ISR on {{ $labels.kubernetes_pod_name }}'

  53. - alert: OfflineLogDirectoryCount

  54. expr: kafka_log_logmanager_offlinelogdirectorycount > 0

  55. for: 10s

  56. labels:

  57. severity: warning

  58. annotations:

  59. summary: 'Kafka offline log directories'

  60. description: 'There are {{ $value }} offline log directories on {{ $labels.kubernetes_pod_name }}'

  61. - alert: ScrapeProblem

  62. expr: up{kubernetes_namespace!~"openshift-.+",kubernetes_pod_name=~".+-kafka-[0-9]+"} == 0

  63. for: 3m

  64. labels:

  65. severity: major

  66. annotations:

  67. summary: 'Prometheus unable to scrape metrics from {{ $labels.kubernetes_pod_name }}/{{ $labels.instance }}'

  68. description: 'Prometheus was unable to scrape metrics from {{ $labels.kubernetes_pod_name }}/{{ $labels.instance }} for more than 3 minutes'

  69. - alert: KafkaContainerRestartedInTheLast5Minutes

  70. expr: count(count_over_time(container_last_seen{container="kafka"}[5m])) > 2 * count(container_last_seen{container="kafka",pod=~".+-kafka-[0-9]+"})

  71. for: 5m

  72. labels:

  73. severity: warning

  74. annotations:

  75. summary: 'One or more Kafka containers restarted too often'

  76. description: 'One or more Kafka containers were restarted too often within the last 5 minutes'

  77. - name: connect

  78. rules:

  79. - alert: ConnectFailedConnector

  80. expr: sum(kafka_connect_connector_status{status="failed"}) > 0

  81. for: 5m

  82. labels:

  83. severity: major

  84. annotations:

  85. summary: 'Kafka Connect Connector Failure'

  86. description: 'One or more connectors have been in failed state for 5 minutes,'

  87. - alert: ConnectFailedTask

  88. expr: sum(kafka_connect_worker_connector_failed_task_count) > 0

  89. for: 5m

  90. labels:

  91. severity: major

  92. annotations:

  93. summary: 'Kafka Connect Task Failure'

  94. description: 'One or more tasks have been in failed state for 5 minutes.'

  95. - name: bridge

  96. rules:

  97. - alert: AvgProducerLatency

  98. expr: strimzi_bridge_kafka_producer_request_latency_avg > 10

  99. for: 10s

  100. labels:

  101. severity: warning

  102. annotations:

  103. summary: 'Kafka Bridge producer average request latency'

  104. description: 'The average producer request latency is {{ $value }} on {{ $labels.clientId }}'

  105. - alert: AvgConsumerFetchLatency

  106. expr: strimzi_bridge_kafka_consumer_fetch_latency_avg > 500

  107. for: 10s

  108. labels:

  109. severity: warning

  110. annotations:

  111. summary: 'Kafka Bridge consumer average fetch latency'

  112. description: 'The average consumer fetch latency is {{ $value }} on {{ $labels.clientId }}'

  113. - alert: AvgConsumerCommitLatency

  114. expr: strimzi_bridge_kafka_consumer_commit_latency_avg > 200

  115. for: 10s

  116. labels:

  117. severity: warning

  118. annotations:

  119. summary: 'Kafka Bridge consumer average commit latency'

  120. description: 'The average consumer commit latency is {{ $value }} on {{ $labels.clientId }}'

  121. - alert: Http4xxErrorRate

  122. expr: strimzi_bridge_http_server_requestCount_total{code=~"^4..$", container=~"^.+-bridge", path !="/favicon.ico"} > 10

  123. for: 1m

  124. labels:

  125. severity: warning

  126. annotations:

  127. summary: 'Kafka Bridge returns code 4xx too often'

  128. description: 'Kafka Bridge returns code 4xx too much ({{ $value }}) for the path {{ $labels.path }}'

  129. - alert: Http5xxErrorRate

  130. expr: strimzi_bridge_http_server_requestCount_total{code=~"^5..$", container=~"^.+-bridge"} > 10

  131. for: 1m

  132. labels:

  133. severity: warning

  134. annotations:

  135. summary: 'Kafka Bridge returns code 5xx too often'

  136. description: 'Kafka Bridge returns code 5xx too much ({{ $value }}) for the path {{ $labels.path }}'

  137. - name: kafkaExporter

  138. rules:

  139. - alert: UnderReplicatedPartition

  140. expr: kafka_topic_partition_under_replicated_partition > 0

  141. for: 10s

  142. labels:

  143. severity: warning

  144. annotations:

  145. summary: 'Topic has under-replicated partitions'

  146. description: 'Topic {{ $labels.topic }} has {{ $value }} under-replicated partition {{ $labels.partition }}'

  147. - alert: TooLargeConsumerGroupLag

  148. expr: kafka_consumergroup_lag > 500

  149. for: 30s

  150. labels:

  151. severity: warning

  152. annotations:

  153. summary: 'Consumer group lag is too big'

  154. description: 'Consumer group {{ $labels.consumergroup}} lag is too big ({{ $value }}) on topic {{ $labels.topic }}/partition {{ $labels.partition }}'

  155. - name: certificates

  156. interval: 1m0s

  157. rules:

  158. - alert: CertificateExpiration

  159. expr: |

  160. strimzi_certificate_expiration_timestamp_ms/1000 - time() < 30 * 24 * 60 * 60

  161. for: 5m

  162. labels:

  163. severity: warning

  164. annotations:

  165. summary: 'Certificate will expire in less than 30 days'

  166. description: 'Certificate of type {{ $labels.type }} in cluster {{ $labels.cluster }} in namespace {{ $labels.resource_namespace }} will expire in less than 30 days'

上面这个规则文件可以strimzi/strimzi-kafka-operator的github 中找到,如下图:

关于如何告警

Prometheus根据PrometheusRule触发告警,告警会被发送Alertmanager,在AlterManager 中会被发到Kubersphere的NotificationManager中,关于NotificationManager及告警系统我们后续单独进行描述。

文章转载自程序员修炼笔记,如果涉嫌侵权,请发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论