페이지

2024년 7월 12일 금요일

Kafka Monitoring and Operations

1. Kafka exposes metrics through JMX.

2. These metrics are highly important for monitoring Kafka, and ensuring the systems are behaving correctly under load.

3. Common places to host the Kafka metrics:

    - ELK(ElasticSearch _ Kibana)

    - Datadog

    - NewRelic

    - Confluent Control Centre

    - Promotheus

    - Many others...!


4. Some of the most important metrics are:


5. Under Replicated Partitions: Number of partitions are have problems with the ISR (in-sync replicas). May indicate a high load on the system


6. Request Handlers: utilization of threads for IO, network, etc... overall utilization of an Apache Kafka broker.


7. Request timing: how long it takes to reply to requests. Lower is better, as latency will be improved.


8. Overall have a look at the documentation here:

- https://kafka.apache.org/documentation/#monitoring

- https://docs.confluent.io/current/kafka/monitoring.html

- https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics/


9. Kafka Operations team must be able to perform the following tasks:

    - Rolling restart of Brokers

    - Updating Configurations

    - Rebalancing Partitions

    - Increasing replication factor

    - Adding a Broker

    - Replacing a Broker

    - Removing a Broker

    - Upgrading a Kafka Cluster with zero downtime






댓글 없음: