페이지

2024년 7월 5일 금요일

Message Compression

1. Producer usually send data that is text-based, for example with JSON data

2. In this case, it is important to apply compression to the producer.


3. Compression is enabled at the Producer level and doesn't require any configuration change in the Brokers or in the Consumers

4. "compression.type" can be 'none'(default), 'gzip', 'lz4', 'snappy'


5. Compression is more effective the bigger the batch of message being sent to Kafka!

6. Benchmarks here: https://blog.cloudflare.com/squeezing-the-firehose/


7. The compressed batch has the following advantage:

    - Much smaller producer request size (compression ration up to 4x!)

    - Faster to transfer data over the network => less latency

    - Better throughput

    - Better disk utilisation in Kafka (stored messages on disk are smaller)


8. Disadvantages (very minor):

    - Producers must commit some CPU cycles to compression

    - Consumers must commit some CPU cycles to decompression


9. Overall:

    - Consider testing snappy or lz4 for optimal speed / compression ratio



Safe producer Summary & Demo

 Kafka < 0.11

- acks = all (producer level)

    - Ensures data is properly replicated before an ack is received

- min.insync.replicas=2 (broker/topic level)

    - Ensures two brokers in ISR at least have the data after an ack

- retries=MAX_INT (producer level)

    - Ensures transient errors are retried indefinitely

- max.in.flight.requests.per.connection=1 (producer level)

    - Ensures onlyu one request is tried at any time, preventing message re-ordering in case of retries


Kafka >= 0.11

- enable.idempotence=true (producer level) + min.insync.replicas=2 (broker/topic level)

    - Implies acks=all, retries=MAX_INT, max.in.flight.requests.per.connection=5(default)

    - while keeping ordering guarantees and improving performance!

- Running a "safe producer" might impact throughput an latency, always test for your use case






Idempotent Producer

 1. Here's the problem: the Producer can introduce duplicate messages in Kakka due to network errors

- In Kafka >= 0.11, you can define a "idempotent producer" which won't introduce duplicates on network error


- Idempotent producers are great to guarantee a stable and safe pipeline!
- They come with:
    - retries = Integer.MAX_VALUE(2^31-1 = 2147483647)
    - max.in.flight.requests = I (Kafka >= 0.11 & < 1.1) or
    - max.in.flight.requests = 5 (Kafka >= 1.1 - higher performance)
    - acks = all

- Just set:
    - producerProps.put("enable.idempotence", true);












Producer retries

 1. In case of transient failures, developers are expected to handle exceptions, otherwise the data will be lost.

2. Example of transient failure:

    - NotEnoughReplicasException

3. There is a "retries" setting

    - defaults to 0

    - You can increase to a high number, ex Integer.MAX_VALUE


- In case of retries, by default there is a chance that messages will be sent out of order (if a batch has failed to be sent).

- If yolu rely on key-based ordering, that can be an issue.

- For this, you can set the setting while controls how many produce requests can be made in parallel: max.in.flight.requests.per.connection

    - Default: 5 

    - Set it to I if you need to ensure ordering (may impact throughput)

- In Kafka >= 1.0.0, there's a better solution!


Producers Acks Deep Dive acks = all (replicas acks)

1. Leader + Replicas ack requested

2. Added latency and safety

3. No data loss if enough replicas


- Necessary setting if you don't want to lose data

- Acks=all must be used in conjunction with min.insync.replicas.

- min.insync.replicas can be set at the broker or topic level (override).

- min.insync.replicas=2 implies that at least 2 brokers that are ISR(including leader) must responsd that they have the data.

- That means if you use replication.factor=3, min.insync=2, acks=all, you can only tolerate I broker going down, otherwise the producer will receive an exception on send.








Producers Acks Deep Dive acks = 1 (leader acks)

 1. Leader response is requested, but replication is not a guarantee

(happens in the background)

2. If an ack is not received, the produceder may retry

3. If the leader broker goes offline but replicas haven't replicated the data yet, we have a data loss.






Producers Acks Deep Dive acks = 0 (no acks)

1. No response is requested

2. If the broker goes offline or an exception happens, we won't know and will lose data

3. Useful for data where it's okay to potentially lose messages:
    - Metrics collection
    - Log collection