2024년 7월 5일 금요일
Message compression Recommendations
1. Find a compression algorithm that gives you the best performance for your specific data. Test all of them!
2. Always use compression in production and especially if you have high throughput
3. Consider tweaking linger.ms and batch.size to have bigger batches, and therefore more compression and higher throughput
Message Compression
1. Producer usually send data that is text-based, for example with JSON data
2. In this case, it is important to apply compression to the producer.
3. Compression is enabled at the Producer level and doesn't require any configuration change in the Brokers or in the Consumers
4. "compression.type" can be 'none'(default), 'gzip', 'lz4', 'snappy'
5. Compression is more effective the bigger the batch of message being sent to Kafka!
6. Benchmarks here: https://blog.cloudflare.com/squeezing-the-firehose/
7. The compressed batch has the following advantage:
- Much smaller producer request size (compression ration up to 4x!)
- Faster to transfer data over the network => less latency
- Better throughput
- Better disk utilisation in Kafka (stored messages on disk are smaller)
8. Disadvantages (very minor):
- Producers must commit some CPU cycles to compression
- Consumers must commit some CPU cycles to decompression
9. Overall:
- Consider testing snappy or lz4 for optimal speed / compression ratio
Safe producer Summary & Demo
Kafka < 0.11
- acks = all (producer level)
- Ensures data is properly replicated before an ack is received
- min.insync.replicas=2 (broker/topic level)
- Ensures two brokers in ISR at least have the data after an ack
- retries=MAX_INT (producer level)
- Ensures transient errors are retried indefinitely
- max.in.flight.requests.per.connection=1 (producer level)
- Ensures onlyu one request is tried at any time, preventing message re-ordering in case of retries
Kafka >= 0.11
- enable.idempotence=true (producer level) + min.insync.replicas=2 (broker/topic level)
- Implies acks=all, retries=MAX_INT, max.in.flight.requests.per.connection=5(default)
- while keeping ordering guarantees and improving performance!
- Running a "safe producer" might impact throughput an latency, always test for your use case
Idempotent Producer
1. Here's the problem: the Producer can introduce duplicate messages in Kakka due to network errors
- In Kafka >= 0.11, you can define a "idempotent producer" which won't introduce duplicates on network errorProducer retries
1. In case of transient failures, developers are expected to handle exceptions, otherwise the data will be lost.
2. Example of transient failure:
- NotEnoughReplicasException
3. There is a "retries" setting
- defaults to 0
- You can increase to a high number, ex Integer.MAX_VALUE
- In case of retries, by default there is a chance that messages will be sent out of order (if a batch has failed to be sent).
- If yolu rely on key-based ordering, that can be an issue.
- For this, you can set the setting while controls how many produce requests can be made in parallel: max.in.flight.requests.per.connection
- Default: 5
- Set it to I if you need to ensure ordering (may impact throughput)
- In Kafka >= 1.0.0, there's a better solution!
Producers Acks Deep Dive acks = all (replicas acks)
1. Leader + Replicas ack requested
2. Added latency and safety
3. No data loss if enough replicas
- Necessary setting if you don't want to lose data
- Acks=all must be used in conjunction with min.insync.replicas.
- min.insync.replicas can be set at the broker or topic level (override).
- min.insync.replicas=2 implies that at least 2 brokers that are ISR(including leader) must responsd that they have the data.
- That means if you use replication.factor=3, min.insync=2, acks=all, you can only tolerate I broker going down, otherwise the producer will receive an exception on send.