페이지

2024년 7월 9일 화요일

Why Kafka Connect and Streams

 1. Four Common Kafka Use Cases:


Source => Kafka        Producer API            Kafka Connect Source

Kafka => Kafka        Consumer, Producer API    Kafka Streams

Kafka => Sink        Consumer API        Kafka Connect Sink

Kafka => App     Consumer API


- Simplify and improve getting data in and out of Kafka

- Simplify transforming data wiythin Kafka without relying on external libs


- Programmers always want to import data from the same sources:

Databases, KDBC, Couchbase, GoldenGate, SAP HANA, Blockchain, Cassandra, DynamoDB, FTP, IOT, MongoDB, MQTT, RethinkDB, Salesforce, Solr, SQS, Twitter, etc...


- Programmers always want to store data in the same sinks:

S3, ElasticSearch, HDFS, KDBC, SAP HANA, DocumentDB, Cassandra, DynamoDB, HBase, MongoDB, Redis, Solr, Splunk, Twitter


- It is tough to achieve Fault Tolerance, IDempotence, Distribution, Ordering


- Other programmers may already have done a very good job!





Kafka Connect API A brief history

 1. (2013) Kafka 0.8.x:

    - Topic replication, Log compaction

    - Simplified producer client API


2. (Nov 2015) Kafka 0.9.x:

    - Simplified high level consumer APIs, without Zookeeper dependency

    - Added security (Encryption and Authentication)

    - Kafka Connect APIs


3. (May 2016): Kafka 0.10.0:

    - Kafka Streams APIS


4. (end 2016 - March 2017) Kafka 0.10.1, 0.10.2:

    - Improved Connect API, Siungle Message Transforms API




Kafka Connect Introduction

 1. Do you feel you're not the first person in the world to wirte a way to get data out of Twitter?


2. Do you feel like you;re not the first person in the world to send data from Kafka to PostgreSQL / ElasticSearch / MongoDb?


3. Additionally, the bugs you'll have, won't someone have fixed them already?


4. Kafka Connect is all about code & connectors re-use!





2024년 7월 6일 토요일

Max.block.ms & buffer.memory

 1. If the producer produces faster than the broker can take, the records will be buffered in memory


2. buffer.memory=33554432(32MB): the size of the send buffer


3. That buffer will fill up over time and fill back down when the throughput to the broker increases


4. If that buffer is full (all 32MB), then the .send() method will start to block (won't return right away)


5. max.block.ms=60000: the time the .send() will block until throwing a exception. Exceptions are basically thrown when

    - The producer has filled up its buffer

    - The broker is not accepting any new data

    - 60 seconds has elapsed.


6. If you hit an exception hit that usually means your brokers aredown or overloaded as they can't respond to requests






Producer Default Partitioner and how keys are hashed

 1. By default, your keys are hashed using the "murmuyr2" algorithm.


2. It is most likely prefered to not override the behavior of the partitioner, but it is possible todo so (partitioner.class).


3. The formula is:

targetPartition = Utils.abs(Utils.murmur2(record.key())) % numPartitions;


4. This means that same key will go to the same partition (we already know this), and adding partitions to a topic will completely alter the formuila



High Throughput Producer Demo

1. We'll add snappy message compression in our producer

2. snappy is very helpful if your messages are text based, for example log lines or JSON documents

3. snappy has a good balance of CPU / compression ratio

4. We'll also increase the batch.size to 32KB and introduce a small delay through linger.ms(20ms)  

Linger.ms & batch.size

 1. By default, Kafka tries to send records as soon as possible

    - It will have up to 5 reuests in flight, meaning up to messages individually sent at the same time.

    - After this, if more messages have to be sent while others are in flight, Kafka is smart and will start batching them while they wait to send them all at once.


2. This smart batching allows Kafka to increase throughput while maintaining very low latency.

3. Batches have higher compression ratio so better efficiency


4. So how can we control the batching mechanism?


- Linger.ms: Number of milliseconds a producer is willing to wait before sending a batch out. (default 0)

- By introducing some lag (for example linger.ms=5), we increase the chances of messages being sent together in a batch

- So at the expense of introducing a  small delay, we can increase throughput, compression and efficiency of our producer.


- If a batch is full (see batch.size) before the end of the linger.ms period, it will be sent to Kafka right away!



* Linger.ms




- batch.size: Maximum number of bytes that will be included in a batch. The default is 16KB.

- Increasing a batch size to something like 32KB or 64KB can help increasing the compression, throuthput, and efficiency of requests

- Any message that is bigger than the batch size will not be batched

- A batch is allocated per partition, so make sure that you don't set it to a number that's too high, otherwise you'll run waste memory!


- (Note: You can monitor the average batch size metric using Kafka Producer Metrics)