Limit(0): 7월 2024

2024년 7월 27일 토요일

Lesson 1 AWS Building Blocks Learning Objectives

1.1 What is AWS?

AWS Official Definition

Amazon Web Services(AWS) is the world's most comprehensive and broadly adopted cloud (On-demand, Pay as you go, Network-accessible) platform, offering over 200 fully featured services(There is a service for almost everything, and you'll need to specialize!) from data centers globally.(Hundreds of data centers and millions of servers around the world!) Millions of customers - including the fastest-growing startups, largest enterprises, and leading government agencies - are using AWS to lower costs, become more agile, and innovate faster.(You can do these in ways not possible using on-premises data centers!)

1.2 The Shared Responsibility Model

1.3 The AWS Account

1.4 Demo: Introduction to the AWS Console

1.5 AWS Service Categories

1.6 AWS Icons and Diagrams

Module 1: AWS Overview

Lesson 1: AWS Building Blocks

Lesson 2: AWS Global Infrastructure

Amazon Web Services(AWS) 3rd Edition

Module 1: AWS Overview

Module 2: AWS Identity and Access Management

Module 3: AWS Network Services

Module 4: AWS Compute Services

Module 5: AWS Storage Services

Module 6: AWS Database Services

Module 7: AWS High Availability Services

Module 8: AWS Analytics Services

Module 9: AWS Management Tools

Module 10: AWS Monitoring and Automation Services

Module 11: AWS Security Services

Module 12: AWS Developer Services

Module 13: AWS Biling and Cost Management

Module 14: Course Wrap-Up And Next Steps

2024년 7월 13일 토요일

Why should I care about topic config?

1. Brokers have defaults for all the topic configuration parameters

2. These parameters impact performance and topic behavior

3. Some topics may need different values than the defaults

- Replication Factor

- #of Partitions

- Message size

- Compression level

- Log Cleanup Policy

- Min Insync Replicas

- Other configurations

4. A list of configuration can be found at:

https://kafka.apache.org/documentation/#brokerconfigs

.\kafka-topics.bat --bootstrap-server 127.0.0.1:2181 --list

PS C:\kafka_2.12-3.7.0> .\bin\windows\kafka-topics.bat --bootstrap-server 127.0.0.1:2181 --create --topic configured-topic --partitions 3 --replication-factor 1

2024년 7월 12일 금요일

Kafka Multi Cluster + Replication

1. Kafka can only operate well in a single resion

2. Therefore, it is very common for enterprises to have Kafka clusters across the world, with some level of replication between them

3. A replication application at its core is just a consumer + a producer

4. There are different tools to perform it:

- Mirror Maker - open source tool that ships with Kafka

- netflix users Flink - they wrote their own applicaiton

- Uber uysers uRepli8cator - address performance and operations issues with MM

- Comcast has their own open source Kafka Connect Source

- Confluent has their own Kafka Connect Source(paid)

5. Overall, try these and see if it works for your use case before writing your own

6. There are two desings for cluster replication:

7. Active => Active:

- You have a global application

- You have a global dataset

8. Active => Passive:

- You want to have an aggregation cluster (for example for analytics)

- You want to create some form of disaster recovery strategy (it's hard)

- Cloud Migration (from on-premise cluster to Cloud cluster)

9. Replicating doesn't preserve offsets, just data!

State of the art of Kafka Security

1. Kafka Security is fairly new (0.10.0)

2. Kafka Security improves over time and becomes more flexible / easier to setup as time goes.

3. Currently, it i hard to setup Kafka Security.

4. Best support for Kafka Security for applications is with JAVA

Putting it all together

1. You can mix

- Encryption

- Authentication

- Authorisation

2. This allows you Kafka clients to:

- Communicate securely to Kafka

- Clients would authenticate against Kafka

- Kafka can authorise clients to read / write to topics

Authentication in Kafka

1. Authentication in Kafka ensures that only clients thats can prove their identity can connect to our Kafka Cluster

2. This is similar concept to a login (username / password)

3. Authentication in Kafka can take a few forms

4. SSL Authentication: clients authenticate to Kafka using SSL certificates

5. SASL Authentication:

- PLAIN: clients authenticate using username / password (weak - easy to setup)

- Kerberos: such as Microsoft Active Directory (strong - hard to setup)

- SCRAM: username / password (strong - medium to setup)

6. Once a client is authenticated, Kafka can verify its identity

7. It still needs to be combined with authorisatioin, so that Kafka knows that

- "User alice can view topic finace"

- "User bob cannot view topic trucks"

8. ACL(Access Control Lists) have to be maintained by administration and onboard new users

Encryption in Kafka

1. Encryption in Kafka ensures that the data exchanged between clients and brokers is secret to routers on the way

2. This is similar concept to an https website

The need for encryption, authentication & nauthorization in Kafka

1. Currently, any client can access your Kafka cluster (authentication)

2. The clients can publish / consume any topic data (authorisation)

3. All the data being sent is fully visible on the network (encryption)

- Someone could intercept data being sent

- Someone could publish bad data / steal data

- Someone could delete topics

- All these reasons push for more security and an authentication mode

Kafka Monitoring and Operations

1. Kafka exposes metrics through JMX.

2. These metrics are highly important for monitoring Kafka, and ensuring the systems are behaving correctly under load.

3. Common places to host the Kafka metrics:

- ELK(ElasticSearch _ Kibana)

- Datadog

- NewRelic

- Confluent Control Centre

- Promotheus

- Many others...!

4. Some of the most important metrics are:

5. Under Replicated Partitions: Number of partitions are have problems with the ISR (in-sync replicas). May indicate a high load on the system

6. Request Handlers: utilization of threads for IO, network, etc... overall utilization of an Apache Kafka broker.

7. Request timing: how long it takes to reply to requests. Lower is better, as latency will be improved.

8. Overall have a look at the documentation here:

- https://kafka.apache.org/documentation/#monitoring

- https://docs.confluent.io/current/kafka/monitoring.html

- https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics/

9. Kafka Operations team must be able to perform the following tasks:

- Rolling restart of Brokers

- Updating Configurations

- Rebalancing Partitions

- Increasing replication factor

- Adding a Broker

- Replacing a Broker

- Removing a Broker

- Upgrading a Kafka Cluster with zero downtime

Kafka Cluster Setup Gotchas

1. It's not easy to setup a cluster

2. You want to isolate each Zookeeper & Broker on separate servers

3. Monitoring needs to be implemented

4. Operations have to be mastered

5. You need a really good Kafka Admin

- Alternative: main different "Kafka as a Service" offerings on the web

- No operational burdens (upates, monitoring, setup, etc...)

2024년 7월 10일 수요일

Kafka Cluster Setup High Level Architecture

1. You want multiple brokers in different data centers (racks) to distribute your load. You also want a cluster of at least 3 zookeeper

2. In AWS:

2024년 7월 9일 화요일

Kafka Connect and Streams Architecture Design

Why Kafka Connect and Streams

1. Four Common Kafka Use Cases:

Source => Kafka Producer API Kafka Connect Source

Kafka => Kafka Consumer, Producer API Kafka Streams

Kafka => Sink Consumer API Kafka Connect Sink

Kafka => App Consumer API

- Simplify and improve getting data in and out of Kafka

- Simplify transforming data wiythin Kafka without relying on external libs

- Programmers always want to import data from the same sources:

Databases, KDBC, Couchbase, GoldenGate, SAP HANA, Blockchain, Cassandra, DynamoDB, FTP, IOT, MongoDB, MQTT, RethinkDB, Salesforce, Solr, SQS, Twitter, etc...

- Programmers always want to store data in the same sinks:

S3, ElasticSearch, HDFS, KDBC, SAP HANA, DocumentDB, Cassandra, DynamoDB, HBase, MongoDB, Redis, Solr, Splunk, Twitter

- It is tough to achieve Fault Tolerance, IDempotence, Distribution, Ordering

- Other programmers may already have done a very good job!

Kafka Connect API A brief history

1. (2013) Kafka 0.8.x:

- Topic replication, Log compaction

- Simplified producer client API

2. (Nov 2015) Kafka 0.9.x:

- Simplified high level consumer APIs, without Zookeeper dependency

- Added security (Encryption and Authentication)

- Kafka Connect APIs

3. (May 2016): Kafka 0.10.0:

- Kafka Streams APIS

4. (end 2016 - March 2017) Kafka 0.10.1, 0.10.2:

- Improved Connect API, Siungle Message Transforms API

Kafka Connect Introduction

1. Do you feel you're not the first person in the world to wirte a way to get data out of Twitter?

2. Do you feel like you;re not the first person in the world to send data from Kafka to PostgreSQL / ElasticSearch / MongoDb?

3. Additionally, the bugs you'll have, won't someone have fixed them already?

4. Kafka Connect is all about code & connectors re-use!

2024년 7월 6일 토요일

Max.block.ms & buffer.memory

1. If the producer produces faster than the broker can take, the records will be buffered in memory

2. buffer.memory=33554432(32MB): the size of the send buffer

3. That buffer will fill up over time and fill back down when the throughput to the broker increases

4. If that buffer is full (all 32MB), then the .send() method will start to block (won't return right away)

5. max.block.ms=60000: the time the .send() will block until throwing a exception. Exceptions are basically thrown when

- The producer has filled up its buffer

- The broker is not accepting any new data

- 60 seconds has elapsed.

6. If you hit an exception hit that usually means your brokers aredown or overloaded as they can't respond to requests

Producer Default Partitioner and how keys are hashed

1. By default, your keys are hashed using the "murmuyr2" algorithm.

2. It is most likely prefered to not override the behavior of the partitioner, but it is possible todo so (partitioner.class).

3. The formula is:

targetPartition = Utils.abs(Utils.murmur2(record.key())) % numPartitions;

4. This means that same key will go to the same partition (we already know this), and adding partitions to a topic will completely alter the formuila

High Throughput Producer Demo

1. We'll add snappy message compression in our producer

2. snappy is very helpful if your messages are text based, for example log lines or JSON documents

3. snappy has a good balance of CPU / compression ratio

4. We'll also increase the batch.size to 32KB and introduce a small delay through linger.ms(20ms)

Linger.ms & batch.size

1. By default, Kafka tries to send records as soon as possible

- It will have up to 5 reuests in flight, meaning up to messages individually sent at the same time.

- After this, if more messages have to be sent while others are in flight, Kafka is smart and will start batching them while they wait to send them all at once.

2. This smart batching allows Kafka to increase throughput while maintaining very low latency.

3. Batches have higher compression ratio so better efficiency

4. So how can we control the batching mechanism?

- Linger.ms: Number of milliseconds a producer is willing to wait before sending a batch out. (default 0)

- By introducing some lag (for example linger.ms=5), we increase the chances of messages being sent together in a batch

- So at the expense of introducing a small delay, we can increase throughput, compression and efficiency of our producer.

- If a batch is full (see batch.size) before the end of the linger.ms period, it will be sent to Kafka right away!

* Linger.ms

- batch.size: Maximum number of bytes that will be included in a batch. The default is 16KB.

- Increasing a batch size to something like 32KB or 64KB can help increasing the compression, throuthput, and efficiency of requests

- Any message that is bigger than the batch size will not be batched

- A batch is allocated per partition, so make sure that you don't set it to a number that's too high, otherwise you'll run waste memory!

- (Note: You can monitor the average batch size metric using Kafka Producer Metrics)

Linger.ms

2024년 7월 5일 금요일

Message compression Recommendations

1. Find a compression algorithm that gives you the best performance for your specific data. Test all of them!

2. Always use compression in production and especially if you have high throughput

3. Consider tweaking linger.ms and batch.size to have bigger batches, and therefore more compression and higher throughput

Message Compression

1. Producer usually send data that is text-based, for example with JSON data

2. In this case, it is important to apply compression to the producer.

3. Compression is enabled at the Producer level and doesn't require any configuration change in the Brokers or in the Consumers

4. "compression.type" can be 'none'(default), 'gzip', 'lz4', 'snappy'

5. Compression is more effective the bigger the batch of message being sent to Kafka!

6. Benchmarks here: https://blog.cloudflare.com/squeezing-the-firehose/

7. The compressed batch has the following advantage:

- Much smaller producer request size (compression ration up to 4x!)

- Faster to transfer data over the network => less latency

- Better throughput

- Better disk utilisation in Kafka (stored messages on disk are smaller)

8. Disadvantages (very minor):

- Producers must commit some CPU cycles to compression

- Consumers must commit some CPU cycles to decompression

9. Overall:

- Consider testing snappy or lz4 for optimal speed / compression ratio

Safe producer Summary & Demo

Kafka < 0.11

- acks = all (producer level)

- Ensures data is properly replicated before an ack is received

- min.insync.replicas=2 (broker/topic level)

- Ensures two brokers in ISR at least have the data after an ack

- retries=MAX_INT (producer level)

- Ensures transient errors are retried indefinitely

- max.in.flight.requests.per.connection=1 (producer level)

- Ensures onlyu one request is tried at any time, preventing message re-ordering in case of retries

Kafka >= 0.11

- enable.idempotence=true (producer level) + min.insync.replicas=2 (broker/topic level)

- Implies acks=all, retries=MAX_INT, max.in.flight.requests.per.connection=5(default)

- while keeping ordering guarantees and improving performance!

- Running a "safe producer" might impact throughput an latency, always test for your use case

Idempotent Producer

1. Here's the problem: the Producer can introduce duplicate messages in Kakka due to network errors

- In Kafka >= 0.11, you can define a "idempotent producer" which won't introduce duplicates on network error

- Idempotent producers are great to guarantee a stable and safe pipeline!

- They come with:

- retries = Integer.MAX_VALUE(2^31-1 = 2147483647)

- max.in.flight.requests = I (Kafka >= 0.11 & < 1.1) or

- max.in.flight.requests = 5 (Kafka >= 1.1 - higher performance)

- acks = all

- Just set:

- producerProps.put("enable.idempotence", true);

Producer retries

1. In case of transient failures, developers are expected to handle exceptions, otherwise the data will be lost.

2. Example of transient failure:

- NotEnoughReplicasException

3. There is a "retries" setting

- defaults to 0

- You can increase to a high number, ex Integer.MAX_VALUE

- In case of retries, by default there is a chance that messages will be sent out of order (if a batch has failed to be sent).

- If yolu rely on key-based ordering, that can be an issue.

- For this, you can set the setting while controls how many produce requests can be made in parallel: max.in.flight.requests.per.connection

- Default: 5

- Set it to I if you need to ensure ordering (may impact throughput)

- In Kafka >= 1.0.0, there's a better solution!

Producers Acks Deep Dive acks = all (replicas acks)

1. Leader + Replicas ack requested

2. Added latency and safety

3. No data loss if enough replicas

- Necessary setting if you don't want to lose data

- Acks=all must be used in conjunction with min.insync.replicas.

- min.insync.replicas can be set at the broker or topic level (override).

- min.insync.replicas=2 implies that at least 2 brokers that are ISR(including leader) must responsd that they have the data.

- That means if you use replication.factor=3, min.insync=2, acks=all, you can only tolerate I broker going down, otherwise the producer will receive an exception on send.