Apache Kafka Platform
An Apache Kafka Platform is a open source distributed, partitioned, and replicated data stream processing platform managed by the Apache Kafka project.
- Context:
- It can (typically) be used to create a Publish-Subscribe Messaging System.
- It can be a Pull System (Flume is a push system)
- It can (typically) support Kafka Producers that write IT micro-events to kafka
- It can (typically) support Kafka Consumers that subscribe to IT micro-events. (e.g. click events).
- It can provide a vast array of metrics on performance and resource utilization (to help you manage and scale a cluster).
- It can (typically) guarantee order (within a Kafka partition).
- It can be used to create a Kafka Messaging System Instance.
- It can have Kafka APIs, such as Kafka Connect API (for Kafka Connect).
- …
- Example(s):
- Kafka v2.4.1 (~2020-03-20) [1].
- Kafka v1.0.0 (~2017-11-01).
- Kafka v0.11.0 (~2017-11-17).
- https://kafka.apache.org/downloads
- …
- Counter-Example(s):
- See: Kafka Streams, AWS MSK, Distributed Messaging System, Near-line System, Apache Yarn, Kafka SQL (KSQL), Kafka Cruise Control.
References
2021
- (Wikipedia, 2021) ⇒ https://en.wikipedia.org/wiki/Apache_Kafka Retrieved:2021-6-10.
- Apache Kafka is a framework implementation of a software bus using stream-processing. It is an open-source software platform developed by the Apache Software Foundation written in Scala and Java. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds. Kafka can connect to external systems (for data import/export) via Kafka Connect and provides Kafka Streams, a Java stream processing library.
Kafka uses a binary TCP-based protocol that is optimized for efficiency and relies on a "message set" abstraction that naturally groups messages together to reduce the overhead of the network roundtrip. This "leads to larger network packets, larger sequential disk operations, contiguous memory blocks [...] which allows Kafka to turn a bursty stream of random message writes into linear writes."
- Apache Kafka is a framework implementation of a software bus using stream-processing. It is an open-source software platform developed by the Apache Software Foundation written in Scala and Java. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds. Kafka can connect to external systems (for data import/export) via Kafka Connect and provides Kafka Streams, a Java stream processing library.
2017a
- (Wikipedia, 2017) ⇒ https://en.wikipedia.org/wiki/Apache_Kafka Retrieved:2017-7-21.
- Apache Kafka is an open-source stream processing platform developed by the Apache Software Foundation written in Scala and Java. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds. Its storage layer is essentially a "massively scalable pub/sub message queue architected as a distributed transaction log," making it highly valuable for enterprise infrastructures to process streaming data. Additionally, Kafka connects to external systems (for data import/export) via Kafka Connect and provides Kafka Streams, a Java stream processing library. The design is heavily influenced by transaction logs.
2017b
- (Wikipedia, 2017) ⇒ https://en.wikipedia.org/wiki/Apache_Kafka#Description Retrieved:2017-7-21.
- Kafka stores messages which come from arbitrarily many processes called "producers". The data can thereby be partitioned in different "partitions" within different "topics". Within a partition the messages are indexed and stored together with a timestamp. Other processes called "consumers" can query messages from partitions. Kafka runs on a cluster of one or more servers and the partitions can be distributed across cluster nodes.
2017c
- http://confluent.io/blog/apache-kafka-for-service-architectures/
- QUOTE: … The log-structured approach is itself a simple idea: a collection of messages, appended sequentially to a file. When a service wants to read messages from Kafka it ‘seeks’ to the position of the last message it read, then scans sequentially, reading messages in order, while periodically recording its new position in the log.
Taking a log-structured approach has an interesting side effect. Both reads and writes are sequential operations. This makes them sympathetic to the underlying media, leveraging pre-fetch, the various layers of caching and naturally batching operations together. This makes them efficient. In fact, when you read messages from Kafka, the server doesn’t even import them into the JVM. Data is copied directly from the disk buffer to the network buffer. An opportunity afforded by the simplicity of both the contract and the underlying data structure. …
- QUOTE: … The log-structured approach is itself a simple idea: a collection of messages, appended sequentially to a file. When a service wants to read messages from Kafka it ‘seeks’ to the position of the last message it read, then scans sequentially, reading messages in order, while periodically recording its new position in the log.
2014
- http://kafka.apache.org/documentation.html#introduction
- QUOTE: Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design. …
let's review some basic messaging terminology:
Kafka maintains feeds of messages in categories called topics.
We'll call processes that publish messages to a Kafka topic producers.
We'll call processes that subscribe to topics and process the feed of published messages consumers.
Kafka is run as a cluster comprised of one or more servers each of which is called a broker.
So, at a high level, producers send messages over the network to the Kafka cluster which in turn serves them up to consumers like this: …
- QUOTE: Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design. …
2011
- http://incubator.apache.org/kafka/index.html
- Apache Kafka is a distributed publish-subscribe messaging system. It is designed to support the following
- Persistent messaging with O(1) disk structures that provide constant time performance even with many TB of stored messages.
- High-throughput: even with very modest hardware Kafka can support hundreds of thousands of messages per second.
- Explicit support for partitioning messages over Kafka servers and distributing consumption over a cluster of consumer machines while maintaining per-partition ordering semantics.
- Support for parallel data load into Hadoop.
- Apache Kafka is a distributed publish-subscribe messaging system. It is designed to support the following