Data Retention Strategies in Kafka

Data retention is a crucial aspect of managing a Kafka cluster. It determines how long messages are stored before they are discarded. Kafka provides two main strategies for data retention: time-based retention and size-based retention.

Time-Based Retention

Time-based retention is configured using the log.retention.hours or log.retention.minutes parameters. These settings specify the maximum time for which a log segment is retained before it is eligible for deletion.

Key points:

  • Default value is 168 hours (7 days).
  • Applies to all partitions of a topic.
  • Deletion occurs during the next log segment roll.

Size-Based Retention

Size-based retention is configured using the log.retention.bytes parameter. This setting specifies the maximum size of a partition before old log segments are deleted to free up space.

Key points:

  • Default value is -1 (infinite).
  • Applies at the partition level.
  • Useful for managing disk space.

Log Segment Deletion

Log segment deletion is triggered based on the active retention strategy. When a log segment meets the retention criteria (age or size), it becomes eligible for deletion.

Important aspects:

  • Deletion occurs in the background by the log cleaner.
  • Deletion is performed per partition.
  • Segment deletion doesn’t impact consumer offsets.

Implications of Data Retention

The choice of data retention strategy and parameters has several implications:

  1. Disk Space Usage:

    • Longer retention periods or larger size limits lead to increased disk space consumption.
    • Proper monitoring and capacity planning are essential.
  2. Data Availability:

    • Shorter retention periods may result in data being unavailable for consumers that fall behind.
    • Consumers need to process data within the retention window.
  3. Replay Scenarios:

    • Longer retention allows for replaying events from a specific point in time.
    • Useful for error recovery, auditing, or data reprocessing.
  4. Compliance and Legal Requirements:

    • Certain industries may have specific data retention requirements.
    • Retention settings should align with legal and compliance needs.

Configuring Retention Settings

Retention settings can be configured at different levels:

  1. Broker Level:

    • Configured in the server.properties file.
    • Applies to all topics unless overridden at the topic level.
  2. Topic Level:

    • Configured using topic-level configuration.
    • Overrides the broker-level settings for the specific topic.
# Example: Set retention period to 30 days for a topic
kafka-configs.sh --bootstrap-server localhost:9092 --alter --topic my-topic --add-config retention.ms=2592000000