Large rows can cause CDC to miss WAL records

24 November 2025
Product Affected Versions Related Issues Fixed In
CDC v2.20.7 to v2.20.12, v2024.1.3+, v2024.2.0 to v2024.2.6, v2025.1.0 to v2025.1.2 #29060 v2.20.13, v2024.2.7, v2025.2.0

Description

Due to a bug in the affected release versions, it is possible that the WAL reading logic used by CDC fails to deliver the last batch of records in the active WAL segment. This could potentially lead to CDC missing out on sending all the records in the active segment.

If CDC is continuously reading data from the closed WAL segments, then this issue will not be encountered.

This issue will only be encountered when CDC is consuming the WAL data from the active segment, and a record (WRITE_OP) from a single shard transaction with size greater than 4 MB is at the end of the active segment. However if this issue is encountered, all the records from the active segment will be missed.

If encountered, this issue would lead to data mismatch between the source and sink databases.

Mitigation

CDC users running affected versions should upgrade to versions where the issue has been fixed.

There is no workaround to stream the missed records using the same CDC stream.

If you hit this issue, you can create a new stream, take the snapshot again, and start streaming the records after upgrading to an appropriate version. This would ensure that no records have been missed, as the records missed earlier will be streamed by the new stream as a part of its snapshot.

Details

The cdcsdk_producer calls PeerMessageQueue::ReadReplicatedMessagesForConsistentCDC() to read the complete data from the active segment. Internally, the producer then calls LogCache::ReadOps() to read the data from the WAL. If the batch of records containing the last committed record exceeds 4 MB (FLAGS_consensus_max_batch_size_bytes), then CDC could end up missing sending all the records in the active segment to the cdcsdk_producer.

On receiving an empty batch of records from the WAL, cdcsdk_producer would wrongly conclude that there are no more records to stream in the WAL and would end up moving the safe_hybrid_time forward to the current tablet leader safe time. If the commit time of the missed records is less than the tablet leader safe time, this would break the following invariant maintained by the CDC service: "If the safe_hybrid_time is x, then we have streamed all the records with commit time less than or equal to x for that tablet." As a result in the subsequent GetChanges calls, even if we read the previously missed records from the WAL, cdcsdk_producer would filter them out, and end up missing streaming these records in the CDC stream.