Cloudflare revealed that it lost 55% of logs sent to customers during a 3.5-hour window on November 14, 2024, due to a bug in its log collection service.

The issue significantly impacted customers relying on the service for traffic monitoring, security analysis, and site optimization.

Cloudflare’s “logpush” service allows customers to export logs to external storage platforms like Amazon S3, Google Cloud Storage, and Splunk for further analysis. With Cloudflare processing over 50 trillion customer event logs daily, the scale of the disruption was considerable, affecting roughly 4.5 trillion logs sent to clients each day.

The incident stemmed from a misconfiguration in Logfwdr, a core component of Cloudflare’s logging pipeline responsible for directing event logs to downstream systems. A configuration update inadvertently sent a blank setup, causing the system to incorrectly assume no logs needed forwarding. This misstep resulted in the logs being discarded entirely.

Buy Me a Coffee

While Logfwdr is designed with a failsafe to forward all logs in case of invalid configurations, the safeguard backfired, creating a massive spike in log processing volume. The influx overwhelmed Buftee, a distributed buffering system meant to temporarily store logs when downstream systems are overloaded. The surge—40 times Buftee’s capacity—caused the system to fail completely within five minutes, requiring a full restart and exacerbating the log loss.

Cloudflare acknowledged that Buftee’s own resource limits and throttling mechanisms also failed due to improper configuration and insufficient testing, highlighting critical gaps in its system’s resilience.

To prevent similar incidents, Cloudflare has introduced stricter measures, including a misconfiguration detection and alerting system to catch issues early, better configuration of Buftee to handle spikes, and routine overload testing to ensure failsafe mechanisms can handle unexpected surges.

READ
PayPal Resolves Global Outage Affecting Thousands of Users

While the company’s transparency about the incident and the steps taken to address it provide some reassurance, the event underscores the complexity and challenges of managing large-scale logging systems in the face of unprecedented data volumes.