Cloudflare says it lost 55% of logs pushed to customers for 3.5 hours

Cloudflare says it lost 55% of logs pushed to customers for 3.5 hours

November 27, 2024 at 11:17AM

Cloudflare faced a significant incident on November 14, 2024, losing 55% of customer logs due to a bug in its log collection service. A misconfiguration in the Logfwdr system led to log discard errors, overwhelming Buftee, the buffering system. Cloudflare has since implemented measures to prevent future incidents.

### Meeting Takeaways:

1. **Incident Overview**: On November 14, 2024, Cloudflare experienced a significant incident that resulted in the loss of 55% of customer logs over a 3.5-hour period due to a bug in its log collection service.

2. **Affected Services**: The incident impacted Cloudflare Logs, which are pivotal for customers needing to monitor traffic, troubleshoot issues, and analyze security incidents.

3. **Root Cause**: A misconfiguration in the Logfwdr component of Cloudflare’s logging pipeline led to a ‘blank configuration’ being issued, which incorrectly indicated that there were no customers for log forwarding. Consequently, the logs were discarded.

4. **Failsafe Failure**: The failsafe mechanism intended to prevent data loss instead caused an overwhelming spike in log volume (40 times the normal capacity), which led to the failure of the Buftee logging buffer system. This, combined with its own overwhelmed safeguards, caused a shutdown of Buftee.

5. **Log Volume Statistics**: Cloudflare processes over 50 trillion event logs daily, of which approximately 4.5 trillion are sent to customers. This incident had a considerable impact given these volumes.

6. **Response and Remediation**:
– Implementation of a dedicated misconfiguration detection and alerting system to identify anomalies in log forwarding configurations.
– Correct reconfiguration of Buftee to prevent service outages due to log volume spikes.
– Plans to routinely conduct overload tests to ensure system resilience against unexpected data surges.

7. **Next Steps**: Enhance monitoring and testing protocols to improve system robustness and prevent future incidents.

Full Article