Afterwards, Cloudflare CEO Matthew Prince said that this round of downtime was not caused by hacker DDOS attacks, but the server was overloaded. "Due to a bug in the Cloudflare Web Application Firewall (WAF) service, this error caused a significant increase in CPU usage, causing a crash in the primary and backup systems and causing the user's website to encounter a 502 error; therefore, Cloudflare is temporarily down. Part of the WAF's functionality, solved the bug that caused the problem, and re-launched the WAF service."
Cloudflare also explained the ins and outs of the entire event in the official blog. The full text of the blog is as follows:
Cloudflare global large-scale 502 error event description: service interruption due to improper software deployment
Beijing time on July 2nd, around 21:52, lasted for about 30 minutes. Users who visited the CloudFlare site around the world received 502 errors caused by the sharp increase in CPU usage on the Cloudflare network. This CPU spike is caused by an incorrect configuration on the line. After the rollback configuration is completed and restarted, the service resumes normal operation, and all domain names using CloudFlare are restored to normal traffic levels.
This is not an attack (not some people speculate), we are very sorry for the occurrence of this incident. The internal team is in a meeting, and the CTO (John Graham-Cumming) wrote a downtime analysis of how this happened, and how it can prevent this from happening again.
The specific reason for this global outage: During the regular deployment update of the new CloudFlare WAF managed rules, a misconfigured rule was deployed in the CloudFlare Web Application Firewall (WAF), one of which contained a regular expression, resulting in a rule The node is 100% CPU occupied.
Because the WAF rule was implemented in the simulation mode by the automated test suite, it passed the test and was pushed to the application deployment on the global CDN node in one time, resulting in a CPU peak of 100% on the global cluster machine. This 100% CPU spike eventually led to a 502 error seen by a large number of users. At the worst of times, it covered 82% of the total traffic.
Such incidents have caused great harm to customers, and the existing automated testing process is not perfect. We continue to review and improve the existing test-deployment process to avoid such incidents in the future.