Even Famous Services Have Gone Down
"The server crashed," "The site went down" - you hear these phrases in the news all the time, but what's actually happening? Even the world's largest services can't avoid outages. And the causes are often surprisingly mundane.
Notable Service Outage Incidents
Facebook's 6-Hour Outage (October 2021)
Facebook, Instagram, and WhatsApp went completely offline for approximately 6 hours. The cause was a BGP configuration error. When a Facebook engineer modified routing settings, they accidentally withdrew all of the company's BGP routes.
As a result, Facebook's network "vanished" from the internet, and DNS name resolution also stopped working. Furthermore, since all internal tools were also hosted on Facebook's network, engineers lost the very means to access and fix the problem. Ultimately, they had to physically travel to the data center and manually restore the servers.
AWS Major Outage (February 2017)
Amazon Web Services' S3 (storage service) went down for approximately 4 hours, affecting numerous services including Netflix, Slack, and Trello. The cause was an engineer who mistyped a command during debugging, shutting down more servers than intended.
The ironic aspect of this outage was that AWS's own status page was hosted on S3, so they couldn't display outage information.
Cloudflare Outage (June 2022)
Cloudflare, a CDN service used by approximately 20% of the world's websites, experienced an outage affecting numerous services including Discord, Shopify, and Fitbit. The cause was a network configuration change that triggered an unexpected chain reaction.
Google's 47-Minute Total Service Outage (December 2020)
Nearly all Google services - Gmail, YouTube, Google Drive, Google Maps, and more - went down for approximately 47 minutes. The cause was the authentication system's storage running out of capacity. Every service requiring login was affected.
Main Reasons Websites Go Down
- Traffic spikes (overload): Access surges from popular ticket sales, sale launches, or breaking news exceed the server's processing capacity
- Configuration errors (human error): Outages caused by engineer mistakes are extremely common. Both the Facebook BGP incident and the AWS S3 incident were human errors
- Software bugs: Bugs included in updates are discovered in the production environment
- DDoS attacks: Attacks that intentionally flood servers with massive amounts of traffic to bring them down
- DNS failures: When DNS breaks, users can't access the site even though the server itself is functioning normally
- Certificate expiration: When an HTTPS certificate expires, browsers display warnings and block access
- Physical failures: Data center power outages, cooling system failures, undersea cable cuts
How to Check If a Site Is "Down"
When you can't access a site, there are ways to determine whether it's your connection or the site's problem.
- Down Detector: A site that aggregates outage reports from users worldwide in real time
- isitdown.site: A site that checks whether a specified URL is accessible from various locations around the world
- IP Check-san: First verify that your own internet connection is working. If your IP address is displayed, your connection is fine
- Try from a different device or network: Try accessing via your smartphone's mobile data connection
Summary
Even the world's largest services can go down for hours due to a single configuration mistake. The causes of website outages range from traffic spikes and human error to DDoS attacks and DNS failures. Next time you can't access a site, first check your connection on IP Check-san, then check Down Detector for site-side outage information.