As of December 5, 2025, the global internet is once again grappling with reliability questions after a fresh Cloudflare disruption on top of recent outages at Amazon Web Services (AWS), Microsoft Azure and other core infrastructure providers. At the same time, AWS is rolling out a new Route 53 “Accelerated Recovery” feature and partnering with Google Cloud on a multicloud link designed to make it easier to ride out future failures. [1]
This combination of repeated outages and rapid engineering responses is reshaping the debate about whether the cloud can still be treated as “always on” — especially for industries and platforms that simply cannot go dark, from banking to massively popular games like Fortnite and Roblox. [2]
A bad season for the cloud: AWS, Azure and Cloudflare all stumble
The October AWS outage that “broke the internet”
On October 20, 2025, AWS suffered a major disruption centred on its US‑EAST‑1 (N. Virginia) region, the company’s busiest and most failure‑prone hub. The incident disrupted thousands of websites and apps globally, including Snapchat, Reddit and other high‑traffic consumer services. [3]
Analysis from internet monitoring firms later found the outage lasted around 15 hours and rippled through multiple AWS services. Organizations that concentrated most or all workloads in a single region found their businesses effectively frozen, while those with multi‑region or multi‑cloud failover saw only brief hiccups. [4]
Economic estimates underscore the stakes: one analytics firm, cited by Reuters and other outlets, projected that the October AWS incident alone would cost U.S. companies between $500 million and $650 million in lost revenue and productivity. [5]
Azure outages highlight that AWS is not alone
Microsoft’s cloud has had its own difficult autumn. A large Azure incident on October 29 disrupted services such as Office 365, Teams, Outlook, Xbox Live and major retail and airline platforms hosted on Azure, with many customers facing hours‑long disruption while Microsoft rolled back a faulty change. [6]
Just days later, a separate “thermal event” in a West Europe Azure data center caused multiple storage units to shut down after elevated temperatures, triggering cascading issues as power and cooling systems struggled to recover. Microsoft’s own status history details temperature breaches, power sag and manual intervention to restart cooling. [7]
These incidents underline a key point: hyperscale clouds run on physical data centers, power and cooling — and sometimes that infrastructure fails in very old‑fashioned ways.
Cloudflare’s double outage — including a new disruption today
Then there’s Cloudflare, the content‑delivery and security provider that sits in front of an enormous slice of the modern web.
- On November 18, Cloudflare suffered a major outage that took down or impaired services from Spotify and ChatGPT to payment providers and retailers. Cloudflare’s post‑mortem traced the incident to a problem in a “bot management” module within its core proxy pipeline, not to a cyberattack. [8]
- Today, December 5, Cloudflare is battling another disruption. Users reported issues globally with sites like Canva, outage tracker Downdetector and several trading platforms, while Cloudflare acknowledged problems with its dashboard and APIs on its status page. News outlets describe it as the second major disruption in less than a month. [9]
When Cloudflare stumbles, the impact is similar to an AWS or Azure outage: hundreds of millions of users can struggle to reach websites that did nothing “wrong” except depend on a shared infrastructure layer.
Fortnite, Roblox and the “is AWS down?” confusion
If October’s AWS outage showed what happens when a core region fails, late November showed how messy things become when users see mass breakage — but root cause is unclear.
On November 26, gamers across the world suddenly found Fortnite, Roblox, PlayStation Network, Xbox Live, Electronic Arts services, Alexa and other platforms glitching or unavailable. Traffic‑monitoring site Downdetector lit up as reports surged for AWS and these services, prompting headlines that “AWS is down” ahead of the critical Black Friday shopping period. [10]
Later that day, AWS pushed back. In a statement shared with media and via its Newsroom account, the company said its services were operating normally and that an “event elsewhere on the internet” had triggered speculation about an AWS‑wide outage. Amazon pointed users to the AWS Health Dashboard as the only authoritative source for its service status. [11]
Regardless of where the fault originated, the lesson for consumers and businesses was the same: when a single point of infrastructure goes wrong — an upstream network, a DNS provider, a CDN or a dominant cloud region — stacked dependencies can cause many unrelated brands to fail at once.
Forbes: cloud reliability has become a strategic risk
These incidents are part of a broader pattern flagged by commentators and cloud architects.
In a recent Forbes piece, venture investor and technologist Sanjit Singh Dang argues that the latest AWS and Azure failures expose a growing reliability problem in hyperscale clouds. As more critical workloads — payments, healthcare, logistics, government services — move to the cloud, “even a single outage” can trigger system‑wide disruption. The article warns that architects and executives now have to rethink redundancy, failover and vendor concentration as board‑level risks, not just technical nice‑to‑haves. [12]
The core message: the cloud has not “failed” as a model, but its risk profile has changed. Outages are inevitable in complex distributed systems; what’s unacceptable is acting as if they will never happen.
AWS’s counter‑move: Route 53 Accelerated Recovery for DNS
AWS’s newest answer to these concerns is a feature aimed at one of the most fundamental pieces of internet plumbing: DNS (Domain Name System).
On November 26, AWS announced Amazon Route 53 Accelerated Recovery for managing public DNS records, a new business‑continuity capability specifically designed to keep DNS management working even during a regional disruption in US‑EAST‑1 (N. Virginia). [13]
Key points from the launch:
- 60‑minute recovery time objective (RTO)
Accelerated Recovery is designed so customers can continue to make critical DNS changes within about an hour of a service disruption in the affected region. That’s meant to give teams enough time to redirect traffic or bring standby infrastructure online in another region. [14] - Access to core Route 53 APIs during failover
AWS says the feature ensures continued access to essential Route 53 operations — such as changing resource record sets, checking the status of changes, listing hosted zones and enumerating DNS records — using the same API endpoints and tooling you already have. No new API surface, no custom integration layer. [15] - Simple opt‑in, no extra cost
Accelerated Recovery is disabled by default; customers enable it per hosted zone through the console, CLI, SDKs or infrastructure‑as‑code tools like CloudFormation and CDK. AWS emphasizes that it comes at no additional charge, which is notable given that DNS is one of the levers that directly influences recovery time during an outage. [16]
Why DNS matters here: if your primary region fails, you can only shift user traffic to healthy regions or different providers as quickly as you can change DNS — and have those changes propagate. Locking DNS changes to a failed control plane can be a catastrophic bottleneck. Route 53 Accelerated Recovery is AWS’s attempt to ensure that, at minimum, the steering wheel still works when part of the vehicle is on fire.
AWS and Google Cloud: from rivals to resilience partners
Another major development this week is a rare collaboration between cloud competitors.
On December 1, Amazon and Google announced a jointly developed multicloud networking service that lets customers stand up private, high‑speed links between AWS and Google Cloud in minutes instead of weeks. The service combines AWS’s Interconnect–multicloud capabilities with Google Cloud’s Cross‑Cloud Interconnect, creating a simpler path for traffic and data to flow between the two platforms. [17]
A few important details:
- The offering is explicitly framed as a way to withstand outages like the October AWS incident, which knocked some of the internet’s most popular apps offline and led to huge estimated financial losses. [18]
- Early adopters reportedly include Salesforce, a flagship SaaS player whose own uptime obligations are stringent. [19]
- AWS has signalled plans to extend similar direct connectivity to Microsoft Azure in the coming year, further normalizing multicloud architectures as a hedge against any single provider’s downtime. [20]
Taken together with Route 53 Accelerated Recovery, this sends a clear signal: even the largest cloud providers now recognize they must help customers get off their platforms quickly when things go wrong — whether that means failing over to another AWS region or to a completely different cloud.
What the outages really tell us about the state of the cloud
Looking across AWS, Azure and Cloudflare, several themes emerge.
1. Concentration risk is real
The October AWS outage again centred on US‑EAST‑1, a region that has now suffered multiple high‑impact incidents in recent years. Analysts and post‑mortems highlight how concentrating control planes and critical services in one geography amplifies risk, especially when so many businesses treat that region as the “default” for new workloads. [21]
Similarly, Cloudflare’s outages show what happens when a single CDN and security provider fronts vast swathes of the web. When its control plane or configuration pipeline misbehaves, everything behind it can appear broken — even if every origin server is perfectly healthy. [22]
2. Failures are increasingly complex, not just “a server down”
Recent incident reports talk about:
- cascading software bugs in health monitoring subsystems and load balancer orchestration at AWS [23]
- configuration changes in Cloudflare’s bot‑management module taking down frontline proxies [24]
- thermal and power anomalies in Azure data centers rippling into storage and compute services [25]
These are not simple hardware failures; they reflect complex interactions between automation, configuration management, physical infrastructure and global scale.
3. The human and economic impact is no longer “just tech”
When Fortnite, Roblox, major banking apps, productivity suites and trading platforms go offline, the disruption hits:
- everyday consumers (who suddenly can’t game, trade, work or pay)
- merchants and streamers whose income depends on always‑on platforms
- enterprises whose SLAs, reputations and even regulatory standing are tied to uptime
The October AWS outage alone carried nine‑figure cost estimates, and today’s Cloudflare disruption briefly affected trading and creative tools that professionals rely on. [26]
How enterprises should respond: from “cloud first” to “resilience first”
The big providers are evolving, but the responsibility for resilience ultimately sits with the organizations building on top of them. The message from experts, post‑mortems and even AWS’s own Route 53 launch is converging on a few concrete practices.
1. Architect for multi‑zone and multi‑region from day one
It is no longer defensible for critical services to live entirely in a single availability zone or region. At minimum:
- spread workloads across multiple AZs within a region
- design active‑active or active‑standby deployments across multiple regions, with regular failover drills
- ensure databases, stateful systems and queues are replicated and recoverable elsewhere
Outage analyses of the October AWS event show a stark divide between organizations that had cross‑region failover ready and those that didn’t. [27]
2. Treat DNS as a core part of your disaster‑recovery plan
AWS’s Route 53 Accelerated Recovery is a timely reminder: if DNS is stuck, your failover is stuck.
Practical steps include:
- enabling Route 53 health checks and failover routing policies
- opting into Accelerated Recovery for public zones that front mission‑critical services
- keeping DNS TTLs low enough to make rapid cut‑overs feasible (but not so low they overload clients and resolvers)
- rehearsing DNS‑based region or provider failover in controlled drills
Regulated industries like banking and FinTech — explicitly mentioned in AWS’s announcement — are already under pressure to demonstrate that DNS will not become a single point of failure in a crisis. [28]
3. Embrace multicloud where it genuinely adds resilience
Not every workload needs to be multicloud, but some do:
- customer‑facing portals that must be available even if one hyperscaler suffers a regional or control‑plane incident
- critical back‑office systems where extended downtime would create regulatory or safety hazards
- real‑time trading, communications or industrial control systems
The new AWS‑Google multicloud networking service and pending AWS‑Azure connectivity make it far easier to stand up redundant front doors, replicate services and shift traffic between providers without months of bespoke network engineering. [29]
4. Invest in observability, chaos drills and clear incident playbooks
Finally, organizations need to assume that outages will happen and invest in:
- multi‑layer monitoring that can distinguish cloud‑provider issues from application bugs
- chaos engineering exercises that deliberately break dependencies in staging (or carefully controlled production)
- tested runbooks for different outage scenarios: regional failure, DNS/control‐plane loss, CDN issues, identity provider failures, etc.
Expert commentary on the recent outages repeatedly stresses that the organizations that fared best were the ones that had practised failure — across technical, operational and communications dimensions. [30]
What this means for users, regulators and the future of the cloud
For end users, the message behind today’s Cloudflare disruption and the recent AWS and Azure incidents is simple: “the cloud” is not one entity, and when a shared layer fails, it can take dozens of familiar brands down with it.
For regulators and policymakers, especially in finance and critical infrastructure, these outages strengthen the case for:
- stricter resilience and testing requirements on both cloud providers and cloud‑dependent firms
- transparency around incident root causes and time‑to‑recover
- encouraging diversity of providers — and perhaps even discouraging excessive reliance on single regions or vendors
For the cloud providers themselves, the next few years are likely to be defined less by who can offer the most AI GPUs and more by who can prove that their platforms can sustain and recover from failure — and who can help customers escape to safety when they cannot.
DNS‑level safeguards like Route 53 Accelerated Recovery, multicloud networking agreements between rivals, and renewed focus on reliability engineering all point to the same conclusion: the future of the cloud is not outage‑free, but failure‑tolerant. As of December 5, 2025, the internet’s nervous system is under scrutiny — and resilience has become the new battleground.
References
1. aws.amazon.com, 2. en.as.com, 3. www.reuters.com, 4. www.thousandeyes.com, 5. www.reuters.com, 6. www.sangfor.com, 7. azure.status.microsoft, 8. www.theguardian.com, 9. cybernews.com, 10. en.as.com, 11. en.as.com, 12. www.forbes.com, 13. aws.amazon.com, 14. aws.amazon.com, 15. aws.amazon.com, 16. aws.amazon.com, 17. www.reuters.com, 18. www.reuters.com, 19. www.reuters.com, 20. www.theverge.com, 21. www.thousandeyes.com, 22. www.theguardian.com, 23. www.reuters.com, 24. blog.cloudflare.com, 25. azure.status.microsoft, 26. www.reuters.com, 27. www.thousandeyes.com, 28. aws.amazon.com, 29. www.reuters.com, 30. www.sciencemediacentre.org


