Operational issue – Multiple services (UAE)

(aws.amazon.com)

133 points | by earthboundkid 3 hours ago

11 comments

  • bgentry 2 hours ago

    The important quote from the timeline:

    Mar 01 9:41 AM PST

    We want to provide some additional information on the power issue in a single Availability Zone in the ME-CENTRAL-1 Region. At around 4:30 AM PST, one of our Availability Zones (mec1-az2) was impacted by objects that struck the data center, creating sparks and fire. The fire department shut off power to the facility and generators as they worked to put out the fire. We are still awaiting permission to turn the power back on, and once we have, we will ensure we restore power and connectivity safely. It will take several hours to restore connectivity to the impacted AZ. The other AZs in the region are functioning normally.

    • jiggawatts 50 minutes ago

      This reminds me of a visit to an Equinix data centre where the sales person was droning on and on about how incredibly reliable their power supplies were, how uninterruptible everything was, etc, etc…

      Essentially, he was trying to assure us that no-no-no, we don’t need multiple zones like the public clouds, they can instead guarantee 100% uninterrupted power under all circumstances.

      A bit bored and annoyed, I pointed to the giant red button conspicuously placed in the middle of a pillar and asked what it is for.

      “Oh, that’s in case there’s a fire!”

      “What does it do?”

      “It cuts… the power… uhh… for the safety of the fire department.”

      “So… if there’s a wisp of smoke in a corner somewhere, the fireys turn up, the first thing they do is… cut the power?”

      “… yes.”

      “Not 100% then, is it?”

      • vntok 22 minutes ago

        Should have pushed it.

      • > we will ensure we restore power and connectivity safely

        this would require human intervention and I am a bit worried what if the strike can happen again and human lives might be lost.

        IIRC there have been cases in history where sometimes a same location is targeted across multiple days. Obviously, AWS might have local employees working in the region but would there be an evaluation of this threat itself within the relevant team in AWS. What if they try to bring the service back but then missiles are struck again and what if human lives might be lost on it. Let's just hope that it could be part of a evaluation as well.

        • tokyobreakfast 55 minutes ago

          > this would require human intervention

          that's the difference between heroes and ordinary employees who bitch about having to go into the office twice a month.

          same as the stories you hear of guys taking snow-cats up a mountain in a blizzard to restore phone circuits or radio transmitters gone offline.

          • flymasterv 45 minutes ago

            Man, don’t be a “hero” trying to restore a lower ping to someone trying to buy a kindle in Jeddah.

            • ok_dad 41 minutes ago

              What about local hospitals which may have service from that data center? There are heroes needed everywhere, all the time.

              • stonogo 21 minutes ago

                In that case, the hero was the person who avoided relying on a single AZ when they deployed to cloud.

            • thatguy0900 41 minutes ago

              I'm sure bezos will be really happy someone is being a hero for him in a war zone while he sails his newest yacht to wherever the new version of the island is.

              • tokyobreakfast 27 minutes ago

                on second thought there is a difference between restoring critical infrastructure in times of crisis vs restoring bot infrastructure for indian spamming operations. choose wisely

            • pwarner 21 minutes ago

              But I mean,are the employees safe at home? I guess if the really targeted the data center then home is safer, but in the fog of war maybe the data center wasn't the target?

          • p-o 2 hours ago

            Interesting adjacent theory is how much are datacenters becoming military target to strike as part of disrupting initial defenses. It doesn't seem it was the case in this instance, but I could see this becoming a more important target in future.

            Seems like it should be somewhat easier to bomb 50 datacenters than it would be to hack and disrupt 1000s of different services.

            Again, this is just me thinking out loud on a tangent and this doesn't have much to do with this story, but I felt it was an interesting thought to share nonetheless.

            • swiftcoder 1 hour ago

              The more interesting question, is how many datacenters are just plonked next to a high-value military target?

              For infrastructure reasons, we plonk datacenters down next to airports big enough to fly major hardware into, and near where the big oceanic cables come ashore… and for strategic reasons those are also the perfect places to place military bases

              • throw475787 1 hour ago

                Is there acrually some meaningful physical separation between military and civilian server deployments?

                We seem to be really bad at separating those two. For example Starlink is basically military infrastructure now, used to guide bombs.

                • cherryteastain 49 minutes ago

                  A datacenter IS a high value military target.

                • roncesvalles 1 hour ago

                  Exactly. 2 is only sufficient for HA against random failures. It's not enough for HA against a determined adversary willing to use targeted force.

                  • tbrownaw 2 hours ago

                    > Seems like it should be somewhat easier to nuke 50 datacenters than it would be to hack and disrupt 1000s of different services.

                    Previous outage news makes it sound like the cloud providers still have quite a few logical single points of failure.

                    • Zeyka 2 hours ago

                      That's so interesting. Are any of the US military (or other satellite state of the US) systems running in "normal" datacenters or do they have a few protected DoD datacenters in the US?

                    • Imustaskforhelp 2 hours ago

                      > Seems like it should be somewhat easier to nuke 50 datacenters than it would be to hack and disrupt 1000s of different services.

                      The bigger part of me seems that if we someone nukes 50 datacenters all at once or say all of Amazon's datacenters at once, then the data stored in there would simply be gone and given so many datacenters are located in Virginia,USA iirc or just so many companies being reliant on few datacenter providers.

                      The larger threat to me with the lose of data is firstly the panic within public fronting services but also, with Hedge Funds, Pension funds or banking datacenters who might be using these and if they lose the data, then its gonna cause even more public mayhem.

                      Some might be saying oh off-site backups exist but there has atleast been one instance, where a single Google accident had led to massive issues for a 135 Billion $ pension fund.

                      Relevant Kevin Faang video about it: https://www.youtube.com/watch?v=3GOAUyipnM4 [Google Accidentally Deletes $135 Billion Pension Fund, Chaos Ensues]

                      • roxolotl 1 hour ago

                        This is the data center version of https://xkcd.com/538/. Realistically if there is a hot war what you’re saying seems accurate.

                        • jcgrillo 2 hours ago

                          IIUC part of the reason ballistic missiles have multiple warheads is that some of them detonate high up to knock out air defenses and other electronics allowing the rest to fall through to their targets. The last time we tried this experiment as a species was the starfish prime tests in 1962 which caused some electrical havoc in Hawaii. These days our systems are probably more delicate and sensitive? All that is to say, in a scenario where nukes are going off I'm not sure you'd even need to target any datacenters in particular.. they're probably all toast by default.

                        • ejdyksen 2 hours ago

                          Just one AZ, not the whole region:

                          > The other AZs in the region are functioning normally. Customers who were running their applications redundantly across the AZs are not impacted by this event.

                          • anonu 2 hours ago

                            We have business in UAE. For whatever reason I defaulted to us-west-2 since these particular applications are not latency sensitive.

                            • boxedemp 2 hours ago

                              Amazon usually has 3 AVs per region, looks like there are surviving AVs but the system didn't switch over gracefully.

                              I bet that was an interesting sev2 ticket!

                              • easton 2 hours ago

                                It depends on the service if things move gracefully or not. The incident explains it's only EC2 (and dependent services) in that AZ, so if they try to route traffic for services hosted on EC2 to that AZ it's not working (and customers running instances in that AZ have lost access).

                                The other ones are not impacted. They always like to tell you to pay for more than one instance in different AZs so if this happens you don't get impacted.

                              • Shank 2 hours ago

                                I wonder if this was bad targeting job or intentional. I appreciate the transparency and optimism in the status updates though!

                                • sb057 2 hours ago

                                  Looking at Google Maps, there's Al Dhafra Air Base a couple of miles to the datacenter's south, an oil refinery a bit to the east, ports to the north, and a military academy to the west.

                                • eptcyka 2 hours ago

                                  > one of our Availability Zones (mec1-az2) was impacted by objects that struck the data center, creating sparks and fire.

                                  God forbid we'd ever say that it was struck by a missile or a munition in an act of war.

                                  • NikolaNovak 2 hours ago

                                    That's what I'm trying to understand too. It's this a meteor,tree,etc? Or a human made object,and if so accidental or intentional one. Further risk assessment would be dependent on root cause.

                                    • rrvidqdi 1 hour ago

                                      The earliest phrasing I saw internally was "Root cause is identified as a drone attack to DXB61 site". That's somewhat open to interpretation, and could also have simply been incorrect. It was scrubbed from the ticket, though, and it now merely vaguely gestures toward a "power event". The ticket I'd expect to have further detail was locked down.

                                    • hdgvhicv 2 hours ago

                                      Maybe a missile, maybe a drone, maybe debris

                                      Doesn’t really matter, we know trumps latest war is the cause

                                      • arjie 1 hour ago

                                        I actually like the way they said it. I don't know if it's a different cultural tradition, but the cool steely-eyed fact-based conversation always really felt so much more inspiring:

                                            Conrad: I got three fuel cell lights, an AC bus light, a fuel cell disconnect, AC bus overload 1 and 2, Main Bus A and B out.
                                        
                                            Aaron: Flight, EECOM. Try SCE to Aux.
                                        
                                        Modern culture in the movies and whatnot is that someone should be yelling "Everything's failing. Give me something, Houston. All lights are on! MAYDAY MAYDAY!" and some sort of flavour commentary like that. But reading engineering updates that go like this feels like watching maximal professionalism under fire:

                                        > At around 4:30 AM PST, one of our Availability Zones (mec1-az2) was impacted by objects that struck the data center, creating sparks and fire. The fire department shut off power to the facility and generators as they worked to put out the fire. We are still awaiting permission to turn the power back on, and once we have, we will ensure we restore power and connectivity safely. It will take several hours to restore connectivity to the impacted AZ. The other AZs in the region are functioning normally. Customers who were running their applications redundantly across the AZs are not impacted by this event. EC2 Instance launches will continue to be impaired in the impacted AZ. We recommend that customers continue to retry any failed API requests. If immediate recovery of an affected resource (EC2 Instance, EBS Volume, RDS DB Instance, etc.) is required, we recommend restoring from your most recent backup, by launching replacement resources in one of the unaffected zones, or an alternate AWS Region. We will provide an update by 12:30 PM PST, or sooner if we have additional information to share.

                                        This has that same mechanical tone of an ice-cold captain dealing with a proximate situation providing exactly the information they know. No flavour commentary. Amazing. I fucking love it.

                                      • bigyabai 2 hours ago

                                        Potential moon bear attack; we're waiting on satellite imagery to confirm it. https://youtu.be/pvjgIxuVdo4?t=96

                                      • potatoproduct 2 hours ago

                                        We are living in increasingly weirder times.

                                        • astrange 2 hours ago

                                          A factory not working because of a missile strike seems pretty classic actually.

                                          • guerrilla 2 hours ago

                                            Sure, but it's not a factory.

                                            • astrange 2 hours ago

                                              It's a big building with a lot of capital assets inside that are the means of production for a business…

                                              • mediaman 2 hours ago

                                                Why not? It's a physical building with lots of equipment that produces products shipped to its customers.

                                                Its products are sequences of electrons, instead of atoms. But so are power plants. And in the context of what happens when they're hit by missiles, a factory, data center, and power plant all behave the same.

                                            • debo_ 2 hours ago

                                              When I first learned that there were AWS Middle East regions, my first thought was "wow they are more optimistic than I am ."

                                              • toast0 2 hours ago

                                                Google Cloud also has middle east locations. As does Azure, Oracle and Alibaba. Afaik, IBM Cloud does not. I think those five and AWS are the top 6 global public access clouds.

                                                • Cyph0n 2 hours ago

                                                  No, they are more aware of the customer demand for compute in the region.

                                                  • alexfoo 2 hours ago

                                                    And demand for data sovereignty.

                                                    • Cyph0n 2 hours ago

                                                      Absolutely, especially in the KSA.

                                                • dgxyz 2 hours ago

                                                  Not really. It's just been pretty damn quiet for years.

                                                • Trasmatta 2 hours ago

                                                  Is this the one in Bahrain?

                                                • Imustaskforhelp 2 hours ago

                                                  Has this ever happened ever in history of Cloud providers before this because of war?

                                                  They mention that the datacenter had fires and sparks and they are mentioning hours of downtime but given the situation, How does that prevent the situation from happening again. It's best for people to use safer regions than the middle east in the moment as missiles might target the same datacenter seeing that some damage was caused.

                                                  Moving forward, will there be a demand (all be small) for nuclear bunker esque datacenters which can withstands missiles? I know absolutely nothing about constructing underground but can explosives not be used to create underground datacenters comparatively cheaply? One can also use revamped Nuclear bunkers (although the scale of AWS datacenters might be huge tho who knows)

                                                  Had some ideas which show that this idea might be interesting, https://www.nature.com/articles/s44284-026-00406-2

                                                  I am curious but what are the safety attempts made by Internet Exchange Providers or (had to search it up) but Submarine Cable landing stations, to me it feels like blowing these up leads to internet downtime across whole country / between providers.

                                                  • toast0 1 hour ago

                                                    Historically in the US, some portion of Bell installations were designed to be resistant to attack. But it comes at large expense for construction and maintenance. Underground facilities also bring increased risk of flooding.

                                                    Competition and deregulation and lack of attacks leads towards less robust installations to reduce costs. Geographically redundant installations help as long as all installations aren't targetted; and are valuable for operational concerns other than just attacks.

                                                    • userbinator 19 minutes ago

                                                      Cold War era definitely resulted in a lot of comms infrastructure being hardened against attack.

                                                    • crote 32 minutes ago

                                                      > will there be a demand (all be small) for nuclear bunker esque datacenters

                                                      Those already exist. See for example Bahnhof's "Pionen - White Mountain" data center in Stockholm, or Cyberfort's "The Bunker" a bit west of London.

                                                      • tbrownaw 1 hour ago

                                                        > but given the situation, How does that prevent the situation from happening again

                                                        You don't. Instead, you make sure your failover or DR setup is regularly tested and works.

                                                        • SoftTalker 1 hour ago

                                                          Data centers are usually built to withstand local natural risks e.g. weather. All bets, SLAs, and insurance are usually off when it comes to acts of war.

                                                          • crote 18 minutes ago

                                                            There's also just an upper limit to the kind of risk you can reasonably defend against.

                                                            An out-of-control wildfire levels the entire city? The Big One hits the Bay Area? The entire city is flooded for a few months because the levees break during a Cat5 hurricane? Yeah, your DC will be completely ruined. And even if it isn't, you're probably not getting any outside power, generator fuel, or repair technicians for a while.

                                                            No matter how much money you pump into hardening your own super-bunker DC, there will always be disasters you aren't prepared for. At a certain point it just makes more financial sense to abandon the idea of invulnerability and build a redundant site a few states over. Accept that you will occasionally lose one, and only protect against incidents where mitigation is cheaper than occasionally rebuilding.

                                                        • general1465 2 hours ago

                                                          In Southern Europe some smaller web servers are intermittently not working, while big servers like YouTube are working fine. But I don't think it is related.