• GreenKnight23@lemmy.world
    link
    fedilink
    English
    arrow-up
    40
    ·
    5 hours ago

    oh sure, when they fuck up DNS it’s a “race condition”.

    when I fuck up DNS it’s a “fireable offense”.

  • TommySoda@lemmy.world
    link
    fedilink
    English
    arrow-up
    30
    ·
    9 hours ago

    This is purely anecdotal, but I have been running into a lot of DNS issues over the past couple months where I work. 3 of the computers and even one of the laptops for remote work were having DNS issues that needed to be fixed. One even needed Windows reinstalled after fixing the DNS issue (Which was probably unrelated, but worth mentioning)

    I’m honestly starting to think that the internet in general might be imploding. Not sure why, but replacing so many developers and programmers with AI might be responsible. Who knows, but it’s definitely very strange.

    • ubergeek@lemmy.today
      link
      fedilink
      English
      arrow-up
      17
      ·
      6 hours ago

      A huge problem are developers who lack a fundamental understanding of how the internet even works. I’ve had to explain how short, unqualified names resolve vs how fqdns resolve. Or why even you may not be able to reach another node in your proverbial cluster, because they are on different subnets. Or, why using GUIDs as hostnames is a generally bad idea, and will cause things to fail in unpredictable ways, especially with deeply nested subdomains.

    • Possibly linux@lemmy.zip
      link
      fedilink
      English
      arrow-up
      49
      ·
      9 hours ago

      The biggest issue is how centralized the internet has become. It went from a bunch of local servers to a handful of cloud providers.

      We need to spread things out again

        • ramble81@lemmy.zip
          link
          fedilink
          English
          arrow-up
          7
          ·
          6 hours ago

          Oh man. One of my old companies, the Devs would always blame the network. Even after we spent a year upgrading and removing all SPOFs. They’d blame the network……

          “Your application is somehow producing 2 billion packets per second and your SQL queries are returning 5GB of data”…. “See! The network is too slow and it has problems”

      • NickwithaC@lemmy.world
        link
        fedilink
        English
        arrow-up
        43
        ·
        10 hours ago

        I always view the source of websites like this and this is one of the worst I’ve seen. 217 lines of code (including inline Javascript?!) and a Google tag for some reason, all to put the word YES in green on black.

        • ijhoo@lemmy.ml
          link
          fedilink
          English
          arrow-up
          8
          ·
          9 hours ago

          Did not think of doing that.

          I guess i never expected anyone to have a fcking JavaScript on a simple page as that

  • falseWhite@lemmy.world
    link
    fedilink
    English
    arrow-up
    79
    ·
    11 hours ago

    That’s what you get when you let go hundreds of employees from your cloud computing unit in favour of AI.

    I hope they end up having to compensate all the billions of losses they caused to all the businesses and people.

    • Possibly linux@lemmy.zip
      link
      fedilink
      English
      arrow-up
      12
      ·
      7 hours ago

      Mistakes happen with or without AI

      The problem is that the current internet is structured in a way that creates high risk systems that can cause a massive outage. We went from having thousands of independent companies to a handful of massive ones. A mistake by a single company shouldn’t be able to black out half the internet.

    • Phoenixz@lemmy.ca
      link
      fedilink
      English
      arrow-up
      10
      ·
      7 hours ago

      Was it proven that AI wa the cause?

      In not saying it wasn’t, just that if it really was, I’d like a source for that claim

      • Serinus@lemmy.world
        link
        fedilink
        English
        arrow-up
        1
        ·
        56 minutes ago

        No, but it clearly wasn’t the solution. They likely could have used some of those people they fired for that.

      • jaybone@lemmy.zip
        link
        fedilink
        English
        arrow-up
        4
        ·
        5 hours ago

        There was an article in my lemmy all feed yesterday claiming so. But it was a super questionable shady site, which people were calling out.

      • falseWhite@lemmy.world
        link
        fedilink
        English
        arrow-up
        26
        ·
        edit-2
        10 hours ago

        They do have contracts and are obligated to provide a certain “up time”, which is usually 99% or so. If they fail to provide that, they are liable to compensate for the losses.

        Or do you think that Amazon is above the law and no other company could sue them?

        It all depends on what kind of contracts they have.

        • WASTECH@lemmy.world
          link
          fedilink
          English
          arrow-up
          4
          ·
          7 hours ago

          These contracts do not stipulate reimbursement for lost revenue. The “uptime guarantee” just gets you a partial discount or service refund for the impacted services.

          It is on the customer to architect their environment for high availability (use multiple regions or even multiple hyperscalers, depending on the uptime need).

          Source: I work at an enterprise that is bound by one of these agreements (although not with AWS).

          • CheezyWeezle@lemmy.world
            link
            fedilink
            English
            arrow-up
            3
            ·
            6 hours ago

            SLA contracts can have a plethora of stipulations, including fines and damages for missing SLO. It really depends on how big and important the customer is. For example, you can imagine government contracts probably include hefty fines for causing downtime or data loss, although I am not involved with or familiar with public sector/ government contracts or their terms.

            You can imagine that a customer that is big enough to contract a cloud provider to build new locations and install a bunch of new hardware just for them, would also be big enough to leverage contract terms that include fines and compensation for extended downtime or missing SLO.

            I work at a data center for a major cloud provider, also not AWS

        • Onomatopoeia@lemmy.cafe
          link
          fedilink
          English
          arrow-up
          15
          ·
          edit-2
          10 hours ago

          Much of this stuff is automatic - I’ve worked with such contracted services where uptime is guaranteed. The contracts dictate the terms and conditions for refunds, we see them on a monthly basis when uptime is missed and it’s not done by a person.

          I imagine many companies have already seen refunds for outage time, and Amazon scrambled to stop the automation around this.

          They’ll have little to stand on in court for something this visible and extensive, and could easily lose their shirt with fines and penalties when a big company sues over breech when they choose to not renew.

          Just cause they’re big doesn’t mean all their clients are small or don’t have legal teams of their own.

        • BakerBagel@midwest.social
          link
          fedilink
          English
          arrow-up
          6
          ·
          10 hours ago

          Amazon has more money than most countries. They can outlast any company in court, or just ban you from their services in the future.

          • Onomatopoeia@lemmy.cafe
            link
            fedilink
            English
            arrow-up
            7
            ·
            10 hours ago

            Depends on who we’re talking about. Companies like finance orgs are all about legal contracts and would be able to hold their feet to the fire.

            You don’t want to go to court against a finance company or any very large org where contract law is their bread and butter (basically any large/multinational corp).

            Amazon’s not hosting just small operations.

        • Passerby6497@lemmy.world
          link
          fedilink
          English
          arrow-up
          2
          ·
          8 hours ago

          99% uptime in a year gives you 3.65 days of downtime, which I think would still be within SLA (assuming nothing else happened this year). Though, once you get to 1 9 reliability (99.9%), you’ve got a shift and change you can be down before you breach SLA.

          If their reliability metrics are monthly, 99% gets you less than a shift of down time, so they’d be out of SLA and could probably yell to get money back.

          • Phoenixz@lemmy.ca
            link
            fedilink
            English
            arrow-up
            7
            ·
            7 hours ago

            I worked at a datacenter that sold clients 99.99% uptime.

            Fun times with a maximum of about one hour of downtime per year for hundreds of servers

        • BCsven@lemmy.ca
          link
          fedilink
          English
          arrow-up
          5
          ·
          10 hours ago

          Most services have a clause that they are not liable for unforseen issues… Depends how good the lawyers were when formalizing the contracts.

          • Passerby6497@lemmy.world
            link
            fedilink
            English
            arrow-up
            3
            ·
            8 hours ago

            Good luck arguing that a missed config counts as an ‘unforeseen issue’. If they go that route, people will be all over them for not being SOC compliant wrt change control.

            • BCsven@lemmy.ca
              link
              fedilink
              English
              arrow-up
              1
              ·
              4 hours ago

              They can try to argue that latency issue and the stale state were an unknown / unanticipated problem. Like when half of Canadas Rogers network went down affecting most debit payment systems. Testing of routing showed it OK, realworld flip went haywire.

      • amino@lemmy.blahaj.zone
        link
        fedilink
        English
        arrow-up
        9
        ·
        8 hours ago

        Signal is definitely part of the fun internet, they just decided to rely on AWS due to techbro culture I assume?

  • SayCyberOnceMore@feddit.uk
    link
    fedilink
    English
    arrow-up
    11
    ·
    10 hours ago

    I’m glad these things happen… it keeps everyone aware that cloud is fragile and Plan B should be considered for mission critical tasks.

    I’m also hoping that it will improve cloud resiliency because a complete / partial restart of cloud systems needs a whole different approach than maintaining a running system.

    • non_burglar@lemmy.world
      link
      fedilink
      English
      arrow-up
      4
      ·
      10 hours ago

      Its true.

      It comes up at work, it comes up in discussions on Linux podcasts I listen to, it comes up here…

      We have a big, dangerous impending problem in DNS.

      • Flax@feddit.uk
        link
        fedilink
        English
        arrow-up
        8
        ·
        10 hours ago

        The issue here isn’t DNS. The issue here is a large portion of the internet relying on a single data centre on the US East coast. Ideally, a lot of competing hosting companies would exist so if one goes down, it’s just one service and very few people notice.

        • Onomatopoeia@lemmy.cafe
          link
          fedilink
          English
          arrow-up
          6
          ·
          10 hours ago

          So much this.

          Why is Signal hosted in one location on AWS, for example? That’s the sort of thing that should be in multiple places around the world with automatic fail over.

        • non_burglar@lemmy.world
          link
          fedilink
          English
          arrow-up
          2
          ·
          8 hours ago

          Yes, that’s true, I guess it’s a separate issue. But the way DNS currently runs is a problem waiting to happen.

  • pop [he/him]@lemmy.blahaj.zone
    link
    fedilink
    English
    arrow-up
    4
    ·
    10 hours ago

    So, in the end they turned off the thing that caused this whole mess and everything is still working.

    What’s the point of having it, then?