• Cycling enthusiasts, bicycle mechanics, and anyone curious about bikes (or computers)? You're in the right place!

    Register for a free account and dive into the discussions.

    Our forum works with an AdBlocker, but if you’d like to support us, consider backing us on Patreon.

Worldwide outage due to a CrowdStrike software update

Advert
I can't stress enough the importance of making backups, and testing updates in a test ("staging") environment before pushing it to production.

At the time of writing, a huge number of businesses worldwide are down due to a problem that seems to be related to a CrowdStrike software update (so much for the IT professionals' weekend :( ).

The update resulted in Windows PC computers showing the "blue screen of death," and once a computer is affected, it's too late to roll back the update to the previous version.

...We are yet to see if BikeGremlin will be more fortunate on July 27th 2024, when its hosting server update is planned - LOL :)

Relja
 
  1. Boot Windows into Safe Mode or the Windows Recovery Environment
  2. Navigate to the C:\Windows\System32\drivers\CrowdStrike directory
  3. Locate the file matching “C-00000291*.sys”, and delete it.
  4. Boot the host normally.
 
  1. Boot Windows into Safe Mode or the Windows Recovery Environment
  2. Navigate to the C:\Windows\System32\drivers\CrowdStrike directory
  3. Locate the file matching “C-00000291*.sys”, and delete it.
  4. Boot the host normally.

Yup, AFAIK that should fix it - but it needs to be done manually on each affected PC.

This might help with automation (haven't tested, haven't been affected :) - not sure if it would work if computers boot with a blue screen fast enough):

Automated CrowdStrike BSOD Workaround in Safe Mode using Group Policy
 
New info:

Throwaway account...
CrowdStrike in this context is a NT kernel loadable module (a .sys file) which does syscall level interception and logs then to a separate process on the machine. It can also STOP syscalls from working if they are trying to connect out to other nodes and accessing files they shouldn't be (using some drunk ass heuristics).

What happened here was they pushed a new kernel driver out to every client without authorization to fix an issue with slowness and latency that was in the previous Falcon sensor product. They have a staging system which is supposed to give clients control over this but they pissed over everyone's staging and rules and just pushed this to production.

This has taken us out and we have 30 people currently doing recovery and DR. Most of our nodes are boot looping with blue screens which in the cloud is not something you can just hit F8 and remove the driver. We have to literally take each node down, attach the disk to a working node, delete the .sys file and bring it up. Either that or bring up a new node entirely from a snapshot.

This is fine but EC2 is rammed with people doing this now so it's taking forever. Storage latency is through the roof.

I fought for months to keep this shit out of production because of this reason. I am now busy but vindicated.

Edit: to all the people moaning about windows, we've had no problems with Windows. This is not a windows issue. This is a third party security vendor shitting in the kernel.

Source:
https://news.ycombinator.com/item?id=41003390
 

Attachments

  • crowdstrike-update.webp
    crowdstrike-update.webp
    99.3 KB · Views: 13
Last edited:

Support BikeGremlin

Help BikeGremlin stay online with a Patreon donation:

Advert
Back
Top Bottom