Lessons learned from the CrowdStrike incident

Digital security

Organizations, including those not affected by the CrowdStrike incident, should avoid the temptation to attribute the IT crisis to exceptional circumstances

July 23, 2024
•
,
3 minutes reading

crowdstrike incident outages lessons learned resilience

As the dust settles on the cyber incident caused by CrowdStrike releasing a corrupt update, many companies will conduct a thorough post-mortem to determine how the incident affected their business and what can be done differently in the future.

Most critical infrastructure and large organizations will undoubtedly have their proven cyber resilience plan in action. However, the incident, which has been called “the largest IT failure in history,” was likely something that no organization, no matter how large and cyber framework compliant, could have prepared for. It felt like an ‘Armageddon moment’, as was evident from the disruptions at major airports on Friday.

A company can prepare for the unavailability of its own systems, or of a number of important partner systems. However, when an incident is so widespread that it impacts, for example, air traffic control, government transportation departments, transportation providers, and even the airport restaurants through to TV companies that can alert passengers to the issue, preparedness is likely to lead to be limited to your own systems. Fortunately, incidents of this magnitude are rare.

What Friday’s incident does demonstrate is that only a small percentage of devices need to be taken offline to cause a major global incident. Microsoft confirmed that 8.5 million devices were affected – a conservative estimate would put this at between 0.5 and 0.75% of total PC devices.

However, this small percentage concerns the devices that must be kept safe and operational at all times. They are in critical services. That’s why the companies that use them deploy security updates and patches as they become available. Failure to do so could have serious consequences and lead cyber incident experts to question the organization’s reasoning and competence in managing cybersecurity risks.

The importance of cyber resilience plans

A detailed and comprehensive cyber resilience plan can help you get your business back up and running quickly. However, in exceptional circumstances like these, your business may not become operational due to others depending on you not being as willing or quick to deploy the necessary resources. No company can anticipate all scenarios and completely eliminate the risk of operational disruption.

That said, it is important that ALL companies adopt a cyber resilience plan and test the plan periodically to ensure it performs as expected. The plan could even be tested with direct business partners, but testing on the scale of the ‘CrowdStrike Fridays’ incident would likely be impractical. In previous blogs I’ve detailed the core elements of cyber resilience to provide some guidance: here are two links that may provide you with some help: #ShieldsUp and these guidelines to help small businesses increase their preparedness.

The most important message after last Friday’s incident is not to skip the autopsy or attribute the incident to exceptional circumstances. By assessing and learning from an incident, you improve your ability to deal with future incidents. This review should also address the issue of dependence on only a few suppliers, the pitfalls of a monoculture technology environment, and the benefits of implementing diversity in technology to reduce risk.

All eggs in one basket

There are several reasons why companies choose one supplier. One is obviously cost-effectiveness, the others are likely to be a one-size-fits-all approach and attempts to avoid multiple management platforms and incompatibilities between similar, side-by-side solutions. It may be time for companies to explore how proven coexistence with their competitors and diversified product selection can reduce risks and benefit customers. This can even take the form of an industry requirement or a standard.

The autopsy should also be performed by those not affected by ‘CrowdStrike Friday’. You’ve seen the devastation that can be caused by an exceptional cyber incident, and while it didn’t affect you this time, you might not be so lucky next time. So use others’ lessons from this incident to improve your own cyber resilience.

Finally, one way to avoid such an incident is not to use technology that is so old that it cannot be affected by such an incident. Last weekend someone highlighted one to me article that Southwest Airlines is not affected, reportedly due to the fact that they use Windows 3.1 and Windows 95, which, in the case of Windows 3.1, hasn’t been updated in over 20 years. I’m not sure if there are anti-malware products that still support and protect this archaic technology. This old technical strategy may not give me the confidence needed to fly southwest anytime soon. Old technology is not the answer, and it is not a viable plan for cyber resilience; it is a disaster waiting to happen.