Six Best Practices for Enterprises in the Wake of the CrowdStrike–Microsoft IT Outage

Share:

The recent CrowdStrike and Microsoft incident, which rapidly escalated into one of the most significant IT outages in recent times, underscores the need for enterprises to reassess their own security. It’s increasingly critical to ensure readiness as IT infrastructure incidents can have significant business impact, as this incident amply demonstrated.

What Caused the CrowdStrike-Microsoft Outage?

While there are still some small questions about what happened, there is a lot we do know. A leading cybersecurity platform, CrowdStrike, has an auto-deploy feature for some of its updates. General software updates to the base product code are able to be managed through settings allowing organizations to run at “n-1,” enabling full testing of the update to prevent outages. However, “sensor” updates (think audio-visual signature files) are frequently deployed as “content” updates and set to automatically install as soon as they become available.

In this specific case, a small content update was auto-deployed without sufficient testing on July 19, 2024. When it was deployed, it interacted poorly with Microsoft Windows systems, causing 1 percent of Windows devices (roughly 8.5 million, worldwide) to crash and leave them unable to restart, according to a July 20 blog post from Microsoft. The fix was found within hours, but it required manual intervention and thus had deep financial and operational impacts—estimated at $10 billion by Business Insider.

This type of auto-deployment of an update without sufficient testing is one big issue ISG has observed in the marketplace. In the era of continuous integration and continuous delivery and deployment (CI/CD), it is critical that software companies not only provide full-scale testing of any update to their products but have a mandatory process for organizations to choose when and how any update is deployed. Adoption of CI/CD development processes allowed CrowdStrike to deploy a revised update in less than 90 minutes to affected systems.

This is both the power and the risk of rapid deployment in the cybersecurity space. Such speed is critical to effectively address zero-day attacks; however, unintended consequences can occur. In this case, more thorough testing and the ability for every customer to choose when and how similar updates are installed would have significantly reduced the risk to production systems.  

The other significant trend we see that exacerbated this situation is the existence of technical debt (i.e., the use of outdated, legacy technology or code) that is prevalent in large global organizations. The CrowdStrike update hit the wall of technical debt in industries such as airlines, utilities and banks that still use older systems to run critical parts of their business. It is unclear if these older systems were lacking Microsoft patches and the most up-to-date software, causing additional issues in recovery once the update from CrowdStrike became available.

What has become very clear is that the outage exposed issues with the process Microsoft uses for managing updates to the kernel—the core component that manages resources for the operating system and connects hardware with software—of recent Windows products.

In light of this, ISG does not believe that CrowdStrike should be singled out as the sole entity responsible for this incident. Instead, it should serve as a critical reminder for the cybersecurity ecosystem to bolster their preparedness for potential business disruptions.

Outages of this scale are rare, but some are inevitable. The necessity for readiness, robust response capabilities, and enhanced business and operational resilience—as well as a diverse ecosystem of vendors—cannot be overstated. It’s important for enterprises to pursue continuous improvement in these areas to mitigate risks and maintain cybersecurity integrity.

We highlight some of our recommended best practices below.

Six Best Practices for Handling Cybersecurity Tool Outages

While businesses prepare to adopt AI-enabled infrastructures, such incidents expose vulnerabilities related to cloud-based IT and cybersecurity tools, the first step for a future-ready enterprise. As organizations, including those supporting critical sectors like healthcare and public services, rely more and more on complex, interconnected IT systems, it is important to have a robust and reliable mechanism for proactive issue resolution and disaster recovery.

Based on our experience working with global enterprises, ISG recommends the following six best practices to help organizations prepare for and handle such incidents.

1. Incorporate testing validation into software updates and patches.

Global cybersecurity software-as-a-service (SaaS) vendors frequently update their solutions to address the latest security threats. Enterprises should collaborate with these vendors to rigorously validate the updates, making sure they don’t create new vulnerabilities or compatibility issues within the existing ecosystem. Companies should work with their vendors to include site reliability engineering (SRE) practices and prioritize the reliability and criticality of systems within their software development lifecycle (SDLC) processes. This approach ensures proper testing, especially during production releases, and helps eliminate costly mistakes and failures. A similar approach should be applied to in-house application development, using technology such as software bill of materials (SBOM) to track risks introduced to the software supply chain by open-source components.

2. Watch the watchers.

Enterprises must understand the intricacies of their cybersecurity software solutions. Even after deploying popular solutions for critical technologies such as extended detection and response (XDR), IT teams must understand the interdependencies these solutions can have. Given most large organizations depend on the Microsoft Windows operating system, it is essential to be aware of the access cybersecurity solutions have to core operating system components, such as the kernel. This enables enterprises to conduct comprehensive risk assessments and prevent potential issues from escalating. Enterprises should also adopt a zero-trust approach, which requires continuous verification, least-privilege access and risk mitigation. This approach ensures an enterprise trusts no one, not even the security solution vendor.

3. Ensure robust backup and disaster recovery (DR) plans.

Enterprises must back up critical data and systems to ensure quick restoration of operations during an outage. While enterprises often consider consolidating their backup vendors, they must also use multicloud infrastructure and load balancing. Regular backups of critical data are essential for swift recovery. Additionally, enterprises should seek managed service providers with geographically distributed data centers and security operations centers (SOCs) to further mitigate risks.

4. Conduct crisis management drills in a simulated environment.

Incident response and crisis management teams should prepare response plans for such incidents, following industry standards such as SANS, ISO/IEC 27001 and CIS. These teams must respond rapidly with clear communication, resilience strategies and strong ecosystem partnerships. Detailed procedures should be in place so all stakeholders understand their roles during a cyber incident. Only regular testing through drills and simulations will ensure effectiveness in real-world scenarios.

5. Quantify cyber risks.

Enterprises must quantify cyber risks in their environment to support decisions regarding cyber investments by assigning financial value to specific cyber incidents. This approach also plays a vital role in evaluating whether planned or existing cybersecurity insurance coverage is sufficient or needs to be increased. Enterprises can leverage ISG’s recommended approach for cyber risk quantification.

6. Use DEX and adaptive, self-healing tools to prevent widespread impact.

Enterprises must leverage modern endpoint management and digital employee experience (DEX) solutions to continuously monitor endpoints and predict, prevent and remediate issues affecting them. The ISG Provider Lens Future of Work Solutions report notes modern DEX solutions can monitor endpoint behavior, detect anomalies and flag issues. Enterprise IT teams actively leveraging DEX solutions are able to identify unusual spikes in system crashes and promptly prevent widespread impact. Additionally, service desk and remote support teams should employ out-of-band management solutions, such as Intel vPro, which are specifically designed to address blue screen of death (BSOD) issues.

Not all global outage incidents are related to external attacks, as evident in the Crowdstrike/Microsoft incident. Improper planning and inadequate quality assurance and validation processes can lead to catastrophic results.

The ISG Provider Lens report on Cybersecurity Services and Solutions explores leading cybersecurity solution vendors that are bringing innovative features to help enterprises prevent new and potential threats. Some offer automated implementation for uninterrupted functioning. Still, enterprises must balance rapid innovation with thorough security and validation testing. This will require keeping humans in the loop and being prepared for potential mishaps or oversights—even from cybersecurity providers.

ISG helps organizations to better prepare against critical incidences, navigate the rapidly evolving cybersecurity market and find right-fit service providers.

Share:

About the authors

Maxime Martelli

Maxime Martelli

As a Consulting Manager at ISG France, Dr. Maxime Martelli takes part in ISG’s “Cybersecurity” Solution for multinational firms and public sector services, as well as applying his expertise around IT Benchmark and Sourcing projects.  

Maxime is also leading the SASE/SSE topic at ISG EMEA, as well as being Lead Analyst of the ISG Provider Lens™ Cyber Security Solutions & Services report. 

Gowtham Sampath

Gowtham Sampath

Gowtham Sampath is a Manager with ISG Research focusing on emerging technologies and their impact on businesses. He is also responsible for authoring Provider Lens quadrant reports for Banking Industry Services and Analytics Solutions & Services market. Gowtham’s responsibility includes authoring ongoing research articles and blogs on the data analytics market covering a broad spectrum of verticals and across functional domains. In his role, he also works with advisors in addressing enterprise clients' requests for ad-hoc research requirements within the IT services sector, across industries. 
Mrinal Rai

Mrinal Rai

Mrinal Rai is Assistant Director and Principal Analyst at ISG and leads research for the future of work and enterprise customer experience. His expertise is in the digital workplace, emerging technologies and the global IT outsourcing industry. He covers key areas around the Workplace and End User computing domain, viz., modernizing workplace, Enterprise mobility, BYOD, DEX, VDI, managed workplace services, service desk and modernizing IT architecture. He also focuses on unified communications collaboration as a service, enterprise social software, content collaboration, team collaboration, employee experience and productivity services and solutions. He has been with ISG for 10+ years and has 16+ years of industry experience. Mrinal works with ISG advisors and clients in engagements related to the digital workplace, unified communications and service desk. He also leads the ISG Star of ExcellenceTM program that tracks and analyzes enterprise customer experience in the technology industry and authors quarterly ISG CX Index reports. He is also the ISG’s official media spokesperson in India.

Doug Saylors

Doug Saylors

Doug currently leads the ISG Cybersecurity unit and offers expertise in cybersecurity strategy, large scale transformation projects,  infrastructure, Digital enablement,  relationship management, and service delivery. Clients benefit from Doug's expertise from years of working with global clients within the life sciences, automotive manufacturing, aerospace, banking, insurance, financial services, healthcare, utilities and retail industries, as well as his deep and current knowledge of the service provider market.  Doug routinely performs Strategy and Assessment engagements to assist clients in understanding how to select the optimal organizational and operational models to meet their business needs while minimizing security exposure and risk of loss.

LinkedIn Profile