A short reflection on the recent CrowdStrike IT disaster.
I feel like there's a lot to learn from the #CrowdStrike meltdown where a bug in a software update is causing havoc across the world. Here's what immediately comes to mind, both from the perspective of the company, and all of us as a society.
1. Don't put all your update eggs in one basket.
If you're a global service provider, unless you're sending a critical security patch, do you really need to go for a global rollout, or can you do it in batches? That way, if something goes wrong, you limit the fallout.
2. The importance of testing, and being responsible.
Why did the faulty code pass the CI/CD checks? As an avid software tester, I can't help but wonder how CrowdStrike's systems are set up. If you are a service provider for critical societal infrastructure like hospitals and aviation, I feel that you have a responsibility to have solid testing pipelines before releasing an update. Of course, things can still be missed, but again, I can't help but wonder about the robustness of their delivery pipelines.
3. The single point of failure problem.
Why aren't we more concerned about relying on single points of failures? It's quite honestly frightening how much chaos can be caused due to a single mistake, and it's even more terrifying the direction we're heading in. People in our industry have been moving towards relying on outside entities, and now everyone is paying the price.
Looking to the future.
This essay got dark. So let's try to end it on a more positive note. What can we as IT professionals do to prevent this from happening again? How can we make sure that we're not the next #CrowdStrike? I think that's a conversation worth having.
For now, I'm going to go back to my testing, and make sure that my code is as solid as it can be. I hope you do the same. And do let me know your thoughts on this whole situation. I'm curious to hear what you think and what you would do differently.