There are No Slam Dunks in IT.
That’s a saying I have thrown around for close to 10 years now. But one that I think too many people in technology fail to remember on a daily basis. They get caught up in the urgency of the moment, short cut change management procedures, fail to think about the downstream impact of what they see as a minor, isolated change. All too often the mindset of “the easy change,” “the lay-up,” or “the routine lazy fly ball” ends up as an unexpected outage. That break away slam dunk clanks off the rim and bounces out of bounds. That easy two points turns into a turnover.
As we kicked off 2012, a relatively new to the company network engineer noticed that a top of rack server switch had two fiber uplinks but only one was active. Anxious to make a good impression, he wanted to resolve that issue. It was an admiral thing to do. He was taking initiative to make things better. So one night during the first week of the fresh new year, he executed a change to bring up the second uplink. Things did not go well as the change, and I will not go into the gory technical details, brought down the entire data center network. It was after standard business hours – whatever that means in today’s 24×7 business world – but the impact of that 10 minutes outage was significant. A classic case of a self-inflicted wound from not following good change management procedures.
It was actually a frustrating incident for me, because as we put together the 2012 Business Plan for Corporate Technology Services, we were asked to list the keys to success for our operations and the actions we needed to take achieve success.
THE #1 key for success listed was: Avoid self-inflicted outages and issues that take away cycles from the planned efforts and cause unplanned unavailability of our client facing solutions.
So 30 days prior I had told our CEO, CFO and the rest of the executive management team that our #1 key to success in IT was to avoid such things, yet here I was four days into the new year staring at the carnage of a self-inflicted outage.
Outages are close to a given in the world of technology. Servers will crash, switches will randomly reboot, hard drives will fail, application will act weird, redundancy will fail, and there will be maintenance efforts that we know will cause outages. Given that, every IT organization must take steps to not be the cause of even more outages. Business leaders know that there will be some level of downtime with technology – have you ever seen a 100% SLA? Rarely. It usually some 99.xx% number. But outages that are caused by the very people charged with keeping things running drives them nuts, and rightfully so.
The morning after that self-inflicted wound, I communicated out the following to every member of the IT organization:
We need to strive to make sure that we are not the cause of any unexpected outages. We must exercise good change management process and follow the five actions listed above. As our solutions and the underlying infrastructure become increasingly intertwined, we must make an extra effort to assess the potential unintended downstream (or upstream) impact as we plan the change.
When making a change we must always follow these steps:
Plan – make sure each change action/project we undertake is well thought out, steps are documented, risks are assessed. If disruption in service is expected, plan for when we make this change to limit the impact of the disruption.
Communicate – communicate each change action/project to the parties potentially impacted prior to executing the change
Execute – flawlessly execute according the plan developed
Test – test to make sure that the change executed resulted in the expected results and there are no unintended consequences from the change
Communicate – communicate to the potentially impacted parties that the change has been completed and tested
To keep this goal of avoiding self-inflicted outages top of mind, we implemented a ‘It’s Been X Days Since our Last Self-Inflicted Outage” counter. Basically taking a page out of the factory accident prevention playbook.
We had to reset it once after we implemented it, but we are now at 19 days and counting. Let’s hope that the next reset is no time soon.
As I am dealing with a bad migration (round 3) for a customer’s planned “Go-Live” today, I really appreciate this post!