Change control management for a network department
This may be a familiar scenario: You are just going to make a small change in the router, and then go home. Your change will not have any effect on anything...You type your command, and then suddenly you lose contact with the router. It dawns on you.... OOPS! The network is down, and you lost contact with the router, so you cannot fix it. PANIC!
This happens sometimes. Human error is one of the most common reasons for network downtime. Even the most skillful network engineer makes mistakes.
Careful planning of all network changes is important, but it will not protect you completely against errors. You need to make sure that your errors will have the smallest impact possible.
You should divide all your changes into two categories:
- Routine tasks are simple tasks that you do regularly that are simple to do and require no design or engineering work. Preferrably, your routine tasks should be documented step by step, so that anyone with basic skills could perform them. Routine tasks are tasks that you "have to" be able to make at any time. Examples can be: adding a new port to an existing switch, adding a new CPE to an access router for an ISP, adding a new BGP peer on an exchange point or adding a new entry in the DNS.
- Scheduled tasks are all other tasks.
Any change that is not a routine change should be planned and performed at a time when the impact of errors will be as small as possible for your customers. Typically, if you are running an office LAN, this will be (well) outside of office hours, or if you are running an ISP network it will be during nights.
Before you perform a scheduled change, you should make sure of the following:
- Make sure you announce the change to your customers well ahead of time, with the time for the change and expected impact on the network and that there is a risk for downtime. Typically you should announce 1-2 weeks ahead as a minimum.
You should make sure to announce your maintenance window with plenty of time to make any changes and recover in case of error. Your customer may be another department within your company, but it is still important to do this.
By doing this, your customer has a chance to come back with feedback in case this causes problem for them, and they will also know that there is a risk for network downtime. - Make sure you prepare the change thoroughly, and have a plan to roll back your changes in case of major problems. This is extremely important, and consider that your brain may not work at full capacity at 03:00 in a cold, noisy datacenter.
- It is a good idea to be 2 people making the change, it will make things go faster and decrease the risk of error a lot, because you have someone double checking your changes
- Make sure you have backup of the configuration files available for all network devices. And make sure they are availaible even if the network is down. Putting them on your laptop is a good idea.
- Make sure you can access your network devices via out of band management (OOBM). This can be as simple as connecting to the device directly in your datacenter via a serial line for a local device, or by dialing in to a terminal server connected to the device via a serial line if the device is located somewhere else.
- Make sure you have all the phonenumbers you may need for any type of problem that may occur, including but not limited to numbers to the support departments for your network device manufacturer(s), numbers to other departments in your company and the numbers to people that have physical access to the devices (if you don't).
- Make sure you have approval to make the change from your manager, if necessary.
After you have made the change, spend plenty of time testing that everything works as you expect. In a typical maintenance window, you should spend 10% of the time making changes and 90% of the time verifying that everything works.
By planning as above you will make sure that any downtime due to human error will occur when your customers are prepared. And one more thing, have the same people design the change, plan the change and implement the change, otherwise the risk for error increases dramatically.
Of course, sometimes you need to break these rules. Just don't do it unless you really have to.