In many organizations, the servers and especially the storage area network (SAN) are critical. The thought of installing firmware on these systems or performing a major operating system upgrade is incredibly scary to the IT staff.
I was working at a company recently which was having trouble with their backup system. Their backup system did not support some of their critical servers. This has been a problem for years. Without a backup solution, the idea of performing any maintenance on these critical servers was a no-go. They knew that they needed to patch them because of security vulnerabilities. They knew that they needed to do disk maintenance. But because of the risk, they were paralyzed. Bringing in an outside consultant to do the maintenance was the right choice for them. I helped them research recovery options, found a backup system that supported their servers, and got backups running. At that point we were able to patch vulnerabilities and update the hardware safely.
Why is performing necessary maintenance scary?
- The IT staff is not used to administrating these complex systems, and the person who set them up originally is no longer around (either they moved on, or they were a short term consultant sent by the manufacturer).
- If a SAN fails, it can be devastating to a company, because this typically means that all or most services are down, and data might be lost (this is even worse than an outage). The entire company is at a work stoppage. If a server fails, it can be almost as devastating, depending on whether it hosted critical services.
- There is always a small chance that maintenance will cause or uncover a failure.
- IT staff is not sure how they would reverse the changes or recover if there is a problem.
How do we make maintenance less scary?
There are a variety of solutions that reduce risk from maintenance.
1. Having a plan.
Even if you have no budget for a proper backup system, you can still do a lot to reduce risk. You can configure scripts to copy critical files to a network location or an external drive. You can create a Disaster Recovery Plan with individual procedures to rebuild or recover each server. You can pre-stage the software needed for rebuilding the server, rather than trying to obtain it during an outage.
Applying a formal Change Management process is beneficial because it asks questions like “How would you revert this change if it goes wrong?” and “What systems could this impact?”.
Kieri Solutions helps businesses by writing custom Disaster Recovery Plans, Business Continuity Plans, Continuity Of Operations Plans, and Change Management Plans. Beyond the benefits of having a written plan, as we are researching and writing your BCP / DRP, we also value-add by identifying problems and working with your IT staff to design solutions.
2. Having service agreements with vendors.
Professional IT departments pay for extended warranties and support contracts from their hardware vendors and their software vendors. Having a 4-hour parts replacement contract can reduce an outage time from three days to four hours. If you have 200 administrative staff at a work-stoppage, this is the difference between $40k and $700k in losses from an incident.
Having support from software vendors is important for many reasons. You can ask them to “hold your hand” while performing tricky maintenance, which greatly reduces the risk from operator error. They can verify the health of the system, or identify which configurations need to be backed up, before performing maintenance. And if something does go wrong, they are a critical lifeline for restoring service fast.
3. Using a backup system which can host a restored server inside it.
An example of this is the Datto backup company. Their products make backups but also have the CPU and RAM capacity to run a server. If you have a catastrophic hardware failure on your server, you can tell the Datto to run the latest backup internally. This will normally restore all functionality within an hour. Then you can address the hardware failure on your own time.
Another option is using a combination of VMWare, Veeam, and optionally a “warm” disaster recovery site. VMware virtualization has natural resilience built in, and can recover quickly from server hardware failures. Veeam is an excellent backup solution which makes high quality backups which are easy to recover from. Setting up an off-site disaster recovery environment with Veeam means that you can replicate the current state of your servers to a different location. If something shuts down your main server room (fire, power outage, flooding), you can turn on your disaster recovery site and restore operations very quickly. This can reduce outage times from 1 week + to 1-3 hours.
Kieri Solutions specializes in disaster recovery and business continuity. We can help you set up any of these backup solutions or disaster recovery sites. We can also help you test recovery procedures safely (without overwriting your current operations) to make sure that they will work when you need them.
4. Incident Response Drills
Smart companies perform Incident Response drills regularly. They ask questions like “If a hard drive failed, would we detect it with our current procedures?” and “What would we do if the power circuit to the server room went out?”
Over time, these drills will help your company identify weaknesses and perform preventative projects. Kieri Solutions has experience with many types of incidents. We can educate your staff about the purpose of these drills and help you with the first few.
The experience of building a server from scratch or setting up a SAN is very helpful when performing risky upgrades. An experienced engineer knows what tasks are potentially impacting and which ones can be done casually. They can help you schedule maintenance, gather all materials together, back up configuration files, and make a step by step plan.
Kieri Solutions specializes in these scary upgrade tasks. We understand the architecture of SANs and servers to reduce unexpected problems. We are also very careful, and will perform the extra steps necessary to reduce risk to a minimum. We are also happy to train your IT staff how to perform routine maintenance so that your next updates can be performed in-house.
What strikes fear into an IT professional’s heart?
- Updating firmware on production servers
- Upgrading the operating system on your Storage Area Network or Network Attached Storage
- Installing the latest version of VMWare
- Upgrading virtual appliances
If you have maintenance that is stalled because of the risk, Kieri Solutions can help. We specialize in this work and have patched and upgraded hundreds of servers and dozens of SANs successfully.