Maintenance Checklist For Data Centres

Data centre servers are just sophisticated machines. Like any machine, they require regular maintenance to operate at their best. Simple maintenance procedures reduce serious service calls and extend the working life of servers.

Even with the performance and redundant features of modern servers, increased workload consolidation and reliability expectations can take a toll on your fleet. Your server maintenance checklist should cover physical elements, as well as the system’s critical configuration.

Stick to a routine

Server administrators too often overlook planning maintenance windows. Don’t wait until there is an actual failure; set aside time for routine maintenance to prevent problems.

Handy preventive server maintenance checklist:

Migrate workloads off server
Physical inspection
Airflow check
Hard-disk scan
Event-log alerts or trends
Test patches and updates
Install patches and updates
Next maintenance scheduled

Maintenance frequency depends on the age of the equipment, the data centre environment, the volume of servers requiring maintenance and other factors. For example, older equipment located in an equipment closet will need more frequent inspections than new servers deployed in a HEPA-filtered, well-cooled data centre. Organizations can base routine maintenance schedules on vendor or third-party provider routines; if the vendor’s service contract calls for system inspections every four or six months, follow that schedule.

Before virtualization, maintenance windows disrupted workloads, forcing IT personnel to do maintenance at night or on weekends. Virtualized servers enable workload migration instead of downtime, so applications are safe whenever server maintenance occurs.

Make sure the server can breathe

Once a server is offline, visually inspect its external and internal air flow paths. Remove any accumulations of dust and other debris that can impede cooling air.

Start with the exterior air inlets and outlets, then proceed into the system chassis, looking at the CPU heat sink and fan assemblies, memory modules, and all cooling fan blades and air duct pathways. Remove dust or debris on an appropriate, static-safe workspace with clean, dry compressed air. Do not clean the server right there in the rack.

Dust-busting is an old-school process, but that doesn’t mean it’s outdated. Dust is a thermal insulator, making it all the more important to remove it, now that alternative cooling schemes and ASHRAE recommendations have raised data centre operating temperatures. Dust and other air-flow obstructions will cause the server to use more energy, even precipitating avoidable premature component failures.

Read the event log’s fine print

Servers record a wealth of information in event logs, particularly details about problems. No server inspection is complete without a careful review of system, malware and other event logs. Sure, critical system issues should have attracted the attention of IT administrators and technicians right away, but there are countless noncritical issues that signal chronic and serious problems.

While you’re there, check the reporting setup and verify the correct alert and alarm recipients. For example, if a technician leaves the server group, you’ll need to update the server’s reporting system. Double-check the contact methods too; a critical error reported to a technician’s company email address might be entirely inadequate if the error occurs outside of business hours.

Be proactive with log data. When a log inspection reveals chronic or recurring issues, a proactive investigation can resolve the problem before it escalates. For example, if the server’s log reports recoverable errors in a memory module, it will not trigger critical alarms. But repeated instances signal problems with the module, and IT staff can perform more detailed diagnostics to identify impending failures.

If the problems are not severe enough to warrant shutting down a server, it can return to production until replacement hardware comes in.

Make time for patches and updates

The server’s software stack — BIOS, OS, hypervisors, drivers, applications, support tools — must all interact and work together. Unfortunately, software code is rarely elegant or problem-free, so pieces of this software puzzle are frequently patched or updated to fix bugs, improve security, streamline interoperability, enhance performance and so on.

No production software should be able to update automatically. Administrators will determine if a patch or upgrade is necessary, then evaluate and test the change thoroughly. If the update fixes a problem your server doesn’t have, why risk creating other problems?

Software developers cannot possibly test every potential combination of hardware and software, so patches and updates can cause more problems than they fix on your specific server or software stack. For example, a monitoring-agent patch could cause performance problems with an important workload because the new agent takes more bandwidth than expected.

The shift to DevOps, with smaller and more frequent updates, exacerbates the potential for problems. You still need to test any patch or update in a lab before deploying it. And always be sure you can undo the change and restore the original software configuration if necessary.