In Data Center Facilities, critical electromechanical infrastructure systems are required to be operated at high business continuity. The main objective is uninterrupted operation for 24 hours x 7 days regardless of official holidays.
The following should be noted in advance to avoid any misunderstanding. Uptime is a criterion for Information Technology (IT) equipment. It does not apply for electromechanical systems. In contrast, in order to ensure that IT devices provide service at high business continuity, the electromechanical systems providing power and cooling to them should be switched off, maintained, and switched on again in a scheduled manner. Critical operation here is all the maneuvers should be performed so professionally that none of the IT equipment’s availability and performance are affected.
This requirement is necessary to ensure that the performance and reliability of the electromechanical systems are maintained at their initial reliability level. Preventive Maintenance involves an operating cycle that is costly but important for sustainability in critical infrastructure operation.
Isolating infrastructure equipment from the system in a certain order, fully performing the maintenance steps and reactivating with no delay in scheduled maintenance is a particularly challenging process. Organizations with data centers allocate a high budget for this process. A large part of the operation budget covers staff expenses and maintenance agreements.
The target of “high business continuity and efficient operation” cannot be achieved unless this periodic preventive maintenance process is performed at high quality. The performance of the operation team is quite critical in this process. On one hand, they are required to deactivate and reactivate systems without impacting the critical loads and on the other hand, they have to make sure that the service providers perform full and high-quality maintenance. If the team fails to deliver the performance and errors arise due to the operation staff, the belief that “maintenance is the cause of problems” will become dominant. The perception that “Better not to do it” will settle.
Tens of systems contain around 500-1000 equipment and devices. These devices require maintenance more than once a year. Maintenance planning that fits into 365 days of the year and does not disrupt redundancy, isolating each device at the time of maintenance and recommissioning after maintenance require suitable planning, trained staff, and costly service agreements.
In the discipline of Maintenance and Reliability, it is accepted as a principle that errors in systems with postponed or omitted maintenance will reach irrecoverable levels over time. This is understandable for anyone who drives a car daily. A car will not break down as long as its maintenance is timely performed by a competent service. Based on common experience, it is widely known that lack of maintenance on a new car or skipping with merely oil change will lead to major problems costing days or weeks with no driving.
Many operators will find the following example familiar in a facility where thousands of maintenance activities are performed:
For example, maintenance is performed on critical cooling units. After maintenance is performed on the main pump, successful operation of the pump is checked in manual position. The maintenance form is completed, and maintenance is finalized. In the upcoming days, the other cooling unit is put under maintenance and the system devices including the main pump are switched off for isolation. Then an employee who incidentally enters the white space observes that the room temperature is extremely high and witnesses over temperature warnings of the IT equipment. While the “redundant” cooling system is expected to meet the required capacity, it is noticed that at least one item is off. Can you imagine the stress of the technical employee rushing about saying “What did I leave switched off?” and the position of the executive who needs to answer to the customers?